How to Use Mod-Rewrite to Simplify URL Rewriting in Apache - A Basic Guide to the Mod-Rewrite Module

Friday, November 2, 2012

Introduction
URL Rewriting is the process of manipulating an URL or a link, which is send to a web server in such a way that the link is dynamically modified at the server to include additional parameters and information along with a server initiated redirection. The web server performs all these manipulations on the fly so that the browser is kept out of the loop regarding the change made in URL and the redirection.
URL Rewriting can benefit your websites and web based applications by providing better security, better visibility or friendliness with Search Engines and helps in keeping the structure of the website more easy to maintain for future changes.
In this article we will be taking a look at how we can implement URL Rewriting on an Apache based web server environment using the mod_rewrite module for Apache.
What is mod_rewrite?
Mod_rewrite is one of the most favored modules for the Apache web server and there are many web developers and administrators who will vote this module as the best thing to happen on Apache. This module has a lot of tricks up its sleeve so that it can be called the Swiss Army Knife of all Apache Modules. Apart from providing simple URL Rewriting functionality for an Apache based website, this module arms the website with better URL protection, better search engine visibility, protection against bandwidth thieves by stopping hot linking, hassle free restructuring possibilities and options to provide friendliest of URLs for the website users. This module due to its versatility and functionality can at times feel a bit daunting to master, but getting a through understanding of the basics can make you a master of the craft of URL Rewriting.
Lets Begin! - A look at all the stuff you need to have on your test environment to get mod-rewrite alive and kicking.
First and foremost you should have a properly configured Apache Web Server on your test machine. Mod_rewrite is usually installed along with the Apache server, but in case it is missing - this can be the case on a Linux machine where the mod_rewrite module was not compiled along with the installation - you will have to get it installed. For using mod_rewrite on your Apache box you will have to configure this module to load dynamically on demand made by Apache. On a shared server you will have to contact your web hosting company to get this module installed and loaded on Apache.
On your local machine you can find if the module is installed along with Apache by having a look at the modules directory of Apache. Check for a file named mod_rewrite.so and if it is there then the module can be made to load in to the Apache server dynamically. By default this module is not loaded when Apache starts and you need to tell Apache to enable this module for dynamic loading by making changes in the web servers configuration file, which is explained below.
How to Enable mod_rewrite on Apache?
You can make the mod_rewrite module load dynamically in to the Apache web server environment using the LoadModule Directive in the httpd.conf file. Load this file in a text editor and find a line similar to the one given below.
#LoadModule rewrite_module modules/mod_rewrite.so
Uncomment this line by removing the # and save the httpd.conf file. Restart your Apache server and if all went well mod_rewrite module will now be enabled on your web server.
Lets Rewrite our first URL using mod_rewrite Ok, now the mod_rewrite module is enabled on your server. Lets have a look at how to make this module load itself and to make it work for us.
In order to load the module dynamically you have to add a single line to your .htaccess file. The .htaccess files are configuration files with Apache directives defined in them and they provide distributed directory level configuration for a website. Create a .htaccess file in your web servers test directory - or any other directory on which you want to make URL Rewriting active - and add the below given line to it.
RewriteEngine on
Now we have the rewrite engine turned on and Apache is ready to rewrite URLs for you. Lets look at a sample rewrite instruction for making a request to our server for first.html redirected to second.html at server level. Add the below given line to your .htaccess file along with the RewriteEngine directive that we have added before.
RewriteRule ^first.html$ second.html
I will explain what we have done here at the next section, but if all went well then any requests for first.html made on your server will be transferred to second.html. This is one of the simplest forms of URL Rewritting.
A point to note here is that the redirect is kept totally hidden from client and this differs from the classic HTTP Redirects. The client or the browser is given the impression that the content of the second.html is being fetched from first.html. This enables websites to generate on the fly URLs with out the clients awareness and is what makes URL Rewriting very powerful.
Basics of mod_rewrite module
Now we know that mod_rewrite can be enabled for an entire website or a specific directory by using .htaccess file and have done a basic rewrite directive in the previous example. Here I will explain what exactly have we done in the first sample rewrite.
Mod_rewrite module provides a set of configuration directive statements for URL Rewriting and the RewriteRule directive - that we saw in the previous sample - is the most important one. The mod_rewrite engine uses pattern-matching substitutions for making the translations and this means a good grasp of Regular Expressions can help you a lot.
Note: Regular Expressions are so vast that they will not fit in to the scope of this article. I will try to write another article on that topic someday.
1. The RewriteRule Directive
The general syntax of the RewriteRule is very straightforward.
RewriteRule Pattern Substitution [Flags]
The Pattern part is the pattern which the rewrite engine will look for in the incoming URL to catch. So in our first sample ^first.html$ is the Pattern. The pattern is written as a regular expression.
The Substitution is the replacement or translation that is to be done on the caught pattern in the URL. In our sample second.html is the Substitution part.
Flags are optional and they make the rewrite engine to do certain other tasks apart from just doing the substitution on the URL string. The flags if present are defined with in square brackets and should be separated by commas.
Lets take a look at a more complex rewrite rule. Take a look at the following URL.
yourwebsiteurl/articles.php?category=stamps&id=122
Now we will convert the above URL in to a search engine and user friendly URL like the one given below.
yourwebsiteurl/articles/stamps/122
Create a page called articles.php with the following code:
$category = $_GET['category'];
$id = $_GET['id'];
echo "Category : " . $category . " ";
echo "ID : " . $id;
This page simply prints the two GET variables passed to it on the webpage.
Open the .htaccess file and write in the below given Rule.
RewriteEngine on
RewriteRule ^articles/(w+)/([0-9]+)$ /articles.php?category=$1&id=$2
The pattern ^articles/(w+)/([0-9]+)$ can be bisected as:
^articles/ - checks if the request starts with 'articles/'
(w+)/ - checks if this part is a single word followed by a forward slash. The parenthesis is used for extracting the parameter values, which we need for replacing in the actual query string, in the substituted URL. The pattern, which is placed in parenthesis will be stored in a special variable which can be back-referenced in the substitution part using variables like $1, $2 so on for each pair of parenthesis.
([0-9]+)$ - this checks for digits at the last part of the url.
Try requesting the articles.php file in your test server with the below given url.
yourwebsiteurl/articles/coins/1222
The URL Rewrite rule you have written will kick in and you will be seeing the result as if the url requested where:
yourwebsiteurl/articles.php?category=coins&id=1222
Now you can work on this sample to build more and more complex URL Rewritting rules. By using URL rewriting in the above example we have achieved a search engine and user friendly URL, which is also tamper proof against casual script kiddie injection sort of attacks.
What does the Flags parameter of RewriteRule directive do?
RewriteRule flags provide us with a way to control the way mod_rewrite handles each rule. These flags are defined inside a common set of square brackets separated by commas and there are about 15 flags to choose from. These flags range from those which controls the way rules are interpreted to complex one's like those which sent specific HTTP headers back to the client when a match is found on the pattern.
Lets look at some of the basic flags.
  • [NC] flag (nocase) -. This makes mod_rewrite to treat the pattern in a case-insensitive manner.
  • [F] flag (forbidden) - This makes Apache send a forbidden HTTP response header - response 403 - back to the client.
  • [R] flag (redirect) - This flag makes mod_rewrite to use a formal HTTP redirect instead of the internal Apache redirect. You can use this flag to inform the client about the redirection and this flag sends a Moved Temporarily - Response 302 - by default, but this flag takes an extra parameter, which you can use to modify the response code. If you wish to send a response code of 301 - Moved Permanently - then this flag can be written as [R=301]
  • [G] flag (gone) - This flag makes Apache respond with a HTTP Response 410 - File Gone.
  • [L] flag (last) - This makes mod_rewrite to stop processing succeeding directives if the current directive is successful.
  • [N] flag (next) - This flag makes the rewrite engine to stop process and loop back to start of the rule list. A point to note is that the URL, which will be used for pattern matching, will be the rewritten one. This flag can create an endless loop and so extreme care should be given while using it.
There are other flags too but they are complex to explain with in the scope of this article so you can find more info on them by referring the mod_rewrite manual.
2. The RewriteCond Directive
This directive gives you the additional power of conditional checking on a range of parameters and conditions. This statement when combined with RewriteRule will let you rewrite URLs based on the success of conditions. RewriteCond are like the if() statement in your programming language but here they are for deciding whether a RewriteRule directive's substitution should take place or not. Things like preventing hot linking and checking whether the client meets certain criteria's before rewriting the URL etc can be achieved by using this directive.
The general syntax of the RewriteCond is:
RewriteCond string-to-test condition-pattern
The string-to-test part of the RewriteCond has access to a large set of Variables like the HTTP Header variables, Request Variables, Server Variables, Time variables etc so you can do a lot of complex conditional checking while writing directives. You can use any of these variables as a string to test by putting it in a %{string} format. Suppose you want to use the HTTP_REFERER variable then it can be used as %{HTTP_REFERER }.
The condition part can be a simple string or a very complex regular expression as your imagination is the only limit with this module.
Lets take a look at an example for conditional rewriting using RewriteCond directive:
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/4(.*)MSIE
RewriteRule ^index.html$ /index.ie.html [L]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5(.*)Gecko
RewriteRule ^index.html$ /index.netscape.html [L]
RewriteRule ^index.html$ /index.other.html [L]
This example uses the HTTP_USER_AGENT as the test string with the RewriteCond directive. What it does is that it uses the HTTP_USER_AGENT header variable to find the browser of the visiting user and match it against a set of pre known values to detect the browser and serve different pages to the visitor based on the match result. The first RewriteCond checks the HTTP_USER_AGENT to find a match for the ^Mozilla/4(.*)MSIE pattern. This match will occur when a user visits the page using IE as browser. Then the RewriteRule given just under that statement will kick in and will rewrite the URL to server index.ie.html page to the IE visitor.
Similarly a checking is made for mozilla specific browsers in the second RewriteCond and the RewriteRule will do the substitution for index.netscape.html when a positive match is made on the ^Mozilla/5(.*)Gecko pattern. The third RewriteRule is there to catch other browsers. If both the first and second RewriteCond fails then the last RewriteRule will be considered. A point to note in the above example is the usage of the [L] flag with all the RewriteRule directives. This is used to avoid the cascading of applying the rules when a positive RewriteRule is applied.
Two flags which can be used to further control the way the RewriteCond directive behave are [NC] - case-insensitive - and [OR] - chaining of multiple RewriteCond directives with logical OR.
By using these two directives - RewriteRule and RewriteCond - you can implement a lot of powerful URL Rewriting functionality on your website.
Other mod_rewrite Directives
  1. RewriteBase Directive - This directive can solve the problem of RewriteRule creating non-existent URLs due to difference in the physical file system structure on web server and the structure of website URLs. Setting this directive to the below given statement can solve this problem. RewriteBase /
  2. RewriteMap Directive- This directive is very powerful as it allows you to map unique values to a set of other replacement values from a table and to use it in the substitution to generate on the fly URLs. This can be especially useful for huge e-commerce or CMS kind of applications where you need to replace each section name or category name in the URL with a corresponding id taken from a database.
  3. RewriteLog Directive - This directive can be used to set the log file that the mod_rewrite engine will use to log all the actions taken during processing on client requests. The syntax is: RewriteLog /path/to/logfile This directive should be defined in the httpd.conf file as this directive is applied on a per-server basis.
  4. RewriteLogLevel Directive- This directive tells mod_rewrite module the amount of information on the internal processing done while rewriting URLs to be logged. This directive takes values from 0 to 9 where 0 means no logging and 9 means all the information is logged. A higher level of logging can make Apache run slow, so a level above 2 is desired only for debugging purposes. This directive can be applied using the below given syntax.br/> RewriteLogLevel levelnumber

ConclusionIn this article we have taken only a brief look at the power of the mod_rewrite module. It is only a scratch on the surface but I hope it is enough to get you started on using this module on your web server environment.

0 comments:

Post a Comment