URL Rewriting for the Fearful

Drew McLellan

I think it was Marilyn Monroe who said, “If you can’t handle me at my worst, please just fix these rewrite rules, I’m getting an internal server error.” Even the blonde bombshell hated configuring URL rewrites on her website, and I think most of us know where she was coming from.

The majority of website projects I work on require some amount of URL rewriting, and I find it mildly enjoyable — I quite like a good rewrite rule. I suspect you may not share my glee, so in this article we’re going to go back to basics to try to make the whole rigmarole more understandable.

When we think about URL rewriting, usually that means adding some rules to an .htaccess file for an Apache web server. As that’s the most common case, that’s what I’ll be sticking to here. If you work with a different server, there’s often documentation specifically for translating from Apache’s mod_rewrite rules. I even found an automatic converter for nginx.

This isn’t going to be a comprehensive guide to every URL rewriting problem you might ever have. That would take us until Christmas. If you consider yourself a trial-and-error dabbler in the HTTP 500-infested waters of URL rewriting, then hopefully this will provide a little bit more of a basis to help you figure out what you’re doing. If you’ve ever found yourself staring at the white screen of death after screwing up your .htaccess file, don’t worry. As Michael Jackson once insipidly whined, you are not alone.

The basics

Rewrite rules form part of the Apache web server’s configuration for a website, and can be placed in a number of different locations as part of your virtual host configuration. By far the simplest and most portable option is to use an .htaccess file in your website root. Provided your server has mod_rewrite available, all you need to do to kick things off in your .htaccess file is:

RewriteEngine on

The general formula for a rewrite rule is:

RewriteRule URL/to/match URL/to/use/if/it/matches [options]

When we talk about URL rewriting, we’re normally talking about one of two things: redirecting the browser to a different URL; or rewriting the URL internally to use a particular file. We’ll look at those in turn.

Redirects

Redirects match an incoming URL, and then redirect the user’s browser to a different address. These can be useful for maintaining legacy URLs if content changes location as part of a site redesign. Redirecting the old URL to the new location makes sure that any incoming links, such as those from search engines, continue to work.

In 1998, Sir Tim Berners-Lee wrote that Cool URIs don’t change, encouraging us all to go the extra mile to make sure links keep working forever. I think that sometimes it’s fine to move things around — especially to correct bad URL design choices of the past — provided that you can do so while keeping those old URLs working. That’s where redirects can help.

A redirect might look like this

RewriteRule ^article/used/to/be/here.php$ /article/now/lives/here/ [R=301,L]

Rewriting

By default, web servers closely map page URLs to the files in your site. On receiving a request for http://example.com/about/history.html the server goes to the configured folder for the example.com website, and then goes into the about folder and returns the history.html file.

A rewrite rule changes that process by breaking the direct relationship between the URL and the file system. “When there’s a request for /about/history.html” a rewrite rule might say, “use the file /about_section.php instead.”

This opens up lots of possibilities for creative ways to map URLs to the files that know how to serve up the page. Most MVC frameworks will have a single rule to rewrite all page URLs to one single file. That file will be a script which kicks off the framework to figure out what to do to serve the page.

RewriteRule ^for/this/url/$ /use/this/file.php [L]

Matching patterns

By now you’ll have noted the weird ^ and $ characters wrapped around the URL we’re trying to match. That’s because what we’re actually using here is a pattern. Technically, it is what’s called a Perl Compatible Regular Expression (PCRE) or simply a regex or regexp. We’ll call it a pattern because we’re not animals.

What are these patterns? If I asked you to enter your credit card expiry date as MM/YY then chances are you’d wonder what I wanted your credit card details for, but you’d know that I wanted a two-digit month, a slash, and a two-digit year. That’s not a regular expression, but it’s the same idea: using some placeholder characters to define the pattern of the input you’re trying to match.

We’ve already met two regexp characters.

^: Matches the beginning of a string
$: Matches the end of a string

When a pattern starts with ^ and ends with $ it’s to make sure we match the complete URL start to finish, not just part of it. There are lots of other ways to match, too:

[0-9]: Matches a number, 0–9. [2-4] would match numbers 2 to 4 inclusive.
[a-z]: Matches lowercase letters a–z
[A-Z]: Matches uppercase letters A–Z
[a-z0-9]: Combining some of these, this matches letters a–z and numbers 0–9

These are what we call character groups. The square brackets basically tell the server to match from the selection of characters within them. You can put any specific characters you’re looking for within the brackets, as well as the ranges shown above.

However, all these just match one single character. [0-9] would match 8 but not 84 — to match 84 we’d need to use [0-9] twice.

[0-9][0-9]

So, if we wanted to match 1984 we could to do this:

[0-9][0-9][0-9][0-9]

…but that’s getting silly. Instead, we can do this:

[0-9]{4}

That means any character between 0 and 9, four times. If we wanted to match a number, but didn’t know how long it might be (for example, a database ID in the URL) we could use the + symbol, which means one or more.

[0-9]+

This now matches 1, 123 and 1234567.

Putting it into practice

Let’s say we need to write a rule to match article URLs for this website, and to rewrite them to use /article.php under the hood. The articles all have URLs like this:

2013/article-title/

They start with a year (from 2005 up to 2013, currently), a slash, and then have a URL-safe version of the article title (a slug), ending in a slash. We’d match it like this:

^[0-9]{4}/[a-z0-9-]+/$

If that looks frightening, don’t worry. Breaking it down, from the start of the URL (^) we’re looking for four numbers ([0-9]{4}). Then a slash — that’s just literal — and then anything lowercase a–z or 0–9 or a dash ([a-z0-9-]) one or more times (+), ending in a slash (/$).

Putting that into a rewrite rule, we end up with this:

RewriteRule ^[0-9]{4}/[a-z0-9-]+/$ /article.php

We’re getting close now. We can match the article URLs and rewrite them to use article.php. Now we just need to make sure that article.php knows which article it’s supposed to display.

Capturing groups, and replacements

When rewriting URLs you’ll often want to take important parts of the URL you’re matching and pass them along to the script that handles the request. That’s usually done by adding those parts of the URL on as query string arguments. For our example, we want to make sure that article.php knows the year and the article title we’re looking for. That means we need to call it like this:

/article.php?year=2013&slug=article-title

To do this, we need to mark which parts of the pattern we want to reuse in the destination. We do this with round brackets or parentheses. By placing parentheses around parts of the pattern we want to reuse, we create what’s called a capturing group. To capture an important part of the source URL to use in the destination, surround it in parentheses.

Our pattern now looks like this, with parentheses around the parts that match the year and slug, but ignoring the slashes:

^([0-9]{4})/([a-z0-9-]+)/$

To use the capturing groups in the destination URL, we use the dollar sign and the number of the group we want to use. So, the first capturing group is $1, the second is $2 and so on. (The $ is unrelated to the end-of-pattern $ we used before.)

RewriteRule ^([0-9]{4})/([a-z0-9-]+)/$ /article.php?year=$1&slug=$2

The value of the year capturing group gets used as $1 and the article title slug is $2. Had there been a third group, that would be $3 and so on. In regexp parlance, these are called back-references as they refer back to the pattern.

Options

Several brain-taxing minutes ago, I mentioned some options as the final part of a rewrite rule. There are lots of options (or flags) you can set to change how the rule is processed. The most useful (to my mind) are:

R=301: Perform an HTTP 301 redirect to send the user’s browser to the new URL. A status of 301 means a resource has moved permanently and so it’s a good way of both redirecting the user to the new URL, and letting search engines know to update their indexes.
L: Last. If this rule matches, don’t bother processing the following rules.

Options are set in square brackets at the end of the rule. You can set multiple options by separating them with commas:

RewriteRule ^([0-9]{4})/([a-z0-9-]+)/$ /article.php?year=$1&slug=$2 [L]

RewriteRule ^about/([a-z0-9-]+).jsp/$ /about/$1/ [R=301,L]

Common pitfalls

Once you’ve built up a few rewrite rules, things can start to go wrong. You may have been there: a rule which looks perfectly good is somehow not matching. One common reason for this is hidden behind that [L] flag.

L for Last is a useful option to tell the rewrite engine to stop once the rule has been matched. This is what it does — the remaining rules in the .htaccess file are then ignored. However, once a URL has been rewritten, the entire set of rules are then run again on the new URL. If the new URL matches any of the rules, that too will be rewritten and on it goes.

One way to avoid this problem is to keep your ‘real’ pages under a folder path that will never match one of your rules, or that you can exclude from the rewrite rules.

Useful snippets

I find myself reusing the same few rules over and over again, just with minor changes. Here are some useful examples to refer back to.

Excluding a directory

As mentioned above, if you’re rewriting lots of fancy URLs to a collection of real files it can be helpful to put those files in a folder and exclude it from rewrite rules. This helps solve the issue of rewrite rules reapplying to your newly rewritten URL. To exclude a directory, put a rule like this at the top of your file, before your other rules. Our files are in a folder called _source, the dash in the rule means do nothing, and the L flag means the following rules won’t be applied.

RewriteRule ^_source - [L]

This is also useful for excluding things like CMS folders from your website’s rewrite rules

RewriteRule ^perch - [L]

Adding or removing www from the domain

Some folk like to use a www and others don’t. Usually, it’s best to pick one and go with it, and redirect the one you don’t want. On this site, we don’t use www.24ways.org so we redirect those requests to 24ways.org.

This uses a RewriteCond which is like an if for a rewrite rule: “If this condition matches, then apply the following rule.” In this case, it’s if the HTTP HOST (or domain name, basically) matches this pattern, then redirect everything:

RewriteCond %{HTTP_HOST} ^www.24ways.org$ [NC]
RewriteRule ^(.*)$ http://24ways.org/$1 [R=301,L]

The [NC] flag means ‘no case’ — the match is case-insensitive. The dots in the domain are escaped with a backslash, as a dot is a regular expression character which means match anything, so we escape it because we literally mean a dot in this instance.

Removing file extensions

Sometimes all you need to do to tidy up a URL is strip off the technology-specific file extension, so that /about/history.php becomes /about/history. This is easily achieved with the help of some more rewrite conditions.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME}.php -f
RewriteRule ^(.+)$ $1.php [L,QSA]

This says if the file being asked for isn’t a file (!-f) and if it isn’t a directory (!-d) and if the file name plus .php is an actual file (-f) then rewrite by adding .php on the end. The QSA flag means ‘query string append’: append the existing query string onto the rewritten URL.

It’s these sorts of more generic catch-all rules that you need to watch out for when your .htaccess gets rerun after a successful match. Without care they can easily rematch the newly rewritten URL.

Logging for when it all goes wrong

Although not possible within your .htaccess file, if you have access to your Apache configuration files you can enable rewrite logging. This can be useful to track down where a rule is going wrong, if it’s matching incorrectly or failing to match. It also gives you an overview of the amount of work being done by the rewrite engine, enabling you to rearrange your rules and maximise performance.

RewriteEngine On
RewriteLog "/full/system/path/to/rewrite.log"
RewriteLogLevel 5

To be doubly clear: this will not work from an .htaccess file — it needs to be added to the main Apache configuration files. (I sometimes work using MAMP PRO locally on my Mac, and this can be pasted into the snappily named Customized virtual host general settings box in the Advanced tab for your site.)

The white screen of death

One of the most frustrating things when working with rewrite rules is that when you make a mistake it can result in the server returning an HTTP 500 Internal Server Error. This in itself isn’t an error message, of course. It’s more of a notification that an error has occurred. The real error message can usually be found in your Apache error log.

If you have access to your server logs, check the Apache error log and you’ll usually find a much more descriptive error message, pointing you towards your mistake. (Again, if using MAMP PRO, go to Server, Apache and the View Log button.)

In conclusion

Rewriting URLs can be a bear, but the advantages are clear. Keeping a tidy URL structure, disconnected from the technology or file structure of your site can result in URLs that are easier to use and easier to maintain into the future.

If you’re redesigning a site, remember that cool URIs don’t change, so budget some time to make sure that any content you move has a rewrite rule associated with it to keep any links working.