Wednesday, March 27, 2013

Interpreting common status codes in web logs

Interpreting common status codes in web logs


The status codes you find in your web logs are useful troubleshooting tools, but only if you know what they mean.




Status codes


When a web browser talks to a web server, the server lets the client know the status of its request by sending a "status code". This status code will show up in the access logs of the server as a number. There are a lot of different status codes that can be passed to a web client, and you can view the full list at w3's website.

Fortunately there are only a few status codes that you're likely to see in your access logs, so consider the following descriptions to be highlights from the full list of status codes.

200 - OK


The 200 status code indicates that the request was successful. This is the one you want to see in your logs. At its most basic it means that when a web browser asked for a file, the server was able to find it and send it back to the browser.

403 - Forbidden


The 403 status code indicates that the server is not allowed to respond to the web client's request.

One circumstance that can cause a 403 status is if you do not have "Indexes" enabled for a directory, and the directory doesn't have an index file in it that the server can access. In other words, the client asked for a directory, and the server doesn't find anything there it can show to the client.

A more common circumstance is that the permissions on the file or directory being requested don't allow access by the web server's user. If the web server is running as user "www-data", any files you want the web server to serve will have to be accessible by the user "www-data". For example, if a directory's permissions look like:
drwx------ 5 root     root     4096 2009-12-18 01:39 wordpress

Then the user "www-data" will not be able to access any of the files inside. Requests sent to the server that ask for the "wordpress" directory or any of its contents will yield 403 status codes instead of serving the file requested.

For more information on how Linux file permissions work, you can read this article series. In a nutshell, the web server user needs to have read permission for files in order to serve them, and it has to have read and execute permissions for directories in order to see files inside them.

404 - Not found


A 404 status code means that the requested file could not be found. If you see this error often you should check the links on your site to make sure they're pointing to the right places.

Since the filesystem is case-sensitive you should also make sure the capitalization matches between the request in the URL and the name of the file on the disk. For example, if a file is named "File.txt" and the URL requests "file.txt", the file won't be found by the web server. Either the URL or the file name would need to be changed so the capitalization matches in both instances.

A couple commonly-requested files are worthy of note.

robots.txt


If you see 404 errors connected to a file named "robots.txt", that's the result of a spider program (like web search engines use) checking to see what your preferences are for indexing your site.

If you don't want to restrict the access of web spider robots to your site, you can just create an empty robots.txt file and the 404 errors will go away.

The robots.txt file can be useful if there are parts of the site that you want search engines to ignore. If you don't want search engines to record anything in the "orders" or "scripts" directories on your site, for example, you could use the following robots.txt file:
User-agent: *
Disallow: /orders/
Disallow: /scripts/

A slash at the end of a disallow will let the search engine robot know that it refers to a directory.

The "User-agent" part of the file describes what user agent the robots.txt would apply to. The "*" means that you want the rule to apply to everybody. You can have more than one User-agent entry in a robots.txt file, as in:
User-agent: EvilSearch
Disallow: /

User-agent: *
Disallow:

In that file, the EvilSearch engine's robot would be asked not to record anything on the site (thus the "/"), while everything else will be allowed to record anything they can find (which is what the empty argument to Disallow means).

Note that the robots.txt instructions aren't enforced in any way. A spider can freely ignore them. The better search engines (the ones you've heard of) tend to obey the robots.txt file, while spiders used by spammers and email harvesters will ignore robots.txt entirely.

favicon.ico


Any 404 errors connected to "favicon.ico" are the result of a web browser checking for a favorites icon for the site. That's another file not found error that can be safely ignored if you don't want to make a favorites icon for the site.

The favorites icon is often used by modern browsers both as an icon in a bookmarks list and as an identifying icon in a tabbed interface. If you've noticed that bringing up a site puts an image associated with the site next to your address bar or in the tab for that page, the favicon.ico file is where your browser got that image.

There are ways to point a browser to another file for the favorites icon, but if you want to make a quick-and-dirty favorites icon there are several utilities on the web that either allow you to create your own or convert an image file. Once you've generated the favicon.ico file you can upload it to the document root of your site and the associated 404 errors should stop appearing in your log.

500 - Internal server error


The 500 status code is kind of a catch-all error code for when a module or external program doesn't do what the web server was expecting it to do. If you have a module that proxies requests back to an application server behind your web server, and the application server is having problems, then the server could return a 500 error to web clients.

503 - Service unavailable


The 503 status code appears when the web server can't create a new connection to handle an incoming request. If you see this status code in your logs it usually means that you're getting more web traffic than can be handled by your current web server configuration. You'll then need to look into increasing the number of clients the server can handle at one time in order to be rid of this status code.

No comments:

Post a Comment