Wednesday, March 27, 2013

Reading apache web logs

Reading apache web logs


Whether you're dealing with web server difficulties or just want to see what apache is up to, your best bet is to look in its logs.




Keeping tabs on your web server


Sooner or later you'll want to know more about what your web server is up to. Luckily, apache (like many other server applications) keeps a diary of sorts called a "log".

Well, actually, more than one log, so the analogy isn't terribly good. Unless you think of your web server as a very organized diary-writer, maintaining different diaries for different kinds of events that have happened throughout the day.

Still not a great analogy, but it will do. In plainer terms: Logs are where apache records events like visitors to your site and problems it's encountered.

By default apache writes stuff about its activities in two types of logs — the error log and the access log.

Error log


The error log is where your web server records anything it doesn't think is quite right. Much of the time what gets recorded there are actual errors, like a visiting web client requesting a file that doesn't exist. Sometimes you'll also see warnings in there that don't indicate that a problem has occurred yet, but advise you that a particular event or configuration could cause problems later.

If you're having trouble with your web server this is the place to go first. For example, if you try to start your web server and it fails without telling you anything on the command line, it may be recording a reason in its error log. There you may find out about a misconfiguration or learn that it couldn't bind to the address or port it's configured for (possibly because some other program is already using the port).

Access log


The access log is where your web server records all the visitors to your site. There you can see what files users are accessing, how the web server responded to requests, and other information like what kind of web browsers visitors are using.

The access log can be used with programs called "traffic analyzers" to track the site's usage over time.

It can also be used to watch for unusual client behaviors that indicate someone is looking for a vulnerability they can exploit to hack your machine. If someone is sending unusual requests to an application you're running on your web server (like phpmyadmin or WordPress), it's usually a good idea to make sure you're running the latest version of the software.

Where to find your web logs


Before you can read your logs you'll need to find them. The most straightforward way to do that is to look for the configuration directives that tell apache where to create them.

Error log


To find the error log look in your main apache config file. The error log should be defined there with the "ErrorLog" directive. For example:
ErrorLog /var/log/apache2/error.log

Note that a lot of systems will restrict the permissions for apache's log directory to just root, so you may need to use the sudo command to look at the error log. For instance:
sudo cat /var/log/apache2/error.log

Access log


The access log is typically defined inside a virtual host block but can sometimes have a default defined in the main apache config file. You'll want to look for the "CustomLog" directive:
CustomLog /var/log/apache2/access.log combined

The first argument to CustomLog gives the file's location. The second argument ("combined") defines the format of the log. We'll get into what that means and how to change it later.

If a default CustomLog is defined in the main apache configuration and a different CustomLog is defined within a virtual host, the access log (or logs) defined in the virtual host will replace the default access log for just that virtual host.

Reading the logs


Now that you know where to find the logs, let's look at what's inside each. And most importantly, let's look at what they can tell you about your web server.

Error log


The error log is where the server will log, well, errors. These are usually errors the program encountered when trying to start a process or use a module, but they can also be errors that were sent to web clients, like a "file not found" error.

An error log entry for a file not found error would look something like:
[Mon Aug 23 15:25:35 2010] [error] [client 80.154.42.54] File does not exist: /var/www/phpmy-admin

In this case, a web client tried to visit a page in a "phpmy-admin" directory that didn't exist. Fortunately I happen to know that I don't have phpmy-admin installed, so it's not a broken link I need to fix. It's just some script kiddie looking for an exploitable version of that software. It's a good indication that I should install a program like fail2ban to block people like him.

Error log components


The first part of the log entry is the date and time (server time) when the event occurred. Apart from just being informative, that time can be useful for looking for entries in other logs at the same time. In this case I could check the access log to see the full URL that the web client tried to visit. If it were an error that indicated a module had trouble talking to a database, then I could look in the database server's logs at the same time to see what prevented the connection from happening.

The next part, "[error]", describes the level of the alert. This will often be "error", but sometimes other levels will indicate that the message logged is just a warning, or it may represent a critical error that caused the web server to shut down or fail to start.

The next part, "[client 80.154.42.54]", shows the source of the error. In this case the source is a web client, so the visitor's IP address was logged.

The last part of the log entry is the error itself.

Combined log format


The most common format for web log entries (and the default for most modern web servers) is the "combined" format, also referred to as "CLF" (Combined Log Format). A log entry in combined log format might look like this:
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1" 200 2 "http://www.example.com/wordpress3/wp-admin/post-new.php" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"

There's a lot of stuff there, but when you break the log entry down it contains a standard set of information in a standard order.

Combined log components


The first entry is the IP address of the web client accessing your server.

The second entry above is "-", which is what gets logged when there's nothing to put in that part of the log. In this case, the entry would represent the name of a remote log, if one were being used. You'll pretty much always see "-" here.

The third entry above is another "-". That slot contains the username the web client was authorized under, if any. If you enabled password protection for a file or directory, then the username the visitor used to log in would be recorded here.

The next entry is the date and time of the access.

The next entry is the first line of the request the web client sent to the server. In this case it's:
POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1

That entry means the web client sent a "POST" request (a submission of information) to the file at "/wordpress3/wp-admin/admin-ajax.php". That's a relative location, which means that if you wanted to find that file you'd start at the document root of that virtual host. If your document root was "/var/www", then the file being accessed above would be at "/var/www/wordpress3/wp-admin/admin-ajax.php". The last entry describes the protocol used for the request, in this case HTTP version 1.1.

The next entry tells us the status code that was returned for the request. The code above, "200", is hopefully one you'll see most often in your access logs — it means that the file was found and served to the client. Other common status codes are "403" (access forbidden) and "404" (file not found). We go into more detail about status codes in another article, but for a full list you can visit the official w3 website's list of status codes.

The next number is the size of the response your server sent, in bytes. In this case it was a very small response (2 bytes), so it was likely just an acknowledgement from the server rather than a full page access.

The next entry is the "referrer URL". In this case the entry is:
http://www.example.com/wordpress3/wp-admin/post-new.php

That's the page the web client visited before sending the recorded access request. Usually that means it's the page that linked to the one they accessed. The referrer can be useful information if you're wondering where people are finding links to your site (from a Google search, or a link from a partner site), or if you want to find the page that contained a bad link if the access entry was an error.

The last entry is called the "user agent". Most of the time that just means it's the identifier used by the web browser the visitor used. In this case, the user agent was:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3

The user agent is pretty specific sometimes. In this entry the web browser told the server not only what its name was (Chrome, in this case), but also what operating system it's running on (Mac OS X), the version of the browser and the system, and the components that the web browser uses from the operating system. It's usually a lot more than you need, but if you know your site will display differently in different browsers, all that information can be used by a web application to tailor the page it returns to look best on that particular visitor's browser.

Putting them together


Whew! Lots of stuff there, but it's useful stuff. To give another example, let's look in again on our would-be intruder from earlier. Looking in the error log I saw what time he tried to access the non-existent directory. By looking at the same time in the access log I can see more information about what he tried to access:
80.154.42.54 - - [23/Aug/2010:15:25:35 +0000] "GET /phpmy-admin/scripts/setup.php HTTP/1.1" 404 347 "-" "ZmEu"

There's the same IP address and the same time. So we can see that his script used a "GET" method (a request for a page) to ask for the setup script for php-myadmin. The "404" status means that the file wasn't found. And while the user agent entry certainly isn't any web browser on the market, some web searches will turn up other people who have been hit by what is probably the same script. So even if that user agent isn't the browser, it can be useful in determining the type of attack your site was experiencing, and how many 404s you can expect your server to have to handle when it hits you.

Common log format


It isn't used a lot anymore, but you may run into a CustomLog directive that uses the log format of "common" in some older configurations of apache. The "common" format is similar to "combined", but omits the referrer and user agent information at the end of the log entry. Otherwise it can be read the same way as a "combined" format log.

Useful commands


Here are a few commands that can make browsing log files a little quicker or easier. These are very basic overviews of the commands in question. For more information on each you can check their respective man pages.

cat


The "cat" command simply displays the contents of a file. To see the whole error log all at once, you might run:
sudo cat /var/log/apache2/error.log

less


If the log you want to look at is particularly large, you probably don't want to look at the whole thing at once. To browse through a file you can use the "less" command:
sudo less /var/log/apache2/error.log

While less is displaying a file you can hit the space bar to page down, and the up and down arrow keys on your keyboard to scroll up or down one line at a time.

tail


The "tail" command returns lines from the end of a file. By default it displays the last ten lines of the file, so this command would display the last ten lines of an access log:
sudo tail /var/log/apache2/access.log

To specify the number of lines to grab, use the "-n [number]" option. To display the last 100 lines of the access log, you could run:
sudo tail -n 100 /var/log/apache2/access.log

You can also save yourself a little typing by just using "-[number]" instead of "-n [number]", as in:
sudo tail -100 /var/log/apache2/access.log

The tail command is useful if you're just looking for recent activity in a log. If you want to watch the end of a file for changes as they happen, you can use the "-f" option:
sudo tail -f /var/log/apache2/access.log

With this version of tail running, when a new line is added to the log file you'll see it on your screen too. To get out of tail when it's in this mode use control-C.

grep


If you're looking for a particular item in a web log (like a certain IP address, or any "404" responses), skimming through the log manually can be tiresome. It's easier to let the "grep" command do the work for you.

The grep command will look through its input, or a file, and return any lines that contain the search term sent to it. To look for the term "404" in an access log, you might run:
sudo grep 404 /var/log/apache2/access.log

The first argument is the text grep is searching for, and the second argument is the file to search.

If you want to look for a phrase, you can do that by enclosing the phrase in quotes. To look for requests for a particular file, you could run:
sudo grep "GET /images/butterfly.jpg" /var/log/apache2/access.log

By default grep's searches are case-sensitive. If you specify capital letters like "GET", then lines with "get" in lowercase letters won't be returned as hits. To make the search case-insensitive, pass grep the "-i" option, as in:
sudo grep -i "get /images/butterfly.jpg" /var/log/apache2/access.log

You can combine tail and grep by using what's called a "pipe":
sudo tail -n 100 /var/log/apache2/access.log | grep 404

The first part of that statement just lists the last 100 lines of the access log. The next character, "|", is the "pipe". It redirects the output of the last command and sends it to the next command in the statement. In this case that second command is "grep", searching for 404. So the above command would return any 404 errors found in the last 100 access log entries.

No comments:

Post a Comment