Yuvalinux - Place to Learn Linux: March 2013

Wednesday, March 27, 2013

Scheduling awstats report generation

We've looked at running awstats reports, but only manually. Let's automate report generation so all you need to worry about is looking at those sweet, sweet numbers.

Automating awstats

In the previous article in this series we set up awstats for your site and ran an update of the reports manually. That's all well and good, but the command to update is big and ugly, and it would be kind of a pain to have to run it every time you want to view updated stats.

Fortunately Linux is chock full of ways to automate stuff. One way is to create a cron script to do all the updating, but since we're shooting for simple we should use a tool that's already processing your web logs on a regular basis. We can piggyback our updates onto logrotate's regular rotation tasks.

Scheduling reports with logrotate

With this approach we'll take advantage of the fact that logrotate is already performing regular log rotation for your domain. I mean, you do have your logs rotating automatically, right?

If not, visit this article series on logrotate and follow the directions there to set up log rotation for your virtual host. Log rotation keeps those logs from becoming giant disk-space-eating behemoths, so it's a very good idea. Really, do it now. I'll wait.

Now that we're certain you have logrotate managing your web logs, let's add a step to what logrotate does when it performs the rotation.

Editing the logrotate entry

Let's look at the logrotate.d file for a virtual host that's just a modification of the default entry for apache on Ubuntu:

/home/demo/public_html/example.com/logs/*.log {
    weekly
    missingok
    rotate 52
    compress
    delaycompress
    notifempty
    create 640 root adm
    sharedscripts
    postrotate
        if [ -f "`. /etc/apache2/envvars ; echo ${APACHE_PID_FILE:-/var/run/apache2.pid}`" ]; then
            /etc/init.d/apache2 reload > /dev/null
        fi
    endscript
}

Don't worry too much about most of those entries if you haven't seen them before. There are only a couple important bits to note here.

First, take a look at the "weekly" directive in that logrotate file. That's okay for simple log rotation, but you probably want your web traffic stats updating daily. In that case you'd want to change that to "daily" so the rotate script runs more frequently, and possibly modify the "rotate 52" entry to keep more archived log files.

Next we look at what we want to add to the logrotate process. See that "postrotate" block up there? Don't worry about what's inside, just note that it's there. What that does is run some stuff after the log rotation is done (in this case, tell apache to reload if it's running). The reason we're looking at it is because there's also a "prerotate" directive that we can use to have awstats run through a log file before it's rotated.

The "prerotate" directive should run the stats update and generate the reports. And it should run just like the command we ran to get our reports. We would create a prerotate block like:

prerotate
    /usr/local/awstats/tools/awstats_buildstaticpages.pl -update -config=www.example.com -dir=/home/demo/public_html/example.com/webstats -awstatsprog=/usr/local/awstats/wwwroot/cgi-bin/awstats.pl > /dev/null
endscript

The "> /dev/null" bit redirects the normal output of the command so it doesn't get sent to the console or emailed to root.

Inserted into the existing logrotate file for our virtual host, the whole thing would look like:

/home/demo/public_html/example.com/logs/*.log {
    daily
    missingok
    rotate 52
    compress
    delaycompress
    notifempty
    create 640 root adm
    sharedscripts
    prerotate
        /usr/local/awstats/tools/awstats_buildstaticpages.pl -update -config=www.example.com -dir=/home/demo/public_html/example.com/public/webstats -awstatsprog=/usr/local/awstats/wwwroot/cgi-bin/awstats.pl > /dev/null
    endscript
    postrotate
        if [ -f "`. /etc/apache2/envvars ; echo ${APACHE_PID_FILE:-/var/run/apache2.pid}`" ]; then
            /etc/init.d/apache2 reload > /dev/null
        fi
    endscript
}

And with that, every time the logs get rotated, the web stats get updated too.

Why use logrotate?

Instead of using logrotate to run the web stats update we could just schedule the stats to update through cron. So why insert the commands into logrotate?

The main reason is accuracy. By sticking a web stats update into the log rotation process we make sure that awstats is looking at log entries up until the last possible moment, when the log is rotated. If you have the web server reloading instead of restarting there might be a couple log entries missed (since the old log file would still be used as the old connections finish). The information lost in that case is negligible and not worth the downtime required to fully restart the web server.

If you want the web traffic reports to update more often than the web logs are rotated, you can use cron to run the update and report-building scripts on a more frequent schedule (like hourly). Awstats will recognize log entries that have already been processed and skip them when analyzing the data.

Monthly reports

Something you'll notice if you leave awstats running the way we've set it up is that the reports being generated only go into detail for the current month. Dividing the information up by month is a decent interval, but you may want to look at previous months in detail instead of only the current one.

So it's entirely optional, but if you'd like to keep monthly reports around it's really just a matter of tailoring a shell script to fit your site.

A cron script

We'll name the script after the awstats config file for the site in question and put it in the cron monthly directory. Using "www.example.com", we would create the file:

/etc/cron.monthly/awstats.www.example.com

Inside the file put the following script:

#!/bin/sh
#
# Run the awstats build script to generate a report for last month.
#

# Modify these 3 variables for your environment:
#
# The location of the awstats installation
AWSTATSDIR=/usr/local/awstats

# Your main domain, as you reported it to awstats
DOMAIN=www.example.com

# The directory where you're storing the reports for this domain
REPORTDIR=/home/demo/public_html/example.com/public/webstats

LASTMONTH=`date -d "last month" +%B`
LASTMONTHNUM=`date -d "last month" +%m`
LASTMONTHDIR=$REPORTDIR/$LASTMONTH

mkdir -p $LASTMONTHDIR
cp -Rf $AWSTATSDIR/wwwroot/icon $LASTMONTHDIR/awstatsicons

$AWSTATSDIR/tools/awstats_buildstaticpages.pl -month=$LASTMONTHNUM -update -config=$DOMAIN -dir=$LASTMONTHDIR -awstatsprog=$AWSTATSDIR/wwwroot/cgi-bin/awstats.pl > /dev/null

Change the values of "AWSTATSDIR", "DOMAIN", and "REPORTDIR" to match your environment and site.

After saving the file make your new script executable:

sudo chmod a+x /etc/cron.monthly/awstats.www.example.com

You can even run that script now to test it (and to get a detailed report for last month's data, if you have some). If you do, be sure to run it using sudo.

What it does

When the script runs it creates (if it doesn't already exist) a directory named after last month. If the current month is September, the directory will be named "August" (or the equivalent for your machine's locale setting). The awstats icons directory will be copied there, then a report will be generated for last month and put in the month's directory.

To look at a monthly report you'll use a similar URL to viewing the current report, but you'll insert the name of the month you want to view in front of the page name. The first letter of the monthly directory names is capitalized, so it will have to be capitalized in the URL used to visit a monthly report as well.

To take our example URL from earlier and use it to look at the August statistics we would change it to:

http://www.example.com/webstats/August/awstats.www.example.com.html

Keeping more monthly reports

Note that the way the script is written the monthly reports get replaced every year (since they're only made available by month name, not by year).

If you want monthly reports to never get overwritten you can modify the script to add the year to the directory names. Change the line that defines "LASTMONTH" above to something like:

LASTMONTH=`date -d "last month" +%B-%Y`

If the month were September, the above line would cause the script to use the directory name "August-2010".

Generating and viewing awstats reports

Now that awstats is installed we take a look at actually running the analysis and viewing the reports.

Awstats in action

If you followed along with the first part of this series you should have awstats installed and configured for your site. In this article we'll look at a simple approach to report generation from the command line.

This approach will create static html pages to display your web traffic.

Build a report

Time to tell awstats to generate your reports. Fortunately for our "start with something simple" approach, there's a script that rolls generating several reports into one step.

awstats_buildstaticpages.pl

We're going to use a script that's included with awstats, "awstats_buildstaticpages.pl". This script updates the stats and generates a bunch of standard reports, using the main "awstats.pl" script behind the scenes. For a closer look at what reports this script will build, check the awstats online documentation.

For our example the command would look like:

sudo /usr/local/awstats/tools/awstats_buildstaticpages.pl -update -config=www.example.com -dir=/home/demo/public_html/example.com/public/webstats -awstatsprog=/usr/local/awstats/wwwroot/cgi-bin/awstats.pl

Okay, yeah. I admit that's kind of long. But it's not as scary as it seems, honest. Especially since you won't have to memorize it.

Let's break that down so you know what to put where.

The script itself

/usr/local/awstats/tools/awstats_buildstaticpages.pl

This part is the script we're running, "awstats_buildstaticpages.pl". If you installed to a location other than "/usr/local/awstats" you'll want to change this part to point to the actual location of the script on your machine.

The -update option

-update

Including "-update" at the beginning of the options tells the script to update the stats analysis before generating the reports.

The -config option

-config=www.example.com

The "config" value should be the main domain name for the site. Note that this domain matches up with the name of the config file you created in the first part of this series. The name of your config file should have "awstats." before the main domain name, and ".conf" after it, since that's pretty much what this script will be looking for.

In short, replace "www.example.com" with your main domain name.

The -dir option

-dir=/home/demo/public_html/example.com/public/webstats

The "-dir" option refers to the directory where you want awstats to create its reports. That directory should contain an "awstatsicons" directory containing awstats' standard image files.

The -awstatsprog option

-awstatsprog=/usr/local/awstats/wwwroot/cgi-bin/awstats.pl

For "-awstatsprog" you'll want the value to be the location of the "awstats.pl" script, which is the main awstats script. If you installed awstats someplace other than "/usr/local/awstats", adjust accordingly.

The script's results

Once you run that big command (all on one line) you should see that the script launches the awstats update process, then tells you about every one of the 20 reports it's generating.

If the script encountered an error it should give you some troubleshooting advice (like making sure you used the right "config" identifier).

Note that the last line, the "Main HTML page" line, gives the main page of the report.

If you take a look in your reports directory you should now see a bunch of html files there:

$ ls /home/demo/public_html/example.com/public/webstats                              
awstats.articles.slicehost.com.alldomains.html
awstats.articles.slicehost.com.allhosts.html
...

View the report

Now we get to see the results of our hard work. Point your browser to the "main html file" that was identified by the script we ran to generate the report.

http://www.example.com/webstats/awstats.www.example.com.html

The important part here is working out the address you'll use to view the reports. If you discover at this point that you created the reports in a directory you can't see from a browser, you may want to make a new reports directory. Edit your awstats config file accordingly, then run the report generation again to make sure it works with the new directory.

If all goes well you'll see something like:

Awstats example

(Without the smudges, of course.)

You might see less than a day's worth of traffic in this initial report, or perhaps a week, depending on how often your web logs are rotated. So not a lot that's interesting just yet, but enough to make sure the reports were generated properly.

Visits, hits, pages and bandwidth

There are a bunch of reports available, linked at the top of your main report's page. The main statistics you'll see at the beginning of the report bear some quick explanation, just so you know what you're looking at.

Unique visitors

The "unique visitors" stat tracks the number of different visitors your site received. For awstats this mostly means the number of unique IP addresses it saw in your web logs. This number isn't perfectly accurate, since visitors behind proxy servers and home routers can throw it off a bit (since those visitors would only appear in your web logs under the IP addresses belonging to the proxies or routers).

Number of visits

This stat tracks how many times visitors came back to the site. A "visit" for these purposes will encompass all page hits from a visitor within an hour or so of each other. If the same IP address appears in the web logs the next day that would count as a second visit.

Pages

A "page", in web traffic terms, is the main page of a visited URL. This would be the HTML or PHP file that was requested by the visitor. If a page includes the contents of other HTML files, only the main page is counted as a "page" in the traffic stats.

Hits

Pretty much everything a web browser asks for from a site is a "hit". The main page, headers and footers, images, videos — everything the browser has to ask for is a hit. A complex site will produce a fair number of hits per page visit.

Bandwidth

In the combined log format the web server records the size of all the requests and responses that get sent between the browser and the server. The total of all the outgoing response sizes is the "bandwidth" statistic in awstats. This is not necessarily the total bandwidth used by the site — it's just the total bandwidth that got recorded in your web server's access logs.

A note about referer spam

You may notice that the "referer" information in your reports contains links to referring web sites. This is useful for checking out sites that are linking to you but there's a potential drawback to putting this information on a web page, and that's "referer spam".

There's a school of thought among less-reputable web admins that encourages doing whatever you can to increase your search engine ratings. One of those tactics involves finding a site with a publicly-accessible web stats page and then running a script that visits the site a bunch of times using their web site as a referrer. The theory is that search engines will count the stats page as another site linking to their site.

In practice it doesn't work that well (most major search engines are wise to the practice and account for it), but that doesn't mean we should encourage the inconsiderate jerks to keep trying it.

The preferred method to keeping the stats pages from being used for spamming purposes is to protect the stats directory from unauthorized access. You can do that by password-protecting that part of the site, or by restricting access to that site to just localhost and using ssh tunneling to view your stats.

If you want to keep your stats public you should at least modify your site's "robots.txt" file to tell the major search engines not to index your stats pages. If you don't have a robots.txt file in the document root of your site this is a good time to create one.

Inside the robots.txt file you just need to add a "Disallow" rule for the web stats directory. If you don't have a robots.txt file already, you can use something like the following:

User-agent: *
Disallow: /webstats/

That would tell any robot that complies with the robots.txt file not to index the "webstats" part of the site. That way your stats site won't show up on major search engines at all, defeating the purpose of any efforts to manipulate your referers report.

If you want your web stats to show up on search engines for some reason, then at least tell robots not to index the referer page report:

User-agent: *
Disallow: /webstats/awstats.www.example.com.refererpages.html

How to Install awstats on Linux

The awstats program is a versatile tool for generating web traffic reports. We'll walk through a simple installation to track stats for your site.

Web log analysis

If you run a web site you might get curious about statistics like how many people visit your site each month and what sites or search engines they used to find you. That's where web traffic analysis comes in.

There are many options for analyzing your traffic, but in this article series we'll look at a program called "awstats". Awstats runs through your web logs (which are lying around on the disk anyway) and generates reports based on what it finds. The reports break down the data to show you information like what the more popular parts of your site are, what search terms people used to get there, and which search engines have spidered your site lately.

We'll aim for a simple approach to analyzing logs with awstats in this series. There are some nifty features of awstats we won't be using (like dynamically generating reports via CGI), but the benefit of this approach is that it's light on resource use and easy to set up no matter what web server you use.

Why not just use page tagging?

A different approach to web traffic analysis is called "page tagging", used by services like Google Analytics. It involves embedding a javascript tag in your pages that causes visiting web browsers to report their visit to a master data server. Because the browser can also set a cookie to go along with the javascript, a page tagging approach can give you very good data about what individual users are up to with regards to your site.

The approach awstats uses is called "log analysis". The analyzer program goes through your web logs line-by-line and sorts out what files got served and where the requests came from. Because this approach doesn't rely on the visiting web browser executing any specific code properly, the numbers for total traffic a log analyzer gathers with will be closer to an accurate tally. The downside is that the only identifying information recorded about a visitor is the IP address they used, which isn't always a reliable way to distinguish between users (since several of them could be behind the same proxy server or firewall).

In the end, neither approach is really superior to the other.

Page tagging gives you better information about how often visitors return to your site and what they do there, but doesn't record visitors that can't or won't execute the tag (like older browsers, many mobile phones, users with privacy concerns, and search engine robots).

Log analysis gives you better information about how much traffic your web server handles but is less reliable when it comes to determining site usage patterns.

See where I'm going? Both approaches have complementary strengths and weaknesses. Somewhere between the page tagging statistics and the web log analysis lies the whole picture.

For the most accurate assessment you'll want to have both types of usage reports available and extrapolate from there.

Prerequisites

Before installing awstats pick out the virtual host you want to report on. If you want to use more than one you can go through this guide again for each one, but make sure each virtual host is logging to its own access log. It's possible to use a single log for multiple sites, but it's more complicated and isn't recommended. We're going for simple, remember?

Web server

First make sure you have a web server. Hey, might as well be thorough.

With that out of the way, we want to see if the virtual host we're going to be tracking is logging in the right format. While awstats can handle some other log formats, what we want to use is the standard "combined" web log format.

Most web servers, like nginx or lighttpd, use the combined log format by default. No problems there unless you went out of your way to change it.

If you're using apache it might be logging in either a "combined" or a "common" log format. To find out which, take a look in your virtual host config file and look for the "CustomLog" directive:

CustomLog /var/www/access.log combined

That last word is the one to check for the format. If it isn't there, or if it says something like "common", change the config so it's using "combined" for the format instead. For more information on the combined log format, check out this article for apache or this article for nginx.

If you altered the format used for the virtual host's log remember to reload the web server to implement the change.

Perl

The awstats scripts use a scripting language called "perl". It's used for a lot of things, so you probably have it installed already.

To check, run the command:

perl -v

If you get a response that gives you a perl version, you're set. If you get a "command not found" error then you need to install perl. You should be able to do that through your distribution's package manager, like yum or aptitude.

Download and extract

We're actually not going to use your Linux distribution's pre-packaged version of awstats (even if it has one). The awstats program gets updated regularly, and new versions include data on the latest web browser and operating system identifiers. It's best if you install the source package for awstats, then manually update it regularly so you get the most accurate reports possible.

You can get the latest version of awstats from the project's download page. You can decide if you want the latest beta for cutting-edge reporting data, or if you want to get the latest stable version instead and play it safe. In this guide we used the latest stable version (6.95 at the time of this writing).

Get the download with the ".tar.gz" extension. It's more unixy, and saves you from checking script permissions after the install.

You can either download the package from the awstats site to your desktop and then upload it to your VPS with scp, or you can download it directly to the VPS if you have wget installed:

wget http://prdownloads.sourceforge.net/awstats/awstats-6.95.tar.gz

Once you have the package on your VPS, unpack it:

tar -xvzf awstats-6.95.tar.gz

You should end up with a directory named after the awstats version, like "awstats-6.95". Now you just need to move that to wherever you actually want awstats installed (I used /usr/local/awstats):

sudo mv awstats-6.95 /usr/local/awstats

If you want to update to another version of awstats later, just go through those steps with the new version. Replace the old "awstats" directory with the new one. Simple.

Choose the reports directory

Next you'll need to create an output directory for your reports. This can be pretty much anywhere you like, since the reports are just static html pages. They just need to be accessible from your web browser if you want to view them.

For this example, let's say we're going to be tracking traffic for "www.example.com", and we put that site's files in a directory in the "demo" user's home directory. We're feeling unoriginal, so we'll make a "webstats" directory for awstats' reports:

mkdir -p /home/demo/public_html/example.com/public/webstats

The awstats icons

The html pages that awstats creates when it makes its reports want to use some images to make them a bit less bland. Let's make sure a copy of those images will be available to the reports we generate:

cp -R /usr/local/awstats/wwwroot/icon /home/demo/public_html/example.com/public/webstats/awstatsicons

When you update awstats you may want to re-copy this directory as well, to catch any additions (like icons for new browsers).

Create the data directory

We'll need to give awstats a directory where it can store the data it uses to generate its reports. The default is "/var/lib/awstats". That's a pretty good location.

sudo mkdir -p /var/lib/awstats

Copy the config template

Now to set up a config file telling awstats how to process the logs for your domain. First, create a directory to hold the config:

sudo mkdir -p /etc/awstats

Next we'll copy a template config file into that directory that we can modify for our domain.

The template config file is located in the awstats installation's wwwroot/cgi-bin directory:

[awstats install]/wwwroot/cgi-bin/awstats.model.conf

Name the new config file in the style of "awstats.[main domain name].conf". If you were creating a config for "www.example.com" and had installed awstats to "/usr/local/awstats", your copy command would look like:

sudo cp /usr/local/awstats/wwwroot/cgi-bin/awstats.model.conf /etc/awstats/awstats.www.example.com.conf

Customize the configuration

Time to dig around in the config file we created. Using your favorite text editor (nano or vi, usually), edit:

/etc/awstats/awstats.www.example.com.conf

Change "www.example.com" above to the name of the domain you'll be tracking. If the file doesn't exist you may have made a typo when you copied it. The file should be chock full of stuff right now.

Fortunately, part of keeping things simple is only needing to change a few config settings, and those are toward the beginning of the file. Let's look at the settings you'll need to pay particular attention to and their default values.

LogFile

LogFile="/var/log/httpd/mylog.log"

The LogFile value is a pretty important one — it tells awstats where to find the log it's supposed to be analyzing. For our example, we'd change that value to the location of example.com's access log:

LogFile="/home/demo/public_html/example.com/log/access.log"

LogFormat

LogFormat=1

The LogFormat directive is probably not one you'll need to change, but I mention it in case you've got your domain logging in a custom format, or if you absolutely want to use another standard format like the common log format. The commented text that precedes this directive explains how you can tell awstats what your log format looks like.

There are also some predefined log formats. The default, "1", represents the combined log format. If you are using common log format you would use "4" here instead.

SiteDomain

SiteDomain=""

The SiteDomain value is where you tell awstats what your site's main domain name is. If we usually direct vistors to "www.example.com", we'd change this setting to:

SiteDomain="www.example.com"

HostAliases

HostAliases="localhost 127.0.0.1 REGEX[myserver\.com$]"

The HostAliases setting tells awstats all the different domains people might use to visit your site. It's there so it can separate external referring sites from internal links.

The default has some funky "REGEX" stuff in there — that describes a "regular expression", which is a flexible but daunting way to describe a search filter. That one above just checks for "myserver.com", for instance. We're not going to keep that. Regular expressions aren't simple (but they are useful, if you know how to use them).

All you really need for this setting is a list of domains. It's good to include "localhost" and "127.0.0.1" in there, and to throw in your main domain name and any alternates you have for the site separated by spaces:

HostAliases="www.example.com example.com www.olddomain.com olddomain.com localhost 127.0.0.1"

DNSLookup

DNSLookup=2

The DNSLookup setting is actually not one you'll want to change. At its current setting it means that awstats won't do DNS lookups on visitors' IP addresses. What awstats might glean from that information is what country the vistor is in. This can be nice to know and chart, but not nice enough for the amount of time and effort it can take awstats to make all those DNS queries.

If you really want country data for visitors, check the awstats plugin site for a couple alternatives to DNS lookup. They have an impact on awstats performance as well, but not as much as straight DNS lookups.

DirData

DirData="/var/lib/awstats"

Remember that data directory you made? This is where you tell awstats what it was. If you didn't use "/var/lib/awstats", be sure and change this value to point to your data directory.

DirIcons

DirIcons="/awstatsicons"

When the generated reports reference images they do so using this value. That directory is relative to the location of the reports. In this case, it will point to the "awstatsicons" directory we made by copying the default images directory. If you want to rename that directory you'll need to change it here so the generated reports can find the images.

Customizing apache web logs

You can create your own custom formats for apache web logs, to record more information or to make them easier to read. Here's how.

Changing the log format

If you know how to read web logs then you may have an idea of how you would want to write them differently — maybe add a little here, trim a little out there, switch the order around a bit. Luckily, you can do that with the access logs through a couple built-in commands and a handful of log variables.

LogFormat

Apache's "LogFormat" directive is what lets you define your own access log setup. Let's look at how that directive would be used to define the combined log format (CLF):

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

That first argument, in quotes, is the string that describes the log format. The last argument, "combined", gives a nickname to the format that can be used by CustomLog later on.

That format string contains a bunch of placeholders that describe the data to be included in the log. That first one, for example, is "%h" and represents the IP address of the visitor (the identifier for their host). A bit further on, "%t" represents the time of the request.

Components of the CLF

Let's look at that CLF format string side-by-side with an access log entry in the format:

%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"
123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1" 200 2 "http://www.example.com/wordpress3/wp-admin/post-new.php" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"

Okay, they don't look too pretty together, but there is a correlation between each element in the format string and the components of the log entry below it. Breaking down what the stuff in the format string means:

%h      The remote host
%l      The remote logname (usually just "-")
%u      The authenticated user (if any)
%t      The time of the access
\"%r\"  The first line of the request
%>s     The final status of the request
%b      The size of the server's response, in bytes
\"%{Referer}i\"     The referrer URL, taken from the request's headers
\"%{User-Agent}i\"  The user agent, taken from the request's headers

So reading along, we see that in place of "%h" is "123.65.150.10" - the remote host. And after that, "%l" becomes "-" for the remote log host, "%u" turns into "-" for the remote user (since this connection didn't require authentication), "%t" is replaced with "[23/Aug/2010:03:50:59 +0000]" because it's the time the request was sent, and so on.

Note that the places in the log format where a quote character (") was used, it was escaped in the format string with a backslash (\"). The escape is there because if it were a quote symbol by itself, LogFormat would think the format string was complete at that point. The backslash tells it to keep reading.

The last two parts of the format, the referrer and the user agent, use a format component that requires an argument — in this case which header should be extracted from the request by %i. The referrer and user agent headers are, appropriately, named "Referer" and "User-Agent", respectively.

Well, mostly appropriately. "Referer" is misspelled. That's the spelling of the header name in the HTTP standards, however, so it is "Referer" for all time when talking about web link referrers. A bit of lexicographical trivia for you there. Enjoy.

Other format components

Apart from what we saw in our breakdown of the combined log format, there are other components you can include in a LogFormat entry. Some commonly-used components are:

%{cookie}C

The contents of the cookie named "cookie" for the request.

%{header}i

The contents of the HTTP header named "header" for the request.

%{VAR}e

The contents of the environment variable "VAR" for the request.

%k

The number of keepalive requests handled by the connection that spawned the logged request. The first time a request is sent the keepalive value will be zero, but each subsequent request that uses the same keepalive connection will increase that number by one. This can be handy for seeing how many requests a keepalive connection handles before it's terminated.

If keepalives aren't enabled this value will always be zero.

If you only see very low numbers for the keepalives value in a log but have a long keepalive timeout set, then it may be worth trying a much shorter timeout for keepalives. That way apache won't be maintaining connections in memory for longer than it needs to.

%T

How long the server took to serve the request, in seconds.

%v

The ServerName of the virtual host the request was sent to. This format code can be handy if you're writing more than one virtual hosts' accesses to the same log file.

For a full list of format components see the apache documentation for LogFormat.

Make your own log format

While the LogFormat entry is useful for interpreting what appears in the logs, it can also be used to create your own formats.

If you want your log to add the length of time it takes to serve requests to its access entries, you might make a LogFormat directive that looks like:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T" timed_combined

All we have to do is add a "%T" to the end of the format string, then give it a new nickname — for our example, "timed_combined".

Using the new log format

Now, if you want to tell your virtual host to make an access log using the new format, you can include in the virtual host definition:

CustomLog /var/log/apache2/timed.log timed_combined

To recap: A LogFormat directive takes a format you give it and assigns it a nickname you choose. Then you use CustomLog to tell apache to write the access log using the new format by telling it where to write the log and the nickname of your log format.

Adding more custom logs

You can have more than one CustomLog directive for a virtual host. If you already have a CustomLog using the "combined" format, you don't have to remove it when adding your "timed_combined" log. This can be useful if you want to maintain one log in CLF that a web log analyzer program can read and another log file with just the information you care about when you're skimming the entries.

So if you wanted another log with just the stuff you wanted in it, you might take that "timed_combined" format and remove the things you feel are distractions. If you decided to remove the remote log entry, the user entry, and the user agent entry, you could create that format with:

LogFormat "%h %t \"%r\" %>s %b \"%{Referer}i\" %T" slim

And then add a new CustomLog directive to use the "slim" format:

CustomLog /var/log/apache2/slim.log slim

Precedence

Note that any logs defined in a virtual host will override log directives in the main apache config file. So if the main config file has the CustomLog entry:

CustomLog /var/log/apache2/access.log combined

And the virtual host has another CustomLog entry:

CustomLog /var/log/apache2/example.com.log combined

Then the virtual host will log its accesses to the "example.com.log" file, but not to the "access.log" file. If you wanted accesses to be logged to both files, you would need to include a line for the main access.log file in the virtual host definition, as in:

CustomLog /var/log/apache2/access.log combined
CustomLog /var/log/apache2/example.com.log combined

Rotating new logs

When you create any new logs, you should remember to configure logrotate to rotate them regularly. Otherwise they may grow and grow until they eat all your disk space right up. Any logs in the default apache log directory should get rotated under apache's default rules, but if you put a new log in another directory you may need to add a rule to logrotate.

Interpreting common status codes in web logs

The status codes you find in your web logs are useful troubleshooting tools, but only if you know what they mean.

Status codes

When a web browser talks to a web server, the server lets the client know the status of its request by sending a "status code". This status code will show up in the access logs of the server as a number. There are a lot of different status codes that can be passed to a web client, and you can view the full list at w3's website.

Fortunately there are only a few status codes that you're likely to see in your access logs, so consider the following descriptions to be highlights from the full list of status codes.

200 - OK

The 200 status code indicates that the request was successful. This is the one you want to see in your logs. At its most basic it means that when a web browser asked for a file, the server was able to find it and send it back to the browser.

403 - Forbidden

The 403 status code indicates that the server is not allowed to respond to the web client's request.

One circumstance that can cause a 403 status is if you do not have "Indexes" enabled for a directory, and the directory doesn't have an index file in it that the server can access. In other words, the client asked for a directory, and the server doesn't find anything there it can show to the client.

A more common circumstance is that the permissions on the file or directory being requested don't allow access by the web server's user. If the web server is running as user "www-data", any files you want the web server to serve will have to be accessible by the user "www-data". For example, if a directory's permissions look like:

drwx------ 5 root     root     4096 2009-12-18 01:39 wordpress

Then the user "www-data" will not be able to access any of the files inside. Requests sent to the server that ask for the "wordpress" directory or any of its contents will yield 403 status codes instead of serving the file requested.

For more information on how Linux file permissions work, you can read this article series. In a nutshell, the web server user needs to have read permission for files in order to serve them, and it has to have read and execute permissions for directories in order to see files inside them.

404 - Not found

A 404 status code means that the requested file could not be found. If you see this error often you should check the links on your site to make sure they're pointing to the right places.

Since the filesystem is case-sensitive you should also make sure the capitalization matches between the request in the URL and the name of the file on the disk. For example, if a file is named "File.txt" and the URL requests "file.txt", the file won't be found by the web server. Either the URL or the file name would need to be changed so the capitalization matches in both instances.

A couple commonly-requested files are worthy of note.

robots.txt

If you see 404 errors connected to a file named "robots.txt", that's the result of a spider program (like web search engines use) checking to see what your preferences are for indexing your site.

If you don't want to restrict the access of web spider robots to your site, you can just create an empty robots.txt file and the 404 errors will go away.

The robots.txt file can be useful if there are parts of the site that you want search engines to ignore. If you don't want search engines to record anything in the "orders" or "scripts" directories on your site, for example, you could use the following robots.txt file:

User-agent: *
Disallow: /orders/
Disallow: /scripts/

A slash at the end of a disallow will let the search engine robot know that it refers to a directory.

The "User-agent" part of the file describes what user agent the robots.txt would apply to. The "*" means that you want the rule to apply to everybody. You can have more than one User-agent entry in a robots.txt file, as in:

User-agent: EvilSearch
Disallow: /

User-agent: *
Disallow:

In that file, the EvilSearch engine's robot would be asked not to record anything on the site (thus the "/"), while everything else will be allowed to record anything they can find (which is what the empty argument to Disallow means).

Note that the robots.txt instructions aren't enforced in any way. A spider can freely ignore them. The better search engines (the ones you've heard of) tend to obey the robots.txt file, while spiders used by spammers and email harvesters will ignore robots.txt entirely.

favicon.ico

Any 404 errors connected to "favicon.ico" are the result of a web browser checking for a favorites icon for the site. That's another file not found error that can be safely ignored if you don't want to make a favorites icon for the site.

The favorites icon is often used by modern browsers both as an icon in a bookmarks list and as an identifying icon in a tabbed interface. If you've noticed that bringing up a site puts an image associated with the site next to your address bar or in the tab for that page, the favicon.ico file is where your browser got that image.

There are ways to point a browser to another file for the favorites icon, but if you want to make a quick-and-dirty favorites icon there are several utilities on the web that either allow you to create your own or convert an image file. Once you've generated the favicon.ico file you can upload it to the document root of your site and the associated 404 errors should stop appearing in your log.

500 - Internal server error

The 500 status code is kind of a catch-all error code for when a module or external program doesn't do what the web server was expecting it to do. If you have a module that proxies requests back to an application server behind your web server, and the application server is having problems, then the server could return a 500 error to web clients.

503 - Service unavailable

The 503 status code appears when the web server can't create a new connection to handle an incoming request. If you see this status code in your logs it usually means that you're getting more web traffic than can be handled by your current web server configuration. You'll then need to look into increasing the number of clients the server can handle at one time in order to be rid of this status code.

Reading apache web logs

Whether you're dealing with web server difficulties or just want to see what apache is up to, your best bet is to look in its logs.

Keeping tabs on your web server

Sooner or later you'll want to know more about what your web server is up to. Luckily, apache (like many other server applications) keeps a diary of sorts called a "log".

Well, actually, more than one log, so the analogy isn't terribly good. Unless you think of your web server as a very organized diary-writer, maintaining different diaries for different kinds of events that have happened throughout the day.

Still not a great analogy, but it will do. In plainer terms: Logs are where apache records events like visitors to your site and problems it's encountered.

By default apache writes stuff about its activities in two types of logs — the error log and the access log.

Error log

The error log is where your web server records anything it doesn't think is quite right. Much of the time what gets recorded there are actual errors, like a visiting web client requesting a file that doesn't exist. Sometimes you'll also see warnings in there that don't indicate that a problem has occurred yet, but advise you that a particular event or configuration could cause problems later.

If you're having trouble with your web server this is the place to go first. For example, if you try to start your web server and it fails without telling you anything on the command line, it may be recording a reason in its error log. There you may find out about a misconfiguration or learn that it couldn't bind to the address or port it's configured for (possibly because some other program is already using the port).

Access log

The access log is where your web server records all the visitors to your site. There you can see what files users are accessing, how the web server responded to requests, and other information like what kind of web browsers visitors are using.

The access log can be used with programs called "traffic analyzers" to track the site's usage over time.

It can also be used to watch for unusual client behaviors that indicate someone is looking for a vulnerability they can exploit to hack your machine. If someone is sending unusual requests to an application you're running on your web server (like phpmyadmin or WordPress), it's usually a good idea to make sure you're running the latest version of the software.

Where to find your web logs

Before you can read your logs you'll need to find them. The most straightforward way to do that is to look for the configuration directives that tell apache where to create them.

Error log

To find the error log look in your main apache config file. The error log should be defined there with the "ErrorLog" directive. For example:

ErrorLog /var/log/apache2/error.log

Note that a lot of systems will restrict the permissions for apache's log directory to just root, so you may need to use the sudo command to look at the error log. For instance:

sudo cat /var/log/apache2/error.log

Access log

The access log is typically defined inside a virtual host block but can sometimes have a default defined in the main apache config file. You'll want to look for the "CustomLog" directive:

CustomLog /var/log/apache2/access.log combined

The first argument to CustomLog gives the file's location. The second argument ("combined") defines the format of the log. We'll get into what that means and how to change it later.

If a default CustomLog is defined in the main apache configuration and a different CustomLog is defined within a virtual host, the access log (or logs) defined in the virtual host will replace the default access log for just that virtual host.

Reading the logs

Now that you know where to find the logs, let's look at what's inside each. And most importantly, let's look at what they can tell you about your web server.

Error log

The error log is where the server will log, well, errors. These are usually errors the program encountered when trying to start a process or use a module, but they can also be errors that were sent to web clients, like a "file not found" error.

An error log entry for a file not found error would look something like:

[Mon Aug 23 15:25:35 2010] [error] [client 80.154.42.54] File does not exist: /var/www/phpmy-admin

In this case, a web client tried to visit a page in a "phpmy-admin" directory that didn't exist. Fortunately I happen to know that I don't have phpmy-admin installed, so it's not a broken link I need to fix. It's just some script kiddie looking for an exploitable version of that software. It's a good indication that I should install a program like fail2ban to block people like him.

Error log components

The first part of the log entry is the date and time (server time) when the event occurred. Apart from just being informative, that time can be useful for looking for entries in other logs at the same time. In this case I could check the access log to see the full URL that the web client tried to visit. If it were an error that indicated a module had trouble talking to a database, then I could look in the database server's logs at the same time to see what prevented the connection from happening.

The next part, "[error]", describes the level of the alert. This will often be "error", but sometimes other levels will indicate that the message logged is just a warning, or it may represent a critical error that caused the web server to shut down or fail to start.

The next part, "[client 80.154.42.54]", shows the source of the error. In this case the source is a web client, so the visitor's IP address was logged.

The last part of the log entry is the error itself.

Combined log format

The most common format for web log entries (and the default for most modern web servers) is the "combined" format, also referred to as "CLF" (Combined Log Format). A log entry in combined log format might look like this:

123.65.150.10 - - [23/Aug/2010:03:50:59 +0000] "POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1" 200 2 "http://www.example.com/wordpress3/wp-admin/post-new.php" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3"

There's a lot of stuff there, but when you break the log entry down it contains a standard set of information in a standard order.

Combined log components

The first entry is the IP address of the web client accessing your server.

The second entry above is "-", which is what gets logged when there's nothing to put in that part of the log. In this case, the entry would represent the name of a remote log, if one were being used. You'll pretty much always see "-" here.

The third entry above is another "-". That slot contains the username the web client was authorized under, if any. If you enabled password protection for a file or directory, then the username the visitor used to log in would be recorded here.

The next entry is the date and time of the access.

The next entry is the first line of the request the web client sent to the server. In this case it's:

POST /wordpress3/wp-admin/admin-ajax.php HTTP/1.1

That entry means the web client sent a "POST" request (a submission of information) to the file at "/wordpress3/wp-admin/admin-ajax.php". That's a relative location, which means that if you wanted to find that file you'd start at the document root of that virtual host. If your document root was "/var/www", then the file being accessed above would be at "/var/www/wordpress3/wp-admin/admin-ajax.php". The last entry describes the protocol used for the request, in this case HTTP version 1.1.

The next entry tells us the status code that was returned for the request. The code above, "200", is hopefully one you'll see most often in your access logs — it means that the file was found and served to the client. Other common status codes are "403" (access forbidden) and "404" (file not found). We go into more detail about status codes in another article, but for a full list you can visit the official w3 website's list of status codes.

The next number is the size of the response your server sent, in bytes. In this case it was a very small response (2 bytes), so it was likely just an acknowledgement from the server rather than a full page access.

The next entry is the "referrer URL". In this case the entry is:

http://www.example.com/wordpress3/wp-admin/post-new.php

That's the page the web client visited before sending the recorded access request. Usually that means it's the page that linked to the one they accessed. The referrer can be useful information if you're wondering where people are finding links to your site (from a Google search, or a link from a partner site), or if you want to find the page that contained a bad link if the access entry was an error.

The last entry is called the "user agent". Most of the time that just means it's the identifier used by the web browser the visitor used. In this case, the user agent was:

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.25 Safari/534.3

The user agent is pretty specific sometimes. In this entry the web browser told the server not only what its name was (Chrome, in this case), but also what operating system it's running on (Mac OS X), the version of the browser and the system, and the components that the web browser uses from the operating system. It's usually a lot more than you need, but if you know your site will display differently in different browsers, all that information can be used by a web application to tailor the page it returns to look best on that particular visitor's browser.

Putting them together

Whew! Lots of stuff there, but it's useful stuff. To give another example, let's look in again on our would-be intruder from earlier. Looking in the error log I saw what time he tried to access the non-existent directory. By looking at the same time in the access log I can see more information about what he tried to access:

80.154.42.54 - - [23/Aug/2010:15:25:35 +0000] "GET /phpmy-admin/scripts/setup.php HTTP/1.1" 404 347 "-" "ZmEu"

There's the same IP address and the same time. So we can see that his script used a "GET" method (a request for a page) to ask for the setup script for php-myadmin. The "404" status means that the file wasn't found. And while the user agent entry certainly isn't any web browser on the market, some web searches will turn up other people who have been hit by what is probably the same script. So even if that user agent isn't the browser, it can be useful in determining the type of attack your site was experiencing, and how many 404s you can expect your server to have to handle when it hits you.

Common log format

It isn't used a lot anymore, but you may run into a CustomLog directive that uses the log format of "common" in some older configurations of apache. The "common" format is similar to "combined", but omits the referrer and user agent information at the end of the log entry. Otherwise it can be read the same way as a "combined" format log.

Useful commands

Here are a few commands that can make browsing log files a little quicker or easier. These are very basic overviews of the commands in question. For more information on each you can check their respective man pages.

cat

The "cat" command simply displays the contents of a file. To see the whole error log all at once, you might run:

sudo cat /var/log/apache2/error.log

less

If the log you want to look at is particularly large, you probably don't want to look at the whole thing at once. To browse through a file you can use the "less" command:

sudo less /var/log/apache2/error.log

While less is displaying a file you can hit the space bar to page down, and the up and down arrow keys on your keyboard to scroll up or down one line at a time.

tail

The "tail" command returns lines from the end of a file. By default it displays the last ten lines of the file, so this command would display the last ten lines of an access log:

sudo tail /var/log/apache2/access.log

To specify the number of lines to grab, use the "-n [number]" option. To display the last 100 lines of the access log, you could run:

sudo tail -n 100 /var/log/apache2/access.log

You can also save yourself a little typing by just using "-[number]" instead of "-n [number]", as in:

sudo tail -100 /var/log/apache2/access.log

The tail command is useful if you're just looking for recent activity in a log. If you want to watch the end of a file for changes as they happen, you can use the "-f" option:

sudo tail -f /var/log/apache2/access.log

With this version of tail running, when a new line is added to the log file you'll see it on your screen too. To get out of tail when it's in this mode use control-C.

grep

If you're looking for a particular item in a web log (like a certain IP address, or any "404" responses), skimming through the log manually can be tiresome. It's easier to let the "grep" command do the work for you.

The grep command will look through its input, or a file, and return any lines that contain the search term sent to it. To look for the term "404" in an access log, you might run:

sudo grep 404 /var/log/apache2/access.log

The first argument is the text grep is searching for, and the second argument is the file to search.

If you want to look for a phrase, you can do that by enclosing the phrase in quotes. To look for requests for a particular file, you could run:

sudo grep "GET /images/butterfly.jpg" /var/log/apache2/access.log

By default grep's searches are case-sensitive. If you specify capital letters like "GET", then lines with "get" in lowercase letters won't be returned as hits. To make the search case-insensitive, pass grep the "-i" option, as in:

sudo grep -i "get /images/butterfly.jpg" /var/log/apache2/access.log

You can combine tail and grep by using what's called a "pipe":

sudo tail -n 100 /var/log/apache2/access.log | grep 404

The first part of that statement just lists the last 100 lines of the access log. The next character, "|", is the "pipe". It redirects the output of the last command and sends it to the next command in the statement. In this case that second command is "grep", searching for 404. So the above command would return any 404 errors found in the last 100 access log entries.

Barebones apache install for CentOS

This article describes how to install an apache web server on CentOS with no extras. It's intended only for users who are experienced administrators or who just want a basic web server install with no details on including modules like PHP or customizing apache for their site.

Why "barebones"?

A barebones article is intended for users who just want to get a software package up and running with the default options and no frills. It's best used by either experienced Linux administrators or users needing to get a package installed to satisfy a prerequisite without going through extensive customization. Most users are advised to use the more in-depth tutorials found elsewhere in the Slicehost articles repository so they can better learn the software they are implementing.

For a more comprehensive survey of this topic, check the links in the "Further reading" section at the end of the article.

Installing apache

Run the following commands:

sudo yum install httpd

Adding iptables rules for apache

The Slicehost articles on configuring CentOS slices leave ports 80 and 443 open.

If you are using the CentOS default iptables rules or have modified the Slicehost default rules, you can add rules for apache with the following commands:

sudo /sbin/iptables -I INPUT -p tcp --dport 80 -m state --state NEW,ESTABLISHED -j ACCEPT
sudo /sbin/iptables -I OUTPUT -p tcp --sport 80 -m state --state ESTABLISHED -j ACCEPT
sudo /sbin/iptables -I INPUT -p tcp --dport 443 -m state --state NEW,ESTABLISHED -j ACCEPT
sudo /sbin/iptables -I OUTPUT -p tcp --sport 443 -m state --state ESTABLISHED -j ACCEPT

To save those rules so they'll take effect next time the slice is rebooted, run:

sudo service iptables save

Starting and stopping apache

You can start apache with:

sudo /usr/sbin/apachectl start

Similarly, you can stop apache with:

sudo /usr/sbin/apachectl stop

To restart apache gracefully, so existing connections aren't broken but new ones will use any recent configuration changes, run:

sudo /usr/sbin/apachectl graceful

Executing apachectl by itself will show what options can be passed to the command.

/usr/sbin/apachectl

Starting apache at boot time

Ensure apache will start when the slice reboots by running:

sudo /sbin/chkconfig httpd on

Where to put documents

Apache will serve documents for the default site from the directory:

/var/www/html

This directory is called apache's "document root". You can make documents available via the web by putting them in that directory or in a subdirectory. If you were to have "www.example.com" pointing to your slice you could see the default index file in the document root by going to this URL:

http://www.example.com

To access files or subdirectories in the document root, think of the base URL for your web server as an alias for the document root, then add a path to the URL telling the web server to look deeper. For example, if "www.example.com" gets you to the document root of "/var/www/html", to view the file at "/var/www/html/mysite/mypage.html" you would use the URL:

http://www.example.com/mysite/mypage.html

When no filename is specified in the URL apache will look for a default page like "index.html" or "index.htm" in the target directory before returning the default welcome page, an error, or a listing of files in the directory (depending on your configuration).

Log files

Apache's log files are located in the directory:

/var/log/httpd

By default only the root user has access to that directory, so you will need to use sudo to get a directory listing or view files.

The log file that records errors is:

/var/log/httpd/error_log

The log file that records page accesses to the default site is:

/var/log/httpd/access_log

Configuration files

Apache's configuration files are located in:

/etc/httpd

The main configuration file for the web server is:

/etc/http/conf/httpd.conf

The main configuration file sets up the default web site that will be served by apache, as well as defining any mods that will be enabled. It's well-commented so it's worth skimming the file to see what directives are included. Some highlights are:

Listen 80

The "Listen" directive tells apache to listen to a port, an IP address, or a combination of the two. You can include more than one Listen directive. By default, with just the one Listen directive configured, apache will listen to port 80 on all available IP addresses.

DocumentRoot "/var/www/html"

The DocumentRoot directive, as it happens, tells apache where the document root is located. The document root is where apache will look first for files to serve (see the earlier section, "Where to put documents", for more on the document root).

<Directory "/var/www/html">

The Directory directive starts a configuration block that applies the options it contains only to the defined directory and its subdirectories. There can be more than one Directory block in the httpd.conf file. A brief example of a full Directory block is the default "/" Directory entry, which is very restrictive:

<Directory />
    Options FollowSymLinks
    AllowOverride None
</Directory>

Umask and unusual file permissions and types

In this last entry in our series on Linux file permissions we look at the umask and some more advanced file permissions. We also throw in some discussion of other file types you may see in a directory listing.

Odds and ends

We've covered the basics of file permissions in Linux, as well as how to view and change them. Now, in this final article of the series, we'll look at the rest of it. Bits and pieces that might be useful, or might just satisfy curiosity, or can help you avoid trouble down the line.

The umask

We'll start with the subject of "umask", a means of controlling the permissions on files and directories when they're first created.

The umask is set when you log in, and is usually set in one of the default shell config files (like /etc/profile). You can override the umask for a particular user by setting their umask in the user's shell profile, usually in "~/.bashrc". The setting looks something like:

umask 022

The umask octal value is kind of the reverse of chmod permissions — you set it with an octal value, but instead of specifying the permissions you want the created file to have, you specify what you don't want it to have. For comic book fans, think of it as Bizarro Permission.

In the example above, the "2" set for "group" and "other" means, instead of adding write permission to the created file, everything except write permission is added for those two categories. The "0" means all permissions are set for the file owner.

You will sometimes see the umask expressed as four digits, like "0022". Both styles work. That first digit is for setting some special permissions, which we will describe shortly. But the quick version: You usually won't want to set those with umask, so if you send umask four digits just use a zero as the first digit.

Note that the default behavior for files is to omit the executable permission for all categories. So while the above example only omits the "write" permission, a file created with that umask would have the octal permissions "644", while a directory would include the executable permissions, and thus be "755".

If you prefer a mathematical way to look at it, take the maximum default permissions (777 for a directory, 666 for a file), then subtract the umask value to get the initial permissions.

Another example to illustrate the point:

umask 027

This one removes write from "group" and everything from "other", making a file that can be read and written to by its owner, read by its group, and denies all access to everyone else. A created directory would be the same, but would also include "execute" for owner and group.

You can view the current umask setting for your shell session by typing simply:

umask

Symbolic values and umask

You will almost always see the umask set with the octal value assignment. This is mostly because of history and backward compatibility — older shells only support octal values for umask. But current Linux distributions ship with default user shells that support symbolic values for umask (like bash), so if you prefer to go symbolic, you should be fine.

One significant point of note about using symbolic values with umask is that you don't do the "use the reverse of what you want" thing like you do with octal umask. If you do an assignment of symbolic permissions with umask, it works a lot like it would if you used chmod. The one difference would be that the behavior of not assigning execute permission to regular files when they're created still applies.

To express the equivalent of "umask 022" with symbols, you can directly assign the permissions:

umask u=rwx,g=rx,o=rx

Again, noting that the "x" permission will only be applied to directories, not regular files. You can also use relative symbols with umask, but that can get a little tricky. If you use relative symbols with umask, like:

umask g+w

Then the adjustment is applied to whatever the current umask is set to. That can be confusing if the default umask gets changed later, though - imagine the above example if someone comes along later and takes group read and execute permission out of the default umask. You'll end up with files and directories that your group can write to but can't read or cd into.

So if you know you just want to change the umask for a particular category, it's best to just use something like:

umask g=rwx

That way you'll still be using the default for the "other" category, but you can be sure group will always have the permissions you need set.

It's handy to know that you can get the symbolic representation of the current umask by passing umask the "-S" option:

umask -S

For a umask that would create files with the default octal permission "644" and directories with "755", the results of umask checks would look like:

$ umask
0022
$ umask -S
u=rwx,g=rx,o=rx

The high-order bits

There are some other types of permissions you can set that we haven't really talked about yet. They are the "high order" permissions. To set them with octal you need to use a four-digit octal number, and the first digit will represent the high-order permissions. The high-order permissions are setuid, setgid, and text (the "sticky bit").

Since you may notice them lying around your file system and wonder about them, let's briefly cover what each is for, and how they look in an "ls -l".

Setuid

If the setuid bit is set on a file, when you execute the file the process will run as if it were launched by the file's owner. If you're running as user "demo" and run a file owned by root that has "setuid" on it, then when the program runs it will run as if root launched it.

As you might imagine, you want to be careful with this one. The smallest of security holes in a program running as root can lead to pretty big exploits. For this reason most scripts won't even launch if "setuid" is set on them.

Setuid is represented by an "s" in the "user" category when viewed in ls. A file with setuid looks like:

-rwsr-xr-x 2 root root     122880 2010-04-14 20:12 sudo

When setuid is set on a directory, the system ignores it.

In octal representation, the setuid bit is "4". So setting the permissions in octal for the sudo program above would look like:

chmod 4755 sudo

Symbolically, setuid is "s" added to the user category only. So adding setuid to sudo could look like:

chmod u+s sudo

Setgid

The setgid permission works like setuid, except it causes a file to run as the file's group instead of the group of the user that launched it. So a program with the setgid bit on it that's in the group "www-data" will always run as if it were launched by a user with the primary group "www-data", whether the user that actually launched it is in "www-data" or not.

A file can have both setuid and setgid active at the same time.

Setgid looks the same as setuid in ls, it just appears in the group category instead of the user category:

-rwxr-sr-x 1 root crontab   31656 2009-05-12 21:58 crontab

In octal representation the setgid bit is "2". So setting the permissions for crontab above would look like:

chmod 2755 crontab

Symbolically, setgid is "s" added to the group category only. So adding setgid to crontab could look like:

chmod g+s chmod

It's possible for a file to have setgid set, but not be executable by its group (it's also possible for this to happen with setuid, but is much less likely). When that happens, setgid is displayed as a capital "S" instead of the usual lowercase "s". If you changed the permissions for crontab in the above example so that only root could run it, but kept setgid active, the end result would look like:

-rwxr-Sr-- 1 root crontab   31656 2009-05-12 21:58 crontab

Directories handle setgid differently. If setgid is set on a directory, every file created in that directory will be created with the directory's group instead of the creating user's group. Furthermore, new subdirectories will inherit the setgid bit from the parent directory.

The inheritance behavior of setgid on a directory can be useful if you want a particular directory and all its contents to always be accessible to users in a particular group. You can just put the setgid permission on the parent directory (and any existing subdirectories), then change the default umask so files and directories will be created with group write permissions.

Note that you can't use setuid or setgid with the "-R" flag for chmod (for recursive permission changes) — you have to set that permission on each file or directory individually.

The sticky bit

The "sticky bit" confuses a lot of people. There's a good reason for this: The sticky bit means different things to different versions of Unix. Fortunately we only need to worry about Linux, so we only have to talk about one implementation of the sticky bit.

The sticky bit is ignored when set on files.

When set on a directory, the sticky bit tells the system that files in that directory can only be renamed or deleted by the user that owns them (and root). The most common use for the sticky bit, and really the only one you're ever likely to need, is on /tmp:

drwxrwxrwt   5 root root  4096 2010-07-16 02:47 tmp

The "t" at the end of the permissions is the sticky bit. The letter hearkens back to the sticky bit's original meaning, which involved caching text in memory. A handy mnemonic might be to think of the sticky bit as the "tmp bit" or "text bit", depending on which association might work best for you.

That's the trouble with using letters to abbreviate this stuff. Because "sticky bit" is a memorable name you'll run into a lot of instances of people mistakenly reading an "s" in a permissions list as the sticky bit, when "s" is actually setuid and setgid. You will now know better. Feel free to correct people when they say "sticky bit" when they mean "setuid" or "setgid". Unless they are particularly large and ill-tempered, in which case it may be best to just let it slide.

Anyway, you probably won't want to set the sticky bit anywhere else. It's useful in /tmp because it makes the permissions there a little more restrictive than a directory usually would be with 777 permissions.

In octal representation, the sticky bit is "1". So if you accidentally deleted /tmp (trust me, it can happen) and wanted to recreate it, you could set the octal permissions of your new /tmp with:

chmod 1777 /tmp

A high-order cheat sheet

To summarize the octal values of those high-order permissions:

setuid = 4
setgid = 2
sticky = 1

Other file types

There are file types other than regular files, directories and symlinks that you might see in directory listings, particularly in /dev. These special file types represent ways for programs to talk with other programs or with hardware.

They illustrate part of what made Unix so weird and special at its creation: Treating most interactions similarly to file interactions. It's not a perfect setup, but the biggest benefit is providing a simple and fairly standard way for programs to talk to other parts of the system.

You might not need to use any of these file types yourself but an overview can be helpful, if only to know something about what the system is doing behind the scenes. Mostly I cover them here because you might see them in a directory listing and want some idea of what the heck they are.

These file types can be recognized by the first letter in an "ls -l" result, where you'd usually see a "-" for a regular file, "d" for a directory, or "l" for a symlink.

Socket

A socket file is a special type of file that lets a program write to a network interface using the normal file system interface. Instead of doing a "write" and having the text wind up in a text file, the text gets sent to the network interface.

A socket file is labeled with an "s" in the first slot, as in:

srw-rw-rw- 1 root   root           0 2010-02-26 22:46 log

That particular example is a socket located at /dev/log. That socket can be used by programs (like "logger") to send log entries to a syslog daemon without needing to connect to the daemon directly.

Named pipe

A named pipe basically lets one program put data into the pipe and have another program read it. It's usually created with the "mkfifo" command.

You might have been told at some point to "pipe" data from one program to another with the "|" separator. If so, then you've used an "unnamed pipe" before. A named pipe is like that, except it lives in the filesystem and can be reused.

A named pipe is labeled with "p" in the file type slot of a directory listing.

For example, a named pipe at /dev/xconsole might be used by syslog to send logging data to xconsole, which in turn would be displayed to a user running x-windows. In a directory listing it would look like:

prw-r----- 1 syslog adm            0 2010-07-15 17:09 xconsole

Block special file

A block special file is a representation of a block device in the file system. A block device is a piece of hardware that the system would read from or write to in "blocks" of data, like a hard drive.

A block special file is represented by a "b" in the directory listing. The hard drive at /dev/sda1 would look like:

brw-rw---- 1 root   disk      8,   1 2010-02-26 22:45 sda1

Character device

A character special file is an interface to a character device. A character device is similar to a block device, but instead of reading from or writing to the interface in blocks, the system talks to the device one character at a time.

Where a block device tends to be something that stores data for later retrieval (like a disk), a character device is usually one where the system only needs to send or receive data of a more immediate nature - like sending a file to a printer, or receiving keystrokes from a keyboard.

A character special file is labeled with a "c" in a directory listing.

For example, a terminal session is a character device, usually named something like /dev/tty1:

crw------- 1 root root 4, 0 Jun 30 03:29 /dev/tty1

Summary

You probably now know more than you ever wanted to about Linux file permissions and file types. What can I say? I like to be thorough. Most of it is the sort of information you may never need, but when you do (like when you need to change default file permissions, or want to send data easily between programs), it can save you some research time if you know something of what's possible beforehand.

Linux file permission concepts

Linux file permissions are strange and wondrous things. Start down the path of understanding by looking at the core concepts behind them before moving on to practical applications.

File access

In a multi-user environment like Linux it's important to control which users can modify or delete various files on the system. This control isn't just a necessary security precaution — it prevents catastrophic accidents. If a user can only affect a minimum number of files, there's less chance that a mistyped command or a typo in a script will destroy an essential file or publish confidential information to a public web site.

To approach the problem of managing file access, we first need to understand the concepts of file ownership and file permissions. Once we've wrapped our brains around those basics we can move on to the matter of actually checking and changing file access details.

Note that this ownership and permission stuff also applies to directories, since they're basically a special kind of file as far as the filesystem is concerned. The way permissions can be applied to directories is a little different from regular files, but the basic concepts are the same, so we'll mostly just talk about "files" and keep in mind that it can mean "directories" too.

So we'll start with one of our two basic concepts: File ownership.

The basics: Ownership

Every file and directory on a Linux file system has an owner. Just like ownership in "real life", the owner of a file is the one who gets to assign permissions for the file. If user "mom" owns the file "lawndarts", you'll need mom's permission to play with lawndarts. Maybe she'll let you mess with lawndarts, or maybe she'll deny you access to it altogether. Or she'll just let you look at lawndarts without getting to play with it, not knowing how much it tortures you to see them sitting there on a high shelf, unreachable, just because she thinks you'll put your eye out.

Sorry. Got a little sidetracked. Where was I?

Right, file ownership. The user who owns a file gets to change its permissions. Everything after that depends on the permissions that are set — whether someone (even the owner!) can read the file, or change it, or delete it. It's a simple privilege, but far-reaching in its impact.

File group

While every file has a user who owns it and can control its permissions, every file also belongs to a group. A "group", for the purposes of Linux files, describes a set of users who have file permissions that may be different from the common user. A user can be in more than one group, but a file can only be in one group.

Group ownership is a handy way to let a file owner assign one set of permissions to a file for people he doesn't know ("You can look, but can't touch"), and another set of permissions for people he trusts with the file ("You can look and touch. But no one else").

Changing ownership

A normal user can control a file's permissions, but can't actually give a file away. To do that, you need to use a third party to broker the deal - the superuser. The superuser is more commonly known as "root". You may have run into him before.

If you aren't logged in as root that usually means that you'll need to use "sudo" to use root privileges to change a file's owner.

The filesystem is more flexible about changing a file's group. You can still use root to change the group, but the file's owner can also switch a file to another group so long as the user belongs to the group in question.

chown

The main command used to change a file's owner or group is "chown". The most common syntax used with chown is:

chown user:group file1 file2 file3

Breaking that down, the "user" in the example above is the user you want to own the file, and "group" is the group you want the file to belong to. A colon separates the two. After the user/group pair you list one or more files that will be affected by the change.

The user and the group are both optional with chown, though of course if you omit both you won't actually change anything. If you want to just change the owner for a file, you can use:

chown user file1 file2 file3

If you want to use chown to just change the group, make sure to include the colon even though you won't be specifying a user:

chown :group file1 file2 file3

It's important to note that chown also accepts a period in place of the colon when separating the user and group names. This is outdated behavior, but you'll sometimes see it in old scripts or documentation so chown still supports it. If you see a period in someone's example you can use it with chown and it will work fine, but I'd still recommend using the colon instead.

The reason I recommend against the use of a period as the chown separator is that it's possible (discouraged, and often made difficult, but still possible) to create a user that has a period in its name. You will probably never, ever run into this, but I like to be thorough.

Fortunately if you do encounter a username with a period in it and want to use chown with it, chown will handle the period gracefully if you include a colon between the user and group names. If you don't want to change the group, you can just leave that part blank but still include the colon, as in:

chown john.smith: file1 file2 file3

chgrp

If you don't like that messy colon hanging around when you just want to change the group for a file, there's an alternative command:

chgrp group file1 file2 file3

This works just like "chown :group", but it's easier to type and read.

Using -R

There are times you'll want to change the owner of not just a particular directory, but everything inside the directory and its subdirectories. When that comes up, just use the "-R" flag to make a "recursive" change:

chown -R user:group directoryname

The "-R" flag works with chgrp as well. With both commands the change will first be applied to the parent directory, then the command will iterate through everything inside the directory (including subdirectories) and apply the change to each of them as well.

Affecting symlinks

You run into a special circumstance when you try to use chown or chgrp with a symlink. A symlink is kind of like an alias for another file, similar to a shortcut in Windows. Rather than apply the change to the symlink itself, by default the filesystem will apply the change to the target of the symlink. So if the symlink "link" points to the file "thefile", consider this command:

chown user:group link

When that command executes, the system will actually change the owner and group for the target file "thefile". The ownership of the symlink "link" will remain the same.

If you want to change the owner and/or group of a symlink, use the "-h" flag for chown and chgrp, as in:

chown -h user:group link

The other basics: Permissions

Now that we hopefully understand ownership — namely, that it allows the control of permissions — let's talk about permissions themselves.

There are two parts to permissions. The first involves what someone is allowed to do with a file, and the second involves who that "someone" can be. Let's look at the possibilities for "what" before we go into the "who".

What can be done

When controlling what can be done to a file or directory, there are three categories of actions: read, write, and execute.

What is specifically allowed or disallowed can be different for files and directories, so we'll talk about both for each category.

Read

The "read" permission controls, well, who can read a file. If you don't have read permissions for a file you can't look inside and see its contents.

The "read" permission for a directory controls whether or not you can see a list of the files in the directory. Note, however, that to do so you will also need "execute" permission for the directory.

Write

The "write" permission on a file controls whether or not you can change the file's contents. If you want to edit the text in an html file, for example, you need write permission before you can do so.

The "write" permission on a directory controls whether or not you can add, delete, or rename files in that directory.

Note that the way write permissions work, only the write permission on the enclosing directory will affect whether or not you can rename or delete a file. Well, at the operating system level, anyway — some programs, like "rm", will do a check and prevent you from deleting a file you don't own. There's nothing stopping another program that doesn't have a similar check built into it from deleting a file you can't write to and don't own.

The rename and delete permissions might seem a little weird until you consider what we mentioned earlier: The filesystem considers a directory to be a special kind of file. If you think of a directory as a special file that lists the files it contains and how to find them on the disk, it might make a bit more sense that write permission for the directory would let you delete or change that list.

Note that, just like read, to exercise your write permissions in a directory you will also need "execute" permission for said directory.

Execute

The "execute" permission for a file allows you to run that file from the command line. In order to run any command ("chown", "ls", "rm", etc.), you have to have execute permission for the file representing that command. If you try to run a command and get a "permission denied" error, it's because you don't have execute permission.

The "execute" permission for a directory lets you perform an operation in that directory, or to change your working directory ("cd") to that directory.

Remember those two "Note that..." lines in the sections on read and write, about needing execute permission for a directory too? This is why. Even if you have read permission for a directory you can't actually run the "ls" command in that directory to see the list of files unless you have execute permission. Otherwise you try to run "ls" and get blocked before the system can even check for read permission.

Basically, to affect anything inside a directory, you need to be able to "execute" the directory first.

Who can do what

Now that we have an idea of what permissions are available, let's look at what categories we can use to control who actually gets affected by those permissions. The categories are: user, group, and other.

User

The "user" permission category refers to permissions that apply to the owner of the file. It's the only category that specifically targets only one user, because only one user can own the file.

It's tempting to think of this category as "owner", but I recommend against it. You'll see why in successive articles, but the main reason is that there's already another category that starts with an "o" (that being "other"), and that can get confusing. Stick to "user", as in "the user who owns the file", since that can be abbreviated to "u". Trust me for now.

Group

The "group" category refers to users that are in the same group as the file. If the file is in the group "devs", and the file has write permission for its group, that would mean that users in the "devs" group will have write access to the file.

Other

The "other" category is a catch-all for everyone who doesn't fall under "user" or "group". You use this category to determine whether that faceless mass of anonymous users will be able to read the file, or edit it, or run it as a command.

Category priority

It's important to note that permission categories are applied in the above order, and the first permission category the system finds for a user is the only one it will apply. If you're the owner of the file, your permissions are whatever are set for "user", so the system won't bother checking the group permissions for the file — it's already found what it's going to use.

The reason this is important is that if you set a permission on "other", that permission will not be applied to the file's owner or to anyone in that file's group. Those users will get the permission set in "user" or "group", respectively.

If you don't set "read" permission on a file for the "group" category but do set it for the "user" and "other" categories, that will mean that users in the file's group will not have read access but everyone else will. Look at it as an easy way to prevent access for a small group of users without needing to add everyone else to a more privileged group. Just put the offending users and the file in the "outcasts" group and remove group read access for for the file in question, and you're set.

Permissions plus users

Combining ownership, user categories and permissions gives you a lot of options for controlling access to files and directories. A few common examples:

If you make a file read-only for "other" but let "user" and "group" write to it, then you can establish a group of editors for a file while still allowing other users (like the one running the web server, for example) to read it. Just add the privileged users to the same group as the file.

Setting "read" permission for "user" but removing it from "group" and "other" will ensure that only the owner of the file can view its contents. This is handy when you're just using a file for your own testing purposes, or don't want someone else coming along and criticizing a document until you're done writing it.

Setting "execute" permission for a file allows it to be run as a command, so if you have a command you only want specific users to be able to run, remove "execute" permission on the file for the "other" category.

Directories get the same treatment. Many system log directories are set to "read" and "execute" by just "user" (usually root) and exclude those permissions from other categories to ensure that only someone with superuser access will be able to view the logs, no matter what permissions are set on the files themselves.

Why root exists

All of this leads up to the root user's reason for being: access and control. The root user can change the ownership and permissions of any file or directory on the system. That user can also interact with files and directories as if it has the most permissive permissions available for the file in question.

If "user" can't read a file but "other" can, root can read it. Similarly, if "user" can read the file but "other" can't, then root can still read the file. But if no category has "read" permission (not user, not group, and not other), then root can't read the file either.

This behavior is most useful for files you really don't want to accidentally change. If write permissions are removed from all categories for a file, then not even root can change the file's contents without changing those permissions. Though it's useful to note that, going back to what we discussed about the way permissions work with directories, if you want to prevent a file from being deleted by root you'll have to completely remove the "write" permission from the enclosing directory as well.

Changing permissions

We actually won't go into that in this article, but we will in the next article. The command is "chmod", and it's complicated, and this article is running long already.

Summary

With a basic understanding of how file permissions work in Linux, you should be better prepared to secure files from accidental or malicious harm. You should also be able to keep an eye out for errors that can be caused by restrictive file permissions, like an application being unable to write to its log (no write permission for the user that owns the process), or a web server being unable to serve an html file (no read permission, or the directory doesn't have execute permission).

A bit confusing? Definitely. Useful? That too. All this stuff about file ownership and permissions makes more sense once you learn how to view the permissions, which will be the topic of the next article in this series.