Reducing the size and I/O load of Apache's web server log files

Apache, and all other web servers, have a mechanism to write an "access log" recording every HTTP access to the server. The information that is logged is valuable, and includes things like the IP address of the user making the request, the date and time, the size of the request in bytes, the return code from the HTTP protocol, the request's URI, the referer, and the browser/operating system that the user is using.

This information is useful in many ways, including troubleshooting, bandwidth estimation, traffic analysis using packages like Awstats, and much more ...

The problem: Disk space and Disk I/O

However, in the case of large web sites, those who get hundreds of thousands of page views a day or more, the size of the log file and the time it takes for Awstats to process it becomes a problem.

For example, on a site that used to get around 2 million page views per day, the log size had 133,910,121 entries for only one week, and consumed over 38 gigabytes! As the logs are rotated, more disk is consumed by the old logs. Compressing the old logs helps for a while, but still the current and previous logs are not compressed normally.

Moreover, Awstats was taking too long to complete that it took more time to process the log since the last invocation than the time till the next invocation, causing an unsurmountable backlog that would never complete. Moreover, it increases the load on the disk, wasting both disk space, and I/O throughput as the disk is occupied writing log entries all the time.

The Solution: Excluding static files

In order to solve this, we started excluding most static files from Apache's logging. Only the Drupal generated pages would be logged, but not any graphics files, stylesheets, Javascript, ...etc.

This solution can be accomplished by modifying the vhost for the site as follows.

Normally, in the vhost entry for the site, you will have something like:

CustomLog /var/log/apache2/access-example.com.log combined

Now we replace it with this:

SetEnvIf Request_URI "\.(jpg|xml|png|gif|ico|js|css|swf|js?.|css?.)$" DontLog
CustomLog /var/log/apache2/access-example.com.log combined Env=!DontLog

What this code does is to declare an variable called DontLog if the Request URI matches the above pattern. The pattern is basically all common static files.

Note that this is for Drupal 6 or later, and hence js and css have a question mark after then followed by one letter.

The CustomLog line then puts a condition: if the DontLog variable is not set, then log the request, otherwise do not log it.

Conclusion

Once we did this, the space used on the server was much less than before, and so did the I/O load on the disk where the logs resided.

Obviously there are some downsides: for example, Awstats will not report hits, or accurate bandwidth. Also, it makes tracking of image leeching harder as well.

But all things considered, this compromise is worth the effort and the drawbacks.

The site has grown since and is doing over 3 million page views per day. Had we not come up with the solution detailed above, we would have run into serious issues ...

Contents: 

Tags: 

Comments

I haven't run into a problem

I haven't run into a problem yet with 800mbit traffic or in clusters. Have been using this since apache 2.0. The benefits have been noticeable especially when disk io is overloaded already.

Can't quantify it

We can't quantify exactly how much performance gain it added.

The disk space usage was just insane, and we would have run out of it quickly in several weeks if we continued to log images.

But, I see this as a show stopper and blockage of normal site operation. We removed a block rather than gained speed.

Mythical?

Mythical? Nice choice of words here. I hope you are not implying that it does not exist or made up?

As for your comment, please ask on the relevant page, and I will answer there.

We solved this by writing all logs to a socket

We have multiple web servers and needed to combine all the logs together for webalizer. We did this by adding:

LogFormat "vhost-%v %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" combined
CustomLog "|/usr/bin/logger -p local1.info -u /var/log/httpd/apache_log.socket" combined

to the httpd.conf

and then setting up syslog-ng to create the socket and send the logs via UDP to a remote host. On the recieving server, we use syslog-ng to filter based on the vhost-%v in the incoming stream to write to a file.

This solved alot of issues such as real-time combining of all servers' logs as well as off-loading the log processing to another server.

mod_sflow

Another option is to use mod_sflow for monitoring.

Instead of selectively logging filtered requests, mod_sflow uses random sampling to provide scalable real-time monitoring of large Apache clusters. In addition to sampling web requests, the module maintains performance counters that can be trended using tools like Cacti.

For more information:

sFlow: HTTP

To /dev/null with you

To /dev/null with you wretched access log!