Apache, and all other web servers, have a mechanism to write an "access log" recording every HTTP access to the server. The information that is logged is valuable, and includes things like the IP address of the user making the request, the date and time, the size of the request in bytes, the return code from the HTTP protocol, the request's URI, the referer, and the browser/operating system that the user is using.
This information is useful in many ways, including troubleshooting, bandwidth estimation, traffic analysis using packages like Awstats, and much more ...
The problem: Disk space and Disk I/O
However, in the case of large web sites, those who get hundreds of thousands of page views a day or more, the size of the log file and the time it takes for Awstats to process it becomes a problem.
For example, on a site that used to get around 2 million page views per day, the log size had 133,910,121 entries for only one week, and consumed over 38 gigabytes! As the logs are rotated, more disk is consumed by the old logs. Compressing the old logs helps for a while, but still the current and previous logs are not compressed normally.
Moreover, Awstats was taking too long to complete that it took more time to process the log since the last invocation than the time till the next invocation, causing an unsurmountable backlog that would never complete. Moreover, it increases the load on the disk, wasting both disk space, and I/O throughput as the disk is occupied writing log entries all the time.
The Solution: Excluding static files
In order to solve this, we started excluding most static files from Apache's logging. Only the Drupal generated pages would be logged, but not any graphics files, stylesheets, Javascript, ...etc.
This solution can be accomplished by modifying the vhost for the site as follows.
Normally, in the vhost entry for the site, you will have something like:
CustomLog /var/log/apache2/access-example.com.log combined
Now we replace it with this:
SetEnvIf Request_URI "\.(jpg|xml|png|gif|ico|js|css|swf|js?.|css?.)$" DontLog CustomLog /var/log/apache2/access-example.com.log combined Env=!DontLog
What this code does is to declare an variable called DontLog if the Request URI matches the above pattern. The pattern is basically all common static files.
Note that this is for Drupal 6 or later, and hence js and css have a question mark after then followed by one letter.
The CustomLog line then puts a condition: if the DontLog variable is not set, then log the request, otherwise do not log it.
Conclusion
Once we did this, the space used on the server was much less than before, and so did the I/O load on the disk where the logs resided.
Obviously there are some downsides: for example, Awstats will not report hits, or accurate bandwidth. Also, it makes tracking of image leeching harder as well.
But all things considered, this compromise is worth the effort and the drawbacks.
The site has grown since and is doing over 3 million page views per day. Had we not come up with the solution detailed above, we would have run into serious issues ...
Comments
Peter Phaal (not verified)
mod_sflow
Wed, 2011/01/12 - 16:46Another option is to use mod_sflow for monitoring.
Instead of selectively logging filtered requests, mod_sflow uses random sampling to provide scalable real-time monitoring of large Apache clusters. In addition to sampling web requests, the module maintains performance counters that can be trended using tools like Cacti.
For more information:
sFlow: HTTP
Visitor (not verified)
To /dev/null with you
Wed, 2012/02/15 - 21:31To /dev/null with you wretched access log!
Pages