Apache, and all other web servers, have a mechanism to write an "access log" recording every HTTP access to the server. The information that is logged is valuable, and includes things like the IP address of the user making the request, the date and time, the size of the request in bytes, the return code from the HTTP protocol, the request's URI, the referer, and the browser/operating system that the user is using.
This information is useful in many ways, including troubleshooting, bandwidth estimation, traffic analysis using packages like Awstats, and much more ...
The problem: Disk space and Disk I/O
However, in the case of large web sites, those who get hundreds of thousands of page views a day or more, the size of the log file and the time it takes for Awstats to process it becomes a problem.
For example, on a site that used to get around 2 million page views per day, the log size had 133,910,121 entries for only one week, and consumed over 38 gigabytes! As the logs are rotated, more disk is consumed by the old logs. Compressing the old logs helps for a while, but still the current and previous logs are not compressed normally.
Moreover, Awstats was taking too long to complete that it took more time to process the log since the last invocation than the time till the next invocation, causing an unsurmountable backlog that would never complete. Moreover, it increases the load on the disk, wasting both disk space, and I/O throughput as the disk is occupied writing log entries all the time.
The Solution: Excluding static files
In order to solve this, we started excluding most static files from Apache's logging. Only the Drupal generated pages would be logged, but not any graphics files, stylesheets, Javascript, ...etc.
This solution can be accomplished by modifying the vhost for the site as follows.
Normally, in the vhost entry for the site, you will have something like:
CustomLog /var/log/apache2/access-example.com.log combined
Now we replace it with this:
SetEnvIf Request_URI "\.(jpg|xml|png|gif|ico|js|css|swf|js?.|css?.)$" DontLog CustomLog /var/log/apache2/access-example.com.log combined Env=!DontLog
What this code does is to declare an variable called DontLog if the Request URI matches the above pattern. The pattern is basically all common static files.
Note that this is for Drupal 6 or later, and hence js and css have a question mark after then followed by one letter.
The CustomLog line then puts a condition: if the DontLog variable is not set, then log the request, otherwise do not log it.
Conclusion
Once we did this, the space used on the server was much less than before, and so did the I/O load on the disk where the logs resided.
Obviously there are some downsides: for example, Awstats will not report hits, or accurate bandwidth. Also, it makes tracking of image leeching harder as well.
But all things considered, this compromise is worth the effort and the drawbacks.
The site has grown since and is doing over 3 million page views per day. Had we not come up with the solution detailed above, we would have run into serious issues ...
Comments
Visitor (not verified)
Turn BufferedLogs on as well.
Thu, 2010/08/05 - 02:30Turn BufferedLogs on as well.
Khalid
Experimental
Thu, 2010/08/05 - 11:09The documentation for Apache mod_log_config says this is experimental and should be used with caution.
Visitor (not verified)
I haven't run into a problem
Thu, 2010/08/05 - 11:26I haven't run into a problem yet with 800mbit traffic or in clusters. Have been using this since apache 2.0. The benefits have been noticeable especially when disk io is overloaded already.
Visitor (not verified)
Another downside of this
Thu, 2010/08/05 - 04:49Another downside of this tweak could be the lack of information about the 404's of static files.
Visitor (not verified)
Nice "trick",
Thu, 2010/08/05 - 08:06Nice "trick", thanks.
Question: do you maybe know how big is this performance improvement?
Khalid
Can't quantify it
Thu, 2010/08/05 - 11:07We can't quantify exactly how much performance gain it added.
The disk space usage was just insane, and we would have run out of it quickly in several weeks if we continued to log images.
But, I see this as a show stopper and blockage of normal site operation. We removed a block rather than gained speed.
Visitor (not verified)
I wonder if/when Khalid will
Fri, 2010/08/06 - 09:16I wonder if/when Khalid will have to move this mythical site to two servers. I think he might be running out of ways to keep it on one.
Khalid
Mythical?
Fri, 2010/08/06 - 11:12Mythical? Nice choice of words here. I hope you are not implying that it does not exist or made up?
As for your comment, please ask on the relevant page, and I will answer there.
Dave Augustus (not verified)
We solved this by writing all logs to a socket
Mon, 2012/03/26 - 13:09We have multiple web servers and needed to combine all the logs together for webalizer. We did this by adding:
LogFormat "vhost-%v %h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"" combined
CustomLog "|/usr/bin/logger -p local1.info -u /var/log/httpd/apache_log.socket" combined
to the httpd.conf
and then setting up syslog-ng to create the socket and send the logs via UDP to a remote host. On the recieving server, we use syslog-ng to filter based on the vhost-%v in the incoming stream to write to a file.
This solved alot of issues such as real-time combining of all servers' logs as well as off-loading the log processing to another server.
Visitor (not verified)
or... disable logging
Fri, 2010/08/06 - 12:21or... disable logging altogether.
Pages