Aggressive crawlers that hit your web site a lot can cause performance problems.
There are many ways to identify aggressive crawlers, including writing custom scripts that analyze your web server logs.
One tool that we found to be useful in analyzing which crawlers hit the site the most today or yesterday is Go Access.
Getting Go Access
Go Access is available for Ubuntu Natty Narwahl (11.04) only, but not earlier LTS releases.
Therefore you have to build it from source for 8.04 or 10.04, and before that, you also need to install some dependencies.
Download Go Access
If you are installing from source, you must use the tar ball, not the git checkout, since it does not have a configure script.
Visit the Go Access web site, and download the latest version:
# tar xzvf goaccessXXX.tar.gz # cd goaccessXXX
Since Go Access requires glib2 and ncurses, you need to install the dev versions of those packages:
# aptitude install libglib2.0-dev libncurses5-dev
Building Go Access
Use the following commands to build Go Access:
# ./configure # make # make install
Running Go Access
Go Access can be run to directly access the web server's log, using the -f option. In this mode, Go Access will update itself to reflect new accesses to the web server in real time.
Make sure you have enough free memory before you run Go Access.
Filtering web server logs
However, for a large site, this mode will not be practical, because it will have to sift through gigabytes of data. Also, you may be interested in what happened today or yesterday, rather than what happened a week ago.
Therefore, you need to filter the log by day, before piping the output to goaccess.
For example, to select a certain day, do the following:
# grep '19/Oct/2011' /var/log/apache2/access-example.com.log | goaccess -asb
To filter on a specific hour, you can use:
grep '20/Oct/2011:17' /var/log/apache2/access-example.com.log | goaccess -asb
This will take a few minutes on a large site, so be patient.
Performance of Go Access
Go Access takes some time to parse the log and generate the data structures needed for the report.
For example, for a site with a log file with 17,450,514 lines, and filtering on a single day with 3,413,460 hits, the generation time was 175 seconds. The site is on a dedicated server with dual Quad Core Xeon E5620 @ 2.4GHz and 8GB RAM.
On the same server, with direct access to an access log file with 19,087,802 lines (i.e. I used -f directly), it took 622 seconds.
On another similar server that had 9,197,594 lines, without grep filtering (i.e. I used -f directly). It took 139 seconds on a similar server (16GB instead of 8GB).
Here are other Go Access performance figures.
So, be patient ...
Take a few minutes to familiarize yourself with the display, and how to drill down. Access the built in help using the F1 key.
For crawlers, you will be interested in "module" 8 (IP Addresses).
Press 8, then the letter o to open module 8. You will see the top IP addresses accessing the site.
You should see something like this:
586.50 MB - 30070 18.104.22.168 537.67 MB - 28199 22.214.171.124 48.80 MB - 3570 126.96.36.199 42.51 MB - 3146 188.8.131.52
Obviously, the first two IP addresses access the site far more than other IP addresses for that day. This is not good ...
Use the arrow keys to highlight and address, then press the letter o again to know which company/network this address belongs to.
Problematic Crawler Cases
In once case, it was Google's crawlers who where the culprit. We explain that in another article on our site.
In another case, it was an address that belonged to Amazon's EC2.
For the Amazon EC2 case, here is how we discovered that something is terribly wrong:
We checked the logs for what this IP address is doing.
grep ^184.108.40.206 access-example.com.log| less
In our case, we found tens of thousands of accesses like this to various URLs.
220.127.116.11 - - [02/Oct/2011:16:57:42 -0700] "HEAD /some-url HTTP/1.1" 200 - "-" "-"
Now block the IP address:
iptables -I INPUT -s 18.104.22.168 -j DROP
Your site should see the load go down considerably, if crawlers are indeed the problem.