Aggressive crawlers that hit your web site a lot can cause performance problems.
There are many ways to identify aggressive crawlers, including writing custom scripts that analyze your web server logs.
One tool that we found to be useful in analyzing which crawlers hit the site the most today or yesterday is Go Access.
Getting Go Access
Go Access is available for Ubuntu Natty Narwahl (11.04) only, but not earlier LTS releases.
Therefore you have to build it from source for 8.04 or 10.04, and before that, you also need to install some dependencies.
Download Go Access
If you are installing from source, you must use the tar ball, not the git checkout, since it does not have a configure script.
Visit the Go Access web site, and download the latest version:
# tar xzvf goaccessXXX.tar.gz # cd goaccessXXX
Install Dependencies
Since Go Access requires glib2 and ncurses, you need to install the dev versions of those packages:
# aptitude install libglib2.0-dev libncurses5-dev
Building Go Access
Use the following commands to build Go Access:
# ./configure # make # make install
Running Go Access
Go Access can be run to directly access the web server's log, using the -f option. In this mode, Go Access will update itself to reflect new accesses to the web server in real time.
Make sure you have enough free memory before you run Go Access.
Filtering web server logs
However, for a large site, this mode will not be practical, because it will have to sift through gigabytes of data. Also, you may be interested in what happened today or yesterday, rather than what happened a week ago.
Therefore, you need to filter the log by day, before piping the output to goaccess.
For example, to select a certain day, do the following:
# grep '19/Oct/2011' /var/log/apache2/access-example.com.log | goaccess -asb
To filter on a specific hour, you can use:
grep '20/Oct/2011:17' /var/log/apache2/access-example.com.log | goaccess -asb
This will take a few minutes on a large site, so be patient.
Performance of Go Access
Go Access takes some time to parse the log and generate the data structures needed for the report.
For example, for a site with a log file with 17,450,514 lines, and filtering on a single day with 3,413,460 hits, the generation time was 175 seconds. The site is on a dedicated server with dual Quad Core Xeon E5620 @ 2.4GHz and 8GB RAM.
On the same server, with direct access to an access log file with 19,087,802 lines (i.e. I used -f directly), it took 622 seconds.
On another similar server that had 9,197,594 lines, without grep filtering (i.e. I used -f directly). It took 139 seconds on a similar server (16GB instead of 8GB).
Here are other Go Access performance figures.
So, be patient ...
Drilling down
Take a few minutes to familiarize yourself with the display, and how to drill down. Access the built in help using the F1 key.
For crawlers, you will be interested in "module" 8 (IP Addresses).
Press 8, then the letter o to open module 8. You will see the top IP addresses accessing the site.
You should see something like this:
586.50 MB - 30070 66.249.71.243 537.67 MB - 28199 66.249.71.114 48.80 MB - 3570 87.250.254.242 42.51 MB - 3146 123.126.50.78
Obviously, the first two IP addresses access the site far more than other IP addresses for that day. This is not good ...
Use the arrow keys to highlight and address, then press the letter o again to know which company/network this address belongs to.
Problematic Crawler Cases
In once case, it was Google's crawlers who where the culprit. We explain that in another article on our site.
In another case, it was an address that belonged to Amazon's EC2.
For the Amazon EC2 case, here is how we discovered that something is terribly wrong:
We checked the logs for what this IP address is doing.
grep ^184.73.45.144 access-example.com.log| less
In our case, we found tens of thousands of accesses like this to various URLs.
184.73.45.144 - - [02/Oct/2011:16:57:42 -0700] "HEAD /some-url HTTP/1.1" 200 - "-" "-"
Now block the IP address:
iptables -I INPUT -s 184.73.45.144 -j DROP
Your site should see the load go down considerably, if crawlers are indeed the problem.
Comments
Lawrence (not verified)
Similar situation
Sun, 2012/05/20 - 09:46While troubleshooting a similar issue on a clients server I noted that the Bing/MSN bot was acting much in the same way as this - causing lots of unnecessary traffic.
I checked that via tailing logs and grepping the connection count's.
My own process detailed here - http://www.computersolutions.cn/blog/2012/05/msn-bing-crawler-spider-madness/
GoAccess seems like a much nicer tool for some of that, although it won't amalgamate the ip traffic from a given company as one source unfortunately.
Was feeling happy with myself by finding and sorting out Bing's excesses, and then client complained server load was up yet again.
Used GoAccess, and found that ip - 184.183.29.53 had done about 2.7G of traffic of that days traffic on one of their sites.
We have logs spread over multiple sites rather than conglomerated, so my check was more like this (to check the last few days+ rotated log)
grep 'May/2012' /home/*/logs/*log /home/*/logs/*_log.1 | goaccess -asb
The current days traffic was around 5G total, and 2.7G for that one ip., and no identifying details for it other than a presumably faked browser id.
2.76 GB - 53922 /home/mybeonline/logs/access_log.1:184.183.29.53
The only other address ranges that came close to hit count (actual traffic was substantially lower) had valid reverse dns, ip ownership for common spiders (eg GoogleBot, Sogou Web Spider, Baidu Spider).
184.183.29.53 on the other hand identified itself as either
"Python-urllib/2.5"
or
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.11) Gecko/2009060214 Firefox/3.0.11"
A line count of the logs showed 70,000 odd log lines
grep 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.11) Gecko/2009060214 Firefox/3.0.11' /home/*/logs/*log /home/*/logs/*_log.1 | wc -l
70120
A quick double check to make sure that the referrer matched the ip gave the same result.
grep 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.11) Gecko/2009060214 Firefox/3.0.11' /home/*/logs/*log /home/*/logs/*_log.1 | grep 184.183.29.53 | wc -l
70120
Guess the Firefox/3.0.11 is the new spambot Favourite!
As the User Agent seems remarkable close to the one listed on the botnet page - https://2bits.com/botnet/botnet-hammering-web-site-causing-outages.html you may want to take a look for that in future too.
Cheers,
Lawrence.