Identifying aggressive crawlers using Go Access

Aggressive crawlers that hit your web site a lot can cause performance problems.

There are many ways to identify aggressive crawlers, including writing custom scripts that analyze your web server logs.

One tool that we found to be useful in analyzing which crawlers hit the site the most today or yesterday is Go Access.

Getting Go Access

Go Access is available for Ubuntu Natty Narwahl (11.04) only, but not earlier LTS releases.

Therefore you have to build it from source for 8.04 or 10.04, and before that, you also need to install some dependencies.

Download Go Access

If you are installing from source, you must use the tar ball, not the git checkout, since it does not have a configure script.

Visit the Go Access web site, and download the latest version:

# tar xzvf goaccessXXX.tar.gz
# cd goaccessXXX

Install Dependencies

Since Go Access requires glib2 and ncurses, you need to install the dev versions of those packages:

# aptitude install libglib2.0-dev libncurses5-dev

Building Go Access

Use the following commands to build Go Access:

# ./configure 
# make
# make install

Running Go Access

Go Access can be run to directly access the web server's log, using the -f option. In this mode, Go Access will update itself to reflect new accesses to the web server in real time.

Make sure you have enough free memory before you run Go Access.

Filtering web server logs

However, for a large site, this mode will not be practical, because it will have to sift through gigabytes of data. Also, you may be interested in what happened today or yesterday, rather than what happened a week ago.

Therefore, you need to filter the log by day, before piping the output to goaccess.

For example, to select a certain day, do the following:

# grep '19/Oct/2011' /var/log/apache2/ | goaccess -asb

To filter on a specific hour, you can use:

grep '20/Oct/2011:17' /var/log/apache2/ | goaccess -asb

This will take a few minutes on a large site, so be patient.

Performance of Go Access

Go Access takes some time to parse the log and generate the data structures needed for the report.

For example, for a site with a log file with 17,450,514 lines, and filtering on a single day with 3,413,460 hits, the generation time was 175 seconds. The site is on a dedicated server with dual Quad Core Xeon E5620 @ 2.4GHz and 8GB RAM.

On the same server, with direct access to an access log file with 19,087,802 lines (i.e. I used -f directly), it took 622 seconds.

On another similar server that had 9,197,594 lines, without grep filtering (i.e. I used -f directly). It took 139 seconds on a similar server (16GB instead of 8GB).

Here are other Go Access performance figures.

So, be patient ...

Drilling down

Take a few minutes to familiarize yourself with the display, and how to drill down. Access the built in help using the F1 key.

For crawlers, you will be interested in "module" 8 (IP Addresses).

Press 8, then the letter o to open module 8. You will see the top IP addresses accessing the site.

You should see something like this:

586.50 MB - 30070
537.67 MB - 28199
 48.80 MB - 3570
 42.51 MB - 3146 

Obviously, the first two IP addresses access the site far more than other IP addresses for that day. This is not good ...

Use the arrow keys to highlight and address, then press the letter o again to know which company/network this address belongs to.

Problematic Crawler Cases

In once case, it was Google's crawlers who where the culprit. We explain that in another article on our site.

In another case, it was an address that belonged to Amazon's EC2.

For the Amazon EC2 case, here is how we discovered that something is terribly wrong:

We checked the logs for what this IP address is doing.

grep ^| less

In our case, we found tens of thousands of accesses like this to various URLs. - - [02/Oct/2011:16:57:42 -0700] "HEAD /some-url HTTP/1.1" 200 - "-" "-"

Now block the IP address:

iptables -I INPUT -s -j DROP

Your site should see the load go down considerably, if crawlers are indeed the problem.




Similar situation

While troubleshooting a similar issue on a clients server I noted that the Bing/MSN bot was acting much in the same way as this - causing lots of unnecessary traffic.

I checked that via tailing logs and grepping the connection count's.
My own process detailed here -

GoAccess seems like a much nicer tool for some of that, although it won't amalgamate the ip traffic from a given company as one source unfortunately.

Was feeling happy with myself by finding and sorting out Bing's excesses, and then client complained server load was up yet again.
Used GoAccess, and found that ip - had done about 2.7G of traffic of that days traffic on one of their sites.

We have logs spread over multiple sites rather than conglomerated, so my check was more like this (to check the last few days+ rotated log)

grep 'May/2012' /home/*/logs/*log /home/*/logs/*_log.1 | goaccess -asb

The current days traffic was around 5G total, and 2.7G for that one ip., and no identifying details for it other than a presumably faked browser id.
2.76 GB - 53922 /home/mybeonline/logs/access_log.1:

The only other address ranges that came close to hit count (actual traffic was substantially lower) had valid reverse dns, ip ownership for common spiders (eg GoogleBot, Sogou Web Spider, Baidu Spider). on the other hand identified itself as either
"Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv: Gecko/2009060214 Firefox/3.0.11"

A line count of the logs showed 70,000 odd log lines
grep 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv: Gecko/2009060214 Firefox/3.0.11' /home/*/logs/*log /home/*/logs/*_log.1 | wc -l

A quick double check to make sure that the referrer matched the ip gave the same result.
grep 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv: Gecko/2009060214 Firefox/3.0.11' /home/*/logs/*log /home/*/logs/*_log.1 | grep | wc -l

Guess the Firefox/3.0.11 is the new spambot Favourite!
As the User Agent seems remarkable close to the one listed on the botnet page - you may want to take a look for that in future too.