Strategies for dealing with resource wasting crawlers

In a previous article, we showed how a combination of Microsoft's WebDAV MiniReDir and Drupal Single SignOn result in an unintentional aggressive crawler that has the same effect as a Denial of Service attack (DoS).

And in our session yesterday at DrupalCon we touched on that topic again.

So, in this article we expand on this topic and provide more details.

Detection of crawlers

Before we can act against crawlers, we first have to detect them in the first place.

To do this, there are several methods:

ntop

The ntop tool is useful for detecting lots of connections from a single IP address. It works as a daemon that opens a certain port, and has an HTML interface that using javascript.

Apachetop

The Apachetop tool is useful for real time monitoring from the Apache log. It can be used to list each IP address that is now connected to the server.

Using a custom Drupal module

Here is some prototype code that logs the IP address for all page accessess, and the time as well. This data can then be filtered using a SQL query.

We can then detect if a certain IP has hit the server in the psat period.

This is not suitable for very large sites, due to the volume of database INSERTs that this will incur.

function crawlers_boot() {
  $ip = ip_address();
  // Log the IP address and timestamp
  db_query("INSERT INTO {crawlers} ('%s', %d, 0)", $ip, time());
}

function crawlers_cron() {
  // Clean the table so it does not grow forever
  db_query("DELETE FROM {crawlers} WHERE timestamp 

Blocking: Drupal modules

At the Drupal level, there are the following modules that can help in certain cases.

Blocking: UserAgent + Teapot

The following code implements the HTCPCP protocol described in detail in RFC 2324. Although intended as humor, this HTTP status code can be useful.

Add this to settings.php:

$ua = array(
  'Microsoft URL Control',
  'LucidMedia ClickSense',
);

foreach($ua as $string) {
  if (FALSE !== strpos($_SERVER['HTTP_USER_AGENT'], $string)) {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

This causes a shorter lightweight boot only, and therefore does not use lots of resources on the server. It also logs the 418 HTTP status code to the access log.

We can then detec the IP addresses that hit the site the most, via this command:

grep "418 -" /var/log/apache2/access-example.com.log |
  awk '{print $1}' |
  sort -u

Which shows a list of the top IP addresses that have accessed the site recently, like this:

84.59.127.111
88.233.173.222
...

And then we can have this shell script to actually block the IP address from the command line:

#!/bin/sh
#
# Script to block one or more ip addresses

# Check arguments
if [ $# = 0 ]; then
  # Display a usage error message
  echo "Usage: `basename $0` ip-address"
  exit 1
fi

for IP in $*
do
  # Block the IP address
  iptables -I INPUT -s $IP -j DROP
done

We can now combine the above as follows:

block-ip.sh `grep " 418 - " /var/log/apache2/access-example.com.log | 
  awk '{print $1}' | 
  sort -u`

Making the blacklist persistent

Then, we need to save the list and restore it upon rebooting:

Save it by:

iptables-save > /root/data/ip-blocked-list.txt

And then modify the /etc/rc.local file, and place the following before the "exit 0" line:

iptables-restore 

Using connlimit

In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:

iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT

This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously.

But alas, this does not work with Ubuntu 8.04 LTS, and was fixed 8.10 and later. See bug #60439 on Launchpad.

Your turn

What other strategies do you employ for dealing with crawlers? Please post a comment below.

Contents: