In a previous article, we showed how a combination of Microsoft's WebDAV MiniReDir and Drupal Single SignOn result in an unintentional aggressive crawler that has the same effect as a Denial of Service attack (DoS).
And in our session yesterday at DrupalCon we touched on that topic again.
So, in this article we expand on this topic and provide more details.
Detection of crawlers
Before we can act against crawlers, we first have to detect them in the first place.
To do this, there are several methods:
ntop
The ntop tool is useful for detecting lots of connections from a single IP address. It works as a daemon that opens a certain port, and has an HTML interface that using javascript.
Apachetop
The Apachetop tool is useful for real time monitoring from the Apache log. It can be used to list each IP address that is now connected to the server.
Using a custom Drupal module
Here is some prototype code that logs the IP address for all page accessess, and the time as well. This data can then be filtered using a SQL query.
We can then detect if a certain IP has hit the server in the psat period.
This is not suitable for very large sites, due to the volume of database INSERTs that this will incur.
function crawlers_boot() { $ip = ip_address(); // Log the IP address and timestamp db_query("INSERT INTO {crawlers} ('%s', %d, 0)", $ip, time()); } function crawlers_cron() { // Clean the table so it does not grow forever db_query("DELETE FROM {crawlers} WHERE timestamp < %d", time() - 120); } function crawlers_reports() { // Report on the crawlers. // This is left as a community exercise ... }
Blocking: Drupal modules
At the Drupal level, there are the following modules that can help in certain cases.
Blocking: UserAgent + Teapot
The following code implements the HTCPCP protocol described in detail in RFC 2324. Although intended as humor, this HTTP status code can be useful.
Add this to settings.php:
$ua = array( 'Microsoft URL Control', 'LucidMedia ClickSense', ); foreach($ua as $string) { if (FALSE !== strpos($_SERVER['HTTP_USER_AGENT'], $string)) { header("HTTP/1.0 418 I'm a teapot"); exit(); } }
This causes a shorter lightweight boot only, and therefore does not use lots of resources on the server. It also logs the 418 HTTP status code to the access log.
We can then detec the IP addresses that hit the site the most, via this command:
grep "418 -" /var/log/apache2/access-example.com.log | awk '{print $1}' | sort -u
Which shows a list of the top IP addresses that have accessed the site recently, like this:
84.59.127.111 88.233.173.222 ...
And then we can have this shell script to actually block the IP address from the command line:
#!/bin/sh # # Script to block one or more ip addresses # Check arguments if [ $# = 0 ]; then # Display a usage error message echo "Usage: `basename $0` ip-address" exit 1 fi for IP in $* do # Block the IP address iptables -I INPUT -s $IP -j DROP done
We can now combine the above as follows:
block-ip.sh `grep " 418 - " /var/log/apache2/access-example.com.log | awk '{print $1}' | sort -u`
Making the blacklist persistent
Then, we need to save the list and restore it upon rebooting:
Save it by:
iptables-save > /root/data/ip-blocked-list.txt
And then modify the /etc/rc.local file, and place the following before the "exit 0" line:
iptables-restore < /root/data/ip-blocked-list.txt
Using connlimit
In newer Linux kernels, there is a connlimit module for iptables. It can be used like this:
iptables -I INPUT -p tcp -m connlimit --connlimit-above 5 -j REJECT
This limits connections from each IP address to no more than 5 simultaneous connections. This sort of "rations" connections, and prevents crawlers from hitting the site simultaneously.
But alas, this does not work with Ubuntu 8.04 LTS, and was fixed 8.10 and later. See bug #60439 on Launchpad.
Your turn
What other strategies do you employ for dealing with crawlers? Please post a comment below.
Most Comments