Google Crawler hitting your site too aggressively?

If your Drupal site suffers occasional slow downs or outages, check if crawlers are hitting your site too hard.

We've seen several clients complain, and upon investigation we found that the culprit is Google's own crawler.

The tell tale sign is that you will see lots of queries executing with the LIMIT clause having high numbers. Depending on your site's specifics, these queries would be slow queries too.

This means that there are crawlers that accessing very old content (hundreds of pages back).

Here is an example from a recent client:

SELECT node.nid AS nid, ...
LIMIT 4213, 26

SELECT node.nid AS nid, ...
LIMIT 7489, 26

SELECT node.nid AS nid, ...
LIMIT 8893, 26

As you can see, Google's crawler is going back 340+ pages for the last query.

Going to your web server's log would show something like this:

1.2.3.4 - - [26/Feb/2012:07:26:59 -0800] "GET /blah-blah?page=621 HTTP/1.1" 200 10017 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Note the page= part, and the Google Bot as the user agent.

The solution is often to go into Google Webmaster and reduce the crawl rate for the site, so they are not hitting too many pages at the same time. Start with 20%. You may need to go up down to 40% in severe cases.

Either way, you need to experiment with a value that would fit your site's specific case.

Here is how to change Google's crawl rate.

Contents: 

Tags: 

Comments

Good advice - also apply to faceted navigation

I've seen this effect on large sites all too often.

If you have webalizer installed, the "agents" part is useful. It's not uncommon to see 50% of visits comming from Googlebot. If you then compare the value of "Total Unique URLs" to the number of pages indexed in google (search for site:yourdomain.com in google) you might find that Google has 2-5x this number as it indexes parameters as part of the url. You might also notice /node as the most visited content (in webalizer) if you have left "publish to front page" as default on your content types. Even if you have a different front page, /node still exists and Google finds it (usually from the rss feed?). Take a look at yourdomin.com/node and how many pages there are. Google loves /node. http://drupal.org/node as an example has 150+ pages.

On the same note, you need to be extra carefull if you have faceted navigation. This can potentially create millions of uniqe url combinations as the bots treats parameters as part of the url.

Indexing a lot of unnessesary pages is not a good thing.. not for your resources, googles resources (crawler fatigue) or SEO. See http://searchnewscentral.com/20110601167/General-SEO/solving-duplicate-content-issues-arising-from-faceted-navigation.html

Page and Facet paramenters can be excluded in webmaster tools for Google, but a more universal - though a bit brute force - approach is robots.txt. e.g. adding something like "Disallow: /*?*filter*" for Facets API. But you probably want to just exclude some facets, as the facets are in fact a part of the site structure and could be important for the distribution of pagerank.

Atle