How Google and Bing crawlers were confused by quicktabs

Published Thu, 2011/11/03 - 21:28, Updated Fri, 2011/11/04 - 12:23

Quick Tabs is a widely used Drupal module. Site builders like it because it improves usability in some cases by reducing clutter.

Incidentally, the way this module works has cause us to run across performance issues caused by certain uses. See previous article about Quick Tabs can sure use more caching and a case study involving Quick Tabs.

This is the third case of a site having problems caused bby a combination of quick tabs and confused crawlers, including Google's own crawler.

The Symptoms

The main symptom was a slow down of the server, spikes in slow queries, outages, ...etc.

When investigating the issue, we found that this particular site had close to 66,000 nodes.

But Google Webmaster listed over 12 million URLs having "quicktabs" as a parameter!

Googlebot confused

When investigating the web server log, we found the following:

GET /user/20847?quicktabs_8=0 HTTP/1.1" ... YandexBot/3.0
GET /user/13938?quicktabs_8=2 HTTP/1.1" ... Googlebot/2.1
GET /node/feed?quicktabs_2=0&quicktabs_3=0&quicktabs_7=4 HTTP/1.1" ... Googlebot/2.1
GET /forums/example-url?quicktabs_8=3&quicktabs_2=2 HTTP/1.1" ... YandexBot/3.0

Even some nonsense URLs like:

/frontpage/playlist.xml?page=74&271=&quicktabs_2=3

No wonder Google index orders of magnitude more than what pages the site has.

Additionally, you can also use a tool like the extremely useful Go Access to analyze the logs and see the top accesses by IP address.

Avoiding The Problem

To solve this we did several things, on several fronts.

First, in settings.php, we added the following:

$error_404_header = 'HTTP/1.0 404 Not Found';
$error_404_text  = '<html><head><title>404 Not Found';
$error_404_text .= '</title></head><body><h1>Not Found</h1>';
$error_404_text .= '<p>The requested URL was not found on this server.';
$error_404_text .= '</p></body></html>';

$bot_ua = array(
  'bot',
  'spider',
  'crawler',
  'slurp',
);

if (FALSE !== strpos($_SERVER['QUERY_STRING'], 'quicktabs')) {
  foreach ($bot_ua as $bot) {
    if (FALSE !== stripos($_SERVER['HTTP_USER_AGENT'], $bot)) {
      header($error_404_header);
      print $error_404_text;
      exit();
    }
  }
}

Bing and robots.txt

We later found that Bing is hitting the site too much as well.

To solve this, we added the following to the robots.txt file:

User-agent: *
Crawl-delay: 10
# No access for quicktabs in the URL
Disallow: /*?quicktabs_*
Disallow: /*&quicktabs_*

Google Webmaster

We also went into Google Webmaster, and changed crawling to ignore the quicktabs parameter.

To do so go under "Site Configuration" then "URL Parameters", and edit the "quicktabs" parameter and set it to "No: Does not affect page content".

Refer to this help page for more information.

The Results

Not only were the crawlers hitting the site too quickly. They were also putting these pages in the page cache, overflowing the memory allocated for memcache, and thus hurting the performance for regular users, and overloading the server in two ways, not only one.

Here is a comparison of before and after for a one week period.

Before the change for crawlers, we had 66,280 slow queries per week for 292 unique queries, eating up a total of 350,766 seconds.

After the change for crawlers, we have only 10,660 slow queries per week for 162 unique queries, eating up only 47,446 seconds.

This is a 7X decrease of the CPU time used by slow queries.

Similar problem with Faceted Navigstion

Thanks for the great article. We did a lot of research for Facet API on a similar problem, and in case you are interested the relevant research plus some interesting responses is located in the Validate SEO approach taken by Facet API and determine if it would be worth adding an item to the checklist post on Drupal.org. Git the topic I thought it would be relevant to cross-post.

Alterable RobotsTxt in core

Another great reason why it would be good to have the RobotsTxt module in core: http://drupal.org/node/495608

Thank you for this approach.

I was just about to launch a new site tomorrow and how knows probably the google pagerank would have fallen.

cloaking

Targetting search bots specifically in settings.php is an example of cloaking and is a high-risk activity. Google may penalize your site for such behaviour.

See this video made by Google for more info:
http://www.youtube.com/watch?feature=player_embedded&v=QHtnfOgp65Q

Maybe

We've also asked Google to not crawl the URLs that have quicktabs query parameter, and consider it as not changing the page.

We are returning a 404 when the URL is being accessed by a crawler, AND there is a quicktabs query parameter. The idea here is bail out early and not incur the added performance overhead of a full Drupal bootstrap.

This is a meaningless URL anyways, and should not be indexed as a separate page.

So far, Google has not reacted negatively to that, perhaps since we adjusted Google Webmaster to match.

Deal with the problem at its source

There is an open issue in the Quicktabs module to deal with this.

I just posted the following theme function override that removes the unneeded query string: http://drupal.org/node/354867#comment-5204212. (Since the query string is only needed to support people who don't have Javascript enabled.)

If you add your comments to that issue, supporting my proposal in comment #48, or, better, writing a patch, perhaps it can get committed.

It should be noted that Google can have duplicate content issues with all kinds of query string parameters when they don't actually affect the output of the page (i.e., not the ?page query string parameter). These can all be excluded from robots.txt using a wildcard exclusion.

This post: http://www.propdrop.com/blog/duplicate-content-problems-drupal-quicktabs-module recommends a canonical tag via Nodewords as the best solution, but that apparently wasn't working for us in all cases.

Secure argument passing

Related to this explosion of URLs is a great, ageing patch for Drupal - Secure argument passing. Help wanted.

Thanks.

HI,

I just ran into this issue today... thanks much for the information and fixes...

Eric

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting