Quick Tabs is a widely used Drupal module. Site builders like it because it improves usability in some cases by reducing clutter.
Incidentally, the way this module works has cause us to run across performance issues caused by certain uses. See previous article about Quick Tabs can sure use more caching and a case study involving Quick Tabs.
This is the third case of a site having problems caused bby a combination of quick tabs and confused crawlers, including Google's own crawler.
The Symptoms
The main symptom was a slow down of the server, spikes in slow queries, outages, ...etc.
When investigating the issue, we found that this particular site had close to 66,000 nodes.
But Google Webmaster listed over 12 million URLs having "quicktabs" as a parameter!
Googlebot confused
When investigating the web server log, we found the following:
GET /user/20847?quicktabs_8=0 HTTP/1.1" ... YandexBot/3.0 GET /user/13938?quicktabs_8=2 HTTP/1.1" ... Googlebot/2.1 GET /node/feed?quicktabs_2=0&quicktabs_3=0&quicktabs_7=4 HTTP/1.1" ... Googlebot/2.1 GET /forums/example-url?quicktabs_8=3&quicktabs_2=2 HTTP/1.1" ... YandexBot/3.0
Even some nonsense URLs like:
/frontpage/playlist.xml?page=74&271=&quicktabs_2=3
No wonder Google index orders of magnitude more than what pages the site has.
Additionally, you can also use a tool like the extremely useful Go Access to analyze the logs and see the top accesses by IP address.
Avoiding The Problem
To solve this we did several things, on several fronts.
First, in settings.php, we added the following:
$error_404_header = 'HTTP/1.0 404 Not Found'; $error_404_text = '<html><head><title>404 Not Found'; $error_404_text .= '</title></head><body><h1>Not Found</h1>'; $error_404_text .= '<p>The requested URL was not found on this server.'; $error_404_text .= '</p></body></html>'; $bot_ua = array( 'bot', 'spider', 'crawler', 'slurp', ); if (FALSE !== strpos($_SERVER['QUERY_STRING'], 'quicktabs')) { foreach ($bot_ua as $bot) { if (FALSE !== stripos($_SERVER['HTTP_USER_AGENT'], $bot)) { header($error_404_header); print $error_404_text; exit(); } } }
Bing and robots.txt
We later found that Bing is hitting the site too much as well.
To solve this, we added the following to the robots.txt file:
User-agent: * Crawl-delay: 10 # No access for quicktabs in the URL Disallow: /*?quicktabs_* Disallow: /*&quicktabs_*
Google Webmaster
We also went into Google Webmaster, and changed crawling to ignore the quicktabs parameter.
To do so go under "Site Configuration" then "URL Parameters", and edit the "quicktabs" parameter and set it to "No: Does not affect page content".
Refer to this help page for more information.
The Results
Not only were the crawlers hitting the site too quickly. They were also putting these pages in the page cache, overflowing the memory allocated for memcache, and thus hurting the performance for regular users, and overloading the server in two ways, not only one.
Here is a comparison of before and after for a one week period.
Before the change for crawlers, we had 66,280 slow queries per week for 292 unique queries, eating up a total of 350,766 seconds.
After the change for crawlers, we have only 10,660 slow queries per week for 162 unique queries, eating up only 47,446 seconds.
This is a 7X decrease of the CPU time used by slow queries.
Comments
Eric (not verified)
Thanks.
Fri, 2011/11/04 - 01:17HI,
I just ran into this issue today... thanks much for the information and fixes...
Eric
moshe weitzman (not verified)
Secure argument passing
Fri, 2011/11/04 - 09:46Related to this explosion of URLs is a great, ageing patch for Drupal - Secure argument passing. Help wanted.
EvanDonovan (not verified)
Deal with the problem at its source
Fri, 2011/11/04 - 13:25There is an open issue in the Quicktabs module to deal with this.
I just posted the following theme function override that removes the unneeded query string: http://drupal.org/node/354867#comment-5204212. (Since the query string is only needed to support people who don't have Javascript enabled.)
If you add your comments to that issue, supporting my proposal in comment #48, or, better, writing a patch, perhaps it can get committed.
It should be noted that Google can have duplicate content issues with all kinds of query string parameters when they don't actually affect the output of the page (i.e., not the ?page query string parameter). These can all be excluded from robots.txt using a wildcard exclusion.
This post: http://www.propdrop.com/blog/duplicate-content-problems-drupal-quicktabs-module recommends a canonical tag via Nodewords as the best solution, but that apparently wasn't working for us in all cases.
dalin (not verified)
cloaking
Fri, 2011/11/04 - 13:53Targetting search bots specifically in settings.php is an example of cloaking and is a high-risk activity. Google may penalize your site for such behaviour.
See this video made by Google for more info:
http://www.youtube.com/watch?feature=player_embedded&v=QHtnfOgp65Q
Khalid
Maybe
Fri, 2011/11/04 - 14:41We've also asked Google to not crawl the URLs that have quicktabs query parameter, and consider it as not changing the page.
We are returning a 404 when the URL is being accessed by a crawler, AND there is a quicktabs query parameter. The idea here is bail out early and not incur the added performance overhead of a full Drupal bootstrap.
This is a meaningless URL anyways, and should not be indexed as a separate page.
So far, Google has not reacted negatively to that, perhaps since we adjusted Google Webmaster to match.
lennyaspen (not verified)
Thank you for this approach.
Mon, 2011/11/07 - 15:24I was just about to launch a new site tomorrow and how knows probably the google pagerank would have fallen.
Dave Reid (not verified)
Alterable RobotsTxt in core
Sat, 2011/12/10 - 19:25Another great reason why it would be good to have the RobotsTxt module in core: http://drupal.org/node/495608
cpliakas (not verified)
Similar problem with Faceted Navigstion
Tue, 2013/04/09 - 09:11Thanks for the great article. We did a lot of research for Facet API on a similar problem, and in case you are interested the relevant research plus some interesting responses is located in the Validate SEO approach taken by Facet API and determine if it would be worth adding an item to the checklist post on Drupal.org. Git the topic I thought it would be relevant to cross-post.