For all of the sites we consult on, and manage, we use the excellent memcache module, which replaces the core's database caching. Database caching works for low traffic simple sites, but cannot scale for heavy traffic or complex site.
Recently we were asked to consult on the slow performance of a site with an all authenticated audience. The site is indeed complex, with over 235 enabled modules, 130 enabled rules, and 110 views.
The site was moved from a dedicated server to an Amazon AWS cluster, with the site on one EC2 instance, the database on an RDS instance, and memcache on a third instance. This move was in the hope that Amazon's AWS will improve performance.
However, to their dismay, performance after the move went from bad (~ 5 to 7 seconds to load a page) to worse (~ 15 seconds).
We recommended to the client that we perform a Drupal performance assessment server on the site. We got a copy of the site and recreated the site in our lab, and proceeded with the investigation.
After some investigation we found that by having memcached on the same server that runs PHP made a significant performance improvement. This was logical, because with a site with all users logged in, there are plenty of calls to cache_get() and cache_set(). Each of these calls have to do a round trip over the network to the other server and back, even if it returns nothing. The same goes for database queries.
Instead of 29.0, 15.8, 15.9, and 15.5 seconds for different pages on the live site, the page loads in our lab on a single medium server were: 3.6, 5.5, 1.4, 1.5 and 0.6 seconds.
However, this victory was short lived. Once we put load on the web site, other bottlenecks were encountered.
We started with 200 concurrent logged in users on our lab server, and kept investigating and tweaking, running a performance test after each tweak to assess its impact.
The initial figures were: an average response time of 13.93 seconds, and only 6,200 page load attempts for 200 users (with 436 time outs).
So, what did we find? We found that the culprit was memcache! Yes, the very thing that helps site be scaleable was severely impeding scalability!
Why is this so? Because of the way it was configured for locking and stampede protection.
The settings.php for the site had these two lines:
$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';
$conf['memcache_stampede_protection'] = TRUE;
Look at the memcache.inc, lines 180 to 201, in the valid() function:
if (!$cache) {
if (variable_get('memcache_stampede_protection', FALSE) ... ) {
static $lock_count = 0;
$lock_count++;
if ($lock_count <= variable_get('memcache_stampede_wait_limit', 3)) {
lock_wait(..., variable_get('memcache_stampede_wait_time', 5));
$cache = ...;
}
}
}
The above is for version 7.x of the module, and the same logic is in the Drupal 8.x branch as well.
If memcache_stampede_protection is set to TRUE, then there will be up to three attempts, with a 5 second delay each. The total then can be as high as 15 seconds when the site is busy, which is exactly what we were seeing. Most of the PHP processes will be waiting for the lock, and no free PHP processes will be available to serve requests from other site visitors.
One possible solution is to lower the number of attempts to 2 (memcache_stampede_wait_limit = 2), and the wait time for each attempt to 1 second (memcache_stampede_wait_time = 1), but that is still 2 seconds of wait!
We did exactly that, and re-ran our tests.
The figures were much better: for 200 concurrent logged in users, the the average response time was 2.89 seconds, and a total of 10,042 page loads, with 100% success (i.e. no time outs).
But a response time of ~ 3 seconds is still slow, and there is still the possibility of a pile up condition when all PHP processes are waiting.
So, we decided that the best course of action is not to use memcache's locking at all, nor its stampede protection, and hence deleted the two lines from settings.php:
//$conf['lock_inc'] = 'sites/all/modules/memcache/memcache-lock.inc';
//$conf['memcache_stampede_protection'] = TRUE;
The results were much better: for 200 concurrent logged in users, the average response time was 1.09 seconds, and a total of 11,196 pages with 100% success rate (no timeouts).
At this point, the server's CPU utilization was 45-55%, meaning that it can handle more users.
But wait! We forgot something: the last test was run with xhprof profiler left enabled by mistake from profiling the web site! That causes lots of CPU time being used up, as well as heavy writes to the disk as well.
So we disabled xhprof and ran another test: and the results were fantastic: for 200 concurrent logged in users, the average response time was just 0.20 seconds, and a total of 11,892 pages with 100% success rate (no timeouts).
Eureka!
Note that for the above tests, we disabled all the rules, disabled a couple of modules that have slow queries, and commented out the history table update query in core's node.module:node_tag_new(). So, these figures are idealized somewhat.
Also, this is a server that is not particularly new (made in 2013), and uses regular spinning disks (not SSDs).
For now, the main bottleneck has been uncovered, and overcome. The site is now only limited by other factors, such as available CPU, speed of its disks, complexity of modules, rules and views ...etc.
So, check your settings.php to see if you have memcache_stampede_protection enabled, and disable it if it is.
Comments
Brian Tully (not verified)
As always, very helpful info from 2bits. When was this written?
Tue, 2019/10/15 - 11:44Hey there! Long time listener, first time caller ;)
The information in this article certainly validates what I've been observing and testing. While we (Acquia) initially recommended using memcache locking to avoid database deadlocks, lately i've been recommending setting database tx_isolation to READ-COMMITTED, which seems more reliable and performant than memcache locking (especially when stampede protection is also enabled).
I'm curious as to when this article was posted. Have you considered displaying the published date field on your articles so that users can have the context on when the information was published? I think that would be really helpful.
Thanks again for all you do and all the great information you share! :)
Khalid
Article is not that old
Tue, 2019/10/15 - 17:28The article is not that old. It was posted on Feb 12th, 2018.
Brian (not verified)
Thanks, but there's no way of knowing that
Sat, 2024/02/17 - 12:14Hi Khalid. You say the article was posted on Feb 12th, 2018 -- but how do you know that? There's nothing on the page to indicate when it was published. My suggestion is simply to make the published date visible in the template so that users can know whether the information is recent or not.