Drupal Performance and Scalability With Excessive I/O Load Or Memory Exhausion

On many occasions, we see web site performance suffereing due to misconfiguration or oversight of system resources. Here is an example where RAM and Disk I/O severely impacted web site performance, and how we fixed them.

A recent project for a client who had bad site performance uncovered issues within the application itself, i.e. how the Drupal site was put together. However, overcoming those issues was not enough to achieve the required scalability with several hundred logged in users on the site at the same time.

First, regarding memory, the site configured too many PHP-FPM processes, and that left no room in memory for the filesystem buffers and cache, which help a lot with disk I/O load.

Here is a partial display from when we were monitoring the server before we fixed it:

As you can see, the buffers + cache + free memory all amount to less than 1 GB of total RAM, while the used RAM is over 7GB.

used buffers cache free
7112M 8892k 746M 119M
7087M 9204k 738M 151M
7081M 9256k 770M 125M
7076M 4436k 768M 136M
7087M 4556k 760M 133M

We did calculations on how much RAM is really needed by watching the main components on the server:

In this case the calculation was:

Memcache + MySQL + (Apache2 X number of instances) + (PHP-FPM X number of instances)

And then adjusting the PHP-FPM number of processes down to a reasonable number, for a total application RAM of no more than 70% of the total.

The result is as follows. As you can see, used memory is now 1.8GB instead of 7GB. Free memory will slowly be used by cache and buffers making I/O operations much faster.

used buffers cache free
1858M 50.9M 1793M 4283M
1880M 51.2M 1795M 4258M
1840M 52.1M 1815M 4278M
1813M 52.4M 1815M 4304M

Another issue with the server, partially caused by by the above lack of cache and buffers, but also by forgotten settings, caused a severe bottleneck in the Disk I/O performance. The disk was so tied up that everything had to wait. I/O Wait was 30%, as seen in top and htop. This is very very high, and should usually be no more than 1 or 2% maximum.

We also observed excessive disk reads and writes, as follows:

disk read disk write i/o read i/o write
5199k 1269k 196 59.9
1731k 1045k 80 50.7
7013k 1106k 286 55.2
23M 1168k 607 58.4
9121k 1369k 358 59.7

Upon investigating, we found that the rules_debug_log setting was on. The site had 130 enabled rules and the syslog module was enabled. We found a file under /var/log/ with over a GB per day and growing. This writing of rules debugging for every page load tied up the disk when a few hundred users were on the site.

After disabling the rules debug log settings, wait for I/O went down to 1.3%! A significant improvement.

Here is the disk I/O figures after the fix.

disk read disk write i/o read i/o write
192k 429k 10.1 27.7
292k 334k 16.0 26.3
2336k 429k 83.6 30.7
85k 742k 4.53 30.8

Now, the site has response times of 1 second or less instead of the 3-4 seconds.