This article is about an anomaly that I have seen on a busy Drupal site. It is related to the network stack.
On a normal weekday with the regular load on the peak hours, netstat shows the following:
# netstat -ant | grep ":80 " | wc -l 4833
While Apache top shows we are at 80-90 requests per second:
last hit: 15:56:28 atop runtime: 0 days, 00:00:35 15:56:29 All: 3069 reqs ( 87.7/sec) 29.9M ( 875.0K/sec) 10218.5B/req 2xx: 2704 (88.1%) 3xx: 364 (11.9%) 4xx: 1 ( 0.0%) 5xx: 0 ( 0.0%) R ( 30s): 2623 reqs ( 87.4/sec) 25.9M ( 882.4K/sec) 10.1K/req 2xx: 2278 (86.8%) 3xx: 344 (13.1%) 4xx: 1 ( 0.0%) 5xx: 0 ( 0.0%)
So, life is good, things are normal and the server can handle all that gracefully.
Load under digg
That site normally gets on digg's front page anywhere between once every two months to a couple of times per month. Previous instances have been uneventful (except for the increased load of course).
This time, something unusual happened. The bottleneck was in the network stack, with Munin showing a lot of failed connections.
netstat showed an unusual number of connections
# netstat -ant | grep ":80 " | wc -l 13826 # netstat -ant | grep ":80 " | wc -l 13777 # netstat -ant | grep ":80 " | wc -l 13957
The system was idle when this happened, since not that many requests make it to Apache. Munin showed a spike in Apache traffic followed by a brief "blank" period, then started to serve requests.
Right after the system recovered, Apachetop showed the following:
last hit: 16:38:50 atop runtime: 0 days, 00:03:10 16:38:51 All: 41066 reqs ( 216.1/sec) 467.3M ( 2518.6K/sec) 11.7K/req 2xx: 38440 (93.6%) 3xx: 2608 ( 6.4%) 4xx: 18 ( 0.0%) 5xx: 0 ( 0.0%) R ( 30s): 6283 reqs ( 209.4/sec) 72.6M ( 2478.5K/sec) 11.8K/req 2xx: 5882 (93.6%) 3xx: 398 ( 6.3%) 4xx: 3 ( 0.0%) 5xx: 0 ( 0.0%)
Note the requests per seconds are 209 to 216 req/sec, which is double what is normally served.
Because MaxClients was tuned, the system never went to swapping.
The big question
So, the question is: what can be done at the network stack level to prevent this period of outage?