This article is about an anomaly that I have seen on a busy Drupal site. It is related to the network stack.
Normal load
On a normal weekday with the regular load on the peak hours, netstat shows the following:
# netstat -ant | grep ":80 " | wc -l 4833
While Apache top shows we are at 80-90 requests per second:
last hit: 15:56:28 atop runtime: 0 days, 00:00:35 15:56:29 All: 3069 reqs ( 87.7/sec) 29.9M ( 875.0K/sec) 10218.5B/req 2xx: 2704 (88.1%) 3xx: 364 (11.9%) 4xx: 1 ( 0.0%) 5xx: 0 ( 0.0%) R ( 30s): 2623 reqs ( 87.4/sec) 25.9M ( 882.4K/sec) 10.1K/req 2xx: 2278 (86.8%) 3xx: 344 (13.1%) 4xx: 1 ( 0.0%) 5xx: 0 ( 0.0%)
So, life is good, things are normal and the server can handle all that gracefully.
Load under digg
That site normally gets on digg's front page anywhere between once every two months to a couple of times per month. Previous instances have been uneventful (except for the increased load of course).
This time, something unusual happened. The bottleneck was in the network stack, with Munin showing a lot of failed connections.
netstat showed an unusual number of connections
# netstat -ant | grep ":80 " | wc -l 13826 # netstat -ant | grep ":80 " | wc -l 13777 # netstat -ant | grep ":80 " | wc -l 13957
The system was idle when this happened, since not that many requests make it to Apache. Munin showed a spike in Apache traffic followed by a brief "blank" period, then started to serve requests.
Right after the system recovered, Apachetop showed the following:
last hit: 16:38:50 atop runtime: 0 days, 00:03:10 16:38:51 All: 41066 reqs ( 216.1/sec) 467.3M ( 2518.6K/sec) 11.7K/req 2xx: 38440 (93.6%) 3xx: 2608 ( 6.4%) 4xx: 18 ( 0.0%) 5xx: 0 ( 0.0%) R ( 30s): 6283 reqs ( 209.4/sec) 72.6M ( 2478.5K/sec) 11.8K/req 2xx: 5882 (93.6%) 3xx: 398 ( 6.3%) 4xx: 3 ( 0.0%) 5xx: 0 ( 0.0%)
Note the requests per seconds are 209 to 216 req/sec, which is double what is normally served.
Because MaxClients was tuned, the system never went to swapping.
The big question
So, the question is: what can be done at the network stack level to prevent this period of outage?
Comments
Visitor (not verified)
That kind of problems can be
Tue, 2007/09/04 - 14:05That kind of problems can be quite complex to investigate and you need to look at all components in your infrastructure.
Are you sure it was your webserver that caused the refused connections and not a firewall in front of the system or eg a backend server with nfs or mysql?
Did you notice any messages in /var/log/messages|syslog about SYN cookies or maximum number of open files reached?
Or maybe it's much easier: if your MaxClients is on value X and you get 4 times X requests at the same time, your clients will have to wait. If they have to wait too long, the connection times out and will probably be logged as 'failed' in Munin.
Khalid
Thanks
Thu, 2007/09/06 - 13:05At the time the digg happened, it was indeed port 80 that showed the ~14,000 connections.
There is nothing else of note in /var/log syslog|messages|kern.log
The last one shows some:
But this happens at other times with no ill effects.
That could be the case, but it would not explain why the load on Drupal, CPU, ...etc. is not high, and the system is idle for most of the time.
--
2bits -- Drupal and Backdrop CMS consulting