Digg causes the network stack to choke

This article is about an anomaly that I have seen on a busy Drupal site. It is related to the network stack.

Normal load

On a normal weekday with the regular load on the peak hours, netstat shows the following:

# netstat -ant | grep ":80 " | wc -l
4833

While Apache top shows we are at 80-90 requests per second:

last hit: 15:56:28         atop runtime:  0 days, 00:00:35             15:56:29
All:         3069 reqs (  87.7/sec)         29.9M (  875.0K/sec)   10218.5B/req
2xx:    2704 (88.1%) 3xx:     364 (11.9%) 4xx:     1 ( 0.0%) 5xx:     0 ( 0.0%)
R ( 30s):    2623 reqs (  87.4/sec)         25.9M (  882.4K/sec)      10.1K/req
2xx:    2278 (86.8%) 3xx:     344 (13.1%) 4xx:     1 ( 0.0%) 5xx:     0 ( 0.0%)

So, life is good, things are normal and the server can handle all that gracefully.

Load under digg

That site normally gets on digg's front page anywhere between once every two months to a couple of times per month. Previous instances have been uneventful (except for the increased load of course).

This time, something unusual happened. The bottleneck was in the network stack, with Munin showing a lot of failed connections.

netstat showed an unusual number of connections

# netstat -ant | grep ":80 " | wc -l
13826
# netstat -ant | grep ":80 " | wc -l
13777
# netstat -ant | grep ":80 " | wc -l
13957

The system was idle when this happened, since not that many requests make it to Apache. Munin showed a spike in Apache traffic followed by a brief "blank" period, then started to serve requests.

Right after the system recovered, Apachetop showed the following:

last hit: 16:38:50         atop runtime:  0 days, 00:03:10             16:38:51
All:        41066 reqs ( 216.1/sec)        467.3M ( 2518.6K/sec)      11.7K/req
2xx:   38440 (93.6%) 3xx:    2608 ( 6.4%) 4xx:    18 ( 0.0%) 5xx:     0 ( 0.0%)
R ( 30s):    6283 reqs ( 209.4/sec)         72.6M ( 2478.5K/sec)      11.8K/req
2xx:    5882 (93.6%) 3xx:     398 ( 6.3%) 4xx:     3 ( 0.0%) 5xx:     0 ( 0.0%)

Note the requests per seconds are 209 to 216 req/sec, which is double what is normally served.

Because MaxClients was tuned, the system never went to swapping.

The big question

So, the question is: what can be done at the network stack level to prevent this period of outage?

Articles

Add comment

Comments

Visitor (not verified)

That kind of problems can be

Tue, 2007/09/04 - 14:05

That kind of problems can be quite complex to investigate and you need to look at all components in your infrastructure.
Are you sure it was your webserver that caused the refused connections and not a firewall in front of the system or eg a backend server with nfs or mysql?
Did you notice any messages in /var/log/messages|syslog about SYN cookies or maximum number of open files reached?

Or maybe it's much easier: if your MaxClients is on value X and you get 4 times X requests at the same time, your clients will have to wait. If they have to wait too long, the connection times out and will probably be logged as 'failed' in Munin.

Khalid

Thanks

Thu, 2007/09/06 - 13:05

Are you sure it was your webserver that caused the refused connections and not a firewall in front of the system or eg a backend server with nfs or mysql?

At the time the digg happened, it was indeed port 80 that showed the ~14,000 connections.

Did you notice any messages in /var/log/messages|syslog about SYN cookies or maximum number of open files reached?

There is nothing else of note in /var/log syslog|messages|kern.log

The last one shows some:

TCP: Treason uncloaked! Peer x.x.x.x:46002/80 shrinks window y:z. Repaired.

But this happens at other times with no ill effects.

Or maybe it's much easier: if your MaxClients is on value X and you get 4 times X requests at the same time, your clients will have to wait. If they have to wait too long, the connection times out and will probably be logged as 'failed' in Munin.

That could be the case, but it would not explain why the load on Drupal, CPU, ...etc. is not high, and the system is idle for most of the time.

--
2bits -- Drupal and Backdrop CMS consulting

Normal load

Load under digg

The big question

Comments

That kind of problems can be

Thanks

Contents

In Depth

Recent Content

Most Comments

Consulting

Digg causes the network stack to choke

Normal load

Load under digg

The big question

Comments

That kind of problems can be

Thanks

Search

Recent Content

Most Comments

Consulting