How relying on connections to third party servers can be detrimental to performance

One client of ours was facing severe issues with their relatively new well equipped server: the server stopped responding to web requests, and was rebooted, only to stop responding again.

Upon investigation, we found out that pages were taking a lot of time to load.

This only happened when viewing a node in full page view, not when the
nodes were in lists (just as in views, node edit form ...etc.)

Devel was showing this:

Page execution time was 189895.68 ms. Executed 863 queries in 619.69 milliseconds.
Page execution time was 309544.34 ms. Executed 682 queries in 125.29 milliseconds.
Page execution time was 309567.13 ms. Executed 533 queries in 123.34 milliseconds. 

As you can see, it is more than 5 minutes for the page to load, and it was not because of database queries.

At one point we found the server using 100% of all 8 CPUs, with 350+ processes, compared to a normal of 95 to 150 processes under normal load. More curiously, no one process was hogging the CPU. All the time was system not user time.

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 352 total, 1 running
2  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 69.28 123.59 112.93
3  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 02:16:10
4  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
5  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
6  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
7  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
8  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
15573 www-data  20   0  257M 31840  2540 S  8.8  0.4  0:00.87 /usr/sbin/apache2 -k start
15016 www-data  20   0  265M 39632  2684 S  2.5  0.5  0:03.80 /usr/sbin/apache2 -k start
16099 root      20   0 19528  1508   996 R  1.3  0.0  0:00.67 htop
5762 mysql     20   0  322M 79324  2504 S  0.6  1.0 16:46.69 /usr/sbin/mysqld --basedir=/usr
15536 www-data  20   0  243M 18152  2836 S  0.6  0.2  0:59.75 /usr/sbin/apache2 -k start
14937 www-data  20   0  264M 38608  2604 D  0.6  0.5  0:02.23 /usr/sbin/apache2 -k start
15112 www-data  20   0  234M  9208  2748 S  0.6  0.1  0:04.44 /usr/sbin/apache2 -k start 

We suspected a network issue, and using netstat, we were able to confirm that it is the case, since we saw the following:

tcp        0    212        ESTABLISHED 7110/apache2
tcp        0    240        ESTABLISHED 7071/apache2
tcp        0    217        ESTABLISHED 7095/apache2
tcp        0    246        ESTABLISHED 6451/apache2
tcp        0    207        ESTABLISHED 6487/apache2
tcp        0    203        ESTABLISHED 6460/apache2
tcp        0    206        ESTABLISHED 7534/apache2
tcp        0    206        ESTABLISHED 7060/apache2
tcp        0    206        ESTABLISHED 7545/apache2
tcp        0    191        ESTABLISHED 5762/apache2
tcp        0    214        ESTABLISHED 7073/apache2
tcp        0    219        ESTABLISHED 7081/apache2
tcp        0    192        ESTABLISHED 6448/apache2
tcp        0    217        ESTABLISHED 6449/apache2
tcp        0    207        ESTABLISHED 6443/apache2
tcp        0    211        ESTABLISHED 7046/apache2
tcp        0    206        ESTABLISHED 7112/apache2
tcp        0    188        ESTABLISHED 6979/apache2
tcp        0    212        ESTABLISHED 6442/apache2
tcp        0    194        ESTABLISHED 6552/apache2
tcp        0    210        ESTABLISHED 6532/apache2
tcp        0    212        ESTABLISHED 6551/apache2
tcp        0    243        ESTABLISHED 7036/apache2
tcp        0    207        ESTABLISHED 6560/apache2
tcp        0    191        ESTABLISHED 5813/apache2
tcp        0    192        ESTABLISHED 6438/apache2
tcp        0    194        ESTABLISHED 6578/apache2
tcp        0    217        ESTABLISHED 7121/apache2
tcp        0    207        ESTABLISHED 7048/apache2
tcp        0    202        ESTABLISHED 6465/apache2
tcp        0    212        ESTABLISHED 7108/apache2
tcp        0    259        ESTABLISHED 6437/apache2
tcp        0    225        ESTABLISHED 7024/apache2
tcp        0    232        ESTABLISHED 6479/apache2
tcp        0    243        ESTABLISHED 5526/apache2 

With DNS resolving the names, we saw this:

tcp        0    192   ESTABLISHED 6564/apache2
tcp        0    207   ESTABLISHED 8323/apache2
tcp        0    212   ESTABLISHED 7118/apache2
tcp        0    199   ESTABLISHED 7119/apache2
tcp        0    196   ESTABLISHED 6476/apache2
tcp        0    222   ESTABLISHED 6523/apache2
tcp        0    212   ESTABLISHED 6559/apache2
tcp        0    224   ESTABLISHED 5703/apache2
tcp        0    202   ESTABLISHED 6562/apache2
tcp        0    227   ESTABLISHED 6457/apache2
tcp        0    205   ESTABLISHED 7062/apache2
tcp        0    236   ESTABLISHED 7109/apache2
tcp        0    236   ESTABLISHED 7013/apache2
tcp        0    204   ESTABLISHED 6563/apache2
tcp        0    256   ESTABLISHED 6960/apache2
tcp        0    185   ESTABLISHED 6489/apache2
tcp        0    206   ESTABLISHED 6475/apache2
tcp        0    207   ESTABLISHED 7922/apache2 

Now, we know that it was connections to How many? 169 connection at one point.

This means 169 Apache processes tied up doing nothing, occuping memory, waiting for the other end to respond. Moreover, with more incoming requests, more and more memory was consumed causing the server to start swapping and going into thrashing and "freezing up".

Digging into the site's module, we found that it was the diggthis that was the culprit. More accurately, it was a combination of using module, with a problem with a remote server accessed from the backend.

Disabling that module solved the problem.

So, the morale of the story is to avoid having server-side code that does calls to third party servers outside of your control. If that server is down, or slow, your site will be down or slow too. At least have a reasonable timeout value if you must do so.



Signs of the widget bubble

I'll go you one further: these 3rd party "widgets" in drupal should use cron.php to update their own cache occasionally, in addition to having a short temper when it comes to outside servers locking them up in some perverted "keep-alive" ruse.

You are right ...

You are right. Your blog post is mainly about front end widgets. They do slow down the user's experience, but they have no effect on the back end of the server.

This case of diggthis module, doing server to server connections over the network killed the server totally, not only made the site feel slow to users browsing it.
2bits -- Drupal consulting

The Web 2.0 Myth

Thanks again for bringing attention to something that everyone should know.

There is no such thing as a free lunch.

The truth off web 2.0 is that you can mash anything together... the myth of web 2.0 is that you can do it in real time. If you are building for performance and scaling DO NOT use web services in real time.

As this example shows a dependency on an a service outside of your control can ruin your day.


One more vote to back up the others

Indeed, I've become very disillusioned about most third-party services. In the better cases they seem like they slow things down 'a little bit', at worse they're totally crippling to the performance and even the availability of a site, and in all in all cases it puts a part of your code base out of your hands and makes it un-testable (unit testing-wise). Stay away from them as much as possibly can, and only use them if you absolutely have_to_have_to, and *never, ever* use one on for mission critical functionality. (says someone who has used a third-party service in user registration workflow)

Dependence on Twitter/Facebook

Indeed, many sites now want to be social-network compatible and adding FB like/share and Twitter tweet buttons has become the norm. I shudder to think the repercussions of such sites slowing down..