How relying on connections to third party servers can be detrimental to performance

One client of ours was facing severe issues with their relatively new well equipped server: the server stopped responding to web requests, and was rebooted, only to stop responding again.

Upon investigation, we found out that pages were taking a lot of time to load.

This only happened when viewing a node in full page view, not when the
nodes were in lists (just as in views, node edit form ...etc.)

Devel was showing this:

Page execution time was 189895.68 ms. Executed 863 queries in 619.69 milliseconds.
Page execution time was 309544.34 ms. Executed 682 queries in 125.29 milliseconds.
Page execution time was 309567.13 ms. Executed 533 queries in 123.34 milliseconds. 

As you can see, it is more than 5 minutes for the page to load, and it was not because of database queries.

At one point we found the server using 100% of all 8 CPUs, with 350+ processes, compared to a normal of 95 to 150 processes under normal load. More curiously, no one process was hogging the CPU. All the time was system not user time.

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 352 total, 1 running
2  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 69.28 123.59 112.93
3  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 02:16:10
4  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
5  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
6  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
7  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
8  [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||7417/8002MB]
Swp[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||1729/2000MB]
PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
15573 www-data  20   0  257M 31840  2540 S  8.8  0.4  0:00.87 /usr/sbin/apache2 -k start
15016 www-data  20   0  265M 39632  2684 S  2.5  0.5  0:03.80 /usr/sbin/apache2 -k start
16099 root      20   0 19528  1508   996 R  1.3  0.0  0:00.67 htop
5762 mysql     20   0  322M 79324  2504 S  0.6  1.0 16:46.69 /usr/sbin/mysqld --basedir=/usr
15536 www-data  20   0  243M 18152  2836 S  0.6  0.2  0:59.75 /usr/sbin/apache2 -k start
14937 www-data  20   0  264M 38608  2604 D  0.6  0.5  0:02.23 /usr/sbin/apache2 -k start
15112 www-data  20   0  234M  9208  2748 S  0.6  0.1  0:04.44 /usr/sbin/apache2 -k start 

We suspected a network issue, and using netstat, we were able to confirm that it is the case, since we saw the following:

tcp        0    212 10.0.0.1:50803     64.191.203.26:80        ESTABLISHED 7110/apache2
tcp        0    240 10.0.0.1:45267     64.191.203.26:80        ESTABLISHED 7071/apache2
tcp        0    217 10.0.0.1:45287     64.191.203.26:80        ESTABLISHED 7095/apache2
tcp        0    246 10.0.0.1:50812     64.191.203.26:80        ESTABLISHED 6451/apache2
tcp        0    207 10.0.0.1:50844     64.191.203.26:80        ESTABLISHED 6487/apache2
tcp        0    203 10.0.0.1:45202     64.191.203.26:80        ESTABLISHED 6460/apache2
tcp        0    206 10.0.0.1:50810     64.191.203.26:80        ESTABLISHED 7534/apache2
tcp        0    206 10.0.0.1:50842     64.191.203.26:80        ESTABLISHED 7060/apache2
tcp        0    206 10.0.0.1:50756     64.191.203.26:80        ESTABLISHED 7545/apache2
tcp        0    191 10.0.0.1:45276     64.191.203.26:80        ESTABLISHED 5762/apache2
tcp        0    214 10.0.0.1:50795     64.191.203.26:80        ESTABLISHED 7073/apache2
tcp        0    219 10.0.0.1:50801     64.191.203.26:80        ESTABLISHED 7081/apache2
tcp        0    192 10.0.0.1:45207     64.191.203.26:80        ESTABLISHED 6448/apache2
tcp        0    217 10.0.0.1:50832     64.191.203.26:80        ESTABLISHED 6449/apache2
tcp        0    207 10.0.0.1:45240     64.191.203.26:80        ESTABLISHED 6443/apache2
tcp        0    211 10.0.0.1:50814     64.191.203.26:80        ESTABLISHED 7046/apache2
tcp        0    206 10.0.0.1:50799     64.191.203.26:80        ESTABLISHED 7112/apache2
tcp        0    188 10.0.0.1:50735     64.191.203.26:80        ESTABLISHED 6979/apache2
tcp        0    212 10.0.0.1:45220     64.191.203.26:80        ESTABLISHED 6442/apache2
tcp        0    194 10.0.0.1:45256     64.191.203.26:80        ESTABLISHED 6552/apache2
tcp        0    210 10.0.0.1:50742     64.191.203.26:80        ESTABLISHED 6532/apache2
tcp        0    212 10.0.0.1:50798     64.191.203.26:80        ESTABLISHED 6551/apache2
tcp        0    243 10.0.0.1:45225     64.191.203.26:80        ESTABLISHED 7036/apache2
tcp        0    207 10.0.0.1:45229     64.191.203.26:80        ESTABLISHED 6560/apache2
tcp        0    191 10.0.0.1:50830     64.191.203.26:80        ESTABLISHED 5813/apache2
tcp        0    192 10.0.0.1:45270     64.191.203.26:80        ESTABLISHED 6438/apache2
tcp        0    194 10.0.0.1:50794     64.191.203.26:80        ESTABLISHED 6578/apache2
tcp        0    217 10.0.0.1:50759     64.191.203.26:80        ESTABLISHED 7121/apache2
tcp        0    207 10.0.0.1:50819     64.191.203.26:80        ESTABLISHED 7048/apache2
tcp        0    202 10.0.0.1:50805     64.191.203.26:80        ESTABLISHED 6465/apache2
tcp        0    212 10.0.0.1:45222     64.191.203.26:80        ESTABLISHED 7108/apache2
tcp        0    259 10.0.0.1:45242     64.191.203.26:80        ESTABLISHED 6437/apache2
tcp        0    225 10.0.0.1:45238     64.191.203.26:80        ESTABLISHED 7024/apache2
tcp        0    232 10.0.0.1:50749     64.191.203.26:80        ESTABLISHED 6479/apache2
tcp        0    243 10.0.0.1:45218     64.191.203.26:80        ESTABLISHED 5526/apache2 

With DNS resolving the names, we saw this:

tcp        0    192 example.com:45209 services.digg.com:www   ESTABLISHED 6564/apache2
tcp        0    207 example.com:45213 services.digg.com:www   ESTABLISHED 8323/apache2
tcp        0    212 example.com:50734 services.digg.com:www   ESTABLISHED 7118/apache2
tcp        0    199 example.com:50821 services.digg.com:www   ESTABLISHED 7119/apache2
tcp        0    196 example.com:45258 services.digg.com:www   ESTABLISHED 6476/apache2
tcp        0    222 example.com:45261 services.digg.com:www   ESTABLISHED 6523/apache2
tcp        0    212 example.com:45286 services.digg.com:www   ESTABLISHED 6559/apache2
tcp        0    224 example.com:45249 services.digg.com:www   ESTABLISHED 5703/apache2
tcp        0    202 example.com:50755 services.digg.com:www   ESTABLISHED 6562/apache2
tcp        0    227 example.com:45244 services.digg.com:www   ESTABLISHED 6457/apache2
tcp        0    205 example.com:50736 services.digg.com:www   ESTABLISHED 7062/apache2
tcp        0    236 example.com:50745 services.digg.com:www   ESTABLISHED 7109/apache2
tcp        0    236 example.com:45277 services.digg.com:www   ESTABLISHED 7013/apache2
tcp        0    204 example.com:50757 services.digg.com:www   ESTABLISHED 6563/apache2
tcp        0    256 example.com:45295 services.digg.com:www   ESTABLISHED 6960/apache2
tcp        0    185 example.com:50818 services.digg.com:www   ESTABLISHED 6489/apache2
tcp        0    206 example.com:45221 services.digg.com:www   ESTABLISHED 6475/apache2
tcp        0    207 example.com:45268 services.digg.com:www   ESTABLISHED 7922/apache2 

Now, we know that it was connections to services.digg.com. How many? 169 connection at one point.

This means 169 Apache processes tied up doing nothing, occuping memory, waiting for the other end to respond. Moreover, with more incoming requests, more and more memory was consumed causing the server to start swapping and going into thrashing and "freezing up".

Digging into the site's module, we found that it was the diggthis that was the culprit. More accurately, it was a combination of using module, with a problem with a remote server accessed from the backend.

Disabling that module solved the problem.

So, the morale of the story is to avoid having server-side code that does calls to third party servers outside of your control. If that server is down, or slow, your site will be down or slow too. At least have a reasonable timeout value if you must do so.

Contents: 

Comments

Signs of the widget bubble

I'll go you one further: these 3rd party "widgets" in drupal should use cron.php to update their own cache occasionally, in addition to having a short temper when it comes to outside servers locking them up in some perverted "keep-alive" ruse.

You are right ...

You are right. Your blog post is mainly about front end widgets. They do slow down the user's experience, but they have no effect on the back end of the server.

This case of diggthis module, doing server to server connections over the network killed the server totally, not only made the site feel slow to users browsing it.
--
2bits -- Drupal consulting

The Web 2.0 Myth

Thanks again for bringing attention to something that everyone should know.

There is no such thing as a free lunch.

The truth off web 2.0 is that you can mash anything together... the myth of web 2.0 is that you can do it in real time. If you are building for performance and scaling DO NOT use web services in real time.

As this example shows a dependency on an a service outside of your control can ruin your day.

andre

One more vote to back up the others

Indeed, I've become very disillusioned about most third-party services. In the better cases they seem like they slow things down 'a little bit', at worse they're totally crippling to the performance and even the availability of a site, and in all in all cases it puts a part of your code base out of your hands and makes it un-testable (unit testing-wise). Stay away from them as much as possibly can, and only use them if you absolutely have_to_have_to, and *never, ever* use one on for mission critical functionality. (says someone who has used a third-party service in user registration workflow)

Dependence on Twitter/Facebook

Indeed, many sites now want to be social-network compatible and adding FB like/share and Twitter tweet buttons has become the norm. I shudder to think the repercussions of such sites slowing down..