There's lots written about scaling large complex websites, but what does a casual sysadmin do to fix a simple Apache config? TechMeme picked up my Facebook blog post and suddenly hundreds of readers were coming, peaking at an awesome 4 requests/second. And my poor little blog got very slow, 30 second pageloads. Wutdo?

At first I assumed my server was melting; my blog software is old and simple. But uptime and top showed no CPU problems and free and vmstat showed no swapping. Blosxom's simplicity is a good thing, there's very little to go wrong, particularly with a little caching.

So the server was fine, why was my blog taking 30 seconds to load? The key insight came from Google Chrome's resource panel which shows a little timeline of the page load. My main page was taking 17 seconds! That's not good, the script itself executes in about 400ms. Worse, my page loads a bunch of images and CSS and javascript from my server and some of those loads were taking 10+ seconds as well. For static files.

The little light went off in my head: connection queuing. Each web request gets its own Apache process that has to hang around until the client at the other end of the Internet receives all the bytes. Apache has a limited number of workers and one slow browser may be using several at once. I only had 20 workers; it doesn't take long for them all to be sitting around doing nothing except waiting for slow clients to download data. Once that happens new requests have to queue up and wait, 10+ seconds, and things go downhill quickly. You can diagnose queueing via netstat and Apache server-status.

The quick fix for queueing was to make my blog simpler, removing a bunch of images and Javascript to reduce my server load. The other quick fix was to modify prefork MaxClients, doubling to 40 workers. More workers is generally better right until you have too many and run out of RAM or CPU, so I was a bit conservative. Fortunately the blog slimming was enough and I had things running OK in about 10 minutes after the problem started.

That was all quick fixes in a crisis, what's the right solution? A big fix is moving static resources to another server, I'd never quite believed that advice before this experience. Another fix is tuning Apache better, particularly using a different multiprocessing implementation. A final fix is making more of my Javascript asynchronous so the blog renders sooner. (The real solution is migrating my blog and getting out of the admin business entirely.)

  2010-09-10 18:18 Z