Tuesday, January 31, 2006

Tuning Apache, part 1

There was a link on Digg a couple of days ago to an article about how to tune Apache so as to survive a Slashdotting. After reading it through, I came to the conclusion that the author had no idea what he was talking about. Not only did he admit that he had never experienced the "Slashdot Effect", but his advice was just plain wrong. I offered a few comments there, but I figured that I should elaborate on a few of them here. I'll post each major configuration topic as a new blog entry, and today's entry is about HTTP's Keep-Alive feature.

A brief history of Keep-Alives
The original HTTP protocol did not allow keep-alives, which meant that a connection was made to the server for each file that needed to be downloaded. This was a very inefficient method of doing things, especially since web pages typically had several files that needed to be downloaded in order to be properly displayed. Why was it inefficient? For two reasons:
  1. Each connection requires an overhead of at least 3 packets to be initiated (SYN, SYN-ACK, and ACK packets). This means that at least three round-trips are required to open a connection, which obviously slowed things down.
  2. Due to the nature of TCP, which underlies HTTP, a connection gets "faster" the longer it is open. By continously opening and closing new connections, HTTP would never be able to fully utilize its available bandwidth.
The designers of HTTP realized this weakness in the protocol, and took steps to correct it in the next version of HTTP. This new version of HTTP incorporated the concepts of keep-alives, where a client could keep a connection to the web server open indefinitely, or at least as long as the server permitted. Although this somewhat went against HTTP's original design goal of being "stateless", it allowed for it to overcome its speed and overhead problems.

A brief introduction to Apache
Now let's examine how Apache works. When you start Apache, a main "coordinator" process is created. This main process is responsible for accepting incoming connections and passing them off to "worker" processes that it creates. These workers then read users' requests and send back responses. Once a worker is done servicing a user's requests, it reports back to the main process and then waits for a new connection to be handed to it.

Apache and Keep-Alives
So, in theory, keep-alives are a great thing. They allow web clients and servers to fully utilize their available bandwidth, and reduces latency by eliminating the overhead of frequently opening new connections. In a perfect world, you would want Apache's KeepAliveTimeout setting to be "infinity", so that web clients maintain a connection to the web server for as long as possible and thus everything on your web site pulls up as fast as possible.

Apache allows you to configure its behavior in regard to keep-alives through a few options in its configuration file:
  • KeepAlive: either On or Off, depending on whether Apache should allow connections to be used for multiple requests
  • KeepAliveTimeout: how long, in seconds, Apache will wait after a request has been answered for another request before closing the connection
  • MaxKeepAliveRequests: how many total requests a client can issue across a single connection
  • MaxClients: the total number of worker processes that Apache will allow at any given time
The default Apache configuration file sets KeepAlive to be on, with a KeepAliveTimeout of 15 seconds and MaxKeepAliveRequests of 100. The MaxClients setting is set to 150.

Apache meets its match
Unfortunately, nothing in life is free, not even keep-alives. Each client connection requires Apache to create (or use a waiting) worker process to service its requests. These worker processes can only handle one connection at a time, and each connection will last at least 15 seconds. Apache will create a new worker process for each new connection until it hits its limit of MaxClients at 150. Thus, the cost of a keep-alive is one worker process for the KeepAliveTimeout.

Now imagine what happens when 1,000 web clients try to access your web site at the same moment (e.g. when it first shows up on Slashdot). The first 150 clients will successfully connect to your web server, because Apache will create workers to service their requests. However, those web clients do not immediately leave; after they've downloaded your page, they will hold open their connections for 15 seconds until your server forces their connection to close. The next 850 clients will be unable to access the web server, as all of the available Apache worker processes will be used up, waiting for 15 seconds on the unused connections to the first 150 clients. Some of those 850 clients will queue up and wait for an available Apache process to service their request, but most will give up.

Perhaps some readers are wondering why you wouldn't just increase the MaxClients setting to something high enough to handle your peak load, like 2000 or something. This is a very bad idea; you can increase Apache's MaxClients, but only at your own peril. Because each Apache process consumes a bit of memory, you can only fit a certain number in memory before the web server begins to violently thrash, swapping things between RAM and the hard drive in a futile attempt to make it work. The result is a totally unresponsive server; by increasing MaxClients too high, you will have caused your own demise. I will talk about how to figure out a good value for MaxClients in a future post, but a good rule of thumb might be to divide your total RAM by 5 megabytes. Thus, a server with 512 megabytes of RAM could probably handle a MaxClients setting of 100. This is probably a somewhat conservative estimate, but it should give you a starting point.

A partial solution
So how do you fix the problem, other than by adding many gigabytes of RAM to the server? One easy way to get around this limitation is to either reduce the KeepAliveTimeout to a mere second or two, or else to simply turn KeepAlive off completely. I have found that turning it down to 2 seconds seems to give the client enough time to request all of the files needed for a page without having to open multiple connections, yet allows Apache to terminate the connection soon enough to be able to handle many more clients than usual.

One interesting thing of which to take note is what the major Apache-based web sites allow, in terms of keep-alive timeouts. In my (very brief) experiments, it seems that CNN, Yahoo, craigslist, and Slashdot don't permit keep-alives at all, while the BBC has a very short keep-alive timeout of under 5 seconds. On the other hand, there are several other major Apache-based sites that do use a large keep-alive timeout (Apple, CNET, etc...), but they may have decided that they would prefer to take the performance hit so that they can have the "fastest" web sites as possible.

Of course, this isn't a perfect solution. It would be nice to be able to have both high-performance as well as long-lived client connections. Apache 2.2, from what I understand, includes an experimental module that allows keep-alives to be handled very efficiently. If it turns out to work well, then it could be a near-perfect solution to the problem. It does have its problems (i.e. it seems to require a threaded MPM, which is not recommended if you use PHP), but it could be incredibly useful in some situations.