Sunday, April 23, 2006

Under the covers of a memory access

When someone writes a computer program, they use many statements that look something like A=B+1. This simple operataion retrieves a piece of data from memory location B, adds one to it, and then stores the result in memory location B. Statements of this form are incredibly common in computer programs; they probably make up the majority of the operations performed by programs. With something so simple and common, you would think that it would be a trivial matter for the computer to carry it out. This is not the case at all. In this posting, I will give a very high-level introduction to some of the steps involved in just retrieving a single piece of data from memory. I have made a number of simplifications (such as placing the memory address translation before any of the caches), but hopefully it is detailed enough to give people a taste of the complexities involved.

The first step is to translate the memory address used by the program into a different address, called a physical address. Most modern computers use a special thing called a memory-management unit (MMU) to allow each process on a system to pretend that it is the only thing running. It does this by having the process use a large set of virtual ("fake") addresses which are then translated by the MMU into physical ("real") addresses on demand. Of course, this can be a slow process; the computer must have a set of translation tables for each process that map from virtual to physcical memory addresses, and it must consult this table for every single memory access. These tables are stored in the computer's memory, and thus each memory access potentially turns into TWO memory accesses. Fortunately, a high-speed caching mechanism called the TLB is used to speed up translation table lookups, which makes the virtual-to-physical translation process fast enough to be feasible.

Once the computer has the physical address of the desired item in memory, it then proceeds to look in a set of high-speed caches for it. Although the RAM in a computer is incredibly fast when compared to something like a hard drive, it is actually somewhat slow when compared to the speed of a CPU. Thus, memory caches exist between the processor and memory that are small but fast. Whenever a piece of data is retrieved from RAM, it is subsequently placed into one or more of these caches so that future references to it are much quicker. A lot of effort has been put into making these various caches as fast and efficient as possible, as they can have a dramatic effect on the performance of a computer. In many modern computers, there are two levels of caches: L1, which is very small but also very fast, and L2, which is somewhat larger but also somewhat slower. Some processors even have a third cache (L3) that is even larger than the previous two. Each of these caches must be checked in turn to see if one of them contains the requested data.

If the requested data is not in any of the caches, then the computer is forced to actually retrieve it from main memory. This is a somewhat lengthy process and is one of the big obstacles in speeding up computers; in the time that it takes to retrieve a piece of data from memory, a processor could possibly execute hundreds or thousands of instructions. Unfortunately, the processor many times must simply sit and twiddle its thumbs while the memory system services the request.

There is another complication, though. Sometimes the operating system on the computer will tell a little white lie to the running processes by letting them believe that something is in RAM when it really is on the hard drive. This can happen in a few different circumstances, the most well-known of which is when the operating system "swaps" something out from memory to the drive. This works rather well when a block of data doesn't need to be used soon, but it does complicate memory accesses because the operating system needs to dynamically and transparently move things in and out of memory. Thus, if the requested piece of data has been swapped out to the hard drive, the memory system will essentially wake up the operating system and tell it to put it back into memory. This process is incredibly time consuming and involves thousands and thousands of additional memory accesses, as the operating system must determine where the data is located on the hard drive and communicate with the disk to load it in.

Once the requested piece of data has been found, the memory system returns the data to the processor (placing it into one or more of the caches along the way). After all is said and done, that single memory access has potentially cause numerous additional memory accesses, instruction executions, and even hard disk accesses. It is easy to see why computer and operating system designers work very hard to mitigate the costs associated with memory accesses by taking full advantage of high-speed caching mechanisms and advanced algorithms to predict program behavior. Even compilers use special tricks to try and ameliorate the situation, by doing things like reordering instructions and making the most efficient use of low-level processor resources.

This overview of a computer's memory system has been very brief, and thus is somewhat incorrect int its details. For further information, I would recommend that people read a good computer architecture book (such as Computer Organization and Design) as well as an introductory operating systems book (Operating System Concepts covers many of these details). The Wikipedia also has extensive coverage of many low-level computer architecture details like this.

Saturday, February 04, 2006

Understanding memory usage on Linux

This entry is for those people who have ever wondered, "Why the hell is a simple KDE text editor taking up 25 megabytes of memory?" Many people are led to believe that many Linux applications, especially KDE or Gnome programs, are "bloated" based solely upon what tools like ps report. While this may or may not be true, depending on the program, it is not generally true -- many programs are much more memory efficient than they seem.

What ps reports
The ps tool can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size", which are commonly used by geeks around the world to see how much memory processes are taking up.

For example, here is the output of ps aux for KEdit on my computer:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dbunker 3468 0.0 2.7 25400 14452 ? S 20:19 0:00 kdeinit: kedit

According to ps, KEdit has a virtual size of about 25 megabytes and a resident size of about 14 megabytes (both numbers above are reported in kilobytes). It seems that most people like to randomly choose to accept one number or the other as representing the real memory usage of a process. I'm not going to explain the difference between VSZ and RSS right now but, needless to say, this is the wrong approach; neither number is an accurate picture of what the memory cost of running KEdit is.

Why ps is "wrong"
Depending on how you look at it, ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory each process would take up if it were the only process running. Of course, a typical Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely "wrong". In order to understand why, it is necessary to learn how Linux handles shared libraries in programs.

Most major programs on Linux use shared libraries to facilitate certain functionality. For example, a KDE text editing program will use several KDE shared libraries (to allow for interaction with other KDE components), several X libraries (to allow it to display images and copy and pasting), and several general system libraries (to allow it to perform basic operations). Many of these shared libraries, especially commonly used ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to use a great trick: it will load a single copy of the shared libraries into memory and use that one copy for every program that references it.

For better or worse, many tools don't care very much about this very common trick; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.

Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Not only do you need to understand how the system really works, but you need to decide how you want to deal with some hard questions. Should a shared library that is only needed for one process be counted in that process's memory usage? If a shared library is used my multiple processes, should its memory usage be evenly distributed among the different processes, or just ignored? There isn't a hard and fast rule here; you might have different answers depending on the situation you're facing. It's easy to see why ps doesn't try harder to report "correct" memory usage totals, given the ambiguity.

Seeing a process's memory map
Enough talk; let's see what the situation is with that "huge" KEdit process. To see what KEdit's memory looks like, we'll use the pmap program (with the -d flag):

Address Kbytes Mode Offset Device Mapping
08048000 40 r-x-- 0000000000000000 0fe:00000 kdeinit
08052000 4 rw--- 0000000000009000 0fe:00000 kdeinit
08053000 1164 rw--- 0000000008053000 000:00000 [ anon ]
40000000 84 r-x-- 0000000000000000 0fe:00000 ld-2.3.5.so
40015000 8 rw--- 0000000000014000 0fe:00000 ld-2.3.5.so
40017000 4 rw--- 0000000040017000 000:00000 [ anon ]
40018000 4 r-x-- 0000000000000000 0fe:00000 kedit.so
40019000 4 rw--- 0000000000000000 0fe:00000 kedit.so
40027000 252 r-x-- 0000000000000000 0fe:00000 libkparts.so.2.1.0
40066000 20 rw--- 000000000003e000 0fe:00000 libkparts.so.2.1.0
4006b000 3108 r-x-- 0000000000000000 0fe:00000 libkio.so.4.2.0
40374000 116 rw--- 0000000000309000 0fe:00000 libkio.so.4.2.0
40391000 8 rw--- 0000000040391000 000:00000 [ anon ]
40393000 2644 r-x-- 0000000000000000 0fe:00000 libkdeui.so.4.2.0
40628000 164 rw--- 0000000000295000 0fe:00000 libkdeui.so.4.2.0
40651000 4 rw--- 0000000040651000 000:00000 [ anon ]
40652000 100 r-x-- 0000000000000000 0fe:00000 libkdesu.so.4.2.0
4066b000 4 rw--- 0000000000019000 0fe:00000 libkdesu.so.4.2.0
4066c000 68 r-x-- 0000000000000000 0fe:00000 libkwalletclient.so.1.0.0
4067d000 4 rw--- 0000000000011000 0fe:00000 libkwalletclient.so.1.0.0
4067e000 4 rw--- 000000004067e000 000:00000 [ anon ]
4067f000 2148 r-x-- 0000000000000000 0fe:00000 libkdecore.so.4.2.0
40898000 64 rw--- 0000000000219000 0fe:00000 libkdecore.so.4.2.0
408a8000 8 rw--- 00000000408a8000 000:00000 [ anon ]
... (trimmed) ...
mapped: 25404K writeable/private: 2432K shared: 0K

I cut out a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to the discussion.

If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of KEdit (assuming that all of the shared libraries were already loaded) is around 2 megabytes. That is quite a different story from the 14 or 25 megabytes that ps reported.

What does it all mean?
The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critial when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle (although see one of my past postings for another way of increasing Apache's performance).

It also shows that it pays to stick with one desktop's software as much as possible. If you run KDE for your desktop, but mostly use Gnome applications, then you are paying a large price for a lot of redundant (but different) shared libraries. By sticking to just KDE or just Gnome apps as much as possible, you reduce your overall memory usage due to the reduced marginal memory cost of running new KDE or Gnome applications, which allows Linux to use more memory for other interesting things (like the file cache, which speeds up file accesses immensely).

Monday, February 06, 2006

Re: memory usage on Linux

A lot of people made good points in the comments section of my last posting (Understanding memory usage on Linux). Here are some of the general ideas that were mentioned:

(1) Several comments noted that non-x86 hardware has a different approach to shared memory between processes. This is true; some architectures do not handle shared memory in the same way as x86. To be honest, I don't know which platforms those are, so I'm not going to even try to list them. Thus, my previous post should be taken with a big grain of salt if you're working on a non-x86 platform.

(2) Many people also noted that this shared library feature of Linux isn't some fancy new thing, which is completely true. Microsoft Windows platforms undoubtedly have the same basic sharing feature, just like any full-featured modern operating system. My post only addressed Linux because, to be honest, I'm a Linux-centric kind of person.

(3) Yes, I did commit the sin of using "it's" instead of "its". To all of the English majors in the audience, I offer my most sincere apology.

(4) A few comments mentioned the memory size of Firefox. I must admit that I began this article with Firefox instead of KEdit as the primary example, but I was forced to switch to KEdit when I saw how big Firefox's private/writeable size was; KEdit illustrated my point much better. :)

(5) If the word "marginal" that I used confused anyone, then feel free to just mentally replace it with the word "incremental".

Thanks to everyone that commented on the posting; part of my reason for writing it was to see what other people thought, as other people usually know more than I do about any given subject.

Thursday, February 09, 2006

Great computer books

There was a very good comment on my Understanding memory usage on Linux post a couple of days ago. Besides having some insightful points about memory usage, the poster made mention of Linux Kernel Development, a book by Robert Love on the Linux 2.6 kernel. I own this book and love it; I'm not really a kernel hacker, but I have found the information in the book invaluable when it comes to understanding how Linux ticks. I highly recommend it to anyone that wants to delve into the kernel.

Although Linux Kernel Development can be read without having too much theoretical operating systems background, I would still recommend that people also pick up a good general OS book. My preference is Operating System Concepts by Silberschatz, Galvin, and Gagne, but that may be because it's the one that I used for my college OS class. I've wanted to try out Andrew Tanenbaum's Modern Operating Systems, but it's a little too spendy for me. Plus, I'd feel like I couldn't read it around other Linux geeks, what with Tanenbaum's addiction to microkernels...

My other favorite topic, besides operating systems, is networking. In this field I have two all-time favorites, one of which is Tanenbaum's Computer Networks. Not only does he cover a huge range of networking topics, but he does it in the classic hacker way -- with humor. The other favorite of mine is Computer Networking: A Top-Down Approach by Kurose and Ross. It's more accessible than Tanenbaum's book and has a nice solid, professional feel to it. Both of these books are on the high side in terms of price, but are well worth it.

Disclaimer: Greg Gagne of Operating System Concepts was my advisor in college (although, interestingly enough, I didn't know about his book until I went to graduate school). Other than that, I do not have any personal or financial ties to any of these books.

Thursday, January 18, 2007

Oh no -- I've become unbalanced!

I recently discovered an interesting feature of the Linux kernel. It all started when I noticed that some new web servers at work were exhibiting a peculiar behavior -- the two CPU cores didn't seem to have balanced workloads. It seemed that usually the first core would have the majority of the load, while the second core would usually have much less (all witnessed via the mpstat -P ALL 10 command). Given that these servers were Apache web servers, I thought that they should typically be within a few percent of each other, but instead they were separated by anywhere from 5-25%.

After poking around a bit, I decided to try and tweak the server a bit to try and force the kernel to balance better. These servers each had two Ethernet interfaces, one to a 'front channel' across which the HTTP interaction with clients was performed, and one to a 'back channel' across which some database and other calls were made. One of the early thoughts that I had was to dedicate each of the two cores to one of the two Ethernet interfaces: this way, each core would always service the interrupts from the same interface. The default setup on the 2.6 kernel is to have all interrupts serviced by the first core, so I thought that perhaps the data load on the two interfaces was forcing the first core to run a bit hotter than the other.

It was easy enough to make the change. I first checked to find the interrupt numbers for the two interfaces (ifconfig), then forced the second one to always use the second processor core (echo "2" >/proc/irq/177/smp_affinity). This achieved the desired effect of rebalancing the interrupts, but it didn't solve my original problem: the cores were still unbalanced. In fact, the problem became even more interesting -- the two cores had almost flipped their loads, with the second core having the consistently higher usage.

At first I thought that perhaps the second interface might be transferring more data; that was wrong (ifconfig). Next I thought that, even with less total data transferred, perhaps the second interface was triggering more interrupts. That, too, was incorrect (cat /proc/interrupts). Finally, I checked to make sure that both interfaces were using the same driver, which they were (dmesg | grep eth). I had just about given up when I finally decided to put some logging into every web page (using PHP) to see what processor it used. Imagine my surprise when I discovered that nearly every web process started off on the lower-usage core, but then almost immediately migrated to the higher-usage core!

I had uncovered the superficial culprit: when a user requested a web page, one of the first things that the server would do is get a copy of their current session data from another server through the second interface. The moment that the second interface was used, the web process would flip over to the second core (or whatever core was assigned to service that interface's interrupts).

A little more research into the kernel code revealed it even further (caveat: I'm not a kernel hacker, so the details are still a bit fuzzy). When a network interface receives some data, Linux will find the process that is sleeping while waiting for that data and wake it up. When waking up the process, Linux will try to keep it on the same processor core that received the interrupt. I imagine that it does this in an effort to better utilize the core's memory cache, which should speed things up. However, it also appears to have a slight unbalancing effect on the other cores, which is exactly what I was seeing on my servers.

Of course, there are ways to fix this "problem". The easiest is to simply remove the flags in the kernel code that trigger this behavior (you'll find them in include/linux/topology.h; just remove any lines that say SD_WAKE_AFFINE), and then recompile the kernel. I've discovered, though, that this change doesn't seem to improve things like I thought it might. It did seem to balance out the load between the two cores, but the average load across the cores is actually slightly higher (1-2%, usually). This would make some sense; like I said before, the kernel is probably trying to make more efficient use of each core's cache, which apparently has a greater positive effect than having a more equal load distribution.

In the end, I suppose I can live with unbalanced CPU cores. I just wish that this behavior had been a bit better documented -- it was a rather painful process to have to figure it out myself. I suppose it could have been worse, though: I could be using closed-source Windows. ;)

Monday, February 06, 2006

Hello world!

After my recent Understanding memory usage on Linux post was linked from some large web sites (Slashdot, Digg, and del.icio.us so far), I thought it would be fun to create a map that showed where the resulting ~60,000 hits came from.

The map displays a dot for each hit that I received over the last few days. This isn't an exact science, as I can only determine where everyone's ISPs are located, but I figure it is a pretty close estimate. I would love to create an hour-by-hour slideshow, but that will have to wait for another day.

So here's a "hello world" to all of my fellow geeks in Russia, Nigeria, Egypt, Bermuda, the middle of the Caspian Sea, Sydney, Sri Lanka, Argentina, Iceland, Italia, and everywhere else.

Tuesday, January 31, 2006

Tuning Apache, part 1

There was a link on Digg a couple of days ago to an article about how to tune Apache so as to survive a Slashdotting. After reading it through, I came to the conclusion that the author had no idea what he was talking about. Not only did he admit that he had never experienced the "Slashdot Effect", but his advice was just plain wrong. I offered a few comments there, but I figured that I should elaborate on a few of them here. I'll post each major configuration topic as a new blog entry, and today's entry is about HTTP's Keep-Alive feature.

A brief history of Keep-Alives
The original HTTP protocol did not allow keep-alives, which meant that a connection was made to the server for each file that needed to be downloaded. This was a very inefficient method of doing things, especially since web pages typically had several files that needed to be downloaded in order to be properly displayed. Why was it inefficient? For two reasons:
  1. Each connection requires an overhead of at least 3 packets to be initiated (SYN, SYN-ACK, and ACK packets). This means that at least three round-trips are required to open a connection, which obviously slowed things down.
  2. Due to the nature of TCP, which underlies HTTP, a connection gets "faster" the longer it is open. By continously opening and closing new connections, HTTP would never be able to fully utilize its available bandwidth.
The designers of HTTP realized this weakness in the protocol, and took steps to correct it in the next version of HTTP. This new version of HTTP incorporated the concepts of keep-alives, where a client could keep a connection to the web server open indefinitely, or at least as long as the server permitted. Although this somewhat went against HTTP's original design goal of being "stateless", it allowed for it to overcome its speed and overhead problems.

A brief introduction to Apache
Now let's examine how Apache works. When you start Apache, a main "coordinator" process is created. This main process is responsible for accepting incoming connections and passing them off to "worker" processes that it creates. These workers then read users' requests and send back responses. Once a worker is done servicing a user's requests, it reports back to the main process and then waits for a new connection to be handed to it.

Apache and Keep-Alives
So, in theory, keep-alives are a great thing. They allow web clients and servers to fully utilize their available bandwidth, and reduces latency by eliminating the overhead of frequently opening new connections. In a perfect world, you would want Apache's KeepAliveTimeout setting to be "infinity", so that web clients maintain a connection to the web server for as long as possible and thus everything on your web site pulls up as fast as possible.

Apache allows you to configure its behavior in regard to keep-alives through a few options in its configuration file:
  • KeepAlive: either On or Off, depending on whether Apache should allow connections to be used for multiple requests
  • KeepAliveTimeout: how long, in seconds, Apache will wait after a request has been answered for another request before closing the connection
  • MaxKeepAliveRequests: how many total requests a client can issue across a single connection
  • MaxClients: the total number of worker processes that Apache will allow at any given time
The default Apache configuration file sets KeepAlive to be on, with a KeepAliveTimeout of 15 seconds and MaxKeepAliveRequests of 100. The MaxClients setting is set to 150.

Apache meets its match
Unfortunately, nothing in life is free, not even keep-alives. Each client connection requires Apache to create (or use a waiting) worker process to service its requests. These worker processes can only handle one connection at a time, and each connection will last at least 15 seconds. Apache will create a new worker process for each new connection until it hits its limit of MaxClients at 150. Thus, the cost of a keep-alive is one worker process for the KeepAliveTimeout.

Now imagine what happens when 1,000 web clients try to access your web site at the same moment (e.g. when it first shows up on Slashdot). The first 150 clients will successfully connect to your web server, because Apache will create workers to service their requests. However, those web clients do not immediately leave; after they've downloaded your page, they will hold open their connections for 15 seconds until your server forces their connection to close. The next 850 clients will be unable to access the web server, as all of the available Apache worker processes will be used up, waiting for 15 seconds on the unused connections to the first 150 clients. Some of those 850 clients will queue up and wait for an available Apache process to service their request, but most will give up.

Perhaps some readers are wondering why you wouldn't just increase the MaxClients setting to something high enough to handle your peak load, like 2000 or something. This is a very bad idea; you can increase Apache's MaxClients, but only at your own peril. Because each Apache process consumes a bit of memory, you can only fit a certain number in memory before the web server begins to violently thrash, swapping things between RAM and the hard drive in a futile attempt to make it work. The result is a totally unresponsive server; by increasing MaxClients too high, you will have caused your own demise. I will talk about how to figure out a good value for MaxClients in a future post, but a good rule of thumb might be to divide your total RAM by 5 megabytes. Thus, a server with 512 megabytes of RAM could probably handle a MaxClients setting of 100. This is probably a somewhat conservative estimate, but it should give you a starting point.

A partial solution
So how do you fix the problem, other than by adding many gigabytes of RAM to the server? One easy way to get around this limitation is to either reduce the KeepAliveTimeout to a mere second or two, or else to simply turn KeepAlive off completely. I have found that turning it down to 2 seconds seems to give the client enough time to request all of the files needed for a page without having to open multiple connections, yet allows Apache to terminate the connection soon enough to be able to handle many more clients than usual.

One interesting thing of which to take note is what the major Apache-based web sites allow, in terms of keep-alive timeouts. In my (very brief) experiments, it seems that CNN, Yahoo, craigslist, and Slashdot don't permit keep-alives at all, while the BBC has a very short keep-alive timeout of under 5 seconds. On the other hand, there are several other major Apache-based sites that do use a large keep-alive timeout (Apple, CNET, etc...), but they may have decided that they would prefer to take the performance hit so that they can have the "fastest" web sites as possible.

Of course, this isn't a perfect solution. It would be nice to be able to have both high-performance as well as long-lived client connections. Apache 2.2, from what I understand, includes an experimental module that allows keep-alives to be handled very efficiently. If it turns out to work well, then it could be a near-perfect solution to the problem. It does have its problems (i.e. it seems to require a threaded MPM, which is not recommended if you use PHP), but it could be incredibly useful in some situations.