Tuesday, February 21, 2006

Beware Ethernet flow control

After having been recently bitten by Ethernet's flow control mechanism, I decided to learn about this somewhat obscure but commonly used facet of modern networks. This post is a summary of what I discovered about it and its associated benefits and dangers.

What is flow control?
Ethernet flow control, or 802.3x, is a way for a network device to tell its immediate neighbor that it is overloaded with data, such as when a device is receiving data faster than it can process it. It allows for an overloaded device to send out a special Ethernet frame, called a pause frame, that asks the device on the other end of the wire to stop sending data temporarily. If the receiving device honors the pause frame then the sending device has time to catch up on the stack of received data that it hasn't had time to process yet.

There also exists an older method for flow control called "back pressure" that is used in half-duplex environments (i.e. non-switched Ethernet). It consists of the overloaded device "jamming" the medium temporarily until it has the ability to accept more data. I don't know much about half-duplex flow control, and thus I won't mention it again; everything here applies solely to full-duplex flow control via 802.3x. Also, TCP has a mechanism for performing its own flow control that is entirely different from Ethernet's flow control; I will not be fully explaining TCP's flow control method here, as it would merit a lengthy discussion itself.

Rules of the game
When thinking about Ethernet flow control, it is important to keep several things in mind:
  1. Flow control operates at a lower layer than TCP or IP, and thus is independent of them. Put another way, flow control is capable of being used regardless of what higher-level protocols are put on top of it. An important side-effect of this is that neither TCP nor IP know what Ethernet's flow control is doing; they operate under the assumption that there is no flow control other than what they may or may not provide themselves.
  2. Flow control functions between two directly connected network devices, and flow control frames are never forwarded between links. Thus, two computers that are connected via a switch will never send pause frames to each other, but could send pause frames to the switch itself (and vice versa: the switch can send pause frames to the two computers).
  3. Pause frames have a limited duration; they will automatically "expire" after a certain amount of time. The expiration time is set by the device that transmits the pause frame.
  4. A paused link is not a discriminator of protocols; it will prevent any data from being passed across the link other than more pause frames.
Perhaps you have begun to see some issues with flow control in light of some of the above points. Let's start looking at them.

TCP breakage
Okay, it isn't true, TCP doesn't stop working when flow control is enabled. However, an important part of it does stop working correctly: its own flow control mechanism. TCP flow control uses a more complex mechanism of timeouts and acknowledgement segments to determine when a remote device is overloaded. It basically sends at a faster and faster pace until it sees that some of its sent data isn't getting to the remote device and then slows down. This allows TCP to utilize network links in a somewhat intelligent manner, as an overloaded network or device will cause some TCP segments to be lost and thus cause the sender to send data at a slower rate.

Now consider what happens when Ethernet flow control is mixed with TCP flow control. Let's assume that we have two directly connected computers, one of which is much slower than the other. The faster sending computer starts sending lots of data to the slower receiving computer. The receiver eventually notices that it is getting overloaded with data and sends a pause frame to the sender. The sender sees the pause frame and stops sending temporarily. Once the pause frame expires, the sender will resume sending its flood of data to the other computer. Unfortunately, the TCP engine on the sender will not recognize that the receiver is overloaded, as there was no lost data -- the receiver will typically stop the sender before it loses any data. Thus, the sender will continue to speed up at an exponential rate; because it didn't see any lost data, it will send data twice as fast as before! Because the receiver has a permanent speed disadvantage, this will require the receiver to send out pause frames twice as often. Things start snowballing until the receiver pauses the sender so often that the sender starts dropping its own data before it sends it, and thus finally sees some data being lost and slows down.

Is this a problem? In some ways it isn't. Because TCP is a reliable protocol, nothing is ever really "lost"; it is simply retransmitted and life goes on. Ethernet flow control accomplishes the same thing as TCP flow control in this situation, as they both slow down the data transmission to the speed that the slower device can handle. There are some arguments to be made for there being an awkward overlap between the two flow control mechanisms, but it could be worse.

Unfortunately, it does get worse.

Head-of-line blocking
In the last example, I considered the case where two computers were directly connected to each other. This example is too simplistic to be of much use -- when was the last time you saw two directly connected computers? It is a bit of a rarity. Let's now look at what happens when you introduce a switch into the mix. For our purposes, let us assume that the switch fully supports Ethernet flow control and that it is willing to use it. Our new setup will consist of two desktop computers and one file server, all of which are attached to the switch. It isn't any fun to make everything perfect, so let's also say that one of the desktops has a 10 Mbps connection to the switch while the other desktop and the server have 100 Mbps connections.

This setup is usually fine -- the 10 Mbps connection will be slower than the others, but it doesn't cause too many problems, just slower service to the one desktop. Things could get ugly, though, if Ethernet flow control is enabled on the switch. Imagine that the 10 Mbps desktop requests a large file from the file server. The file server begins to send the file to the desktop initially at a slow rate, but quickly picks up steam. Eventually, the file server will start to send data to the desktop at 11 Mbps, which is more than the poor 10 Mbps connection can handle. Without flow control enabled on the switch, the switch would start to simply drop data segments destined to the desktop, which the file server would notice and start to throttle back its sending rate.

With flow control enabled on the switch, though, the switch takes a very different approach; it will send out its own pause frames to any port that is sending data to the now-overloaded 10 Mbps port. This means that the file server will receive a pause frame from the switch, requesting it to cease all transmissions for a certain amount of time. Is this a problem? Yes! Because pause frames cease all transmissions on the link, any other data that the file server is sending will be paused as well, including data that may be destined to the 100 Mbps desktop computer. Eventually the pause will expire and the file server will continue sending out data. Unfortunately, the TCP mechanism on the file server will not know that anything is wrong and will continue sending out data at faster and faster speeds, thus overloading the 10 Mbps desktop again. As before, the cycle will keep repeating itself until the file server starts dropping its own data. Unlike the previous situation, the innocent 100 Mbps desktop bystander is penalized and will see its transfers from the file server drop to 10 Mbps speeds.

This situation is called head-of-line blocking, and it is the major reason why Ethernet flow control is somewhat dangerous to use. When enabled on network switches, it can create situations where one slow link in a network can bring the rest of the network to a crawl. It gets especially bad if the backbones in your network have flow control enabled; it should be obvious by this point just how bad that could get.

When to enable flow control
So what should you do? Should you completely disable flow control on all computers and switches? Not necessarily. It is generally safe to leave flow control enabled on computers. Switches, though, should either have flow control disabled or configured such that they will honor received pause frames but will never send out new pause frames. Some Cisco switches are even permanently configured this way -- they can receive pause frames but never emit them. To be honest, the complete answer to flow control is somewhat more complicated than this (e.g. you could probably enable pause frame emission if a switch port is connected to a slow backplane), but the safest bet is to disable flow control when given the option.

Thursday, February 09, 2006

Great computer books

There was a very good comment on my Understanding memory usage on Linux post a couple of days ago. Besides having some insightful points about memory usage, the poster made mention of Linux Kernel Development, a book by Robert Love on the Linux 2.6 kernel. I own this book and love it; I'm not really a kernel hacker, but I have found the information in the book invaluable when it comes to understanding how Linux ticks. I highly recommend it to anyone that wants to delve into the kernel.

Although Linux Kernel Development can be read without having too much theoretical operating systems background, I would still recommend that people also pick up a good general OS book. My preference is Operating System Concepts by Silberschatz, Galvin, and Gagne, but that may be because it's the one that I used for my college OS class. I've wanted to try out Andrew Tanenbaum's Modern Operating Systems, but it's a little too spendy for me. Plus, I'd feel like I couldn't read it around other Linux geeks, what with Tanenbaum's addiction to microkernels...

My other favorite topic, besides operating systems, is networking. In this field I have two all-time favorites, one of which is Tanenbaum's Computer Networks. Not only does he cover a huge range of networking topics, but he does it in the classic hacker way -- with humor. The other favorite of mine is Computer Networking: A Top-Down Approach by Kurose and Ross. It's more accessible than Tanenbaum's book and has a nice solid, professional feel to it. Both of these books are on the high side in terms of price, but are well worth it.

Disclaimer: Greg Gagne of Operating System Concepts was my advisor in college (although, interestingly enough, I didn't know about his book until I went to graduate school). Other than that, I do not have any personal or financial ties to any of these books.

Monday, February 06, 2006

Hello world!

After my recent Understanding memory usage on Linux post was linked from some large web sites (Slashdot, Digg, and del.icio.us so far), I thought it would be fun to create a map that showed where the resulting ~60,000 hits came from.

The map displays a dot for each hit that I received over the last few days. This isn't an exact science, as I can only determine where everyone's ISPs are located, but I figure it is a pretty close estimate. I would love to create an hour-by-hour slideshow, but that will have to wait for another day.

So here's a "hello world" to all of my fellow geeks in Russia, Nigeria, Egypt, Bermuda, the middle of the Caspian Sea, Sydney, Sri Lanka, Argentina, Iceland, Italia, and everywhere else.

Re: memory usage on Linux

A lot of people made good points in the comments section of my last posting (Understanding memory usage on Linux). Here are some of the general ideas that were mentioned:

(1) Several comments noted that non-x86 hardware has a different approach to shared memory between processes. This is true; some architectures do not handle shared memory in the same way as x86. To be honest, I don't know which platforms those are, so I'm not going to even try to list them. Thus, my previous post should be taken with a big grain of salt if you're working on a non-x86 platform.

(2) Many people also noted that this shared library feature of Linux isn't some fancy new thing, which is completely true. Microsoft Windows platforms undoubtedly have the same basic sharing feature, just like any full-featured modern operating system. My post only addressed Linux because, to be honest, I'm a Linux-centric kind of person.

(3) Yes, I did commit the sin of using "it's" instead of "its". To all of the English majors in the audience, I offer my most sincere apology.

(4) A few comments mentioned the memory size of Firefox. I must admit that I began this article with Firefox instead of KEdit as the primary example, but I was forced to switch to KEdit when I saw how big Firefox's private/writeable size was; KEdit illustrated my point much better. :)

(5) If the word "marginal" that I used confused anyone, then feel free to just mentally replace it with the word "incremental".

Thanks to everyone that commented on the posting; part of my reason for writing it was to see what other people thought, as other people usually know more than I do about any given subject.

Saturday, February 04, 2006

Understanding memory usage on Linux

This entry is for those people who have ever wondered, "Why the hell is a simple KDE text editor taking up 25 megabytes of memory?" Many people are led to believe that many Linux applications, especially KDE or Gnome programs, are "bloated" based solely upon what tools like ps report. While this may or may not be true, depending on the program, it is not generally true -- many programs are much more memory efficient than they seem.

What ps reports
The ps tool can output various pieces of information about a process, such as its process id, current running state, and resource utilization. Two of the possible outputs are VSZ and RSS, which stand for "virtual set size" and "resident set size", which are commonly used by geeks around the world to see how much memory processes are taking up.

For example, here is the output of ps aux for KEdit on my computer:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
dbunker 3468 0.0 2.7 25400 14452 ? S 20:19 0:00 kdeinit: kedit

According to ps, KEdit has a virtual size of about 25 megabytes and a resident size of about 14 megabytes (both numbers above are reported in kilobytes). It seems that most people like to randomly choose to accept one number or the other as representing the real memory usage of a process. I'm not going to explain the difference between VSZ and RSS right now but, needless to say, this is the wrong approach; neither number is an accurate picture of what the memory cost of running KEdit is.

Why ps is "wrong"
Depending on how you look at it, ps is not reporting the real memory usage of processes. What it is really doing is showing how much real memory each process would take up if it were the only process running. Of course, a typical Linux machine has several dozen processes running at any given time, which means that the VSZ and RSS numbers reported by ps are almost definitely "wrong". In order to understand why, it is necessary to learn how Linux handles shared libraries in programs.

Most major programs on Linux use shared libraries to facilitate certain functionality. For example, a KDE text editing program will use several KDE shared libraries (to allow for interaction with other KDE components), several X libraries (to allow it to display images and copy and pasting), and several general system libraries (to allow it to perform basic operations). Many of these shared libraries, especially commonly used ones like libc, are used by many of the programs running on a Linux system. Due to this sharing, Linux is able to use a great trick: it will load a single copy of the shared libraries into memory and use that one copy for every program that references it.

For better or worse, many tools don't care very much about this very common trick; they simply report how much memory a process uses, regardless of whether that memory is shared with other processes as well. Two programs could therefore use a large shared library and yet have its size count towards both of their memory usage totals; the library is being double-counted, which can be very misleading if you don't know what is going on.

Unfortunately, a perfect representation of process memory usage isn't easy to obtain. Not only do you need to understand how the system really works, but you need to decide how you want to deal with some hard questions. Should a shared library that is only needed for one process be counted in that process's memory usage? If a shared library is used my multiple processes, should its memory usage be evenly distributed among the different processes, or just ignored? There isn't a hard and fast rule here; you might have different answers depending on the situation you're facing. It's easy to see why ps doesn't try harder to report "correct" memory usage totals, given the ambiguity.

Seeing a process's memory map
Enough talk; let's see what the situation is with that "huge" KEdit process. To see what KEdit's memory looks like, we'll use the pmap program (with the -d flag):

Address Kbytes Mode Offset Device Mapping
08048000 40 r-x-- 0000000000000000 0fe:00000 kdeinit
08052000 4 rw--- 0000000000009000 0fe:00000 kdeinit
08053000 1164 rw--- 0000000008053000 000:00000 [ anon ]
40000000 84 r-x-- 0000000000000000 0fe:00000 ld-2.3.5.so
40015000 8 rw--- 0000000000014000 0fe:00000 ld-2.3.5.so
40017000 4 rw--- 0000000040017000 000:00000 [ anon ]
40018000 4 r-x-- 0000000000000000 0fe:00000 kedit.so
40019000 4 rw--- 0000000000000000 0fe:00000 kedit.so
40027000 252 r-x-- 0000000000000000 0fe:00000 libkparts.so.2.1.0
40066000 20 rw--- 000000000003e000 0fe:00000 libkparts.so.2.1.0
4006b000 3108 r-x-- 0000000000000000 0fe:00000 libkio.so.4.2.0
40374000 116 rw--- 0000000000309000 0fe:00000 libkio.so.4.2.0
40391000 8 rw--- 0000000040391000 000:00000 [ anon ]
40393000 2644 r-x-- 0000000000000000 0fe:00000 libkdeui.so.4.2.0
40628000 164 rw--- 0000000000295000 0fe:00000 libkdeui.so.4.2.0
40651000 4 rw--- 0000000040651000 000:00000 [ anon ]
40652000 100 r-x-- 0000000000000000 0fe:00000 libkdesu.so.4.2.0
4066b000 4 rw--- 0000000000019000 0fe:00000 libkdesu.so.4.2.0
4066c000 68 r-x-- 0000000000000000 0fe:00000 libkwalletclient.so.1.0.0
4067d000 4 rw--- 0000000000011000 0fe:00000 libkwalletclient.so.1.0.0
4067e000 4 rw--- 000000004067e000 000:00000 [ anon ]
4067f000 2148 r-x-- 0000000000000000 0fe:00000 libkdecore.so.4.2.0
40898000 64 rw--- 0000000000219000 0fe:00000 libkdecore.so.4.2.0
408a8000 8 rw--- 00000000408a8000 000:00000 [ anon ]
... (trimmed) ...
mapped: 25404K writeable/private: 2432K shared: 0K

I cut out a lot of the output; the rest is similar to what is shown. Even without the complete output, we can see some very interesting things. One important thing to note about the output is that each shared library is listed twice; once for its code segment and once for its data segment. The code segments have a mode of "r-x--", while the data is set to "rw---". The Kbytes, Mode, and Mapping columns are the only ones we will care about, as the rest are unimportant to the discussion.

If you go through the output, you will find that the lines with the largest Kbytes number are usually the code segments of the included shared libraries (the ones that start with "lib" are the shared libraries). What is great about that is that they are the ones that can be shared between processes. If you factor out all of the parts that are shared between processes, you end up with the "writeable/private" total, which is shown at the bottom of the output. This is what can be considered the incremental cost of this process, factoring out the shared libraries. Therefore, the cost to run this instance of KEdit (assuming that all of the shared libraries were already loaded) is around 2 megabytes. That is quite a different story from the 14 or 25 megabytes that ps reported.

What does it all mean?
The moral of this story is that process memory usage on Linux is a complex matter; you can't just run ps and know what is going on. This is especially true when you deal with programs that create a lot of identical children processes, like Apache. ps might report that each Apache process uses 10 megabytes of memory, when the reality might be that the marginal cost of each Apache process is 1 megabyte of memory. This information becomes critial when tuning Apache's MaxClients setting, which determines how many simultaneous requests your server can handle (although see one of my past postings for another way of increasing Apache's performance).

It also shows that it pays to stick with one desktop's software as much as possible. If you run KDE for your desktop, but mostly use Gnome applications, then you are paying a large price for a lot of redundant (but different) shared libraries. By sticking to just KDE or just Gnome apps as much as possible, you reduce your overall memory usage due to the reduced marginal memory cost of running new KDE or Gnome applications, which allows Linux to use more memory for other interesting things (like the file cache, which speeds up file accesses immensely).

Lack of monitoring tools for Linux

I've been looking for a way to monitor a few dozen Linux servers lately, and there just doesn't seem to be a nice integrated tool to do it. In particular, I am looking for something that:
  • Pulls various SNMP data from a list of Linux server
  • Stores said data for a user-specifiable amount of time
  • Generates useful graphs of said data
  • Sends emails out when said data exceeds certain thresholds
  • Provides a decent web interface for controlling everything
  • Runs under Linux
Maybe I'm just blind, but there doesn't seem to be anything that can do all of the above. I can accomplish some of it using mon, for example, but then I don't have a decent web interface, data retrieval/storage, or graphing. I can use Cacti, but then I don't have good alerting or data storage (RRD files are "lossy"). I would write my own, but then I lose the nice user interface.

Undoubtedly, someone will eventually come out with the complete package that satisfies my every desire. Once that happens, I'll just be one step away from having everything I ever wanted from Linux, with better cluster administration tools being my last hurdle.

Thursday, February 02, 2006

User comments on news sites

Yesterday, ksl.com unveiled its new comment system. I must say that I love being able to see what other people think about news stories, especially local ones; comments are the feature of Slashdot that I love the most.

This is quite the contrast to the Washington Post which recently turned off user comments on their blog postings. Although I can understand why they are nervous about some of the user-supplied content, I cannot help but think that they are shooting themselves in the foot. What's strange is that it seems they weren't even allowing people to comment on the main news stories, but just on the blog.

For me, half of the value of news stories on places like Slashdot and now ksl.com is seeing how everyone else is reacting. There is a sense of community there that is sorely missing in much of today's media.