Thursday, January 18, 2007

Oh no -- I've become unbalanced!

I recently discovered an interesting feature of the Linux kernel. It all started when I noticed that some new web servers at work were exhibiting a peculiar behavior -- the two CPU cores didn't seem to have balanced workloads. It seemed that usually the first core would have the majority of the load, while the second core would usually have much less (all witnessed via the mpstat -P ALL 10 command). Given that these servers were Apache web servers, I thought that they should typically be within a few percent of each other, but instead they were separated by anywhere from 5-25%.

After poking around a bit, I decided to try and tweak the server a bit to try and force the kernel to balance better. These servers each had two Ethernet interfaces, one to a 'front channel' across which the HTTP interaction with clients was performed, and one to a 'back channel' across which some database and other calls were made. One of the early thoughts that I had was to dedicate each of the two cores to one of the two Ethernet interfaces: this way, each core would always service the interrupts from the same interface. The default setup on the 2.6 kernel is to have all interrupts serviced by the first core, so I thought that perhaps the data load on the two interfaces was forcing the first core to run a bit hotter than the other.

It was easy enough to make the change. I first checked to find the interrupt numbers for the two interfaces (ifconfig), then forced the second one to always use the second processor core (echo "2" >/proc/irq/177/smp_affinity). This achieved the desired effect of rebalancing the interrupts, but it didn't solve my original problem: the cores were still unbalanced. In fact, the problem became even more interesting -- the two cores had almost flipped their loads, with the second core having the consistently higher usage.

At first I thought that perhaps the second interface might be transferring more data; that was wrong (ifconfig). Next I thought that, even with less total data transferred, perhaps the second interface was triggering more interrupts. That, too, was incorrect (cat /proc/interrupts). Finally, I checked to make sure that both interfaces were using the same driver, which they were (dmesg | grep eth). I had just about given up when I finally decided to put some logging into every web page (using PHP) to see what processor it used. Imagine my surprise when I discovered that nearly every web process started off on the lower-usage core, but then almost immediately migrated to the higher-usage core!

I had uncovered the superficial culprit: when a user requested a web page, one of the first things that the server would do is get a copy of their current session data from another server through the second interface. The moment that the second interface was used, the web process would flip over to the second core (or whatever core was assigned to service that interface's interrupts).

A little more research into the kernel code revealed it even further (caveat: I'm not a kernel hacker, so the details are still a bit fuzzy). When a network interface receives some data, Linux will find the process that is sleeping while waiting for that data and wake it up. When waking up the process, Linux will try to keep it on the same processor core that received the interrupt. I imagine that it does this in an effort to better utilize the core's memory cache, which should speed things up. However, it also appears to have a slight unbalancing effect on the other cores, which is exactly what I was seeing on my servers.

Of course, there are ways to fix this "problem". The easiest is to simply remove the flags in the kernel code that trigger this behavior (you'll find them in include/linux/topology.h; just remove any lines that say SD_WAKE_AFFINE), and then recompile the kernel. I've discovered, though, that this change doesn't seem to improve things like I thought it might. It did seem to balance out the load between the two cores, but the average load across the cores is actually slightly higher (1-2%, usually). This would make some sense; like I said before, the kernel is probably trying to make more efficient use of each core's cache, which apparently has a greater positive effect than having a more equal load distribution.

In the end, I suppose I can live with unbalanced CPU cores. I just wish that this behavior had been a bit better documented -- it was a rather painful process to have to figure it out myself. I suppose it could have been worse, though: I could be using closed-source Windows. ;)


At 3:06 PM, Anonymous Anonymous said...

Excellent article. Even I have observed this issues but did dare to dig into kernel level :(

At 3:21 AM, Blogger Otheus said...

This might be of some help:

Tuning the soft affinity can be done with a parameter called "cache_hot_time". However, in this article, the proposal is to make affinity tighter, not looser.

You can manually set the affinity of processes. Perhaps you could modify apache (through a module?) to set the affinity of the current process to a different CPU. (Using a round-robin scheme, for instance: take the current PID and take the modulo of the number of cores).

in 2.6.23 a new scheduler appears, the CFS "Completely Fair Scheduler".

Since 2.6, the schedulers associate CPUs in "domains", that way the first 4-core CPU can share a process with greater affinity than the second 4-core CPU.


Post a Comment

<< Home