Virtual Threads: Beware Ethernet flow control

Tuesday, February 21, 2006

Beware Ethernet flow control

After having been recently bitten by Ethernet's flow control mechanism, I decided to learn about this somewhat obscure but commonly used facet of modern networks. This post is a summary of what I discovered about it and its associated benefits and dangers.

What is flow control?
Ethernet flow control, or 802.3x, is a way for a network device to tell its immediate neighbor that it is overloaded with data, such as when a device is receiving data faster than it can process it. It allows for an overloaded device to send out a special Ethernet frame, called a pause frame, that asks the device on the other end of the wire to stop sending data temporarily. If the receiving device honors the pause frame then the sending device has time to catch up on the stack of received data that it hasn't had time to process yet.

There also exists an older method for flow control called "back pressure" that is used in half-duplex environments (i.e. non-switched Ethernet). It consists of the overloaded device "jamming" the medium temporarily until it has the ability to accept more data. I don't know much about half-duplex flow control, and thus I won't mention it again; everything here applies solely to full-duplex flow control via 802.3x. Also, TCP has a mechanism for performing its own flow control that is entirely different from Ethernet's flow control; I will not be fully explaining TCP's flow control method here, as it would merit a lengthy discussion itself.

Rules of the game
When thinking about Ethernet flow control, it is important to keep several things in mind:

Flow control operates at a lower layer than TCP or IP, and thus is independent of them. Put another way, flow control is capable of being used regardless of what higher-level protocols are put on top of it. An important side-effect of this is that neither TCP nor IP know what Ethernet's flow control is doing; they operate under the assumption that there is no flow control other than what they may or may not provide themselves.
Flow control functions between two directly connected network devices, and flow control frames are never forwarded between links. Thus, two computers that are connected via a switch will never send pause frames to each other, but could send pause frames to the switch itself (and vice versa: the switch can send pause frames to the two computers).
Pause frames have a limited duration; they will automatically "expire" after a certain amount of time. The expiration time is set by the device that transmits the pause frame.
A paused link is not a discriminator of protocols; it will prevent any data from being passed across the link other than more pause frames.

Perhaps you have begun to see some issues with flow control in light of some of the above points. Let's start looking at them.

TCP breakage
Okay, it isn't true, TCP doesn't stop working when flow control is enabled. However, an important part of it does stop working correctly: its own flow control mechanism. TCP flow control uses a more complex mechanism of timeouts and acknowledgement segments to determine when a remote device is overloaded. It basically sends at a faster and faster pace until it sees that some of its sent data isn't getting to the remote device and then slows down. This allows TCP to utilize network links in a somewhat intelligent manner, as an overloaded network or device will cause some TCP segments to be lost and thus cause the sender to send data at a slower rate.

Now consider what happens when Ethernet flow control is mixed with TCP flow control. Let's assume that we have two directly connected computers, one of which is much slower than the other. The faster sending computer starts sending lots of data to the slower receiving computer. The receiver eventually notices that it is getting overloaded with data and sends a pause frame to the sender. The sender sees the pause frame and stops sending temporarily. Once the pause frame expires, the sender will resume sending its flood of data to the other computer. Unfortunately, the TCP engine on the sender will not recognize that the receiver is overloaded, as there was no lost data -- the receiver will typically stop the sender before it loses any data. Thus, the sender will continue to speed up at an exponential rate; because it didn't see any lost data, it will send data twice as fast as before! Because the receiver has a permanent speed disadvantage, this will require the receiver to send out pause frames twice as often. Things start snowballing until the receiver pauses the sender so often that the sender starts dropping its own data before it sends it, and thus finally sees some data being lost and slows down.

Is this a problem? In some ways it isn't. Because TCP is a reliable protocol, nothing is ever really "lost"; it is simply retransmitted and life goes on. Ethernet flow control accomplishes the same thing as TCP flow control in this situation, as they both slow down the data transmission to the speed that the slower device can handle. There are some arguments to be made for there being an awkward overlap between the two flow control mechanisms, but it could be worse.

Unfortunately, it does get worse.

Head-of-line blocking
In the last example, I considered the case where two computers were directly connected to each other. This example is too simplistic to be of much use -- when was the last time you saw two directly connected computers? It is a bit of a rarity. Let's now look at what happens when you introduce a switch into the mix. For our purposes, let us assume that the switch fully supports Ethernet flow control and that it is willing to use it. Our new setup will consist of two desktop computers and one file server, all of which are attached to the switch. It isn't any fun to make everything perfect, so let's also say that one of the desktops has a 10 Mbps connection to the switch while the other desktop and the server have 100 Mbps connections.

This setup is usually fine -- the 10 Mbps connection will be slower than the others, but it doesn't cause too many problems, just slower service to the one desktop. Things could get ugly, though, if Ethernet flow control is enabled on the switch. Imagine that the 10 Mbps desktop requests a large file from the file server. The file server begins to send the file to the desktop initially at a slow rate, but quickly picks up steam. Eventually, the file server will start to send data to the desktop at 11 Mbps, which is more than the poor 10 Mbps connection can handle. Without flow control enabled on the switch, the switch would start to simply drop data segments destined to the desktop, which the file server would notice and start to throttle back its sending rate.

With flow control enabled on the switch, though, the switch takes a very different approach; it will send out its own pause frames to any port that is sending data to the now-overloaded 10 Mbps port. This means that the file server will receive a pause frame from the switch, requesting it to cease all transmissions for a certain amount of time. Is this a problem? Yes! Because pause frames cease all transmissions on the link, any other data that the file server is sending will be paused as well, including data that may be destined to the 100 Mbps desktop computer. Eventually the pause will expire and the file server will continue sending out data. Unfortunately, the TCP mechanism on the file server will not know that anything is wrong and will continue sending out data at faster and faster speeds, thus overloading the 10 Mbps desktop again. As before, the cycle will keep repeating itself until the file server starts dropping its own data. Unlike the previous situation, the innocent 100 Mbps desktop bystander is penalized and will see its transfers from the file server drop to 10 Mbps speeds.

This situation is called head-of-line blocking, and it is the major reason why Ethernet flow control is somewhat dangerous to use. When enabled on network switches, it can create situations where one slow link in a network can bring the rest of the network to a crawl. It gets especially bad if the backbones in your network have flow control enabled; it should be obvious by this point just how bad that could get.

When to enable flow control
So what should you do? Should you completely disable flow control on all computers and switches? Not necessarily. It is generally safe to leave flow control enabled on computers. Switches, though, should either have flow control disabled or configured such that they will honor received pause frames but will never send out new pause frames. Some Cisco switches are even permanently configured this way -- they can receive pause frames but never emit them. To be honest, the complete answer to flow control is somewhat more complicated than this (e.g. you could probably enable pause frame emission if a switch port is connected to a slow backplane), but the safest bet is to disable flow control when given the option.

16 Comments:

At 2:49 PM, Anonymous said...: Great Post!

Thanks for sharing your insights into flow control, I found it very helpful.

Keep up the great posts :)

Jonathan
At 2:46 AM, Tarry Singh said...: This is a great blog. I already had bookmarked you , now I'm tempted to link you on my blog! :-)

Good professional blogging!

Tarry
At 2:31 AM, Justin said...: Wonderful!!!

I get more useful information from this article.
At 4:22 PM, Anonymous said...: Thanks for the article. I took it and went to one of our senior engineers ( we make switches) and we pulled out the Broadcom Theory of Operations and looked up pause control.

What you say is essentially correct for a very simple network, but with real switches the issues are much more complicated.

A few concepts that helped me:
-dropped packets vs lossless switching, the two ways the switch can react.
-this is a layer 1 protocol, so knowledge of the packet contents in not present.
-its the receiving end that generates a pause frame.
-switches are using multiple methods of traffic control simultaneously.
-switches can notice that a port doesn't support pause and react differently than they would if it does.

The Broadcom documentation is NDA I think, and anyway my understanding is limited. Plus it's written in Broadcomese. So I'll attempt a weak paraphrase and explanation based on our engineer's explanation, and stay away from Broadcom specifics.

While in theory it is possible to create head-of-line blocking using pause frames, the switch is doing a lot of real-time processing to prevent it. For one thing many packets will be subject to prioritization ( not at layer 1 of course), and the switch may well have a full packet queue on the port before the receiving end shouts "pause!" If so, it will have to drop the packets or do some fancy queue adjusting. For another its a slow machine at the end of a fast connection thats likely to cause trouble. If there is a 10 meg connection which is overwhelmed, that will happen at the switch side, the buffers will overflow, and packets will be dropped before the far side can cry "pause!" Of course both things can happen simultaneously, in which case packets will be dropped and a pause frame may get through too.

It looks like the switch is capable of manipulating the length of the pause as well, though I didn't see the algorithm.

So the switch is trying to do lossless traffic management by default using the pause frames, but generally packet priority will also be implemented at a higher level, depending on selectively dropped packets to manage traffic. ( these dropped packets may be broadcast, and hence never resent BTW). So if you are setting up your switch, you will be using both types of traffic management at once. You'll have to make complex decisions about where to set the various parameters, in order to get smooth operation among the protocols.

I don't claim expertise, just had access to some resources most of you don't have.
At 6:06 PM, Devin said...: Anonymous,

Just a few points to note about your comment:

1) A switch can and will generate pause frames, depending on its configuration. Imagine two switches, #1 and #2, connected to each other at 10 Mbps; if servers connected to switch #1 try to send 100 Mbps worth of data to desktops connected to switch #2, then switch #1's output buffer to switch #2 will quickly overflow and switch #1 will send pause frames to the servers.

2) Most switches don't do any "fancy queue adjusting". You can sometimes manually override the size of the queue, but the common case is to always have a fixed-size input or output queue.

3) A switch that receives a pause frame will not manipulate the length of the pause; only the device that transmitted the pause will do that.

4) Higher level prioritization on end devices has little or nothing to do with Ethernet flow control. It may or may not influence whether or not an end device decides to start issuing pause frame, but it does nothing to help in a head-of-line blocking situation.

The following link has comments from some switch manufacturers about flow control. They are very enlightening, especially Cisco's:

http://www.networkworld.com/netresources/0913flow2.html
At 6:19 PM, Anonymous said...: Greate,

But how to identify the slower endstation in switching networking with non-management switch.?

Thanks

Jeffery
At 11:11 AM, Anonymous said...: Thank You! Helped enlightened me with an issue I was having w/ Bloomberg's T1 routers. I had my switch w/ Flow Control on 10 half duplex as to their specs. I am assuming they didn't thus they were seeing their circuits oversubscribed. When I asked them about flow control they didn't know what it was. Go figure. Turned it off on my end and their circuits are seeing normal load.
At 4:32 AM, Anonymous said...: Very good post - i've been bitten recently by the flow control monster, it's good to get a good description on what was happening on the network.

Anthony
At 11:13 AM, Anonymous said...: Thank you, thank you!

Was trying to figure out if I should turn my flow control on, but it looks like it may be better left disabled. "If it ain't broke, don't fix it".
At 12:36 PM, Anonymous said...: I'm no fan of Ethernet flow control, but in practice, I think you'll find that the interaction with TCP isn't as bad as you make it out to be.

TCP does double the transmission rate during startup and halve it in response to loss as you say. But TCP is also measuring the round-trip time for every ACK as the traffic flows. If transmission rate exceeds the ability of the network to carry the traffic or of the receiver to swallow it, this round-trip time will go up and TCP will slow down.

If likened to driving a car, you could say that TCP takes its foot off the gas when the round-trip time goes up, but it steps on the brake in response to loss.

If you were to actually construct the example networks you gave and watch carefully with a network monitor, it's very unlikely that you'd see loss with a single TCP stream passing through a switch port. Loss does happen as multiple TCP streams contend for limited bandwidth through something like a single switch port, but it usually isn't a problem.

More details here: http://www.29west.com/docs/THPM/ethernet-flow-control.html

And here: http://www.29west.com/docs/THPM/tcp-latency.html

Bob
At 7:13 AM, Anonymous said...: Hi
Thanks for the useful article!

I’m adding the flow control into my Ethernet driver (core by Synopsys).
This MAC core allows flow control in Full-Duplex using the Pause operation and control frame transmission. It also supports flow control in Half-Duplex, using back pressure.

So I’ve some doubts and questions:
1) Enabling the Flow-control in the MAC driver; do I need to verify the link partner capability? My PHYs support pause and ANE.
for example:
if (priv->fc and phydev->pause)
[... init HW FC ... ]
2) Which existent driver could I get as example?
3) I can program the FLOW control register with the pause value. I have not clear enough which value I’ve to use.
4) If I configure the HW in order to handle the TX flow control. When and how will the interface send a pause command?
5) When my MAC receives a pause frame; I’ve not clear if the pause time to be used is the value programmed into the relative register or if it is the value within the received frame. In the latter scenario, should the receive process check this kind of frames or it is handle by the HW?

Sorry for the silly questions and thanks again
Regards
Giuseppe
At 4:11 AM, Unknown said...: Devin,
Really neat explanation!! keep it up. thanks i enjoyed reading it

According to my understanding, ethernet switches are designed not to allow flow control on Uplink(backbone) ports.

Actually i myself validated a ethernet switch design wherein i had a checker to detect this failure (i.e. flow control on uplink port)
At 12:12 PM, CCIE STUDY GROUP said...: Hi,

Nice explanation of how flow control works. I have a query. Can i enabel or disable flowcontrol on per port basis on Cisco Switches?

Sriram
At 3:42 AM, Anonymous said...: Hi,

Thanks for a good article.. I have a question though, if I have a cisco switch, with flow control enabled, how will I know the number of paused frames the switch already sent out? Is there a pause frame stats?

Thanks,

Peter
At 10:29 AM, Anonymous said...: Hi..
Awesome Post ,
I was looking for a Bandwidth Controller (with 1 WAN and 4 to 8 LAN ports) and one guy told me I can control the bandwidth to each user using flow control , anyone knows a good product to Bandwidth controlling on Interface basis like 128K - LAN 1 /512K – LAN2 ?
Help me..
YSV
At 6:57 AM, Anonymous said...: Thanks for such an interesting post. I will be back to read the blog again. I have found that Interoute have a great product -
ethernet reach

Virtual Threads

Tuesday, February 21, 2006

Beware Ethernet flow control

16 Comments:

About Me

Previous Posts