Little Server Guy - Info from a Guy Trying to Run a Little WS2003 Server on the Cheap
Skip Navigation Links
Home
Multi-Homing
Backup
Virus Scanning
Gateway Filtering
Hardware
OS Licensing
DeadGatewayDetect

Microsoft's Dead Gateway Detection and RFC 816

I was pretty surprised when I googled these two terms and found many pages that repeat the same information: either from the 1982 RFC or from Microsoft's KB. Well, no one else may have anything to say on these subjects, but I do.

I'm going to start with my recommendations and then I'll explain how these are justified.

First of all, dead gateway detection(DGWD), as far as I can observe its implementation in windows, will work in specific network configurations. But in other network configurations, windows DGWD simply shuts down an otherwise functioning NIC, serving exactly the opposite of its purpose. It needs to be redesigned.

Recommendations for system managers:

Unless you have a month to waste and a spare server to play with,  configure your server with one and only one default gateway on each and every network connection intended to carry tcp/ip traffic, and disable DGWD by creating HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\EnableDeadGWDetect  as D_word and setting to 0 to disable DGWD, which will function by default without this parameter.

Recommendations for Microsoft:

1. Default Gateways in windows should be clearly described as pertaining to a specific network connection.
2. Every network connection should be configured with a default gateway.
3. Dead Gateway detection only makes sense for a specific gateway. Therefore, any dead gateway settings shold apply to a gateway, not to an interface (which may not be the only interface using a gateway) or a computer (which can have multiple gateways).
4. Dead gateway responses should be applied to all/only those NICs using the dead gateway.
5. The response to a dead gateway must never be to leave a NIC non-functional, as happens now. The RFC is clear on this.
6. Switching gateways can only be a valid response to a dead gateway when there is another known gateway within the same subnet as the dead gateway. In the absence of this, the dead gateway should continue to be used, on the assumption that either it will return to service, or that the criteria that identified it as dead actually identified poor performance and not death.
7.  DGWD criteria should be used to idenitfy when a primary connection does not work.GDWD provides a way to maintain sub-optimal connectivity when a primary connection does not work. This is different from gateway optimization, which chooses one gateway to use in preference to another based on their relative performance. A dead gateway has NO connectivity. Microsoft's implementation tries to switch gateways when a gateway's performance on the most current communications drops, even though the gateway may still be functioning and the system may even know it it still functioning. This is actually an optimization, except that the system has no means to determine if the switch improved performance, or if it should be switched back. DGWD should be designed to deal with dead gateways and not with sub-optimal gateways.
8. When a gateway is declared dead, the list of alternate gateways should include all other default gateways and all multiple default gateways configured on all NICS on the that subnet.
9. The multiple default gateway warning message offered by the network connection properties dialogs should be changed or eliminated.
10. If DGWD is to be enabled by default (as it is now), its parameters should be settable via dialog boxes rather than regedit.
11. The logical steps of the DGWD should be published, including not just the criteria, but how the list of alternate gateways is assembled and ordered, and the criteria used to select an alternate gateway.

Discussion

1. Default Gateways in windows should be clearly described as pertaining to a specific network connection.
2. Every network connection should be configured with a default gateway.

 Microsoft's KB incorrectly states that the default gateway is a host property: (http://support.microsoft.com/kb/157025/en-us) "Only one default gateway should be configured on any multihomed computer. The default gateway is a global configuration for the server, not a setting that must be set for each network adapter. The server is already aware of all the networks it is directly connected to, and adds a route to each network for which it has a TCP/IP address."  It's not clear what this means, but obviously, for a host with NICs on two different subnets, a global default gateway would not work. In practice, when multiple NICs are on different subnets myWS2003, it appears that each network connection needs its own default gateway specification to function, even if a route exists in the routing table.

Another KB incorrectly  says that multiple default gateways cause connectivity problems (http://support.microsoft.com/kb/159168/EN-US/) as does the warning dialog displayed when multiple gateways are defined in the  network connection properties dialogs.The connectivity problem is actually a result of the faulty DGWD implemented by MS. Clearly, a NIC on a unique subnet needs a default gateway of its own since it can't directly use a gateway on a different subnet. A second KB article (http://support.microsoft.com/kb/162055/en-us) supports this, describing the behavior with multple default gateways.

The proper understanding is that one default gateway needs to be defined for each unique subnet served by a NIC in the computer. Windows will then create default routes (0.0.0.0 routes) for based on each default gateway spec where the route includes the interface that physically links to that gateway. Moreover, (pardon my incomplete understanding on the application side here), it appears that applications (say IIS, for example) need to route responses back out on the interface on which the request was recieved. Thus, a route needs to exist inthe routing table that not only gets to a functioning default gateway on the subnet that IIS is listening to, but must be defined to use the interface that IIS is listening to. This means that each network connection needs its own default gateway spec, even if that gateway is the same as the gateway on another network connection in the same server.

Thus, MS KB is backward on this, suggesting incorrect configuration to avoid the problem caused by faulty DGWD.

3. Dead Gateway detection only makes for a specific gateway. Therefore, any dead gateway settings shold apply to a gateway, not to an interface (which may not be the only interface using a gateway) or a computer (which can have multiple gateways). 

RFC 816, which is the apparent sole basis for MS's DGWD, was written in 1982. Enough said? It doesn't appear to comtemplate a multihomed server. But, simply put, for any given gateway, you can only substitute another gateway on the same subnet. Why? The gateway is so defined, as a device on a subnet with connection to and knowledge of other subnets.Being a device on a subnet, it wouldn't be visible to a NIC on any other subnet. Thus, DGWD has to take place with a subnet.

Within a subnet, multiple gateways could be in use (I think this is possible, with separate NICs routing traffic out on the same subnet but via different gateways on that subnet). Thus, DGWD needs to be done on a per-gateway basis. Thus, all traffic routed to a gateway needs to be considered, even if that traffic is moving via more than one NIC in the computer. Traffic on the same subnet, but using another gateway, should not be considered. Any settings, such as how to determine when the gateway is dead, or whether to use DGWD at all, should pertain to the gateway.

If an interface is switched by DGWD to another gateway, the settings for that new gateway should then be used. It may be, for example, that the new gateway is configured not to use DGWD. Ot it may need more tolerant criteria for determining that it is dead.

4. Dead gateway responses should be applied to all/only those NICs using the dead gateway.

Just as all traffic through a gateway should be used to determine if a gateway is dead, when a gateway is pronounced dead, the response should be applied to all NICs that were routing traffic through it.

5. The response to a dead gateway must never be to leave a NIC non-functional, as happens now. The RFC is clear on this.
6. Switching gateways can only be a valid response to a dead gateway when there is another known gateway within the same subnet as the dead gateway. In the absence of this, the dead gateway should continue to be used, on the assumption that either it will return to service, or that the criteria that identified it as dead actually identified poor performance and not death.

The current DGWD algorithm appears to monitor gateways with no alternate gateways defined. It then pronounced them dead, and switches the default to the next alternate gateway, namely nothing. This leaves the NIC unable to function. This should never happen. A NIC should only be switched to another valid gateway address, and then the NIC must continue to be monitored to see if that new gateway is dead. If no alternate gateway is known, no switching should be done at all, and the NIC should continue to use the dead gateway, whether functioning or not.


7.  DGWD should be used when a primary connection does not work.GDWD provides a way to maintain sub-optimal connectivity when a primary connection does not work. This is different from gateway optimization, which chooses one gateway to use in preference to another based on their relative performance. A dead gateway has NO connectivity. Microsoft's implementation tries to switch gateways when a gateway's performance on the most current communications drops, even though the gateway may still be functioning and the system may even know it it still functioning. This is actually an optimization, except that the system has no means to determine if the switch improved performance, or it should be switched back. DGWD should be designed to deal with dead gateways and not with sub-optimal gateways.

MS's KB defines (well, as of NT 4.0...) the dead gateway algorithm (http://support.microsoft.com/kb/171564/en-us) .

The criteria at the bottom of this article show that what is being done is dead route detection. I quote, "When any TCP connection that is routed through the default gateway has attempted to send a TCP packet to the destination a number of times equal to one-half of the registry value TcpMaxDataRetransmissions, but receives no response, the algorithm changes the Route Cache Entry (RCE) for that one remote IP address to use the next default gateway in the list." This is interesting because 1) the parameter TcpMaxDataRetransmissions has been subverted, the system really won't retry that many times; and 2) it's not clear what it means to change a RCE for a remote IP address if the NIC is serving traffic from the iternet, where the default route (0.0.0.0) would be the one used. But note that this test is specific to a particular IP destination, which may have it's own unique connectivity issues that lay far beyond the gateway.

Completing the excerpt from the article, "When 25 percent of the TCP connections have moved to the next default gateway, the algorithm advises IP to change the default gateway for the whole computer to the one that the connections are now using." There are two issues here: 1) The gateway may not be dead at all. Only 25% of the "connections" have to be having problems. Why 25%? What if the gateway is serving a small business that has light traffic, with few connections? What if it is 4am and there is only one connection and it times out? There seem to be many situations in which this criteria may lead to a gateway being declared dead when a single remote destination has problems. This is NOT what the RFC contemplated. 2) The article says that the "default gateway for the whole computer" will be changed. But as argued above, each NIC really needs its own default gateway, and if the NICs serve different subnets, the gateways must be on different subnets, and thus must be different. So changing the default gateway for the whole computer is an oxymoron, and if it were done it could easily break one or more NICs. Guess what? It does.

New criteria need to be developed which identify a dead gateway. If any connections are working through a gateway, it isn't dead.  Using a percent of connections is unworkable for NICs providing any server function that might have periods of light traffic, since one unreliable connection could trip a dead gateway response.

Choosing an optimum gateway may be a desirable feature for some users, but it should not be part of the core tcp/ip implementation.

8. When a gateway is declared dead, the list of alternate gateways should include all other default gateways and all multiple default gateways configured on all NICS on the that subnet.

A gateway serves a subnet. All the gateways on that subnet that are known to the computer should be in the list of backup gateways.So if another NIC on the same subnet has a different gateway, it should be an alternate. If the NIC in question, or any other NIC on that subnet has multiple gateways defined, these should be in the list. Obviously one needs to be selected, but in practice this should probably be addressed by a manually configured metric.

9. The multiple default gateway warning message offered by the network connection properties dialogs should be changed or eliminated.

As mentioned above, all the MS documenation pertaining to DGWD is wrong, including the warning dialog that comes up when a second default gateway is entered. First, the DGWD needs to be fixed. Then this message needs to be changed.

10. If DGWD is to be enabled by default (as it is now), its parameters should be settable via dialog boxes rather than regedit.

Right now, DGWD is a time bomb for multihomed computers. Turning it off requires finding the documenation and then a creating registry value. The parameters for how DGWD works apparently are not accessible. These should all have a dialog box for configuration.

11. The logical steps of the DGWD should be published, including not just the criteria, but how the list of alternate gateways is assembled and ordered, and the criteria used to select an alternate gateway.


Glossary

NIC: Network Interface Controller: a physical device that sits in a host and provides a connection for a wire(less) to another device on a network.

Network Connection: A windows construct: a logical grouping of specifications that apply to one specific NIC in a windows computer.

Interface: Used in network definitions that aren't specific to the OS; a pointer to a physical NIC. 

These three entities exist one-to-one-to-one. The presence of one NIC in a windows computers means there will be one and only one Network Connection and one and only one interface that associated with that NIC.

Gateway: a network device with knowledge of  and connections to more than one subnet.

Comments?  Combine the word "richard" with the words "server guy" with no spaces or punctuation and stick it in front of the domain this page is hosted on. That should get an electonic message to me.
(c) 2006 Richard Skerritt