Little Server Guy -
Info from a Guy Trying to Run a Little WS2003 Server on the Cheap
Microsoft's Dead Gateway Detection and RFC 816
I was pretty surprised when I googled these two terms and found many pages that
repeat the same information: either from the 1982 RFC or from Microsoft's KB. Well,
no one else may have anything to say on these subjects, but I do.
I'm going to start with my recommendations and then I'll explain how these are justified.
First of all, dead gateway detection(DGWD), as far as I can observe its implementation
in windows, will work in specific network configurations. But in other network configurations,
windows DGWD simply shuts down an otherwise functioning NIC, serving exactly the
opposite of its purpose. It needs to be redesigned.
Recommendations for system managers:
Unless you have a month to waste and a spare server to play with, configure
your server with one and only one default gateway on each and every network connection
intended to carry tcp/ip traffic, and disable DGWD by creating HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\EnableDeadGWDetect
as D_word and setting to 0 to disable DGWD, which will function by default without
this parameter.
Recommendations for Microsoft:
1. Default Gateways in windows should be clearly described as pertaining to a specific
network connection.
2. Every network connection should be configured with a default gateway.
3. Dead Gateway detection only makes sense for a specific gateway. Therefore, any
dead gateway settings shold apply to a gateway, not to an interface (which may not
be the only interface using a gateway) or a computer (which can have multiple gateways).
4. Dead gateway responses should be applied to all/only those NICs using the dead
gateway.
5. The response to a dead gateway must never be to leave a NIC non-functional, as
happens now. The RFC is clear on this.
6. Switching gateways can only be a valid response to a dead gateway when there
is another known gateway within the same subnet as the dead gateway. In the absence
of this, the dead gateway should continue to be used, on the assumption that either
it will return to service, or that the criteria that identified it as dead actually
identified poor performance and not death.
7. DGWD criteria should be used to idenitfy when a primary connection does
not work.GDWD provides a way to maintain sub-optimal connectivity when a primary
connection does not work. This is different from gateway optimization, which chooses
one gateway to use in preference to another based on their relative performance.
A dead gateway has NO connectivity. Microsoft's implementation tries to switch gateways
when a gateway's performance on the most current communications drops, even though
the gateway may still be functioning and the system may even know it it still functioning.
This is actually an optimization, except that the system has no means to determine
if the switch improved performance, or if it should be switched back. DGWD should
be designed to deal with dead gateways and not with sub-optimal gateways.
8. When a gateway is declared dead, the list of alternate gateways should include
all other default gateways and all multiple default gateways configured on all NICS
on the that subnet.
9. The multiple default gateway warning message offered by the network connection
properties dialogs should be changed or eliminated.
10. If DGWD is to be enabled by default (as it is now), its parameters should be
settable via dialog boxes rather than regedit.
11. The logical steps of the DGWD should be published, including not just the criteria,
but how the list of alternate gateways is assembled and ordered, and the criteria
used to select an alternate gateway.
Discussion
1. Default Gateways in windows should be clearly described as pertaining to a specific
network connection.
2. Every network connection should be configured with a default gateway.
Microsoft's KB incorrectly states that the default gateway is a host property:
(
http://support.microsoft.com/kb/157025/en-us)
"Only one default gateway should be configured on any multihomed computer. The default
gateway is a global configuration for the server, not a setting that must be set
for each network adapter. The server is already aware of all the networks it is
directly connected to, and adds a route to each network for which it has a TCP/IP
address." It's not clear what this means, but obviously, for a host with NICs
on two different subnets, a global default gateway would not work. In practice,
when multiple NICs are on different subnets myWS2003, it appears that each network
connection needs its own default gateway specification to function, even if a route
exists in the routing table.
Another KB incorrectly says that multiple default gateways cause connectivity
problems (
http://support.microsoft.com/kb/159168/EN-US/)
as does the warning dialog displayed when multiple gateways are defined in the
network connection properties dialogs.The connectivity problem is actually a result
of the faulty DGWD implemented by MS. Clearly, a NIC on a unique subnet needs a
default gateway of its own since it can't directly use a gateway on a different
subnet. A second KB article (
http://support.microsoft.com/kb/162055/en-us)
supports this, describing the behavior with multple default gateways.
The proper understanding is that one default gateway needs to be defined for each
unique subnet served by a NIC in the computer. Windows will then create default
routes (0.0.0.0 routes) for based on each default gateway spec where the route includes
the interface that physically links to that gateway. Moreover, (pardon my incomplete
understanding on the application side here), it appears that applications (say IIS,
for example) need to route responses back out on the interface on which the request
was recieved. Thus, a route needs to exist inthe routing table that not only gets
to a functioning default gateway on the subnet that IIS is listening to, but must
be defined to use the interface that IIS is listening to. This means that each network
connection needs its own default gateway spec, even if that gateway is the same
as the gateway on another network connection in the same server.
Thus, MS KB is backward on this, suggesting incorrect configuration to avoid the
problem caused by faulty DGWD.
3. Dead Gateway detection only makes for a specific gateway. Therefore, any dead
gateway settings shold apply to a gateway, not to an interface (which may not be
the only interface using a gateway) or a computer (which can have multiple gateways).
RFC 816, which is the apparent sole basis for MS's DGWD, was written in 1982. Enough
said? It doesn't appear to comtemplate a multihomed server. But, simply put, for
any given gateway, you can only substitute another gateway on the same subnet. Why?
The gateway is so defined, as a device on a subnet with connection to and knowledge
of other subnets.Being a device on a subnet, it wouldn't be visible to a NIC on
any other subnet. Thus, DGWD has to take place with a subnet.
Within a subnet, multiple gateways could be in use (I think this is possible, with
separate NICs routing traffic out on the same subnet but via different gateways
on that subnet). Thus, DGWD needs to be done on a per-gateway basis. Thus, all traffic
routed to a gateway needs to be considered, even if that traffic is moving via more
than one NIC in the computer. Traffic on the same subnet, but using another gateway,
should not be considered. Any settings, such as how to determine when the gateway
is dead, or whether to use DGWD at all, should pertain to the gateway.
If an interface is switched by DGWD to another gateway, the settings for that new
gateway should then be used. It may be, for example, that the new gateway is configured
not to use DGWD. Ot it may need more tolerant criteria
for determining that it is dead.
4. Dead gateway responses should be applied to all/only those NICs using the dead
gateway.
Just as all traffic through a gateway should be used to determine if a gateway is
dead, when a gateway is pronounced dead, the response should be applied to all NICs
that were routing traffic through it.
5. The response to a dead gateway must never be to leave a NIC non-functional, as
happens now. The RFC is clear on this.
6. Switching gateways can only be a valid response to a dead gateway when there
is another known gateway within the same subnet as the dead gateway. In the absence
of this, the dead gateway should continue to be used, on the assumption that either
it will return to service, or that the criteria that identified it as dead actually
identified poor performance and not death.
The current DGWD algorithm appears to monitor gateways with no alternate gateways
defined. It then pronounced them dead, and switches the default to the next alternate
gateway, namely nothing. This leaves the NIC unable to function. This should never
happen. A NIC should only be switched to another valid gateway address, and then
the NIC must continue to be monitored to see if that new gateway is dead. If no
alternate gateway is known, no switching should be done at all, and the NIC should
continue to use the dead gateway, whether functioning or not.
7. DGWD should be used when a primary connection does not work.GDWD provides
a way to maintain sub-optimal connectivity when a primary connection does not work.
This is different from gateway optimization, which chooses one gateway to use in
preference to another based on their relative performance. A dead gateway has NO
connectivity. Microsoft's implementation tries to switch gateways when a gateway's
performance on the most current communications drops, even though the gateway may
still be functioning and the system may even know it it still functioning. This
is actually an optimization, except that the system has no means to determine if
the switch improved performance, or it should be switched back. DGWD should be designed
to deal with dead gateways and not with sub-optimal gateways.
MS's KB defines (well, as of NT 4.0...) the dead gateway algorithm (http://support.microsoft.com/kb/171564/en-us)
.
The criteria at the bottom of this article show that what is being done is dead
route detection. I quote, "When any TCP connection that is routed through the default
gateway has attempted to send a TCP packet to the destination a number of times
equal to one-half of the registry value TcpMaxDataRetransmissions, but receives
no response, the algorithm changes the Route Cache Entry (RCE) for that one remote
IP address to use the next default gateway in the list." This is interesting because
1) the parameter TcpMaxDataRetransmissions has been subverted, the system really
won't retry that many times; and 2) it's not clear what it means to change a RCE
for a remote IP address if the NIC is serving traffic from the iternet, where the
default route (0.0.0.0) would be the one used. But note that this test is specific
to a particular IP destination, which may have it's own unique connectivity issues
that lay far beyond the gateway.
Completing the excerpt from the article, "When 25 percent of the TCP connections
have moved to the next default gateway, the algorithm advises IP to change the default
gateway for the whole computer to the one that the connections are now using." There
are two issues here: 1) The gateway may not be dead at all. Only 25% of the "connections"
have to be having problems. Why 25%? What if the gateway is serving a small business
that has light traffic, with few connections? What if it is 4am and there is only
one connection and it times out? There seem to be many situations in which this
criteria may lead to a gateway being declared dead when a single remote destination
has problems. This is NOT what the RFC contemplated. 2) The article says that the
"default gateway for the whole computer" will be changed. But as argued above, each
NIC really needs its own default gateway, and if the NICs serve different subnets,
the gateways must be on different subnets, and thus must
be different. So changing the default gateway for the whole computer is an oxymoron,
and if it were done it could easily break one or more NICs. Guess what? It does.
New criteria need to be developed which identify a dead gateway. If any connections
are working through a gateway, it isn't dead. Using a percent of connections
is unworkable for NICs providing any server function that might have periods of
light traffic, since one unreliable connection could trip a dead gateway response.
Choosing an optimum gateway may be a desirable feature for some users, but it should
not be part of the core tcp/ip implementation.
8. When a gateway is declared dead, the list of alternate gateways should include
all other default gateways and all multiple default gateways configured on all NICS
on the that subnet.
A gateway serves a subnet. All the gateways on that subnet that are known to the
computer should be in the list of backup gateways.So if another NIC on the same
subnet has a different gateway, it should be an alternate. If the NIC in question,
or any other NIC on that subnet has multiple gateways defined, these should be in
the list. Obviously one needs to be selected, but in practice this should probably
be addressed by a manually configured metric.
9. The multiple default gateway warning message offered by the network connection
properties dialogs should be changed or eliminated.
As mentioned above, all the MS documenation pertaining to DGWD is wrong, including
the warning dialog that comes up when a second default gateway is entered. First,
the DGWD needs to be fixed. Then this message needs to be changed.
10. If DGWD is to be enabled by default (as it is now), its parameters should be
settable via dialog boxes rather than regedit.
Right now, DGWD is a time bomb for multihomed computers. Turning it off requires
finding the documenation and then a creating registry value. The parameters for
how DGWD works apparently are not accessible. These should all have a dialog box
for configuration.
11. The logical steps of the DGWD should be published, including not just the criteria,
but how the list of alternate gateways is assembled and ordered, and the criteria
used to select an alternate gateway.
Glossary
NIC: Network Interface Controller: a physical device that sits in a host and provides
a connection for a wire(less) to another device on a network.
Network Connection: A windows construct: a logical grouping of specifications that
apply to one specific NIC in a windows computer.
Interface: Used in network definitions that aren't specific to the OS; a pointer
to a physical NIC.
These three entities exist one-to-one-to-one. The presence of one NIC in a windows
computers means there will be one and only one Network Connection and one and only
one interface that associated with that NIC.
Gateway: a network device with knowledge of and connections to more than one
subnet.
Comments? Combine the word "richard" with the words "server guy" with no spaces
or punctuation and stick it in front of the domain this page is hosted on. That
should get an electonic message to me.
(c) 2006 Richard Skerritt