Hi,
Lately we encountered a network issue with one of our virtualised Windows 2003 servers.
The symptoms:
Long downloads fail, there is no error, the datastream just stops.
Speed drops to zero and stays there.
But not always, it was pretty unpredictable behaviour.
At first we thought this was an IIS issue, so we began to search in that direction.
Changed some parameters, fiddled a bit with the settings…
But no, we were wrong. This was proved by installing Apache on the system and running into the very same problem.
We decided to put a sniffer between our server and another testing machine, only to discover a LOT of bad TCP/IP packets.
A bit demotivated we began a seemingly endless journey on the internet, searching for people who have a problem that resembled ours.
Until we found a post about someone with a Windows 2000 – Xenserver driver issue.
(http://forums.citrix.com/message.jspa?messageID=1337520)
Our attention headed in the direction of the Windows drivers and not to much longer the almighty Google came up with another Citrix post: (http://forums.citrix.com/thread.jspa?threadID=234961&tstart=0).
This issue kind of resembles an old problem we used to have with Xenserver 3.2 and one of your servers… Anyway, that’s not the problem here, but this does kind of prove there is something fishy with these Xen PV drivers. (Citrix people even admitted it in that page!)
So finally, we fixed it by disabling TCP/IP offloading in Windows.
This way not the Xen network card would handle the creation of the TCP checksums, Windows would.
However, this has one downside, it kind of hogs the first CPU.
I managed to get 100% cpu usage on CPU0 only by downloading stuff through IIS, so make sure not to many services are sitting on CPU0 only! (I reconfigured MS SQL to use all the cpu’s but cpu0 to prevent the server from running into problems.)
So, the key to Windows 2003 on XenServer without TCP/IP related headaches is located in the registry at:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
DisableTaskOffload=0
(DWORD value. 0 means disable offload, 1 means enable. By default this key will not be there, you can just add it.)
For the real adventurous people out there, you can try RSS (Receive-side Scaling), which should make the other cpu’s available for handling NIC packets.
Personally, I did not test it, but keep in mind that if you DO enable it, this TCP checksum calculation can start having an impact on ALL of your cpu’s.
The key:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\
EnableRSS=1
If now only Citrix can fix this out of the box, we can be happy!
Greets,
Koen