Getting Xen networking to work on a Dell Poweredge 2950
What happened to our networking?
Here we were, admiring our brand new Dell Poweredge 2950 server that we were going to use as a Xen host. Fedora Core 6 installed without a hitch, but when we tried booting the Xen kernel, we lost connectivity on the first network card for Dom0. The second NIC, which had not been tweaked by Xen’s bridging setup, still worked as a charm. So, regretfully, we set out to learn once again things we’d rather left unlearned…
Dell, Broadcom and the art of technical support
After googling around a bit, we found that the problem was hardly a new one. It appears that the Broadcom Netextreme II BCM5708 NIC’s that ship with the Dell Poweredge 2950 do not play ball with Xen’s bridging setup, at least not when support for Dell’s IPMI card ( DRAC ) is enabled. Xen does some internal tweaking with your network interfaces, and ends up assigning eth0′s mac address to a new virtual interface, which is Dom0′s eth0. The original eth0 gets renamed to peth0, and gets a mac address of ff:ff:ff:ff:ff:fe, since in normal circumstances this interface should not be visible on the LAN.
It appears that the Broadcom BCM5708 NIC’s filter out all traffic that is not directed at the NIC’s physical mac address, at least when the NIC’s management function is enabled. The funny thing is that this problem has been acknowledged by Dell as far back as the summer of 2006, but they haven’t seen fit to bring out a firmware update ro solve this problem. Calling Dell Support brought us no further: there’s nothing known about an eventual release of a patch. How the guys are going to cope with the new Suse Linux Enterprise Server 10 and Red Hat Enterprise Linux 5, both of whom proudly announce the wonders of Xen virtualisation, I don’t know.
Two solutions for the problem
The first, quick and dirty workaround, is to disable the management functions on your Broadcom NIC’s. Since this means going without IPMI support, and forgoing remote reboot and serial text console functionality , that’s more of a choice between Scylla and Charibdis. Anyway, this workaround is documented at:
How to disable the management function on Broadcom NetXtreme II BCM5708 NIC
We chose for a second solution, which is slightly more complex, but has the advantage of keeping IPMI connectivity intact. Since the Broadcom NIC’s require that the physical NIC’s mac addresses remain unchanged, we had to tweak Xen’s bridging scripts to allow this. Dom0′s network interfaces, which normally are assigned the physical NIC’s mac addresses, get a locally administered mac address out of Xen’s mac address range instead. Beware: you should take care to assign different addresses to all Xen servers on a same subnet, or you may experience connectivity problems to your Dom0′s.
So here’s what we did:
In you Xen config file, found at /etc/xen/xend-config.sxp in our installation, change the default network script to a custom one:
#(network-script network-bridge)
# add this one
(network-script my-network-bridge)
Then you have to create your custom bridging script. Place it in /etc/xen/scripts/my-network-bridge – or the directory your xen scripts reside in. The script should contain the following lines – the second one is not strictly necessary if you only want top bridge one network interface:
dir=$(dirname "$0")
# The mac addresses you assign here should be different for every server.
# Also, watch out for conflicts with your DomU’s mac addresses.
# Use the 00:16:3e:xx:xx:xx range, which is Xen’s assigned range
"$dir/network-bridge" "$@" vifnum=0 macaddr=00:16:3e:00:00:00
"$dir/network-bridge" "$@" vifnum=1 macaddr=00:16:3e:00:00:01
Then finally, modify your network-bridge script like this – the changes are outlined in bold:
#============================================================================
# Default Xen network start/stop script.
# Xend calls a network script when it starts.
# The script name to use is defined in /etc/xen/xend-config.sxp
# in the network-script field.
#
# This script creates a bridge (default xenbr${vifnum}), adds a device
# (default eth${vifnum}) to it, copies the IP addresses from the device
# to the bridge and adjusts the routes accordingly.
#
# If all goes well, this should ensure that networking stays up.
# However, some configurations are upset by this, especially
# NFS roots. If the bridged setup does not meet your needs,
# configure a different script, for example using routing instead.
#
# Usage:
#
# network-bridge (start|stop|status) {VAR=VAL}*
#
# Vars:
#
# vifnum Virtual device number to use (default 0). Numbers >=8
# require the netback driver to have nloopbacks set to a
# higher value than its default of 8.
# bridge The bridge to use (default xenbr${vifnum}).
# netdev The interface to add to the bridge (default eth${vifnum}).
# antispoof Whether to use iptables to prevent spoofing (default no).
#
# Internal Vars:
# pdev="p${netdev}"
# vdev="veth${vifnum}"
# vif0="vif0.${vifnum}"
#
# start:
# Creates the bridge
# Copies the IP and MAC addresses from netdev to vdev
# Renames netdev to be pdev
# Renames vdev to be netdev
# Enslaves pdev, vdev to bridge
#
# stop:
# Removes netdev from the bridge
# Transfers addresses, routes from netdev to pdev
# Renames netdev to vdev
# Renames pdev to netdev
# Deletes bridge
#
# status:
# Print addresses, interfaces, routes
#
#============================================================================
dir=$(dirname "$0")
. "$dir/xen-script-common.sh"
. "$dir/xen-network-common.sh"
findCommand "$@"
evalVariables "$@"
vifnum=${vifnum:-$(ip route list | awk ‘/^default / { print $NF }’ | sed ‘s/^[^0-9]*//’)}
vifnum=${vifnum:-0}
bridge=${bridge:-xenbr${vifnum}}
netdev=${netdev:-eth${vifnum}}
antispoof=${antispoof:-no}
# add a new macaddr parameter to the script
macaddr=${macaddr:-ff:ff:ff:ff:ff:ff}
pdev="p${netdev}"
vdev="veth${vifnum}"
vif0="vif0.${vifnum}"
get_ip_info() {
addr_pfx=`ip addr show dev $1 | egrep ‘^ *inet’ | sed -e ‘s/ *inet //’ -e ‘s/ .*//’`
gateway=`ip route show dev $1 | fgrep default | sed ‘s/default via //’`
}
do_ifup() {
if ! ifup $1 ; then
if [ ${addr_pfx} ] ; then
# use the info from get_ip_info()
ip addr flush $1
ip addr add ${addr_pfx} dev $1
ip link set dev $1 up
[ ${gateway} ] && ip route add default via ${gateway}
fi
fi
}
# Usage: transfer_addrs src dst
# Copy all IP addresses (including aliases) from device $src to device $dst.
transfer_addrs () {
local src=$1
local dst=$2
# Don’t bother if $dst already has IP addresses.
if ip addr show dev ${dst} | egrep -q ‘^ *inet ‘ ; then
return
fi
# Address lines start with ‘inet’ and have the device in them.
# Replace ‘inet’ with ‘ip addr add’ and change the device name $src
# to ‘dev $src’.
ip addr show dev ${src} | egrep ‘^ *inet ‘ | sed -e "
s/inet/ip addr add/
s@\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+/[0-9]\+\)@\1@
s/${src}/dev ${dst}/
" | sh -e
# Remove automatic routes on destination device
ip route list | sed -ne "
/dev ${dst}\( \|$\)/ {
s/^/ip route del /
p
}" | sh -e
}
# Usage: transfer_routes src dst
# Get all IP routes to device $src, delete them, and
# add the same routes to device $dst.
# The original routes have to be deleted, otherwise adding them
# for $dst fails (duplicate routes).
transfer_routes () {
local src=$1
local dst=$2
# List all routes and grep the ones with $src in.
# Stick ‘ip route del’ on the front to delete.
# Change $src to $dst and use ‘ip route add’ to add.
ip route list | sed -ne "
/dev ${src}\( \|$\)/ {
h
s/^/ip route del /
P
g
s/${src}/${dst}/
s/^/ip route add /
P
d
}" | sh -e
}
##
# link_exists interface
#
# Returns 0 if the interface named exists (whether up or down), 1 otherwise.
#
link_exists()
{
if ip link show "$1" >/dev/null 2>/dev/null
then
return 0
else
return 1
fi
}
# Set the default forwarding policy for $dev to drop.
# Allow forwarding to the bridge.
antispoofing () {
iptables -P FORWARD DROP
iptables -F FORWARD
iptables -A FORWARD -m physdev –physdev-in ${pdev} -j ACCEPT
iptables -A FORWARD -m physdev –physdev-in ${vif0} -j ACCEPT
}
# Usage: show_status dev bridge
# Print ifconfig and routes.
show_status () {
local dev=$1
local bridge=$2
echo ‘============================================================’
ip addr show ${dev}
ip addr show ${bridge}
echo ‘ ‘
brctl show ${bridge}
echo ‘ ‘
ip route list
echo ‘ ‘
route -n
echo ‘============================================================’
}
op_start () {
if [ "${bridge}" = "null" ] ; then
return
fi
if ! link_exists "$vdev"; then
if link_exists "$pdev"; then
# The device is already up.
return
else
echo "
Link $vdev is missing.
This may be because you have reached the limit of the number of interfaces
that the loopback driver supports. If the loopback driver is a module, you
may raise this limit by passing it as a parameter (nloopbacks=<N>); if the
driver is compiled statically into the kernel, then you may set the parameter
using loopback.nloopbacks=<N> on the domain 0 kernel command line.
" >&2
exit 1
fi
fi
create_bridge ${bridge}
if link_exists "$vdev"; then
mac=`ip link show ${netdev} | grep ‘link\/ether’ | sed -e ‘s/.*ether \(..:..:..:..:..:..\).*/\1/’`
preiftransfer ${netdev}
transfer_addrs ${netdev} ${vdev}
if ! ifdown ${netdev}; then
# If ifdown fails, remember the IP details.
get_ip_info ${netdev}
ip link set ${netdev} down
ip addr flush ${netdev}
fi
ip link set ${netdev} name ${pdev}
ip link set ${vdev} name ${netdev}
# do not use the default bridging function as this resets the mac address
# setup_bridge_port ${pdev}
# instead use the code below, which leaves the mac address intact
ip link set ${pdev} down
ip link set ${pdev} arp off
ip link set ${pdev} multicast off
ip addr flush ${pdev}
setup_bridge_port ${vif0}
# do not assign pethX’s mac address to Dom0′s ethX
# ip link set ${netdev} addr ${mac} arp on
# instead use the $macaddr parameter we passed to the script
ip link set ${netdev} addr ${macaddr} arp on
ip link set ${bridge} up
add_to_bridge ${bridge} ${vif0}
add_to_bridge2 ${bridge} ${pdev}
do_ifup ${netdev}
else
# old style without ${vdev}
transfer_addrs ${netdev} ${bridge}
transfer_routes ${netdev} ${bridge}
fi
if [ ${antispoof} = 'yes' ] ; then
antispoofing
fi
}
op_stop () {
if [ "${bridge}" = "null" ]; then
return
fi
if ! link_exists "$bridge"; then
return
fi
if link_exists "$pdev"; then
ip link set dev ${vif0} down
mac=`ip link show ${netdev} | grep ‘link\/ether’ | sed -e ‘s/.*ether \(..:..:..:..:..:..\).*/\1/’`
transfer_addrs ${netdev} ${pdev}
if ! ifdown ${netdev}; then
get_ip_info ${netdev}
fi
ip link set ${netdev} down arp off
ip link set ${netdev} addr fe:ff:ff:ff:ff:ff
ip link set ${pdev} down
ip addr flush ${netdev}
# do not reassign pethX’s mac address, since it hasn’t changed
# ip link set ${pdev} addr ${mac} arp on
brctl delif ${bridge} ${pdev}
brctl delif ${bridge} ${vif0}
ip link set ${bridge} down
ip link set ${netdev} name ${vdev}
ip link set ${pdev} name ${netdev}
do_ifup ${netdev}
else
transfer_routes ${bridge} ${netdev}
ip link set ${bridge} down
fi
brctl delbr ${bridge}
}
# adds $dev to $bridge but waits for $dev to be in running state first
add_to_bridge2() {
local bridge=$1
local dev=$2
local maxtries=10
echo -n "Waiting for ${dev} to negotiate link."
ip link set ${dev} up
for i in `seq ${maxtries}` ; do
if ifconfig ${dev} | grep -q RUNNING ; then
break
else
echo -n ‘.’
sleep 1
fi
done
if [ ${i} -eq ${maxtries} ] ; then echo ‘(link isnt in running state)’ ; fi
add_to_bridge ${bridge} ${dev}
}
case "$command" in
start)
op_start
;;
stop)
op_stop
;;
status)
show_status ${netdev} ${bridge}
;;
*)
echo "Unknown command: $command" >&2
echo ‘Valid commands are: start, stop, status’ >&2
exit 1
esac
Finally, comment out the mac addresses in the ifup scripts, else your OS will complain about setting up eth0 and eth1 with different mac addresses than those mentioned in the ifup scripts:
DEVICE=eth0
BOOTPROTO=static
BROADCAST=192.168.255.255
#HWADDR=00:1A:A0:05:12:AB
IPADDR=192.168.0.11
NETMASK=255.255.0.0
NETWORK=192.168.0.0
ONBOOT=yes
Restart your services as needed, and you’ll be able to connect to your network interfaces without a hitch!