For the last few years we have been running IBM x3850 x5 servers in our cloud, with Emulex 10G NICs for all of our networking needs. One of our services was starting to run out of steam, and we picked up a pair of HP DL580 G8 servers for the needed horsepower. These servers were configured with Intel based 10G NICs. At the time of the purchase, our main concern was to get the maximum CPU performance and minimal attention was paid to 10G NIC selection.
As we put these servers into production, we noticed that the VMware iSCSI performance on the new HP hosts seemed especially good. The E7-8891v2 Xeons undoubtedly have a large role to play in the improved performance, but we began to wonder how much of the performance improvement can be attributed to the 10G NICs. Not one to let such thoughts sit idle for long, I asked our friends at Zones if we can borrow a few 10G NICs for performance evaluation. Within a few weeks I had dual port Intel and Mellanox 10G NICs in hand for testing, in addition to the Emulex NICs already present in our IBM hosts. Continue reading →
The VMware virtualization environment is a very flexible and powerful tool for delivery a variety of computing services. The flexibility this product has brought to the IT world has made many nearly impossible actions simple and error free. One of the more unique features introduced is vMotion – moving a running virtual computer from one physical host to another without downtime. Cool!
All great until it fails.
Just the other day, I saw a situation where a VM was moved from one host to another and it disappeared from the network. Check vCenter – everything is running. Even other VMs on the same networks are working fine. What?
This is a case of ARP cache poisoning. Each computer network interface has a unique address on the network. The network switches keep a database of these addresses and reference these addresses to a specific switch port. If you move a device from one port on a switch to another, the switch notices the network interface has disappeared, it sends out a query asking where is address X. On one switch, easy. When you daisychain network switches, the each switch keeps a record of all addresses that can be found on neighboring switches. The more switches between the initial location of the VM and its final location, the longer it will take. Typical timeout per switch is 5 mins.
It is this multi-switch configuration that caused the problem we saw. Switches were originally designed for rare changes in the network topology. Virtualization has changed that. When VMware moves a vm from one host to another, in a large environment the machine might get dropped on a switch many hops away from the original switch. Depending on switch configurations, it can take from 5 – 30 minutes before the switches discover where the moved VM has shown up. OUCH! That is an outage!
So how to fix it? Immediately flush the arp cache on all the switches when you first notice the problem. But to fix it long term see the suggestions below:
For a cisco switch environment, disable arp caching and enable ARP Masquerading.
For a Force10 switch, run this command: mac-address-table station-move refresh-arp