The VMware virtualization environment is a very flexible and powerful tool for delivery a variety of computing services. The flexibility this product has brought to the IT world has made many nearly impossible actions simple and error free. One of the more unique features introduced is vMotion – moving a running virtual computer from one physical host to another without downtime. Cool!
All great until it fails.
Just the other day, I saw a situation where a VM was moved from one host to another and it disappeared from the network. Check vCenter – everything is running. Even other VMs on the same networks are working fine. What?
This is a case of ARP cache poisoning. Each computer network interface has a unique address on the network. The network switches keep a database of these addresses and reference these addresses to a specific switch port. If you move a device from one port on a switch to another, the switch notices the network interface has disappeared, it sends out a query asking where is address X. On one switch, easy. When you daisychain network switches, the each switch keeps a record of all addresses that can be found on neighboring switches. The more switches between the initial location of the VM and its final location, the longer it will take. Typical timeout per switch is 5 mins.
It is this multi-switch configuration that caused the problem we saw. Switches were originally designed for rare changes in the network topology. Virtualization has changed that. When VMware moves a vm from one host to another, in a large environment the machine might get dropped on a switch many hops away from the original switch. Depending on switch configurations, it can take from 5 – 30 minutes before the switches discover where the moved VM has shown up. OUCH! That is an outage!
So how to fix it? Immediately flush the arp cache on all the switches when you first notice the problem. But to fix it long term see the suggestions below:
For a cisco switch environment, disable arp caching and enable ARP Masquerading.
For a Force10 switch, run this command: mac-address-table station-move refresh-arp