In datacenters, Disaster Recovery is not just a concept, but it is something that we must take into consideration every day. The company/customer make money if the services we provide are operational. We try to make everything as stable and secure as possible, but we also have to plan for the worst – what if the building is gone tomorrow? For this contingency, we usually identify a location FarFarAway, that is not subject to the same risks as the primary location. EZ. Find a location, sign a contract, install and configure environment and that’s it, right? ummm… nope. Now comes the hard part.
The hard part for DR has always been replicating data from the production site to the recovery site. You have to make sure all of the data gets across to the recovery site in time and consistently. Since we are a rather large VMware shop, and the entire production environment is now VMs, all we have to do is replicate the VMs and their additional data to the recovery site.
We use a number of tools for replication:
- SAN based replication (Compellent)
- VMware’s Site Recovery Manger (in evaluation at the moment)
- Offsite replication of our daily backups (Quantum DXi)
The SAN replication moves the data that is not encased in a VM, in our case, file server data. vRanger and VMware’s SRM replicate the VMs. vRanger is cheap, is poorly integrated with vCenter, and has a rather weak understanding of Recovery Point Objectives. SRM is awfully expensive, can be managed from the vSphere client and will plan the replication schedule to make sure the replica is never older than you specify.
All of these are great tools, but I had been banging my head against the wall trying to identify why the replication processes are unable to make full use of the available bandwidth. SAN to SAN replication was just fine, but vRanger and SRM were struggling. I tested various TCP performance enhancements, opened cases with VMware, etc. Nothing helped until we installed a shiny new Compellent SAN in the DR location.
By the time the new SAN came in, I was getting quite frustrated with the DR replications. Every night at least 10% of the replication processes would fail. Some required a restart, others were in such a bad state the replica VMs had to be wiped and re-synced. It was bad. Really bad. When the new SAN was installed, the first thing I did was move a few of the troublesome replications to the new SAN. I nearly fell out of my chair when I looked at the bandwidth stats! The replications were running flat out, using the entire capacity of the WAN link. This had never happened before! All of the replicas are now running to the new SAN and the replication failure rate is less than 1%. Yes!
As it turns out, recovery site storage performance is critical in reducing the latencies in the replication process. I was using an old hand-me down Compellent, as well as a Quantum DXi as storage devices. The IOPS performance of the new SAN beats both of these devices by at least an order of magnitude. And it shows in the replication performance.
So, if you are having VM replication performance issues, take a look at your DR site SAN. It might not be good enough.