Category Archives: Datacenter

Datacenter related experiments

vCenter 5.5 web interface and the Cubic congestion algorithm

It has been a while since I last posted, but thought this one might be of interest to a few of you, especially since I keep running into it every so often.

Some time ago I did a performance analysis of multiple manufacturer NICs. As a part of that test, I also tested the impact of changing the TCP congestion algorithm in VMware from the default New Reno to Cubic. The cubic algorithm consistently delivered better throughput results in the tests, and since then I have updated all of our hosts to use this.

Unfortunately, if the algorithm is set via the VMware 5.5 vCenter Web Interface, not all of the required changes are applied properly to the target ESXi host and upon reboot, the host is unable to connect to vCenter. It took quite a bit of work with VMware until we discovered the root cause. Fortunately, we did identify a simple workaround, documented here:   Continue reading

ESXi 5.5 enable SSD

VMware’s ESXi servers frequently are unable to recognize attached SSD devices, especially when they are behind a RAID controller. If a storage device is not recognized as an SSD, it cannot be designated for vFRC or the Virtual Flash Host Swap Cache function. A set of commands outlined in VMware’s KB 2013188 will mark the storage device as a local SSD and enable the required functionality. If you are up to living dangerously, you can mark any local storage device as an SSD. Continue reading

VMware 10G NIC Performance Evaluation

For the last few years we have been running IBM x3850 x5 servers in our cloud, with Emulex 10G NICs for all of our networking needs. One of our services was starting to run out of steam, and we picked up a pair of HP DL580 G8 servers for the needed horsepower. These servers were configured with Intel based 10G NICs. At the time of the purchase, our main concern was to get the maximum CPU performance and minimal attention was paid to 10G NIC selection.

As we put these servers into production, we noticed that the VMware iSCSI performance on the new HP hosts seemed especially good. The E7-8891v2 Xeons undoubtedly have a large role to play in the improved performance, but we began to wonder how much of the performance improvement can be attributed to the 10G NICs. Not one to let such thoughts sit idle for long, I asked our friends at Zones if we can borrow a few 10G NICs for performance evaluation. Within a few weeks I had dual port Intel and Mellanox 10G NICs in hand for testing, in addition to the Emulex NICs already present in our IBM hosts. Continue reading

Manually disable vmreplica state (unable to extend disk)

Just ran into an issue with a VMware VM that was at one time replicated offsite using VMware’s replication tool. For some reason, when the service was decomissioned, (at least) one of our VMs was still marked as a replicated machine. When a VM is marked as active, then it will not be possible to extend the disk drive. In our case it appears that it was also in the middle of a replication, long since abandoned.

To fix, we found this article: StuffThatMightBeUseful

In the article, the author points out that the VM needs to be shut down. Probably better if the VM is down, but in our case we did not have this option and changed the setting on the running VM. Worked and did not crash. yet.

The steps are simple:

  1. Find the VM_ID (number in 1st column): vim-cmd vmsvc/getallvms |grep “name of your VM”
  2. Confirm current state: vim-cmd hbrsvc/vmreplica.getState VM_ID
  3. Fix: vim-cmd hbrsvc/vmreplica.disable VM_ID

Patch, Patch and Patch Some More

As some of you know, I look after a datacenter dedicated to providing a specialized cloud service for the printing industry. We use similar servers and a similar infrastructure as another datacenter I know. This includes IBM x3690 and x3850 enterprise grade servers, 10G networking, VMware virtualization, etc. These servers do not have the fastest processors, but they have awesome on-line out-of-band diagnostics, can handle terabytes of RAM and will have   replacement parts delivered PDQ. Ours have been running for the past two years without failure and data loss.

In my talks with the manager of the other datacenter, I have been hearing complaints about these servers. Frequent crashes, data loss, disk failures and more. Huh? They have the same hardware, and are running a newer VMware release (5.1) than we are (5.0). Digging deeper, I began to see a pattern. Out of date patches. I usually run no more than 4 months behind on host patches, but looking at their systems, most of them are the original release level. But it goes deeper than that.

In a virtual datacenter, there are five levels of patches:

  • Application
  • Operating System
  • Hypervisor (VMware, Hyper-V, Xen, etc.)
  • Firmware
  • BIOS

Most users only see the top two levels through application updates and Windows service packs etc. However when you run a virtual datacenter, the bottom three levels become very important. A VMware host may run up to 100 virtual machines. An outage on this host affects all of the machines it hosts. Making sure the host is operational becomes Very Important.

VMware and Microsoft both provide simple mechanisms to keep the virtualization environment up to date. Microsoft does this via the Windows Update function. VMware provides a free add-on and plugin called Update Manager. Both of these tools are very easy to use and only require a brief maintenance window, when to reboot the host and update the hypervisor.

When you buy a host and its accessories from HP or IBM (I assume also from Dell..), you can use the update tools they provide to bring the firmware for all the accessories up to date with a single reboot. IBM provides UpdateXpress and HP provides a new HP Service Pack for Proliant utility. Both of these tools can be configured to bring the host system up to date with the latest fixes for all of the RAID controllers, disks, network interfaces, BIOS, remote management, etc.

All of the updates provided by HP and IBM come with release notes. Think back to the last time your server crashed. If you think it was storage related, find out the firmware level of  your RAID controller and find the latest firmware update for the controller. Read through the release notes – you should not be surprised to see the issue you ran into has already been patched… but you are not running it on your server. Time Bomb? You bet! No, really, you are betting with your paycheck.

So.. Patch! Every three to four months patch completely. Find the tools, learn them and use them. It is better to take a controlled, scheduled outage than one right in the middle of the day with data loss.

Good luck! Don’t let that unscheduled outage kick you in the a$$!

A VMware Replication Performance Secret

In datacenters, Disaster Recovery is not just a concept, but it is something that we must take into consideration every day. The company/customer make money if the services we provide are operational. We try to make everything as stable and secure as possible, but we also have to plan for the worst – what if the building is gone tomorrow? For this contingency, we usually identify a location FarFarAway, that is not subject to the same risks as the primary location. EZ. Find a location, sign a contract, install and configure environment and that’s it, right? ummm… nope. Now comes the hard part.

The hard part for DR has always been replicating data from the production site to the recovery site. You have to make sure all of the data gets across to the recovery site in time and consistently. Since we are a rather large VMware shop, and the entire production environment is now VMs, all we have to do is replicate the VMs and their additional data to the recovery site.

We use a number of tools for replication:

  • SAN based replication (Compellent)
  • vRanger
  • VMware’s Site Recovery Manger (in evaluation at the moment)
  • Offsite replication of our daily backups (Quantum DXi)

The SAN replication moves the data that is not encased in a VM, in our case, file server data. vRanger and VMware’s SRM replicate the VMs. vRanger is cheap, is poorly integrated with vCenter, and has a rather weak understanding of Recovery Point Objectives. SRM is awfully expensive, can be managed from the vSphere client and will plan the replication schedule to make sure the replica is never older than you specify.

All of these are great tools, but I had been banging my head against the wall trying to identify why the replication processes are unable to make full use of the available bandwidth. SAN to SAN replication was just fine, but vRanger and SRM were struggling. I tested various TCP performance enhancements, opened cases with VMware, etc. Nothing helped until we installed a shiny new Compellent SAN in the DR location.

By the time the new SAN came in, I was getting quite frustrated with the DR replications. Every night at least 10% of the replication processes would fail. Some required a restart, others were in such a bad state the replica VMs had to be wiped and re-synced. It was bad. Really bad. When the new SAN was installed, the first thing I did was move a few of the troublesome replications to the new SAN. I nearly fell out of my chair when I looked at the bandwidth stats! The replications were running flat out, using the entire capacity of the WAN link. This had never happened before! All of the replicas are now running to the new SAN and the replication failure rate is less than 1%. Yes!

As it turns out, recovery site storage performance is critical in reducing the latencies in the replication process. I was using an old hand-me down Compellent, as well as a Quantum DXi as storage devices. The IOPS performance of the new SAN beats both of these devices by at least an order of magnitude. And it shows in the replication performance.

So, if you are having VM replication performance issues, take a look at your DR site SAN. It might not be good enough.