As some of you know, I look after a datacenter dedicated to providing a specialized cloud service for the printing industry. We use similar servers and a similar infrastructure as another datacenter I know. This includes IBM x3690 and x3850 enterprise grade servers, 10G networking, VMware virtualization, etc. These servers do not have the fastest processors, but they have awesome on-line out-of-band diagnostics, can handle terabytes of RAM and will have replacement parts delivered PDQ. Ours have been running for the past two years without failure and data loss.
In my talks with the manager of the other datacenter, I have been hearing complaints about these servers. Frequent crashes, data loss, disk failures and more. Huh? They have the same hardware, and are running a newer VMware release (5.1) than we are (5.0). Digging deeper, I began to see a pattern. Out of date patches. I usually run no more than 4 months behind on host patches, but looking at their systems, most of them are the original release level. But it goes deeper than that.
In a virtual datacenter, there are five levels of patches:
- Operating System
- Hypervisor (VMware, Hyper-V, Xen, etc.)
Most users only see the top two levels through application updates and Windows service packs etc. However when you run a virtual datacenter, the bottom three levels become very important. A VMware host may run up to 100 virtual machines. An outage on this host affects all of the machines it hosts. Making sure the host is operational becomes Very Important.
VMware and Microsoft both provide simple mechanisms to keep the virtualization environment up to date. Microsoft does this via the Windows Update function. VMware provides a free add-on and plugin called Update Manager. Both of these tools are very easy to use and only require a brief maintenance window, when to reboot the host and update the hypervisor.
When you buy a host and its accessories from HP or IBM (I assume also from Dell..), you can use the update tools they provide to bring the firmware for all the accessories up to date with a single reboot. IBM provides UpdateXpress and HP provides a new HP Service Pack for Proliant utility. Both of these tools can be configured to bring the host system up to date with the latest fixes for all of the RAID controllers, disks, network interfaces, BIOS, remote management, etc.
All of the updates provided by HP and IBM come with release notes. Think back to the last time your server crashed. If you think it was storage related, find out the firmware level of your RAID controller and find the latest firmware update for the controller. Read through the release notes – you should not be surprised to see the issue you ran into has already been patched… but you are not running it on your server. Time Bomb? You bet! No, really, you are betting with your paycheck.
So.. Patch! Every three to four months patch completely. Find the tools, learn them and use them. It is better to take a controlled, scheduled outage than one right in the middle of the day with data loss.
Good luck! Don’t let that unscheduled outage kick you in the a$$!