VMware ESXi 6.0 Introduced A Changed Block Tracking Bug (2136854) – Why Recovery Testing Is Critical
VMware vSphere is by far the most advanced hypervisor for backup and recovery. They pioneered critical features for data access, speed, efficiency, and management that allow virtualization administrators to handle many critical demands from their customers and partners. Obviously, they do pretty darn well with cool features outside of backup and recovery. That’s why it’s natural for admins to chomp at the bit when a new ESXi version comes out, especially when that one feature you’ve wanted for a long time is finally available.
In February of this year, VMware announced vSphere 6.0. As a backup vendor – and a virtualization fan – I was pretty excited about some of the new features, especially in the storage arena.
However, as with any major release, there’s always that extra level of planning and caution to take with the rollout. Implementing a major update to a hypervisor that runs many of your most critical business applications is pretty scary, especially if it was your call to launch it. A lot can go wrong without a solid plan to mitigate risk – and rollback, if necessary.
This got me wondering…would yet another Changed Block Tracking (CBT) bug rear its ugly head in ESXi 6.0? It was only a few months before 6.0 was released that a CBT bug was announced. Fortunately, Unitrends was not affected.
Unitrends supported vSphere 6.0 very early. As we started to see the early adopters, unfortunately, it didn’t take long before the first KB was posted and a few support cases rolled in – Backing up a virtual machine with Changed Block Tracking (CBT) enabled fails after upgrading to or installing VMware ESXi 6.0 (2114076).
Does this mean CBT is bad or problematic? No, not at all. In fact, it is one of those incredible features I’ve always loved VMware for implementing. It helps users back up a ton more data, much more frequently than they ever could without it, and it allows VMware to stand above its hypervisor competitors for being the only one to have it for so long.
It just means that CBT, especially in releases where a lot of storage functionality has been added, is one of those features that you need to mind very carefully in your planning efforts. It is core to how most backup vendors read data. So issues in this area of the hypervisor can be pretty impactful (i.e. unnoticed corruption).
So now that we are many months past GA for vSphere 6.0, all is in the clear. Right?
Unfortunately…no. VMware recently posted yet another KB – Backups with Changed Block Tracking can return incorrect changed sectors in ESXi 6.0 (2136854).
Unfortunately, if you are impacted it could mean that backups are corrupt. Obviously, that’s not a good thing. You may have been running ESXi 6.0 for quite some time. Concerns about your confidence in recovery are never good, especially since any solid VM backup vendor uses CBT.
This is exactly why recovery testing is no longer a nice-to-have component of a continuity plan. It needs to be a requirement. Nobody wants to find out that their backups are not recoverable, but you really do not want to discover that fact after an outage or a data loss event. There are too many external factors that can impact recoverability these days to avoid testing.
Unitrends offers Recovery Assurance to help IT administrators with this problem. It is available via our recovery automation software called ReliableDR, which is a premium feature of our Enterprise Plus edition of Unitrends Enterprise Backup software. It is also available as an add-on for our Recovery Series physical appliances.
However, since it is not easily understood if you could be impacted by this latest CBT bug, Recovery Series and UEB customers can use it free until at least January 31, 2016. We do not want this issue to leave you worried about your Unitrends backups.
What you do?
What can ReliableDR test?
It can test the recoverability of your VMware backups (as well as Hyper-V and Windows backups) by spinning up one or multiple, dependent virtual machine backups using our Instant Recovery functionality. Everything runs in an isolated sandbox that is shielded from your production network. From there, it can do the following for Windows and Linux VMs:
- Perform database checks
- Check that application services were properly started
- Perform network tests to other dependent VMs within the sandbox
- Even run custom scripts for any additional verification not available out of the box
ReliableDR jobs will execute tests against the latest available Unitrends backups, but you can use the Failover To CRP (Certified Recovery Point) feature to manually run tests against older backups.
What to do if you discover recovery issues?
The likely best option is to upgrade to the latest patch that VMware states will address this issue, as documented in KB article 2136854. It may be worth raising a support request with VMware to ensure it addresses your specific situation. You should also leverage Unitrends KB 3765 for additional instructions after applying the patch.
If you are unable to fix the issue with the patch, then leverage the most practical workaround, as documented in that article.
Above and beyond the stated workarounds from VMware, Unitrends has another alternative. You can leverage guest-level protection for your VMs with an agent.
You may find that this option is the most practical because it allows you to continue to get fast incremental backups. This is due to the fact that it leverages a completely separate mechanism within the guest operating system to track changed data between backups. Just keep in mind that it represents a bigger change to your VMware backup strategy compared to other alternatives.
We apologize that we have to deliver the bad news about the realities of why backups can sometimes be at risk. We do hope that our offer will help you gain confidence in your recoverability, avoid any major recovery issues, and help you incorporate Recovery Assurance into your future continuity plans.