In a previous post, we reviewed a GitLab issue and asked our readers, what is the worst sysadmin mistakes you have made?
While searching for information on the topic I discovered this Reddit thread with over 200 comments from sysadmins around the world explaining their biggest mistakes and what they taught them. This is a great source of information, listing human mistakes that will cause application downtime.
Looking at the Reddit thread, you will read about all kinds of sysadmin mistakes; deleting wrong folders or files, applying wrong permissions or an invalid switch configuration, doing a wrong UPDATE SQL query, dropping the wrong disk from a RAID configuration, the list goes on and on. Reading the comments will keep a sysadmin up at night wondering how they will recover from making their next mistake.
Over the years, human errors have been a leading cause of data center outages. Even the most experienced system administrators and large companies with the best talent will make them. Here are three examples:
IT staff is always under pressure while maintaining, updating, and running infrastructure; fixing issues, deploying new servers, or many other activities that cause issues that go unnoticed by the end users they support; but sooner or later a bigger outage will happen, and a sysadmin must have a disaster recovery plan in place to minimize downtime.
It is even worse when any employee in the company can cause a disaster. Every week we see issues caused when an employee downloads some kind of virus, cybersecurity threat, or ransomware that propagates over the network causing an outage.
But as we saw with the GitLab issue, even having a documented disaster recovery plan in place, where they could potentially recover from five different backup methods, did not assure recovery. None of their recovery methods worked.
Due to the amount of changes, cyber threats, and dynamic environments, it is critical to have solutions that can automate frequent disaster recovery tests and alert you when issues are found during recovery. This is how you minimize downtime, prepare and test.
That is why Unitrends invented and built into our product the Recovery Assurance technology to daily automate recovery testing of backups. This is how Unitrends makes sysadmins sleep better at night.