Last week GitLab went down because of a human error when a sysadmin made a mistake and removed a production folder. When I read that, I remembered my early days working as system administrator. I managed a lot of Linux/UNIX servers and I would make my job easier by automating things using scripts and remote commands (at that time Ansible didn’t exist and not much focus was placed on automation tools). I learned early on that even the most experienced admin will make a mistake and get in trouble.
One day I had the task to add/remove some users from all systems. It would take hours to log in on each server and manually complete, so I created a script that read the passwd file, parsed it, added or removed the users, and overwrote it on all servers. As you can guess, things went wrong. For some reason, my script execution resulted in all systems having an empty passwd file. No one could log in to any of the servers and fix the issue. Fortunately, we had backups in place and we were able to recover the passwd file on all systems. I thought that was the end of my job but my manager told me to not worry and learn from that mistake.
That is the day I truly learned the importance of testing your code on dev/test servers first and the importance of having a proper backup and disaster recovery solution in place that can save your day when things go wrong.
One of my favorite feature of Unitrends Backup is Instant Recovery/File Level Recovery which allows you to create virtual copies of your servers, databases or files from any backup and make them accessible to users for dev/test purposes. They can test all possible changes and if something breaks, no problem, changes can be discarded and the environment refreshed for new testing in just a few minutes. This takes no extra space consumption and, of course, all this can be automated as part of your DevOps process using Unitrends REST APIs or Unitrends Powershell.
So here is my question to spur discussion, what is the worst mistake you ever made as a sysadmin?