When I was growing up, I remember hearing the modified adage:
“To err is to be human, to really screw things up requires a computer.”
This was repeated quite often, especially from the previous generation. Today, in the DevOps and Continuous Integration/Deployment world, there is an additional line that can be added:
“To err is to be human, to really screw things up requires a computer, and to screw up all things all at once in a coordinated effort is DevOps.”
As a seasoned SysAdmin, the DevOps method of managing infrastructure is amazing. It allows the management of large systems and environments with absolute efficiency. However, with that efficiency of keeping things running smoothly comes the ease at which they can go horribly wrong.
When DevOps fails miserably
DevOps going horribly wrong was very publicly brought to light with an Amazon S3 outage in US-east-1 a few years back. According to AWS’s summary:
“…an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”
Imagine the turmoil inside that individual’s head minutes after they hit “Enter” on that command.
We’ve all been there at some scale.
Thoughts likely ranged from, “Seriously? I know I typed that right, I’ve done it 1000 times!” to, “CTRL-C, CTRL-X, CTRL-Z, ESC, ESC, ESC!” to, “Why are there no limits on that script to prevent this?!” and I’m sure, to “File-> Open-> Resume.rtf.”
Learning from someone else’s very public DevOps mistake
The AWS failure event inspired Deft to self-check our own DevOps infrastructure management practices.
Could something like a simple missing input flag cause a massive outage for one or more of our clients? Do we need to re-architect our configuration management to prevent something like this?
Deft manages the infrastructure of our AWS and Linux-based environments with Ansible. We store and collaborate all of the infrastructure “code” in git repositories. This gives us insight into what changes have been made, how we can roll them back if necessary, and, of course, the ability to find out who did it (not to place blame, but to verify the use case, etc.).
Do our DevOps tools offer anything in the way of disaster prevention and sanity checks?
Consider the following command we use in Ansible for regular maintenance on an environment:
/path/to/client/playbook/$ ansible-playbook site.yml --tags=maintenance --limit=stage
This script runs only the tasks tagged as maintenance, and only in the “Stage” environment. Once the changes are vetted, we can change the --limit=
to prod
and apply them to Production.
However, what if we mistype the --tags=
parameter like this:
/path/to/client/playbook/$ ansible-playbook site.yml --tags=mintenance --limit=stage
Since there are no tasks tagged as mintenance
, nothing happens.
Things get a bit more interesting if we forget it altogether:
/path/to/client/playbook/$ ansible-playbook site.yml --limit=stage
This will run the entire playbook against the Staging environment, running all tasks and resetting all configurations to baseline. But is this actually a problem? Not if you’re practicing DevOps properly.
If you’re practicing safe DevOps, you have not made any changes to the environment outside of the Ansible scripts. Resetting the environment to “baseline” shouldn’t be an issue, as it should have never left baseline in the first place. True, some package updates might be triggered, but you were doing that anyway with the --tags=maintenance
option!
The similar is true if you really screw up the command and forget the limits as well:
/path/to/client/playbook/$ansible-playbook site.yml
This runs all tasks on all environments, including Production. Again, the baseline reset shouldn’t be an issue. You do run the risk of having package updates applied to Production without vetting them in Stage — however, we structure our playbook so that tasks are run sequentially through the environments. So as long as your DevOps admin is attentive, they should see that they ran the entire playbook while the tasks are applied to Stage, and CTRL-C
the command before it gets to the Production part of the playbook.
What happens if we’re not even in the correct playbook?
/path/to/wrong/playbook/$ ansible-playbook site.yml --tags=maintenance --limit=stage
A different environment than what we intended gets the maintenance. At least it’s maintenance designed for that environment, since each playbook is custom-written for each environment. This might not be ideal, but it sure won’t be catastrophic.
Issues to look out for when using Ansible
Ansible is not without its pitfalls. Consider ad-hoc commands:
/path/to/client/playbook/$ ansible stage -a 'rm -rf /'
Any Linux admin knows the old rm -rf /
command; it frees up disk space like no other! Coupled with DevOps, it frees up disk space on ALL the Stage servers at the same time. However, if you’re practicing proper DevOps, then this command (no matter how tempting) should never enter the picture.
In the case of the AWS S3 outage, it sounds like there were some fail-safes built into the script. The compounding factor with their issue was, according to the article, that:
“We have not completely restarted the index subsystem … for many years.”
I view this as a testament to how well things run at AWS, as this crippling mistake has not happened in several years of operation.
And even if you are practicing perfectly safe DevOps, there may be a need to make ad-hoc changes, especially during on-the-fly troubleshooting. In light of this, it may actually be beneficial to make a mistake when running the playbook and return the environment to baseline. This must be done in a regular and controlled manner (think Stage first, then Production). If it’s done on a more regular basis than every few years, the potential impact will be minimal.
The key takeaway
Overall, safe DevOps means that ALL changes to the environment should be made through a configuration and change management system.
That system needs to be architected to withstand the occasional human error. Changes should be rolled up through Testing and Staging environments, even when there’s a live Production problem. It always seems quicker to make fixes ad-hoc. Eventually, though, the environment will be “re-based,” and the ad-hoc changes will be lost by orders of magnitude if the re-base blows away years of patch-and-fix configs.
Also, don’t fear the re-base — just use it in a controlled test to ensure that your environment is consistent.
Need help implementing DevOps culture? Contact us.