The s3 outage summary (https://aws.amazon.com/message/41926/) describing the cause of the outage 2/28/2017, specifically says an 'established playbook' was used to take down s3, which sounds like they used an Ansible playbook. And I'm pretty sure how such a terrible error happened, because I've worked around this problem with Ansible in my own projects.
The normal way you run a subset of a group of servers in Ansible is to add the '--limit' parameter, which filters the list of servers based on a group name. So, of example, you say run on 'web' group with a filter of 'webserver01''; this would only run the Ansible playbook on 'webserver01', not on all the 'web' servers.
The problem is, if you badly specify '--limit' or leave it off, it runs against the whole group. This is a horrible design flaw.
The work around is not to use '--limit' at all, but instead you specify `hosts: "{{ target }}"` in your playbook. So you must specify '-e "target=webserver01' or you get an error saying no hosts specified. The target can be a pattern, so "target=web:database" or "target=webserver0*" works, so this is flexible enough to not need '--limit' at all, and avoid this dangerous design flaw of Ansible.
No comments:
Post a Comment