Technical Musings: How To Avoid Taking Down Your S3 With Ansible

Friday, March 3, 2017

How To Avoid Taking Down Your S3 With Ansible

The s3 outage summary (https://aws.amazon.com/message/41926/) describing the cause of the outage 2/28/2017, specifically says an 'established playbook' was used to take down s3, which sounds like they used an Ansible playbook.  And I'm pretty sure how such a terrible error happened, because I've worked around this problem with Ansible in my own projects.

The normal way you run a subset of a group of servers in Ansible is to add the '--limit' parameter, which filters the list of servers based on a group name.  So, of example, you say run on 'web' group with a filter of 'webserver01''; this would only run the Ansible playbook on 'webserver01', not on all the 'web' servers.

The problem is, if you badly specify '--limit' or leave it off, it runs against the whole group.  This is a horrible design flaw.

The work around is not to use '--limit' at all, but instead you specify `hosts: "{{ target }}"` in your playbook.  So you must specify '-e "target=webserver01' or you get an error saying no hosts specified.  The target can be a pattern, so "target=web:database" or "target=webserver0*" works, so this is flexible enough to not need '--limit' at all, and avoid this dangerous design flaw of Ansible.

No comments: