A presentation at Python Belgrade in in Belgrade, Serbia by Luka Kladaric
ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) Luka Kladaric @ Python Belgrade Jan 2018.
WHO? Luka Kladaric formerly: web developer for 10+ years now: architecture, infrastructure & security consultant also a startup founder and remote work evangelist 2 — Luka Kladaric @ Python Belgrade Jan 2018.
migrating an entire company's infrastructure from Rackspace to Amazon AWS 3 — Luka Kladaric @ Python Belgrade Jan 2018.
60 virtual machines 3 baremetal boxes (db) assorted networking equipment 4 — Luka Kladaric @ Python Belgrade Jan 2018.
the migration took 2 months to execute but a year and a half to prepare... 5 — Luka Kladaric @ Python Belgrade Jan 2018.
WHY? 6 — Luka Kladaric @ Python Belgrade Jan 2018.
hand-crafted build server, unreproducible jobs for 3 Android apps... ...each completely different 7 — Luka Kladaric @ Python Belgrade Jan 2018.
massive monolthic 10 GB git repository touching anything triggers a rollout of everything no concept of "stable" 8 — Luka Kladaric @ Python Belgrade Jan 2018.
half the servers are not deployable from scratch or their deployability is unknown 9 — Luka Kladaric @ Python Belgrade Jan 2018.
no local dev environments half the company has to VPN into production to get any work done everyone works directly on production systems no db schema migration system == no db versioning 10 — Luka Kladaric @ Python Belgrade Jan 2018.
horrible code review tool (Rietveld) 11 — Luka Kladaric @ Python Belgrade Jan 2018.
same mysql account used by everyone everywhere > > 12 — Luka Kladaric @ Python Belgrade Jan 2018.
same mysql account used by everyone everywhere that mysql account is "root" > 13 — Luka Kladaric @ Python Belgrade Jan 2018.
same mysql account used by everyone everywhere that mysql account is "root" that mysql db is 1.5 TB big 14 — Luka Kladaric @ Python Belgrade Jan 2018.
no access to LB config has a bunch of magic in it changes often result in issues and outages 15 — Luka Kladaric @ Python Belgrade Jan 2018.
no server metrics / perfdata no idea if overprovisioned and by how much 16 — Luka Kladaric @ Python Belgrade Jan 2018.
no access to disaster recovery instance in case the primary DC went down (access goes through primary DC) 17 — Luka Kladaric @ Python Belgrade Jan 2018.
RACKSPACE WAS REALLY TERRIBLE a constant pain to deal with unexpected outages of never explained causes unresponsive support team zero flexibility 18 — Luka Kladaric @ Python Belgrade Jan 2018.
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: conservatively: realistically: 19 — Luka Kladaric @ Python Belgrade Jan 2018.
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: realistically: 20 — Luka Kladaric @ Python Belgrade Jan 2018.
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: 6-9 months (of dedicated work) realistically: 21 — Luka Kladaric @ Python Belgrade Jan 2018.
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: 6-9 months (of dedicated work) realistically: a year (with interruptions) 22 — Luka Kladaric @ Python Belgrade Jan 2018.
NO LEADERSHIP BUY-IN 2 failed attempts to get approval Infrastructure team makes a pact "Do Things The Right Way From Now On" mask cleanup work with ongoing maintenance 23 — Luka Kladaric @ Python Belgrade Jan 2018.
PLOT TWIST RACKSPACE STARTS FALLING APART 24 — Luka Kladaric @ Python Belgrade Jan 2018.
NEW ESTIMATE 19 man-days (after final push for preparation) 25 — Luka Kladaric @ Python Belgrade Jan 2018.
HOSTING COST ESTIMATE before: $18k after: $6k savings: $12k (-66%!) 26 — Luka Kladaric @ Python Belgrade Jan 2018.
GOT APPROVAL! 27 — Luka Kladaric @ Python Belgrade Jan 2018.
Actually executed in 25-30 man-days over 2 months 28 — Luka Kladaric @ Python Belgrade Jan 2018.
HOW? 29 — Luka Kladaric @ Python Belgrade Jan 2018.
build server rebuilt from scratch deployed from Ansible all build jobs defined in code with inheritance and templating tweaking jobs through UI disabled 30 — Luka Kladaric @ Python Belgrade Jan 2018.
monolithic git repository split up into 40 smaller repositories changes trigger rollout only on affected project 31 — Luka Kladaric @ Python Belgrade Jan 2018.
all servers rebuilt and redeployed with Ansible "upgrading the fleet to Ubuntu 16.04" ;) 32 — Luka Kladaric @ Python Belgrade Jan 2018.
better code review tool (Phabricator) allows code ownership rules of engagement per repository don't ask me about Phabricator (it's amazing) 33 — Luka Kladaric @ Python Belgrade Jan 2018.
most dev work doesn't require VPN any more but even if it did... 34 — Luka Kladaric @ Python Belgrade Jan 2018.
no more shared mysql root account (RIP) no write access to production database (to people or their software) local dev environments! (far from perfect) 35 — Luka Kladaric @ Python Belgrade Jan 2018.
db schema migration system based on gh-ost sql scripts -> code review -> git -> web ui to run 36 — Luka Kladaric @ Python Belgrade Jan 2018.
all LB logic slowly moved to our own haproxies haproxy configuration auto-generated from Ansible makes it easy to shuffle things around 37 — Luka Kladaric @ Python Belgrade Jan 2018.
all apps slowly migrated to be served through haproxies avoiding Rackspace LB magic 38 — Luka Kladaric @ Python Belgrade Jan 2018.
metrics, metrics, metrics (Datadog ftw) 39 — Luka Kladaric @ Python Belgrade Jan 2018.
TIME SPENT: A YEAR AND A HALF 40 — Luka Kladaric @ Python Belgrade Jan 2018.
AND ALL BEFORE WE HAD APPROVAL TO DO ANYTHING ;) 41 — Luka Kladaric @ Python Belgrade Jan 2018.
"yeah yeah yeah... how did you do the actual migration?" 42 — Luka Kladaric @ Python Belgrade Jan 2018.
VPN bridge between AWS and RS ~20 MB/s, ~20ms ping good enough to treat as a "local" connection for shorter periods of time 43 — Luka Kladaric @ Python Belgrade Jan 2018.
mysql master-master replication between DCs this was a massive pain to achieve with a 1.5TB db 44 — Luka Kladaric @ Python Belgrade Jan 2018.
recreate the entire fleet in AWS 45 — Luka Kladaric @ Python Belgrade Jan 2018.
app servers in both DCs (java+python) 46 — Luka Kladaric @ Python Belgrade Jan 2018.
haproxies in both DCs aware of app servers in both DCs preferring local ones but falling back to remote if necessary "no request left behind" 47 — Luka Kladaric @ Python Belgrade Jan 2018.
CloudFlare used for near-instant DNS failover but even stray requests will get handled 48 — Luka Kladaric @ Python Belgrade Jan 2018.
RESULTS 49 — Luka Kladaric @ Python Belgrade Jan 2018.
core production migrated in days internal tools migrated within a week or two developer tools migrated within a month (git hosting, build server, etc) obscure legacy services migrated within 2 months 50 — Luka Kladaric @ Python Belgrade Jan 2018.
all hardware at Rackspace decomissioned within 3 months 51 — Luka Kladaric @ Python Belgrade Jan 2018.
sideffect: actual HA instead of fake HA old "two or more of everything" approach translated well into Availability Zones 52 — Luka Kladaric @ Python Belgrade Jan 2018.
cost estimate? right on the money. once the dust settled the $18k/mo bill from RS was replaced with a $6k/mo bill from AWS 53 — Luka Kladaric @ Python Belgrade Jan 2018.
AND IT WAS GOOD 54 — Luka Kladaric @ Python Belgrade Jan 2018.
55 — Luka Kladaric @ Python Belgrade Jan 2018.
The moral of this story is: don't wait for permission to do your job right. 56 — Luka Kladaric @ Python Belgrade Jan 2018.
57 — Luka Kladaric @ Python Belgrade Jan 2018.
The team was well aware of how broken things were. If we pushed for it to be a single massive project, it would've never happened. 58 — Luka Kladaric @ Python Belgrade Jan 2018.
QUESTIONS? Luka Kladaric @ Python Belgrade Jan 2018.
THANK YOU! Luka Kladaric twitter: @kll luka@sekura.io www.sekura.io Luka Kladaric @ Python Belgrade Jan 2018.