ZERO-DOWNTIME DATACENTER FAILOVERS
(SWITCHING HOSTING PROVIDERS FOR DUMMIES)
Luka Kladaric @ Python Belgrade Jan 2018.
Slide 2
WHO? Luka Kladaric formerly: web developer for 10+ years now: architecture, infrastructure & security consultant also a startup founder and remote work evangelist 2 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 3
migrating an entire company's infrastructure from Rackspace to Amazon AWS
3 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 4
60 virtual machines 3 baremetal boxes (db) assorted networking equipment
4 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 5
the migration took 2 months to execute but a year and a half to prepare...
5 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 6
WHY? 6 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 7
hand-crafted build server, unreproducible jobs for 3 Android apps... ...each completely different
7 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 8
massive monolthic 10 GB git repository touching anything triggers a rollout of everything no concept of "stable"
8 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 9
half the servers are not deployable from scratch or their deployability is unknown
9 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 10
no local dev environments half the company has to VPN into production to get any work done everyone works directly on production systems no db schema migration system == no db versioning
10 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 11
horrible code review tool (Rietveld)
11 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 12
same mysql account used by everyone everywhere > >
12 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 13
same mysql account used by everyone everywhere that mysql account is "root" >
13 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 14
same mysql account used by everyone everywhere that mysql account is "root" that mysql db is 1.5 TB big
14 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 15
no access to LB config has a bunch of magic in it changes often result in issues and outages
15 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 16
no server metrics / perfdata no idea if overprovisioned and by how much
16 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 17
no access to disaster recovery instance in case the primary DC went down (access goes through primary DC)
17 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 18
RACKSPACE WAS REALLY TERRIBLE a constant pain to deal with unexpected outages of never explained causes unresponsive support team zero flexibility
18 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 19
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: conservatively: realistically:
19 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 20
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: realistically:
20 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 21
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: 6-9 months (of dedicated work) realistically:
21 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 22
HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: 6-9 months (of dedicated work) realistically: a year (with interruptions)
22 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 23
NO LEADERSHIP BUY-IN 2 failed attempts to get approval Infrastructure team makes a pact "Do Things The Right Way From Now On" mask cleanup work with ongoing maintenance
23 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 24
PLOT TWIST
RACKSPACE STARTS FALLING APART 24 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 25
NEW ESTIMATE 19 man-days (after final push for preparation)
25 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 26
HOSTING COST ESTIMATE before: $18k after: $6k savings: $12k (-66%!)
26 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 27
GOT APPROVAL! 27 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 28
Actually executed in 25-30 man-days over 2 months
28 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 29
HOW? 29 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 30
build server rebuilt from scratch deployed from Ansible all build jobs defined in code with inheritance and templating tweaking jobs through UI disabled
30 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 31
monolithic git repository split up into 40 smaller repositories changes trigger rollout only on affected project
31 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 32
all servers rebuilt and redeployed with Ansible "upgrading the fleet to Ubuntu 16.04" ;)
32 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 33
better code review tool (Phabricator) allows code ownership rules of engagement per repository don't ask me about Phabricator (it's amazing)
33 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 34
most dev work doesn't require VPN any more but even if it did...
34 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 35
no more shared mysql root account (RIP) no write access to production database (to people or their software) local dev environments! (far from perfect)
35 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 36
db schema migration system based on gh-ost sql scripts -> code review -> git -> web ui to run
36 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 37
all LB logic slowly moved to our own haproxies haproxy configuration auto-generated from Ansible makes it easy to shuffle things around
37 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 38
all apps slowly migrated to be served through haproxies avoiding Rackspace LB magic
38 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 39
metrics, metrics, metrics (Datadog ftw)
39 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 40
TIME SPENT:
A YEAR AND A HALF 40 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 41
AND ALL BEFORE WE HAD APPROVAL TO DO ANYTHING
;)
41 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 42
"yeah yeah yeah... how did you do the actual migration?"
42 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 43
VPN bridge between AWS and RS ~20 MB/s, ~20ms ping good enough to treat as a "local" connection for shorter periods of time
43 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 44
mysql master-master replication between DCs this was a massive pain to achieve with a 1.5TB db
44 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 45
recreate the entire fleet in AWS
45 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 46
app servers in both DCs (java+python)
46 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 47
haproxies in both DCs aware of app servers in both DCs preferring local ones but falling back to remote if necessary "no request left behind"
47 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 48
CloudFlare used for near-instant DNS failover but even stray requests will get handled
48 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 49
RESULTS 49 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 50
core production migrated in days internal tools migrated within a week or two developer tools migrated within a month (git hosting, build server, etc) obscure legacy services migrated within 2 months
50 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 51
all hardware at Rackspace decomissioned within 3 months
51 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 52
sideffect: actual HA instead of fake HA old "two or more of everything" approach translated well into Availability Zones
52 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 53
cost estimate? right on the money. once the dust settled the $18k/mo bill from RS was replaced with a $6k/mo bill from AWS
53 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 54
AND IT WAS GOOD 54 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 55
55 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 56
The moral of this story is: don't wait for permission to do your job right.
56 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 57
57 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 58
If you see something broken, fix it
57 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 59
If you see something broken, fix it 2. If you don't have time to fix it - write it down
57 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 60
If you see something broken, fix it 2. If you don't have time to fix it - write it down 3. But do come back to it when you can steal a minute
57 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 61
If you see something broken, fix it 2. If you don't have time to fix it - write it down 3. But do come back to it when you can steal a minute 4. Even if it takes months to make progress
57 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 62
The team was well aware of how broken things were. If we pushed for it to be a single massive project, it would've never happened.
58 — Luka Kladaric @ Python Belgrade Jan 2018.
Slide 63
QUESTIONS? Luka Kladaric @ Python Belgrade Jan 2018.
Slide 64
THANK YOU! Luka Kladaric twitter: @kll luka@sekura.io www.sekura.io Luka Kladaric @ Python Belgrade Jan 2018.