Zero-downtime datacenter failovers (Switching hosting providers for dummies)

A presentation at Python Belgrade in January 2018 in Belgrade, Serbia by Luka Kladaric

Slide 1

Slide 1

ZERO-DOWNTIME DATACENTER FAILOVERS (SWITCHING HOSTING PROVIDERS FOR DUMMIES) Luka Kladaric @ Python Belgrade Jan 2018.

Slide 2

Slide 2

WHO? Luka Kladaric formerly: web developer for 10+ years now: architecture, infrastructure & security consultant also a startup founder and remote work evangelist 2 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 3

Slide 3

migrating an entire company's infrastructure from Rackspace to Amazon AWS 3 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 4

Slide 4

60 virtual machines 3 baremetal boxes (db) assorted networking equipment 4 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 5

Slide 5

the migration took 2 months to execute but a year and a half to prepare... 5 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 6

Slide 6

WHY? 6 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 7

Slide 7

hand-crafted build server, unreproducible jobs for 3 Android apps... ...each completely different 7 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 8

Slide 8

massive monolthic 10 GB git repository touching anything triggers a rollout of everything no concept of "stable" 8 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 9

Slide 9

half the servers are not deployable from scratch or their deployability is unknown 9 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 10

Slide 10

no local dev environments half the company has to VPN into production to get any work done everyone works directly on production systems no db schema migration system == no db versioning 10 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 11

Slide 11

horrible code review tool (Rietveld) 11 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 12

Slide 12

same mysql account used by everyone everywhere > > 12 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 13

Slide 13

same mysql account used by everyone everywhere that mysql account is "root" > 13 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 14

Slide 14

same mysql account used by everyone everywhere that mysql account is "root" that mysql db is 1.5 TB big 14 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 15

Slide 15

no access to LB config has a bunch of magic in it changes often result in issues and outages 15 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 16

Slide 16

no server metrics / perfdata no idea if overprovisioned and by how much 16 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 17

Slide 17

no access to disaster recovery instance in case the primary DC went down (access goes through primary DC) 17 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 18

Slide 18

RACKSPACE WAS REALLY TERRIBLE a constant pain to deal with unexpected outages of never explained causes unresponsive support team zero flexibility 18 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 19

Slide 19

HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: conservatively: realistically: 19 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 20

Slide 20

HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: realistically: 20 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 21

Slide 21

HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: 6-9 months (of dedicated work) realistically: 21 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 22

Slide 22

HOW LONG WOULD IT TAKE TO MIGRATE THIS? optimistically: 3 months conservatively: 6-9 months (of dedicated work) realistically: a year (with interruptions) 22 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 23

Slide 23

NO LEADERSHIP BUY-IN 2 failed attempts to get approval Infrastructure team makes a pact "Do Things The Right Way From Now On" mask cleanup work with ongoing maintenance 23 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 24

Slide 24

PLOT TWIST RACKSPACE STARTS FALLING APART 24 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 25

Slide 25

NEW ESTIMATE 19 man-days (after final push for preparation) 25 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 26

Slide 26

HOSTING COST ESTIMATE before: $18k after: $6k savings: $12k (-66%!) 26 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 27

Slide 27

GOT APPROVAL! 27 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 28

Slide 28

Actually executed in 25-30 man-days over 2 months 28 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 29

Slide 29

HOW? 29 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 30

Slide 30

build server rebuilt from scratch deployed from Ansible all build jobs defined in code with inheritance and templating tweaking jobs through UI disabled 30 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 31

Slide 31

monolithic git repository split up into 40 smaller repositories changes trigger rollout only on affected project 31 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 32

Slide 32

all servers rebuilt and redeployed with Ansible "upgrading the fleet to Ubuntu 16.04" ;) 32 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 33

Slide 33

better code review tool (Phabricator) allows code ownership rules of engagement per repository don't ask me about Phabricator (it's amazing) 33 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 34

Slide 34

most dev work doesn't require VPN any more but even if it did... 34 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 35

Slide 35

no more shared mysql root account (RIP) no write access to production database (to people or their software) local dev environments! (far from perfect) 35 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 36

Slide 36

db schema migration system based on gh-ost sql scripts -> code review -> git -> web ui to run 36 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 37

Slide 37

all LB logic slowly moved to our own haproxies haproxy configuration auto-generated from Ansible makes it easy to shuffle things around 37 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 38

Slide 38

all apps slowly migrated to be served through haproxies avoiding Rackspace LB magic 38 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 39

Slide 39

metrics, metrics, metrics (Datadog ftw) 39 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 40

Slide 40

TIME SPENT: A YEAR AND A HALF 40 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 41

Slide 41

AND ALL BEFORE WE HAD APPROVAL TO DO ANYTHING ;) 41 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 42

Slide 42

"yeah yeah yeah... how did you do the actual migration?" 42 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 43

Slide 43

VPN bridge between AWS and RS ~20 MB/s, ~20ms ping good enough to treat as a "local" connection for shorter periods of time 43 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 44

Slide 44

mysql master-master replication between DCs this was a massive pain to achieve with a 1.5TB db 44 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 45

Slide 45

recreate the entire fleet in AWS 45 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 46

Slide 46

app servers in both DCs (java+python) 46 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 47

Slide 47

haproxies in both DCs aware of app servers in both DCs preferring local ones but falling back to remote if necessary "no request left behind" 47 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 48

Slide 48

CloudFlare used for near-instant DNS failover but even stray requests will get handled 48 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 49

Slide 49

RESULTS 49 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 50

Slide 50

core production migrated in days internal tools migrated within a week or two developer tools migrated within a month (git hosting, build server, etc) obscure legacy services migrated within 2 months 50 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 51

Slide 51

all hardware at Rackspace decomissioned within 3 months 51 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 52

Slide 52

sideffect: actual HA instead of fake HA old "two or more of everything" approach translated well into Availability Zones 52 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 53

Slide 53

cost estimate? right on the money. once the dust settled the $18k/mo bill from RS was replaced with a $6k/mo bill from AWS 53 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 54

Slide 54

AND IT WAS GOOD 54 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 55

Slide 55

55 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 56

Slide 56

The moral of this story is: don't wait for permission to do your job right. 56 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 57

Slide 57

57 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 58

Slide 58

  1. If you see something broken, fix it 57 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 59

Slide 59

  1. If you see something broken, fix it 2. If you don't have time to fix it - write it down 57 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 60

Slide 60

  1. If you see something broken, fix it 2. If you don't have time to fix it - write it down 3. But do come back to it when you can steal a minute 57 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 61

Slide 61

  1. If you see something broken, fix it 2. If you don't have time to fix it - write it down 3. But do come back to it when you can steal a minute 4. Even if it takes months to make progress 57 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 62

Slide 62

The team was well aware of how broken things were. If we pushed for it to be a single massive project, it would've never happened. 58 — Luka Kladaric @ Python Belgrade Jan 2018.

Slide 63

Slide 63

QUESTIONS? Luka Kladaric @ Python Belgrade Jan 2018.

Slide 64

Slide 64

THANK YOU! Luka Kladaric twitter: @kll luka@sekura.io www.sekura.io Luka Kladaric @ Python Belgrade Jan 2018.