Divide and Conquer So Nothing is Big

Published in

Tom Harrison’s Blog

6 min readJan 19, 2017

Agile is a way of thinking about solving big problems by breaking them into small pieces. But if you don’t like Agile, it’s OK. Divide and Conquer.

All software problems worth mentioning are huge, even if they might seem straightforward.

We have a good one at my company now.

Our current deployment process is creaky

Our deployment process that has evolved over time needs a makeover. Deployment is hard … unless you happened to have started a company after tools like Capistrano, Jenkins, Travis, xUnit, and other Build/Test/Deploy frameworks had it all figured out. Or, like Ruby on Rails has had since day one of the built-in database migration tool (brilliant!).

Our company is now big, getting bigger, and quickly. We got that way over time and started before all of these fancy new innovations.

No Downtime Deploys

So we need to make our software more easily deployable.

This is terrifying and huge.

The folks that have been there since the start know where all the bodies are buried. They know all the things that could go wrong, because they have already.

We have an incredibly careful, well managed process that’s been crafted over the years. It still has manual components, and most software releases are rolling changes (no downtime) but still, when we have database migrations, we have to do them all at once.

That was fine when we had a few customers in a small database. But now we have hundreds of customers and more big ones coming in a slew of different database instances. So while we figured out how to scale the database horizontally, the deployment part is still a one-shot deal.

In particular, when we need to add or modify the schema, we need to run a script to upgrade — a migration. But it works on all database instances at once.

Everything … everything depends on that database schema. So as we ticked through the components, it turns out, it’s kind of hard not to have the whole system down at once.

Kind of OK last year

This was “kind of OK” as is for most customers, but now we’re moving around the globe, and our 4am-6am maintenance window is inching towards lunchtime for our European customers, or opening hours for our convenience store customers.

So we want to be able to make schema changes with almost no scheduled maintenance window, and during a window that fits out customers’ needs.

That’s a big deal, as it happens. So big, it’s daunting.

I have the benefit of having come from the more recent past, where even for large companies with the same international and other constraints, a deploy had been designed from the outset to meet these needs.

To me, it looks simple.

The truth is in between, and (as we learned in our lessons from Reality 101) probably closer to “daunting” than “simple”.

Blue Green Deployments

The core design of all systems should probably follow Martin Fowler’s Blue-Green Deployment strategy. The main ideas are:

You have all incoming requests routable to one of two system environments he calls “blue” and “green”. These are effectively identical, parallel systems which flip-flop between next version and live version.
Whichever environment is not live is, for a little while, a hot-rollback option if something goes wrong. Shortly thereafter, it gets upgraded and is the hot-failover site. Just as your ready to roll the next deploy, it’s the staging environment for the next release.
Changes involving database schema updates will be a) forward compatible, and b) deployed first and separately so you can have old code running against new schema. In the large majority of cases, you can accomplish more complicated changes (e.g index build) in idle times, so don’t need to “bounce” your system. Most simple schema updates can happen under live load. Rarely should you need to bounce a database.

The key is to plan to avoid breaking changes, where two interdependent systems need to be upgraded in parallel in order to keep working. This kind of process is indeed more work. But it’s much, much, safer and easier to recover from when there are problems.

There are many reasons why this is a well-established pattern. It does indeed mean that you need to have a full-time hot spare. But then you should have that anyway if your business is serious. Ever found your spare tire has gone flat? Avoid that problem by making every deployment a test of your disaster recovery system, too. Nice.

Easy for you to say…

I mostly recapitulate Fowler’s piece, but he and I both make a few blithe assumptions, such as: you have a complete replica of your production system warmed up and ready to roll, and probably two equally capable data centers with great hardware. This does cost real money.

It’s a hell of a lot easier as your systems get bigger, and also if you’re hosted in a virtualized or cloud environment (or even… containerized) where you can spin up a new cluster in a matter of minutes and where multiple availability zones are assumed.

But the key message is about the database. Through discipline, you need to be able to apply DB changes that work with old and new code first, then deploy the code on top second.

But just one more assumption…

But by far the biggest assumption is that you never have “big bang” releases that drop multiple new things at once.

How is this possible?

For example, today Slack released a long-awaited feature of conversation threads. For a product like Slack, this is a huge feature. It’s a big bang as far as I, the user am concerned. So how?

One of the assumptions baked into this whole notion is that you make many small and frequent deployments, not big bang releases. Sound familiar? Let’s not call it Agile, because it’s going out of style, so I’m told. But just sayin’, Martin Fowler was one of the few cool kids who were there for the Agile Manifesto.

How can you do lots of small deployments if it’s a heavy manual process? Answer: automate it, with continuous integration (CI) processes. How do you test so fast? Tests are part of CI! Oh good god, the next thing I am going to say is that you should have continuous delivery (CD). Maybe next year.

All of this stuff was crazy whack-a-doodle just 10 or 15 years ago; now it’s all worked out (Major props to: Jenkins, and Travis CI — I have used and love both), and it’s pretty much the way things are done in software.

Divide And Conquer

If I had to guess, Slack has been working on their big new feature for the last six months. Little dependencies, most of which you never saw were likely wiggled-in over small releases as the new schema, the new UI, the new actions were all revised in little steps, and then old cold swept away. Further, I assume they have had beta tests and internal testing going on for months, enabled for some users and not others using feature flags.

Back to our problem

So to recap, we were talking about how we have the problem of retrofitting a no-downtime deployment process into our existing codebase.

We need to build and learn how to use a few things. We need some routing software and the means to control it.

We need to be ready to have both data centers we’re at now, and more in the future be hot spares, or easily-spun up staging environments (both of which we do).

We certainly need to understand the dependencies between various systems and thing about ways to minimize coupling.

We need to think about ways we can partition our data, both logically and geographically.

We need to build in a process for and infrastructure for feature flagging.

And then we need to make it work.

The only way we’ll do this: piece by piece. Divide, and conquer.

Image: By NASA / Bill Anders — Public Domain