Timothy Fritz has a very interesting blog post on Continuous Deployment at IMVU (subtitled “Doing the impossible fifty times a day”), detailing how all committed code gets automagically pushed to their cluster of servers assuming it passes all tests. One very nice aspect of it is that the change is first put live on only a small set of their machines. Then if there’s any significant variation in a series of metrics tested on those machines (load average, errors generated, etc.) the revision is automatically rolled back rather than pushed to the remainder of the cluster.
In the comments someone raises the question of how such a system can work when database schema changes are required, describing this as the “achilles heel of partial cluster deployment”.
At BlackStar we didn’t have a system anywhere near this advanced, but we did have a requirement to have as close to zero downtime as possible and so we needed to come up with a system for putting database changes live in a way that couldn’t break code in the meantime.
One of the most common schema changes in an evolving system is the gradual migration of all 1-1 relationships to 1-many or many-many. (Someone recently posited that a database archaeologist could tell the age of a system by how many 1-1 relationships still existed. I can’t remember who or where, though. Leave a comment and I’ll credit them.)
So, for example, when you start out, it’s common to have an ’email’ column in a ‘user’ table. Eventually, though, it will become necessary to handle a user who needs to use two or more different email addresses. The obvious solution is to split out an ’email’ table, migrate all the existing data into it, and update the code to use that table instead of the ‘user’ table. However, when you can have different machines potentially running different versions of that code (the “before” and “after” versions) simultaneously, then you have problems. If you put the database schema changes live first, then the “before” versions will suddenly break. If you put the code live first, then the “after” version won’t work until you change the schema. In an environment where down-time is acceptable, then you just turn everything off, make the schema changes, push the new code, and you’re fine. But what to do when it isn’t?
Well, then you need to do everything in stages:
- First, you need to create the new table. No code uses it yet, it’s simply a schema change, so you can safely make it go live.
- Once deployed, you change any code that writes email address to write to both places. Users are still only allowed a single email address, but now that gets inserted into both the ‘user’ table and the ’email’ table. Under normal circumstances such duplication is bad, but it’s only a temporary measure. Everything will be properly normalised when we’re finished.
- Once that code is successfully live everywhere, you can then run a migration on all the existing data. Any new email addresses being added in the live system are being added to both tables, but before we can change any code to read from the new table, we need to make sure it’s comprehensive. So all pre-existing addresses need to be migrated. For a simple case like this you can probably use run a single SQL command; for more complex scenarios you may need a more involved script – but for those you may be better off breaking it down in to a series of migrations like this.
- Once you’re sure that both tables are perfectly in sync, and are staying that way, you can start to migrate all code that reads email addresses to use the new table. This doesn’t have to happen all at once. In a well factored system the scope of this change should be very small, but in reality you’re likely to have code strewn all over the place that reads this data. But the doubled data source means they can gradually be eliminated one by one without blocking any other changes. (At BlackStar we generally made such changes very quickly as we couldn’t put the new functionality we wanted live until we were complete, but we also had a couple of cases where it was a much longer process that took several months to change all the code to use the new table).
- Eventually, when you’re sure that no code reads from the old table you can remove the old code that writes to it, leaving, of course, the code that writes to the new table.
- Once that’s live everywhere, you can delete the column. Or, if deleting a column takes too long on your system and might cause some downtime, you can just delete all the data from it, record by record if needs be. (Or, of course, you can apply a similar multi-step approach to create a new user table without this column, migrate all the code to use it instead, and then delete the old one.)
- Now you have a system that, on the outside, functions identically to when you started – users can still only have a single email address. But that is no longer true of the underlying data schema. So you can now take whatever code imposes this restriction and fix it to allow for multiples without worrying about bringing the database into sync.
It’s a much more involved process, but at every step everything is consistent no matter which version of the code is active on a given server, everything continues to run safely, and there’s no need for any down time.
The actual time that it takes to get from stage 1 to stage 7 depends not only on how long it takes to develop the code changes, but also the gap between each deployment. If you only deploy changes once a week, it can take a few months to work through all the steps. If, however, you can get to a position where you can safely deploy multiple times per day, then you can of course be complete much much quicker. And if you only deploy once a month, or even once a quarter, well, then you have even bigger problems.