Archive

Posts Tagged ‘Databases’

Splitting a WordPress blog in two

May 13th, 2009 No comments

This blog had its seventh birthday recently. I know there are many amongst you who have been blogging since before the term was even coined, and who make more posts in a month than I’ve made in seven years, but still.

Anyway, back in the early days of blogging, a significant percentage of blog posts weren’t original content, but the equivalent of retweeting: a way of passing on to your readers something interesting you’d read elsewhere. Of course the vast majority of those were links to other people’s blogs. It’s how word spread about interesting posts before digg and reddit and twitter and the like.

I tried to do something slightly different for a while: rather than just regurgitating other blog posts, I instead regurgitated interesting snippets from real dead tree books I was reading, picking interesting excerpts chapter by chapter.

It seemed to be well received, and I had a lot of fun choosing which couple of paragraphs from each chapter could convey something interesting enough to both stand alone without the surrounding context and also encourage others to seek out the book for more depth.

Early in 2004, I seem to have abandoned the idea. Likely it’s just because I was super-busy with Twingle, and then with Unite, and I probably always meant to get around to picking it up again, but just never did. Until now.

I decided, however, to do this on a new separate blog: dustyvolumes.com. So I had to work out how to move all the old posts to there. This was significantly more complicated than I expected. Doubtless someone will point me to a WordPress plugin that could have made the whole thing take 30 seconds, but in the absence of that, here’s the gory details for anyone else who ever wants to do something like this.

First, of course, I needed to have the new blog set up. I’m assuming that’s self-evident, and needs no further explanation.

Next I needed to find all the posts I wanted to move. I already had them all tagged with “Books”, so this part was fairly easy and avoided an even longer manual process. WordPress doesn’t have an ‘export by tag/category’ option, though—the only way to restrict an export is by author. So I had to go into “Posts > Edit”, find a post with the relevant tag, and click that tag to give me a list of all those posts. Then I could do a Bulk Edit of each to change the author to a new temporary account I set up just for this purpose. There were multiple pages of them, and there doesn’t seem to be a way to operate on more than one page at a time, so I went through them page by page. It was repetitive enough to make me want to find a short-cut, but there weren’t quite enough pages to make it worthwhile.

Then I exported all the posts by my new author, and imported those into the new blog. I did some more tidying up there of tags and categories etc, and found a few posts that should probably still remain on this blog instead (they were tagged with Books too, but were, for example, about me getting rid of my collection before moving to Estonia, rather than being excerpts suitable for Dusty Volumes), so deleted them from there, and changed the author here back to me on each of them in turn (I wanted that author to match exactly the posts that were on the other blog so I could continue to operate on those here).

Now I had the new blog working, but hit the much harder problem of what to do about the posts here. I could, of course, just have deleted the posts that I’d moved, but I still get quite a few hits on them from Google searches and links from other blogs, as well as some internal links to them, and I didn’t want to break all those. After some research I found a couple of WordPress plugins for setting up redirection. The first one I tried, “Redirection“, has lots and lots of features, but wasn’t quite what I wanted. The second, “Redirect“, was perfect. It does only one thing, but does it simply, and does it well. Using the Custom Field options in WordPress, it lets you set a ‘Redirect’ field with a value of the URL that viewers should be redirected to on viewing a given post. So now it was just a matter of going through and setting those up one by one.

Thankfully the WordPress import maintains the post ID from the export, so I didn’t need to spend any time building a map of which IDs should map where: each relevant post would just need to redirect to http://dustyvolumes.com/archives/<id>. I did a couple of these manually to make sure everything was working, but there was no way I wanted to do another 150 or so by hand. It was time to go to the database.

I’ve never actually explored the WordPress schema before, but there aren’t very many tables, and it’s fairly easy to work out what’s going on. (There’s probably decent documentation for it all too, but I tend to prefer to just work things like this out manually.) I’m not going to detail all the SQL commands I had to run: if you don’t know enough to work them out yourself you probably shouldn’t be playing with the database directly anyway, and should just do this the longwinded way (and I really don’t want to be fielding questions on it 6 months from now when the schema has changed). But it was a simple matter to just select the IDs of all posts by my fake ‘author’, and insert the relevant Redirect custom field values.

However, this still left a large number of ‘Books’ entries in my tagcloud that really weren’t there any more, so I also wanted to remove all the tags from these posts too. Ideally the Bulk Edit should be capable of this, but it currently only allows you to add a tag to multiple posts, not remove one, so again I went to the database. This one was slightly trickier, as it’s a cross-table DELETE, but again, if don’t know how to do that, you shouldn’t just be pasting in random SQL you found on someone’s blog somewhere.

Unfortunately, although that successfully removed all the tags, the tag cloud still proudly declared that I had a huge number of “Books” posts. WordPress, presumably for speed, keeps a total of how many posts are assigned to each category in a different table, and, being a typical modern webapp, maintains that count in client code rather than in the database itself. So having manually removed lots of tags without updating the count field too, my database was now out of sync with itself. MySQL doesn’t do cross-table UPDATEs with aggregates, so this time I needed an UPDATE with a subselect of a COUNT(*).

Including lots of cautious exploratory SELECTs, lots of LIMITs of my UPDATEs and DELETEs to make sure the right thing was happening each time, and backing up carefully after each major change, the whole thing took about an hour. I could possibly have done it all via the web interface in that time, but it would have been a close call, and there was a very high chance that I’d have gotten so bored in the middle of it that I’d have abandoned it half-way through, promising to finish it another day (and likely never quite gotten around to it). This way was mentally stimulating rather than draining, thus giving much more satisfaction when done, and I learned much more about the WordPress database structure that could be very useful if I ever decide to write a Plugin.

And now I have two blogs to rarely write in…

Gradual deployment of schema changes

February 16th, 2009 4 comments

Timothy Fritz has a very interesting blog post on Continuous Deployment at IMVU (subtitled “Doing the impossible fifty times a day”), detailing how all committed code gets automagically pushed to their cluster of servers assuming it passes all tests. One very nice aspect of it is that the change is first put live on only a small set of their machines. Then if there’s any significant variation in a series of metrics tested on those machines (load average, errors generated, etc.) the revision is automatically rolled back rather than pushed to the remainder of the cluster.

In the comments someone raises the question of how such a system can work when database schema changes are required, describing this as the “achilles heel of partial cluster deployment”.

At BlackStar we didn’t have a system anywhere near this advanced, but we did have a requirement to have as close to zero downtime as possible and so we needed to come up with a system for putting database changes live in a way that couldn’t break code in the meantime.

One of the most common schema changes in an evolving system is the gradual migration of all 1-1 relationships to 1-many or many-many. (Someone recently posited that a database archaeologist could tell the age of a system by how many 1-1 relationships still existed. I can’t remember who or where, though. Leave a comment and I’ll credit them.)

So, for example, when you start out, it’s common to have an ‘email’ column in a ‘user’ table. Eventually, though, it will become necessary to handle a user who needs to use two or more different email addresses. The obvious solution is to split out an ‘email’ table, migrate all the existing data into it, and update the code to use that table instead of the ‘user’ table. However, when you can have different machines potentially running different versions of that code (the “before” and “after” versions) simultaneously, then you have problems. If you put the database schema changes live first, then the “before” versions will suddenly break. If you put the code live first, then the “after” version won’t work until you change the schema. In an environment where down-time is acceptable, then you just turn everything off, make the schema changes, push the new code, and you’re fine. But what to do when it isn’t?

Well, then you need to do everything in stages:

  1. First, you need to create the new table. No code uses it yet, it’s simply a schema change, so you can safely make it go live.
  2. Once deployed, you change any code that writes email address to write to both places. Users are still only allowed a single email address, but now that gets inserted into both the ‘user’ table and the ‘email’ table. Under normal circumstances such duplication is bad, but it’s only a temporary measure. Everything will be properly normalised when we’re finished.
  3. Once that code is successfully live everywhere, you can then run a migration on all the existing data. Any new email addresses being added in the live system are being added to both tables, but before we can change any code to read from the new table, we need to make sure it’s comprehensive. So all pre-existing addresses need to be migrated. For a simple case like this you can probably use run a single SQL command; for more complex scenarios you may need a more involved script – but for those you may be better off breaking it down in to a series of migrations like this.
  4. Once you’re sure that both tables are perfectly in sync, and are staying that way, you can start to migrate all code that reads email addresses to use the new table. This doesn’t have to happen all at once. In a well factored system the scope of this change should be very small, but in reality you’re likely to have code strewn all over the place that reads this data. But the doubled data source means they can gradually be eliminated one by one without blocking any other changes. (At BlackStar we generally made such changes very quickly as we couldn’t put the new functionality we wanted live until we were complete, but we also had a couple of cases where it was a much longer process that took several months to change all the code to use the new table).
  5. Eventually, when you’re sure that no code reads from the old table you can remove the old code that writes to it, leaving, of course, the code that writes to the new table.
  6. Once that’s live everywhere, you can delete the column. Or, if deleting a column takes too long on your system and might cause some downtime, you can just delete all the data from it, record by record if needs be. (Or, of course, you can apply a similar multi-step approach to create a new user table without this column, migrate all the code to use it instead, and then delete the old one.)
  7. Now you have a system that, on the outside, functions identically to when you started – users can still only have a single email address. But that is no longer true of the underlying data schema. So you can now take whatever code imposes this restriction and fix it to allow for multiples without worrying about bringing the database into sync.

It’s a much more involved process, but at every step everything is consistent no matter which version of the code is active on a given server, everything continues to run safely, and there’s no need for any down time.

The actual time that it takes to get from stage 1 to stage 7 depends not only on how long it takes to develop the code changes, but also the gap between each deployment. If you only deploy changes once a week, it can take a few months to work through all the steps. If, however, you can get to a position where you can safely deploy multiple times per day, then you can of course be complete much much quicker. And if you only deploy once a month, or even once a quarter, well, then you have even bigger problems.

mytop and locked threads

January 5th, 2007 2 comments

Recently I’ve had to do some heavy-duty maintenance work on a MySQL database that’s still in heavy use, can’t take more than about 30 seconds of downtime, is having serious problems due to a 50 million row table that really needs trimmed down, but is missing the crucial indexes that would allow that to happen easily. Without the indexes, even deleting the complete data set for obsolete accounts can take several hours, never mind the time to perform the more complex purges for active accounts (which should be happening daily, but as they take too long, haven’t been happening for a long time, thus making the problem worse every day!) Of course, adding the indexes that would make this all much easier would also lock the table up even longer.

I have a plan to solve this by temporarily replicating the table to a slave version that has the correct indexes and then swapping the tables (I’ll blog the complete details later if I can get it to work), but whilst I’ve been in investigation mode, I’ve been relying heavily on mytop. This is a wonderful little utility for watching what’s happening in a mysql database, similar to the unix ‘top’ command.

Because the data is in a MyISAM table, and thus has table level locking, it’s prone to the old problem of a long select causing an insert or update to block, which in turn causes all other selects behind that to block as well. So I need to be very careful that none of my queries are causing a big queue. For this, mytop is almost perfect, with one small caveat: out of the box it doesn’t show which threads are locked. I don’t care if my select takes too long if it isn’t blocking anything, but once a queue forms, I need to be ready to kill my thread. Mostly I can work this out from what else is executing, but I prefer letting the computer do that sort of work for me. So I made a simple one line addition to the code.

At around line 1000 there’s a an ‘if ($HAS_COLOR)’ block to print the output in different colours depending on the type of command being executed at the database. At the end of that I added:

print RED()    if $thread->{State} && $thread->{State} eq 'Locked';

Now any locked thread is instantly recognizable, and I can react much quicker to any problem. It’s also quite interesting to watch what’s happening even when I’m not meddling, and see how many locks are naturally arising anyway!

There hasn’t been a release of mytop for a few years now, the mailing list has vanished, and I’m not sure whether it’s even actively maintained any more. So I’m not expecting to see this show up in the live code any time soon. But this post will remind me to add this on any future installations where it might come in useful. The beauty of Free Software!

Tags:

MySQL and Random

March 20th, 2003 2 comments

Yesterday, when writing some tests involving data stored in MySQL, I was quite impressed to discover that I could add random data to the table by doing:

UPDATE table SET column = FLOOR(8 * RAND()) + 1;

I was originally worried that this would pick a random value and then assign it to every row, but no, it picks a new random number each time.

Of course, I shouldn’t have been quite so surprised as I’ve long been a fan of picking a random row via:

SELECT * FROM table ORDER BY RAND() LIMIT 1

I’d just never thought through the implication that this must indeed be selecting a random value for each row in order to sort by it…

Tags:

Stored Procedures for Web Security

January 10th, 2003 No comments

V. Satheesh Babu has an interesting article on using Oracle Stored Procedures to add an extra layer of security to web applications.

Tags:

Spatial Indexes in MySQL

January 7th, 2003 No comments

Jeremy shows how to use spatial indexes in MySQL:


  SELECT name
    FROM map_test
  WHERE Contains(GeomFromText('POLYGON((0 0, 0 3, 3 3, 3 0, 0 0))'), loc) = 1;
Tags:

MySQL best practices

July 14th, 2002 No comments

Jeremy Zawody ably dissects the insanity that is the latest MySQL article on O’Reilly’s OnLamp. This article is written by one of the authors of O’Reilly’s latest book on MySQL, and has probably served to ensure that I don’t buy this book.

I’d add the following comments to Jeremy’s:

4. Don’t store binary data in MySQL

I rarely store BLOBs in a table with other data. But I do store them in MySQL. My reasoning is much more prosaic though. I frequently use MySQL from its command line interface, and much as it’s bad practice in real life code, I like being able to do a “SELECT *” in a query there. And I hate not being able to see all the results I’m interested in “at a glance”. Having BLOBs in the table makes the command line unusable in this manner unless you explicitly name the columns you want to see when operating in this mode – and I hate having to do that. So, I just throw all my BLOBs into a different table and get the best of both worlds.

5. Stick to ANSI SQL.

Most programmers now accept that premature optimisation is the root of all evil. But is there an equivalent warning against “premature portability”? Reese claims that making use of MySQL’s extensions “may make it impossible to port the application to another database without a significant rewrite.”

I’ve met lots of people who migrated applications away from MySQL, but none of them seemed to have any good reasons for it, and most did it as part of a complete migration away from a LAMP-style set-up to a more traditional enterprise set-up with Oracle and Java etc, where they had to rewrite everything anyway. At BlackStar we were persistently told that we’d never be able to scale a LAMP set-up, or that we’d never get funding without ‘solidifying’ our set-up. Well, we scaled it quite nicely, and the VCs (after a few initial queries) seemed quite happy by the huge difference in costs between our set-up and that of many of our peers.

If it’s a prerequisite that your software has to be database agnostic, then, yes, ignore all non-standard features in every database. But you’ll probably have to ignore quite a few ‘standard’ features too, as not every database handles them the same way.

If it’s not, then why bother? If your application is well developed you’ll abstract most of your database code away anyway. Migrating it if you ever need to may not be trivial, but it should be a long way from “impossible”.

6. Create your own sequence generation scheme.

This one has me completely boggled.

* You can have only one AUTO_INCREMENT column per table.

Yes. So? Why would I need more? This is equivalent of saying “You can have only one PRIMARY KEY per table.” I really don’t know what’s behind this, but can only assume that Reese doesn’t actually understand what AUTO_INCREMENT is used for.

* You cannot have a unique sequence for multiple tables.

Again, so what? If you really need something like this, then have a ‘mykey’ table that has a single column which is an AUTO_INCREMENTing value. Then when you need a “super-unique” value for another table just INSERT a null row into the ‘mykey’, get the value that appeared, and use it as your key in the other table.

If you find yourself doing this though, you should probably revisit your schema or approach to the problem, as you’re probably doing something wrong somewhere.

* You cannot easily determine from an application what values MySQL has automatically generated.

Umm. Why would you want to? If I have a DEFAULT value for any other column I can’t determine what values were automatically generated there either. What’s the problem?

Personally I use AUTO_INCREMENT almost everywhere. Even when I have an ‘obvious’ primary key, I rarely use it, preferring to mark it as UNIQUE instead. I’ve spent far too long in the past updating records all over the entire database because a value for a primary key changed, and was referenced in 40 other tables. I much prefer all primary keys, and thus foreign keys, to be meaningless. Then application data can change with impunity. AUTO_INCREMENT makes this style of development trivial. It takes a while to get used to, but this style of development is one of the things that makes MySQL so powerful.

7. Do not mix display code and database code.

Well, yes, but what’s that got to do with MySQL?

8. Normalize with zeal, denormalize sparingly.

Well, yes, but what’s that got to do with MySQL? Maybe this point would be interesting if Reese had pointed out that the need for denormalisation is even rarer with MySQL as it’s blinding fast even if you need a 12 table join. But he didn’t.

9. Use connection pooling in Web servers and application servers.

If you’re using a traditional database, then yes, make sure you do this. But with MySQL connecting to the database is not an expensive operation. Like many other things that take an age in other databases, MySQL is lightning fast in this regard. Yes, abstract your connection information away – you don’t want to have to update multiple files if you want to change it – but don’t worry too much about the speed here. Compared to whatever you’re going to asking the server for, the connection overhead will be tiny. Don’t worry about this one until you really need to.

For my money, the most important MySQL tip omitted is “write small queries”. One of the biggest problems with the traditional MySQL tables is their locking. If you have a table that has frequent writes, you’ll probably run into the scenario where Person A writes a SELECT that joins a lot of tables, does a complex GROUP BY and ORDER BY, and patiently waits for the result. In the meantime Person B comes along an issues an UPDATE to one of those tables. It patiently waits until Person A’s query is finished, but meanwhile Person C comes along and issues a simple SELECT on that table. Now MySQL has a problem. It’s can’t finish the UPDATE until the long SELECT has completed. And it can’t do C’s SELECT in case B’s UPDATE effects that. Because this is MySQL it may only take 10 seconds, instead of the 10 minutes it might have taken on another database, and this speed actually lulls developers into thinking that that’s OK. But in a high-traffic web site it’s probably not. If enough of these start to back up, you get all sorts of jams. Normally fast queries will start taking too long and everything will start getting sluggish. MySQL will resolve it all soon enough, but it’s not a nice experience.

The solution is simple. Avoid queries that take more than a tenth of a second to run. It’s MySQL – that should be fairly easy. Don’t be afraid to split your query into a few smaller queries. It’ll still be fast. If you need to run complex reports, then do it on ‘snapshot’ mirrors of the data that aren’t getting updated.

MySQL’s newer table types have apparently solved this problem for the most part – but the 1/10th of a second rule is a good one anyway. MySQL provides an option to log all slow queries to a file. Turn it on, and turn its setting down to 1 second (it’s the lowest it will go in the current version, but will apparently be reduced to allow sub-second setting in the next version). Then monitor what queries are getting logged – and attack them.

Every database has strengths and weaknesses. Playing to MySQL’s strengths and working around its weaknesses is fairly easy. And if you do it, you’re unlikely to ever need to port your application to a different RDBMS.

Tags: