The tangled web: data migrations

06 September 2006

Migrations are one of my favorite things about Rails. I remember being amazed at the simplicity and usefulness when I first saw them. I also remember a previous boss asserting that we would just use SQL in the migrations because, to paraphrase, that way we would know what was happening. There’s that saying about tossing rubies before chickens, or something. But as I was saying, migrations.

One thing that makes migrations particularly useful is when they are the only means used to change the database. That way, you know the state you are in and unless you have a rare, non-reversible migration, you can migrate up and down at will. That all works great for the database structure. But what about data? We obviously won’t be funneling all the data changes through migrations. This isn’t much of an issue if your application starts as a blank slate and acquires data through use.

Recently we had an application that needed a bunch of preloaded data. Unfortunately, it wasn’t a one-shot loading of data once the structure was mostly solidified. I needed to try out stuff and wanted the super useful migrations to be my tool. Previously, this had been done with some ad hoc `rake` tasks, which turned out to be, frankly, a mess.

The single biggest challenge working with the data was the fact that the state of the data would be changing independent of the migrations. There was no way to ensure the state between one migration and the next.

I decided that the one thing I could assert before and after a migration was that certain records with certain id’s did not exist (pre-) and then did exist (post-migration) (reverse that for migrating down). It turned out this would be good enough for what I needed.

My first stab at implementing this was a couple helper functions that I used in migrations to update the datafiles (in yaml format) with keys for the id’s. This was a big problem because the files would be changing and were tracked in subversion, so they couldn’t really be shared by more than one developer. Playing around a bit, I came across this idea:

 1 class CreateGeographicTypesFacts < ActiveRecord::Migration
 2   def self.up
 3     capture(:features, :details) do
 4 			# do whatever needed to create records
 5     end
 6   end
 7 
 8   def self.down
 9     release :features, :details
10   end
11 end

Basically, as the data migration runs, I want to keep track of what records are created (and which of those are possibly destroyed). Obviously, this assumes that I’ll be adding data on the up migration and removing it on the down. To implement this, I opened up the `ActiveRecord::Migration` class and added a couple methods:

 1 def self.capture(*args, &block)
 2   args.each do |name|
 3     klass = name.to_s.classify.constantize
 4     klass.class_eval <<-end_eval
 5       @@capture_ids = []
 6       def self.capture_ids
 7         @@capture_ids.uniq
 8       end
 9 
10       after_create do |obj|
11         @@capture_ids << obj.id
12       end
13 
14       after_destroy do |obj|
15         @@capture_ids.remove obj.id
16       end
17     end_eval
18   end
19 
20   yield
21 
22   File.open(data_migration_file, 'w') do |f|
23     f.write args.inject({}) { |hash, name|  
24       klass = name.to_s.classify.constantize
25       hash[name] = klass.capture_ids
26       hash
27     }.to_yaml
28   end
29 end

This `capture` method adds callbacks for `after\create` and `after\destroy` to keep track of the id’s. This is written out to a file in a `data` subdirectory of the `db/migrate` directory. These files can be ignored by Subversion so each developer can run the migrations on their own databases without clashing.

Migrating down is basically a matter of removing the records that were created when migrating up.

 1 def self.release(*args)
 2   data_file = data_migration_file(:down)
 3   data = File.open(data_file, 'r+') { |f| YAML.load f }
 4   if block_given?
 5     yield *args.inject([]) { |list, name| list << data[name] 
 6     }.compact if block_given?
 7   else
 8     args.each do |name|
 9       klass = name.to_s.classify.constantize
10       klass.delete data[name] unless data[name].blank?
11     end
12   end
13   File.unlink data_file
14 end

If you provide a block to the `release` method, you take care of doing whatever you want to the records. If you just want to delete them, don’t pass a block and `release` will delete them. I used `delete` rather than `destroy` so that everything you specify on capture is deleted without running into issues with `:dependent => :destroy`.

This has been useful for add and removing data while trying out different table structures. Obviously, it’s a rather targeted solution. I’ll be packaging it as a plugin and posting it to [PLANET ARGON’s](http://www.planetargon.org) community site soon.