How to copy images partially from a live Drupal site to a development/test site

Copying a Drupal site from a live server to a test one is normally a straightforward process: you dump the database, and create a tar archive of the site, extract the archive, load the database dump, make a few changes to the settings.php file, and voila: you have a test site that mirrors the live site. You can then use this for development, training or any other purpose.

We recently need to do this to a a live site, but it had way too much images: with 5.6 GB's worth of images, making an exact copy on a smaller server was not an option. So, we devised the following method to copy images only from the last 7 days, to make the test site more manageable, yet reflecting what the live site has very closely.

This assumes that the images are either attachments using the core upload module, or the image module. Doing this with imagecache, and CCK requires a bit more SQL.

We login to the live site, and change to the directory where Drupal is:

cd /var/www/live.site

We create a database dump as usual:

mysqldump -uuser -p dbname > yoursite.sql 

We create a tar archive, excluding the files directory:

tar --exclude="files" -czvf  yoursite.tar.gz .

Then, we find out what theUNIX epoch date now

date +%s

We get a number like: 1209843315

We then calculate what 7 days worth of seconds is by multiplying 7*24*60*60. We get 604800.

We now run the SQL to get the path names of the files/images for the last 7 days, and we create a cpio archive out of them.

echo "SELECT f.filepath FROM node n INNER JOIN files f USING (nid)
WHERE created > 1209843315 - 604800;"| mysql -uuser -p dbname | grep
-v "^filepath$" | cpio -o -O files.cpio

In this case, we got about 100 MBs worth of files. Far more manageable than 5.6GB, and easily transferred over the network.

Now, you need to copy the files.cpio and yoursite.tar.gz to the test server.

Once you do that, you change the Drupal directory

cd /var/www/test.site

Then you extract the tar archive:

tar xzvf yoursite.tar.gz 

And load the database

mysql -uuser -p testdb < yoursite.sql 

Then extract the subset of images that we created earlier.

cpio -iv -I files.cpio

We now have a functional site with a subset of the images and files ...

Contents: 

Comments

Why not use rsync and find?

We use rsync and find to do a similar thing. Find adds quite a lot more flexibility (you can limit the list of files by size, data and/or file extension) and rsync makes it a lot faster on the second run - since you don't transfer previously copied files (it also does transparent compression of data with the z switch).

ssh USER@DOMAIN "cd sites/default/files; find -size -FILESIZE" > files_lt_FILESIZE
rsync -avz --files-from=files_lt_FILESIZE -e ssh "USER@DOMAIN:~/sites/default/files" files

You can extend the find command to only exclude certain filetypes (e.g. --exclude="*.flv") or only get files created ormodified in the last week (e.g. -mtime -7). We find filesize the most useful criteria though. There are lots more possible criteria (including and/or) you can find on the 'find' command man page (or http://www.opengroup.org/onlinepubs/009695399/utilities/find.html).

Incidentally, we also use rsync for transferring database backups - this makes a huge difference when you have a large database dump to transfer, since the changes compared to the last local copy of the database are likely quite small. Make sure you rsync the uncompressed dump however (and use compression on the rsync), since some compression schemes make rsync much less efficient.