Fast, Parallel MySQL Backups and Imports with Mydumper

Published Mon, 2012/10/15 - 08:49, Updated Mon, 2014/03/10 - 17:30

We recently did a Drupal Performance Assessment project for a very large site, with a massive database. The database takes over 174GB, and houses a total of over 707 million rows.

Exporting and importing a database of that size is challenging in many ways.

We had to use different tools and techniques than the standard mysqldump utility, which was causing the import to exceed 14 days for this database. We were able to cut that down to less than 4 days.

While we plan to talk about this site in detail at Drupal Camp Toronto, and publish an article on it, this article is about a very useful byproduct.

Maatkit mk-parallel-dump

We first considered the use of Maatkit parallel dump. However, this is deprecated and is no longer supported in Percona Toolkit, the successor to Maatkit.

Mydumper and Myloader

We then found the right tool: Mydumper.

Mydumper is a very useful tool for faster backups for many sites with smaller database sizes. The tool is written by current and past MySQL/Oracle staff.

The trick for this tool is that it relies on doing dumps and imports in parallel using multiple threads, rather than using a single thread, which is slow and inefficient.

Compiling and Installing Mydumper from Source

Although Mydumper is available as a package in the repository for Ubuntu Server 12.04 LTS Precise Pangolin, it does not work, and casues segmentation fault error.

For this reason, we have to build it from source. However, a patch is also needed in order for it to compile correctly.

This is described in detail in the following steps:

Install the required packages for compiling from source

sudo aptitude install cmake g++ patch libglib2.0-dev \
  libmysqlclient-dev zlib1g-dev libpcre3-dev patch

Download Mydumper

You need to download latest MyDumper release, 0.6.1 at this time:

wget http://goo.gl/VOOiN6

Extract the archive

Then extract the archive in a directory:

mv VOOiN6 mydumper-0.6.1.tar.gz 
tar xzf mydumper-0.6.1.tar.gz 

Change to the directory where you extracted the source:

cd mydumper-0.6.1/

Get and apply the patch

IMPORTANT: Skip this step if you are using the newer 0.6.1 release!

If you are using the older 0.5.1 release, then you need to get a specific patch from Launchpad issue #1002291, and apply it.

wget http://goo.gl/eUMx6

Now patch the source with what you just downloaded, as follows:

patch -p1 < eUMx6

Compile the source

Then you compile the source. Do not forget the "." after cmake!

cmake .
make

Install

Finally, you need to install the binaries to /usr/local/bin:

sudo make install

Using Mydumper

As we said earlier, Mydumper works by having multiple threads dumping multiple tables, or parts of tables, concurrently.

The program has many options, but the important ones are:

--database

Specifies the name of the database to dump

--threads

The number of threads to use. Generally set this to the number of CPUs in the server, but it can be more in some cases.

--rows

The number of rows per "chunk", before creating anothe chunk for the same table. We used 500,000, and got good results, but it depends on how many rows your tables have.

You can use this script for backup.

#!/bin/sh

DIR_BACKUP=/backup-directory
DB_NAME=database_name
DB_USER=user
DB_PASS=password

DOW=`date "+%a"`
DB_DUMP=$DIR_BACKUP/dbdump.$DB_NAME.$DOW

mydumper \
  --database=$DB_NAME \
  --host=$DB_HOST \
  --user=$DB_USER \
  --password=$DB_PASS \
  --outputdir=$DB_DUMP \
  --rows=500000 \
  --compress \
  --build-empty-files \
  --threads=2 \
  --compress-protocol

Save that file to /usr/local/bin/dbdump.sh

Make it executable using:

chmod +x /usr/local/bin/dbdump.sh

Create a crontab job to run it daily, say at 4:10 am local server time.

10 04 * * * /usr/local/bin/dbdump.sh

Description of Mydumper's output data

Mydumper does not output to files, but rather to files in a directory. The --outputdir option specifies the name of the directory to use.

The output is two parts:

Schema

For each table in the database, a file containing the CREATE TABLE statement will be created. It will be named:

dbname.tablename-schema.sql.gz

You could manipulate this schema and gain some advantages, for example uncompress the file, edit the table creation statement to remove all indexes, other than the primary index. This would make the loading of huge tables much faster than with indexes.

Remember to do a CREATE INDEX for each one after that.

Data

For each table with number of rows above the --rows parameter, you will have a file called:

dbname.tablename.0000n.sql.gz

Where "n" starts with 0 up to the number of.

Import/restore using Myloader

You don't need to know the internals of how Mydumper stores its data in order to do a restore/import. The Myloader tool does that for you.

If you are importing on a machine that has more CPUs than the machine you dumped on, you can use a higher number for the --threads parameter, for example:

myloader \
  --database=$DB_NAME \
  --directory=$DB_DUMP \
  --queries-per-transaction=50000 \
  --threads=6 \
  --compress-protocol \
  --verbose=3

So, what improvement can we see? Here are the details.

Before: Using mysqldump:

For a site that has 5.8 million rows in one table, and 1.39 million in another, there was a noticeable improvement of using Mydumper over mysqldump.

11:00:01 - Start dump of database to .../dbdump.live.Sat.sql.gz
11:08:33 - Successfully dumped database live

So, it takes 8.5 minutes for the dump to complete.

After: Using Mydumper:

After using Mydumper with --rows=500000, the backup was much faster:

11:00:01 - Start dump of database to .../dbdump.live.Mon
11:00:43 - Finished dumping the database live

Only 42 seconds!

Other approaches for faster backups

For sites that do not have large tables that warrant the use of Mydumper, there are other approaches that can help.

For example, you can exclude the data from certain tables with transient content, such as watchdog, sessions, and all the cache tables.

This partial code snippet uses some shell script wizardry to exclude such tables with transient data, and only include their table creation statements, but not their content:

  #!/bin/sh

  # Tables to backup structure, but not data
  DB_NO_DATA="/^(watchdog|sessions|cache(_.+)?)$/"

  ...

  TABLES=`mysql -u... -p... -B -N -e 'SHOW TABLES;' $DB_NAME`
  for TABLE in
  do
    RESULT=`echo $TABLE | awk "$DB_NO_DATA"`
    if [ "$RESULT" != "" ]; then
      # Structure only
      OPTS=" --no-data "
    else
      # Default, structure AND data
      OPTS=""
    fi
    mysqldump $OPTS -u... -p... $DB_NAME $TABLE >> $DB_DUMP
    if [ $? != 0 ]; then
      echo "ERROR: Failed in dumping database $DB_NAME"
    fi
  done

  # Compress the backup
  bzip2 $DB_DUMP
  mv $DB_DUMP.bz2 $DB_DUMP
  echo "Successful dump for database $DB_NAME"

A similar approach for excluding contents of tables with transient content is outlined in this article, if you are using Capistrano.

A different way that relies, like Mydumper, on doing parallel dump and restore, is to use Lullabot's MySQL Parallel, which uses a shell script wrapper around GNU Parallel, and Parallel bzip2. This is of course much better than plain mysqldump, but will not parallelize portions of tables. So a site with a huge table will do other tables quickly, but will be slow for the huge table since all of it will be done in a single thread.

Other use cases for Mydumper/Myloader

More information on how we used MyDumper/MyLoader to cut down the export/import of a massive Drupal database, can be found in our presentation: Huge! Drupal site with 381 modules and 174GB MySQL database and 200 million row tables at Drupal Camp Toronto in November 2012..

Further Information

You can find more information about Mydumper in the following articles:

Update: Added link to presentation that contains MyDumper/MyLoader info.

Equivalent Centos packages

I would like to try this on my Centos VPS. Could anyone provide me with the names of Centos packages for libglib2.0-dev, libmysqlclient-dev, zlib1g-dev, libpcre3-dev

mydumper/myloader is also

mydumper/myloader is also available for Centos pre-compiled from the REMI repository.

Re: Equivalent Centos packages

yum install glib2-devel mysql-devel zlib-devel pcre-devel cmake gcc-c++

Maatkit parallel dump no

Maatkit parallel dump no longer exists because it has been superseded by the much more capable Percona XtraBackup
http://www.percona.com/software/percona-xtrabackup

I've never seen benchmarks, but according to the author of Mydumper XtraBackup should be noticeably faster for InnoDB tables.

Awesome, thanks! Mac users,

Awesome, thanks!

Mac users, you can do this even easier using homebrew: brew install mydumper

Is your Drupal or WordPress site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal and WordPress Performance Optimization and Tuning Consulting