We recently did a Drupal Performance Assessment project for a very large site, with a massive database. The database takes over 174GB, and houses a total of over 707 million rows.

Exporting and importing a database of that size is challenging in many ways.

We had to use different tools and techniques than the standard mysqldump utility, which was causing the import to exceed 14 days for this database. We were able to cut that down to less than 4 days.

While we plan to talk about this site in detail at Drupal Camp Toronto, and publish an article on it, this article is about a very useful byproduct.

Maatkit mk-parallel-dump

We first considered the use of Maatkit parallel dump. However, this is deprecated and is no longer supported in Percona Toolkit, the successor to Maatkit.

Mydumper and Myloader

We then found the right tool: Mydumper.

Mydumper is a very useful tool for faster backups for many sites with smaller database sizes. The tool is written by current and past MySQL/Oracle staff.

The trick for this tool is that it relies on doing dumps and imports in parallel using multiple threads, rather than using a single thread, which is slow and inefficient.

Installing MyDumper

If you are on a Debian derived Linux system, such as a recent version of Ubuntu (e.g. 20.04 or later), then MyDumper is already available in your standard repositories.

Install it using:

sudo apt install mydumper

Compiling Mydumper from Source

If you are on a system that does not have MyDumper in its repositories, then you will need to compile from source.
To do this, follow the instructions on the Github project page.

Older versions of Ubuntu required a patch to be applied to MyDumper before compiling it from source. That is no longer needed if you use a recent version of Ubuntu and MyDumper. But in case you still need that, please refer to an archived version of this very article.

Using Mydumper

As we said earlier, Mydumper works by having multiple threads dumping multiple tables, or parts of tables, concurrently.

The program has many options, but the important ones are:

--database

Specifies the name of the database to dump

--threads

The number of threads to use. Generally set this to the number of CPUs in the server, but it can be more in some cases.

--rows

The number of rows per "chunk", before creating anothe chunk for the same table. We used 500,000, and got good results, but it depends on how many rows your tables have.

You can use this script for backup.

#!/bin/sh

DIR_BACKUP=/backup-directory
DB_NAME=database_name
DB_USER=user
DB_PASS=password

DOW=`date "+%a"`
DB_DUMP=$DIR_BACKUP/dbdump.$DB_NAME.$DOW

mydumper \
  --database=$DB_NAME \
  --host=$DB_HOST \
  --user=$DB_USER \
  --password=$DB_PASS \
  --outputdir=$DB_DUMP \
  --rows=500000 \
  --compress \
  --build-empty-files \
  --threads=2 \
  --compress-protocol

Save that file to /usr/local/bin/dbdump.sh

Make it executable using:

chmod +x /usr/local/bin/dbdump.sh

Create a crontab job to run it daily, say at 4:10 am local server time.

10 04 * * * /usr/local/bin/dbdump.sh

Description of Mydumper's output data

Mydumper does not output to files, but rather to files in a directory. The --outputdir option specifies the name of the directory to use.

The output is two parts:

Schema

For each table in the database, a file containing the CREATE TABLE statement will be created. It will be named:

dbname.tablename-schema.sql.gz

You could manipulate this schema and gain some advantages, for example uncompress the file, edit the table creation statement to remove all indexes, other than the primary index. This would make the loading of huge tables much faster than with indexes.

Remember to do a CREATE INDEX for each one after that.

Data

For each table with number of rows above the --rows parameter, you will have a file called:

dbname.tablename.0000n.sql.gz

Where "n" starts with 0 up to the number of.

Import/restore using Myloader

You don't need to know the internals of how Mydumper stores its data in order to do a restore/import. The Myloader tool does that for you.

If you are importing on a machine that has more CPUs than the machine you dumped on, you can use a higher number for the --threads parameter, for example:

myloader \
  --database=$DB_NAME \
  --directory=$DB_DUMP \
  --queries-per-transaction=50000 \
  --threads=6 \
  --compress-protocol \
  --verbose=3

So, what improvement can we see? Here are the details.

Before: Using mysqldump:

For a site that has 5.8 million rows in one table, and 1.39 million in another, there was a noticeable improvement of using Mydumper over mysqldump.

11:00:01 - Start dump of database to .../dbdump.live.Sat.sql.gz
11:08:33 - Successfully dumped database live

So, it takes 8.5 minutes for the dump to complete.

After: Using Mydumper:

After using Mydumper with --rows=500000, the backup was much faster:

11:00:01 - Start dump of database to .../dbdump.live.Mon
11:00:43 - Finished dumping the database live

Only 42 seconds!

Other approaches for faster backups

For sites that do not have large tables that warrant the use of Mydumper, there are other approaches that can help.

For example, you can exclude the data from certain tables with transient content, such as watchdog, sessions, and all the cache tables.

This partial code snippet uses some shell script wizardry to exclude such tables with transient data, and only include their table creation statements, but not their content:

#!/bin/sh

# Tables to backup structure, but not data
DB_NO_DATA="/^(watchdog|sessions|cache(_.+)?)$/"

...

TABLES=`mysql -u... -p... -B -N -e 'SHOW TABLES;' $DB_NAME`
for TABLE in $TABLES
do
  RESULT=`echo $TABLE | awk "$DB_NO_DATA"`
  if [ "$RESULT" != "" ]; then
    # Structure only
    OPTS=" --no-data "
  else
    # Default, structure AND data
    OPTS=""
  fi
  mysqldump $OPTS -u... -p... $DB_NAME $TABLE >> $DB_DUMP
  if [ $? != 0 ]; then
    echo "ERROR: Failed in dumping database $DB_NAME"
  fi
done

# Compress the backup
bzip2 $DB_DUMP
mv $DB_DUMP.bz2 $DB_DUMP
echo "Successful dump for database $DB_NAME"

A similar approach for excluding contents of tables with transient content is outlined in this article, if you are using Capistrano.

A different way that relies, like Mydumper, on doing parallel dump and restore, is to use Lullabot's MySQL Parallel, which uses a shell script wrapper around GNU Parallel, and Parallel bzip2. This is of course much better than plain mysqldump, but will not parallelize portions of tables. So a site with a huge table will do other tables quickly, but will be slow for the huge table since all of it will be done in a single thread.

Other use cases for Mydumper/Myloader

More information on how we used MyDumper/MyLoader to cut down the export/import of a massive Drupal database, can be found in our presentation: Huge! Drupal site with 381 modules and 174GB MySQL database and 200 million row tables at Drupal Camp Toronto in November 2012..

Further Information

You can find more information about Mydumper in the following articles:

Update: Added link to presentation that contains MyDumper/MyLoader info.
Update: Removed details of patching and compiling for older versions of Ubuntu.

Comments

Mon, 2012/10/22 - 13:47

Maatkit parallel dump no longer exists because it has been superseded by the much more capable Percona XtraBackup
http://www.percona.com/software/percona-xtrabackup

I've never seen benchmarks, but according to the author of Mydumper XtraBackup should be noticeably faster for InnoDB tables.

Fri, 2013/04/05 - 12:52

I would like to try this on my Centos VPS. Could anyone provide me with the names of Centos packages for libglib2.0-dev, libmysqlclient-dev, zlib1g-dev, libpcre3-dev

Fri, 2013/05/17 - 07:38

mydumper/myloader is also available for Centos pre-compiled from the REMI repository.

Is your Drupal or Backdrop CMS site slow?
Is it suffering from server resources shortages?
Is it experiencing outages?
Contact us for Drupal or Backdrop CMS Performance Optimization and Tuning Consulting