Exporting and importing a database of that size is challenging in many ways.
We had to use different tools and techniques than the standard mysqldump utility, which was causing the import to exceed 14 days for this database. We were able to cut that down to less than 4 days.
While we plan to talk about this site in detail at Drupal Camp Toronto, and publish an article on it, this article is about a very useful byproduct.
Mydumper and Myloader
We then found the right tool: Mydumper.
Mydumper is a very useful tool for faster backups for many sites with smaller database sizes. The tool is written by current and past MySQL/Oracle staff.
The trick for this tool is that it relies on doing dumps and imports in parallel using multiple threads, rather than using a single thread, which is slow and inefficient.
If you are on a Debian derived Linux system, such as a recent version of Ubuntu (e.g. 20.04 or later), then MyDumper is already available in your standard repositories.
Install it using:
sudo apt install mydumper
Compiling Mydumper from Source
If you are on a system that does not have MyDumper in its repositories, then you will need to compile from source.
To do this, follow the instructions on the Github project page.
Older versions of Ubuntu required a patch to be applied to MyDumper before compiling it from source. That is no longer needed if you use a recent version of Ubuntu and MyDumper. But in case you still need that, please refer to an archived version of this very article.
As we said earlier, Mydumper works by having multiple threads dumping multiple tables, or parts of tables, concurrently.
The program has many options, but the important ones are:
Specifies the name of the database to dump
The number of threads to use. Generally set this to the number of CPUs in the server, but it can be more in some cases.
The number of rows per "chunk", before creating anothe chunk for the same table. We used 500,000, and got good results, but it depends on how many rows your tables have.
You can use this script for backup.
Save that file to /usr/local/bin/dbdump.sh
Make it executable using:
chmod +x /usr/local/bin/dbdump.sh
Create a crontab job to run it daily, say at 4:10 am local server time.
10 04 * * * /usr/local/bin/dbdump.sh
Description of Mydumper's output data
Mydumper does not output to files, but rather to files in a directory. The --outputdir option specifies the name of the directory to use.
The output is two parts:
For each table in the database, a file containing the CREATE TABLE statement will be created. It will be named:
You could manipulate this schema and gain some advantages, for example uncompress the file, edit the table creation statement to remove all indexes, other than the primary index. This would make the loading of huge tables much faster than with indexes.
Remember to do a CREATE INDEX for each one after that.
For each table with number of rows above the --rows parameter, you will have a file called:
Where "n" starts with 0 up to the number of.
Import/restore using Myloader
You don't need to know the internals of how Mydumper stores its data in order to do a restore/import. The Myloader tool does that for you.
If you are importing on a machine that has more CPUs than the machine you dumped on, you can use a higher number for the --threads parameter, for example:
So, what improvement can we see? Here are the details.
Before: Using mysqldump:
For a site that has 5.8 million rows in one table, and 1.39 million in another, there was a noticeable improvement of using Mydumper over mysqldump.
11:00:01 - Start dump of database to .../dbdump.live.Sat.sql.gz
11:08:33 - Successfully dumped database live
So, it takes 8.5 minutes for the dump to complete.
After: Using Mydumper:
After using Mydumper with --rows=500000, the backup was much faster:
11:00:01 - Start dump of database to .../dbdump.live.Mon
11:00:43 - Finished dumping the database live
Only 42 seconds!
Other approaches for faster backups
For sites that do not have large tables that warrant the use of Mydumper, there are other approaches that can help.
For example, you can exclude the data from certain tables with transient content, such as watchdog, sessions, and all the cache tables.
This partial code snippet uses some shell script wizardry to exclude such tables with transient data, and only include their table creation statements, but not their content:
# Tables to backup structure, but not data
TABLES=`mysql -u... -p... -B -N -e 'SHOW TABLES;' $DB_NAME`
for TABLE in $TABLES
RESULT=`echo $TABLE | awk "$DB_NO_DATA"`
if [ "$RESULT" != "" ]; then
# Structure only
OPTS=" --no-data "
# Default, structure AND data
mysqldump $OPTS -u... -p... $DB_NAME $TABLE >> $DB_DUMP
if [ $? != 0 ]; then
echo "ERROR: Failed in dumping database $DB_NAME"
# Compress the backup
mv $DB_DUMP.bz2 $DB_DUMP
echo "Successful dump for database $DB_NAME"
A similar approach for excluding contents of tables with transient content is outlined in this article, if you are using Capistrano.
A different way that relies, like Mydumper, on doing parallel dump and restore, is to use Lullabot's MySQL Parallel, which uses a shell script wrapper around GNU Parallel, and Parallel bzip2. This is of course much better than plain mysqldump, but will not parallelize portions of tables. So a site with a huge table will do other tables quickly, but will be slow for the huge table since all of it will be done in a single thread.
Other use cases for Mydumper/Myloader
More information on how we used MyDumper/MyLoader to cut down the export/import of a massive Drupal database, can be found in our presentation: Huge! Drupal site with 381 modules and 174GB MySQL database and 200 million row tables at Drupal Camp Toronto in November 2012..
You can find more information about Mydumper in the following articles:
- Faster alternative to MySQL mysqldump backup and restore with multi-threaded Mydumper & Myloader tool
Update: Added link to presentation that contains MyDumper/MyLoader info.
Update: Removed details of patching and compiling for older versions of Ubuntu.