Holimetrix Tech Blog


Handling 3TB of data with MySQL

thibault kettererthibault ketterer

Every company has to start with something. Even if you claim to be in the big data world MySQL could be a good start for your first year.
At Holimetrix we are getting lots of data from our clients, we have tag on their websites in order to track their TV ads. The bigger the client, the bigger the data we receive.

So we started with MySQL, and data got huge quite quickly.
Here are some hints about what we learn about this kind of workflow.

Choose your MySQL Engine:

There is now oracle MySQL, Percona and MariaDB.

Percona also ship with interesting tools: percona-tools (used to be known has maatkit-tools).

As we run debian and we don't like to do thing ourselves we ran everything on percona, the packages were good but updating really often.
If you don't want to be annoyed by updating too often (and mysql restart !) when you upgrade your machine, try this:

echo "percona-server-server hold" | dpkg --set-selections  
echo "percona-server-common-5.6 hold" | dpkg --set-selections  

If I were to make the choice now I will benchmark MariaDB and percona. I dislike the oracle package, they are not finished.

delete is no good

If you want or have to delete all data before a given date. Please don't.

In a table with 100 millions rows, it would be very long.
In InnoDB/xtradb delete are transactionnal, (only drop table is not) so first MySQL will make a copy of your data before doing anything. Your options are to speed up things are:

This leading to our next big hint: bulkdload is fast

Bulkload is fast

We measured that some delete could takes 9 hours, while a full bulkload could take 1 hour. with the good options.
When I say fast, don't misunderstand me, MySQL is not fast while handling TB of data, bulkload is just less slow-ish.

The good bulkload strategy

In more technical details, here is what you need to know to make a fast and reliable MySQL bulkload.
CSV, MySQL bulkload is based on text CSV files, (beware with encoding). Those CSV could grow bug and you cannot load them in one run.

If you put a 50GB csv dump into MySQL, it will crash exhausting the host memory. A better stategy is to split your files into multiple parts. That's the purpose of percona split file pt-fifo-split (former maatkit)
It will split the file into x lines parts and write it to a pipe file(mkfifo).
Mysql can read the pipe and load the data part by part.
Based on your data structure, volume and your MySQL machine it's up to your to define what part size is the fastest. For us it was 200000 lines on every load, We had 40 GB free on the host and a pretty big host (12 disks, 500GB ram, 80% for MySQL indexes cache).

You can base your bulkload on that kind of scripts:
script 1 fifo

pt-fifo-split /srv/tmp/big_dump.csv --fifo /tmp/my-fifo --lines 200000 && rm /tmp/my-fifo  

script 2 insert

while [ -e /tmp/my-fifo ]; do  
   time mysql -e "set foreign_key_checks=0; set sql_log_bin=0; set unique_checks=0; load data local infile '/tmp/my-fifo' into table mybase.mytable fields terminated by ';' ENCLOSED BY '\"' lines terminated by '\n' ;"
   sleep 1;

Optimize / DELETE/ reload when you can
With time, indexes get dirty, if you delete 1/3 of the data and then re-insert into MySQL, the database will began to be slow.
To avoid thoses problems you'll have to focus on optimizing the table. Better, you can drop the database and restart from a backup, it may be faster. If you are bulkloading your backup it may be really faster than optimize.

Good MySQL hardware configuration

Cpu is not important, disk are important, ram also. It all depends on your workflow. We where constantly inserting small data, loading huge and complex selects. For this we decided lots of ram would be a good thing. My advice on this part would be to start some host on a cloud provider of your choice (aws/gcloud/azure/..) and test your workloads. Only trust your own benchmarks, your workload is not one of mysqlperformance blog, it's yours you have to bench it before making decisions.

some parameters

The devil is in the details, And those parameters could be very dangerous if did not have a good and reliable backup of your data.

innodb buffer pool size

The most obvious parameter you have to use, it uses the RAM for flushing the data, with trx commit to 2 it has proven really effective.
innodb_buffer_pool_size = 290G

trx commit

With this you can control when MySQL does flush its ram buffer.

innodb_flush_log_at_trx_commit = 2 in our workload with lots of small insert, this parameter gave us a 20% performance boost.

Always work on your requests

Do less things, think twice, etc. by working a bit on our SQL requests and our algorithm features, we were able to get a 30% performance boost in two weeks of work.

BIG tables to ALTER

When you want to alter a table with 100 millions+ rows, do it offline if you can. You can also use the bulkload technique to speed it up:

The result will be about 5 times faster than altering the table in place.

in the end

As a conclusion I will say MySQL was a really good choice for our first year. But as the data grows, all those tricks were not enough anymore and we moved to big data hadoop, pig, spark ... (will be discussed in other articles)

some refs

In computer science since 2004. First developper, then devops. I managed many different architectures going from solaris 8 to the last Debian. CTO deputy @holimetrix since Mars 2016.