Every company has to start with something. Even if you claim to be in the big data world MySQL could be a good start for your first year.
At Holimetrix we are getting lots of data from our clients, we have tag on their websites in order to track their TV ads. The bigger the client, the bigger the data we receive.
So we started with MySQL, and data got huge quite quickly.
Here are some hints about what we learn about this kind of workflow.
Choose your MySQL Engine:
There is now oracle MySQL, Percona and MariaDB.
- Oracle version might be good to get the last feature of MySQL 5.7(mainly json).
- MariaDB is really good for clustering.
- Percona (the guys from mysqlperformanceblog) is reputed to be faster
Percona also ship with interesting tools: percona-tools (used to be known has maatkit-tools).
As we run debian and we don't like to do thing ourselves we ran everything on percona, the packages were good but updating really often.
If you don't want to be annoyed by updating too often (and mysql restart !) when you upgrade your machine, try this:
echo "percona-server-server hold" | dpkg --set-selections echo "percona-server-common-5.6 hold" | dpkg --set-selections
If I were to make the choice now I will benchmark MariaDB and percona. I dislike the oracle package, they are not finished.
delete is no good
If you want or have to delete all data before a given date. Please don't.
In a table with 100 millions rows, it would be very long.
In InnoDB/xtradb delete are transactionnal, (only drop table is not) so first MySQL will make a copy of your data before doing anything. Your options are to speed up things are:
- don't delete
- partition your data with time and drop some partitions
- drop the table, write a script to filter data from a dump and bulkload the result
This leading to our next big hint: bulkdload is fast
Bulkload is fast
We measured that some delete could takes 9 hours, while a full bulkload could take 1 hour. with the good options.
When I say fast, don't misunderstand me, MySQL is not fast while handling TB of data, bulkload is just less slow-ish.
The good bulkload strategy
In more technical details, here is what you need to know to make a fast and reliable MySQL bulkload.
CSV, MySQL bulkload is based on text CSV files, (beware with encoding). Those CSV could grow bug and you cannot load them in one run.
If you put a 50GB csv dump into MySQL, it will crash exhausting the host memory.
A better stategy is to split your files into multiple parts. That's the purpose of percona split file
pt-fifo-split (former maatkit)
It will split the file into x lines parts and write it to a pipe file(mkfifo).
Mysql can read the pipe and load the data part by part.
Based on your data structure, volume and your MySQL machine it's up to your to define what part size is the fastest. For us it was 200000 lines on every load, We had 40 GB free on the host and a pretty big host (12 disks, 500GB ram, 80% for MySQL indexes cache).
You can base your bulkload on that kind of scripts:
script 1 fifo
pt-fifo-split /srv/tmp/big_dump.csv --fifo /tmp/my-fifo --lines 200000 && rm /tmp/my-fifo
script 2 insert
while [ -e /tmp/my-fifo ]; do date time mysql -e "set foreign_key_checks=0; set sql_log_bin=0; set unique_checks=0; load data local infile '/tmp/my-fifo' into table mybase.mytable fields terminated by ';' ENCLOSED BY '\"' lines terminated by '\n' ;" sleep 1; done
Optimize / DELETE/ reload when you can
With time, indexes get dirty, if you delete 1/3 of the data and then re-insert into MySQL, the database will began to be slow.
To avoid thoses problems you'll have to focus on optimizing the table. Better, you can drop the database and restart from a backup, it may be faster. If you are bulkloading your backup it may be really faster than optimize.
Good MySQL hardware configuration
Cpu is not important, disk are important, ram also. It all depends on your workflow. We where constantly inserting small data, loading huge and complex selects. For this we decided lots of ram would be a good thing. My advice on this part would be to start some host on a cloud provider of your choice (aws/gcloud/azure/..) and test your workloads. Only trust your own benchmarks, your workload is not one of mysqlperformance blog, it's yours you have to bench it before making decisions.
The devil is in the details, And those parameters could be very dangerous if did not have a good and reliable backup of your data.
innodb buffer pool size
The most obvious parameter you have to use, it uses the RAM for flushing the data, with trx commit to 2 it has proven really effective.
innodb_buffer_pool_size = 290G
With this you can control when MySQL does flush its ram buffer.
1is the default, full ACID commit.
0is the dangerous mode: flush 1/second, any mysqld process crash can erase the last second of transactions.
2is somewhere in the middle, not ACID but really faster, only an operating system crash or a power outage can erase the last second of transactions
innodb_flush_log_at_trx_commit = 2
in our workload with lots of small insert, this parameter gave us a 20% performance boost.
Always work on your requests
Do less things, think twice, etc. by working a bit on our SQL requests and our algorithm features, we were able to get a 30% performance boost in two weeks of work.
BIG tables to ALTER
When you want to alter a table with 100 millions+ rows, do it offline if you can. You can also use the bulkload technique to speed it up:
- export table as csv (or use your backup)
- create the new table with new schema (with a new name if you got enough space / or drop the old one)
- bulkload it.
- (rename the table if you had space for both)
The result will be about 5 times faster than altering the table in place.
in the end
As a conclusion I will say MySQL was a really good choice for our first year. But as the data grows, all those tricks were not enough anymore and we moved to big data hadoop, pig, spark ... (will be discussed in other articles)