At Holimetrix, we deal with millions of log lines everyday. Archiving and compressing these logs was a challenge for us, so we ran a full benchmark of all solutions available on the market.
Our strategy is pretty simple : make the most of some machine's CPU at night, when they're not too busy, to achieve the maximum compression and optimize our storage capacity.
We selected four challengers for their usability within scripts and robustness:
- Gzip, probably the most famous and oldest one (established in the early 1990s). It is based on the LZ77 and Huffman algorithms.
- Bzip2, the same name as the algorithm and developed by Julian Seward, the creator of Valgrind, which use algorithms BWT and Huffman.
- Xz, it uses the "new" LZMA2 algorithm.
- LZ4, developed by the French Yann Collet, its compression software is based primarily on LZ77.
In order to choose which one we were going to use, we proceeded very pragmatically to test each of these tools.
We ran the tests on this reference server :
- Debian Version 8.3
- Kernel SMP Debian 3.16.7-ckt20-1+deb8u3
- 8 cores Intel Xeon CPU E5-1410 v2 @ 2.80GHz
- 64Gb RAM
- File System ext4 on a SSD disk
We used a small 1.3GB log file to test the compression algorithms, each with the same approach:
dario@mirkwood:~$ ll access.log -rw-r--r-- 1 dario dario 1356661100 Jan 30 15:50 access.log
- time gzip -9v access.log
- time bzip2 -9v access.log
- time xz -9v access.log
- time lz4 -9v access.log
No surprise, bzip2, when used with compression options set at its maximum, takes much more longer to compress our file than its competitors, same applies for decompression. This is not a serious issue for the servers we use as we have a good part of the night to carry out these operations.
Although LZ4 is a pretty recent technology, it beats Gzip on its own grounds with incredible performance in compression and decompression.
We will now test the most important aspect for us, the compression ratio.
It appears that bzip2 and xz are pretty much neck and neck. Gzip and lz4 are definitely out due to their insufficient compression ratio.
We logically chose to use xz during our night batch processing for being a little bit faster than bzip2.
When taking a closer look at the xz options, we discovered the "extreme" mode:
-e, --extreme try to improve compression ratio by using more CPU time; does not affect decompressor memory requirements
We ran a small test to compare performance with and without extreme mode.
As you can see, with extreme mode on, compression time is almost multiplied by 6. But compression ratio also improves dramatically: over 98% compression ratio, our initial 1.3 GB file drops to 23 MB!
Despite this amazing performance, we decided to use xz without extreme mode to preserve reasonable compression and decompression speed.
We'll surely keep an eye on lz4 which is quite impressive for such a young product and could soon become an interesting alternative to xz.
Cool Links :