7: ERR174310_1-2.fastq.gz 8: m1310[03,04,05,08,09,10]* - PacBio 9: Ecoli_R73 14: sample_2-10_sorted.bam The ID compression here is based around the cram_modules branch of io_lib: https://github.com/jkbonfield/io_lib/tree/cram_modules This contains plugins for off-the-shelve compression codecs such as zstd and libbsc, as well as a custom codec for identifiers named "names3". For ease of testing, I have now made a couple standalone binaries of names3 (at 1Mb and 5Mb block sizes). These have been built via: git clone git@github.com:jkbonfield/io_lib.git cd io_lib git checkout cram_modules ./bootstrap ./configure --disable-shared make cd codec_src cc -O3 cram_codec_names3.c -DTEST_MAIN -I.. -L../io_lib/.libs -lstaden-read -lbz2 -DBLK_SIZE=1000000 -o names3.1M -static cc -O3 cram_codec_names3.c -DTEST_MAIN -I.. -L../io_lib/.libs -lstaden-read -lbz2 -DBLK_SIZE=5000000 -o names3.5M -static Example usage on the first 1 million names from dataset 16: $ xz < /tmp/_n | wc -c 4012596 $ ./names3.1M < /tmp/_n | wc -c 3393587 $ ./names3.1M < /tmp/_n | ./names3.1M -d | md5sum 2da4206b4364d7bb7ef0142543979421 - $ md5sum /tmp/_n 2da4206b4364d7bb7ef0142543979421 /tmp/_n IDs === Data set 16 ----------- Tool/seq-per-blk Size Enc(s) Dec(s) bsc/1000 68230510 56.6 24.2 bsc/10000 53054480 28.4 15.4 bsc/100000 49957820 29.2 15.8 zlib/1000 75745625 8.7 1.3 zlib/10000 65303357 10.4 1.1 zlib/100000 64095645 13.2 1.1 zstd/1000 77904135 23.5 1.5 zstd/10000 65700493 27.3 0.8 zstd/100000 62831254 21.6 0.6 zstd_fqz_n3/1000 69080857 69.5 6.5 zstd_fqz_n3/10000 48570719 47.6 4.6 zstd_fqz_n3/100000 43435039 46.6 4.8 names3.1M 44661139 45.9 5.1 names3.5M 42872449 44.3 5.4 Data set 10 ----------- raw 9415113466 bsc/1000 1477507852 1359.0 868.4 bsc/10000 1242009047 791.7 501.2 bsc/100000 1091425173 818.1 386.5 zlib/1000 1715624836 253.8 34.2 zlib/10000 1655419479 288.1 31.7 zlib/100000 1648685918 295.8 32.5 zstd/1000 1714390615 411.6 18.0 zstd/10000 1521422576 344.2 14.1 zstd/100000 1423886560 453.4 15.2 zstd_fqz_n3/1000 1454275810 1625.4 153.3 zstd_fqz_n3/10000 FAIL zstd_fqz_n3/100000 FAIL names3.1M 1122148682 1933.4 159.8 names3.5M 1044068348 1541.3 150.9 "Names3" (aka n3) codec is buggy in implementation, but the file format shows promise and when it works it generally works well. The size is considerably smaller than others tested for data set 16, but on data set 09 (below) bsc wins out. The method is to use a combination of whole record LZ (referring back {dist} records rather than a {dist,len} byte range) and a prefix dictionary (e.g. computed using a trie in the encoder). The format is an order-0 RANS compressed dictionary, 1-5 blocks of distance codecs stored as the successive low 7-bits, compressed with order-0 RANS. (Later blocks are omitted if all values are zero), followed by the remaining literals compressed using bzip2. Data set 09 ----------- Tool/seq-per-blk Size Enc(s) Dec(s) bsc/1000 77759126 75.8 52.9 bsc/10000 66013994 29.2 19.1 bsc/100000 59574350 33.5 17.9 zlib/1000 93350051 31.1 1.8 zlib/10000 89976155 85.5 1.5 zlib/100000 88999138 95.2 1.5 zstd/1000 73522143 74.4 1.3 zstd/10000 72112054 114.9 1.0 zstd/100000 69253251 165.9 1.0 zstd_fqz_n3/1000 73521914 74.4 1.3 zstd_fqz_n3/10000 67550945 86.1 0.0 (DIED decoding) zstd_fqz_n3/100000 61939253 103.0 9.9 names3.1M 62210509 80.9 9.6 names3.5M 61450477 74.9 9.8 n3 fails on 09 sometimes, but bsc easily beats it anyway. Note: this data set is excessively deep with multiple libraries, sorted on position first and then other fields (including name). Hence string delta works well. Data set 09 names "split -l 100000" external files. Times & sizes are aggregates. Eg: time (for i in xaa*;do xz < $i;done) |wc -c 60472476 real 3m25.885s user 3m21.425s sys 0m3.532s zstd 98998735 0m4.387s zstd -22 69205275 4m30.231s xz 60472476 3m25.885s bsc 59526668 0m40.540s bzip2 64900800 0m31.960s lpaq 8 47521969 5m54.370s fqzcomp -n1 47225675 <0m22.685s <--- best fqzcomp -n2 67464075 <0m21.959s So even boring old bzip2 is faster and smaller than max zstd on this particular data set. Modern BWT is considerably smaller still. Best of all is the naive original name model in fqzcomp which uses the previous name as a template. This is because names are in groups, due to semi-sorted names. I am unsure if it is good to recommend this as a general purpose approach, unless data is (at least partially) name sorted. Lossy mode: This is implemented in CRAM by exploiting read pairs. Any template that knowingly has all copies within a single CRAM slice can have the read names discarded. However this is the only aligned input (BAM) out of the 3 data sets and it has no read-pairs, so lossy mode is irrelevant. In this scenario, to keep pairing, we can throw ALL read names away and just assign a numeric counter, giving a total compressed size of a single flag bit. Data set 14 ----------- Ecoli_R73.fastq fqzcomp -n1 238554 fqzcomp -n2 200583 CRAM (zlib) 455744 (default CRAM) name3/1k 272476 name3/10k 202058 name3/100k 195271 bsc/1k 335332 bsc/10k 271246 bsc/100k 243460 names3.1M 201449 names3.5M 195775 All varified during decompression. No timings as these are all incredibly quick due to small file size. Here names3 doesn't trigger its crash and comes out looking very good, even better than the fqzcomp non-random access format. Data set 07 ----------- ERR174310_1.fastq.gz & ERR174310_2.fastq.gz each fqzcomp -n1 413815000 <23m40s fqzcomp -n2 408490655 <25m04s Note fqzcomp encodes a single file rather than a pair of files, but the names are identical between pairs. While the code doesn't exploit this, it's trivial to uncompress names in one and duplicate to the other, so notionally in this data set the size is the same as either _1 or _2 (and not their sum). I haven't yet had time to do a full alignment and sort of data set 7, so do not know what the CRAM lossy ID compression will be.