7:	ERR174310_1-2.fastq.gz
8:	m1310[03,04,05,08,09,10]* - PacBio
9:	Ecoli_R73
14:	sample_2-10_sorted.bam

The ID compression here is based around the cram_modules branch of
io_lib:

    https://github.com/jkbonfield/io_lib/tree/cram_modules

This contains plugins for off-the-shelve compression codecs such as
zstd and libbsc, as well as a custom codec for identifiers named
"names3".  For ease of testing, I have now made a couple standalone
binaries of names3 (at 1Mb and 5Mb block sizes).  These have been
built via:

git clone git@github.com:jkbonfield/io_lib.git
cd io_lib
git checkout cram_modules
./bootstrap
./configure --disable-shared
make
cd codec_src
cc -O3 cram_codec_names3.c -DTEST_MAIN -I.. -L../io_lib/.libs -lstaden-read -lbz2 -DBLK_SIZE=1000000 -o names3.1M -static
cc -O3 cram_codec_names3.c -DTEST_MAIN -I.. -L../io_lib/.libs -lstaden-read -lbz2 -DBLK_SIZE=5000000 -o names3.5M -static

Example usage on the first 1 million names from dataset 16:

$ xz < /tmp/_n | wc -c
4012596
$ ./names3.1M < /tmp/_n | wc -c
3393587
$ ./names3.1M < /tmp/_n | ./names3.1M -d | md5sum
2da4206b4364d7bb7ef0142543979421  -
$ md5sum /tmp/_n
2da4206b4364d7bb7ef0142543979421  /tmp/_n


IDs
===

Data set 16
-----------
  Tool/seq-per-blk	   Size	       Enc(s)	Dec(s)
  bsc/1000                 68230510     56.6     24.2 
  bsc/10000                53054480     28.4     15.4 
  bsc/100000               49957820     29.2     15.8 
  zlib/1000                75745625      8.7      1.3 
  zlib/10000               65303357     10.4      1.1 
  zlib/100000              64095645     13.2      1.1 
  zstd/1000                77904135     23.5      1.5 
  zstd/10000               65700493     27.3      0.8 
  zstd/100000              62831254     21.6      0.6 
  zstd_fqz_n3/1000         69080857     69.5      6.5 
  zstd_fqz_n3/10000        48570719     47.6      4.6 
  zstd_fqz_n3/100000       43435039     46.6      4.8 

  names3.1M		   44661139	45.9	  5.1
  names3.5M		   42872449	44.3	  5.4

Data set 10
-----------
  raw			 9415113466
  bsc/1000               1477507852   1359.0    868.4 
  bsc/10000              1242009047    791.7    501.2 
  bsc/100000             1091425173    818.1    386.5 
  zlib/1000              1715624836    253.8     34.2 
  zlib/10000             1655419479    288.1     31.7 
  zlib/100000            1648685918    295.8     32.5 
  zstd/1000              1714390615    411.6     18.0 
  zstd/10000             1521422576    344.2     14.1 
  zstd/100000            1423886560    453.4     15.2 
  zstd_fqz_n3/1000       1454275810   1625.4    153.3 
  zstd_fqz_n3/10000      FAIL
  zstd_fqz_n3/100000     FAIL

  names3.1M		 1122148682   1933.4	159.8
  names3.5M		 1044068348   1541.3	150.9


"Names3" (aka n3) codec is buggy in implementation, but the file
format shows promise and when it works it generally works well.  The
size is considerably smaller than others tested for data set 16, but
on data set 09 (below) bsc wins out.

The method is to use a combination of whole record LZ (referring back
{dist} records rather than a {dist,len} byte range) and a prefix
dictionary (e.g. computed using a trie in the encoder).  The format is
an order-0 RANS compressed dictionary, 1-5 blocks of distance codecs
stored as the successive low 7-bits, compressed with order-0 RANS.
(Later blocks are omitted if all values are zero), followed by the
remaining literals compressed using bzip2.


Data set 09
-----------
  Tool/seq-per-blk	   Size	       Enc(s)	Dec(s)
  bsc/1000                 77759126     75.8     52.9 
  bsc/10000                66013994     29.2     19.1 
  bsc/100000               59574350     33.5     17.9 
  zlib/1000                93350051     31.1      1.8 
  zlib/10000               89976155     85.5      1.5 
  zlib/100000              88999138     95.2      1.5 
  zstd/1000                73522143     74.4      1.3 
  zstd/10000               72112054    114.9      1.0 
  zstd/100000              69253251    165.9      1.0 
  zstd_fqz_n3/1000         73521914     74.4      1.3 
  zstd_fqz_n3/10000        67550945     86.1      0.0 (DIED decoding)
  zstd_fqz_n3/100000       61939253    103.0      9.9

  names3.1M		   62210509     80.9	  9.6
  names3.5M		   61450477	74.9	  9.8

n3 fails on 09 sometimes, but bsc easily beats it anyway.  Note: this
data set is excessively deep with multiple libraries, sorted on
position first and then other fields (including name).  Hence string
delta works well.

Data set 09 names "split -l 100000" external files.
Times & sizes are aggregates. Eg:

time (for i in xaa*;do xz < $i;done) |wc -c
60472476

real	3m25.885s
user	3m21.425s
sys	0m3.532s


zstd		98998735   0m4.387s
zstd -22	69205275   4m30.231s
xz		60472476   3m25.885s
bsc		59526668   0m40.540s
bzip2		64900800   0m31.960s
lpaq 8		47521969   5m54.370s
fqzcomp -n1	47225675  <0m22.685s  <--- best
fqzcomp -n2	67464075  <0m21.959s

So even boring old bzip2 is faster and smaller than max zstd on this
particular data set.  Modern BWT is considerably smaller still.  Best
of all is the naive original name model in fqzcomp which uses the
previous name as a template.  This is because names are in groups, due
to semi-sorted names.  I am unsure if it is good to recommend this as
a general purpose approach, unless data is (at least partially) name
sorted.


Lossy mode:
This is implemented in CRAM by exploiting read pairs.  Any template
that knowingly has all copies within a single CRAM slice can have the
read names discarded.

However this is the only aligned input (BAM) out of the 3 data sets
and it has no read-pairs, so lossy mode is irrelevant.  In this
scenario, to keep pairing, we can throw ALL read names away and just
assign a numeric counter, giving a total compressed size of a single
flag bit.


Data set 14
-----------

Ecoli_R73.fastq

fqzcomp -n1	238554
fqzcomp -n2	200583
CRAM (zlib)	455744 (default CRAM)
name3/1k	272476
name3/10k	202058
name3/100k	195271
bsc/1k		335332
bsc/10k		271246
bsc/100k	243460

names3.1M	201449
names3.5M	195775

All varified during decompression.  No timings as these are all
incredibly quick due to small file size.

Here names3 doesn't trigger its crash and comes out looking very good,
even better than the fqzcomp non-random access format.


Data set 07
-----------

ERR174310_1.fastq.gz & ERR174310_2.fastq.gz each
fqzcomp -n1	413815000	<23m40s
fqzcomp -n2	408490655	<25m04s

Note fqzcomp encodes a single file rather than a pair of files, but
the names are identical between pairs.

While the code doesn't exploit this, it's trivial to uncompress names
in one and duplicate to the other, so notionally in this data set the
size is the same as either _1 or _2 (and not their sum).

I haven't yet had time to do a full alignment and sort of data set 7,
so do not know what the CRAM lossy ID compression will be.