CE2, lossless quality
=====================

We present a variety of standard external compression tools, for
comparitive purposes, along with custom quality value encoders in
CRAM.  These include the default order-1 rANS already implemented in
CRAM v3 and an adaption of the quality model from the fqzcomp (-q2)
tool.

rANS1 bit stream is as defined in the CRAM specification
(http://samtools.github.io/hts-specs/).

The fqzcomp CRAM plugin, using Scramble's "cram_modules" branch
(https://sourceforge.net/p/staden/code/HEAD/tree/io_lib/branches/cram_modules/)
is derived from the level 2 quality model described in this paper:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190

Specifically the "context" used to encode quality Q(n)
: Q(n-1)             // previous quality value
: MAX(Q(n-2),Q(n-3)) // max of the previous next two values
: Q(n-2) == Q(n-3)   // a single boolean; if prev. two values match 
: MIN(7*8,delta/8)   // delta computed since start of read; 3 bits worth.

The running delta starts at 5 for each read and is updated by:
delta = delta + ( Q(n-1)>Q(n) ? Q(n-1) - Q(n) : 0 )

The Fqzcomp tool was a standalone FASTQ compressor, but here we use it
as a plugin to CRAM.  In this context as input it takes a block of
quality values and also has access to the unencoded BAM records.  To
decode it only has the quality stream, so unlike fqzcomp the CRAM
plugin has to also encode additional data into the quality block to
permit the qualities to be decoded in isolation.

To do this it extends the fqzcomp format with the addition of the read
length (so it knows when to reset delta and Q(n-1) to Q(n-3)) and the
orientation of the quality string taken from the BAM FLAGS field.
Additionally it has gained extra code for spotting 100% duplications
of the previous quality string; see below for the reasons for this.


Data set 09
-----------
  Tool/seq-per-blk        Size         Enc(s)   Dec(s)
  bsc/1000                833618988    267.0    207.4 
  bsc/10000               813329602    252.4    190.3 
  bsc/100000              798610572    312.2    227.5 
  rANS1/1000              886674571     16.7      8.3 
  rANS1/10000             867558424     17.3      8.6 
  rANS1/100000            866132619     27.4      8.6 
  fqz/1000                886674571     18.9      8.3 (Redo; it detected rANS1 was better)
  fqz/10000               853315557    119.8    143.2 
  fqz/100000              808792284    118.5    118.0 

-- external tools, no random access --
  bsc:                    793682714    158.9     56.1 (333.5 cpu, threaded)
  gzip:                   975609900
  bzip2:                  901698074
  rngcod/comp1            867240333     48.0 (adaptive order-1 entropy coder)
  rngcod/comp2            806331129     52.2 (adaptive order-2 entropy coder)
  fqzcomp -q1             788815867     <75 (slowest of qual+name+id)
  fqzcomp -q2             778564559    <103  
  fqzcomp -q3             778252683    <111

Bsc is Block Sorting Compressor
(https://github.com/IlyaGrebnov/libbsc).  Comp1 is from Michael
Schindler's "rngcod" range coder demonstration, using an adaptive
order-1 model; comp2 is my own adaptive order-2 equivalent.
Comp1/comp2 are there as data-points for very simple encoding
techniques and to show how a static order-2 rANS model may work if we
share the frequency table between multiple slices.  (In CRAM world,
this means putting it in the container compression header and having
multiple slices per container; something we discussed early on but did
not implement.)

Timings for external fqzcomp tool come from wall-clock time spent.  It encodes
identifiers, sequences and quality values in parallel, so the time for
quality alone is unknown (but likely the slowest of the 3).  Decode
times weren't measured, but they're usually a little slower than
encode times.

Conclusion - bsc is suprisingly good, better than fqzcomp model.
However it is slow.  Bsc as an external tool had different parameters
to my usage as a library, including multi-threading, which gave
very different speeds.

An order-2 [rt]ANS codec would help here too, perhaps getting around
the 820Mb mark.

Data set 16
-----------
  Tool/seq-per-blk        Size         Enc(s)   Dec(s)
  bsc/1000                710737162    236.1     95.2 
  bsc/10000               689254788    243.2    164.3 
  bsc/100000              676054814    321.7    227.9 
  rANS1/1000              733300118     16.5      8.5 
  rANS1/10000             714245853     17.1      8.3 
  rANS1/100000            712828981     27.0      9.6 
  fqz/1000                753777116    192.7    200.0 (New fqz with dedup)
  fqz/10000               672545339    105.6    125.0 
  fqz/100000              646493190     91.3    105.1 

-- external tools, in BAM orientation --
rANS1                       715024762   18.2    9.7  (R/A at ~1Mb blocks)
(deskpro)r32x16_avx2o1      716717407    8.8    2.3  (R/A at ~1Mb blocks)
(deskpro)r32x16_avx2o0     1013103538    4.4    1.5  (R/A at ~1Mb blocks)
bsc-3.1.0 -t -m0            673778956   167s(463cpu) (No random access?)

-- external tools, in Fastq orientation --
rANS1                       705721512   18.7    9.2  (R/A at ~1Mb blocks)
(deskpro)r32x16_avx2o1      707253117    8.8    2.3  (R/A at ~1Mb blocks)
fqzcomp -q1                 660906421   ~93s         (No random access)
fqzcomp -q2                 635844496   ~123s        (No random access)
fqzcomp -q3                 618993894   ~143         (No random access)
bsc-3.1.0 -t -m0            659671028   182(468cpu)  (No random access?)

All these bar r32x16_avx2 tests were run on the same server (Intel(R)
Xeon(R) CPU E5-2660 0 @ 2.20GHz).  The avx2 method was run on a
desktop machine (Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz) as the
server lacked AVX2 support.  It is evident that mid-level compression
can be achieved at extremely high rates if we're willing to exploit
SIMD.  (The standard rANS1 file format isn't ideal for SIMD due to
interdependency of data.)

Conclusion: reordering qualities by their original fastq orientation
helps with most methods.  (CRAM uses aligned/BAM orientation, but note
the fqz codec for CRAM flips these and stores an additional copy of the
original orientation inside the fqz quality block.)

Cram's fqz codec is based on fqzcomp -q2 as it is expected to be
operated in block mode to permit random access.  Even so it needs
large blocks to maximise efficiency and still cannot match fqzcomp
-q3, however it is still considerably smaller and faster than libbsc.

Data set 10
-----------

Set 10 is RNAseq and has multiple mappings of the same sequence.  This
gives rise to duplicate quality strings, so benefits from LZ or BWT
processing over raw entropy encoding methods.  Thus detection of
whole-entry duplications may be beneficial, but only in repetitive
areas.

The first 10 million lines of quality have 204700 duplicated lines, of
which 178819 are the previous line.  By emitting a code for "same as
last line" we save ~4% storage.

Orig qual:
bsc                 770000000 into 250270936 in  72.8 seconds
rANS1               770000000 into 270235820 in   7.4 seconds
comp2               770000000 into 254166072 in  19.9 seconds

dedup1 (dedup previous line only):
bsc                 746588575 into 243597924 in 61.0 seconds.
rANS1               746588575 into 258990929 in  9.5 seconds
comp2               746588575 into 244409910 in 18.3 seconds

dedup10k (dedup up to 10k lines away with back-ref):
bsc                 744689536 into 243527960 in 61.5 seconds.
rANS1               744689536 into 258518510 in  9.8 seconds
comp2               744689536 into 243922525 in 18.3 seconds

Dedup1 is very quick to implement, so is a no-brainer to code up.
Order-2 rANS or tANS would help further.

Inside CRAM:
  Tool/seq-per-blk        Size         Enc(s)   Dec(s)
  bsc/1000               6217036972   2741.1   1754.5 
  bsc/10000              6021859786   2168.2   1509.8 
  bsc/100000             5953446434   2453.0   1599.1 
  rANS1/1000             6773212352    164.4     77.9 
  rANS1/10000            6444091732    146.3     73.8 
  rANS1/100000           6421605206    159.8     77.6 
  fqz/1000               6773589349    169.3     82.9 (detected ANS better)
  fqz/10000              6163647096   1010.6   1180.6 
  fqz/100000             5873759283    752.4    946.2 
  dedup-rANS1/10000      6185333488    152.7     96.1   (vs 6444091732)

fqz CRAM module with dedup:
  fqz/1000               6773624515   2625.3   2485.4 (forced, but too small to adapt)
  fqz/10000              5934828621   1071.9   1228.2 
  fqz/100000             5641065113    768.7    969.8

(Dedup encoding is poorly implemented currently with too many
mallocs/memcpy statements, so timings could be improved.)


=============================================================================

Lossy quality
=============

The lossy compression is achieved by the Crumble tool, available from:
    https://github.com/jkbonfield/crumble

This employs a noddy consensus caller, based on Gap5's consensus
algorithm, to evaluate whether the existance of quality values is
necessary to achieve confident SNP calls.  If not, they are removed,
based on various parameters (loosely adjusted by crumble -1 to crumble
-9 or each explicitly if required).

For quality values retained, they are kept losslessly.  All others are
set to a binary state (high & low).

All tests below are using GATK caller.  We did not yet evaluate the
Samtools / Bcftools calling pipeline.


Data set 11
~~~~~~~~~~~

Chromosome 11:

Orig 	       total size   656888474
Crumble -1     total size   640548371
Crumble -9     total size   564459481
Crumble -9p8   total size    82126745  (rANS1 codec, 10k seq block)
Crumble -9p8   total size    96394731  (fqz codec, 100k seq block)

Lossless
 90.0 SNP PASS 0.486779 0.937853 0.640905
 99.0 SNP PASS 0.545241 0.939863 0.690122
 99.9 SNP PASS 0.559192 0.939294 0.701035
100.0 SNP PASS 0.562877 0.937361 0.703380

Crumble -1 
 90.0 SNP PASS 0.490596 0.938082 0.644259
 99.0 SNP PASS 0.545794 0.939670 0.690513
 99.9 SNP PASS 0.559841 0.939848 0.701699
100.0 SNP PASS 0.563835 0.937691 0.704221

Crumble -9 
 90.0 SNP PASS 0.494090 0.939940 0.647706
 99.0 SNP PASS 0.554055 0.941328 0.697544
 99.9 SNP PASS 0.567129 0.941047 0.707736
100.0 SNP PASS 0.571484 0.938837 0.710485

Crumble -9 -p8 
 90.0 SNP PASS 0.521593 0.946976 0.672677
 99.0 SNP PASS 0.582612 0.946927 0.721382
 99.9 SNP PASS 0.598140 0.946122 0.732924
100.0 SNP PASS 0.602672 0.943078 0.735393


Chromosome 20:

Orig 	       total size   291952417  
Crumble -1     total size   285105238  
Crumble -9     total size   252045288  
Crumble -9p8   total size    37003532   (rANS1 codec, 10k seq block)
Crumble -9p8   total size    38185172   (fqz codec, 100k seq block)
Crumble -9p8   total size    36636817   (rANS1 codec, 100k seq block)

Orig (lossless)
 90.0 SNP PASS 0.464911 0.926027 0.619036
 99.0 SNP PASS 0.516630 0.928990 0.663998
 99.9 SNP PASS 0.528064 0.928534 0.673247
100.0 SNP PASS 0.530541 0.925990 0.674583

Crumble -1 
 90.0 SNP PASS 0.467715 0.926810 0.621693
 99.0 SNP PASS 0.518904 0.929253 0.665940
 99.9 SNP PASS 0.529108 0.927853 0.673916
100.0 SNP PASS 0.531522 0.926092 0.675403

Crumble -9
 90.0 SNP PASS 0.473572 0.928868 0.627315
 99.0 SNP PASS 0.528469 0.931185 0.674273
 99.9 SNP PASS 0.537052 0.930801 0.681115
100.0 SNP PASS 0.539264 0.927697 0.682054

Crumble -9p8
 90.0 SNP PASS 0.498481 0.936273 0.650584
 99.0 SNP PASS 0.556244 0.936675 0.697988
 99.9 SNP PASS 0.568738 0.935600 0.707436
100.0 SNP PASS 0.572087 0.931868 0.708944

I didn't expect standard Crumble alone to do well on compression
ratio, and indeed it doesn't, until P-block is added in too.  This is
because each column is error prone with a high uncertainty in the
consensus, leading to relatively few locations where qualities can be
discarded.  However the P-block algorithm degrades all quality scores
so has a large impact.  The net effect is 7.9 fold compression ratio.

Bizarrely all lossy modes improve GATK snp calling, both in reducing
false positives and reducing false negatives.  (I suspect a flaw
somewhere in the calling code or our use of it as this shouldn't be
possible.)

Suprisingly the fqzcomp codec doesn't improve over a basic order-1
rANS codec for this data set.  (Likely fqzcomp is over-tuned for
Illumina and takes too long to adapt.)


Data set 12 (*Robot*)
~~~~~~~~~~~~~~~~~~~~~

Using "gatk_call.sh" script, as per N16525, followed by hap.py for
calculation of precision/recall at the various tranch levels.

Chr11
-----

Lossless
 90.0 SNP PASS 0.791391 0.999702 0.883433
 99.0 SNP PASS 0.945127 0.999431 0.971521
 99.9 SNP PASS 0.982748 0.998555 0.990588
100.0 SNP PASS 0.995121 0.996951 0.996035

Crumble -1
 90.0 SNP PASS 0.811031 0.999491 0.895452
 99.0 SNP PASS 0.941339 0.999194 0.969404
 99.9 SNP PASS 0.985180 0.998588 0.991839
100.0 SNP PASS 0.995121 0.996862 0.995991

Crumble -9
 90.0 SNP PASS 0.806344 0.999616 0.892638
 99.0 SNP PASS 0.944942 0.999252 0.971338
 99.9 SNP PASS 0.980456 0.998874 0.989579
100.0 SNP PASS 0.995261 0.996708 0.995984

Crumble -9 -p8
 90.0 SNP PASS 0.799004 0.999650 0.888136
 99.0 SNP PASS 0.939423 0.999467 0.968515
 99.9 SNP PASS 0.981628 0.998748 0.990114
100.0 SNP PASS 0.995372 0.996606 0.995989

Sizes:
Lossless	CRAM/rANS/10k	2282732421
Crumble -1  	CRAM/rANS/10k	 326499298
Crumble -9  	CRAM/rANS/10k	 245398824
Crumble -9p8  	CRAM/rANS/10k	 131670326
Crumble -9p8  	CRAM/fqz/100k	 135769367  (NB: poorer than rANS1).


Chr20
-----

(With no filtering, using the files downloaded from Jan Voges.)
20194777 seqs
3017194085 bases => 47.87x coverage

Lossless
 90.0 SNP PASS 0.800695 0.999747 0.889218
 99.0 SNP PASS 0.940134 0.999536 0.968925
 99.9 SNP PASS 0.987226 0.998613 0.992887
100.0 SNP PASS 0.996417 0.997505 0.996961

Crumble -1
 90.0 SNP PASS 0.804979 0.999768 0.891862
 99.0 SNP PASS 0.952861 0.999395 0.975573
 99.9 SNP PASS 0.983565 0.998371 0.990913
100.0 SNP PASS 0.996526 0.997365 0.996945

Crumble -9
 90.0 SNP PASS 0.817924 0.999638 0.899697
 99.0 SNP PASS 0.951194 0.999509 0.974753
 99.9 SNP PASS 0.989812 0.998617 0.994195
100.0 SNP PASS 0.996526 0.997272 0.996899

Crumble -9 -p8
 90.0 SNP PASS 0.810696 0.999616 0.895298
 99.0 SNP PASS 0.944200 0.999489 0.971058
 99.9 SNP PASS 0.984157 0.998877 0.991462
100.0 SNP PASS 0.996729 0.996884 0.996806

100% loss (all q30) (QS size 187820 => overheads)
 90.0 SNP PASS 0.815634 0.999618 0.898302
 99.0 SNP PASS 0.947642 0.999113 0.972697
 99.9 SNP PASS 0.985170 0.998216 0.991650
100.0 SNP PASS 0.995887 0.994261 0.995073

QS sizes and decode times:

Method				      Size	Decode time(s)
Uncompressed			3017194085	 
Lossless     CRAM/rANS/10k	1045660210	 11.9
Crumble -1   CRAM/rANS/10k	 146073925	 11.2
Crumble -9   CRAM/rANS/10k	 103828736	 11.1
Crumble -9p8 CRAM/rANS/10k	  61519457	 11.1
Lossless     CRAM/fqz/100k	 943982528	156.5
Crumble -1   CRAM/fqz/100k	 134376049	 61.3
Crumble -9   CRAM/fqz/100k	  94828059	 78.7
Crumble -9p8 CRAM/fqz/100k	  55894285	 66.2


This data is so deep that the accuracy of SNP calling is very high,
potentially making quality values less useful.  I tested low coverage
by artificially reducing the data using samtools view -s.

Shrunk to ~10x coverage (9.987x) to test reduced coveraging calling
rates.  Raw size ~630,000,000.

Depth 10, lossless,	rans1 size 218255702
 90.0 SNP PASS 0.668874 0.982608 0.795941
 99.0 SNP PASS 0.788123 0.982274 0.874553
 99.9 SNP PASS 0.805057 0.981092 0.884400
100.0 SNP PASS 0.812223 0.980849 0.888607

Depth 10, crumble -1,	rans1 size  80409731
 90.0 SNP PASS 0.681071 0.982693 0.804542
 99.0 SNP PASS 0.792111 0.982437 0.877067
 99.9 SNP PASS 0.807611 0.981225 0.885993
100.0 SNP PASS 0.815400 0.980922 0.890535

Depth 10, crumble -9,   rans1 size  38547722
 90.0 SNP PASS 0.707616 0.983246 0.822966
 99.0 SNP PASS 0.799511 0.982973 0.881801
 99.9 SNP PASS 0.815696 0.982863 0.891511
100.0 SNP PASS 0.824934 0.982103 0.896684

Depth 10, crumble -9p8, rans1 size  17844840
 90.0 SNP PASS 0.709875 0.984978 0.825100
 99.0 SNP PASS 0.830402 0.983850 0.900637
 99.9 SNP PASS 0.843051 0.982677 0.907525
100.0 SNP PASS 0.851697 0.982144 0.912281

Depth 10, all bases Q30

Although recall and precision rates are down on the 40x sample (.997
vs .891 F-score), it is still demonstrable that the more qualities we
discard the better the GATK calling gets.


Figures for the entirety of data set 12 (full depth):

Crumble (BAM to BAM)
@ seq3c[scratch01/mpeg]; time ~/work/crumble/crumble -9 -p8 NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.bam NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.bam 
real  663m37.836s
user  649m48.797s
sys   5m47.962s

CRAM default (BAM to CRAM, rANS1 codec, 10k seqs per block):
@ seq3c[scratch01/mpeg]; time scramble -r hum_ref/human_g1k_v37.fasta -t16 NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.bam NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.cram
real  27m26.279s
user  341m37.001s
sys   13m17.554s
Block content_id      12, total size  3517043197     R QS


CRAM with fqzcomp codec and 100k seqs per block.
@ seq3c[scratch01/mpeg]; time CRAM_CODEC_DIR=~/scratch/data/mpeg/dataset/fqz/ ~/scratch/data/mpeg/dataset/scramble.modules -r hum_ref/human_g1k_v37.fasta -t16 -s100000 NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.bam NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.fqz.cram
Adding codec 18 libcram_codec_fqzcomp_qual.so
real 30m52.060s
user 393m6.574s
sys  14m13.621s
Block content_id      12, total size  3217310516      [q] QS


OLD versions of this analysis follow, labelled with >F04 and >F44 to
indicate which FLAG filter options have been used when discarding
alignments.


>F04 Chr20 sizes
>F04 -----------
>F04 
>F04 The lossy compressed size is a combination of the amount/placement of
>F04 data being lost and the compression algorithm.  These two components
>F04 are mostly orthogonal, so we tested the lossless and lossy size with
>F04 both default CRAM v3 codecs (rANS order 1 at 10,000 seqs per slice)
>F04 and with the fqzcomp plugin module with 100,000 seqs per slice.
>F04 
>F04 Times listed here are CRAM only for encoding the QS data series.  It
>F04 does not include the time to lossily reduce the qualities; see below
>F04 for Crumble timings.
>F04 
>F04 Method                    Size (bytes, x-fold), Encode time (seconds), Decode time
>F04 0% loss (raw)         	  2622005008
>F04 0% loss (CRAM), rANS1      903633030	 2.90
>F04 Crumble -1,     rANS1      103102947	25.43
>F04 Crumble -9,     rANS1       60613822	43.26
>F04 Crumble -9 -p8, rANS1       42175805	62.17
>F04 Lossless CRAM,  fqz,100k   815488380	 3.22
>F04 Crumble -1,     fqz,100k    95389680	27.49
>F04 Crumble -9,     fqz,100k    55115911	47.57
>F04 Crumble -9 -p8, fqz,100k    38299417	68.46
>F04 
>F04 
>F04 Chr20 time and memory
>F04 ---------------------
>F04 
>F04 Measured on chr20 only.  The memory needed by the lossy compression
>F04 stage is tiny.  About 2/3rds of the CPU time is BAM decoding/encoding
>F04 rather than the quality loss algorithm itself, which could be
>F04 multi-threaded.
>F04 
>F04 Crumble -9 -p8 to BAM  chr20:  11m18s, 4.2Mb memory
>F04 Samtools BAM to BAM (no loss):  7m20s, 1.5Mb memory
>F04 
>F04 @ seq3c[dataset/10]; /usr/bin/time ~/work/crumble/crumble -Obam -9 -p8
>F04 /local/scratch01/mpeg/_NA12878.chr20.bqsr.bam /tmp/_tmp.bam
>F04 664.54user 8.11system 11:17.67elapsed 99%CPU (0avgtext+0avgdata 16816maxresident)k
>F04 482320inputs+4835816outputs (1major+2280378minor)pagefaults 0swaps
>F04 
>F04 @ seq3c[dataset/10]; /usr/bin/time samtools view -b /tmp/_tmp.bam -o /tmp/_tmp2.bam
>F04 424.71user 12.14system 7:20.45elapsed 99%CPU (0avgtext+0avgdata 5920maxresident)k
>F04 1607752inputs+4835760outputs (0major+5484489minor)pagefaults 0swaps
>F04 
>F04 
>F04 Chr20 recall / precision
>F04 ------------------------
>F04 
>F04 Measured figures from hap.py for each of the 90.0, 99.0, 99.9 and
>F04 100.0 calibration tranches (as per the doc).
>F04 
>F04 for i in 90.0 99.0 99.9 100.0; do HGREF=/nfs/srpipe_references/references/Human/1000Genomes/all/fasta/human_g1k_v37.fasta PYTHONPATH=/software/CGP/external-apps/CNVkit-0.7.10/lib/python2.7/ /software/python-2.7.3/bin/python hap.py/bin/hap.py giab-filt-norm.vcf _NA12878.chr20.bqsr.crumble-1.gatk.snps.recal$i.vcf -r hum_ref/human_g1k_v37.fasta -f giab.bed -o __NA12878.chr20.bqsr.crumble-1.gatk.happy_root$i -T chr20.bed --threads 16 --roc VQLSOD;done
>F04 
>F04 0% quality loss
>F04 Threshold      Recall   Precis.  F-score
>F04  90.0 SNP PASS 0.823781 0.999641 0.903231
>F04  99.0 SNP PASS 0.947362 0.999573 0.972767
>F04  99.9 SNP PASS 0.985964 0.999211 0.992543
>F04 100.0 SNP PASS 0.996448 0.997458 0.996953
>F04 
>F04 100% quality loss (all quality set to Q20)
>F04  90.0 SNP PASS 0.803265 0.999632 0.890755
>F04  99.0 SNP PASS 0.945243 0.999210 0.971478
>F04  99.9 SNP PASS 0.979546 0.997810 0.988594
>F04 100.0 SNP PASS 0.994517 0.996317 0.995416
>F04 
>F04 100% quality loss (all quality set to Q30)
>F04  90.0 SNP PASS 0.808811 0.999711 0.894186
>F04  99.0 SNP PASS 0.948483 0.999294 0.973226
>F04  99.9 SNP PASS 0.985793 0.997793 0.991757
>F04 100.0 SNP PASS 0.995934 0.994277 0.995105
>F04 
>F04 100% quality loss (all quality set to Q40 / 'I').
>F04  90.0 SNP PASS 0.824669 0.999566 0.903733
>F04  99.0 SNP PASS 0.948826 0.997674 0.972637
>F04  99.9 SNP PASS 0.987273 0.995961 0.991598
>F04 100.0 SNP PASS 0.996028 0.983480 0.989714
>F04 
>F04 Crumble-1
>F04 Threshold      Recall   Precis.  F-score
>F04  90.0 SNP PASS 0.814201 0.999790 0.897502 
>F04  99.0 SNP PASS 0.946194 0.999441 0.972089
>F04  99.9 SNP PASS 0.985590 0.998422 0.991965 
>F04 100.0 SNP PASS 0.996588 0.997334 0.996961
>F04 
>F04 Crumble-9
>F04 Threshold      Recall   Precis.  F-score
>F04  90.0 SNP PASS 0.806786 0.999749 0.892962
>F04  99.0 SNP PASS 0.945197 0.999654 0.971663
>F04  99.9 SNP PASS 0.985107 0.999242 0.992124
>F04 100.0 SNP PASS 0.996573 0.997039 0.996806
>F04 
>F04 Crumble-9 with P-block 8 (rANS1 sz 42175805)
>F04 Threshold      Recall   Precis.  F-score
>F04  90.0 SNP PASS 0.823844 0.999622 0.903261
>F04  99.0 SNP PASS 0.947222 0.999490 0.972654
>F04  99.9 SNP PASS 0.987771 0.999070 0.993388
>F04 100.0 SNP PASS 0.996807 0.996807 0.996807
>F04 
>F04 
>F04 
>F04 The conclusion here appears to be that there is very little difference
>F04 in SNP accuracy between original, crumble -1, crumble -9 and crumble
>F04 with P-block.  The 99.0 filter level shows a decrease in recall and
>F04 corresponding F-score, while other levels appear to have increases
>F04 throughout.  It is expected that tweaking the fixed low and high
>F04 quality values used in crumble could have an effect here, but it
>F04 hasn't been properly explored.
>F04 
>F04 Note however a lot of the parameters that change between crumble -1
>F04 and crumble -9 are related to preserving Indel calling, which is not
>F04 part of this test. Specifically the quality values of neighbouring
>F04 bases in a short-tandem repeat are of great importance for indel
>F04 calling.
>F04 
>F04 We performed a rough test of indel calling on the entire GATK output
>F04 (-stand_emit_conf 10 -stand_call_conf 30) without the recalibration
>F04 steps.
>F04 
>F04 Lossless 	      Recall		Precision
>F04  INDEL    ALL         0.960191          0.912072
>F04  INDEL   PASS         0.955201          0.940314
>F04    SNP    ALL         0.997289          0.996405
>F04    SNP   PASS         0.996448          0.997412
>F04 
>F04 Crumble -1 	      Recall		Precision
>F04  INDEL    ALL         0.962252          0.907703
>F04  INDEL   PASS         0.956937          0.938432
>F04    SNP    ALL         0.997398          0.996375
>F04    SNP   PASS         0.996588          0.997288
>F04 
>F04 Crumble -9 	      Recall		Precision
>F04  INDEL    ALL         0.965831          0.906715
>F04  INDEL   PASS         0.961601          0.935061
>F04    SNP    ALL         0.997398          0.996127
>F04    SNP   PASS         0.996573          0.997008
>F04 
>F04 Crumble -9 -p8 	      Recall		Precision
>F04  INDEL    ALL         0.965181          0.906659
>F04  INDEL   PASS         0.960408          0.934986
>F04    SNP    ALL         0.997523          0.995631
>F04    SNP   PASS         0.996807          0.996776
>F04 
>F04 While we do not know the veracity of the indel calls, we can measure
>F04 the deviation caused by lossy encoding.  It is evident that the
>F04 precision deviation with crumble -1 is much less than with crumble -9
>F04 or -9 -p8, although the recall figures seem to improve with more
>F04 aggressive compression.
>F04 
>F04 
>F04 Default crumble quality values for lost values are: -c 25 -l 10 -u 40
>F04 mismatch cons:  1..24 => 10
>F04 mismatch cons: 25..99 => 40
>F04 match cons:           => 40
>F04 
>F04 Try again with Q30 as max instead? -c 20 -l 10 -u 30
>F04 mismatch cons:  1..19 => 10
>F04 mismatch cons: 20..99 => 30
>F04 match cons:           => 30
>F04 
>F04 Crumble -9 -p8 -c 20 -l 10 -u 30  (rans1 QS size 39922081; was 42175805)
>F04 Threshold      Recall   Precis.  F-score
>F04  90.0 SNP PASS 0.799651 0.999649 0.888535 (-- was 0.903261, big drop in recall)
>F04  99.0 SNP PASS 0.943701 0.999423 0.970763 (-- was 0.972654)
>F04  99.9 SNP PASS 0.989781 0.998350 0.994047 (+  was 0.993388)
>F04 100.0 SNP PASS 0.996635 0.997101 0.996868 (~  was 0.996807)
>F04 
>F04 crumble -9 -p8 -c15 -u35
>F04 SNP PASS 0.827551 0.999567 0.905462 64193 53123 11070
>F04 SNP PASS 0.947876 0.999442 0.972976 64193 60847 3346
>F04 SNP PASS 0.984889 0.999241 0.992013 64193 63223 970
>F04 SNP PASS 0.996729 0.996744 0.996736 64193 63983 210
>F04 
>F04 crumble -9 -p8 -l10 -c20 -u40 (rANS1 sz 41565704) 
>F04 SNP PASS 0.828813 0.999587 0.906225 64193 53204 10989 + (vs -9 -p8)
>F04 SNP PASS 0.946178 0.999441 0.972080 64193 60738 3455  -
>F04 SNP PASS 0.986400 0.999069 0.992694 64193 63320 873   -
>F04 SNP PASS 0.996791 0.996822 0.996806 64193 63987 206   =
>F04 
>F04 
>F04 crumble -9 -p8 -l10 -c20 -u35 (rANS1 sz 40851277)
>F04 SNP PASS 0.817581 0.999657 0.899498 64193 52483 11710
>F04 SNP PASS 0.950072 0.999459 0.974140 64193 60988 3205
>F04 SNP PASS 0.983596 0.999145 0.991310 64193 63140 1053
>F04 SNP PASS 0.996729 0.996977 0.996853 64193 63983 210
>F04 
>F04 crumble -9 -p8 -l10 -c20 -u35 (U40 for agree)
>F04 SNP PASS 0.815572 0.999599 0.898257 64193 52354 11839  - (vs -9 -p8 -l10 -c20 -u35)
>F04 SNP PASS 0.948857 0.999491 0.973516 64193 60910 3283   -
>F04 SNP PASS 0.990404 0.998822 0.994595 64193 63577 616    + (++ recall, -precis)
>F04 SNP PASS 0.996791 0.996900 0.996845 64193 63987 206    ~
>F04 
>F04 crumble -9 -p8 -l5 -c20 -u35 (rANS1 sz 40684464)
>F04 SNP PASS 0.790725 0.999842 0.883072 64193 50759 13434
>F04 SNP PASS 0.964482 0.999000 0.981438 64193 61913 2280
>F04 SNP PASS 0.985450 0.997808 0.991590 64193 63259 934
>F04 SNP PASS 0.996869 0.996962 0.996915 64193 63992 201


>F44 NB: This data set was processed using samtools view -F 0xF40, so is single-ended.
>F44 While not correct, it has some use for evaluating performance on
>F44 shallower data sets. (~20x in this case)
>F44
>F44 Data set 12 (*Robot*)
>F44 ~~~~~~~~~~~~~~~~~~~~~
>F44 
>F44 Using "gatk_call.sh" script, as per N16525, followed by hap.py for
>F44 calculation of precision/recall at the various tranch levels.
>F44 
>F44 Chr20 sizes
>F44 -----------
>F44 
>F44 The lossy compressed size is a combination of the amount/placement of
>F44 data being lost and the compression algorithm.  These two components
>F44 are mostly orthogonal, so we tested the lossless and lossy size with
>F44 both default CRAM v3 codecs (rANS order 1 at 10,000 seqs per slice)
>F44 and with the fqzcomp plugin module with 100,000 seqs per slice.
>F44 
>F44 Times listed here are CRAM only for encoding the QS data series.  It
>F44 does not include the time to lossily reduce the qualities; see below
>F44 for Crumble timings.
>F44 
>F44 Method                    Size (bytes, x-fold), Encode time (seconds), Decode time
>F44 Lossless QS (raw)         1292621755   1.00       N/A                    N/A
>F44 Lossless CRAM, rANS1       505376417   2.56      21.9                    5.3
>F44 Lossless CRAM, fqz,100k    451304814   2.86     103.7                   66.7
>F44 Crumble -1 QS, rANS1       138208917   9.35      10.1                    4.9
>F44 Crumble -1 QS, fqz,100k    126346751  10.23      45.7                   35.2
>F44 Crumble -9 QS, rANS1        76475355  16.90       9.1                    4.9
>F44 Crumble -9 QS, fqz,100k     70540126  18.32      34.6                   30.3
>F44 
>F44 As mentioned above, Crumble may be considered as a "vertical"
>F44 compressor in that it acts on entire aligned columns of qualities.
>F44 P-block, R-block and similar may be considered "horizontal"
>F44 compressors in that they squash qualities along the length of
>F44 individual reads.  Combining the two algorithms together gives further
>F44 better results.  Timings for these haven't been kept as it involved
>F44 running various external tools in an inefficient manner (eg convering
>F44 BAM to SAM, CompressSAM, DecompressSAM, and then converting the
>F44 decoded SAM back to CRAM).  It is expected that implementing P-block
>F44 directly within Crumble would have a minimal encode overhead and a
>F44 non-existant decode overhead for CRAM.
>F44 
>F44 Method                            Size (bytes, x-fold)
>F44 Crumble -9 + PBlock 1             64437688  20.06
>F44 Crumble -9 + PBlock 2             59257354  21.81
>F44 Crumble -9 + PBlock 4             51704458  25.00
>F44 Crumble -9 + PBlock 8		  42078669  30.72
>F44 Crumble -9 + PBlock 1 + fqz,100k  60387992  21.41
>F44 Crumble -9 + PBlock 2 + fqz,100k  55618114  23.24
>F44 Crumble -9 + PBlock 4 + fqz,100k  48539269  26.63
>F44 Crumble -9 + PBlock 8 + fqz,100k  39003300  33.14
>F44 
>F44 PBlock's own binary encoding for level 4 (CompressQual -q1 -m1 -l4)
>F44 consumes 107,041,156 bytes so is quite large.  However the replacement
>F44 of qualities by runs of identical value makes CRAM's own entropy
>F44 encoders work better.
>F44 
>F44 Chr20 time and memory
>F44 ---------------------
>F44 
>F44 Measured on chr20 only.  Output format BAM vs vanilla CRAM v3 is
>F44 minimal difference.  Memory is larger due to larger block size in
>F44 CRAM.  The actual memory needed by the lossy compression stage is
>F44 tiny.
>F44 
>F44 Crumble -9 to BAM  chr20:  6m3sec, ~3 Mb memory
>F44 Crumble -9 to CRAM chr20:  7min,  242 Mb memory
>F44 
>F44 
>F44 Chr20 recall / precision
>F44 ------------------------
>F44 
>F44 Measured figures from hap.py for each of the 90.0, 99.0, 99.9 and
>F44 100.0 calibration tranches (as per the doc).
>F44 
>F44 for i in 90.0 99.0 99.9 100.0; do HGREF=/nfs/srpipe_references/references/Human/1000Genomes/all/fasta/human_g1k_v37.fasta PYTHONPATH=/software/CGP/external-apps/CNVkit-0.7.10/lib/python2.7/ /software/python-2.7.3/bin/python hap.py/bin/hap.py giab-filt-norm.vcf _NA12878.chr20.bqsr.crumble-1.gatk.snps.recal$i.vcf -r hum_ref/human_g1k_v37.fasta -f giab.bed -o __NA12878.chr20.bqsr.crumble-1.gatk.happy_root$i -T chr20.bed --threads 16 --roc VQLSOD;done
>F44 
>F44 Orig
>F44 Threshold      Recall   Precis.  F-score
>F44  90.0 SNP PASS 0.788466 0.999092 0.881370
>F44  99.0 SNP PASS 0.941567 0.997031 0.968506
>F44  99.9 SNP PASS 0.958531 0.995937 0.976876
>F44 100.0 SNP PASS 0.967255 0.995176 0.981017
>F44 
>F44 Crumble-1
>F44  90.0 SNP PASS 0.828034 0.998141 0.905165
>F44  99.0 SNP PASS 0.930444 0.997795 0.962943
>F44  99.9 SNP PASS 0.962348 0.997207 0.979467
>F44 100.0 SNP PASS 0.970402 0.995732 0.982904
>F44 
>F44 Crumble-9
>F44  90.0 SNP PASS 0.81632  0.998685 0.898341
>F44  99.0 SNP PASS 0.934463 0.998236 0.965297
>F44  99.9 SNP PASS 0.969592 0.997228 0.983216
>F44 100.0 SNP PASS 0.97467  0.996084 0.985261
>F44 
>F44 Crumble-9 with P-block 4
>F44  90.0 SNP PASS 0.811412 0.998734 0.895381
>F44  99.0 SNP PASS 0.929463 0.998243 0.962626
>F44  99.9 SNP PASS 0.973097 0.996681 0.984748
>F44 100.0 SNP PASS 0.975075 0.996275 0.985561
>F44 
>F44 Crumble-9 with P-block 8
>F44  90.0 SNP PASS 0.828486 0.998629 0.905636
>F44  99.0 SNP PASS 0.929276 0.998210 0.962510
>F44  99.9 SNP PASS 0.968299 0.997545 0.982704
>F44 100.0 SNP PASS 0.978051 0.995844 0.986867
>F44 
>F44 100% lossy (all quality set to Q40 / 'I').
>F44 
>F44 
>F44 The conclusion here appears to be that there is very little difference
>F44 in SNP accuracy between original, crumble -1, crumble -9 and crumble
>F44 with P-block.  The 99.0 filter level shows a decrease in recall and
>F44 corresponding F-score, while other levels appear to have increases
>F44 throughout.  It is expected that tweaking the fixed low and high
>F44 quality values used in crumble could have an effect here, but it
>F44 hasn't been properly explored.
>F44 
>F44 Note however a lot of the parameters that change between crumble -1
>F44 and crumble -9 are related to preserving Indel calling, which is not
>F44 part of this test. Specifically the quality values of neighbouring
>F44 bases in a short-tandem repeat are of great importance for indel
>F44 calling.
>F44 
>F44 We performed a rough test of indel calling on the entire GATK output
>F44 (-stand_emit_conf 10 -stand_call_conf 30) without the recalibration
>F44 steps.
>F44 
>F44 Lossless   Recall   Precis.  F-score
>F44 INDEL ALL  0.856058 0.936213 0.894343 (>=10)
>F44 INDEL PASS 0.822432 0.954780 0.883678 (>=30)
>F44 
>F44 Crumble -1
>F44 INDEL ALL  0.864844 0.935299 0.898693 (>=10)
>F44 INDEL PASS 0.836642 0.954226 0.891574 (>=30)
>F44 
>F44 Crumble -9
>F44 INDEL ALL  0.877969 0.931810 0.904089 (>=10)
>F44 INDEL PASS 0.852045 0.953522 0.899932 (>=30)
>F44 
>F44 Crumble -9 + PBlock 4
>F44 INDEL ALL  0.880356 0.930703 0.904830 (>=10)
>F44 INDEL PASS 0.854214 0.953980 0.901345 (>=30)
>F44 
>F44 While we do not know the veracity of the indel calls, we can measure
>F44 the deviation caused by lossy encoding.  It is evident that the
>F44 deviation with crumble -1 is much less than with crumble -9, but both
>F44 tend to favour lower precision with higher recall (to an extent that
>F44 still increases F-score).  The best F-score comes from more aggressive
>F44 compression, but without knowing the absolute validity of these calls
>F44 it is hard to judge this.
>F44 
>F44 
>F44 Chr11 sizes
>F44 -----------
>F44 
>F44 Method                    Size (bytes, x-fold), Encode time (seconds), Decode time
>F44 Lossless QS (raw)         2845105400     1.00    N/A                    N/A
>F44 Lossless CRAM, rANS1      1065305769     2.67    42.7                   21.7
>F44 Lossless CRAM, fqz,100k    985856117     2.89   124.4                  145.0
>F44 Crumble -1 QS, rANS1       295071612     9.64    17.5                   10.5
>F44 Crumble -1 QS, fqz,100k    272042681    10.46    67.3                   78.0            
>F44 Crumble -9 QS, rANS1       166800786    17.06    16.6                   10.2
>F44 Crumble -9 QS, fqz,100k    154424543    18.42    56.4                   64.9
>F44 
>F44 
>F44 Chr11 time and memory
>F44 ---------------------
>F44 
>F44 Lossy portion only (so not including advanced CRAM compression models,
>F44 eg fqz):
>F44 
>F44 Crumble -1 BAM to BAM chr11: 15m54s 51Mb
>F44 Crumble -9 BAM to BAM chr11: 13m8s  51Mb
>F44 
>F44 
>F44 Chr11 recall / precision
>F44 ------------------------
>F44 
>F44 Lossless				 
>F44 Threshold      Recall   Precis.  F-score
>F44  90.0 SNP PASS 0.825431	0.998146 0.9036
>F44  99.0 SNP PASS 0.924882	0.997766 0.9599
>F44  99.9 SNP PASS 0.958421	0.997278 0.9775
>F44 100.0 SNP PASS 0.970596	0.995405 0.9828
>F44 
>F44 Crumble -1
>F44 Threshold      Recall   Precis.  F-score
>F44  90.0 SNP PASS 0.795312	0.998547 0.8854
>F44  99.0 SNP PASS 0.925347	0.99818	 0.9604
>F44  99.9 SNP PASS 0.964892	0.99722	 0.9808
>F44 100.0 SNP PASS 0.972202	0.995781 0.9839
>F44 
>F44 Crumble -9
>F44 Threshold      Recall   Precis.  F-score
>F44  90.0 SNP PASS 0.820729	0.998539 0.9009
>F44  99.0 SNP PASS 0.93265	0.998068 0.9643
>F44  99.9 SNP PASS 0.961826	0.997569 0.9794
>F44 100.0 SNP PASS 0.975039	0.995373 0.9851
>F44 
>F44 As with Chr20, F-scores are only minimally impacted by lossy
>F44 compression and typically even improve slightly.