CE2, lossless quality ===================== We present a variety of standard external compression tools, for comparitive purposes, along with custom quality value encoders in CRAM. These include the default order-1 rANS already implemented in CRAM v3 and an adaption of the quality model from the fqzcomp (-q2) tool. rANS1 bit stream is as defined in the CRAM specification (http://samtools.github.io/hts-specs/). The fqzcomp CRAM plugin, using Scramble's "cram_modules" branch (https://sourceforge.net/p/staden/code/HEAD/tree/io_lib/branches/cram_modules/) is derived from the level 2 quality model described in this paper: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0059190 Specifically the "context" used to encode quality Q(n) : Q(n-1) // previous quality value : MAX(Q(n-2),Q(n-3)) // max of the previous next two values : Q(n-2) == Q(n-3) // a single boolean; if prev. two values match : MIN(7*8,delta/8) // delta computed since start of read; 3 bits worth. The running delta starts at 5 for each read and is updated by: delta = delta + ( Q(n-1)>Q(n) ? Q(n-1) - Q(n) : 0 ) The Fqzcomp tool was a standalone FASTQ compressor, but here we use it as a plugin to CRAM. In this context as input it takes a block of quality values and also has access to the unencoded BAM records. To decode it only has the quality stream, so unlike fqzcomp the CRAM plugin has to also encode additional data into the quality block to permit the qualities to be decoded in isolation. To do this it extends the fqzcomp format with the addition of the read length (so it knows when to reset delta and Q(n-1) to Q(n-3)) and the orientation of the quality string taken from the BAM FLAGS field. Additionally it has gained extra code for spotting 100% duplications of the previous quality string; see below for the reasons for this. Data set 09 ----------- Tool/seq-per-blk Size Enc(s) Dec(s) bsc/1000 833618988 267.0 207.4 bsc/10000 813329602 252.4 190.3 bsc/100000 798610572 312.2 227.5 rANS1/1000 886674571 16.7 8.3 rANS1/10000 867558424 17.3 8.6 rANS1/100000 866132619 27.4 8.6 fqz/1000 886674571 18.9 8.3 (Redo; it detected rANS1 was better) fqz/10000 853315557 119.8 143.2 fqz/100000 808792284 118.5 118.0 -- external tools, no random access -- bsc: 793682714 158.9 56.1 (333.5 cpu, threaded) gzip: 975609900 bzip2: 901698074 rngcod/comp1 867240333 48.0 (adaptive order-1 entropy coder) rngcod/comp2 806331129 52.2 (adaptive order-2 entropy coder) fqzcomp -q1 788815867 <75 (slowest of qual+name+id) fqzcomp -q2 778564559 <103 fqzcomp -q3 778252683 <111 Bsc is Block Sorting Compressor (https://github.com/IlyaGrebnov/libbsc). Comp1 is from Michael Schindler's "rngcod" range coder demonstration, using an adaptive order-1 model; comp2 is my own adaptive order-2 equivalent. Comp1/comp2 are there as data-points for very simple encoding techniques and to show how a static order-2 rANS model may work if we share the frequency table between multiple slices. (In CRAM world, this means putting it in the container compression header and having multiple slices per container; something we discussed early on but did not implement.) Timings for external fqzcomp tool come from wall-clock time spent. It encodes identifiers, sequences and quality values in parallel, so the time for quality alone is unknown (but likely the slowest of the 3). Decode times weren't measured, but they're usually a little slower than encode times. Conclusion - bsc is suprisingly good, better than fqzcomp model. However it is slow. Bsc as an external tool had different parameters to my usage as a library, including multi-threading, which gave very different speeds. An order-2 [rt]ANS codec would help here too, perhaps getting around the 820Mb mark. Data set 16 ----------- Tool/seq-per-blk Size Enc(s) Dec(s) bsc/1000 710737162 236.1 95.2 bsc/10000 689254788 243.2 164.3 bsc/100000 676054814 321.7 227.9 rANS1/1000 733300118 16.5 8.5 rANS1/10000 714245853 17.1 8.3 rANS1/100000 712828981 27.0 9.6 fqz/1000 753777116 192.7 200.0 (New fqz with dedup) fqz/10000 672545339 105.6 125.0 fqz/100000 646493190 91.3 105.1 -- external tools, in BAM orientation -- rANS1 715024762 18.2 9.7 (R/A at ~1Mb blocks) (deskpro)r32x16_avx2o1 716717407 8.8 2.3 (R/A at ~1Mb blocks) (deskpro)r32x16_avx2o0 1013103538 4.4 1.5 (R/A at ~1Mb blocks) bsc-3.1.0 -t -m0 673778956 167s(463cpu) (No random access?) -- external tools, in Fastq orientation -- rANS1 705721512 18.7 9.2 (R/A at ~1Mb blocks) (deskpro)r32x16_avx2o1 707253117 8.8 2.3 (R/A at ~1Mb blocks) fqzcomp -q1 660906421 ~93s (No random access) fqzcomp -q2 635844496 ~123s (No random access) fqzcomp -q3 618993894 ~143 (No random access) bsc-3.1.0 -t -m0 659671028 182(468cpu) (No random access?) All these bar r32x16_avx2 tests were run on the same server (Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz). The avx2 method was run on a desktop machine (Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz) as the server lacked AVX2 support. It is evident that mid-level compression can be achieved at extremely high rates if we're willing to exploit SIMD. (The standard rANS1 file format isn't ideal for SIMD due to interdependency of data.) Conclusion: reordering qualities by their original fastq orientation helps with most methods. (CRAM uses aligned/BAM orientation, but note the fqz codec for CRAM flips these and stores an additional copy of the original orientation inside the fqz quality block.) Cram's fqz codec is based on fqzcomp -q2 as it is expected to be operated in block mode to permit random access. Even so it needs large blocks to maximise efficiency and still cannot match fqzcomp -q3, however it is still considerably smaller and faster than libbsc. Data set 10 ----------- Set 10 is RNAseq and has multiple mappings of the same sequence. This gives rise to duplicate quality strings, so benefits from LZ or BWT processing over raw entropy encoding methods. Thus detection of whole-entry duplications may be beneficial, but only in repetitive areas. The first 10 million lines of quality have 204700 duplicated lines, of which 178819 are the previous line. By emitting a code for "same as last line" we save ~4% storage. Orig qual: bsc 770000000 into 250270936 in 72.8 seconds rANS1 770000000 into 270235820 in 7.4 seconds comp2 770000000 into 254166072 in 19.9 seconds dedup1 (dedup previous line only): bsc 746588575 into 243597924 in 61.0 seconds. rANS1 746588575 into 258990929 in 9.5 seconds comp2 746588575 into 244409910 in 18.3 seconds dedup10k (dedup up to 10k lines away with back-ref): bsc 744689536 into 243527960 in 61.5 seconds. rANS1 744689536 into 258518510 in 9.8 seconds comp2 744689536 into 243922525 in 18.3 seconds Dedup1 is very quick to implement, so is a no-brainer to code up. Order-2 rANS or tANS would help further. Inside CRAM: Tool/seq-per-blk Size Enc(s) Dec(s) bsc/1000 6217036972 2741.1 1754.5 bsc/10000 6021859786 2168.2 1509.8 bsc/100000 5953446434 2453.0 1599.1 rANS1/1000 6773212352 164.4 77.9 rANS1/10000 6444091732 146.3 73.8 rANS1/100000 6421605206 159.8 77.6 fqz/1000 6773589349 169.3 82.9 (detected ANS better) fqz/10000 6163647096 1010.6 1180.6 fqz/100000 5873759283 752.4 946.2 dedup-rANS1/10000 6185333488 152.7 96.1 (vs 6444091732) fqz CRAM module with dedup: fqz/1000 6773624515 2625.3 2485.4 (forced, but too small to adapt) fqz/10000 5934828621 1071.9 1228.2 fqz/100000 5641065113 768.7 969.8 (Dedup encoding is poorly implemented currently with too many mallocs/memcpy statements, so timings could be improved.) ============================================================================= Lossy quality ============= The lossy compression is achieved by the Crumble tool, available from: https://github.com/jkbonfield/crumble This employs a noddy consensus caller, based on Gap5's consensus algorithm, to evaluate whether the existance of quality values is necessary to achieve confident SNP calls. If not, they are removed, based on various parameters (loosely adjusted by crumble -1 to crumble -9 or each explicitly if required). For quality values retained, they are kept losslessly. All others are set to a binary state (high & low). All tests below are using GATK caller. We did not yet evaluate the Samtools / Bcftools calling pipeline. Data set 11 ~~~~~~~~~~~ Chromosome 11: Orig total size 656888474 Crumble -1 total size 640548371 Crumble -9 total size 564459481 Crumble -9p8 total size 82126745 (rANS1 codec, 10k seq block) Crumble -9p8 total size 96394731 (fqz codec, 100k seq block) Lossless 90.0 SNP PASS 0.486779 0.937853 0.640905 99.0 SNP PASS 0.545241 0.939863 0.690122 99.9 SNP PASS 0.559192 0.939294 0.701035 100.0 SNP PASS 0.562877 0.937361 0.703380 Crumble -1 90.0 SNP PASS 0.490596 0.938082 0.644259 99.0 SNP PASS 0.545794 0.939670 0.690513 99.9 SNP PASS 0.559841 0.939848 0.701699 100.0 SNP PASS 0.563835 0.937691 0.704221 Crumble -9 90.0 SNP PASS 0.494090 0.939940 0.647706 99.0 SNP PASS 0.554055 0.941328 0.697544 99.9 SNP PASS 0.567129 0.941047 0.707736 100.0 SNP PASS 0.571484 0.938837 0.710485 Crumble -9 -p8 90.0 SNP PASS 0.521593 0.946976 0.672677 99.0 SNP PASS 0.582612 0.946927 0.721382 99.9 SNP PASS 0.598140 0.946122 0.732924 100.0 SNP PASS 0.602672 0.943078 0.735393 Chromosome 20: Orig total size 291952417 Crumble -1 total size 285105238 Crumble -9 total size 252045288 Crumble -9p8 total size 37003532 (rANS1 codec, 10k seq block) Crumble -9p8 total size 38185172 (fqz codec, 100k seq block) Crumble -9p8 total size 36636817 (rANS1 codec, 100k seq block) Orig (lossless) 90.0 SNP PASS 0.464911 0.926027 0.619036 99.0 SNP PASS 0.516630 0.928990 0.663998 99.9 SNP PASS 0.528064 0.928534 0.673247 100.0 SNP PASS 0.530541 0.925990 0.674583 Crumble -1 90.0 SNP PASS 0.467715 0.926810 0.621693 99.0 SNP PASS 0.518904 0.929253 0.665940 99.9 SNP PASS 0.529108 0.927853 0.673916 100.0 SNP PASS 0.531522 0.926092 0.675403 Crumble -9 90.0 SNP PASS 0.473572 0.928868 0.627315 99.0 SNP PASS 0.528469 0.931185 0.674273 99.9 SNP PASS 0.537052 0.930801 0.681115 100.0 SNP PASS 0.539264 0.927697 0.682054 Crumble -9p8 90.0 SNP PASS 0.498481 0.936273 0.650584 99.0 SNP PASS 0.556244 0.936675 0.697988 99.9 SNP PASS 0.568738 0.935600 0.707436 100.0 SNP PASS 0.572087 0.931868 0.708944 I didn't expect standard Crumble alone to do well on compression ratio, and indeed it doesn't, until P-block is added in too. This is because each column is error prone with a high uncertainty in the consensus, leading to relatively few locations where qualities can be discarded. However the P-block algorithm degrades all quality scores so has a large impact. The net effect is 7.9 fold compression ratio. Bizarrely all lossy modes improve GATK snp calling, both in reducing false positives and reducing false negatives. (I suspect a flaw somewhere in the calling code or our use of it as this shouldn't be possible.) Suprisingly the fqzcomp codec doesn't improve over a basic order-1 rANS codec for this data set. (Likely fqzcomp is over-tuned for Illumina and takes too long to adapt.) Data set 12 (*Robot*) ~~~~~~~~~~~~~~~~~~~~~ Using "gatk_call.sh" script, as per N16525, followed by hap.py for calculation of precision/recall at the various tranch levels. Chr11 ----- Lossless 90.0 SNP PASS 0.791391 0.999702 0.883433 99.0 SNP PASS 0.945127 0.999431 0.971521 99.9 SNP PASS 0.982748 0.998555 0.990588 100.0 SNP PASS 0.995121 0.996951 0.996035 Crumble -1 90.0 SNP PASS 0.811031 0.999491 0.895452 99.0 SNP PASS 0.941339 0.999194 0.969404 99.9 SNP PASS 0.985180 0.998588 0.991839 100.0 SNP PASS 0.995121 0.996862 0.995991 Crumble -9 90.0 SNP PASS 0.806344 0.999616 0.892638 99.0 SNP PASS 0.944942 0.999252 0.971338 99.9 SNP PASS 0.980456 0.998874 0.989579 100.0 SNP PASS 0.995261 0.996708 0.995984 Crumble -9 -p8 90.0 SNP PASS 0.799004 0.999650 0.888136 99.0 SNP PASS 0.939423 0.999467 0.968515 99.9 SNP PASS 0.981628 0.998748 0.990114 100.0 SNP PASS 0.995372 0.996606 0.995989 Sizes: Lossless CRAM/rANS/10k 2282732421 Crumble -1 CRAM/rANS/10k 326499298 Crumble -9 CRAM/rANS/10k 245398824 Crumble -9p8 CRAM/rANS/10k 131670326 Crumble -9p8 CRAM/fqz/100k 135769367 (NB: poorer than rANS1). Chr20 ----- (With no filtering, using the files downloaded from Jan Voges.) 20194777 seqs 3017194085 bases => 47.87x coverage Lossless 90.0 SNP PASS 0.800695 0.999747 0.889218 99.0 SNP PASS 0.940134 0.999536 0.968925 99.9 SNP PASS 0.987226 0.998613 0.992887 100.0 SNP PASS 0.996417 0.997505 0.996961 Crumble -1 90.0 SNP PASS 0.804979 0.999768 0.891862 99.0 SNP PASS 0.952861 0.999395 0.975573 99.9 SNP PASS 0.983565 0.998371 0.990913 100.0 SNP PASS 0.996526 0.997365 0.996945 Crumble -9 90.0 SNP PASS 0.817924 0.999638 0.899697 99.0 SNP PASS 0.951194 0.999509 0.974753 99.9 SNP PASS 0.989812 0.998617 0.994195 100.0 SNP PASS 0.996526 0.997272 0.996899 Crumble -9 -p8 90.0 SNP PASS 0.810696 0.999616 0.895298 99.0 SNP PASS 0.944200 0.999489 0.971058 99.9 SNP PASS 0.984157 0.998877 0.991462 100.0 SNP PASS 0.996729 0.996884 0.996806 100% loss (all q30) (QS size 187820 => overheads) 90.0 SNP PASS 0.815634 0.999618 0.898302 99.0 SNP PASS 0.947642 0.999113 0.972697 99.9 SNP PASS 0.985170 0.998216 0.991650 100.0 SNP PASS 0.995887 0.994261 0.995073 QS sizes and decode times: Method Size Decode time(s) Uncompressed 3017194085 Lossless CRAM/rANS/10k 1045660210 11.9 Crumble -1 CRAM/rANS/10k 146073925 11.2 Crumble -9 CRAM/rANS/10k 103828736 11.1 Crumble -9p8 CRAM/rANS/10k 61519457 11.1 Lossless CRAM/fqz/100k 943982528 156.5 Crumble -1 CRAM/fqz/100k 134376049 61.3 Crumble -9 CRAM/fqz/100k 94828059 78.7 Crumble -9p8 CRAM/fqz/100k 55894285 66.2 This data is so deep that the accuracy of SNP calling is very high, potentially making quality values less useful. I tested low coverage by artificially reducing the data using samtools view -s. Shrunk to ~10x coverage (9.987x) to test reduced coveraging calling rates. Raw size ~630,000,000. Depth 10, lossless, rans1 size 218255702 90.0 SNP PASS 0.668874 0.982608 0.795941 99.0 SNP PASS 0.788123 0.982274 0.874553 99.9 SNP PASS 0.805057 0.981092 0.884400 100.0 SNP PASS 0.812223 0.980849 0.888607 Depth 10, crumble -1, rans1 size 80409731 90.0 SNP PASS 0.681071 0.982693 0.804542 99.0 SNP PASS 0.792111 0.982437 0.877067 99.9 SNP PASS 0.807611 0.981225 0.885993 100.0 SNP PASS 0.815400 0.980922 0.890535 Depth 10, crumble -9, rans1 size 38547722 90.0 SNP PASS 0.707616 0.983246 0.822966 99.0 SNP PASS 0.799511 0.982973 0.881801 99.9 SNP PASS 0.815696 0.982863 0.891511 100.0 SNP PASS 0.824934 0.982103 0.896684 Depth 10, crumble -9p8, rans1 size 17844840 90.0 SNP PASS 0.709875 0.984978 0.825100 99.0 SNP PASS 0.830402 0.983850 0.900637 99.9 SNP PASS 0.843051 0.982677 0.907525 100.0 SNP PASS 0.851697 0.982144 0.912281 Depth 10, all bases Q30 Although recall and precision rates are down on the 40x sample (.997 vs .891 F-score), it is still demonstrable that the more qualities we discard the better the GATK calling gets. Figures for the entirety of data set 12 (full depth): Crumble (BAM to BAM) @ seq3c[scratch01/mpeg]; time ~/work/crumble/crumble -9 -p8 NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.bam NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.bam real 663m37.836s user 649m48.797s sys 5m47.962s CRAM default (BAM to CRAM, rANS1 codec, 10k seqs per block): @ seq3c[scratch01/mpeg]; time scramble -r hum_ref/human_g1k_v37.fasta -t16 NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.bam NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.cram real 27m26.279s user 341m37.001s sys 13m17.554s Block content_id 12, total size 3517043197 R QS CRAM with fqzcomp codec and 100k seqs per block. @ seq3c[scratch01/mpeg]; time CRAM_CODEC_DIR=~/scratch/data/mpeg/dataset/fqz/ ~/scratch/data/mpeg/dataset/scramble.modules -r hum_ref/human_g1k_v37.fasta -t16 -s100000 NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.bam NA12878_V2.5_Robot_2.aln_bowtie2.sorted.dupmark.rg.realn.recal.crumble-9p8.fqz.cram Adding codec 18 libcram_codec_fqzcomp_qual.so real 30m52.060s user 393m6.574s sys 14m13.621s Block content_id 12, total size 3217310516 [q] QS OLD versions of this analysis follow, labelled with >F04 and >F44 to indicate which FLAG filter options have been used when discarding alignments. >F04 Chr20 sizes >F04 ----------- >F04 >F04 The lossy compressed size is a combination of the amount/placement of >F04 data being lost and the compression algorithm. These two components >F04 are mostly orthogonal, so we tested the lossless and lossy size with >F04 both default CRAM v3 codecs (rANS order 1 at 10,000 seqs per slice) >F04 and with the fqzcomp plugin module with 100,000 seqs per slice. >F04 >F04 Times listed here are CRAM only for encoding the QS data series. It >F04 does not include the time to lossily reduce the qualities; see below >F04 for Crumble timings. >F04 >F04 Method Size (bytes, x-fold), Encode time (seconds), Decode time >F04 0% loss (raw) 2622005008 >F04 0% loss (CRAM), rANS1 903633030 2.90 >F04 Crumble -1, rANS1 103102947 25.43 >F04 Crumble -9, rANS1 60613822 43.26 >F04 Crumble -9 -p8, rANS1 42175805 62.17 >F04 Lossless CRAM, fqz,100k 815488380 3.22 >F04 Crumble -1, fqz,100k 95389680 27.49 >F04 Crumble -9, fqz,100k 55115911 47.57 >F04 Crumble -9 -p8, fqz,100k 38299417 68.46 >F04 >F04 >F04 Chr20 time and memory >F04 --------------------- >F04 >F04 Measured on chr20 only. The memory needed by the lossy compression >F04 stage is tiny. About 2/3rds of the CPU time is BAM decoding/encoding >F04 rather than the quality loss algorithm itself, which could be >F04 multi-threaded. >F04 >F04 Crumble -9 -p8 to BAM chr20: 11m18s, 4.2Mb memory >F04 Samtools BAM to BAM (no loss): 7m20s, 1.5Mb memory >F04 >F04 @ seq3c[dataset/10]; /usr/bin/time ~/work/crumble/crumble -Obam -9 -p8 >F04 /local/scratch01/mpeg/_NA12878.chr20.bqsr.bam /tmp/_tmp.bam >F04 664.54user 8.11system 11:17.67elapsed 99%CPU (0avgtext+0avgdata 16816maxresident)k >F04 482320inputs+4835816outputs (1major+2280378minor)pagefaults 0swaps >F04 >F04 @ seq3c[dataset/10]; /usr/bin/time samtools view -b /tmp/_tmp.bam -o /tmp/_tmp2.bam >F04 424.71user 12.14system 7:20.45elapsed 99%CPU (0avgtext+0avgdata 5920maxresident)k >F04 1607752inputs+4835760outputs (0major+5484489minor)pagefaults 0swaps >F04 >F04 >F04 Chr20 recall / precision >F04 ------------------------ >F04 >F04 Measured figures from hap.py for each of the 90.0, 99.0, 99.9 and >F04 100.0 calibration tranches (as per the doc). >F04 >F04 for i in 90.0 99.0 99.9 100.0; do HGREF=/nfs/srpipe_references/references/Human/1000Genomes/all/fasta/human_g1k_v37.fasta PYTHONPATH=/software/CGP/external-apps/CNVkit-0.7.10/lib/python2.7/ /software/python-2.7.3/bin/python hap.py/bin/hap.py giab-filt-norm.vcf _NA12878.chr20.bqsr.crumble-1.gatk.snps.recal$i.vcf -r hum_ref/human_g1k_v37.fasta -f giab.bed -o __NA12878.chr20.bqsr.crumble-1.gatk.happy_root$i -T chr20.bed --threads 16 --roc VQLSOD;done >F04 >F04 0% quality loss >F04 Threshold Recall Precis. F-score >F04 90.0 SNP PASS 0.823781 0.999641 0.903231 >F04 99.0 SNP PASS 0.947362 0.999573 0.972767 >F04 99.9 SNP PASS 0.985964 0.999211 0.992543 >F04 100.0 SNP PASS 0.996448 0.997458 0.996953 >F04 >F04 100% quality loss (all quality set to Q20) >F04 90.0 SNP PASS 0.803265 0.999632 0.890755 >F04 99.0 SNP PASS 0.945243 0.999210 0.971478 >F04 99.9 SNP PASS 0.979546 0.997810 0.988594 >F04 100.0 SNP PASS 0.994517 0.996317 0.995416 >F04 >F04 100% quality loss (all quality set to Q30) >F04 90.0 SNP PASS 0.808811 0.999711 0.894186 >F04 99.0 SNP PASS 0.948483 0.999294 0.973226 >F04 99.9 SNP PASS 0.985793 0.997793 0.991757 >F04 100.0 SNP PASS 0.995934 0.994277 0.995105 >F04 >F04 100% quality loss (all quality set to Q40 / 'I'). >F04 90.0 SNP PASS 0.824669 0.999566 0.903733 >F04 99.0 SNP PASS 0.948826 0.997674 0.972637 >F04 99.9 SNP PASS 0.987273 0.995961 0.991598 >F04 100.0 SNP PASS 0.996028 0.983480 0.989714 >F04 >F04 Crumble-1 >F04 Threshold Recall Precis. F-score >F04 90.0 SNP PASS 0.814201 0.999790 0.897502 >F04 99.0 SNP PASS 0.946194 0.999441 0.972089 >F04 99.9 SNP PASS 0.985590 0.998422 0.991965 >F04 100.0 SNP PASS 0.996588 0.997334 0.996961 >F04 >F04 Crumble-9 >F04 Threshold Recall Precis. F-score >F04 90.0 SNP PASS 0.806786 0.999749 0.892962 >F04 99.0 SNP PASS 0.945197 0.999654 0.971663 >F04 99.9 SNP PASS 0.985107 0.999242 0.992124 >F04 100.0 SNP PASS 0.996573 0.997039 0.996806 >F04 >F04 Crumble-9 with P-block 8 (rANS1 sz 42175805) >F04 Threshold Recall Precis. F-score >F04 90.0 SNP PASS 0.823844 0.999622 0.903261 >F04 99.0 SNP PASS 0.947222 0.999490 0.972654 >F04 99.9 SNP PASS 0.987771 0.999070 0.993388 >F04 100.0 SNP PASS 0.996807 0.996807 0.996807 >F04 >F04 >F04 >F04 The conclusion here appears to be that there is very little difference >F04 in SNP accuracy between original, crumble -1, crumble -9 and crumble >F04 with P-block. The 99.0 filter level shows a decrease in recall and >F04 corresponding F-score, while other levels appear to have increases >F04 throughout. It is expected that tweaking the fixed low and high >F04 quality values used in crumble could have an effect here, but it >F04 hasn't been properly explored. >F04 >F04 Note however a lot of the parameters that change between crumble -1 >F04 and crumble -9 are related to preserving Indel calling, which is not >F04 part of this test. Specifically the quality values of neighbouring >F04 bases in a short-tandem repeat are of great importance for indel >F04 calling. >F04 >F04 We performed a rough test of indel calling on the entire GATK output >F04 (-stand_emit_conf 10 -stand_call_conf 30) without the recalibration >F04 steps. >F04 >F04 Lossless Recall Precision >F04 INDEL ALL 0.960191 0.912072 >F04 INDEL PASS 0.955201 0.940314 >F04 SNP ALL 0.997289 0.996405 >F04 SNP PASS 0.996448 0.997412 >F04 >F04 Crumble -1 Recall Precision >F04 INDEL ALL 0.962252 0.907703 >F04 INDEL PASS 0.956937 0.938432 >F04 SNP ALL 0.997398 0.996375 >F04 SNP PASS 0.996588 0.997288 >F04 >F04 Crumble -9 Recall Precision >F04 INDEL ALL 0.965831 0.906715 >F04 INDEL PASS 0.961601 0.935061 >F04 SNP ALL 0.997398 0.996127 >F04 SNP PASS 0.996573 0.997008 >F04 >F04 Crumble -9 -p8 Recall Precision >F04 INDEL ALL 0.965181 0.906659 >F04 INDEL PASS 0.960408 0.934986 >F04 SNP ALL 0.997523 0.995631 >F04 SNP PASS 0.996807 0.996776 >F04 >F04 While we do not know the veracity of the indel calls, we can measure >F04 the deviation caused by lossy encoding. It is evident that the >F04 precision deviation with crumble -1 is much less than with crumble -9 >F04 or -9 -p8, although the recall figures seem to improve with more >F04 aggressive compression. >F04 >F04 >F04 Default crumble quality values for lost values are: -c 25 -l 10 -u 40 >F04 mismatch cons: 1..24 => 10 >F04 mismatch cons: 25..99 => 40 >F04 match cons: => 40 >F04 >F04 Try again with Q30 as max instead? -c 20 -l 10 -u 30 >F04 mismatch cons: 1..19 => 10 >F04 mismatch cons: 20..99 => 30 >F04 match cons: => 30 >F04 >F04 Crumble -9 -p8 -c 20 -l 10 -u 30 (rans1 QS size 39922081; was 42175805) >F04 Threshold Recall Precis. F-score >F04 90.0 SNP PASS 0.799651 0.999649 0.888535 (-- was 0.903261, big drop in recall) >F04 99.0 SNP PASS 0.943701 0.999423 0.970763 (-- was 0.972654) >F04 99.9 SNP PASS 0.989781 0.998350 0.994047 (+ was 0.993388) >F04 100.0 SNP PASS 0.996635 0.997101 0.996868 (~ was 0.996807) >F04 >F04 crumble -9 -p8 -c15 -u35 >F04 SNP PASS 0.827551 0.999567 0.905462 64193 53123 11070 >F04 SNP PASS 0.947876 0.999442 0.972976 64193 60847 3346 >F04 SNP PASS 0.984889 0.999241 0.992013 64193 63223 970 >F04 SNP PASS 0.996729 0.996744 0.996736 64193 63983 210 >F04 >F04 crumble -9 -p8 -l10 -c20 -u40 (rANS1 sz 41565704) >F04 SNP PASS 0.828813 0.999587 0.906225 64193 53204 10989 + (vs -9 -p8) >F04 SNP PASS 0.946178 0.999441 0.972080 64193 60738 3455 - >F04 SNP PASS 0.986400 0.999069 0.992694 64193 63320 873 - >F04 SNP PASS 0.996791 0.996822 0.996806 64193 63987 206 = >F04 >F04 >F04 crumble -9 -p8 -l10 -c20 -u35 (rANS1 sz 40851277) >F04 SNP PASS 0.817581 0.999657 0.899498 64193 52483 11710 >F04 SNP PASS 0.950072 0.999459 0.974140 64193 60988 3205 >F04 SNP PASS 0.983596 0.999145 0.991310 64193 63140 1053 >F04 SNP PASS 0.996729 0.996977 0.996853 64193 63983 210 >F04 >F04 crumble -9 -p8 -l10 -c20 -u35 (U40 for agree) >F04 SNP PASS 0.815572 0.999599 0.898257 64193 52354 11839 - (vs -9 -p8 -l10 -c20 -u35) >F04 SNP PASS 0.948857 0.999491 0.973516 64193 60910 3283 - >F04 SNP PASS 0.990404 0.998822 0.994595 64193 63577 616 + (++ recall, -precis) >F04 SNP PASS 0.996791 0.996900 0.996845 64193 63987 206 ~ >F04 >F04 crumble -9 -p8 -l5 -c20 -u35 (rANS1 sz 40684464) >F04 SNP PASS 0.790725 0.999842 0.883072 64193 50759 13434 >F04 SNP PASS 0.964482 0.999000 0.981438 64193 61913 2280 >F04 SNP PASS 0.985450 0.997808 0.991590 64193 63259 934 >F04 SNP PASS 0.996869 0.996962 0.996915 64193 63992 201 >F44 NB: This data set was processed using samtools view -F 0xF40, so is single-ended. >F44 While not correct, it has some use for evaluating performance on >F44 shallower data sets. (~20x in this case) >F44 >F44 Data set 12 (*Robot*) >F44 ~~~~~~~~~~~~~~~~~~~~~ >F44 >F44 Using "gatk_call.sh" script, as per N16525, followed by hap.py for >F44 calculation of precision/recall at the various tranch levels. >F44 >F44 Chr20 sizes >F44 ----------- >F44 >F44 The lossy compressed size is a combination of the amount/placement of >F44 data being lost and the compression algorithm. These two components >F44 are mostly orthogonal, so we tested the lossless and lossy size with >F44 both default CRAM v3 codecs (rANS order 1 at 10,000 seqs per slice) >F44 and with the fqzcomp plugin module with 100,000 seqs per slice. >F44 >F44 Times listed here are CRAM only for encoding the QS data series. It >F44 does not include the time to lossily reduce the qualities; see below >F44 for Crumble timings. >F44 >F44 Method Size (bytes, x-fold), Encode time (seconds), Decode time >F44 Lossless QS (raw) 1292621755 1.00 N/A N/A >F44 Lossless CRAM, rANS1 505376417 2.56 21.9 5.3 >F44 Lossless CRAM, fqz,100k 451304814 2.86 103.7 66.7 >F44 Crumble -1 QS, rANS1 138208917 9.35 10.1 4.9 >F44 Crumble -1 QS, fqz,100k 126346751 10.23 45.7 35.2 >F44 Crumble -9 QS, rANS1 76475355 16.90 9.1 4.9 >F44 Crumble -9 QS, fqz,100k 70540126 18.32 34.6 30.3 >F44 >F44 As mentioned above, Crumble may be considered as a "vertical" >F44 compressor in that it acts on entire aligned columns of qualities. >F44 P-block, R-block and similar may be considered "horizontal" >F44 compressors in that they squash qualities along the length of >F44 individual reads. Combining the two algorithms together gives further >F44 better results. Timings for these haven't been kept as it involved >F44 running various external tools in an inefficient manner (eg convering >F44 BAM to SAM, CompressSAM, DecompressSAM, and then converting the >F44 decoded SAM back to CRAM). It is expected that implementing P-block >F44 directly within Crumble would have a minimal encode overhead and a >F44 non-existant decode overhead for CRAM. >F44 >F44 Method Size (bytes, x-fold) >F44 Crumble -9 + PBlock 1 64437688 20.06 >F44 Crumble -9 + PBlock 2 59257354 21.81 >F44 Crumble -9 + PBlock 4 51704458 25.00 >F44 Crumble -9 + PBlock 8 42078669 30.72 >F44 Crumble -9 + PBlock 1 + fqz,100k 60387992 21.41 >F44 Crumble -9 + PBlock 2 + fqz,100k 55618114 23.24 >F44 Crumble -9 + PBlock 4 + fqz,100k 48539269 26.63 >F44 Crumble -9 + PBlock 8 + fqz,100k 39003300 33.14 >F44 >F44 PBlock's own binary encoding for level 4 (CompressQual -q1 -m1 -l4) >F44 consumes 107,041,156 bytes so is quite large. However the replacement >F44 of qualities by runs of identical value makes CRAM's own entropy >F44 encoders work better. >F44 >F44 Chr20 time and memory >F44 --------------------- >F44 >F44 Measured on chr20 only. Output format BAM vs vanilla CRAM v3 is >F44 minimal difference. Memory is larger due to larger block size in >F44 CRAM. The actual memory needed by the lossy compression stage is >F44 tiny. >F44 >F44 Crumble -9 to BAM chr20: 6m3sec, ~3 Mb memory >F44 Crumble -9 to CRAM chr20: 7min, 242 Mb memory >F44 >F44 >F44 Chr20 recall / precision >F44 ------------------------ >F44 >F44 Measured figures from hap.py for each of the 90.0, 99.0, 99.9 and >F44 100.0 calibration tranches (as per the doc). >F44 >F44 for i in 90.0 99.0 99.9 100.0; do HGREF=/nfs/srpipe_references/references/Human/1000Genomes/all/fasta/human_g1k_v37.fasta PYTHONPATH=/software/CGP/external-apps/CNVkit-0.7.10/lib/python2.7/ /software/python-2.7.3/bin/python hap.py/bin/hap.py giab-filt-norm.vcf _NA12878.chr20.bqsr.crumble-1.gatk.snps.recal$i.vcf -r hum_ref/human_g1k_v37.fasta -f giab.bed -o __NA12878.chr20.bqsr.crumble-1.gatk.happy_root$i -T chr20.bed --threads 16 --roc VQLSOD;done >F44 >F44 Orig >F44 Threshold Recall Precis. F-score >F44 90.0 SNP PASS 0.788466 0.999092 0.881370 >F44 99.0 SNP PASS 0.941567 0.997031 0.968506 >F44 99.9 SNP PASS 0.958531 0.995937 0.976876 >F44 100.0 SNP PASS 0.967255 0.995176 0.981017 >F44 >F44 Crumble-1 >F44 90.0 SNP PASS 0.828034 0.998141 0.905165 >F44 99.0 SNP PASS 0.930444 0.997795 0.962943 >F44 99.9 SNP PASS 0.962348 0.997207 0.979467 >F44 100.0 SNP PASS 0.970402 0.995732 0.982904 >F44 >F44 Crumble-9 >F44 90.0 SNP PASS 0.81632 0.998685 0.898341 >F44 99.0 SNP PASS 0.934463 0.998236 0.965297 >F44 99.9 SNP PASS 0.969592 0.997228 0.983216 >F44 100.0 SNP PASS 0.97467 0.996084 0.985261 >F44 >F44 Crumble-9 with P-block 4 >F44 90.0 SNP PASS 0.811412 0.998734 0.895381 >F44 99.0 SNP PASS 0.929463 0.998243 0.962626 >F44 99.9 SNP PASS 0.973097 0.996681 0.984748 >F44 100.0 SNP PASS 0.975075 0.996275 0.985561 >F44 >F44 Crumble-9 with P-block 8 >F44 90.0 SNP PASS 0.828486 0.998629 0.905636 >F44 99.0 SNP PASS 0.929276 0.998210 0.962510 >F44 99.9 SNP PASS 0.968299 0.997545 0.982704 >F44 100.0 SNP PASS 0.978051 0.995844 0.986867 >F44 >F44 100% lossy (all quality set to Q40 / 'I'). >F44 >F44 >F44 The conclusion here appears to be that there is very little difference >F44 in SNP accuracy between original, crumble -1, crumble -9 and crumble >F44 with P-block. The 99.0 filter level shows a decrease in recall and >F44 corresponding F-score, while other levels appear to have increases >F44 throughout. It is expected that tweaking the fixed low and high >F44 quality values used in crumble could have an effect here, but it >F44 hasn't been properly explored. >F44 >F44 Note however a lot of the parameters that change between crumble -1 >F44 and crumble -9 are related to preserving Indel calling, which is not >F44 part of this test. Specifically the quality values of neighbouring >F44 bases in a short-tandem repeat are of great importance for indel >F44 calling. >F44 >F44 We performed a rough test of indel calling on the entire GATK output >F44 (-stand_emit_conf 10 -stand_call_conf 30) without the recalibration >F44 steps. >F44 >F44 Lossless Recall Precis. F-score >F44 INDEL ALL 0.856058 0.936213 0.894343 (>=10) >F44 INDEL PASS 0.822432 0.954780 0.883678 (>=30) >F44 >F44 Crumble -1 >F44 INDEL ALL 0.864844 0.935299 0.898693 (>=10) >F44 INDEL PASS 0.836642 0.954226 0.891574 (>=30) >F44 >F44 Crumble -9 >F44 INDEL ALL 0.877969 0.931810 0.904089 (>=10) >F44 INDEL PASS 0.852045 0.953522 0.899932 (>=30) >F44 >F44 Crumble -9 + PBlock 4 >F44 INDEL ALL 0.880356 0.930703 0.904830 (>=10) >F44 INDEL PASS 0.854214 0.953980 0.901345 (>=30) >F44 >F44 While we do not know the veracity of the indel calls, we can measure >F44 the deviation caused by lossy encoding. It is evident that the >F44 deviation with crumble -1 is much less than with crumble -9, but both >F44 tend to favour lower precision with higher recall (to an extent that >F44 still increases F-score). The best F-score comes from more aggressive >F44 compression, but without knowing the absolute validity of these calls >F44 it is hard to judge this. >F44 >F44 >F44 Chr11 sizes >F44 ----------- >F44 >F44 Method Size (bytes, x-fold), Encode time (seconds), Decode time >F44 Lossless QS (raw) 2845105400 1.00 N/A N/A >F44 Lossless CRAM, rANS1 1065305769 2.67 42.7 21.7 >F44 Lossless CRAM, fqz,100k 985856117 2.89 124.4 145.0 >F44 Crumble -1 QS, rANS1 295071612 9.64 17.5 10.5 >F44 Crumble -1 QS, fqz,100k 272042681 10.46 67.3 78.0 >F44 Crumble -9 QS, rANS1 166800786 17.06 16.6 10.2 >F44 Crumble -9 QS, fqz,100k 154424543 18.42 56.4 64.9 >F44 >F44 >F44 Chr11 time and memory >F44 --------------------- >F44 >F44 Lossy portion only (so not including advanced CRAM compression models, >F44 eg fqz): >F44 >F44 Crumble -1 BAM to BAM chr11: 15m54s 51Mb >F44 Crumble -9 BAM to BAM chr11: 13m8s 51Mb >F44 >F44 >F44 Chr11 recall / precision >F44 ------------------------ >F44 >F44 Lossless >F44 Threshold Recall Precis. F-score >F44 90.0 SNP PASS 0.825431 0.998146 0.9036 >F44 99.0 SNP PASS 0.924882 0.997766 0.9599 >F44 99.9 SNP PASS 0.958421 0.997278 0.9775 >F44 100.0 SNP PASS 0.970596 0.995405 0.9828 >F44 >F44 Crumble -1 >F44 Threshold Recall Precis. F-score >F44 90.0 SNP PASS 0.795312 0.998547 0.8854 >F44 99.0 SNP PASS 0.925347 0.99818 0.9604 >F44 99.9 SNP PASS 0.964892 0.99722 0.9808 >F44 100.0 SNP PASS 0.972202 0.995781 0.9839 >F44 >F44 Crumble -9 >F44 Threshold Recall Precis. F-score >F44 90.0 SNP PASS 0.820729 0.998539 0.9009 >F44 99.0 SNP PASS 0.93265 0.998068 0.9643 >F44 99.9 SNP PASS 0.961826 0.997569 0.9794 >F44 100.0 SNP PASS 0.975039 0.995373 0.9851 >F44 >F44 As with Chr20, F-scores are only minimally impacted by lossy >F44 compression and typically even improve slightly.