CRAM documentation ================== Most of my CE1, CE2 and CE3 submissions revolve around CRAM codecs. For CE2 and CE3 these are just single data types (all qualities or all identifiers) so are trivial to experiment with on the command line too, albeit possibly via unix split to get lots of separate files to simulate keeping the ability of random access. CE1 however is more nebulous describing sequences and pairings. These span many cram data series so a more complete description of how CRAM works is beneficial to the understanding of the encoded file formats. Firstly, see the official CRAM specification: https://github.com/samtools/hts-specs/blob/master/CRAMv3.tex https://github.com/samtools/hts-specs/blob/master/CRAMv3.pdf Note: I'm partially responsible for this, but not the original author. Even so, I ought to have fixed it by now as it's really not an easy document to read, nor is it at all tightly worded. Therefore I will attempt to clarify some of the most glaring omissions here. ITF8 ---- Integer values are written in a variable sized format where the number of top bits set in the first byte indicates how many additional bytes should be read. This 0 to 127 is 1 byte and anything above this needs more. A tragic omission was no ability to store negative numbers, so these all get turned into very large positives! Data series ----------- CRAM breaks data by type into "data series". Trivial examples may be "RN" (read name, aka identifier) and "QS" (quality score), while others are a bit more specific such as "DL" (length of a CIGAR "D" - deletion - operation), "SC" (soft-clipped bases referred to in a CIGAR "S" operation) and "NF" (number of records to the next fragment, used when both members of a read-pair fall within the same CRAM slice). Note: Each read is processed in a breadth first fashion; so we encode all data series for the first read, then all data series for the second read, and so on. There are conditions that indicate whether or not a data series is appended to (for example the "DL" listed above is only appended to if the record contains a deletion), thus the flow of data is tightly bound to the program flow logic too. containers, slices, blocks -------------------------- Cram files consist of a header followed by a series of containers. Each container has a compression header and one or more slices. Each slice (and indeed the container compression headers) are stored in a series of blocks. The compression header holds information about how each data series is encoded and to which blocks it is stored in. Technically we can mix multiple data series into the same block (which is when it becomes important to identify the program flow order), but practically speaking this is rarely of benefit to compression efficiency. The cram_dump tool can demonstrate this: Container pos 573 size 6045659 Ref id: 0 Ref pos: 1 + 35030 Rec counter: 0 No. recs: 100000 No. bases 15000000 No. blocks: 27 No. landmarks: 1 {325} Container_header block pos 598 Preservation map: RN => 1 (0x1) SM => 11410149 (0xae1ae5) TD => 11410087 (0xae1aa7) AP => 1 (0x1) Substitution map: A: CGTN C: AGTN G: ACTN T: ACGN N: ACGT TD map: 0: BCZXDZSMSASS 1: BCZXDZSMS 2: BCZXDZSMC 3: BCZXDZSMCASS 4: BCZXDZSMCASC Record encoding map: FN => EXTERNAL {26} FP => EXTERNAL {28} RL => HUFFMAN {1, 128, 150, 1, 0} RN => BYTE_ARRAY_STOP {0, 11} QS => EXTERNAL {12} BA => EXTERNAL {30} BB => BYTE_ARRAY_LEN {1, 1, 42, 1, 1, 37} TL => EXTERNAL {32} IN => BYTE_ARRAY_STOP {0, 13} BF => EXTERNAL {15} MF => EXTERNAL {21} TS => EXTERNAL {22} CF => EXTERNAL {16} NF => EXTERNAL {24} AP => EXTERNAL {17} FC => EXTERNAL {27} MQ => EXTERNAL {19} DL => EXTERNAL {29} BS => EXTERNAL {31} NP => EXTERNAL {23} SC => BYTE_ARRAY_STOP {0, 14} NS => EXTERNAL {20} RG => HUFFMAN {1, 0, 1, 0} RI => HUFFMAN {1, 0, 1, 0} Tag encoding map: SMC => BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 83, 77, 67} SMS => BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 83, 77, 83} BCZ => BYTE_ARRAY_STOP {9, 224, 66, 67, 90} XDZ => BYTE_ARRAY_STOP {9, 224, 88, 68, 90} ASS => BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 65, 83, 83} ASC => BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 65, 83, 67} The tag encoding map here is for the XX:Y:data auxiliary tags in SAM. Eg SM:i:100 (100 being 1 byte so "C" in BAM) or BC:Z:foo. EXTERNAL, BYTE_ARRAY_STOP and BYTE_ARRAY_LEN are encoding instructions indicating how the data is layed out. EXTERNAL is used for arrays where the caller already knows the size (so it is not an explicit part of the stored data) or for single items such as a character or an integer. BYTE_ARRAY_STOP and BYTE_ARRAY_LEN are methods of storing variable sized data where the size is also one of the items being stored, either as say a null terminated string or where the length itself has its own distinct encoding instructions. There is also a "CORE" block, mostly now unused, which worked on a bunch of bit-packers in CRAM for HUFFMAN, BETA, ELIAS-GAMMA, SUB-EXPONENTIAL, etc. Of these we typically now only use HUFFMAN and again only for the special case of zero-length codes which are used to define constants. Typically we're only using one slice per container, although the code does permit multiples. In theory this would be used to reduce overheads of the compression header while keeping high granularity (per slice). The slice is dumped out by cram_dump as: Slice 1/1, container offset 325, file offset 923 Slice content type MAPPED_SLICE Slice ref seq 0 Slice ref start 1 Slice ref span 35030 Slice MD5 209abd0aa842316600786525e3a815f3 Rec counter 0 No. records 100000 No. blocks 25 Blk IDS: {11, 12, 15, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 4281155, 4281171, 5459283, 4342618, 5784666, 5459267} Ref base id: -1 Block 1/25 Size: 0 comp / 0 uncomp Method: RAW Content type: CORE Content id: 0 Block 2/25 Size: 476489 comp / 1872622 uncomp Method: BSC? Content type: EXTERNAL Content id: 11 Keys: RN (\265/\375\240\356\222\034\000\254A\004zp\023b\016\260\301\006\3143"\262\273\273\222\236/\345\007\026\006\031\006!\006\017\241\036\241\306\360[&z\004\011\213\344\335l\342\213\030)\016=\020k\247\305H\213\036d\\207v\006\242W\205\360\021\302\301\010\325 "#\206Z\237c\012\275B\226\204b\015\313\340F5M\336\260 Block 3/25 Size: 4959119 comp / 15000000 uncomp ... Here we see it has listed all the blocks referred to in the compression header. Each block has a content id number (also referred to in the various EXTERNAL, BYTE_ARRAY_STOP and BYTE_ARRAY_LEN encoding meta-data), and a "method" which is basically a compression type; eg gzip, bzip2, bsc, rANS, etc. Block encoders -------------- As described earlier, the actual CRAM blocks themselves are then encoded using what cram refers to as an "external" codec. These started as just zlib and bzip2, but then progressed to lzma and rANS (a custom part of CRAM) and in my submission I added libbsc, zstd and custom quality and name codecs. The interface for these is simply a byte stream; data in to data out. There are no block transforms, so these would need to be folded into any custom codecs. rANS is probably the only truely novel part of vanilla CRAM. It was derived from the initial rANS implementation by Fabian Giessen (https://github.com/rygorous/ryg_rans) but has since been substantially revised and sped up by myself (https://github.com/jkbonfield/rans_static). CRAM's rANS implementation has both order-0 and order-1 implementations. It's strictly a static entropy encoder, requiring a frequency table to be stored within the block. Practically speaking this was a mistake; it should have been stored in the compression header, permitting efficient compression using many slices per container, which in turn makes static order-2 compression achievable too (although likely dynamic table updates become better at that stage). The rANS used within CRAM is still an earlier draft, specifically the "rANS_static4x8.c" algorithin in github. I have AVX2 and AVX512 implementations of this algorithm, but they are not binary compatible with the original rANS. These however run much faster. For example order-0 quality value decoding runs at ~1400 MB/s and order-1 decoding at 870-1060MB MB/s; tested on a 3.2Ghz Intel i5 4570. To achieve that speed they interleave decoding from 32 portions of the block. The orignal CRAM method only has 4-way interleaving, but it's still a quick entropy codec. Partial decoding ---------------- CRAM can be sliced two ways. The slice size is the smallest element of random access across the genome, assuming sorted data. There is also the optimise of only decoding some data series while ignoring others. For example "samtools flagstats" then we know we need to decode the BAM FLAG column, held in the BF CRAM data-series and possibly amended for read-pairs via CF, NF, etc data series. In practice this is complex. If every data series has been written to its own "external" block then this becomes easy - we turn SAM columns into related CRAM data series and then uncompress appropriate columns, ignoring the others. The API is always returning a valid in-memory BAM record, but some fields will be replaced with place holder values if they were not requested to be decoded. If however the CRAM file, as is permitted, mixes multiple data series together into the same block (either multiples to the CORE block or several sharing the same external blocks) then the decoder needs to understand the interdependency between data types. For example if BF (bam flag) and DL (deletion length) shared the same block then this explicitly means to decode BF we also have to be decoding DL, which in turns means we need to decode all the other CIGAR fields and brings in an additional dependency on FN and FC. Possibly the program code itself has no option for decoding without interpreting, which then brings in other dependencies on BS (base subsitution) etc. While achieveable (it's implemented in Scramble), this isn't pretty and it requires iteration until the full effect of data-series dependencies have been resolved. The recommendation therefore is: don't ever mix multiple data series in the same block. Decoding -------- CRAM processes each read at a time, decoding the necessary fields in turn. The order isn't explicitly described in the specification, but basically it's the same order as the keys in the tables. Cram_dump can show this too. Eg: Rec 11/10000 at 1,4 BF = 73 => SAM 0x49 (ret 0, out_sz 1) CF = 3 (ret 0, out_sz 1) RL = 100 (ret 0, out_sz 1) AP = 1 (ret 0, out_sz 1) RG = 0 (ret 0, out_sz 1) RN = HS25_09827:2:1204:18768:54085#49 (ret 0, out_sz 32) Detached MF = 2 (ret 0, out_sz 1) NS = 0 (ret 0, out_sz 1) NP = 10011 (ret 0, out_sz 1) TS = 0 (ret 0, out_sz 1) TL = 7 (ret 0, out_sz 1) 0: TN= X0c ... FN = 3 (ret 0, out_sz 1) 0: FC = X (ret 0, out_sz 1) 0: FP = 51+0 = 51 (ret 0, out_sz 1) 0: BS = 0 (ret 0) 1: FC = X (ret 0, out_sz 1) 1: FP = 12+51 = 63 (ret 0, out_sz 1) 1: BS = 0 (ret 0) 2: FC = X (ret 0, out_sz 1) 2: FP = 24+63 = 87 (ret 0, out_sz 1) 2: BS = 0 (ret 0) MQ = 0 (ret 0, out_sz 1) QS = CABCFGFEIFFGGGGFGGGFGGIEHHFFEHHHFEEGFHB@FHFEACFGFH@?FFGECFGGFG-HDBHG4CHFFFCDFG8HGCIHD;,C@@7G>5F5AH,2 We can see the data series being pulled; BF (bam flags), CF (cram flags), RL (read-length - likely a constant), AP (alignment position - typically a delta to the last one), etc. FN is "feature number" and indicates how many edits have been made with respect to the reference. Each "feature" then has a type (FC; all "X" here for substitution), relative position since last (FP) and potential additional meta data (BS - the base substitution value itself). Seq decoding ------------ The observation here is that if we store the new base when it differs, then assuming even distribution and simplifying for ACGT only then we'll end up around 2 bits per edited base. However we know we wouldn't stored a "C" substitution if the reference base was "C", so in fact we only have 3 possibilities per substitution instead of 4. Hence we use a 2D lookup table and BS, the substitution feature, is the X axis with the reference base being the Y axis. The exact table itself can be optimised per file and is recorded in the compression header. From above: Substitution map: A: CGTN C: AGTN G: ACTN T: ACGN N: ACGT So ref G and BS 2 would indicate a G->T substution. For non ACGTN ref or seq bases we use the "BA" data series and store the bases verbatim. Similarly for inserted bases or soft-clips. Seq pairing ----------- If a template has been sequenced as a pair it will likely have two alignment records in SAM. The SAM format has RNEXT, PNEXT and TLEN fields to describe how these pairs relate to one another - which reference the mate matches too, what position, and the observed template lengths. (The definition of TLEN however is hotly disputed!) When this occurs, they *usually* are near each other on the genome and thus often both pairs occur within the same CRAM slice. This in turn permits RNEXT, PNEXT and TLEN to be computed on-the-fly during the CRAM decode. To do so all CRAM stores is the number of records between this read and the mate via the NF data series. Rather bizarelly however the read-name is *not* deduplicated via this method. (I've no idea why - not my format!) If however the pairs aren't near each other or are but fall either side of a slice boundary, the CRAM records are labelled as "detached" (a bit in the CF - cram flags - data series) and the values get stored verbatim via the MF (mate flags), NS (mate ref ID; RNEXT), NP (mate position; PNEXT) and TS (template size; TLEN) data series. There are a few foibles though. Some software defined TLEN to be template end-start while others defined it to be end-start+1. Even worse, there is dispute other whether it is 5' to 3' locations or first to last (this distinction can matter when reads overlap each other). Therefore in order to avoid round-trip mistakes, Scramble computes the template sizes used to see if the decoding algoritm would preserve the value in the input file and if not it adjusts the CF flag to state the sequence is detached, even when it is not. This isn't ideal as it ties the deduplication of RNEXT and PNEXT to the less stable TLEN field. When doing lossy read-name compression, it is this read-pairing "detatched" flag which is used to determine which templates can have names generated on-the-fly during decode and which cannot. For example a read-pair spanning two slices, either nearby slices or perhaps long-distance due to genome structural rearrangements, will be set as "detached" and both records get their names stored verbatim. This means it's easy to do random access and not break the pairing. However a read pair held within a slice will not have the detached flag (*IFF* TLEN IS VALID) and so no name is stored. The weakness here should be obvious - we need TLEN (zero isn't enough) and it has to agree with the algorithm in the published SAM specification, otherwise we cannot do lossy name compression. I would prefer this to be encoded in a different manner. Auxiliary field decoding ------------------------ Requoting some of the compression header from above: TD map: 0: BCZXDZSMSASS 1: BCZXDZSMS 2: BCZXDZSMC 3: BCZXDZSMCASS 4: BCZXDZSMCASC Record encoding map: TL => EXTERNAL {32} Tag encoding map: SMC => BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 83, 77, 67} SMS => BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 83, 77, 83} BCZ => BYTE_ARRAY_STOP {9, 224, 66, 67, 90} XDZ => BYTE_ARRAY_STOP {9, 224, 88, 68, 90} ASS => BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 65, 83, 83} ASC => BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 65, 83, 67} The TD map is a the "tag dictionary". It consists of a series of triplets matching the BAM encoding formats for auxiliary tags; eg BCZ for BC:Z: and SMS for SM:i where (0 <= i <= 65535). TD could get large, but is bounded by the number of sequence "reads" in the container and is typically far smaller. TL is the tag line, indicating which entry in the dictionary this record uses. So TL 1 would imply this record as BC:Z, XD:Z and SM:i tags to be decoded. Finally the Tag encoding map is used to figure out where from to decode each aux tag in turn. The tag binary format itself is as per BAM specification, which makes it quick and easy to convert from CRAM to BAM in memory. It would be possible to sort tags so that BCZSMS and SMSBCZ were the same line entry, but doing so will change the order of tags output. This isn't defined in SAM (or indeed may be explicitly undefined), so is a valid thing to do, but hasn't been done within Scramble. There are also some additional tag types which are computed on-the-fly. These are RG (read-group; stored elsewhere in the RG data series as a numeric index into the SAM @RG lines), MD (an edit string describing how this sequence differs to the reference) and NM (an integer holding the number of mismatches to the reference). Sadly there is no way of knowing whether MD and NM existed in the original record, therefore they are either always created or never created. A file containing a mix of reads with and without these records will always have some differences. However they are derived data and typically computed using tools like "samtools calmd". There is a small workaround to this CRAM failing, which is to always store MD and NM verbatim as actual tags in TD, and then to ask the code to not auto-compute them. Obviously this adds to the storage costs. Read identifiers ---------------- Read IDs are just the RN data series, stored using BYTE_ARRAY_STOP typically, so a concatenated series of nul terminated read names that are then compressed per slice using a codec. The default codec here will be zlib, but typically bzip2, bsc etc will be far better. As mentioned before, read names are NOT deduplicated even for the case where CRAM knows the reads form a valid pair. (This is a flaw.) It could also often benefit from simple transforms for removing common prefix/suffix, although this typically helps zlib only. The custom names3 codec in some of the submissions, when not being buggy!, was a far more advanced way of encoding read names. Quality values -------------- These are in the CRAM QS data series. The read length is explicitly recorded in the RL data series (often constant for illumina data) so the size is already known, meaning they abut each other. The orientation of the quality strings here is as seen within the SAM file. This is typically suboptimal and on some data sets flipping qualities to the original read orientation can improve compression by a couple percent, although doing so either adds the requirement of storing an additional bit per record in the QS block or adding a code dependency on the BF (bam flags) data series. Typically Scramble ends up using rANS order 1 to encode qualities as there is minimal redundancy for LZ techniques to work. The exception on this is with RNAseq data having high numbers of secondary alignments; a custom codec to detect if the QS string perfectly matched the previous QS string removed almost all of this overhead, but some more exploration is required. Final words ----------- It should be clear that there is a large amount of leeway between which encoders we use for which data series, which blocks they go to (multiplexing several to a block or each data-series to its own block), as well as which compression method to use per block. Some of this is tuneable by command line arguments, but there are also considerably different choices made between the C (Scramble, Samtools) and the original Java (Cramtools / htsjdk) implementations of CRAM. For example the first 1 million records of 9827_2#49.bam convert to CRAM as: -rw-r--r-- 1 jkb team117 62266410 Jan 5 17:17 _j.cram Cramtools -rw-r--r-- 1 jkb team117 59006802 Jan 5 17:18 _c.cram Scramble -rw-r--r-- 1 jkb team117 58698854 Jan 5 17:18 _9.cram Scramble -9 Then there are also size differences gained by changing codecs or block sizes. Eg also permitting bzip and lzma and upping the block size to 100,000 reads per slice instead of the default 10,000 gives this (a valid CRAM file still and decodable by Java cramtools): -rw-r--r-- 1 jkb team117 56414688 Jan 5 17:23 _+.cram Scramble -9jZ -s 100000 [This makes evaluation of the "cram size" less clear. A bug bear of mine is the lack of parameters or even often program and version numbers in papers when comparing fmt X against "CRAM".] James Bonfield Author of C CRAM implementation in Scramble & Htslib/Samtools. Co-maintainer of CRAM specification since version 2.1 onwards. (Original CRAM authors are Markus Fritz and Vadim Zalunin at EBI, with Vadim being the author of the Java implementation.)