CRAM documentation
==================

Most of my CE1, CE2 and CE3 submissions revolve around CRAM codecs.
For CE2 and CE3 these are just single data types (all qualities or all
identifiers) so are trivial to experiment with on the command line
too, albeit possibly via unix split to get lots of separate files to
simulate keeping the ability of random access.

CE1 however is more nebulous describing sequences and pairings.  These
span many cram data series so a more complete description of how CRAM
works is beneficial to the understanding of the encoded file formats.

Firstly, see the official CRAM specification:

https://github.com/samtools/hts-specs/blob/master/CRAMv3.tex
https://github.com/samtools/hts-specs/blob/master/CRAMv3.pdf

Note: I'm partially responsible for this, but not the original
author.  Even so, I ought to have fixed it by now as it's really not
an easy document to read, nor is it at all tightly worded.  Therefore
I will attempt to clarify some of the most glaring omissions here.


ITF8
----

Integer values are written in a variable sized format where the number
of top bits set in the first byte indicates how many additional bytes
should be read.  This 0 to 127 is 1 byte and anything above this needs
more.  A tragic omission was no ability to store negative numbers, so
these all get turned into very large positives!


Data series
-----------

CRAM breaks data by type into "data series".   Trivial examples may be
"RN" (read name, aka identifier) and "QS" (quality score), while
others are a bit more specific such as "DL" (length of a CIGAR "D" -
deletion - operation), "SC" (soft-clipped bases referred to in a
CIGAR "S" operation) and "NF" (number of records to the next fragment,
used when both members of a read-pair fall within the same CRAM
slice).

Note: Each read is processed in a breadth first fashion; so we encode
all data series for the first read, then all data series for the
second read, and so on.  There are conditions that indicate whether or
not a data series is appended to (for example the "DL" listed above is
only appended to if the record contains a deletion), thus the flow of
data is tightly bound to the program flow logic too.


containers, slices, blocks
--------------------------

Cram files consist of a header followed by a series of containers.
Each container has a compression header and one or more slices.  Each
slice (and indeed the container compression headers) are stored in
a series of blocks.

The compression header holds information about how each data series is
encoded and to which blocks it is stored in.  Technically we can mix
multiple data series into the same block (which is when it becomes
important to identify the program flow order), but practically
speaking this is rarely of benefit to compression efficiency.  The
cram_dump tool can demonstrate this:

Container pos 573 size 6045659
    Ref id:            0
    Ref pos:           1 + 35030
    Rec counter:       0
    No. recs:          100000
    No. bases          15000000
    No. blocks:        27
    No. landmarks:     1
    {325}

    Container_header block pos 598
      Preservation map:
        RN => 1 (0x1)
        SM => 11410149 (0xae1ae5)
        TD => 11410087 (0xae1aa7)
        AP => 1 (0x1)
      Substitution map:
        A: CGTN
        C: AGTN
        G: ACTN
        T: ACGN
        N: ACGT
      TD map:
          0: BCZXDZSMSASS
          1: BCZXDZSMS
          2: BCZXDZSMC
          3: BCZXDZSMCASS
          4: BCZXDZSMCASC

      Record encoding map:
         FN =>         EXTERNAL {26}
         FP =>         EXTERNAL {28}
         RL =>          HUFFMAN {1, 128, 150, 1, 0}
         RN =>  BYTE_ARRAY_STOP {0, 11}
         QS =>         EXTERNAL {12}
         BA =>         EXTERNAL {30}
         BB =>   BYTE_ARRAY_LEN {1, 1, 42, 1, 1, 37}
         TL =>         EXTERNAL {32}
         IN =>  BYTE_ARRAY_STOP {0, 13}
         BF =>         EXTERNAL {15}
         MF =>         EXTERNAL {21}
         TS =>         EXTERNAL {22}
         CF =>         EXTERNAL {16}
         NF =>         EXTERNAL {24}
         AP =>         EXTERNAL {17}
         FC =>         EXTERNAL {27}
         MQ =>         EXTERNAL {19}
         DL =>         EXTERNAL {29}
         BS =>         EXTERNAL {31}
         NP =>         EXTERNAL {23}
         SC =>  BYTE_ARRAY_STOP {0, 14}
         NS =>         EXTERNAL {20}
         RG =>          HUFFMAN {1, 0, 1, 0}
         RI =>          HUFFMAN {1, 0, 1, 0}

      Tag encoding map:
        SMC =>   BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 83, 77, 67}
        SMS =>   BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 83, 77, 83}
        BCZ =>  BYTE_ARRAY_STOP {9, 224, 66, 67, 90}
        XDZ =>  BYTE_ARRAY_STOP {9, 224, 88, 68, 90}
        ASS =>   BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 65, 83, 83}
        ASC =>   BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 65, 83, 67}

The tag encoding map here is for the XX:Y:data auxiliary tags in SAM.
Eg SM:i:100 (100 being 1 byte so "C" in BAM) or BC:Z:foo.

EXTERNAL, BYTE_ARRAY_STOP and BYTE_ARRAY_LEN are encoding instructions
indicating how the data is layed out.  EXTERNAL is used for arrays
where the caller already knows the size (so it is not an explicit part
of the stored data) or for single items such as a character or an
integer.  BYTE_ARRAY_STOP and BYTE_ARRAY_LEN are methods of storing
variable sized data where the size is also one of the items being
stored, either as say a null terminated string or where the length
itself has its own distinct encoding instructions.

There is also a "CORE" block, mostly now unused, which worked on a
bunch of bit-packers in CRAM for HUFFMAN, BETA, ELIAS-GAMMA,
SUB-EXPONENTIAL, etc.  Of these we typically now only use HUFFMAN and
again only for the special case of zero-length codes which are used to
define constants.

Typically we're only using one slice per container, although the code
does permit multiples.  In theory this would be used to reduce
overheads of the compression header while keeping high granularity
(per slice).  

The slice is dumped out by cram_dump as:

    Slice 1/1, container offset 325, file offset 923
        Slice content type MAPPED_SLICE
        Slice ref seq    0
        Slice ref start  1
        Slice ref span   35030
        Slice MD5        209abd0aa842316600786525e3a815f3
        Rec counter      0
        No. records      100000
        No. blocks       25
        Blk IDS:         {11, 12, 15, 16, 17, 19, 20, 21, 22, 23, 24, 26, 27, 28, 29, 30, 31, 32, 4281155, 4281171, 5459283, 4342618, 5784666, 5459267}
        Ref base id:     -1

        Block 1/25
            Size:         0 comp / 0 uncomp
            Method:       RAW
            Content type: CORE
            Content id:   0

        Block 2/25
            Size:         476489 comp / 1872622 uncomp
            Method:       BSC?
            Content type: EXTERNAL
            Content id:   11
            Keys:         RN 
(\265/\375\240\356\222\034\000\254A\004zp\023b\016\260\301\006\3143"\262\273\273\222\236/\345\007\026\006\031\006!\006\017\241\036\241\306\360[&z\004\011\213\344\335l\342\213\030)\016=\020k\247\305H\213\036d\\207v\006\242W\205\360\021\302\301\010\325 "#\206Z\237c\012\275B\226\204b\015\313\340F5M\336\260
        Block 3/25
            Size:         4959119 comp / 15000000 uncomp
	    ...

Here we see it has listed all the blocks referred to in the
compression header.   Each block has a content id number (also
referred to in the various EXTERNAL, BYTE_ARRAY_STOP and
BYTE_ARRAY_LEN encoding meta-data), and a "method" which is basically
a compression type; eg gzip, bzip2, bsc, rANS, etc.


Block encoders
--------------

As described earlier, the actual CRAM blocks themselves are then
encoded using what cram refers to as an "external" codec.   These
started as just zlib and bzip2, but then progressed to lzma and rANS
(a custom part of CRAM) and in my submission I added libbsc, zstd and
custom quality and name codecs.

The interface for these is simply a byte stream; data in to data out.
There are no block transforms, so these would need to be folded into
any custom codecs.

rANS is probably the only truely novel part of vanilla CRAM.  It was
derived from the initial rANS implementation by Fabian Giessen
(https://github.com/rygorous/ryg_rans) but has since been substantially
revised and sped up by myself (https://github.com/jkbonfield/rans_static).

CRAM's rANS implementation has both order-0 and order-1
implementations.  It's strictly a static entropy encoder, requiring a
frequency table to be stored within the block.  Practically speaking
this was a mistake; it should have been stored in the compression
header, permitting efficient compression using many slices per
container, which in turn makes static order-2 compression achievable
too (although likely dynamic table updates become better at that
stage).

The rANS used within CRAM is still an earlier draft, specifically the
"rANS_static4x8.c" algorithin in github.  I have AVX2 and AVX512
implementations of this algorithm, but they are not binary compatible
with the original rANS.  These however run much faster.  For example
order-0 quality value decoding runs at ~1400 MB/s and order-1 decoding
at 870-1060MB MB/s; tested on a 3.2Ghz Intel i5 4570.  To achieve that
speed they interleave decoding from 32 portions of the block.  The
orignal CRAM method only has 4-way interleaving, but it's still a
quick entropy codec.


Partial decoding
----------------

CRAM can be sliced two ways.  The slice size is the smallest element
of random access across the genome, assuming sorted data.

There is also the optimise of only decoding some data series while
ignoring others.  For example "samtools flagstats" then
we know we need to decode the BAM FLAG column, held in the BF CRAM
data-series and possibly amended for read-pairs via CF, NF, etc data
series.

In practice this is complex.  If every data series has been written to
its own "external" block then this becomes easy - we turn SAM columns
into related CRAM data series and then uncompress appropriate columns,
ignoring the others.  The API is always returning a valid in-memory
BAM record, but some fields will be replaced with place holder values
if they were not requested to be decoded.

If however the CRAM file, as is permitted, mixes multiple data series
together into the same block (either multiples to the CORE block or
several sharing the same external blocks) then the decoder needs to
understand the interdependency between data types.  For example if BF
(bam flag) and DL (deletion length) shared the same block then this
explicitly means to decode BF we also have to be decoding DL, which
in turns means we need to decode all the other CIGAR fields and brings
in an additional dependency on FN and FC.  Possibly the program code
itself has no option for decoding without interpreting, which then
brings in other dependencies on BS (base subsitution) etc.  While
achieveable (it's implemented in Scramble), this isn't pretty and it
requires iteration until the full effect of data-series dependencies
have been resolved.

The recommendation therefore is: don't ever mix multiple data series
in the same block.


Decoding
--------

CRAM processes each read at a time, decoding the necessary fields in
turn.  The order isn't explicitly described in the specification, but
basically it's the same order as the keys in the tables.  Cram_dump
can show this too. Eg:

Rec 11/10000 at 1,4
BF = 73 => SAM 0x49 (ret 0, out_sz 1)
CF = 3 (ret 0, out_sz 1)
RL = 100 (ret 0, out_sz 1)
AP = 1 (ret 0, out_sz 1)
RG = 0 (ret 0, out_sz 1)
RN = HS25_09827:2:1204:18768:54085#49 (ret 0, out_sz 32)
Detached
MF = 2 (ret 0, out_sz 1)
NS = 0 (ret 0, out_sz 1)
NP = 10011 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 7 (ret 0, out_sz 1)
  0: TN= X0c
  ...
FN = 3 (ret 0, out_sz 1)
  0: FC = X (ret 0, out_sz 1)
  0: FP = 51+0 = 51 (ret 0, out_sz 1)
  0: BS = 0 (ret 0)
  1: FC = X (ret 0, out_sz 1)
  1: FP = 12+51 = 63 (ret 0, out_sz 1)
  1: BS = 0 (ret 0)
  2: FC = X (ret 0, out_sz 1)
  2: FP = 24+63 = 87 (ret 0, out_sz 1)
  2: BS = 0 (ret 0)
MQ = 0 (ret 0, out_sz 1)
QS = CABCFGFEIFFGGGGFGGGFGGIEHHFFEHHHFEEGFHB@FHFEACFGFH@?FFGECFGGFG-HDBHG4CHFFFCDFG8HGCIHD;,C@@7G>5F5AH,2

We can see the data series being pulled; BF (bam flags), CF (cram
flags), RL (read-length - likely a constant), AP (alignment position
- typically a delta to the last one), etc.  FN is "feature number" and
indicates how many edits have been made with respect to the
reference.  Each "feature" then has a type (FC; all "X" here for
substitution), relative position since last (FP) and potential
additional meta data (BS - the base substitution value itself).


Seq decoding
------------

The observation here is that if we store the new base when it differs,
then assuming even distribution and simplifying for ACGT only then
we'll end up around 2 bits per edited base.  However we know we
wouldn't stored a "C" substitution if the reference base was "C", so
in fact we only have 3 possibilities per substitution instead of 4.

Hence we use a 2D lookup table and BS, the substitution feature, is
the X axis with the reference base being the Y axis.  The exact table
itself can be optimised per file and is recorded in the compression
header. From above: 

      Substitution map:
        A: CGTN
        C: AGTN
        G: ACTN
        T: ACGN
        N: ACGT

So ref G and BS 2 would indicate a G->T substution.  For non ACGTN ref
or seq bases we use the "BA" data series and store the bases
verbatim.  Similarly for inserted bases or soft-clips.


Seq pairing
-----------

If a template has been sequenced as a pair it will likely have two
alignment records in SAM.  The SAM format has RNEXT, PNEXT and TLEN
fields to describe how these pairs relate to one another - which
reference the mate matches too, what position, and the observed
template lengths.  (The definition of TLEN however is hotly disputed!)

When this occurs, they *usually* are near each other on the genome and
thus often both pairs occur within the same CRAM slice.  This in turn
permits RNEXT, PNEXT and TLEN to be computed on-the-fly during the
CRAM decode.   To do so all CRAM stores is the number of records
between this read and the mate via the NF data series.  Rather
bizarelly however the read-name is *not* deduplicated via this
method.  (I've no idea why - not my format!)

If however the pairs aren't near each other or are but fall either
side of a slice boundary, the CRAM records are labelled as "detached"
(a bit in the CF - cram flags - data series) and the values get stored
verbatim via the MF (mate flags), NS (mate ref ID; RNEXT), NP (mate
position; PNEXT) and TS (template size; TLEN) data series.


There are a few foibles though.  Some software defined TLEN to be
template end-start while others defined it to be end-start+1.  Even
worse, there is dispute other whether it is 5' to 3' locations or
first to last (this distinction can matter when reads overlap each
other).  Therefore in order to avoid round-trip mistakes, Scramble
computes the template sizes used to see if the decoding algoritm would 
preserve the value in the input file and if not it adjusts the CF flag
to state the sequence is detached, even when it is not.  This isn't
ideal as it ties the deduplication of RNEXT and PNEXT to the less
stable TLEN field.


When doing lossy read-name compression, it is this read-pairing
"detatched" flag which is used to determine which templates can have
names generated on-the-fly during decode and which cannot.  For
example a read-pair spanning two slices, either nearby slices or
perhaps long-distance due to genome structural rearrangements, will be
set as "detached" and both records get their names stored verbatim.
This means it's easy to do random access and not break the pairing.
However a read pair held within a slice will not have the detached
flag (*IFF* TLEN IS VALID) and so no name is stored.

The weakness here should be obvious - we need TLEN (zero isn't enough)
and it has to agree with the algorithm in the published SAM
specification, otherwise we cannot do lossy name compression.  I would
prefer this to be encoded in a different manner.


Auxiliary field decoding
------------------------

Requoting some of the compression header from above:

      TD map:
          0: BCZXDZSMSASS
          1: BCZXDZSMS
          2: BCZXDZSMC
          3: BCZXDZSMCASS
          4: BCZXDZSMCASC

      Record encoding map:
         TL =>         EXTERNAL {32}

      Tag encoding map:
        SMC =>   BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 83, 77, 67}
        SMS =>   BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 83, 77, 83}
        BCZ =>  BYTE_ARRAY_STOP {9, 224, 66, 67, 90}
        XDZ =>  BYTE_ARRAY_STOP {9, 224, 88, 68, 90}
        ASS =>   BYTE_ARRAY_LEN {3, 4, 1, 2, 1, 0, 1, 4, 224, 65, 83, 83}
        ASC =>   BYTE_ARRAY_LEN {3, 4, 1, 1, 1, 0, 1, 4, 224, 65, 83, 67}

The TD map is a the "tag dictionary".  It consists of a series of
triplets matching the BAM encoding formats for auxiliary tags; eg BCZ
for BC:Z: and SMS for SM:i where (0 <= i <= 65535).  TD could get
large, but is bounded by the number of sequence "reads" in the
container and is typically far smaller.

TL is the tag line, indicating which entry in the dictionary this
record uses.  So TL 1 would imply this record as BC:Z, XD:Z and SM:i
tags to be decoded.

Finally the Tag encoding map is used to figure out where from to
decode each aux tag in turn.  The tag binary format itself is as per
BAM specification, which makes it quick and easy to convert from CRAM
to BAM in memory.

It would be possible to sort tags so that BCZSMS and SMSBCZ were the
same line entry, but doing so will change the order of tags output.
This isn't defined in SAM (or indeed may be explicitly undefined), so
is a valid thing to do, but hasn't been done within Scramble.

There are also some additional tag types which are computed
on-the-fly.  These are RG (read-group; stored elsewhere in the RG data
series as a numeric index into the SAM @RG lines), MD (an edit string
describing how this sequence differs to the reference) and NM (an
integer holding the number of mismatches to the reference).

Sadly there is no way of knowing whether MD and NM existed in the
original record, therefore they are either always created or never
created.  A file containing a mix of reads with and without these
records will always have some differences.  However they are derived
data and typically computed using tools like "samtools calmd".   There
is a small workaround to this CRAM failing, which is to always store
MD and NM verbatim as actual tags in TD, and then to ask the code to
not auto-compute them.  Obviously this adds to the storage costs.


Read identifiers
----------------

Read IDs are just the RN data series, stored using BYTE_ARRAY_STOP
typically, so a concatenated series of nul terminated read names that
are then compressed per slice using a codec.

The default codec here will be zlib, but typically bzip2, bsc etc will
be far better.

As mentioned before, read names are NOT deduplicated even for the case
where CRAM knows the reads form a valid pair.  (This is a flaw.)
It could also often benefit from simple transforms for removing common
prefix/suffix, although this typically helps zlib only.

The custom names3 codec in some of the submissions, when not being
buggy!, was a far more advanced way of encoding read names.


Quality values
--------------

These are in the CRAM QS data series.  The read length is explicitly
recorded in the RL data series (often constant for illumina data) so
the size is already known, meaning they abut each other.

The orientation of the quality strings here is as seen within the SAM
file.  This is typically suboptimal and on some data sets flipping
qualities to the original read orientation can improve compression by
a couple percent, although doing so either adds the requirement of
storing an additional bit per record in the QS block or adding a code
dependency on the BF (bam flags) data series.

Typically Scramble ends up using rANS order 1 to encode qualities as
there is minimal redundancy for LZ techniques to work.  The exception
on this is with RNAseq data having high numbers of secondary
alignments; a custom codec to detect if the QS string perfectly
matched the previous QS string removed almost all of this overhead,
but some more exploration is required.


Final words
-----------

It should be clear that there is a large amount of leeway between
which encoders we use for which data series, which blocks they go to
(multiplexing several to a block or each data-series to its own
block), as well as which compression method to use per block.  Some of
this is tuneable by command line arguments, but there are also
considerably different choices made between the C (Scramble, Samtools)
and the original Java (Cramtools / htsjdk) implementations of CRAM.

For example the first 1 million records of 9827_2#49.bam convert to
CRAM as:

-rw-r--r-- 1 jkb team117 62266410 Jan  5 17:17 _j.cram	Cramtools
-rw-r--r-- 1 jkb team117 59006802 Jan  5 17:18 _c.cram	Scramble
-rw-r--r-- 1 jkb team117 58698854 Jan  5 17:18 _9.cram	Scramble -9

Then there are also size differences gained by changing codecs or
block sizes. Eg also permitting bzip and lzma and upping the block
size to 100,000 reads per slice instead of the default 10,000 gives
this (a valid CRAM file still and decodable by Java cramtools):

-rw-r--r-- 1 jkb team117 56414688 Jan  5 17:23 _+.cram	Scramble -9jZ -s 100000

[This makes evaluation of the "cram size" less clear.  A bug bear of
mine is the lack of parameters or even often program and version
numbers in papers when comparing fmt X against "CRAM".]


James Bonfield

  Author of C CRAM implementation in Scramble & Htslib/Samtools.
  Co-maintainer of CRAM specification since version 2.1 onwards.

  (Original CRAM authors are Markus Fritz and Vadim Zalunin at EBI,
  with Vadim being the author of the Java implementation.)