Musings
-------

These are some rather random musings on what I find desireable in a
file format.  It isn't exhaustive and is simply a brain dump.

Possibly as a community we could agree on what features are desireable
(maybe with a weight) so we can measure how well each file format
fits.


1. Containerised

a) Random access granularity needs to be configurable.
  - Best compression for long-term archival is probably large blocks.
  - Small blocks better for work loads that cherry pick individual
    regions, eg evaluating known SNPs.

    BAM is typically 500-1000 records per BGZF block.  CRAM defaults
    to 10,000 but users have reported using 1000 per slice is
    necessary to get BAM equivalent performance in random access
    scenatrios.

b) Possible multi-level container, with compression metrics in outer level
   and random access possible at inner level.  May trade off size vs
   random access.

   - Eg static frequency tables in outer container, in blocks of say
     100,000 records.
   - 100 inner containers per outer, each containing 1000 records.
     Hence random access per 1000 records, at a cost of 2 seeks and
     reads instead of 1.

2. Indexing

a) Requires a spatial index (eg R-tree, nested containment list).
   - Sequences are not "point" objects; they have lengths.
   - We need to know which sequences overlap a region query, rather than
     simply start beyond it.
   - Consider case of mixing many short sequences (eg Illumina) with a
     few long sequences (ONT, PacBio).

b) Index the index itself, so large indices don't have to be entirely
   loaded into memory before querying.

c) Self indexing formats most desirable, so index cannot become
   detached from format.

3. Security

a) Encypted files at rest.
   - Possibility to encrypt per chromosome, or per region? (Use case:
     consider EBI's EGA vs ENA archives).
   - Consider the ChrY SNPs correlated to surnames (partial
     deanonymisation) where we may want to grant access to all bar
     ChrY.  We already know this with ChrY, but at any point we may
     discover another problematic region.
   - Traceability.  Can we discover who leaked our data?

b) Data validity
   - Checksums, what's acceptable?
   - Error recovery?
   - Resync points in case of lost fragments.  Either deliberate or
     format with enough known plaintext to allow auto-detection of
     block boundaries.
   - EOF marker; being able to detect when we ran out of disk space or
     truncated a file.  Truncations *normally* mean corrupted data,
     but if dealing with 100,000s of files this isn't guaranteed for
     all.
   - Clear magic numbers for file type detection.
   - Signing - can we detect fake data? Can we validate the author?

4. Data access patterns
   - Slice by region.
     - Either chromosome:start-end (aligned) or record N to record M (unaligned).
   - Slice by data type.
     - All data.
     - Minus (specific?) optional auxiliary tags
     - Minus quality values?
   - Desire an ability to create valid data streams after slicing
     without expensive transcoding or decompression & recompression.
     - Eg pick a region and data to elide and produce new format.
         filter -r chr10:20000-30000 --no_qual --no_aux < in.foo > out.foo
       where in.foo and out.foo are both valid "foo" data streams.
     - This is ideal for server based work loads (eg GA4GH streaming
       API) where client gives hints to server in order to reduce size
       of downloaded file.