Musings ------- These are some rather random musings on what I find desireable in a file format. It isn't exhaustive and is simply a brain dump. Possibly as a community we could agree on what features are desireable (maybe with a weight) so we can measure how well each file format fits. 1. Containerised a) Random access granularity needs to be configurable. - Best compression for long-term archival is probably large blocks. - Small blocks better for work loads that cherry pick individual regions, eg evaluating known SNPs. BAM is typically 500-1000 records per BGZF block. CRAM defaults to 10,000 but users have reported using 1000 per slice is necessary to get BAM equivalent performance in random access scenatrios. b) Possible multi-level container, with compression metrics in outer level and random access possible at inner level. May trade off size vs random access. - Eg static frequency tables in outer container, in blocks of say 100,000 records. - 100 inner containers per outer, each containing 1000 records. Hence random access per 1000 records, at a cost of 2 seeks and reads instead of 1. 2. Indexing a) Requires a spatial index (eg R-tree, nested containment list). - Sequences are not "point" objects; they have lengths. - We need to know which sequences overlap a region query, rather than simply start beyond it. - Consider case of mixing many short sequences (eg Illumina) with a few long sequences (ONT, PacBio). b) Index the index itself, so large indices don't have to be entirely loaded into memory before querying. c) Self indexing formats most desirable, so index cannot become detached from format. 3. Security a) Encypted files at rest. - Possibility to encrypt per chromosome, or per region? (Use case: consider EBI's EGA vs ENA archives). - Consider the ChrY SNPs correlated to surnames (partial deanonymisation) where we may want to grant access to all bar ChrY. We already know this with ChrY, but at any point we may discover another problematic region. - Traceability. Can we discover who leaked our data? b) Data validity - Checksums, what's acceptable? - Error recovery? - Resync points in case of lost fragments. Either deliberate or format with enough known plaintext to allow auto-detection of block boundaries. - EOF marker; being able to detect when we ran out of disk space or truncated a file. Truncations *normally* mean corrupted data, but if dealing with 100,000s of files this isn't guaranteed for all. - Clear magic numbers for file type detection. - Signing - can we detect fake data? Can we validate the author? 4. Data access patterns - Slice by region. - Either chromosome:start-end (aligned) or record N to record M (unaligned). - Slice by data type. - All data. - Minus (specific?) optional auxiliary tags - Minus quality values? - Desire an ability to create valid data streams after slicing without expensive transcoding or decompression & recompression. - Eg pick a region and data to elide and produce new format. filter -r chr10:20000-30000 --no_qual --no_aux < in.foo > out.foo where in.foo and out.foo are both valid "foo" data streams. - This is ideal for server based work loads (eg GA4GH streaming API) where client gives hints to server in order to reduce size of downloaded file.