As user requirements have grown to include 600+ BAM files (and 400+?? bigWig) and no doubt more in the near future the current flat .ini configuration system has become unworkable.
There is a case for continuing to use the 'ZMap' config file for existing basic config but several parts of this system need to be handled differently and the obvious solution is an SQLite database. Note that in the context of a ZMap session this data is entirely static; we explicitly choose to store any session state information elsewhere. Initially we will convert to SQLite large configuration data (style, featuresetsm columns, servers) and replicate the existing flat file information unchanged.
This will also be an opportunity to tidy up some configuration options that have been added hurriedly over that past few years. It is also an opportunity to address the complications caused by using GQuarks as unique id's for data items - with SQL we can generate unique id's in the database and have names simply as names.
There are two main use-cases: as part of the Otterlace annotation systema and as a standalone genome browser. Otterlace can generate a database in whatever format is needed and our only concerns are that ZMap can read the data and Otterlace can generate it easily. When acting as a standalone browser we need to ensure that users unskilled in SQL can use ZMap and make minor modifications to configuration, a few examples of this will be considered later. It is imagined that a 'standalone users' includes those who have very simple configuration needs eg Ensembl style with only a few Core tracks. This can be addressed by using an open source SQL browser or by providing a short Perl script to act as a very basic local webserver that handles HTML form submission. Additionally several small Perl scripts can be provided to access and update records from the command line. (eg to re-order columns on the display, add a new data source etc etc.
As the development of ZMap is likely to require many more configuration changes in the future we need to consider how this this will affect database structure. In terms of efficiency there are no real constraints as we are dealing with small amounts of data (for a database) and only accessing it as configuration (and storing this in memory for real-time use).
By default ZMap will attempt to read a database in the selected configuration directory (defaults to ~/.ZMap, or can be set via the arg --conf_dir=there) called 'ZMap_config.db'. This file name can be set via another command line argument:
zmap --conf_file=ZMap_my_stuff --conf_db=my_config.dbor by a configuration option in the ZMap configuration file:
[ZMap] database = ZMap_otterlace.db; ZMap_encode.dbWhere files names are specified they are relative to the selected configuration unless they begin with '/'.
387 defined in a recent-ish but out of date file These work in a hierarchy and inherit attributes, if stored in SQL ths would be as a two simple tables that get processed by ZMap as at present from the ini stanza data - style inheritance will remain as a ZMap application function.
style table style-id unique key, indexed name attibute table style-id ref to style, indexed attribute-id generic attribute: completely future proof value text, converted to numeric by application codeNote that styles have a large number of options some of which are mutually exclusive. Styles are referenced by featureset and columns; in existing flat file configuration this is by unique ID derived from the name, and we would be best converting this to a database unique ID.
These map into columns on display and each one contains data of the same type. Tradtionally the type of data was identified by the display style but this confuses presentation with the data model. The type of data (eg transcript, alignment) implies the type and amount of information that can be attached to each feature and the relationships between parts, and there is a need to differentiate between some datasets operationally - BAM data is such high volume that we need to restrict access to small sequences only and do that on demand rather than on startup.
With BAM data we expect approx 1000 of these to start with.
Featuresets have a display style and must be linked to a source/ server.
ACEDB supports request by column and supplies a mapping from featureset to column - it will be necessary to add this to the data extracted from SQL. We'd like to avoid updating the SQLite database OTF, as then it would become state not configuration.
These are the sources of feature data for ZMap and can be (currently) ACEDB, DAS (not fully implemented) pipe or file. Note that we do have DAS based data but this is accessed via a pipe script. Each server can support one or more featuresets and has options such as load on startup. As new source types (eg BAM, coverage) extra options become necessary, although most of these are best tied to the featuresets.
Historically ZMap has operated with a structure derived from ACEBD where columns of data (tracks) are requested and these may contain several featuresets. Columns are ordered on the screen left to right in reverse and forward strands. Columns can be shown/ hidden and this implies operating on several featuresets in tandem.
With BAM coverage data we need to map several featuresets into sub-columns - this has been achieved by inventing 'virtual featuresets' which map into coverage columns. What is really needed here is nested grouping; the top level can be used to show/ hide several sub-columns and appears on the load-column dialog, and the intermediate level groups several featuresets into a display entity: two quite distinct functions.
Historically column order has been defined by a list of columns (as plain text) and while this works well for small numbers it has become unwieldy and also suffers from the (recent) existance of multiple lists and the fact that this single list of columns has multiple functions: display order, column request, grouping of featuresets. These need to be teased apart into distinct configuration options.
As the database is regarded as static (write once, no updates) and only used for configuration there is little need for a higly optimised design and we opt for maximum flexibility. Tables are defined as blank (named) records that have any number of related attibutes which are all defined in another table; attributes could in theory be shared between table types but it is probably clearer/ less confusing to define separate ones for each. Where one table relates to another explicitly then we define a key in the table rather than using an attribute record to link them as this will make understanding and SQL syntax a bit easier.
SQLite syntax differs from MyySQL and PostGres and this is summarised very well in this document. Each record has a unique rowid automatically - refer to this and that for details. Extra (+ composite) indices may be added if needed. Column data types are flexible and in the schema just give a general indication of what format the data takes. We avoid the use of the automatic rowid as a primary key as it's not a standard SQL format, and declare primary keys explicitly.
The main use by ZMap is to read in the whole database on startup and store in C data structures (and by Otterlace to write in bulk) and in that respect there is no need for any indexes. However for manual changes (which are very likely with ZMap operating standalone) we need to ensure that queries can be processed efficiently and therefore include sensible indices as normal.
Sample queries are provided to help future users and also to allow testing of the schema.
To avoid much duplication of attributes we define most tables as being grouped or inheriting data from other of the same type. This is something that SQL does not do but we define the relationship between two records in the same table and deal with inheritance in the application (C) code. These records (styles, servers and featuresets) can be organised in a hierarchy (as that's the data we wish to model) but this is process outside of SQL - SQL just stores the information.
Here's a suggested schema, it's pre-implementation and may be subject to change. Most of these queries have been tested but the exact syntax is not guaranteed - and example schema file will be created and circulated and should be used for real development; what's below is for discussion.
-- sample database for ZMap config -- use 'sqlite".read thisfile;"' to create the schema -- we use rowid extensively to access tables, no need to specify explicitly -- ref to http://www.sqlite.org/autoinc.html and also http://www.sqlite.org/lang_createtable.html -- feature display style CREATE TABLE zmap_style ( style_id INTEGER PRIMARY KEY, name TEXT, -- styles inherit attributes from thier parents, recursively -- parent_id is as in former style defintions parent_id INTEGER ); CREATE INDEX idx_zmap_style_name ON zmap_style (style_name); -- generic attributes of style -- includes parent_style_id for nested style definitions CREATE TABLE zmap_style_attribute ( style_id INTEGER, tag TEXT, value TEXT ); CREATE INDEX idx_zmap_style_attribute ON zmap_style_attribute (style_id); -- a (base) set of features of the same type CREATE TABLE zmap_featureset ( featureset_id INTEGER PRIMARY KEY, name TEXT, style_id INTEGER, column_id INTEGER, -- which column this appears in -- may be NULL if this is in a group server_id INTEGER, -- where it comes from -- featuresets can be grouped together, recursively -- a group of featuresets is defined as another featureset record group_id INTEGER -- the group that this featureset belongs to ); CREATE INDEX idx_zmap_featureset_name ON zmap_featureset (featureset_name); CREATE INDEX idx_zmap_featureset_column ON zmap_featureset (column_id); CREATE INDEX idx_zmap_featureset_group ON zmap_featureset (group_id); -- generic attributes of featureset CREATE TABLE zmap_featureset_attribute ( featureset_id INTEGER, tag TEXT, value TEXT ); CREATE INDEX idx_zmap_featureset_attribute ON zmap_featureset_attribute (featureset_id); CREATE TABLE zmap_column ( column_id INTEGER PRIMARY KEY, name TEXT, style_id INTEGER -- optional style for the column ); CREATE INDEX idx_zmap_column_name ON zmap_featureset (column_name); -- generic attributes of column -- includes column order (position) CREATE TABLE zmap_column_attribute ( column_id INTEGER, tag TEXT, value TEXT ); CREATE INDEX idx_zmap_column_attribute ON zmap_column_attribute (column_id); -- server - provides features from one or more featuresets CREATE TABLE zmap_server ( server_id INTEGER PRIMARY KEY, name TEXT, -- servers can inherit attibute from others, recursively -- eg pipe_DAS for generic DAS options, pipe_ENCODE for BAM servers from ENCODE server_type_id INTEGER ); CREATE INDEX idx_zmap_server_name ON zmap_server (name); -- generic attributes of server CREATE TABLE zmap_server_attribute ( server_id INTEGER, tag TEXT, value TEXT ); CREATE INDEX idx_zmap_server_attribute ON zmap_server_attribute (server_id);
There is a need for quite a lot of these and also a need to be able to add more on an ad-hoc basis. These will be doucumented by adding attribute records with a related table id (eg feautreset_id, style_id etc) of 0 and the value set to some descriptive text. These can then be listed quite easily via a simple SQL query.
select tag as attribute, value as description from zmap_style_attribute where style_id = 0;
Note that attributes are not uniquely specified per parent table, and where the original .ini file configuration has lists of data these will be added as separate attribute records with the same tag.
A fiedl in the columns record will be use to store a colum's display position. In case of mis-configuration (ie two columns with the same position) ZMap will carry on regardless - the ordering of the misconfigured columns will be stable relative to the others . There is a feature request active to drag and drop columns when live, but this table defines the default which is explicitly defined centrally on request from Havana.
update zmap_column_attribute set value=value+1 where tag = "position" and value >= 6; insert into zmap_column_attribute (column_id, tag. value values(123, "position", 6);
Columns are defined as appearing on different Tabs on the load columns dialog and are grouped with other similar columns. Order is defined relative to each Tab: these are defined from several different data sources (eg otterlace, encode) and these tabs are optional depending on the desired configuration. It is believed that each tab contains data of significanlty different types and there is no requirement to mix these.
Groups of columns are nominally displayed together but it is possible to have a group of columns in the load columns dialog that are disjunct on the display. Many columns are optional depending on species and these are defined for all species/ datasets which implies that the column ordering within each Tab will be stable.
Thanks to James Gilbert for these... (Note some of the field names differ from the above, it's an earlier iteration.
-- Example adding this ini style file data into the schema: -- -- [EST_align] -- colours=normal fill Purple ; normal border #3d1954 -- parent-style=dna_align -- alignment-mask-sets=self ; vertebrate_mRNA INSERT INTO zmap_style (style_name) VALUES ('EST_Align'); SELECT style_id FROM zmap_style WHERE style_name = 'EST_Align'; -- Assuming SELECT returns '102', then: INSERT INTO zmap_style_attribute (style_id, style_tag, style_value) VALUES ('102', 'colour', 'normal fill Purple'); INSERT INTO zmap_style_attribute (style_id, style_tag, style_value) VALUES ('102', 'colour', 'normal border #3d1954'); INSERT INTO zmap_style_attribute (style_id, style_tag, style_value) VALUES ('102', 'parent-style', 'dna_align'); INSERT INTO zmap_style_attribute (style_id, style_tag, style_value) VALUES ('102', 'alignment-mask-set', 'self'); INSERT INTO zmap_style_attribute (style_id, style_tag, style_value) VALUES ('102', 'alignment-mask-set', 'vertebrate_mRNA'); -- Example query to fetch info for EST_Align style: SELECT fa.style_tag , fa.style_value FROM zmap_style f , zmap_style_attribute fa WHERE f.style_id = fa.style_id AND f.style_name = 'EST_Align'; -- Example query to fetch info for EST_Mouse style: SELECT fa.featureset_tag , fa.featureset_value FROM zmap_featureset f , zmap_featureset_attribute fa WHERE f.featureset_id = fa.featureset_id AND f.featureset_name = 'EST_Mouse';
Separate databases will be defined for different major sets of data - Otterlace will prove a database for all its data and ZMap will use another database for ENCODE data. Further data sets may be added in future as needed. ZMap will read in configuration data by submitting identical queries to each configured database in turn.
Typically ZMap is run in an Otterlace environment using data from a single species/ genomic sequence, but there is a need for multiple views containing data from more than one species (eg human/mouse, multiple mouse strains). Servers are defined generically and ZMap plugs in sequence information such as species chromosome start and end. Columns (filters) are defined as relevant to a particular species, but featuresets and servers are not.
This is all human data and columns should be defined as human specific, although this may not be necessary as we can simply not use the ENCODE configuration database when looking at other species' data.
Here we assume a user has no knowledge of SQL and we wish to ensure that reasonable configuration changes can be effected.
This requires two seperate SQL queries and clearly needs a trivial script. Columns are ordered via a series of records and to move one it has to be deleted and inserted in the sequence, which involves updating many records. However this can be done with a single SQL query.
Styles have many attributes and these can be presented on a generic SQL browser style form. Note that due to style modes some options will be meaningless and we need to prevent misuse.
This is quite complex. The user must define URLs and query options (assuming we provide generic scripts to drive the request process). Featuresets provided must be defined and linked to styles, which may need to be created. Columns must be defined and featuresets grouped and mapped into them. Columns must be placed on the load columns dialog in the appropriate tab and group.
Clearly this requires extensive user documentation, and as much scripted help as we can provide. This looks optimistic for a naive user.
Initially we will provide scripts to generate this data in some agreed format and do not expect end user configuration. We can provide alternatives on request quite easily.
One of the uses of database driven configuration is to facilitate implementation of a tidier Load Columns dialog. In this, columns that may be requested are grouped into Tabs (eg Core, RNAseq, etc) and within each tab they are grouped further (eg Transcripts,EST's etc). This will be very easy to implement using two attributes connected to each column.
Another requirement is to handle columns that may not be requested directly or only be requested from the marked region, and this is a clear case for more attributes.
ENCODE contains a large amount of data and on the dashboard page there is a table that uses four or five dimensions to group the data. Mapping this onto a ZMap dialog is not simple (we have chosen two levels of grouping and do not want to make this more complex; in particular we want to avoid implementing third party interfaces embedded in our appllication). The attempted solutuion is to have Tabs for major ENCODE groupings (cell type) anf groups for another (rnaExtract) and to have all other variable lumped together. This means we do not have a single Tab for ENCODE, and if we ever have another large dataset included at the same time will run out of space. One solution would be to have tabs within tabs or to provide separate dialog for major groupings of data,
Originally it was intended to implement data to support the new load columns dialog only and fit this onto the existing ZMap with existing configuration from flat .ini files. The requirement to provide access to all ENCODE data at the same time (via the new load columns dialog as requested) means that it makes more sense to implement featuresets and servers in the database at the same time. This is due to the fact than an interim solution would require duplicated effort a) defining columns b) defining featuresets and servers and mapping these to columns, and having two teams working on different views of the same data would probably introduce errors.
Therefore, we propose the following sequence of events:
The main problem to solve here is constructing long urls that contain script options for different sources and sequences. Many servers are similar (eg otterlace gff sources, bam_get, bam_coverage) and share similar options: these common options will be defined in a server records that others refer to. Attribute strings must allow server spcific data to be added: generically we need to add species (dataset) chromosome, start, end, cvser etc, and these cannot be defined as hard coded is we are to use a server for different species as would be required with ZMap running with two sequences loaded.
There are three main categories of option:
Some standard tokens will be defined to cover what's needed. These will be defined as '$token' as $ appears not to be used in existing URL syntax. Initially we will use only the tokens needed by encode servers, and as is currently implemented in ZMap we will adjust sequence specific options (start, end, dataset, gff_seqname) automatically.
url=pipe:///bigwig_get? -seq_id=chr4-04& -csver=GRCh37& -end=9721721& -file=http%3A%2F%2Fhgdownload-test.cse.ucsc.edu%2FgoldenPath%2Fhg19%2FencodeDCC%2FwgEncodeCshlLongRnaSeq%2FreleaseLatest%2FwgEncodeCshlLongRnaSeqAg04450CellPapPlusRawSigRep2.bigWig &-gff_feature_source=Tier3_AG04450_cell_longPolyA_rep2_coverage_plus &-gff_version=2 &-start=9324643 &-strand=1 url=pipe:///bam_get? --seq_id=chr4-04 &--csver=GRCh37 &--end=9721721 &--file=http%3A%2F%2Fhgdownload-test.cse.ucsc.edu%2FgoldenPath%2Fhg19%2FencodeDCC%2FwgEncodeCshlLongRnaSeq%2FreleaseLatest%2FwgEncodeCshlLongRnaSeqHsmmCellPapAlnRep2.bam &--gff_feature_source=Tier3_HSMM_cell_longPolyA_rep2 &--start=9324643 &-gff_version=2Note that 'normal' otterlace scripts have a different set of options:
url=pipe:///filter_get? analysis=HUVEC_WholeCell_Ap_Long_CSHL &client=otterlace &cookie_jar=%2Fnfs%2Fusers%2Fnfs_m%2Fmh17%2F.otter%2Fns_cookie_jar &cs=chromosome &csver=Otter &dataset=human &end=9721721 &gff_seqname=chr4-04 &gff_source=ensembl_cshl_huvec_3 &metakey=gencode_dev_cshl_huvec &name=4 &server_script=get_gff_genes &session_dir=%2Fvar%2Ftmp%2Flace_61.mh17.23729.1 &start=9324643 &transcript_analyses=IDR_0.1 &type=chr4-04 &url_root=http%3A%2F%2Fdev.sanger.ac.uk%3A80%2Fcgi-bin%2Fotter%2F61and to process these URLs reliably we have to be careful to preserve what's configured. We also need to process seq_id' and 'gff_seqname' interchangably and update both if they are present, but not add seq_id if it's not there. Specifically this means that encode URL's need to include -seq_id= and not -gff_seqname=. Otterlace URLs do not include seq_id but do include gff_seqname, and this is added if not present.
Some of the otterlace options are session specific and need to be passed to ZMap somehow,