ZMap displays columns of data containing one or more featuresets. Requesting or showing/hiding a column operates on all the featuresets in it.
This is not flexible enough to cope with the requirements for BAM data and we need to find a way to configure and operate featuresets that are related in several different ways.
Please refer to the following diagram, which illustrates how the current ZMap architecture cannot implement what is required (but see Bingo below).
a) shows the traditional approach derived from ACEDB and now in use with data provided by pipe servers (NOTE: this is not a brilliant example as all the featuresets actually come from one pipe server).
b) shows a simple BAM data column with two reps of the same experiment combined into one column. Each rep is sourced from it's own pipe server and the features can be colour code via styles easily - this fits quite well into existing architecture.
c) shows a coverage column. Coverage is displayed as several featuresets in a column, so for instance for a given grouping of cell lines in ENCODE (eg tier 1) we present all the cytosol experiments for each cell line/ cell compartment combination and each one appears as a heatmap . Each heatmap contains the coverage data for all the reps and therefore has data from several sources. Heatmaps have been implemented as single canvas items that contain many subfeatures and assume that they contain data from one featureset. They are positioned in the column in the order specified in the configuration which says which featuresets are in the column.
There are some other issues to consider which all make it more essential to have each coverage heatmap sourced from one featureset and groups of these configured as displayed in the same column:
Ultimately the problem is that we need three levels of grouping (featureset, related featureset, and column) and we only have two.
Extra configuration options have beed added to relate coverage to data to allow requests via the mouse and this has become unwieldy.
This leads to the current interim solution where the pipe script that provides coverage data amalgamates several different data sources into one featureset, which is not an ideal solution as:
It may be possible to display several coverage featuresets in one heatmap via enhanced styles options but this is mis-using the styles data which should be for how data is displayed (it's appearance) not how it is grouped - to display a heatmap it has to exist as a unique object/ data structure in ZMap.
It would also be possible to add configuration data to link a few coverage featuresets together and combine these before display. Using current practice as an example we could do something like this:
[ZMap] columns=coverage_Tier1_cytosol; etc [columns] coverage_Tier1_cytosol = coverage_Tier1_cytosol_GM12878 ; coverage_Tier1_cytosol_H1-hESC ; coverage_Tier1_cytosol_K562 [featuresets] coverage_Tier1_cytosol_K562 = coverage_Tier1_cytosol_K562_LongPolyA_Rep1 ; coverage_Tier1_cytosol_K562_LongPolyA_Rep2with the last stanza being new. Note that there would be more featuresets to be combined in practice. The source featuresets would be configured for servers but not appear in the columns dialog. The composite featureset would appear in the column as configured.
It is possible to implement this as mapping each source featureset to the composite at the display interface, which provides all the required functionality using existing architecture. Note that it is only relevant to display items like heatmaps and requires the used of a column-wide canvas item such as ZMapWindowGraphDensityItem. Currently in a heatmap features are displayed in one canvas item which is named according to column (which includes strand) and featureset. If configured we can easily replace featureset with the one mapped to - all the features will appear in one place and each feature will still be as in the orignal featureset.
This also means we keep all the processing of coverage data internal to ZMap.
A subsequent re-implementation of the basic features display code requires a similar process when more than one short reads featureset is to be displayed in one column. This is simple than the coverage configuration above as it requires only one virtual featureset. See the diagram below and box for sample confiurgation. The virtual featureset is needed to avoid data from different real featuresets overlapping when bumped.
The difference between BAM coverage and data columns for the user is that coverage colums appear in the load columns dialog, and have several featuresets summarise in heatmap sub-columns, and data columns have the real features intermingled in the same column. Data may be requested via the coverage column menu. Both may only be requested for the mark.
NOTE that this is an interim solution and will change in the near future.
NOTE Strictly speaking we can display several featuresets on one column but and this will work. However, BAM data is referenced by coverage data and in the current implemntation this is via a virtual featureset of which there can be one. It would be good to chnage this at come point.... (ie short read not using a virtual featureset, more than one related featureset for coverage columns);
Here is a picture of some sample featuresets and columns as needed for ENCODE data:
Configuration:
[Featuresets] # These are virtual featuresets; data from several real featuresets is intermingled into one heatmap # Each heatmap is displayed staggered in one column Tier1_K562_Cell_coverage = Tier1_K562_Cell_Rep1_Coverage; Tier1_K562_Cell_Rep2_Coverage Tier1_GM12345_Cell_coverage = Tier1_GM12345_Cell_Coverage; Tier1_GM12345_Cell_Rep2_Coverage # this is a virtual featureset that holds all the data from several real ones # data is displayed as normal basic features Tier1_K562_Cell_features = Tier1_K562_Cell_Rep1; Tier1_K562_Cell_Rep2 [Columns] # The data column contains two featuresets: data is intermingled but stored in a virutal featureset Tier1_K562_Cell = Tier1_K562_Cell_features # The coverage column contains some virtual featuresets Tier1_Cell_Coverage = Tier1_K562_Cell_coverage; Tier1_GM12345_Cell_coverage [Featureset-related] # we relate the virtual coverage featureset to the data column Tier1_K562_Cell_related = Tier1_K562_Cell # we also need to ensure that data may only be requested from the mark and does not appear on the load columns dialog. [ZMap] seq_data = Tier1_K562_Cell_features; Tier1_K562_Cell_Rep1; Tier1_K562_Cell_Rep2 columns = Tier1_Cell_CoverageResult:
This is quite instructive in the sense that we can see how complex the configuration is and how it has become tangled. There are a few places in the code where it has been necessary to invent ACEDB style configuration data as it is needed for various functions, and this at times can overwrite the data read in at startup. NOTE:What appears here may only be partial documentation.
NOTE:there is a slight nuance with the [featureset-related] stanza. To link a coverage column such that the user can right click to request the related underlying data we need to related each real coverage featureset to the column for the underlying data. To prevent the coverage column from being requestable without the mark being set we need to relate the virtual featureset also. This could probably be fixed by a small tweak to the code. Likewise there is a report that the virtual featureset requires a style to be set.
Featuresets and columns have unique IDs and display names and these can get used interchanagably by mistake.
Due to ACEDB legacy issues during the request protocol we add featuresets to a column's list, for all servers at once, and repeat this for every server, which adds to the configured configuration. Really this should onyl be done for ACEDB and this should involve revisiting the server protocol. In the short term we must avoid adding featuresets that become virtualised.
Existing flat file stanza driven configuration is quite limited and is not a practical solution for large amounts of data and each modification is fairly tedious.
Some notes on converting existing configuration file to SQLite can be found here
To make future modifications easier it is proposed to define simple data items (eg featuresets, columns, servers, styles) and then define relations between them. Each relation will consist of a type and two unique id's, but note that each id may belong to different tables. The following relations will be handled (and can be added to without changing the structure of the configuration data):
This would replace the existing data in the ZMapFeatureContextMap data structure and a dynamic interface provided to give a list of data related in some way. The ACEDB interface would be modified either to ignore configuration and styles from ACEDB or to convert into the new format. (NOTE the ext_curated column contains approx 1000 featuresets which are mostly empty and not configured in current practice).
This way of working does not perform well with ordered lists - for example column ordering.
ZMap has been programmed to assume some configuration if not present. such as 'featureset appears in column of same name', and conversely has been programmed to ignore features with no style.