Columns Featuresets and Data Sources

Definitions

zmapFeature.shtml has some notes about the words used for various data items in the ZMap code (which cannot be changed without a lot of work) and this section aims to define terms to be used to describe things that can be configured by the user/ otterlace or presented to the user/ annotators, or used in an external interface eg data servers and X-remote. This may seem pedantic, but there is some confusion caused by the re-use of words to refer to different objects.

ACEDB includes a number of configuration options driven by database tables and this functionality must be supported, and other than that we need to be able to specify what data to display, where to request the data, and where to display it.

We also have to specify a display style for each data item and the display columns themselves.

Featureset

A collection of data items of the same type. These will be identified in a GFF file with the same name (ie a GFF source, typically representing the output from a single analysis), and each individual data item (corresponding to one line in a GFF file) is a feature.

To make the implementation of external servers easier it is also desirable to allow the featureset name to be different from the GFF source name.

Each featureset is assigned a style, which defaults to having the same name as the featureset. Featuresets with no style defined cannot be displayed. Due to the behaviour of external sources it is desirable to be able to specify a style of a different name, and this would also allow styles to be shared amongst sources.

NB: currently for a pipeServer the featureset name must be the same as the GFF source name, and the pipeServer must also be given a styles file containing a style of the same name.

Column

A display column in a window in a view. ZMap has a list of display columns which specify in order which data to display across the window. Tradtionally this has been inferred from the list of featuresets specified for each server such that all the featuresets from the first server appear in order, followed by all the featuresets for the other servers as configured.

A column can be filled with any number of featuresets each with thier own style, and can also be assigned a style for the whole column which can specify things like width, display options (eg show/hide), strand specific, frame specific etc. Note that if a column is defined as strand specific then ZMap will draw two columns, one on either side of the strand separator, but for each strand the order will be as specified.

It's not clear about whether or not this is currently true, (style for column as well as featureset) perhaps the first style in a colums's list is used; however it makes sense to me to do this explicitly. Update: all the styles needed by a column get lumped together, lookup for a parameter is first come first served.

Styles

These are supplied either by a text file or from a Server (eg ACEDB, DAS). The style defines how to display a feature, and combines data for the appearance and also when and where to display it. Refer to Request Timing below for some notes about a third aspect of styles.

Source

Where to request a data item from, which can only be a data server such as ACEDB, pipe, DAS, file.

Mapping of things to other things:

We need all these to be configurable, with sensible defaults (ie same name for a 1-1 relationship)

Name Spaces and Collisions

There are a lot of different name spaces here, mostly that default to having the same values for the same object. If two names co-incide then ZMap will treat that as both referring to the same thing. For example a column called 'EST_human' maps to featuresets 'EST_human' and 'Saturated_EST_human'. The first of these has a style 'EST_human' and so does the column and the both refer to the one style 'EST_human'

Strategy

There are a number of competing issues:

There is a conflict between the 'best' configuration between ACEDB and pipeServers which have a different perspective: a pipeServer is intended to supply one featureset and ACEDB to supply many and to provide links between them and other configuration options. However, it is likely that even pipeServers will supply multiple GFF sources, and that these can appear in different display columns - all combinations are expected in practice.

A Structural Perspective

Any solution will be a compromise - what is proposed is to provide configuration options for ZMap from the perspective of ZMap rather than any particular external module. By referring to the ISO/OSI 7-layer model (try google) we hope to suggest the cleanest way to divide up the various configuration options. Other than this the approach is to avoid too great a re-write of Zmap code and to use formats similar to what's currently in use. Similar code has to be maintained to support exisitng ACEDB configurations and only having one implementation will increase reliability.

Display layout

The columns to display and in which order should be defined explicitly and relate to ZMap rather than data sources or featuresets. Raw data (sets of features supplied externally) is logically at a different level from presentation and should not refer to presentation issues.

Current column lists and ordering is inferred from server config. By defining this explicitly we avoid having to define servers in display order (and can mix & match featuresets), and also choose columns to display without having to change server configuration.

In a simple configuration with one featureset per column a single style can be used to define column related data (eg hide/show) and feature related data (eg blue). However for a column with many featuresets it may be cleaner to define this in a seperate style - one for the column and one for each featureset. In which case the column may be defined with an optional style.

This data will take the form of a new option in the [ZMap] stanza as 'featureset:style':

[ZMap]
columns = EST_human ; Repeats ; etc
This looks very similar to the existing featuresets option in the Server stanzas, but in this case does not refer directly to any featureset; it is a list of display columns and nothing more.

Another stanza [columns] will allow the user to specify what featuresets to fill a column with, and this allows data from more than one source to be displayed in the same column. By default (if not specified in the [columns] stanza) each column will be assigned data from the featureset of the same name.

[columns]
Repeats = repeatmasker_line ; repeatmasker_sine
mRNA = vertebrate_mRNA ; polyA_site ; polyA_signal

If no column list is defined in the [ZMap] stanza then each server's list of featuresets will be used to define the columns to display in order of servers and thier featuresets.

Display styles

Again this is logically at a different level from the raw data. Traditionally a feature's style has been linked to the feature type, but ACEDB does provide a mapping to arbitrary display styles as a separate database query. It seems logical to provide a similar function in ZMap configuration, but to continue support of existing ACEDB function we need to make this optional for ACEDB. We can default a style to be the same name as the featureset without breaking modularity.

To add in the ability to specify an arbitary style for a featureset we add another stanza for this mapping, and also another one for the column (column specific parameters can optionally be defined seperately from the individual featuresets):

[featureset_style]
vertebrate_mRNA = vertRNA

[column_style]
vertebrate_mRNA = vertRNA_col

Request timing

We wish to define some data as 'requested on startup' or 'requested on demand'. Currently (April 2010) this is done via server config ('delayed=true/false') and this applies to all featuresets supplied by that server.

It would be possible to specify startup and delayed featuresets per server, but perhaps this is overkill: it is just as easy to configure two servers.

ACEDB also defines some featuresets as 'deferred' via the featuresets style, which means that they are not requested with the other features on startup. This relates to communication issues rather than display and we propose that the style options 'deferred' and 'loaded' are removed and replaced by ZMap configuration options. (NB:There is also a third style option 'current_bump_mode' which relates to display but implies a 1-1 featureset-style mapping which is also a candidate for review).

The ZMap Columns dialog lists columns that can be requested post startup - this could / should? be changed to include all configured display columns or featuresets

Mapping featuresets to data sources

From a users point of view they wish to request extra features at various times after ZMap startup and we need to define how this process works. The startup situation can be treated in an identical manner (except for the user interaction of course).

From otterlace they request a column (which implies one or more featuresets), from ZMap they also request a column (for deferred styles). When these requests are given to ZMap then they may consist of a list of featuresets. The server's 'featuresets' parameter is used to define where to request this from.

In the GFF data stream each features is identified by a source name and this maps 1-1 to the featureset as defined in ZMap's configuration. There may be a need to provide a further mapping of GFF-source-name to featureset-name but at present this is not implemented.

Configuration format summarised

All display columns will be defined in terms of featuresets (and default to a 1-1 mapping):

[columns]
Repeats = repeatmasker_line ; repeatmasker_sine
mRNA = vertebrate_mRNA ; polyA_site ; polyA_signal

Each featureset may be given a style:

[featureset-style]
EST_rat = EST_rat_style

Column styles may be defined as follows:

[column-style]
EST_rat = align_col_style

Each featureset may be mapped to a (case sensitive) GFF source name. This is the name that is displayed in the right hand status box in ZMap and it defaults to the featureset name. If desired this can be set for any of the featuresets defined as featureset name = GFF source name. Note that this will not affect how GFF files are parsed, it does not map GFF-source-name to featureset-name.
[featureset-source]
vertebrate_mRNA = vertrna

For GFF sources we also need a description text (which was supplied by ACEDB) and corresponds to the description in the load column dialog in Otterlace. This will appear in yet another stanza:

[featureset-description]
vertebrate_mRNA = Vertebrate Messenger RNA

Each column may also be given a description:

[column-description]
vertebrate_mRNA = Vertebrate messenger RNA

Some featuresets may be related to others. The first case we need is to link a coverage featureset to the data it represents, and in this case we expect to request the data featureset via a right-click on the coverage.

[featureset-related]
coverage=data
NOTE: many other relationships are possible and this is an interim solution. Due to the increasing complexity of ZMap configuration we expect to replace the existing flat file with an SQLite database.

It is also possible to define 'virtual featuresets' that cause several featuresets to be displayed as one.NOTE this is only revelan tot coverage style displays where each featureset effectivel has its own sub-column. See here for why.

[featuresets]
coverage_Tier1_cytosol_K562 = coverage_Tier1_cytosol_K562_LongPolyA_Rep1 ; coverage_Tier1_cytosol_K562_LongPolyA_Rep2

Refer also to SQLite configuration for some other notes on future directions.

A Note about GLib Key Files

The spec for Glib's key file functions is quite explicit in saying thay keys may consist only of alphanumerics and '-' and this causes us some problems when dealing with the many names used by otterlace that include '_'. In addition to this we would like to name columns with multiple words with whitespace between them.

Fortunately, by using the functions g_key_file_get_keys() and g_key_file_get_string() we can use keys containing space and underscore. There is a small risk that GLib will change their implementation, in which case we would have to replace some of the config file code.

Keys containing whitespace will be normalised to have each whitespace set as a single space character.

Implementation

Configuration file

The alternative stanza format above will be used, as it is probably easier to read.

Legacy issues with GFF and ACEDB etc

In the ZMap code there are a few mappings, originally derived from ACEDB and we need to use these (from ACEDB) and provide similar data for pipeServers. Or rather that we need to implement the above configuration data in ZMap and either patch the ACEDB data into the same data structures or replace it with our own. ZMapView holds the following mappings:

and ZMapWindow has feature_set_names (columns in display order, which turns out to be the featuresets as defined in ZMap [source] stanzas config) and featureset_2_styles.

ZMapFeatureContext has feature_set_names which is the 'requested featuresets' in the context. This derives from the list of featuresets specified in ZMap [server] config stanzas.

Requesting data

In zmapView.c/zMapViewLoadFeatures() (but not zmapView.c/zMapViewConnect()) when geven a request for a GFF source the code tried to map this to a featureset (ie display column) and then tries to find the featureset in any of the configured servers. So, traditionally ZMap has requested display columns aka featuresets from ACEDB and ACEDB has supplied several GFF sources in reply.

This has some implications for the request protocol in ZMap - ACEDB can accept requests for GFF sources directly or the columns that they are to map to. Some OTF requests for data involve GFF sourcfes that are not mentioned in the ZMap config but instead relate to the source_2_featureset list that is returned from ACEDB. pipeServers can only accept GFF sources as request fodder, which is OK as that is the format supplied by Otterlace, and all these can be configured without too much trouble.

It is important that all these mapping lists are merged and we have to consider the ZMap config file ACEDB and DAS.

Displaying data received from a server

The above deals with finding where to request a GFF source from, the converse is where to display GFF data when it comes back from a server (ie what column to put it in).

zMapWindowCreateSetColumns() and set_name_create_set_columns() in zmapWindowDrawFeatures.c both call produce_column() and appears to assume that the featureset is a display column rather than a GFF source. The columns are created in zmapWindowDrawFeatures.c/windowDrawContextCB() which is an execute function - look at the end of the BLOCK level. The window data (window->feature_set_names) holds the list of columns.

Instead, looking at zmapGFF2parser.c we can see that the makeNewFeature() function does the reverse source to featureset and source to style mapping: we simply have to pass this data to the pipeServer and features should end up in the correct display columns. Even with features being supplied by different servers for the same column ZMap will merge in the new context to the old, so there should be no problem.

Featureset column and style data

The ZMapView structure will be modified to contain the mappings above, and the ...(continued p95). As the mappings operate as quark to quark the config stansas may be read in an any order.

Extracting styles from columns data

zmapWindowContainerFeatureSet.c contains code that handles all the features and styles in a column, and there are functions to extract style information from a style table associated with the column. This works by finding the first match for a style parameter in the complete list of styles. If we add a new style to match the column name (if this is not already done) then this process should continue to work as is.

Where to store column styles?

zmapWIndowUtils.c/zmapWindowFeatureSetStyles() reveals that if we just add a column's style to the view->featureset_2_style list that will be enough - this data is copied to the window at some point and then used to find all the styles necessary.