It is thought that the volume of data is too great for direct use by ZMap as the time needed to load and display woud be impractical. Instead ZMap will load summary data and display data volumes in fixed size bins over the sequence in view. The actual data will be displayed by Blixem and requested over shorter regions defined by the mark.
In the short term we do not expect to be able to handle summary data and instead provide a menu driven interface to run Blixem. This will involve adding a 'Blixem Short Reads' option to the column menu for each configured short-reads data sources. Later on when summary data is available the summary columns can be Blixem'ed directly. This menu option will be interpreted as if the user clicked on a summary column and not a feature.
By default Blixem operates with a region of 200k bases centred on the mouse (which can be set by the Edit/Preferences main menu item) but for short reads data this will be different: the user must set the mark and if not set the menu options for short reads data will not appear.
Using ENCODE data as a abasis for development we choose a subset of this and make it avaialble via a column menu.
more text needed here...
The major problem to solve is one of passing round information regarding what data to request. Otterlace has to supply enough configuration information to ZMap and Zmap has to pass on this plus sequence information to Blixem.
BAM is a file format not a data source and we need to be able to specify multiple data sources in the BAM format and present these to the user with relevant names.
There are currently a number of ways that ZMap is asked to load and display data and these operate by some implicit protocols. It may be appropriate to redefine some of this to avoid duplication in Blixem configuration.
ZMap can be started up without specifying a 'default sequence' and this results in the main window (not normally visible to Otterlace users) being ready to accept sequence requests, and no view window being displayed. Views can be created by requesting this from the main window (the ZMap Manager) or via an XRemote request from Otterlace, (both of which reuqire a sequence name and start and end coordinates) but note that currently there is no clear linkage between a view and the data sources or request protocol used to request the data. This has implications for standalone operation and currently the ability to specify different sequences is limited to operation with ACEDB. Note that ZMap does not provide this kind of function with pipe servers.
ZMap will use the sequence name of the current view and the requested start and end coordinates (specified by the mark) as the source of this data and pass these directly to Blixem. Blixem will call a script provided by Anacode to generate a GFF file containing BAM file sourced data and pass this data to the script via command line arguments.
The blixem configuration file blixemrc will be amended to include a stanza for fetching short-read data, which will include a script with optional arguments, not including the sequence name or start or end coordinates. Refer here for some relevant comments regarding other pipe-server arguments which may affect this configuration in future.
The config file already specifies the default pfetch mode for our current data type (EMBL sequences). This fetch mode relates to fetching missing data for sequences that Blixem already knows about (i.e. those in its <datafile> input file). The new fetch method will fetch all sequences within a particular region from a particular source, without any prior knowledge of what those sequences are. We could in the future want additional fetch methods for different data types that we don't yet know about, which may fetch sequences or just sequence data.
The following suggestion for a config file format is an attempt to be as flexible as possible in these requirements. See the wiki page for more info about the Blixem config file.
[blixem] bulk-fetch = pfetch-socket # default fetch method to use when there is no data type given for an item in the GFF file user-fetch = pfetch-http # optionally, this can be used to specify a different fetch method for when the user double-clicks a sequence [embl] # config options for the embl data type bulk-fetch = pfetch-socket [short-read] # config options for the short-read data type bulk-fetch = region-fetch [region-fetch] # config options for this particular fetch method script = /nfs/users/nfs_j/jh13/work/bam/bam_get ...
Assuming that we have two BAM sources configured (in ZMap) as:
[ZMap] seq-data=14dpf;Liverthe Blixem configuration file will include new stanzas for each source of BAM data in a format as follows:
[14dpf] args=-bam_path=path_to_data_file =bam_cs=NBCIM37 name=14dpf description=14 days post fertilisation [Liver] args=-bam_path=path_to_data_file =bam_cs=NBCIM37 description=Adult Liver etcand Blixem with know to read the relevant stanza by referring to the GFF file provided by ZMap with a single line for each short-reads request. The name of each stanza will be exactly as specified in the source field. For Blixem the name and description items are optional.
Note that this implies that the blixem configuration file must not be generated for each invocation of Blixem if the dataset changes.
The GFF file would then look something like:
##gff-version 3 ##sequence-region chr4-04 210623 364887 # the reference sequence region that is passed to blixem chr4-04 source1 region 215000 300000 0.000000 . . dataType=short-read # the coords here specify the region to fetch short-reads for chr4-04 source2 region 215000 300000 0.000000 . . dataType=short-read # optionally fetch short-reads for a second source chr4-04 vertebrate_mRNA nucleotide_match 212438 212568 71.000000 + . Target=AK075066.1 125 255 +;percentID=77.1;dataType=embl chr4-04 vertebrate_mRNA nucleotide_match 211806 212636 544.000000 + . Target=AK123078.1 831 1683 +;percentID=84; # no dataType tag => use default-fetch
Note that Blixem will need to pass chromosome coords to the bam-fetch script. ZMap should pass everything to Blixem in chromosome coords plus the required offset (via the -O argument) that Blixem will need to use to display its coords in the same coordinate system as ZMap.
All other options relating to short-reads data will be passed in the GFF file and Blixem configuration and no further command line arguments are needed.
seq-data | A list of featuresets containing SR data As for DNA and protein featuresets we try to match both columns and featuresets to allow simple configuration. (Note: each column/featureset has options for short names and more lengthy descriptions). Refer here for some relevant comments regarding extra ZMap config options |
The [ZMap] stanza will also include:
dataset | eg 'human' |
sequence | eg 'chr14-10' |
Blixem will find the script path and fixed arguments from a stanza in its configuration file 'blixemrc'. Other arguments will be appended as follows:
-seq_id | the name of the reference sequence eg 'chr6-18' |
-gff_seqname | (alternative format of '-seq_id for use by ZMap) |
-start | request start coordinate |
-end | request end coordinate |
-bam_path | path to the BAM file |
-bam_cs | coordinate system of the BAM file, probably NCBIM37 |
In the near future ZMap will also be able to request data from this script and all script configuration will be done via the existing pipe server mechanism.
As specified above
[ZMap] seq-data=14dpf;Liverwill be used to specify BAM data that may be used requested by Blixem. ZMap will in future be able to display coverage data for BAM datasets and also request the real data from the mark by R-clicking on a coverage column.. Each set of BAM data and coverage data will be named as distinct featuresets and these may be requested via the existing pipe server mechanism - either by a request on startup or via the columns dialog or by request from otterlace. We expect coverage data to be displayed by default and the users to request sub-sequences of BAM data by using the RC menu on a coverage column (although there is a request for a special pane on the load columns dialog).
Each BAM pipe server will support the relevant ZMap featureset, and coverage data will link to real data via another stanza in the ZMap configuration:
[featureset-related] 14dpf_coverage=14dpf # (coverage = real data)and right clicking on a coverage column will display an option to display the real data column. (This will replace the inclusion of all BAM data sets in the RC menu as at present). Whether or not this causes the columns to switch or display together is not relevant here. Existing ZMap configuration (eg [featureset-description]) will be used to specify display names and helpful information as at present.
Referring to the request for a new pane on the load columns dialog, the existance of a [Coverage] mapping may be used to include featuresets in this page or not, and also the existing load columns pane.
Depending on where the user clicks a differnt choice of Blixem menu option appear and in the short term all cases will include access to short reads data as configured, but only if the mark is set. It would be preferable to display these as disabled/ greyed out but this option has not been programmed and would require extra work to include.
In zmapWindowMenus.c an anonymous enum of BLIX_ options has been defined and this is used to specify what kind of homology to fetch (DNA, peptide, column or feature(s) and we can add a new open BLIX_SEQ. Eventually we aim to interface by the user clicking on the desired column but in the short term BLIX_SEQ will be used as the base of a range of values referring to a list of featuresets defined in the ZMap config, which is stored in the view's context map and is accessable to the Window code. (view|window->context_map.seq_data_featuresets). When short reads data can be displayed ion ZMap then this can revert to a single value.
This data is copied to a data structure ZMapWindowCallbackCommandAlign by zmapWindowFeatureFuncs.c/zmapWindowCallBlixemOnPos(). There is another enum ZMapWindowAlignSetType in zmapWindow.h which defines the same options as the BLIX_ stuff but is accessable by other modules. We map BLIX_ to ZMAPWINDOW_ALIGNCMD_ and at the same time insert the appropriate featureset as chosen from the menu as if the user clicked on a column.
Blixem expect to receive a list of features in a GFF file and for short reads data the features are not known and we give a 'source' feature instead corresponding to the featureset, which we will pass through as text (the name).
Th interface to the zmapViewCallBlixem() code is via the functiondoBlixemCmd() which is given a ZMapWindowCallbackCommandAlign structure and calls zmapViewCallBlixem() passing as arguments each of the members of that structure. doBlixemCommand() copies these arguments to another data structure blixemDataStruct which is private to zmapViewCallBlixem.c.
Many features are duplicated and when bumped take up a lot of space on the screen, and we can compress this by collapsing identical features into a single display item with a score, which can be dislayed by width or heat colour.
The features are explicitly named and displaying a composite feature will loose some of this information, however this is not thought to be a problem. If we store all the source features in the feature context and have these sorted by start coordinate then it will be relatively simple to provide a list of these if necessary.
Compressiing features on display presents some problems:
A style option will be added to switch on compression, and this will apply to any type of feature: ideally we should be dealing with alignments only, but simple features are implicated as BAM data is currently presented like this.
[short-read] collapse = true score-mode=width # or heat score-scale=log # (default is linear)We provide an option for log scale scoring to cope with sparse and highly populated regions. Previous types of style do not precess the scores and where log scales are required these have hostirically been pre-computed, However we need to keep the scores as integer numbers (a count of the number of features) and display then logarithmically. The score-scale must default to linear to prevent contamination of existing scored features.
min and max score apply as normal, but not that this may apply to the actual score not the log scale.
NOTE if we collapse alignments then these may have a score and score mode set. If both collase and score mode is set then collapse will take priority. If collapsed alignments have differtn scores then we will loose this information from the display - if it is necessary to observe different scores then it is not valid to collapse features.
When a featureset is merged into the feature context we will process this as follows:
As the original; features appear in the feature context they can be searched for and despite not being in the canvas they need to map to a canvas item. One way to resolve this is to have all the source features in the FToI hash mappng to the same canvas item (the CanvasFeatureset. This is a many to one mapping, but the reverse is one to one. Alternatively we can just frop the feature from the FToI hash.
If the style used specifies an alignment then we need to run some different code and require each part of the alignment to match. NOTE however as we are dealing with short reads we can assume that each alignment is one simple gapped block and that each one appears as a simgle feature. We do not need to match several 'exons' as for EST masking.