Implementing fixed style density plots in ZMap
Here we define how to display a coverage featureset as a density plot. The data is supplied via GFF and includes start and end coordinates for the bins and a score which is the number of features in the bin. Note that even though the display is likely to be log(score) we need the actual number so as to be able to combine bins as zoom level changes.
Whether or not to produce a density plot will be determined by a column's style using some new parameters.
A new mode (density) will be introduced which is automatically a graph mode display but allows bins to be combined.
Density mode styles will trigger base features being re-processed into display data structures based on pixel boundaries.
Multiple featuresets may be displayed in a single column and each one will produce its own graphical display and the styles should be defined to ensure clarity - heatmaps and histograms do not overlap well but wiggle plots may.
Configuration
An example:
[density-plot]
graph-mode = heatmap | histogram | line # default = heatmap
graph-scale = log | linear # default = log
colours = normal fill border # default = fill white; border black
# frame-colours are also possible according to frame mode
mode=graph
graph-density=true
graph-density-min-bin=2
graph_scale=log
For a heatmap the colours used will be calculated according to the scale and between the fill and border colours, with border being the maximum density.
Other existing style parameters will be used in a density mode style:
width = pixels
min-score = 0 # number of features for minimum score
max-score = 2000 # number of features for max score
min-mag =
max-mag =
Note that min-score and max-score define the number of steps in the display and also the behaviour of a heatmap - 0 will mean no features and displays as the fill colour and a single feature will be the next step up.
Bumping density plots
It is a requirement to bump a heatmap into a histogram and it is possible that other options could be asked for. We can introduce new bump modes and/or bump styles to achieve this. Note that a lot of bump modes make little sense for binned data.
Implementation - drawing the plots
Change of philosophy - virtual features
There are some problems with simply drawing bins as received:
It is easy to imagine in excess of 10k features in a column. if we have several of these then we have performance problems.
If we overlap bins at low zoom then we still see the maximum level (histograms and line plots) but for heatmaps we may miss data.
Bins may not map well to pixel boundaries.
To avoid performance overheads due to large amount of data a single density plot canvas item (ZMapWindowDensityItem ) will be added per column and this will handle the display and selection of all bins. Please note the following:
This item will cover the whole extent of the column, ie it will range from start to end coordinate of its parent, and will set the width of the column as the width of the widest bin.
It will be added transparently from the Item Factory when a source bin item is added to the canvas.
Each source bin item added will be added to a data structure (a skip-list) held by the density item but not added to the canvas directly.
The FToIHash will point to the real canvas item, for each source bin added (a many to one relationship) and will also have a pointer to the orignal feature (a 1-1 mapping)
On draw or zoom the density item will calculate a list of display bins, which will be based on pixel boundaries and may hold several source bins depending on zoom.
It will draw any section of the plot on expose, driving the GDK_Draw function from its internal display list.
For a wiggle plot a single GDK_PolyLine call will be used - there cannot be more than 1200 lines for this on current monitors, but the number of segment will be set to some abritatry limit.
For histograms and heatmaps rectangles will be drawn for each bin.
mouse and keyboard events will be handled and the relevant display or source bin used to extract meta information and respond as for individual features at present.
Other implementation issues
The window status bar, feature details window, and feature search dialog will all need tweaking to work well with density data. In particular feature search makes no sense for density columns and the assumptions in the feature context FtoIHash such as 'each feature has exactly one corresponding canvas item' will be broken.
Effort needed
What needs to happen:
styles to define display parameters
implemement density canvas item (interface to canvas)... cut and paste
on display and zoom compute display bins
on expose draw a series of poly lines OTF or rectangles from the bins data for the exposed region
catch mouse click on column: show info on status bar: look up bin from canvas item...
catch double click: show feature dialog ?
patch feature search to handle density columns
resolve canvas_item->feature references
Further notes (during implementation)
Creating canvas items
- we have a feature and decide to display it
- this goes through stuff like zmapWindowDrawFeatures and ends up in the windowItemFactory.
- according to the feature's style a different factory method gets called:
{ ZMAPSTYLE_MODE_GRAPH, drawSimpleGraphFeature }
- this calls:
canvas_item = zMapWindowCanvasItemCreate(parent, y1, feature, style);
zMapWindowCanvasItemAddInterval(canvas_item, NULL, 0.0, y2 - y1, x1, x2);
- CanvasItemCreate calls feature_is_drawable() which returns a type such as ZMAP_TYPE_WINDOW_GRAPH_FEATURE
depending on the feature's style
NB in zmapWindowBasicFeature.c this process is repeeated.
What we need to do is to change drawSimpleGraphFeature() so that it makes a singleton graph column item
and adds the source features to this instead of calling zmapWindowCanvasItemCreate(), which will be clean.
Either the class data needs to include a hash of existing graph column items or we look up existing ones in the parent.
Glyphs also get run thro' basic feature, which is odd.
Patching density items via the item factory
This is slightly intricate and not immediately obvious.
Preserving graph mode=histogram (non density plot graphs)
A small preparatory task was to take simple graph features (the only implementaion so far - graph mode histogram as simple boxes) out of zmapWindowBasicFeature.c and to create a real ZMapWindowGraphItem class. This involved creating a new file zmapWindowGraphItem.c based on BasicFeature and adjusting code in zmapWindowItemFactory.c/ and zmapWindowCanvasItem.c/feature_is_drawable().
Note that the item factory (zmapWindowFToIFactoryRunSingle() chooses a factory method indexed by a feature's style mode and then each one of these calls zMapWindowCanvasItemCreate() which determines the type of canvas item to create via feature_is_drawable() based on the feature's style mode, so effectively the same decision is being made twice in a row.
As each factory method is tied directly to each style mode by the above we cannot call different methods for different types of graph and the method already defined (drawSimpleGraphFeature() must necessarily be used to draw other types of graph, so to avoid confusion this has been renamed to drawGraphFeature()
After creating an empty canvas item of the correct type (previously Basic, now Graph), drawGraphFeature adds the actual data via the class->add_interval() function which resolves to zmapWindowGraphItem.c/zmap_window_graph_item_add_interval() . This adds a foo canvas rect to the ZMapWindowCanvasItem which is a foo canvas group.
Inserting the ZMapWindowDensityItem item once per column
In drawGraphFeature we test the graph mode style's density flag and if set add the ZMapWindowDensityItem once only. Ths is implemented by the class directly, which contains a GHashTable of these indexed by the column unique id, which is accessable to the factory via ther run_data->container foo canvas group.
Note that we make this more complicated by adding style_id as well and we construct a composite GQuark (once) in the item factory.
Adding data to the ZMapWindowDensityItem instead of the foo canvas
Using the styles density flag as above we choose whether to call zmapWindowCanvasItemCreate or not.
Having obtained out singleton data structure we add our graph iten to the source bins list.
Displaying data - making assumptions
A coverage display is implicitly non overlapping,; if we wish to display overlapping data the we wi8ll use two different styles which will cause two different density displays to exist. This allows an optimisation when looking up display data on expose - we can search on start coordinate and look at the previous item to see if it overlaps the exposed region.
It is intended to replace normal graphs with the density module (which will operate in the same way but without re-calculating bins on zoom), and in this case it is possible we may have to handle overlapping data. This can be done automatically (no configuration needed) by remembering the longest item to display and offset search coordinates by this amount. As features are expected to be small this will be efficient enough - we simply scan a sorted linked list. Taking a hypothetical worst case this is unlikely to be time consuming.
If necessary we can create display indexes for long and short items to provide an effcient solution wothout risking an incorrect display.
Work in progress
- graph items have been made into ZWCI and have foo rects added as intervals
rather than existing as a variant of basic feature
- free standing glyphs are added as foo glyph items to basic features
- for density plots we need to add GraphFeature for the single
density column wide item (a hack) and then add an interval to that which
will be the foo extension density item
This is needed as the ZWCI is a foo group extension and does not draw directly
whereas foo items do
- when canvas items get simplified then we turn the basic/density ZWCI into a simple
foo extension as for other types
- zMapWindowGraphDensityItemGetDensityItem() returns the ZWCI if found.
If not then the item factory (via drawGraphFeature()) will add it and add the interval
Note that we will call foo_canvas_item_new() directly rather than via zMapWindowCanvasItemCreate()
- problems:
canvas item->feature gets accessed directly and cannot be NULL (OK: set to first)
We can invent a singleton feature that gets assigned data depending on mouse clicks??
we need to handle long items??
style needs to be set globally -> this looks like the add_interval() call but we need to
check compatablilty and/or write the code directly
- to do:
* paint the graph in the _draw function
* test as for plain graph features: should display the same
* test using two featuresets with the same data from GFF files and display as old and new style
* display as lines not boxes -> combine data points into a poly_line
* display as heatmap: boxes are full width and colour coded; ok with irregular size bins?
* handle mouse clicks etc, point and bounds
handle long items (trivial?.. ignored without mishap)
handle hide and show
* re-bucket on display and zoom
* get some timing stats
add on GC skew
add print/ screen dump handling
Operating coverage display and bumping
heatmaps for related files (eg tissue types) will be displayed in one column and bumping will bump all heatmaps together
currently bumping to wiggle is preferred over histogram
users want to be able to hide 'uninteresting' heatmap/ wiggle features (as in feature hide/show)
To implement this new style and bump options will be provided:
[coverage]
bump-style=style-name # disables 'bump-more-options'
stagger=pixels
show-zero=true/false
overlap=true/false # need to optimise aligments code (not relevant here)
stagger will offset each featureset item in the column by a set amount and these will be ordered as per column confguration, for example:
[columns]
coverage = liver; heart ; brain
will specify that the featuresets are in that order from left to right. Note that each set's width is specified separately
and this may indicate overlap (eg for wiggles) or a gap (eg for a heatmap). This is necessary to avoid having to define different styles for each featureset with a different offset - it is important that the featuresets appear in the same (predictable) order and position within the column.
The whole column may be shown/ hidden as normal, but the RC menu will be amended to allow a single featureset to be hidden / displayed with a keyboard shortcut for the selected one. Note that we will have to show column highlight differently and for heatmaps this may be difficult. One solution would be by using an outline for select. The columns dialog popup was intended to allow selection of featuresets in a column at some point... NOTE that this will be deferred until the dialog design is revisited, and in the short term only a menu driven interface will be provided.
Handling short reads data
To avoid accumulating too many features on the ZMap display data outside the mark will be removed. This data will be displayed on 'request from mark' and remain until the mark is reset and set again. Adjusting the mark will not remove displayed data.
Handling focus interface
Given a mouse click that is attached to a single column wide foo item we can find the relevant feature implied by the mouse coordinates and set the foo item's feature to be this and the focus code can be adjusted to cope with his by adding the feature to the focus item data structure.
Focus code is called from in zmapWindowItem.c For a simple mouse click
this causes no problems as we can find the relevant feature quite easily. For an alignment (highlight same name items) this requires a feature search and will not work with density items using the current form of feature search (which returns a list of foo items which would all be the same).
This list of items is then added to the focus list en-masse and if this is derived from density items it will work.
However in the short term we know that alignments are not displayed as density items and this need not concern us.
Focus code is also called from zmapWindowState.c and this restores a single 'hot' item. This (current ZMap) fails to restore the previously highlit data and is relatively simple to emulate. Window state is stored via some 'serialise' function - it's not clear if this is necessary, as we only restore state eg after a RevComp, and the feature context is still valid.
There is a further problem to solve of finding a feature from a feature id; previously this has been done by finding the corresponding canvas item which has a pointer back to the feature. Next sentence not followed Focus code could be extended to store start, end coordinates of features (via the feature itself) and this can be used to look up the feature in the density items skip list.
Focus items a) have to be highlit and b) the highlight must persist until removed. Taking into account that density items (and normal graphs) do not overlap we can implement this with a simple flag in the graph segment data structure and just display a different colour, which is stroed in the column wide extenced foo canvas (density) item.
To handle overlapped items (eg alignments) it will be necessary to post-process focus items and display them after the others - perhaps the logical way to of this is to display focus itesm at the end of every expose event, but that is outside the scope of this module.
Note that focus code highlights features using a given colour not necessarily the features' style's SELECTED colour if a window default has been set. This default can alternatively be set using the root style. The interface is via a ZMapWindowCanvasItem function (set_colour) and for density items we will need to patch into this. Focus code works using a set of flags for different groups of focus items (eg selected, evidence) and we can store these flags in the graph-density item struct for each feature which will allow us to implement all optional colours, storing these in the global graph density struct. Note that for density items we only need to handle 'selected' but implementing this interface now will allow other data types with more display options (eg masked alignments) to be implemented easily.
Unfortunately due to scope issues it is not possible to use focus code to get colours, and the canvas item class interface is a bit convoluted, so to keep things simple we will store a limited number of alternate colours directly in the graph density item struct and choose these according to a flag. This will limit density items to being normal colour or selected (as defined by the focus code) only.
Handling the FToI search interface
In the short term we can appply restrictions to this. Search for coverage data by name is not particularly useful and we can just disallow this or make it fail. Likewise coverage data is not valid as evidence features (transcript/highlight evidence) and we can assume this will not occur in the short term . (see zmapWindowFeature.c # 1352 approx).
In the longer term we need to modify the interface to return feature as well as foo item.
Feature search has been made to work with density items - FToI search returns a list of FToIHash data structures instead of canvas items. These contain a reference to the feature as well as the canvas item.
Implementing low-zoom triggered density plots in ZMap
(under construction)
These are earlier thoughts now abandoned The idea was to improve perfomance by switching to coverage, but htis looses information (eg alignment score).
A separate style will be added for optional density mode displays and each normal feature's style will refer to this (in a later implementation).
Bumping the column will revert to using the normal style. The density style bump mode will be used to control whether or not this is allowed.
A density mode style will automatically also be a graph mode style and graph-mode configuration will be used to drive the display. The difference between simple graph mode and density graph mode is that one uses features and the other feature-frequencies per pixel.
Note that a column style may be set independantly of the included featuresets or can be inferred by merging the featureset styles. All featuresets in a column will be treated as one for the purposes of density plots.
(for example if we created a UniProt column then Swissprot and Trembl would be displayed in different colours as at present when displayed at high zoom but would be combined to create a single density plot.
Note that although most of the relevant data consists of alignments some high volume data (eg ensembl_variation) are basic features and therefore these parameters must apply to all style modes.
[trembl]
density-style = density-plot-trembl
Implementation
Normally we would wish to display a density plot when the zoom level is such that features are not readily discernable due to overlap. This is a) to give extra information and b) to implement a faster display. Using the foo canvas we are limited to nominally 1000 pixels at minumum zoom and 30k pixels at maximum, but note that future developments may change this. However we expect future technology to allow us to display significantly more features in less time. We do wish to avoid drawing 30k density features - this would be missing the point of using these. A solution would be to impose a maximum number of these in the style and increase the vertical size of the bins by whole numbers of pixesl as the xoom level increases.
One way to decide when to switch between display modes would be the existing min and max zoom parameters in styles and this looks like a good place to start. The disadvantage is that to achieve best results the style has to be set appropriately for the data and this is a trial & error technique and will always be a compromise.
Another way would be to set a minimum average or maximium bin volume - if no bins have more than X features then display the features.
What is important is that there is a clear switch between one style and the other. For example if we defined styles to display at different zoom levels and display real features and density data in the same column there could be a transitional zoom level where both were displayed and we need to avoid this.
GC content
(needs to be reviewed, ref to density plots above)
Not a density plot
GC content provides an easy way to implement wiggle plots as a graph mode feature which can later be used in a density plot environment subject to configuration. Note that they are defined as style mode = graph!
Rolling window size
A cursory scan of google suggests a window of 50-500bp - 50 as the GC skew can change suddenly and 500 relating to 'flanking DNA', with values being calculated at 100 bp steps (Human Molecular Genetics, 1999, Vol. 8, No. 6 1061?1067). This suggests that we must provide this paramet in the style, and also possibly provide more than one graph, using different window sizes. One way to implement this would be to define hard coded featuresets (like DNA) such as GC_Skew1, GC_Skew2 etc and map these into a single column (if desired) and attach different styles to specify the window size.
If the zoom level is so low as to contain more bases than the window size per pixel then the GC content will be calculated per pixel. This will be triggered by setting min and max zoom levels per GC_Skew style (interpreted as base pairs per pixel), and a per pixel average will be triggered with a window size of 0.
GC Skew will not be displayed if the number of bases per pixel is greater than the window size, and the rooling window size will be set but the min_zoom style parameter. The step size can also be parameterised; this will be useful for all density styles and provides a way to limit the munber of features displayed.
Graphical display
A simple linear scale plot with 50% GC content as the centre and min and max_score in the style being used to define the extremes of the percentages to be displayed (eg 30 and 80%).
Efficiency
Sneaking past the FooCanvas
Implementing a wiggle plot as a series of poly-line features may operate significantly faster in the foo canvas then a feature per line segment. Implementing a single polyline may also operate slower due to having to redraw the whole feature every time. If we work in a 30K pixel maximum window size and 100 lines per feature then this results in 300 features, which may paint quickly, and if so will allow us to paint rolling averages at maximum zoom. These parameters can be held in a ZMap config stanza if necessary. This may be significantly faster than histograms or heatmaps. Implementing parameterised plots like this will also not tie us to any particular canvas.
More realistically if we have (for example) a window size of 40 and a step size of 10 then a 30K pixel display will require 3000 line segments which could be configured are 100 features of 30 line segments.
If we have a sequence of 200k bp then the maximum number of line segments would be 2x20k ie 40k.
Showing details
The window ruler can be modified to show details of the displayed averages at the current sequence coordinate if the mouse is over a GC column.
Redisplay on zoom?
If we assume that we can display nominally 300 features of 100 line segments effciently using the Foo Canvas then we can just display the same features regardless of zoom level, but some features must be hidden or displayed depending on the zoom level, and hopefully this is already coded??
By definition the per pixel average but be redisplayed on every zoom change.
Long item handling must also be applied to these features.
Log and linear scales
If external data is supplies as logarithmic then the display style will have to be set to linear to avoid confusion we will set linear as the default and this confgiuration option can be omitted.
Log format displays will have to be calculated per bin and if the log() function is slow then we can calculate threshold values per horizontal pixel and look these up per bin.
Sample Configuration
Here we illustrate GC Skew wiggle plots for low zoom and rolling average of 500 and 50 bp. Note that the styles define whether of not we use wiggle plots or histograms etc and this is not hard coded.
ZMap Config
[ZMap]
# number of lines per feature
wiggle_feature=100
[Columns]
# 3 hard coded featuresets, triggered by the prefix 'GC_Skew_'
# (styles default to the same name)
GC_Skew = GC_Skew_0; GC_Skew_500; GC_Skew_50
Styles
[GC_Skew] # column style
width=20
[GC_Skew_0]
mode = graph # not a density feature!
graph-mode = line
colours = normal border green
min-mag = 0 # per pixel average
max-mag=500 # don't display at bigger zoom, relates to GC_Skew_500
[GC_Skew_500]
mode = graph
graph-mode = line
colours = normal border blue
min-mag=500
[GC_Skew_50]
mode = graph
graph-mode = line
colours = normal border red
min-mag=50
To Do
add style and configuration options
generate rolling averages as configured for loaded DNA and store as features (in featuresets in the feature context) with several values held in an array allocated off the heap. Sort these into an ordered list.
implement a new canvas item for wiggle plots
check min/max zoom show/hide code and adjust as necessary
add wiggle items to long items code.