Density plots

A caveat: this page started of as design ideas and developed into implementation notes which evolved as requirements and technical obscurities became clearer, so be skeptical af what you read here. It's mostly true but pedantry cannot be acheived. Later on the density code gets to be used as a basis for the new-model-canvas.

Motivation

Summary data

There are two main reasons for wanting to present a set of features in terms of volumes: 1) speed and 2) usefulness.

Especially at low zoom levels high volume columns are displayed with many features overlapping with the result that most are not visible despite being drawn on the canvas. Experiments with filtering out invisible features from the display while still presenting the same picture show that improvements in speed by a factor of 20x are realistic.

Features displayed in low resolutions are not very useful to annotators - the display at low zoom levels serves mainly to show where features are and a density plot would give more information - as mentioned above overlapping features are not visible and help no-one. These kinds of display have been requested specifically for RNAseq data.

GC content

A related requirement is to show GC content which is a rolling average calculated from the DNA sequence and this can be displayed as a histogram or wiggle plot.

Display output

Here we consider displaying any kind of feature entirely in terms of how many there are and there a few ways of representing this information. We expect this process to be useful primarily for alignment features (eg RNAseq)

Heat map

Fixed width with each vertical pixel colour coded according to how many features overlap that region.

Bar chart

Horizontal stripes with the width corresponding to the number of features.

Wiggle plot

A line chart (what most people think of as a graph)

a picture of density plots

Typically the density scale will be chosen as logarithmic.

Other browsers (eg GBrowse - server side web based) implement density plots by using pre-computed bins based on different zoom levels but this imposes a requirement on the data source. Figures such as "a few minutes for 16k features" have been mentioned for this pre-computation and this is not a good option for ZMap - it would increase the data needed to display the initial view, and the intention is to make this as fast as possible.

Implementing fixed style density plots in ZMap

Here we define how to display a coverage featureset as a density plot. The data is supplied via GFF and includes start and end coordinates for the bins and a score which is the number of features in the bin. Note that even though the display is likely to be log(score) we need the actual number so as to be able to combine bins as zoom level changes.

Whether or not to produce a density plot will be determined by a column's style using some new parameters. A new mode (density) will be introduced which is automatically a graph mode display but allows bins to be combined. Density mode styles will trigger base features being re-processed into display data structures based on pixel boundaries. Multiple featuresets may be displayed in a single column and each one will produce its own graphical display and the styles should be defined to ensure clarity - heatmaps and histograms do not overlap well but wiggle plots may.

Configuration

An example:

[density-plot]
graph-mode = heatmap | histogram | line  # default = heatmap
graph-scale = log | linear        # default = log
colours = normal fill border # default = fill white; border black
# frame-colours are also possible according to frame mode
mode=graph
graph-density=true
graph-density-min-bin=2
graph_scale=log

For a heatmap the colours used will be calculated according to the scale and between the fill and border colours, with border being the maximum density.

Other existing style parameters will be used in a density mode style:

width =  pixels
min-score = 0     # number of features for minimum score
max-score = 2000  # number of features for max score

min-mag =
max-mag =
Note that min-score and max-score define the number of steps in the display and also the behaviour of a heatmap - 0 will mean no features and displays as the fill colour and a single feature will be the next step up.

Bumping density plots

It is a requirement to bump a heatmap into a histogram and it is possible that other options could be asked for. We can introduce new bump modes and/or bump styles to achieve this. Note that a lot of bump modes make little sense for binned data.

Implementation - drawing the plots

Change of philosophy - virtual features

There are some problems with simply drawing bins as received:

To avoid performance overheads due to large amount of data a single density plot canvas item (ZMapWindowDensityItem) will be added per column and this will handle the display and selection of all bins. Please note the following:

Other implementation issues

The window status bar, feature details window, and feature search dialog will all need tweaking to work well with density data. In particular feature search makes no sense for density columns and the assumptions in the feature context FtoIHash such as 'each feature has exactly one corresponding canvas item' will be broken.

Effort needed

What needs to happen:

Further notes (during implementation)

Creating canvas items

- we have a feature and decide to display it
- this goes through stuff like zmapWindowDrawFeatures and ends up in the windowItemFactory.
- according to the feature's style a different factory method gets called:
      { ZMAPSTYLE_MODE_GRAPH,        drawSimpleGraphFeature }
- this calls:
        canvas_item  = zMapWindowCanvasItemCreate(parent, y1, feature, style);
        zMapWindowCanvasItemAddInterval(canvas_item, NULL, 0.0, y2 - y1, x1, x2);
- CanvasItemCreate calls feature_is_drawable() which returns a type such as ZMAP_TYPE_WINDOW_GRAPH_FEATURE
depending on the feature's style

NB in zmapWindowBasicFeature.c this process is repeeated.

What we need to do is to change drawSimpleGraphFeature() so that it makes a singleton graph column item
and adds the source features to this instead of calling zmapWindowCanvasItemCreate(), which will be clean.

Either the class data needs to include a hash of existing graph column items or we look up existing ones in the parent.

Glyphs also get run thro' basic feature, which is odd.

Patching density items via the item factory

This is slightly intricate and not immediately obvious.

Preserving graph mode=histogram (non density plot graphs)

A small preparatory task was to take simple graph features (the only implementaion so far - graph mode histogram as simple boxes) out of zmapWindowBasicFeature.c and to create a real ZMapWindowGraphItem class. This involved creating a new file zmapWindowGraphItem.c based on BasicFeature and adjusting code in zmapWindowItemFactory.c/ and zmapWindowCanvasItem.c/feature_is_drawable().

Note that the item factory (zmapWindowFToIFactoryRunSingle() chooses a factory method indexed by a feature's style mode and then each one of these calls zMapWindowCanvasItemCreate() which determines the type of canvas item to create via feature_is_drawable() based on the feature's style mode, so effectively the same decision is being made twice in a row.

As each factory method is tied directly to each style mode by the above we cannot call different methods for different types of graph and the method already defined (drawSimpleGraphFeature() must necessarily be used to draw other types of graph, so to avoid confusion this has been renamed to drawGraphFeature()

After creating an empty canvas item of the correct type (previously Basic, now Graph), drawGraphFeature adds the actual data via the class->add_interval() function which resolves to zmapWindowGraphItem.c/zmap_window_graph_item_add_interval(). This adds a foo canvas rect to the ZMapWindowCanvasItem which is a foo canvas group.

Inserting the ZMapWindowDensityItem item once per column

In drawGraphFeature we test the graph mode style's density flag and if set add the ZMapWindowDensityItem once only. Ths is implemented by the class directly, which contains a GHashTable of these indexed by the column unique id, which is accessable to the factory via ther run_data->container foo canvas group.

Note that we make this more complicated by adding style_id as well and we construct a composite GQuark (once) in the item factory.

Adding data to the ZMapWindowDensityItem instead of the foo canvas

Using the styles density flag as above we choose whether to call zmapWindowCanvasItemCreate or not. Having obtained out singleton data structure we add our graph iten to the source bins list.

Displaying data - making assumptions

A coverage display is implicitly non overlapping,; if we wish to display overlapping data the we wi8ll use two different styles which will cause two different density displays to exist. This allows an optimisation when looking up display data on expose - we can search on start coordinate and look at the previous item to see if it overlaps the exposed region.

It is intended to replace normal graphs with the density module (which will operate in the same way but without re-calculating bins on zoom), and in this case it is possible we may have to handle overlapping data. This can be done automatically (no configuration needed) by remembering the longest item to display and offset search coordinates by this amount. As features are expected to be small this will be efficient enough - we simply scan a sorted linked list. Taking a hypothetical worst case this is unlikely to be time consuming. If necessary we can create display indexes for long and short items to provide an effcient solution wothout risking an incorrect display.

Work in progress
- graph items have been made into ZWCI and have foo rects added as intervals
  rather than existing as a variant of basic feature
- free standing glyphs are added as foo glyph items to basic features
- for density plots we need to add GraphFeature for the single
  density column wide item (a hack) and then add an interval to that which
  will be the foo extension density item
  This is needed as the ZWCI is a foo group extension and does not draw directly
  whereas foo items do
- when canvas items get simplified then we turn the basic/density ZWCI into a simple
  foo extension as for other types

- zMapWindowGraphDensityItemGetDensityItem() returns the ZWCI if found.
  If not then the item factory (via drawGraphFeature()) will add it and add the interval
  Note that we will call foo_canvas_item_new() directly rather than via zMapWindowCanvasItemCreate()

- problems:
  canvas item->feature gets accessed directly and cannot be NULL (OK: set to first)
  We can invent a singleton feature that gets assigned data depending on mouse clicks??
  we need to handle long items??
  style needs to be set globally -> this looks like the add_interval() call but we need to
  check compatablilty and/or write the code directly

- to do:
*  paint the graph in the _draw function
*  test as for plain graph features: should display the same
*  test using two featuresets with the same data from GFF files and display as old and new style
* display as lines not boxes -> combine data points into a poly_line
* display as heatmap: boxes are full width and colour coded; ok with irregular size bins?
* handle mouse clicks etc, point and bounds
  handle long items (trivial?.. ignored without mishap)
  handle hide and show
* re-bucket on display and zoom
* get some timing stats
  add on GC skew

  add print/ screen dump handling
Operating coverage display and bumping

To implement this new style and bump options will be provided:

[coverage]
bump-style=style-name	# disables 'bump-more-options'
stagger=pixels
show-zero=true/false
overlap=true/false	# need to optimise aligments code (not relevant here)
stagger will offset each featureset item in the column by a set amount and these will be ordered as per column confguration, for example:
[columns]
coverage = liver; heart ; brain
will specify that the featuresets are in that order from left to right. Note that each set's width is specified separately and this may indicate overlap (eg for wiggles) or a gap (eg for a heatmap). This is necessary to avoid having to define different styles for each featureset with a different offset - it is important that the featuresets appear in the same (predictable) order and position within the column.

The whole column may be shown/ hidden as normal, but the RC menu will be amended to allow a single featureset to be hidden / displayed with a keyboard shortcut for the selected one. Note that we will have to show column highlight differently and for heatmaps this may be difficult. One solution would be by using an outline for select. The columns dialog popup was intended to allow selection of featuresets in a column at some point... NOTE that this will be deferred until the dialog design is revisited, and in the short term only a menu driven interface will be provided.

Handling short reads data

To avoid accumulating too many features on the ZMap display data outside the mark will be removed. This data will be displayed on 'request from mark' and remain until the mark is reset and set again. Adjusting the mark will not remove displayed data.

Handling focus interface

Given a mouse click that is attached to a single column wide foo item we can find the relevant feature implied by the mouse coordinates and set the foo item's feature to be this and the focus code can be adjusted to cope with his by adding the feature to the focus item data structure.

Focus code is called from in zmapWindowItem.c For a simple mouse click this causes no problems as we can find the relevant feature quite easily. For an alignment (highlight same name items) this requires a feature search and will not work with density items using the current form of feature search (which returns a list of foo items which would all be the same). This list of items is then added to the focus list en-masse and if this is derived from density items it will work. However in the short term we know that alignments are not displayed as density items and this need not concern us.

Focus code is also called from zmapWindowState.c and this restores a single 'hot' item. This (current ZMap) fails to restore the previously highlit data and is relatively simple to emulate. Window state is stored via some 'serialise' function - it's not clear if this is necessary, as we only restore state eg after a RevComp, and the feature context is still valid.

There is a further problem to solve of finding a feature from a feature id; previously this has been done by finding the corresponding canvas item which has a pointer back to the feature. Next sentence not followedFocus code could be extended to store start, end coordinates of features (via the feature itself) and this can be used to look up the feature in the density items skip list.

Focus items a) have to be highlit and b) the highlight must persist until removed. Taking into account that density items (and normal graphs) do not overlap we can implement this with a simple flag in the graph segment data structure and just display a different colour, which is stroed in the column wide extenced foo canvas (density) item.

To handle overlapped items (eg alignments) it will be necessary to post-process focus items and display them after the others - perhaps the logical way to of this is to display focus itesm at the end of every expose event, but that is outside the scope of this module.

Note that focus code highlights features using a given colour not necessarily the features' style's SELECTED colour if a window default has been set. This default can alternatively be set using the root style. The interface is via a ZMapWindowCanvasItem function (set_colour) and for density items we will need to patch into this. Focus code works using a set of flags for different groups of focus items (eg selected, evidence) and we can store these flags in the graph-density item struct for each feature which will allow us to implement all optional colours, storing these in the global graph density struct. Note that for density items we only need to handle 'selected' but implementing this interface now will allow other data types with more display options (eg masked alignments) to be implemented easily.

Unfortunately due to scope issues it is not possible to use focus code to get colours, and the canvas item class interface is a bit convoluted, so to keep things simple we will store a limited number of alternate colours directly in the graph density item struct and choose these according to a flag. This will limit density items to being normal colour or selected (as defined by the focus code) only.

Handling the FToI search interface

In the short term we can appply restrictions to this. Search for coverage data by name is not particularly useful and we can just disallow this or make it fail. Likewise coverage data is not valid as evidence features (transcript/highlight evidence) and we can assume this will not occur in the short term. (see zmapWindowFeature.c # 1352 approx).

In the longer term we need to modify the interface to return feature as well as foo item.

Feature search has been made to work with density items - FToI search returns a list of FToIHash data structures instead of canvas items. These contain a reference to the feature as well as the canvas item.

Implementing low-zoom triggered density plots in ZMap

(under construction)

These are earlier thoughts now abandoned The idea was to improve perfomance by switching to coverage, but htis looses information (eg alignment score).

A separate style will be added for optional density mode displays and each normal feature's style will refer to this (in a later implementation).

Bumping the column will revert to using the normal style. The density style bump mode will be used to control whether or not this is allowed.

A density mode style will automatically also be a graph mode style and graph-mode configuration will be used to drive the display. The difference between simple graph mode and density graph mode is that one uses features and the other feature-frequencies per pixel.

Note that a column style may be set independantly of the included featuresets or can be inferred by merging the featureset styles. All featuresets in a column will be treated as one for the purposes of density plots. (for example if we created a UniProt column then Swissprot and Trembl would be displayed in different colours as at present when displayed at high zoom but would be combined to create a single density plot.

Note that although most of the relevant data consists of alignments some high volume data (eg ensembl_variation) are basic features and therefore these parameters must apply to all style modes.

[trembl]
density-style = density-plot-trembl

Implementation

Normally we would wish to display a density plot when the zoom level is such that features are not readily discernable due to overlap. This is a) to give extra information and b) to implement a faster display. Using the foo canvas we are limited to nominally 1000 pixels at minumum zoom and 30k pixels at maximum, but note that future developments may change this. However we expect future technology to allow us to display significantly more features in less time. We do wish to avoid drawing 30k density features - this would be missing the point of using these. A solution would be to impose a maximum number of these in the style and increase the vertical size of the bins by whole numbers of pixesl as the xoom level increases.

One way to decide when to switch between display modes would be the existing min and max zoom parameters in styles and this looks like a good place to start. The disadvantage is that to achieve best results the style has to be set appropriately for the data and this is a trial & error technique and will always be a compromise.

Another way would be to set a minimum average or maximium bin volume - if no bins have more than X features then display the features.

What is important is that there is a clear switch between one style and the other. For example if we defined styles to display at different zoom levels and display real features and density data in the same column there could be a transitional zoom level where both were displayed and we need to avoid this.

GC content

(needs to be reviewed, ref to density plots above)

Not a density plot

GC content provides an easy way to implement wiggle plots as a graph mode feature which can later be used in a density plot environment subject to configuration. Note that they are defined as style mode = graph!

Rolling window size

A cursory scan of google suggests a window of 50-500bp - 50 as the GC skew can change suddenly and 500 relating to 'flanking DNA', with values being calculated at 100 bp steps (Human Molecular Genetics, 1999, Vol. 8, No. 6 1061?1067). This suggests that we must provide this paramet in the style, and also possibly provide more than one graph, using different window sizes. One way to implement this would be to define hard coded featuresets (like DNA) such as GC_Skew1, GC_Skew2 etc and map these into a single column (if desired) and attach different styles to specify the window size.

If the zoom level is so low as to contain more bases than the window size per pixel then the GC content will be calculated per pixel. This will be triggered by setting min and max zoom levels per GC_Skew style (interpreted as base pairs per pixel), and a per pixel average will be triggered with a window size of 0. GC Skew will not be displayed if the number of bases per pixel is greater than the window size, and the rooling window size will be set but the min_zoom style parameter. The step size can also be parameterised; this will be useful for all density styles and provides a way to limit the munber of features displayed.

Graphical display

A simple linear scale plot with 50% GC content as the centre and min and max_score in the style being used to define the extremes of the percentages to be displayed (eg 30 and 80%).

Efficiency

Sneaking past the FooCanvas

Implementing a wiggle plot as a series of poly-line features may operate significantly faster in the foo canvas then a feature per line segment. Implementing a single polyline may also operate slower due to having to redraw the whole feature every time. If we work in a 30K pixel maximum window size and 100 lines per feature then this results in 300 features, which may paint quickly, and if so will allow us to paint rolling averages at maximum zoom. These parameters can be held in a ZMap config stanza if necessary. This may be significantly faster than histograms or heatmaps. Implementing parameterised plots like this will also not tie us to any particular canvas.

More realistically if we have (for example) a window size of 40 and a step size of 10 then a 30K pixel display will require 3000 line segments which could be configured are 100 features of 30 line segments. If we have a sequence of 200k bp then the maximum number of line segments would be 2x20k ie 40k.

Showing details

The window ruler can be modified to show details of the displayed averages at the current sequence coordinate if the mouse is over a GC column.

Redisplay on zoom?

If we assume that we can display nominally 300 features of 100 line segments effciently using the Foo Canvas then we can just display the same features regardless of zoom level, but some features must be hidden or displayed depending on the zoom level, and hopefully this is already coded??

By definition the per pixel average but be redisplayed on every zoom change.

Long item handling must also be applied to these features.

Log and linear scales

If external data is supplies as logarithmic then the display style will have to be set to linear to avoid confusion we will set linear as the default and this confgiuration option can be omitted. Log format displays will have to be calculated per bin and if the log() function is slow then we can calculate threshold values per horizontal pixel and look these up per bin.

Sample Configuration

Here we illustrate GC Skew wiggle plots for low zoom and rolling average of 500 and 50 bp. Note that the styles define whether of not we use wiggle plots or histograms etc and this is not hard coded.

ZMap Config

[ZMap]
# number of lines per feature
wiggle_feature=100

[Columns]
# 3 hard coded featuresets, triggered by the prefix 'GC_Skew_'
# (styles default to the same name)
GC_Skew = GC_Skew_0; GC_Skew_500; GC_Skew_50

Styles

[GC_Skew]   # column style
width=20

[GC_Skew_0]
mode = graph         # not a density feature!
graph-mode = line
colours = normal border green
min-mag = 0         # per pixel average
max-mag=500         # don't display at bigger zoom, relates to GC_Skew_500

[GC_Skew_500]
mode = graph
graph-mode = line
colours = normal border blue
min-mag=500

[GC_Skew_50]
mode = graph
graph-mode = line
colours = normal border red
min-mag=50

To Do