RT 69060: Clustering (masking) EST's and mRNA's

Background

In 2008 ticket 69060 was created requesting this:

I think Jen may have already discussed this with you, but would it be
possible to introduce some clustering of ESTs with mRNAs. In an ideal
world it would be great to eliminate (ie remove from Zmap) any ESTs
which are completely covered by an mRNA (including human ESTs vs human
mRNAs, mouse vs mouse, etc) as this would reduce the number of ESTs we
need to look at and should make navigation round Zmap more simple. I
guess it might also speed up the redrawing as there would hopefully be a
lot fewer boxes.
and recently (after the latest annotation test) we were requested to see how easy this would be to implement.

Please Note

Design issues

This was once intended to be an interactive process w/ otterlace- see here. However, reading the text and a quick chat w/ the requestor suggest that an interactive process was not wanted.

A brief User Guide

Note this section written months after the rambling notes below

What do we mean by masking?

We mask alignments with other alignments with the intention of hiding those that are uninteresting ie give no new information. Here an alignment is a series of matched features with the same name and an uninteresting one is exactly the same as another except perhaps a little shorter at one or both ends. We apply this process to EST's and mRNA's and this is defined by ZMap configuration as discussed in the sections below.

How does this affect the ZMap display?

By default a masked featureset (eg EST_Human) will have masked features hidden (not displayed) so there will be fewer features on the screen and bumping the column will show less data. There is an option on the Right-Click Menu ('Show Masked Features') that will display the masked features, and these will be in grey, to indicate that we think they give no information not given by another feature.

Masking is done at the featureset level and a feature is masked in all windows (eg if you v-split) but each window can be set to display the masked features or not, just like 'bump-column' only applies to one column & window at a time.

Masking is done whenever new data arrives, so if you request (eg) another EST column then you may find that displayed features either disappear or turn grey.

Highlighting a feature from 'list all this column features'

If a feature is masked and hidden you can still find it via the feature search windows and if you select a hidden masked feature then it will be displayed for you. Note that this will not cause the other masked features in the column to display.

When specifically is a feature masked?

EST's are masked by themselves and other EST featuresets and mRNA's and mRNA's are masked by themsleves. Perhaps the easiest way to illustrate this is with a diagram.

NOTE originally the request was to mask as in this diagram but since the first implementation it was decided to insist on exact splicing in all cases and in the diagram mRNA 3 would not mask EST C, and mRNA1 would not mask any EST.

Other notes

With masked features not displayed we expect to see fewer 'imcomplete homology' markers

Some alignments can repeat over a DNA sequence and this can affect how the masking algorithm operates as it does not take into account repetition. (Note that this is consistent with current ZMap practice in displaying co-linear lines between features).

Things to define/ solve

How to decide when to not display a feature?

An EST feature (ie group of ESTs with the same name) will be masked if each feature is covered by an mRNA feature (ie group of mRNAs with the same name). Optionally we can mask an EST column against itself and also do the same with mRNA data, but in this case we must only mask data that has the same number of 'exons' to avoid masking alternate splicing events.

How to choose features to compare?

As features are effectively blank data and thier names are 'user-defined' we cannot hard code these and one obvious way to mark featuresets for 'clustering' and non-display would be via styles eg:

[EST_Human]
# list of featuresets to compare this featureset with
alignment-mask-sets = self ; vertebrate_mRNA ; cDNA
If we wish to mask a featureset against itself then we include its own name on the mask-sets list or the key word 'self'. It will be more efficient to include a featureset's self reference first. The 'self' keyword is necessary to allow a common style to be used for a number of EST columns.

Display Options

Masked data can be not displayed, or displayed in grey (for example). The grey colour can be specified as a window colour in the ZMap config file.

[ZMapWindow]
colour-masked-feature-border = light grey
colour-masked-feature-fill = dark grey

If styles cause data to be hidden from the annotators it may be advisable to allow this to be switched off and to display all features on demand. A new option can be added to the RC menu to toggle display of masked features independantly of existing bump options; although the non-display function is only useful to the user in bumped columns it will result in much faster display of non-bumped columns. This option will apply to the column whether bumped or not. Note that this flag must be stored in the featureset struct and is distinct from flags per feature which say whether or not each feature is displayed. If no colour are defined as above or the column is not maskable then the menu option will not be visible and the masked features will never be displayed.

Implementation

To gain display speed it is best to avoid adding hidden items to the foo canvas which implies that the masked features must be flagged as hidden in the feature context. But as features may arrive at any time (via pipeservers or via delayed loading) this status can change from time to time and as new featuresets arrive more features may be masked from display, in which case the foo canvas items for the features must be hidden or destroyed. The foo canvas provides a hide function or features may be removed via a GTK_object destroy function, and the foo canvas/ ZMapCanvasItems have to provide a dispose handler.

We must provide functions to add and remove flagged features from the canvas (to implement the RC menu option)

The fact that we expect featuresets to be displayed separately (esp w/ pipe servers) implies that we must be able to mask each EST featureset independantly against each mRNA featureset, although if several mRNA featuresets are available when an EST column is displayed then these could be processed together.

In the interests of simplicity masking shall be performed on the feature context - flags will be maintained in the feature context to decide which features are masked and how this is presented to the user can be independant of this.

Note that canvas items refer to features and not vice versa, but a canvas item can be found relatively quickly via a hash table in the ZMapWindow structure.

Implementation README

Masking algorithm

Coding

There may be a combinatorial problem to solve. A sample session from human chr-4-04 presented about 4-500 vertebrate mRNA features and 2-3000 EST_Human and given that each feature may contain many parts a naive algorithm could get quite slow.

The following is proposed:

Masking a set with itself

The same procedure as above is followed but with the restriction that each exon must be matched (no gaps are allowed) and splice junctions must be the same. (which means that only the 5' and 3' coordinates may differ).

Data Structures

The feature structure contains a union depending on the featuretype and for EST and mRNA we use style mode=align, which corresponds to a homol data structure. The style data uses a similar structure but the structure is called an alignment. Extra flags will be added to these to handle masking status.

Sorted access to the feature context

Existing practice

The Feature Context has features referenced by a hash table which provides random unsorted access. NB Tests have revealed that creating a hash table of 100k items will fail: the function does not return, although 90K appears to work quite quickly. The FooCanvas stores data in a GList (which is quite slow) and the Window items code maintains a hash table to map features to canvas items. The canvas items refer back to thier features directly.

To avoid disturbing existing code we will add data to the feature context to provide sorted access - it would be better to replace the hash table with a B-tree or similar but not very practical in the short term.

A quick glance at GLib show that they provide binary trees and n-way trees but neither of these are ideal. The n-way trees just provide more lists but obviously would be slow with our data on random access. The binary trees are an opaque structre and while they provide lookup and traverse functions do not implement next or previous functions.

Relating feature context data to canvas items: alignments

Alignment features correspond to features in EnsEMBl and represent gapped alignments. Each feature is represented by a feature.homol structure which includes a GArray to specify the gaps if present; these gaps are small and are displayed as lines accross the feature. Several distinct features of the same name can be joined up with vertical lines. When displayed each part (alignment feature) is expressed as a separate canvas item, and these can be re-arranged in different ways according to the choice of bump mode. In Ensembl the data is represented as several distinct items with the same name.

This implies that we have to connect features of the same name together and then sort these composite objects into start coordinate order. Column bumping already does a similar task, but for canvas items - we need lists of features of the same name. To make the sorting and comparison algorithm run faster we will prepend each list with 'whole alignment' entry which will hold the start and end coordinates of each set of alignments.

Sorting and maintaining data

We expect EST data and mRNA data to be static within a ZMap session - once sorted there is no requirement to repeat the process. Additional featuresets supplied by servers will be stored in distinct context featuresets regardless of whether they are to be displayed in the same column. Note that this requires ZMap configuration to specify unique featureset names - this is possible even if the ultimate source of two data sets are named, via the [featureset-source] stanza. If in future we request features for subsets of the region being used then we will need to re-sort a featureset as new data arrives. Data will be sorted into a GList of pointers to the features which are stored in the featureset structure. If new data arrives the set will be flagged as unsorted by freeing the list.

Choosing to hide or display masked features in grey

Motivation

We wish to hide (ie not display at all) features that are masked a) to unclutter the display and b) to run faster and this option is chosen by not specifying a 'masked features colour' for the window.

For testing and also as an option to show all the data but direct the user's attention to important features we wish to display masked features in grey, and this colour choice will logically apply to the window rather than a column. If a masked features colour is specified then these features will be displayed but initially hidden - display speed will be faster if they are not displayed but typically the difference will not be extreme as we are dealing in some cases with only 1000 features, and display speed is dominated by protien alignments where volumes are more like 50k+. But NB: volumes of 10-20K are not rare.

How is this done?

In zmapWindowFeature.c/zmapWindowFeatureDraw(), called from zmapWindowDrawFeatures.c/ProcessFeature() we test to see if the window colour has been set and if not just don't draw masked features, in which case we never have to consider what colour to use.

If the window colour is set then when we draw the features we have to access this colour by a process which is a little obscure. In zmapWindowCanvasItem.c/zMapWindowCanvasItemAddInterval() we call zmapWindowAlignmentFeature.c/zmap_window_alignment_feature_set_colour() via the over-ridden class structure function set_colour() for alignment features. This has no direct access to the colour data but we can access the canvas featureset group (aka column) via the canvas item parent references. This ZMapWindowContainerFeatureSet object has a reference to the window which is opaque, and we have to impelement a ZMapWindow function to access the colour spec. Unfortunately we have to write two of these functions to get past two levels of scope incompatability, one to get to the featureset and one to reach the window.

Controlling the display and column menus

Each featureset in the view's context has a masker_sorted_features GList which is non-NULL if the column has been masked ie if masking has been configured. Each feature has flags to specify masked and displayed.

The featureset will be displayed in a column (or more in case of strand and frame) and when it is displayed in a column flags in the ZMapWindowContainerFeatureset will be set to say if the column is maskable (contains one or more featuresets that are masked) and masked (mask data is not displayed). These flags will drive the column menu.

Currently there are no plans to select display of individual featuresets in a column, if so it may be wise to add a list of contained featuresets to the ZMapWindowContainerFeatureset structure.

Toggling display of masked items

By default masked data will not be displayed and if we select 'Show Maksed Features' then we will scan all ZMapWindowCanvasItems in that column and display the masked data. Deselecting this will remove the masked items from the FooCanvas. Each canvas item refers to it's feature and when the display status of a feature is changed the flag in the feature structure that says 'displayed' will be updated.

NOTE: We assume that each feature is displayed in one column only, but the situation where multiple windows have been opened is more complex. In line with existing practice each window in a view reflects the same display state and changes in one occur in the others. To accomodate this we must display or hide features according to menu choice (or column state) rather than feature state. Each time we show or hide a feature we set the flags in the feature accordingly.

NOTE: If a featureset is loaded into a columm we must also check the display state of the column. As per normal we mask the new column and display unmasked features, but if the column state requires masked features to be shown then we unmask the new features.

Where to put the show/hide function and implementation

zmapWindowContainerFeatureset.c/zMapWindowContainerFeatureSetShowHideMaskedFeatures()

Hidden features were originally intended to be not displayed as we would have added or removed items from the canvas according to display state. However this is a relatively complicated process due to the item factory... and also the fact that featuresets can appear in several columns (not particularly relevant for current practice with ESTs and mRNAs but we have to code it properly). Instead we display as normal and use foo_canvas_item_show/hide(). Data volumes in this case are not huge, typically 1000 features total, so performance will not be affected much, and operating the show/hide menu will be quicker.

Displaying in multiple windows and updating as new masker sets arrive

There may be some value in displaying masked columns in different windows independantly - this would allow user to compare them - in which case we can operate using each column's state without difficulty. (NB: column bumping operates this way).

As data arrives via pipes in no particular order we have to mask data that is already displayed, and depending on the column stat this will mean hiding it or changing the colour. This process must be applied to all open windows.

As we perform the masking in the View's feature context not the window we cannot easily hide features as they are masked and therefore we are require to post process the windows via a for-all-windows function.

Displaying new features

There is a logical possibility that we may receive a new feature from otterlace via XML, although we don't expect this to happen for EST or mRNA types (transcripts are more likely). It's not immediately obvious from the code what would happen if this arose - it looks like we get a context with just the new features in to draw which refers to the merged context, in which case we would need to free the masker_sorted_features list for the affected featureset and re-mask. As this is unlikely to occur this has not been done.

Much more likely is requesting data for sub-regions and which case we can get large updates to a context featureset. On receiving any data for an existing featureset we will free the masker_sorted_features list, which implies a re-sort. and all data in that featureset will be processed for masking. Each feature has a masked and displayed flag and the masked flag will be reset before processsing. Masking is performed from the justMergeContext() function in zmapView.c and display from justDisplayContext(). It is valid to only mask with the new data.

Displaying new data in an existing featureset

Using existing code the new data will be drawn via the 'diff_context' data structure, which contains references to all the new features in the merged context. These can be drawn as-is.

Masking existing data

zMapFeatureMaskFeatureSets() can return a list of featuresets masked and this can be processed to re-display or modify existing columns (excluding new ones which display as normal). As we draw the context after masking we know that the feature context is in the correct state at this time. We will never need to draw existing features and will only ever have to set features as masked, not the reverse. We can scan each window's canvas tree and process only displayed items (new ones will not be displayed yet, so this will run at maximum efficiency, and as we process the window context not the features context there will be no errors matching active columns via a parallel set of code to the display functions.

NOTE: If masked colours are not defined then we never show masked features, and if a featureset is masked after loading then we remove the features from the Canvas.

Results

Using [chr11-03_2098831-2266877] and 3 EST columns and vertebrate_mRNA (and each set masking itself) we get:

mask set vertebrate_mRNA with set vertebrate_mRNA
masked 15, failed 1099, tried 125 of 125 composite features

mask set EST_Other with set EST_Other
masked 68, failed 3066, tried 322 of 322 composite features
mask set EST_Other with set vertebrate_mRNA
masked 260, failed 1401, tried 322 of 322 composite features

mask set EST_Mouse with set EST_Mouse
masked 27, failed 350, tried 87 of 87 composite features
mask set EST_Mouse with set vertebrate_mRNA
masked 69, failed 365, tried 87 of 87 composite features

mask set EST_Human with set EST_Human
masked 109, failed 6279, tried 528 of 528 composite features
mask set EST_Human with set vertebrate_mRNA
masked 346, failed 1812, tried 514 of 528 composite features
Which in rough terms gives us 70% of ESTs filtered off the display.

Using chr11-03_71088949-71922184, which has more EST features we have much higher failed comparisons, but the high figure of 221163 compare to a worst case of 11417641 or 2% of that required by the naive algorithm. EST human is masked only at a level of 68%

mask set vertebrate_mRNA with set vertebrate_mRNA
masked 134, failed 11271, tried 577 of 577 composite features

mask set EST_Human with set EST_Human
masked 439, failed 221163, tried 3379 of 3379 composite features
mask set EST_Human with set vertebrate_mRNA
masked 1861, failed 70075, tried 3379 of 3379 composite features

mask set EST_Other with set EST_Other
masked 256, failed 11421, tried 1023 of 1023 composite features
mask set EST_Other with set vertebrate_mRNA
masked 687, failed 11129, tried 1023 of 1023 composite features

mask set EST_Mouse with set EST_Mouse
masked 113, failed 1870, tried 457 of 457 composite features
mask set EST_Mouse with set vertebrate_mRNA
masked 339, failed 4171, tried 457 of 457 composite features

The algortihm appears to run effciently: # failed is the number of comparisons used with no effect. The worst case is quadratic (like the obvious naive algorithm) but we are a long way from that.

A note about homology markers

By hiding masked alignments wich typically include those that are missing a few bases at either end we display much fewer alignemtnes that would be displayed with incomplete homology markers - an obvious side effect but disconcerting when you first see it. This can be verified by clicking on 'Display Masked Features'.

A note about duplicate alignments

An alignment can math to several places on the genome due to sequence duplications and this produces some potential problems. Each alignment is made of several features with the same name that match to splicing and if a series is repeated on the same strand then we can get longer series of features. This will result in features not being masked, which is not harmful but could be improved. NB a similar problem occurs in the display of homology markers - if in doubt try 'List same name features'.