An attempt to speed things up a bit. Note that here 'new model' refers more to Oliver Cromwell than MVC, although it is one of the design aims to make the view and window have less model and more view.
This design follows an initial experiment with coverage displays which involved creating a single ZMapWindowCanvasItem (foo canvas group) containing a single ZMapWindowGraphDensityItem (foo canvas item) for each featureset. Features were represented by simple data structure accessed by a skip list from the ZMapWindowGraphDensityItem and were not added directly to the foo canvas. Expose events were handled by calling GDK functions for the relevant feature structures.
This initial experiment relates to handling the foo canvas at a level below the column objects and the new model canvas is to be implemented first by replacing data and code at that level and then by revisiting the container design but for clarity here we present a top down discussion from canvas root to indivudual sub-feature.
Code and data is quite complex and gains little - each feature on display is 10 levels deep in the Foo Canvas and positioning features in the screen is a complex process that is prone to error.
There is a lot of ZMapCanvas code that does not take advantage of object oriented or plain old structured programming patterns amd this can be improved significantly.
Adding features to the canvas is a time consuming process that also involves some complex interactions with X. The Foo Canvas is not designed to handle the number of features displayed by ZMap.
This will be simplified to look like the diagram below. The feature context will remain as is with potentially many blocks and more than one align, although with present use we are restricted to a single block. If more than one block is implemented in future this will be done via a separate pane on display (a ZMapWindow with its own canvas) for each block, and different alignments can be locked together or otherwise.
Within a block there will be no strand containers but instead we will add ZMapWindowColumn groups directly to the block. Reverse and forward strand columns and the strand separator will be identified by strand explicitly and will be positioned by using the existing columns sorting code, tweaked as necessary.
Each container will hold data to specify strand and frame and also borders if any.
As the number of columns is not very great (typically less than 100) these can be stored in a simple linked list.
The feature context has coordinates specified as forward strand chromosome based from 1 (although the segment viewed is unlikely to be). In a block these coordinates will map to zero based canvas coordinates, with the first base in the viewed sequence mapping to 0.0.
Note that currently reverse complement operates on the feature context which is then re-displayed.
A canvas column contains a set of features that the user can request and show/hide and this may include data from several sources. (But note that the user interacts with one or both strands together depending on circumstances). Examples include Repeats, Uniprot (= Swissprot + TrEMBL) ext_curated (~400 potential sources), BAM paired reads (experimental repeats etc). The column foo canvas group implements the overall layout and positioning of features in the window and within that we have foo canvas items that implement the display of features (canvas featuresets).
A canvas featureset is not the same as a context featureset in that is is used to organise the presentation of features to the user wherease a context featureset is simply a collection of features of the same type which can be displayed in several colums eg via strand and frame (the data model).
A canvas featureset may contain data from several sources (context featureset) and may contain a subset of the data from its context featuresets (eg reverse strand only)
A canvas column may contain one or more canvas featuresets. This is especially relevant to the handling of coverage data comapred with simple features - to display several heatmap featuresets side by side and interact with them distinctly we need to store these as separate data structures. Converserly, when we combine several sources into one mixed column, to handle display and selection events we need to have these held in the same data structure.
This leads to a design as below, where we map source data (context featuresets) into containers (canvas featuresets).
Following experiments with coverage displays and graph features it looks feasable to code each featureset as a single foo canvas item and to have this item handle expose events and display individual features. This means that what were Foo Canvas Items (and ZMapWindowCanvasItems) become simple data structures, and performance improvements of 5x on add and 3x on paint are achievable.
It is intended to implement the 'featureset summarise' function within a canvas featureset, allowing this to be switched on without breaking assumptions made by other parts of ZMap. Note that due to structreing options outlined above we can operate this code on single context featuresets or combinations of these.
Each featureset item is a class which handles display etc for indiviudual features. A data structure is used by the class to describe each displayed feature and this refers back to the feature context. The exsiting FToI hash can be used to find a canvas items from a context feature or key, but has to be modified to return the feature as well as the canvas item as these canvas items are composite structures.
The primary function of the canvas featureset object is to display features and this is to be handled efficienly by the use of a skip list. features are added to a canvas featureset via the ItemFactory as at present and on display the skip list is created by sorting the list of features and generating extra latyers of pointers via a single pass. As this data is static we can be assured of a non degenerate structure. Where several context featuresets are added to a canvas featureset we just re-create the skip list each time - compared with the computation requried in total re-creating the skip list is not an issue.
Other than the initial display where the whole loaded sequence is visible ZMap normally operates zoomed in and expose events are only revelant to small sub-sequences. The skip list allows access to the start coordinate of an expose region in nominally O(log n) time and the processing the bottom level of the list serially till we reach the end is as efficient as we can get (O(1))
A simple object (eg basic feature, glyph) corresponds to a single GDK item and each one is unrelated to the others. They are accessed via thier featureset's skip list.
A compound object is a set of simple objects and each of these are accessed via their featureset's skip list in exactly the same way as simple objects. However these have left and right pointers to other parts of the same compound object which may be some distance away and other compound objects may appear in the gap between. An alignment and a transcript may be represented by similar structures - alignments may have extra data to handle gaps. If a user selects a compund object it is a trivial matter to find all the parts eg for highlighting.
Left and right links between parts of a compound object will be created as features are added to a canvas featureset. Either this can be done by another sorting phase or via the FtoI hash functions (for alignments) or via an extra function provided by a child class (for transcripts). It may be convenient to set up composited alignments in the feature context while doing this, or to implement VULGAR strings and treat alignments like transcripts.
Initially basic features will be implemented and then extended to unclude alignments - these for the bulk of the data handled by ZMap and give the greatest performance gains. There are some issues to face here beyond what may need to be done for transcripts, as transcripts already have some structure in the feature context (a wild guess) whereas alignments are linked by a 'same name' relationship and no data structuring exists.
To implement alignments in the CanvasFeatureset we will not make any changes to the feature context - this is to localise effort and avoid unraveling large amounts of code with no clear boundaries. The existing interface is by feature not featureset and this presents a problem in that retro fitting same-name links in the CanvasFeatureset naively would involve a double lookup for each one - first via the FtoIhash and then the Canvasfeatureset skip list. Conversely pre-sorting the data and adding in same-name order to allow effcient operation requires external coding.
An obvious strategy would be to sort by name before creating the skip list, add the links and then sort by sequence coordinate.as at present. This is practical enough but requires careful coding if features are added later. (NOTE: handling addition and deletion of features need to be specified clearly regardless).
A further consideration is that different type of feature can be mapped into one CanvasFeatureset and we need to trigger sideways links only when appropriate, or accept the performance overhead of the extra sort and scan.
Compound objects consist of a series of boxes (for example) and these may be joined up by lines and also have glyphs attached. Depending on the display mode and the type of object the decorations may not always be displayed.
For a transcript we always display a bent line between blocks and for an alignment we display traffic light lines and also glyphs if bumped. To minimise memory use these decorations will not be represented explicity by thier own data structures but instead drawn on demand depending on the display mode. However, some extra data will be stored to assist in this and help rapid display:
Alignments are displayed either as simple boxes or as a series of boxes joined by lines (to show gaps). In the existing implementation all of these are distinct foo canvas items, which is quite inefficient. Instead we will store each feature as is (a simple box) and when bumped or at high zoom draw the appropriate series of boxes and lines. It will be relatively easy to trigger this according to how visible these decorations will be related to their size in pixels.
Each alignment will be stored as a single feature and displayed as a simple box if unbumped. NOTE also that gaps are not displayed on the reverse strand.
When bumped we will refer to the feature's gaps array and taking into account base and pixel sizes construct a list of gaps to draw - several may overlap a single pixel at low zoom. The feature actually stores the bases that are aligned not the gaps and each one corresponds to a box for display; at high zoom these are joined by a black co-linear line but often they overlap a single pixel. To produce an efficient paint we will join up blocks if they are separated by one pixel, and add horizontal lines afterwards. If a gap is bigger than one pixel then we add a vertical line after the box. On changing the zoom level we recalculate this data.
Thus we have a design for feature data structures that looks like the diagram below, where the green exons form a complete transcript object (a similar structure will be used for alignments). We can select of search for sub-parts of compound objects and are at choice whether or not we wish to access the part of the whole, and the process for each is simple and well defined. Each compound object is a linked list of sub-parts and the start and end correspond to the start and end of the list. Each sub-part refers back to the feature context. Note that in the canvas there is no data structure corresponding to the whole transcript; it is just a sum of parts.
Other than bump-style all bump modes simply specify how to arrange features in the column and compound objects are simply fed into this code as simple objects covering the whole range. (NOTE that no extra data structures are created). Bumping is done a) by adjusting the X coordinate of each feature and b) adding decorations. Unlike the existing canvas we do not add any new data structures for decorations but simply paint on demand. Two X coordinates will be stored for each feature: 1) for unbumped) and 2) for the current selected bump mode and the appropriate one used by the GDK code.
We wish to avoid mulitple scans and sorts for bumping componud objects and the following strategy will be used:
There are conflicting requirements regarding columns with multiple featuresets. For graph mode displays (eg coverage heatmaps) it is essential to have each featurset displayed separately and the features for each one aceessed by a different skip list, yet for other types of feature (eg repeats) we have several featuresets that we want to display intermingled and this requires a skip list for the column not each featureset. This can be handled by providing a virtual featureset for the column and mapping each real featurset to this, triggering (initially) off style mode, although a separate style attribute can easily be created if necessary.
Initial experiments with coverage displays have worked by adding a single canvas item for each featureset which has a width defined by the style and a height corresponding to the whole sequence. Clipping has not been coded for high zoom as we only paint objects within the scroll region. X coordinates have been ignored as there has been little benefit.
For a bumped alignment column we have significant X coordinates and need to ensure that expose events are handled efficiently. Displayable objects are found using a skip list which is sorted by Y coordinate and this provides effective selection of the vertical region of interest. Features within that region are scanned from the top downwards and we can simply ignore features that lie left or right of the expose area.
If we take as a reasonable worst case a protein alignment column with 200k features that is bumped without setting the mark and also that the bumped display is 200 features wide, and we are at minimun zoom and we expose one column of the bumped data then the overhead of scanning the X coordinates is 199k feature comparisons and list links. This can be measured quite easily, and for 100k features we have 4ms user time to run a test program, increasing this to 1000k gives 8ms. 4ms is required to run the test program with no data - this implies that 250k comparisons take approx 1ms user time.
Therefore the overhead is minimal and naive code can be used.
Tradtionally co-linear lines and glyphs have been added to same-name features (or rather to the container) as discrete foo canvas items and these get painted as normal canvas items. On un-bump they are removed from the canvas.
To acheive a faster performance we would like to simply paint the existing features as bumped and then add the decorations as part of the normal paint function for the features. This is relatively simple at first glance - the type of decorations needed can be calculated on demand and switched on as required. It is a simple task to draw a line from one feature to the next, but scroll events cause expose events that do not pick up the decorations as they extend beyond the bounds of the original features. It is relatively simple to add a 'canvas extent' for features and catch paint events downstream, but extending a feature upstream could result in the index no longer being sorted. Note also that if we also try to paint lines upstream of a feature this will not be effective, as we may expose an area between each one and have no reason to paint anything.
Clearly as colinear lines can extend large distances we wish to avoid the paint overhead when un-bumped and have to set the extent whenever the bump mode is changed.
We also have to change the featureset overlap distance to pick up upstream features. Note that although this gives a significant performance overhead it does so because we actually want to paint all those features, and this cannot be avoided.
Glyphs may be slightly tricky to add on as we have to calculate the maximum size of a glyph to define the extra overlap distance.
The root cause of these problems is that we are dealing with feature parts and we need to display some part in the middle that we can't find on an expose. If we could deal with alignments as composite objects then these could easily pick up any relevant expose event.
There is a performance gain to be had by caching glyph shapes as these are all the same shape and size (but may be inverted or displayed in a different colour. Glyphs in general can be sized by score (which can be set in the style) and we have to handle this (effciently) if someone configures it (eg as for GF_Splice glyphs as standalone features), but for bumped alignments we currently have a restricted level of variation. This means we only need to have four data structures for each glyph style definition, but have to allow for one per feature 'just in case'.
To provide a clean interface the alignment code will use a #define or function call to access a sub-feature glyph and these will be painted on demand. These may be stored per feature (quiWe prefer to avoid large amounts of memory allocation (which saves memory as not all alignmentsck and easy but uses a lot of memory NOTE: this could require 100MB for 200k alignments if pre-allocated) or in a hash table (memory efficient but a little fiddly). The problem is to find a way to index glyphs in an effcient enough but also general way - note that we need to index an instance of a glyph not its definition.. What we are aiming at is to just have examples of each variation (max 8 per style of which we currently have 2) that works regardless of our knowledge of current use.
To handle arbritary number of glyph instances we must have one glyph per features (at each end) and the only way to do this effeicently in terms of memory is via a pointer or key in the alignment struct (we wish to avoid recalculating on expose for speed). If we can use these pointers/ keys to refer to a few glyph instnace that can be mapped to a feature's coordinated then we have an effcient solution.
Note this code is part of the glyph implementation and not restricted to alignment features, although other types of features would have to code the interface if requried.
A Nuance: score_mode ALT displays the glyph using a different colour, and this can be implemented by the calling code easily enough, we have the options to use shorter signatures and fewer glyph instances,; however this is not hugely significant in terms of speed or memory use.
Due to historical constraints alignments are sourced from multiple features and then linked together and in future we expect this to change, with a single cigar string and sequence for a whole series. Transcripts appear in several lines of GFF but are assembled on input into complex features. These then have to be added to the canvas as a series of simple features linked together. Data volumes are much lower and they are normally viewed bumped, and there is no difference in the display format whether bumped or not.
Alignments were implemented by having all decorations (such as co-linear lines) as virtual canvas items a) to handle alternate display formats without deleting and adding large amounts of data to the canvas on bumping. and b) to reduce the amount of features in the canvas. Neither of these constraints are relevant to transcripts.
A new feature specific function will be written to handle adding transcripts to the canvas, which will display each exon and intron as simple canvas features and ink these together via the existing left and right links. CDS and UTR regions could be added as distinct canvas features but this would make selection of a complete exon more complex and ths will not be done. An exon with a CDS/UTR split will be dislayed as a single feature but my consist of 1, 2, or 3 boxes. Unlike alignments intron lines will be added as explicit canvas features.
There will be no data structure in the canvas for a composite transcript object, just an intron/exon structure, each of which will refer to the transcript feature in the feature context and be linked to its siblings.
In the initial trial we had the luxury of knowing that features do not overlap and focus highlight has been implemented simply by setting the colour of individual features.
For the more general case we need to display multiple features on top of others and in the existing implementation this has been done by re-ordering the features in a column (eg via foo_canvas_raise_to_top()). Note that this has resulted in a few anomalies in the past and there are instances where the focus code is deficient (eg revcomp will restore only a single focus item.
A review of this is needed and different methods used: as we have sorted data for display then we cannot re-order features to highlight them.
One way, using the proposed canvas structure is to flag each focussed item and have display code display these after displaying all others. This would be a simple process of adding focus items to a list on expoose (but note that wiggle plots are perhaps complicated ... these currently do not show focus highlights as there is no obvious way to do this).
An alternative would be to post process a focus items list on expose; this does not fit so well into the overal canvas structure but could possibly be handled by the block objects. This would be difficult to implement if as proposed we do canvas items first and then containers.
The above implies data structured approximately like this:
simple feature struct feature type y coord x coord (umbumped) x coord (bumped) left link pointer to sub-feature (NULL) right link pointer to sub-feature (NULL)
complex feature struct simple feature struct left link pointer to sub-feature or NULL right link pointer to sub-feature or NULL
transcript feature struct complex feature struct
alignment feature struct complex feature struct gaps data homology dataNOTE that we will pre-calculate same-name alignment groups from existing GFF data, or use VULGAR strings to retrieve this, and do this when features are first displayed: this is not something computed on bump or select as at present.
We wish to implement a base level of code and data that interfaces to featureset ie whole column functions and then provide feature specific functions per type of feature. We also wish to avoid the use of GObjects which implement OO style features quite slowly in real time rather than at compile time.
There are two obvious ways to approach this:
feature ->featureset-interaface ->display()
There are a number of clearly defined functional parts:
The existing GraphDensity module can be used as a basis for this wiht the addition handling overlap and focus.
Subsequently we can add on column summarise functions.
Thus we expect an implementation involving the following files:
zmapWindowCanvasBlock.c root/block container zmapWindowCanvasColumn.c column container zmapWindowCanvasFeatureSet.c canvas featureset item zmapWindowCanvasColBump.c column bump zmapWindowCanvasFocus.c handle highlightsand also a few more to handle feature specific things like display styles:
zmapWindowCanvasTranscript.c zmapWindowCanvasAlignment.c zmapWindowCanvasBasic.c zmapWindowCanvasGlyph.c zmapWindowCanvasGraphDensity.c zmapWindowCanvasAssembly.c zmapWindowCanvasText.c zmapWindowCanvasSequence.c etc
Currently all displayable items are foo canvas items which are children of a foo canvas group that is a ZMapWindowCanvasItem, and adding a featureset foo canvas item requires there to be a ZmapWindowCanvasItem around it. This is required by the current interface via the ItemFactory ?? and also some code in the base ZMapWindowCanvasItem ??. We wish to remove ZMapWindowCanvasItem but cannot do this until all instances can be removed.
WHY?
In the interim this means we have to add dummy ZmapWindowCanvasItems around featureset canvas items, and this has been done for ZmapWindowGraphDensityItems using a GraphItem type. To avoid extra work and repetition of code we need to change this to be a generic featureset type of ZMapWindowCanvasItem and to provide an interface to generic functions that will work for multiple types of features. This will result in the base data structure that can be extended per feature type.
Note however that the graph density items code will remain as is, as it performs some quite different processes such as re-binning the source data. After implementing generic featureset items this may be reviewed and common code integrated in some way if this turns out to be appropriate. It is likely that the overall canvas interface to GDI will have to be changed slightly.
Copy the basic feature code for alignments - this will give the same outward appearance as existing code for unbumped data.
Using the interface as above, modify the ItemFactory to create singleton alignment featuresets and add features as simple data structures.
Add a module to handle display of alignments as umbumped (simple boxes, no decorations). Note this requires overlap to be handled on paint.
Implement left and right links for same-name features.
NOTEColumn summarise code could be added here.
Historically this are has been implemented in a way that results in a few anomalies and we cannot alter the interface without changing a lot of application level code. What we are aiming to provide is:
For canvas featuresets this function will be called for the relevant features and currently sets flags in the feature data structure - to minimise memory use it was decided to use flags rather than adding colour attributes. Features are painted using colours that are cached by the featureset (to reduce the number of GDK calls), and currently a focus highlight can only be one colour, which is set in the featureset data struct, and used for the specified feature(s) if a flags is set.
The interface used by the ZMap code sets colours in FooCanvasItems generated by a series of configuration options and could be any values at all but in practice can only taken from a small set. To use memory effcieintly while still allowing full generality and extendability of focus/select code the following process will be adopted.
This should process data via a canvas featureset's skip list and set feature's x-coordinates.
Modify display code to test for X-coordinates if bumped, and handle decorations and gaps. The featureset struct will hold a 'current bump mode' (if set) and if not will calculate bump coordinates by lazy evaluation. Note that if the mark changes we could have to recalculate these, depending on interpertation of overlap.
It is desired to implement code in such a way as to allow common features to be coded at the featureset level and for different types of features to operate via a similar interface. Historically this has been done by using GOjects and extending FooCanvasItems but we wish to avoid this as it will run quite slowly.
The main performance gains we expect to achieve are by not polling every feature and by optimiising paint operations, but performance gains of 4x are easily achievable by not using GObjects.
On the assumption that we code in C then we can inherit a base data structure quite easily by including it in out child object.
struct base_feature { int y1,y2; } struct alignnment_feature { struct base_feature base; etc }One consequence of this is that memory allocation for features either has to work with the largest child object or operate separate free lists for each type. As memory use has been dominated by alignments and these are the most complex objects (with gaps) it is tempting to do the former, but with the advent of paired reads this may no longer be valid.
In an attempt to keep things simple, a simple array of functions indexed by feature type (an enum) can be used. This array can be maintained by the featureset class and a wrapper function provided to type check and call safely. Functions can be inherted or replaced quite easily.
Functions will be limited to those needed by display, mouse, focus etc. and are defined and maintained by the featureset class, not the feature classes themselves. Note that unlike GObjects this is a global array of blank functions and not a series of function pointers defined in a class struct.