Optimising ZMap

Overview

Some ZMap functions take a long time, notably 3-Frame toggle and RevComp, zoom and we wish to make these run much faster. The design so far has been to display all the data so that it is displayed in a single bitmap which can be scrolled using hardware and inevitably as data size increases display functions get slower.

It may be beneficial to consider some different aspects of speed and percieved speed:


Ideas/Action Notes on ZMap build Results

Measurements

Test data and environment

We need to be able to evaluate how long a given process takes to see if improvements can be made or are effective and we need a reproducable test environment with a fixed set of data and operations that can be run and results generated automatically.

Some ad-hoc timing code has been used in the past and suffers the disadvantage that it must be coded and then removed for a production build. This could be used in a separate build only used for development and automatically removed from production versions but instead the X-remote test program could be enhanced to gather this data, which would allow the timing code to remain in the ZMap source.

Actions we may want to time

After some experience we may wish to split some of these into smaller chunks.

Action plan

Constructive delivery

With a view on max results from min work a practical approach shall be used:

This process will be subject to 'constant review'.

Effect of data size

It would be useful to have a similar test sequence run on datasets of different sizes to see whether certain functions are especially sensitive to data size. However, just testing with a large data set may be a simpler strategy - if any part of the code has a problem then it will become apparrent.

Timing and call frequency data from VTune

Running ZMap under VTune is quite useful - we get a hardware assisted profile of the whole PC which appears to be quite accurate and data can be grouped by process, thread, module (eg zmap, foo-canvas, glib) and source file and ordered by name, CPU % and call count. This does not appear to provide cumulative totals but is very effective for showing the contribution of individual functions. Call frequencies appear to be estimated but are driven by execution (or caching?) of call instructions.

Note that the CPU percentages given are relative to a module and therefore cannopt be compared directly between modules. There also appears to be no way to export the data eg to a spreadsheet, so it looks quite difficult to get a cumulatative total of time spent within a function. There appears to be no cut & paste function either.

Canvas choice

Jeremy has produced some statistics comparing various canvas widgets and this reveals that the foo-canvas is the best choice, only bettered by openGL.

Ad hoc test programs

As styles are used to draw every feature a simple test program was written to test the effect of using function calls to read style parameters. (in ~mh17/src/styles). This shows a 10x performance improvement when just reading the style's structure members directly.

Initial timing results

Here's a summary of the big picture from loading ZMap, toggling 3-Frame and then clicking on RevComp.
All other modules are less than 0.7%. ZMap used 97% of the CPU.

Module CPU %
libGObject 24.6
libGLib 18.9
libc 15.6
zmap 12.8
libgdk 11.2
libpthread 11.0
libX11 2.7
vmlinux 1.4

Within the ZMap module grouping their data by source file reveals (all others less than 2.0%):
Module CPU %
foo_canvas.c 32.9
foo_canvas_rect-ellipse.c 15.9
zmapWindowCanvasItem.c 13.0
crtn.S 5.7
zmapStyle.c 3.1
zmapFeatureUtils.c 2.8
zmapWindowFeature.c 2.6
zmapWindowItemFactory 2.3
zmapFeature.c 2.0

Grouping by function reveals that:

Ideas for performance improvement

Targeting the right code

Other than loading feature data we regard all startup and configuration code as acceptably fast and we can afford to perform extensive validation of data as necessary. However, recent changes to the startup behaviour of ZMap/ otterlace may require a review of this - if we operate a separate pipe server for each featureset then an ineffcient way of reading this data may become an issue.

Checking compiler options

Have we selected the best compiler optimisations?

Speeding up style data access (a)

Styles are GObjects and are read in from a file or a database such as ACEDB. Style data is currrently not accessable outside of module other than by function call and this was deemed appropriate to ensure data integrity. Structure members are set via a GObject->set() function call, which is inevitably quite slow.

However, accessing styles takes up 2%+ of the CPU and can be reduced to a small fraction of 1% by allowing direct access to the style data structure. (access is prevented by having the style structure defined in a private header).

Action plan

It is suggested that the implementation is changed as follows:

Expected gain 2% of zmap CPU, about 0.4% overall

Results

Little difference to overall time used but vtune reports a change of ~2% of ZMap CPU for StyleIsPropertySetID().

Speeding up style data access (b)

To find a single basic feature's style (the majority of features) the window items class factory calls zmapWindowContainerFeatureSetStyleFromID(), and to set the colours a separate call to (class)->get_style() is made. Glyphs got though a similar process:

style = (ZMAP_CANVAS_ITEM_GET_CLASS(basic)->get_style)(basic);
which translates as:
style = zmap_window_canvas_item_get_style(basic);
This does not appear in VTune as it's static but it calls some globals:
zMapWindowCanvasItemIntervalGetTopLevelObject 0.1%  CPU 100k calls = 0.001 per 1k calls
zmapWindowContainerCanvasItemGetContainer     0.55% CPU 600k calls = 0.0009 per 1k calls
zmapWindowContainerFeatureSetStyleFromID      0.4%  CPU 500k calls = 0.0008 per 1k calls
So for each basic feature we expect to use 0.0027 + 0.0008 = 0.0035% CPU per 1000 features just to lookup the style. The situation may be worse: zmapWindowContainerFeatureSetStyleFromID calls a GObject type check function and then another function which calls g_hash_table_lookup, both of which are implicated in 25% CPU of thier respective modules, both of which use significantly more CPU than ZMap. This is significantly more than required to read the style data once we have the struct, even using function calls.

Action plan

Restructuring the feature data to speed up style access

The server model used by ZMap is such that display styles must be present in the server so that it can filter out data that has no display style. In the case of ACEDB styles are traditionally derived from the database and for pipe servers (and optionally for ACEDB) styles are passed to the server in a file. All servers return styles in data structure which is then merged with existing styles.

There are also some hard coded styles that are provided by ZMap

Features when read in by the server are given a style id which is later used to look up the style in a small hash table owned by the column the feature is to be displayed in. The whole feature contect is passed over to ZMap and merged into the existing one.

By combining the styles data with the feature context from each server it would be possible to include a pointer to a feature's style in the feature itself, giving instant lookup. This has some implications:

Expected gain 1.4% of zmap CPU, plus some contribution from GLib and Gobject, about 0.5-1.0% overall

Initial results

The featureset CanvasGroup now holds a copy of its style and each feature has a pointer to this. In ProcessFeature() the function calls to lookup styles have been removed. The column group still has copies of all the styles needed - any changing parameters such as current bump mode are stored in these not the private featureset copies.

Further work required

The column group objects need to be given pointers to the featureset styles instead of making copies of all the styles needed so that all the code access the same instances of each style

Sub-features types are still processed by style lookup via the column group. and should be implemented as pointers: these extra styles would be accessable only though thier parent via each features style pointer

See below for performance measurements.

Fixing zMapWindowFeatureStrand

This function decides which strand a feature belongs on which involves looking up the style in a window-global GData list, and attaching the style directly to the feature will save us another 1.7% on ZMap CPU.

Action plan

Removed this functions' style lookup function after restructuring the data

Expected gain 1.7% of zmap CPU 0.2% overall

Results

Apparently little change: is this % at the level of noise?

Removing Asserts

Arguably the Assert calls used in ZMap perform a valid function during development but when the code functions correctly they should never be called and they are a waste of CPU.

The function zMapFeatureIsValid() is only called from Assert (38 times) and uses 1.3% of the zmap CPU. There are many other calls to Assert (817 in total) and if we pro-rate this as 10%/ per call this implies a much greater saving of 15% of the zmap CPU. This seems quite high and most other calls are probably less frequent.

How can we justify removing Asserts?

These are already coded as macros and can be adjusted to be included only in development versions of code. During development that are used to catch programming errors and are only valid where there is a logical error in the code that has broken an assumption about the data. They should not be used to detect errors in external data (from users/ other programs or other modules). During testing we hope to find all these logical errors but on occasion we have reports from users.

If would be advisable to create a test environment that can exercise ZMap functions and be run automatically before releasing any build - this would give greater confidence and it should also be noted that Asserts do not prevent problems from occurring.

Action plan

Implement a debug/ production build option to control how Asserts are compiled.

Extend the x-remote or other test program to automatically exercise most of the ZMap code. Note that here we are not testing for correct function but only that ZMap does not abort - the test can be done with user interaction.

Expected gain 1.7% of zmap CPU 0.2% overall, plus a few % more

Speeding up GLib

GData keyed data lists

Processing these (just the function g_datalist_id_get_data()) accounts for 14% of 17% of the total or approximately 3% CPU overall.

They are used only for styles and feature contexts - lists of featuresets. Given that we can easily have 300+ styles these would be better coded as a GHashTable.

It appears that this function is only called from zMapFindStyle() which could be removed from most of the code if we did as above. Note that this function is called from processFeature(), (once directly and once via zmapWindowFeatureStrand()) which is called to display every feature, and has to search the window-global list of ~300 styles for each feature.

Action plan

Remove the style GData list structures and replace then with small hashes and intergrate styles into the feature context.

Expected gain 3% CPU overall, plus a few % more

Results

GData has been removed from styles and now is only used for feature sets.

Significant changed in CPU use can be observed:
Function Before CPU % After CPU %
g_hash_table_lookup 22.4 27.1
g_datalist_id_get_data 14.7 2.7
which equates to a saving of 7.3% of GLib CPU, whidh is approx 50% more significant than ZMap CPU.

However real time used to display data is the same as before.

Further work

g_datalist_set_data() remains at 6% (from 7%) - this is used for 'multiline-features' in the GFF parser and while we would expect this only to apply to a small fraction of the features it is identified as having 14M calls. It may be called fro every feature in which case replacing this last instace with a hash table may be worth while.

Speeding up GObject (a)

GObject takes up 25% of the total CPU and this is dominated by casts and type checking. We can gain 14% of 25% by replacing G_TYPE_CHECK_INSTANCE_CAST with a simple cast, although it might be good to have the option to switch this back on for development.

Action plan

Implement a global header or build option to allow these macros to be changed easily. Click here for some notes on how to operate the build system.

This option is controlled by:

#if GOBJ_CAST

Expected gain 4% CPU overall

Details

These macros appear in:
include/ZMap/zmapBase.h:2
include/ZMap/zmapGUITreeView.h:2
include/ZMap/zmapStyle.h:2
libcurlobject/libcurlobject.h:2
libpfetch/libpfetch.h:6
zmapWindow/items/zmapWindowAlignmentFeature.h:2
zmapWindow/items/zmapWindowAssemblyFeature.h:2
zmapWindow/items/zmapWindowBasicFeature.h:2
zmapWindow/items/zmapWindowCanvasItem.h:2
zmapWindow/items/zmapWindowContainerAlignment.h:2
zmapWindow/items/zmapWindowContainerBlock.h:2
zmapWindow/items/zmapWindowContainerChildren.h:8
zmapWindow/items/zmapWindowContainerContext.h:2
zmapWindow/items/zmapWindowContainerFeatureSet.h:2
zmapWindow/items/zmapWindowContainerGroup.h:2
zmapWindow/items/zmapWindowContainerStrand.h:2
zmapWindow/items/zmapWindowGlyphItem.h:2
zmapWindow/items/zmapWindowLongItem.h:2
zmapWindow/items/zmapWindowSequenceFeature.h:2
zmapWindow/items/zmapWindowTextFeature.h:2
zmapWindow/items/zmapWindowTextItem.h:2
zmapWindow/items/zmapWindowTranscriptFeature.h:2
zmapWindow/zmapWindowDNAList.h:2
zmapWindow/zmapWindowFeatureList.h:4
zmapStyle and zmapWindow/items/* will be changed.and the other files left unchanged.
Results

There a was no change: further inspection reveals that this cast macro was never called for Basicfeatures which account for the bulk of CanvasItems. It is thought that most of the calls to these dynamic cast functions are indirect and may be inside the foo canvas and GLib.

Speeding up GObject (b)

Another function G_TYPE_CHECK_INSTANCE_TYPE uses 5% of the total CPU, but cannot be easily removed as it it used to make choices about what code to run. There are 140 of these but given that there are 140M call in out test data some major gains could be expected if we could remove a few of them - there are cases where this function is called when we can reasonably expect it to succeed in all cases.

Action plan

Inspect calls to these macros and identify ones that can be removed. Create new macros for these that can be switched on or off globally

Expected gain 2-3% CPU overall, but given that plan (a) above had no effect It's probably not worth the large effort involved.

GObject paramters

A lot of functions connected with GValue and GParam appear near the top of the list, but as foo-canvas items use these mechanisms it seem unlikely that this can be improved without a major re-design. However, as we have control of the windowCanvasItem code it may be possible to make some significant gains.

Action plan

Initially do nothing. After investigating other issues review how the windowCanvasItems work.

How much improvement can be acheived?

Most of the above is tinkering with micro efficiency and looks like gaining us about 10% and is unlikely to gain more than 20% even if extended, although it may be that an iterative process will highlight new bottlenecks as the most obvious are cleared.

Given that all ZMap does is to draw boxes on a window, what is the best performance we can expect? We have data for foo-canvas performance and if we factor in an equivalent number of floating point operations then this may give us some idea of what should be achievable.

The vast majority of features are 'basic features' ie a simple rectangle and the foo canvas handles drawing the lines and fill colour. ZMap has to calculate the coordinates for each one and to estimate the work required we have:

If we add these up we get 6 multiplications and 22 additions amd 16 of the additions are arguably not necessary - they relative position of each level of the feature conntext is calcualted for each feature.

Here's a summary of some real timings. The foo canvas timing is for an 'expose' event which may not be the whole story.
Operation TimeComment
100k x 16 FP additions 0.013s
100k x 6 FP multiplications 0.005s
expose 100k foo canvas items 0.010s
Revcomp 100k features 0.050s
Display 100k features 7 sec
Lookup 300 item data list 50k times 0.180s (was thought to be a problem, equates to 360ms each
Create hash table of 50k items 0.100s Done for trembl column
Lookup 1M hash table entried in 50k table 0.050s not affected by table size

NOTE Tests reveal that creating a hash table of 100k items fails - the code does not return for a very long time.

Action plan

Implement a test environment using x-remote and perform various experiments as described above. Review where the CPU time is going what can be achieved.

Multiple foo-canavses

If we create one canvas per column then we avoid any need to re-calculate x-coordinates for columns that are already drawn, and if the foo canvas performance degrades significantly for large amount of data then this could cretae a significant improvement. For example if it operates at O(n log n) for real data then splitting the canvas into 16 sections could give a 4x improvement in speed. However as some columns (eg swissprot, trembl) hold the majority of the data this is unlikely to occur in practice.

Displaying bitmaps on the foo canvas

Currently we display individual feature items as foo canvas items and when these overlap (eg when viewing a whole clone) then much of the time is used to overlay existing features. If we could generate our own bitmap quicker than via the foo canvas and then display the bitmap then we could avoid significant foo canvas/ glib overhead. Mouse events would of course have to be translated by ZMap.

How quickly can we draw a bitmap?

Using G2 to paint 50k filled rectangles of up to 1k bp on a canvas of 150k takes...

How to find out? Add a key handler to ZMap to call a function that does that for the trembl featureset from the feature context (not the foo-canvas) and writes the bitmap to a file using G2. Also run it with no drawing to find out how long it takes to access the features and calculate coordinates. Crib some code from screenshot/ print. Verify the ouput by viewing the file and test different formats.

Is G2 efficient? A simple stand-alone test program would demonstrate how quickly a bitmap can be generated using random rectangles. png may be an ineffcient format if it includes drawing instructions for each feature.

How to determine the maximum size of a column's bitmap? Y is easy as we have seq start and end for the window. X is more complex as not all styles have a width set and features are sized according to a score, but max and min score or width have to be set and it must be possible to calculate this from the set of styles needed by a column. Removing blank space from around a column of features would also be simple if we keep a running min and max value - the bitmap can be moved and displayed partially.

GDK bitmaps

Cursory inspection of gdk_draw_pixbuf() reveals that bitmaps can be drawn with hardware acceleration using mediaLib in three different chipsets.

Initial Results

Analysys of 3-Frame and RevComp

Test protocol and documentation

ZMap is run with the command line argument '--conf-file=ZMap_time' on malcolm's PC (deskpro18979) and STDOUT redirected to a file and the following config option set:

[debug]
timing=true
Data is provided by running the 'acepdf' alias.

ZMap is allowed to finish loading data and the 3-Frame is selected and when complete RevComp and the Zmap is shut down. The output file starts with a comment containing the config file name and the date. Output files are stored in ~mh17/zmap/timing/ are named to reflect mods made before testing and some more detailed information recorded in ~mh17/zmap/timing/optimise.log.

Initial comments

Using a simple manually generated printout of timings for various parts of these functions it is clear that performance is dominated by zMapWindowDrawFeatureSet(), which calls ProcessFeature() for each feature, and this function has already been flagged above as inefficient. In particular it can be seeen that displaying the Trembl columns (about 50k features) takes 5 seconds whether in 3-frame mode or not.

Creating columns takes 0.2sec, which is suspiciously slow, but this is insignificant realted to drawing features, which equates to 12k floating point multiplications per feature.

Processing the window takes 2.8 seconds.

Reverse complementing the features themselves take only 50ms.

Comments about timing methodology

Simply addition of timer functions to the code is easy but tedious and works well for major functions. Some automated procedure that gave cumulative times for all functions would be be a lot more efficient in terms of developer time, and would also provide higher quality information:

Where does the time go?

Adding further data to ProcessFeature reveals that the trembl column take about 4 seconds to draw and then more than one second to bump, even though it is not bumped on startup. Possibly it is configured without a valid bump mode as default (eg ZMAPBUMP_UNBUMP); however this is relatively unimportant.

Experiments with commenting out code show that almost all the time is used by zMapWindowFeatureDraw() which apart from some innocuous parent lookups calls zmapWindowFToIFactoryRunSingle(). Inside this almost all the time is spent in ((method)->method)(), and from within that almost the time is spent in:

post_create() is the function that adds lists of foo canvas items to features, background overlay and underlay.

Limits to speed

From this seems likely that wittout a major redesign we are limited to the speed of the foo canvas, which if we stripped out most of our code or made it run in minimal time would be about 4x as fast as the current ZMap.

Ways forwards

Simplifying window canvas items

For the vast majority of features (alignments) a simple rectangle is enough and the current creation of a canvas item group for each feature consumes a lot of memory and time. By removing this complexity we could expect to save 50%.

Using integer arithmetic

It may be possible to speed up the zmap and the foo canvas by modifying it to use integer arithmetic. Here's a comparison of floating point, long and long long on a 32 bit PC doing 60M operations:
Operation doublelong longlong
Multiplication 0.544 0.3960.356
Division 1.240 1.7322.056
Addition 0.536 0.3560.184

Some of this seems anomalous, but perhaps the long long division is compiled via conversion to double?? If we can avoid large amounts of division (eg by pre-calculating reciprocals) and convert to long long arithmetic then there is a chance to save 30% plus the division bonus on arithmetic operations.

However if most of the time is spent operating the foo canvas and Glib then this would likely be ineffective.

Reducing the GObject overhead

The vtune data suggests this may be where a lot of time is spent. However changing this would be a lot of work.