Here's a summary of the big picture from loading ZMap, toggling 3-Frame and then clicking on RevComp.
All other modules are less than 0.7%. ZMap used 97% of the CPU.
Module | CPU % |
---|---|
libGObject | 24.6 |
libGLib | 18.9 |
libc | 15.6 |
zmap | 12.8 |
libgdk | 11.2 |
libpthread | 11.0 |
libX11 | 2.7 |
vmlinux | 1.4 |
Within the ZMap module grouping their data by source file reveals (all others less than 2.0%):
Module | CPU % |
---|---|
foo_canvas.c | 32.9 |
foo_canvas_rect-ellipse.c | 15.9 |
zmapWindowCanvasItem.c | 13.0 |
crtn.S | 5.7 |
zmapStyle.c | 3.1 |
zmapFeatureUtils.c | 2.8 |
zmapWindowFeature.c | 2.6 |
zmapWindowItemFactory | 2.3 |
zmapFeature.c | 2.0 |
Grouping by function reveals that:
Other than loading feature data we regard all startup and configuration code as acceptably fast and we can afford to perform extensive validation of data as necessary. However, recent changes to the startup behaviour of ZMap/ otterlace may require a review of this - if we operate a separate pipe server for each featureset then an ineffcient way of reading this data may become an issue.
Have we selected the best compiler optimisations?
Styles are GObjects and are read in from a file or a database such as ACEDB. Style data is currrently not accessable outside of module other than by function call and this was deemed appropriate to ensure data integrity. Structure members are set via a GObject->set() function call, which is inevitably quite slow.
However, accessing styles takes up 2%+ of the CPU and can be reduced to a small fraction of 1% by allowing direct access to the style data structure. (access is prevented by having the style structure defined in a private header).
It is suggested that the implementation is changed as follows:
Expected gain 2% of zmap CPU, about 0.4% overall
Little difference to overall time used but vtune reports a change of ~2% of ZMap CPU for StyleIsPropertySetID().
To find a single basic feature's style (the majority of features) the window items class factory calls zmapWindowContainerFeatureSetStyleFromID(), and to set the colours a separate call to (class)->get_style() is made. Glyphs got though a similar process:
style = (ZMAP_CANVAS_ITEM_GET_CLASS(basic)->get_style)(basic);which translates as:
style = zmap_window_canvas_item_get_style(basic);This does not appear in VTune as it's static but it calls some globals:
zMapWindowCanvasItemIntervalGetTopLevelObject 0.1% CPU 100k calls = 0.001 per 1k calls zmapWindowContainerCanvasItemGetContainer 0.55% CPU 600k calls = 0.0009 per 1k calls zmapWindowContainerFeatureSetStyleFromID 0.4% CPU 500k calls = 0.0008 per 1k callsSo for each basic feature we expect to use 0.0027 + 0.0008 = 0.0035% CPU per 1000 features just to lookup the style. The situation may be worse: zmapWindowContainerFeatureSetStyleFromID calls a GObject type check function and then another function which calls g_hash_table_lookup, both of which are implicated in 25% CPU of thier respective modules, both of which use significantly more CPU than ZMap. This is significantly more than required to read the style data once we have the struct, even using function calls.
The server model used by ZMap is such that display styles must be present in the server so that it can filter out data that has no display style. In the case of ACEDB styles are traditionally derived from the database and for pipe servers (and optionally for ACEDB) styles are passed to the server in a file. All servers return styles in data structure which is then merged with existing styles.
There are also some hard coded styles that are provided by ZMap
Features when read in by the server are given a style id which is later used to look up the style in a small hash table owned by the column the feature is to be displayed in. The whole feature contect is passed over to ZMap and merged into the existing one.
By combining the styles data with the feature context from each server it would be possible to include a pointer to a feature's style in the feature itself, giving instant lookup. This has some implications:
Expected gain 1.4% of zmap CPU, plus some contribution from GLib and Gobject, about 0.5-1.0% overall
The featureset CanvasGroup now holds a copy of its style and each feature has a pointer to this. In ProcessFeature() the function calls to lookup styles have been removed. The column group still has copies of all the styles needed - any changing parameters such as current bump mode are stored in these not the private featureset copies.
The column group objects need to be given pointers to the featureset styles instead of making copies of all the styles needed so that all the code access the same instances of each style
Sub-features types are still processed by style lookup via the column group. and should be implemented as pointers: these extra styles would be accessable only though thier parent via each features style pointer
See below for performance measurements.
This function decides which strand a feature belongs on which involves looking up the style in a window-global GData list, and attaching the style directly to the feature will save us another 1.7% on ZMap CPU.
Removed this functions' style lookup function after restructuring the data
Expected gain 1.7% of zmap CPU 0.2% overall
Apparently little change: is this % at the level of noise?
Arguably the Assert calls used in ZMap perform a valid function during development but when the code functions correctly they should never be called and they are a waste of CPU.
The function zMapFeatureIsValid() is only called from Assert (38 times) and uses 1.3% of the zmap CPU. There are many other calls to Assert (817 in total) and if we pro-rate this as 10%/ per call this implies a much greater saving of 15% of the zmap CPU. This seems quite high and most other calls are probably less frequent.
These are already coded as macros and can be adjusted to be included only in development versions of code. During development that are used to catch programming errors and are only valid where there is a logical error in the code that has broken an assumption about the data. They should not be used to detect errors in external data (from users/ other programs or other modules). During testing we hope to find all these logical errors but on occasion we have reports from users.
If would be advisable to create a test environment that can exercise ZMap functions and be run automatically before releasing any build - this would give greater confidence and it should also be noted that Asserts do not prevent problems from occurring.
Implement a debug/ production build option to control how Asserts are compiled.
Extend the x-remote or other test program to automatically exercise most of the ZMap code. Note that here we are not testing for correct function but only that ZMap does not abort - the test can be done with user interaction.
Expected gain 1.7% of zmap CPU 0.2% overall, plus a few % more
Processing these (just the function g_datalist_id_get_data()) accounts for 14% of 17% of the total or approximately 3% CPU overall.
They are used only for styles and feature contexts - lists of featuresets. Given that we can easily have 300+ styles these would be better coded as a GHashTable.
It appears that this function is only called from zMapFindStyle() which could be removed from most of the code if we did as above. Note that this function is called from processFeature(), (once directly and once via zmapWindowFeatureStrand()) which is called to display every feature, and has to search the window-global list of ~300 styles for each feature.
Remove the style GData list structures and replace then with small hashes and intergrate styles into the feature context.
Expected gain 3% CPU overall, plus a few % more
GData has been removed from styles and now is only used for feature sets.
Significant changed in CPU use can be observed:
Function | Before CPU % | After CPU % |
---|---|---|
g_hash_table_lookup | 22.4 | 27.1 |
g_datalist_id_get_data | 14.7 | 2.7 |
g_datalist_set_data() remains at 6% (from 7%) - this is used for 'multiline-features' in the GFF parser and while we would expect this only to apply to a small fraction of the features it is identified as having 14M calls. It may be called fro every feature in which case replacing this last instace with a hash table may be worth while.
GObject takes up 25% of the total CPU and this is dominated by casts and type checking. We can gain 14% of 25% by replacing G_TYPE_CHECK_INSTANCE_CAST with a simple cast, although it might be good to have the option to switch this back on for development.
Implement a global header or build option to allow these macros to be changed easily. Click here for some notes on how to operate the build system.
This option is controlled by:
#if GOBJ_CAST
Expected gain 4% CPU overall
include/ZMap/zmapBase.h:2 include/ZMap/zmapGUITreeView.h:2 include/ZMap/zmapStyle.h:2 libcurlobject/libcurlobject.h:2 libpfetch/libpfetch.h:6 zmapWindow/items/zmapWindowAlignmentFeature.h:2 zmapWindow/items/zmapWindowAssemblyFeature.h:2 zmapWindow/items/zmapWindowBasicFeature.h:2 zmapWindow/items/zmapWindowCanvasItem.h:2 zmapWindow/items/zmapWindowContainerAlignment.h:2 zmapWindow/items/zmapWindowContainerBlock.h:2 zmapWindow/items/zmapWindowContainerChildren.h:8 zmapWindow/items/zmapWindowContainerContext.h:2 zmapWindow/items/zmapWindowContainerFeatureSet.h:2 zmapWindow/items/zmapWindowContainerGroup.h:2 zmapWindow/items/zmapWindowContainerStrand.h:2 zmapWindow/items/zmapWindowGlyphItem.h:2 zmapWindow/items/zmapWindowLongItem.h:2 zmapWindow/items/zmapWindowSequenceFeature.h:2 zmapWindow/items/zmapWindowTextFeature.h:2 zmapWindow/items/zmapWindowTextItem.h:2 zmapWindow/items/zmapWindowTranscriptFeature.h:2 zmapWindow/zmapWindowDNAList.h:2 zmapWindow/zmapWindowFeatureList.h:4zmapStyle and zmapWindow/items/* will be changed.and the other files left unchanged.
There a was no change: further inspection reveals that this cast macro was never called for Basicfeatures which account for the bulk of CanvasItems. It is thought that most of the calls to these dynamic cast functions are indirect and may be inside the foo canvas and GLib.
Another function G_TYPE_CHECK_INSTANCE_TYPE uses 5% of the total CPU, but cannot be easily removed as it it used to make choices about what code to run. There are 140 of these but given that there are 140M call in out test data some major gains could be expected if we could remove a few of them - there are cases where this function is called when we can reasonably expect it to succeed in all cases.
Inspect calls to these macros and identify ones that can be removed. Create new macros for these that can be switched on or off globally
Expected gain 2-3% CPU overall, but given that plan (a) above had no effect It's probably not worth the large effort involved.
A lot of functions connected with GValue and GParam appear near the top of the list, but as foo-canvas items use these mechanisms it seem unlikely that this can be improved without a major re-design. However, as we have control of the windowCanvasItem code it may be possible to make some significant gains.
Initially do nothing. After investigating other issues review how the windowCanvasItems work.
Most of the above is tinkering with micro efficiency and looks like gaining us about 10% and is unlikely to gain more than 20% even if extended, although it may be that an iterative process will highlight new bottlenecks as the most obvious are cleared.
Given that all ZMap does is to draw boxes on a window, what is the best performance we can expect? We have data for foo-canvas performance and if we factor in an equivalent number of floating point operations then this may give us some idea of what should be achievable.
The vast majority of features are 'basic features' ie a simple rectangle and the foo canvas handles drawing the lines and fill colour. ZMap has to calculate the coordinates for each one and to estimate the work required we have:
Here's a summary of some real timings. The foo canvas timing is for an 'expose' event which may not be the whole story.
Operation | Time | Comment |
---|---|---|
100k x 16 FP additions | 0.013s | |
100k x 6 FP multiplications | 0.005s | |
expose 100k foo canvas items | 0.010s | |
Revcomp 100k features | 0.050s | |
Display 100k features | 7 sec | |
Lookup 300 item data list 50k times | 0.180s | (was thought to be a problem, equates to 360ms each |
Create hash table of 50k items | 0.100s | Done for trembl column |
Lookup 1M hash table entried in 50k table | 0.050s | not affected by table size |
NOTE Tests reveal that creating a hash table of 100k items fails - the code does not return for a very long time.
Implement a test environment using x-remote and perform various experiments as described above. Review where the CPU time is going what can be achieved.
If we create one canvas per column then we avoid any need to re-calculate x-coordinates for columns that are already drawn, and if the foo canvas performance degrades significantly for large amount of data then this could cretae a significant improvement. For example if it operates at O(n log n) for real data then splitting the canvas into 16 sections could give a 4x improvement in speed. However as some columns (eg swissprot, trembl) hold the majority of the data this is unlikely to occur in practice.
Currently we display individual feature items as foo canvas items and when these overlap (eg when viewing a whole clone) then much of the time is used to overlay existing features. If we could generate our own bitmap quicker than via the foo canvas and then display the bitmap then we could avoid significant foo canvas/ glib overhead. Mouse events would of course have to be translated by ZMap.
Using G2 to paint 50k filled rectangles of up to 1k bp on a canvas of 150k takes...
How to find out? Add a key handler to ZMap to call a function that does that for the trembl featureset from the feature context (not the foo-canvas) and writes the bitmap to a file using G2. Also run it with no drawing to find out how long it takes to access the features and calculate coordinates. Crib some code from screenshot/ print. Verify the ouput by viewing the file and test different formats.
Is G2 efficient? A simple stand-alone test program would demonstrate how quickly a bitmap can be generated using random rectangles. png may be an ineffcient format if it includes drawing instructions for each feature.
How to determine the maximum size of a column's bitmap? Y is easy as we have seq start and end for the window. X is more complex as not all styles have a width set and features are sized according to a score, but max and min score or width have to be set and it must be possible to calculate this from the set of styles needed by a column. Removing blank space from around a column of features would also be simple if we keep a running min and max value - the bitmap can be moved and displayed partially.
Cursory inspection of gdk_draw_pixbuf() reveals that bitmaps can be drawn with hardware acceleration using mediaLib in three different chipsets.
ZMap is run with the command line argument '--conf-file=ZMap_time' on malcolm's PC (deskpro18979) and STDOUT redirected to a file and the following config option set:
[debug] timing=trueData is provided by running the 'acepdf' alias.
ZMap is allowed to finish loading data and the 3-Frame is selected and when complete RevComp and the Zmap is shut down. The output file starts with a comment containing the config file name and the date. Output files are stored in ~mh17/zmap/timing/ are named to reflect mods made before testing and some more detailed information recorded in ~mh17/zmap/timing/optimise.log.
Using a simple manually generated printout of timings for various parts of these functions it is clear that performance is dominated by zMapWindowDrawFeatureSet(), which calls ProcessFeature() for each feature, and this function has already been flagged above as inefficient. In particular it can be seeen that displaying the Trembl columns (about 50k features) takes 5 seconds whether in 3-frame mode or not.
Creating columns takes 0.2sec, which is suspiciously slow, but this is insignificant realted to drawing features, which equates to 12k floating point multiplications per feature.
Processing the window takes 2.8 seconds.
Reverse complementing the features themselves take only 50ms.
Simply addition of timer functions to the code is easy but tedious and works well for major functions. Some automated procedure that gave cumulative times for all functions would be be a lot more efficient in terms of developer time, and would also provide higher quality information:
Adding further data to ProcessFeature reveals that the trembl column take about 4 seconds to draw and then more than one second to bump, even though it is not bumped on startup. Possibly it is configured without a valid bump mode as default (eg ZMAPBUMP_UNBUMP); however this is relatively unimportant.
Experiments with commenting out code show that almost all the time is used by zMapWindowFeatureDraw() which apart from some innocuous parent lookups calls zmapWindowFToIFactoryRunSingle(). Inside this almost all the time is spent in ((method)->method)(), and from within that almost the time is spent in:
post_create() is the function that adds lists of foo canvas items to features, background overlay and underlay.
From this seems likely that wittout a major redesign we are limited to the speed of the foo canvas, which if we stripped out most of our code or made it run in minimal time would be about 4x as fast as the current ZMap.
For the vast majority of features (alignments) a simple rectangle is enough and the current creation of a canvas item group for each feature consumes a lot of memory and time. By removing this complexity we could expect to save 50%.
It may be possible to speed up the zmap and the foo canvas by modifying it to use integer arithmetic. Here's a comparison of floating point, long and long long on a 32 bit PC doing 60M operations:
Operation | double | long long | long |
---|---|---|---|
Multiplication | 0.544 | 0.396 | 0.356 |
Division | 1.240 | 1.732 | 2.056 |
Addition | 0.536 | 0.356 | 0.184 |
Some of this seems anomalous, but perhaps the long long division is compiled via conversion to double?? If we can avoid large amounts of division (eg by pre-calculating reciprocals) and convert to long long arithmetic then there is a chance to save 30% plus the division bonus on arithmetic operations.
However if most of the time is spent operating the foo canvas and Glib then this would likely be ineffective.
The vtune data suggests this may be where a lot of time is spent. However changing this would be a lot of work.