Paper on ZMap


Abstract

ZMap is a database independent sequence display program that can be integrated into annotation systems that run on Unix and Unix-like systems with X-windows. ZMap is written in C for performance and has many optimisations that allow it to handle large volumes of data under the control of the annotator. It is multi-threaded to allow background loading of data while maintaining a responsive GUI. It currently supports acedb, GFF and DAS v1 datastreams but support for other formats can easily be added via a "plug-in" architecture (not actually completelly correct currently...).

Introduction

In general sequence annotation systems have been closely tied to the underlying database or flat-file system holding the annotation data. While this makes for easier programming it has two major disadvantages: the systems can only be used with a limited number of data formats and users are stuck with whatever annotation semantics the system imposes.

Some systems (e.g. Apollo) included the option of adding code to support other database formats while others (e.g. acedb) required that the user translate their data into the supported format. While converting data to the supported format or writing code to support your own data format is possible it is time consuming and does not solve problems of data semantics. Different semantics arise for a number of reasons including the underlying general philosophy of the researchers, differences in the organisms being studied, and historical but intractable differences.

A solution to these problems is to separate data display from data editting. While there are some differences between annotation displays (e.g. vertical vs. horizontal layout) they have largely converged to use the same glyphs and basic layouts. This makes sense as it enables users to navigate new and different systems without in depth experience. This creates the opportunity for a data independent annotation viewer that can be used with many annotation systems. ZMap is an attempt to produce this kind of viewer.

Design Goals

The ZMap project was initiated with the goal of significantly improving the annotation interface available to researchers at the Sanger Institute. To do this it had to fulfill a number of goals:

Experience has shown that languages that are interpreter based (perl, python, java) do not provide the performance to deal with displaying large volumes of data. The C language was chosen because of it's potential for good performance and it's portability. The GTK toolkit was used for the GUI with the foocanvas being used for the actual sequence display. This combination has proved robust and has provided the performance to display large numbers of features in very large scrollable windows. The following sections describe the key components of ZMap.

Threading model

Providing a responsive GUI while loading data is a problem that has been tackled in different ways, prior to the introduction of a portable threading interface code had to be written so that long running functions would allow periodic updates to the GUI. The X Windows server is a good example of this. The introduction of threading makes tackling this problem easier as it means that the long running function can be run in a separate thread without the need to callback to the GUI.

Certain software constraints mean that a simple model presents itself. The X Window library while being thread safe is not multi-threaded meaning that there is little or no gain to having more than one thread in the GUI code. The obvious model therefore is to have one "master" thread running the GUI and in effect controlling the application, "slave" threads are then used to fetch and process data for display by the GUI thread (see figure XX).

The separation of the threads also naturally leads to a "plugin" model for adding new data source modules. There is a single standardised interface between the GUI thread and it's source threads, new modules can be added without alteration to the interface code which provides the "bridge" between the threads (see bridge pattern in design patterns). See figure XXX.

(should we add stuff about cancelling....currently it doesn't work that well so try it out first...also I think that restart doesn't work ???)

Data display

Sequence display must cope with very large coordinate systems that are then mapped to the screen. The foocanvas holds it's "world view" in floating point coordinates allowing plotting to as fine a scale as required for even whole chromosome display. The canvas has the concept of a world view and a subsection of that world view that is displayable (the "scrollable area"). This is important both for performance but also because underlaying window systems usually have a maximum limit to the size of window they can display. For X windows the biggest window that can usefully be handled is 32k by 32k pixels. ZMap therefore limits the scrollable area to this size or less but allows scrolling of the overall scrollable area so that even at high zoom levels the user can quickly and easily scroll over large areas of sequence.

A primary design goal was to allow users to rapidly scroll around and zoom in and out of the sequence being annotated. In addition, like any good text editor, they can split the view of the sequence an arbitrary number of times both horizontally and vertically allowing for instance the simultaneous viewing of different sections of a long transcript.

The foocanvas like other canvas packages allows the addition of custom written "canvas items" which is particularly important for a genome annotation program as there are current and emerging standards for the shapes used to represent different kinds of features. ZMap supports a number of these (see Fig xx) and it is easy to add more. They range from very simple glyphs to much more complex items that display a complete transcript with CDS, UTR and other parts marked up.

ZMap has the concept of a 'mark' region which by default is the entire displayed sequence but can easily be set to encompass just the area of the window or any feature within the window. Many operations are limited to the mark region which has two benefits:

Efficiency in data display

Displaying all features for a large sequence can be too slow even with optmised compiled code. Fortunately it is possible to optimise display both by differential loading and differential display of features.

Each feature set has a minimum and maximum zoom factor controlling the range of magnifications at which the features are displayed. This allows the user to specify that very numerous features such as homologies are only displayed at higher magnifications.

Features can be selectively loaded under the control of the user and sys Loading all features for a large sequence can consume large amounts of memory and Usually annotators do not need to see all features for the entire range of the sequence they are annotating, they instead concentrate on sub-areas of interest.

ZMap and Otterlace

The first annotation system that ZMap has been used for is the Otterlace system which is a perl Tk based application and associated pipelines that is used for vertebrate annotation at the Sanger Institute and elsewhere.

Conclusions

need acedb, otterlace, gmod, apollo refs