Coordinate systems and configuration

Coordinate systems for generic databases

What's the problem?

It is common (especially in model organism databases) to use chromosome coordinates to refer to sequence information but many databases exist which are based on fragments eg BAC (clones), and these are typically based from 1. ZMap implements a mapping process to define what coordinates are presented to the user and what coordinates are used when referring to the underlying database.

There is also a lot of code based on the idea of several blocks of sequence data and possible several aligns (which contain the blocks) sourced from different organisms. Because this code was not completed there are a few incomplatabiliites in the code, and there is code present that assumes that only a single align and block exist.

This introduces complexity with reverse complement and also simply displaying the data on the foo canvas and it is in these two areas that assumptions have been made wrt one block. Note also that drawing the scale bar also assumes a single contiguous sequence which is also incompatable with multiple blocks.

A Strategy

Attempts to move ZMap from 1-based coordinate to chromosome have proved quite difficult and to achieve a working system in a reasonable time it is proposed to re-implement and debug a single align and block system with data structures that may be compatable with multiple blocks and to abandon the idea of multiple aligns and blocks for the present.

Some of the problems to solve are:

Coordinate display

Typical chromosome coordinates are quite large and are not easy for annotators to use. ZMap will display coordinates on the Scale bar and on the Ruler as block relative - the first base in any block will be 1 and the last will be the length of the block. The Ruler will also display coordinates in the parent span if these differ from block relative. The Status bar will show coordinates as block relative (which will be consistent with the scale bar). Currently there is no way to display parent (eg chromosome) coordinates in the status bar.

For reverse complement ZMap currently inverts the coordinates relative to the block end and displays coordinates as negative. For a sequence that is not 1-based this implies a possible large shift in coordinates and if multiple blocks existed collissions could occur. ZMap will present a single block in this format and chromosiome coordinates will always be displayed as forward strand, as at present we do not know chromosome size.

Feature display and sequence coordinates - implementation

ZMap stores sequence and feature data in a FeatureContext which contains a single Align which contains a single contiguous Block of features. There is provision to have multiple aligns and blocks and in this case the blocks of features could be mapped from any genomic region, for example to model genome rearrangements.

Note that at present each feature item on the foo canvas is draw explicitly at sequence coordinates and the x-coordinate is set expelicitly for each one; therse are not block relative ofr column relative. Foo canvas groups, although they are containers and have an extent may overlap and the items displayed may appear anywhere on the canvas.

Multiple aligns and blocks - some historical notes

This was a ZMap feature that was started but not finished.

A single align corresponds to data from a single organism and is displayed with blocks containing columns of features arranged vertically. The first align is known as the master align and block are displayed in order as specified by configuration (when implemented). Any further aligns are displayed to the right and blocks are positioned as specified in configuration (when implemented). These blocks map to the master align but as the sequences will include re-arrangements and indels it will not be possible to map the whole block accurately. At present it is thought that multiple aligns are too cumbersome for practical use as very large windows (and monitors) will be needed, and remarks here about configuration and use are speculative. For the moment we restrict ZMap to handle a single block in a single align.

NOTE: Requests for data from servers can be made for single blocks only.

Much of the top level interface to ZMap specifies a sequence as start to end, which implies a single align and block, and at present multiple blocks and aligns are not supported.

Overall structure - a summary

Reverse Complement

ZMap implements reverse complement by reversing the entire feature context and features are RevComp'd relative to thier containing block, which means that each feature's start and end coordinates are swapped and the coordinates are reflected in the block's end coordinate.

If a block is RevComp'd then coordnates are displayed as -ve block relative and the ordering of features reversed on the display. This interacts with the block_to_sequence.reversed flag - if a reversed block is RevComp'd it will be displayed as forward strand.

Each block will have a RevComp'd flag - note that at present the Window has this and it may be worth preserving a global Revcomp function for the window.

What happens when we reverse complement a block?

Previous code performed RevComp by reflecting coordinates in the end coordinate for a region, and for 1-based sequences from ACEDB with no parent span defined the result was incorrect chromosome coordinates. Without a defintion of chromosome size it is not possible to calculate correct reverse strand cooridnates and therefore chromosome coordinates will always be displayed as forward strand.

ZMap displays forward strand coordinates as 1-based and block relative (ie 1-block size) as these are much smaller and easier for user to use. Reverse strand coordinates in ZMap are the distance from the end of the current block and are expressed as negative and 1-based at the end opf the displayed sequence.

Data used to perform coordinate mapping

ZMapFeatureContext->parent_span

This defines the genomic region the sequence(s) being viewed are extracted from. Note that at present it is assumed that this is a single contiguous sequence of DNA (eg a single chromosome or a database holding a sequence based on some other coordinate system. If more than one chromosome is to be represented then it is necessary to concatenate these into a single sequence; this will be reviewed in future and is likely to change. Parent span coordinates are always forward strand.

typedef struct
{
  Coord x1, x2 ;
} ZMapSpanStruct, *ZMapSpan ;

(in ZMapFeatureContextStruct)
ZMapSpanStruct parent_span;
If parent_span is not defined then ZMap will define it as 1-based and treat it as the concatenation of all the blocks in the align.

ZMapFeatureAlignment->sequence_span

This defines the extent of the sequence data held in blocks and is the min to max block coordinate. It is calculated automatically as blocks are loaded.

(in ZMapFeatureAlignmentStruct)
ZMapSpanStruct sequence_span;

ZMapFeatureBlock->block_to_sequence

This defines the genomic region within the parent span that a block covers. It contains the subsection of the parent span that is included, and the coordinates that are used to refer to the data being viewed internally by ZMap. There is a flag to say whether or not the region is reversed (ie reverse strand relative to the parent span).

typedef struct
{
      /* NOTE even if reversed coords are as start < end */
  ZMapSpanStruct parent;          /* start/end in parent span (context) */
  ZMapSpanStruct block;           /* start,end in align, aka child seq */
  gboolean reversed;
} ZMapMapBlockStruct, *ZMapMapBlock ;


(in ZMapFeatureBlockStruct)
  ZMapMapBlockStruct sequence_to_parent ;

Overall structure - examples

Some examples, note that we do not limit ZMap to these two options)

Currently we have this:
top level coords      1-------------------------------------------------n
                               |                    |
                               |                    |
sequence coords                1------x-------y-----m
                                      |       |
                                      |       |
block coords                          x-------y
                                      |       |
                                      |       |
display coords                        x-------y


ZMapFeatureAlign.parent_span.x1,x2 = 1,n
ZMapFeatureBlockStruct.block_to_sequence.parent.x1,x2 = x,y
ZMapFeatureBlockStruct.block_to_sequence.block.x1,x2 = x,y
but we want to go to this (using chromosome coordinates internally):
top level coords      1--------a------x-------y-----b-------------------n
                               |      |       |     |
                               |      |       |     |
sequence coords                a------x-------y-----b
                                      |       |
                                      |       |
block coords                          x-------y
                                      |       |
                                      |       |
display coords                      x-a+1   y-a+1

ZMapFeatureAlign.parent_span.x1,x2 = 1,n
ZMapFeatureBlockStruct.block_to_sequence.parent.x1,x2 = x,y
ZMapFeatureBlockStruct.block_to_sequence.block.x1,x2 = x,y

i.e. everything is in the coord system of the top level sequence which in otterlace would be the chromosome but in another database could be a clone, contig or whatever. NOTE that for a single align and block x == a and y == b.

Mapping versus Alignment

When mapping a genomic sequence to an underlying database both forward and reverse strands are mapped together. This is different from alignment which is where a sequence on one strand of a DNA segment is aligned with a sequence on another region and strand (although the strands may in fact be the same).

Note that there are two different data structures defined in zmapFeature.h: ZMapAlignBlock which is used to process gapped alignments, and ZMapMapAlign which is used to specify the mapping between Align and parent sequence.

Reverse Complementing

This occurs per block and we display coordinates as negative forward strand coordinates (block relative) wiht the corresponding parent (eg chromosome) coordinate as for normal forward strand.

Loose Ends

The following items have been noted while updating the source to reflect the above:

Chromosome and ZMap coordinates

Chromosome coordinates are useful when dealing with external sources and people but are unwieldy when annotating and ZMap has traditionally held features based on slice coordinates ie based from 1.

The user will always be presented with ZMap coordinates in status widgets and the ruler also provides a tooltip with the zmap coordinate and if available the corresponding chromosome coordinate.

Traditionally the chromosome coordinates have been derived from the sequence name (eg 'chr3-18_123124234-234242342').

New configuration options

As a first step towards specifying the regiojn of interest in ZMap configuration we will provide the following options:

[ZMap]
start=12324124
end=234242234
csname=chromosome
csver=Otter
and initially any values other than chromosome and Otter will be invalid.

Subsequently various other parameters currently used in pipe server URL's will be controlled by extra option in [Zmap].

Use with otterlace

dev_otterlace (and evetually test_otterlace and otterlace) will accept an environment variable to choose between chromosome coordinates and 1-based coordinates. This will control whether of not start and end is configured in ZMap and also whether pipe server scripts rebase thier coordinates from 1. ACEDB must be configured seperately.

Handling chromosome coordinates without new configuration

If available, the chromosome coordinates in the sequence name in the ZMap config will be used to present chromosome coordinates on request by the user (eg in the ruler tooltip). Pipe servers and ACEDB should provide GFF data based from 1 and requests for data will be based from 1.

Handling chromosome coordinates with new configuration

ZMap will store features with chromosome coordinates and Pipe servers and ACEDB must provide these in their GFF output. Coordinates presented to the user will be adjusted to be based from 1. Requests for data (eg load from mark) will be in chromosome coordinates. NB ACEDB must be configured with extra data to facilitate this.

Implications for ZMap

Data requests will always be in the native coordinate system.

The scale display must be adjusted when chromosome coords are configured.

Window and sequence coordinates

Nomenclature

world FooCanvas real world coordinate relating to sequence/ feature
canvas FooCanvas pixel coordinate relating to the scroll region
window FooCanvas same as canvas coordinates
display ZMap 1-based coordinates shown on the Ruler and ScaleBar
Also equivalent to the position of a feature on the canvas
sequence ZMap coordinates corresponding to loaded data
(may be 1-based or not depending on the database)
chromosome ZMap external (physical) coordinates relating to where the loaded data has come from

The FooCanvas code comments that window and canvas coordinates are the same, but note that canvas are integer and window are real. The code suggests that they are both pixel coordinates.

Sequence coordinates and RevComp

Sequence coordinates correspond to our raw data ie the sequence and features loaded from servers (see above for details of how real-world chromosome data can be mapped into ZMap cooridnates depending on configuration). On RevComp this data is complemented (all sequence and feature data) and this means reflecting start and end coordinates in the end of the parent span. The MVC paradigm suggestes that this is wrong: to create a new view of the data there should be no need to alter the model, but this is how ZMap has been implemented. To change this so that the data remain static and the view changes would need consideration of at least the following:

Note that to date ZMap has not known the extent of the parent span (eg the size of the chromosome the loaded sequence has been taken from) and typically the end of the loaded sequence has been used instead.

Note also that historically sequence data has beed loaded as 1-based.

Window coordinates

Integer sequence coordinates are used to specify the position of features on the display and these are expressed as real numbers for addition to the canvas. The canvas implements zoom and these real number coordinates correspond to pixel coordinates on the user's screen depending on the zoom level. There is a maximum range of 30,000 pixels supported by the canvas and at high zooms not all data can be displayed. The canvas has a scroll region which corresponds to the extent of the data displayed in real number coordinates. The data actually visible on the canvas widget is determined by the scroll offset which is linked to the scrollbar position.

Coordinates: Usage

Things to be aware of

As historically coordinates received from ACEDB were 1-based and the parent span was unkmown much of the ZMap code has hade to deal with only special cases: Reverse complementing a sequence of 1-N in a parent span of 1-N gives a reversed sequenc of 1-N.

If the parent span is unknonw and the loaded sequence span is used instead then a sequence of X-Y in a span of 1-Y will give a reversed sequence if 1-(Y-X), ie a sequence that is 1-based and this is another special case.

Historically the Window and also the Scale bar have implented a display 'origin' which have been used to calculate display coordinates. This is quite confusing as for a typical 1-based RevComp'd sequence the origin is the sequence length and the start display coordinate is (-sequence_length) and the start real world coordinate is 1.

Previous abandoned implementations of multiple blocks would not work well with a window->origin as this related to the whole window not a series of disjunct blocks. If features were displayed as block relative perhaps this would be workable, but they are not. To be future proof any display 'origin' must be connected to the containing block not the window.

Display of features

Each feature has sequence start and end (feature->x1,x2) expressed as sequence coordinates. When a feature is added to the canvas these are expressed as real numbers (world coordinates). The canvas scroll region is set to the extent of all the features on the canvas and clipped to the 30,000 pixel foo canvas limit. The canvas scroll offset controls which part of the scroll region is visible in the canvas widget.

ScaleBar

This provides an indication of display coordinates on the left of the ZMap display and occupies a separate foo canvas. Displayed coordinates are 1-based, and when RevComped these are displayed as -ve and ends at -1.

Note: the ScaleBar is variously referred to as the Ruler in the ZMap source code (and is coded in zmapWindowRuler.c) and should not be confused with the Ruler, which is a horizontal line drawn as a cursor when the middle mouse button is held down.

Ruler

This allows the user to see if features line up by drawing a horizontal line accross the display where the cursor is. When active the display and chromosome coordinates are displayed in a tooltip.

Status Bar

Coordinates of whole or sub-features are displayed in some of these boxes and these are the display coordinates, which must correspond to the coordinates shown on the Scale Bar.

Requests for data from servers

To implement request from mark (and the corresponding function via an XRemote request) ZMap will use chromosome coordinates, as this is unambiguous and will operate with separately hosted and unrelated servers.

To get chromosome coordinates from the mark we have to:

For reverse strand data we also have to:

Implementation

Feature canvas coordinates

It's not obvious from the code how feature coordinates are processed to become canvas coordinates, mainly because the data structures and code are very complex and contain obscure details. What appear to be the case is that:

The following snippet of code (in drawSimpleFeature() is what makes the canvas item block relative.

  zmapWindowSeq2CanOffset(&y1, &y2, feature_offset);
It is called from zmapWindowFToIFactoryRunSingle() via this function call, and similar functions are used for other ZMapWindowCanvasItem types.
            /* get the block offset for the display contect not the load request context */
            offset = block->block_to_sequence.block.x1;

            run_data.factory   = factory;
            run_data.container = features_container;
            run_data.context   = context;
            run_data.align     = align;
            run_data.block     = block;
            run_data.set       = set;

            run_data.canvas_item = current_item;

            item   = ((method)->method)(&run_data, feature, offset,
                          points[0], points[1],
                          points[2], points[3],
                          style);

One implication of this is that glyphs attached to features which are displayed as foo_canvas_items directly must be displayed as block relative. For historical reasons despite ZMapWindowCanvasItems being complex objects these are displayed directly in the containing column (ZMapWindowContainerFeatureSet).