pipeServer - Pipe Server Configuration and Use

Author: mh17, 27 Nov 2009; Converted to HTML 02 Mar 2010

Summary

The pipeServer module is based on the existing fileServer and is configured using a [source] stanza in the ZMap config file for each pipe. The module takes as input a GFF format stream and it is intended that ony one or very few featuresets will be included in each pipe - the idea is to upload data in parallel in order to reduce startup time.

Scripts are to be installed locally with ZMap and the directory itself will be identified in the [ZMap] stanza in ~/.ZMap/ZMap. The scripts are defined by a url and paramters can be given in the query string.

Although the intention is to load much of the data direct from database sources instead of passing it via ACEDB it is stll necessary for ZMAP to load the essential data from ACE - transcripts and featuresets-to-styles, style information, as GFF does not allow definition of styles. ZMap can also run from pipes/ files only if styles are specified in another file.

The pipeServer module replaces the old fileServer module and supports both kinds of input.

Configuration (complete config documentation here). In ZMap's main configuration file (~/.ZMap/ZMap, or as otherwise optioned) we specify each pipeServer and also set up a few other options as needed. This is driven from the ZMap stanza, and each source is specified in a stanza of its own:

[ZMap]
# define the location for scripts and data
# Of course these could be stored centrally somewhere if desired.
# and pipeServers can use absoloute paths instead
script-dir = /nfs/users/nfs_m/mh17/ZMap/scripts
data-dir = /nfs/users/nfs_m/mh17/ZMap/scripts

# Each data source must be referenced in [ZMap] by listing the source stanzas like this
# Columns are displayed in the order given:
# in this example all the acedb features first then the 'other_feed' etc.
# later in the ZMap file each source must be defined in its own stanza.
sources = acedb ; other-feed ; yet_another_one

# If styles are not defined via ACEDB then a file must be given in the
# source stanza eg:
stylesfile = /nfs/users/nfs_m/mh17/zmap/styles/ZMap.b0250.file.styles

[other-feed]
# config options for 'other-feed'
#

When configured, ZMap will request data from each source in parallel, hopefully speeding things up a lot. Each script will obtain and send the data 'somehow'. They will replace the existing mechanism of Otterlace retreiving the data sequentially and adding to ACEDB on startup.

Each source stanza has a 'delayed' option, which allows data to be requested on demand rather than at startup:

# delayed == conect from X-Remote request or load columns dialog, otherwise connect on startup
delayed = true

URL interpretation

In each source stanza (one must exist for each data source) the syntax is the same as for existing file:// and acedb:// sources, but specifically for pipe:// sources we interpret the configuration as follows:

URL's take the form

      ://[user][:password]@[:port]/[url-path][;typecode][?query][#fragment]
and:
       will be 'pipe'

      user:password@host are not used and if present are ignored

      port is not used and if present will be ignored

      url-path is the path of the script.
      Note that according to http://rfc.net/rfc1738.html a single leading '/' signifies a relative path and two signifies absolute.  We will interpret relative paths as relative to the ZMap scripts directory.

      typecode is not used and will be ignored if present

      query will be expanded into a normal argv vector

      fragment is not used and will be ignored

Typically we expect a pipe:// data source to have only one (or very few) feature sets, as a major design aim is to exploit concurrent operation. Other configuration parameters will operate as normal (eg 'sequence=true' (which can only appear in one source) and 'navigator_sets=xxx,yyyy').

Here is an example for a test script that simple outputs an existing GFF file.

[b0250]
url = pipe://getgff.pl?file=b0250_curated.gff
featuresets = curated_features ; curated ; genomic_canonical
styles = curated_features ; curated ; genomic_canonical

A more realistic one with an absolute path... (but needs featuresets and styles and stylesfile specifying)

[human_xyx]
url=pipe:///software/anacode/bin/get_genes.pl?dataset=human&name=1&
      analysis=ccds_gene&end=161655109&csver=Otter&cs=chromosome&
      type=chr1-14&metakey=ens_livemirror_ccds_db&start=161542637&
      featuresets=CCDS:Coding;CCDS:Transcript
(url broken up for readability).

Script operation: some rules

A script must start with #!<program> or else it will not be exec'd. (Assuming Linux)

A script may obtain data in any way it likes but must output valid GFF data and nothing else on STDOUT (but anything is valid in a comment).

Brief error messages may be output to STDERR and these will be appended to the zmap log. STDERR output is intended only to alert users of some failure (eg 'warning not all data found' or 'cannot connect to database') and not as a detailed log of script activity - if this is needed then the script should maintain its own log file. A warning message will be presented to the user, consisting of the last line in STDERR and hopefully this will be enough to explain the situation with resorting to log files.

Regardless of whether an error message is sent ZMap will attempt to use the GFF data provided.

ZMap will probably read STDERR after STDOUT is closed, and only if some error is encountered.

Arguments will be given in the format key=value with no preceeding dashes, these will be as extracted from the server query string. (If people care about this we could change it...)

Extra arguments may be added subject to implementation:

zmap_start=zmap_start_coord   # in zmap coordinates not bases (if configured)
zmap_end=zmap_end_coord
wait=9                        # delay some seconds before sending data
                              # (can be given in the query string, main use is for testing)

Speed of data retrieval and overloading servers

A problem

Running 40 requests in parallel does reduce the amount of time needed to retrieve data but has the drawback that many requests to a single server can be detrimental and it is necessary tp provide the ability to group some pipe servers together and request data in series.

In conjunction with this it is also necessary to allow each single featureset to be requested individually. Note that although most pipe Servers will be confgured for a single featureset it is possible to serve many from a single script.

Intricacies of the request protocol and configuration

ZMap will request data by featureset and optionally a coordinate range, but the data is displayed in columns which may contain several. Servers are configured to request one or more featuresets and while there is no advantage in defining servers and columns such that a server provides featuresets for several columns and most columns require data from several servers this is still possible and needs to be handled.

When presented with a request for a featureset ZMap will determine which servers to request the data from and if this has the side effect of requesting extra featuresets these will be stored automatically. Existing configuration (Feb 2011) links a featureset list to a single URL in a pipe server stanza and to prevent overlapping requests for given featuresets it is simplest to group pipe servers (stanzas) together.

Implementation

An extra stanza will be provided in ZMap config:

[Server-groups]
group1 = ensembl_bodymap_breast ; ensembl_bodymap_heart ; etc
group2 = Exonerate_EST_Human ; EST_Mouse ; Exonerate_EST_Mouse ; etc
Which defines servers that should not be active at the same time. The group names are not used by ZMap except to facilitate configuration. When ZMap queues requests it will only start up one server per group and hold the remaining ones in a list and request them as active servers complete. Note that we assume that each server in a group will terminate after supplying data (this is how pipe servers work) and because of this an ACEDB or other long term database connections should not be configured into a Server group with any other servers.

Note that: