NAME

GFF::GeneFeatureSet.pm - Perl extension for GFF (Homol)GeneFeature Set Container


SYNOPSIS

use GFF ; # contains an implicit 'use GFF::GeneFeatureSet ;'


AUTHORS

Copyright (c) 1999 Created by Tim Hubbard th@sanger.ac.uk.
Augmented by Richard Bruskiewich rbsk@sanger.ac.uk

Sanger Institute, Wellcome Trust Genome Campus, Cambs, UK All rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.


DESCRIPTION

GFF::GeneFeatureSet (derived from GFF) is a Perl Object for General Feature Format. A GFF::GeneFeatureSet object is a container object for a set of GFF::GeneFeature (or GFF::HomolGeneFeature) objects.

How to Read Method Protocols

Normal Perl data type notations are used for argument declarations in the method protocols. A backslash denotes argument passing by reference. Class methods are invoked using the 'class->method(args)' or 'method class args' Perl call formats.


SOURCE CODE

The most current release of the Perl source code for this module is available here. All bug reports may be submitted to Richard Bruskiewich (rbsk@sanger.ac.uk).


GFF::GeneFeatureSet Construction Methods

new( $version, $seqname, $start, $end )
Class method to construct a new empty GFF::GeneFeatureSet object of version ``$version''. If $version is not specified, it is taken to be the current default GFF version. GFF::Region() values for $seqname, $start and $end may also be provided.

addGeneFeature( $GeneFeature, \&filter, $copy )
Method to add a reference to a GeneFeature object (first argument) to a GFF::GeneFeatureSet object, possibly subject to an optional, user-defined &filter function. This predicate (boolean) function ``&filter'' tests the GeneFeature object (given as an argument) for inclusion in the GFF::GeneFeatureSet object set, based upon user criteria. If a ``&filter'' is not provided, then the GeneFeature is unconditionally included. If the $copy argument is set (and 'true' == non-null), then copies (not the original object references) of the source GeneFeature objects should be added to the invoking object. If $copy is specified

nextGeneFeature()
Method to complete remove the next ('head') GeneFeature from the GeneFeatureSet, returning it to the caller. The order of elements returned is FIFO relative to addGeneFeature() method calls.

addGFF( $GeneFeatureSet2, \&filter, $copy )
Method to append GeneFeatures in another GFF::GeneFeatureSet object to the GFF::GeneFeatureSet object invoking the method (where ``$GeneFeatureSet2'' above is the object reference of the second object, given as the method argument). Use the ``union'' method if you wish to merge two GFF::GeneFeatureSet objects without duplication of GeneFeature objects. If a ``&filter'' (see addGeneFeature above) is not provided, then the GeneFeature is unconditionally included. If the $copy argument is set (and 'true' == non-null), then copies (not the original object references) of the source GeneFeature objects should be appended to the invoking object.

copy( $version )
Method to duplicate the invoking GFF::GeneFeatureSet object. If the optional '$version' argument is specified (and greater than 0) then the new copy is cast into the specified version. This allows for GFF version casting of GeneFeatureSets.


GFF::GeneFeatureSet Input/Output Methods

read_header( \*INPUT )
Reads in the GFF file header meta-comments from the top (head) of an input file, (re)setting the meta-data of the current object accordingly.

Reading stops at the first non-comment field encountered, returning non-null if meta-comments encountered.

read( \*INPUT, \&filter, \&converter )
Create GeneFeature object for each line of a stream from a GFF formatted file and add references to them to the GeneFeatureSet object. An optional ``&converter'' function may be used to modify or filter input lines on the fly. This function should take a $string and return a $string; lines converting to an empty string are skipped by the read. An optional, user-defined predicate (boolean) function ``&filter'' tests the resulting GeneFeature object (given as its argument) for conditional inclusion in the GeneFeatureSet object, based upon user criteria. If a ``&filter'' is not provided, then the GeneFeature is unconditionally included. Comment lines (lines beginning with a ``#'') are also skipped. Use the GFF::trace(1) command to have read input tracing to \*STDERR.

pipe( \*INPUT, \*OUTPUT, \&filter )
Pipe a GFF file from \*INPUT to \*OUTPUT. An optional ``&filter'' function may be used to modify or filter input features on the fly. This function should take a $gf and return a $gf or 0; features returning 0 are skipped. Comment lines (lines beginning with a ``#'') are piped verbatim to output.

The merit of this method is that it does not read the whole GFF file into memory, so one can use a filter function to make small, simple sequential modifications to a GFF file without incurring a large memory overhead.

Use the GFF::trace(1) command to have read input tracing to \*STDERR.

read_msp( \*INPUT, $source, \&filter, \&name_parser )
Create HomolGeneFeature object for each line of a stream from an MSP file (output from MSPcrunch) and add references to them to the invoking GeneFeatureSet object. User specifies the origin of the msp as the $source string argument to the function. The &name_parser argument is user defined function which takes two arguments $group and \@array, where @array is assumed to be the (delimiter split) remainder of a line of data beyond the group field. An optional, user-defined predicate (boolean) function ``&filter'' tests the resulting GeneFeature object (given as its argument) for inclusion in the GeneFeatureSet object, based upon user criteria. If a ``&filter'' is not provided, then the GeneFeature is unconditionally included. Comment lines (lines beginning with a ``#'') are also skipped.

read_parse( \*INPUT, \&parser, \&filter, $type )
Method uses the GFF::GeneFeature::new_from_parse() or HomolGeneFeature::new_from_parse() method to create GeneFeature objects from each line of the specified stream input filehandle, using a user-defined parsing function ``&parser'' (see new_from_parse for &parser protocol details). The resulting GeneFeature object references are added to the invoking GeneFeatureSet object. An optional, user-defined predicate (boolean) function ``&filter'' test the resulting GeneFeature objects (given as its first argument) for inclusion in the GeneFeatureSet object, based upon user criteria. If a ``&filter'' is not provided, then the GeneFeature is unconditionally included. Comment lines (lines beginning with a ``#'') are also skipped. The optional $type parameter may be set to ``HomolGeneFeature'' or ``GeneFeature'' (defaults to ``GeneFeature'', if not specified).

eachGeneFeature()
Method to return an array of refs to the GeneFeatures in a GeneFeatureSet object.

dump_header( \*OUTPUT )
Method to dump meta-comment fields that this GeneFeatureSet object knows about, currently, the ##version, ##date and ##sequence-region (if defined). Returns non-null if any such fields are defined (and thus printed).

dump( \*OUTPUT, $tab, $newline, $flen, $inorder, $tag )
Dump out a GeneFeatureSet object (via GFF::GeneFeature::dump_string()). If \*OUTPUT is not given, \*STDOUT is used. The method returns 'true' (``1'') if non-empty GeneFeatureSet object; 'false' (``0'') if the GeneFeatureSet object is empty.

The ``$tab'' argument is a boolean flag, where a ``true'' (non-null) value directs the use tab as the field delimiter in the output line. Otherwise, blank space is used as the delimiter (default is ``true'' if not specified).

The ``$newline'' argument is passed to GeneFeature dump_string, which passes it on to GFF::GeneFeature::dump_group() to affect group printing.

The ``$flen'' argument is a boolean flag, where a non-null value stipulates that the length of the current output line should be printed as an extra field at the end of the output line (assumed null if not specified. Note: the extra length of this field is *not* added to the displayed line size, but the extra field is tab delimited, if $tab is set).

The ``$inorder'' argument is a boolean flag (default: true) which forces sorting of the GeneFeatures during dump by ``start'' coordinate order. Users may wish to suppress sorting (i.e. explicitly set $inorder to 0 ``false''), for performance reasons, when the GeneFeatureSet file is large.

The optional $tag argument controls dumping of [group] fields (see GFF:GeneFeature::dump()).

dump_matches( \*OUTPUT, $tab, $show_nomatches )
Dump out a GeneFeatureSet object (via method to dump out a GeneFeature object) along with information about (overlap) matching GeneFeatures. If \*OUTPUT is not given, \*STDOUT is used. The ``$tab'' argument is a boolean flag, where a ``true'' (non-null) value directs the use tab as the field delimiter in the output line (assumed ``true'' (non-null) if not specified). Otherwise, blank space is used as the delimiter. Normally, only features with matches are dumped. The optional boolean flag '$show_nomatches' when defined and non-null, directs that 'no match' records are reported too.


GFF::GeneFeatureSet Access Methods

The various GFF::GeneFeatureSet object parameters may be set or queried by the following access methods. All the methods can take arguments as noted to set the variable. With or without an argument, the methods return the current (or newly set) values, as a list, except as specifically noted below:

date( $year, $month, $date )
date of the GeneFeatureSet file (meta-comment ##date line).

region( $sequence, $start, $end )
sequence region of the GeneFeatureSet file (meta-comment ##sequence-region line).

version( $version )
object protocol version (see GFF::GeneFeatureSetObject::version()).


GFF::GeneFeatureSet Simple Set Operations

Note - all set comparisons are by reference only - comparisons cannot check if two GeneFeature objects with distinct reference pointers actually contain the same data!

member( $GeneFeature )
Method to test if a GeneFeature object (object reference ``$GeneFeature'') is a member of an existing GeneFeatureSet object

union( $GeneFeatureSet2 )
Method to generate a new GeneFeatureSet object which is a union of 2 GeneFeatureSet objects (the one invoking the method plus the one specified by the GeneFeatureSet object reference argument, $GeneFeatureSet2, to the method).

intersection( $GeneFeatureSet2 )
Method to generate a new GeneFeatureSet object which is every GeneFeature in first (invoking) GeneFeatureSet GeneFeature set that is also a member of the second GeneFeatureSet GeneFeature set (argument $GeneFeatureSet2 above).

difference( $GeneFeatureSet2 )
Method to generate a new GeneFeatureSet object which is everything in first (invoking) GeneFeatureSet GeneFeature set that is not in the second GeneFeatureSet GeneFeature set (argument $GeneFeatureSet2 above).


GFF::GeneFeatureSet Feature Partition Methods

This series of methods partition a GeneFeatureSet into subsets based upon specified attribute or feature criteria.

filter( \&function )
Method to generate a new GeneFeatureSet object based on a filtered version of the object. Filtering is carried out by passing a reference to a subroutine, ``&function'', which is applied to each GeneFeature object in the invoking GeneFeatureSet object. This user-defined &function should be designed to accept a reference to a single GeneFeature object and to return the predicate (boolean) outcome of some test upon that object reflecting the user filter criterion: 1 implying inclusion, 0 implying exclusion of the GeneFeature from the new GeneFeatureSet object set.

exclude( \&function )
Method to generate a new GeneFeatureSet object based on a filtered version of the object. This method is simply the negation of filter(), in that the discriminant function values 1 implies *exclusion* and 0 implied inclusion. This may be handy in that the same discriminant functions can therefore be used to partition a set by using filter() and exclude() sequentially.

rewriteField( $field, $target, $rewrite, $returnAll, $record )
This method creates a new GeneFeatureSet object containing GeneFeatures in which a designated $field (specified by a string value 'SEQNAME,' 'SOURCE', 'FEATURE' or 'GROUP' - case insensitive) in all GeneFeatures that match the specified $target value (which can either be a simple identifier or a Perl regular expression; for GFF Version 1, 'GROUP' fields $target matches the field itself; for GFF Version 1 'GROUP' fields, $target should be a simple identifier matching a tag of some tag-value). If $field is 'SEQNAME', the return object 'sequence-region' is renamed to the first field matched.

The specified field is overwritten with the specified $rewrite value (which may be a simple identifier or a full Perl search & replace expression, namely, ``s/<search>/<replacement>/'' ). Note: the 's' should be the very first character in the string and should be immediately followed by a non-alphanumeric delimiter character, for this to work properly). Backreferenced ($1 et al.) replacement values and the 'g','i', & 'x' search modifiers are permitted.

Note: if the 'target' is simply an asterix '*' ('wildcard'), then all fields of the designated type are rewritten with $rewrite specification.

The new GeneFeatureSet object returned only contains GeneFeatures which triggered a rewrite, unless the optional 'returnAll' boolean flag argument is defined and non-null, in which case a COPY of all GeneFeatures is returned, whether or not it was modified.

If the optional $record argument is set (to a simple tag identifier), then the old $field value is recorded as the $record [group] field tag value.

cluster( \&comparison, $single )
Method to build an array of GeneFeatureSet objects each containing a group of GeneFeature objects sharing some shared attribute [pairwise comparison]. The &comparison function reference is the operational user definition of this shared attribute, which when given the references to each of two GeneFeature objects, returns a 1 implying inclusion in a cluster, or 0 implying exclusion from a cluster, based upon shared attributes. If the $single flag is set to 1 (assumed 0 if omitted) then all singular GeneFeature objects not assigned to a group are added to the array as independent, single-membered GeneFeatureSet object groups.

features( \&discriminator )
Method to return a hash, key indexed by distinct feature types and their occurrence. The keys of the hash are based upon a user-defined ``&discriminator'' function which takes a GeneFeature object reference as its input parameter and returns a (string) key value label characteristic of a feature type of interest. If the &discriminator function returns undef or a null string for a given GeneFeature object argument, then that GeneFeature is ignored by the method. For GeneFeatures with a non-null &discriminator return value, the method uses this return value as a hash key to maintain a cumulative count of the occurrence of this return value (``feature type'') in the current GeneFeatureSet object.

theFeature( \&discriminator )
Method to test if the &discriminator function returns a single value, based upon a user-defined ``&discriminator'' function which takes a GeneFeature object as its input parameter and returns a (string) key value characterizing the feature types of interest. If such a singular feature is found, then it is returned with its frequency of occurrence in the GeneFeatureSet file.

the( \&discriminator )
Non-fatal method to test if the discriminant function returns a singular value when acting upon the given GFF::GeneFeatureSet. Returns the single value if unique; returns undef otherwise. This is good for obtaining data such as the common strand of a GeneFeatureSet 'gene' object.

group( \&discriminator, $copy )
Method to build an hash of GeneFeatureSet objects each containing a group of GeneFeature objects sharing some attribute [fixed, named], operationally defined by a user &discriminator function taking a GeneFeature object reference as the input argument and returning a unique name string labelling the attribute. The group method returns a hash of GeneFeatureSet object references, key indexed by the attribute name strings. The (meta-comment) 'sequence-region' start and end coordinates are set to the minimum and maximum start and end respectively of the GeneFeatures included into each group GeneFeatureSet object.

If the (optional) $copy switch is 'true' (non-null) then the new group GeneFeatureSet objects (dereferenced by the hash) are composed of copies of the original GeneFeatures. In other words, modifications of these new GeneFeatures will not modify the GeneFeatures objects in the original object.

group_value_string( $source, $feature, $tag )
Method to return a string constructed from the list of values associated with a given $tag of [group] tag-value pairs, from a specified GeneFeature record, of the invoking gene feature set, which matches the given $source and $feature (which may be Perl regular expressions). $source and $feature may also be undef or '*', designating that any source or feature can match. The $tag should be a simple tag identifier (*not* a Perl regex). Only values from the first such $tag encountered in the gene feature set are returned. Returns 'undef' if no such tag-value list is found.

deleteTag($tag)
Invokes the GFF::GeneFeature::deleteTag() method on every gene feature object in the current GeneFeature set. Note: this operation directly modifies the feature objects concerned.

label( $membergroup )
Method to add a label (``$membergroup'') to each GeneFeature in this GeneFeatureSet, indexing a reference pointing back to the invoking GeneFeatureSet object.

label_pair( $membergroup )
Method to make a list of the GeneFeatureSet objects with which the GeneFeatures in this GeneFeatureSet are paired by some label (``$membergroup'').

addMember( $ParentGeneFeatureSet, $membergroup )
Method to add a member record to indicate the parent GeneFeatureSet object for this particular grouping (``$membergroup'') of GeneFeatures.

getMember( $membergroup )
Method to get reference to parent GeneFeatureSet object of GeneFeature object under this particular grouping (``$membergroup'').

Method to test whether or not GeneFeatureSet contains members.

containsMembers()
Method to get test whether $membergroups has members.

getAllMembers()
Method to get hash reference of all $membergroups.


GFF::GeneFeatureSet Geometric Partition Methods

This series of methods partition a GeneFeatureSet into subsets based upon coordinate (geometric) criteria.

complement($source,$feature,$strand)
Method returns a set of 'gene features' constructed from the geometric complement of the calling GeneFeatureSet, that is, all the subsequences *NOT* spanned by the input feature records.

The $source, $feature and $strand arguments are used to label the <source> and <feature> fields respectively (default: 'GFF_Complement' for <source> and/or <feature>; '.' for <strand>).

The '$tag' value is a [group] field tag, common to features in the input GeneFeatureSet, which is used to annotate the complementary features in the form [$tag $feature1 $feature2] where $feature1 and $feature2 are the features flanking the newly generate complement feature. If complementary features are generated at the start and/or end of the host coordinate range, then the special names 'Start' and 'End' are used as feature names for the $tag labelling.

The '$append' argument, if defined and non-null, directs that the new GeneFeatureSet of complemented features are appended to the invoking GeneFeatureSet, which itself is returned.

self_overlap(\*OUTPUT, $strict, $exact, $tag)
Method to generate a new GeneFeatureSet object based on a filtered version of the object based on any pairwise overlap detected by the GFF::GeneFeature::overlap_logical() method. Option to report overlaps to filehandle reference ``\*OUTPUT'' (output sent to STDOUT if $file is omitted or undef; output is suppress if a null (zero) value is passed to the method for \*OUTPUT. Note: the method does not test whether the features being merged match in kind. Thus the user is responsible for making sure that the GeneFeatureSet object only contains features mergeable in the semantic sense (e.g. all the features are exons predicted by a single prediction algorithm).

If the optional $strict flag is set 'true' (non-null) then only overlapping Gene Features which match identically with respect to <seqname>, <source>, <feature> and <strand> (and [group] $tag -- see below) are deleted (i.e. only truly 'duplicate' records deleted). $strict defaults to 'false' if omitted.

The optional $exact flag specifies that an exact match is required for overlaps (defaults to false if the argument is omitted or undef).

If '$tag' is defined, then the specified [group] tag-value must also match for $strict matches.

self_overlap_merge( $tolerance, $strand, $group_tag, $addscores )
Method to generate a new GeneFeatureSet object based on a merging of overlapping (but otherwise similar) GeneFeature records using the GFF::GeneFeature::overlap_merge() method i.e. if two GeneFeatureSet's, 1-10 and 8-16 then replace with one new one, 1-16. See caveat note above (under self_overlap) about feature merger semantics.

The optional $tolerance value is passed onto the GFF::Genefeature overlap_merge() method.

Defined, non-null '$strand' boolean flag forces the merging to be strand sensitive.

The optional $group_tag argument is passed to the GFF::GeneFeature::overlap_merge() function (which see). This may be a simple [group] tag identifier or a modifier function.

The optional $addscores argument stipulates that all merged feature scores are added to the merged version of the feature.

intersect_range( $start, $end, $exact, $strand )
Method to generate a new GeneFeatureSet object from all GeneFeatures in the invoking GeneFeatureSet object which overlap the specified $start and $end range By default, GeneFeatures are included if they overlap at all with the range.

Use of the optional $exact boolean flag (default: 'false') specifies that both the start and end of GeneFeatures must lie completely within the specified range. Either, any overlap with the range is a merge hit.

The optional $strand ('+','-' or '.') argument forces the intersections to be strand sensitive (default: ignore the strand value).

intersect_overlap( $GeneFeatureSet2, $tolerance, $single, $strand, $soft, $tag )
Method to generate a new GeneFeatureSet object based on intersection with second GeneFeatureSet object (``GeneFeatureSet2''), on a GeneFeature by GeneFeature basis (using the GeneFeature ``match'' method). Refer to the GFF::GeneFeature::match() method for information on $tolerance, $single and $strand method arguments.

The optional argument '$soft', when 'true' (non-null) specifies that intersection match copies of GeneFeatures from both the invoking and $GeneFeatureSet2 sets are kept in the intersection set (i.e. the intersection is based upon coordinate matches, but a 'union' of the matching GeneFeatures). This is useful for situations where one needs to use the geometric intersection set in some context where the feature identity must be preserved (e.g. when taking a difference set of features from a given which are not in the geometric intersection set).

GFF Version 2: The optional '$tag' argument is a [group] tag which is used to mark up the gene feature matches, using the match description.

intersect_overlap_merge( $GeneFeatureSet2, $tolerance, $strand, $tag )
Method to generate a new GeneFeatureSet object based on intersection merger of features in a second GeneFeatureSet object (``GeneFeatureSet2''), on a GeneFeature by GeneFeature basis (using the GeneFeature ``overlap_merge'' method; refer to the GFF::GeneFeature::overlap_merge() method for information on $tolerance and $strand method arguments. Note: the method returns 'copied' versions of the overlapping invoking features.

GFF Version 2: The optional '$tag' argument is a [group] tag which is used to mark up the gene feature matches, using the match description. This value is also used to rename the <feature> field of the overlap merged gene feature.

intersect_overlap_set( $GeneFeatureSet2, $tolerance, $single, $strand )
Check for exact match in position between two GeneFeatureSet objects (Q: Tim: is this meant to be an exact overlap?) Refer to the GFF::GeneFeature::match() method for information on $tolerance, $single and $strand method arguments.

intersect_overlap_matches( $GeneFeatureSet2, $tolerance, $single, $strand, $verbose, $file )
Method to generate a new GeneFeatureSet object based on 'overlap' intersection with a second GeneFeatureSet object. The new object contains GeneFeatures from the first object, which have matches to GeneFeatures in the second GeneFeatureSet object. These matches are recorded in the new GeneFeatureSet object and may be retrieved by GeneFeature *Match*() methods. Refer to the GFF::GeneFeature::match() method for information on $tolerance, $single and $strand method arguments. If the $verbose flag is present and equal to 1, then errors are reported to $file (or STDOUT if $file is omitted). Note: a side effect of this method is to symmetrically record all GeneFeature matches in both the invoking GeneFeatureSet object and the second $GeneFeatureSet2 object.

order_gf( $descending, $by_end )
Method returns sorted array of GeneFeature objects in the GeneFeatureSet. By default, the array is sorted (low to high) by segment 'start' coordinates. If the optional '$descending' is defined and non-null, coordinate sorting is 'high to low'. If the optional '$by_end' argument is defined and non-null, then the feature 'end' coordinate is used for the sort, instead of the start coordinate.

order_gf_by_size( $descending )
Method returns sorted array of GeneFeature objects in the GeneFeatureSet, sorted from the smallest gene feature segment length to the largest gene segment length.

If the optional boolean '$descending' flag is given, the order is reversed, from largest to smallest.


Methods to do calculations on GFF::GeneFeatureSet objects

count()
Method to count the number of GeneFeature objects in a GeneFeatureSet object

strands()
Method return strands of the GeneFeature objects which were found in the GeneFeatureSet object, in the form of a string.

frames()
Method return frames of the GeneFeature objects which were found in the GeneFeatureSet object, in the form of a string.

minScore( $matched )
Method to find minimum score of the GeneFeature members of a GeneFeatureSet object. If $matched is defined and equal to 1, then the method only evaluates objects which are matched to another GeneFeature.

maxScore( $matched )
Method to find maximum score of the GeneFeature members of a GeneFeatureSet object. If $matched is defined and equal to 1, then the method only evaluates objects which are matched to another GeneFeature.

avScore( $matched, $unweighted)
Method to find average score of the GeneFeature members of a GeneFeatureSet object. If $matched is defined and equal to 1, then the method only evaluates objects which are lined to another GeneFeature.

The '$unweighted' argument, when non-NULL, stipulates that a simple sum of scores rather than a feature length weighted sum of scores should be taken for the computation of the average.

scoreRange( $matched )
Method to return the minimum and maximum score information for members of a GeneFeatureSet object. Returns a list of minimum and maximum scores in an array context. Returns the spread (difference) between maximum and minimum in a scalar context. If $matched is defined and equal to 1, then the method only evaluates objects which are matched to another GeneFeature.

maxScoreGf( \&filter )
Method to return the GeneFeature object having the maximum score in a GeneFeatureSet object. If the optional, user-defined &filter predicate function (taking a GeneFeature object reference as its argument; returning a value of 1 for inclusion or 0 for exclusion) is provided, then only the &filter defined subset of GeneFeatures is assessed for scores.

max_min_range( $text ) -- Tim's old version (deprecated)
Method to find minimum and maximum range of all members in a GeneFeatureSet object. The method returns (max_range, min_range) list pair unless the optional non-null '$text' argument is provided, in which case, the method returns the string ``min_range-max_range''.

min_max_range( $filter )
Method to find minimum and maximum range of all members in a GeneFeatureSet object. In a list context, the method returns (min_range, max_range) list pair; in a scalar context, the range difference value, max_range minus min_range, is returned.

An optional boolean predicate $filter function (taking a GeneFeature reference as its argument and returning 1 == inclusion, 0 == exclusion) may be used to limit range calculation to a specific subset of GeneFeatures in the invoking GeneFeatureSet object.

min_max_range_homol()
Method to find maximum and minimum range, of HomolGeneFeature members only, in a GeneFeatureSet object. In a list context, the method returns (min_range, max_range) list pair; in a scalar context, the range difference value, max_range minus min_range, is returned.

start_range()
Method to find the minimum start coordinate among members of a GeneFeatureSet object.

end_range()
Method to find the maximum end coordinate among members of a GeneFeatureSet object.

remap( $offset )
Method to add an $offset amount to the start and end coordinates of every GeneFeature in a GeneFeatureSet object (uses GFF::GeneFeature->remap()). Note: the start and end of the original gene features (not copies thereof) of the the gene feature set are changed.

mask_length()
Calculates the sum of all segment lengths (end-start+1) of all the GeneFeatures in the GeneFeatureSet object. Method assumes non-overlapping features (i.e. that a self_overlap_merge() method call has been done first).

mask_length_true($max,$min)
Calculates true coverage of all the GeneFeatures in the GeneFeatureSet object. Need to supply maximum range ($max,$min). $min defaults to 1 if not specified.

lengthStats($filter)
Calculates the mean length of segment lengths (end-start+1) of all the GeneFeatures in the GFF::GeneFeatureSet object.

In a scalar context, this method simply returns the mean value of feature lengths.

In a list context, the mean, standard deviation and underlying variable values of total number of features, total feature lengths, and sum of feature lengths squared, plus a reference to a (sparse) hash of tallies for each non-zero length class (keyed by lengths) are also returned, in that order respectively.

An optional boolean predicate $filter function (taking a GeneFeature reference as its argument and returning 1 == inclusion, 0 == exclusion) may be used to limit the calculation to a specific subset of GeneFeatures in the invoking GeneFeatureSet object.

The method returns 'undef' if no features are found upon which to compute statistics.

shared_matches( $GeneFeatureSet2 )
Method to return a number of (overlap) matching GeneFeature objects between the invoking GeneFeatureSet object and a second GeneFeatureSet object (``$GeneFeatureSet2'').

score( $tolerance, $nep, $net, \*VERBOSE )
Method returns an (4 x 3) array of scores for a set of GeneFeature objects which contain (overlap) matches to other GeneFeatures. This scoring provides the number of features (N) and fractional accuracy (specificity) and coverage (sensitivity) for exact matches, overlaps, 5' and 3' alignments of the gene feature matches, respectively. The $tolerance indicates how precise the boundaries must match to count as a match to be scored. The $nep is the number of predicted exons and the $net is the number of true exons. If a score is ``infinite'' due to division by zero, a -1 is returned for that value. The optional '\*VERBOSE' argument, if defined, is a file device for dumping of detailed (signed) match offset statistics.


Revision History

2.093 (24/11/99)- rbsk: - fixed logical errors and performance issues in intersect_overlap*() methods Thanks to Alessandro Guffante for bringing these to my attention ;-) - order_gf() argument semantics changes: the single $reverse argument is replaced by $descending, to designate 'high to low' sorting by coordinates, $end argument added to force usage of the gene feature end coordinate instead of the start coordinate. The old '$reverse' argument is thus replaced by $descending == $end == non-null; the method still defaults to 'start coordinate, ascending sort';

2.092 (19/11/99) - rbsk: - cluster() method, for matches, the comparison function can now return a non-null $name string for unique labelling of the cluster which becomes the ##sequence-region $name for the cluster - dump_matches(): $show_nomatches argument added; Normally, only features with matches are dumped. The optional boolean flag '$show_nomatches' when defined and non-null, directs that 'no match' records are reported too.

2.091 (10/11/99)- rbsk: - major algorithmic performance enhancement of pairwise overlap methods!

2.090 (5/11/99) - rbsk: - added the 'read_header()' method

2.089 (30/10/99)- rbsk: - added the 'pipe()' method

2.088 (21/10/99)- rbsk: - added GFF complement() method

2.087 (13/10/99)- rbsk: - added lengthStats() method

2.086 (12/10/99)- rbsk: - added method nextGeneFeature() (FIFO inverse of addGeneFeature()).

2.085 (3/10/99) - rbsk: - $group_tag in self_overlap_merge() access CODE ref and subsumed into GFF::GeneFeature::overlap_merge() method.

2.084 (30/9/99) - rbsk: - $tag argument in dump() - $group_tag argument moved over in self_overlap_merge() - using order_gf() in *overlap*() methods (and run tracing)

2.083 (27/9/99) - rbsk: - $strand argument in self_overlap_merge() && intersect_range() - make $strict mode in self_overlap() strand sensitive

2.082 (21/9/99) - rbsk: - added optional '$tag' argument to intersect_overlap() - created intersect_overlap_merge() method - created the deleteTag() method

2.081 (9/9/99) - rbsk: order_by_gf_size() method added 3/9/99 - rbsk: $exclude to addGeneFeature() method; exclude() method added

31/8/99 - rbsk: self_overlap_merge() method: $group_tag specification allows for recording of merged features (by $group_tag) optional '$tolerance' value provides for overlap merge where the two features lie within $tolerance base pairs of each other

21/7/99 - rbsk: rewriteField() $target matches made case insensitive and framed by /^...$/

12/7/99 - rbsk: creation from miscellaneous GFF analysis code; transferred methods makeGenes(), constructGene() and mRNA from the GFF::GeneFeatureSet to new GFF::Analysis module

7/7/99 - rbsk: read() $echo argument added to provide user feedback during reading in of large GFF files...

5/7/99 - rbsk: transcript() method renamed to mRNA() - biologically more accurate :-)

28/6/99 - rbsk: transcript() method added 14/6/99 - rbsk: recoded self_overlap() slightly to account for undefined group_values

24/5/99 - rbsk: $tag argument to self_overlap() new() region() setting bug fixed

21/5/99 - rbsk: revised the rewriteField() method (see above) ; Note that the $group_tag/value & $saveold arguments were removed from this method

14/5/99 - rbsk: created the 'copy()' method 7/5/99 - rbsk: reinserted Tim's old max_min_range() method, for backwards compatibility (deprecated?)

6/5/99 - rbsk: added '$verbose' argument to score method()

28/4/99 - rbsk: renamed GFF.pm to GFF::GeneFeatureSet.pm

27/4/99 - rbsk: added '$filter' argument to min_max_range() added minScore() and scoreRange() methods

26/4/99 - rbsk: renamed max_min_range() to min_max_range() 23/4/99 - rbsk: addGeneFeatureSet() documentation fixed...

21/4/99 - rbsk: intersect_range(), containsMembers(), and getAllMembers() methods added rewriteField(): added '$saveold' argument. $copy argument added to group(), addGeneFeature(), addGeneFeatureSet() methods

19/4/99 - rbsk: deletion bug in self_overlap() fixed... GeneFeatureSetPair functionality (score() method et al.) merged with GeneFeatureSet.pm

16/4/99 - rbsk: added $exact flag to self_overlap() method; overlap method debugged too added $soft flag to intersect_overlap()

1/4/99 - rbsk: order_gf: optional $reverse argument to reverse sorting order to high to low by segment end coordinates.

26/3/99 - rbsk: new method 'rewriteField()' added. 23/3/99 - rbsk: $strict flag added to self_overlap() method

16/3/99 - rbsk: GeneFeatureSet objects now subclassed from GeneFeatureSetObject class; moved ``version()'' method from GeneFeatureSet.pm into the new base class GeneFeatureSetObject.pm

25/2/99 - rbsk: Extensively revised and improved the documentation Added Version 2 GeneFeatureSet code, including &version class function Also: Converted all ``constructor'' type methods in all GeneFeatureSet libraries to class methods (i.e. must now be invoked as class->new*(args) or ``new class args'') Standardize all file glob arguments to \*FILE references Use ``croak'' instead of ``die'' in theFeature Added GeneFeatureSet ``intersection'' set operation; Renamed ``intersect_not'' to ``difference'' Default $type for read_parse is now ``GeneFeature'' Added ``frames'' method; rename ``strand'' to ``strands'' max_min_range() modified to sense array v/s scalar context max_min_range_homol() just ignores non-HomolGen

SEE ALSO

GFF, GFF::GeneFeatureSet, GFF::HomolGeneFeature, GFF::Analysis, GFF::GifGFF.