Sanger Institute, Wellcome Trust Genome Campus, Cambs, UK All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation.
How to Read Method Protocols
Normal Perl data type notations are used for argument declarations in the method protocols. A backslash denotes argument passing by reference. Class methods are invoked using the 'class->method(args)' or 'method class args' Perl call formats.
$start
and $end
may also be provided.
addGeneFeature()
method calls.
Reading stops at the first non-comment field encountered, returning non-null if meta-comments encountered.
$string
and return a
$string; lines converting to an empty string are skipped by the read. An
optional, user-defined predicate (boolean) function ``&filter'' tests
the resulting GeneFeature object (given as its argument) for conditional
inclusion in the GeneFeatureSet object, based upon user criteria. If a
``&filter'' is not provided, then the GeneFeature is unconditionally
included. Comment lines (lines beginning with a ``#'') are also skipped.
Use the GFF::trace(1) command to have read input tracing to \*STDERR.
$gf
and return a $gf
or 0;
features returning 0 are skipped. Comment lines (lines beginning with a
``#'') are piped verbatim to output.
The merit of this method is that it does not read the whole GFF file into memory, so one can use a filter function to make small, simple sequential modifications to a GFF file without incurring a large memory overhead.
Use the GFF::trace(1) command to have read input tracing to \*STDERR.
$source
string argument to the function. The
&name_parser
argument is user defined function which takes two
arguments $group and \@array, where
@array
is assumed to be the (delimiter split) remainder of a
line of data beyond the group field. An optional, user-defined predicate
(boolean) function ``&filter'' tests the resulting GeneFeature object
(given as its argument) for inclusion in the GeneFeatureSet object, based
upon user criteria. If a ``&filter'' is not provided, then the
GeneFeature is unconditionally included. Comment lines (lines beginning
with a ``#'') are also skipped.
&parser
protocol details). The resulting GeneFeature object
references are added to the invoking GeneFeatureSet object. An optional,
user-defined predicate (boolean) function ``&filter'' test the
resulting GeneFeature objects (given as its first argument) for inclusion
in the GeneFeatureSet object, based upon user criteria. If a
``&filter'' is not provided, then the GeneFeature is unconditionally
included. Comment lines (lines beginning with a ``#'') are also skipped.
The optional $type
parameter may be set to
``HomolGeneFeature'' or ``GeneFeature'' (defaults to ``GeneFeature'', if
not specified).
The ``$tab'' argument is a boolean flag, where a ``true'' (non-null) value directs the use tab as the field delimiter in the output line. Otherwise, blank space is used as the delimiter (default is ``true'' if not specified).
The ``$newline'' argument is passed to GeneFeature dump_string, which passes it on to GFF::GeneFeature::dump_group() to affect group printing.
The ``$flen'' argument is a boolean flag, where a non-null value stipulates
that the length of the current output line should be printed as an extra
field at the end of the output line (assumed null if not specified. Note:
the extra length of this field is *not* added to the displayed line size,
but the extra field is tab delimited, if $tab
is set).
The ``$inorder'' argument is a boolean flag (default: true) which forces
sorting of the GeneFeatures during dump by ``start'' coordinate order.
Users may wish to suppress sorting (i.e. explicitly set
$inorder
to 0 ``false''), for performance reasons, when the
GeneFeatureSet file is large.
The optional $tag
argument controls dumping of [group] fields
(see GFF:GeneFeature::dump()).
$GeneFeatureSet2
above).
$GeneFeatureSet2
above).
&function
should be designed to accept a reference to a single
GeneFeature object and to return the predicate (boolean) outcome of some
test upon that object reflecting the user filter criterion: 1 implying
inclusion, 0 implying exclusion of the GeneFeature from the new
GeneFeatureSet object set.
filter(),
in that the discriminant function values 1 implies *exclusion* and 0
implied inclusion. This may be handy in that the same discriminant
functions can therefore be used to partition a set by using
filter()
and exclude()
sequentially.
$field
(specified by a string value
'SEQNAME,' 'SOURCE', 'FEATURE' or 'GROUP' - case insensitive) in all
GeneFeatures that match the specified $target
value (which can
either be a simple identifier or a Perl regular expression; for GFF Version
1, 'GROUP' fields $target
matches the field itself; for GFF
Version 1 'GROUP' fields, $target
should be a simple
identifier matching a tag of some tag-value). If $field
is
'SEQNAME', the return object 'sequence-region' is renamed to the first
field matched.
The specified field is overwritten with the specified $rewrite
value (which may be a simple identifier or a full Perl search & replace
expression, namely, ``s/<search>/<replacement>/'' ). Note: the
's' should be the very first character in the string and should be
immediately followed by a non-alphanumeric delimiter character, for this to
work properly). Backreferenced ($1 et al.) replacement values and the
'g','i', & 'x' search modifiers are permitted.
Note: if the 'target' is simply an asterix '*' ('wildcard'), then all
fields of the designated type are rewritten with $rewrite
specification.
The new GeneFeatureSet object returned only contains GeneFeatures which triggered a rewrite, unless the optional 'returnAll' boolean flag argument is defined and non-null, in which case a COPY of all GeneFeatures is returned, whether or not it was modified.
If the optional $record
argument is set (to a simple tag
identifier), then the old $field
value is recorded as the
$record
[group] field tag value.
&comparison
function reference is the operational user
definition of this shared attribute, which when given the references to
each of two GeneFeature objects, returns a 1 implying inclusion in a
cluster, or 0 implying exclusion from a cluster, based upon shared
attributes. If the $single
flag is set to 1 (assumed 0 if
omitted) then all singular GeneFeature objects not assigned to a group are
added to the array as independent, single-membered GeneFeatureSet object
groups.
&discriminator
function returns undef or a null string for a
given GeneFeature object argument, then that GeneFeature is ignored by the
method. For GeneFeatures with a non-null &discriminator
return
value, the method uses this return value as a hash key to maintain a
cumulative count of the occurrence of this return value (``feature type'')
in the current GeneFeatureSet object.
&discriminator
function returns a single
value, based upon a user-defined ``&discriminator'' function which
takes a GeneFeature object as its input parameter and returns a (string)
key value characterizing the feature types of interest. If such a singular
feature is found, then it is returned with its frequency of occurrence in
the GeneFeatureSet file.
&discriminator
function taking a GeneFeature
object reference as the input argument and returning a unique name string
labelling the attribute. The group method returns a hash of GeneFeatureSet
object references, key indexed by the attribute name strings. The
(meta-comment) 'sequence-region' start and end coordinates are set to the
minimum and maximum start and end respectively of the GeneFeatures included
into each group GeneFeatureSet object.
If the (optional) $copy switch is 'true' (non-null) then the new group GeneFeatureSet objects (dereferenced by the hash) are composed of copies of the original GeneFeatures. In other words, modifications of these new GeneFeatures will not modify the GeneFeatures objects in the original object.
$tag
of [group] tag-value pairs, from a specified
GeneFeature record, of the invoking gene feature set, which matches the
given $source
and $feature
(which may be Perl
regular expressions). $source
and $feature
may
also be undef or '*', designating that any source or feature can match. The
$tag
should be a simple tag identifier (*not* a Perl regex).
Only values from the first such $tag
encountered in the gene
feature set are returned. Returns 'undef' if no such tag-value list is
found.
Method to test whether or not GeneFeatureSet contains members.
$membergroups
has members.
The $source, $feature
and $strand
arguments are
used to label the <source> and <feature> fields respectively
(default: 'GFF_Complement' for <source> and/or <feature>; '.'
for <strand>).
The '$tag' value is a [group] field tag, common to features in the input
GeneFeatureSet, which is used to annotate the complementary features in the
form [$tag $feature1
$feature2] where $feature1
and $feature2
are the features flanking the newly generate
complement feature. If complementary features are generated at the start
and/or end of the host coordinate range, then the special names 'Start' and
'End' are used as feature names for the $tag
labelling.
The '$append' argument, if defined and non-null, directs that the new GeneFeatureSet of complemented features are appended to the invoking GeneFeatureSet, which itself is returned.
$file
is omitted or undef; output is suppress if a null (zero)
value is passed to the method for \*OUTPUT. Note: the method does not test
whether the features being merged match in kind. Thus the user is
responsible for making sure that the GeneFeatureSet object only contains
features mergeable in the semantic sense (e.g. all the features are exons
predicted by a single prediction algorithm).
If the optional $strict
flag is set 'true' (non-null) then
only overlapping Gene Features which match identically with respect to
<seqname>, <source>, <feature> and <strand> (and
[group] $tag
-- see below) are deleted (i.e. only truly
'duplicate' records deleted). $strict
defaults to 'false' if
omitted.
The optional $exact
flag specifies that an exact match is
required for overlaps (defaults to false if the argument is omitted or
undef).
If '$tag' is defined, then the specified [group] tag-value must also match
for $strict
matches.
The optional $tolerance
value is passed onto the
GFF::Genefeature overlap_merge()
method.
Defined, non-null '$strand' boolean flag forces the merging to be strand sensitive.
The optional $group_tag
argument is passed to the
GFF::GeneFeature::overlap_merge()
function (which see). This may be a
simple [group] tag identifier or a modifier function.
The optional $addscores
argument stipulates that all merged
feature scores are added to the merged version of the feature.
$start
and $end
range By default, GeneFeatures
are included if they overlap at all with the range.
Use of the optional $exact
boolean flag (default: 'false')
specifies that both the start and end of GeneFeatures must lie completely
within the specified range. Either, any overlap with the range is a merge
hit.
The optional $strand
('+','-' or '.') argument forces the
intersections to be strand sensitive (default: ignore the strand value).
$single
and $strand
method arguments.
The optional argument '$soft', when 'true' (non-null) specifies that
intersection match copies of GeneFeatures from both the invoking and
$GeneFeatureSet2
sets are kept in the intersection set (i.e.
the intersection is based upon coordinate matches, but a 'union' of the
matching GeneFeatures). This is useful for situations where one needs to
use the geometric intersection set in some context where the feature
identity must be preserved (e.g. when taking a difference set of features
from a given which are not in the geometric intersection set).
GFF Version 2: The optional '$tag' argument is a [group] tag which is used to mark up the gene feature matches, using the match description.
$tolerance
and $strand
method
arguments. Note: the method returns 'copied' versions of the overlapping
invoking features.
GFF Version 2: The optional '$tag' argument is a [group] tag which is used to mark up the gene feature matches, using the match description. This value is also used to rename the <feature> field of the overlap merged gene feature.
$single
and $strand
method arguments.
$single
and $strand
method arguments.
If the $verbose
flag is present and equal to 1, then errors
are reported to $file
(or STDOUT if $file
is
omitted). Note: a side effect of this method is to symmetrically record all
GeneFeature matches in both the invoking GeneFeatureSet object and the
second $GeneFeatureSet2
object.
If the optional boolean '$descending' flag is given, the order is reversed, from largest to smallest.
$matched
is defined and equal to 1, then the method
only evaluates objects which are matched to another GeneFeature.
$matched
is defined and equal to 1, then the method
only evaluates objects which are matched to another GeneFeature.
$matched
is defined and equal to 1, then the method
only evaluates objects which are lined to another GeneFeature.
The '$unweighted' argument, when non-NULL, stipulates that a simple sum of scores rather than a feature length weighted sum of scores should be taken for the computation of the average.
$matched
is defined and equal to 1,
then the method only evaluates objects which are matched to another
GeneFeature.
An optional boolean predicate $filter function (taking a GeneFeature reference as its argument and returning 1 == inclusion, 0 == exclusion) may be used to limit range calculation to a specific subset of GeneFeatures in the invoking GeneFeatureSet object.
$offset
amount to the start and end
coordinates of every GeneFeature in a GeneFeatureSet object (uses
GFF::GeneFeature->remap()). Note: the start and end of the original gene
features (not copies thereof) of the the gene feature set are changed.
self_overlap_merge()
method call has
been done first).
$min
defaults to 1 if not specified.
In a scalar context, this method simply returns the mean value of feature lengths.
In a list context, the mean, standard deviation and underlying variable values of total number of features, total feature lengths, and sum of feature lengths squared, plus a reference to a (sparse) hash of tallies for each non-zero length class (keyed by lengths) are also returned, in that order respectively.
An optional boolean predicate $filter function (taking a GeneFeature reference as its argument and returning 1 == inclusion, 0 == exclusion) may be used to limit the calculation to a specific subset of GeneFeatures in the invoking GeneFeatureSet object.
The method returns 'undef' if no features are found upon which to compute statistics.
$tolerance
indicates how precise the boundaries must match to
count as a match to be scored. The $nep
is the number of
predicted exons and the $net
is the number of true exons. If a
score is ``infinite'' due to division by zero, a -1 is returned for that
value. The optional '\*VERBOSE' argument, if defined, is a file device for
dumping of detailed (signed) match offset statistics.
order_gf()
argument semantics
changes: the single $reverse
argument is replaced by
$descending, to designate 'high to low' sorting by coordinates,
$end
argument added to force usage of the gene feature end
coordinate instead of the start coordinate. The old '$reverse' argument is
thus replaced by $descending
== $end
== non-null;
the method still defaults to 'start coordinate, ascending sort';
2.092 (19/11/99)
- rbsk: - cluster()
method, for matches, the
comparison function can now return a non-null $name
string for
unique labelling of the cluster which becomes the ##sequence-region
$name
for the cluster
- dump_matches():
$show_nomatches
argument added;
Normally, only features with matches are dumped. The optional boolean flag
'$show_nomatches' when defined and non-null, directs that 'no match'
records are reported too.
2.091 (10/11/99)- rbsk: - major algorithmic performance enhancement of pairwise overlap methods!
2.090 (5/11/99) - rbsk: - added the 'read_header()' method
2.089 (30/10/99)- rbsk: - added the 'pipe()' method
2.088 (21/10/99)- rbsk: - added GFF complement()
method
2.087
(13/10/99)- rbsk: - added lengthStats()
method
2.086
(12/10/99)- rbsk: - added method nextGeneFeature()
(FIFO
inverse of addGeneFeature()).
2.085 (3/10/99) - rbsk: - $group_tag in self_overlap_merge() access CODE ref and subsumed into GFF::GeneFeature::overlap_merge() method.
2.084 (30/9/99) - rbsk: - $tag argument in dump() - $group_tag argument moved over in self_overlap_merge() - using order_gf() in *overlap*() methods (and run tracing)
2.083 (27/9/99) - rbsk: - $strand argument in self_overlap_merge() && intersect_range() - make $strict mode in self_overlap() strand sensitive
2.082 (21/9/99) - rbsk: - added optional '$tag' argument to intersect_overlap() - created intersect_overlap_merge() method - created the deleteTag() method
2.081 (9/9/99) - rbsk: order_by_gf_size() method added 3/9/99 - rbsk: $exclude to addGeneFeature() method; exclude() method added
31/8/99 - rbsk: self_overlap_merge() method: $group_tag specification allows for recording of merged features (by $group_tag) optional '$tolerance' value provides for overlap merge where the two features lie within $tolerance base pairs of each other
21/7/99 - rbsk: rewriteField() $target matches made case insensitive and framed by /^...$/
12/7/99 - rbsk: creation from miscellaneous GFF analysis code; transferred methods makeGenes(), constructGene() and mRNA from the GFF::GeneFeatureSet to new GFF::Analysis module
7/7/99 - rbsk: read() $echo argument added to provide user feedback during reading in of large GFF files...
5/7/99 - rbsk: transcript() method renamed to mRNA() - biologically more accurate :-)
28/6/99 - rbsk: transcript() method added 14/6/99 - rbsk: recoded self_overlap() slightly to account for undefined group_values
24/5/99 - rbsk: $tag argument to self_overlap() new() region() setting bug fixed
21/5/99 - rbsk: revised the rewriteField() method (see above) ; Note that the $group_tag/value & $saveold arguments were removed from this method
14/5/99 - rbsk: created the 'copy()' method 7/5/99 - rbsk: reinserted Tim's old max_min_range() method, for backwards compatibility (deprecated?)
6/5/99 - rbsk: added '$verbose' argument to score method()
28/4/99 - rbsk: renamed GFF.pm to GFF::GeneFeatureSet.pm
27/4/99 - rbsk: added '$filter' argument to min_max_range() added minScore() and scoreRange() methods
26/4/99 - rbsk: renamed max_min_range() to min_max_range() 23/4/99 - rbsk: addGeneFeatureSet() documentation fixed...
21/4/99 - rbsk: intersect_range(), containsMembers(), and getAllMembers() methods added rewriteField(): added '$saveold' argument. $copy argument added to group(), addGeneFeature(), addGeneFeatureSet() methods
19/4/99 - rbsk: deletion bug in self_overlap() fixed... GeneFeatureSetPair functionality (score() method et al.) merged with GeneFeatureSet.pm
16/4/99 - rbsk: added $exact flag to self_overlap() method; overlap method debugged too added $soft flag to intersect_overlap()
1/4/99 - rbsk: order_gf: optional $reverse argument to reverse sorting order to high to low by segment end coordinates.
26/3/99 - rbsk: new method 'rewriteField()' added. 23/3/99 - rbsk: $strict flag added to self_overlap() method
16/3/99 - rbsk: GeneFeatureSet objects now subclassed from GeneFeatureSetObject class; moved ``version()'' method from GeneFeatureSet.pm into the new base class GeneFeatureSetObject.pm
25/2/99 - rbsk: Extensively revised and improved the documentation Added Version 2 GeneFeatureSet code, including &version class function Also: Converted all ``constructor'' type methods in all GeneFeatureSet libraries to class methods (i.e. must now be invoked as class->new*(args) or ``new class args'') Standardize all file glob arguments to \*FILE references Use ``croak'' instead of ``die'' in theFeature Added GeneFeatureSet ``intersection'' set operation; Renamed ``intersect_not'' to ``difference'' Default $type for read_parse is now ``GeneFeature'' Added ``frames'' method; rename ``strand'' to ``strands'' max_min_range() modified to sense array v/s scalar context max_min_range_homol() just ignores non-HomolGen