NAME

GFF::Analysis.pm - Perl utility library for General Feature Format (``GFF'') analysis routines using the GFF Perl Object libraries.

SYNOPSIS

# include what functions you need; GFF::Analysis.pm contains an implicit 'use GFF ;'

use GFF::Analysis qw(constructGene makeGenes mRNA featureLengthStats segregateGeneFeatures normalize mergeGeneFeatures cleanUpSeqName normalize_mRNA);

DESCRIPTION

GFF::Analysis (derived from GFF) is a utility library for the Gene Finding Feature, built upon the GFF perl module library.

Exports:

     constructGene()
     makeGenes()
     mRNA()
     segregateGeneFeatures()
     cleanUpName() # name cleanup protocol used by normalize
     normalize()
     mergeGeneFeatures()
     normalize_mRNA() # calls segregateGeneFeatures(), normalize(), and mergeGeneFeatures() sequentially

AUTHORSHIP

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation

SOURCE CODE

The most current release of the Perl source code for this module is available here. All bug reports may be submitted to Richard Bruskiewich (rbsk@sanger.ac.uk).

METHODS

Note: These methods are not 'object' invoked, but take a GFF::GeneFeatureSet reference as their first argument.

constructGene($gffi)

Given a GeneFeatureSet object ('$gfs') GeneFeatures with <feature> fields specifically labelled with 'exon' and possibly 'promoter', 'transcription_start' and/or 'polyA_signal' tags, and belonging to a single gene as defined by a common [group] field label, this method returns an augmented GeneFeatureSet object fully describing a 'gene' containing introns, UTRs and flanking sequences inferred from the original GeneFeature set.

GFF::GeneFeatures in the calling GeneFeatureSet object are assumed to all be on the same <strand>, from the same <seqname>, <source> and group_value('Sequence') named gene (i.e. makeGenes() clustering) thus, the method looks at the first GeneFeature encountered in the GeneFeatureSet for all these values! The <frame> and <score> are assumed to be irrelevant for all features added in this method (e.g. introns), and is thus set to '.'. The incoming GeneFeatures are also assumed to be non-overlapping, since this assumption drives the identification of 'inter' GeneFeature gaps ('introns' et al.) Also, if the first (and/or last) GeneFeature start (end) does (do) not coincide with the start (end) of the <seqname> region range, then the 5' (and 3') flanking regions are inferred and so labelled in the field. This latter labelling is also influenced by the presence of 'promoter', 'transcription_start' and/or 'polyA_signal' GeneFeatures.

makeGenes($gffi): After clustering a GeneFeatureSet set of predicted exons, promoter, polyA's etc. by 'gene' groups (i.e. by Version 1 [group] tags or by Version 2 [group] field 'Sequence' tag-values), this method uses the GeneFeatureSet::constructGene method to infer additional GeneFeatures (e.g. introns, [5'|3'] UTRs and Flanking regions). The method then returns all the GeneFeatures (old and new) in a new GeneFeatureSet object.

mRNA($gffi, $seq, $pattern): Method to return a single string of a subsequence representing a mRNA or similar gapped entity represented by the gene features in the invoking object, whose <feature> field matches the $pattern. The method expects a string '$seq' corresponding to the sequence from which the features are to be extracted. Returns a concatenated string of all the subsequences defined by the matching gene features.

featureLengthStats($gffo,\%statab,$label)

This method applies the GFF::GeneFeatureSet::lengthStats() method to a given GeneFeatureSet, returning the results in a primary hash (passed by reference,\%statab ) indexed under the given $label, and returned as a secondary level hash reference.

    i.e.  \%statab->{"$label"}->{'<data_label>'}

The '<data_label> secondary hash keys for these statistics are as follows:

    'M'    == mean length
    'SD'   == standard deviation of lengths
    'N'    == total number of features
    'Cov'  == total sum of lengths ('coverage')
    'Cov2' == total sum of lengths squared
    'LenC' == reference to an array of feature incidence 
              counts for each class of length
    
A side effect of the setting of the table is that these values
are returned (in an array context) as a list, in the order
indicated above.

segregateGeneFeatures($gffi,\%statab,$trace)

Given a GeneFeatureSet containing 'sequence', 'exon' and 'CDS' records, this method returns a list of four $gff object pointers representing each segregated subsets for each of 'genes', 'pseudogenes', 'exons' and 'CDSs', respectively.

If a suitable reference '\%statab' to a hash table is given, then the method also compiles statistics for each subset into that table using the featureLengthStats() method (see above). That is, the table is of the form:

        \%statab->{('Gene'|'Exon'|'CDS')}->{'<data item>'}

where '<data item>'s are statistics as returned by the featureLengthStats().

Another side effect of the method is that mRNA/coding_exon and CDS/exon redundancies are filtered out of the dataset.

The optional '$trace' boolean flag, when non-null, turns on GFF module tracing.

mergeGeneFeatures($gffg,$gffp,$gffe,$gffc,\%statab,$trace)

This method performs the reverse operation to that of segregateGeneFeatures(), taking four types of features - 'gene (sequence)', 'pseudogene (sequence)', 'exon' and 'CDS' records - and remerging them into cohesive 'gene' sets based upon overlapping GFF coordinates. The method then returns a list of references to the each of the 'true' and 'pseudo' GeneFeatureSets.

Along the way, a further set of statistics may be (optionally) computed and stored into a dereferenced hash table \%statab passed to the function. These statistics pertain rather to 'exons' and 'CDSs' per (pseudo)gene, and are stored in the hash at the primary level under the 'Gene' key, and at the secondary level in dereferenced hashes with 'Exon' and 'CDS' keys, then the specific data items, e.g.

    \%statab->{Gene}->{(Exon|CDS|Transcript|Translation)}->{<data item>}

The '<data item>'s are as follows:

    'M'     == mean features per gene
    'SD'    == standard deviation of features per gene
    'X'     == total sum of given features
    'X2'    == total sum of given features squared
    'GeneC' == reference to an array of gene incidence 
               counts for each class of 'per gene' values

The optional '$trace' boolean flag, when non-null, turns on GFF module tracing.

cleanUpSeqName(name)

Default namefilter used by normalize(), which removes:

- CDS/mRNA suffixes - 5'/3' suffixes - alphabet or digit isoform suffixes

The cleaned up name is returned.

normalize($gffndg, $gffndp, $gffnde, $gffndc, \&nameFilter, \%statab, $trace)

Method to return a normalized set of gene descriptions in which all isoforms and exons have been merged into distinct, non-overlapping non-duplicated sets of data.

&nameFilter should take a $name string as $input, perform clean up of name decorations, then return the $name string. If nameFilter is undef or NULL, then name clean-up is suppressed. If nameFilter is defined, non-NULL but not a reference point to a subroutine, then a standard cleanup of names is performed. Otherwise, the user supplied subroutine is used.

normalize_mRNA($gffi, \&nameFilter, $source, \%statab, $trace )

Invokes segregate, normalize and merge routines (see above) to normalize transcript GFF.

The optional $source argument is used to rewrite the <source> field of the file to a uniform value (default: 'Gene').

Providing a defined reference to an empty hash, '\%statab' triggers the compilation of statistics about the file, as generated in the segregateGeneFeatures() and mergeGeneFeatures() methods (see above).

Separate statistics are generated for each of transcripts, genes, exons and CDS's. (Note that 'genes' are defined as the normalized transcript sequence spans output by the method, whereas 'transcripts' are the sequence records before merge overlaps are done). Note that the gene sequence count 'N' are after normalization, but all other feature counts 'N' are unnormalized numbers.

A defined and non-null $trace flag turns on runtime tracing of the normalization method. =back

REVISION HISTORY

2.09 (19/10/99) - rbsk: segregateGeneFeatures() and mergeGeneFeatures() handle pseudogenes separately and explicitly;

2.08 (16/10/99) - rbsk: for greater flexibility, I extracted out normalize_mRNA() functionality, into externally visible methods:

                        - featureLengthStats()
                        - segregateGeneFeatures()
                        - mergeGeneFeatures()

2.07 (13/10/99) - rbsk: added $statab argument to normalize_mRNA()

2.06 (9/10/99) - rbsk: now exporting cleanUpSeqName(); keep 'em:' prefixes for now...

2.05 (27/9/99) - rbsk: need to make all normalization procedures strand sensitive!

2.04 (21/7/99) - rbsk: normalize_mRNA() should not consider source(*CDS) records with feature(sequence) to be redundant, in case the specific 'sequence' only has a CDS specified (but no mRNA); This method now also relabels <source> fields to 'TranscriptSet'

2.03 (16/7/99) - rbsk: removed draw_graph() from here (into GFF::Graph()) because of Curve_plot.pm usage, which won't be universal outside Sanger (at least until Raphael decides to release it for general use?)

2.02 (14/7/99) - rbsk: transferred draw_graph() from GeneFeatureSet to here and generalized to multiple GFF plot (API changed)

2.01 (12/7/99) - rbsk: creation from miscellaneous GFF analysis code; transferred methods makeGenes(), constructGene() and mRNA from GFF::GeneFeatureSet to this module