use GFF::Analysis qw(constructGene
makeGenes mRNA
featureLengthStats segregateGeneFeatures normalize mergeGeneFeatures
cleanUpSeqName normalize_mRNA);
Exports:
constructGene() makeGenes() mRNA() segregateGeneFeatures() cleanUpName() # name cleanup protocol used by normalize normalize() mergeGeneFeatures() normalize_mRNA() # calls segregateGeneFeatures(), normalize(), and mergeGeneFeatures() sequentially
Sanger Institute, Wellcome Trust Genome Campus, Cambs, UK All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation
GFF::GeneFeatures in the calling GeneFeatureSet object are assumed to all
be on the same <strand>, from the same <seqname>,
<source> and group_value('Sequence')
named gene (i.e.
makeGenes()
clustering) thus, the method looks at the first
GeneFeature encountered in the GeneFeatureSet for all these values! The
<frame> and <score> are assumed to be irrelevant for all
features added in this method (e.g. introns), and is thus set to '.'. The
incoming GeneFeatures are also assumed to be non-overlapping, since this
assumption drives the identification of 'inter' GeneFeature gaps ('introns'
et al.) Also, if the first (and/or last) GeneFeature start (end) does (do)
not coincide with the start (end) of the <seqname> region range, then
the 5' (and 3') flanking regions are inferred and so labelled in the field.
This latter labelling is also influenced by the presence of 'promoter',
'transcription_start' and/or 'polyA_signal' GeneFeatures.
i.e. \%statab->{"$label"}->{'<data_label>'}
The '<data_label> secondary hash keys for these statistics are as follows:
'M' == mean length 'SD' == standard deviation of lengths 'N' == total number of features 'Cov' == total sum of lengths ('coverage') 'Cov2' == total sum of lengths squared 'LenC' == reference to an array of feature incidence counts for each class of length A side effect of the setting of the table is that these values are returned (in an array context) as a list, in the order indicated above.
$gff
object pointers
representing each segregated subsets for each of 'genes', 'pseudogenes',
'exons' and 'CDSs', respectively.
If a suitable reference '\%statab' to a hash table is given, then the
method also compiles statistics for each subset into that table using the
featureLengthStats()
method (see above). That is, the table is
of the form:
\%statab->{('Gene'|'Exon'|'CDS')}->{'<data item>'}
where '<data item>'s are statistics as returned by the
featureLengthStats().
Another side effect of the method is that mRNA/coding_exon and CDS/exon redundancies are filtered out of the dataset.
The optional '$trace' boolean flag, when non-null, turns on GFF module tracing.
segregateGeneFeatures(),
taking four types of features - 'gene
(sequence)', 'pseudogene (sequence)', 'exon' and 'CDS' records - and
remerging them into cohesive 'gene' sets based upon overlapping GFF
coordinates. The method then returns a list of references to the each of
the 'true' and 'pseudo' GeneFeatureSets.
Along the way, a further set of statistics may be (optionally) computed and stored into a dereferenced hash table \%statab passed to the function. These statistics pertain rather to 'exons' and 'CDSs' per (pseudo)gene, and are stored in the hash at the primary level under the 'Gene' key, and at the secondary level in dereferenced hashes with 'Exon' and 'CDS' keys, then the specific data items, e.g.
\%statab->{Gene}->{(Exon|CDS|Transcript|Translation)}->{<data item>}
The '<data item>'s are as follows:
'M' == mean features per gene 'SD' == standard deviation of features per gene 'X' == total sum of given features 'X2' == total sum of given features squared 'GeneC' == reference to an array of gene incidence counts for each class of 'per gene' values
The optional '$trace' boolean flag, when non-null, turns on GFF module tracing.
normalize(),
which removes:
- CDS/mRNA suffixes - 5'/3' suffixes - alphabet or digit isoform suffixes
The cleaned up name is returned.
&nameFilter
should take a $name
string as $input,
perform clean up of name decorations, then return the $name
string. If nameFilter is undef or NULL, then name clean-up is suppressed.
If nameFilter is defined, non-NULL but not a reference point to a
subroutine, then a standard cleanup of names is performed. Otherwise, the
user supplied subroutine is used.
The optional $source
argument is used to rewrite the
<source> field of the file to a uniform value (default: 'Gene').
Providing a defined reference to an empty hash, '\%statab' triggers the
compilation of statistics about the file, as generated in the
segregateGeneFeatures()
and mergeGeneFeatures()
methods (see above).
Separate statistics are generated for each of transcripts, genes, exons and CDS's. (Note that 'genes' are defined as the normalized transcript sequence spans output by the method, whereas 'transcripts' are the sequence records before merge overlaps are done). Note that the gene sequence count 'N' are after normalization, but all other feature counts 'N' are unnormalized numbers.
A defined and non-null $trace
flag turns on runtime tracing of
the normalization method. =back
segregateGeneFeatures()
and
mergeGeneFeatures()
handle pseudogenes separately and
explicitly;
2.08 (16/10/99) - rbsk: for greater flexibility, I extracted out
normalize_mRNA()
functionality, into externally visible
methods:
- featureLengthStats() - segregateGeneFeatures() - mergeGeneFeatures()
2.07 (13/10/99) - rbsk: added $statab
argument to
normalize_mRNA()
2.06 (9/10/99) - rbsk: now exporting
cleanUpSeqName();
keep 'em:' prefixes for now...
2.05 (27/9/99) - rbsk: need to make all normalization procedures strand sensitive!
2.04 (21/7/99) - rbsk: normalize_mRNA()
should not
consider source(*CDS)
records with
feature(sequence)
to be redundant, in case the specific
'sequence' only has a CDS specified (but no mRNA); This method now also
relabels <source> fields to 'TranscriptSet'
2.03 (16/7/99) - rbsk:
removed draw_graph()
from here (into GFF::Graph()) because of
Curve_plot.pm usage, which won't be universal outside Sanger (at least
until Raphael decides to release it for general use?)
2.02 (14/7/99) -
rbsk: transferred draw_graph()
from GeneFeatureSet to here and
generalized to multiple GFF plot (API changed)
2.01 (12/7/99) - rbsk:
creation from miscellaneous GFF analysis code; transferred methods
makeGenes(),
constructGene()
and mRNA from
GFF::GeneFeatureSet to this module