The topic of computational gene recognition has become more and more
important as long DNA being sequenced in the Human Genome Project.
How do we know where the genes are located from the sequence information
alone? The papers listed in this
bibliography are an accumulation of more than 15 years
of research in computational molecular biology on this topic.
To have an overall picture of how this task could possibly be accomplished, let's ask the following questions:
Where the genes are unlikely to be located? -
excluding inter-genic regions.
How do transcription factors know where to bind a region
of DNA? -
searching consensus patterns in the promoter region
Where are the transcription, splicing, and translation
start and stop signals? - searching the start codon,
the stop codons, and the splicing sites
Be careful, though, with the non-universal genetic code
(see
a list here)
What does coding region do (and non-coding regions do not) ?
[Hint: It translates three nucleotides
to one amino acid!] -
recognizing the period-three pattern in coding region
Can we learn from examples? These learned knowledges are
usually species-dependent (non-universal)
- checking whether the codon usage in your sequence
is closer to that in coding or non-coding sequences
Does this sequence look familiar? I have seen this gene before
somewhere... (yeah, right :-)) - database similarity search.
Papers on database itself are labeled by
.
This web page,
http://130.132.229.55/gdp/gdp.html , provides ste-by-step procedures
for this approach.
Well, it pretty much summarizes methods used in computational gene recognition!
But, there are challenges! (added on October 11,1999)