Introduction to the Problem of Computational Gene Recognition


The topic of computational gene recognition has become more and more important as long DNA being sequenced in the Human Genome Project. How do we know where the genes are located from the sequence information alone? The papers listed in this bibliography are an accumulation of more than 15 years of research in computational molecular biology on this topic.

To have an overall picture of how this task could possibly be accomplished, let's ask the following questions:

Where the genes are unlikely to be located? - excluding inter-genic regions.

How do transcription factors know where to bind a region of DNA? - searching consensus patterns in the promoter region

Where are the transcription, splicing, and translation start and stop signals? - searching the start codon, the stop codons, and the splicing sites Be careful, though, with the non-universal genetic code (see a list here)

What does coding region do (and non-coding regions do not) ? [Hint: It translates three nucleotides to one amino acid!] - recognizing the period-three pattern in coding region

Can we learn from examples? These learned knowledges are usually species-dependent (non-universal) - checking whether the codon usage in your sequence is closer to that in coding or non-coding sequences

Does this sequence look familiar? I have seen this gene before somewhere... (yeah, right :-)) - database similarity search. Papers on database itself are labeled by . This web page, http://130.132.229.55/gdp/gdp.html , provides ste-by-step procedures for this approach.

Well, it pretty much summarizes methods used in computational gene recognition!

But, there are challenges! (added on October 11,1999)