Cow gene-scaffolds and genes in Ensembl

The low sequence coverage of the Btau_1.0 cow genome assembly (3x) means that all of the normal problems experienced when predicting gene structures in draft genome assemblies (missing sequence, fragmentation, misassemblies, misplacements, small insertions/deletions/substitutions) are exacerbated. In particular, many genes will be represented only partially (or not at all) in the assembly, and many others (particularly those with large genomic extent) will be found in pieces, distributed across more than one scaffold. The standard Ensembl gene-build pipeline is therefore unsuitable for such low-coverage genomes.

General method

We have developed a new gene-building methodology for low-coverage genomes that relies on a whole genome alignment (WGA) to an annotated, reference genome. The WGA underlying each annotated gene structure in the reference genome is used to infer "gene-scaffold" assemblies of scaffolds in the target genome that contain complete gene structures.

The protein-coding transcripts of the reference gene structures are projected through the WGA onto the implied gene-supercontigs in the target genome. Small insertions/deletions that disrupt the reading-frame of the resultant transcripts are corrected for by inserting "frame-shift" introns into the structure in a manner similar to that of the Ensembl chimpanzee gene-build.

When the WGA implies that the sequence containing an internal exon is missing from the assembly, and the location is consistent with an intra- or inter-scaffold gap, the exon is placed on the gap sequence. This results in a run of X's of the correct length in the translation.

Cow specifics

The Ensembl-annotated human genome (version 31.35d) was used as the reference. A whole-genome alignment between the human and cow genomes was generated in-house using BLASTz (Schwartz S et. al., Gen. Res. 13:103-107). The resulting set of local alignments were processed into a form suitable for the above method using the Axt tools written by Jim Kent (W.J. Kent et.al., PNAS 100:1484-9).

The WGA-based approach described above will miss genes in regions where the alignment quality is poor, and also genes that are specific to cow and its close relatives. The standard mammalian Ensembl gene-build pipeline was therefore used to identify additional gene structures.