Fifth assembly Zv5 of the zebrafish genome released

The assembly comprises a total sequence length of 1,630,306,866 bp in 16,214 fragments. This assembly has been tied to the FPC map (data freeze 15th February 2005) and contains 699Mb from 4,519 sequenced clones. The FPC contigs provided a template to place the sequenced clones. We completed the remainder of the sequence with contigs from a whole genome shotgun (WGS) assembly (see details below). This integration of clone sequences and WGS contigs is based on a mixed strategy that considers sequence alignments and the placement of BAC ends and features such as zebrafish cDNAs and markers.

In this release, sequences that are based on FPC contigs are named Zv5_scaffold followed by a number. The WGS contigs that could not be placed in those scaffolds were named Zv5_NA followed by a number. Zv5_scaffolds were placed onto chromosomes where possible. According to the agreement reached at the European zebrafish meeting in Paris, 2003, we translated linkage group numbers directly into chromosome numbers (e.g. linkage group 1 = chromosome 1).

Please note that this still is a *preliminary* assembly and there are a number of points to remember. The regions of the genome where the physical map is incomplete or contains gaps are covered by sequence with a higher number of misassemblies. In general, regions which are highly variable do not form clusters for assembly since they are quite likely from different haplotypes. This also affects the generation of the physical map resulting in assembly dropouts and false duplications.

A pre-ensembl database built on the Zv5 assembly featuring the sequence and raw computes is now available.

The assembly can be searched using BLAST or SSAHA. Single contigs of your interest can be downloaded right there under the Export Data option.

The whole assembly can be downloaded at ftp://ftp.ensembl.org/pub/assembly/zebrafish/Zv5release

Assembly Statistics

The WGS assembly is based on 20,541,433 reads comprising 14,160,626,498 bp with a coverage of 6.5-7x. This set includes 6,882,050 reads from a new library generated from a single Tuebingen, double haploid zebrafish. In order to increase continuity of contigs in the finished or near finished regions, we shredded 1,366,419 reads from finished clones in the tiling path. From this set 18,969,500 reads were finally placed in the assembly. Phusion was used to cluster the reads and phrap was used for cluster assembly and consensus generation. This resulted in 247,928 contigs with an N50 size of 20,629bp. Contigs are joined in supercontigs based on read-pair information where the sizes of gaps are estimated using insert sizes of different lengths. Small supercontigs with less than 3 reads or smaller than 0.5 kb were rejected. There are 105,987 supercontigs in the WGS assembly with an N50 size of 687,451bp.

The integration of the WGS assembly with the clone sequences result in the Zv5 assembly released (bp measures include estimated gap sizes):

  • Total bases = 1,630,306,866 bp
  • Scaffolds = 16,214
  • Largest = 9,935,765 bp
  • N50 = 1,116,981 bp, n = 520

  • 1,200,129,620 bp in FPC contigs tied to chromosomes 1-25.
  • 183,993,739 bp in 265 scaffolds tied to unplaced FPC contigs.
  • 246,183,507 bp in 14,676 NA scaffolds.