Sixth assembly Zv6 of the zebrafish genome released

General information

The assembly comprises a total sequence length of 1,626,077,335 bp in 6,653 fragments. This assembly has been tied to the FPC map (data freeze 12th March 2006) which provides a tiling path of sequenced clones. 1.02 Gb of sequence from 7,615 sequenced clones (5,994 finished and 1,621 unfinished) were taken as a scaffold that was completed with contigs from a whole genome shotgun (WGS) assembly (see details below). This integration of clone sequences and WGS contigs is based on a mixed strategy that considers sequence alignments and the placement of BAC ends and features such as zebrafish cDNAs and markers.

In this release the integration algorithm has been updated to allow the placement of WGS contigs that contain markers but couldn't be linked to the FPC contigs. Information from markers has also been used to detect misjoins in the WGS supercontigs reducing the conflicts with the marker panels. There are cases where markers from different chromosomes appear in a single WGS contig or sequenced clone; in these situations priority has been given to the genetic panels HS and MGH. Some of these cases can be due to misassemblies but, in particular for sequenced clones, it suggests inconsistencies between the marker panels.

The sequences that are based on FPC contigs or are linked to chromosomes via markers are named Zv6_scaffold followed by a number. The WGS contigs that could not be placed onto chromosomes were named Zv6_NA followed by a number. According to the agreement reached at the European Zebrafish Meeting in Paris, 2003, we translated linkage group numbers directly into chromosome numbers (e.g. linkage group 1 = chromosome 1).

Please note:

This is still a *preliminary* assembly and there are a number of points to remember. The regions of the assembly covered by WGS contigs are of lower quality. In general regions which are highly variable do not form clusters since they are quite likely from different haplotypes. This also affects the generation of the physical map resulting in assembly dropouts and false duplications. In this assembly special attention has been paid to these issues. This is reflected in the drop in the number of scaffolds, in particular for regions not attached to chromosomes.

Resources

A pre-ensembl database built on the Zv6 assembly featuring the sequence and raw computes is now available.

The assembly can be searched using BLAST or SSAHA2. Single contigs of your interest can be downloaded using the Export Data option.

The whole assembly can be downloaded from ftp://ftp.ensembl.org/pub/assembly/zebrafish/Zv6release

Assembly Statistics

The WGS assembly used to fill the gaps in the tiling path is the same as used for Zv5. It is based on 20,541,433 reads comprising 14,160,626,498 bp with a coverage of 6.5-7x. This set includes 6,882,050 reads from a library generated from a single Tuebingen, doubled haploid zebrafish. In order to increase continuity of contigs in the finished or near finished regions, we shredded 1,366,419 reads from finished clones in the tiling path. From this set 18,969,500 reads were finally placed in the assembly. Phusion was used to cluster the reads and phrap was used for consensus generation. This resulted in 247,928 contigs with an N50 size of 20,629 bp. Contigs are joined in supercontigs based on read-pair information where the sizes of gaps are estimated using insert sizes of different lengths. Small supercontigs with less than 3 reads or smaller than 0.5 kb were rejected. There are 105,987 supercontigs in the WGS assembly with an N50 size of 687,451 bp.

The integration of the WGS assembly with the clone sequences results in the Zv6 assembly (bp measures include estimated gap sizes):

Total bases = 1,626,077,335 bp

Scaffolds = 6,653

Largest = 9,115,136

N50 = 1,247,221, n = 327

1,547,299,723 bp in scaffolds placed on chromosomes 1-25 (includes 100 bp gaps between scaffolds).

15,986,232 in 68 scaffolds tied to unplaced FPC contigs.

63,164,280 bp in 2,898 NA scaffolds.