Fourth assembly Zv4 of the zebrafish genome released

Please note that this still is a *preliminary* assembly and there are a number of points to remember:

There is a high level of misassembly. This is because the source DNA came from ~1000 5 day old embryos and the polymorphism is at least 1/200bps with additional significant indels. Thus regions of the genome which are highly variable do not form clusters for assembly since the sequences that originate from a given region are quite likely from different haplotypes. This causes assembly dropouts for some regions and false duplications in other regions where phrap splits different haplotypes into multiple paths. We are working on the assembly code, Phusion, to address these issues. However, there is an enormous amount of useful sequence in this assembly and hope this outweighs the problems in the assembly.

The assembly comprises a total sequence length of 1,560,480,686 bp in 21,333 fragments. This assembly has been tied to the FPC map (data freeze 17th of May, 2004). This assembly contains 443 Mb from 2,828 finished clones, and 121 Mb from 1,272 unfinished clones. The FPC contigs provided a template to place the (un)finished sequence. We use a new mix strategy of sequence alignment and BAC end position to fill the remaining of the sequence with WGS assembly contigs. The WGS sequences used for this approach are still identical to the ones used in Zv3, however the FPC data and the process of integrating the two data types has much improved. In this release, sequences that are based on FPC contigs are the named Zv4_scaffold followed by a number. The WGS contigs that could not be placed in those scaffolds were named Zv4_NA followed by a number. Zv4_scaffolds were placed onto chromosomes where possible. According to the agreement reached at the European zebrafish meeting in Paris, 2003, we translated linkage group numbers directly into chromosome numbers (e.g. linkage group 1 = chromosome 1).

A pre-ensembl database built on the Zv4 assembly featuring the sequence and raw computes is now available.

The assembly can be searched using BLAST or SSAHA. Single contigs of your interest can be downloaded right there under the Export Data option.

The whole assembly can be downloaded at ftp://ftp.ensembl.org/pub/assembly/zebrafish/Zv4release

Assembly Statistics

We started with 13,122,073 reads comprising 9,107,933,259 bp (694 bps average RL). The coverage is roughly 5.7 x. There are 10,504,790 unique reads, 80 % of the total reads, placed in the assembly. (Note: untrimmed reads and placed reads align well with the assembly)

Phusion was used to cluster the reads and phrap was used for cluster assembly and consensus generation

Small supercontigs with less than 3 reads or smaller than 0.5 kb were rejected.

1,260,930,206 bp (75 %) could be tied to the FPC map.

For the scaffolds (bp measures include estimated gap sizes):

Scaffolds stats (bp measures include estimated gap sizes):

Total bases = 1,560,480,686 bp

Scaffolds = 21,333

Average length = 73,148 bp

Largest = 7,353,829 bp

N50 = 719,174 bp, n = 520