FASTA Database Files

Introduction

Ensembl provides sequence databases of transcript and translation models predicted by the Ensembl analysis and annotation pipleine, as well as by ab initio methods. The database files in FASTA format are available from corresponding 'fasta' directories on the ftp.ensembl.org site. This document describes the current naming convention and the sequence header line format used by Ensembl. Similar descriptions are also available from README files in FTP site directories.

While Ensembl provides also a more general description of the FTP site structure, 'fasta' directories contain the following sub-directories:

To facilitate storage and download all databases are GNU Zip (gzip, *.gz) compressed.

FASTA Database File Names

All files deposited in these directories obey a common naming scheme:

species.version.month.sequence type.[status].[id type].[id].fa.gz 

FASTA Sequence Header Lines

The FASTA format sequence header lines are designed to be consistent across all types of Ensembl sequences, giving enough info for the sequence to be identified outside the context of the sequence database file.

>ID SEQTYPE:IDTYPE LOCATION 

The following sequence header line is an example for the simple Ensembl cDNA header format:

>ENST00000289823 cdna:known chromosome:NCBI34:8:21922367:21927699:1 
 ^               ^    ^     ^
 ID              |    |     LOCATION 
                 |    IDTYPE 
                 SEQTYPE 

There is obviously a great deal more transcript-specific meta data that could be added to the header.

>ID SEQTYPE:IDTYPE LOCATION META 

An example for the extended sequence header format is the following line:

>ENST00000289823 cdna:known chromosome:NCBI34:8:21922367:21927699:1 gene:ENSG00000158815:HUGO:FGF17
 ^               ^    ^     ^                                       ^ 
 ID              |    |     LOCATION                                META
                 |    IDTYPE 
                 SEQTYPE 

DNA Directories

Top Level

These files contain the full sequence of the assembly in FASTA format. They contain one chromosome per file.

species.version.month.sequence type.id type.id.fa.gz 

Examples

The genomic sequence of human chromosome 1:

Homo_sapiens.NCBI34.may.dna.chromosome.1.fa.gz 

The masked version of the genome sequence on human chromosome 1 contains '_rm' in the name:

Homo_sapiens.NCBI34.may.dna_rm.chromosome.1.fa.gz 

Non-chromosomal assembly sequences (e.g. mitochondrial genome, sequence contigs not yet mapped on chromosomes):

Homo_sapiens.NCBI34.may.dna.nonchromosomal.fa.gz 
Homo_sapiens.NCBI34.may.dna_rm.nonchromosomal.fa.gz 

Sequence Level

These files represent dumps of the assembly at the sequence level in FASTA format.

species.version.month.sequence type.id type.fa.gz 

Examples

Unmasked sequence file name examples:

Homo_sapiens.NCBI34.may.dna.contig.fa.gz 
Anopheles_gambiae.MOZ2a.may.dna.chunk.fa.gz 
Fugu_rubripes.FUGU2.may.dna.scaffold.fa.gz 

Repeat masked files contain '_rm' in the file name:

Homo_sapiens.NCBI34.may.dna_rm.contig.fa.gz 
Anopheles_gambiae.MOZ2a.may.dna_rm.chunk.fa.gz 
Fugu_rubripes.FUGU2.may.dna_rm.scaffold.fa.gz 

Note that the sequence 'id type' varies in different species: contigs in human, chunks in Anopheles gambiae, scaffolds in Takifugu rubripes.

README

Each directory on ftp.ensembl.org contains an auto-generated README file, explaining the filenames and FASTA format header line conventions in use.