Saccharomyces Genome Resequencing Project data, version 7, 15-3-2007 ==================================================================== Please contact David Carter, dmc@sanger.ac.uk, if you need any help or clarification on the following. This file documents the FTP-downloadable data for the project, located at ftp://ftp.sanger.ac.uk/pub/dmc/yeast. Sections of this file: Known problems Alignment filtering method Imputing Nucleotides Contents of this directory Version history ----------- Known problems and incompletenesses: (1) IMPORTANT: the per-strain sequences in subdirectory "assembly" are only provisional. They do not take account of rearrangements or any indels too large to feature in ssahaSNP alignments (i.e. over 20 to 30bp or so), but they do represent our best guess at nucleotide values (i.e. substitutions) and small indels (up to 20 or 30bp). So they should be useful for primer design, deriving (most) gene sequences, etc, but not for investigating large-scale differences between strains. I am working on improving this. (2) No complete S paradoxus assembly is available; we use the 832-contig assembly available from http://www.broad.mit.edu/annotation/fungi/comp_yeasts/downloads.html. The version of the S cerevisiae genome we use is an intermediate one from November 2005; we plan to update it soon. (3) S cerevisiae strain UWOPS87-2421 has ABI sequence reads and alignments of those reads to the reference genome (subdirectory "read"), but was accidentally omitted from subsequent processing (subdirectories "match" and "report"). This should be corrected in the next month or two; mail me if you are particularly interested in this strain. (4) More sequence data is expected for S cerevisiae strains 273614X, K11, YJM975, YS2 and YS9; this should appear here during March or April 2007. ----------- Alignment filtering method Typically, a given read aligns to many places in the reference genome with many different scores. By default, ssahaSNP discards all alignments of a read except the unique maximum-scoring one (if it exists). The most obvious negative consequence of this is that exactly-repeated regions of the reference genome receive no hits. We therefore run ssahaSNP in a very "lax" mode, generating many weak and/or short alignments. We then group the alignments for a read pair into consistent clusters, and choose the best-scoring cluster, unless there is a near-tie in which case we cautiously reject all clusters. The genome browser interface shows which alignments were rejected for which reasons. ----------- Imputing Nucleotides We impute nucleotide values and associated error probabilities for all reference base positions for all strains by applying Felsenstein's algorithm to the trees created by Mark Minichiello's "Margarita" ancestral recombination graph constructor. Margarita was run on the SNPs identified in the sorted.gz file described below, and the imputer output will differ from this in the following ways: - it can change a value at a SNP if closely related strains in that part of the genome disagree - it will infer variants even where the NQS criterion is not met, e.g. where multiple mutations are present within a few bases - it will also infer values for a strain when there is no data at all in that position for that strain (generally with lower confidence) - see note 1 below for a caveat in this case. The current version of this data is really just a proof of principle and we would be pleased to have feedback on it. There are many reasons why it is suboptimal, and these should be borne in mind. Here are the most important: 1) When there are no alignments from some strain to a given region of the genome, we assume this is because of the shallow sequencing coverage, and try to impute values. Of course, the reason could be that that region is deleted altogether in the strain in question, so that imputation is not appropriate. We are intending to look for long indels in the near future. 2) The model of mutation in the phylogenetic trees is very simplistic. Gap symbols are treated as if they were a "fifth nucleotide", and mutations from one nucleotide to any of the other four are assumed to be equally likely (transitions are not favoured over transversions). A mutation probably of 0.0001 per tree branch is assumed, regardless of genomic region, codon position, or branch height. 3) We run Margarita to create 100 ARGs per chromosome, and then assume these ARGs sample the space of ARGs uniformly, i.e. we assign a prior probability of 0.01 to each one. 4) Margarita does not see positions with more than two nucleotide values, nor does it take base quality calls into account. For the latter reason, we only show it nucleotides satisfying NQS. This latter restriction should be removed soon. 5) We assume that the recombination, if any, always occurs exactly halfway between each pair of SNPs seen by Margarita. 6) The data has not been thoroughly checked, so some subtle bugs may still be present. While the above approximations seem a sensible initial set, we are keen to find out which ones, if any, lead to serious inaccuracies. -------------- Contents of this directory. NOTE: in all files, S cerevisiae chromosome names are chr01 ... chr16, with chrm for the mitochondrion and scplasm1 for the 2-micron circle. For S paradoxus, contig names are c000, c001 ... c831, instead of the c0, c1, ... in the original assembly. Subdirectory "ref": reference genomes in Fasta format ref/cere/genome.fa Cerevisiae reference genome ref/cere/ALL.fa genome.fa plus some vector sequences not accounted for by ssahaSNP (so that as much vector contamination as possible can be filtered out) ref/extras The vector sequences. ref/para/ As for ref/cere, but for paradoxus. Subdirectory "aliases": various definition files aliases/cere_chr.txt Mapping from ?NCBI codes for cerevisiae chromosome names to the form that we use, chrNN (e.g. we call chrVIII chr08) aliases/filterParams.txt Formal definitions of meth1 and meth2 (so it is easy to define other methods) aliases/strainSets.txt Which strains make up the CERE and PARA sets. Throughout this data, "cere" and "para" refer to (reference genomes for) species, while "CERE" and "PARA" refer to strain sets. The distinction is maintained to allow us to match cerevisiae strains to the paradoxus genome or vice versa, though we have not done so here. aliases/strains.txt Mapping from non-canonical to canonical strain names. The non-canonical names are abbreviations and/or errors used when the strains were sequenced, but the canonical names are the ones that occur throughout the data provided here. aliases/traceDirs.txt Strain read locations in Sanger internal trace repository -- only of local interest (we have now submitted the reads to trace.ensembl.org, however). Subdirectory "read": per-strain alignment of reads read/$STRAIN/renamed.fastq Fastq file for the reads for each strain, after clipping to remove low-quality ends, screening out matches to (most) vector sequences, canonicalisation of strain and read names, and sorting by name. Each sequence is followed by the corresponding quality scores in single character format - see below qualities.gz. For cerevisiae strains, some reads have been renamed to correct errors in the sequencing process; thus, X.p1k and X.q1k (when both exist) should now (nearly) always be read pairs. We have not been able to determine if similar corrections are required for paradoxus, because of the fragmentary nature of the reference assembly. read/$STRAIN/renamed.wm File showing the good-quality parts of each read. Format of each line is: readName Length Lo1 Hi1 Lo2 Hi2 ... where LoN-HiN is a good-quality region. read/$STRAIN/name_corrections.txt File showing the name corrections (if any) applied to create renamed.fastq from the original clip.fastq (from which reads were submitted to Genbank). A line "X Y" means X was renamed to Y; "X" on its own means X was discarded, usually because it represents an inferior or failed attempt to sequence a read. read/$STRAIN/$SPECIES/$METHOD/ssahaSNP.sorted.gz SsahaSNP output: matches of reads for $STRAIN to the reference genome for $SPECIES, filtered using $METHOD (always "methc" at present, for "clustering method"). read/$STRAIN/$SPECIES/$METHOD/alilines.txt Accepted alignments for this strain, species and method. read/$STRAIN/$SPECIES/$METHOD/alilines_status.txt Status and cluster ID for every alignment produced by ssahaSNP. read/$STRAIN/$SPECIES/$METHOD/clusters.txt Cluster memberships and scores. Subdirectory "match": per-chromosome/contig processing derived from read/* match/$SET/$SPECIES/$METHOD/p$PART/$CHR Directory containing files for matches to chromosome (or contig) $CHR of the $SPECIES genome from strain set $SET, filtered using $METHOD. $PART is just an extra directory level to avoid the inefficiency of having hundreds of sister contig directories for paradoxus: the $PART of a $CHR is its final character if that is a digit, or 0 otherwise. Within each such directory are the following files, in order of creation: matches.gz SsahaSNP output (as in ssahaSNP.sorted.gz above) for this data. Thus the union of all matches.gz files under a given $SET, $SPECIES and $METHOD should be identical to the union of the ssahaSNP.sorted.gz files for the same conditions, but distributed differently. qualities.gz Base qualities for each match in matches.gz. A quality character Q represents a quality value q = ascii(Q) - ascii(!), such that P(error) = 10 ^ (-q/10). The next two files basically have a row per base position, with the strains organised by column. sorted.gz The first line of this file is of the form ">strain1 strain2 ..." showing the (alphabetical) order applied to the strains whose values occur in the file. (Not all strains match to all paradoxus contigs, for example). Then each following line consists of interleaved nucleotide and quality characters, one such pair for each strain, beginning with the reference genome and continuing with strain1, strain2 etc. Thus the third column is the nucleotide value for strain1, and the fourth is its quality value. The following conventions apply in sorted.gz and/or imputed.gz in the directory: -- If the nucleotide is blank, no read for that strain was matched (and survived filtering) at this point. -- For sorted.gz, if a non-blank nucleotide is upper case, it passes NQS; if lower-case, it fails. -- Gaps are indicated by digits 1 to 9. The different digits are equivalent except that they represent overlapping runs of different extent; they are distinguished only so that Margarita treats them differently. -- If the nucleotide is non-blank and non-gap but the quality is blank, the error probability is very low (either this is the reference genome, i.e. column 2 of the line, or the evidence is so strong that Q would have an ascii code greater than 126. imputed.gz This is a filled-out version of sorted.gz, with the same format (except as noted below) but with nucleotide values imputed, and quality scores adjusted, by using the techniques based on ancestral recombination graphs. Thus this file should (for better or worse) contain a nucleotide and quality character for every strain in every position. The difference in format from sorted.gz is that a lower-case nucleotide denotes a value inferred where there was no value, or occasionally a different nucleotide, in sorted.gz. Upper case means that the same nucleotide occurs here in sorted.gz, irrespective of NQS. The "IMPORTANT CAVEAT" given for assemblies (below) also applies to the imputed.gz files, from which the assemblies are derived. The caveat is that rearrangements are NOT reflected in imputed.gz files, even when they show up elsewhere, e.g. in the genome browser. Subdirectory "report": snps_cere_CERE_methc.txt.gz snps_para_PARA_methc.txt.gz Summary of all SNPs in match/.../sorted.gz files (see above) that satisfy NQS. That is, these SNPs are derived by combining matches for the same strain to the same genome position, but _not_ by imputing missing values. isnps_cere_CERE_methc.txt.gz isnps_para_PARA_methc.txt.gz As for snps...gz, but derived from imputed.gz instead of sorted.gz. A typical SNP entry is as follows: TTCCATGGCTGAGTTGTAGTC chr01 26724 gtcCAtGGCTCAGTTgTAGtt chr01 26724 Y55 G>C 51 2/2 NNNNNNNNNTTTYYTTTTTTT This represents a 21-base window centered on the SNP. The fields are: referenceSequence chr offset strainSequence chr offset strain R>S q vote qualityValues where: referenceSequence is the reference sequence (perhaps with some gaps) chr is the chromosome or contig of the ref sequence offset is the offset (zero-based) of the SNP strainSequence is the sequence for the strain with the SNP X>Y shows the SNP itself: X in ref sequence, Y in strain q is the quality score: P(error) = 10 ^ (-q/10) vote is M/N, indicating that M of the N matches for this strain to this position had nucleotide value Y (Absent in the "isnps..." case). qualityValues is the quality characters for the strainSequence isnp_counts_cere_CERE_methc.txt isnp_counts_para_PARA_methc.txt Imputed SNP counts over the whole genome between each pair of strains for the species in question. Subdirectory "assembly": {preimp,postimp}/STRAIN.fastq.gz {preimp,postimp}/STRAIN.fa.gz Compressed FASTQ and FASTA format files for the given strain, pre- and post- SNP imputation, i.e. derived from sorted.gz and imputed.gz files respectively. The postimp files should be what you want for most purposes. These files represent a first attempt at assemblies for each strain. IMPORTANT CAVEAT, which I am working on resolving: they do not take ANY account of rearrangements or any indels too large to feature in ssahaSNP alignments (i.e. over 20 to 30bp or so), but they do represent our best guess at nucleotide values (i.e. substitutions) and small indels (up to 20 or 30bp). So they should be useful for primer design, deriving (most) gene sequences, etc. FASTQ format consists of sets of four lines like this: @header sequence +header quality The two "header" items in a given set of four lines are identical. In these files (as opposed to FASTQ in general), each header consists of: chunk name: strain.chromosome.nnn for nnn=1,2,3 ... chromosome, low offset and high offset for strain direction of alignment: "+" or "-" (always "+" at present because we're not yet handling rearrangements) chromosome, low offset and high offset for reference (ref chromosome is always same as strain chromosome at present) For "chromosome", read "contig" in the case of paradoxus, as usual. Where we have an insertion in the strain, we will have high offset = low offset - 1 in reference; and vice versa for an insertion in the reference (deletion in the strain). In the latter case, the sequence and quality strings are of zero length. Where we have neither an insertion nor a deletion, high-low should be the same for strain and reference, i.e. nucleotides align one-to-one. For easy reading, chunks have a maximum length of 1000. In a sequence line, the meaning of lower case is different for preimp and postimp. Preimp: nucleotide fails NQS, _or_ no nucleotide present (no coverage), so copied from reference Postimp: nucleotide was imputed (missing in pre-imputation). In a quality line, which should always be the same length as the sequence line above it, each character represents an upper bound on the probability that the corresponding nucleotide value in the sequence line is wrong. The formula to convert a character C to an error probability P is: 0.1 * (ASCII(C) - ASCII("!")) P <= 0.1 Thus for example: ASCII "A" is 65, and ASCII "!" is 33, giving a difference of 32, so P is 0.1 to the power 3.2 which is about 0.00063. The following perl script prints out this value for every character you give it: perl -ne '{print exp(log(0.1)*0.1*(ord($_)-ord("!"))) . "\n"}' "!" as a quality character (giving P <= 1, i.e. making no claims to accuracy at all) is used in preimp when there was no nucleotide present in the strain. "!" can also occur as a real quality character, but a long string of "!" characters almost certainly means the former case. To say the same things procedurally: To convert an ASCII character to an error probability: -- if the character is space, the probability is very close to zero. Otherwise: -- find the code of the character, e.g. with ord() in perl -- subtract the code for "!", which is 33 -- divide by 10 -- take 0.1 to that power To convert an error probability to an ASCII code: -- take log (to base 10) of the error probability -- multiply by -10 -- round to nearest integer -- add the code for "!", 33 -- if the resulting value is 127 or more, return the space character; otherwise return the character with that value, e.g. with chr() in perl. Subdirectory "browser_data": browser_data contains the .fa and .gff files used by the Gbrowse interface to the data (http://www.sanger.ac.uk/Teams/Team71/durbin/sgrp/browser.shtml). browser_data/cere.tgz is for cerevisiae, browser_data/para.tgz is paradoxus. Please mail me if you have any questions on formats that cannot be resolved by comparing the files with what you see in the browser. ---------- Version history 7, 14-3-2007: 1) Nearly-complete data for strains for second phase of project added: now there are 36 S cerevisiae strains and 27 S paradoxus. 2) Browser .gff files are now smaller in number and larger in size, to work around a SNP display problem. 6, 21-11-2006: 1) "imputed" and "report" files for S paradoxus are now (I think) complete; previously some were missing. 2) "browser_data" directory added. 3) "assembly" directory added. 5, 9-11-2006: 1) New alignment-filtering method implemented, based on clustering alignments into consistent sets. 2) New version of Margarita takes account of sequencing-error probabilities. 3) Cerevisiae read names corrected where necessary and possible, to make p1k and q1k correspond. 4, 19-5-2006: 1) S paradoxus strain Z1_1 added. 2) "report" files (detailing SNPs) have summary statistics at the end. [Omitted for space reasons: 3) ssahaSNP output files, read/{STRAIN}/{cere,para}/ssahaSNP.out.gz included.] 3, 17-5-2006: 1) Some wrong lines in imputed.gz files were fixed. 2) Gap error probability changed to 10^-2.5. 2, 13-4-2006: 1) Most vector sequences have been screened out of the reads using ssahaScreen. 2) Two alternative methods have been used to filter matches to the reference genome. See "Filtering Methods" below. 3) The sequence for the cerevisiae two-micron circle plasmid, SCPLASM1, has been added to the cerevisiae genome, and accounts for about 0.165% of the alignments (1 in 600). 4) Reads for an additional paradoxus strain, A4, have been added. 5) The SNP files are about twice as large as before, because: -- overlapping matches for the same strain have their quality scores combined by taking the maximum score, and -- NQS may be satisfied with respect to any other strain, not only with respect to the reference genome. 6) A preliminary version of imputed sequence data is provided by using ancestral recombination graphs to infer likely nucleotide values when no reads, or only low-quality reads, for a strain align to a given location. See "Imputing Nucleotides" below. 1, 9-2-2006: first version