Saccharomyces Genome Resequencing Project data, version 7, 15-3-2007
====================================================================

Please contact David Carter, dmc@sanger.ac.uk, if you need
any help or clarification on the following.

This file documents the FTP-downloadable data for the project,
located at ftp://ftp.sanger.ac.uk/pub/dmc/yeast.

Sections of this file:

Known problems
Alignment filtering method
Imputing Nucleotides
Contents of this directory
Version history

-----------

Known problems and incompletenesses:

(1) IMPORTANT: the per-strain sequences in subdirectory "assembly" are
only provisional.  They do not take account of rearrangements or any
indels too large to feature in ssahaSNP alignments (i.e. over 20 to
30bp or so), but they do represent our best guess at nucleotide values
(i.e. substitutions) and small indels (up to 20 or 30bp). So they
should be useful for primer design, deriving (most) gene sequences,
etc, but not for investigating large-scale differences between
strains.  I am working on improving this.

(2) No complete S paradoxus assembly is available; we use the
832-contig assembly available from
http://www.broad.mit.edu/annotation/fungi/comp_yeasts/downloads.html.
The version of the S cerevisiae genome we use is an intermediate one
from November 2005; we plan to update it soon.

(3) S cerevisiae strain UWOPS87-2421 has ABI sequence reads and
alignments of those reads to the reference genome (subdirectory
"read"), but was accidentally omitted from subsequent processing
(subdirectories "match" and "report"). This should be corrected
in the next month or two; mail me if you are particularly interested
in this strain.

(4) More sequence data is expected for S cerevisiae strains
273614X, K11, YJM975, YS2 and YS9; this should appear here during
March or April 2007.

-----------

Alignment filtering method

Typically, a given read aligns to many places in the reference genome
with many different scores. By default, ssahaSNP discards all alignments
of a read except the unique maximum-scoring one (if it exists). The
most obvious negative consequence of this is that exactly-repeated
regions of the reference genome receive no hits.

We therefore run ssahaSNP in a very "lax" mode, generating many
weak and/or short alignments. We then group the alignments for a read pair
into consistent clusters, and choose the best-scoring cluster, unless
there is a near-tie in which case we cautiously reject all clusters.
The genome browser interface shows which alignments were rejected
for which reasons.

-----------

Imputing Nucleotides

We impute nucleotide values and associated error probabilities for all
reference base positions for all strains by applying Felsenstein's
algorithm to the trees created by Mark Minichiello's "Margarita"
ancestral recombination graph constructor.  Margarita was run on the 
SNPs identified in the sorted.gz file described below, and the imputer
output will differ from this in the following ways:
 - it can change a value at a SNP if closely related strains in that 
part of the genome disagree
 - it will infer variants even where the NQS criterion is not met, e.g.
where multiple mutations are present within a few bases
 - it will also infer values for a strain when there is no data at all
in that position for that strain (generally with lower confidence) - see
note 1 below for a caveat in this case.

The current version of this data is really just a proof of principle
and we would be pleased to have feedback on it. There are many reasons
why it is suboptimal, and these should be borne in mind. Here are
the most important:

1) When there are no alignments from some strain to a given region of
   the genome, we assume this is because of the shallow sequencing
   coverage, and try to impute values. Of course, the reason could be
   that that region is deleted altogether in the strain in question,
   so that imputation is not appropriate. We are intending to
   look for long indels in the near future.

2) The model of mutation in the phylogenetic trees is very simplistic.
   Gap symbols are treated as if they were a "fifth nucleotide", and
   mutations from one nucleotide to any of the other four are assumed
   to be equally likely (transitions are not favoured over transversions).
   A mutation probably of 0.0001 per tree branch is assumed, regardless
   of genomic region, codon position, or branch height.

3) We run Margarita to create 100 ARGs per chromosome, and then assume
   these ARGs sample the space of ARGs uniformly, i.e. we assign a
   prior probability of 0.01 to each one.

4) Margarita does not see positions with more than two nucleotide
   values, nor does it take base quality calls into account. For
   the latter reason, we only show it nucleotides satisfying NQS.
   This latter restriction should be removed soon.

5) We assume that the recombination, if any, always occurs exactly
   halfway between each pair of SNPs seen by Margarita.

6) The data has not been thoroughly checked, so some subtle bugs
   may still be present.

While the above approximations seem a sensible initial set, we are keen
to find out which ones, if any, lead to serious inaccuracies.

--------------

Contents of this directory.

  NOTE: in all files, S cerevisiae chromosome names are chr01 ... chr16,
  with chrm for the mitochondrion and scplasm1 for the 2-micron circle.
  For S paradoxus, contig names are c000, c001 ... c831, instead of
  the c0, c1, ... in the original assembly.

Subdirectory "ref": reference genomes in Fasta format

ref/cere/genome.fa
  Cerevisiae reference genome
ref/cere/ALL.fa
  genome.fa plus some vector sequences not accounted for by ssahaSNP
  (so that as much vector contamination as possible can be filtered out)
ref/extras
  The vector sequences.
ref/para/
  As for ref/cere, but for paradoxus.

Subdirectory "aliases": various definition files

aliases/cere_chr.txt
  Mapping from ?NCBI codes for cerevisiae chromosome names to
  the form that we use, chrNN (e.g. we call chrVIII chr08)
aliases/filterParams.txt
  Formal definitions of meth1 and meth2 (so it is easy to define
  other methods)
aliases/strainSets.txt
  Which strains make up the CERE and PARA sets. Throughout this
  data, "cere" and "para" refer to (reference genomes for) species,
  while "CERE" and "PARA" refer to strain sets. The distinction is
  maintained to allow us to match cerevisiae strains to the paradoxus
  genome or vice versa, though we have not done so here.
aliases/strains.txt
  Mapping from non-canonical to canonical strain names. The
  non-canonical names are abbreviations and/or errors used when
  the strains were sequenced, but the canonical names are the
  ones that occur throughout the data provided here.
aliases/traceDirs.txt
  Strain read locations in Sanger internal trace repository -- only
  of local interest (we have now submitted the reads to
  trace.ensembl.org, however).

Subdirectory "read": per-strain alignment of reads

read/$STRAIN/renamed.fastq 
  Fastq file for the reads for each strain, after clipping to
  remove low-quality ends, screening out matches to (most) vector
  sequences, canonicalisation of strain and read names, and sorting
  by name.  Each sequence is followed by the corresponding quality 
  scores in single character format - see below qualities.gz.
  For cerevisiae strains, some reads have been renamed to correct
  errors in the sequencing process; thus, X.p1k and X.q1k (when
  both exist) should now (nearly) always be read pairs. We have
  not been able to determine if similar corrections are required
  for paradoxus, because of the fragmentary nature of the reference
  assembly.
read/$STRAIN/renamed.wm
  File showing the good-quality parts of each read. Format of each
  line is: readName Length Lo1 Hi1 Lo2 Hi2 ... where LoN-HiN is
  a good-quality region.
read/$STRAIN/name_corrections.txt
  File showing the name corrections (if any) applied to create
  renamed.fastq from the original clip.fastq (from which reads
  were submitted to Genbank). A line "X Y" means X was renamed to
  Y; "X" on its own means X was discarded, usually because it
  represents an inferior or failed attempt to sequence a read.
read/$STRAIN/$SPECIES/$METHOD/ssahaSNP.sorted.gz
  SsahaSNP output: matches of reads for $STRAIN to the reference
  genome for $SPECIES, filtered using $METHOD (always "methc" at present,
  for "clustering method").
read/$STRAIN/$SPECIES/$METHOD/alilines.txt
  Accepted alignments for this strain, species and method.
read/$STRAIN/$SPECIES/$METHOD/alilines_status.txt
  Status and cluster ID for every alignment produced by ssahaSNP.
read/$STRAIN/$SPECIES/$METHOD/clusters.txt
  Cluster memberships and scores.

Subdirectory "match": per-chromosome/contig processing derived from read/*

match/$SET/$SPECIES/$METHOD/p$PART/$CHR
  Directory containing files for matches to chromosome (or contig)
  $CHR of the $SPECIES genome from strain set $SET, filtered using 
  $METHOD. $PART is just an extra directory level to avoid
  the inefficiency of having hundreds of sister contig directories
  for paradoxus: the $PART of a $CHR is its final character if that 
  is a digit, or 0 otherwise.

  Within each such directory are the following files, in order of creation:

  matches.gz
    SsahaSNP output (as in ssahaSNP.sorted.gz above) for this data.
    Thus the union of all matches.gz files under a given $SET, $SPECIES
    and $METHOD should be identical to the union of the ssahaSNP.sorted.gz
    files for the same conditions, but distributed differently.

  qualities.gz
    Base qualities for each match in matches.gz. A quality character
    Q represents a quality value q = ascii(Q) - ascii(!), such that
    P(error) = 10 ^ (-q/10).

  The next two files basically have a row per base position, with the
  strains organised by column.

  sorted.gz
    The first line of this file is of the form ">strain1 strain2 ..."
    showing the (alphabetical) order applied to the strains whose
    values occur in the file. (Not all strains match to all paradoxus
    contigs, for example). Then each following line consists of
    interleaved nucleotide and quality characters, one such pair for
    each strain, beginning with the reference genome and continuing
    with strain1, strain2 etc. Thus the third column is the nucleotide
    value for strain1, and the fourth is its quality value. The
    following conventions apply in sorted.gz and/or imputed.gz in the
    directory:

      -- If the nucleotide is blank, no read for that strain was
         matched (and survived filtering) at this point.
      -- For sorted.gz, if a non-blank nucleotide is upper case, it passes
         NQS; if lower-case, it fails.
      -- Gaps are indicated by digits 1 to 9. The different digits are
         equivalent except that they represent overlapping runs of
         different extent; they are distinguished only so that Margarita
         treats them differently.
      -- If the nucleotide is non-blank and non-gap but the quality 
         is blank, the error probability is very low (either this
         is the reference genome, i.e. column 2 of the line, or
         the evidence is so strong that Q would have an ascii code
         greater than 126.

   imputed.gz

     This is a filled-out version of sorted.gz, with the same format
     (except as noted below) but with nucleotide values imputed, and
     quality scores adjusted, by using the techniques based on 
     ancestral recombination graphs. Thus this file should (for better
     or worse) contain a nucleotide and quality character for every
     strain in every position.

     The difference in format from sorted.gz is that a lower-case
     nucleotide denotes a value inferred where there was no value,
     or occasionally a different nucleotide, in sorted.gz. Upper
     case means that the same nucleotide occurs here in sorted.gz,
     irrespective of NQS.

     The "IMPORTANT CAVEAT" given for assemblies (below) also applies
     to the imputed.gz files, from which the assemblies are derived.
     The caveat is that rearrangements are NOT reflected in imputed.gz
     files, even when they show up elsewhere, e.g. in the genome browser.
  
Subdirectory "report":

  snps_cere_CERE_methc.txt.gz
  snps_para_PARA_methc.txt.gz

  Summary of all SNPs in match/.../sorted.gz files (see above) that
  satisfy NQS. That is, these SNPs are derived by combining matches
  for the same strain to the same genome position, but _not_ by
  imputing missing values. 

  isnps_cere_CERE_methc.txt.gz
  isnps_para_PARA_methc.txt.gz

  As for snps...gz, but derived from imputed.gz instead of sorted.gz.

  A typical SNP entry is as follows:

  TTCCATGGCTGAGTTGTAGTC chr01 26724
  gtcCAtGGCTCAGTTgTAGtt chr01 26724 Y55 G>C 51 2/2
  NNNNNNNNNTTTYYTTTTTTT

  This represents a 21-base window centered on the SNP. The fields are:

  referenceSequence     chr offset
  strainSequence        chr offset strain R>S q vote
  qualityValues

  where:

  referenceSequence is the reference sequence (perhaps with some gaps)
  chr               is the chromosome or contig of the ref sequence
  offset            is the offset (zero-based) of the SNP
  strainSequence    is the sequence for the strain with the SNP
  X>Y               shows the SNP itself: X in ref sequence, Y in strain
  q                 is the quality score: P(error) = 10 ^ (-q/10)
  vote              is M/N, indicating that M of the N matches for this
                    strain to this position had nucleotide value Y
                    (Absent in the "isnps..." case).
  qualityValues     is the quality characters for the strainSequence

  isnp_counts_cere_CERE_methc.txt
  isnp_counts_para_PARA_methc.txt

  Imputed SNP counts over the whole genome between each pair of strains
  for the species in question.

Subdirectory "assembly":

  {preimp,postimp}/STRAIN.fastq.gz
  {preimp,postimp}/STRAIN.fa.gz

  Compressed FASTQ and FASTA format files for the given strain, pre-
  and post- SNP imputation, i.e. derived from sorted.gz and imputed.gz
  files respectively. The postimp files should be what you want for
  most purposes.

  These files represent a first attempt at assemblies for each strain.
  IMPORTANT CAVEAT, which I am working on resolving: they do not take
  ANY account of rearrangements or any indels too large to feature in
  ssahaSNP alignments (i.e. over 20 to 30bp or so), but they do
  represent our best guess at nucleotide values (i.e. substitutions)
  and small indels (up to 20 or 30bp). So they should be useful for
  primer design, deriving (most) gene sequences, etc.

  FASTQ format consists of sets of four lines like this:

  @header
  sequence
  +header
  quality

  The two "header" items in a given set of four lines are identical.
  In these files (as opposed to FASTQ in general), each header consists of:
    chunk name: strain.chromosome.nnn for nnn=1,2,3 ...
    chromosome, low offset and high offset for strain
    direction of alignment: "+" or "-" (always "+" at present because we're
       not yet handling rearrangements)
    chromosome, low offset and high offset for reference (ref chromosome is
       always same as strain chromosome at present)

  For "chromosome", read "contig" in the case of paradoxus, as usual.

  Where we have an insertion in the strain, we will have high offset =
  low offset - 1 in reference; and vice versa for an insertion in the
  reference (deletion in the strain).  In the latter case, the sequence
  and quality strings are of zero length.  Where we have neither an
  insertion nor a deletion, high-low should be the same for strain and
  reference, i.e. nucleotides align one-to-one. For easy reading, chunks
  have a maximum length of 1000.

  In a sequence line, the meaning of lower case is different for preimp and postimp.
    Preimp: nucleotide fails NQS, _or_ no nucleotide present (no coverage), so copied
      from reference
    Postimp: nucleotide was imputed (missing in pre-imputation).

  In a quality line, which should always be the same length as the sequence
  line above it, each character represents an upper bound on the probability that the
  corresponding nucleotide value in the sequence line is wrong. The formula
  to convert a character C to an error probability P is:
  
            0.1 * (ASCII(C) - ASCII("!"))
    P <= 0.1

  Thus for example: ASCII "A" is 65, and ASCII "!" is 33, giving a difference
  of 32, so P is 0.1 to the power 3.2 which is about 0.00063. The following
  perl script prints out this value for every character you give it:

    perl -ne '{print exp(log(0.1)*0.1*(ord($_)-ord("!"))) . "\n"}'

  "!" as a quality character (giving P <= 1, i.e. making no claims to
  accuracy at all) is used in preimp when there was no nucleotide present in
  the strain. "!" can also occur as a real quality character, but a long string of "!"
  characters almost certainly means the former case.

  To say the same things procedurally:

  To convert an ASCII character to an error probability:
    -- if the character is space, the probability is very close to zero.
    Otherwise:
    -- find the code of the character, e.g. with ord() in perl
    -- subtract the code for "!", which is 33
    -- divide by 10
    -- take 0.1 to that power

  To convert an error probability to an ASCII code:
    -- take log (to base 10) of the error probability
    -- multiply by -10
    -- round to nearest integer
    -- add the code for "!", 33
    -- if the resulting value is 127 or more, return the space character;
       otherwise return the character with that value, e.g. with chr() in perl.

Subdirectory "browser_data":

  browser_data contains the .fa and .gff files used by the Gbrowse
  interface to the data (http://www.sanger.ac.uk/Teams/Team71/durbin/sgrp/browser.shtml).
  browser_data/cere.tgz is for cerevisiae, browser_data/para.tgz is paradoxus.
  Please mail me if you have any questions on formats that cannot be resolved
  by comparing the files with what you see in the browser.

----------

Version history

7, 14-3-2007:

1) Nearly-complete data for strains for second phase of project added: now
   there are 36 S cerevisiae strains and 27 S paradoxus.

2) Browser .gff files are now smaller in number and larger in size, to work
   around a SNP display problem.

6, 21-11-2006:

1) "imputed" and "report" files for S paradoxus are now (I think) complete;
   previously some were missing.

2) "browser_data" directory added.

3) "assembly" directory added.

5, 9-11-2006:

1) New alignment-filtering method implemented, based on clustering
   alignments into consistent sets.

2) New version of Margarita takes account of sequencing-error probabilities.

3) Cerevisiae read names corrected where necessary and possible, to
   make p1k and q1k correspond.

4, 19-5-2006: 

1) S paradoxus strain Z1_1 added.

2) "report" files (detailing SNPs) have summary statistics at the end.

[Omitted for space reasons:
 3) ssahaSNP output files, read/{STRAIN}/{cere,para}/ssahaSNP.out.gz included.]

3, 17-5-2006:

1) Some wrong lines in imputed.gz files were fixed.

2) Gap error probability changed to 10^-2.5.

2, 13-4-2006:

1) Most vector sequences have been screened out of the reads
   using ssahaScreen.

2) Two alternative methods have been used to filter matches
   to the reference genome. See "Filtering Methods" below.

3) The sequence for the cerevisiae two-micron circle plasmid,
   SCPLASM1, has been added to the cerevisiae genome, and accounts
   for about 0.165% of the alignments (1 in 600).

4) Reads for an additional paradoxus strain, A4, have been added.

5) The SNP files are about twice as large as before, because:
     -- overlapping matches for the same strain have their quality
        scores combined by taking the maximum score, and
     -- NQS may be satisfied with respect to any other strain, not
        only with respect to the reference genome.

6) A preliminary version of imputed sequence data is provided by
   using ancestral recombination graphs to infer likely nucleotide
   values when no reads, or only low-quality reads, for a strain
   align to a given location. See "Imputing Nucleotides" below.

1,  9-2-2006: first version