*** Overview *** The VCF files contain short indel (<50 nt) calls for the low-coverage samples for CEU, YRI, and JPT/CHB. All lines where the FILTER column says 'PASS' should be considered high-confidence indel calls. *** Procedure *** The calls were made using the following procedure: 1. Extract candidate indels from Illumina, 454 and SOLiD data. The full set of candidates is available at http://www.well.ox.ac.uk/~gerton/1000G/LC/pilot1-indelcalls-17sept09.tgz The total number of candidates considered on all populations was 8,504,899. The candidate indel set was compiled by Gerton Lunter from candidates provided Sanger, Broad, Oxford, Sanger/LUMC and TGEN groups. All candidates were tested in all populations. JPT and CHB were analysed jointly. 2. Realign reads around candidate indels to candidate haplotypes using the indel caller Dindel (Albers et al.). Dindel at this stage was used to produce both indel site calls (make a call whether a candidate indel segregates in the population), and to produce genotype likelihoods for each individual at a called site. 3. Finally, QCALL (Quang Si Le, Richard Durbin) was used to impute genotypes from the genotype likelihoods for the sites called by Dindel, by making use of LD structure. Note that QCALL filtered out a small fraction (<0.25%) of the sites called by Dindel; these are sites where the genotype likelihoods are not consistent with the local LD structure. If a site is filtered out in this way, the FILTER column in the sites VCF file will say 'NoQCALL'. *** Novel indels *** The indels were checked against dbSNP 129, the indels from (Mills et al., Genome Research 2006), and the indels from the Watson and Venter genomes. Due to inconsistencies in indel placement in various databases, the criterion for 'novel' is less precise than that for SNPs, and is given in the header of the VCF files. *** Imputation notes *** Even though candidates from all technologies were used, the support for candidate indels was evaluated only on the Illumina sequence data. For the following samples there was no Illumina data, and as a result their genotypes are completely imputed from other samples (any SOLiD/454 data for these samples was *not* used to compute genotype likelihoods). List of samples without Illumina data imputed from other samples: CEU: NA12814 NA11840 NA12872 NA12815 NA12812 NA12760 NA12874 NA12762 NA06985 NA12873 NA12234 YRI: NA19141 NA19143 JPTCHB: NA18969 NA18970 *** LOF variants *** In coding regions, indel rates are significantly lower due to selection, and since noise levels (factors resulting in false positives) are expected to be approximately constant across the genome, the false discovery rate of the indel call set will be increased in coding regions. To lower the number of false positive indel calls, we applied more stringent filters to the subset of indels that were called in the genome-wide set and were predicted to fall into the LOF class. The stringent filter requires that the range of positions where an indel would yield the same alternative haplotype sequence as the original called indel (for instance, in a repeat, the deletion of any repeat unit would give the same alternative haplotype), plus 4 bases of reference sequence on both sides of this region, was covered by at least one read on the forward strand, and at least one read on the reverse strand, with at most one mismatch between the read and the alternative haplotype sequence resulting from the indel (regardless of base-qualities). This filter removed the excess of 1-bp frameshift insertions seen in CHBJPT with respect to CEU in the less stringently filtered genome-wide indel call set, although it is expected to remove a significant number of true positive calls as well. The indels that pass this stringent filter have been annotated in the VCF files by 'SF' in the INFO field. *** Note *** The 'NoQCALL' subset of calls is likely to be enriched for false calls, but they may contain potentially interesting targets for association studies, as one reason for these sites being filtered by QCALL could be low LD with nearby SNPs. The 'NoQCALL' indels are only present in the 'sites' file and not in the 'genotypes' file. *** Questions *** If you have any questions, please email Kees Albers at caa (at) sanger.ac.uk Kees Albers (caa (at) sanger.ac.uk) Gerton Lunter Quang Si Le Richard Durbin