*** Overview *** The VCF files contain short indel (<50 nt) calls for the low-coverage samples for CEU, YRI, and JPT/CHB. All lines where the FILTER column says 'PASS' should be considered high-confidence indel calls. *** Procedure *** The calls were made using the following procedure: 1. Extract candidate indels from Illumina, 454 and SOLiD data. The full set of candidates is available at http://www.well.ox.ac.uk/~gerton/1000G/LC/pilot1-indelcalls-17sept09.tgz The total number of candidates considered on all populations was 8,504,899. The candidate indel set was compiled by Gerton Lunter from candidates provided Sanger, Broad, Oxford, Sanger/LUMC and TGEN groups. All candidates were tested in all populations. JPT and CHB were analysed jointly. 2. Realign reads around candidate indels to candidate haplotypes using the indel caller Dindel (Albers et al.). Dindel at this stage was used to produce both indel site calls (make a call whether a candidate indel segregates in the population), and to produce genotype likelihoods for each individual at a called site. 3. Finally, QCALL (Quang Si Le, Richard Durbin) was used to impute genotypes from the genotype likelihoods for the sites called by Dindel, by making use of LD structure. Note that QCALL filtered out a small fraction (<0.25%) of the sites called by Dindel; these are sites where the genotype likelihoods are not consistent with the local LD structure. If a site is filtered out in this way, the FILTER column in the sites VCF file will say 'NoQCALL'. *** Novel indels *** The indels were checked against dbSNP 129, the indels from (Mills et al., Genome Research 2006), and the indels from the Watson and Venter genomes. Due to inconsistencies in indel placement in various databases, the criterion for 'novel' is less precise than that for SNPs, and is given in the header of the VCF files. *** Imputation notes *** Even though candidates from all technologies were used, the support for candidate indels was evaluated only on the Illumina sequence data. For the following samples there was no Illumina data, and as a result their genotypes are completely imputed from other samples (any SOLiD/454 data for these samples was *not* used to compute genotype likelihoods). List of samples without Illumina data imputed from other samples: CEU: NA12814 NA11840 NA12872 NA12815 NA12812 NA12760 NA12874 NA12762 NA06985 NA12873 NA12234 YRI: NA19141 NA19143 JPTCHB: NA18969 NA18970 *** Note *** The 'NoQCALL' subset of calls is likely to be enriched for false calls, but they may contain potentially interesting targets for association studies, as one reason for these sites being filtered by QCALL could be low LD with nearby SNPs. The 'NoQCALL' indels are only present in the 'sites' file and not in the 'genotypes' file. *** Questions *** If you have any questions, please email Kees Albers at caa (at) sanger.ac.uk Kees Albers (caa (at) sanger.ac.uk) Gerton Lunter Quang Si Le Richard Durbin