A Comprehensive Evaluation of SNP Genotype Imputation
Our analysis was based on genome-wide SNP data of 449 healthy German individuals (PopGen control individuals) generated by Affymetrix 5.0 (500k), Affymetrix 6.0 (1000k) and Illumina 550k SNP arrays, respectively (see Reference, Methods). Imputation performance was assessed for four publicly available computer programs, namely BEAGLE, IMPUTE, MACH, and PLINK, using the release 22 CEU phasing data from HapMap as a reference. Only those SNPs that were present in the HapMap phasing data and that passed quality control in the German samples (Methods) were included. Genotypes obtained with one array were then used to impute genotypes for SNPs unique to the other array, and vice versa. Imputation accuracy was quantified by means of the concordance rate between the imputed and observed genotypes. Each program uses some sort of confidence threshold (CT) for imputation with respect to the complete sample cohort. Imputation efficacy was quantified as the proportion of imputable SNPs that had confidence values equal to or exceeding the CT. After initially employing the default CT values for benchmarking, we also varied CT to assess the impact of this parameter upon both accuracy and efficacy.
Nothnagel M, Ellinghaus D, Schreiber S, Krawczak M, Franke A.
A comprehensive evaluation of SNP genotype imputation.
Hum Genet. 2008 Dec 17. [Epub ahead of print]
DOI: 10.1007/s00439-008-0606-5. PMID: 19089453.
the "PopGen samples data" is publicly available to all academic groups as highlighted in our manuscript. However, in order to comply with the German data protection law, we have to ask you to fill out a short transfer agreement. Please send this form to PopGen (info(at)popgen(dot)de). Please note that we also have >300 different phenotypes for these PopGen control individuals (http://www.popgen.de) that can be requested as well (ask for list). We appreciate your understanding.
Preprocessed HapMap reference (release 22 CEU phasing data) used for imputation:
HAPMAP-ref.tar.gz (274 MB)
R-scripts and result files for generating statistics and graphs:
data.scripts_freeze_08-12-02.tar.gz (34 KB)
Requirements for using data:
Requirements for using example scripts:
- POSIX compliant UNIX system, for example Linux, MAC OS X, and OpenBSD
- Bourne-again-shell (bash)
Questions regarding data or scripts?
Email to: d(dot)ellinghaus(at)ikmb(dot)uni-kiel(dot)de