Over the past decade, genetic epidemiology studies have progressed from studying single genetic variants in candidate genes to investigating millions of common genetic variants in genome-wide association studies (GWAS). A paradigm shift in human genetics is currently taking place towards genome-wide rare variant association studies, gene-gene (GxG) and gene-environment (GxE) interaction studies. However, the explosive growth of genetic data presents enormous challenges for future genome-wide association studies with great demands on computational resources and data management infrastructures. Current studies often employ sample sets of several 10k and sometimes even >100k samples and using the latest imputation references results in over 30 million genotypes per sample.
We developed a flexible in-house software pipeline for (semi-)automated quality-control and analysis of genome-wide data sets. The software pipeline is implemented in Python and R in an object-oriented style, makes extensively use of the open source PLINK library and runs on our Linux computer cluster with a batch processing system. Parameters, options and data sets can be combined in a flexible way and are currently read from configuration text files. An attached plotting routine facilitates a fast evaluation of candidate loci (e.g. regional association plots) or summary statistics (e.g. Manhattan-, and quantile-quantile- (Q-Q) plots) (Figure 1). All computations are carried out on our high performance compute systems.