There are many study designs that have been successfully applied to genetic analyses. Here we provide a primer that is primarily focused on GWAS, given that it is this paradigm that has recently provided a means of assessing the entire genome in order to identify specific genetic differences among human beings that contribute to variation in disease susceptibility.
The human genome sequence is comprised of roughly 3 billion nucleotide bases (6 billion if one considers its diploid nature). Although more than 99% of that sequence does not differ from person to person, it is the differences in sequence that are of interest because it is these differences that, along with environmental/behavioral differences, contribute to phenotypic divergence. Variation in the DNA sequence can take different forms. By far the most common form of variation is characterized by sites in the sequence where individuals differ by a single base. These differences are known as single nucleotide polymorphisms or “SNPs,” and they occur, on average, about one site per 300 bases .
More than 10 million SNPs are thought to be present in the human genome. SNPs are also relatively common such that, by definition, the minor allele of any given SNP is present in at least 5% of individuals. GWAS have traditionally focused on using high-throughput genotyping to assess SNP variation across the genome to identify sites where frequency differences exist between individuals with and without disease (or with and without a certain phenotype). One can imagine that genotyping 10 million sites in the genome could potentially be very expensive and time-consuming. The genome, however, exhibits a structural property known as linkage disequilibrium (LD) whereby large sections of DNA sequence within a given chromosome are highly correlated. This structural property allows a shortcut that makes GWAS cost-effective and feasible in that representative SNPs (“tag” SNPs) from each section of correlated sequence can be genotyped and then used to infer genotypes at other unmeasured bases within the same section of sequence. These sections of correlated sequence are known as “haplotypes.” This method of genotyping tag SNPs across known haplotypes means that GWAS studies are possible with genotyping of only 500,000 to 1 million SNPs. Indeed, using this method, over the past 5 years or so more than 400 GWAS have been published identifying over 150 risk variants for more than 60 common diseases and traits .