For this module, we are going to go through the scenario where your PI gives you a dataset and tells that you have to perform the standard filtering steps, analyze it, and return the results (without actually telling you what the dataset represents). Luckily, we learned many of the relevant commands in the first genetic variation module, so you have all the tools you need to impress your demanding PI.
If you look in the '~/32_Analysis_of_Genetic_Variation_II_data/' directory (which is not in this assigment folder but should be in your home directory), you will see three files with extensions that should look somewhat familiar to you from the last class:
malmo_v4.bed: a binary ped file containing the genotypes for each subject. Note that this is different than the bed files that we have seen before!
malmo_v4.bim: an extended MAP file containing allele name information corresponding to the genotypes in the bed file
malmo_v4.fam: the first six columns of the original .ped file, corresponding to the pedigree and phenotype information for each subject
You will also see a file called malmo_v4_mds.mds, which we will investigate in parts 2 and 3.
First, let's run the basic PLINK command to look at this data, and send the output to 'plink_output/validation' within this assignment folder. Note that to use the binary file, you have to use the '--bfile' flag instead of the '--file' flag.
Now let's perform similar filtering to what we did for the first module: add a MAF threshold of 0.01, a SNP genotyping rate of 0.05, and missingness per individual/sample of 0.1. In addition to these filters, we will also add a filter to exclude SNPs that significantly diverge from Hardy-Weinberg equilibrium, which is meant to identify SNPs that are under specific evolutionary forces and may therefore not be appropriate to include in a case/control association study. Look through the PLINK thresholding page to find the flag for filtering by HWE. What is that flag?
Now, let's filter out SNPs that significantly deviate from HWE with a p-value of 0.001 or less, in addition to the other filters described above. Record the code you ran here, and send the output to 'plink_output/filtering' in this directory:
How many individuals were removed for low genotyping?
How many SNPs were filtered out for diverging from HWE?
8232 SNPs were removed.
How many SNPs failed the missingness test?
1738 SNPs failed missingness test.
How many SNPs failed the MAF test?
13571 SNPs failed frequency test
How many SNPs passed all the filtering?
69165 SNPs passed all the filtering.
How many cases and controls passed filtering?
1000 cases and 1000 controls.
2. Multi-dimensional scaling analysis
A very common and important analysis in genetic studies is to consider the population stratification of the sample: because different ancestries/populations have different genetic architectures, comparing people without consideration of which population they come from can lead to false results, where a genetic signal may seem to be associated with a trait but really be due to differences between the case and control populations. The standard way to analyze this stratification is to calculate the genotype-based pairwise distances between each individual and each other individual, perform multi-dimensional scaling on these distances, which provides a way to make a 2D plot of the similarity of the individuals in the dataset, and cluster individuals based on this information to account for different populations.
You can read this PLINK page that describes approaches for understanding population stratification. We have generated a multidimensional scaling file for you called 'malmo_v4_mds.mds', also in the ~/32_Analysis_of_Genetic_Variation_II_data/ directory. Read through the linked PLINK page to find the flags that you would use to generate an MDS file with 3 dimensions (but do not run this command, as it takes a long time!):
Next, download this MDS file, read it into R (noting that there is a header), and plot the first two dimensions against each other (the 3rd and 4th columns of the data file), and save the plot into the images/ folder of this assignment.
Do you see anything suggesting that there is population stratification in this sample?
The plot suggests that there is indeed population stratification in the sample between, as the first two dimsensions seem to show 2 distinct populations stratified within the sample.
3. Running logistic regression-based association
Given that we have seen population stratification, the question of how to account for this in our association test still remains. Instead of using the basic '--assoc' flag to do our association test, we now have to use a more sophisticated flag, which is the '--logistic' flag. This tells PLINK that we want to run a logistic regression, which is one way to predict a categorical value (i.e. the case or control status of each individual) while taking genotype weights as well as other covariates, in this case the population clusters, into account. In PLINK, we can specify the MDS file as a covariate for the logistic regression model to account for using the '--covar' flag with the malmo_v3_indcln.mds file as the argument.
We will also use two new flags that we haven't seen before: to adjust for multiple testing and sort the output by correct signifiance, we use the '--adjust' flag, and to generate 95% confidence intervals around the estimated parameters, we use the '--ci 0.95' flag.
Write the command with all the filters that we used before plus the --logistic flag, the --covar flag with the MDS file as the argument, and the two flags described above to run this logistic association test, and send the output to plink_output/filtering_and_logistic:
This may take a few minutes, but now we have our logistic regression results in two files: plink_output/filtering_and_logistic.assoc.logistic, which contains the logistic regression and the 95% confidence intervals for the odds ratios of each SNP, and plink_output/filtering_and_logistic.assoc.logistic.adjusted, which contains the SNPs sorted by their adjusted p-values. Give the command to look at the first 30 lines of the adjusted file:
$ head -30 filtering_and_logistic.assoc.logistic.adjusted
The two output files contain different information, including the actual position of each SNP in filtering_and_logistic.assoc.logistic. One important analysis that we should do is to figure out if we have many different and independent significant signals or if there are some significant SNPs that are close to each other, suggesting that they are in linkage disequilibrium and may be tagging the same causal mechanism. Since we don't have too many significant SNPs, we can figure this out by simply grepping for the top Bonferroni SNPs in filtering_and_logistic.assoc.logistic and looking at where they reside. Note that each SNP has 4 lines in this file, for each of four tests: additive, which accounts for all the covariates, and one test for each of the three MDS covariates.
Fill in the following cell with the grep code you ran for each SNP and the corresponding output (you only need to include the first line of the result). We have filled out the first one for you including the header line:
$ head -1 plink_output/filtering_and_logistic.assoc.logistic; grep "rs2244797" plink_output/filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 20 rs2244797 36825603 1 ADD 1981 0.6245 0.07523 0.5389 0.7237 -6.258 3.891e-10$ head -1 filtering_and_logistic.assoc.logistic; grep "rs2823799" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 21 rs2823799 16733736 1 ADD 2127 1.456 0.0659 1.28 1.657 5.702 1.182e-08$ head -1 filtering_and_logistic.assoc.logistic; grep "rs2250350 " filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 20 rs2250350 36803410 2 ADD 2122 0.6428 0.07916 0.5504 0.7507 -5.583 2.368e-08$ head -1 filtering_and_logistic.assoc.logistic; grep "rs2247983" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 20 rs2247983 36831323 1 ADD 2123 0.6734 0.07239 0.5843 0.776 -5.464 4.666e-08$ head -1 filtering_and_logistic.assoc.logistic; grep "rs2757522" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 20 rs2757522 36830925 1 ADD 2118 0.6754 0.07244 0.586 0.7784 -5.418 6.036e-08$ head -1 filtering_and_logistic.assoc.logistic; grep "rs11088585" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 21 rs11088585 16720831 2 ADD 2116 1.412 0.06499 1.243 1.604 5.308 1.107e-07$ head -1 filtering_and_logistic.assoc.logistic; grep "rs240461" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 21 rs240461 16703974 2 ADD 2109 1.416 0.06687 1.242 1.614 5.204 1.952e-07$ head -1 filtering_and_logistic.assoc.logistic; grep "rs2255213" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 20 rs2255213 36819791 1 ADD 2124 0.6844 0.07289 0.5933 0.7895 -5.203 1.96e-07$ head -1 filtering_and_logistic.assoc.logistic; grep "rs11911745" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 21 rs11911745 16725323 2 ADD 2119 1.404 0.06543 1.235 1.597 5.19 2.106e-07$ head -1 filtering_and_logistic.assoc.logistic; grep "rs2244281" filtering_and_logistic.assoc.logistic CHR SNP BP A1 TEST NMISS OR SE L95 U95 STAT P 20 rs2244281 36821845 2 ADD 2096 0.6873 0.07291 0.5958 0.7929 -5.143 2.709e-07
By 'clumping' nearby SNPs together based on their positions (the third column here), how many distinct association signals do you think there are?
There seem to be 2 distinct association signals, one on chrom. 20 and one on 21.
What are the general odds ratios for each of the distinct signals you identified? Are they in the same direction, and can you generally interpret what these values represent? How do these correspond to the 'STAT' column in the logistic output file?
For the Chr20 signal, the odds ratio is around .67-68. For Chr21, the odds ratio is around 1.4. The two are in opposite directions (1 has an odds ratio < 1 and the other has an odds ratio > 1) suggesting that exposure/disease is less likely with the chr20 signal (shown by a negative value in the STAT column). Similarly, exposure/disease is more likely with the Chr21 signal (shown by a positive vlaue in the STAT column).
Finally, go on the genome browser and check these regions of associations, making sure to use the hg18 assembly; what genes do you see in these regions and can you give any general interpretation given that we haven't told you what our actual phenotype is?
For Chr20 signal, the primary gene in the region is ACTR5, a protein-coding gene involved in involved in transcriptional regulation, DNA replication and probably DNA repair. For the Chr21 signal, the primary gene seems to be the cluster host gene MIR99AHG, which encodes a long non-coding RNA. Based on the above characterizations, the variant for the lncRNA on Chr21 seems to implicated in some kind of loss of function mutation that negatively affects normal cell function.
Now you can give these results to your PI, who will surely be impressed with your knowledge of the somewhat esoteric art of running PLINK commands!