For this module we will be using R and Rstudio. The lesson assumes no prior knowledge of R, but some experience may be helpful to some learners. At the end of the lesson, learners will be able to:
-Open and type commands into R and/or Rstudio, and have them executed in the console
-Know how to get help with R errors or warnings
-Have an understanding that there are user/community contributions called ‘packages’ that can help with specific areas
-Calculate the phenotypic covariance matrix
-Compute the additive co-ancestry matrix a set of markers
-Impute missing marker data
-Estimate population structure using a PCA
-Run a GWAS for a trait of interest
For this module you will need to download base R and rstudio (if you don’t already have them). For this lesson we will be using R 3.4.3 “Kite-Eating Tree”. There are distributions of base R for all operating major operating systems.
-MAC OS users go here: https://cran.r-project.org/bin/macosx/R-3.4.3.pkg
-Windows users go here: https://cran.r-project.org/bin/windows/base/
Rstudio is an IDE (integrated development environment) for R, and it streamlines writing of scripts, project management, and plotting - click the link below and choose “Rstudio desktop” (free version) and choose your OS version:
For this module we will be using a Picea glauca (white spruce) dataset of 15 phenotypes, 1694 individual trees, and some 6000 SNP markers. The markers have no linkage map, nor does the data set include pedigree based information. Picea glauca like most conifers, has a very large estimated genome size of 16.15 pg or 2100cM (estimated genetic map size), and 12 autosomal chromosomes. This means that we have very low marker coverage - this is further detailed in the paper (Beaulieu et al. 2014). White spruce is widespread in the Boreal and used mostly in residential home construction in Canada and paper making.
We will use a slightly altered version of the original dataset for the sake of time; however, the original dataset and associated paper by Beaulieu et al. (2014) lives here if anyone is interested in taking a look.
Since this is the first time we’ve opened R let’s list the objects in our workspace:
ls()
## character(0)
There should be no objects in your workspace. If there are you can wipe the workspace in rstudio using the tiny broom icon in the environment panel. Or, you can simply type this command (NOTE only use this command if you are certain you want to wipe all objects from the workspace).
rm(list=ls())
ls()
## character(0)
You can type these directly in the console, or preferrably in our script file. If we type commands in our console, they will be more difficult to recall or remember.
Now that are workspace is a blank space we can start working with our spruce dataset. First we will load the phenotypes into the workspace:
phenos <- read.table("S1.txt", header = TRUE, sep="\t")
If you’re new to R, you’ll notice that when you executed this command on the console - nothing happened. That’s generally good news, as it means your command was successfully executed. You’ll notice that if you type ls()
an object named "phenos"
.
The first thing to do before examining any data is examine it and visualize it. R has several useful commands to do this. First, let’s figure out the size of our data.
dim(phenos)
## [1] 1694 15
The output of this command tells us that we have an array of 1694 rows x 15 columns. The rows always come first, columns second in R. That’s great; however, we would like to get a better picture of the data. We’ll do that with the command str()
or the structure command.
str(phenos)
## 'data.frame': 1694 obs. of 15 variables:
## $ Tree : Factor w/ 1694 levels "E560A3_1_001_5",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ q1 : num 2.854 1.015 2.006 2.023 0.764 ...
## $ q2 : num 1.932 0.057 -0.972 -2.276 -2.041 ...
## $ cellpop : num 1497 1278 1144 1536 1446 ...
## $ coars : num 274 327 356 297 292 ...
## $ cryst : num 2.64 2.58 2.46 2.72 2.54 ...
## $ dens : num 411 413 406 456 419 ...
## $ mfa : num 21.6 19.9 25.9 14.8 25.5 ...
## $ moe : num 8.9 9.01 6.64 13.54 6.89 ...
## $ rad : num 27.9 28.8 30.8 26.9 28.2 ...
## $ rwidth : num 3.51 2.87 3.7 2.58 2.82 ...
## $ specsurf: num 403 363 353 367 379 ...
## $ tad : num 25.1 28.4 29.7 25.4 25.5 ...
## $ cellwt : num 1.91 2.09 2.16 2.12 1.99 ...
## $ h97 : int 480 575 725 520 460 400 495 540 370 405 ...
This not only gives us the dimensions of the data, the datatype we are dealing with, the column names, and the column classes. What does str()
mean though? We can ask R. Type the command below for any function and you’ll get the help file to pop up with any arguments that might be contained within that fuction.
?str()
There’s also another function summary()
which produces a brief summary of all data within our data.frame()
columns.
summary(phenos)
## Tree q1 q2
## E560A3_1_001_5: 1 Min. :-36.8760 Min. :-5.237000
## E560A3_1_002_1: 1 1st Qu.: -0.9895 1st Qu.:-1.624250
## E560A3_1_003_4: 1 Median : 0.4155 Median :-0.692000
## E560A3_1_004_2: 1 Mean : 0.0000 Mean : 0.000006
## E560A3_1_004_3: 1 3rd Qu.: 1.7158 3rd Qu.: 0.405000
## E560A3_1_005_4: 1 Max. : 13.0050 Max. :26.453000
## (Other) :1688
## cellpop coars cryst dens
## Min. : 972.5 Min. :196.8 Min. :2.409 Min. :341.3
## 1st Qu.:1265.0 1st Qu.:284.1 1st Qu.:2.583 1st Qu.:394.6
## Median :1361.0 Median :305.7 Median :2.614 Median :416.3
## Mean :1379.7 Mean :307.7 Mean :2.616 Mean :418.8
## 3rd Qu.:1478.3 3rd Qu.:331.4 3rd Qu.:2.651 3rd Qu.:438.6
## Max. :2037.2 Max. :511.7 Max. :2.829 Max. :662.3
##
## mfa moe rad rwidth
## Min. : 9.982 Min. : 4.667 Min. :21.73 Min. :1.266
## 1st Qu.:15.859 1st Qu.: 8.579 1st Qu.:27.29 1st Qu.:2.649
## Median :18.479 Median : 9.915 Median :28.65 Median :3.098
## Mean :18.970 Mean : 9.933 Mean :28.61 Mean :3.116
## 3rd Qu.:21.624 3rd Qu.:11.182 3rd Qu.:29.83 3rd Qu.:3.551
## Max. :34.206 Max. :17.159 Max. :35.05 Max. :5.350
##
## specsurf tad cellwt h97
## Min. :248.9 Min. :20.82 Min. :1.516 Min. :210.0
## 1st Qu.:357.8 1st Qu.:25.64 1st Qu.:1.918 1st Qu.:420.0
## Median :376.3 Median :26.74 Median :2.034 Median :500.0
## Mean :377.8 Mean :26.71 Mean :2.044 Mean :499.6
## 3rd Qu.:397.4 3rd Qu.:27.73 3rd Qu.:2.156 3rd Qu.:575.0
## Max. :497.5 Max. :33.42 Max. :3.255 Max. :890.0
##
The summary()
can be very useful for figuring out if there’s a difference between mean and median values for example.
Since phenotypic correlations are important to consider in multivariate breeders’ selection we can produce this matrix in R.
cor(phenos)
An error is generated Error in cor(phenos) : x must be numeric
. If you remember from the str
and summary
commands, one of the columns is not numeric.
If we re-execute our command, this time without that column, it should work. We can do this with indexing.
cor(phenos[,-1])
## q1 q2 cellpop coars cryst
## q1 1.000000e+00 6.636573e-07 0.1126717 0.005282580 0.08424207
## q2 6.636573e-07 1.000000e+00 0.1327060 -0.003165443 0.01780984
## cellpop 1.126717e-01 1.327060e-01 1.0000000 -0.734802338 0.28978906
## coars 5.282580e-03 -3.165443e-03 -0.7348023 1.000000000 -0.24610054
## cryst 8.424207e-02 1.780984e-02 0.2897891 -0.246100539 1.00000000
## dens 1.553326e-01 1.800205e-01 0.3622618 0.348782785 0.04544474
## mfa -5.583501e-03 3.200013e-02 0.1171037 -0.210611746 -0.42482115
## moe 8.928539e-02 5.144695e-02 0.1484318 0.234571698 0.46750141
## rad -1.469887e-01 -1.259350e-01 -0.8916835 0.599320506 -0.24898820
## rwidth -1.103279e-01 -1.783114e-01 -0.4775837 0.121632541 -0.14949548
## specsurf -8.920521e-02 -9.239948e-02 0.3449667 -0.867492315 0.15353612
## tad -3.038410e-02 -8.775873e-02 -0.8758502 0.731012677 -0.24498786
## cellwt 8.732436e-02 9.601474e-02 -0.2741314 0.845108059 -0.13653765
## h97 -5.385958e-02 -3.510943e-02 -0.4832443 0.337976722 -0.10478333
## dens mfa moe rad rwidth
## q1 0.15533261 -0.005583501 0.08928539 -0.14698873 -0.1103279
## q2 0.18002049 0.032000128 0.05144695 -0.12593498 -0.1783114
## cellpop 0.36226185 0.117103666 0.14843176 -0.89168354 -0.4775837
## coars 0.34878278 -0.210611746 0.23457170 0.59932051 0.1216325
## cryst 0.04544474 -0.424821149 0.46750141 -0.24898820 -0.1494955
## dens 1.00000000 -0.138458829 0.54317426 -0.42162981 -0.5074612
## mfa -0.13845883 1.000000000 -0.86515231 -0.05660189 0.2583345
## moe 0.54317426 -0.865152312 1.00000000 -0.20416534 -0.4571478
## rad -0.42162981 -0.056601886 -0.20416534 1.00000000 0.5370733
## rwidth -0.50746124 0.258334485 -0.45714784 0.53707331 1.0000000
## specsurf -0.72874941 0.251858441 -0.45715612 -0.21491294 0.1667078
## tad -0.21723958 -0.150876729 -0.05722434 0.59396771 0.3320997
## cellwt 0.79392125 -0.213577129 0.45865483 0.14486824 -0.2106084
## h97 -0.20912127 0.063519823 -0.18034881 0.46891413 0.6033065
## specsurf tad cellwt h97
## q1 -0.08920521 -0.03038410 0.08732436 -0.05385958
## q2 -0.09239948 -0.08775873 0.09601474 -0.03510943
## cellpop 0.34496672 -0.87585024 -0.27413144 -0.48324426
## coars -0.86749232 0.73101268 0.84510806 0.33797672
## cryst 0.15353612 -0.24498786 -0.13653765 -0.10478333
## dens -0.72874941 -0.21723958 0.79392125 -0.20912127
## mfa 0.25185844 -0.15087673 -0.21357713 0.06351982
## moe -0.45715612 -0.05722434 0.45865483 -0.18034881
## rad -0.21491294 0.59396771 0.14486824 0.46891413
## rwidth 0.16670782 0.33209965 -0.21060843 0.60330651
## specsurf 1.00000000 -0.40721253 -0.97163670 -0.12919524
## tad -0.40721253 1.00000000 0.35297580 0.42334539
## cellwt -0.97163670 0.35297580 1.00000000 0.09946855
## h97 -0.12919524 0.42334539 0.09946855 1.00000000
We can exclude multiple columns by using the combine function c()
cor(phenos[,-c(1:3)])
## cellpop coars cryst dens mfa
## cellpop 1.0000000 -0.7348023 0.28978906 0.36226185 0.11710367
## coars -0.7348023 1.0000000 -0.24610054 0.34878278 -0.21061175
## cryst 0.2897891 -0.2461005 1.00000000 0.04544474 -0.42482115
## dens 0.3622618 0.3487828 0.04544474 1.00000000 -0.13845883
## mfa 0.1171037 -0.2106117 -0.42482115 -0.13845883 1.00000000
## moe 0.1484318 0.2345717 0.46750141 0.54317426 -0.86515231
## rad -0.8916835 0.5993205 -0.24898820 -0.42162981 -0.05660189
## rwidth -0.4775837 0.1216325 -0.14949548 -0.50746124 0.25833449
## specsurf 0.3449667 -0.8674923 0.15353612 -0.72874941 0.25185844
## tad -0.8758502 0.7310127 -0.24498786 -0.21723958 -0.15087673
## cellwt -0.2741314 0.8451081 -0.13653765 0.79392125 -0.21357713
## h97 -0.4832443 0.3379767 -0.10478333 -0.20912127 0.06351982
## moe rad rwidth specsurf tad
## cellpop 0.14843176 -0.89168354 -0.4775837 0.3449667 -0.87585024
## coars 0.23457170 0.59932051 0.1216325 -0.8674923 0.73101268
## cryst 0.46750141 -0.24898820 -0.1494955 0.1535361 -0.24498786
## dens 0.54317426 -0.42162981 -0.5074612 -0.7287494 -0.21723958
## mfa -0.86515231 -0.05660189 0.2583345 0.2518584 -0.15087673
## moe 1.00000000 -0.20416534 -0.4571478 -0.4571561 -0.05722434
## rad -0.20416534 1.00000000 0.5370733 -0.2149129 0.59396771
## rwidth -0.45714784 0.53707331 1.0000000 0.1667078 0.33209965
## specsurf -0.45715612 -0.21491294 0.1667078 1.0000000 -0.40721253
## tad -0.05722434 0.59396771 0.3320997 -0.4072125 1.00000000
## cellwt 0.45865483 0.14486824 -0.2106084 -0.9716367 0.35297580
## h97 -0.18034881 0.46891413 0.6033065 -0.1291952 0.42334539
## cellwt h97
## cellpop -0.27413144 -0.48324426
## coars 0.84510806 0.33797672
## cryst -0.13653765 -0.10478333
## dens 0.79392125 -0.20912127
## mfa -0.21357713 0.06351982
## moe 0.45865483 -0.18034881
## rad 0.14486824 0.46891413
## rwidth -0.21060843 0.60330651
## specsurf -0.97163670 -0.12919524
## tad 0.35297580 0.42334539
## cellwt 1.00000000 0.09946855
## h97 0.09946855 1.00000000
cor(phenos[,-c(1,4:9,10)])
## q1 q2 rwidth specsurf tad
## q1 1.000000e+00 6.636573e-07 -0.1103279 -0.08920521 -0.03038410
## q2 6.636573e-07 1.000000e+00 -0.1783114 -0.09239948 -0.08775873
## rwidth -1.103279e-01 -1.783114e-01 1.0000000 0.16670782 0.33209965
## specsurf -8.920521e-02 -9.239948e-02 0.1667078 1.00000000 -0.40721253
## tad -3.038410e-02 -8.775873e-02 0.3320997 -0.40721253 1.00000000
## cellwt 8.732436e-02 9.601474e-02 -0.2106084 -0.97163670 0.35297580
## h97 -5.385958e-02 -3.510943e-02 0.6033065 -0.12919524 0.42334539
## cellwt h97
## q1 0.08732436 -0.05385958
## q2 0.09601474 -0.03510943
## rwidth -0.21060843 0.60330651
## specsurf -0.97163670 -0.12919524
## tad 0.35297580 0.42334539
## cellwt 1.00000000 0.09946855
## h97 0.09946855 1.00000000
Let’s generate the phenotypic covariance matrix, that we could actually use in a breeding program.
pheno_covar <- cov(phenos[,-c(1:3)])
pheno_cor <- cor(phenos[,-c(1:3)])
R comes with numerous helpful commands for plotting and visualizing data. For our phenotypic dataset, let’s use a couple basic plots to get a sense of our trait data
hist(phenos$cryst)
plot(phenos$coars ~ phenos$cryst)
pairs(pheno_cor)
One of the other positives with using R is that its open-source community contributes many pacakges. A package is a set of functions contributed by a/many users for a specific area of study. Packages may have dependencies on other packages. R has a very active community, and there are many thousands of packages that users can take advantage of, the field of genetics is no exception.
For this lesson, we will use a couple of pkgs rrBLUP
from Jeff Endelman’s lab at the University of Wisconsin Madison. We will install the package and load it into the R workspace with the function:
install.packages("rrBLUP")
library(rrBLUP)
If we ever want to read more about a package we can use the citation
function, that is: citation(rrBLUP)
to read more about who generated the package or help()
, that is: help(rrBLUP)
.
citation("rrBLUP")
##
## To cite the rrBLUP package in publications use:
##
## Endelman, J.B. 2011. Ridge regression and other kernels for
## genomic selection with R package rrBLUP. Plant Genome 4:250-255.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Ridge regression and other kernels for genomic selection with R package rrBLUP},
## author = {J. B. Endelman},
## journal = {Plant Genome},
## year = {2011},
## volume = {4},
## pages = {250-255},
## }
help(rrBLUP)
## No documentation for 'rrBLUP' in specified packages and libraries:
## you could try '??rrBLUP'
R can read in different data types, and R packages expand that capability. Above, we read in a tab-delimited text file and it’s not uncommon to read in csv (comma separated value files). There are even numerous packages that allow you to read in native ‘.xls’ and/or ‘.xlsx’ files.
We will load an ‘.RData’ file, which has been pre-processed in advance, which contains a single object genos
, however note that RData files can contain multiple objects if you choose. We’ll also load the linkage map (this is a fake made up linkage map, with a 13th chromosome for one of the organelles or unknown SNPs).
load("genos.RData")
dim(genos.rrBLUP)
## [1] 1694 6716
marker.map <- read.csv("map.csv", header = TRUE)
marker.map <- marker.map[,-1]
What happens if we use the functions above to get an idea of the data structure? Type them in and see what happens.
Could we use our indexing technique from the phenotypic correlations to get an idea?
On your own see if you can print a small subset of the object to get an idea of its structure.
We have no information on pedigree with this dataset, and as such we really can’t differentiate between IBD (identity-by-descent) and IBS (identity-by-state). The more SNPs, the more likely IBS starts to approximates IBD. Nonetheless, we’ll proceed with this toy example. We’ll use the function A.mat
from the rrBLUP
package we previously loaded. Let’s first look at what this function does by typing its name in the console, and we can also find out more by doing what?
library(rrBLUP)
A.mat.spruce <- A.mat(genos.rrBLUP, min.MAF=0.05,
impute.method="mean",
return.imputed=TRUE)
dim(A.mat.spruce)
## NULL
You’ll notice here that I asked also for the option to return an imputed matrix based on the mean of each marker column. This is for downstream analyses, as many GS (genomic selection) and GWAS canned packages/functions require no missing data. I also chose the method that just happens to be the quickest for imputation - NOT THE MOST ACCURATE.
What type of object is retured? It’s a list, which is a special data type that allows the storing of different object types, of different lengths.
We can unlist()
both of those objects for different purposes. For the next step we’ll want to use the imputed marker matrix.
marker_imputed <- unlist(A.mat.spruce$imputed)
We don’t know much about this dataset, save that it’s a bunch of open pollinated individual trees in Canada, whose parents originated from three different latitudes/altitudes in the Province of Quebec and are included in the data set.
While kinship is probably the most important type of structure to consider it’s useful to get an estimate of population structure nonetheless. We will use principal component analysis (PCA), not to be confused with principal co-ordinate analysis (PCoA). Here we are using one of the base functions for principal co-ordinate analysis, but for larger dataset there are other options: EIGENSTRAT, flash-pca, SNPRelate (another R package), and many more.
We will use another nice plotting library to look at our PCs - ggplot2. What do you notice about the plot that strikes you given the limited information you have on the lines?
pca.spruce <- prcomp(marker_imputed)
str(pca.spruce)
library(ggplot2)
PCi<-data.frame(pca.spruce$x)
ggplot(PCi,aes(x=PC1,y=PC2)) +
geom_point(size=3,alpha=0.5)
ggplot(PCi,aes(x=PC1,y=PC2)) +
geom_point(size=3,alpha=0.5, color="orange") + theme_classic()
Even though we have a pretty good estimate of coancestry and population structure, and it’s normally a good idea to include these in GWAS, here we will perform a GWAS with kinship included as the additive co-ancestry GWAS due to the local computation limits of this exercise. If you would like, you can try including the coancestry matrix (I have) and the PCs in your own GWAS. We’ll choose the tree height from 1997, at 22 years of age, and include the tree id in a data.frame
.
?GWAS
pheno <- read.table("S1.txt", header=TRUE)
h97 <- data.frame(pheno$Tree, pheno$h97)
markerT <- data.frame(t(marker_imputed))
colnames(markerT) <- h97$pheno.Tree
genogwas <- marker.map[marker.map$Marker_Name %in% rownames(markerT),]
genogwas <- data.frame(genogwas, markerT)
Ok, now let’s run our GWAS, we’ll remove the default min.MAF = 0.05, since we already did this when we calculated the additive matrix.
GWAS(pheno = h97, geno = genogwas,
min.MAF = 0.05,
K = A.mat.spruce$A,
n.core = 4,
P3D = TRUE,
plot = TRUE)
Sometimes it’s useful to gather scripts that you can use or modify for the task at hand. A few of these are provided below, but we won’t cover them today. You may want to try it out at a later date.
Recoding SNP nuleotide data is one of the things that is a common requisite for many packages for quantitative and population genetics. E.g. taking “AA”, “CC”, “AC”, and translating it to 2, 1, 0 or 1,0,-1. There are other formats too. The function below can do this quickly (try it out later if you’d like).
The original spruce dataset from Data Dryad is loaded into the workspace, we define our function, and then we apply it to the genos
.
genos <- read.table("S2.txt", header = TRUE, sep="\t", stringsAsFactors = TRUE)
genos[1:5,1:5]
## X.Marker. PGAS1_00001 PGAS1_00002 PGAS1_00003 PGAS1_00004
## 1 E560A3_1_001_5 G:G A:A A:A C:C
## 2 E560A3_1_002_1 G:G A:A G:G C:C
## 3 E560A3_1_003_4 G:G A:A A:A C:C
## 4 E560A3_1_004_2 G:G A:A G:G C:G
## 5 E560A3_1_004_3 G:G A:A A:A C:C
recode.SNP.rrBLUP <- function(x) {
alleles <- unique(x)
y <- rep(0,length(x))
y[which(x==alleles[1])] <- -1
y[which(x==alleles[2])] <- 0
y[which(x==alleles[3])] <- 1
y[which(x=="N")] <- NA
return(y)
}
X <- apply(genos[,-1],2,recode.SNP.rrBLUP)
X[1:5,1:5]
## PGAS1_00001 PGAS1_00002 PGAS1_00003 PGAS1_00004 PGAS1_00005
## [1,] -1 -1 -1 -1 -1
## [2,] -1 -1 0 -1 -1
## [3,] -1 -1 -1 -1 -1
## [4,] -1 -1 0 0 -1
## [5,] -1 -1 -1 -1 -1
If you want the 0, 1, 2 format that is often the required by other programs, you can simply (this is required for SNPRelate and PLINK uses it too):
Xadd <- X + 1
Xadd[1:5,1:5]
## PGAS1_00001 PGAS1_00002 PGAS1_00003 PGAS1_00004 PGAS1_00005
## [1,] 0 0 0 0 0
## [2,] 0 0 1 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 1 1 0
## [5,] 0 0 0 0 0
The authors provide estimates of narrow-sense heritability in their paper in Table 1 (Beaulieu et al. 2014), but don’t provide details on location, environment, or pedigree. As such, we can’t calculate heritability using a mixed model or linear model approach. Fortunately, we can estimate the heritability of single traits based on markers alone, by making use of the additive co-ancestry matrix that we’ve calculated already in the lesson above.
library(heritability)
A <- A.mat.spruce$A
colnames(A) <- pheno$Tree
rownames(A) <- pheno$Tree
h2_h97 <- marker_h2(pheno$h97, pheno$Tree, K=A)
h2_h97$h2
Since we have the narrow sense heritability (h2), and we have the phenotypic variances (Vp), we can easily calculate the additive variance (Va) of each of the trait, by re-arranging the equation h2 = Va/Vp to Va= h2 * Vp. Note that the function marker_h2
calculates both as well, we are merely going through the motions here to understand Gmatrix construction. We can also use the h2 in assessing the response to various selection coefficients.
va_h97 <- h2_h97$h2 * var(pheno$h97)
S = 10
R = h2_h97$h2 * S
R