Genome Wide Association Studies with treeWAS

Genome Wide Association Studies (GWAS) can be used to find significant genomic associations with a given trait or characteristic (phenotype). We can analyze the entire genome of individuals to pinpoint single nucleotide polymorphisms (SNPs) or other genetic markers (e.g., the presence or absence of a gene) that exhibit statistical associations with particular phenotypes. A key consideration when running a GWAS is to control for population structure as this may induce false positive calls in where variants are identified due to shared ancestry rather than because of the trait of interest.

treeWAS is a package available in R that has been specifically developed to run bacterial GWAS. This method can be very powerful for identifying SNPs in bacterial genomics that are significantly associated with a given phenotype. This phenotype can be a binary or continuous variable.

This practical will guide you through running treeWAS to perform a GWAS using the 200 M. tuberculosis samples from Moldova.

The following files will be used for this exercise:


1. First we will set up the packages and options:

# Set packages and download treeWAS
require(devtools)
install_github("caitiecollins/treeWAS", build_vignettes = TRUE)
require(treeWAS)
options(stringsAsFactors = F)


2. To run a GWAS with treeWAS, you will need a genotype (sequence) file, a phenotype file, and a phylogeny. First we will first load the tree file. This will be the un-timed ML phylogeny “TB_Moldova.tree”:

tree <- ape::read.tree("TB_Moldova.tree") 


3. Next load in the genotype file in DNAbin format of from ‘ape’. This will be the multi-sequence FASTA file. We need to convert this to a specific format for treeWAS using the ‘DNAbin2genind’ function:

dna <- read.dna(file = "TB_Moldova.fasta", format = "fasta")
genotype <- DNAbin2genind(dna)@tab


4. We will use the phenotype found in the third column of the “TB_Moldova_metadata.csv” file you have been provided with. This will be converted to a named vector using the sequence names. This is a binary variable, with 1 = presence of phenotype, 0 = absence of phenotype.

phenotype <- read.csv(file= "TB_Moldova_metadata.csv")
phen <- as.vector(unlist(phenotype[,3])) # The 
names(phen) <- phenotype[,1]


4. The final step is to run treeWAS, we can also set a random seed to replicate the result.

You will get 7 plots after the run completes. There will be one plot with the input tree with tips coloured by the phenotype, and 6 results plots. These will be a manhatten plot and a bar graph for each of three statistical measures to determine signficant associations, with a red line illustrating the cutoff for significance with bonferroni correction.

  1. the “Terminal Score”, measures sample-wide association across the leaves ofthe phylogenetic tree.
  2. the “Simultaneous Score”, measures the degree of parallel change in the phenotype and genotype across branches of the tree.
  3. the “Subsequent Score”, measures the proportion of the tree in which the genotype and phenotype co-exist.
out <- treeWAS(snps = genotype,
               phen = phen,
               tree = tree,
               seed = 2345)


The results will show you the coordinates in the genotype file that have been found to be significantly associated with the phenotype.


This is the end of practical 7. Navigate back to the homepage here