Identifying homoplasy

We can identify sites that may be under selection by looking for homoplasies across our phylogenetic tree. Homoplasies are instances where mutations have evolved independently across different lineages due to natural selection. This can be accomplished using various computational methods and statistical tests designed to detect signatures of positive selection acting on specific sites in a phylogenetic context.

There are various ways to carry out this analysis but here we will use a program called HomoplasyFinder to identify sites that are inconsistent with the phylogeny and then plot highly discrepant sites on tree.

We will use data from 200 Mycobacterium tuberculosis samples that were collected in Moldova between 2018 and 2019. We have provided you with the following files:

The following files will be used for this exercise:

1. First, we will run the homoplasy analysis using HomoplasyFinder. This is a tool in R, though it has quite difficult dependencies. Therefore, I have provided a script to run this program on our data. In your terminal please run the command in your terminal:

Rscript runHomoplasyFinder.R -f TB_Moldova.fasta -t TB_Moldova.tree

Next we will open the results in R to view the findings, identify where we may have homoplasies, and plot these on a tree. First, we will install and load all the relevant packages:

if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (!requireNamespace("seqinr", quietly = TRUE)) {
install.packages("seqinr")
}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
library(ggplot2)
library(seqinr)
library(BiocManager)

BiocManager::install("ggtree")
library(ggtree)
options(stringsAsFactors = F)

2. HomoplasyFinder will write 3 files to your directory. We are interested in the file that contains the positions in which there is evidence of homoplasy. We will read this file and call it 'results'.

resultsFile <- paste0("consistencyIndexReport_", format(Sys.Date(), "%d-%m-%y"), ".txt")
results <- read.table(resultsFile, header=TRUE, sep="\t", stringsAsFactors=FALSE)

3. Let's view the results. We have 4 columns, the first is the position in our sequence file, the second is an index that measures the degree of homoplasy, and third is a count of the nucleotides at that position across all samples. Importantly the fourth column shows us the number of times a change in nucleotide has occurred across the tree. A high number is evidence of homoplasy.

head(results)

4. We can filter for only positions in which changes to the nucleotide (mutation) has occurred a large number of times across the tree.

results <- results[which(results$MinimumNumberChangesOnTree>10),]

5. View the new results. We are left with two sites.

head(results)

6. We can plot the tree with the nucleotide found in each tip (sequence) to see how these mutations look across our phylogeny. We will do this one at a time for illustrative purposes. We will start with the first site:

fasta<-seqinr::read.fasta("TB_Moldova.fasta",forceDNAtolower = F) # Read fasta
tree<-read.tree("TB_Moldova.tree") # Read the phylogeny

## Make a matrix of the sequences
Seq_df<-as.data.frame(matrix(as.character(unlist(fasta)),nrow=length(fasta),byrow = T))

## Extract the site that you want to plot to a vector and name it by the sequence name
site<-Seq_df[,6028]
names(site)<-names(fasta)

## pick a colour for each nucleotide and change the vector to the colour
site[which(site=="c")]<-"lightblue"
site[which(site=="g")]<-"red"

## Reorder the site vector to match the order of sequences on the tree
site<-site[order(match(names(site),tree$tip.label))]

## plot the tree with the sites coloured at the tips.
ggtree(tree) + geom_tippoint(color=site)


Exercise: Open the index file provided ("TB_Moldova_index.txt"). Find out the true positions of the high likelihood convergent SNPs. Search the internet for results and try to work out what these mutations are. What characteristic do you think the isolates carry these mutations have?

Next activity: dD/dS