Phylogeography using ancestral reconstruction

By reconstructing the ancestral geographic ranges of strains or lineages at different points in their evolutionary history, ancestral reconstruction methods help infer the spatial dynamics of population divergence. These reconstructions integrate phylogenetic trees and geographical data and enable researchers to trace the origins and migration of populations or the spread of disease.

Here we will use the ‘ace’ function in the R package ‘ape’ to carry out ancestral reconstruction on the location of SARS-CoV-2 strains of the P.1 variant collected in the Pacific Northwest of North America. This will allow us to track the variant and identify instances when there may have been a migration of infections between states and countries. The data files we will be using are:


1. First we will set up the packages and options:

# Install and load required packages
if (!requireNamespace("ape", quietly = TRUE)) {
  install.packages("ape")
}
if (!requireNamespace("phytools", quietly = TRUE)) {
  install.packages("phytools")
}
if (!requireNamespace("ggalluvial", quietly = TRUE)) {
  install.packages("ggalluvial")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("tibble", quietly = TRUE)) {
  install.packages("tibble")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr")
}
if (!requireNamespace("ggtree", quietly = TRUE)) {
  install.packages("ggtree")
}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")

library(ape) library(phytools) library(ggalluvial) library(tidyr) library(ggplot2) library(tibble) library(dplyr) library(tidyr) library(ggtree)
options(stringsAsFactors = F)


2. Load in the timed phylogeny. We will remove the Wuhan 1 reference sequence from the tree as we are only concerned with the ancestral movements of the P1 variant.

# Load the timed phylogenetic tree 
tree <- read.tree("P1_timetree.tree")
tree <- drop.tip(tree,"MN908947.3")


3. Read the location data into R. Again, only use metadata for sequences in the tree.

Metadata <- read.csv("P1_dates_location.csv") 
Metadata <- Metadata[which(Metadata$Name %in% tree$tip.label),]


4. Make a vector of the locations only and name this vector as the sequence names.

Location<-Metadata$Location
names(Location)<-Metadata$Name


5. As with some of the earlier analyses, ancestral reconstruction requires a bifuricating tree so we can test for multifurication and make it bifuricating, and make any zero branch lengths a small non-zero value.

if (!is.binary(tree)){# if not bifuricating tree
  tree=multi2di(tree) # Change to bifuricating
  tree$edge.length[which(tree$edge.length==0)]<-0.000000000001 # make all branch length non-zero
}


6. Now we can run the ancestral reconstruction using the ‘ace’ function. The method we will use is “ML”, which stands for maximum likelihood and the type of data is discrete “d”.

anc_Location<-suppressWarnings(ace(Location,tree, type="d", method = "ML"))


7. The result will be a list of 6 objects including a matrix of the likelihoods of each ancestral location at each node (anc_location$lik.anc). Here we will find the location for each node that has the highest likelihood and assign it to a names vector.

## infer the most likely ancestral location of each node
node_anc <- colnames(anc_Location$lik.anc)[apply(anc_Location$lik.anc, 1, which.max)]
names(node_anc)<-row.names(anc_Location$lik.anc)


8. We can plot the inferred location at each node on the tree. We first need to make a new vector of colours for each location and then use ggtree to plot.

node_anc_colors <- node_anc
node_anc_colors[which(node_anc_colors=="Washington")]<-"violet"
node_anc_colors[which(node_anc_colors=="Oregon")]<-"lightblue"
node_anc_colors[which(node_anc_colors=="Montana")]<-"palegreen"
node_anc_colors[which(node_anc_colors=="Alberta")]<-"salmon1"
node_anc_colors[which(node_anc_colors=="BritishColumbia")]<-"gold"

## Plot using ggtree
ggtree(tree,right=TRUE, mrsd = max(Metadata$Date),color="lightgrey") + 
  geom_nodepoint(color=node_anc_colors) + theme_tree2()


Question: Around what time does it look like P1 first entered BC? From where?


9. Now let’s make a data frame of the location at each node and tip, and the location at the previous node. This way we can then identify nodes in which there has been a change in location.

melt_tree<-as_tibble(tree)
Location_changes<-data.frame(Origin=melt_tree$parent,Destination=melt_tree$node)
for (i in 1:length(Location)){
  Location_changes$Origin[i]<-node_anc[which(names(node_anc)==Location_changes$Origin[i])]
  Location_changes$Destination[i]<-Location[which(names(Location)==melt_tree$label[i])]
}
for (i in (length(Location)+1):nrow(Location_changes)){
  Location_changes$Origin[i]<-node_anc[which(names(node_anc)==Location_changes$Origin[i])]
  Location_changes$Destination[i]<-node_anc[which(names(node_anc)==Location_changes$Destination[i])]
}


10. We need to transform the data into a format that can be used to make our final plot.

## Make result data frame summarizing the number of movements between and within locations.
result <- Location_changes %>%
  group_by(Origin, Destination) %>%
  summarize(freq = n()) %>%
  ungroup()

11. We can inspect the first few lines of this table to see the number of transitions between locations:

head(result)

12. Now we need to transform the results table to be in the format required by the alluvial plot. We can also view this again with the head() function:

## Transform the data into the alluvial format
alluvial_data <- pivot_longer(result, cols = c(Origin, Destination),
                              names_to = "Direction", values_to = "Location") %>%
  group_by(Direction) %>%
  mutate(row = row_number()) %>%
  select(Direction, Location, row, freq)

alluvial_data$Location <- as.factor(alluvial_data$Location) # Make location a factor for plotting
head(alluvial_data)


13. Finally we can make the alluvial plot with the package ‘ggalluvial’ to show all the inferred movements between nodes. This can give us a visual representation of the migration.

ggplot(alluvial_data,
       aes(x=Direction, stratum = Location, y = freq, 
           fill=Location, alluvium=row, label=Location)) +
  geom_stratum(width = 1/6, alpha=0.8) +
  geom_flow(width = 1/6)+
  geom_text(stat = "stratum", size=3, min.y = 1) +
  scale_x_discrete(limits = c("Origin", "Destination"), expand = c(.01, .01)) + theme_minimal()

Question: From which region in the Pacific northwest have the most cases of P.1 crossed borders?

Follow-up Activity: You have been provided with the timed tree of SARS-CoV-2 B.1.1.7 sequences from Taiwan that we built in practical session 3 (Taiwan_B117_time.nwk), along with a CSV file with the locations for each sequence (Taiwan_B117_dates_location.csv). Work together to carry out a phylogeographic analysis with these data.


Next activity: Phylodynamics analysis with BEAST2