Two-step Bayesian timed trees with BactDating

An alternative approach to building a timed tree from an un-timed phylogeny is to use Bayesian methods.

Much like the Maximum Likelihood approach in the previous exercise, this approach will use the existing topology of the input tree and estimate the date of nodes and re-scale branches per unit time by the simultaneous Bayesian estimation of the molecular clock rate and coalescent rate.

Bayeisan approaches can be beneficial when using for datasets with uncertain parameters, such as the substitution rate, as you can set intial priors that are updated through the analysis using Markov Chain Monte Carlo (MCMC) sampling for estimation.

BactDating is an Bayesian approach implemented in R to construct dated phylogenies. Here we will go through the two-step timed phylogeny building using an un-timed phylogeny built with IQtree of 200 M. tuberculosis isolates collected in Moldova between 2018 - 2019. A file with the collection dates has also been provided.


1. First we will set up the packages and options:

# Set packages and options
  require("ape")
require("BactDating")
options(stringsAsFactors = F)


2. Next load the un-timed Maximum Likelihood phylogenetic tree of 200 M. tuberculosis isolates collected between 2018 and 2019 in Moldova that was built with IQTree. This tree has been provided in the data folder.

tree <- ape::read.tree("TB_Moldova.tree") 


3. For BactDaing, the branches of the tree must be in units of number of SNPs. The branches of the input tree that we get from ML are in substitutions per site. This was built using a multi-alignment sequence of 12,723 SNPs and sequences aligned to the H37Rv reference strain, which is 4,411,532 bp. Therefore, we can scale the branches to the number of SNPs by multiplying the branch (edges) by the genome length.

tree$edge.length<-tree$edge.length*12723


4. The un-timed phylogeny may contain multifurications, where multiple branches descend from a single node. BactDating requires a bifuricating tree, so we can test for multifurication and make it bifuricating. This may induce branch lengths of zero, so we will assign a a very small, non-zero value to these branches.

is.binary(tree)
if (!is.binary(tree)){# if not bifuricating tree
  tree=multi2di(tree) # Change to bifuricating
  tree$edge.length[which(tree$edge.length==0)]<-0.000000000001 # make all branch length non-zero
}
is.binary(tree)


5. Next we need to create a vector of tip dates. Here, we are using the collection dates of the sequences. BactDating can also handle missing data and infer the date.

date_file<- read.csv("TB_Moldova_metadata.csv") # CSV including names and dates 
dates <- date_file[,2] # Make a vector of the date column 
names(dates)<-date_file[,1] # Name the date vector by sample name


6. Now we can set up the parameters for BactDating. You can enter ?BactDate to check all possible parameters. Here we will set the following:

# Number of MCMC iterations to run. As this is a Bayesian approach, it will employ MCMC to sample from probability distributions.
MCMC_iterations <- 2000 

# The molecular clock model to use (poisson or negbin or strictgamma or relaxedgamma or mixedgamma or arc or carc or mixedcarc).
model <- "mixedgamma" 

# The initial mutation rate in subsitution per genome per unit time (years).
initMu <- 0.5 

# # Whether to update the mutation rate through MCMC iterations
updateMu <- FALSE 

# Minimum branch length to consider to infer dates (we will use all branches so consider the shortest branch)
minbralen = min(tree$edge.length) 

# Update the root to be optimal. (Our tree is not rooted so we will update) 
updateRoot = TRUE 

# Show a progress bar while running the analysis
showProgress = TRUE 


7. The main function to produce a dated phylogeny is ‘bactdate’, We will run ‘bactdate’ using our set parameters and save it to “result”.

result=bactdate(tree = tree, date = dates, nbIts = MCMC_iterations, model = model, initMu = initMu,
	updateMu = updateMu, updateRoot = updateRoot, minbralen = minbralen, showProgress = showProgress)


8. View the result variable, you will see the key posterior parameters. You can explore the result further to find the confidence intervals around the dates on all nodes.

result


9. The timed phylogeny will be saved in the $tree variable. We can write this to a .tree Newick file using ape.

ape::write.tree(result$tree,"TB_Moldova_timed.tree")


10. You can view the final timed tree using the plot() function in R but for a clearer view, open the tree using FigTree.

plot(tree)

Exercise - Try building the tree with a different clock model and mutation rate, what changes do you see in the root and node dates?

Next activity: Timed Bayesian phylogenies with BEAST