Transmission

Transmission Clustering and Network Reconstruction

For the final activity in practical session 4, we will carry out some analyses to investigate possible transmission links in our data.

This activity will comprise two sections. First, we will produce transmission clusters comprised of sequences that are linked to another sequence in the cluster by a maximum of a given SNP threshold (i.e., all sequence). This allows us to identify isolates that may be linked by recent transmission. There are multiple software available to cluster sequences, either directly from the sequence alignments (e.g., TransCluster) or through phylogenetic methods (e.g., TreeCluster). Here, we will produce transmission clusters manually using ‘igraph’ in R.

In the second part of the activity, we will go a step further and carry out a full transmission network reconstruction to identify possible direct and indirect transmission events between isolates using probabilistic modelling approaches. This not only allows us to identify potential transmission as with transmission clustering, but we can infer the time and direction of person-to-person transmission events. Various software have been developed to carry out this analysis (for more examples please read this paper) but here we will use TransPhylo, an R program that uses a Bayesian approach with MCMC sampling to infer a transmission network and allows for epidemiological parameters to be included to best estimate transmission networks.

We will use the the same dataset as we used to build a timed phylogeny for these analyses:

TB_Cluster.fasta - A FASTA alignment file of concatenated SNPs from 38 M. tuberculosis samples collected between 2005 - 2014 in British Columbia. These isolates all share a MIRU-VNTR type, suggesting they may be linked by transmission.

TB_Cluster.txt - A text file with two columns, the name of the 38 M. tuberculosis samples and their collection dates.

TB_Cluster.tree - A timed phylogeny produced using BEAST2 (from practical session 3).

First we will set up the packages and options:

# Install and load required packages
if (!requireNamespace("ape", quietly = TRUE)) {
  install.packages("ape")
}
if (!requireNamespace("seqinr", quietly = TRUE)) {
  install.packages("seqinr")
}
if (!requireNamespace("TransPhylo", quietly = TRUE)) {
  install.packages("TransPhylo")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("igraph", quietly = TRUE)) {
  install.packages("igraph")
}
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
if (!requireNamespace("tidyr", quietly = TRUE)) {
  install.packages("tidyr")
}
if (!requireNamespace("lubridate", quietly = TRUE)) {
  install.packages("lubridate")
}

library(ape)
library(seqinr)
library(TransPhylo)
library(ggplot2)
library(igraph)
library(dplyr)
library(tidyr)
library(lubridate)

options(stringsAsFactors = F)

Transmission clusters

1. Load in the aligned sequences from the TB cluster in FASTA format using the ‘ape’ package.

sequences <- read.dna("TB_Cluster.fasta",as.matrix = T,format = "fasta")

2. Use the dist.dna() function in the ‘ape’ package to calculate the pairwise SNP distance between sequences.

distance <- dist.dna(sequences,model = "N")

3. Define a SNP threshold as the maximum distance allowed to link sequences.

threshold <- 5  # maximum allowed distance for linking

4. We must now transform the distance matrix to a data frame with the columns: 1. Column name, 2. Row name, 3. Distance value. We can also ignore the diagonal (where the row name = column name).

distance_mat <- as.matrix(distance)

distance_transform <- as.data.frame(distance_mat) %>%
  mutate(rowname = rownames(distance_mat)) %>%
  pivot_longer(
    -rowname,
    names_to = "colname",
    values_to = "value"
  ) %>%
  filter(!is.na(value), rowname != colname) %>% 
  filter(match(rowname, rownames(distance_mat)) > match(colname, colnames(distance_mat))) # ignore diagonal

5. Extract the rows with a value <= the SNP threshold.

close_sequences <- distance_transform[which(distance_transform$value <= threshold),]

6. Build an igraph object using the row and column names (columns 1 and 2)

g <- graph_from_edgelist(as.matrix(close_sequences[,c(1,2)]), directed = FALSE)

7. Extract the clusters and build a results table with the cluster designations.

clusters <- components(g)

results <- data.frame(Names = names(clusters$membership), Cluster = as.numeric(clusters$membership))

How many clusters do you find? Are all sequences assigned a cluster?

Repeat this analysis with a higher SNP threshold, what differences do you find?

Transmission network reconstruction

1. Load in the timed phylogeny produced from BEAST2 and the table containing the names and collection dates for each sequence.

tree <- read.nexus("TB_Cluster.tree")

dates <- read.table("TB_Cluster.txt",header = T)

TransPhylo requires the date of the last collected sample in the dataset to scale the branches of the tree. This has to be in the decimal date format.

2. The following command will find the maximum date of the sequences and convert it to a decimal with the ‘lubridate’ package.

date_last_sample <- max(decimal_date(as.Date(dates$Date)))

3. TransPhylo next requires the tree to be converted to a specialised ptree format:

ptree <- ptreeFromPhylo(tree, dateLastSample = date_last_sample)

As TransPhylo employs a Bayesian approach, we can include prior epidemiological parameters in the model and choose to fix them or update them through the MCMC sampling. The full list of parameters that can be specified can be seen if you enter help(“inferTTree”).

Here we will specify prior parameters for the generation time distribution (the time between cases in transmission chain becoming infectious) and the sampling time distribution (the time between infection and sampling of individuals), which are modelled with a gamma distribution. This best models these distributions for an organism such as TB that can have a long and variable latency (non-infectious) period. A value between 0 and 1 for the proportion of cases we believe we have sampled in our dataset can also be specified with the pi parameter and we can also set the number of MCMC iterations to run our model.

4. The following lines of script with set the prior parameters:

w.shape = 1.3 # Shape of the generation time distribution
w.scale = 3.3333 # Scale of the generation time distribution
ws.shape = 1.1 # Shape of the sampling time distribution
ws.scale = 2.5 # Shape of the sampling time distribution

pi = 0.3 # A starting value for the sampling proportion (here we think we have ~30% of all cases in our dataset)

num_iterations <- 5000 # The number of MCMC iterations

5. We can now run the main function of TransPhylo that uses an MCMC sampling algorithm to infer a transmission tree from the phylogenetic tree - inferTTree. This can also include our starting values for the epidemiological parameters specified in step 4:

set.seed(1234) # Set a seed for reproducibility

record <- inferTTree(ptree, w.shape = w.shape, w.scale = w.scale,
                    ws.shape = ws.shape, ws.scale = ws.scale,
                    startPi = pi, mcmcIterations = num_iterations)

Once inferTTree has completed, we will have a posterior collection of estimates sampled at each of our MCMC iterations. We want to check for convergence of our analysis, which is when the MCMC sampling has reached a stationary distribution and results of our model are meaningful.

6. We can inspect the trace plots of a few key metrics with the following command:

plot(record)

Question: Do you think your analysis has converged and the results are meaningful?

7. We can now look at some of the results to charcterise potential transmission events in our dataset. Firstly, we can plot a heatmap showing the pairwise probability of direct transmission between all of our isolates:

mat = computeMatWIW(record)

lattice::levelplot(mat,xlab='',ylab='')

Finally, we can extract the optimum inferred transmission network in a tree format. First, as we have a posterior sample of transmission trees from the MCMC output, we must find a consensus transmission tree using the consTTree.

8. We can set a burnin (% of samples to exclude at the beginning of the run before convergence) and a minimum probability of transmission to be included in the network.

ttree <- consTTree(record, burnin = 0.2, minimum = 0.5) # Burnin of 20% and minimum probability of 0.5.

9. Now we can plot the transmission network, with estimates on the timing of transmission events:

plotTTree2(ttree, maxTime = 2020)

TransPhylo also contains other key results. To explore these further, please see the tutorial here.

Questions:

Can you determine the difference between the black and white dots?

How many direct transmission events between sampled cases can you identify (you may need to increase the size of the plot using the Zoom button or by exporting the plot)?