Filter variants and construct consensus sequence
From the previous exercise, we now have VCF files of potential SNPs and Indels that have been identified in our 3 samples against the H37Rv reference strain. We now want to assign a probable nucleotide call at these sites for each of our samples while filtering possible false positives or ambiguous calls.
The final step of variant calling ‘VariantFiltration” added labels in the ’FILTER’ (column 6) field of the VCF at sites that did not meet the given threshold for that filter parameter. We can filter these variants that have failed any QC steps to decrease false positive variant calls. We may also want to set other thresholds to accept the per-sample variant calls, such as a minimum read depth at that position to account for spurious calls due to low coverage.
In addition, while some software you may use for downstream analysis can use the VCF file directly as input, many will require a aligned sequence file, for example in the FASTA format. As such, we will need to construct a consensus sequence per sample by assigning a nucleotide at each position depending on the call in the VCF.
There are programs available to do this (e.g., VCFtools, vcfProcess) but here we will go through some steps in R to do manual filtration of the VCF file and construct a consensus sequence of concatenated SNPs for each sample, which we will output as an multi-sequence alignment in the FASTA format.
Note: We could assemble the full length consensus sequence from the VCF but for M.tb this is > 4 million bases so we can assume that the sequence is the same as the reference in all samples if there is not a variant call (row) for that position. Also we will not use the Indel VCF file here but the tools above can filter Indels and construct a binary matrix of presence/absence of Indels for each sample.
1. Set the options and load packages. We will use the R package ‘seqinr’ to read and write FASTA files and the ‘ape’ package to calculate the pairwise SNP distance.
# Set packages and options
if (!require(seqinr)){
install.packages("seqinr",repos = "http://cran.us.r-project.org")
library(seqinr)
}
if (!require(ape)){
install.packages("ape",repos = "http://cran.us.r-project.org")
library(ape)
}
options(stringsAsFactors = F)
2. From the previous exercise you should have the VCF file in a data.frame called ‘vcf’ with the correct column names. If not please run this:
vcf<-read.table(file = "SNPs_filtered.vcf") # Read vcf file
vcf_header<-as.matrix(read.table("SNPs_filtered.vcf",comment.char=" ",sep="\n"))
column_names<-unlist(strsplit(vcf_header[45],"\t")) # Isolate column names
colnames(vcf)<-column_names # Set column names
3. The first filtering step is to remove any SNPs that did not meet the threshold for any filter in the ‘VariantFiltration’ step of GATK in the previous exercise. This information is contained in column 7 - “FILTER”:
vcf<-vcf[which(vcf[,7]=="PASS"),] # Choose only rows with a "PASS"
4. GATK will also sometimes call SNPs that overlap indels, these can be difficult to code so for this exercise we will remove these:
vcf<-vcf[-grep("*",vcf[,5],fixed = T),] # Remove any SNPs spanning indels
4. We have applied some filters to the VCF, now we want to assign a nucleotide for each sampled isolate at each position in the VCF (row) that has a read depth (coverage) over a given threshold. First we will create tables of just the GT and DP fields along with vectors of the reference and alternative alleles:
GT<-1 # Genotype position in format field
DP<-3 # Read depth position in format field
Ref<-vcf[,4] # Make vector of reference alleles
Alt1<-vcf[,5] # Make vector of alternative alleles, note there may be more than 1 separated by a comma.
# Functions to split and extract the GT and DP elements
extract_GT <- function(x) {
element <- sapply(strsplit(x, ":"), function(y) y[GT])
return(element)
}
extract_DP <- function(x) {
element <- sapply(strsplit(x, ":"), function(y) y[DP])
return(element)
}
# GT and DP tables
GT_mat <- data.frame(lapply(vcf[,c(10:12)], extract_GT)) # GT table
DP_mat <- data.frame(lapply(vcf[,c(10:12)], extract_DP)) # DP table
DP_mat<-apply(DP_mat, 2, as.numeric) # Make DP_mat numeric
5. We will then make assign the nucleotide to each position for each sample based on the genotype call at that site. We will iterate through the rows and assign reference or alternative alleles depending on the call, or an ambiguous call “N” for sites with no coverage or mixed calls:
For this example, mixed sites (e.g., 0/1, 0/2 etc) will be called as ambiguous but these may be coded differently. (Please read here for more details)
mixed_calls<-c("0/1","0/2","0/3","1/2","1/3","2/3","0|1","0|2","0|3","1|2","1|3","2|3")
for (i in 1:nrow(GT_mat)){ # iterate through rows
GT_mat[i,which(GT_mat[i,]=="./.")]<- "N" # No coverage
GT_mat[i,which(GT_mat[i,] %in% mixed_calls)]<- "N" # Mixed call
GT_mat[i,which(GT_mat[i,]=="0/0" | GT_mat[i,]=="0|0" )]<- Ref[i] # Reference call
GT_mat[i,which(GT_mat[i,]=="1/1" | GT_mat[i,]=="1|1")]<- unlist(strsplit(Alt1[i],","))[1] # First alt allele
GT_mat[i,which(GT_mat[i,]=="2/2" | GT_mat[i,]=="2|2")]<- unlist(strsplit(Alt1[i],","))[2] # Second alte allele
GT_mat[i,which(GT_mat[i,]=="3/3" | GT_mat[i,]=="3|3")]<- unlist(strsplit(Alt1[i],","))[3] # Third alt allele
}
6. We may want to mask sites with low read depth (< 5 reads covering that position) by assigning any call with less than 5 depth (DP < 5) as an ambiguous call “N”:
GT_mat[DP_mat < 5]<-"N"
7. This will result in a table of nucleotides, with each row being the position in the genome and each column a different sampled isolate. We can also rename the rows as the positions in the VCF file:
row.names(GT_mat)<-vcf[,2]
head(GT_mat)
## ERR190342 ERR212102 ERR221611
## 237 G A G
## 1849 C C N
## 1977 G G G
## 2097 C A C
## 2532 T C T
## 3446 C C T
Extra activity: You may want to remove any rows that has no variation after filtering and assigning nucleotides (i.e., all samples have the same nucleotide at this position) to reduce the size of the fasta and speed up any downstream analysis. Write another function to remove rows of the GT_mat with no variation.
8. We can convert our table to a fasta file by transposing it and transforming it into a list with each samples sequence a list element, then writing it to a fasta file with ‘seqinr’:
seqinr::write.fasta(sequences = as.list(apply(t(GT_mat), 1, paste, collapse="")),
names = colnames(GT_mat),
file.out="TB.fasta",open="w")
You can view your FASTA file by opening this file in AliView
9. Finally, we can calculate a pairwise SNP distance matrix from our nucleotide table (GT_mat) using the dist.dna function in ‘ape’. This will give us an idea of the genomic divergence between our samples.
SNP_matrix<-as.DNAbin(matrix(GT_mat,byrow = T)) # Convert the table to a DNAbin object via converting to a matrix
names(SNP_matrix)<-colnames(GT_mat) # Rename the samples using the column names
dist.dna(SNP_matrix,model = "N") # Calculate the pairwise distance as the number of sites that differ between the samples
## ERR190342 ERR212102
## ERR212102 1672
## ERR221611 798 1706
Next activity: Align consensus sequences