Align consensus sequences
In the previous exercise, we constructed a FASTA file of aligned sequences comprised of SNPs built using a VCF file from sequences that had all been aligned to the same reference strain. In some cases though, you may receive consensus sequences from different sources that have not been aligned. This may be because they have been mapped against different reference strains or assembled using a reference-free approach (e.g., de novo assembly). This will be the case when downloading SARS-CoV-2 sequences from GISAID (Practical 1, exercise 1).
To compare sequences and identify variation to allow for phylogenetic tree building and genomic comparison, sequences must be aligned so that the position of corresponding sites matches. Here, we will use the program "mafft" to align the 246 SARS-CoV-2 B.1.1.7 variant sequences that have been collected from Taiwan and assembled in different groups. We will align the sequences against the well-characterized Wuhan-1 SARS-CoV-2 sequence (GenBank accession number MN908947.3), which is 29903 bp in length and has been been used previously as a reference strain.
The data we will use for this exercise are:
-
Taiwan_B.117.fasta - A FASTA file containing the unaligned 246 SARS-CoV-2 B.1.1.7 sequences downloaded from GISAID.
-
Wuhan1.fasta - The Wuhan-1 SARS-CoV-2 strain.
-
View the unaligned "Taiwan_B117.fasta" file in AliView. It should look like this:

-
Input the following command to align the sequences to the Wuhan-1 reference strain:
- View your newly aligned sequence using AliView.
mafft --auto --thread -2 --keeplength --preservecase --addfragments Taiwan_B117.fasta Wuhan1.fasta > Taiwan_B117_aligned.fasta
This may take a few minutes to execute.
| Option | Description |
|---|---|
auto |
Automatically selects an appropriate strategy from L-INS-i, FFT-NS-i, and FFT-NS-2, according to data size. |
thread |
The number of threads to use for multi-threading. Here we have used 2 threads, but you can use more to increase speed. |
keeplength |
Aligns the sequences while preserving the length of the reference strain. This is optimal to allow for better indexing of sites against the well-characterized reference sequence. |
preservecase |
Retains all nucleotides as lower/uppercase. |
addfragments |
Will add fragments of sequences to existing alignments (good for closely related samples. |
Taiwan_B117.fasta |
The name of our unaligned input file. |
Wuhan1.fasta |
The name of our reference sequence. |
Taiwan_B117_aligned.fasta |
The name of our output file. |
This will align your sequences to the Wuhan 1 reference strain, and add the reference strain into the alignment (the name will be the GenBank accession number MN908947.3).
Exercise:
The recommendation for SARS-CoV-2 sequence alignments to the Wuhan 1 reference is to mask sites at the start and end of the sequence as these are prone to errors in sequencing, which results in a high number of ambiguities. The positions to mask are all nucleotides from 1 - 264 and 29674 - 29903.
The solution is to cut the sequence at these positions or to replace all nucleotides at these positions with an ambiguous base call "N". The latter is the best option as we keep the sequence length as 29903, this will allow us to retain the positions relative to Wuhan 1 that . It will also retain the position to easily interpret any estimates for the substitution or mutation rate in any subsequent analysis (we will cover this in future practicals).
-
Open R
-
Make sure the program 'seqinr' is installed:
- Read in the aligned sequence file "Taiwan_B117_aligned.fasta". This will assign the sequence alignment as a list in R.
-
Try to write a short script in R that will change the nucleotides at the masked positions and call this new list "Taiwan_B117_aligned_masked"
-
Save the new masked fasta file with the following command:
- Finally, view your aligned and masked sequence file by opening with AliView.
install.packages("seqinr")
library(seqinr)
B117_aligned<-seqinr::read.fasta("Taiwan_B117_aligned.fasta", forceDNAtolower = F)
seqinr::write.fasta(Taiwan_B117_aligned_masked,names(Taiwan_B117_aligned_masked),"Taiwan_B117_aligned_masked.fasta",open = "w")
Tips:
1. Each samples sequence is list element, so you can use the 'lapply' function to apply to all sequences
2. Use the 'replace' function
3. The names of each list element will not be carried out after replacing so add the names of "Taiwan_B117_aligned" to "Taiwan_B117_aligned_masked"
Please work together if needed and if you get stuck, there is a solution here.
Next activity: Maximum Likelihood phylogenetic trees