Mapping/aligning sequence data to a reference genome

After inspecting and potentially improving the quality of the raw sequence FASTQ files, we now want to attempt to reconstruct the genome from the short reads. This process involves aligning the reads against a reference sequence, identifying similarities and differences, and ultimately generating a consensus sequence.

Here, we will go through an approach to do this using BWA. This is a specialized software for mapping sequences against a reference, which is particularly good for low-divergence organisms such as TB.

For this activity, we will use the following files found in your /data/ folder:


  1. First we can view the options in BWA by simply typing the following command in your Terminal or WSL:
  2.   bwa
    

    This will display all options for BWA:

    Description


  3. Before mapping the reads, we need to first index our reference genome using the BWA ‘index’ command:
  4. bwa index H37Rv.fasta
    
  5. This will create five new index files associated with the H37Rv.fasta file. To view these, you can enter:
  6. ls H37Rv*
    


    We can now align our paired-end reads for each sample to the reference. As you can see from the options shown in step 1, there are multiple options in BWA for alignment, with the main alignment algorithms being:

    1. aln/samse/sampe for BWA-backtrack (best for reads under 100bp)

    2. bwasw for BWA-SW (better for reads over 100bp)
    3. mem for BWA-MEM (best for reads over 100bp, faster and more accurate than bwasw)

    We will just be using the standard parameters for our alignments so will not add any other options but it is important to review and optimize these options for your data. Further information on these options can be found [here].(https://bio-bwa.sourceforge.net/bwa.shtml)

  7. The following command will align our paired reads to the H37Rv reference strain:
  8. bwa mem H37Rv.fasta Test3_R1.fastq.gz Test3_R2.fastq.gz > Test3.sam
    

    This will produce a SAM file in the same folder, which we are calling TB1.sam.


  9. We now need to convert the SAM file into its binary format – BAM file. This is the format that is required as an input for variant calling software (to be covered in tomorrow's activities) and visualization tools, as well as reducing the file size. We can use SAMtools ‘view’ to convert from SAM to BAM (the -b option tells the program to output a BAM file):
  10. samtools view -b Test3.sam > Test3.bam
    

    Tip: You can now remove the SAM file now to free up space as we now have the alignment in BAM format:

    rm Test3.sam 
    


  11. BAM files must be sorted and indexed by the chromosomes/contigs/scaffolds in your reference genome in order to efficiently display/access the data in the BAM file. We can do this with SAMtools:
  12. samtools sort Test3.bam > Test3_sorted.bam 
    

    Tip: You can rename the sorted BAM file (which will also remove the unsorted BAM file):

    mv Test3_sorted.bam Test3.bam
    

    Then we will index the sorted BAM file that will create an additional .bai file:

    samtools index Test3.bam 
    


  13. Finally, you can print summary statistics from the indexed BAM file using the SAMtools ‘flagstat’ option:
  14. samtools flagstat Test3.bam 
    

    It will look something like this:

    Description

Where each line specifies:

Metric Description
1. Total number of reads Total count of reads in the sequencing dataset.
2. Secondary Number of partial alignments of a read in different locations of the genome.
3. Supplementary Number of chimeric alignments in the dataset.
4. Duplicates Number of duplicate reads in the dataset.
5. Number of mapped reads Count of reads that successfully align to a reference genome.
6. Paired reads in sequencing Total number of paired-end reads in the sequencing dataset.
7. Number of forward reads Count of reads aligned in the forward orientation.
8. Number of reverse reads Count of reads aligned in the reverse orientation.
9. Number of properly mapped reads Reads that are properly mapped, i.e., same chromosome, opposite orientation, and within a few deviations from the expected insert size.
10. Number of reads mapped along with their mates Total number of reads where both ends of the paired reads are successfully mapped. (7 is a subset of 8)
11. Singletons Number of reads mapped alone, i.e., mates not mapped. (8 + 9 = 3)
12. Reads whose mates are mapped to a different chromosome Reads where one end is mapped to a different chromosome than its mate.
13. Reads whose mates are mapped to a different chromosome with mapping quality greater than 5 Subset of (10) with a higher mapping quality threshold.


Question: Why might line 5, number of mapped reads, be low? How might you increase the number of mapped reads?


Your “TB1" sequence data is now aligned to the reference genome H37Rv and stored in a BAM file, which has been indexed and sorted to the reference.


Visualizing BAM files

  1. Open the Program “Tablet” or "BamView" that will have saved to the applications on your computer. The opening screen should look like this:
  2. Description


  3. Click the “Open Assembly” button (in red above) to load alignment (TB1.bam) and reference file (H37Rv.fasta) by navigating to the folder these are stored in after clicking 'browse' in the “Select assembly files”.
  4. Note the BAM file will be the ‘primary assembly’ and H37Rv as the ‘Reference/Consensus’.

    Description


  5. This should have loaded the alignment file. Click the space under the “Contigs” section (highlighted in red below):
  6. Description


    Tablet loads and can display up to 25,000bp at a time as default, delimited by the Overview window.

    The reference sequence (and amino acid product) is displayed at [1]. The aligned reads are at [2], where each row is an aligned read. You can hover your curser over a read to display information about that read, including the number of mismatches against the reference strain.

  7. To navigate through the genome, you will need to use the bar on top by dragging it (3). To navigate through the 25 Kb loaded you can click and drag the Read panel (2) or use the button below (4). Note that the red numbers in each corner of the Read panel will display the region currently being shown (5).

Exercise:

Scroll to base 116000, what do you see? What feature do you think this might be?

Hint. It may become clearer if you click the “Tag variants” button in the top panel.

Exercise:

Jump to base 1,997,457, what do you see? Why do you think this is the case? What feature do you think this might be?


Next activity: De novo assembly