De novo (reference-free) assembly

De novo assembly is the process of reconstructing genomes from sequencing read data without the use of a reference genome. It involves overlapping and merging reads to form longer contiguous sequences (contigs) that represent the genomic sequence.

This process is useful when no suitable reference genome is available or when the goal is to discover novel genomic features, such as structural variations, insertions, or new genes. It is also valuable for assembling organisms with high levels of recombination, as it allows reconstruction of unique genome arrangements without bias from a reference.

The goal is to assemble the read data into a single consensus sequence, though this is rarely achieved with short-read data and you will likely have multiple contigs. Long-read sequencing data is much more likely to

We will use a de novo assembly software called Unicycler. This program uses an assembly software called SPAdes to assemble the sequencing reads into contigs, polishes the assemblies to improve the accuracy, and where possible, will circularize genomes.

For this activity, we will use the following files found in your /data/ folder:


  1. First we can view the options in Unicycler by simply typing the following command in your terminal:
  2. unicycler
    

    This will display all options for Unicycler:

    Description


  3. We can then run the main command for Unicycler. This may take a little while to complete as it will test different k-mer lengths (see Lecture 1 for k-mer definition) to produce the optimal assembly. Perhaps run this command over the lunch break:
  4. unicycler -1 Kleb1_R1.fastq.gz -2 Kleb1_R2.fastq.gz --keep 0 -t 2 -o Kleb1
    

    Where the options specified are:

    Option Description
    -1 The forward read of short-read paired end sequence data
    -2 The reverse read of short-read paired end sequence data
    --keep The file to keep in the output directory. 0 = only keep final files: assembly (FASTA, GFA and log),
    -t The number of threads to use.
    -o The name of the output directory

    We will now have the de novo assembled contigs using the optimal k-mer length that Unicycler determines using some quality metrics (outlined on the Unicycler Github). These will be contained in the /Kleb1/ folder.

  5. Navigate to this folder and view the files:
  6. cd Kleb1/
    ls #this is the command to view files
    

    These files include a FASTA file with the assembled contigs (assembly.fasta)

  7. View the fasta file in AliView:
  8. Description

    Discuss the assembled contigs, do you think we have reconstructed the complete genome?

    Instead of inspecting the files manually, we can use software such as QUAST to calculate different metrics to assess our assemblies.

  9. Run QUAST with the following command to calculate assembly metrics and output the new files to a folder called QUAST/:
  10. quast -o QUAST assembly.fasta
    

    This will produce files in multiple formats that show the assembly metrics in the Kleb1/QUAST/ folder. We will view the 'report.txt; file.

  11. Either open the report.txt file in your text editor, or view it in Terminal or WSL using the following command:
  12. cd QUAST/
    vi report.txt
    

    The report.txt file look like this:

    Description

Please discuss the output - 1. How many contigs do we have? 2. Given that the Klebsiella pneumoniae genome is ~5.5MB, do we think we have reconstructed the full genome?

Some of the metrics are not as intuitive (e.g., N50, L50). We may not have time to discuss but these are important to determine the contiguity of the assembly. Please read here for more details.


This is the end of the activities in practical session 1. Navigate back to the homepage for other activities here.