Identifying genes under selection with dN/dS

We can identify genes that may be under selection in populations using genomic sequencing data. A method to estimate which genes are under positive or negative selection is to calculate the dN/dS ratio for a gene, which is the ratio of non-synonymous (amino acid changing and thus potentially protein changing) mutations and synonymous (non-amino acid changing) mutations.

A dN/dS value of > 1 implies that a gene is under positive selection and the higher this value, the stronger the selection. This shows that the ratio of non-synonymous mutations is higher than synonymous and beneficial mutations are being fixed in the gene, which can be the result of adaptation to new pressures or environments.

Conversely, a dN/dS ratio value of < 1 will suggest that a gene is under purifying selection and the number of synonymous SNPs is higher than non-synonymous. This is leads to conservation of the protein function of the gene, implying these may be important housekeeping genes. We can also have values of ~ 1 that implies the genes is evolving neutrally, with no selection.

There are multiple standalone software that will calculate dN/dS (e.g.,PAML, HyPhy), along with alternative methods for estimating which genes are under selection using population genetics, including as Site-Frequency Spectrum methods (e.g., Tajima’s D) and identifying specific mutations associated with a phenotype (e.g., GWAS - next activity).

This practical will take you through a more manual approach for calculating the dN/dS value per gene using the ‘kaks’ function in the ‘seqinr’ package in R.


1. First we will set up the packages and options:

# Set packages
require(seqinr)
options(stringsAsFactors = F)


2. Next we need an alignment of full genes that are in a format for each nucleotide (base) to be read in the correct codon position (i.e., 1st, 2nd and 3rd codon position).

From the download folder, we have the multi-sample alignment called “FullSequence.fasta”, which contains gene and intergenic regions, and an index file showing the names and positions (Name, Start, Stop) of the genes in the sequence alignment called “FullSequence_index.csv”.

Activity: Write an R script to make a multi-sample alignment of each gene separately, naming the resulting FASTA the name of the gene contained in the first column of the FullSequence_index.csv file (e.g., E_Gene.fasta).

The solution is here but please attempt to write the code yourselves.


3. We we can then calculate the dN/dS score for each gene by reading in the FASTA and calculating the number of non-synonymous sites (ka) and synonymous sites (ks) for each gene using the ‘kaks’ function in seqinr. I have given you the code to run one of the genes, please calculate the dN/dS ratio for all genes:

Gene_alignment <- read.alignment("E_Gene.fasta",format = "fasta")

Result <- kaks(Gene_alignment)

print(mean(Result$ka)/mean(Result$ks))


Questions:

1. Which genes have the highest and lowest dN/dS values?

2. Which genes are under purifying or positive selection?

3. Think about why these genes might be under purifying or positive selection?


Next activity: GWAS