View on GitHub

7th Annual training course on Viral Bioinformatics and Genomics (24 - 28 June 2024)

McCall computer cluster, Garscube campus, University of Glasgow, United Kingdom

De novo assembly practicals

Sreenu Vattipally

MRC-University of Glasgow Centre for Virus Research
University of Glasgow, G61 1QH
E-mail: Sreenu.Vattipally@glasgow.ac.uk

In the past decade, next-generation sequencing (NGS) has become an important tool for understanding viral evolution. It also has revolutionised many areas of biological research. NGS platforms can now generate large amounts of sequence data in a relatively short time, making them well-suited for studies of genomes, transcriptomes, and epigenetics.

One of the important applications of NGS is whole-genome sequencing (WGS), which can be used to obtain the complete genome sequence of an organism. However, the primary challenge with WGS is that it is difficult to assemble the genome without a reference.

De novo assembly is the process of assembling a genome without using any reference. This can be a challenge since it requires correctly assembling millions of short DNA sequences (reads) that cover the entire genome. This approach has been used to discover new viruses, characterise known viruses, and study the evolution of viruses.

There are several steps involved in viral de novo assembly, including, quality control, read trimming, assembly, scaffolding of contigs, and gap filling.

In this practical, we will be using three different assemblers i.e. Spades, IDBA UD and ABYSS to assemble the SARS-CoV-2 genome from Illumina reads.

Create a directory for the analysis

mkdir -p ~/DeNovo/SARS-CoV-2
cd ~/DeNovo/SARS-CoV-2/

Copy the data for this practical (SRA ID: SRR21065613) to the current directory and uncompress them.

cp /home3/bvv2t/exFiles/FQs/SRR21065613* .
gunzip SRR21065613_*

Do a quality check and trimming using trim galore program.

trim_galore -q 20 -length 50 --illumina --paired SRR21065613_1.fastq SRR21065613_2.fastq

After running trim galore you should see two new files (SRR21065613_1_val_1.fq and SRR21065613_2_val_2.fq) in your working directory.

Now create individual directories for Spades, IDBA UD and Abyss along with Reference and Validation folders.

mkdir Spades Abyss IDBA_UD Ref Validate

De novo assembly using Spades

First, go to the Spades directory you have created above

cd ~/DeNovo/SARS-CoV-2/Spades

Run spades assembly program with different k-mer sizes.

/software/SPAdes-v3.15.5/bin/spades.py -k 21,45,73,101 -1 ../SRR21065613_1_val_1.fq -2 ../SRR21065613_2_val_2.fq -o .

Here

-k : different k-mer sizes
-1: First reads file
-2: Second reads file
-o: output location. Here we are saving in the current directory(.), we can also specify any name.

The program will take some time depending on the CPU power and available memory to complete the job. If everything goes without errors, the final assembled contigs will be stored in contigs.fasta file. In this file, contigs are saved in a sorted manner, with the longest contig first. Remember, these contigs can be positive or negative strands. We have to check the orientation comparing them with the reference sequence.

**TODO: What is the length of the longest contig generated?

De novo assembly using IDBA_UD

Now let us try running IDBA UD.

cd ~/DeNovo/SARS-CoV-2/IDBA_UD/

Though it is an assembly program for high throughput reads, IDBA_UD works only with fasta files(not fastq). It will not consider the quality of the reads while assembling. So… we have to convert reads from fastq to fasta and submit them to IDBA_UD.

fq2fa --merge ../SRR21065613_1_val_1.fq ../SRR21065613_2_val_2.fq reads.fa

Now run IDBA program

idba_ud --mink=21 --maxk=81 step=10 -r reads.fa -o . --num_threads=4

Here

fq2fa: program to convert reads from fastq to fasta
mink: minimum k-mer size
maxk: maximum k-mer size
step: k-mer size increment
r: input reads
o: output directory
num_threads: number of processing threads to run the program

The final results will be stored in contig.fa file in the current directory.

**TODO: Check the length of the longest contig.

De novo assembly using ABYSS

Go to the Abyss directory

cd ~/DeNovo/SARS-CoV-2/Abyss/

Run Abyss assembler

abyss-pe k=51 n=4 in='../SRR21065613_1_val_1.fq ../SRR21065613_2_val_2.fq' name=Abyss-51 B=2G

Here

k: k-mer size
n: number of threads to use
in: input reads
name: output name

As you notice from above, the abyss assembler works with one k-mer at a time. Since we do not know the optimal k-mer size for our data it is advisable to use various k-mer sizes. Use the below for loop to run the program with different k-mers.

for k in 41 51 61 71 81; 
	do 
	abyss-pe k=$k n=4 in='../SRR21065613_1_val_1.fq ../SRR21065613_2_val_2.fq' name=Abyss-$k B=2G; 
done

This will run abyss with k-mer sizes from 41 to 81 with the increment of 10 and save the output in corresponding files. To get the assembly stats, run

abyss-fac *-contigs.fa

Now combine all the contigs generated by different kmer sizes

cat Abyss*-unitigs.fa > Abyss-contigs.fa

**TODO: Check the length of the longest contig.