View on GitHub

7th Annual training course on Viral Bioinformatics and Genomics (24 - 28 June 2024)

McCall computer cluster, Garscube campus, University of Glasgow, United Kingdom

NGS/HTS Data Practical

Derek W. Wright, MRC-University of Glasgow Centre for Virus Research Derek.Wright@glasgow.ac.uk

Overview

In this practical, we will be exploring the FASTA, multi-FASTA and FASTQ formats.

Linux Commands

Commands that you need to enter into the terminal window (command line) are presented in a box with a fixed-width font, like this:

ls

A few tips to remember:

Shorthand/wildcard symbols help to save typing:

Setup

File Formats

Dataset

cp -r /home3/dw73x/Formats .

. is shortcut for current working directory

cd Formats

View the FASTA Format

Nucleotide Sequence

less single_seq.fasta

Amino Acid Sequence

less protein.faa

Multi-FASTA Format

less BabayanEtAl_sequences.fasta 

View the FASTQ Format

less reads_R1.fastq
grep '@SRR1553467.279000' reads_R1.fastq
grep '@SRR1553467.279000' -A 3 reads_R1.fastq
wc –l reads_R1.fastq

Compressed Files

FASTQ files are often gzipped (compressed) and have .fastq.gz extension Use commands zcat, zmore, zless, zgrep to access these compressed files

zless 00013_OS_L_NA_S1_R1_001.fastq.gz