Trimming ======== Sequencing data are usually provided as raw reads in the FASTQ format, which usually require pre-processing prior to further analysis. Sequences often carry particular sequence oligonucleotides, These may have been introduced by experimental design (e.g., sample barcodes), be specific to a sequencing platform (e.g., adapter sequences) or be a real biological signal (e.g., spliced leader). The most used pre-processing step involves read trimming which: 1. Trims adaptors and primers from reads 2. Filters on base qualities, in different ways depending on the program It is important to trim and remove adaptors to avoid issues in downstream analyses such as in transcriptomics :cite:`Williams2016`, variant analysis or de novo assembly. There are many tools that have been developed for this purpose and an extensive list of them are presented below. Each has its own advantage and some of them have already been benchmarked :cite:`DelFabbro2013`. Here tools for trimming and adaptor removal will be benchmarked in the context of consensus calling and variant calling for viral genomes. Duplicate removal and contaminant sequence filtering will be covered elsewhere. Tools ##### .. list-table:: Trimming programs :widths: 33 33 34 :header-rows: 1 * - Name - Website - Publication * - Btrim - http://graphics.med.yale.edu/trim/ - :cite:`Kong2011` * - ConDeTri - https://github.com/linneas/condetri - :cite:`Smeds2011` * - Cutadapt - https://cutadapt.readthedocs.io/en/stable/ - :cite:`EJ200` * - FASTX - http://hannonlab.cshl.edu/fastx_toolkit/ - N/A * - Flexbar - https://github.com/seqan/flexbar/wiki/Manual - :cite:`Dodt2012` * - PRINSEQ - http://prinseq.sourceforge.net/index.html - :cite:`Schmieder2011` * - SolexaQA + DynamicTrim - http://solexaqa.sourceforge.net/ - :cite:`Cox2010` * - Scythe - https://github.com/vsbuffalo/scythe - N/A * - Seqtk - https://github.com/lh3/seqtk - N/A * - Sickle - https://github.com/najoshi/sickle - N/A * - TagDust2 - http://tagdust.sourceforge.net/ - :cite:`Lassmann2015` * - Trimmomatic - http://www.usadellab.org/cms/index.php?page=trimmomatic - :cite:`Bolger2014` * - Atropos - https://github.com/jdidion/atropos - :cite:`Didion2017` * - fastp - https://github.com/OpenGene/fastp - :cite:`Chen2018` * - FastqPuri - https://github.com/jengelmann/FastqPuri - :cite:`Perez-Rubio2019` * - ea-utils - https://expressionanalysis.github.io/ea-utils/ - :cite:`Aronesty2013` * - TrimGalore - https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/ - N/A * - BBDuk - https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/ - N/A How to run ########## Btrim: ------ Trims low quality at both or either ends of a fastq sequence, identify barcodes, handle indels in adapters and barcodes. The adaptors are provided as a 6 column text-tab delimited file where the user can specify which region in the sequence to search for adaptors in. See `btrim64 `_. To trim Illumina adaptor AGATCGGAAGAGC with a window size of 10 and a cutoff for average quality scores within the moving window for quality trimming of 25, where only 3'-end trimming is carried out, keeping a minimal length sequence of 40: .. code-block:: bash btrim64 -w 10 -a 25 -p illumina_adapter.txt -3 -P -o output.fastq -l 40 -t <(gunzip -c path_to_your_fastq/*.gz ) -C -z The disadvantage is that it does not deal directly with paired-end reads, so the individual files need to be trimmed first and then the perl script paired_end_trim.pl is used to create the files where both ends passed the trimming. Another two files (*.se) are generated for the reads where only one end passed the trimming. .. code-block:: bash btrim64 -a 25 -q -t input_R1.fq -o output_R1.25 -s output_R1.sum25 btrim64 -a 25 -q -t input_R1.fq -o output_R2.25 -s output_R2.sum25 paired_end_trim.pl output_R1.sum25 output_21.sum25 output_R1.25 output_R2.25 Works directly on .gz files. Provides total sequence count and length distribution. **Default:** * maximum number of errors in 5'-adaptor [default=3] * maximum number of errors in 3'-adaptor [default=4] * minimal insert size [default=25] * size of moving window for quality trimming [default=5] * cutoff for average quality scores within the moving window for quality trimming [default=15] **Disadvantages:** Dealing with paired-end reads is a two step process. ConDeTri: --------- Single-end and paired-end trimming independent from user interaction, trim and remove reads with low quality scores. See `condetri_v2.2.pl -h `_ .. code-block:: bash condetri.pl -sc=33 -fastq1=input_R1.fq -fastq2=input_R2.fq -prefix=output **Default:** * HQ = Hiqh quality threshold [25]. * LQ = Low quality threshold [10]. * Frac = Fraction of read that must exceed hq [0.8]. * Minlen = Min allowed read length [50]. * MH = When this number of sequential HQ bases is reached, the trimming stops [5]. * ML = Max number of LQ bases allowed after a stretch of HQ bases from 3'-end [1]. **Disadvantages:** Seems very very slow. Cutadapt: --------- See `cutadapt -help `_ .. code-block:: bash --quality-base N Assume that quality values in FASTQ are encoded as ascii(quality + N). This needs to be set to 64 for some old Illumina FASTQ files. Default: 33 -m LEN[:LEN2], --minimum-length LEN[:LEN2] Discard reads shorter than LEN. Default: 0 cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq Trim_galore ----------- Trim_galore is a wrapper around cutadapt with a few additions. See `trim_galore -h `_ Flexbar: -------- Can do color-space reads (454), maintains read pairing and separates barcode reads. The program Flexbar preprocesses high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Several adapter removal presets for Illumina libraries are included. Flexbar computes exact overlap alignments using SIMD and multicore parallelism. Moreover, trimming and filtering features are provided, e.g. trimming of homopolymers at read ends. Flexbar increases read mapping rates and improves genome as well as transcriptome assemblies. Unique molecular identifiers can be extracted in a flexible way. The software supports data in fasta and fastq format from multiple sequencing platforms. See `flexbar -hh `_ PRINSEQ: -------- One of the big advantages of prinseq are the large number of different graphical outputs. See `prinseq-lite.pl -help `_ .. code-block:: bash perl prinseq-lite.pl -verbose -fastq test_1.fq -fastq2 test_2.fq -ns_max_n 0 -out_good test_no_ns -out_bad test_with_ns Singletons will automatically be filtered and stored in separate files (here test_no_ns_1_singletons.fastq and test_no_ns_2_singletons.fastq) in addition to the files test_no_ns_1.fastq, test_no_ns_2.fastq, test_with_ns_1.fastq, and test_with_ns_2.fastq. **Disadvantages:** No direct way of dealing with paired end that are zipped, you will need to unzip on the fly: .. code-block:: bash /prinseq <(gzip -d -c input.fq.gz) Or if you have paired-end, you need to unzip-trim (zip if you want to keep your reads compressed) and then match the pairs up again (https://github.com/sestaton/Pairfq) .. code-block:: bash gzip -dc input_R1.fastq.gz | prinseq-lite.pl -fastq stdin -out_bad null -out_good stdout \ -trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.input_R1.log | gzip > input_R1.trim.gz gzip -dc input_2.fastq.gz | prinseq-lite.pl -fastq stdin -out_bad null -out_good stdout \ -trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.input_R2.log | gzip > input_R2.trim.gz pairfq makepairs -f input_R1.trim.gz -r input_R2.trim.gz -fp output_R1.trim.fastq-common.out.gz \ -rp output_R2.trim.fastq-common.out.gz -fs output_R1.trim.fastq-unique.out.gz -rs output_R2.trim.fastq-unique.out.gz --compress gzip BBDuk: ------ One annoying thing about BBDuk.sh is that autocompletion of file names is reduced due to the way arguments are specified for the bash script (e.g., -in1=input_R1.fq). BBDuk uses a kmer based approach for doing the quality trimming. It also has a large number of other data filtering options. See `bbduk.sh -h `_ Trimmomatic: ------------ A very highly cited program with a broad range of functionalities and reasonably straight forward to use. It doesn't have a command-line help but the manual PDF can be found `here `_. Benchmark ######### Current benchmarking pipeline available here :vatk:`trimming/trimtest.snake ` .. image:: dag.png Commands used for benchmarking: .. code-block:: bash btrim64 -a 25 -l 50 -q -s tmp.r1.s -o output_R1 -t input_R1.fastq.gz btrim64 -a 25 -l 50 -q -s tmp.r2.s -o output_R2 -t input_R2.fastq.gz paired_end_trim.pl tmp.r1.s tmp.r2.s output_R1 output_R2 cat output_R1.se output_R2.se > output_unpaired.fastq mv output_R1.pe output_R1.fastq mv output_R2.pe output_R2.fastq .. code-block:: bash condetri.pl -sc=33 -fastq1=input_R1.fastq.gz -fastq2=input_R2.fastq.gz -prefix=output .. code-block:: bash cutadapt -m 50 -q 25 -o output_R1.fastq.gz -p output_R2.fastq.gz --too-short-output=tooshort.1.fastq.gz --too-short-paired-output=tooshort.2.fastq.gz input_R1.fastq.gz input_R2.fastq.gz cutadapt -m 50 -q 25 -o rescued.a.fastq.gz tooshort.1.fastq.gz cutadapt -m 50 -q 25 -o rescued.b.fastq.gz tooshort.2.fastq.gz cat rescued.a.fastq.gz rescued.b.fastq.gz > rescued.fastq.gz gunzip -c rescued.fastq.gz > output_unpaired.fastq gunzip -c output_R1.fastq.gz > output_R1.fastq gunzip -c output_R2.fastq.gz > output_R2.fastq .. code-block:: bash flexbar -r input_R1.fastq.gz -p input_R2.fastq.gz -qt 25 -m 50 -qf i1.8 --adapter-preset TruSeq -ap ON -s mv flexbarOut_1.fastq output_R1.fastq mv flexbarOut_2.fastq output_R2.fastq cat flexbarOut_1_single.fastq flexbarOut_2_single.fastq > output_unpaired.fastq .. code-block:: bash gunzip -c input_R1.fastq.gz > tmp_R1.fq gunzip -c input_R2.fastq.gz > tmp_R2.fq prinseq-lite.pl -verbose -min_qual_mean 25 -min_len 50 -fastq tmp_R1.fq -fastq2 tmp_R2.fq -ns_max_n 0 -out_good tmp_no_ns -out_bad null cat tmp_no_ns_1_singletons.fastq tmp_no_ns_2_singletons.fastq > output_unpaired.fastq mv tmp_no_ns_1.fastq output_R1.fastq mv tmp_no_ns_2.fastq output_R2.fastq .. code-block:: bash TrimmomaticPE -phred33 input_R1.fastq.gz input_R2.fastq.gz output_R1.fastq tmp1_unpaired.fq output_R1.fastq tmp2_unpaired.fq ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:25 MINLEN:50 cat tmp1_unpaired.fq tmp2_unpaired.fq > output_unpaired.fastq .. code-block:: bash trim_galore --quality 25 --length 50 --retain_unpaired --stringency 1 --dont_gzip --paired input_R1.fastq.gz input_R2.fastq.gz cat input_R1_unpaired_1.fq input_R2_unpaired_2.fq > output_unpaired.fastq mv input_R1_001_val_1.fq output_R1.fastq mv input_R2_001_val_2.fq output_R2.fastq .. code-block:: bash gunzip -c input_R1.fastq.gz > tmp_R1.fq gunzip -c input_R2.fastq.gz > tmp_R2.fq bbduk.sh in1=tmp_R1.fq in2=tmp_R2.fq out1=output_R1.fastq out2=output_R2.fastq qtrim=lr trimq=25 minlen=50 rieb=false outm1=tmp1_unpaired.fq outm2=tmp2_unpaired.fq overwrite=true cat tmp1_unpaired.fq tmp2_unpaired.fq > tmp_unpaired.fq bbduk.sh in=tmp_unpaired.fq out=output_unpaired.fastq qtrim=lr trimq=25 minlen=50 overwrite=true Time comparison --------------- .. image:: time_plot.png btrim < cutadapt < trimgalore < trimmomatic Number of reads remaining ------------------------- .. image:: count_plot.png More paired reads are removed for btrim, condetri and trimmomatic but a reasonable number are rescued as single reads. Number of bases kept -------------------- .. image:: base_plot.png The effect of trimming in terms of bases lost is not dramatically marked for btrim, condetri, trimmomatic and bbduk. Per read position quality ------------------------- .. image:: qual_plot.png The average quality is higher for trimmomatic and btrim. The decrease in quality over the length of the read is not as steep. Condetri per base quality is slightly lower but prinseq, trimgalore, cutadapt have even lower average quality and flexbar has the lowest average quality. Flexbar has very few single reads rescued, hence the distribution. Some further singleton quality filtering is probably needed for cutadapt, trim_galore and flexbar. Per read position mismatch -------------------------- .. image:: mismatch_plot.png Trimmomatic and btrim have very low levels of mismatch frequency with condetri having a slightly higher mismatch frequency. Trimgalore, cutadapt and prinseq have higher levels of mismatches along the length of the read. Flexbar has the worst level of mismatch frequency. In the context of low-frequency variant calling, Trimmomatic and btrim are good options. BBDuk also performs well, in this case it does not rescue any singletons but does not discard as many reads it does this by trimming the length more agressively, which results in a higher average quality to the right of the read. Condetri also performs reasonably well. If there is a limited depth of coverage across the genome, then the poor trimming will also affect the consensus calling so it is advisable to also use trimmomatic and btrim. The tools that use a window or k-mer approach perform better in the metrics used here. References ########## .. bibliography:: trimrefs.bib :style: plain