Trimming
========

Sequencing data are usually provided as raw reads in the FASTQ format, 
which usually require 
pre-processing prior to further analysis. 
Sequences often carry particular sequence oligonucleotides, These may have been 
introduced by experimental design (e.g., sample barcodes), be specific to a sequencing 
platform (e.g., adapter sequences) or be a real biological signal (e.g., spliced leader).
The most used pre-processing step 
involves read trimming which:

1. Trims adaptors and primers from reads
2. Filters on base qualities, in different ways depending on the program

It is important to trim and remove adaptors to avoid issues in downstream analyses such as 
in transcriptomics :cite:`Williams2016`, variant analysis 
or de novo assembly. 

There are many tools that have been developed for this purpose and an extensive 
list of them are presented below. Each has its own advantage and some of them have already been benchmarked :cite:`DelFabbro2013`. 
Here tools for trimming and adaptor removal will be benchmarked 
in the context of consensus calling and variant calling for viral genomes. 
Duplicate removal and contaminant sequence filtering will be covered elsewhere.


Tools
#####

.. list-table:: Trimming programs
   :widths: 33 33 34
   :header-rows: 1
   
   * - Name
     - Website
     - Publication
   * - Btrim
     - http://graphics.med.yale.edu/trim/
     - :cite:`Kong2011`
   * - ConDeTri
     - https://github.com/linneas/condetri
     - :cite:`Smeds2011`
   * - Cutadapt
     - https://cutadapt.readthedocs.io/en/stable/
     - :cite:`EJ200`
   * - FASTX
     - http://hannonlab.cshl.edu/fastx_toolkit/
     - N/A
   * - Flexbar
     - https://github.com/seqan/flexbar/wiki/Manual
     - :cite:`Dodt2012`
   * - PRINSEQ
     - http://prinseq.sourceforge.net/index.html
     - :cite:`Schmieder2011`
   * - SolexaQA + DynamicTrim
     - http://solexaqa.sourceforge.net/
     - :cite:`Cox2010`
   * - Scythe
     - https://github.com/vsbuffalo/scythe
     - N/A
   * - Seqtk
     - https://github.com/lh3/seqtk
     - N/A
   * - Sickle
     - https://github.com/najoshi/sickle
     - N/A
   * - TagDust2
     - http://tagdust.sourceforge.net/
     - :cite:`Lassmann2015`
   * - Trimmomatic
     - http://www.usadellab.org/cms/index.php?page=trimmomatic
     - :cite:`Bolger2014`
   * - Atropos
     - https://github.com/jdidion/atropos
     - :cite:`Didion2017`
   * - fastp
     - https://github.com/OpenGene/fastp
     - :cite:`Chen2018`
   * - FastqPuri
     - https://github.com/jengelmann/FastqPuri
     - :cite:`Perez-Rubio2019`
   * - ea-utils 
     - https://expressionanalysis.github.io/ea-utils/
     - :cite:`Aronesty2013` 
   * - TrimGalore
     - https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
     - N/A
   * - BBDuk
     - https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbduk-guide/
     - N/A
     
     
How to run
##########

Btrim:
------

Trims low quality at both or either ends of a fastq sequence, identify barcodes, 
handle indels in adapters and barcodes. The adaptors are provided as a 6 column text-tab delimited 
file where the user can specify which region in the sequence to search for adaptors in. See `btrim64 <help/btrim.txt>`_.
To trim Illumina adaptor AGATCGGAAGAGC with a window size of 10 and a cutoff for average 
quality scores within the moving window for quality trimming of 25, where only 3'-end trimming is 
carried out, keeping a minimal length sequence of 40:

.. code-block:: bash

   btrim64 -w 10 -a 25 -p illumina_adapter.txt -3 -P -o output.fastq -l 40 -t <(gunzip -c path_to_your_fastq/*.gz ) -C -z


The disadvantage is that it does not deal directly with paired-end reads, so the individual 
files need to be trimmed first and then the perl script paired_end_trim.pl is used 
to create the files where both ends passed the trimming. Another two files (*.se) are 
generated for the reads where only one end passed the trimming.

.. code-block:: bash

   btrim64 -a 25 -q -t input_R1.fq -o output_R1.25 -s output_R1.sum25
   btrim64 -a 25 -q -t input_R1.fq -o output_R2.25 -s output_R2.sum25
   paired_end_trim.pl output_R1.sum25 output_21.sum25 output_R1.25 output_R2.25


Works directly on .gz files. Provides total sequence count and length distribution.

**Default:**

* maximum number of errors in 5'-adaptor [default=3]

* maximum number of errors in 3'-adaptor [default=4]

* minimal insert size [default=25]

* size of moving window for quality trimming [default=5]

* cutoff for average quality scores within the moving window for quality trimming [default=15]

**Disadvantages:** Dealing with paired-end reads is a two step process.

ConDeTri:
---------

Single-end and paired-end trimming independent from user 
interaction, trim and remove reads with low quality scores. See  `condetri_v2.2.pl -h <help/condetri.txt>`_

.. code-block:: bash

   condetri.pl -sc=33 -fastq1=input_R1.fq -fastq2=input_R2.fq -prefix=output

**Default:**

* HQ = Hiqh quality threshold [25].

* LQ = Low quality threshold [10].

* Frac = Fraction of read that must exceed hq [0.8].

* Minlen = Min allowed read length [50].

* MH = When this number of sequential HQ bases is reached, the trimming stops [5].

* ML = Max number of LQ bases allowed after a stretch of HQ bases from 3'-end [1].

**Disadvantages:** Seems very very slow.


Cutadapt:
---------

See `cutadapt -help <help/cutadapt.txt>`_

.. code-block:: bash

 --quality-base N      Assume that quality values in FASTQ are encoded as
                        ascii(quality + N). This needs to be set to 64 for
                        some old Illumina FASTQ files. Default: 33
 -m LEN[:LEN2], --minimum-length LEN[:LEN2]
                        Discard reads shorter than LEN. Default: 0
   cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

Trim_galore
-----------

Trim_galore is a wrapper around cutadapt with a few additions. 
See `trim_galore -h <help/trimgalore.txt>`_


Flexbar:
--------

Can do color-space reads (454), maintains read pairing and separates barcode reads.

The program Flexbar preprocesses high-throughput sequencing data efficiently. It 
demultiplexes barcoded runs and removes adapter sequences. Several adapter removal presets 
for Illumina libraries are included. Flexbar computes exact overlap alignments using SIMD 
and multicore parallelism. Moreover, trimming and filtering features are provided, e.g. 
trimming of homopolymers at read ends. Flexbar increases read mapping rates and improves 
genome as well as transcriptome assemblies. Unique molecular identifiers can be extracted 
in a flexible way. The software supports data in fasta and fastq format from multiple 
sequencing platforms. See `flexbar -hh <help/flexbar.txt>`_

PRINSEQ:
--------

One of the big advantages of prinseq are the large number of different graphical outputs. See `prinseq-lite.pl -help <help/prinseq.txt>`_

.. code-block:: bash

   perl prinseq-lite.pl -verbose -fastq test_1.fq -fastq2 test_2.fq -ns_max_n 0 -out_good test_no_ns -out_bad test_with_ns


Singletons will automatically be filtered and stored in separate files (here test_no_ns_1_singletons.fastq and test_no_ns_2_singletons.fastq) 
in addition to the files test_no_ns_1.fastq, test_no_ns_2.fastq, test_with_ns_1.fastq, and test_with_ns_2.fastq. 

**Disadvantages:** No direct way of dealing with paired end that are zipped, you will need to unzip on the fly:

.. code-block:: bash

  /prinseq <your parameters> <(gzip -d -c input.fq.gz)
  
  
Or if you have paired-end, you need to unzip-trim (zip if you want to keep your reads compressed) and then match the pairs up again (https://github.com/sestaton/Pairfq)

.. code-block:: bash

 gzip -dc input_R1.fastq.gz | prinseq-lite.pl -fastq stdin -out_bad null -out_good stdout \
   -trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.input_R1.log | gzip > input_R1.trim.gz
 gzip -dc input_2.fastq.gz | prinseq-lite.pl -fastq stdin -out_bad null -out_good stdout \
   -trim_qual_right 20 -min_qual_mean 20 -min_len 35 -log prinseq.input_R2.log | gzip > input_R2.trim.gz
 pairfq makepairs -f input_R1.trim.gz -r input_R2.trim.gz -fp output_R1.trim.fastq-common.out.gz \
   -rp output_R2.trim.fastq-common.out.gz -fs output_R1.trim.fastq-unique.out.gz -rs output_R2.trim.fastq-unique.out.gz --compress gzip


BBDuk:
------

One annoying thing about BBDuk.sh is that autocompletion of file names is reduced due to 
the way arguments are specified for the bash script (e.g., -in1=input_R1.fq). BBDuk uses a kmer based 
approach for doing the quality trimming. It also has a large number of other data filtering options. 
See `bbduk.sh -h <help/bbduk.txt>`_

Trimmomatic:
------------

A very highly cited program with a broad range of functionalities and reasonably straight forward to use. 
It doesn't have a command-line help but the manual PDF can be 
found `here <http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/TrimmomaticManual_V0.32.pdf>`_.


Benchmark
#########

Current benchmarking pipeline available here :vatk:`trimming/trimtest.snake <trimming/trimtest.snake>`

.. image:: dag.png


Commands used for benchmarking:

.. code-block:: bash

   btrim64 -a 25 -l 50 -q -s tmp.r1.s -o output_R1 -t input_R1.fastq.gz
   btrim64 -a 25 -l 50 -q -s tmp.r2.s -o output_R2 -t input_R2.fastq.gz
   paired_end_trim.pl tmp.r1.s tmp.r2.s output_R1 output_R2 
   cat output_R1.se output_R2.se > output_unpaired.fastq
   mv output_R1.pe output_R1.fastq
   mv output_R2.pe output_R2.fastq

.. code-block:: bash

   condetri.pl -sc=33 -fastq1=input_R1.fastq.gz -fastq2=input_R2.fastq.gz -prefix=output
   

.. code-block:: bash

   cutadapt -m 50 -q 25 -o output_R1.fastq.gz -p output_R2.fastq.gz --too-short-output=tooshort.1.fastq.gz --too-short-paired-output=tooshort.2.fastq.gz input_R1.fastq.gz input_R2.fastq.gz
   cutadapt -m 50 -q 25 -o rescued.a.fastq.gz tooshort.1.fastq.gz 
   cutadapt -m 50 -q 25 -o rescued.b.fastq.gz tooshort.2.fastq.gz 
   cat  rescued.a.fastq.gz rescued.b.fastq.gz > rescued.fastq.gz
   gunzip -c rescued.fastq.gz > output_unpaired.fastq
   gunzip -c output_R1.fastq.gz > output_R1.fastq
   gunzip -c output_R2.fastq.gz > output_R2.fastq


.. code-block:: bash

   flexbar -r input_R1.fastq.gz -p input_R2.fastq.gz -qt 25 -m 50 -qf i1.8 --adapter-preset TruSeq -ap ON -s
   mv flexbarOut_1.fastq output_R1.fastq
   mv flexbarOut_2.fastq output_R2.fastq
   cat flexbarOut_1_single.fastq flexbarOut_2_single.fastq > output_unpaired.fastq


.. code-block:: bash

   gunzip -c input_R1.fastq.gz > tmp_R1.fq
   gunzip -c input_R2.fastq.gz > tmp_R2.fq
   prinseq-lite.pl -verbose -min_qual_mean 25 -min_len 50 -fastq tmp_R1.fq -fastq2 tmp_R2.fq -ns_max_n 0 -out_good tmp_no_ns -out_bad null 
   cat tmp_no_ns_1_singletons.fastq tmp_no_ns_2_singletons.fastq > output_unpaired.fastq
   mv tmp_no_ns_1.fastq output_R1.fastq
   mv tmp_no_ns_2.fastq output_R2.fastq

.. code-block:: bash

   TrimmomaticPE -phred33 input_R1.fastq.gz input_R2.fastq.gz output_R1.fastq tmp1_unpaired.fq output_R1.fastq tmp2_unpaired.fq ILLUMINACLIP:TruSeq2-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:25 MINLEN:50
   cat tmp1_unpaired.fq tmp2_unpaired.fq > output_unpaired.fastq


.. code-block:: bash

   trim_galore --quality 25 --length 50 --retain_unpaired --stringency 1 --dont_gzip --paired input_R1.fastq.gz input_R2.fastq.gz
   cat input_R1_unpaired_1.fq input_R2_unpaired_2.fq > output_unpaired.fastq
   mv input_R1_001_val_1.fq output_R1.fastq
   mv input_R2_001_val_2.fq output_R2.fastq

.. code-block:: bash

   gunzip -c input_R1.fastq.gz > tmp_R1.fq
   gunzip -c input_R2.fastq.gz > tmp_R2.fq
   bbduk.sh in1=tmp_R1.fq in2=tmp_R2.fq out1=output_R1.fastq out2=output_R2.fastq qtrim=lr trimq=25 minlen=50 rieb=false outm1=tmp1_unpaired.fq outm2=tmp2_unpaired.fq overwrite=true
   cat tmp1_unpaired.fq tmp2_unpaired.fq > tmp_unpaired.fq
   bbduk.sh in=tmp_unpaired.fq out=output_unpaired.fastq  qtrim=lr trimq=25 minlen=50 overwrite=true


Time comparison
---------------

.. image:: time_plot.png

btrim < cutadapt < trimgalore < trimmomatic


Number of reads remaining
-------------------------

.. image:: count_plot.png

More paired reads are removed for btrim, condetri and trimmomatic but a reasonable number 
are rescued as single reads.

Number of bases kept
--------------------

.. image:: base_plot.png

The effect of trimming in terms of bases lost is not dramatically marked for btrim, 
condetri, trimmomatic and bbduk.


Per read position quality
-------------------------

.. image:: qual_plot.png

The average quality is higher for trimmomatic and btrim. The decrease in quality over the 
length of the read is not as steep. Condetri per base quality is slightly lower 
but prinseq, trimgalore, cutadapt have even lower average quality and flexbar has the 
lowest average quality.
Flexbar has very few single reads rescued, hence the distribution. Some further singleton 
quality filtering is probably needed for cutadapt, trim_galore and flexbar.


Per read position mismatch
--------------------------

.. image:: mismatch_plot.png

Trimmomatic and btrim have very low levels of mismatch frequency with condetri having a 
slightly higher mismatch frequency. Trimgalore, cutadapt and prinseq have higher levels of 
mismatches along the length of the read. Flexbar has the worst level of mismatch frequency.


In the context of low-frequency variant calling, Trimmomatic and btrim are good options. BBDuk 
also performs well, in this case it does not rescue any singletons but does not discard as 
many reads it does this by trimming the length more agressively, which results in a higher 
average quality to the right of the read. Condetri also performs reasonably well. 


If there is a limited depth of coverage across the genome, then the poor trimming will 
also affect the consensus calling so it is advisable to also use trimmomatic and btrim.


The tools that use a window or k-mer approach perform better in the metrics used here.

References
##########


.. bibliography:: trimrefs.bib
   :style: plain