Pre-processing and QC

Loading modules

module load gencore
module load gencore_qc

Quality trimming using trimmomatic

It is always a good idea to perform quality assessment of your raw fastq files before assembling your reads. In addition, there are other tools that perform read error correction. All these steps can result in a noticeable improvement, as well as an improvement in computational run time and resource requirements. The data that we are assembling today have already been quality trimmed and filtered. If however you are looking to perform that yourself, using trimmomatic for instance, then the command below is a typical example.

trimmomatic \
PE -threads 12 -trimlog trimmomatic.log \
Sample_READ1.fastq.gz Sample_READ2.fastq.gz \
Sample_trimmed_READ1_PE.fastq Sample_trimmed_READ1_SE.fastq\
Sample_trimmed_READ2_PE.fastq Sample_trimmed_READ2_SE.fastq \
ILLUMINACLIP:/path/to/adapter_sequences_file/trimmomatic_adapter.fa:2:30:10 TRAILING:3 LEADING:3 SLIDINGWINDOW:4:15 MINLEN:36

For more information on trimmomatic quality filtering, please refer to the trimmomatic homepage (http://www.usadellab.org/cms/?page=trimmomatic\)

FastQC

FastQC allows you to inspect various quality metrics that can inform your quality trimming decisions. We recommend running FastQC before and after you perform your quality trimming. This will allow you to assess and inspect the effects of quality trimming, as well as how much of your raw data (reads) where lost during the quality trimming stage.

Below is an example of FastQC on the read1 file.

fastqc sample_read1.fastq.gz --extract -o sample_R1_FASTQC -t 12

Working with multiple samples

In cases where you have QC'ed multiple samples, it can become very difficult to visualize all the QC reports separately. Imagine you have sequenced 6 samples (paired end reads), perform FastQC, followed by Trimmomatic, and then FastQC again.

This means,

Raw FastQC: 6 x 2 = 12
Trimmed FastQC: 6 x 2 = 12
If QC'ing mateless reads (i.e. where only R1's or R2's survived): 6 x 2 = 12
At least 24 QC reports, and possibly 36!

It then makes more sense to aggregate the reports and visualize them together, how?

Using MultiQC (http://multiqc.info/\)

All you have to do it point multiqc to the top level directory, and it will automatically scan the directory structure (or tree) and consolidate all the reports into one interactive HTML report.

Suppose you have a directory in your scratch folder /scratch/$USER/my_project/, which is where you have performed your analysis. Then all you have to do,

cd /scratch/$USER/my_project
multiqc -n My_FastQC_reports -l My_FastQC_reports.html /scratch/$USER/my_project

MultiQC can detect and parse a lot of different kinds of reports (e.g. PICARD output, SnpEff, Samtools, FastQC among others), for more information, please refer to the software homepage above.

Pre-Processing and QC