Quality Control
In any high-throughput sequencing experiment there are pieces of information that we are not interested in and data that retains no value. Be it low quality, ambiguous bases, or contaminating sequences, no matter the application, there will always be some noise. In particular, when obtaining samples from an environment, there are many factors that can add to the level of noise when it comes time to sequence. The goal of metagenomics is to sequence everything in your sample, hence the prefix "meta." Unfortunately the sample collection methods, as well as the library prep protocols, are not capable of completely omitting contaminating sequences. Obtaining metagenomic samples from a host, a fairly common application of metagenomics, host DNA/RNA will be present in your sequencing data, and generally at large quantities. With that said, it is of utmost importance to remove the confounding data to help purify an already complex dataset. This starts with removing sequences that you believe to be present that you are not interested in.
Overview
- Obtain potential contaminant reference sequences
- Remove contaminant and low quality reads
Potential Contaminants
This section of the tutorial will cover how to use the KneadData software. One of the steps in this software package is to align your data to contaminant sequences. Here are a few suggestions on how to determine what sequences you will want to use to eliminate your contaminants:
- If you obtained your sample from a host, use that host's genome, or grab one from a closely related organism.
- If PhiX was added to your library, use the PhiX genome.
- If you sequenced a RNA, you can use an rRNA database or other rRNA removal software, such as SortMeRNA.
Once you have all your contaminant sequences, put them all in a one file and index it with bowtie2.
To create this reference database, you can use the cat
command and the >
and >>
redirects to write and append sequences to a preexisting reference file, respectively.
To add all sequences to a file you can type:
cat file1.fasta file2.fasta file3.fasta > references.fasta
To append files to an existing genome file:
cat file1.fasta file2.fasta >> references.fasta
Then when you have all of your sequences in a file, you can index it with bowtie2:
bowtie2-build references.fasta references
For this tutorial, we will only use the human genome. You can download it with KneadData using the following command, though I have already provided this:
kneaddata_database --download human_genome bowtie2 ./
This will download the bowtie2 indexed human genome into the current directory.
NOTE: For targeted queries, it is not always necessary to remove contaminant sequences as they are generally not represented in the databases. However, there are some exceptions and it is good practice to always filter your data.
Quality Filtering
KneadData invokes Trimmomatic for its quality filtering/trimming, as well as Tandem Repeat Finder (TRF) and FastQC. In essence, Trimmomatic is capable of throwing away reads or parts of reads that have low quality scores, as well as trimming adaptor sequences. TRF finds tandem repeats and removes them, while FastQC generates a quality report for your data. These are common problems that arise in almost all sequencing runs and should be handled appropriately.
Running KneadData
To run KneadData you need:
- An indexed contaminant database
- Reads in fastq format
The command to execute is as follows:
kneaddata --input demo.fastq \
--bowtie2-options "--very-sensitive -p 4" \
--trimmomatic "${EBROOTGENCORE_METAGENOMICS}/share/trimmomatic-0.36-3" \
--reference-db Homo_sapiens \
--output kneaddata_output
This will create files in the folder kneaddata_output
named
demo_kneaddata_Homo_sapiens_bowtie2_contam.fastq
: FASTQ file containing reads that were identified as contaminants from the database.demo_kneaddata.fastq
: This file includes reads that were not in the reference database.demo_kneaddata.trimmed.fastq
: This file has trimmed reads.demo_kneaddata.log
Now your data is much better than it was at the beginning and you can proceed to the next step. However, upon inspection of your data post filtering, it may be necessary to remove more contaminants and/or toggle the quality filtering parameters.