Quality Control

In any high-throughput sequencing experiment there are pieces of information that we are not interested in and data that retains no value. Be it low quality, ambiguous bases, or contaminating sequences, no matter the application, there will always be some noise. In particular, when obtaining samples from an environment, there are many factors that can add to the level of noise when it comes time to sequence. The goal of metagenomics is to sequence everything in your sample, hence the prefix "meta." Unfortunately the sample collection methods, as well as the library prep protocols, are not capable of completely omitting contaminating sequences. Obtaining metagenomic samples from a host, a fairly common application of metagenomics, host DNA/RNA will be present in your sequencing data, and generally at large quantities. With that said, it is of utmost importance to remove the confounding data to help purify an already complex dataset. This starts with removing sequences that you believe to be present that you are not interested in.


  • Obtain potential contaminant reference sequences
  • Remove contaminant and low quality reads

Potential Contaminants

This section of the tutorial will cover how to use the KneadData software. One of the steps in this software package is to align your data to contaminant sequences. Here are a few suggestions on how to determine what sequences you will want to use to eliminate your contaminants:

  • If you obtained your sample from a host, use that host's genome, or grab one from a closely related organism.
  • If PhiX was added to your library, use the PhiX genome.
  • If you sequenced a RNA, you can use an rRNA database or other rRNA removal software, such as SortMeRNA.

Once you have all your contaminant sequences, put them all in a one file and index it with bowtie2.

To create this reference database, you can use the cat command and the > and >> redirects to write and append sequences to a preexisting reference file, respectively.

To add all sequences to a file you can type:

cat file1.fasta file2.fasta file3.fasta > references.fasta

To append files to an existing genome file:

cat file1.fasta file2.fasta >> references.fasta

Then when you have all of your sequences in a file, you can index it with bowtie2:

bowtie2-build references.fasta references

For this tutorial, we will only use the human genome. You can download it with KneadData using the following command, though I have already provided this:

kneaddata_database --download human_genome bowtie2 ./

This will download the bowtie2 indexed human genome into the current directory.

NOTE: For targeted queries, it is not always necessary to remove contaminant sequences as they are generally not represented in the databases. However, there are some exceptions and it is good practice to always filter your data.

Quality Filtering

KneadData invokes Trimmomatic for its quality filtering/trimming, as well as Tandem Repeat Finder (TRF) and FastQC. In essence, Trimmomatic is capable of throwing away reads or parts of reads that have low quality scores, as well as trimming adaptor sequences. TRF finds tandem repeats and removes them, while FastQC generates a quality report for your data. These are common problems that arise in almost all sequencing runs and should be handled appropriately.

Running KneadData

To run KneadData you need:

  • An indexed contaminant database
  • Reads in fastq format

The command to execute is as follows:

kneaddata --input demo.fastq \
    --bowtie2-options "--very-sensitive -p 4" \
    --trimmomatic "${EBROOTGENCORE_METAGENOMICS}/share/trimmomatic-0.36-3" \
    --reference-db Homo_sapiens \
    --output kneaddata_output

This will create files in the folder kneaddata_output named

  • demo_kneaddata_Homo_sapiens_bowtie2_contam.fastq: FASTQ file containing reads that were identified as contaminants from the database.
  • demo_kneaddata.fastq: This file includes reads that were not in the reference database.
  • demo_kneaddata.trimmed.fastq: This file has trimmed reads.
  • demo_kneaddata.log

Now your data is much better than it was at the beginning and you can proceed to the next step. However, upon inspection of your data post filtering, it may be necessary to remove more contaminants and/or toggle the quality filtering parameters.

results matching ""

    No results matching ""