De novo genome assembly using BioX-workflow and HPCrunner
Assembling a de novo genome involves multiple steps, software, and parameters. There is no single approach, however, producing a "first draft" assembly should, at the very least, address,
- Pre processing and QC.
- Assembling using multiple tools/parameters (at least 2 or 3).
- Assembly metrics comparison.
- Assembly validation.
- Annotation of final "first draft".
As you can see, this will involve multiple submission scripts, SLURM submissions, as well as a significant amount of editing scripts, which can be prone to human error. Instead, we will be using the NYUAD Core Bioinformatics workflow system to automate this process.
In order to complete this exercise in time, the SPADES assembly has already been precomputed for you (it can take 3-3.5 using this dataset). However, we will still use this precomputed assembly in the assessment and validation stage.
Guessing which kmer size works best for a given dataset is exactly that, GUESSING! As, such you will randomly choose 2 kmer sizes (and ideally, you should choose multiple kmers usually at 6 step intervals, i.e. 21, 27, 33, 39 etc.). You will then be asked to work in groups of 3 and use what you have learned to select the best performing assembly.
Finally, we will annotate our assembled genome online using PATRIC (https://www.patricbrc.org/\).
Below is an overview of the entire workflow,
The workflow
In case you haven't done so, start by loading the required modules,
module load gencore
module load gencore_dev
module load gencore_de_novo_genomic
gencore_dev is our development module, and where hpcrunner.pl is located. In the near future, this script will be available in all the "production level" modules and you will no longer have to load gencore_dev
The yaml configuration file for this workflow can be found in the "conf" directory.
cd /scratch/$USER/de_novo_genome_assembly/conf
Our dataset is a standard PE library of a bacterial genome. Let's have a look at the the help menu of SPADES and ABySS to see if any other parameters should be used. You can either run the defaults, or feel free to experiment using other parameters that "make sense".
Next we will need to edit the configuration file and adjust the kmer sizes.
vi de_novo_sequencing_training.yml
The kmer sizes can be found at the beginning of the file (global parameters). Replace ADD_YOUR_KMER_SIZE_HERE with your kmer size e.g. 21. Do this for both the kmers that you selected.
1 ---
2 global:
3 - indir: "data/raw"
4 - outdir: "data/analysis"
5 - root: "data/raw"
6 - rootdir: "data/raw"
7 - root_analysis_dir: "data/analysis"
8 - analysis: "data/analysis/{$sample}"
9 - analysis_dir: "data/analysis/{$sample}"
10 - spades_dir: "data/analysis/{$sample}/spades"
11 - abyss: "data/analysis/{$sample}/abyss_pe"
12 - abyss_dir: "data/analysis/{$sample}/abyss_pe"
13 - gapcloser_spades_dir: "data/analysis/{$sample}/gapcloser_spades"
14 - gapcloser_abyss_dir: "data/analysis/{$sample}/gapcloser_spades"
15 - READ1: "{$self->root_dir}/{$sample}_read1.fastq.gz"
16 - READ2: "{$self->root_dir}/{$sample}_read2.fastq.gz"
17 - TR1: "{$self->trimmomatic_dir}/{$sample}_read1_trimmomatic"
18 - TR2: "{$self->trimmomatic_dir}/{$sample}_read2_trimmomatic"
19 - kmer1: ADD_YOUR_KMER_SIZE_HERE
20 - kmer2: ADD_YOUR_KMER_SIZE_HERE
21 - kmer1_a: "{$self->abyss_dir}/Abyss_k{$self->kmer1}/Abyss.assembly-k{$self->kmer1}-scaffolds.fa"
22 - kmer2_a: "{$self->abyss_dir}/Abyss_k{$self->kmer2}/Abyss.assembly-k{$self->kmer2}-scaffolds.fa"
23 - file_rule: (Sample.*)$
24 - by_sample_outdir: 1
25 - find_by_dir: 1
26 - wait: 0
Hint: In vi, i (insert mode). ESC, :, wq <enter> to save and quit
Next, we will review the rest of the configuration workflow and understand each section.
For a more detailed explanation of the workflows, refer to the HPC section of this gitbook.
For a breakdown of the parameters used in the execution of the software, please refer to the "Individual commands of the assembly workflow" section of the gitbook.
Submitting the analysis workflow
Now we are ready to submit our workflow to the scheduler (SLURM). First we will create the execution script using the configuration file (yaml) with the biox-workflow.pl script. Then, we will submit the workflow to the scheduler using the hpcrunner.pl script.
cd /scratch/$USER/de_novo_genome_assembly
biox-workflow.pl --workflow conf/de_novo_sequencing_training.yml > de_novo.sh
hpcrunner.pl submit_jobs --infile de_novo.sh --project YOUR_NET_ID
Let's check the status of the SLURM jobs using the SLURM commands (refer to the SLURM cheat-sheet https://wikis.nyu.edu/pages/viewpage.action?pageId=82710132&preview=/82710132/83427688/slurm_cheat_sheet.pdf)
Examining the output
Let's start navigating through the analysis directories (before and after gap-closing) and get some basic assembly statistics. We will be using abyss-fac to produce the assembly metrics.
cd /scratch/$USER/de_novo_genome_assembly/data/analysis/Sample_test/
ls -l
cd abyss_pe/Abyss_k(Kmer_size)/
abyss-fac -s 500 Abyss.assembly-k(Kmer_size)-scaffolds.fa
Time to work in groups of 3 and compare the metrics!
Now, that we have assessed the assembly metrics, we will examine the output from Quast and look for the number of predicted genes in addition to the scaffold composition.
Copy the quast HTML report to your local machine (rsync, FileZilla etc.) and open it in your web browser.
Let's look at the assemblies and the number of predicted genes for each one of our assemblies.
Annotating the assembly
Now that we assembled, improved, and selected our "best" performing assembly, we can proceed to annotating and visualizing our organism. Since this is a small prokaryote genome, we can use an online annotation portal called PATRIC (https://www.patricbrc.org/)
Once you navigate to the PATRIC homepage, you will first need to create a free account and register. We can then proceed to uploading our data to the PATRIC portal and annotate our genome.
For your reference, we have added 2 screen capture videos of creating an account, uploading, analyzing, and navigating the analysis tools using PATRIC below.
PATRIC part 1 - https://drive.google.com/a/nyu.edu/file/d/0B3oQ_4LrSMdHdEFQelNkNUVQTjA/view?usp=sharing
PATRIC part 2 - https://drive.google.com/open?id=0B3oQ_4LrSMdHZzFOTHBYSzBzcDQ
Remember, you will only need to upload the fasta file of the assembly, there is no need to upload your reads or any other files generated during this tutorial.
Have fun exploring your genome!