Assembly

Below are the steps, and commands that we will use during the assembly stage. For Abyss, we need to specify the kmer size, and although there are automated ways of achieving this, for the purpose of this tutorial, we suggest you pick 2 kmer sizes

AbySS assemblies (e.g. k45 & k63)

K=45

  1. abyss-pe -j 14 \
    k=45 name=Abyss.assembly-k45 \
    lib='pe1' \
    pe1='data/Sample_test_trimmed_R1_PE.fastq.gz data/Sample_test_trimmed_R2_PE.fastq.gz' \
    se='data/Sample_test_trimmed_R1_SE.fastq.gz data/Sample_test_trimmed_R2_SE.fastq.gz' \
    --directory=Abyss/Abyss_k45
    

    And the complete SLURM (abyssk45_SLURM.sh) script looks something like this,

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=14
#SBATCH --time=06:00:00
#SBATCH --mem 45000

module purge all
module load gencore/1
module load gencore_de_novo_genomic/1.0

abyss-pe -j 14 \
k=45 name=Abyss.assembly-k45 \
lib='pe1' \
pe1='data/Sample_test_trimmed_R1_PE.fastq.gz data/Sample_test_trimmed_R2_PE.fastq.gz' \
se='data/Sample_test_trimmed_R1_SE.fastq.gz data/Sample_test_trimmed_R2_SE.fastq.gz' \
--directory=Abyss/Abyss_k45

K=63

abyss-pe -j 14 \
k=63 name=Abyss.assembly-k63 \
lib='pe1' \
pe1='data/Sample_test_trimmed_R1_PE.fastq.gz data/Sample_test_trimmed_R2_PE.fastq.gz' \
se='data/Sample_test_trimmed_R1_SE.fastq.gz data/Sample_test_trimmed_R2_SE.fastq.gz' \
--directory=Abyss/Abyss_k63

And the complete SLURM (abyssk63_SLURM.sh) script looks something like this,

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=14
#SBATCH --time=06:00:00
#SBATCH --mem 45000

module purge all
module load gencore/1
module load gencore_de_novo_genomic/1.0

abyss-pe -j 14 \
k=63 name=Abyss.assembly-k63 \
lib='pe1' \
pe1='data/Sample_test_trimmed_R1_PE.fastq.gz data/Sample_test_trimmed_R2_PE.fastq.gz' \
se='data/Sample_test_trimmed_R1_SE.fastq.gz data/Sample_test_trimmed_R2_SE.fastq.gz' \
--directory=Abyss/Abyss_k63

Parameters

-j Number of threads (CPUs) to use.

k= Size of kmer to assemble.

lib= Name(s) of sequenced libraries. In this case we only have 1 paired end library, in case we had additional libraries (say 1 more mate pair library) we will name it here e.g. lib='pe1 mp1'

pe1= Full path to the reads specified in the libraries above (in the lib= part), read1 first followed by read2's (space separated).

se= Path to single end (unmated) reads separated by spaces in case of multiple file.

--directory Path to output directory.

SPADES

Spades uses an 'auto' kmer selection mode (although kmer sizes can be specified as well). Below is the command for assembling using Spades.

Spades kmer=auto

spades.py -o SPADES --careful \
-1 data/Sample_test_trimmed_R1_PE.fastq.gz -2 data/Sample_test_trimmed_R2_PE.fastq.gz \
-s data/Sample_test_trimmed_R1_SE.fastq.gz \
-s data/Sample_test_trimmed_R2_SE.fastq.gz \
-t 14 -m 50

And the complete SLURM script for SPADES

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=14
#SBATCH --time=06:00:00
#SBATCH --mem 50000

module load gencore/1
module load gencore_de_novo_genomic/1.0

spades.py -o SPADES --careful \
-1 data/Sample_test_trimmed_R1_PE.fastq.gz -2 data/Sample_test_trimmed_R2_PE.fastq.gz \
-s data/Sample_test_trimmed_R1_SE.fastq.gz \
-s data/Sample_test_trimmed_R2_SE.fastq.gz \
-t 14 -m 50

Parameters

-o Path to output directory

--careful Tries to reduce the number of mis-matches and short indels (Insertions/deletions).

-1 Path to forward reads (Read1's) file.

-2 Path to reverse reads (Read2's) file.

-s Path to single end files (unmated).

-t Number of threads to use (CPUs).

-m Upper memory limit (in Gb).

Note that setting the memory in the SLURM part of the script should match what we tell SPADES using the -m parameter. The same applies for the number of CPUs -t argument.

Assembly improvement - GapCloser

Next, we will use the reads and attempt to close "fill in" some of the resulting gaps. This step takes into account the length of the reads and their insert sizes (in case of paired reads). Like always, there are multiple tools that have been developed to accomplish this task, and although some will perform better (in certain circumstances) than others, here we will be focusing on SOAPdenovo's GapCloser.

First, we need to prepare a configuration file with some basic information about our assemblies and sequenced library.

Let's call it gapclose.config and this is what it looks like.

[LIB]
avg_ins=400
reverse_seq=0
q1=data/Sample_test_trimmed_R1_PE.fastq.gz
q2=data/Sample_test_trimmed_R2_PE.fastq.gz

In this config file, we specify the average insert size of the paired end library (don't worry, the algorithm has leniency in the estimation), whether the sequences should be reversed (0 for our paired end reads i.e. no), and finally the path to our read 1's and read 2's.

We can then run the command like so,

For the SPADES assembly,

GapCloser -a SPADES/scaffolds.fasta -b gapclose.config -o SPADES/gapclosed_spades.fasta -l 155 -t 14

For the AbySS k45 assembly,

GapCloser -a Abyss/Abyss_k45/Abyss.assembly-k45-scaffolds.fa -b gapclose.config -o Abyss/Abyss_k45/gapclosed_abyssk45.fasta -l 155 -t 14

For the AbySS k63 assembly,

GapCloser -a Abyss/Abyss_k63/Abyss.assembly-k63-scaffolds.fa -b gapclose.config -o Abyss/Abyss_k63/gapclosed_abyssk63.fasta -l 155 -t 14

Parameters

-a Assembly filename (input).

-b Configuration filename.

-o Path and name of gap_closed output file.

-l Maximum read length.

-t Number of threads (CPUs) to use.

At the end of these commands, you will get a brief report mentioning how many gaps were closed.

Again, these commands will need to be embedded within SLURM scripts similar to the assembly examples above.

Assembly Evaluation

Now that we have our assemblies, we need to evaluate and assess the 3 different assemblies that we produced. For this we will use Quast. Quast can take as input multiple fasta files of scaffolds (or contigs), in addition if you have a reference (which we don't) and a GFF annotation file of that reference (this could be a closely related species), then you can provide that as well.

Quast will look for length distributions of scaffolds/contigs, gaps, alignments and mis-assemblies.

The command is as follows,

quast -o QUAST -t 14 \
-s SPADES/scaffolds.fasta Abyss/Abyss_k45/Abyss.assembly-k45-scaffolds.fa Abyss/Abyss_k63/Abyss.assembly-k63-scaffolds.fa \
-l SPADES,AbySSk45,AbySSk63 \
-f -1 data/Sample_test_trimmed_R1_PE.fastq.gz -2 q2=data/Sample_test_trimmed_R2_PE.fastq.gz

Parameters

-o Output folder path

-t Number of threads to use.

-s Assemblies that you want to assess, here we can supply multiple assemblies (in our case 3) separated by spaces.

-l Labels to use foreach assembly (the order must match the order which was used in the -s argument above).

-f Enable gene prediction mode.

-1 Path to forward reads (Read1's).

-2 Path to reverse reads (Read2's).

Once Quast completes, all the output will be available in the "QUAST" older. The output contains plain text, PDF, and HTML reports. Now let's see which assembly we pick!

results matching ""

    No results matching ""