Read alignment
Once data are in a FASTQ format the first step of any NGS analysis is to align the short reads against the reference genome. This module describes how to map short DNA sequence reads, assess the quality of the alignment and prepare to visualize the mapping of the reads.
Required Modules:
Overview:
- Align reads to reference
- Sort sam file (output from alignment) and convert to bam
- Alignment Metrics
- Mark duplicates
- Prepare reference dictionary, fasta index, and bam index
1) The Burroughs Wheeler Transform
2) Performing a read alignment using Illumina data
We will use the BWA MEM algorithm to align input reads to your reference genome. We use BWA MEM because it is recommended in the Broads best practices and because it has been found to produce better results for variant calling. Note that BWA MEM is recommended for longer reads, ie. 75bp and up.
Alternative aligners such as Bowtie2 may be used.
Note: Most aligners require an indexed reference sequence as input. Reference index files for the sample data have been prebuilt and are available in:
/scratch/work/cgsb/gencore/data/variant_calling/ref/prebuilt/
If required, index files can be built from a reference sequence (in FASTA format) using the following command:
bwa index <reference.fasta>
Using the reference sequence in the sample dataset, we can build the index files using the following command:
bwa index ./ref/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
If executed correctly, you should see the following output:
[bwa_index] Pack FASTA... 0.95 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=128888334, availableWord=21068624
[BWTIncConstructFromPacked] 10 iterations done. 34753182 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 64202446 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 90372990 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 113629422 characters processed.
[bwt_gen] Finished constructing BWT in 48 iterations.
[bwa_index] 33.57 seconds elapse.
[bwa_index] Update BWT... 0.68 sec
[bwa_index] Pack forward-only FASTA... 0.60 sec
[bwa_index] Construct SA from BWT and Occ... 11.97 sec
[main] Version: 0.7.8-r455
[main] CMD: bwa index GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
[main] Real time: 48.246 sec; CPU: 47.800 sec
Let's take a look at the output using ls -l
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.amb
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.ann
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.bwt
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.pac
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.sa
We can see 5 new files, all having the same basename as the original reference sequence file. These are the index files required by BWA.
Note: If the reference is greater than 2GB, you need to specify a different algorithm when building the BWA index, as follows:
bwa index -a bwtsw <reference.fasta>
Once we have the reference index, we can proceed to the alignment step. We run BWA as follows:
bwa mem -M -R <readgroup_info> <ref> <reads_1.fastq> <reads_2.fastq> > <output.sam>
Command explained:
bwa mem
Invoke the bwa mem algorithm
-M
This flag tells bwa to consider split reads as secondary, required for GATK variant calling
-R <readgroup_info>
Provide the readgroup as a string. The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag.
<ref>
The name of your reference sequence. Note that all index files must be present in the same directory and have the same basename as the reference sequence
<reads_1.fastq>, <reads_2.fastq>
Your input reads. In this case, mates of a paired end library
<output.sam>
The name of your outut file
Put it altogether:
bwa mem -M -R '@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1' ./ref/GCF_000001405.33_GRCh38.p7_chr20_genomic.fna reads_1.fastq reads_2.fastq > aligned_reads.sam
If everything worked, you should have a new aligned_reads.sam
file.
2) Sort sam and convert to bam
The algorithms used in downsteam steps require the data to be sorted by coordinate and in bam format in order to be processed. We use Picard Tools and issue a single command to both sort the sam file produced in step 1 and output the resulting sorted data in bam format:
java -jar picard.jar SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate
If this executed correctly, you should see something like the folloing:
[Wed Dec 07 11:38:40 EST 2016] picard.sam.SortSam INPUT=aligned_reads.sam OUTPUT=sorted_reads.bam SORT_ORDER=coordinate VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Wed Dec 07 11:38:40 EST 2016] Executing as [email protected] on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
INFO 2016-12-07 11:39:05 SortSam Finished reading inputs, merging and writing to output now.
[Wed Dec 07 11:39:23 EST 2016] picard.sam.SortSam done. Elapsed time: 0.72 minutes.
Runtime.totalMemory()=1988100096
Let's take a look at the files before and after this step to see what happened. We will use samtools
to view the sam/bam files.
Let's take a look at the first few lines of the original file. We'll use the samtools view
command to view the sam file, and pipe the output to head -5
to show us only the 'head' of the file (in this case, the first 5 lines).
samtools view aligned_reads.sam | head -5
Output:
HS2000-940_146:5:1101:1161:63226 73 NC_000020.11 23775298 60 78M22S = 23775298 0 CTGNTAGCCCTGCTGAATCTCCCTCCTGACCCAACTCCCTCNTNNNNNNNGCTGGGTGACTGCTGNCNNCACNGGCTGTGNNNNNNNNNNNNNCAGCTGG ?@@#4ADDDFDFFHIGGFCFHCHFGIHGCGHEHHEHD3?BH#0#######--5CEECG=?AEEHE################################### NM:i:13 MD:Z:3G37C1C0T0A0C0T0C0T15T1C0T3T5 AS:i:52 XS:i:0 RG:Z:sample_1
HS2000-940_146:5:1101:1161:63226 133 NC_000020.11 23775298 0 * = 23775298 0 NNCTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNCNAAAGGAGCCTGGGT #################################################################################################### AS:i:0 XS:i:0 RG:Z:sample_1
HS2000-940_146:5:1101:1262:12434 99 NC_000020.11 23843774 60 100M = 23843977 258 ATCAATGGTGTTTCTTTGCCAAGCTTCCTTAGTCGCCTTTAATCGGGAAAAGGTCTTCATTCTTTCTTGTCTTTGTTACCCTGTCATTTTTGAAGATAAC ?@@BDDFFFFHHHGHIIIHJEGHIIIGIJICHIGIIGIJIDHHGJJJ:;8EFH=CFHGGHIIIJJHHGBEHFFFFEEDCCCCEDCCADDEDD(5>5>@5@ NM:i:0 MD:Z:100 AS:i:100 XS:i:0 RG:Z:sample_1
HS2000-940_146:5:1101:1262:12434 147 NC_000020.11 23843977 60 55M45S = 23843774 -258 GTACATCATTCTGGGAGGCCAGGATACCATTGTCCAATTGCNNNNGGATGTTAATNNNNNNNNNNNNNNNNNNNNANNNNCTTCCNNNNNNNNCCACTCT #CDDCCC@38C<DDDDBBCDC;5ADDDCCACAADCC?=5,,####>CGHDC;;--####################1####FC<24########FDDD=B= NM:i:4 MD:Z:41T0G0A0T10 AS:i:47 XS:i:28 RG:Z:sample_1
HS2000-940_146:5:1101:1295:90112 83 NC_000020.11 23920564 60 100M = 23920284 -380 GCAAGTGAATGCTCTTTCCCACAGCAAAGGATTAACTGATTTCTGCTACTTGTGGCTCAGAGGCCAGGGACACTTGACCTGTCCTAGGAAGGCTGTCACC #@DDDD@><DDDEEAC?FFFHHEC=HECEAIGGE@IIGGHB;HAFD??F?IIJIEGGCIIJJJIHCIIIIIHIIHECIJIIHIJJJIFGHHFFDFFF@C@ NM:i:0 MD:Z:100 AS:i:100 XS:i:0 RG:Z:sample_1
Let's compare this initial alignment file to the new sorted file:
samtools view sorted_reads.bam | head -5
Output:
HS2000-940_146:5:2109:14063:29918 161 NC_000020.11 64145 4 54S46M = 23724989 23660944 TTCCAATCCATTCCATTCCATCACACTGCATTCCATTCCATTCCAATCCCCTCAACTCCACTCCACTCCACTCCATTCCACTCCAATCAATTCCATTGCA @CCFFFFFHGDHHJIJJJJJIJIFHHGCGIHHIJJJGHIHIIIJJIHIGGGHIJJE:FFHIGIDHJGIGGIJJ@;CDHGGEIHHHEHF;CCB>;;3;>;> MD:Z:6G27C8T2 RG:Z:sample_1 NM:i:3 AS:i:33 XS:i:31
HS2000-940_146:5:2110:1521:37886 163 NC_000020.11 1217420 0 55S21M24S = 1217591 271 GACAGTTCTGAAGAGAGCAGGGGTTCTTCCAGCATTGCATTTGAGCTCCGAAAATGGACAGACTGCCTCCTCAAGTCGGTCCTTGACCTCCGTGCACCCT ?7:DDD:B,+ADD43C?BF++<2<):**11*1:;C*0?0B?F>GCDBF30'-'-8@8..@1@E@;37@)?76@########################### MD:Z:21 RG:Z:sample_1 NM:i:0 AS:i:21 XS:i:34
HS2000-940_146:5:2110:1521:37886 83 NC_000020.11 1217591 0 100M = 1217420 -271 ATATCCAGACAAACAGGGTCTGGAGTAGACCTCCAGCAAATTCCAACAGACCTGCAGCTGAGGGTCCTGACTGTTAGAAGGAAAACTAACAAACAGAAAG CA@>9CA;5?@>6;;..1;77;3=77)75CCF=)>HG@>BB3GEIGFB??BHDIIIGBHGFC:13@ACF9CAA,@F?EDA4IGGGAGHHHGHEFFFD?@@ MD:Z:3C4G17G13C59 RG:Z:sample_1 NM:i:4 AS:i:81 XS:i:81
HS2000-940_146:5:1102:10582:53061 113 NC_000020.11 1544082 22 100M = 23852837 22308739 TTCTGTTGATTTGGGGTGGAGAGTTCTGTAGAGGTCTGTTAGGTCTGCTTGGTCCAGAGCTGAGCTCAAATCCTGAATATCCTTGTTAATTTTCTGTCTC ###CCA:(>(;?:8;A>A>D@>D@=3=>7=@72@@AED@7BFCF?EIIF?0;IGAFD9HFD9D9@>E?9+<HCIGHEA?C3DDFDGBFD?C?DDAAD?;; MD:Z:32T4A26T4G30 RG:Z:sample_1 NM:i:4 AS:i:80 XS:i:70
HS2000-940_146:5:1112:8371:47601 99 NC_000020.11 2502086 10 100M = 2502334 348 AACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGACCTGATGGAGCTGAAAACCAAGGCACAAGAACTACGTGATGAATACACAAGCCTCAGTAGCCGA CCCFFFFFHHHHHIJIIJJJJGJJIHHIIJJJJIJGIJJIGGHGHHGGIIJJJJJJJGHGGGIIJIIGHHHGEDFEEFEEEEEEDDDBDDDCC>CCDDBB MD:Z:33A22T5G12C4G19 RG:Z:sample_1 NM:i:5 AS:i:75 XS:i:79
Is the output consistent with what we expect?
3) Alignment Metrics
Let's compute some statistics to see how well our reads aligned to the reference genome. We'll use samtools flagstat
for this.
samtools flagstat aligned_reads.sam
Output:
194492 + 0 in total (QC-passed reads + QC-failed reads)
80 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
193804 + 0 mapped (99.65% : N/A)
194412 + 0 paired in sequencing
97206 + 0 read1
97206 + 0 read2
190812 + 0 properly paired (98.15% : N/A)
193108 + 0 with itself and mate mapped
616 + 0 singletons (0.32% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Hint: Save these metrics to a text file by piping the output to a new file
samtools flagstat aligned_reads.sam > alignment_metrics.txt
4) Mark Duplicates
During the sequencing process, the same DNA fragments may be sequenced several times. These duplicate reads are not informative and cannot be considered as evidence for or against a putative variant. For example, duplicates can arise during sample preparation e.g. library construction using PCR. Without this step, you risk having over-representation in your sequence of areas preferentially amplified during PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.
We use Picard Tools to locate and tag duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
Note that this step does not remove the duplicate reads, but rather flags them as such in the read's SAM record. We'll take a look at how this is done shortly. Downstream GATK tools will ignore reads flagged as duplicates by default.
Note: Duplicate marking should not be applied to amplicon sequencing or other data types where reads start and stop at the same positions by design.
java -jar picard.jar MarkDuplicates INPUT=sorted_reads.bam OUTPUT=dedup_reads.bam METRICS_FILE=metrics.txt
If this executed correctly, you should see something like the following:
[Mon Dec 19 17:29:19 EST 2016] Executing as [email protected] on Linux 2.6.32-431.29.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.129(b508b2885562a4e932d3a3a60b8ea283b7ec78e2_1424706677) IntelDeflater
INFO 2016-12-19 17:29:19 MarkDuplicates Start of doWork freeMemory: 797279896; totalMemory: 798490624; maxMemory: 11276386304
INFO 2016-12-19 17:29:19 MarkDuplicates Reading input file and constructing read end information.
INFO 2016-12-19 17:29:19 MarkDuplicates Will retain up to 43370716 data points before spilling to disk.
INFO 2016-12-19 17:29:22 MarkDuplicates Read 194420 records. 0 pairs never matched.
INFO 2016-12-19 17:29:22 MarkDuplicates After buildSortedReadEndLists freeMemory: 540268072; totalMemory: 909639680; maxMemory: 11276386304
INFO 2016-12-19 17:29:22 MarkDuplicates Will retain up to 352387072 duplicate indices before spilling to disk.
INFO 2016-12-19 17:29:23 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2016-12-19 17:29:23 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2016-12-19 17:29:23 MarkDuplicates Sorting list of duplicate records.
INFO 2016-12-19 17:29:24 MarkDuplicates After generateDuplicateIndexes freeMemory: 906398800; totalMemory: 3729260544; maxMemory: 11276386304
INFO 2016-12-19 17:29:24 MarkDuplicates Marking 15269 records as duplicates.
INFO 2016-12-19 17:29:24 MarkDuplicates Found 31 optical duplicate clusters.
INFO 2016-12-19 17:29:27 MarkDuplicates Before output close freeMemory: 3757499808; totalMemory: 3762290688; maxMemory: 11276386304
INFO 2016-12-19 17:29:27 MarkDuplicates After output close freeMemory: 3753828760; totalMemory: 3758620672; maxMemory: 11276386304
[Mon Dec 19 17:29:27 EST 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.14 minutes.
Runtime.totalMemory()=3758620672
These stats are broken down in the metrics.txt file:
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIREDD_READ_DUPLICATES READ_PAIR_DUPLICATE READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
sample_1 616 96554 688 165 7552 31 0.078818 586767
Let's take a look at the bam file before and after the Mark Duplicates step to see how reads are flagged as duplicates.
Refresher: The second column in a SAM file is known as the bitwise flag. This flag allows for the storage of lots of information in a highly efficient format. Let's look at the first read in sorted_reads.bam:
samtools view sorted_reads.bam | head -1
HS2000-940_146:5:2109:14063:29918 161 NC_000020.11 64145 4 54S46M = 23724989 23660944 TTCCAATCCATTCCATTCCATCACACTGCATTCCATTCCATTCCAATCCCCTCAACTCCACTCCACTCCACTCCATTCCACTCCAATCAATTCCATTGCA @CCFFFFFHGDHHJIJJJJJIJIFHHGCGIHHIJJJGHIHIIIJJIHIGGGHIJJE:FFHIGIDHJGIGGIJJ@;CDHGGEIHHHEHF;CCB>;;3;>;> MD:Z:6G27C8T2 RG:Z:sample_1 NM:i:3 AS:i:33 XS:i:31
Question: What is the bitwise flag value for this read?
(Answer: 161)
Question: What does this value represent? http://broadinstitute.github.io/picard/explain-flags.html
(Answer: read paired, mate reverse strand, second in pair)
Note: "read is PCR or optical duplicate" is also stored in this flag
Let's look at the read on line 58, before and after marking duplicates:
samtools view sorted_reads.bam | head -58 | tail -1
HS2000-1010_101:7:2306:21017:95177 81 NC_000020.11 8563852 0 48S48M4S = 23923303 15359405 CTACAATGAGAATCACAAAACACTGTTCAAATAAATCAAAGAAGATACAAACAAATGGGAAAACATCCCATGCTCATGGATAGGAGGAATCAATATCATT #CDCDEFFFDDDBEA>CA;=A@HD=7;GGEFC@D@IHHCGFB>IGCHHG>GIGJHHEGEJHGIJJJJJIEGCJIJJJJIGGEIIJJJHGHHHDFFFFCCC MD:Z:10A26A10 RG:Z:sample_1 NM:i:2 AS:i:38 XS:i:37
samtools view dedup_reads.bam | head -58 | tail -1
HS2000-1010_101:7:2306:21017:95177 1105 NC_000020.11 8563852 0 48S48M4S = 23923303 15359405 CTACAATGAGAATCACAAAACACTGTTCAAATAAATCAAAGAAGATACAAACAAATGGGAAAACATCCCATGCTCATGGATAGGAGGAATCAATATCATT #CDCDEFFFDDDBEA>CA;=A@HD=7;GGEFC@D@IHHCGFB>IGCHHG>GIGJHHEGEJHGIJJJJJIEGCJIJJJJIGGEIIJJJHGHHHDFFFFCCC MD:Z:10A26A10 PG:Z:MarkDuplicates RG:Z:sample_1 NM:i:2AS:i:38 XS:i:37
5) Prepare indexed bam file
We use samtools to build the bam index:
samtools index dedup_reads.bam
We should have 3 new files:
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.dict
- GATK reference dictionary
GCF_000001405.33_GRCh38.p7_chr20_genomic.fna.fai
- fasta Index
dedup_reads.bam.bai
- bam index