Base calling and duplicate removal | Base calling and duplicate removal, also known as initial analysis | Sequencing platform configuration software | FASTQ format |
Primer removal | Primer sequences for amplicon sequencing must be removed from the reads | CutAdapt, BWA, etc. | FASTQ or BAM format |
Adaptor removal | Remove the adaptor sequences from the end of reads. It may interfere with the alignment and cause false-positive/false-negative variant calling if not being trimmed | CutAdapt, BWA, Trimmomatic, SeqPrep, etc. | FASTQ or BAM format |
Low-quality base removal | Low-quality bases may also interfere with the alignment and cause false results. These bases should usually be trimmed from the ends of read | CutAdapt, BWA, Trimmomatic, SeqPrep, etc. | FASTQ or BAM format |
Alignment | In the alignment step, paired-/single-end reads are aligned to the reference genome. SNVs and small indels could be recognized in this step | BWA, Novalign, Stampy , SOAP2, LifeScope, Bowtie, etc. | BAM format |
Duplicate removal (optional) | Duplicates can be introduced by PCR amplifications in the library construction and sequencing steps. Implausible duplicates in the original DNA decrease the accuracy of the calling and should be removed. Probe hybridization capture sequencing generates fewer duplicates, because DNA is randomly fragmented during library construction. Amplicon sequencing does not require deduplication if there are no allele barcodes, and requires if there are | Picard Mark Duplicates, SAMtools, etc. | BAM format |
Indel realignment (optional) | Misalignment is usually seen around indels which can cause false results, especially at the beginning or end of the reads. Local realignment method can determine these locations, minimize this error, and increase accuracy | GATK RealignerTargetCreator and IndelRealigner, SRMA, etc. | BAM format |
Base quality score recalibration (optional) | The base quality score could be recalibrated after the alignment/realignment to decrease the false-positive rate | GATK BaseRecalibrator and PrintReads, ReQON, etc. | BAM format |
Variant calling | Variant calling refers to the detection and description of variations (including SNVs and small indels) based on differences between sequencing data and reference genomes | GATK UnifiedGenotyper, GATK HaplotypeCaller, SAMtools, MuTect, Varscan, Platypus, etc. | VCF format |
Annotation | The variant interpretation relies on detailed annotation. The basic annotation includes gene name, gene structure areas (exon, splicing region, intron, intragenic region, etc.), and coding information. SNP information, pathogenicity, and other references could also be included | ANNOVAR, SnpEff, , Cartagenia Bench Lab NGS, dbSNP, 1000 Genomes, ESP6500, SIFT, PhyloP, MutationTaster, COSMIC, OMIM, ClinVar, HGMD, etc. | CSV, TSV, TXT, Excel, etc. |
Filtering | Disease related variants could be identified by strict filtering large amount of annotated variant calling results. Typical filtering criteria removes low-quality variants, non-coding regions (eg, intron and intragenic region), synonymous SNVs, and known low-frequency SNPs in healthy populations. Labs should set up an internal database to analyze the false positives that often occur on their own platforms and perform rigorous filtering of these false positives | Cartagenia Bench Lab NGS, SnpSift, etc. | CSV, TSV, TXT, Excel, database, etc. |