2

A brief description of the procedure for clinical tumor NGS testing

StepDescriptionTools and databaseOutput
Base calling and duplicate removalBase calling and duplicate removal, also known as initial analysisSequencing platform configuration softwareFASTQ format
Primer removalPrimer sequences for amplicon sequencing must be removed from the readsCutAdapt, BWA, etc.FASTQ or BAM format
Adaptor removalRemove the adaptor sequences from the end of reads. It may interfere with the alignment and cause false-positive/false-negative variant calling if not being trimmedCutAdapt, BWA, Trimmomatic, SeqPrep, etc.FASTQ or BAM format
Low-quality base removalLow-quality bases may also interfere with the alignment and cause false results. These bases should usually be trimmed from the ends of readCutAdapt, BWA, Trimmomatic, SeqPrep, etc.FASTQ or BAM format
AlignmentIn the alignment step, paired-/single-end reads are aligned to the reference genome. SNVs and small indels could be recognized in this stepBWA, Novalign, Stampy , SOAP2, LifeScope, Bowtie, etc.BAM format
Duplicate removal (optional)Duplicates can be introduced by PCR amplifications in the library construction and sequencing steps. Implausible duplicates in the original DNA decrease the accuracy of the calling and should be removed. Probe hybridization capture sequencing generates fewer duplicates, because DNA is randomly fragmented during library construction. Amplicon sequencing does not require deduplication if there are no allele barcodes, and requires if there arePicard Mark Duplicates, SAMtools, etc.BAM format
Indel realignment (optional)Misalignment is usually seen around indels which can cause false results, especially at the beginning or end of the reads. Local realignment method can determine these locations, minimize this error, and increase accuracyGATK RealignerTargetCreator and IndelRealigner, SRMA, etc.BAM format
Base quality score recalibration (optional)The base quality score could be recalibrated after the alignment/realignment to decrease the false-positive rateGATK BaseRecalibrator and PrintReads, ReQON, etc.BAM format
Variant callingVariant calling refers to the detection and description of variations (including SNVs and small indels) based on differences between sequencing data and reference genomesGATK UnifiedGenotyper, GATK HaplotypeCaller, SAMtools, MuTect, Varscan, Platypus, etc.VCF format
AnnotationThe variant interpretation relies on detailed annotation. The basic annotation includes gene name, gene structure areas (exon, splicing region, intron, intragenic region, etc.), and coding information. SNP information, pathogenicity, and other references could also be includedANNOVAR, SnpEff, , Cartagenia Bench Lab NGS, dbSNP, 1000 Genomes, ESP6500, SIFT, PhyloP, MutationTaster, COSMIC, OMIM, ClinVar, HGMD, etc.CSV, TSV, TXT, Excel, etc.
FilteringDisease related variants could be identified by strict filtering large amount of annotated variant calling results. Typical filtering criteria removes low-quality variants, non-coding regions (eg, intron and intragenic region), synonymous SNVs, and known low-frequency SNPs in healthy populations. Labs should set up an internal database to analyze the false positives that often occur on their own platforms and perform rigorous filtering of these false positivesCartagenia Bench Lab NGS, SnpSift, etc.CSV, TSV, TXT, Excel, database, etc.