With modern-day NGS instruments capable of generating billions of reads in a single experiment, the computational analysis that is required to make sense of the data can seem complex. This post will break down the typical NGS Data Analysis workflow into its individual components and detail the importance of bioinformatics in NGS.
From a glance, the standard NGS data analysis pipeline looks complicated, however, it can be simplified and broken down into three individual sections (as per Figure 1), these are:
01
Primary Analysis
02
Secondary Analysis
03
Tertiary Analysis
Primary Analysis
This involves the conversion of raw instrument signal data into sequence data consisting of nucleotide base calls, E.g. FASTQ files. Generally, the primary analysis takes place onboard the NGS instrument, for example the conversion of raw Binary Base Call (BCL) files on an Illumina sequencer to biological sequence data in the form of millions of short reads. The signal processing differs between platforms, Illumina instruments convert fluorescent signals into nucleotide base calls, whereas Ion Torrent instruments process signals in the form of a pH change converted to voltage (Figure 2).
Primary analysis also includes the pre-processing of NGS reads after they have been converted from raw signals. This is performed to ensure that only high-quality reads of the optimal length are used for downstream analysis.
Pre-processing steps
Filtering: Reads are filtered out of the data based on base call quality (Phred score) and the length of the read. Poor confidence base calls can lead to the detection of false-positive variants, so they need to be removed. Reads that are too short are likely to align to multiple regions in the genome and cause poor mapping metrics.
Demultiplexing: Multiplexing in NGS refers to multiple samples being sequenced simultaneously on the same instrument. Demultiplexing refers to the separation of sequencing reads into separate files according to the barcode index used for each sample.
Trimming: Adaptor sequences ligated to the ends of libraries during the library preparation process need to be removed from the sequencing reads as they can interfere with mapping and assembly. Reads are also trimmed to remove poor-quality bases from the ends of reads, tools such as Trimmomatic have been developed especially for this.
Secondary Analysis
Once high-quality sequence reads have been generated, the next step in the data analysis workflow is to align the reads against a reference genome, or perform a de novo assembly, and then call any variants detected. Many different file types are used and generated during secondary analysis, some are detailed in the table below (Table 1).
FILE TYPE | DESCRIPTION | WHERE IT IS USED |
---|---|---|
FASTQ | Text-based file format containing raw sequence reads and the associated quality score of each base | Storage of raw sequence data and input into sequence alignment |
BED | Browser Extensible Data file is a tab-delimited text file that is used to store genomic regions as coordinates | In variant calling pipelines to direct the analysis to a genomic region |
SAM | Sequence Alignment Map file, used to store text-based information for reads aligned to a reference sequence | Store information on read alignment, e.g. position and quality |
BAM | Binary Alignment Map file is a compressed binary version of a SAM file. Can be opened in genome browsers to view read alignment | Used for input into variant calling pipelines |
VCF | The Variant Call Format is a text file which stores sequence variants, each variant occupies a single row | Generated by variant calling pipelines. Used as input into variant annotation |
Table 1: Summary of the different file types used in NGS Data analysis
Alignment
The aim of sequence alignment is to find the genomic location where a read originates from and determine how many reads aligned to that position. The preferred method is to align the reads against a known reference genome, e.g. hg19 for humans (Figure 3). In de novo assembly no reference is used, and reads are aligned to each other based on their sequence similarity to create a long consensus sequence called a contig.
A problem that persists in NGS is that short reads can sometimes align equally well to multiple locations in the genome, the longer the read the easier it is to find its position. Paired-end reads reduce this issue since a pair of reads has a known distance in between which is used to validate its alignment position. Therefore, it is crucial to remove reads that are too short prior to performing the alignment as misaligned reads will lead to false-positive variant calls.
Traditional alignment methods such as BLAST cannot be used for large data sets generated by NGS as they are too memory consuming. This had led to the development of >90 alignment tools, with some even specifically designed for applications such as RNA-Seq or Bisulfite-Seq. Alignment tools require FASTQ files for input (and a reference if used) and output data into a SAM or BAM file.
Before the aligned reads can be used for variant calling analysis, there are additional post-alignment processing steps that ensure only the highest quality reads are retained in the BAM file. Some of these steps include:
- Duplicate removal: PCR duplicates are identical reads that originate from the same DNA molecule, generated during library prep. Duplicate reads may cause false positives or skew allele frequencies, so they are removed from the analysis.
- Local Read Realignment: Improves the accuracy of InDel detection and reduces mismatching of aligned reads.
- Base quality score recalibration: Alignment metrics are used to estimate new base call quality scores. The raw Phred-scaled quality scores produced by base-calling algorithms may not accurately reflect the true base-calling error rates.
Variant Calling
After reads have been aligned and processed, the next step in the pipeline is to identify differences observed between the selected reference genome and the newly sequenced reads. In short, the aim of variant calling is to identify polymorphic sites where nucleotides are different from the reference. A BAM file is used as input for the analysis and contains information on all the aligned reads and their associated quality scores.
BAM files can be opened in a genome browser such as Jbrowse or IGV to view the individual reads aligned against a reference, this is useful for troubleshooting and to rule out false positives caused by errors. If a variant is observed in only a small percentage of the reads, then it can be assumed that it is likely a false positive (Figure 4). Statistical methods are used to evaluate if observed deviations from the reference sequence are by chance or not (i.e. is it a true variant?).
There are multiple tools available for variant calling, such as samtools or GATK, which are both widely used and highly scalable applications. Many of these tools run on Linux only and the choice of tool is dependent on the variant that you are looking to identify. SNPs and small InDels (<50 bp) require different algorithms and parameters compared to a pipeline that is looking to detect large indels and structural variations such as CNVs. The variant calling tools allow for a selection of filters which can be used to filter out variants based on pre-selected criteria. For example, using the allele frequency data to filter out all variants with an allele frequency of <5%, or filtering out variants with a quality score of 10 or less. This is an important step in the generation of the VCF file as a standard human whole-exome sequencing NGS run can detect up to 50,000+ variants, many of them benign or common in the population. After the VCF has been generated and filtered, the next step in the workflow is to annotate the variants and determine what their effect on protein function may be and if it’s a clinical case, to classify the variant as pathogenic or benign.
Tertiary Analysis
The third and final step of the NGS analysis workflow addresses the important issue of making sense of the observed data. In the human genetics context, that is finding the fundamental link between variant data and the phenotype observed in a patient. Tertiary analysis begins with variant annotation, which adds additional information to the variants detected in the previous steps.
Annotation
Variant annotation refers to the process of predicting the biological effect or function of genetic variants, whether this is for a human clinical case or HIV resistance mutations. Annotation tools use the VCF that was generated by variant calling pipelines and output a report of annotated variants and their biological effect. Some of these tools include comprehensive software packages such as the Alissa Informatic Platform from Agilent which is capable of performing sequence analysis from variant calling right through to variant annotation and reporting.
A VCF can contain thousands of variants, filtering the variants and understanding their biological effect can increase the likelihood of finding an actionable variant. Variants that are common in the population can obviously be ruled out when looking for a rare variant that causes disease, so these can be filtered out. Variant annotation tools use data from multiple sources such as software predictive algorithms for protein function (SIFT, PolyPhen) to databases of known variants and clinical diseases such as dbSNP and ClinVar. Other sources of information used to annotate variants include population frequency databases (1000Genome), oncology databases (COSMIC) and pharmacogenomics databases (PharmGKB).
Interpretation
Variant interpretation in the human context is usually done by a qualified individual such as a clinical geneticist and/or genetic counsellor. Their job involves collating all the available patient information, including family history of disease and matching the patient genotype with the clinical phenotype. Depending on the case they may even ask for genetic profiles of the parents to understand the inheritance of potential disease alleles. This is where the importance of variant annotation and filtering comes into play, the main goal is to reduce the workload of the individuals doing the interpretation. The ideal scenario would be finding a single genetic variant that perfectly describes a patient phenotype, however, this is very rarely the case.
In an effort to standardize the variant interpretation process and enable a more systematic approach, the ACMG and AMP have developed a set of guidelines for Mendelian disease diagnosis. The ACMG guidelines use a set of standard questions which are applied to each case, alongside phenotypic information to rank variants based on several factors such as family history, protein function, population frequency, known diseases etc. At the end of the interpretation process, a variant will be classified as pathogenic or benign for an individual and their phenotype. Variants may also be classified as a variant of unknown significance (VUS) which means that there is currently not enough evidence available to classify the variant as pathogenic or benign. As more evidence is gathered and further testing is performed these classifications may change.
References:
- Pereira, R., Oliveira, J. and Sousa, M., 2020. Bioinformatics and Computational Tools for Next-Generation Sequencing Analysis in Clinical Genetics. Journal of Clinical Medicine, 9(1), p.132.
- Roy S, Coldren C, Karunamurthy A, et al. Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: A joint recommendation of the Association for Molecular Pathology and the College of American Pathologists. J Mol Diagn 2018;20:4-27