Part 3: NGS Target Enrichment quality metrices
If you are performing NGS target enrichment assays in your lab, make sure to understand a few key quality metrices.
Due to the massive data output of modern-day NGS instruments such as Illumina’s NovaSeq, performing whole-genome sequencing (WGS) on humans has never been cheaper. Although WGS is useful to piece together an organism’s entire genome, depending on the question being asked, target enrichment methods may be a better approach. This approach is attractive, as it is cheaper to perform, it has lower data storage costs and it requires less laborious downstream data analysis than WGS. Most importantly the biggest benefit of target enrichment is the ability to perform deep sequencing on a selected region or regions of interest while not wasting precious sequencing data on those regions you have no interest in (E.g. long intronic sequences).
A practical example of using a target enrichment approach is when performing NGS on somatic tissue samples with the purpose of detecting oncogenic mutations. Some low frequency mutations require incredibly deep sequencing to detect as they may have a variant allele frequency (VAF) of 5% or less. A targeted gene panel is able to enrich for and sequence only the genes of interest therefore providing the high sequencing depth required for detecting somatic variants.
Besides quality control of the raw sequencing reads, it is also crucial to assess whether the target enrichment has been successful, i.e. if most of the reads mapped on the target, and if the targeted bases reached sufficient coverage. In a previous post, we briefly compared the difference between amplicon-based and capture-based target enrichment methods, now we will cover a few important QC metrics than can be used to assess the run quality of these methods. The QC parameters described below are relevant to both methods and can be used to assess the performance of any NGS target enrichment assay.
On-target rate (%)
The on-target rate refers to the percentage of sequencing data/reads which maps to your region of interest (i.e. Probe target region/Primer amplicon). It is typically expressed as the ratio of the number of sequenced bases covering the target regions to the total number of mapped bases output by the sequencer (Figure 1).
Conversely, the off-target rate refers to sequencing data/reads which map to other unintended genomic locations and is considered wasted data. Some off-target sequencing is inevitable with a considerable proportion of it being panel-specific due to promiscuous hybridization of the probes.
Mean read Depth
Read depth, also called sequencing depth or depth of coverage is the number of reads mapped to a single genomic position following alignment and removal of duplicate reads. The mean read depth is calculated as the total number of aligned bases to the target region divided by the target region size. It indicates how many reads, on average, are likely to be aligned at a given reference base position. In the hypothetical example below, 6 reads map to the sequence of exon 1, therefore exon 1 has a read depth of 6X (Figure 2). However, the entire target region includes exon 1 and exon 2 so the mean read depth will be slightly lower due to poor read depth of exon 2.
Coverage uniformity is not the same across the whole target (certain regions are GC rich etc.), therefore read depth is not always the best metric to assess assay performance as it doesn’t give you an indication of areas with poor sequencing depth. (E.g. a mean depth of 100X across the target means some regions may have a read depth of 150X whereas others may only have a read depth of 50X).
Target covered at ≥20X (%)
The percentage of all target bases achieving 20X or greater read depth is a reliable assessment of sequencing coverage across the entire target. This metric measures the efficiency of the target capture and can be adjusted for the desired coverage, e.g. calculating what percentage of target bases are covered at ≥40X. The higher this value the better, for whole-exome sequencing you expect at least 90% of all target bases covered at 20X for confident variant calling. DNA samples with poor quality tend to have lower % target covered as certain genomic regions may be harder to capture and sequence. The targeted read depth when setting up an NGS experiment is determined by what type of sequencing you’re performing, for which application, and what your research question is. See https://genohub.com/recommended-sequencing-coverage-by-application/ for some guidance on this.
Coverage uniformity describes the read distribution along target regions of the genome and refers to the uniformity of target capture among the probe panel. It can be expressed as the fold-80 base penalty (fold-80), defined as the fold of additional sequencing required to ensure that 80% of the target bases achieve the mean coverage. The lower the on-target rate, or the higher the fold-80, the higher the capture inefficiency and wasted sequencing.
Uniform coverage reduces the amount of sequencing required to reach a sufficient depth of coverage for all regions of interest. However, in reality, no NGS assay is 100% uniform as some targets are under-sequenced, others are over-sequenced, and off-target regions are also captured.
For example, if one million reads produce a mean coverage of 30x, a fold-80 of 2.0 means two million reads would be required to ensure that 80% of the targeted bases reach 30X. A fold-80 of 1.4 would mean that increasing sequencing to 1.4 million reads would achieve the same goal. Therefore, the lower the fold-80, the greater your uniformity is and the less sequencing is required to reach the desired depth of coverage.
Duplication rate (%)
This is calculated as the percent of mapped reads where any 2 reads share the same 5′ and 3′ coordinates. These duplicated reads may indicate bias originating from poor sample quality, and mostly arise during PCR-based library preparation, they may also be a result of artefacts on the sequencing instrument. NGS libraries prepared from DNA samples with poor quality or very low DNA input amounts tend to have higher duplication levels.
Most analysis pipelines remove duplicate reads after alignment and mapping to ensure that only unique reads remain, this slightly reduces sequencing depth but increases confidence in variant calls since each unique read is an independent data point. In the example below the raw read depth (with duplicates) is 8X due to 8 reads spanning the target, however, after duplicate removal, the true read depth is calculated as 3X as only three unique reads remain (Figure 3). If duplicates aren’t removed this can lead to false allele frequency estimates by increasing the proportion of the allele present in the duplicates compared to the alternate allele. Some duplicate reads might also contain mutations introduced during PCR which aren’t a true representation of the actual sample, so it is critical they are removed before downstream analysis.
Agilent target enrichment solutions
Agilent is a world leader in target enrichment with three portfolios of target enrichment solutions depending on panel size required and enrichment method. SureSelect makes use of their tried and tested oligonucleotide probes for efficient hybrid-capture of targets of any size (e.g. whole-exome). HaloPlex and SureMASTR deliver great assay performance using amplicon-based target enrichment techniques for panel sizes up to 5 Mb.
Diagnostech also supply target-enrichment solutions from ArcherDx, a leading genomics company in precision oncology. They have a range of amplicon-based NGS kits for the detection of variants associated with cancers and inherited diseases, including a liquid biopsy assay.