Part 4: Intro to NGS Quality Control metrics
Run-level QC metrices
Illumina
Sequence Analysis Viewer (SAV) is an Illumina tool used to monitor sequencing performance during or check QC after a run. It can be viewed on the Illumina instrument or on a standalone PC that has SAV installed, the software is freely available for download via this link.
To open SAV and view your run metrices, the following files/folders are required from the sequencing output folder:
- InterOp (folder)
- RunInfo.xml
- RunParameters.xml
When a sequencing run is opened in Sequence Analysis Viewer (SAV), the ‘Analysis’ tab will be displayed which contains graphs displaying important QC metrices such as Flow Cell Chart, Data by Cycle, Data By Lane and Q-score distribution etc. (Figure 1).
The ‘Run Summary’ table contains useful data from your sequencing run, including yield, error rate, %Q30, Cluster Density and Cluster PF (%). The data in the first table is presented as an average, for each Read type generated, e.g. the total yield of all the Read1’s in this run was 13.4 Gb (Figure 2).
The parameters reported on the “Run Summary” are explained as follows:
Yield (Gb)
The yield, also referred to as output, is the number of bases generated in the sequencing run and is represented as Giga base pairs (Gb), or 1 billion base pairs. Not to be confused with the similar-looking GB, which refers to Gigabytes, a unit of digital information in computing. Yield is important as this determines how many samples you can multiplex on a single run as well as how deep your sequencing depth will be. Illumina sequencing kits generally have a defined output based on the number of reads generated and the read length. For example, a MiSeq v2 kit with 2x 150 bp reads (300 bp) will generate approximately 15 million reads and therefore ~4.5 Gb of raw sequencing data (300 bp x 15 million = 4.5 Gb).
Error Rate (%)
This refers to the percentage of bases called incorrectly at any one cycle along the read. The error rate is calculated from the reads that align to Illumina’s PhiX, a control library of known base composition that can be spiked into your final library. Error rate is only calculated if the PhiX control is used in your run, if not then %Q30 is your next best metric of base call quality. The error rate increases along the length of the read due to the nature of the sequencing chemistry as reagents are expended.
%Q30
This parameter represents the percentage of bases with a Phred quality score (Q-score) of 30 or higher. The Q-score is a quality indicator of individual base calls and measures the probability that a base is called incorrectly. A Q-score of 30 means that 1 in 1000 base calls may be incorrect (Figure 3). Figure 3 below also indicates what the probability of incorrect base call and base call accuracies would be as Phred Quality Scores vary between 10 and 60.
PHRED QUALITY SCORE | PROBABILITY OF INCORRECT BASE CALL | BASE CALL ACCURACY |
---|---|---|
10 | 1 in 10 | 90% |
20 | 1 in 100 | 99% |
30 | 1 in 1 000 | 99.9% |
40 | 1 in 10 000 | 99.99% |
50 | 1 in 100 000 | 99.999% |
60 | 1 in 1 000 000 | 99.9999% |
Figure 3: Base call accuracy – the probability of an incorrect base call for each Phred Quality Score (Q-score).
Most Illumina runs should generate around 70-80% of Q30 data or higher, which is indicative of a successful sequencing run. These Q30 values are listed for each sequencing kit and are the manufactures specifications. In poor quality runs, the %Q30 will be affected and low-quality data filtered out. The value recorded can be viewed as an average for the entire run or as an average across individual reads (E.g. Read 1 Q30 score). The Q-score decreases slightly as a sequencing run progresses due to reagent expenditure and polymerase errors, which is why it is normal to see lower Q30 values in Read 2 compared to Read 1.
Cluster Density (K/mm2)
The cluster density is the density of sequencing clusters on the flow cell (in thousands per mm2) after clonal amplification. This value is directly correlated to the final library concentration loaded onto the Illumina flow cell. The cluster density is a critically important metric that influences run quality, reads passing filter, Q30 scores, and total data output. Loading a flow cell with a library concentration that is too high will cause over clustering, this makes it difficult for the camera to focus on individual clusters and may cause some clusters to fail QC. Overclustering leads to poor run performance, lower Q30 scores, and lower total data output because bad quality clusters are filtered out. Underclustering occurs when the library concentration is too low and results in lower data output, however it maintains high data quality as the camera is able to focus on each cluster. Optimal clustering is achieved by following the manufacturers loading recommendations and performing the required library QC, this ensures that clusters are spaced at optimal distances to achieve the best data output but does not interfere with the camera focus (Figure 4).
Clusters Passing Filter (%PF)
In Illumina clustering, a single molecule should generate a single cluster with a clear signal in the base being sequenced. The %PF is the number of clusters that passed the onboard “Chastity filter” of the Illumina instrument. The clusters that do not pass this filter are removed from downstream analysis and discarded from the final data. The %PF is an indication of signal purity from each cluster and is adversely affected by poor quality libraries or over clustering. Overclustered flow cells generally have higher numbers of overlapping clusters, leading to poor template generation (clustering), and a subsequent decrease in the %PF. Reduced final yield is a by-product of lower %PF.
Ion Torrent
The sequencing quality metrics for an Ion Torrent NGS run can be viewed in real-time on the Torrent Suite server where the run plan was created. A detailed PDF report is also generated at the completion of a sequencing run and contains important QC metrics displayed in graphical or table format.
ISP Loading (%)
The percentage of chip wells that contain an Ion Sphere Particle (ISP). The ISP Density image is a visual representation of well-loading distribution across the physical chip surface (Figure 5). Red colour indicates areas of high loading and blue indicates areas of low loading, generally loading of ≥80% is considered optimal. ISP loading density can be affected by template preparation on the Ion Chef, Ion OneTouch 2, or Ion OneTouch ES instruments. Bubbles and missing chip blocks or tiles can also affect ISP loading density.
Total bases (Gb)
This metric describes the throughput of the sequencing run and reports the total number of filtered and trimmed base pairs in the output BAM file, represented as Giga-base pairs. This is the same as the Illumina Yield metric and correlates with the size of the Ion chip used for the sequencing run. Bigger chips such as the Ion 550 have more wells and therefore allow more ISPs to enter, which produces higher yields of data.
Key Signal
The key signal is the average signal for all library ISPs containing the library key (TCAG). The library key sequence is a short, known sequence of bases (TCAG) added to the library during templating. All four nucleotides are incorporated, and the library key signal determined during the first eight nucleotide flows of the sequencing run. The key signal measures the number of templates per ISP, and therefore the efficiency of the templating reaction. During a sequencing run, the voltage signal that is generated drops due to the reaction conditions as the runs near completion and eventually becomes indistinguishable from the background noise. As the signal drops closer to the background noise, the quality of the read decreases and increased 3′ quality trimming occurs. Therefore, the higher the starting key signal, the less the impact of signal droop and the less 3′ quality trimming is expected to occur.
Clonal ISPs (%)
The clonal ISPs (%) is the percentage of clonal ISPs on a chip. An ISP is clonal if all its DNA fragments are cloned from a single original library template. All the fragments on such an ISP are identical and produce the exact same signal as each nucleotide is flowed in turn across the chip, thereby amplifying the voltage. This percentage is calculated by dividing the number of ISPs with a single DNA template by the number of total ISPs. If an ISP contains fragments from more than one template, it is said to be polyclonal and will be filtered out from the final data as nucleotides may differ along the template and lead to poor quality data.
Total Reads & Usable Reads
The total reads value is the total number of reads that are written to sample or no-match output BAM files. This is also defined as the total number of reads making up the final library. These reads can be considered as passing the quality filters in Table 1 since filtered reads are not included in this count.
QUALITY | DESCRIPTION |
---|---|
Polyclonal | Filters reads from ISPs with >1 unique library template population. Occasionally, low or unexpected signal ISPs can also get caught in this filter. |
Low Quality | Filters reads with unrecognizable key signal, low signal quality, and reads trimmed to <25 bases. |
Primer Dimer | Filters reads where no or only a very short sequencing insert is present. Reads that, after P1 adapter trimming, have a trimmed length of <25 bases are considered primer dimers. |
Table 1: Description of the quality gates used to filter out poor quality reads.
The usable reads are calculated as the percentage of library ISPs that pass the polyclonal, low quality, and primer-dimer filters. This percentage is calculated by dividing the library reads passing filter (total reads) by the number of library ISPs identified before filtering (Figure 6).
Sample-level QC metrices
FastQC is a popular software tool used to evaluate the quality of raw sequence data coming from high-throughput NGS pipelines. Individual sample data is imported from FastQ, BAM or SAM files and an overview of basic quality control metrics is provided via several different analyses (called modules). The output from FastQC is an HTML file that opens a report to be viewed in your browser, this report contains one result section for each FastQC module. We will cover a few of the most important modules below.
Per Base Sequence Quality
A box-and-whisker plot displaying aggregated quality scores (Q-score) at each base along all reads in the data file. The blue line is the mean quality score at each base position, the higher the line, the better quality the base call. The red line within each yellow box represents the median quality score at that position. The yellow box is the inner-quartile range for 25th to 75th percentile. The upper and lower whiskers represent the 10th and 90th percentile scores (Figure 7).
It is normal with all Illumina instruments for the median quality score to start out lower over the first 5-7 bases and then rise. The average quality score will steadily drop over the length of the read, with longer reads having worse Q-scores towards the end. With paired-end reads the average quality scores for read 1 will almost always be higher than for read 2 due to the nature of SBS chemistry. A warning will be issued if the lower quartile for any base is less than 10, or if the median for any base is less than 25.
Per Base Sequence Content
This plot reports the percent of bases called for each of the four nucleotides at each position across all reads in the file (Figure 8). In a diverse library such as those prepared for whole-genome sequencing, the proportion of each of the four bases should remain relatively constant over the length of the read with %A=%T and %G=%C, as this reflects the overall amount of these bases in our genome. Libraries that are less diverse (E.g. amplicon libraries) will show a skewed proportion of the nucleotide bases, this may display as a warning in FastQC however it is normal and not an indication of any issues.
Per sequence quality scores
A plot of the total number of reads on the y-axis vs the average Q-score over full length of that read on the x-axis. It allows you to easily see if a subset of your reads have universally low quality-scores. You may sometimes see a subset of reads that have slightly lower quality scores, this could be due to poor imaging if they lie on the edge of the field of view, this is acceptable, however, it should only represent a small percentage of the total reads. The distribution of average read quality should be fairly tight in the upper range of the plot, indicating that that majority of reads have high quality-scores (Figure 9).
If a large proportion of the reads have overall poor quality, then this could indicate a more severe problem, possibly with just a section of the run (E.g. bad tiles on the flowcell)
Sequence Duplication Levels
In a highly diverse library, most sequences should occur only once in the final set of reads. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate enrichment bias (e.g. PCR over-amplification). PCR duplicates misrepresent the true proportion of sequences in your starting material and can lead to false-positive variant calls and skewed allele frequencies. The blue line in the graph shows the percentage of reads (x-axis) which are present at a given number of times (y-axis).
For whole-genome sequencing data it is expected that almost all of your reads will be unique (appearing only 1 time in the data). This indicates a highly diverse library that was not over sequenced. When performing whole-transcriptome RNA-Seq there will be some highly abundant transcripts and some lowly abundant, due to differences in gene expression. It is expected that duplicate reads will be observed for high abundance transcripts and will be flagged as failed by FastQC even though the duplication is expected in this case (Figure 10).
References:
- Sequencing Analysis Viewer v2.4 Guide: (https://support.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/sav/sequencing-analysis-viewer-v-2-4-software-guide-15066069-04.pdf)
- How to assess an Ion S5/Ion GeneStudio S5 sequencing run report: (https://assets.thermofisher.com/TFS-Assets/LSG/manuals/MAN0017983_Assess_Ion_S5_Sequencing_Run_UB.pdf)
- FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/