This web page was produced as an assignment for an undergraduate course at Davidson College.

DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification.


DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification is a study published in Science that shows DNA damage is the direct cause for most of the erroneous identifications of somatic variants and other variants with a low frequency (Chen, et al., 2017). When looking at the sequencing of high-quality human genomic DNA, another study showed that certain library preparations cause oxidative damage. Chen et al. wanted to figure out a method to measure DNA damage that occurs in sequencing runs. Based on the knowledge that mutagenic damage results in a global imbalance of variants found in read 1 (R1) and read 2 (R2) during paired-end sequencing, the researchers needed to create a way to measure this global imbalance. Through an innovative method, the researchers “devised an analysis strategy based on this imbalance to deconvolute both the origin and orientation of variants and computed a metric, the Global Imbalance Value (GIV) score, that is indicative of damage” (Chen, et al., 2017). By using the GIV, the researchers could conclude that the most commonly used data sets such as the Cancer Genome Atlas (TCGA) and the 1000 Genome Project have widespread damage that is directly linked to sequencing errors. Furthermore, these sequencing errors impact the identification of the somatic variants in these data sets.

Explanation of figures:


Fig 1. GIV score. (replicated Chen, et al., 2017).

This first figure is not asking a question but instead supporting the principle behind their innovative method known as the GIV score. Figure 1A is using Illumina sequencing or more specifically paired-end sequencing. This technique focuses on the two ends of the same DNA molecule known as the “paired ends.” Through this method, you can sequence one end, then turn it around and sequence the other end. The two sequences you get are called the “paired end reads.” Essentially, oxidative damage affects only one base of a pair and leads to an excess of G-to-T transversion errors when sequencing Read 1 (R1). In paired-end sequencing, when there is an imbalance of G-to-T variants in R1 reads, R2 reads have an excess of C-to-A transversion errors (the reverse compliment of G-to-T). The GIV score is important because it measures the DNA damage caused by this imbalance. In Figure 1A, juxtaposed to this diagram displaying DNA damage is one showing true variation which does not cause an imbalance. The take home message for Figure 1B was that oxidative damage is the cause of the excess G-to-T variants in the unrepaired DNA samples.


Fig 2. GIV scores (y axis) for the 12 nucleotide substitution classes (x axis). (replicated Chen, et al., 2017).

The question being asked was how much widespread damage the leads to an excess of G-to-C variants is found in The Cancer Genome Atlas (TCGA) and the 1000 Genome Project data sets. When using the GIV analysis, a GIV score above 1.5 is defined as damage, while a GIV score of 1 is defined as undamaged. The 1000 Genome project data set reveals that much erroneous damage was done to the G-to-T variants (GIVG_T score ≥ 1.5). The TCGA data set shows that most G-to-T imbalance (GIVG_T score ≥ 2). A GIV score above 1.5 means that “there are 1.5 times more variants on R1 than on R2, suggesting that at least one-third of the variants are erroneous” (Chen, et al., 2017). The researchers concluded that most public data sets have at least one-third of the G-to-T variant reads are a result of erroneous damage.  



Fig 3. Target enrichment experiment. (replicated Chen, et al., 2017).

Their data showed that damage which results in G-to-T transversions is random. Thus, the randomness of errors that cause damage should occur at low allelic frequency. Low-frequency variants are somatic variants and higher frequency variants are germline variants. Therefore, the researchers then tested how the damage affects somatic variants since germline variants should be unaffected. This experiment was measured by repeating oxidative damage experiments using common library preparation procedures. Figure 3A shows that without DNA repair, the somatic variant frequency of G-to-T transversions is higher in R1 then with DNA repair. The only difference between 3B and 3C is that 3C only includes variant frequencies for R1 which is more G-to-T specific than including both R1 and R2. These two figures demonstrate that over 75% of the G-to-T variant positions can be removed by DNA repair at the lower frequencies. This supports the idea that those positions were erroneous and a result of oxidative damage. This data shows that DNA damage directly impacts the accuracy of identifying somatic variants, which are at the very low and low to moderate frequency.


Fig 4. Variants identified in TCGA data sets. (replicated Chen, et al., 2017).

The next question these researchers looked at was “the extent that damage affects somatic variant calls in cancer samples [using] Varscan, popular analysis tool, to identify germline and somatic variants for all TCGA tumor samples with matched tumor-normal pairs” (Chen, et al., 2017). Drawing back to the concept in Figure 1A (which shows that true variation leads to no imbalance), the researchers organized the global balance of somatic mutations calls between R1 and R2 reads. It is easy to visually see the imbalance from the increasing G-to-T damage level. Another point to make is that the fraction of G-to-T variants increased with the GIVG_T score-based damage, unlike the other variants. Figures 4C and 4D contrast each other. Figure 4D didn’t show excess in R1 reads because the G-to-T variants were germline variants (which is unaffected due to being a higher frequency variant). Figure 4C on the other hand shows that there is a high confidence of G-to-T somatic variants (which is supported by Figure 3). Figure 4E shows a positive correlation between the percentage of estimated false positive somatic variants and the GIVG_T score. The main take home message of Figure 4 was that there is a correlation between DNA damage and false-positive variant calls. From this information, the researchers deduced that erroneous variant identification of somatic variants is caused by DNA damage.


There are several major conclusions from this paper. In Figure 2, GIV scores revealed that both the Cancer Genome Atlas (TCGA) and the 1000 Genome Project have excessive DNA damage from G-to-T variants. In Figure 3, target enrichment experiments suggest that most erroneous damage affects the ability to find variants at a low frequency, such as somatic variants. Maybe even more important, DNA repair has shown to almost completely remove oxidative damage that occurs during common library preparation procedures. This process legitimized the ability for the GIV score to identify somatic variants. Figure 4 shows high-confidence mutations calls in the TCGA data sets (Chen, et al., 2017). This paper points out that more attention needs to put towards differentiating between true and artificial somatic variants in widely used data sets and future scientific projects that will use sequencing samples. Detailed criteria must be put in place to reduce the DNA damage during variant-calling.

My Opinions of the Paper:

The main issue I had with this project was after Figure 4. Throughout the entire study, the readers could view that data being discussed in various figures. It seems a bit suspicious to me that they didn’t create a figure after evaluating how the damage affects current TCGA reference variant files. I’m guessing that the data sets predicted to have heavily damaged variants was not nearly visually pleasing as the weak damaged variants data sets. But nevertheless, the researchers concluded that artificial damage occurs when there is a high confidence during large number variant-calling. Overall, I’m glad that this study brought provided evidence that scientific protocols for differentiating between true and artificial somatic variants need to improve. However, I’m still not completely satisfied by their proposed solution which feels like a “catch 22” situation. In attempting to further differentiate between true and artificial somatic variants, scientists could start to increase the false-negative rate. Certainly, this makes me think, which is the lesser of two evils? Is it adding erroneous DNA damage or creating variant-calling algorithms that could remove true variants?



Chen, Lixin, Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA damage is a major cause of sequencing errors, directly confounding variant identification." Science 355 (2017): 752-56. Web.

 Jordan's Genomics Page

Genomics Page
Biology Home Page

Email Questions or Comments:

© Copyright 2017 Department of Biology, Davidson College, Davidson, NC 28035