This web page was produced as an assignment for an undergraduate course at Davidson College.

DNA damage is a pervasive cause of sequencing errors, directly confounding variant identification



Article Summary:

            In this article Chen et al aimed to look at the validity of the low-frequency variants that have been identified on large databases (particularly the Cancer Genome Atlas and the 1000 Genome Project) by addressing the issue that many of these variants could be the result of oxidative damage to the DNA during sequencing instead of real variants in the DNA. They looked specifically at G to T and C to A variants through the use of paired-end sequencing to determine if these variants are a result of oxidative damage during sequencing or if they are valid variants in the genome. Their approach focuses on the difference between the reads and whether or not G to T and C to A variants are more common than other nucleotide mutations and if there is an imbalance in these variants between reads from the same genome. The paper concludes that these variants are indicative of DNA damage during sequencing and that many of the low-frequency variations found within somatic cells on the genome databases are not real variants but a result of improper sample treatment during sequencing.

Figure 1:

Figure 2 

Figure 1. (A) Schematic of how DNA variants occur through oxidative damage during sequencing. (B) Evidence showing that during pair-end sequencing the

Figure 1 is showing the overall idea behind their design behind the GIV scores and the DNA damage. In the left column of Figure 1A Chen et al is showing how oxidative damage occurs during paired-end sequencing. It shows an example of oxidative damage to the G’s within a sequence and after adaptor ligation and PCR amplification that the cell fixes the damage by switching to a thymine and then pairing to an A.  This creates an imbalance in the read 1 and read 2 templates after the paired end sequencing occurs and the different strands are copied and paired which will lead to mapping and identification of false variants. This creates an imbalance in the reads which gives rise to the articles GIV (Global Imbalance Value) scoring system which indicates the amount of damage to the DNA which is correlated with the amount of imbalance. On the right side of the figure is a depiction of the mechanism behind non-damaged paired-end sequencing which ends in no imbalance between read 1 and read 2. Figure 1B is showing the fraction of G to T or C to A variants in various conditions that either allow for DNA repair or do not allow for DNA repair. In the left panel the conditions that do not allow for DNA repair show greater G to T variants in their sequence than those that allow DNA repair. This indicates that without the DNA being able to repair after oxidative damage then the number of G to T variants is increased in only one of the reads and not the other which indicates an imbalance. The same is shown in the right panel in Figure 1B but with C to A variants instead of G to T to demonstrate that there is an imbalance in the two reads when exposed to oxidative damage.

Figure 2:

Figure 2

Figure 2. (A) GIV score data calculated from the 1000 Genome Project. (B) GIV score data from the Cancer Genome Atlas.

            In Figure 2 Chen et al. are showing that C to A and G to T are low-frequency variants that occur in both the Cancer Genome Atlas and the 1000 Genome Project databases. A GIV score greater than 1.5 is indicative of DNA damage and the GIV scores for nucleotide changes from G to T and C to A are greater than 1.5 in the data from both the Cancer Genome Atlas and the 1000 Genome Project. This indicates that their theory on the oxidative damage causing G to T and C to A mutations during sequencing could be the cause of the variation.

Figure 3:

figure 3

Figure 3. (A) Variant frequencies from G to T both allowing for DNA repair and not allowing for DNA repair with both read one and read 2. (B) Number of variants per MB divided into nucleotide to nucleotide variant type and percent frequency. (C) Same data from part B but only data from read 1 was included.

    In Figure 3 Chen et al is demonstrating that with allowing DNA repair and without allowing DNA repair after oxidative damage there is a difference in the frequency of G to T variants in samples. Figure 3A shows that in read 1 there are more G to T variants when you disable DNA repair than when you do not. This demonstrates that without DNA repair being allowed there is an imbalance in the reads which supports the idea that sequencing with oxidative damage does cause false variants. Based on figure 1 and it is implied that read 2 would have C to A variants but this data is not shown. Figure 3B is highlighting the higher number of positions per megabase of G to T and C to A variants in comparison to other possibilities of nucleotide variants. The different panels also highlight that there is increasing number of G to T and C to A variants in lower-frequency variations. Figure 3C is zoning in on just the data from read 1 demonstrating that the G to T variants are more frequent and that these variations are caused by oxidative damage due to the imbalance of reads when compared to the reads together. It is showing that it is not just that G to T variants are more common overall but that they are more common in one read over another so there has to be an imbalance in the reads that they hypothesize resulted from oxidative damage.

Figure 4:

Figure 4

Figure 4. (A) Tumor cell data from the Cancer Genome Atlas sequenced revealed a large range of GIV scores for TCGA G to T damage. (B) The number of somatic variants of different nucleotide to nucleotide variations for various different Cancer Genome Atlas data sets ordered in increasing amount of damage. (C) Same as B but using only high confidence variants from the Varscan. (D) Same as B but only using the germline variants. (E) Estimated false variant calls on G to T variants based on GIV score.

    In Figure 4 Chen et al are wanting to demonstrate that oxidative damage has an effect on these large scale databases. It begins by showing in Figure 4A that there is a range in GIV scores, this would show that there is not a consistent amount of damage on these data sets which shows that there could be a lot or a little false positives scattered throughout the data which points out the inconsistency in the amount of damage that is found throughout these data sets which questions their reliability when it comes to G to T and C to A variations. Figure 4B is zoning in on only data from read 1 that is from somatic cells and is showing the Varscan score of each variant type for the data set in the same arrangement as in figure 4A. The Varscan score is indicative of the percentage of somatic variants for that mutation type. This figure is highlighting the number of G to T and C to A variants in somatic cells and that this is a large percentage of the number of variants found on this database. Figure 4C is zoning further in on the data from Figure 4B but just showing the variants that are high confidence variants. This again highlights the G to T and C to A variants showing that they are much less (although significance is not discussed) than those of the total variants. This demonstrates that reads with much higher confidence in variant validity there is a decrease in the amount of G to T and C to A variants so there must be less damage in these reads. Figure 4D is the same type of depiction but with just germline cells. This shows that these G to T and C to A variants are more frequent in somatic cells. Figure 4E is showing the G to T variants ordered based on their GIV score and the percentage of their variants found in the Cancer Genome Atlas that are valid/real variants. Most of the data is clustered around 70% which demonstrates that the validity of low-frequency variations in these large databases such as the Cancer Genome Atlas could be invalid.

Conclusions/My Opinion:

    Overall this experiment aims to question the accuracy of these large genome databases in regards to low-frequency variants due to the fact that there is not a lot of regulation on conditions in which the genome is sequenced before it is on the database. The paper begins by focusing on their own method and emphasizing the need for the GIV score system. In Figure 1 Chen et al is questioning whether or not their is an effect of oxidation on the results of sequencing. Through this figure they conclude that oxidative damage does have an effect on G to T and C to A variants through their use of comparing read 1 and read 2 of paired end sequencing of DNA that is allowed to repair itself and DNA that has its repair mechanisms disabled. In Figure 2 Chen et al look at two different databases, the 1000 Genome Project and the Cancer Genome Atlas, and are questioning whether or not G to T and C to A variants are more common in DNA that has a GIV score that is indicative of damage. The data shows that this is valid for both databases. From this they decide to look further into whether or not these G to T and C to A variants are just more common or if it is an error from sequencing causing an imbalance of the reads. They begin by measuring the number of G to T variants in similar conditions but one the DNA can be repaired while the other it is disabled and then determining whether there is an imbalance in the reads. The data shows that there is an imbalance in the reads. They also continue this by showing the G to T and C to A variants in comparison to other variants as well as the frequency of the mutation within a population and it is shown that G to T and C to A variants are more common than other types of variants in the low-frequency variations and that those are even more common in one read over another, further suggesting imbalance. This all validates their data with using GIV scores and variant frequencies from figure 2. They finally zone in on the data from the Cancer Genome Atlas in Figure 4 to determine whether these variants are more common in somatic cells and whether or not they are just as common in data that are considered to have a higher confidence and how this might relate to whether or not these low-frequency variants are valid. They conclude that oxidative damage is common and at various levels throughout the data that is currently on these databases and that this does lead to lower confidence in the variations in low-frequency variants.

    I do think that this article does bring up a very good point, we should not have as much confidence that all of the data on these databases is valid and that there is a need for more standards for the sequencing data that gets put into these databases. This paper provides a substantial amount of evidence but there is not a single figure that significance for the data is shown. This made me question how valid their data is or if they were just arranging the figures so that it showed what they wanted it to. I do think that they could also begin asking questions about whether there are other effects outside of oxidative damage that could cause skewed results in sequencing data during sample preparation and another direction could be looking at more ways that sequencing errors can occur. Another issue I had was in Figure 4 their claims with the difference between somatic and germline cells with sequencing troubled me. If it is a matter of the oxidative damage in DNA sequencing then why are there not a lot of G to T and C to A variants in germline cells? To me I saw that they were highlighting that there were still G to T and C to A variants but it was not a large amount of these variant types so if it truly was a sequencing error then it would occur in both somatic and germline cells and effect both types equally, at least that was my own logic behind this. I also think that seeing significance would have been really helpful on this figure. I also did not like that on Figure 4E it was a predicted number of somatic variants that were false positives because their whole paper is trying to show that these variants are false positives and the databases are showing invalid variants so I would prefer to see real data rather than their own estimations that support their claims.


Chen, Lixin, Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA damage is a major cause of sequencing errors, directly confounding variant identification." Science 355 (2017): 752-56. Web.

*Link in the title of this page*

Stephanie's Home Page

Biology Home Page

Email Questions or Comments:

© Copyright 2017 Department of Biology, Davidson College, Davidson, NC 28035