This web page was produced as an assignment
for an undergraduate course at Davidson College.
Anthony Ciancone's Genomics
Second Assignment Page
damage is a
pervasive cause of sequencing errors, directly confounding variant
Liu, Thomas C. Evans Jr.,* Laurence M. Ettwiller*
The authors of this paper argue that DNA is
damaged at a low rate
and that some of the damaged DNA they have identified in genomic libraries
might be mistaken for somatic mutations. Identifying point mutations in
data sets is done by deep data sequencing and analysis, but the threshold
these single point mutations is the same as for point DNA damage.
They looked first at the global imbalance of variants
detected of the first two reads. They used a Global Imbalance Value
to quantify this; imbalance is directly proportional to DNA damage. A GIV
1 means damaged DNA. They prepped DNA by damaging it with 8-oxo-dG,
in G-to-T transversions after amplification. They also tried to repair DNA
an enzyme cocktail, standard in labs. They confirmed their methodology
some tests from publicly available genomes.
1000 GP and TCGA
They looked at the 1000 Genomes Project and The
Cancer Genome Atlas and found significant G-to-T damage present in both,
suggesting up to 1/3rd of the ID’d reads are actually DNA damage
They then performed experiments on their data set
(?) and on a cancer probe and performed this oxidative damage and GIV
They found that most of the very low frequency variant reads were actually
damaged parts of DNA, confirmed by enzymatic fixing.
They claim that they found 180 false positives, or
about 1 false read per cancer gene.
Varscan/TCGA Data Set
Analysis tool used to ID somatic TCGA tumor
variants. An excess of one mutation type suggests DNA damage, similar to
Most of the public data sets showed an excess of G-to-T, especially the ones
predicted by the program to be highly damaged.
They estimated the false positive rate to be 50%
in 78% of tumors analyzed. These false positives strongly correlated with
damage, suggesting confounding results on previously noted somatic variants.
Lung Adenocarcinoma TCGA Data Set
They downloaded publicly available LAC-TCGA data
and looked for damage. They split the data sets up into low-to-moderately
damaged and highly damaged. The highly damaged set contained a moderate
increase in expected damage. The Mutect2 however contained significant
(9%) for either G-to-T or C-to-A damage.
This piece was a real eye-opener because I feel
like I assume too often that just because researchers in a field
agree on certain methodologies do not make them irrefutably correct. This
offers substantial evidence that two of the widely used, publicly available
data sets for genomes contain significant DNA damage. There may be a real
problem with confounding specific somatic variations with actual DNA damage,
complicating a lot of research already done on these subjects.
I do have some questions I would ask about the
authors’ findings, however. For instance, they claim that a significant
of damage is due to DNA oxidation. Could it also be that a similar oxidative
mechanism is the cause for the mutations in the first place? Perhaps all
shows is that DNA can be very susceptible to oxidative damage, inside or
of the body. Also how do they know that 8-oxo-dG operates the same way in
as it does in the body? The authors claim that an enzymatic cocktail can
a lot of this DNA damage. How do they know that this “repairing” does not
clean over actual mutations, considering they (as far as I can tell) do not
know the exact mechanism by which it operates? In their defense, they do
that their stringent qualifications for DNA damage may lead to
for actual somatic mutations.
Overall though, this paper was an interesting read
and provides more evidence that the human genome is complicated and that
is still much work to be done in improving the methodologies behind
it. The next logical step might then be to test whether different types of
damage are caused by sample preparation.
figure is split into two figures, both showing a flow chart of the
sequencing techniques used to validate their work. The left side of the
gives a depiction for what real oxidative damage would look like whereas the
right side shows what actual somatic mutations would look like. They are
measuring the imbalance in base reads between the R1 and R2 reads, which is
basis for their GIV score and quantification of DNA damage.
figure 1A, this figure is split into two parts, with the left showing data
variants on R1 and R2 reads and the right showing complementary C-to-A
on the same thing. It appears that R1 reads without enzymatic fixing show a
higher degree of G-to-T variants for the left graph. The R2 reads on the
graph show no G-to-T imbalance. On the right side of figure 1B, R2 reads
without enzymatic fixing show a higher degree of C-to-A variants. This
imbalance is not present for R1 reads. The authors say that this
complementarity evidences what DNA damage should look like with the
Parts A and B show very similar things so I will
be including them together for discussion purposes. A and B are both
box and whisker plots which show the log2 GIV score with respect
all twelve of the DNA nucleotide mutations (e.g. - G-to-T, G-to-A, C-to-A,
etc.). Each point on both graphs shows a single GIV score for a sequencing
of 5 million base pairs. Part A depicts data from the 1000 Genomes Project;
subset from TCGA. The authors have drawn a black line where a GIV score of
would be, indicative of DNA damage. For both data sets, both C-to-A and
high and low GIV scores indicate damaged DNA data sets. In part B, the
also note that enzymatically repaired DNA was included in the GIV score
explaining the bimodal distribution of C-to-A and G-to-T mutations, with
both above and below the 1.5 GIV score threshold.
was data from the enrichment experiment, which involved using a commercial
cancer panel probe to get an accurate read of 151 cancer genes. Part A looks
the G-to-T variant frequency of R1 and R2 reads at different base positions
between samples treated or not with the repairing enzymatic cocktail. It
appears that unrepaired R1 DNA reads contain more G-to-T variant frequency
any other type of reads.
B: Four paired
bar graphs each showing the relative distribution for all 12 of the
variants between repaired and unrepaired reads. Each bar is split up into
relative portions of these variants per megabase. The four graphs display
data for how relatively rare the variant was during the experiment, with
increasing discovery from left to right. For variants showing up less than
or between 1-5% of the time, there was significantly more G-to-T and C-to-A
variant frequency accounting for all reads for unrepaired DNA compared to
repaired DNA. For more common variant frequencies, this difference was not
same general graphs as in B but for only R1 reads. Now there is only a
significantly increased proportion of only G-to-T reads for unrepaired DNA
the researchers were looking at the TCGA data set for somatic variants using
Varscan, a popular data analysis tool. The graph shows all the 1800
runs ordered by Varscan in order of increasing GIV score looking
for G-to-T imbalances. Most data sets went over the 1.5 GIV score threshold,
suggesting widespread DNA damage.
figures confirms the data presented in part A by presenting a breakdown of
fraction of each type of mutation present. There is a higher G-to-T presence
samples than for C-to-A and every other type of mutation for R1 reads. This
includes all the reads/data.
same as B except for high confidence samples only. These samples were
by the researchers’ algorithm to be highly damaged.
same as C with R1 reads except only looking at germline variants using
There are no significant DNA damages noted.
researchers estimated the false-positive discovery rate of somatic variants
looking at GIV G-to-T score. They found a strong correlation of 0.79,
false positives correlated to estimated damage reasonably well.
Pingfang Liu, Thomas Corwin Evans, and Laurence Michele Ettwiller. "DNA
Damage Is a Major Cause of Sequencing Errors, Directly Confounding
Variant Identification." Science 355 (2016): 752-56. Web. 27
here to view the original paper.
This is my first
assignment homepage. Click
here to return to Anthony's Genomics homepage.
Biology Home Page
Email Questions or Comments: email@example.com
© Copyright 2017 Department of Biology,
Davidson College, Davidson, NC 28035