Cancer, correlations and clustering

A recent paper in Science claims that two-thirds of the variation in cancer risks between different cell types is explained by the number of stem cell divisions. It then goes on to separate cancers into D-type (deterministic), in which environmental factors and hereditary predispositions strongly affect their risk, and R-type (Replicative), in which stochastic factors related to errors in DNA replication strongly affect their risk. Given the media interest in cancer and the headline grabbing line in the abstract that “The majority [of cancer] is due to ‘bad luck'”, it’s not surprising that this paper received a lot of mainstream media attention. The paper has received a lot of attention from bloggers too (Dan Graur, Understanding uncertainty, Aaron Meyer). Here, I want to focus on the paper itself rather than the media interpretation which on the whole was a fairly accurate representation of their conclusions.

The central aim of this paper was to assess the contribution of variations in stem cell divisions between different tissues and the incidence of cancer of these tissues. In order to do this, the author’s conducted a literature search for tissues in which the number of stem cells and their divisions had been quantitatively assessed and estimates were available for the lifetime incidence of cancer. In the paper they say that this was possible for 31 tissue types although in fact 25 discrete tissue types are included, with some of the defined cancer classifications originating from the same tissue type, for example smoking and non-smoking lung adenocarcinoma.

The first figure of the paper (reproduced below) shows a strong correlation between the total number of stem cell divisions and the lifetime risk of cancer. The Pearson correlation coefficient of 0.804 gives an r2 of 0.65 which is where the claim that stem cell divisions explain two-thirds of the variation comes from. However, this observation does not mean that two-thirds of cancers are due to “bad luck” for a number of reasons.


  1. Some cancers were split by environmental factors and hereditary dispositions whilst others were not. The clearest example of this is the split between smoking and non-smoking lung cancer. Hereditary forms of colorectal cancers, Familial adenomatous polyposis (FAP) and Lynch were also included separately. The forms of FAP included in this study were those in which defects in the APC gene are inherited which leads to a lifetime risk of nearly 100% (The authors inferred a lifetime risk of 1, or 100%). The inclusion of separate lifetime risks for these cancers is entirely arbitrary and leads us onto the second problem.
  2. The lifetime risks are averages over all individuals in the population. By looking only at the average lifetime risk for a particular cancer, the impact of environmental factors which are known to contribute to cancer risks such as weight, diet, etc are averaged out. As we can see from the separate data points for the three colorectal cancers, the two duodenum cancers and the two lung cancer groups, splitting a cancer by known risk factors increases the spread of data, reducing the percentage of variation in lifetime risk which can be explained by stem cell divisions. If we were to split each cancer by known risk factors such as diet etc, we would expect to see a lower proportion of variance explained by stem cell divisions.  In essence, known confounding factors have failed to be accounted for.
  3. The correlation only describes how much of the variation in cancer incidence rates can be explained by stem cell divisions. Their analysis only deals with the variation in cancer rates between tissues. It cannot tell us how many instances of cancer are due to “bad luck”. The lifetime risk of some cancers is due almost entirely to environmental or genetic factors and these cancers contribute to a considerable number of the total number of cancers. For example, lung cancer was the second highest prevalence of all forms of cancer in the UK in 2011 and the data from this paper suggests smokers are ~20 times more likely to get lung cancer than non-smokers.

The authors then go on to split cancers into two groups based on whether stochastic effects associated with stem cell division can explain the lifetime risk of a given cancer. Essentially what they are doing is looking at the residual from the linear correlation fitted to the data in figure 1 (i.e how well the estimated risk from the linear regression matches the actual lifetime risk). However, rather than take the residuals, they invent a metric “Extra risk score” (ERS) from the product of the total stem cell divisions and the lifetime risk which, according to the supplementary materials, represents “the (negative value of the) area of the rectangle formed in the upper-left quadrant by the two coordinates (in logarithmic scale) of a data point as its sides”. The figure below shows the same data as Figure 1 above with a graphical representation of what the ERS represents. In this example, the log10 lifetime risk of Thyroid follicular cancer (-1.98) is multiplied by the log10 total stem cell divisions (8.77) to give an ERS of -17.43 which is equivalent to -1 * the area of the rectangle shown in the figure below. The more negative the ERS, the further the point is away from the origin on the graph shown below.



This ERS is later adjusted to an aERS in which the larger the score, the less stem cell division explains the incidence of cancer. Why they’ve chosen this approach rather than using the residuals is not clear but there are at least two major problems I can see.

  1. FAP colorectal has a log10 lifetime risk of 0. It’s ERS score will therefore be 0 regardless of the number of stem cell divisions. The ERS score cannot therefore estimate how well the linear regression predicts the lifetime incidence of FAP colorectal.
  2. The ERS does not correlate well with the residuals from the linear regression, partly because the gradient of the correlation in log-log space is not 1 (it’s close to 0.5). In this case, cancers with a high number of stem cell divisions are biased towards higher ERS scores and vice versa. For example, Colorectal adenocarcinoma does not have an unusually high lifetime risk given the number of stem cell divisions (Figure 1 above), however its ERS score places it within the group of tumors which are strongly influenced by additional risks (see Figure 2 at the end of the post). The figure below shows a plot of the residuals from the linear regression against the ERS score with each point coloured by the number of cell divisions. Tumors with high cell divisions have an ERS score which poorly reflects the residuals from the linear regression.



Figure 2 from the paper (below) demonstrates that they can separate their cancer types using their ERS into two groups which they label as D-type (deterministic, e.g environmental factors and hereditary predispositions strongly affect their risk), and R-type (Replicative, e.g stochastic factors related to DNA replication errors strongly affect their risk). In the text of the paper they explain that “Machine learning methods were employed to classify tumors based only on this score” which sounds fancy but essentially means they used k-means clustering to partition the cancers in two groups. In this instance it is not informative as the data form a smooth continuum based on this 1-dimensional space.  The great thing about K-means is that it will always partition your data into k clusters regardless of whether it is sensible to do so or not!  To make the groupings seem clearer they adjust the ERS (aERS) score so that all R-type tumors have a negative aERS score and all D-type tumors have a positive aERS score. However, it is clear from the figure that the scores are actually on a continuum. Unsurprisingly, Lynch colorectal has a very high aERS score, reflecting the fact that the incidence of Lynch colorectal is largely explained by genetic factors. This additional analysis does not tell us anything we couldn’t get from Figure 1 and for the reasons explained above, the ERS score is actually a poor indicator of how well the number of stem cell divisions explains the lifetime risk for a tumor.



So what can we take from this paper. Well, it’s clear that the number of stem cell divisions explains a considerable proportion of the variation between the average lifetime risks for these cancers, but that newspaper headline that two-thirds of cancers are due to bad luck isn’t supported by the data here. Sadly, the classification of cancers performed here is very poor and doesn’t lead to any novel observations I would have any confidence in. The first figure is certainly striking and remains convincing that the number of stem cell divisions is an important component to consider when comparing cancer rates across different tissue types.