Finding your keys in a mass of high-throughput data

There is a common scientific allegory that is often used in criticism of discovery-based science. A person is looking around at night under a lamppost when a passerby asks, “What are you doing?”. “Looking for my keys” the person replies. “Oh, did you lose them here?” the concerned citizen. “No”, the person replies, “I lost them over there, but the light is much better here”.


The argument as applied to science in a nutshell is that we commonly ask questions based on where the ‘light is good’- that is, where we have the technology to be able to answer the question, rather than asking a better question in the first place. This recent piece covering several critiques of cancer genomics projects is a good example, and uses the analogy liberally throughout- referencing its use in the original articles covered.

One indictment of the large-scale NCI project, The Cancer Genome Atlas (TCGA) is as follows:

The Cancer Genome Atlas, an ambitious effort to chart and catalog all the significant mutations that every important cancer can possibly accrue. But these efforts have largely ended up finding more of the same. The Cancer Genome Atlas is a very significant repository, but it may end up accumulating data that’s irrelevant for actually understanding or curing cancer.

Here’s my fundamental problem with the metaphor, and it’s use as a criticism of scientific endeavors such as the TCGA. The problem is this: we don’t know where the keys are a priori, the light is brightest under the lamppost, we would be stupid NOT to look there first. The indictment comes post-hoc, and so benefits from knowing the answer (or at least having a good idea of the answer). Unraveling this anti-metaphor: we don’t know what causes cancer, genomics seems like a likely place to look and the we have the technology to do so, and if we started by looking elsewhere critics would be left wondering why didn’t we look in the obvious way. The fact that we didn’t find anything new with the TCGA (the jury is quite out on that point) is still a positive step forward in the understanding of cancer- it means that we’ve looked in genomics and haven’t found the answer. This can be used to drive the next round of investigation. If we hadn’t done it, we simply wouldn’t know, and that would prevent taking certain steps to move forward.

Of course, the value of the metaphor is that it can be used to urge caution in investigation. If we have a good notion that our keys are not under the light, then maybe we ought to be thinking about going to get our flashlights to look in the right area to start with. We should also be very careful that in funding the large projects to look where the light is, that we don’t sacrifice other projects that may end up yielding fruit. It is true that large projects tend to be over-hyped and the answers are promised (or all but) before they’ve even begun to be answered. Part of this is necessary salesmanship to be able to get these things off the ground at all, but overselling does not reflect well on anyone in the long run. Finally, the momentum of large projects or motivating ideas (“sequence all cancer genomes”) can be significant and may carry the ideas beyond what is useful. When we’ve figured out that the keys are not under the lamppost we had better figure out where to look next rather than combing over the same well-lit ground.

Part of this piece reflects very well on work that I’m involved with- proteomic and phosphoproteomic characterization of tumors from TCGA under the Clinical Proteomics Tumor Analysis Consortium (CPTAC):

“as Yaffe points out, the real action takes place at the level of proteins, in the intricacies of the signaling pathways involving hundreds of protein hubs whose perturbation is key to a cancer cell’s survival. When drugs kill cancer cells they don’t target genes, they directly target proteins”

So examining the signaling pathways that are involved in cancer directly, as opposed to looking at gene expression or modification as a proxy of activity, may indeed be the way to go to elucidate the causes of cancer. We believe that integrating this kind of information, which is closer to actual function, with the depth of knowledge provided by the TCGA will give significant insight into the biology of cancer and it’s underlying causes. But we’ve only started looking under that lamppost.

So, the next time you hear someone using this analogy as an indictment of a project or approach, ask yourself if they are using this argument post-hoc? That is, “they looked under the lamppost and didn’t find anything so their approach was flawed”. It wasn’t- it was likely the clearest, most logical step, that was most likely to yield fruit given a reasonable cost-benefit assessment.


Eight red flags in bioinformatics analyses

A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:

Researchers — including me and my colleagues — had just reported that the majority of preclinical cancer papers in top-tier journals could not be reproduced, even by the investigators themselves12.

was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.

  1. Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
  2. Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
  3. Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
  4. Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
  5. Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
  6. Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
  7. New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
  8. New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.

The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:

Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.

I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.

Illustration from Derek Roczen

The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.


  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
  2. Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).

How can two be worse than one? Replicates in high-throughput experiments

[Disclaimer: I’m not a lot of things. Statistician is high on that list of things I’m not.]

A fundamental rift between statisticians/computational biologists and bench biologists related to high-throughput data collection (and low-throughput as well, though it’s not discussed as much) is that of the number of replicates to use in the experimental design.

Replicates are multiple copies of samples under the same conditions that are used to assess the underlying variability in measurement. A biological replicate is when the source of the sample is different, meaning that different individuals were used, for human samples, for example, or different cultures were grown independently, for bacterial cultures. This is different from a technical replicate, where one sample is taken or grown, then subsequently split up into replicates that will assess the technical variability of the instrument being used to gather the data (for example, though other types of technical replicates are used too sometimes). Most often you will not know the extent of variability arising from the biology or process and so it is difficult to choose the right balance of replicates without doing pilot studies first. With well-established platforms (microarrays, e.g.) the technical/process variability is understood, but the biological variability is generally not. These choices must also be balanced with expense in terms of money, time, and effort. Choice of number of replicates of each type can mean the difference between a usable experiment that will answer the questions posed and a waste of time and effort that will frustrate everyone involved.

The fundamental rift is this:

  • More is better: statisticians want to make sure that the data gathered, which can be very expensive, can be used to accurately estimate the variability. More is better, and very few experimental designs have as many replicates as statisticians would like.
  • No need for redundant information: Bench biologists, on the other hand, tend to want to get as much science done as possible. Replicates are expensive and often aren’t that interesting in terms of the biology that they reveal when they work- that is, if replicates 1, 2, and 3 agree then wouldn’t it be more efficient to just have run replicate 1 in the first place and use replicates 2 and 3 to get more biology?

This is a vast generalization, and many biologists gathering experimental data understand the statistical issues inherent in this problem- more so in certain fields like genome-wide association studies.

Three replicates is kind-of a minimum for statistical analysis. This number doesn’t give you any room if any of the replicates fail for technical reasons, but if they’re successful you can at least get an estimate of variation in the form of standard deviation out (not a very robust estimate mind you, but the calculation will run). I’ve illustrated the point in the graph below.

Running one replicate can be understood for some situations, and the results have to be presented with the rather large caveat that they will need to be validated in follow-on studies.

Two replicates? Never a good idea. This is solidly in the “why bother?” category. If the data points agree, great. But how much confidence can you have that they’re not just accidentally lining up? If they disagree, you’re out of luck. If you have ten replicates and one doesn’t agree you could, if you investigated the underlying reason for this failure, exclude it from the analysis as an ‘outlier’ (this can get in to shady territory pretty fast- but there are sound ways to do this). However, with two replicates they just don’t agree and you have no idea which value to believe. Many times two replicates are the result of an experimental design with more replicates but some of the samples have failed for some reason. But an experimental design should never be initiated with just two replicates. It doesn’t make sense- though I’ve seen many and have participated in analysis of some too (thus giving me this opinion).

There is much more that can be said on this topic but this is a critical issue that can ruin costly and time-consuming high-throughput experiments before they’ve even started.