Finding your keys in a mass of high-throughput data

There is a common scientific allegory that is often used in criticism of discovery-based science. A person is looking around at night under a lamppost when a passerby asks, “What are you doing?”. “Looking for my keys” the person replies. “Oh, did you lose them here?” the concerned citizen. “No”, the person replies, “I lost them over there, but the light is much better here”.


The argument as applied to science in a nutshell is that we commonly ask questions based on where the ‘light is good’- that is, where we have the technology to be able to answer the question, rather than asking a better question in the first place. This recent piece covering several critiques of cancer genomics projects is a good example, and uses the analogy liberally throughout- referencing its use in the original articles covered.

One indictment of the large-scale NCI project, The Cancer Genome Atlas (TCGA) is as follows:

The Cancer Genome Atlas, an ambitious effort to chart and catalog all the significant mutations that every important cancer can possibly accrue. But these efforts have largely ended up finding more of the same. The Cancer Genome Atlas is a very significant repository, but it may end up accumulating data that’s irrelevant for actually understanding or curing cancer.

Here’s my fundamental problem with the metaphor, and it’s use as a criticism of scientific endeavors such as the TCGA. The problem is this: we don’t know where the keys are a priori, the light is brightest under the lamppost, we would be stupid NOT to look there first. The indictment comes post-hoc, and so benefits from knowing the answer (or at least having a good idea of the answer). Unraveling this anti-metaphor: we don’t know what causes cancer, genomics seems like a likely place to look and the we have the technology to do so, and if we started by looking elsewhere critics would be left wondering why didn’t we look in the obvious way. The fact that we didn’t find anything new with the TCGA (the jury is quite out on that point) is still a positive step forward in the understanding of cancer- it means that we’ve looked in genomics and haven’t found the answer. This can be used to drive the next round of investigation. If we hadn’t done it, we simply wouldn’t know, and that would prevent taking certain steps to move forward.

Of course, the value of the metaphor is that it can be used to urge caution in investigation. If we have a good notion that our keys are not under the light, then maybe we ought to be thinking about going to get our flashlights to look in the right area to start with. We should also be very careful that in funding the large projects to look where the light is, that we don’t sacrifice other projects that may end up yielding fruit. It is true that large projects tend to be over-hyped and the answers are promised (or all but) before they’ve even begun to be answered. Part of this is necessary salesmanship to be able to get these things off the ground at all, but overselling does not reflect well on anyone in the long run. Finally, the momentum of large projects or motivating ideas (“sequence all cancer genomes”) can be significant and may carry the ideas beyond what is useful. When we’ve figured out that the keys are not under the lamppost we had better figure out where to look next rather than combing over the same well-lit ground.

Part of this piece reflects very well on work that I’m involved with- proteomic and phosphoproteomic characterization of tumors from TCGA under the Clinical Proteomics Tumor Analysis Consortium (CPTAC):

“as Yaffe points out, the real action takes place at the level of proteins, in the intricacies of the signaling pathways involving hundreds of protein hubs whose perturbation is key to a cancer cell’s survival. When drugs kill cancer cells they don’t target genes, they directly target proteins”

So examining the signaling pathways that are involved in cancer directly, as opposed to looking at gene expression or modification as a proxy of activity, may indeed be the way to go to elucidate the causes of cancer. We believe that integrating this kind of information, which is closer to actual function, with the depth of knowledge provided by the TCGA will give significant insight into the biology of cancer and it’s underlying causes. But we’ve only started looking under that lamppost.

So, the next time you hear someone using this analogy as an indictment of a project or approach, ask yourself if they are using this argument post-hoc? That is, “they looked under the lamppost and didn’t find anything so their approach was flawed”. It wasn’t- it was likely the clearest, most logical step, that was most likely to yield fruit given a reasonable cost-benefit assessment.


One thought on “Finding your keys in a mass of high-throughput data

  1. Pingback: The false dichotomy of multiple hypothesis testing | The Mad Scientist Confectioner's Club

Leave a Reply

Your email address will not be published. Required fields are marked *