A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:
was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.
- Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
- Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
- Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
- Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
- Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
- Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
- New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
- New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.
The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:
Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.
I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.
The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.
- Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
- Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).