Eight red flags in bioinformatics analyses

A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:

Researchers — including me and my colleagues — had just reported that the majority of preclinical cancer papers in top-tier journals could not be reproduced, even by the investigators themselves12.

was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.

  1. Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
  2. Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
  3. Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
  4. Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
  5. Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
  6. Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
  7. New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
  8. New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.

The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:

Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.

I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.

Illustration from Derek Roczen

The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.


  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
  2. Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).

Cool example of invisible science

I recently posted on invisible science, unexpected observations that don’t fit the hypothesis and can be easily discarded or overlooked completely. Through a collaboration we just published a paper that demonstrates this concept very well. Here’s its story in a nutshell.

A few years back I published a paper in PLoS Pathogens that described the first use* of a machine learning approach to identify bacterial type III effectors from protein sequence. Type III effectors are proteins that are made by bacteria and exported through a complicated structure (the type III secretion apparatus- aka. the injectisome) directly in to a host cell. Inside the host cell these effectors interact with host proteins and networks to effect a change, one that is beneficial for the invading bacteria, and allow survival in an environment that’s not very hospitable for bacterial growth. Though there are a lot of these kinds of proteins known, there’s no pattern that has been found to specify secretion by type III mechanism. It’s a mystery still.

(* there was another paper published back-to-back with mine in PLoS Pathogens that reported the same thing. Additionally, two other papers were published subsequently in other journals that reiterated our findings. I wrote a review of this field here.)

So on the basis of the model that I published my collaborators (Drs. Heffron and Niemann) thought it would be cool to see if a consensus signal (an average of the different parts my model predicted to be important for secretion) that the model predicted would be hyper-secreted (i.e. would be secreted at a high level). I sent them a couple of predictions and some time later (maybe 8 months) Dr. Niemann contacted me to say that the consensus sequence was not, in fact, secreted. So it looked like the prediction wasn’t any good and that some work had been done to get this negative result.

But not so fast, because they’d had some issues with how they’d made the initial construct to do the experiment they remade the construct used to express the consensus. The first one (that was not secreted) used a native promoter and upstream gene sequence. This is the region that causes a gene to be expressed, then allows the ribosome to bind to the mRNA and start translation of the actual coding sequence. The native upstream sequence

Figure 1. Translocation of a consensus effector seq.

Figure 1. Translocation of a consensus effector seq.

was just taken from a real effector. When they redid the construct they used a non-native upstream sequence from a bacteriophage (a virus that infects bacteria), commonly used for expressing genes. All of the sudden, they got secretion from the same consensus sequence. This was a very weird result: why would changing the untranslated region suddenly change the function that the protein sequence was supposed to be directing?

The path of this experiment could have taken a very different turn here. Dr. Niemann could have simply ignored that ‘spurious’ result and decided that the native promoter was the right answer- the consensus sequence wasn’t secreted.

However, in this case the spurious result was the interesting one. Why did the bacteriophage upstream region construct get secreted? The only difference was in the upstream RNA (since the difference was in the non-coding region and the protein produced was exactly the same). Dr. Niemann pressed on and found that the RNA itself was directing secretion. And he found that there were other examples of native upstream sequences in the bacteria (Salmonella Typhimurium) that we were working on. This had never been observed before in Salmonella, though it was known for a few effectors from Yersinia pestis. He also identified an RNA-binding protein, hfq, that was required for this secretion. This paper is currently available as a preprint from the journal.

Niemann GS, Brown RN, Mushamiri IT, Nguyen NT, Taiwo R, Stufkens A, Smith RD, Adkins JN, McDermott JE, Heffron F. RNA Type III Secretion Signals that require Hfq. J Bacteriol. 2013 Feb 8. [Epub ahead of print]

So this story never ended in validation of my consensus sequence. Actually, in all likelihood it can’t direct secretion (the results in the paper show that, though it’s not highlighted). But the story turned out to be more interesting and more impactful and it shows why it’s good to be flexible in science. If you see the results of each experiment only in black and white (it supports or does not support my original hypothesis) this will be extremely limiting to the science you can accomplish.