Another word about balance

[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.] 

TL;DR summary

Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.

The long version

Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.

  1. I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
  2. The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
  3. I varied the ratio of positive to negative examples used to estimate performance and
  4. Calculated several commonly used measures of performance:
    1. Accuracy (TP+FP TN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
    2. Specificity (TN/(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
    3. AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
    4. Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
    5. False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
  5. Repeated these calculations with 20 different random prediction sets
  6. Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.

The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.


Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.


Specificity stays largely the same and is a good estimate because it isn't affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn't change for the same reason.

Specificity stays largely the same and is a good estimate because it isn’t affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn’t change for the same reason.


Happily the AUC doesn't actually change that much- mostly it's just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Happily the AUC doesn’t actually change that much- mostly it’s just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Positive predictive value (PPV)

Aaaand there's where things go to hell.

Aaaand there’s where things go to hell.

False discovery rate (FDR)

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.



Why is this a problem?

The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.

If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.

However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.

Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.

So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.


Multidrug resistance in bacteria

So I just published a paper on predicting multi drug resistance transporters in the journal F1000 Research. This was part of my diabolical* plot (and here) to get grant money (*not really diabolical, but definitely risky, and hopefully clever). So what’s the paper about? Here’s my short explanation, hopefully aimed so that everyone can understand.

TL;DR version (since I wrote more than I thought I was going to)

Antibiotic resistance in bacteria is a rapidly growing health problem- if our existing antibiotics become useless against pathogens we’ve got a big problem. One of the mechanisms of resistance is that bacteria have transporters, proteins that pump out the antibiotics so they can’t kill the bacteria. There are many different kinds of these transporters and finding more of them will help us understand resistance mechanisms. We’ve used a method based on understanding written language to interpret the sequence of proteins (the order of building blocks used to build the protein) and predict a meaning from this- the meaning being the function of antibiotic transporter. We applied this approach to a large set of proteins from bacteria in the environment (a salty lake in Washington state in this case) because it’s known that these poorly understood bacteria have a lot of new proteins that can be transferred to human pathogens and give them superpowers (that is, antibiotic resistance).

(now the long version)

Antibiotic resistance in bacteria

This is a growing world health problem that you’ve probably heard about. Prior to the discovery of antibiotics bacterial infections were a very serious problem that we couldn’t do much about. Antibiotics changed all that, providing a very effective way to treat common and uncommon bacterial infections, and saving countless lives. The problem is that there are a limited number of different kinds of antibiotics that we have (that is, that have been discovered and are clinically effective without drastic side effects) and the prevalence of strains of common bacterial pathogens with resistance to one or more of these antibiotics is growing at an alarming rate. The world will be a very different place if we no longer have effective antibiotics (see this piece for a scary peek into what it’ll be like).

How does this happen? The driving force is Darwinian selection- survival of the fittest. Imagine that the pathogens are a herd of deer and that antibiotics are a wolf pack. The wolf pack easily kills off the slower deer, but leaves the fastest ones to live and reproduce, leading to faster offspring that are harder to kill. Also, the fast deer can pass off their speed to slow deer that are around, making them hard to kill.

Bacterial resistance to antibiotics works in a somewhat similar way. Bacteria can evolve, driven by natural selection, and they reproduce very quickly- but they have an even faster way to accomplish this adaptation than evolving new functions from the ground up. They can exchange genetic material, including the plans for resistance mechanisms (genes that code for resistance proteins) with other bacteria. And they can make these exchanges between bacteria of different species, so a resistant pathogen can pass off resistance to another pathogen, or an innocuous environmental bacteria can pass off a resistance gene to a pathogen making it resistant.

There are three main classes of resistance. First, the bacteria can develop resistance by altering the target of the antibiotic so that it can no longer kill. The ‘target’ in this case is often a protein that the bacteria uses to do some critical thing- and the antibiotic mucks it up so that bacteria die since they can’t accomplish that thing they need to do. Think of this like a disguise- the deer put on a nose and glasses and long coat ala Scooby Doo, and the wolves run right by without noticing. Second, the bacteria can produce an enzyme (a protein that alters small molecules in some way like sugars or drugs) that transforms the antibiotic into an ineffective form. Think of this like the deer using handcuffs to cuff the legs of the wolves together so they can’t run anymore, and thus can’t chase and kill the deer (which are the bacteria if you remember). Third, the bacteria can produce special transporter proteins that pump the antibiotic out of the inside of the cell (the bacterial cell) and away from the vital machinery that the antibiotic is targeting to kill the bacteria. Think of this like the possibility that deer engineers have developed portable wolf catapults. When a wolf gets too close it’s simply catapulted over the trees so it can’t do it’s evil business (in this case, actually good business because the wolves are the antibiotics, remember?)

Antibiotic resistance and the  resistome

The problem addressed in the paper

The problem we address in the paper is related to the third mechanism of resistance- the transporter proteins. There are a number of types of these transporters that can transport several or many different kinds of antibiotics at the same time- thus multi drug resistance transporters. Still, it’s likely that there are a lot of these kinds of proteins out there that we don’t recognize as such- in many cases you can’t just look at the sequence (see the section below) of the protein and figure out what it does.

The point of the paper is to develop a method that can detect these kinds of proteins and look for those beyond what we already know about. The long range view is that this will help us understand better how these kinds of proteins work and possibly suggest ways to block them (using novel antibiotics) to make existing antibiotics more effective again.

An interesting thing that has become clear in the last few years is that environmental bacteria have a large number of different resistance mechanisms to existing antibiotics (and probably to antibiotics we don’t even know about yet). And there are a LOT of environmental bacteria in just about every place on earth. Most of these we don’t know anything about. This has been called the “antibiotic resistome” meaning that it’s a vast reservoir of unknown potential for resistance that can be transferred to human pathogens. In the case of the second mechanism of resistance, the enzymes, these likely have evolved since bacteria in these environmental communities are undergoing constant warfare with each other- producing molecules (like antibiotics) that are designed to kill other species. In the case of the third resistance mechanism (the transporters) this could also be true, but these transporters seem to have a lot of other functions too- like ridding the bacteria of harmful molecules that might be in the environment like salts.

Linguistic-based sequence analysis 

The paper uses an approach that was developed in linguistics (study of language) to analyze proteins. This works because the building blocks of proteins (see below) can be viewed as a kind of language, where different combinations of blocks in different orders can give rise to different meanings- that is, different functions for the protein.

The sequence of a protein refers to the fact that proteins are made up of long chains of amino acids. Amino acids are just building block molecules, and there are 20 different kinds that are commonly found in proteins. These 20 different kinds make up an alphabet, and the alphabet is used to “spell” the protein. The list of amino acid letters that represents the protein is its sequence. It’s relatively easy to get the sequences of proteins for many bacteria, but the problem of what these sequences actually do is very much an open one. Proteins with similar sequences often times do similar things. But there are some interesting exceptions to this that I can illustrate using actual letters and sentences.

The first is that similar sequences might have different meanings.

1) “When she looked at the pool Jala realized it was low.”

2) “When she looked at the pool Jala realized she was too slow.”

The second is that very different sentences might have similar meanings.

1) “When he looked at the pool Joe realized it was dirty.”

2) “The dirty pool caught Joe’s attention.”

(these probably aren’t the BEST sentences to illustrate this, if you have better suggestions please let me know)

The multi drug transporters have elements of both problems. There are large families of transporter proteins that are pretty similar in terms of protein sequence- but the proteins actually transport different things (like, non-antibiotic molecules, and at this point we can’t just look at the sequences and figure out what they transport for many examples. There are also several families of multi drug transporters that have pretty different sequences between families but all do essentially the same job of transporting several types of drugs.

Linguistics, and especially computational linguistics, has been focused on developing algorithms (computer code) to interpret language into meaning. The approach we use in the paper, called PILGram, does exactly this and has been applied to interpretation of natural (English) language for other projects. We just adapted it somewhat so that it would work on protein sequences. Then we trained the method (since the method learns by example) on a set of proteins where we know the answer- previously identified multi drug transporters. After this was trained and we evaluated how well it could do it’s intended job (that is, taking protein sequences and figuring out if they are multi drug transporters or not) we let it loose on a large set of proteins from bacteria in a very salty lake in northern Washington state called Hot Lake.

What we found

First we found that the linguistic-based method did pretty well on some protein sequence problems where we already knew what the answer was. These PROSITE patterns are from a database where scientists have spent a lot of effort figuring out protein motifs (like figures of speech in language that always mean the same thing) for a whole collection of different protein functions. PILGram was able to do pretty well (though not perfectly) at figuring out what those motifs were- even though we didn’t spend any time on looking through the protein sequences, which is what PROSITE did. So that was good.

We then showed that the method could predict multi drug resistance transporters, a set of proteins where a common motif isn’t known. Again, it does fairly well – not perfect but much better than existing ways of doing this. We evaluated how well it did by pretending we didn’t know the answers for a set of proteins when we actually do know the answer- this is called ‘holding out’ some of the data. The trained method (trained on the set of proteins we didn’t hold out) was then used to predict whether or not the held out proteins were multi drug transporters and we could evaluate how well they did by comparing with the real answers.

Finally, we found that the method identified a number of likely looking candidate multi drug transporters from the Hot Lake community proteins and we listed a few of these candidates.

The next step will be to look at these candidates in the lab and see if they actually are multi drug transporters or not. This step is called “validation”. If they are (or at least one or two are) then that’s good for the method- it says that the method can actually predict something useful. If not then we’ll have to refine the method further and try to do better (though a negative result in a limited validation doesn’t necessarily mean that the method doesn’t work). This step, along with a number of computational improvements to the method, is what I proposed in the grant I just submitted. So if I get the funding I get to do this fun stuff.

More information

How to know when to let go

This post is a story in two parts. The first part is about the most in-depth peer review I think I’ve ever gotten. The second deals with making the decision to pull the plug on a project.

Part 1: In which Reviewer 3 is very thorough, and right.

Sometimes Reviewer 3 (that anonymous peer reviewer who consistently causes problems) is right on the money. To extend some of the work I’ve done I’ve done to predict problematic functions of proteins I started a new effort about 2 years ago now. It went really slowly at first and I’ve never put a huge amount of effort in to it, but I thought it had real promise. Essentially it was based on gathering up examples of a functional family that

"At last we meet, reviewer 3, if that is indeed your real name"

“At last we meet, reviewer 3, if that is indeed your real name”

could be used in a machine learning-type approach. The functional family (in this case I chose E3 ubiquitin ligases) is problematic in that there are functionally similar proteins that show little or no sequence similarity by traditional BLAST search. Anyway, using a somewhat innovative approach we developed a model that could predict these kinds of proteins (which are important bacterial virulence effectors) pretty well (much better than BLAST). We wrote up the results and sent it off for an easy publication.

Of course, that’s not the end of the story. The paper was rejected from PLoS One, based on in-depth comments from reviewer 1 (hereafter referred to as Reviewer 3, because). As part of the paper submission we had included supplemental data, enough to replicate our findings, as should be the case*. Generally this kind of data isn’t scrutinized very closely (if at all) by reviewers. This case was different. Reviewer 3 is a functional prediction researcher of some kind (anonymous, so I don’t really know) and their lab is set up to look at these kinds of problems- though probably not from the bacterial pathogenesis angle judging from a few of the comments. So Reviewer 3’s critique can be summed up in their own words:

I see the presented paper as a typical example of computational “solutions” (often based on machine-learning) that produce reasonable numbers on artificial test data, but completely fail in solving the underlying biologic problem in real science.

Ouch. Harsh. And partly true. They are actually wrong about that point from one angle (the work solves a real problem- see Part 2, below) but right from another angle (that problem had apparently already been solved, at least practically speaking). They went on, “my workgroup performed a small experiment to show that a simple classifier based on sequence similarity and protein domains can perform at least as well as <my method> for the envisioned task.” In the review they then present an analysis they did on my supplemental data in which they simply searched for existing Pfam domains that were associated with ubiquitin ligase function. Their analysis, indeed, shows that just searching Reviewer3Attackfor these four known domains could predict this function as well or better than my method. This is interesting because it’s the first time that I can remember where a reviewer has gone in to the supplemental data to do an analysis for the review. This is not a problem at all- in fact, it’s a very good thing. Although I’m disappointed to have my paper rejected I was happy that a knowledgeable and thorough peer reviewer had done due diligence and exposed this, somewhat gaping, hole in my approach/results. It’s worth noting that the other reviewer identified himself, was very knowledgeable and favorable to the paper- just missing this point because it’s fairly specific and wrong, at least in a particular kind of way that I detail below.

So, that’s it right? Game over. Take my toys and go home (or to another more pressing project). Well, maybe or maybe not.

Part 2: In which I take a close look at Reviewer 3’s points and try to rescue my paper

One of the hardest things to learn is how to leave something that you’ve put considerable investment into and move on to more productive pastures. This is true in relationships, investments, and, certainly, academia. I don’t want to just drop this two year project (albeit, not two solid years) without taking a close look to see if there’s something I can do to rescue it. Without going into the details of specific points Reviewer 3 made I’ll tell you about my thought process on this topic.

So, first. One real problem here is that the Pfam models Reviewer 3 used were constructed from the examples I was using. That means that their approach is circular: the Pfam model can identify the examples of E3 ubiquitin ligases because it was built from those same examples. They note that four different Pfam models can describe most of the examples I used. From the analysis that I did in the paper and then again following Reviewer 3’s comments, I found that these models do not cross-predict, whereas my model does. That is, my single model can predict the same as these four different individual models. These facts both mean that Reviewer 3’s critique is not exactly on the mark- my method does some good stuff that Pfam/BLAST can’t do. Unfortunately, neither of these facts makes my method any more practically useful. That is, if you want to predict E3 ubiquitin ligase function you can use Pfam domains to do so.

Which leads me to the second point of possible rescue. Reviewer 3’s analysis, and my subsequent re-analysis to check to make sure they were correct, identified around 30 proteins that are known ubiquitin ligases but which do not have one of these four Pfam domains. These are false negative predictions, by the Pfam method. Using my method these are all predicted to be ubiquitin ligases with pretty good accuracy. This is a definite good point then to my method, meaning that my method can correctly identify those known ligases that don’t have known domains. There! I have something useful that I can publish, right? Well, not so fast. I was interested in seeing what Pfam domains might be in those proteins other than the four ligase domains so I looked more closely. Unfortunately what I found was that these proteins all had a couple of other domains that were specific to E3 ubiquitin ligases but that Reviewer 3 didn’t notice. Sigh. So that means that all the examples in my E3 ubiquitin ligase dataset can be correctly identified by around 6 Pfam domains, again rendering my method essentially useless, though not incorrect. It is worth noting that it is certainly possible that my method would be much better at identification of new E3 ligases that don’t fall into these 6 ‘families’ – but I don’t have any such examples, so I don’t really know and can’t demonstrate this in the paper.

So where does this leave me? I have a method that is sound, but solves a problem that may not have been needed to be solved (as Reviewer 3 pointed out, sort of). I would very much like to publish this paper since I, and several other people, have spent a fair amount of time on it. But I’m left a bit empty-handed. Here are the three paths I can see to publication:

  1. Experimental validation. I make some novel predictions with my method and then have a collaborator validate them. Great idea but this would take a lot of time and effort and luck to pull off. Of course, if it worked it would demonstrate the method’s utility very solidly. Not going to happen right now I think.
  2. Biological insight. I make some novel observations given my model that point out interesting biology underpinning bacterial/viral E3 ubiquitin ligases. This might be possible, and I have a little bit of it in the paper already. However, I think I’d need something solid and maybe experimentally validated to really push this forward.
  3. Another function. Demonstrate that the general approach works on another functional group- one that actually is a good target for this kind of thing. This is something I think I have (another functional group) and I just need to do some checking to really make sure first (like running Pfam on it, duh.) I can then leave the ubiquitin ligase stuff in there as my development example and then apply the method to this ‘real’ problem. This is most likely what I’ll do here (assuming that the new example function I have actually is a good one) since it requires the least amount of work.

So, full disclosure: I didn’t know when I started writing this post this morning what I was going to do with this paper and had pretty much written it off. But now I’m thinking that there may be a relatively easy path to publication with option 3 above. If my new example doesn’t pan out I may very well have to completely abandon this project and move on. But if it does work then I’ll have a nice little story requiring a minimum of (extra) effort.

As a punchline to this story- I’ve written a grant using this project as a fairly key piece of preliminary data. That grant is being reviewed today- as I write. As I described above, there’s nothing wrong with the method- and it actually fits nicely (still) to demonstrate what I needed it to for the grant. However, if the grant is funded then I’ll have actual real money to work on this and that will open other options up for this project. Here’s hoping. If the grant is funded I’ve decided I’ll establish a regular blog post to cover it, hopefully going from start to (successfully renewed) finish on my first R01. So, again, here’s hoping.

*Supplemental data in a scientific manuscript is the figures, tables, and other kinds of data files that either can’t be included in the main text because of size (no one wants to read 20 pages of gene listings in a paper- though I have seen stuff like this) or because the information is felt to be non-central to the main story and better left for more interested readers.

The false dichotomy of multiple hypothesis testing

[Disclaimer: I’m not a statistician, but I do play one at work from time to time. If I’ve gotten something wrong here please point it out to me. This is an evolving thought process for me that’s part of the larger picture of what the scientific method does and doesn’t mean- not the definitive truth about multiple hypothesis testing.]

There’s a division in research between hypothesis-driven and discovery-driven endeavors. In hypothesis-driven research you start out with a model of what’s going on (this can be explicitly stated or just the amalgamation of what’s known about the system you’re studying) and then design an experiment to test that hypothesis (see my discussions on the scientific method here and here). In discovery-driven research you start out with more general questions (that can easily be stated as hypotheses, but often aren’t) and generate larger amounts of data, then search the data for relationships using statistical methods (or other discovery-based methods).

The problem with analysis of large amounts of data is that when you’re applying a statistical test to a dataset you are actually testing many, many hypotheses at once. This means that your level of surprise at finding something that you call significant (arbitrarily but traditionally a p-value of less than 0.05) may be inflated by the fact that you’re looking a whole bunch of times (thus increasing the odds that you’ll observe SOMETHING just on random chance alone- see this excellent xkcd cartoon for an example, see below since I’ll refer to this example). So you need to apply some kind of multiple hypothesis correction to your statistical results to reduce the chances that you’ll fool yourself into thinking that you’ve got something real when actually you’ve just got something random. In the XKCD example below a multiple hypothesis correction using Bonferroni’s method (one of the simplest and most conservative corrections) would suggest that the threshold for significance should be moved to 0.05/20=0.0025 – since 20 different tests were performed.

Here’s where the problem of a false dichotomy occurs. Many researchers who analyze large amounts of data believe that utilizing a hypothesis-based approach mitigates the effect of multiple hypothesis testing on their results. That is, they believe that they can focus their investigation of the data to a subset constrained by a model/hypothesis and thus reduce the effect that multiple hypothesis testing has on their analysis. Instead of looking at 10,000 proteins in a study they now look at only the 25 proteins that are thought to be present in a particular pathway of interest (where the pathway here represent the model based on existing knowledge). This is like saying, “we believe that jelly beans in the blue green color range cause acne” and then drawing your significance threshold at 0.05/4=0.0125 – since there are ~4 jelly beans tested that are in the blue-green color range (not sure if ‘lilac’ counts or not- that would make 5). All well and good EXCEPT for the fact that the actual chance of detecting something by random chance HASN’T changed. In large scale data analysis (transcriptome analysis, e.g.) you’ve still MEASURED everything else. You’ve just chosen to limit your investigation to a smaller subset and then can ‘go easy’ on your multiple hypothesis correction.

The counter-argument that might be made to this point is that by doing this you’re testing a specific hypothesis, one that you believe to be true and may be supported by existing data . This is a reasonable point in one sense- it may lend credence to your finding that there is existing information supporting your result. But on the other hand it doesn’t change the fact that you still could be finding more things by chance than you realize because you simply hadn’t looked at the rest of your data. It turns out that this is true not just of analysis of big data, but also of some kinds of traditional experiments aimed at testing individual – associative- hypotheses. The difference there is that it is technically unfeasible to actually test a large amount of the background cases (generally limited to one or two negative controls). Also a mechanistic hypothesis (as opposed to an associative one) is based on intervention, which tells you something different and so is not (as) subject to these considerations.

Imagine that you’ve dropped your car keys in the street and you don’t know what they look like (maybe borrowing a friend’s car). You’re pretty sure you dropped them in front of the coffee shop on a block with 7 other shops on it- but you did walk the length of the block before you noticed the keys were gone. You walk directly back to look in front of the coffee shop and find a set of keys. Great, you’re done. You found your keys, right? What if you looked in front of the other stores and found other sets of keys. You didn’t look- but that doesn’t make it less likely that you’re wrong about these keys (your existing knowledge/model/hypothesis “I dropped them in front of the coffee shop” could easily be wrong).

XKCD: significant

Eight red flags in bioinformatics analyses

A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:

Researchers — including me and my colleagues — had just reported that the majority of preclinical cancer papers in top-tier journals could not be reproduced, even by the investigators themselves12.

was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.

  1. Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
  2. Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
  3. Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
  4. Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
  5. Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
  6. Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
  7. New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
  8. New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.

The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:

Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.

I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.

Illustration from Derek Roczen

The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.


  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
  2. Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).

Five minute explanation: Cyanothece transcriptional model

Because of the fact that the paper is behind a paywall, I’m making it available as the submitted manuscript. Eventually I’ll get with the program and start releasing on ArXiv or Figshare, but for now it’s here. I’ve tried to make the version somewhat pretty (I get really tired of reading papers that are double-spaced and have the figures and tables at the end).


McDermott J.E., Oehmen C., McCue L.A., Hill H., Choi D.M., Stöckel J., Liberton M., Pakrasi H.B., Sherman L.A. (2011) A model of cyclic transcriptomic behavior in Cyanothece species ATCC 51142. Mol Biosystems 7(8):2407-2418. PMID: 21698331

*but behind a paywall at Molecular BioSystems

Here available as the submitted manuscript and supplemental information.


Cyanothece sp. 51142 is a ocean-dwelling cyanobacteria that is capable of fixing nitrogen in the dark and photosynthesizing in the light, two normally incompatible activities. Unlike some other cyanobacteria it makes this switch inside the same cell every light/dark cycle (normally about 12 hours). This makes it interesting from the standpoint of bioenergy

A 'wreath' network of transcriptional changes in Cyanothece over a 24 hour period.

A ‘wreath’ network of transcriptional changes in Cyanothece over a 24 hour period.

production but also regulation. The process of how it is able to drastically rearrange it’s machinery every 12 hours is not well understood.

What was done?

We used multiple transcriptomic datasets (measurements of levels of gene expression) taken at different times in the light/dark cycle to construct a general model of the functional processes occurring in Cyanothece. The interesting part about this was that we did not impose the circular shape on the model, it arose naturally from analysis of the data, and it really does represent a clock- with the pattern of gene expression at different times of day being located at different locations on the clock face. We then used a mathematical approach to relate the expression levels of drivers (regulators) with groups of genes that can be associated with different functions. The model allows us to plug in different starting points and predict what the state of the system will be at future times.

Why is it important?

The model we constructed can be changed and results simulated to predict what will happen in a real experiment. These kinds of models are good for focusing experimental efforts by predicting interesting behavior. An example question might be to ask what would happen to the timing of photosynthesis (as judged by gene transcription) if the levels of a key regulator are changed. The resulting prediction(s) can then be tested experimentally to discover new things about the system.

The story

This paper took about five years to get written and accepted. That’s from the point at which I decided that a paper should be written to the point that it was published. It was from the first project I worked on at my then new position. I came up with the wreath visualization early in the process and, after having convinced myself and others that it was real, found that it was a very compelling way to think about the diurnal (day/night) cycle. The figure has been used in many different forms, mainly as eye candy. I’m amused when I see it on a poster that I had nothing to do with (from my workplace PNNL). It has even been used around the web.


Cool example of invisible science

I recently posted on invisible science, unexpected observations that don’t fit the hypothesis and can be easily discarded or overlooked completely. Through a collaboration we just published a paper that demonstrates this concept very well. Here’s its story in a nutshell.

A few years back I published a paper in PLoS Pathogens that described the first use* of a machine learning approach to identify bacterial type III effectors from protein sequence. Type III effectors are proteins that are made by bacteria and exported through a complicated structure (the type III secretion apparatus- aka. the injectisome) directly in to a host cell. Inside the host cell these effectors interact with host proteins and networks to effect a change, one that is beneficial for the invading bacteria, and allow survival in an environment that’s not very hospitable for bacterial growth. Though there are a lot of these kinds of proteins known, there’s no pattern that has been found to specify secretion by type III mechanism. It’s a mystery still.

(* there was another paper published back-to-back with mine in PLoS Pathogens that reported the same thing. Additionally, two other papers were published subsequently in other journals that reiterated our findings. I wrote a review of this field here.)

So on the basis of the model that I published my collaborators (Drs. Heffron and Niemann) thought it would be cool to see if a consensus signal (an average of the different parts my model predicted to be important for secretion) that the model predicted would be hyper-secreted (i.e. would be secreted at a high level). I sent them a couple of predictions and some time later (maybe 8 months) Dr. Niemann contacted me to say that the consensus sequence was not, in fact, secreted. So it looked like the prediction wasn’t any good and that some work had been done to get this negative result.

But not so fast, because they’d had some issues with how they’d made the initial construct to do the experiment they remade the construct used to express the consensus. The first one (that was not secreted) used a native promoter and upstream gene sequence. This is the region that causes a gene to be expressed, then allows the ribosome to bind to the mRNA and start translation of the actual coding sequence. The native upstream sequence

Figure 1. Translocation of a consensus effector seq.

Figure 1. Translocation of a consensus effector seq.

was just taken from a real effector. When they redid the construct they used a non-native upstream sequence from a bacteriophage (a virus that infects bacteria), commonly used for expressing genes. All of the sudden, they got secretion from the same consensus sequence. This was a very weird result: why would changing the untranslated region suddenly change the function that the protein sequence was supposed to be directing?

The path of this experiment could have taken a very different turn here. Dr. Niemann could have simply ignored that ‘spurious’ result and decided that the native promoter was the right answer- the consensus sequence wasn’t secreted.

However, in this case the spurious result was the interesting one. Why did the bacteriophage upstream region construct get secreted? The only difference was in the upstream RNA (since the difference was in the non-coding region and the protein produced was exactly the same). Dr. Niemann pressed on and found that the RNA itself was directing secretion. And he found that there were other examples of native upstream sequences in the bacteria (Salmonella Typhimurium) that we were working on. This had never been observed before in Salmonella, though it was known for a few effectors from Yersinia pestis. He also identified an RNA-binding protein, hfq, that was required for this secretion. This paper is currently available as a preprint from the journal.

Niemann GS, Brown RN, Mushamiri IT, Nguyen NT, Taiwo R, Stufkens A, Smith RD, Adkins JN, McDermott JE, Heffron F. RNA Type III Secretion Signals that require Hfq. J Bacteriol. 2013 Feb 8. [Epub ahead of print]

So this story never ended in validation of my consensus sequence. Actually, in all likelihood it can’t direct secretion (the results in the paper show that, though it’s not highlighted). But the story turned out to be more interesting and more impactful and it shows why it’s good to be flexible in science. If you see the results of each experiment only in black and white (it supports or does not support my original hypothesis) this will be extremely limiting to the science you can accomplish.