Another word about balance

[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.] 

TL;DR summary

Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.

The long version

Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.

  1. I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
  2. The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
  3. I varied the ratio of positive to negative examples used to estimate performance and
  4. Calculated several commonly used measures of performance:
    1. Accuracy (TP+FP TN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
    2. Specificity (TN/(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
    3. AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
    4. Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
    5. False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
  5. Repeated these calculations with 20 different random prediction sets
  6. Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.

The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.


Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.


Specificity stays largely the same and is a good estimate because it isn't affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn't change for the same reason.

Specificity stays largely the same and is a good estimate because it isn’t affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn’t change for the same reason.


Happily the AUC doesn't actually change that much- mostly it's just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Happily the AUC doesn’t actually change that much- mostly it’s just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Positive predictive value (PPV)

Aaaand there's where things go to hell.

Aaaand there’s where things go to hell.

False discovery rate (FDR)

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.



Why is this a problem?

The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.

If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.

However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.

Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.

So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.


Magic Hands

Too good to be true or too good to pass up?

Too good to be true or too good to pass up?

There’s been a lot of discussion about the importance of replication in science (read an extensive and very thoughtful post about that here) and notable occurrences of non-reproducible science being published in high-impact journals. The recent retraction of the two STAP stem cell papers from Nature and accompanying debate over who should be blamed and how. The publication of a study (see also my post about this) in which research labs responsible for high-impact publications were challenged to reproduce their findings showed that many of these findings could not be replicated, in the same labs they were originally performed in. These, and similar cases and studies, indicate serious problems in the scientific process- especially, it seems, for some high-profile studies published in high-impact journals.

I was surprised, therefore, at the reaction of some older, very experienced PIs recently after a talk I gave at a university. I mentioned these problems, and briefly explained the results of the study on reproducibility to them- that, in 90% of the cases, the same lab could not reproduce the results that they had previously published. They were generally nonplussed. “Oh”, one said, “probably just a post-doc with magic hands that’s no longer in the group”. And all agreed on the difficulty of reproducing results for difficult and complicated experiments.

So my question is: do these fabled lab technicians actually exist? Are there those people who can “just get things to work”? And is this actually a good thing for science?

I have some personal experience in this area. I was quite good at futzing around with getting a protocol to work the first time. I would get great results. Once. Then I would continue to ‘innovate’ and find that I couldn’t replicate my previous work. In my early experiences I sometimes would not keep notes well enough to allow me to go back to the point where I got it to work. Which was quite disturbing and could send me into a non-productive tailspin of trying to replicate the important results. Other times I’d written things down sufficiently that I could get them to work again. And still others I found that someone else in the lab could consistently get better results out of the EXACT SAME protocol- apparently followed the same way. They had magic hands. Something about the way they did things just *worked*. There were some protocols in the lab that just seemed to need this magic touch- some people had it and some people didn’t. But does that mean that the results these protocols produced were wrong?

What kinds of procedures seem to require “magic hands”? One example is from when I was doing electron microscopy (EM) as a graduate student. We were working constantly at improving our protocols for making two-dimensional protein crystals for EM. This was delicate work, which involved mixing protein with a buffer in a small droplet, layering on a special lipid, incubating for some amount of time to let the crystals form, then lifting the fragile lipid monolayer (hopefully with protein crystals) off onto an EM grid and finally staining with an electron dense stain or flash freezing in liquid nitrogen. The buffers would change, the protein preparations would change, the incubation conditions would change, and how the EM grids were applied to our incubation droplets to lift off the delicate 2D crystals was subject to variation. Any one of these things could scuttle getting good crystals and would therefore produce a non-replication situation. There were several of us in the lab that did this and were successful in getting it to work- but it didn’t always work and it took some time to develop the right ‘touch’ to get it to work. The number of factors that *potentially* contributed to success or failure was daunting and a bit disturbing- and sometimes didn’t seem to be amenable to communication in a written protocol. The line between superstition and required steps was very thin.

But this is true of many protocols that I worked with throughout my lab career* – they were often complicated, multi-step procedures that could be affected by many variables- from the ambient temperature and humidity to who prepared the growth media and when. Not that all of these variables DID affect the outcomes but when an experiment failed there were a long list of possible causes. And the secret with this long list? It probably didn’t include all the factors that did affect the outcome. There were likely hidden factors that could be causing problems. So is someone with magic hands lucky, gifted, or simply persistent? I know of a few examples where all three qualities were likely present- with the last one being, in a way, most important. Yes, my collaborator’s post-doc was able to do amazing things and get amazing results. But (and I know this was the case) she worked really long and hard to get them. She probably repeated experiments many, many times ins some cases before she got it to work. And then she repeated the exact combination to repeat the experiments again. And again. And sometimes even that wasn’t enough (oops, the buffer ran out and had to be remade, but the lot number on the bottle was different, and weren’t they working on the DI water supply last week? Now my experiment doesn’t work anymore.)

So perhaps it’s not so surprising that many of these key findings from these papers couldn’t be repeated, even in the same labs. There was not the same incentive to get it to work for one thing- so that post-doc or another graduate student who’s taken over the same duties, probably tried once to repeat the experiment. Maybe twice. Didn’t work. Huh? That’s unfortunate. And that’s about as much time as we’re going to put in to this little exercise. The protocols could be difficult, complicated, and have many known and unknown variables affecting their outcomes.

But does it mean that all these results are incorrect? Does it mean that the underlying mechanisms or biology that was discovered was just plain wrong? No. Not necessarily. Most, if not all, of these high-profile publications that failed to repeat spawned many follow-on experiments and studies. It’s likely that many of the findings were borne out by orthogonal experiments, that is, experiments that test implications of these findings, and by extension the results of the original finding itself. Because of the nature of this study it was conducted anonymously- so we don’t really know, but it’s probably true. This was an important point, and one that was brought up by these experienced PIs I was talking with, is that sometimes direct replication may not be the most important thing. Important, yes. But perhaps not deal-killing if it doesn’t work. The results still might stand IF, and only if, second, third, and fourth orthogonal experiments can be performed that tell the same story.

Does this mean that you actually can make stem cells by treating regular cultured cells with an acid bath? Well, probably not. For some of these surprising, high-profile findings the ‘replication’ that is discussed is other labs trying to see if the finding is correct. So they try the protocols that have been reported, but it’s likely that they also try other orthogonal experiments that would, if positive, support the original claim.

"OMG! This would be so amazing if it's true- so, it MUST be true!"

“OMG! This would be so amazing if it’s true- so, it MUST be true!”

So this gets back to my earlier discussions on the scientific method and the importance of being your own worst skeptic (see here and here). For every positive result the first reaction should be “this is wrong”, followed by, “but- if it WERE right then X, Y, and Z would have to be true. And we can test X, Y, and Z by…”. The burden of scientific ‘truth’** is in replication, but in replication of the finding– NOT NECESSARILY in replication of the identical experiments.

*I was a labbie for quite a few of my formative years. That is, I actually got my hands dirty and did real, honest-to-god experiments, with Eppendorf tubes, vortexers, water baths, cell culture, the whole bit. Then I converted and became what I am today – a creature purely of silicon and code. Which suits me quite well. This is all just to add to my post a “I kinda know what I’m talking about here- at least somewhat”.

** where I using a very scientific meaning of truth here, which is actually something like “a finding that has extensive support through multiple lines of complementary evidence”

Eight red flags in bioinformatics analyses

A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:

Researchers — including me and my colleagues — had just reported that the majority of preclinical cancer papers in top-tier journals could not be reproduced, even by the investigators themselves12.

was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.

  1. Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
  2. Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
  3. Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
  4. Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
  5. Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
  6. Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
  7. New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
  8. New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.

The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:

Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.

I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.

Illustration from Derek Roczen

The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.


  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
  2. Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).

Cool example of invisible science

I recently posted on invisible science, unexpected observations that don’t fit the hypothesis and can be easily discarded or overlooked completely. Through a collaboration we just published a paper that demonstrates this concept very well. Here’s its story in a nutshell.

A few years back I published a paper in PLoS Pathogens that described the first use* of a machine learning approach to identify bacterial type III effectors from protein sequence. Type III effectors are proteins that are made by bacteria and exported through a complicated structure (the type III secretion apparatus- aka. the injectisome) directly in to a host cell. Inside the host cell these effectors interact with host proteins and networks to effect a change, one that is beneficial for the invading bacteria, and allow survival in an environment that’s not very hospitable for bacterial growth. Though there are a lot of these kinds of proteins known, there’s no pattern that has been found to specify secretion by type III mechanism. It’s a mystery still.

(* there was another paper published back-to-back with mine in PLoS Pathogens that reported the same thing. Additionally, two other papers were published subsequently in other journals that reiterated our findings. I wrote a review of this field here.)

So on the basis of the model that I published my collaborators (Drs. Heffron and Niemann) thought it would be cool to see if a consensus signal (an average of the different parts my model predicted to be important for secretion) that the model predicted would be hyper-secreted (i.e. would be secreted at a high level). I sent them a couple of predictions and some time later (maybe 8 months) Dr. Niemann contacted me to say that the consensus sequence was not, in fact, secreted. So it looked like the prediction wasn’t any good and that some work had been done to get this negative result.

But not so fast, because they’d had some issues with how they’d made the initial construct to do the experiment they remade the construct used to express the consensus. The first one (that was not secreted) used a native promoter and upstream gene sequence. This is the region that causes a gene to be expressed, then allows the ribosome to bind to the mRNA and start translation of the actual coding sequence. The native upstream sequence

Figure 1. Translocation of a consensus effector seq.

Figure 1. Translocation of a consensus effector seq.

was just taken from a real effector. When they redid the construct they used a non-native upstream sequence from a bacteriophage (a virus that infects bacteria), commonly used for expressing genes. All of the sudden, they got secretion from the same consensus sequence. This was a very weird result: why would changing the untranslated region suddenly change the function that the protein sequence was supposed to be directing?

The path of this experiment could have taken a very different turn here. Dr. Niemann could have simply ignored that ‘spurious’ result and decided that the native promoter was the right answer- the consensus sequence wasn’t secreted.

However, in this case the spurious result was the interesting one. Why did the bacteriophage upstream region construct get secreted? The only difference was in the upstream RNA (since the difference was in the non-coding region and the protein produced was exactly the same). Dr. Niemann pressed on and found that the RNA itself was directing secretion. And he found that there were other examples of native upstream sequences in the bacteria (Salmonella Typhimurium) that we were working on. This had never been observed before in Salmonella, though it was known for a few effectors from Yersinia pestis. He also identified an RNA-binding protein, hfq, that was required for this secretion. This paper is currently available as a preprint from the journal.

Niemann GS, Brown RN, Mushamiri IT, Nguyen NT, Taiwo R, Stufkens A, Smith RD, Adkins JN, McDermott JE, Heffron F. RNA Type III Secretion Signals that require Hfq. J Bacteriol. 2013 Feb 8. [Epub ahead of print]

So this story never ended in validation of my consensus sequence. Actually, in all likelihood it can’t direct secretion (the results in the paper show that, though it’s not highlighted). But the story turned out to be more interesting and more impactful and it shows why it’s good to be flexible in science. If you see the results of each experiment only in black and white (it supports or does not support my original hypothesis) this will be extremely limiting to the science you can accomplish.