Another word about balance

[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.] 

TL;DR summary

Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.

The long version

Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.

  1. I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
  2. The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
  3. I varied the ratio of positive to negative examples used to estimate performance and
  4. Calculated several commonly used measures of performance:
    1. Accuracy (TP+FP TN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
    2. Specificity (TN/(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
    3. AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
    4. Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
    5. False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
  5. Repeated these calculations with 20 different random prediction sets
  6. Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.

The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.

Accuracy

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Specificity

Specificity stays largely the same and is a good estimate because it isn't affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn't change for the same reason.

Specificity stays largely the same and is a good estimate because it isn’t affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn’t change for the same reason.

AUC

Happily the AUC doesn't actually change that much- mostly it's just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Happily the AUC doesn’t actually change that much- mostly it’s just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Positive predictive value (PPV)

Aaaand there's where things go to hell.

Aaaand there’s where things go to hell.

False discovery rate (FDR)

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

 

 

Why is this a problem?

The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.

If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.

However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.

Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.

So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.

 

Regret

Well, there probably ARE some exceptions here.

Well, there probably ARE some exceptions here.

So I first thought of this as a funny way of expressing relief over a paper being accepted that was a real pain to get finished. But after I thought about the general idea awhile I actually think it’s got some merit in science. Academic publication is not about publishing airtight studies with every possibility examined and every loose end or unconstrained variable nailed down. It can’t be. That would limit scientific productivity to zero because it’s not possible. Science is an evolving dialogue, some of it involving elements of the truth.

The dirty little secret (or elegant grand framework, depending on your perspective) of research is that science is not about finding the truth. It’s about moving our understanding closer to the truth. Often times that involves false positive observations- not because of the misconduct of science but because of it’s proper conduct. You should never publish junk or anything that’s deliberately misleading. But you can’t help publishing things that sometimes move us further away from the truth. The idea in science is that these erroneous findings will be corrected by further iterations and may even provide an impetus for driving studies that advance science. So publish away!

The false dichotomy of multiple hypothesis testing

[Disclaimer: I’m not a statistician, but I do play one at work from time to time. If I’ve gotten something wrong here please point it out to me. This is an evolving thought process for me that’s part of the larger picture of what the scientific method does and doesn’t mean- not the definitive truth about multiple hypothesis testing.]

There’s a division in research between hypothesis-driven and discovery-driven endeavors. In hypothesis-driven research you start out with a model of what’s going on (this can be explicitly stated or just the amalgamation of what’s known about the system you’re studying) and then design an experiment to test that hypothesis (see my discussions on the scientific method here and here). In discovery-driven research you start out with more general questions (that can easily be stated as hypotheses, but often aren’t) and generate larger amounts of data, then search the data for relationships using statistical methods (or other discovery-based methods).

The problem with analysis of large amounts of data is that when you’re applying a statistical test to a dataset you are actually testing many, many hypotheses at once. This means that your level of surprise at finding something that you call significant (arbitrarily but traditionally a p-value of less than 0.05) may be inflated by the fact that you’re looking a whole bunch of times (thus increasing the odds that you’ll observe SOMETHING just on random chance alone- see this excellent xkcd cartoon for an example, see below since I’ll refer to this example). So you need to apply some kind of multiple hypothesis correction to your statistical results to reduce the chances that you’ll fool yourself into thinking that you’ve got something real when actually you’ve just got something random. In the XKCD example below a multiple hypothesis correction using Bonferroni’s method (one of the simplest and most conservative corrections) would suggest that the threshold for significance should be moved to 0.05/20=0.0025 – since 20 different tests were performed.

Here’s where the problem of a false dichotomy occurs. Many researchers who analyze large amounts of data believe that utilizing a hypothesis-based approach mitigates the effect of multiple hypothesis testing on their results. That is, they believe that they can focus their investigation of the data to a subset constrained by a model/hypothesis and thus reduce the effect that multiple hypothesis testing has on their analysis. Instead of looking at 10,000 proteins in a study they now look at only the 25 proteins that are thought to be present in a particular pathway of interest (where the pathway here represent the model based on existing knowledge). This is like saying, “we believe that jelly beans in the blue green color range cause acne” and then drawing your significance threshold at 0.05/4=0.0125 – since there are ~4 jelly beans tested that are in the blue-green color range (not sure if ‘lilac’ counts or not- that would make 5). All well and good EXCEPT for the fact that the actual chance of detecting something by random chance HASN’T changed. In large scale data analysis (transcriptome analysis, e.g.) you’ve still MEASURED everything else. You’ve just chosen to limit your investigation to a smaller subset and then can ‘go easy’ on your multiple hypothesis correction.

The counter-argument that might be made to this point is that by doing this you’re testing a specific hypothesis, one that you believe to be true and may be supported by existing data . This is a reasonable point in one sense- it may lend credence to your finding that there is existing information supporting your result. But on the other hand it doesn’t change the fact that you still could be finding more things by chance than you realize because you simply hadn’t looked at the rest of your data. It turns out that this is true not just of analysis of big data, but also of some kinds of traditional experiments aimed at testing individual – associative- hypotheses. The difference there is that it is technically unfeasible to actually test a large amount of the background cases (generally limited to one or two negative controls). Also a mechanistic hypothesis (as opposed to an associative one) is based on intervention, which tells you something different and so is not (as) subject to these considerations.

Imagine that you’ve dropped your car keys in the street and you don’t know what they look like (maybe borrowing a friend’s car). You’re pretty sure you dropped them in front of the coffee shop on a block with 7 other shops on it- but you did walk the length of the block before you noticed the keys were gone. You walk directly back to look in front of the coffee shop and find a set of keys. Great, you’re done. You found your keys, right? What if you looked in front of the other stores and found other sets of keys. You didn’t look- but that doesn’t make it less likely that you’re wrong about these keys (your existing knowledge/model/hypothesis “I dropped them in front of the coffee shop” could easily be wrong).

XKCD: significant

Always use the right pen

Spike Lee’s 1989 classic “Do the Right Thing” is about a lot of things. It’s about life in general and I still don’t fully understand it’s message- why did Mookie throw the garbage can through Sal’s window. Was it the right thing to do?

It was NOT about life in academia – but did have elements about the conflict between the creative and destructive influences that I find very compelling. And Radio Raheem’s Love/Hate speech seemed to speak to me in a different way. Anyway, here’s an homage I did for fun.

RedPenBlackPen_v4small

Not being part of the rumor mill

I had something happen today that made me stop and think. I repeated a bit of ‘knowledge’ – something science-y that had to do with a celebrity. This was a factoid that I have repeated many other times. Each time I do I state this factoid with a good deal of authority in my voice and with the security that this is “fact”. Someone who was in the room said, “really?” Of course, as a quick Google check to several sites (including snopes.com) showed- this was, at best, an unsubstantiated rumor, and probably just plain untrue. But the memory voice in my head had spoken with such authority! How could it be WRONG? I’m generally pretty good at picking out bits of misinformation that other people present and checking it, but I realized that I’m not always so good about detecting it when I do it myself.

Of course, this is how rumors get spread and disinformation gets disseminated. As scientists we are not immune to it- even if we’d like to think we are. And we actually could be big players in it. You see, people believe us. We speak with the authority of many years of schooling and many big science-y wordings. And the real danger is repeating or producing factoids that fall in “science” but outside what we’re really experts in (where we should know better). Because many non-scientists see us as experts IN SCIENCE. People hear us spout some random science-ish factoid and they LISTEN to us. And then they, in turn, repeat what we’ve said, except that this time they say it with authority because it was stated, with authority, by a reputable source. US. And I realized that this was the exact same reason that it seemed like fact to me. Because it had been presented to me AS FACT by someone who I looked up to and trusted.

So this is just a note of caution about being your own worst critic – even in normal conversation. Especially when it comes to those slightly too plausible factoids. Though it may not seem like it sometimes people do listen to us.

15 great ways to fool yourself about your results

I’ve written before about how easy it is to fool yourself and some tips on how to avoid it for high-throughput data. Here is a non-exhaustive list of ways you too can join in the fun!

  1. Those results SHOULD be that good. Nearly perfect. It all makes sense.
  2. Our bioinformatics algorithm worked! We put input in and out came output! Yay! Publishing time.
  3. Hey, these are statistically significant results. I don’t need to care about how many different ways I tested to see if SOMETHING was significant about them.
  4. We only need three replicates to come to our conclusions. Really, it’s what everyone does.
  5. These results don’t look all THAT great, but the biological story is VERY compelling.
  6. A pilot study can yield solid conclusions, right?
  7. Biological replicates? Those are pretty much the same as technical replicates, right?
  8. Awesome! Our experiment eliminated one alternate hypothesis. That must mean our hypothesis is TRUE!
  9. Model parameters were chosen based on what produced reasonable output: therefore, they are biologically correct.
  10. The statistics on this comparison just aren’t working out right. If I adjust the background I’m comparing to I can get much better results. That’s legit, right
  11. Repeating the experiment might spoil these good results I’ve got already.
  12. The goal is to get the p-value less than 0.05. End.Of.The.Line. (h/t Siouxsie Wilespvalue_kid_meme
  13. Who, me biased? Bias is for chumps and those not so highly trained in the sciences as an important researcher such as myself. (h/t Siouxsie Wiles)
  14. It doesn’t seem like the right method to use- but that’s the way they did it in this one important paper, so we’re all good. (h/t Siouxsie Wiles)
  15. Sure the results look surprising, and I apparently didn’t write down exactly what I did, and my memory on it’s kinda fuzzy because I did the experiment six months ago, but I must’ve done it THIS way because that’s what would make the most sense.
  16. My PI told me to do this, so it’s the right thing to do. If I doubt that it’s better not to question it since that would make me look dumb.
  17. Don’t sweat the small details- I mean what’s the worst that could happen?

Want to AVOID doing this? Check out my previous post on ways to do robust data analysis and the BioStat Decision Tool from Siouxsie Wiles that will walk you through the process of choosing appropriate statistical analyses for your purposes! Yes, it is JUST THAT EASY!

Feel free to add to this list in the comments. I’m sure there’s a whole gold mine out there. Never a shortage of ways to fool yourself.

 

Gut feelings about gut feelings about marriage

An interesting study was published about the ‘gut feelings’ of newlyweds, and how they can predict future happiness in the marriage. The study assessed gut feelings (as opposed to stated feelings, which are likely to be biased in the rosy direction) of newlyweds towards their spouse by a word association and controlled for several different variables (like how the same people react to random strangers with the word association). They found that newlyweds that had more ‘gut feeling’ positive associations about their spouse were in happier relationships after four years. Sounds pretty good, right? Fits with what you might think about gut feelings.

The interesting point (which is nicely put in a Nature piece that covers this study) is that after other effects are factored out of the analysis the positive association was statistically significant, but that it could only explain 2% of the eventual difference in happiness (this analysis was apparently done by the Nature reporter, and not reported in the original paper). 2%! That’s not a very meaningful effect- even though it may be statistically significant. Though the study is certainly interesting and likely contains quite a bit of good data – this effect seems vanishingly small.

For interest here are the titles of the paper and some follow-on news pieces that were written about it and how they make the results seem much more clear cut and meaningful.

Title of the original Science paper:

Though They May Be Unaware, Newlyweds Implicitly Know Whether Their Marriage Will Be Satisfying

Title of the Nature piece covering this study:

Newlyweds’ gut feelings predict marital happiness Four-year study shows that split-second reactions foretell future satisfaction.

Headline from New Zealand Herald article:

Gut instinct key to a long and happy marriage

Headline from New York Daily News

Newlyweds’ gut feelings on their marriage are correct: study

 

Invisible science and how it’s too easy to fool yourself

If you want the full effect of this post, watch the video below first before reading further.

[scroll down for the post]

.

.

.

.

.

.

.

.

.

.

So, did you see it? I did, probably because I was primed to watch for it, but apparently 50% of subjects don’t!

I heard about a really interesting psychology experiment today (and I LOVE these kinds of things that show us how we’re not as smart as we think we are) called the invisible gorilla experiment. The set up is simple, the subjects watch a video of kids passing balls back and forth. The kids are wearing either red or white shirts. The object is to count the number of times a ball is passed between kids with white shirts. It takes concentration since the kids are moving and mixing and tossing fast. At some point a gorilla walks into view, beats its chest, and walks off. Subjects are then asked if they saw a gorilla. Surprisingly (or not- because it’s one of THESE kinds of experiments) 50% of the subjects don’t remember seeing a gorilla. What they’ve been told to look for and pay attention to is the ball and the color of shirts- gorillas don’t figure in to that equation and your brain, which is very good at filtering out irrelevant information, filters this out.

Anyway, it got me thinking about how we do science. Some of the most interesting, useful, exciting, groundbreaking results in science arise from the unexpected result. You’ve set up your experiment perfectly, you execute it perfectly, and it turns out WRONG! It doesn’t fit with your hypothesis, but in some weird way. Repeat the experiment a few times. If that doesn’t fix the problem then work on changing the experiment until you get rid of that pesky weird result. Ahhhh, there, you’ve ‘fixed’ it- now things will fit with what you expected in the first place.

Most of the time spurious, weird results are probably just that- not very interesting. However, there are probably a lot of times when there are weird results that we as scientists don’t even see. We don’t expect to see them, so we don’t see them. And those could be incredibly interesting. I can see this as being the case in what I do a lot, analysis of high-throughput data (lots of measurements for lots of components at the same time- like microarray expression data). It’s sometimes like trying to count the number of times the kids wearing white shirts pass the ball back and forth- but where there are 300 shirt colors and 2500 kids. Ouch. A gorilla wandering into that mess would be about as obvious as Waldo in a multi-colored referees’ convention. That is, not so much. I wonder how many interesting things are missed and how important that is. In high throughput data analysis often times the goal is to focus on what’s important and ignore the rest- but if the rest is telling an important and dominant story we’re really missing the boat.

I’ve found that one of the best things I can do in my science is to be my own reviewer, my own critic, and my own skeptic. If some result turns out exceptionally well I don’t believe it. Actually there’s an inverse correlation between my belief and the quality of the result with what I expect. I figure if I don’t do this someone down the line will- and it will come back to me. I try to eliminate all the other possibilities of what could be going on (using the scientific method algorithm I’ve previously described). I try to rigorously oppose all my findings until I myself am convinced. However, studies like the invisible gorilla really make me wonder how good I am at seeing things that I’m not specifically looking for.