15 great ways to fool yourself about your results

I’ve written before about how easy it is to fool yourself and some tips on how to avoid it for high-throughput data. Here is a non-exhaustive list of ways you too can join in the fun!

  1. Those results SHOULD be that good. Nearly perfect. It all makes sense.
  2. Our bioinformatics algorithm worked! We put input in and out came output! Yay! Publishing time.
  3. Hey, these are statistically significant results. I don’t need to care about how many different ways I tested to see if SOMETHING was significant about them.
  4. We only need three replicates to come to our conclusions. Really, it’s what everyone does.
  5. These results don’t look all THAT great, but the biological story is VERY compelling.
  6. A pilot study can yield solid conclusions, right?
  7. Biological replicates? Those are pretty much the same as technical replicates, right?
  8. Awesome! Our experiment eliminated one alternate hypothesis. That must mean our hypothesis is TRUE!
  9. Model parameters were chosen based on what produced reasonable output: therefore, they are biologically correct.
  10. The statistics on this comparison just aren’t working out right. If I adjust the background I’m comparing to I can get much better results. That’s legit, right
  11. Repeating the experiment might spoil these good results I’ve got already.
  12. The goal is to get the p-value less than 0.05. End.Of.The.Line. (h/t Siouxsie Wilespvalue_kid_meme
  13. Who, me biased? Bias is for chumps and those not so highly trained in the sciences as an important researcher such as myself. (h/t Siouxsie Wiles)
  14. It doesn’t seem like the right method to use- but that’s the way they did it in this one important paper, so we’re all good. (h/t Siouxsie Wiles)
  15. Sure the results look surprising, and I apparently didn’t write down exactly what I did, and my memory on it’s kinda fuzzy because I did the experiment six months ago, but I must’ve done it THIS way because that’s what would make the most sense.
  16. My PI told me to do this, so it’s the right thing to do. If I doubt that it’s better not to question it since that would make me look dumb.
  17. Don’t sweat the small details- I mean what’s the worst that could happen?

Want to AVOID doing this? Check out my previous post on ways to do robust data analysis and the BioStat Decision Tool from Siouxsie Wiles that will walk you through the process of choosing appropriate statistical analyses for your purposes! Yes, it is JUST THAT EASY!

Feel free to add to this list in the comments. I’m sure there’s a whole gold mine out there. Never a shortage of ways to fool yourself.

 

Gut feelings about gut feelings about marriage

An interesting study was published about the ‘gut feelings’ of newlyweds, and how they can predict future happiness in the marriage. The study assessed gut feelings (as opposed to stated feelings, which are likely to be biased in the rosy direction) of newlyweds towards their spouse by a word association and controlled for several different variables (like how the same people react to random strangers with the word association). They found that newlyweds that had more ‘gut feeling’ positive associations about their spouse were in happier relationships after four years. Sounds pretty good, right? Fits with what you might think about gut feelings.

The interesting point (which is nicely put in a Nature piece that covers this study) is that after other effects are factored out of the analysis the positive association was statistically significant, but that it could only explain 2% of the eventual difference in happiness (this analysis was apparently done by the Nature reporter, and not reported in the original paper). 2%! That’s not a very meaningful effect- even though it may be statistically significant. Though the study is certainly interesting and likely contains quite a bit of good data – this effect seems vanishingly small.

For interest here are the titles of the paper and some follow-on news pieces that were written about it and how they make the results seem much more clear cut and meaningful.

Title of the original Science paper:

Though They May Be Unaware, Newlyweds Implicitly Know Whether Their Marriage Will Be Satisfying

Title of the Nature piece covering this study:

Newlyweds’ gut feelings predict marital happiness Four-year study shows that split-second reactions foretell future satisfaction.

Headline from New Zealand Herald article:

Gut instinct key to a long and happy marriage

Headline from New York Daily News

Newlyweds’ gut feelings on their marriage are correct: study

 

A word about balance

I’ve been reviewing machine learning papers lately and have seen a particular problem repeatedly. Essentially it’s a problem of how a machine learning algorithm is trained and evaluated for performance versus how it would be actually applied. I’ve seen this particular problem also in a whole bunch of published papers too so thought I’d write a blog rant post about it. I’ve given a quick-and-dirty primer to machine learning approaches at the end of this post for those interested.

The problem is this: methods are often evaluated using an artificial balance of positive versus negative training examples, one that can artificially inflate estimates of performance over what would actually be obtained in a real world application.

I’ve seen lots of studies that use a balanced approach to training. That is, the number of positive examples is matched with the number of negative examples. The problem is that many times the number of negative examples in a ‘real world’ application is much larger than the number of positive examples- sometimes by orders of magnitude. The reason that is often given for choosing to use a balanced training set? That this provides better performance and that training on datasets with a real distribution of examples would not work well since any pattern in the features from the positive examples would be drowned out by the sheer number of negative examples. So essentially- that when we use a real ratio of positive to negative examples in our evaluation our method sucks. Hmmmmm……

This argument is partly true- though some machine learning algorithms do perform very poorly with highly unbalanced datasets. Support Vector Machines (SVM), though and some other kinds of machine learning approaches, seem to do pretty well. Some studies then follow this initial balanced training step with an evaluation on a real world set – that is, one with a ‘naturally’ occurring balance of positive and negative examples. This is a perfectly reasonable approach. However, too many studies don’t do this step, or perform a follow on ‘validation’ on a dataset with more negative examples, but still nowhere near the number that would be present in a real dataset. And importantly- the ‘bad’ studies report the performance results from the balanced (and thus, artificial) dataset.

The issue here is that evaluation on a dataset with an even number of positive and negative examples can vastly overestimate performance by decreasing the number of false positive predictions that are made. Imagine that we have a training set with 50 positive examples and a matched number of 50 negative examples. The algorithm is trained on these examples and cross-validation (random division of the training set for evaluation purposes- see below) reveals that the algorithm predicts 40 of the positives to be positive (TP) and 48 of the negatives to be negative (TN). So it misclassifies two negative examples to be positive examples with scores that make it look as good or better than the other TPs- which wouldn’t be too bad, the majority of positive predictions would be true positives. Now imagine that the actual ratio of positives to negative examples in a real world example was 1:50, that is for every positive example there are 50 negative examples. So, what’s not done in these problem cases is extrapolating the performance of the algorithm to a real world dataset. In that case you’d expect to see 100 false positive predictions- now outnumbering the number of true positive predictions and making the results a lot less confident than originally estimated. The example I use here is actually a generous one. I frequently deal with datasets (and review or read papers) where the ratios are 1:100 to 1:10,000 where this can substantially impact results.

So the evaluation of a machine learning method should involve a step where a naturally occurring ratio of positive and negative examples is represented. Though this natural ratio may not be clearly evident for some applications, it should be given a reasonable estimate. The performance of the method should be reported based on THIS evaluation, not the evaluation on the balanced set- since that is likely to be inflated from a little to a lot.

For those that are interested in real examples of this problem I’ve got two example studies from one of my own areas of research- type III effector prediction in bacteria. In Gram negative bacteria with type III secretion systems there are an unknown number of secreted effectors (proteins that are injected into host cells to effect virulence) but we estimate on the order of 50-100 for a genome like Salmonella Typhimurium, which has 4500 proteins total, so the ratio should be around 1:40 to 1:150 for most bacteria like this. In my own study on type III effector prediction I used a 1:120 ratio for evaluation for exactly this reason. A subsequent paper in this area was published that chose to use a 1:2 ratio because “the number of non-T3S proteins was much larger than the number of positive proteins,…, to overcome the imbalance between positive and negative datasets.” If you’ve been paying attention, THAT is not a good reason and I didn’t review that paper (though I’m not saying that their conclusions are incorrect since I haven’t closely evaluated their study).

  1. Samudrala R, Heffron F and McDermott JE. 2009. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathogens 5(4):e1000375.
  2. Wang Y, Zhang Q, Sun MA, Guo D. 2011. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics. 2011 Mar 15;27(6):777-84.

So the trick here is to not fool yourself, and in turn fool others. Make sure you’re being your own worst critic. Otherwise someone else will take up that job instead.

Quick and Dirty Primer on Machine Learning

Machine learning is an approach to pattern recognition that learns patterns from data. Often times the pattern that is learned is a particular pattern of features, properties of the examples, that can classify one group of examples from another. A simple example would be to try to identify all the basketball players at an awards ceremony for football, basketball, and baseball players. You would start out by selecting some features, that is, player attributes, that you think might separate the groups out. You might select hair color, length of shorts or pants in the uniform, height, and handedness of the player as potential features. Obviously all these features would not be equally powerful at identifying basketball players, but a good algorithm will be able to make best use of the features. A machine learning algorithm could then look at all the examples: the positive examples, basketball players; and the negative examples, everyone else. The algorithm would consider the values of the features in each group and ideally find the best way to separate the two groups. Generally to evaluate the algorithm all the examples are separated into a training set, to learn the pattern, and a testing set, to test how well the pattern works on an independent set. Cross-validation, a common method of evaluation, does this repeatedly, each time separating the larger group into training and testing sets by randomly selecting positive and negative examples to put into each set. Evaluation is very important since the performance of the method will provide end users with an idea of how well the method has worked for their real world application where they don’t know the answers already. Performance measures vary but for classification they generally involve comparing predictions made by the algorithm with the known ‘labels’ of the examples- that is, whether the player is a basketball player or not. There are four categories of prediction: true positives (TP), the algorithm predicts a basketball player where there is a real basketball player; true negatives (TN), the algorithm predicts not a basketball player when the example is not a basketball player; false positives (FP), the algorithm predicts a basketball player when the example is not; and false negatives (FN), the algorithm predicts not a basketball player when the example actually is.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won't work well but a machine learning algorithm can provide a good separation. I'm showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won’t work well but a machine learning algorithm can provide a good separation. I’m showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

The good, the bad, and the ugly: Open access, peer review, investigative reporting, and pit bulls

We all have strong feelings about things based on anecdotal evidence, it’s part of human nature. Science is aimed at testing those anecdotal feelings (we call them hypotheses) in a more rigorous fashion to support or refute our gut feelings about a subject. Many times those gut feelings are wrong- especially about new concepts and ideas that come along. Open access publishing certainly falls into this category- a new and interesting business model that many people have very strong feelings about. There is, therefore, a need for the  second part: scientific studies that illuminate how well it’s working.

Recently the very prestigious journal Science published an article, titillatingly titled, “Who’s Afraid of Peer Review: A spoof paper concocted by Science reveals little or no scrutiny at many open-access journals.” I’ve seen it posted and reposted on Twitter and Facebook by a number of colleagues, and, indeed, when I first read about it I was intrigued. The post has been accompanied by sentiments such as “I never trusted open access” or “now you know why you get so many emails from open access journals”- in other words, gut feelings about the overall quality of open access journals.

Here’s the basic rundown: John Bohannon concocted a fake, but believable scientific paper with a critical flaw. He submitted it to a large number of open access journals under different names then recorded which journals accepted it, along with recording the correspondence with that journal- some of which is pretty damning (i.e. it looks like they didn’t do any peer review on the paper). Several high-profile open access journals like PLoS One rejected the paper. But many journals accepted the flawed paper. On one hand the study is an ambitious and ground breaking investigation into how well journals execute peer review, the heart of scientific publishing. The author is to be commended on this undertaking, which is considerably more comprehensive (in terms of numbers of journals targeted) than anything in the past.

On the other hand, the ‘study’, which concludes that open access peer review is flawed, is itself deeply flawed and was not, in fact, peer reviewed (it is categorized as a “News” piece for Science). The reason is really simple- the ‘study’ was not conceived as a scientific study at all. It was investigative reporting, which is much different. The goal of investigative reporting is to call attention to important and often times unrecognized problems. In this way Dr. Bohannon’s piece was probably quite successful because it does highlight the very lax or non-existent peer review at a large number of journals. However, the focus on open access is harmful misdirection that only muddies the waters.

Here’s what’s not in question: Dr. Bohannon, found that a large number of the journals he submitted his fake paper to seemed to accept it with little or no peer review. (However, it is worth noting that Gunther Eysenbach, an editor for a journal that was contacted, reports that he rejected the paper because it was out of scope of the journal and that his journal was not listed in the final list of journals in Bohannon’s paper for some reason.)

What this says about peer review in general is striking: this fake paper was flawed in a pretty serious way and should not have passed peer review. This conclusion of the paper is a good and important one: peer review is flawed for a surprising number of journals (or just non-existent).

What the results do not say is anything about whether open access contributes to this problem. Open access was not a variable in Dr. Bohannon’s study. However, it is one of the main conclusions of the paper- that the open access model is flawed. So essentially, this ‘study’ is falsely representing the results of a study that was not designed to answer the question posed: are open access journals more likely than for-pay journals to have shoddy peer review processes? No for-pay journals were tested in the sting, thus no results. It MAY be that open access is worse than for-pay in terms of peer review, but THIS WAS NOT TESTED BY THE STUDY. Partly this is the fault of the promotion for the piece by Science, which does play up the open access angle quite a bit- but it is really implicit in the study itself. Interestingly, this is how Dr. Bohannon describes the spoof paper’s second flawed experiment:

The second experiment is more outrageous. The control cells were not exposed to any radiation at all. So the observed “interactive effect” is nothing more than the standard inhibition of cell growth by radiation. Indeed, it would be impossible to conclude anything from this experiment.

Thus neatly summarizing the fundamental flaw in his own study- the control journals (more traditional for-pay journals) were not queried at all so nothing can be concluded from this study- in terms of open access anyway.

The heart of the problem is that the very well-respected journal Science is now asking the reader to accept conclusions that are not based in the scientific method. This is the equivalent of stating that pitbulls are more dangerous than other breeds because they bite 10,000 people per year in the US (I just made that figure up). End of story. How many people were bitten by other breeds? We don’t know because we didn’t look at those statistics. How do we support our conclusion? Because people feel that pitbulls are more dangerous than other breeds- just as some scientists distrust open access journals as “predatory” or worse. So, in a very real way the well-respected for-pay journal Science is preying upon the ‘gut feelings’ of readers who may distrust open access and feeding them with pseudoscience, or at least pseudo conclusions about open access.

A number of very smart and well-spoken (well, written) people have posted on this subject and made some other excellent points. See posts from Michael EisenBjörn Brembs, Paul Baskin, and Gunther Eysenbach on the subject.

Excitement about great results

My first attempt at doing an XKCD-like plot about science. This being inspired by my great results I posted about yesterday, which still stand by the way.

"OMG! This would be so amazing if it's true- so, it MUST be true!"

“OMG! This would be so amazing if it’s true- so, it MUST be true!”

Finding your keys in a mass of high-throughput data

There is a common scientific allegory that is often used in criticism of discovery-based science. A person is looking around at night under a lamppost when a passerby asks, “What are you doing?”. “Looking for my keys” the person replies. “Oh, did you lose them here?” the concerned citizen. “No”, the person replies, “I lost them over there, but the light is much better here”.

cropped-night_tree_7.jpg

The argument as applied to science in a nutshell is that we commonly ask questions based on where the ‘light is good’- that is, where we have the technology to be able to answer the question, rather than asking a better question in the first place. This recent piece covering several critiques of cancer genomics projects is a good example, and uses the analogy liberally throughout- referencing its use in the original articles covered.

One indictment of the large-scale NCI project, The Cancer Genome Atlas (TCGA) is as follows:

The Cancer Genome Atlas, an ambitious effort to chart and catalog all the significant mutations that every important cancer can possibly accrue. But these efforts have largely ended up finding more of the same. The Cancer Genome Atlas is a very significant repository, but it may end up accumulating data that’s irrelevant for actually understanding or curing cancer.

Here’s my fundamental problem with the metaphor, and it’s use as a criticism of scientific endeavors such as the TCGA. The problem is this: we don’t know where the keys are a priori, the light is brightest under the lamppost, we would be stupid NOT to look there first. The indictment comes post-hoc, and so benefits from knowing the answer (or at least having a good idea of the answer). Unraveling this anti-metaphor: we don’t know what causes cancer, genomics seems like a likely place to look and the we have the technology to do so, and if we started by looking elsewhere critics would be left wondering why didn’t we look in the obvious way. The fact that we didn’t find anything new with the TCGA (the jury is quite out on that point) is still a positive step forward in the understanding of cancer- it means that we’ve looked in genomics and haven’t found the answer. This can be used to drive the next round of investigation. If we hadn’t done it, we simply wouldn’t know, and that would prevent taking certain steps to move forward.

Of course, the value of the metaphor is that it can be used to urge caution in investigation. If we have a good notion that our keys are not under the light, then maybe we ought to be thinking about going to get our flashlights to look in the right area to start with. We should also be very careful that in funding the large projects to look where the light is, that we don’t sacrifice other projects that may end up yielding fruit. It is true that large projects tend to be over-hyped and the answers are promised (or all but) before they’ve even begun to be answered. Part of this is necessary salesmanship to be able to get these things off the ground at all, but overselling does not reflect well on anyone in the long run. Finally, the momentum of large projects or motivating ideas (“sequence all cancer genomes”) can be significant and may carry the ideas beyond what is useful. When we’ve figured out that the keys are not under the lamppost we had better figure out where to look next rather than combing over the same well-lit ground.

Part of this piece reflects very well on work that I’m involved with- proteomic and phosphoproteomic characterization of tumors from TCGA under the Clinical Proteomics Tumor Analysis Consortium (CPTAC):

“as Yaffe points out, the real action takes place at the level of proteins, in the intricacies of the signaling pathways involving hundreds of protein hubs whose perturbation is key to a cancer cell’s survival. When drugs kill cancer cells they don’t target genes, they directly target proteins”

So examining the signaling pathways that are involved in cancer directly, as opposed to looking at gene expression or modification as a proxy of activity, may indeed be the way to go to elucidate the causes of cancer. We believe that integrating this kind of information, which is closer to actual function, with the depth of knowledge provided by the TCGA will give significant insight into the biology of cancer and it’s underlying causes. But we’ve only started looking under that lamppost.

So, the next time you hear someone using this analogy as an indictment of a project or approach, ask yourself if they are using this argument post-hoc? That is, “they looked under the lamppost and didn’t find anything so their approach was flawed”. It wasn’t- it was likely the clearest, most logical step, that was most likely to yield fruit given a reasonable cost-benefit assessment.

 

Invisible science and how it’s too easy to fool yourself

If you want the full effect of this post, watch the video below first before reading further.

[scroll down for the post]

.

.

.

.

.

.

.

.

.

.

So, did you see it? I did, probably because I was primed to watch for it, but apparently 50% of subjects don’t!

I heard about a really interesting psychology experiment today (and I LOVE these kinds of things that show us how we’re not as smart as we think we are) called the invisible gorilla experiment. The set up is simple, the subjects watch a video of kids passing balls back and forth. The kids are wearing either red or white shirts. The object is to count the number of times a ball is passed between kids with white shirts. It takes concentration since the kids are moving and mixing and tossing fast. At some point a gorilla walks into view, beats its chest, and walks off. Subjects are then asked if they saw a gorilla. Surprisingly (or not- because it’s one of THESE kinds of experiments) 50% of the subjects don’t remember seeing a gorilla. What they’ve been told to look for and pay attention to is the ball and the color of shirts- gorillas don’t figure in to that equation and your brain, which is very good at filtering out irrelevant information, filters this out.

Anyway, it got me thinking about how we do science. Some of the most interesting, useful, exciting, groundbreaking results in science arise from the unexpected result. You’ve set up your experiment perfectly, you execute it perfectly, and it turns out WRONG! It doesn’t fit with your hypothesis, but in some weird way. Repeat the experiment a few times. If that doesn’t fix the problem then work on changing the experiment until you get rid of that pesky weird result. Ahhhh, there, you’ve ‘fixed’ it- now things will fit with what you expected in the first place.

Most of the time spurious, weird results are probably just that- not very interesting. However, there are probably a lot of times when there are weird results that we as scientists don’t even see. We don’t expect to see them, so we don’t see them. And those could be incredibly interesting. I can see this as being the case in what I do a lot, analysis of high-throughput data (lots of measurements for lots of components at the same time- like microarray expression data). It’s sometimes like trying to count the number of times the kids wearing white shirts pass the ball back and forth- but where there are 300 shirt colors and 2500 kids. Ouch. A gorilla wandering into that mess would be about as obvious as Waldo in a multi-colored referees’ convention. That is, not so much. I wonder how many interesting things are missed and how important that is. In high throughput data analysis often times the goal is to focus on what’s important and ignore the rest- but if the rest is telling an important and dominant story we’re really missing the boat.

I’ve found that one of the best things I can do in my science is to be my own reviewer, my own critic, and my own skeptic. If some result turns out exceptionally well I don’t believe it. Actually there’s an inverse correlation between my belief and the quality of the result with what I expect. I figure if I don’t do this someone down the line will- and it will come back to me. I try to eliminate all the other possibilities of what could be going on (using the scientific method algorithm I’ve previously described). I try to rigorously oppose all my findings until I myself am convinced. However, studies like the invisible gorilla really make me wonder how good I am at seeing things that I’m not specifically looking for.