Another word about balance

[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.] 

TL;DR summary

Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.

The long version

Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.

  1. I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
  2. The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
  3. I varied the ratio of positive to negative examples used to estimate performance and
  4. Calculated several commonly used measures of performance:
    1. Accuracy (TP+FP TN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
    2. Specificity (TN/(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
    3. AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
    4. Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
    5. False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
  5. Repeated these calculations with 20 different random prediction sets
  6. Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.

The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.

Accuracy

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Specificity

Specificity stays largely the same and is a good estimate because it isn't affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn't change for the same reason.

Specificity stays largely the same and is a good estimate because it isn’t affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn’t change for the same reason.

AUC

Happily the AUC doesn't actually change that much- mostly it's just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Happily the AUC doesn’t actually change that much- mostly it’s just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Positive predictive value (PPV)

Aaaand there's where things go to hell.

Aaaand there’s where things go to hell.

False discovery rate (FDR)

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

 

 

Why is this a problem?

The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.

If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.

However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.

Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.

So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.

 

Big Data Showdown

One of the toughest parts of collaborative science is communication across disciplines. I’ve had many (generally initial) conversations with bench biologists, clinicians, and sometimes others that go approximately like:

“So, tell me what you can do with my data.”

“OK- tell me what questions you’re asking.”

“Um,.. that kinda depends on what you can do with it.”

“Well, that kinda depends on what you’re interested in…”

And this continues.

But the great part- the part about it that I really love- is that given two interested parties you’ll sometimes work to a point of mutual understanding, figuring out the borders and potential of each other’s skills and knowledge. And you generally work out a way of communicating that suits both sides and (mostly) works to get the job done. This is really when you start to hit the point of synergistic collaboration- and also, sadly, usually about the time you run out of funding to do the research.
BigDataShowdown_v1

Human Protein Tweetbots

I came up with an interesting idea today based on someone’s joke at a meeting. I’m paraphrasing here but the joke was “let’s just get all the proteins Facebook accounts and let their graph algorithms sort everything out”. Which isn’t as nutty as it sounds- at least using some of FBs algorithms, if they’re available, to figure out interesting biology from

The Cellular Social Network Can be a tough place

The Cellular Social Network Can be a tough place

protein networks. But it got me thinking about social media and computational biology.

Scientists use Twitter for a lot of different purposes. One of these is to keep abreast of the scientific literature. This is generally done by following other scientists in disciplines that are relevant to your work, journals and preprint archives that post their newest papers as they’re published, and other aggregators like professional societies and special interest groups.

Many biologists have broad interests, but even journals for your sub-sub-sub field publish papers that you might not be that interested in. Many biologists also have specific genes, proteins, complexes, or pathways that are of interest to them.

My thought was simple. Spawn a bunch of Tweetbots (each with their own Twitter account) that would be tied to a specific gene/protein, complex, or pathway. These Tweetbots would search PubMed (and possibly other sources) and post links to ‘relevant’ new publications – probably simply containing the name of the protein or an alias. I think that you could probably set some kind of popularity bar for actually having a Tweetbot (e.g. BRCA1 would certainly have one, but a protein like SLC10A4 might not).

Sure there are other ways you can do this- for example you can set up automatic notifications on PubMed that email you new publications with keywords- and there might already be specific apps that try to do something like this- but they’re not Twitter. One potential roadblock would be the process of opening so many Twitter accounts- which I’m thinking you can’t do automatically (but don’t know that for sure). To make it useful you’d probably have to start out with at least 1000 of them, maybe more, but wouldn’t need to do all proteins (!) or even all ~30K human proteins.

I’m interested in getting feedback about this idea. I’m not likely to implement it myself (though could probably)- but would other biologists see this as useful? Interesting? Could you see any other applications or twists to make it better?

 

The false dichotomy of multiple hypothesis testing

[Disclaimer: I’m not a statistician, but I do play one at work from time to time. If I’ve gotten something wrong here please point it out to me. This is an evolving thought process for me that’s part of the larger picture of what the scientific method does and doesn’t mean- not the definitive truth about multiple hypothesis testing.]

There’s a division in research between hypothesis-driven and discovery-driven endeavors. In hypothesis-driven research you start out with a model of what’s going on (this can be explicitly stated or just the amalgamation of what’s known about the system you’re studying) and then design an experiment to test that hypothesis (see my discussions on the scientific method here and here). In discovery-driven research you start out with more general questions (that can easily be stated as hypotheses, but often aren’t) and generate larger amounts of data, then search the data for relationships using statistical methods (or other discovery-based methods).

The problem with analysis of large amounts of data is that when you’re applying a statistical test to a dataset you are actually testing many, many hypotheses at once. This means that your level of surprise at finding something that you call significant (arbitrarily but traditionally a p-value of less than 0.05) may be inflated by the fact that you’re looking a whole bunch of times (thus increasing the odds that you’ll observe SOMETHING just on random chance alone- see this excellent xkcd cartoon for an example, see below since I’ll refer to this example). So you need to apply some kind of multiple hypothesis correction to your statistical results to reduce the chances that you’ll fool yourself into thinking that you’ve got something real when actually you’ve just got something random. In the XKCD example below a multiple hypothesis correction using Bonferroni’s method (one of the simplest and most conservative corrections) would suggest that the threshold for significance should be moved to 0.05/20=0.0025 – since 20 different tests were performed.

Here’s where the problem of a false dichotomy occurs. Many researchers who analyze large amounts of data believe that utilizing a hypothesis-based approach mitigates the effect of multiple hypothesis testing on their results. That is, they believe that they can focus their investigation of the data to a subset constrained by a model/hypothesis and thus reduce the effect that multiple hypothesis testing has on their analysis. Instead of looking at 10,000 proteins in a study they now look at only the 25 proteins that are thought to be present in a particular pathway of interest (where the pathway here represent the model based on existing knowledge). This is like saying, “we believe that jelly beans in the blue green color range cause acne” and then drawing your significance threshold at 0.05/4=0.0125 – since there are ~4 jelly beans tested that are in the blue-green color range (not sure if ‘lilac’ counts or not- that would make 5). All well and good EXCEPT for the fact that the actual chance of detecting something by random chance HASN’T changed. In large scale data analysis (transcriptome analysis, e.g.) you’ve still MEASURED everything else. You’ve just chosen to limit your investigation to a smaller subset and then can ‘go easy’ on your multiple hypothesis correction.

The counter-argument that might be made to this point is that by doing this you’re testing a specific hypothesis, one that you believe to be true and may be supported by existing data . This is a reasonable point in one sense- it may lend credence to your finding that there is existing information supporting your result. But on the other hand it doesn’t change the fact that you still could be finding more things by chance than you realize because you simply hadn’t looked at the rest of your data. It turns out that this is true not just of analysis of big data, but also of some kinds of traditional experiments aimed at testing individual – associative- hypotheses. The difference there is that it is technically unfeasible to actually test a large amount of the background cases (generally limited to one or two negative controls). Also a mechanistic hypothesis (as opposed to an associative one) is based on intervention, which tells you something different and so is not (as) subject to these considerations.

Imagine that you’ve dropped your car keys in the street and you don’t know what they look like (maybe borrowing a friend’s car). You’re pretty sure you dropped them in front of the coffee shop on a block with 7 other shops on it- but you did walk the length of the block before you noticed the keys were gone. You walk directly back to look in front of the coffee shop and find a set of keys. Great, you’re done. You found your keys, right? What if you looked in front of the other stores and found other sets of keys. You didn’t look- but that doesn’t make it less likely that you’re wrong about these keys (your existing knowledge/model/hypothesis “I dropped them in front of the coffee shop” could easily be wrong).

XKCD: significant

Spaghetti plots? Sashimi? Food-themed Plots for Science!

For whatever reason bioinformaticians and other plot makers like to name (or re-name) plotting methods with food themes. Just saw this paper for “Sashimi plots” to represent alternative isoform expression from RNA-seq data.

Sashimi plots: Quantitative visualization of alternative isoform expression from RNA-seq data

That prompted me to post this from my Tumblr http://scieastereggs.tumblr.com (growing collection of funny bits in scientific publications):

————————————————————————————————

Spaghetti plots? Lasagne? OK then I can do rigatoni plots

This possibly somewhat satirical paper makes the case for “lasagne plots”, following on the spaghetti plots that are popular in some fields for representing longitudinal data. Lasagne plots are presented as an alternative for large datasets though the authors state: ”To remain consistent with the Italian cuisine-themed spaghetti plot, we refer to heatmaps as ‘lasagna plots.” The remainder of the paper is a pretty straight-on discussion and demonstration of why and when these plots are better than the spaghetti plots.

Lasagna plots: A saucy alternative to spaghetti plots

Bruce J. Swihart, Brian Caffo, Bryan D. James, Matthew Strand, Brian S. Schwartz, Naresh M. Punjabi

Interestingly, a recent paper reimagines heatmaps as “quilt” plots (though less satirically so). This opens whole new doors in the thematic renaming of methods for plotting data.

(h/t @leonidkruglyak)

But, in keeping with the Italian cuisine-themed spaghetti and lasagne plots: Now introducing Rigatoni plots!

(no pasta was harmed in the making of this plot. Well, OK. It was harmed a little)

Need to show outliers? Tasty, tasty outliers? No problem! (thanks @Lewis_Lab)

(capers. They’re capers)

via Spaghetti plots? Lasagne? OK then I can do rigatoni plots.

Gender bias in scientific publishing

The short version: This is a good paper about an important topic, gender bias in publication. The authors try to address two main points: What is the relationship between gender and research output?; and What is the relationship between author gender and paper impact? The study shows a bias in number of papers published by gender, but apparently fails to control for the relative number of researchers of each gender found in each field. This means that the first point of the paper, that women publish less than men, can’t be separated from the well-known gender bias in most of these fields- i.e. there are more men than women. This seems like a strange oversight, and it’s only briefly mentioned in the paper. The second point, which is made well and clearly, is that papers authored by women are cited less than those authored by men. This is the only real take home of the paper, though it is a very important and alarming one.
What the paper does say: that papers authored by women are cited less than those authored by men.
What the paper does NOT say: that women are less productive than men, on average, in terms of publishing papers.
The slightly longer version
This study on gender bias in scientific publishing is a really comprehensive look at gender and publishing world-wide (though it is biased toward the US). The authors do a good job of laying out previous work in this area and then indicate that they are interested in looking at scientific productivity with respect to differences in gender. The first stated goal is to provide an analysis of: “the relationship between gender and research output (for which our proxy was authorship on published papers).”
The study is not in any way incorrect (that I can see in my fairly cursory read-through) but it does present the data in a way that is a bit misleading. Most of the paper describes gathering pretty comprehensive data on gender in published papers relative to author position, geographic location, and several other variables. This is then used to ‘show’ that women are less productive than men in scientific publication but it omits a terribly important step- they never seem to normalize for the ratio of women to men in positions that might be publishing at all. That is, their results very clearly reiterate that there is a gender bias in the positions themselves- but doesn’t say anything (that I can see) about the productivity of individuals (how many papers were published by each author, for example).
They do mention this issue in their final discussion:
UNESCO data show10 that in 17% of countries an equal number of men and women are scientists. Yet we found a grimmer picture: fewer than 6% of countries represented in the Web of Science come close to achieving gender parity in terms of papers published.
And, though this is true, it seems like a less-than-satisfying analysis of the data.
On the other hand, the result that they show at the last- the number of times a paper is cited when a male or female name is included in various locations- is pretty compelling and is really their novel finding. This is actually pretty sobering analysis and the authors provide some ideas on how to address this issue, which seems to be part of the larger problem of providing equal opportunities and advantages to women in science.

A word about balance

I’ve been reviewing machine learning papers lately and have seen a particular problem repeatedly. Essentially it’s a problem of how a machine learning algorithm is trained and evaluated for performance versus how it would be actually applied. I’ve seen this particular problem also in a whole bunch of published papers too so thought I’d write a blog rant post about it. I’ve given a quick-and-dirty primer to machine learning approaches at the end of this post for those interested.

The problem is this: methods are often evaluated using an artificial balance of positive versus negative training examples, one that can artificially inflate estimates of performance over what would actually be obtained in a real world application.

I’ve seen lots of studies that use a balanced approach to training. That is, the number of positive examples is matched with the number of negative examples. The problem is that many times the number of negative examples in a ‘real world’ application is much larger than the number of positive examples- sometimes by orders of magnitude. The reason that is often given for choosing to use a balanced training set? That this provides better performance and that training on datasets with a real distribution of examples would not work well since any pattern in the features from the positive examples would be drowned out by the sheer number of negative examples. So essentially- that when we use a real ratio of positive to negative examples in our evaluation our method sucks. Hmmmmm……

This argument is partly true- though some machine learning algorithms do perform very poorly with highly unbalanced datasets. Support Vector Machines (SVM), though and some other kinds of machine learning approaches, seem to do pretty well. Some studies then follow this initial balanced training step with an evaluation on a real world set – that is, one with a ‘naturally’ occurring balance of positive and negative examples. This is a perfectly reasonable approach. However, too many studies don’t do this step, or perform a follow on ‘validation’ on a dataset with more negative examples, but still nowhere near the number that would be present in a real dataset. And importantly- the ‘bad’ studies report the performance results from the balanced (and thus, artificial) dataset.

The issue here is that evaluation on a dataset with an even number of positive and negative examples can vastly overestimate performance by decreasing the number of false positive predictions that are made. Imagine that we have a training set with 50 positive examples and a matched number of 50 negative examples. The algorithm is trained on these examples and cross-validation (random division of the training set for evaluation purposes- see below) reveals that the algorithm predicts 40 of the positives to be positive (TP) and 48 of the negatives to be negative (TN). So it misclassifies two negative examples to be positive examples with scores that make it look as good or better than the other TPs- which wouldn’t be too bad, the majority of positive predictions would be true positives. Now imagine that the actual ratio of positives to negative examples in a real world example was 1:50, that is for every positive example there are 50 negative examples. So, what’s not done in these problem cases is extrapolating the performance of the algorithm to a real world dataset. In that case you’d expect to see 100 false positive predictions- now outnumbering the number of true positive predictions and making the results a lot less confident than originally estimated. The example I use here is actually a generous one. I frequently deal with datasets (and review or read papers) where the ratios are 1:100 to 1:10,000 where this can substantially impact results.

So the evaluation of a machine learning method should involve a step where a naturally occurring ratio of positive and negative examples is represented. Though this natural ratio may not be clearly evident for some applications, it should be given a reasonable estimate. The performance of the method should be reported based on THIS evaluation, not the evaluation on the balanced set- since that is likely to be inflated from a little to a lot.

For those that are interested in real examples of this problem I’ve got two example studies from one of my own areas of research- type III effector prediction in bacteria. In Gram negative bacteria with type III secretion systems there are an unknown number of secreted effectors (proteins that are injected into host cells to effect virulence) but we estimate on the order of 50-100 for a genome like Salmonella Typhimurium, which has 4500 proteins total, so the ratio should be around 1:40 to 1:150 for most bacteria like this. In my own study on type III effector prediction I used a 1:120 ratio for evaluation for exactly this reason. A subsequent paper in this area was published that chose to use a 1:2 ratio because “the number of non-T3S proteins was much larger than the number of positive proteins,…, to overcome the imbalance between positive and negative datasets.” If you’ve been paying attention, THAT is not a good reason and I didn’t review that paper (though I’m not saying that their conclusions are incorrect since I haven’t closely evaluated their study).

  1. Samudrala R, Heffron F and McDermott JE. 2009. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathogens 5(4):e1000375.
  2. Wang Y, Zhang Q, Sun MA, Guo D. 2011. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics. 2011 Mar 15;27(6):777-84.

So the trick here is to not fool yourself, and in turn fool others. Make sure you’re being your own worst critic. Otherwise someone else will take up that job instead.

Quick and Dirty Primer on Machine Learning

Machine learning is an approach to pattern recognition that learns patterns from data. Often times the pattern that is learned is a particular pattern of features, properties of the examples, that can classify one group of examples from another. A simple example would be to try to identify all the basketball players at an awards ceremony for football, basketball, and baseball players. You would start out by selecting some features, that is, player attributes, that you think might separate the groups out. You might select hair color, length of shorts or pants in the uniform, height, and handedness of the player as potential features. Obviously all these features would not be equally powerful at identifying basketball players, but a good algorithm will be able to make best use of the features. A machine learning algorithm could then look at all the examples: the positive examples, basketball players; and the negative examples, everyone else. The algorithm would consider the values of the features in each group and ideally find the best way to separate the two groups. Generally to evaluate the algorithm all the examples are separated into a training set, to learn the pattern, and a testing set, to test how well the pattern works on an independent set. Cross-validation, a common method of evaluation, does this repeatedly, each time separating the larger group into training and testing sets by randomly selecting positive and negative examples to put into each set. Evaluation is very important since the performance of the method will provide end users with an idea of how well the method has worked for their real world application where they don’t know the answers already. Performance measures vary but for classification they generally involve comparing predictions made by the algorithm with the known ‘labels’ of the examples- that is, whether the player is a basketball player or not. There are four categories of prediction: true positives (TP), the algorithm predicts a basketball player where there is a real basketball player; true negatives (TN), the algorithm predicts not a basketball player when the example is not a basketball player; false positives (FP), the algorithm predicts a basketball player when the example is not; and false negatives (FN), the algorithm predicts not a basketball player when the example actually is.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won't work well but a machine learning algorithm can provide a good separation. I'm showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won’t work well but a machine learning algorithm can provide a good separation. I’m showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

What if I were my own post-doc mentor?

Recently I’ve  had time, and reason, to reflect upon what was expected of me during the early portion of my post-doc and what I was able to deliver. It started me thinking: how would I judge myself as a post-doc if I (the me right now) were my own mentor?

My post-doc started 12 years ago and completed when I landed my current job, 7 years ago. I’ve given a short introduction that includes some context; where I was coming from and what I settled on for my post-doc project.

Background: I did my PhD in a structural virology lab in a microbiology and immunology department. I started out solidly on the bench science side then worked my way slowly into image analysis and some coding as we developed methods for analysis of electron microscopy images to get structural information.

May 2001: Interviewed for a post-doc position with Dr. Ram Samudrala in the Department of Microbiology at UW. Offered a position and accepted soon after. My second day on the job, sitting in an office with a wonderful panoramic view of downtown Seattle from tall tower to tall tower, was September 11th 2001.

First idea on the job: Was to develop a one-dimensional cellular automaton to predict protein structure. It didn’t work, but I learned a lot of coding. I’m planning on writing a post about that and will link to it here (in the near future).

Starting project: My starting project that I finally settled on was to predict structures for all the tractable proteins in the rice, Oryza sativa, proteome, a task that I’m pretty sure has never been completed by anyone. The idea here is that there are three classes of protein sequence: those which have structures that have been solved for that specific protein, those that have significant sequence similarity to proteins with solved structures, and those that are not similar to sequences with known structures. Also, there’s a problem with large proteins that have many domains. These need to be broken up into their domains (structurally and functionally distinct regions of the protein) before they can be predicted. So I started organizing and analyzing sequences in the rice proteome. This quickly took on a life of it’s own and became my post-doc project. I did still work some with structure but focused more on how to represent data, access it, and use it from multiple levels to make predictions that were not obvious from any of the individual data sources. This is a area that I continue to work in in my current position. What came out of it was The Bioverse, a repository for genomic and proteomic data, and a way to represent that data in a way that was accessible to anyone with interest. The first version was coded all by me from the ground up in a colossal, and sometimes misguided, monolithic process that included a workflow pipeline, a webserver, a network viewer, and a database, of sorts. It makes me tired just thinking of it. Ultimately the Bioverse was an idea that didn’t have longevity for a number of different reasons- maybe I’ll write a post about that in the future.

Publishing my first paper as a post-doc: My first paper was a short note for the Nucleic Acids Research special issue on databases on the Bioverse that I’d developed. I submitted it one and a half years after starting my post-doc.

Now the hard part, what if I were my own mentor: How would mentor me view post-doc me?

How would I evaluate myself if I were my own mentor? Hard to say, but I’m pretty sure mentor me would be frustrated at post-doc me’s lack of progress publishing papers. However, I think mentor me would also see the value in the amount and quality of the technical work post-doc me had done, though I’m not sure mentor me would give post-doc me the kind of latitude I’d need to get to that point. Mentor me would think that post-doc me needed mentoring. You know- mentor me needs to DO something, right? And I’m not sure how post-doc me would react to that. Probably it would be fine, but I’m not sure it’d be helpful. Mentor me would push for greater productivity, and post-doc me would chafe under the stress. We might very well have a blow up over that.

Mentor me would be frustrated that post-doc me was continually reinventing the wheel in terms of code. Mentor me would push post-doc me to learn more about what was already being done in the field and what resources existed that had similarities with what post-doc me was doing. Mentor me would be frustrated with post-doc me’s lack of vision for the future: did post-doc me consider writing a grant? How long did post-doc me want to remain a post-doc? How did post-doc me think they’d be able to land a job with minimal publications?

Advice that mentor me would give post-doc me? Probably to focus more on getting science done and publishing some of it than futzing around with (sometimes unnecessary) code. I might very well be wrong about that too. The path that I took through my post-doc and to my current independent scientist position might very well be the optimal path for what I do now.

I (mentor me) filled out an evaluation form that is similar to the one I have to do for my current post-docs (see below). Remember, this was 12 years ago- so it’s a bit fuzzy. I (post-doc me) comes out OK- but having a number of places for improvement.

This evaluation makes me realize how ideas and evaluations of “success”, “progress”, and even “potential as an independent scientist” can be very complicated and can evolve rapidly over time for the same person. As a mentor there is not a single clear path to promote these qualities in your mentees. In fact, mentorship is hard. Too much mentorship and you could stifle good qualities. Too little and you could let those qualities die. And here’s the kicker: or not. What you do as a mentor might not have as much to do with eventual outcomes of success as you’d like to think.

SelfEvaluation

How would mentor me rate post-doc me if I had to evaluate using the same criteria that I now use for my own post-docs?

How would mentor me rate post-doc me if I had to evaluate using the same criteria that I now use for my own post-docs?

Eight red flags in bioinformatics analyses

A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:

Researchers — including me and my colleagues — had just reported that the majority of preclinical cancer papers in top-tier journals could not be reproduced, even by the investigators themselves12.

was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.

  1. Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
  2. Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
  3. Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
  4. Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
  5. Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
  6. Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
  7. New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
  8. New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.

The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:

Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.

I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.

Illustration from Derek Roczen

The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.

References

  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
  2. Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).

Job opening: worst critic. Better fill it for yourself, otherwise someone else will.

A recent technical comment in Science (here) reminded me of a post I’d been meaning to write. We need to be our own worst critics. And by “we” I’m specifically talking about the bioinformaticians and computational biologists who are doing lots of transformations with lots of data all the time- but this generally applies to any scientist.

The technical comment I referred to is behind a paywall so I’ll summarize. The first group published the discovery of a mechanism for X-linked dosage compensation in Drosophila based on, among other things, ChIP-seq data (to determine transcription factor binding to DNA). The authors of the comment found that the initial analysis of the data had used an inappropriate normalization step – and the error is pretty simple: instead of multiplying a ratio by a factor (the square root of the number of bins used in a moving average) they multiplied the log2 transform of the ratio by the factor. This resulted in greatly exaggerated ratios, and artificially inducing a statistically significant difference where there was none. Importantly, the authors of the comment noticed this when,

We noticed that the analysis by Conrad et al. reported unusually high Pol II ChIP enrichment levels. The average enrichment at the promoters of bound genes was reported to be ~30,000-fold over input (~15 on a log2 scale), orders of magnitude higher than what is typical of robust ChIP-seq experiments.

This is important because it means that this was an obvious flag that the original authors SHOULD have seen and wondered about at some point. If they wondered about it they SHOULD have looked further into their analysis and done some simple tests to determine if what they were seeing (30,000 fold increase) was actually reasonable. In all likelihood they would have found their error. Of course, they may not have ended up with a story that could be published in Science- but at least they would not have had the embarrassment of being caught out that way. This is not to say that there is any indication of wrongdoing on the part of the original paper- it seems that they made an honest mistake.

In this story the authors likely fell prey to the Confirmation Bias, the tendency to believe results that support your hypothesis. This is a particularly enticing and tricky bias and I have fallen prey to it many times. As far as I know, these errors have never made it into any of my published work. However, falling for particularly egregious examples (arising from mistakes in machine learning applications, for example) trains you to be on the lookout for it in other situations. Essentially it boils down to the following:

  1. Be suspicious of all your results.
  2. Be especially suspicious of results that support your hypothesis.
  3. The amount you should be suspicious should be proportional to the quality of the results. That is, the better the results are the more you should be suspicious of them and the more rigorously you should try to disprove them.

This is essentially wrapped up in the scientific method (my post about that here)- but it bears repeating and revisiting. You need to be extremely critical of your own work. If something works, check to make sure that it actually does work. If it works extremely well, be very suspicious and look at the problem from multiple angles. If you don’t someone else may, and they may not write as nice of things about you as YOU would.

The example I give above is nice in its clarity and it resulted in calling into question the findings of a Science paper (which is embarrassing). However, there are much, much worse cases with more serious consequences.

Take, for instance, the work Keith Baggerly and Kevin Coombes did to uncover a series of cancer papers that had multiple data processing, analysis and interpretation errors. The NY Times ran a good piece on it. It is more complicated and involves both (likely) unintentional errors in processing, analysis, or interpretation and could actually involve more serious issues of impropriety. I won’t go in to the details here but their original paper in The Annals of Applied Statistics, “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology“, should be reading for any bioinformatics or computational biology researcher. The paper painstakingly and clearly goes through the results of several high profile papers from the same group and reconstructs, first, the steps they must have taken to get the results they did, then second, where the errors occurred, and finally, the results if the analysis had been done correctly.

Their conclusions are startling and scary: they found that the methods were often times not described clearly such that a reader could easily reconstruct what was done and they found a number of easily explainable errors that SHOULD have been caught by the researchers.

These were associated with one group and a particular approach, but I can easily recognize the first, if not the second, in many papers. That is, it is often times very difficult to tell what has actually been done to process the data and analyze it. Steps that have to be there are missing in the methods sections, parameters for programs are omitted, data is referred to but not provided, and the list goes on. I’m sure that I’ve been guilty of this from time to time. It is difficult to remember that writing the “boring” parts of the methods may actually ensure that someone else can do what you’ve done. And sharing your data? That’s just a no-brainer, but something that is too often overlooked in the rush to publish.

So these are cautionary tales. For those of us handling lots of data of different types for different purposes and running many different types of analysis to obtain predictions we must always be on guard against our own worst enemy, ourselves and the errors we might make. And we must be our own worst (and best) critics: if something seems too good to be true, it probably is.