Best Practices

BestPracticesComic

This comic is inspired, not by real interactions I’ve had with developers (no developer has ever volunteered to get within 20 paces of my code), but rather by discussions online on the importance of ‘proper’ coding. Here’s a comic from xkcd which has a different point:

My reaction to this– as a bench biology-trained computational biologist who has never taken a computer programming class– is “who cares?” If it works, really, who cares?

Sure, there are very good reasons for standard programming practices, standards, and clean, efficient code. Even in bioinformatics (or especially so). These would be almost exclusively applicable to approaches that you’ve had quite a bit of experience with working out the bugs, figuring out how it works with the underlying data, making sure that it’s actually useful in terms of the biology. This is at least 75% of my job. I try and discard many approaches for any particular problem I’m working on. It’s important to have a record of these attempts, but this code doesn’t have to be clean or efficient. There are exceptions to this, such as when you have code that takes a loooong time to run even once, you probably want to make that as efficient as you can. The vast majority of the things I do- even with large amounts of data- I can determine if they’re working or not in a reasonable amount of time using inefficient code (anything written in R, for example).

The other part, where good coding is important, is when you want the code to be usable by other people. This is an incredibly important part of computational biology and I’m not trying to downplay its importance here. This is when you’re relatively certain that the code will be looked at and/or used by other people in your own group and when you publish or release the code to a wider audience.

For further reading into this subject here’s a post from Byte Size Biology that covers some great ideas for writing *research* code. And here is some dissenting opinion from Living in and Ivory Basement touting the importance of good programming practices (note- I don’t disagree, but do believe that at least 75% of the coding I do should not have such a high bar- not necessary and I’d never get anything done) . Finally, here are some of my thoughts on how coding really follows the scientific method.

Another word about balance

[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.] 

TL;DR summary

Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.

The long version

Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.

  1. I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
  2. The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
  3. I varied the ratio of positive to negative examples used to estimate performance and
  4. Calculated several commonly used measures of performance:
    1. Accuracy (TP+FP TN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
    2. Specificity (TN/(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
    3. AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
    4. Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
    5. False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
  5. Repeated these calculations with 20 different random prediction sets
  6. Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.

The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.

Accuracy

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Specificity

Specificity stays largely the same and is a good estimate because it isn't affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn't change for the same reason.

Specificity stays largely the same and is a good estimate because it isn’t affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn’t change for the same reason.

AUC

Happily the AUC doesn't actually change that much- mostly it's just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Happily the AUC doesn’t actually change that much- mostly it’s just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Positive predictive value (PPV)

Aaaand there's where things go to hell.

Aaaand there’s where things go to hell.

False discovery rate (FDR)

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

 

 

Why is this a problem?

The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.

If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.

However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.

Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.

So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.

 

How to know when to let go

This post is a story in two parts. The first part is about the most in-depth peer review I think I’ve ever gotten. The second deals with making the decision to pull the plug on a project.

Part 1: In which Reviewer 3 is very thorough, and right.

Sometimes Reviewer 3 (that anonymous peer reviewer who consistently causes problems) is right on the money. To extend some of the work I’ve done I’ve done to predict problematic functions of proteins I started a new effort about 2 years ago now. It went really slowly at first and I’ve never put a huge amount of effort in to it, but I thought it had real promise. Essentially it was based on gathering up examples of a functional family that

"At last we meet, reviewer 3, if that is indeed your real name"

“At last we meet, reviewer 3, if that is indeed your real name”

could be used in a machine learning-type approach. The functional family (in this case I chose E3 ubiquitin ligases) is problematic in that there are functionally similar proteins that show little or no sequence similarity by traditional BLAST search. Anyway, using a somewhat innovative approach we developed a model that could predict these kinds of proteins (which are important bacterial virulence effectors) pretty well (much better than BLAST). We wrote up the results and sent it off for an easy publication.

Of course, that’s not the end of the story. The paper was rejected from PLoS One, based on in-depth comments from reviewer 1 (hereafter referred to as Reviewer 3, because). As part of the paper submission we had included supplemental data, enough to replicate our findings, as should be the case*. Generally this kind of data isn’t scrutinized very closely (if at all) by reviewers. This case was different. Reviewer 3 is a functional prediction researcher of some kind (anonymous, so I don’t really know) and their lab is set up to look at these kinds of problems- though probably not from the bacterial pathogenesis angle judging from a few of the comments. So Reviewer 3’s critique can be summed up in their own words:

I see the presented paper as a typical example of computational “solutions” (often based on machine-learning) that produce reasonable numbers on artificial test data, but completely fail in solving the underlying biologic problem in real science.

Ouch. Harsh. And partly true. They are actually wrong about that point from one angle (the work solves a real problem- see Part 2, below) but right from another angle (that problem had apparently already been solved, at least practically speaking). They went on, “my workgroup performed a small experiment to show that a simple classifier based on sequence similarity and protein domains can perform at least as well as <my method> for the envisioned task.” In the review they then present an analysis they did on my supplemental data in which they simply searched for existing Pfam domains that were associated with ubiquitin ligase function. Their analysis, indeed, shows that just searching Reviewer3Attackfor these four known domains could predict this function as well or better than my method. This is interesting because it’s the first time that I can remember where a reviewer has gone in to the supplemental data to do an analysis for the review. This is not a problem at all- in fact, it’s a very good thing. Although I’m disappointed to have my paper rejected I was happy that a knowledgeable and thorough peer reviewer had done due diligence and exposed this, somewhat gaping, hole in my approach/results. It’s worth noting that the other reviewer identified himself, was very knowledgeable and favorable to the paper- just missing this point because it’s fairly specific and wrong, at least in a particular kind of way that I detail below.

So, that’s it right? Game over. Take my toys and go home (or to another more pressing project). Well, maybe or maybe not.

Part 2: In which I take a close look at Reviewer 3’s points and try to rescue my paper

One of the hardest things to learn is how to leave something that you’ve put considerable investment into and move on to more productive pastures. This is true in relationships, investments, and, certainly, academia. I don’t want to just drop this two year project (albeit, not two solid years) without taking a close look to see if there’s something I can do to rescue it. Without going into the details of specific points Reviewer 3 made I’ll tell you about my thought process on this topic.

So, first. One real problem here is that the Pfam models Reviewer 3 used were constructed from the examples I was using. That means that their approach is circular: the Pfam model can identify the examples of E3 ubiquitin ligases because it was built from those same examples. They note that four different Pfam models can describe most of the examples I used. From the analysis that I did in the paper and then again following Reviewer 3’s comments, I found that these models do not cross-predict, whereas my model does. That is, my single model can predict the same as these four different individual models. These facts both mean that Reviewer 3’s critique is not exactly on the mark- my method does some good stuff that Pfam/BLAST can’t do. Unfortunately, neither of these facts makes my method any more practically useful. That is, if you want to predict E3 ubiquitin ligase function you can use Pfam domains to do so.

Which leads me to the second point of possible rescue. Reviewer 3’s analysis, and my subsequent re-analysis to check to make sure they were correct, identified around 30 proteins that are known ubiquitin ligases but which do not have one of these four Pfam domains. These are false negative predictions, by the Pfam method. Using my method these are all predicted to be ubiquitin ligases with pretty good accuracy. This is a definite good point then to my method, meaning that my method can correctly identify those known ligases that don’t have known domains. There! I have something useful that I can publish, right? Well, not so fast. I was interested in seeing what Pfam domains might be in those proteins other than the four ligase domains so I looked more closely. Unfortunately what I found was that these proteins all had a couple of other domains that were specific to E3 ubiquitin ligases but that Reviewer 3 didn’t notice. Sigh. So that means that all the examples in my E3 ubiquitin ligase dataset can be correctly identified by around 6 Pfam domains, again rendering my method essentially useless, though not incorrect. It is worth noting that it is certainly possible that my method would be much better at identification of new E3 ligases that don’t fall into these 6 ‘families’ – but I don’t have any such examples, so I don’t really know and can’t demonstrate this in the paper.

So where does this leave me? I have a method that is sound, but solves a problem that may not have been needed to be solved (as Reviewer 3 pointed out, sort of). I would very much like to publish this paper since I, and several other people, have spent a fair amount of time on it. But I’m left a bit empty-handed. Here are the three paths I can see to publication:

  1. Experimental validation. I make some novel predictions with my method and then have a collaborator validate them. Great idea but this would take a lot of time and effort and luck to pull off. Of course, if it worked it would demonstrate the method’s utility very solidly. Not going to happen right now I think.
  2. Biological insight. I make some novel observations given my model that point out interesting biology underpinning bacterial/viral E3 ubiquitin ligases. This might be possible, and I have a little bit of it in the paper already. However, I think I’d need something solid and maybe experimentally validated to really push this forward.
  3. Another function. Demonstrate that the general approach works on another functional group- one that actually is a good target for this kind of thing. This is something I think I have (another functional group) and I just need to do some checking to really make sure first (like running Pfam on it, duh.) I can then leave the ubiquitin ligase stuff in there as my development example and then apply the method to this ‘real’ problem. This is most likely what I’ll do here (assuming that the new example function I have actually is a good one) since it requires the least amount of work.

So, full disclosure: I didn’t know when I started writing this post this morning what I was going to do with this paper and had pretty much written it off. But now I’m thinking that there may be a relatively easy path to publication with option 3 above. If my new example doesn’t pan out I may very well have to completely abandon this project and move on. But if it does work then I’ll have a nice little story requiring a minimum of (extra) effort.

As a punchline to this story- I’ve written a grant using this project as a fairly key piece of preliminary data. That grant is being reviewed today- as I write. As I described above, there’s nothing wrong with the method- and it actually fits nicely (still) to demonstrate what I needed it to for the grant. However, if the grant is funded then I’ll have actual real money to work on this and that will open other options up for this project. Here’s hoping. If the grant is funded I’ve decided I’ll establish a regular blog post to cover it, hopefully going from start to (successfully renewed) finish on my first R01. So, again, here’s hoping.

*Supplemental data in a scientific manuscript is the figures, tables, and other kinds of data files that either can’t be included in the main text because of size (no one wants to read 20 pages of gene listings in a paper- though I have seen stuff like this) or because the information is felt to be non-central to the main story and better left for more interested readers.