This post is a story in two parts. The first part is about the most in-depth peer review I think I’ve ever gotten. The second deals with making the decision to pull the plug on a project.
Part 1: In which Reviewer 3 is very thorough, and right.
Sometimes Reviewer 3 (that anonymous peer reviewer who consistently causes problems) is right on the money. To extend some of the work I’ve done I’ve done to predict problematic functions of proteins I started a new effort about 2 years ago now. It went really slowly at first and I’ve never put a huge amount of effort in to it, but I thought it had real promise. Essentially it was based on gathering up examples of a functional family that
“At last we meet, reviewer 3, if that is indeed your real name”
could be used in a machine learning-type approach. The functional family (in this case I chose E3 ubiquitin ligases) is problematic in that there are functionally similar proteins that show little or no sequence similarity by traditional BLAST search. Anyway, using a somewhat innovative approach we developed a model that could predict these kinds of proteins (which are important bacterial virulence effectors) pretty well (much better than BLAST). We wrote up the results and sent it off for an easy publication.
Of course, that’s not the end of the story. The paper was rejected from PLoS One, based on in-depth comments from reviewer 1 (hereafter referred to as Reviewer 3, because). As part of the paper submission we had included supplemental data, enough to replicate our findings, as should be the case*. Generally this kind of data isn’t scrutinized very closely (if at all) by reviewers. This case was different. Reviewer 3 is a functional prediction researcher of some kind (anonymous, so I don’t really know) and their lab is set up to look at these kinds of problems- though probably not from the bacterial pathogenesis angle judging from a few of the comments. So Reviewer 3’s critique can be summed up in their own words:
I see the presented paper as a typical example of computational “solutions” (often based on machine-learning) that produce reasonable numbers on artificial test data, but completely fail in solving the underlying biologic problem in real science.
Ouch. Harsh. And partly true. They are actually wrong about that point from one angle (the work solves a real problem- see Part 2, below) but right from another angle (that problem had apparently already been solved, at least practically speaking). They went on, “my workgroup performed a small experiment to show that a simple classifier based on sequence similarity and protein domains can perform at least as well as <my method> for the envisioned task.” In the review they then present an analysis they did on my supplemental data in which they simply searched for existing Pfam domains that were associated with ubiquitin ligase function. Their analysis, indeed, shows that just searching for these four known domains could predict this function as well or better than my method. This is interesting because it’s the first time that I can remember where a reviewer has gone in to the supplemental data to do an analysis for the review. This is not a problem at all- in fact, it’s a very good thing. Although I’m disappointed to have my paper rejected I was happy that a knowledgeable and thorough peer reviewer had done due diligence and exposed this, somewhat gaping, hole in my approach/results. It’s worth noting that the other reviewer identified himself, was very knowledgeable and favorable to the paper- just missing this point because it’s fairly specific and wrong, at least in a particular kind of way that I detail below.
So, that’s it right? Game over. Take my toys and go home (or to another more pressing project). Well, maybe or maybe not.
Part 2: In which I take a close look at Reviewer 3’s points and try to rescue my paper
One of the hardest things to learn is how to leave something that you’ve put considerable investment into and move on to more productive pastures. This is true in relationships, investments, and, certainly, academia. I don’t want to just drop this two year project (albeit, not two solid years) without taking a close look to see if there’s something I can do to rescue it. Without going into the details of specific points Reviewer 3 made I’ll tell you about my thought process on this topic.
So, first. One real problem here is that the Pfam models Reviewer 3 used were constructed from the examples I was using. That means that their approach is circular: the Pfam model can identify the examples of E3 ubiquitin ligases because it was built from those same examples. They note that four different Pfam models can describe most of the examples I used. From the analysis that I did in the paper and then again following Reviewer 3’s comments, I found that these models do not cross-predict, whereas my model does. That is, my single model can predict the same as these four different individual models. These facts both mean that Reviewer 3’s critique is not exactly on the mark- my method does some good stuff that Pfam/BLAST can’t do. Unfortunately, neither of these facts makes my method any more practically useful. That is, if you want to predict E3 ubiquitin ligase function you can use Pfam domains to do so.
Which leads me to the second point of possible rescue. Reviewer 3’s analysis, and my subsequent re-analysis to check to make sure they were correct, identified around 30 proteins that are known ubiquitin ligases but which do not have one of these four Pfam domains. These are false negative predictions, by the Pfam method. Using my method these are all predicted to be ubiquitin ligases with pretty good accuracy. This is a definite good point then to my method, meaning that my method can correctly identify those known ligases that don’t have known domains. There! I have something useful that I can publish, right? Well, not so fast. I was interested in seeing what Pfam domains might be in those proteins other than the four ligase domains so I looked more closely. Unfortunately what I found was that these proteins all had a couple of other domains that were specific to E3 ubiquitin ligases but that Reviewer 3 didn’t notice. Sigh. So that means that all the examples in my E3 ubiquitin ligase dataset can be correctly identified by around 6 Pfam domains, again rendering my method essentially useless, though not incorrect. It is worth noting that it is certainly possible that my method would be much better at identification of new E3 ligases that don’t fall into these 6 ‘families’ – but I don’t have any such examples, so I don’t really know and can’t demonstrate this in the paper.
So where does this leave me? I have a method that is sound, but solves a problem that may not have been needed to be solved (as Reviewer 3 pointed out, sort of). I would very much like to publish this paper since I, and several other people, have spent a fair amount of time on it. But I’m left a bit empty-handed. Here are the three paths I can see to publication:
- Experimental validation. I make some novel predictions with my method and then have a collaborator validate them. Great idea but this would take a lot of time and effort and luck to pull off. Of course, if it worked it would demonstrate the method’s utility very solidly. Not going to happen right now I think.
- Biological insight. I make some novel observations given my model that point out interesting biology underpinning bacterial/viral E3 ubiquitin ligases. This might be possible, and I have a little bit of it in the paper already. However, I think I’d need something solid and maybe experimentally validated to really push this forward.
- Another function. Demonstrate that the general approach works on another functional group- one that actually is a good target for this kind of thing. This is something I think I have (another functional group) and I just need to do some checking to really make sure first (like running Pfam on it, duh.) I can then leave the ubiquitin ligase stuff in there as my development example and then apply the method to this ‘real’ problem. This is most likely what I’ll do here (assuming that the new example function I have actually is a good one) since it requires the least amount of work.
So, full disclosure: I didn’t know when I started writing this post this morning what I was going to do with this paper and had pretty much written it off. But now I’m thinking that there may be a relatively easy path to publication with option 3 above. If my new example doesn’t pan out I may very well have to completely abandon this project and move on. But if it does work then I’ll have a nice little story requiring a minimum of (extra) effort.
As a punchline to this story- I’ve written a grant using this project as a fairly key piece of preliminary data. That grant is being reviewed today- as I write. As I described above, there’s nothing wrong with the method- and it actually fits nicely (still) to demonstrate what I needed it to for the grant. However, if the grant is funded then I’ll have actual real money to work on this and that will open other options up for this project. Here’s hoping. If the grant is funded I’ve decided I’ll establish a regular blog post to cover it, hopefully going from start to (successfully renewed) finish on my first R01. So, again, here’s hoping.
*Supplemental data in a scientific manuscript is the figures, tables, and other kinds of data files that either can’t be included in the main text because of size (no one wants to read 20 pages of gene listings in a paper- though I have seen stuff like this) or because the information is felt to be non-central to the main story and better left for more interested readers.