So you toil for 4+ years in graduate school, 4+ years as a post-doc, land your first academic gig. Now you get to do all this awesome science, right? Well, sorta…
So I got this comment from a reviewer on one of my grants:
The use of the term “hypothesis” throughout this application is confusing. In research, hypotheses pertain to phenomena that can be empirically observed. Observation can then validate or refute a hypothesis. The hypotheses in this application pertain to models not to actual phenomena. Of course the PI may hypothesize that his models will work, but that is not hypothesis-driven research.
There are a lot of things I can say about this statement, which really rankles. As a thought experiment replace all occurrences of the word “model” with “Western blot” in the above comment. Does the comment still hold?
At this point it may be informative to get some definitions, keeping in mind that the _working_ definitions in science can have somewhat different connotations.
Hypothesis: a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.
This definition has nothing about empirical observation- and I would argue that this definition would be fairly widely accepted in biological sciences research, though the underpinnings of the reviewer’s comment- empirically observed phenomena- probably are in the minds of many biologists.
So then, also from Google:
Empirical: based on, concerned with, or verifiable by observation or experience rather than theory or pure logic.
Here’s where the real meat of the discussion is. Empirical evidence is based on observation or experience as opposed to being based on theory or pure logic. It’s important to understand that the “models” being referred to in my grant are machine learning statistical models that have been derived from sequence data (that is, observation).
I would argue that including some theory or logic in a model that’s based on observation is exactly what science is about- this is what the basis of a hypothesis IS. All the hypotheses considered in my proposal were based on empirical observation, filtered through some form of logic/theory (if X is true then it’s reasonable to conclude Y), and would be tested by returning to empirical observations (either of protein sequences or experimentation at the actual lab bench).
I believe that the reviewer was confused by the use of statistics, which is a largely empirical endeavor (based on the observation of data- though filtered through theory) and computation, which they do not see as empirical. Back to my original thought experiment, there’s a lot of assumptions, theory, and logic that goes into interpretation of Western blot – or any other common lab experiment. However, this does not mean that we can’t use them to formulate further hypotheses.
This debate is really fundamental to my scientific identity. I am a biologist who uses computers (algorithms, visualization, statistics, machine learning and more) to do biology. If the reviewer is correct, then I’m pretty much out of a job I guess. Or I have to settle back on “data analyst” as a job title (which is certainly a good part of my job, but not the core of it).
So I’d appreciate feedback and discussion on this. I’m interested to hear what other people think about this point.
Last spring I posted about a proposal I’d put in where I’d published the key piece of preliminary data in F1000 Research, a journal that offers post-publication peer review.
The idea was that I could get my paper published (it’s available here) and accessible to reviewers prior to submission of my grant. It could then be peer-reviewed and I could address the revisions after that. This strategy was driven by the lag time between proposal submission and review for NIH, which is about 4 months. Also, it used to be possible to include papers that hadn’t been formally accepted by a journal as an appendix to NIH grants. This hasn’t been possible for some time now. But I figured this might be a pretty good way to get preliminary data out to the grant reviewers in a published form with quick turnaround. Or at least that you could utilize that lag time to also function as review time for your paper.
I was able to get my paper submitted to F100 Research and obtained a DOI and URL that I could include as a citation in my grant. Details here.
The review for the grant was completed in early June of this year and the results were not what I had hoped- the grant wasn’t even scored, despite being totally awesome (of course, right?). But for this post I’ll focus on the parts that are pertinent to the “gambit”- the use of post-publication peer review as preliminary data.
The results here were mostly unencouraging RE post-publication peer review being used this way, which was disappointing. But let me briefly describe the timeline, which is important to understand a large caveat about the results.
I received first-round reviews from two reviewers in a blindingly fast 10 and 16 days after initial submission. Both were encouraging, but had some substantial (and substantially helpful) requests. You can read them here and here. It took me longer than it could have to address these completely – though I did some new analysis and added additional explanation to several important points. I then resubmitted on around May 12th or so. However, due to some kind of issue the revised version wasn’t made available by F1000 Research until May 29th. Given that the NIH review panel met in the first week of June it is likely that the reviewers didn’t see the revised (and much improved version). The reviewers then got back final comments in early June (again- blindingly fast). You can read those here and here. The paper was accepted/approved/indexed in mid-June.
The grant had comments from three reviewers and each had something to say about the paper as preliminary data.
The first reviewer had the most negative comments.
It is not appropriate to point reviewers to a paper in order to save space in the proposal.
Alone this comment is pretty odd and makes me think that the reviewer was annoyed by the approach. So I can’t refer to a paper as preliminary data? On the face of it this is absolutely ridiculous. Science, and the accumulation of scientific knowledge just doesn’t work in a way that allows you to include all your preliminary data completely (as well as your research approach and everything else) in the space of 12 page grant. However, their further comments (which directly follow this one) shed some light on their thinking.
The PILGram approach should have been described in sufficient detail in the proposal to allow us to adequately assess it. The space currently used to lecture us on generative models could have been better used to actually provide details about the methods being developed.
So reading between the (somewhat grumpy) lines I think they mean to say that I should have done a better job of presenting some important details in the text itself. But my guess is that the first reviewer was not thrilled by the prospect of using a post-publication peer reviewed paper as preliminary data for the grant. Not thrilled.
- Reviewer 1: Thumbs down.
Second reviewer’s comment.
The investigators revised the proposal according to prior reviews and included further details about the method in the form of a recently ‘published’ paper (the quotes are due to the fact that the paper was submitted to a journal that accepts and posts submissions even after peer review – F1000 Research). The public reviewers’ comments on the paper itself raise several concerns with the method proposed and whether it actually works sufficiently well.
This comment, unfortunately, is likely due to the timeline I presented above. I think they saw the first version of the paper, read the paper comments, and figured that there were holes in the whole approach. If my revisions had been available it seems like there still would have been issues, unless I had already gotten the final approval for the paper.
- Reviewer 2: Thumbs down- although maybe not with the annoyed thrusting motions that the first reviewer was presumably making.
Finally, the third reviewer (contrary to scientific lore) was the most gentle.
A recent publication is suggested by the PI as a source of details, but there aren‟t many in that manuscript either.
I’m a little puzzled about this since the paper is pretty comprehensive. But maybe this is an effect of reading the first version, not the final version. So I would call this neutral on the approach.
- Reviewer 3: No decision.
The takeaway from this gambit is mixed.
I think if it had been executed better (by me) I could have gotten the final approval through by the time the grant reviewers were looking at it and then a lot of the hesitation and negative feelings would have gone away. Of course, this would be dependent on having paper reviewers that were as quick as those that I got- which certainly isn’t a sure thing.
I think that the views of biologists on preprints, post-publication review, and other ‘alternative’ publishing options are changing. Hopefully more biologist will start using these methods- because, frankly, in a lot of cases they make a lot more sense than the traditional closed-access, non-transparent peer review processes.
However, the field can be slow to change. I will probably try this, or something like this, again. Honestly, what do I have to lose exactly? Overall, this was a positive experience and one where I believe I was able to make a contribution to science. I just hope my next grant is a better substrate for this kind of experiment.
Other posts on this process:
- My original Proposal gambit post
- Proposal gambit – update 1
- A press release from F1000 Research about this paper
- The published paper “Prediction of multi-drug resistance transporters using a novel sequence analysis method”
- The GitHub repository with code
- My post giving an overview of the paper, along with a fun infographic
People worry about a lot of different things. I’m no different. It’s also a true statement to say that people are very poor at statistical reasoning, and thus risk assessment. And add on top of this that we now get nearly instantaneous news about occurrences happening in regions that include thousands, or millions, or billions of people.
So, when there are several reports of swimmers being attacked by sharks in a short period of time, for example, these are reported at a national or international level *because* they’re notable and *very rare* occurrences. In general people are worried about getting attacked by sharks when swimming in the ocean somewhere (it crosses my mind every time I take a swim). The truth is that you’re more likely to be struck by lightning (51 DEATHS per year in the US) and killed than to be bitten by a shark while swimming (14 attacks per year, 1 death every two years). This is to say nothing of the dangers of driving (or riding) to the beach to get in the water. Yikes!
Watching the news if it actually represented the statistical risks appropriately would be very different. Actually – unlike my comic below, it wouldn’t be silence. It would be filled with heart disease, car crashes, ladder falls, and that sort of thing. It’d be boring and no one would watch. Show us the SHARK ATTACKS!!!
I even wrote a song about this (maybe I’ll post a video if I get brave). But, to be clear, it doesn’t (yet) include sharks.
I’ve been thinking lately about how events in your academic life can lead to unintended, and often times unrecognized, downstream effects. Recently I realized that I’m having trouble putting together a couple of papers that I’m supposed to be leading. After some reflection I came to the conclusion that at least one reason is I’ve been affected by the long, tortuous, and somewhat degrading process of trying to get a large and rather important paper published. This paper has been in the works, and through multiple submission/revision cycles, for around five years. And it starts to really wear on your academic psyche after that time, though it can be hard to recognize. I think that my failure to get that paper published (so far) is partly holding me back on putting together these other papers. Partly this is about the continuing and varied forms of rejection you experience in this process, but partly it’s about the fact that there’s something sitting there that shouldn’t be sitting there. Even though I don’t currently have any active tasks that I have to complete for that problem paper it still weighs on me.
The silver lining is that once I recognized that this was a factor things started to seem easier with those projects and the story I was trying to tell. Anyway, I think we as academics should have our own therapists that specialize in problems such as this. It would be very helpful.
This comic is inspired, not by real interactions I’ve had with developers (no developer has ever volunteered to get within 20 paces of my code), but rather by discussions online on the importance of ‘proper’ coding. Here’s a comic from xkcd which has a different point:
My reaction to this– as a bench biology-trained computational biologist who has never taken a computer programming class– is “who cares?” If it works, really, who cares?
Sure, there are very good reasons for standard programming practices, standards, and clean, efficient code. Even in bioinformatics (or especially so). These would be almost exclusively applicable to approaches that you’ve had quite a bit of experience with working out the bugs, figuring out how it works with the underlying data, making sure that it’s actually useful in terms of the biology. This is at least 75% of my job. I try and discard many approaches for any particular problem I’m working on. It’s important to have a record of these attempts, but this code doesn’t have to be clean or efficient. There are exceptions to this, such as when you have code that takes a loooong time to run even once, you probably want to make that as efficient as you can. The vast majority of the things I do- even with large amounts of data- I can determine if they’re working or not in a reasonable amount of time using inefficient code (anything written in R, for example).
The other part, where good coding is important, is when you want the code to be usable by other people. This is an incredibly important part of computational biology and I’m not trying to downplay its importance here. This is when you’re relatively certain that the code will be looked at and/or used by other people in your own group and when you publish or release the code to a wider audience.
For further reading into this subject here’s a post from Byte Size Biology that covers some great ideas for writing *research* code. And here is some dissenting opinion from Living in and Ivory Basement touting the importance of good programming practices (note- I don’t disagree, but do believe that at least 75% of the coding I do should not have such a high bar- not necessary and I’d never get anything done) . Finally, here are some of my thoughts on how coding really follows the scientific method.
I’ve been fascinated with the idea of investment, and how it can color your thoughts, feelings, and opinions about something. Not the monetary sense of the word (though probably that too) but the emotional and intellectual sense of the word. If you’ve ever been in a bad relationship you might have fallen prey to this reasoning- “I’m in this relationship and I’m not getting out because reasons so admitting that’s it’s absolutely terrible for me is unthinkable so I’m going to pretend like it’s not and I’m going to believe that it’s not and I’m going to tell everyone that I’m doing great”. I really believe this can be a motivating factor for a big chunk of human behavior.
And it’s certainly a problem in science. When you become too invested in an idea or an approach or a tool- that is, you’ve spent a considerable amount of time researching or promoting it- it can be very difficult to distance yourself from that thing and admit that you might have it wrong. That would be unthinkable.
Sometimes this investment pitfall is contagious. If you’re on a project working together with others for common goals the problem of investment can become more complicated. That is, if I’ve said something, and some amount of group effort has been put into this idea, but it turns out I was wrong about it, it can be difficult to raise that to the rest of the group. Though, I note, that it is really imperative that it is raised. This can become more difficult if the ideas or preliminary results you’ve put forward become part of the project- through presentations made by others or through further investment of project resources to follow up on these leads.
I think this sometimes happens when you’re writing an early draft of a document- though the effect can be more subtle here. If you write words down and put out ideas that are generally sound and on-point it can be hard for you, or others who may edit the paper after you, to erase these. More importantly a first draft, no matter how preliminary or draft-y, can establish an organization that can be hard to break. Clearly if there are parts that really don’t work, or don’t fit, or aren’t true, they can be removed fairly easily. The bigger problems lie in those parts that are *pretty good*. I’ve looked back at my own preliminary drafts and realized (after a whole lot of work trying to get things to fit) that the initial overall organization was somehow wrong- and that I really need to rip it all apart and start over, at least in terms of the organization. I’ve also seen this in other people’s work, where something just doesn’t seem right about a paper, but I really can’t place my finger on what- at least not without a bunch of effort.
Does this mean that you should very carefully plan out your preliminary drafts? Not at all. That’s essentially the route to complete gridlock and non-productivity. Rather, you should be aware of this problem and be willing to be flexible. Realize that what you put down on the paper for the first draft (or early versions of analysis) is subject to change- and make others you are working with aware of this explicitly (simply labeling something as “preliminary analysis” or “rough draft” isn’t explicit enough). And don’t be afraid to back away from it if it’s not working out. It’s much better if that happens earlier in the process than later- that is, it’s better to completely tear down a final draft of a paper than to have reviewers completely miss the point of what you’re trying to say after you’ve submitted it.
I posted awhile back about encountering two vehicles with the same 3 letter code on their license plates as mine while driving to work one morning. Interestingly, in the following months I found myself paying more and more attention to license plates and saw at least 6-7 other vehicles in the area (a small three-city region with about 200K residents) with the same code.
Spooky. I started to feel like there was some kind of cosmological numerology going on in license plates around me that was trying to send me a message. BUT WHAT WAS IT?
A conclusion I drew from my thinking on the probability of that happening was that:
it is evident that there can be multiple underlying and often hidden explanatory variables that may be influencing such probabilities [from my post]
It was suggested that part of my noticing the plates could have been confirmation bias, I was looking for something so I noticed that thing more than normal given a pretty variable and unconnected background. I’m sure that’s true. However, I was sitting in traffic one evening (yes, we do have *some* traffic around here) and saw three plates that started with the letters ARK in the space of about 5 minutes. Weird.
So THEN I started really looking at the plates around me and noticed a strong underlying variable that pretty much explains it all. But it’s kinda interesting. I first noticed that Washington state seems to have recently switched from three number-three letter plates to three letter-four number plates. I then noticed that the starting letters for both kinds of plates were in a narrow range, W-Z for the old plates and A-C for the new plates. There don’t seem to be *any* plates outside that range right now (surveying a couple of hundred plates over the last couple of days). W is really underrepresented as is C – the tails of the distribution. This makes me guess that there’s a rolling distribution with a window of about 6 letters for license plates (in the state of Washington, other states have other systems or are on a different pattern). This probably changes with time as people have to renew their plates, buy new vehicles and get rid of the old. So the effective size of the license plate universe I tried to calculate in my previous post is much smaller than what I was thinking.
I don’t know why I find this so interesting but it really is. I know this is just some system that the Washington State Department of Licensing has and I could probably go to an office and just ask, but it seems like it’s a metaphor for larger problems of coincidence, underlying mechanisms, and science. I’m actually pretty satisfied with my findings, even though they won’t be published as a journal article (hey- you’re still reading, right?). On my way to pick up lunch today I noticed some more ARK plates (4) and these two sitting right next to each other (also 3 other ABG plates in other parts of the parking lot).
The universe IS trying to tell me something. It’s about science stupid.
I’m a big fan of peer review. Most of the revisions that reviewers suggest are very reasonable and sometimes really improve the manuscript. Other times it doesn’t seem to work that way. I’ve noticed this is especially true when the manuscript goes through multiple rounds of peer review at different journals. It can become a franken-paper, unloved by the very reviewers who made it.
[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.]
Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.
The long version
Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.
- I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
- The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
- I varied the ratio of positive to negative examples used to estimate performance and
- Calculated several commonly used measures of performance:
- Accuracy (TP
+FPTN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
- Specificity (TN/
(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
- AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
- Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
- False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
- Accuracy (TP
- Repeated these calculations with 20 different random prediction sets
- Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.
The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.
Positive predictive value (PPV)
False discovery rate (FDR)
Why is this a problem?
The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.
If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.
However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.
Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.
So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.