More Science Caution Signs

You asked for it (you don’t remember? Well, you did) so you got it. More science caution signs.

This time I had some help. See the contributions of ideas from:

And there were some other ideas too that I just haven’t put into a visual representation yet- so there may be another installment of these important warning signs in the future.



What is a hypothesis?

So I got this comment from a reviewer on one of my grants:

The use of the term “hypothesis” throughout this application is confusing. In research, hypotheses pertain to phenomena that can be empirically observed. Observation can then validate or refute a hypothesis. The hypotheses in this application pertain to models not to actual phenomena. Of course the PI may hypothesize that his models will work, but that is not hypothesis-driven research.

There are a lot of things I can say about this statement, which really rankles. As a thought experiment replace all occurrences of the word “model” with “Western blot” in the above comment. Does the comment still hold?

At this point it may be informative to get some definitions, keeping in mind that the _working_ definitions in science can have somewhat different connotations.

From Google:

Hypothesis: a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.

This definition has nothing about empirical observation- and I would argue that this definition would be fairly widely accepted in biological sciences research, though the underpinnings of the reviewer’s comment- empirically observed phenomena- probably are in the minds of many biologists.

So then, also from Google:

Empirical: based on, concerned with, or verifiable by observation or experience rather than theory or pure logic.

Here’s where the real meat of the discussion is. Empirical evidence is based on observation or experience as opposed to being based on theory or pure logic. It’s important to understand that the “models” being referred to in my grant are machine learning statistical models that have been derived from sequence data (that is, observation).

I would argue that including some theory or logic in a model that’s based on observation is exactly what science is about- this is what the basis of a hypothesis IS. All the hypotheses considered in my proposal were based on empirical observation, filtered through some form of logic/theory (if X is true then it’s reasonable to conclude Y), and would be tested by returning to empirical observations (either of protein sequences or experimentation at the actual lab bench).

I believe that the reviewer was confused by the use of statistics, which is a largely empirical endeavor (based on the observation of data- though filtered through theory) and computation, which they do not see as empirical. Back to my original thought experiment, there’s a lot of assumptions, theory, and logic that goes into interpretation of Western blot – or any other common lab experiment. However, this does not mean that we can’t use them to formulate further hypotheses.

This debate is really fundamental to my scientific identity. I am a biologist who uses computers (algorithms, visualization, statistics, machine learning and more) to do biology. If the reviewer is correct, then I’m pretty much out of a job I guess. Or I have to settle back on “data analyst” as a job title (which is certainly a good part of my job, but not the core of it).

So I’d appreciate feedback and discussion on this. I’m interested to hear what other people think about this point.

Proposal gambit – Betting the ranch

Last spring I posted about a proposal I’d put in where I’d published the key piece of preliminary data in F1000 Research, a journal that offers post-publication peer review.

The idea was that I could get my paper published (it’s available here) and accessible to reviewers prior to submission of my grant. It could then be peer-reviewed and I could address the revisions after that. This strategy was driven by the lag time between proposal submission and review for NIH, which is about 4 months. Also, it used to be possible to include papers that hadn’t been formally accepted by a journal as an appendix to NIH grants. This hasn’t been possible for some time now. But I figured this might be a pretty good way to get preliminary data out to the grant reviewers in a published form with quick turnaround. Or at least that you could utilize that lag time to also function as review time for your paper.

I was able to get my paper submitted to F100 Research and obtained a DOI and URL that I could include as a citation in my grant. Details here.

The review for the grant was completed in early June of this year and the results were not what I had hoped- the grant wasn’t even scored, despite being totally awesome (of course, right?). But for this post I’ll focus on the parts that are pertinent to the “gambit”- the use of post-publication peer review as preliminary data.

The results here were mostly unencouraging RE post-publication peer review being used this way, which was disappointing. But let me briefly describe the timeline, which is important to understand a large caveat about the results.

I received first-round reviews from two reviewers in a blindingly fast 10 and 16 days after initial submission. Both were encouraging, but had some substantial (and substantially helpful) requests. You can read them here and here. It took me longer than it could have to address these completely – though I did some new analysis and added additional explanation to several important points. I then resubmitted on around May 12th or so. However, due to some kind of issue the revised version wasn’t made available by F1000 Research until May 29th. Given that the NIH review panel met in the first week of June it is likely that the reviewers didn’t see the revised (and much improved version). The reviewers then got back final comments in early June (again- blindingly fast). You can read those here and here. The paper was accepted/approved/indexed in mid-June.

The grant had comments from three reviewers and each had something to say about the paper as preliminary data.

The first reviewer had the most negative comments.

It is not appropriate to point reviewers to a paper in order to save space in the proposal.

Alone this comment is pretty odd and makes me think that the reviewer was annoyed by the approach. So I can’t refer to a paper as preliminary data? On the face of it this is absolutely ridiculous. Science, and the accumulation of scientific knowledge just doesn’t work in a way that allows you to include all your preliminary data completely (as well as your research approach and everything else) in the space of 12 page grant. However, their further comments (which directly follow this one) shed some light on their thinking.

The PILGram approach should have been described in sufficient detail in the proposal to allow us to adequately assess it. The space currently used to lecture us on generative models could have been better used to actually provide details about the methods being developed.

So reading between the (somewhat grumpy) lines I think they mean to say that I should have done a better job of presenting some important details in the text itself. But my guess is that the first reviewer was not thrilled by the prospect of using a post-publication peer reviewed paper as preliminary data for the grant. Not thrilled.

  • Reviewer 1: Thumbs down.

Second reviewer’s comment.

The investigators revised the proposal according to prior reviews and included further details about the method in the form of a recently ‘published’ paper (the quotes are due to the fact that the paper was submitted to a journal that accepts and posts submissions even after peer review – F1000 Research). The public reviewers’ comments on the paper itself raise several concerns with the method proposed and whether it actually works sufficiently well.

This comment, unfortunately, is likely due to the timeline I presented above. I think they saw the first version of the paper, read the paper comments, and figured that there were holes in the whole approach. If my revisions had been available it seems like there still would have been issues, unless I had already gotten the final approval for the paper.

  • Reviewer 2: Thumbs down- although maybe not with the annoyed thrusting motions that the first reviewer was presumably making.

Finally, the third reviewer (contrary to scientific lore) was the most gentle.

A recent publication is suggested by the PI as a source of details, but there aren‟t many in that manuscript either.

I’m a little puzzled about this since the paper is pretty comprehensive. But maybe this is an effect of reading the first version, not the final version. So I would call this neutral on the approach.

  • Reviewer 3: No decision.


The takeaway from this gambit is mixed.

I think if it had been executed better (by me) I could have gotten the final approval through by the time the grant reviewers were looking at it and then a lot of the hesitation and negative feelings would have gone away. Of course, this would be dependent on having paper reviewers that were as quick as those that I got- which certainly isn’t a sure thing.

I think that the views of biologists on preprints, post-publication review, and other ‘alternative’ publishing options are changing. Hopefully more biologist will start using these methods- because, frankly, in a lot of cases they make a lot more sense than the traditional closed-access, non-transparent peer review processes.

However, the field can be slow to change. I will probably try this, or something like this, again. Honestly, what do I have to lose exactly? Overall, this was a positive experience and one where I believe I was able to make a contribution to science. I just hope my next grant is a better substrate for this kind of experiment.

Other posts on this process:



Another word about balance

[4/17/2015 updated: A reader pointed out that my formulae for specificity and accuracy contained errors. It turns out that both measures were being calculated correctly, just a typing error on the blog. I’ve corrected them below.] 

TL;DR summary

Evaluating a binary classifier based on an artificial balance of positive examples and negative examples (which is commonly done in this field) can cause underestimation of method accuracy but vast overestimation of the positive predictive value (PPV) of the method. Since PPV is likely the only metric that really matters to a particular kind of important end user, the biologist wanting to find a couple of novel positive examples in the lab based on your prediction, this is a potentially very big problem with reporting performance.

The long version

Previously I wrote a post about the importance of having a naturally balanced set of positive and negative examples when evaluating the performance of a binary classifier produced by machine learning methods. I’ve continued to think about this problem and realized that I didn’t have a very good handle on what kinds of effects artificially balanced sets would have on performance. Though the metrics I’m using are very simple I felt that it would be worthwhile to demonstrate the effects so did a simple simulation.

  1. I produced random prediction sets with a set portion of positives predicted correctly (85%) and a set portion of negatives predicted correctly (95%).
  2. The ‘naturally’ occurring ratio of positive to negative examples could be varied but for the figures below I used 1:100.
  3. I varied the ratio of positive to negative examples used to estimate performance and
  4. Calculated several commonly used measures of performance:
    1. Accuracy (TP+FP TN)/(TP+FP+TN+FN); that is, the percentage of positive or negative predictions that are correct relative to the total number of predictions)
    2. Specificity (TN/(TN+FN)(TN+FP); that is, the percentage of negative predictions that are correct relative to the total number of negative examples)
    3. AUC (area under the receiver operating characteristic curve; a summary metric that is commonly used in classification to evaluate performance)
    4. Positive predictive value (TP/(TP+FP); that is, out of all positive predictions what percentage are correct)
    5. False discovery rate (FDR; 1-PPV; percentage of positive predictions that are wrong)
  5. Repeated these calculations with 20 different random prediction sets
  6. Plotted the results as box plots, which summarize the mean (dark line in the middle), standard deviation (the box), and the lines (whiskers) showing 1.5 times the interquartile range from the box- dots above or below are outside this range.

The results are not surprising but do demonstrate the pitfalls of using artificially balanced data sets. Keep in mind that there are many publications that limit their training and evaluation datasets to a 1:1 ratio of positive to negative examples.


Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.

Accuracy estimates are actually worse than they should be for the artificial splits because fewer of the negative results are being considered.


Specificity stays largely the same and is a good estimate because it isn't affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn't change for the same reason.

Specificity stays largely the same and is a good estimate because it isn’t affected by the ratio of negatives to positive examples. Sensitivity (the same measure but for positive examples) also doesn’t change for the same reason.


Happily the AUC doesn't actually change that much- mostly it's just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Happily the AUC doesn’t actually change that much- mostly it’s just much more variable with smaller ratios of negatives to positives. So an AUC from a 1:1 split should be considered to be in the right ballpark, but maybe off from the real value by a bit.

Positive predictive value (PPV)

Aaaand there's where things go to hell.

Aaaand there’s where things go to hell.

False discovery rate (FDR)

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.

Same thing here. The FDR is extremely high (>90%) in the real dataset, but the artificial balanced sets vastly underestimate it.



Why is this a problem?

The last two plots, PPV and FDR, are where the real trouble is. The problem is that the artificial splits vastly overestimate PPV and underestimate FDR (note that the Y axis scale on these plots runs from 0 to close to 1). Why is this important? This is important because, in general, PPV is what an end user is likely to be concerned about. I’m thinking of the end user that wants to use your great new method for predicting that proteins are members of some very important functional class. They will then apply your method to their own examples (say their newly sequenced bacteria) and rank the positive predictions. They could care less about the negative predictions because that’s not what they’re interested in. So they take the top few predictions to the lab (they can’t afford to do 100s, only the best few, say 5, predictions) and experimentally validate them.

If your method’s PPV is actually 95% it’s fairly likely that all 5 of their predictions will pan out (it’s NEVER really as likely as that due to all kinds of factors, but for sake of argument) making them very happy and allowing the poor grad student who’s project it is to actually graduate.

However, the actual PPV from the example above is about 5%. This means that the poor grad student who slaves for weeks over experiments to validate at least ONE of your stinking predictions will probably end up empty-handed for their efforts and will have to spend another 3 years struggling to get their project to the point of graduation.

Given a large enough ratio in the real dataset (e.g. protein-protein interactions where the number of positive examples is somewhere around 50-100k in human but the number of negatives is somewhere around 4.5x10e8, a ratio of ~1:10000) the real PPV can fall to essentially 0, whereas the artificially estimated PPV can stay very high.

So, don’t be that bioinformatician who publishes the paper with performance results based on a vastly artificial balance of positive versus negative examples that ruins some poor graduate student’s life down the road.


Multidrug resistance in bacteria

So I just published a paper on predicting multi drug resistance transporters in the journal F1000 Research. This was part of my diabolical* plot (and here) to get grant money (*not really diabolical, but definitely risky, and hopefully clever). So what’s the paper about? Here’s my short explanation, hopefully aimed so that everyone can understand.

TL;DR version (since I wrote more than I thought I was going to)

Antibiotic resistance in bacteria is a rapidly growing health problem- if our existing antibiotics become useless against pathogens we’ve got a big problem. One of the mechanisms of resistance is that bacteria have transporters, proteins that pump out the antibiotics so they can’t kill the bacteria. There are many different kinds of these transporters and finding more of them will help us understand resistance mechanisms. We’ve used a method based on understanding written language to interpret the sequence of proteins (the order of building blocks used to build the protein) and predict a meaning from this- the meaning being the function of antibiotic transporter. We applied this approach to a large set of proteins from bacteria in the environment (a salty lake in Washington state in this case) because it’s known that these poorly understood bacteria have a lot of new proteins that can be transferred to human pathogens and give them superpowers (that is, antibiotic resistance).

(now the long version)

Antibiotic resistance in bacteria

This is a growing world health problem that you’ve probably heard about. Prior to the discovery of antibiotics bacterial infections were a very serious problem that we couldn’t do much about. Antibiotics changed all that, providing a very effective way to treat common and uncommon bacterial infections, and saving countless lives. The problem is that there are a limited number of different kinds of antibiotics that we have (that is, that have been discovered and are clinically effective without drastic side effects) and the prevalence of strains of common bacterial pathogens with resistance to one or more of these antibiotics is growing at an alarming rate. The world will be a very different place if we no longer have effective antibiotics (see this piece for a scary peek into what it’ll be like).

How does this happen? The driving force is Darwinian selection- survival of the fittest. Imagine that the pathogens are a herd of deer and that antibiotics are a wolf pack. The wolf pack easily kills off the slower deer, but leaves the fastest ones to live and reproduce, leading to faster offspring that are harder to kill. Also, the fast deer can pass off their speed to slow deer that are around, making them hard to kill.

Bacterial resistance to antibiotics works in a somewhat similar way. Bacteria can evolve, driven by natural selection, and they reproduce very quickly- but they have an even faster way to accomplish this adaptation than evolving new functions from the ground up. They can exchange genetic material, including the plans for resistance mechanisms (genes that code for resistance proteins) with other bacteria. And they can make these exchanges between bacteria of different species, so a resistant pathogen can pass off resistance to another pathogen, or an innocuous environmental bacteria can pass off a resistance gene to a pathogen making it resistant.

There are three main classes of resistance. First, the bacteria can develop resistance by altering the target of the antibiotic so that it can no longer kill. The ‘target’ in this case is often a protein that the bacteria uses to do some critical thing- and the antibiotic mucks it up so that bacteria die since they can’t accomplish that thing they need to do. Think of this like a disguise- the deer put on a nose and glasses and long coat ala Scooby Doo, and the wolves run right by without noticing. Second, the bacteria can produce an enzyme (a protein that alters small molecules in some way like sugars or drugs) that transforms the antibiotic into an ineffective form. Think of this like the deer using handcuffs to cuff the legs of the wolves together so they can’t run anymore, and thus can’t chase and kill the deer (which are the bacteria if you remember). Third, the bacteria can produce special transporter proteins that pump the antibiotic out of the inside of the cell (the bacterial cell) and away from the vital machinery that the antibiotic is targeting to kill the bacteria. Think of this like the possibility that deer engineers have developed portable wolf catapults. When a wolf gets too close it’s simply catapulted over the trees so it can’t do it’s evil business (in this case, actually good business because the wolves are the antibiotics, remember?)

Antibiotic resistance and the  resistome

The problem addressed in the paper

The problem we address in the paper is related to the third mechanism of resistance- the transporter proteins. There are a number of types of these transporters that can transport several or many different kinds of antibiotics at the same time- thus multi drug resistance transporters. Still, it’s likely that there are a lot of these kinds of proteins out there that we don’t recognize as such- in many cases you can’t just look at the sequence (see the section below) of the protein and figure out what it does.

The point of the paper is to develop a method that can detect these kinds of proteins and look for those beyond what we already know about. The long range view is that this will help us understand better how these kinds of proteins work and possibly suggest ways to block them (using novel antibiotics) to make existing antibiotics more effective again.

An interesting thing that has become clear in the last few years is that environmental bacteria have a large number of different resistance mechanisms to existing antibiotics (and probably to antibiotics we don’t even know about yet). And there are a LOT of environmental bacteria in just about every place on earth. Most of these we don’t know anything about. This has been called the “antibiotic resistome” meaning that it’s a vast reservoir of unknown potential for resistance that can be transferred to human pathogens. In the case of the second mechanism of resistance, the enzymes, these likely have evolved since bacteria in these environmental communities are undergoing constant warfare with each other- producing molecules (like antibiotics) that are designed to kill other species. In the case of the third resistance mechanism (the transporters) this could also be true, but these transporters seem to have a lot of other functions too- like ridding the bacteria of harmful molecules that might be in the environment like salts.

Linguistic-based sequence analysis 

The paper uses an approach that was developed in linguistics (study of language) to analyze proteins. This works because the building blocks of proteins (see below) can be viewed as a kind of language, where different combinations of blocks in different orders can give rise to different meanings- that is, different functions for the protein.

The sequence of a protein refers to the fact that proteins are made up of long chains of amino acids. Amino acids are just building block molecules, and there are 20 different kinds that are commonly found in proteins. These 20 different kinds make up an alphabet, and the alphabet is used to “spell” the protein. The list of amino acid letters that represents the protein is its sequence. It’s relatively easy to get the sequences of proteins for many bacteria, but the problem of what these sequences actually do is very much an open one. Proteins with similar sequences often times do similar things. But there are some interesting exceptions to this that I can illustrate using actual letters and sentences.

The first is that similar sequences might have different meanings.

1) “When she looked at the pool Jala realized it was low.”

2) “When she looked at the pool Jala realized she was too slow.”

The second is that very different sentences might have similar meanings.

1) “When he looked at the pool Joe realized it was dirty.”

2) “The dirty pool caught Joe’s attention.”

(these probably aren’t the BEST sentences to illustrate this, if you have better suggestions please let me know)

The multi drug transporters have elements of both problems. There are large families of transporter proteins that are pretty similar in terms of protein sequence- but the proteins actually transport different things (like, non-antibiotic molecules, and at this point we can’t just look at the sequences and figure out what they transport for many examples. There are also several families of multi drug transporters that have pretty different sequences between families but all do essentially the same job of transporting several types of drugs.

Linguistics, and especially computational linguistics, has been focused on developing algorithms (computer code) to interpret language into meaning. The approach we use in the paper, called PILGram, does exactly this and has been applied to interpretation of natural (English) language for other projects. We just adapted it somewhat so that it would work on protein sequences. Then we trained the method (since the method learns by example) on a set of proteins where we know the answer- previously identified multi drug transporters. After this was trained and we evaluated how well it could do it’s intended job (that is, taking protein sequences and figuring out if they are multi drug transporters or not) we let it loose on a large set of proteins from bacteria in a very salty lake in northern Washington state called Hot Lake.

What we found

First we found that the linguistic-based method did pretty well on some protein sequence problems where we already knew what the answer was. These PROSITE patterns are from a database where scientists have spent a lot of effort figuring out protein motifs (like figures of speech in language that always mean the same thing) for a whole collection of different protein functions. PILGram was able to do pretty well (though not perfectly) at figuring out what those motifs were- even though we didn’t spend any time on looking through the protein sequences, which is what PROSITE did. So that was good.

We then showed that the method could predict multi drug resistance transporters, a set of proteins where a common motif isn’t known. Again, it does fairly well – not perfect but much better than existing ways of doing this. We evaluated how well it did by pretending we didn’t know the answers for a set of proteins when we actually do know the answer- this is called ‘holding out’ some of the data. The trained method (trained on the set of proteins we didn’t hold out) was then used to predict whether or not the held out proteins were multi drug transporters and we could evaluate how well they did by comparing with the real answers.

Finally, we found that the method identified a number of likely looking candidate multi drug transporters from the Hot Lake community proteins and we listed a few of these candidates.

The next step will be to look at these candidates in the lab and see if they actually are multi drug transporters or not. This step is called “validation”. If they are (or at least one or two are) then that’s good for the method- it says that the method can actually predict something useful. If not then we’ll have to refine the method further and try to do better (though a negative result in a limited validation doesn’t necessarily mean that the method doesn’t work). This step, along with a number of computational improvements to the method, is what I proposed in the grant I just submitted. So if I get the funding I get to do this fun stuff.

More information

Proposal gambit

I am currently (this minute… well, not THIS minute, but just a minute ago, and in a minute) in the throes of revising a resubmission of a previously submitted R01 proposal to NIH. This proposal generally covers novel methods to build protein-sequence-based classifiers for problematic functional classes- that is, groups of proteins that have a shared function but either are very divergent in their sequence (meaning that they can’t be associated by traditional sequence similarity approaches) or have a lot of similar sequences with divergent functions (and the function that’s interesting can’t be easily disambiguated).

I got good feedback from reviewers on the previous version (though I did not get discussed- for those who aren’t familiar with the process, to get a score- and thus a chance at funding- your grant has to be in the top 50% of the grants that the review panel reads, then it moves on to actual discussion in the panel and scoring). Their main complaint was that I had not described the novel method I was proposing in sufficient detail, and so they were intrigued but couldn’t assess if this would really work or not. The format of NIH R01-level grants (12 pages for the research part) means that to provide details of methods you really need to have published your preliminary results. Also- if it’s published it really lends weight to the fact that you can do it and get it through peer review (or pay your way into a publication in an fly-by-night journal).

So anyway. I’ve put this resubmission off since last year and I’m not getting any younger and I don’t have a publication to reference on the method in the proposal yet. So here’s my gambit. I’ve been working on the paper that will provide preliminary data and it was really nearly finished it just needed a good push to get it finalized, which came in the form of this grant. My plan is to finish up the last couple of details on the paper and submit it to F1000 Research because it offers online publication immediately with subsequent peer review. I’ve been intrigued by this emerging model recently and wanted to try it anyway. But this allows me to reference the online version very soon after I upload it (maybe tomorrow) and include it as a bona fide citation for my grant. The idea is that by the time it’s reviewed (3 months hence) it will have passed peer review and will be an actual citation.

But it’s a gambit. It’s possible that the paper will still be under review or will have received harsh reviews by the time the reviewers look at it. It’s also possible that since I won’t have a traditional journal citation in text for the proposal- I’ll need to supply a URL to my online version- that the reviewers will just frown on this whole idea and it might even piss them off making them think I’m trying to get away with something (which I totally am, though it’s not unethical or against the rules in any way that I can see). However, I’m pretty sure that this is a lot more common on the CS side (preprint servers, and the like) so I’m betting on that flying.

Anyway, I’ll have an update in 3+ months on how this worked out for me. I actually have high hopes for this proposal- which does scare me a little. But I’m totally used to dealing with rejection, as I’ve mentioned before on numerous occasions. Wish me luck!

Big Data Showdown

One of the toughest parts of collaborative science is communication across disciplines. I’ve had many (generally initial) conversations with bench biologists, clinicians, and sometimes others that go approximately like:

“So, tell me what you can do with my data.”

“OK- tell me what questions you’re asking.”

“Um,.. that kinda depends on what you can do with it.”

“Well, that kinda depends on what you’re interested in…”

And this continues.

But the great part- the part about it that I really love- is that given two interested parties you’ll sometimes work to a point of mutual understanding, figuring out the borders and potential of each other’s skills and knowledge. And you generally work out a way of communicating that suits both sides and (mostly) works to get the job done. This is really when you start to hit the point of synergistic collaboration- and also, sadly, usually about the time you run out of funding to do the research.


Well, there probably ARE some exceptions here.

Well, there probably ARE some exceptions here.

So I first thought of this as a funny way of expressing relief over a paper being accepted that was a real pain to get finished. But after I thought about the general idea awhile I actually think it’s got some merit in science. Academic publication is not about publishing airtight studies with every possibility examined and every loose end or unconstrained variable nailed down. It can’t be. That would limit scientific productivity to zero because it’s not possible. Science is an evolving dialogue, some of it involving elements of the truth.

The dirty little secret (or elegant grand framework, depending on your perspective) of research is that science is not about finding the truth. It’s about moving our understanding closer to the truth. Often times that involves false positive observations- not because of the misconduct of science but because of it’s proper conduct. You should never publish junk or anything that’s deliberately misleading. But you can’t help publishing things that sometimes move us further away from the truth. The idea in science is that these erroneous findings will be corrected by further iterations and may even provide an impetus for driving studies that advance science. So publish away!

A Fine Trip Spoiled

I had a dream the other night that inspired this comic. My dream was about waiting for a connecting flight. I decided to take it easy and do something fun, then realized that my flight was leaving soon and I was nowhere near the gate. Then I got on a train and realized I was going the wrong direction. Anyway, I woke up to the realization that I’d relaxed and done fun stuff most of the weekend (I did work some in the evenings) and that I had an unfinished grant that was still due this week. As it turned out I finished up my grant quite nicely despite the slacking off- or maybe even because of the slacking off. But it gave me the inspiration for this comic.

You see, writing and submitting a grant proposal is a lot like planning for a vacation that you’ll probably never get to take. The work you’re proposing should be fun and interesting (otherwise, why are you trying to get money to do it, right?) but your chances are pretty slim that you’ll ever get to do it- at least in the form that you propose it. I’ve started to think of the grant process as a long game (see this post from one DrugMonkey)- one where the act of writing a single grant is mainly just positioning for the next grant you’ll write down the line. Writing grants give you opportunity to come up with ideas, to consolidate your thoughts, and think through the science that you want to do and how you want to do it. The process can push you to publish your work so that you can cite it as preliminary data. And it can forge long-lasting collaborations that go beyond failed proposals (though funded proposals certainly help to cement these relationships in a much more sure way).

I think “A Fine Trip Spoiled” may be the title of my autobiography when I get rich and famous.