Making a super villain

I’ve written about Reviewer 3 before (here, here, here, and here). Somehow the third reviewer has come to embody the capriciousness (and sometimes meanness) of the anonymous peer review process. Note that I believe in the peer review process, but am a realist about what it means and what it accomplishes. It doesn’t mean that every paper passing peer review is perfect and it doesn’t mean that every peer reviewer is doing a great job of reviewing.

When I’m a reviewer I see the peer review process through the lens of the line from Spiderman (Stan Lee), “with great power comes great responsibility”. I strive to put as much effort in to each paper I review as I would expect and want from the reviewers who review my papers. Sometimes that means that I don’t get my reviews back exactly on time- but better that than a crappy, half-thought-through review. I’m not sure that I always succeed. Sometimes I think that I may have missed points made by the authors, or I may have the wrong idea about an approach or result. However, if I’ve done a good job of trying to get it right the peer review process is working.



I’ve been thinking lately about how events in your academic life can lead to unintended, and often times unrecognized, downstream effects. Recently I realized that I’m having trouble putting together a couple of papers that I’m supposed to be leading. After some reflection I came to the conclusion that at least one reason is I’ve been affected by the long, tortuous, and somewhat degrading process of trying to get a large and rather important paper published. This paper has been in the works, and through multiple submission/revision cycles, for around five years. And it starts to really wear on your academic psyche after that time, though it can be hard to recognize. I think that my failure to get that paper published (so far) is partly holding me back on putting together these other papers. Partly this is about the continuing and varied forms of rejection you experience in this process, but partly it’s about the fact that there’s something sitting there that shouldn’t be sitting there. Even though I don’t currently have any active tasks that I have to complete for that problem paper it still weighs on me.

The silver lining is that once I recognized that this was a factor things started to seem easier with those projects and the story I was trying to tell. Anyway, I think we as academics should have our own therapists that specialize in problems such as this. It would be very helpful.


Writing Yourself Into A Corner

I’ve been fascinated with the idea of investment, and how it can color your thoughts, feelings, and opinions about something. Not the monetary sense of the word (though probably that too) but the emotional and intellectual sense of the word. If you’ve ever been in a bad relationship you might have fallen prey to this reasoning- “I’m in this relationship and I’m not getting out because reasons so admitting that’s it’s absolutely terrible for me is unthinkable so I’m going to pretend like it’s not and I’m going to believe that it’s not and I’m going to tell everyone that I’m doing great”. I really believe this can be a motivating factor for a big chunk of human behavior.

And it’s certainly a problem in science. When you become too invested in an idea or an approach or a tool- that is, you’ve spent a considerable amount of time researching or promoting it- it can be very difficult to distance yourself from that thing and admit that you might have it wrong. That would be unthinkable.

Sometimes this investment pitfall is contagious. If you’re on a project working together with others for common goals the problem of investment can become more complicated. That is, if I’ve said something, and some amount of group effort has been put into this idea, but it turns out I was wrong about it, it can be difficult to raise that to the rest of the group. Though, I note, that it is really imperative that it is raised. This can become more difficult if the ideas or preliminary results you’ve put forward become part of the project- through presentations made by others or through further investment of project resources to follow up on these leads.

I think this sometimes happens when you’re writing an early draft of a document- though the effect can be more subtle here. If you write words down and put out ideas that are generally sound and on-point it can be hard for you, or others who may edit the paper after you, to erase these. More importantly a first draft, no matter how preliminary or draft-y, can establish an organization that can be hard to break. Clearly if there are parts that really don’t work, or don’t fit, or aren’t true, they can be removed fairly easily. The bigger problems lie in those parts that are *pretty good*. I’ve looked back at my own preliminary drafts and realized (after a whole lot of work trying to get things to fit) that the initial overall organization was somehow wrong- and that I really need to rip it all apart and start over, at least in terms of the organization. I’ve also seen this in other people’s work, where something just doesn’t seem right about a paper, but I really can’t place my finger on what- at least not without a bunch of effort.

Does this mean that you should very carefully plan out your preliminary drafts? Not at all. That’s essentially the route to complete gridlock and non-productivity. Rather, you should be aware of this problem and be willing to be flexible. Realize that what you put down on the paper for the first draft (or early versions of analysis) is subject to change- and make others you are working with aware of this explicitly (simply labeling something as “preliminary analysis” or “rough draft” isn’t explicit enough). And don’t be afraid to back away from it if it’s not working out. It’s much better if that happens earlier in the process than later- that is, it’s better to completely tear down a final draft of a paper than to have reviewers completely miss the point of what you’re trying to say after you’ve submitted it.


Your Manuscript On Peer Review

I’m a big fan of peer review. Most of the revisions that reviewers suggest are very reasonable and sometimes really improve the manuscript. Other times it doesn’t seem to work that way. I’ve noticed this is especially true when the manuscript goes through multiple rounds of peer review at different journals. It can become a franken-paper, unloved by the very reviewers who made it.

Multidrug resistance in bacteria

So I just published a paper on predicting multi drug resistance transporters in the journal F1000 Research. This was part of my diabolical* plot (and here) to get grant money (*not really diabolical, but definitely risky, and hopefully clever). So what’s the paper about? Here’s my short explanation, hopefully aimed so that everyone can understand.

TL;DR version (since I wrote more than I thought I was going to)

Antibiotic resistance in bacteria is a rapidly growing health problem- if our existing antibiotics become useless against pathogens we’ve got a big problem. One of the mechanisms of resistance is that bacteria have transporters, proteins that pump out the antibiotics so they can’t kill the bacteria. There are many different kinds of these transporters and finding more of them will help us understand resistance mechanisms. We’ve used a method based on understanding written language to interpret the sequence of proteins (the order of building blocks used to build the protein) and predict a meaning from this- the meaning being the function of antibiotic transporter. We applied this approach to a large set of proteins from bacteria in the environment (a salty lake in Washington state in this case) because it’s known that these poorly understood bacteria have a lot of new proteins that can be transferred to human pathogens and give them superpowers (that is, antibiotic resistance).

(now the long version)

Antibiotic resistance in bacteria

This is a growing world health problem that you’ve probably heard about. Prior to the discovery of antibiotics bacterial infections were a very serious problem that we couldn’t do much about. Antibiotics changed all that, providing a very effective way to treat common and uncommon bacterial infections, and saving countless lives. The problem is that there are a limited number of different kinds of antibiotics that we have (that is, that have been discovered and are clinically effective without drastic side effects) and the prevalence of strains of common bacterial pathogens with resistance to one or more of these antibiotics is growing at an alarming rate. The world will be a very different place if we no longer have effective antibiotics (see this piece for a scary peek into what it’ll be like).

How does this happen? The driving force is Darwinian selection- survival of the fittest. Imagine that the pathogens are a herd of deer and that antibiotics are a wolf pack. The wolf pack easily kills off the slower deer, but leaves the fastest ones to live and reproduce, leading to faster offspring that are harder to kill. Also, the fast deer can pass off their speed to slow deer that are around, making them hard to kill.

Bacterial resistance to antibiotics works in a somewhat similar way. Bacteria can evolve, driven by natural selection, and they reproduce very quickly- but they have an even faster way to accomplish this adaptation than evolving new functions from the ground up. They can exchange genetic material, including the plans for resistance mechanisms (genes that code for resistance proteins) with other bacteria. And they can make these exchanges between bacteria of different species, so a resistant pathogen can pass off resistance to another pathogen, or an innocuous environmental bacteria can pass off a resistance gene to a pathogen making it resistant.

There are three main classes of resistance. First, the bacteria can develop resistance by altering the target of the antibiotic so that it can no longer kill. The ‘target’ in this case is often a protein that the bacteria uses to do some critical thing- and the antibiotic mucks it up so that bacteria die since they can’t accomplish that thing they need to do. Think of this like a disguise- the deer put on a nose and glasses and long coat ala Scooby Doo, and the wolves run right by without noticing. Second, the bacteria can produce an enzyme (a protein that alters small molecules in some way like sugars or drugs) that transforms the antibiotic into an ineffective form. Think of this like the deer using handcuffs to cuff the legs of the wolves together so they can’t run anymore, and thus can’t chase and kill the deer (which are the bacteria if you remember). Third, the bacteria can produce special transporter proteins that pump the antibiotic out of the inside of the cell (the bacterial cell) and away from the vital machinery that the antibiotic is targeting to kill the bacteria. Think of this like the possibility that deer engineers have developed portable wolf catapults. When a wolf gets too close it’s simply catapulted over the trees so it can’t do it’s evil business (in this case, actually good business because the wolves are the antibiotics, remember?)

Antibiotic resistance and the  resistome

The problem addressed in the paper

The problem we address in the paper is related to the third mechanism of resistance- the transporter proteins. There are a number of types of these transporters that can transport several or many different kinds of antibiotics at the same time- thus multi drug resistance transporters. Still, it’s likely that there are a lot of these kinds of proteins out there that we don’t recognize as such- in many cases you can’t just look at the sequence (see the section below) of the protein and figure out what it does.

The point of the paper is to develop a method that can detect these kinds of proteins and look for those beyond what we already know about. The long range view is that this will help us understand better how these kinds of proteins work and possibly suggest ways to block them (using novel antibiotics) to make existing antibiotics more effective again.

An interesting thing that has become clear in the last few years is that environmental bacteria have a large number of different resistance mechanisms to existing antibiotics (and probably to antibiotics we don’t even know about yet). And there are a LOT of environmental bacteria in just about every place on earth. Most of these we don’t know anything about. This has been called the “antibiotic resistome” meaning that it’s a vast reservoir of unknown potential for resistance that can be transferred to human pathogens. In the case of the second mechanism of resistance, the enzymes, these likely have evolved since bacteria in these environmental communities are undergoing constant warfare with each other- producing molecules (like antibiotics) that are designed to kill other species. In the case of the third resistance mechanism (the transporters) this could also be true, but these transporters seem to have a lot of other functions too- like ridding the bacteria of harmful molecules that might be in the environment like salts.

Linguistic-based sequence analysis 

The paper uses an approach that was developed in linguistics (study of language) to analyze proteins. This works because the building blocks of proteins (see below) can be viewed as a kind of language, where different combinations of blocks in different orders can give rise to different meanings- that is, different functions for the protein.

The sequence of a protein refers to the fact that proteins are made up of long chains of amino acids. Amino acids are just building block molecules, and there are 20 different kinds that are commonly found in proteins. These 20 different kinds make up an alphabet, and the alphabet is used to “spell” the protein. The list of amino acid letters that represents the protein is its sequence. It’s relatively easy to get the sequences of proteins for many bacteria, but the problem of what these sequences actually do is very much an open one. Proteins with similar sequences often times do similar things. But there are some interesting exceptions to this that I can illustrate using actual letters and sentences.

The first is that similar sequences might have different meanings.

1) “When she looked at the pool Jala realized it was low.”

2) “When she looked at the pool Jala realized she was too slow.”

The second is that very different sentences might have similar meanings.

1) “When he looked at the pool Joe realized it was dirty.”

2) “The dirty pool caught Joe’s attention.”

(these probably aren’t the BEST sentences to illustrate this, if you have better suggestions please let me know)

The multi drug transporters have elements of both problems. There are large families of transporter proteins that are pretty similar in terms of protein sequence- but the proteins actually transport different things (like, non-antibiotic molecules, and at this point we can’t just look at the sequences and figure out what they transport for many examples. There are also several families of multi drug transporters that have pretty different sequences between families but all do essentially the same job of transporting several types of drugs.

Linguistics, and especially computational linguistics, has been focused on developing algorithms (computer code) to interpret language into meaning. The approach we use in the paper, called PILGram, does exactly this and has been applied to interpretation of natural (English) language for other projects. We just adapted it somewhat so that it would work on protein sequences. Then we trained the method (since the method learns by example) on a set of proteins where we know the answer- previously identified multi drug transporters. After this was trained and we evaluated how well it could do it’s intended job (that is, taking protein sequences and figuring out if they are multi drug transporters or not) we let it loose on a large set of proteins from bacteria in a very salty lake in northern Washington state called Hot Lake.

What we found

First we found that the linguistic-based method did pretty well on some protein sequence problems where we already knew what the answer was. These PROSITE patterns are from a database where scientists have spent a lot of effort figuring out protein motifs (like figures of speech in language that always mean the same thing) for a whole collection of different protein functions. PILGram was able to do pretty well (though not perfectly) at figuring out what those motifs were- even though we didn’t spend any time on looking through the protein sequences, which is what PROSITE did. So that was good.

We then showed that the method could predict multi drug resistance transporters, a set of proteins where a common motif isn’t known. Again, it does fairly well – not perfect but much better than existing ways of doing this. We evaluated how well it did by pretending we didn’t know the answers for a set of proteins when we actually do know the answer- this is called ‘holding out’ some of the data. The trained method (trained on the set of proteins we didn’t hold out) was then used to predict whether or not the held out proteins were multi drug transporters and we could evaluate how well they did by comparing with the real answers.

Finally, we found that the method identified a number of likely looking candidate multi drug transporters from the Hot Lake community proteins and we listed a few of these candidates.

The next step will be to look at these candidates in the lab and see if they actually are multi drug transporters or not. This step is called “validation”. If they are (or at least one or two are) then that’s good for the method- it says that the method can actually predict something useful. If not then we’ll have to refine the method further and try to do better (though a negative result in a limited validation doesn’t necessarily mean that the method doesn’t work). This step, along with a number of computational improvements to the method, is what I proposed in the grant I just submitted. So if I get the funding I get to do this fun stuff.

More information

Proposal gambit – Update 1

Last week I posted about my strategy for a proposal I’m just submitting. Pretty simple really, just using a publication in a post-publication peer review journal (F1000 Research) as the crucial piece of my preliminary data in my grant. Here’s an update on the process.

So, if you’re going to predicate an R01 submission on having a citation to a paper with a crucial set of preliminary data in it… don’t leave it until the last minute. I submitted my paper to F1000 Research on Thursday (one week prior to the submission date for my grant). They responded very quickly – next day, with requests for some minor changes and to send the figures separately (I had included them in the document). No problems, but then the weekend came up and I ended up getting everything back to them on Sunday evening. Fine. Monday came and went and I didn’t have a link. Also on Monday I was surprised because I was erroneously told that I had to have the absolute final version of my grant to our grants and contracts office that day. With no citation. I scrambled to make myself an arXiv account so that I could get it out that way (a good thing in any case). But turns out it was incorrect and I could still make minor modifications after that.

So yesterday (Tuesday) I pinged F1000 Research, politely and with acknowledgment that this was a short turnaround time, and mentioned that I wanted to put the citation in the grant. They replied on Wednesday morning apologizing for the delay (nice, but there was no delay- I was really trying to push things fast) and saying that the formatted version should be ready in a couple of days and GIVING ME A DOI for the paper! Perfect. That’s what I really needed to include in the grant.

So today the updated grant was actually submitted- a whole day early, probably a first. Now it’s just a matter of settling in until June when it will be reviewed. Of course, I still need to get my paper reviewed, but I think that won’t be a huge problem.

Overall this process is going swimmingly. And I’ve been really pleased with my interactions with F1000 Research so far.

Proposal gambit

I am currently (this minute… well, not THIS minute, but just a minute ago, and in a minute) in the throes of revising a resubmission of a previously submitted R01 proposal to NIH. This proposal generally covers novel methods to build protein-sequence-based classifiers for problematic functional classes- that is, groups of proteins that have a shared function but either are very divergent in their sequence (meaning that they can’t be associated by traditional sequence similarity approaches) or have a lot of similar sequences with divergent functions (and the function that’s interesting can’t be easily disambiguated).

I got good feedback from reviewers on the previous version (though I did not get discussed- for those who aren’t familiar with the process, to get a score- and thus a chance at funding- your grant has to be in the top 50% of the grants that the review panel reads, then it moves on to actual discussion in the panel and scoring). Their main complaint was that I had not described the novel method I was proposing in sufficient detail, and so they were intrigued but couldn’t assess if this would really work or not. The format of NIH R01-level grants (12 pages for the research part) means that to provide details of methods you really need to have published your preliminary results. Also- if it’s published it really lends weight to the fact that you can do it and get it through peer review (or pay your way into a publication in an fly-by-night journal).

So anyway. I’ve put this resubmission off since last year and I’m not getting any younger and I don’t have a publication to reference on the method in the proposal yet. So here’s my gambit. I’ve been working on the paper that will provide preliminary data and it was really nearly finished it just needed a good push to get it finalized, which came in the form of this grant. My plan is to finish up the last couple of details on the paper and submit it to F1000 Research because it offers online publication immediately with subsequent peer review. I’ve been intrigued by this emerging model recently and wanted to try it anyway. But this allows me to reference the online version very soon after I upload it (maybe tomorrow) and include it as a bona fide citation for my grant. The idea is that by the time it’s reviewed (3 months hence) it will have passed peer review and will be an actual citation.

But it’s a gambit. It’s possible that the paper will still be under review or will have received harsh reviews by the time the reviewers look at it. It’s also possible that since I won’t have a traditional journal citation in text for the proposal- I’ll need to supply a URL to my online version- that the reviewers will just frown on this whole idea and it might even piss them off making them think I’m trying to get away with something (which I totally am, though it’s not unethical or against the rules in any way that I can see). However, I’m pretty sure that this is a lot more common on the CS side (preprint servers, and the like) so I’m betting on that flying.

Anyway, I’ll have an update in 3+ months on how this worked out for me. I actually have high hopes for this proposal- which does scare me a little. But I’m totally used to dealing with rejection, as I’ve mentioned before on numerous occasions. Wish me luck!


Well, there probably ARE some exceptions here.

Well, there probably ARE some exceptions here.

So I first thought of this as a funny way of expressing relief over a paper being accepted that was a real pain to get finished. But after I thought about the general idea awhile I actually think it’s got some merit in science. Academic publication is not about publishing airtight studies with every possibility examined and every loose end or unconstrained variable nailed down. It can’t be. That would limit scientific productivity to zero because it’s not possible. Science is an evolving dialogue, some of it involving elements of the truth.

The dirty little secret (or elegant grand framework, depending on your perspective) of research is that science is not about finding the truth. It’s about moving our understanding closer to the truth. Often times that involves false positive observations- not because of the misconduct of science but because of it’s proper conduct. You should never publish junk or anything that’s deliberately misleading. But you can’t help publishing things that sometimes move us further away from the truth. The idea in science is that these erroneous findings will be corrected by further iterations and may even provide an impetus for driving studies that advance science. So publish away!

The $1000 paper

[Updated 11/2/2014 with green open access and NIH PubMed central caveats]

Anyone familiar with the debate around open access scientific journals knows that it can be expensive to publish your work there (see this list of some publication charges). In one model of open access publication the cost is shifted to the authors, who are usually funding publications from their grant money, and those charges can be in the thousands of US dollars per paper. The Public Library of Science (PLoS) journals charge between $1300 and $2900 per article, though they have a program for partial to full coverage of these charges. The result is that anyone can access, download, and read the paper free of charge opening up the research to a much wider audience.

During the Twitter discussion of alternative scientific metrics spawned by the so-called “Kardashian index” paper (see my post here) some metrics regarding publishing were suggested. One that was suggested to me (though unfortunately who suggested it is now lost in my Twitter feed- sorry) was to create a metric that calculated how expensive a paper would be to read, if you didn’t have institutional or other subscriptions to the publishers.

Here are the assumptions used:

  1. No access to any subscriptions
  2. You would purchase every paper/chapter cited in the paper
  3. You would pay non-student prices (where applicable)
  4. You’d buy the book if you couldn’t purchase individual chapters
  5. Updated! Pointed out by  that I forgot one very important caveat. Many of these for-pay papers may be available as “green open access” (self archiving their own publications) or by requirements such as those imposed by the NIH that require deposition of papers in the PubMed Central repository.

This is actually an interesting idea – and it’s only taken me about 5 months to get to it but I calculated numbers for three papers (see Table and full spreadsheet here).

The bottom line is that it would be EXPENSIVE to read a single paper this way, over $1000 for each paper (with the caveat that I’ve only looked at 3 papers total).

 Table showing cost of citations for three papers
Summary Journal Total number OA Average cost Total cost
Paper 1 PLoS Computational Biology 37 5 $38.43 $1,422.05
Paper 2 Nature 27 3 $27.87 $752.50
Paper 3 Journal of Bacteriology 41 3 $38.71 $1,587.22

This has a linear relationship with the number of citations in the paper as demonstrated in this graph (again, small sample size).


Of course, this is mostly an academic exercise (like most things I do- I’m an academic) since nobody reads every citation and most people who want to read specific citations would have access to institutional subscriptions. However, it points out a hidden cost to research publication that (I don’t think) is thought about by most researchers.

It would be fairly simple to code up a calculator for this metric given that many journals are published by the same publishers who have pretty consistent pricing. But I’ve got to get back to work now and publish more papers.