Journaling a Computational Biology Project: Part 3

Day 4 (link to previous post)

It figures that the week I decide to return to using the cluster (the PIC, in case you’re interested) is the week that they have to shut it down for construction. So ran no more permutations today- that’ll have to wait until next week.

Didn’t really do any other work on the paper or project today either- busy doing other things. So not much to report today actually. I did talk a bit about the results with my post-doc on our semi-weekly MSFAB (Mad Scientist Friday Afternoon Beer). We both agreed that the permutation test was a good idea and possibly the only way to get an estimate of real false discovery rates. Along these lines, as I reported yesterday the first round of permutations returned with some fairly significant results. These actually exceeded the Bonferroni corrected p values I was getting, which is supposed to tell you essentially the same thing. So it seems in this case that Bonferroni, generally a conservative multiple hypothesis correction, was not conservative enough. Good lesson to remember.

Journaling a computational biology project: Part 2

Day 3 (link to my previous entry)

Uh-oh- roadblock. Remember how I was saying this project was dirt simple?

It's just THIS simple. This has to work- there's no WAY it could fail.

It’s just THIS simple. This has to work- there’s no WAY it could fail.

This came much faster than I thought it would. I’ve got actual data and I have to figure out if there’s a story there. Or rather, where the story is. The results from my large-scale parallel run are interesting, but I’m not sure they clearly demonstrate how this approach is better than previous approaches. Also there I had to rerun the whole thing to all the results, turns out I was only capturing about 1/5th of the results- but the end problem was the same. The results are very significant, but not head and shoulders above previous results, and don’t really demonstrate what I was hoping they would. Strongly anyway. Time for some thinkin. Never as dirt simple as I think it will be to start with.

Down, down, down, down...

Down, down, down, down…

Anyway, pushing onwards doing permutations. The question here is how likely would I be to see the scores I’m getting just by chance alone. So I permute the labels on my data and run the thing a few times with random labels. The permutation is done on the sample level- the data I’m using is from observations under condition 1 and condition 2- and I have multiple observations from each conditions. So to permute I just randomize which observations I’m saying are from condition 1 and condition 2.

I’ve done the first couple of randomized runs and they’re actually coming up with some reasonably significant results. This means that I’ll have to compare the random scores with my real scores in order to establish a false discovery rate, which I can then use as a threshold for reporting.

I’ve also started to put things into a kind of an outline for the paper. Here’s what I’ve got so far- I’ve taken the details of what I’m doing out for blogging purposes- but you get the idea:


  1. General background about the problem we’re developing our method on
  2. Description of the previous algorithm, what it offers and what is the gap that our approach will fill
  3. Specific details about the data set we’re using
  4. Summary of what our approach is and what results we’ll be presenting in the paper


  1. First apply the previous algorithm on our data (this hasn’t been done). Possibly validate on an external dataset
  2. Show how our algorithm improves results over previous
  3. Add in the extra idea we came up with that will also be a novel twist on the approach
  4. Show what kind of biological information can be derived from these new approaches. This is really open at this point since I’m not really sure what I’ll get yet. But preparing for it and thinking about it so writing it down.
  5. Validation on an external dataset (i.e. a different one from the one I’m using)- maybe. This might be difficult to impossible.

Journaling a computational biology study: Part 1

The process of how I compose a computational biology study, execute it, and write it up seems to follow a kind of set pattern, for the relatively simple projects anyway. So I thought I’d blog about this process as it happens.

I debated on several takes on this:

  1. Blogging as it’s happening with full technical details
  2. Blogging a journal that I could then release after the paper was finished and submitted
  3. Blogging as it’s happening but not talking about the technical details

The first is appealing, but probably wouldn’t go over well with my employer- and it is a simple idea that someone else could pick up and run with. I’m not paranoid, but in this case it might be too enticing for someone with the right skills. The second seems like not as much fun. So I’m opting for the third option and will be blogging about what I’m doing generally, but not giving specifics on the science, algorithms, or data I’m working with.

Background on the project

I’m not starting cold on this project. I came up with the idea last year but haven’t had time to implement it until now. It’s a dirt simple extension of an existing method that has the potential to be very interesting. I have the problem and data in hand to work on it. Last year we implemented a parallel version of a prototype of the algorithm. Now that I can actually work on it I can see a clear path to a finished project- being a submitted paper, or possibly inclusion as a part of a larger paper.

Day 1

Started out by revisiting the idea. Thinking about it and doing some PubMed searches. I just wanted to make sure that it hadn’t been done by anyone, especially the groups that developed the original algorithm. Nothing seems to be there- which is good, because as I said- it’s dirt simple.

Mid-day talked myself out of the idea in it’s original form- it can’t work as simply as I’d thought.

Relay my thoughts to my post-doc who reassured me that it was actually that simple and we could do it the way I originally envisioned. He was right. We talked about the statistics and algorithms for it for awhile.

Got my old code working again. Revised the core bits to handle the new idea. Actually ran some data through on my laptop using a very limited dataset. Looks like it works! So fun to actually be coding again and not just writing papers, grants, emails, or notes. Opened a blank Word document to do some writing. *sigh*

Decided on a tentative title (which will change) and a tentative author list. Myself, the post-doc who I talked with about it, the programmer who coded the parallel version previously, a post-doc who hasn’t worked on it yet, but probably will, and a senior domain expert. Yes, I’m doing this very early on. But as I said, there’s a clear path from here to a paper- it’s not too early.

Day 2

More testing on the prototype code to make sure that it’s behaving as I think it should. Also coded up an alternative data pre-processing step that seems to be a good idea. Comparing results from both pre-processing methods determine that they give different answers. I’ll have to iron that one out later when working with the real datasets.

Figured out the plan for the project- at least in broad strokes. Run on complete dataset, implement a random permutation strategy to estimate false discovery rate, break up dataset and show how the method works on individual parts of it (this is specific to the problem), find another dataset for validation, write it up. Yes, it’s just that simple.

Discussed an additional very interesting strategy with post-doc number 1 that will really add novelty and hopefully value to the study. Also discussed the permutation strategy in some detail. That will be really important to demonstrate that this actually works.

Spent most of the day revising the code for the parallel implementation to get the new ideas and testing it out on our cluster to see if it works. Slow progress, but finally got the entire thing to run! I did a couple of test runs using a limited dataset and only running on 2 nodes. When those worked I did the whole shebang. Finished in about an hour on 60 nodes, which is really pretty impressive given what it’s doing. Definitely a win!

Now to work on putting some words down for the Introduction section. I also like to outline the results section by generally writing about how I think it will go in a glorified outline. I’ve posted about this process previously here.



Leading a collaborative scientific paper: My tips on cat herding

Large collaborative research projects, centers, or consortia have a single goal: to be funded for another round. That’s completely cynical, but it is not so far off the truth. The point of these projects is to advance science by bringing together many different experts in many different areas to do more than what could be done in a single R01-size endeavor. If there are no project-wide collaborative papers that come out of this effort going to high-profile journals there will be nothing- or very little- to make the claim that the project was successful. Why not just fund 3-8 R01-sized project that can work in isolation and accomplish the same thing or more? So publications are important.

The second thing to understand is that there’s no such thing as a ‘group-written’ paper, in my experience. Not truly. Someone always needs to step forward and take ownership of the paper to drive things forward otherwise it’s dead in the water. Maybe it can be two people, maybe it can be more- I’ve never seen it happen. So someone needs to step forward and be chief cat herder. This is a thankless job, but if it results in a solid, collaborative manuscript it can be very satisfying. Not to mention the fact that you will (or very much SHOULD) have your name first in the author order.

Here’s my metaphor for spearheading such a monster, errrr… paper.

Imagine that you’ve gathered a painter, a sculptor who works in clay, a sculptor who works with metal, and a DJ in a room- actually in many cases they’re not even in the same room, they’re distributed around the country in their own studios. Around the room (or in their studios) you have a canvas and paint, a block of clay, a pile of metal, and a box of vinyl. Your job is to assemble a work of art that incorporates all those elements together, blends them where appropriate, and is clear about how the pieces all fit together. You have a limited time to accomplish this. Art critics will be visiting after you’re finished to evaluate your work. Go.

Here are my list of thoughts on how to approach this kind of problem.

  1. Don’t think of this as a collaborative paper. In all likelihood the actual driving of the paper will be done by one person, and that’s you. If you wait around for everyone to chime in, contribute, take ownership for their sections, you will never get anything done. If you aren’t the leader of the paper, but the leader isn’t leading it MAY be possible to just start the process and take leadership. This can be politically dangerous and really depends on the specifics of the project and collaborations, but it’s something to keep in mind. You could be a hero.
  2. Think of this as a collaborative paper. This is a collaborative effort. I realize that this is directly contradictory to my first point. However, it is very important that you don’t lose sight of the fact that you are not the expert in many areas of the paper that you have to put together. Make use of others’ expertise but try to put this in direct requests for input of well-defined portions.
  3. Have a basic understanding of each component. This is really important. Everyone has different expertise and you will not become an expert in a new area by writing a paper. Don’t try. But if there are things that you really are not familiar with that need to go into the paper brush up on them by reading (actually reading from start to finish) previous papers from the group or current review articles in the area. This will allow you to understand at least where the collaborator is coming from and what they can offer.
  4. Don’t overload collaborators with many outlines and drafts. This will only make your collaborators stop paying attention. Instead try to put out one or two outlines, with discussion (teleconference or in person) between. Also with the draft, work with individuals to get portions completed instead of doing everything in multiple rounds of drafts that are commented on by everyone.
  5. Choose a way of collaborating on writing and communicate it with contributors. If you use MS Word for drafts make sure everyone uses the “Tracking Changes” option turned on. Otherwise it’s a nightmare to figure out what parts have been changed. Part of your job will be to manually merge all these changes into a single document. This is a tremendous pain in the ass, but it allows you to evaluate all contributions and make decisions about what to include or how things should be worded. Google Docs seems to work well for producing drafts collaboratively, but at some point the draft should be moved to a single document for finalization.
  6. At the early stages include, don’t exclude. Welcome everyone’s input and suggestions. At some point it may be necessary to make hard decisions about directions of the paper and that may make people unhappy. That’s something you have to live with- but try to listen to the group about these decisions. If there are people with suggestions on more work to do (either experiments/analysis or writing) and their suggestions seem reasonable, make it clear that it’s up to them to carry through with the actual work and try to get a timeline from them for completion. If their piece is essential to the project make sure that you have a plan for extracting this from them- there’s probably a nicer way to put this, but that’s the idea.
  7. At the later stages don’t let newcomers (or others) distract from the plan. If they have really great suggestions, listen to them. If their suggestions seem to distract from the story you are telling fall back on the, “well that’s a great idea, why don’t you investigate that and we can include it if the reviewers request it”- that is, after submission and review.
  8. Have a strategy to create the story you’re going to tell. It can be very difficult to start on a paper cold, when there’s only been discussion about what should be done. A reasonable approach is to do some preliminary analysis yourself then take this to the larger group for input. Make it clear that this is only one possible path and that you’re just trying to promote discussion. Make sure you’re telling a story- this is actually what a scientific paper is about. Be flexible about what the story is. It has to be consistent with the data available- but you may choose to incorporate portions of the results and leave out others that do not help the story along. See also my post on how to write a scientific paper.
  9. Try to avoid redundant effort. Generally this isn’t an issue because everyone is an expert in different areas so the actual work shouldn’t be redundant. Sometimes data analysis needs to be defined to avoid redundancy. If there are large sections to be written (such as an Introduction) it’s better to break it into smaller bits for different people to work on and call this out in the outline or draft so people are clear on who’s doing what. Everyone can revise/comment on all sections toward the end and that’s easier to merge than two disparate documents that are trying to talk about the same thing.
  10. Navigate author order and authorship carefully. This is tremendously important for most people on the project. The critical positions to identify are first author and last author (for biology papers anyway). If you are leading the paper you should be first author, but always remember that for many journals you can specify two or even three ‘first’ authors. For this kind of paper that might be necessary. Don’t try to limit authorship too much. These kinds of papers will have lots of authors. But try to be consistent; if you accept suggestions from everyone’s groups wholesale, it can cause conflicts. Consider that one group might consider technicians who performed the work to be worthy of authorship. If you say OK to this the other groups may chime in with all their technicians, etc. Follow the rules of authorship that you feel comfortable with and believe are ethically consistent, but remember that many, many people may have made significant contributions to the paper. This can be one of the most politically treacherous portions of the paper- have fun!
  11. Find a champion. Identify a senior author who you can communicate with and who you believe will support your positions, or at least will listen to your positions. There may arise situations that require having someone with authority agreeing with you to get others to fall in line.

Finally, here’s an example of a large collaborative research paper that I’ve recently published. It didn’t turn out quite as grand as I’d hoped (what paper does?) but it’s still a nice example of integrating the input of many different groups. I am currently working on (leading) at least three more such papers that are in various stages of being completed.

McDermottJE, ShankaranH, EisfeldAJ, BelisleSE, NeumanG, LiC, McWeeneyS, SabourinC, KawaokaY, Katze MG, Waters KM. (2011). Conserved host response to highly pathogenic avian influenza virus infection in human cell culture, mouse and macaque model systems. BMC Systems Biology. 5(1):190.