The process of how I compose a computational biology study, execute it, and write it up seems to follow a kind of set pattern, for the relatively simple projects anyway. So I thought I’d blog about this process as it happens.
I debated on several takes on this:
- Blogging as it’s happening with full technical details
- Blogging a journal that I could then release after the paper was finished and submitted
- Blogging as it’s happening but not talking about the technical details
The first is appealing, but probably wouldn’t go over well with my employer- and it is a simple idea that someone else could pick up and run with. I’m not paranoid, but in this case it might be too enticing for someone with the right skills. The second seems like not as much fun. So I’m opting for the third option and will be blogging about what I’m doing generally, but not giving specifics on the science, algorithms, or data I’m working with.
Background on the project
I’m not starting cold on this project. I came up with the idea last year but haven’t had time to implement it until now. It’s a dirt simple extension of an existing method that has the potential to be very interesting. I have the problem and data in hand to work on it. Last year we implemented a parallel version of a prototype of the algorithm. Now that I can actually work on it I can see a clear path to a finished project- being a submitted paper, or possibly inclusion as a part of a larger paper.
Started out by revisiting the idea. Thinking about it and doing some PubMed searches. I just wanted to make sure that it hadn’t been done by anyone, especially the groups that developed the original algorithm. Nothing seems to be there- which is good, because as I said- it’s dirt simple.
Mid-day talked myself out of the idea in it’s original form- it can’t work as simply as I’d thought.
Relay my thoughts to my post-doc who reassured me that it was actually that simple and we could do it the way I originally envisioned. He was right. We talked about the statistics and algorithms for it for awhile.
Got my old code working again. Revised the core bits to handle the new idea. Actually ran some data through on my laptop using a very limited dataset. Looks like it works! So fun to actually be coding again and not just writing papers, grants, emails, or notes. Opened a blank Word document to do some writing. *sigh*
Decided on a tentative title (which will change) and a tentative author list. Myself, the post-doc who I talked with about it, the programmer who coded the parallel version previously, a post-doc who hasn’t worked on it yet, but probably will, and a senior domain expert. Yes, I’m doing this very early on. But as I said, there’s a clear path from here to a paper- it’s not too early.
More testing on the prototype code to make sure that it’s behaving as I think it should. Also coded up an alternative data pre-processing step that seems to be a good idea. Comparing results from both pre-processing methods determine that they give different answers. I’ll have to iron that one out later when working with the real datasets.
Figured out the plan for the project- at least in broad strokes. Run on complete dataset, implement a random permutation strategy to estimate false discovery rate, break up dataset and show how the method works on individual parts of it (this is specific to the problem), find another dataset for validation, write it up. Yes, it’s just that simple.
Discussed an additional very interesting strategy with post-doc number 1 that will really add novelty and hopefully value to the study. Also discussed the permutation strategy in some detail. That will be really important to demonstrate that this actually works.
Spent most of the day revising the code for the parallel implementation to get the new ideas and testing it out on our cluster to see if it works. Slow progress, but finally got the entire thing to run! I did a couple of test runs using a limited dataset and only running on 2 nodes. When those worked I did the whole shebang. Finished in about an hour on 60 nodes, which is really pretty impressive given what it’s doing. Definitely a win!
Now to work on putting some words down for the Introduction section. I also like to outline the results section by generally writing about how I think it will go in a glorified outline. I’ve posted about this process previously here.