Journaling a Computational Biology Project: Part 6

Day 23: (link to my previous post in this series)

First a note about where this series is going. I decided to write this series of posts to journal the evolution of a (fairly simple) computational biology project from the start- or close to it- to the end- publication of a paper. For various reasons I mentioned in my first post I want to be circumspect about the actual method and application. However, I’m currently keeping a separate set of posts, mirroring each of these that I post in real time. These posts give details and will have links to data I’m using. I plan to post all these alongside the originals at the point the paper is submitted (or possibly accepted, haven’t decided). This may even be the location of the supplemental information and data that will accompany the paper. That way you can see what I’m talking about. In the meantime I hope that the *process* if not the content will be interesting enough to keep following.

Starting out with my “side project” spawned from my primary project.

Wow. It’s rare when you clean up a bug and find that it improves your results significantly. Really significantly. With the ‘bug’ I mentioned in my last post (which was actually just a misunderstanding of how they had implemented the original algorithm) I was getting essentially nothing. A confusing set of nothing.

It is hard to explain my joy in this working this well. And my ever-present suspicion that something's wrong.

It is hard to explain my joy in this working this well. And my ever-present suspicion that something’s wrong. Meep-meep!

After fixing this bug the results are surprisingly good. I’m doing some comparisons between subsets of data and finding that the differences I’m seeing are substantial. And I’m getting a vast improvement over the previous approach- which is exactly what I was hoping for. Very exciting. Last night I ran all the comparisons and put them into a spreadsheet that I can share with collaborators to get their feedback on what it might mean in terms of the biology. It’s like wiping off a dirty window so you can now see into a room filled with interesting things.

On (and back) to my original project

When I last posted about this I had run the initial algorithm on the cluster and started to look at the data, but then realized that I’d need to do permutations to get p value estimates. I coded this up in the parallel algorithm and decided to use a label permutation approach where the data is permuted a bunch of times (1000) for each true calculation in the algorithm and results are compared with the true value of the algorithm. This will slow things down a bit, but should result in a defensible score/p value combination for the algorithm. And I’m running it on a cluster. The slowest part of my initial run was simply writing out all the output files to the central file system- they’re huge.

Last night I updated the code then this morning tested it out with a small set of data and two nodes on the cluster. This didn’t work- so I debugged it (stupid, but easily made mistake) and tested again: success! Now I’m running the real thing and waiting for the output. Not sure I’ll have time to look this over carefully today or not.

Afternoon update: nothing yet, job is still waiting in the queue meaning that it hasn’t started running yet. I may call it for the day.

Journaling a Computational Biology Project: Part 5

Days 6-22: (link to my previous post in this series)

I said “real time” and I meant real time. This is what happens to projects when there are 4-5 or more other projects vying for attention, plus a bunch of internal strategy planning things, plus outside reviews to complete, plus papers to wrap up and submit, plus summer vacation stuff, oh and plus Twitter stuff and making a t shirt and stuff. You don’t get stuff done for long periods of time.

Anyway, what I have gotten done is to focus on my highly related side project that arose from my initial project. The key here was that I could process the data exactly the same way then just use it in a slightly modified version of another existing algorithm, and voila! magic happens.

Of course this wasn’t how things shook out. I did the data processing and ran the algorithm and found that the results weren’t really that enlightening. In fact they looked a bit ho-hum. Suspiciously so actually. So there are two alternatives: alternative 1 is that the method is functioning perfectly and that the biology just is like this and I’m not going to get very much from this approach; alternative 2 is that the method is not working for some reason. To test this I ran a subset of the data using the existing method and my new extended method. For reasons regarding the particular method I’m using and how I’m extending it these two subsets should yield exactly the same results.

Which do you think was the case? Well, of course, results from the two subsets look different and it’s alternative 2. I messed something up or my thinking about the problem was wrong. This morning I sat down for 5 minutes (that’s how long this process took me, really) and read the methods section for the paper describing the method I’m trying to extend. Actually I just read the first paragraph of the methods section. Ohhhhhhhhhhh. Go figure. It actually HELPS to read how they implemented the existing method in the first place <facepalm>.

Facepalm. Old school.

So now I’m recoding and rerunning- hoping that this improves things. The way they did it in the first place is elegantly simple, so it works with my extensions no problems. I’ll see. And you will too in the next installment. Which could be tomorrow. Or in December. Who knows?

Journaling a Computational Biology Project: Part 4

Day 5 (link to my previous post in this series)

So the cluster was back on line at 11:45 AM this morning, meaning that I could restart with my false-discovery rate estimation running more permutations. However, I decided to take a short detour to quickly implement a related idea that came up when I was coding up this one. It involves a similar data preprocessing step as this one, so I modularized my code (making the data preprocessing more independent from the rest of the code). After I ran

Eek! A BUG!

Eek! A BUG!

the code I realized I had a big ole stinkin’ BUG that meant that I wasn’t preprocessing really at all. This was bad: I should’ve seen this before and it cost me a bunch of time on the cluster running fairly meaningless code (200 hours CPU time or so). It was also very good: it means that the so-so results I was getting previously may have been because the algorithm wasn’t actually working correctly (but in a way that was impossible to see given the end results).

So, optimistically, I started the job again- but it’s going to run over night, so I won’t know the answer until tomorrow (and you won’t either). In any case I got the side-track idea working, though it’s still waiting on the addition of one thing to make it really interesting. Lesson for the day: check your code carefully! (this is only about the 350th time I’ve learned this lesson, believe me)

Journaling a Computational Biology Project: Part 3

Day 4 (link to previous post)

It figures that the week I decide to return to using the cluster (the PIC, in case you’re interested) is the week that they have to shut it down for construction. So ran no more permutations today- that’ll have to wait until next week.

Didn’t really do any other work on the paper or project today either- busy doing other things. So not much to report today actually. I did talk a bit about the results with my post-doc on our semi-weekly MSFAB (Mad Scientist Friday Afternoon Beer). We both agreed that the permutation test was a good idea and possibly the only way to get an estimate of real false discovery rates. Along these lines, as I reported yesterday the first round of permutations returned with some fairly significant results. These actually exceeded the Bonferroni corrected p values I was getting, which is supposed to tell you essentially the same thing. So it seems in this case that Bonferroni, generally a conservative multiple hypothesis correction, was not conservative enough. Good lesson to remember.

Journaling a computational biology project: Part 2

Day 3 (link to my previous entry)

Uh-oh- roadblock. Remember how I was saying this project was dirt simple?

It's just THIS simple. This has to work- there's no WAY it could fail.

It’s just THIS simple. This has to work- there’s no WAY it could fail.

This came much faster than I thought it would. I’ve got actual data and I have to figure out if there’s a story there. Or rather, where the story is. The results from my large-scale parallel run are interesting, but I’m not sure they clearly demonstrate how this approach is better than previous approaches. Also there I had to rerun the whole thing to all the results, turns out I was only capturing about 1/5th of the results- but the end problem was the same. The results are very significant, but not head and shoulders above previous results, and don’t really demonstrate what I was hoping they would. Strongly anyway. Time for some thinkin. Never as dirt simple as I think it will be to start with.

Down, down, down, down...

Down, down, down, down…

Anyway, pushing onwards doing permutations. The question here is how likely would I be to see the scores I’m getting just by chance alone. So I permute the labels on my data and run the thing a few times with random labels. The permutation is done on the sample level- the data I’m using is from observations under condition 1 and condition 2- and I have multiple observations from each conditions. So to permute I just randomize which observations I’m saying are from condition 1 and condition 2.

I’ve done the first couple of randomized runs and they’re actually coming up with some reasonably significant results. This means that I’ll have to compare the random scores with my real scores in order to establish a false discovery rate, which I can then use as a threshold for reporting.

I’ve also started to put things into a kind of an outline for the paper. Here’s what I’ve got so far- I’ve taken the details of what I’m doing out for blogging purposes- but you get the idea:

Introduction

  1. General background about the problem we’re developing our method on
  2. Description of the previous algorithm, what it offers and what is the gap that our approach will fill
  3. Specific details about the data set we’re using
  4. Summary of what our approach is and what results we’ll be presenting in the paper

Results

  1. First apply the previous algorithm on our data (this hasn’t been done). Possibly validate on an external dataset
  2. Show how our algorithm improves results over previous
  3. Add in the extra idea we came up with that will also be a novel twist on the approach
  4. Show what kind of biological information can be derived from these new approaches. This is really open at this point since I’m not really sure what I’ll get yet. But preparing for it and thinking about it so writing it down.
  5. Validation on an external dataset (i.e. a different one from the one I’m using)- maybe. This might be difficult to impossible.

Journaling a computational biology study: Part 1

The process of how I compose a computational biology study, execute it, and write it up seems to follow a kind of set pattern, for the relatively simple projects anyway. So I thought I’d blog about this process as it happens.

I debated on several takes on this:

  1. Blogging as it’s happening with full technical details
  2. Blogging a journal that I could then release after the paper was finished and submitted
  3. Blogging as it’s happening but not talking about the technical details

The first is appealing, but probably wouldn’t go over well with my employer- and it is a simple idea that someone else could pick up and run with. I’m not paranoid, but in this case it might be too enticing for someone with the right skills. The second seems like not as much fun. So I’m opting for the third option and will be blogging about what I’m doing generally, but not giving specifics on the science, algorithms, or data I’m working with.

Background on the project

I’m not starting cold on this project. I came up with the idea last year but haven’t had time to implement it until now. It’s a dirt simple extension of an existing method that has the potential to be very interesting. I have the problem and data in hand to work on it. Last year we implemented a parallel version of a prototype of the algorithm. Now that I can actually work on it I can see a clear path to a finished project- being a submitted paper, or possibly inclusion as a part of a larger paper.

Day 1

Started out by revisiting the idea. Thinking about it and doing some PubMed searches. I just wanted to make sure that it hadn’t been done by anyone, especially the groups that developed the original algorithm. Nothing seems to be there- which is good, because as I said- it’s dirt simple.

Mid-day talked myself out of the idea in it’s original form- it can’t work as simply as I’d thought.

Relay my thoughts to my post-doc who reassured me that it was actually that simple and we could do it the way I originally envisioned. He was right. We talked about the statistics and algorithms for it for awhile.

Got my old code working again. Revised the core bits to handle the new idea. Actually ran some data through on my laptop using a very limited dataset. Looks like it works! So fun to actually be coding again and not just writing papers, grants, emails, or notes. Opened a blank Word document to do some writing. *sigh*

Decided on a tentative title (which will change) and a tentative author list. Myself, the post-doc who I talked with about it, the programmer who coded the parallel version previously, a post-doc who hasn’t worked on it yet, but probably will, and a senior domain expert. Yes, I’m doing this very early on. But as I said, there’s a clear path from here to a paper- it’s not too early.

Day 2

More testing on the prototype code to make sure that it’s behaving as I think it should. Also coded up an alternative data pre-processing step that seems to be a good idea. Comparing results from both pre-processing methods determine that they give different answers. I’ll have to iron that one out later when working with the real datasets.

Figured out the plan for the project- at least in broad strokes. Run on complete dataset, implement a random permutation strategy to estimate false discovery rate, break up dataset and show how the method works on individual parts of it (this is specific to the problem), find another dataset for validation, write it up. Yes, it’s just that simple.

Discussed an additional very interesting strategy with post-doc number 1 that will really add novelty and hopefully value to the study. Also discussed the permutation strategy in some detail. That will be really important to demonstrate that this actually works.

Spent most of the day revising the code for the parallel implementation to get the new ideas and testing it out on our cluster to see if it works. Slow progress, but finally got the entire thing to run! I did a couple of test runs using a limited dataset and only running on 2 nodes. When those worked I did the whole shebang. Finished in about an hour on 60 nodes, which is really pretty impressive given what it’s doing. Definitely a win!

Now to work on putting some words down for the Introduction section. I also like to outline the results section by generally writing about how I think it will go in a glorified outline. I’ve posted about this process previously here.