Day 23: (link to my previous post in this series)
First a note about where this series is going. I decided to write this series of posts to journal the evolution of a (fairly simple) computational biology project from the start- or close to it- to the end- publication of a paper. For various reasons I mentioned in my first post I want to be circumspect about the actual method and application. However, I’m currently keeping a separate set of posts, mirroring each of these that I post in real time. These posts give details and will have links to data I’m using. I plan to post all these alongside the originals at the point the paper is submitted (or possibly accepted, haven’t decided). This may even be the location of the supplemental information and data that will accompany the paper. That way you can see what I’m talking about. In the meantime I hope that the *process* if not the content will be interesting enough to keep following.
Starting out with my “side project” spawned from my primary project.
Wow. It’s rare when you clean up a bug and find that it improves your results significantly. Really significantly. With the ‘bug’ I mentioned in my last post (which was actually just a misunderstanding of how they had implemented the original algorithm) I was getting essentially nothing. A confusing set of nothing.
After fixing this bug the results are surprisingly good. I’m doing some comparisons between subsets of data and finding that the differences I’m seeing are substantial. And I’m getting a vast improvement over the previous approach- which is exactly what I was hoping for. Very exciting. Last night I ran all the comparisons and put them into a spreadsheet that I can share with collaborators to get their feedback on what it might mean in terms of the biology. It’s like wiping off a dirty window so you can now see into a room filled with interesting things.
On (and back) to my original project
When I last posted about this I had run the initial algorithm on the cluster and started to look at the data, but then realized that I’d need to do permutations to get p value estimates. I coded this up in the parallel algorithm and decided to use a label permutation approach where the data is permuted a bunch of times (1000) for each true calculation in the algorithm and results are compared with the true value of the algorithm. This will slow things down a bit, but should result in a defensible score/p value combination for the algorithm. And I’m running it on a cluster. The slowest part of my initial run was simply writing out all the output files to the central file system- they’re huge.
Last night I updated the code then this morning tested it out with a small set of data and two nodes on the cluster. This didn’t work- so I debugged it (stupid, but easily made mistake) and tested again: success! Now I’m running the real thing and waiting for the output. Not sure I’ll have time to look this over carefully today or not.
Afternoon update: nothing yet, job is still waiting in the queue meaning that it hasn’t started running yet. I may call it for the day.