Day 23: (link to my previous post in this series)
First a note about where this series is going. I decided to write this series of posts to journal the evolution of a (fairly simple) computational biology project from the start- or close to it- to the end- publication of a paper. For various reasons I mentioned in my first post I want to be circumspect about the actual method and application. However, I’m currently keeping a separate set of posts, mirroring each of these that I post in real time. These posts give details and will have links to data I’m using. I plan to post all these alongside the originals at the point the paper is submitted (or possibly accepted, haven’t decided). This may even be the location of the supplemental information and data that will accompany the paper. That way you can see what I’m talking about. In the meantime I hope that the *process* if not the content will be interesting enough to keep following.
Starting out with my “side project” spawned from my primary project.
Wow. It’s rare when you clean up a bug and find that it improves your results significantly. Really significantly. With the ‘bug’ I mentioned in my last post (which was actually just a misunderstanding of how they had implemented the original algorithm) I was getting essentially nothing. A confusing set of nothing.
After fixing this bug the results are surprisingly good. I’m doing some comparisons between subsets of data and finding that the differences I’m seeing are substantial. And I’m getting a vast improvement over the previous approach- which is exactly what I was hoping for. Very exciting. Last night I ran all the comparisons and put them into a spreadsheet that I can share with collaborators to get their feedback on what it might mean in terms of the biology. It’s like wiping off a dirty window so you can now see into a room filled with interesting things.
On (and back) to my original project
When I last posted about this I had run the initial algorithm on the cluster and started to look at the data, but then realized that I’d need to do permutations to get p value estimates. I coded this up in the parallel algorithm and decided to use a label permutation approach where the data is permuted a bunch of times (1000) for each true calculation in the algorithm and results are compared with the true value of the algorithm. This will slow things down a bit, but should result in a defensible score/p value combination for the algorithm. And I’m running it on a cluster. The slowest part of my initial run was simply writing out all the output files to the central file system- they’re huge.
Last night I updated the code then this morning tested it out with a small set of data and two nodes on the cluster. This didn’t work- so I debugged it (stupid, but easily made mistake) and tested again: success! Now I’m running the real thing and waiting for the output. Not sure I’ll have time to look this over carefully today or not.
Afternoon update: nothing yet, job is still waiting in the queue meaning that it hasn’t started running yet. I may call it for the day.
Days 6-22: (link to my previous post in this series)
I said “real time” and I meant real time. This is what happens to projects when there are 4-5 or more other projects vying for attention, plus a bunch of internal strategy planning things, plus outside reviews to complete, plus papers to wrap up and submit, plus summer vacation stuff, oh and plus Twitter stuff and making a t shirt and stuff. You don’t get stuff done for long periods of time.
Anyway, what I have gotten done is to focus on my highly related side project that arose from my initial project. The key here was that I could process the data exactly the same way then just use it in a slightly modified version of another existing algorithm, and voila! magic happens.
Of course this wasn’t how things shook out. I did the data processing and ran the algorithm and found that the results weren’t really that enlightening. In fact they looked a bit ho-hum. Suspiciously so actually. So there are two alternatives: alternative 1 is that the method is functioning perfectly and that the biology just is like this and I’m not going to get very much from this approach; alternative 2 is that the method is not working for some reason. To test this I ran a subset of the data using the existing method and my new extended method. For reasons regarding the particular method I’m using and how I’m extending it these two subsets should yield exactly the same results.
Which do you think was the case? Well, of course, results from the two subsets look different and it’s alternative 2. I messed something up or my thinking about the problem was wrong. This morning I sat down for 5 minutes (that’s how long this process took me, really) and read the methods section for the paper describing the method I’m trying to extend. Actually I just read the first paragraph of the methods section. Ohhhhhhhhhhh. Go figure. It actually HELPS to read how they implemented the existing method in the first place <facepalm>.
So now I’m recoding and rerunning- hoping that this improves things. The way they did it in the first place is elegantly simple, so it works with my extensions no problems. I’ll see. And you will too in the next installment. Which could be tomorrow. Or in December. Who knows?
As a part of the very funny
#MelodramaticLabNotebook hashtag on Twitter I Tweeted this the other day:
At last we meet, reviewer 3, if that is indeed your real name #MelodramaticLabNotebook
— Jason McDermott (@BioDataGanache) August 20, 2013
Which went mini-viral and got me the most retweets I’ve gotten for any Tweet to date.
Yes, reviewer 3. The arch nemesis of authors of scientific papers. The queen/king of rejections. (for non-science-y types, see my explanation of the basics of peer review below) For some reason it seems that reviewer 3 has gotten a bad name. It’d be interesting to actually look at the data here, but the review process is mostly private, so I think that’d be difficult. My guess is that reviewer 3 is generally a tie-breaker. A pinch-hitter brought in by the editor to break the impasse of having reviewer 1 give a positive review and reviewer 2 give a negative review. Reviewer 3 may have a bad name since it may be easier to see the points of the negative reviewer, rather than being swayed by a positive review. I don’t know- I’m just guessing.
Anyway, back to me. One response to my Tweet was this gem. Alex Cagan drew this comic visualization of my Tweet, which is completely awesome.
I needed a new science awesomeness t shirt so I took Alex’s design and made it into a t shirt on CafePress (currently on it’s way to my house!). You too can sport this wonderful creation if you’re brave enough.
A (very) brief overview of peer review process for scientific papers
So for those not familiar with the process of scientific publication here’s a brief summary. This is how it generally works in biology/chemistry/physics but there are exceptions. A researcher completes a scientific study and wants to (needs to) share it with the greater scientific community. This is generally done through a myriad of scientific journals that publish these kinds of papers. The researcher writes up the study into a manuscript and submits it to a relevant journal. An editor at the journal then sends the paper out to peer reviewers, other researchers who have the expertise to evaluate the study and manuscript to see if it’s acceptable for publication. These are generally anonymous reviewers and this is done on a volunteer basis (i.e. the journal does not pay reviewers to review). Most journals solicit two reviews. If the reviewers agree to reject or accept the paper then the editor will generally go along with that. If the reviewers are split then the editor can make a decision in cases where they feel that it is merited, or they can send it to another reviewer. Some papers can be reviewed by more than three peer reviewers- not sure why this happens.
The label of “scientist” is usually assigned to a particular type of person. One who is highly educated and generally paid to do science as a profession. While this is not incorrect (these people are generally scientists)- it is not inclusive. Studies have highlighted that very young children use something very like the scientific method to discover things about their environment, the people they interact with, and how the world works.
A group of middle schoolers from New York were listed as co-authors on a PLoS One publication about elephant behavior last spring. This was featured in a New York Times article about it that mentions it as one of the first times that a teenagers have been listed as co-authors on a scientific study. This is really a big accomplishment for them- and it’s great to see them get recognition and respect from their coauthors, the established scientists. The paper itself doesn’t really highlight the role that the teenagers played in the experiment however. This is really written as more of a straight-on scientific paper.
However, this isn’t the first publication like this. In 2010 a paper on behavior in bees
was published in Biology Letters that includes as it’s first author “P.S. Blackawton”, where the “P.S.” stands for “Public School”. This paper is really cool because the entire process was guided by established scientists, but was thought of, executed by, and largely written by elementary school children, aged 8-10. The abstract for the paper gives a good overview of how, and why, the study was organized as such.
This is science: the process of playing with rules that enables one to reveal previously unseen patterns of relationships that extend our collective understanding of nature and human nature.
What a great thing not to forget as an “established scientist”.
Here’s an excerpt from the Discussion section that highlights what a refreshingly delightful read this paper is:
This experiment is important, because, as far as we know, no one in history (including adults) has done this experiment before. It tells us that bees can learn to solve puzzles (and if we are lucky we will be able to get them to do Sudoku in a couple of years’ time).
It is really important to remember that the label of “scientist” does not just apply to those with PhDs (or on their way), with white coats, doing experiments in laboratories. Scientists are those who follow the scientific method, intentionally or not, to discover things about the world around them. And non-traditional scientists can bring a whole new perspective to challenging problems.
So the cluster was back on line at 11:45 AM this morning, meaning that I could restart with my false-discovery rate estimation running more permutations. However, I decided to take a short detour to quickly implement a related idea that came up when I was coding up this one. It involves a similar data preprocessing step as this one, so I modularized my code (making the data preprocessing more independent from the rest of the code). After I ran
the code I realized I had a big ole stinkin’ BUG that meant that I wasn’t preprocessing really at all. This was bad: I should’ve seen this before and it cost me a bunch of time on the cluster running fairly meaningless code (200 hours CPU time or so). It was also very good: it means that the so-so results I was getting previously may have been because the algorithm wasn’t actually working correctly (but in a way that was impossible to see given the end results).
So, optimistically, I started the job again- but it’s going to run over night, so I won’t know the answer until tomorrow (and you won’t either). In any case I got the side-track idea working, though it’s still waiting on the addition of one thing to make it really interesting. Lesson for the day: check your code carefully! (this is only about the 350th time I’ve learned this lesson, believe me)
Day 4 (link to previous post)
It figures that the week I decide to return to using the cluster (the PIC, in case you’re interested) is the week that they have to shut it down for construction. So ran no more permutations today- that’ll have to wait until next week.
Didn’t really do any other work on the paper or project today either- busy doing other things. So not much to report today actually. I did talk a bit about the results with my post-doc on our semi-weekly MSFAB (Mad Scientist Friday Afternoon Beer). We both agreed that the permutation test was a good idea and possibly the only way to get an estimate of real false discovery rates. Along these lines, as I reported yesterday the first round of permutations returned with some fairly significant results. These actually exceeded the Bonferroni corrected p values I was getting, which is supposed to tell you essentially the same thing. So it seems in this case that Bonferroni, generally a conservative multiple hypothesis correction, was not conservative enough. Good lesson to remember.
Uh-oh- roadblock. Remember how I was saying this project was dirt simple?
This came much faster than I thought it would. I’ve got actual data and I have to figure out if there’s a story there. Or rather, where the story is. The results from my large-scale parallel run are interesting, but I’m not sure they clearly demonstrate how this approach is better than previous approaches. Also there I had to rerun the whole thing to all the results, turns out I was only capturing about 1/5th of the results- but the end problem was the same. The results are very significant, but not head and shoulders above previous results, and don’t really demonstrate what I was hoping they would. Strongly anyway. Time for some thinkin. Never as dirt simple as I think it will be to start with.
Anyway, pushing onwards doing permutations. The question here is how likely would I be to see the scores I’m getting just by chance alone. So I permute the labels on my data and run the thing a few times with random labels. The permutation is done on the sample level- the data I’m using is from observations under condition 1 and condition 2- and I have multiple observations from each conditions. So to permute I just randomize which observations I’m saying are from condition 1 and condition 2.
I’ve done the first couple of randomized runs and they’re actually coming up with some reasonably significant results. This means that I’ll have to compare the random scores with my real scores in order to establish a false discovery rate, which I can then use as a threshold for reporting.
I’ve also started to put things into a kind of an outline for the paper. Here’s what I’ve got so far- I’ve taken the details of what I’m doing out for blogging purposes- but you get the idea:
- General background about the problem we’re developing our method on
- Description of the previous algorithm, what it offers and what is the gap that our approach will fill
- Specific details about the data set we’re using
- Summary of what our approach is and what results we’ll be presenting in the paper
- First apply the previous algorithm on our data (this hasn’t been done). Possibly validate on an external dataset
- Show how our algorithm improves results over previous
- Add in the extra idea we came up with that will also be a novel twist on the approach
- Show what kind of biological information can be derived from these new approaches. This is really open at this point since I’m not really sure what I’ll get yet. But preparing for it and thinking about it so writing it down.
- Validation on an external dataset (i.e. a different one from the one I’m using)- maybe. This might be difficult to impossible.
You want to keep your brain in good shape and ward off dementia in older age, right? But what will do that? Turns out a bunch of things. Here I’m just highlighting a couple of recent news reports that I found funny in combination.
- Hot cocoa: two cups of hot cocoa a day helps sharpen seniors’ brains, found a study in 60 seniors reported in Neurology. And coffee: caffeine intake can improve cognition and memory. The idea is that it improves blood flow to your brain.
- Alcohol: OK this could go either way. Moderate alcohol consumption may be able to ward off dementia, but there doesn’t seem to be a consensus among different studies as indicated in this review.
- Sex: or more accurately orgasms seem to stimulate multiple areas of the brain. Dr. Barry Komisaruk, a Rutgers neuroscientist, has been studying orgasms for years (insert witty quip here) and has found multiple associations using MRI of subjects’ brains undergoing orgasm. However, it seems that there are few controlled studies in this area and so…. more research is needed.
So what’s the take home message? Start off the evening with hot cocoa…
The process of how I compose a computational biology study, execute it, and write it up seems to follow a kind of set pattern, for the relatively simple projects anyway. So I thought I’d blog about this process as it happens.
I debated on several takes on this:
- Blogging as it’s happening with full technical details
- Blogging a journal that I could then release after the paper was finished and submitted
- Blogging as it’s happening but not talking about the technical details
The first is appealing, but probably wouldn’t go over well with my employer- and it is a simple idea that someone else could pick up and run with. I’m not paranoid, but in this case it might be too enticing for someone with the right skills. The second seems like not as much fun. So I’m opting for the third option and will be blogging about what I’m doing generally, but not giving specifics on the science, algorithms, or data I’m working with.
Background on the project
I’m not starting cold on this project. I came up with the idea last year but haven’t had time to implement it until now. It’s a dirt simple extension of an existing method that has the potential to be very interesting. I have the problem and data in hand to work on it. Last year we implemented a parallel version of a prototype of the algorithm. Now that I can actually work on it I can see a clear path to a finished project- being a submitted paper, or possibly inclusion as a part of a larger paper.
Started out by revisiting the idea. Thinking about it and doing some PubMed searches. I just wanted to make sure that it hadn’t been done by anyone, especially the groups that developed the original algorithm. Nothing seems to be there- which is good, because as I said- it’s dirt simple.
Mid-day talked myself out of the idea in it’s original form- it can’t work as simply as I’d thought.
Relay my thoughts to my post-doc who reassured me that it was actually that simple and we could do it the way I originally envisioned. He was right. We talked about the statistics and algorithms for it for awhile.
Got my old code working again. Revised the core bits to handle the new idea. Actually ran some data through on my laptop using a very limited dataset. Looks like it works! So fun to actually be coding again and not just writing papers, grants, emails, or notes. Opened a blank Word document to do some writing. *sigh*
Decided on a tentative title (which will change) and a tentative author list. Myself, the post-doc who I talked with about it, the programmer who coded the parallel version previously, a post-doc who hasn’t worked on it yet, but probably will, and a senior domain expert. Yes, I’m doing this very early on. But as I said, there’s a clear path from here to a paper- it’s not too early.
More testing on the prototype code to make sure that it’s behaving as I think it should. Also coded up an alternative data pre-processing step that seems to be a good idea. Comparing results from both pre-processing methods determine that they give different answers. I’ll have to iron that one out later when working with the real datasets.
Figured out the plan for the project- at least in broad strokes. Run on complete dataset, implement a random permutation strategy to estimate false discovery rate, break up dataset and show how the method works on individual parts of it (this is specific to the problem), find another dataset for validation, write it up. Yes, it’s just that simple.
Discussed an additional very interesting strategy with post-doc number 1 that will really add novelty and hopefully value to the study. Also discussed the permutation strategy in some detail. That will be really important to demonstrate that this actually works.
Spent most of the day revising the code for the parallel implementation to get the new ideas and testing it out on our cluster to see if it works. Slow progress, but finally got the entire thing to run! I did a couple of test runs using a limited dataset and only running on 2 nodes. When those worked I did the whole shebang. Finished in about an hour on 60 nodes, which is really pretty impressive given what it’s doing. Definitely a win!
Now to work on putting some words down for the Introduction section. I also like to outline the results section by generally writing about how I think it will go in a glorified outline. I’ve posted about this process previously here.