Oh yeah, it’s significant. REALLY significant.

Matthew Hankins recently wrote a very nice post cataloging the ways that researchers try to indicate that some result is *this* close to being significant, but doesn’t quite make the cut. His point is a very good one: a result from a statistical test is either significant, it passes some rather arbitrary threshold (say, less than 0.05), or it isn’t. There’s no almost in significance, no trending toward significance, no flirting with significance. It’s significant or it’s insignificant, period.

I thought it would be useful to also catalog the flip side of this coin: what about when a result passes a significance test and keeps on going? These results are still “just significant”- not “ultra-incredibly significant”, since significance is a binary value. Accordingly, I have assembled a list of ways that authors have expressed that a result has a very low p value and thus is very significant. I also sampled real publications and found that the actual use of these phrases danced lightly about the verge of sending out tendrils to touch something that is close to significance. I feel that this result really wanted to be significant and was moving in that direction, toward significance. With a little more effort and if everyone believes, it can be significant. Say it with me, “I believe it’s significant. I believe it’s significant” (p value 0.99).

Disclaimer: I know that I have, on more than one occasion, been a perpetrator of each of these errors stating that a result is ‘close to significant’ or is ‘highly significant’. I’ll try to be better in the future.

Thanks to Shanon White for the idea for this post. This is an incomplete list. If you have other examples please add them to the comments or Tweet them with the hashtag #ohsosignificant.

  • highly significant (p<1e-8)
  • very significant (p<0.01)
  • extremely significant (p<0.0001)
  • whoah baby that’s significant (p<1e-6)
  • I got your significance right here (p<0.0002)
  • Holy Sh*t! In you FACE b*tches. This sh*t’s significant (p<1e-90)
  • By the power of Greyskull, we have the significance! (p<1e-23)
  • Say hello to my little significance (p<1e-14)
  • You can’t HANDLE the significance (p<1e-30)
  • BAM! There it is daawwg! That’s significance right there! (p<1e-20)
  • You call THAT significant? That’s not significant. THIS is significant (p<1e-45)
  • the mostest significant in the whole wide world (p<1e-29)
  • Neener neener neener motherf**ker (p<1e-65)
  • significance of the utmost elevated level (p<1e-9)
  • Oh that’s good. Really good. Actually I’m thinking that might be Science or Nature good it’s that good. Holy crap, this is actually working. For once it’s working. Oh god I’m so excited, I’m going to totally rub it in the faces of my smug thesis committee. That’ll show them. Yeah. Oh god I hope it’s not wrong. Please let it be not wrong (p<1e-18)
  • solidly, unequivocally significant (p<1e-12)
  • Bonferroni? We don’t need no stinkin’ Bonferroni (p<1e-56)
  • First, we brought you a significant result (p<0.05). Then we rolled out a very significant result (p<0.01). But can we go further? That’s just crazy, right? Nope. We did it, presenting our new ultra significant result (p<1e-20), now with smoother trending.

Here’s the serious part of this post

This is a semantic argument at its heart. It’s a valuable, important, and true fact that statistical significance does not come in shades of gray; it either is or it isn’t. However, we as intelligent, statistically savvy readers interpret these statements, or at least those that are on the border and not hyperbole, as meaning, “if we were to shift our arbitrary threshold we used for statistical significance to a more lenient/conservative value, then the result we talk about would now meet our new criterion for significance”. Yes, the authors should have just set that level of significance to start out with and not bothered to backtrack to make a point. And yes, many of the real statements on Matthew’s post and the (mostly) fake statements on mine are in the realm of the far out and are just silly (a p value of 0.3 being ‘nearly’ significant, really!?). But really the important thing is that you clearly and completely report your findings, the methods you used to arrive at those findings (and conclusions), and provide access to your data so that the interested reader can make their own judgement.

 

 

Job opening: worst critic. Better fill it for yourself, otherwise someone else will.

A recent technical comment in Science (here) reminded me of a post I’d been meaning to write. We need to be our own worst critics. And by “we” I’m specifically talking about the bioinformaticians and computational biologists who are doing lots of transformations with lots of data all the time- but this generally applies to any scientist.

The technical comment I referred to is behind a paywall so I’ll summarize. The first group published the discovery of a mechanism for X-linked dosage compensation in Drosophila based on, among other things, ChIP-seq data (to determine transcription factor binding to DNA). The authors of the comment found that the initial analysis of the data had used an inappropriate normalization step – and the error is pretty simple: instead of multiplying a ratio by a factor (the square root of the number of bins used in a moving average) they multiplied the log2 transform of the ratio by the factor. This resulted in greatly exaggerated ratios, and artificially inducing a statistically significant difference where there was none. Importantly, the authors of the comment noticed this when,

We noticed that the analysis by Conrad et al. reported unusually high Pol II ChIP enrichment levels. The average enrichment at the promoters of bound genes was reported to be ~30,000-fold over input (~15 on a log2 scale), orders of magnitude higher than what is typical of robust ChIP-seq experiments.

This is important because it means that this was an obvious flag that the original authors SHOULD have seen and wondered about at some point. If they wondered about it they SHOULD have looked further into their analysis and done some simple tests to determine if what they were seeing (30,000 fold increase) was actually reasonable. In all likelihood they would have found their error. Of course, they may not have ended up with a story that could be published in Science- but at least they would not have had the embarrassment of being caught out that way. This is not to say that there is any indication of wrongdoing on the part of the original paper- it seems that they made an honest mistake.

In this story the authors likely fell prey to the Confirmation Bias, the tendency to believe results that support your hypothesis. This is a particularly enticing and tricky bias and I have fallen prey to it many times. As far as I know, these errors have never made it into any of my published work. However, falling for particularly egregious examples (arising from mistakes in machine learning applications, for example) trains you to be on the lookout for it in other situations. Essentially it boils down to the following:

  1. Be suspicious of all your results.
  2. Be especially suspicious of results that support your hypothesis.
  3. The amount you should be suspicious should be proportional to the quality of the results. That is, the better the results are the more you should be suspicious of them and the more rigorously you should try to disprove them.

This is essentially wrapped up in the scientific method (my post about that here)- but it bears repeating and revisiting. You need to be extremely critical of your own work. If something works, check to make sure that it actually does work. If it works extremely well, be very suspicious and look at the problem from multiple angles. If you don’t someone else may, and they may not write as nice of things about you as YOU would.

The example I give above is nice in its clarity and it resulted in calling into question the findings of a Science paper (which is embarrassing). However, there are much, much worse cases with more serious consequences.

Take, for instance, the work Keith Baggerly and Kevin Coombes did to uncover a series of cancer papers that had multiple data processing, analysis and interpretation errors. The NY Times ran a good piece on it. It is more complicated and involves both (likely) unintentional errors in processing, analysis, or interpretation and could actually involve more serious issues of impropriety. I won’t go in to the details here but their original paper in The Annals of Applied Statistics, “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology“, should be reading for any bioinformatics or computational biology researcher. The paper painstakingly and clearly goes through the results of several high profile papers from the same group and reconstructs, first, the steps they must have taken to get the results they did, then second, where the errors occurred, and finally, the results if the analysis had been done correctly.

Their conclusions are startling and scary: they found that the methods were often times not described clearly such that a reader could easily reconstruct what was done and they found a number of easily explainable errors that SHOULD have been caught by the researchers.

These were associated with one group and a particular approach, but I can easily recognize the first, if not the second, in many papers. That is, it is often times very difficult to tell what has actually been done to process the data and analyze it. Steps that have to be there are missing in the methods sections, parameters for programs are omitted, data is referred to but not provided, and the list goes on. I’m sure that I’ve been guilty of this from time to time. It is difficult to remember that writing the “boring” parts of the methods may actually ensure that someone else can do what you’ve done. And sharing your data? That’s just a no-brainer, but something that is too often overlooked in the rush to publish.

So these are cautionary tales. For those of us handling lots of data of different types for different purposes and running many different types of analysis to obtain predictions we must always be on guard against our own worst enemy, ourselves and the errors we might make. And we must be our own worst (and best) critics: if something seems too good to be true, it probably is.

 

Time to review for scientific publications revisited

Anna Sharman wrote a couple of excellent posts about time to first response for journals and time to publication after acceptance for journals. Following up my previous post on time spent at the journal (from submission to acceptance) she wrote:

So I went back through my email archive and reconstructed the process for all the papers I previously listed, plus a couple. And I corrected a few inaccuracies in my previous report (I was lumping the two rejections that I then resubmitted anew with their eventually accepted versions- not really fair for the journals). Here are the results which show that the time from submission to first response (in all cases listed except the one with the asterisk this is when I received the first reviews), overall time to acceptance, and finally time to publication after acceptance. The publication time is the first time the article appears on a website since most of these journals have epub before ‘print’ policies. PLoS journals don’t have a physical volume (there is not a physical paper-and-glue PLoS journal) but they do release volumes, collections of articles, at set times.

Overall my reanalysis decreased the mean total time at the journal (from 5.7 months to 4.9 months) and showed that the actual time spent under review (as opposed to time when I was revising the paper according to reviewers’ comments) was about half that- about 2.6 months. I would be interested in seeing if this is typical or not since this is one variable that could be very specific to the way that I work.

The outlier in the analysis is my PLoS One publication that was first considered at PLoS Computational Biology. This appears to have a very short turnaround time, but this is only because the editor at PLoS One evaluated my responses to the reviews received from PLoS Computational Biology and made a decision on that basis.

Finally, this analysis does not take into account several places where the effective time to acceptance was much longer. The aforementioned PLoS One publication was actually submitted to the RECOMB Systems Biology conference where it reviewed well (about 2.5 months) then recommended for consideration at PLoS Computational Biology, where it was reviewed not as well. From start to finish this was close to a year before it was actually published. Likewise the BMC Systems Biology publication that was rejected then resubmitted went through a long process of editorial consideration at the end that extended the time we had it (i.e. shortened the time in review by my calculations) by a lot since we had challenged inappropriate reviews at the editorial level.

The original impetus for the post was

And the current analysis revises my initial assessment since what Nick was really asking about was the turnaround time- that is, the time from submission to receipt of the first reviews for the paper. In this case 100 days is quite a bit longer than normal (as judged by my limited analysis here) since the mean turnaround times I get are about 2 months.

Table. (revised) Survey of time in review for a number of my own papers.

PMID Journal Time to first review Months until acceptance Months spent in review Acceptance to publication
23335946 Expert opinions 1.9 4.7 1.9 0.9
22546282 BMC Sys Bio (rejected) 2.0   2.0  
22546282 BMC Sys Bio (new submission) 1.9 11.2 3.0 1.3
23071432 PLoS CB 3.1 12.2 7.6 1.8
22745654 PLoS One 2.7 6.6 4.2 2.7
22074594 BMC Sys Bio 3.2 3.4 3.3 1.0
21698331 Mol BioSystems 1.8 4.6 3.0 1.0
21339814 PLoS CB (rejected) 1.0   1.0
21339814 PLoS One (new submission *) 0.4 0.4 0.4 0.9
20974833 Infection and Immunity 1.8 1.8 2.2 0.7
20877914 Mol BioSystems 0.8 1.2 0.9 1.4
19390620 PLoS Pathogens 1.8 4.4 2.4 4.6
20974834 Infection and Immunity 2.0 2.9 2.0 0.4
Mean (months) 1.9 4.9 2.6 1.5
  Std dev (days) 24 119 57 35

 

 

 

Fact from fiction: The scientific method is alive and well

Weirdly I’ve read two blog posts today from apparently completely independent sources (here and here) that both state essentially the same thing: the scientific method is harmful to creativity and is not the “only way to do science”. I’ve posted here and here about the scientific method previously. While I applaud their efforts to make science more approachable to all and I do agree that conceiving of the scientific method too rigidly is a mistake, the basic premise of these posts is absolutely wrong.

Both use the examples of the observation being an integral part of science:

In 1928 Alexander Fleming accidentally left a cover off a petri dish used to cultivate bacteria. The plate was contaminated by a mold that contained penicillin. In this case, there was no problem or question to start with. It was an accident. -Rhett Allain

and:

In some instances, scientists may use computers to model, or simulate, conditions. Other times, researchers will test ideas in the real world. Sometimes they begin an experiment with no idea what may happen. They might disturb some system just to see what happens, -Jennifer Cutraro

 

It is certainly true that chance observations can start the process of scientific investigation, but in the first example (which the author follows with several other examples from other fields- all with the same basic problem), the observation being described is the very start and not the end product. Fleming did NOT just discover penicillin by observing a petri dish that was left open by accident. In fact he probably initially had a hunch (based on an internal model of some kind that he’d built up during his career and the results he observed from this accident) that something was interesting and worth pursuing further using, wait for it, THE SCIENTIFIC METHOD. This would involve constructing controlled tests to isolate the source of the mold (and indeed to show that it was mold at all), validate what it was doing (for example, maybe the mold grew but had no effect on the growth of the bacteria, which were instead killed by the sudden draft of cold air, maybe aliens came into the lab at night and killed off the bacteria one-by-one, maybe…), and then identify the factor that was causing this (magic pixie dust! churches! very small rocks!). I’m not up on the history, but I’m sure that this actually took many years and used the scientific method many times in the kind of iterative cycle that I’ve described in my previous posts. The fact that the initial observation was accidental has nearly no relation with the subsequent application of the scientific method to follow up. Many accidents or weird observations are discarded as being uninteresting or not worth pursuing, sometimes in error.

The second blog post simply lists several examples of ways that a scientist might start out using the scientific method (similar to the story of penicillin)- and Ms. Cutraro then uses the words ‘test’ and ‘experiment’, which are both components of the scientific method. She is not describing scientific discovery, she’s describing the very first steps toward scientific discovery. Ms. Cutraro writes:

In contrast, geologists, scientists who study the history of Earth as recorded in rocks, won’t necessarily do experiments, Schweingruber points out. “They’re going into the field, looking at landforms, looking at clues and doing a reconstruction to figure out the past,” she explains. Geologists are still collecting evidence, “but it’s a different kind of evidence.”

This kind of science does not generally involve physical experiments, that’s true. However, the gathering of evidence to support or discard a model is a version of the scientific method. The process of “looking at clues” and “doing a reconstruction” can be part of the scientific method (in fact if you’re looking at clues then you are certainly using the scientific method). Imagine we identify a landform that seems to be formed by running water. We (as  geologists) can test whether this is untrue by performing observations of many such landforms, the terrain around them, and other features that might support the idea of the landform being formed by water, or not- in which case a new hypothesis/model must be formed. There may not be the ability to physically test this hypothesis by running a gigantic experiment involving tons of rock and millions of years of running water, but it is still very much science.

So my position is that the scientific method, in the broad sense but as I’ve previously outlined it, is inherent to our ability to discriminate fact from fiction- in fact the two are essentially the same thing. The act of stating a formal hypothesis is often something that is either unstated or unconscious, but it is always present and it is part of how we learn. To state that we can discover things without using the scientific method is misleading (at best).

And, importantly, both of these discussions sell the idea of a model short.

…make models of stuff. Really, that is what we do in science. We try to make equations or conceptual ideas or computer programs that can agree with real life and predict future events in real life. That is science. -Rhett Allain

That is, they seem to separate the idea of a model from the process of the scientific method, which it is not. A model, whether a conceptual gathering of existing knowledge into a picture of “how things should be” or instantiated in some way, like a computer algorithm, is an absolute requirement of the scientific method and can’t be separated from it. In fact, a model does not exist outside the scientific method. If a model predicts future events then these predictions must be validated using the… yes, that again.

So perhaps what the authors of these two posts really mean (and this is suggested by some of their writing) is that the traditional view of the scientific method as a rigidly defined set of steps, is not wholly comprehensive. Each of these steps must be thought of for what they mean and how they apply to every day science and indeed the rest of life. Science IS the scientific method. It is the way that we learn things about reality. And it is the only way we can exclude sets of plausible fictions to guide us toward fact.

 

How long is long: Time in review for scientific publications

Though there seems to be a lot of anecdotal information about how long it takes to get your scientific paper reviewed by a peer-review journal there doesn’t seem to be much actual data about this. Although some journals (like PNAS) list dates for “sent to review” and “approval”, these may not include the whole process- time for editorial consideration for example- and are probably not representative. PLoS journals do a great job and list the date received and the date of acceptance, but I couldn’t figure out a way to get that information in bulk (I didn’t inquire by the way- maybe another project in the works). The length of time it takes to review a paper for publication can have numerous impacts on projects, grant proposals, and the ability to submit to another journal if the paper is rejected.

Having been a peer reviewer for some time I realize that often it’s difficult to return reviews on time, and this is one source of delay. Editorial delays, because of volume of submissions being considered or other reasons, is another. And then there are just difficult reviews that may take more than the normal number of reviewers because of conflicting reviews or reviews that are not clearly positive or negative. This process can easily stretch out over months. Then after reviews come back the authors must address the reviewers concerns and submit their revisions. This is also a source of delay, and can be highly variable. Some journals (BMC journals for example) limit the number of revisions possible on a single manuscript to two- but they allow resubmission of the revised manuscript as a “new” submission after that- presumably to be handled by the same editor.

Over the last few years I’ve tracked the time it takes to get a paper accepted, from the time of first submission, for papers that I’m responsible for (first or last author papers). This doesn’t include rejected papers- some of those times, especially for higher impact journals where the initial decisions whether a paper will be peer reviewed at all are made by editors and turnaround is generally fairly quick.

This is NOT a representative sample, but it does capture many of the elements I’ve discussed above. These numbers are pretty in-line with an evaluation of PLoS One turnaround times.

So the answer is: No, I wouldn’t consider 100 days to be fast, but it’s not exactly slow either. In fact, it may be in line with what can generally be expected from the scientific publication system. I’d be very interested to hear other researchers’ opinions on their times in peer review and if you have data all the better.

Table. Survey of time in review for a number of my own papers.

PMID Journal Days in review Months
23335946 Expert opinions 142 4.7
22546282 BMC Sys Bio 335 11.2
23071432 PLoS CB 366 12.2
22745654 PLoS One 198 6.6
22074594 BMC Sys Bio 103 3.4
21698331 Mol BioSystems 137 4.6
21339814 PLoS One 193 6.4
20974833 Infection and Immunity 55 1.8
20877914 Mol BioSystems 36 1.2
19390620 PLoS Pathogens 132 4.4
Mean 170 5.7
  Std dev 108 4

 

Money is deeply, fundamentally weird.

Ever since I read this article in Wired magazine (you know, the paper things that are thinner than books and you still find in doctor’s offices?) I’ve had this feeling that the sands are shifting beneath my feet. How can you truly know the value of what’s in your wallet? Count your money? Try again. Money is something other than what we normally think it is. The financial credit crisis of 2007-ish happened in part because of people and groups wanting to buy lots of debt. Why does anyone want to buy someone else’s debt? It makes sense (debt gets paid back with interest, that make the owner of the debt money)- but is pretty weird, really. And many people, myself included, felt that they ‘lost’ a large amount of value following that time- but what was that value really? (an awesome overview of this whole thing can be found at This American Life’s podcast about it- HIGHLY recommended). Why does the economy contract? Isn’t it weird that tomorrow there may be more (or less) value in the world than there is today?

Money is fluid, and so is value. Imagine that you have currency based on gold (work with me here). You can set a value for a certain amount of currency based on the very real mass of gold that it represents. No problems. Everything is smooth sailing, right? Well, all the sudden a massive gold vein is discovered near a subway station in Manhattan. And all of the sudden the value of what you have in your pocket is not what it was in the morning. You worked the same amount for it, right? So why did it change?

The Wired article describes a market, based in the online game Everquest, then just a few years old. In this game players can earn currency (in virtual gold pieces I think) by playing the game. The demand, at that time, was such that the virtual gold pieces had real value. That is, there was an exchange rate between Everquest ‘gold’ and real dollars. Think about it for a minute. Instead of thinking, “what weirdos are going to pay real money to buy gold in an online game”, the real question is what does this say about our “real” money? You could, in theory, go to work all day in a virtual world for virtual currency- that is, play a game that enough other interested parties are playing- and then exchange that currency for things that you really need. Who carries coins and bills in their pockets? Credit cards are where it’s at: Money has gone in directly from your employer then gets transferred to the store you’ve made a purchase at- no physical instantiation involved at all. There have been sweatshops uncovered where the workers are playing games for days on end to get virtual currency (that then is turned into ‘real’ money).

The more recent advent of Bitcoin is a similar-type example of our ability as humans to strike bargains between each other. Their (it’s a decentralized, open source effort, so maybe that’s more of “our”) system is pretty cool and complex, but with thought behind it, which is more than I can say of the US monetary system. Money is a bargain between people. It’s not only based on trust, hope, and need, it’s actually a human instantiation of those very emotions. So when you pull out a dollar bill to pay for something, think about how you’re handing over your trust, hope and need to the sales clerk. But I wouldn’t advise mentioning that to them. That would be weird.

Five minute explanation: Cyanothece transcriptional model


Because of the fact that the paper is behind a paywall, I’m making it available as the submitted manuscript. Eventually I’ll get with the program and start releasing on ArXiv or Figshare, but for now it’s here. I’ve tried to make the version somewhat pretty (I get really tired of reading papers that are double-spaced and have the figures and tables at the end).

Citation

McDermott J.E., Oehmen C., McCue L.A., Hill H., Choi D.M., Stöckel J., Liberton M., Pakrasi H.B., Sherman L.A. (2011) A model of cyclic transcriptomic behavior in Cyanothece species ATCC 51142. Mol Biosystems 7(8):2407-2418. PMID: 21698331

*but behind a paywall at Molecular BioSystems

Here available as the submitted manuscript and supplemental information.

Background

Cyanothece sp. 51142 is a ocean-dwelling cyanobacteria that is capable of fixing nitrogen in the dark and photosynthesizing in the light, two normally incompatible activities. Unlike some other cyanobacteria it makes this switch inside the same cell every light/dark cycle (normally about 12 hours). This makes it interesting from the standpoint of bioenergy

A 'wreath' network of transcriptional changes in Cyanothece over a 24 hour period.

A ‘wreath’ network of transcriptional changes in Cyanothece over a 24 hour period.

production but also regulation. The process of how it is able to drastically rearrange it’s machinery every 12 hours is not well understood.

What was done?

We used multiple transcriptomic datasets (measurements of levels of gene expression) taken at different times in the light/dark cycle to construct a general model of the functional processes occurring in Cyanothece. The interesting part about this was that we did not impose the circular shape on the model, it arose naturally from analysis of the data, and it really does represent a clock- with the pattern of gene expression at different times of day being located at different locations on the clock face. We then used a mathematical approach to relate the expression levels of drivers (regulators) with groups of genes that can be associated with different functions. The model allows us to plug in different starting points and predict what the state of the system will be at future times.

Why is it important?

The model we constructed can be changed and results simulated to predict what will happen in a real experiment. These kinds of models are good for focusing experimental efforts by predicting interesting behavior. An example question might be to ask what would happen to the timing of photosynthesis (as judged by gene transcription) if the levels of a key regulator are changed. The resulting prediction(s) can then be tested experimentally to discover new things about the system.

The story

This paper took about five years to get written and accepted. That’s from the point at which I decided that a paper should be written to the point that it was published. It was from the first project I worked on at my then new position. I came up with the wreath visualization early in the process and, after having convinced myself and others that it was real, found that it was a very compelling way to think about the diurnal (day/night) cycle. The figure has been used in many different forms, mainly as eye candy. I’m amused when I see it on a poster that I had nothing to do with (from my workplace PNNL). It has even been used around the web.

 

Gaming the system: How to get an astronomical h-index with little scientific impact

The old scientific adage “publish or perish” has garnered a lot of debate lately. I’ve posted about my own scientific impact as well as the impact of papers published about computational methods that are named versus unnamed in the title. Certainly publications remain the currency of scientific careers, for better or worse- though I think this is changing with more emphasis being placed on other, more flexible and open, forms of scientific outreach. There’s a lot of talk about this subject from various places including ByteSizeBiology, Peter Lawrence, and Michael Eisen - to name a few.

The purpose of this post is to highlight an instance of abuse of the system- kind of in a funny (odd, surprising, shocking) way. This is similar in spirit to recent reports that a math paper generated by linking mathematical words together by an algorithm to write papers was accepted into a journal.

I was searching gene names to research a paper I was writing a couple of years ago and started to notice a weird pattern. Some genes were mostly absent from the literature (that is, no one has actually studied their function, and they haven’t been highlighted in any other screen-type studies that identify lots of things). However, a number of publications on completely different genes looked suspiciously similar. Many of these had titles that included the words “integrative genomic analyses” or “identification and characterization of [gene] in silico”, they all had two authors M. Katoh and M. Katoh or Y. Katoh, though some had more authors, and most were published in a few journals, the International Journal of Molecular Medicine and the International Journal of Oncology both with low, but respectable impact factors (1.8 or so). Many, though not all, of these papers seem to be rehashed digests of information obtained from databases combined with review-type information about potential functions related to cancer or biomedicine. This PubMed search retrieves most of these citations for your amusement.

A quick search in Web of Knowledge for “Katoh M” as an author and “INTERNATIONAL JOURNAL OF ONCOLOGY” as a publication retrieves 99 publications, with a jaw-dropping h-index of 48 (h-index is a measurement of scientific impact of a group of publications). Results from the “INTERNATIONAL JOURNAL OF MOLECULAR MEDICINE” were only slightly less impressive (h-index of 37 with exactly 99 publications as well; see the screen capture of results below). Following up with a search of the three main names here, Masaru, Masuko, or Yuriko (there was also a mysteriously named “Mom Katoh”, who may be the ringleader of the bunch- but she/he only had a couple of publications) retrieved 216 publications with a combined h-index of 56, a number that any biologist would die for (or at least should be very happy with).

Web of Knowledge Search for Katoh M

Web of Knowledge Search for Katoh M

Masaru is affiliated with the apparently reputable National Cancer Research Institute in Japan. But Masuko and Yuriko don’t seem to be closely affiliated to any place in particular (judging by a Google search).

Some of these publications may, in fact, be valuable and have valuable information and results in them- I certainly haven’t gone through each and every one. However, a large number of these “integrative genomic analyses” are not useful and seem to have been targeted at genes with little characterization and are written based on template text. The high citation number that they get, then, may be due to lack of care on the part of those citing the publication, and they are included simply because they appear to be the only comprehensive functional study of a particular gene that has turned up in the study. It certainly emphasizes the need for caution when “filling in” citations for a publication that are not central to the main story (and thus writers, myself certainly included, are less critical about the source of their citations).

How important is having a name for your computational method?

When building software tools, databases, or reporting approaches to data analysis or modeling, choosing a name is important. I started out writing this post with the notion that this is true, searched briefly for evidence to back me up, then realized that I could do this analysis myself. Or at least enough to get an idea of how important having a name for your method might be.

Here’s what I did: gathered all publications in the journal Bioinformatics published between 2004 and 2008 (3517 or so) from the Web of Knowledge/Science. I then identified those publications that referenced software tools or databases by starting with a “[name]: [title of paper]” giving approximately 954 publications (there are more than this that fit the bill, more on that in a minute). I calculated the mean number of citations the publications in each group had (not adjusting for years in publication)- that’s the “All” comparison in the figure below. The difference shows that publications that use a name garner more citations (and thus have more ‘impact’ by this measure) and this was statistically significant by t test (0.005). However, this could be due to the difference in the nature of the publication. Perhaps, tools are just more likely to be cited than more scientific studies about specific systems (I think they are). So I went through an arbitrary selection of 500 of the publications without a name and identified a conservative set of 158 that looked like they could have had names associated with them, based on their titles. This was a bit of an arbitrary endeavor, but I think I did an OK job. That comparison is the “Matched” comparison below and shows a much more marked difference.

You can find a spreadsheet with my analysis here: Bioinformatics_Pubs_WOS_2008

The bottom line: The publications with named methods garnered over three times the number of citations as the pubs with no names and this was also statistically significant (0.05, because of the smaller number of publications in the matched set).

Impact analysis of pubs with named methods versus unnamed methods

Impact analysis of pubs with named methods versus unnamed methods

There are a number of ways I could improve on this comparison and I’d be happy to entertain suggestions on it. However I think the results of this are quite interesting. There are some reasons that they might be true (that are unrelated to actually having a name). First thing I can think of is that the named publications are likely to be application notes, which describe the release of more mature, tested software than the non-named publications that may describe more of the research and proving of the method- that is, they may be more likely to have tool that is actually usable by others (and thus citable) than the other kind of publication, which may not even provide software at all. A good way to examine this would be to construct a matched set of publications that have no named method, but do have associated software (or web interface). However, I really don’t have time for doing that, it sounds painfully boring.

However, another non-exclusive notion that this result suggests is that simply the presence of a recognizable, easily usable name for a method increases the likelihood that it will be cited in future work. This allows association of the complicated and hard-to-describe process that is described in the paper with a “handle” for the method that is easy to remember. This is actually fairly interesting psychologically and suggests what I believe many scientists already realize, that marketing (the choice of a good name for example) can be key in scientific impact. We can debate on whether or not that’s a good thing, but it’s generally true in science.

So these results seem to suggest that a way to increase scientific impact is to name your method. Though, of course, correlation does not imply causation- so it certainly might not work that way. I’m really interested in seeing if there are patterns in the choice of name that extend to impact, but I’m not sure about how to do that. The length (number of characters) in the name has no correlation with number of citations, but that’s as far as I’ve gotten. Any suggestions?

 

Survival of the fitness: how to do good by your health on travel

I don’t travel a lot compared to some people I work with, but I do a bit of business travel. I just returned from a quick trip to DC. If you travel this way, and you’re trying to maintain an exercise regimen of any kind you know how hard it can be.

from DUSAN PETRICIC in The Scientist

When you get to your hotel you just want to lay in bed, relax, and veg out- meetings can go all day, and the food can be, to put it VERY generously, less than healthy. It’s easy to take the vacation way out. That is, to think, “hey, this business travel is kinda like a vacation and I can just let all this health stuff slide for a bit”. Slippery slope- very slippery. It’s not just the travel time you’re talking about, it’s also the time when you get back and start dodging your workout routines and eating well because you’re out of practice. Actually, business travel can be a great opportunity (see me with the more optimism) to actually do more than you usually do- if not in the eating area at least in the fitness area. Here are some things that have helped me (and that I aspire to, I’m certainly not perfect in this area). I’m intentionally trying to avoid the advice that’s good in this area, but could pertain any time to your fitness.

Eating

  1. Bring along healthy snacks/small meals with you. This beats the heck out of buying stuff in the airport, on the airplane, from the hotel snack bar or (heaven forbid) minibar, or from a random vending machine. This wins on the nutrition front and on your wallet too. I generally pack energy bars (the Clif Zbars for kids are actually great for grownups too and about 120 calories), instant oatmeal with extras (brown sugar, dried fruit, peanut butter) since hotel rooms almost always have coffee makers- but don’t forget a spoon, crackers and tuna fish (Starkist has cute packages, but you can easily make your own), and fruit (NOT bananas, but apples, pears, etc.). All of this should make it through security OK- I’ve never had a problem (even with the PB, which is kindof a ‘paste’).
  2. Don’t give up on eating well, but realize that there are just those times. Dinners out with colleagues, free food buffets, cookies and muffins provided at the conference, alcohol and more alcohol- all those things can be tricky. Make sure that you keep a rough estimation of caloric intake in your head and try to match it (or, if you’re really good, precede it by) doing something from the exercise list below- that way things even out, more-or-less.
  3. You probably won’t eat your best, but DON’T eat your worst. This is just common sense, but it’s really easy to forget. If you’re going to eat bad don’t go whole hog- there are generally better choices and worse choices. Try to go toward the light.
  4. Use jet lag and busy meetings to your advantage. Sometimes jet lag and busy meetings (without food available) can be your friend. You may not be hungry at the times you normally are and you may be able to avoid some of the bad by simply skipping it (this can go both ways- I get hungry early in the morning on the East coast for some reason). Also, for me busy is better. I’ll simply forget that I’m hungry (at least hungry in that bored-so-I’ll-munch way).

Exercise

  1. Bring your workout clothes dummy. It seems simple, but it’s probably not the thing you’re thinking of when you’re packing. Don’t forget workout shoes (I use some flat shoes that pack easily) and an mp3 player if you normally use one.
  2. Make use of the hotel gym. Most business hotels have workout rooms. Make sure you ask when you check in where it is and when it’s open. Use it but don’t be tied to your normal workout schedule since it probably won’t work on travel.
  3. Walk. If your meeting is in the city, walk. Walk to the conference (if it’s somewhere else), to dinner, or just plan to walk around during your breaks. This is the thing that’s really helped me and it’s fun too. Do some research prior to your trip to make sure you’ll be walking in safe areas or just ask at the front desk before you venture out. Walking back to the hotel, even a longish way, after dinner can be a good way to make up some calories- but ask at the restaurant about a safe path. Running works too.
  4. Get out and see the place. If you have breaks or free time go and see the sights, but walk. Use public transportation (Metro is best) to get from A to B and walk the rest. Travel like this is a great opportunity and walking is one of the best ways to actually see someplace.
  5. Use the stairs. Not just to get up to your room, but use the stairs to work out. It may be that the hotel doesn’t have a gym or that the gym isn’t the greatest. Use the stairs. Climbing 10 floors (about 5 minutes) should burn somewhere around 50 calories- and you can do it many times. It’s likely that no one will see you sweat, but this might not be the most interesting place to workout. Listen to music or podcasts to pass the time.
  6. Do a workout in your hotel room. You can blast the tunes, watch a movie, or do this completely naked (but please close the blinds, please). There are lots of different fitness regimens that you can do with no equipment at all- and they can kick your ass. Here’s a good set specifically for the hotel room stay from NerdFitness.
  7. Dress like you mean it. Planning to put on your workout clothes provides a much lower energy barrier than actually working out. So do that first. When you’re standing around in your workout clothes you’ll start to feel stupid for not working out. It actually works.
  8. Use your layover. Airports are big. Some are really big. Use that fact. If you have a layover of more than about 45 minutes start walking. Plan out your walk so that you don’t end up far away from your gate when you need to board- which would make you feel dumb, sweat, and probably hate me for my stupid ideas. Try walking the whole thing. If you have a roller bag so much the better. Dragging one of those things around will only make things better. Skip the moving walkways- instead try to beat the people standing on them (or walking on them even) to the other end. Pretend like you’re in a super hurry to catch your plane, it’s fun.