Top Posts of 2013

Although I started blogging in 2012, 2013 has been my first year of blogging. It’s been fun so far if a bit sporadic. I’ve posted approximately once a week, which is a bit less than I’d like. And I’ve had some fun. My top posts for the year are listed in order below. Looking forward to continuing to blog, and improve, in the coming year and beyond. My blog resolution for 2014 is to post more frequently but to also work on a few posts that are more like mini-papers, studies of actual data that’s interesting to the scientific community similar to my analysis of review times in journals. (Caveat: this ranking is based on absolute numbers so short-changes more recent posts that haven’t had as much time to be viewed.  But really I think it’s pretty reasonable)

I had some failures in 2013 too. Some posts that I was sure would knock it out of the park, but didn’t garner much interest. Also, I started a series (parts 1, 2, 3, 4, 5, 6) that was supposed to chronicle my progress on a computational biology project in real time. That series has stalled because it was a bit harder to put together the project than I thought it would be (this is not surprising in the least BTW) and I ran into other more pressing things I needed to do. I’m still planning on finishing this- it seems like a perfect project for the Jan-Feb lull that sometimes occurs.

Top Posts of 2013 for The Mad Scientist Confectioner’s Club

  1. Scientific paper easter eggs: Far and away my most viewed post. A list of funny things that authors have hidden in scientific papers, but also of just funny (intentionally or not) scientific papers. And these keep coming too- so much so that I started a Tumblr to add new ones.
  2. How long is long: Time in review for scientific publications/Time to review for scientific publications revisited: These two posts have analysis I’ve done of the time my papers spent in review. After some Twitter discussions I posted the second one that looked at how long the papers took to get their first review returned, which is more fair to the journals (my first post looked at overall time, including the time that I spent revising the papers). Look for a continuation of this in 2014, hopefully including contribution of data from other people.
  3. Eight red flags in bioinformatics analyses: I’m still working on revising this post into a full paper since I think there’s a lot of good stuff in there. Unfortunately on the back burner right now. However, I did get my first Nature publication (in the form of a Comment) out of the deal. Not bad.
  4. Reviewer 3, I presume?: This post was to recap the (moderate) success of a Tweet I made, and the turning of that Tweet into a sweet T-shirt!
  5. Gaming the system: How to get an astronomical h-index with little scientific impact: One of my favorite posts (though I think I wrote it in 2012) does a bit of impact analysis on a Japanese bioinformatic group that published (and still publishes) a whole bunch of boilerplate papers- and got an h-index close to 50!
  6. How can two be worse than one? Replicates in high-throughput experiments: I’m including this one so that this list isn’t 5 long, and also because I like this post. This is essentially a complaint about the differences between the way that statisticians and data analysts (computational biologists, e.g.) see replicates in high-throughput data and how wet-lab biologists see them. It has yielded one of my new favorite quotes (from myself) that’s not actually in the post: “The only reason to do an experiment with two replicates because you know replicates are important, but you don’t know why.”

Have a great New Year and see everyone in 2014!

Gender bias in scientific publishing

The short version: This is a good paper about an important topic, gender bias in publication. The authors try to address two main points: What is the relationship between gender and research output?; and What is the relationship between author gender and paper impact? The study shows a bias in number of papers published by gender, but apparently fails to control for the relative number of researchers of each gender found in each field. This means that the first point of the paper, that women publish less than men, can’t be separated from the well-known gender bias in most of these fields- i.e. there are more men than women. This seems like a strange oversight, and it’s only briefly mentioned in the paper. The second point, which is made well and clearly, is that papers authored by women are cited less than those authored by men. This is the only real take home of the paper, though it is a very important and alarming one.
What the paper does say: that papers authored by women are cited less than those authored by men.
What the paper does NOT say: that women are less productive than men, on average, in terms of publishing papers.
The slightly longer version
This study on gender bias in scientific publishing is a really comprehensive look at gender and publishing world-wide (though it is biased toward the US). The authors do a good job of laying out previous work in this area and then indicate that they are interested in looking at scientific productivity with respect to differences in gender. The first stated goal is to provide an analysis of: “the relationship between gender and research output (for which our proxy was authorship on published papers).”
The study is not in any way incorrect (that I can see in my fairly cursory read-through) but it does present the data in a way that is a bit misleading. Most of the paper describes gathering pretty comprehensive data on gender in published papers relative to author position, geographic location, and several other variables. This is then used to ‘show’ that women are less productive than men in scientific publication but it omits a terribly important step- they never seem to normalize for the ratio of women to men in positions that might be publishing at all. That is, their results very clearly reiterate that there is a gender bias in the positions themselves- but doesn’t say anything (that I can see) about the productivity of individuals (how many papers were published by each author, for example).
They do mention this issue in their final discussion:
UNESCO data show10 that in 17% of countries an equal number of men and women are scientists. Yet we found a grimmer picture: fewer than 6% of countries represented in the Web of Science come close to achieving gender parity in terms of papers published.
And, though this is true, it seems like a less-than-satisfying analysis of the data.
On the other hand, the result that they show at the last- the number of times a paper is cited when a male or female name is included in various locations- is pretty compelling and is really their novel finding. This is actually pretty sobering analysis and the authors provide some ideas on how to address this issue, which seems to be part of the larger problem of providing equal opportunities and advantages to women in science.

Reviewer 3… was RIGHT!

I’m just taking a pass at revising a paper I haven’t really looked at in about six months. I’m coming to a sobering realization: reviewer 3 was right! The paper did deserve to be rejected because of the way it was written and, in spots, poor presentation.

I’ve noticed this before but this was a pretty good example. The paper was originally reviewed for a conference and the bulk of the critique was that it was hard to understand and that some of the data that should have been there wasn’t presented. Because I didn’t get a shot at resubmitting (being a conference) I decided to do a bit more analysis and quickly realized that a lot of the results I’d come up with (but not all) weren’t valid. Or rather, they didn’t validate in another dataset. The reviewers didn’t catch that but it meant that I shelved the paper for awhile until I had time to really revise.

Now I’ve redone the analysis, updated with results that actually work, and have been working on the paper. There are lots of places in the paper where I clearly was blinded to my own knowledge at the time- and I think that’s very common. That is, I presented ideas and results without adequate explanation. At the time it all made sense to me because I was in the moment- but now it seems confusing, even to me. One reviewer stated that the paper is “difficult for me to assess its biological significance in its current form” and another that “I find the manuscript difficult to follow.” Yet another noted that the paper, “lacks a strong biological hypothesis”, which was mainly due to poor presentation on my part.

There were some more substantive comments as well- and I’m addressing those in my revision but this was a good wake-up call for someone like me who has a number of manuscripts under their belt, to be more careful about reading my own work with a fresh eye and having more colleagues or collaborators read my work before it goes in. One thing that I like to do (but often don’t do) is to have someone not involved with the manuscript or the project take a read over the paper. That way you get really fresh eyes – like those of a reviewer – that can point out places where things just don’t add up. Wish me luck for the next round with this paper!

15 great ways to fool yourself about your results

I’ve written before about how easy it is to fool yourself and some tips on how to avoid it for high-throughput data. Here is a non-exhaustive list of ways you too can join in the fun!

  1. Those results SHOULD be that good. Nearly perfect. It all makes sense.
  2. Our bioinformatics algorithm worked! We put input in and out came output! Yay! Publishing time.
  3. Hey, these are statistically significant results. I don’t need to care about how many different ways I tested to see if SOMETHING was significant about them.
  4. We only need three replicates to come to our conclusions. Really, it’s what everyone does.
  5. These results don’t look all THAT great, but the biological story is VERY compelling.
  6. A pilot study can yield solid conclusions, right?
  7. Biological replicates? Those are pretty much the same as technical replicates, right?
  8. Awesome! Our experiment eliminated one alternate hypothesis. That must mean our hypothesis is TRUE!
  9. Model parameters were chosen based on what produced reasonable output: therefore, they are biologically correct.
  10. The statistics on this comparison just aren’t working out right. If I adjust the background I’m comparing to I can get much better results. That’s legit, right
  11. Repeating the experiment might spoil these good results I’ve got already.
  12. The goal is to get the p-value less than 0.05. End.Of.The.Line. (h/t Siouxsie Wilespvalue_kid_meme
  13. Who, me biased? Bias is for chumps and those not so highly trained in the sciences as an important researcher such as myself. (h/t Siouxsie Wiles)
  14. It doesn’t seem like the right method to use- but that’s the way they did it in this one important paper, so we’re all good. (h/t Siouxsie Wiles)
  15. Sure the results look surprising, and I apparently didn’t write down exactly what I did, and my memory on it’s kinda fuzzy because I did the experiment six months ago, but I must’ve done it THIS way because that’s what would make the most sense.
  16. My PI told me to do this, so it’s the right thing to do. If I doubt that it’s better not to question it since that would make me look dumb.
  17. Don’t sweat the small details- I mean what’s the worst that could happen?

Want to AVOID doing this? Check out my previous post on ways to do robust data analysis and the BioStat Decision Tool from Siouxsie Wiles that will walk you through the process of choosing appropriate statistical analyses for your purposes! Yes, it is JUST THAT EASY!

Feel free to add to this list in the comments. I’m sure there’s a whole gold mine out there. Never a shortage of ways to fool yourself.


Gut feelings about gut feelings about marriage

An interesting study was published about the ‘gut feelings’ of newlyweds, and how they can predict future happiness in the marriage. The study assessed gut feelings (as opposed to stated feelings, which are likely to be biased in the rosy direction) of newlyweds towards their spouse by a word association and controlled for several different variables (like how the same people react to random strangers with the word association). They found that newlyweds that had more ‘gut feeling’ positive associations about their spouse were in happier relationships after four years. Sounds pretty good, right? Fits with what you might think about gut feelings.

The interesting point (which is nicely put in a Nature piece that covers this study) is that after other effects are factored out of the analysis the positive association was statistically significant, but that it could only explain 2% of the eventual difference in happiness (this analysis was apparently done by the Nature reporter, and not reported in the original paper). 2%! That’s not a very meaningful effect- even though it may be statistically significant. Though the study is certainly interesting and likely contains quite a bit of good data – this effect seems vanishingly small.

For interest here are the titles of the paper and some follow-on news pieces that were written about it and how they make the results seem much more clear cut and meaningful.

Title of the original Science paper:

Though They May Be Unaware, Newlyweds Implicitly Know Whether Their Marriage Will Be Satisfying

Title of the Nature piece covering this study:

Newlyweds’ gut feelings predict marital happiness Four-year study shows that split-second reactions foretell future satisfaction.

Headline from New Zealand Herald article:

Gut instinct key to a long and happy marriage

Headline from New York Daily News

Newlyweds’ gut feelings on their marriage are correct: study


A word about balance

I’ve been reviewing machine learning papers lately and have seen a particular problem repeatedly. Essentially it’s a problem of how a machine learning algorithm is trained and evaluated for performance versus how it would be actually applied. I’ve seen this particular problem also in a whole bunch of published papers too so thought I’d write a blog rant post about it. I’ve given a quick-and-dirty primer to machine learning approaches at the end of this post for those interested.

The problem is this: methods are often evaluated using an artificial balance of positive versus negative training examples, one that can artificially inflate estimates of performance over what would actually be obtained in a real world application.

I’ve seen lots of studies that use a balanced approach to training. That is, the number of positive examples is matched with the number of negative examples. The problem is that many times the number of negative examples in a ‘real world’ application is much larger than the number of positive examples- sometimes by orders of magnitude. The reason that is often given for choosing to use a balanced training set? That this provides better performance and that training on datasets with a real distribution of examples would not work well since any pattern in the features from the positive examples would be drowned out by the sheer number of negative examples. So essentially- that when we use a real ratio of positive to negative examples in our evaluation our method sucks. Hmmmmm……

This argument is partly true- though some machine learning algorithms do perform very poorly with highly unbalanced datasets. Support Vector Machines (SVM), though and some other kinds of machine learning approaches, seem to do pretty well. Some studies then follow this initial balanced training step with an evaluation on a real world set – that is, one with a ‘naturally’ occurring balance of positive and negative examples. This is a perfectly reasonable approach. However, too many studies don’t do this step, or perform a follow on ‘validation’ on a dataset with more negative examples, but still nowhere near the number that would be present in a real dataset. And importantly- the ‘bad’ studies report the performance results from the balanced (and thus, artificial) dataset.

The issue here is that evaluation on a dataset with an even number of positive and negative examples can vastly overestimate performance by decreasing the number of false positive predictions that are made. Imagine that we have a training set with 50 positive examples and a matched number of 50 negative examples. The algorithm is trained on these examples and cross-validation (random division of the training set for evaluation purposes- see below) reveals that the algorithm predicts 40 of the positives to be positive (TP) and 48 of the negatives to be negative (TN). So it misclassifies two negative examples to be positive examples with scores that make it look as good or better than the other TPs- which wouldn’t be too bad, the majority of positive predictions would be true positives. Now imagine that the actual ratio of positives to negative examples in a real world example was 1:50, that is for every positive example there are 50 negative examples. So, what’s not done in these problem cases is extrapolating the performance of the algorithm to a real world dataset. In that case you’d expect to see 100 false positive predictions- now outnumbering the number of true positive predictions and making the results a lot less confident than originally estimated. The example I use here is actually a generous one. I frequently deal with datasets (and review or read papers) where the ratios are 1:100 to 1:10,000 where this can substantially impact results.

So the evaluation of a machine learning method should involve a step where a naturally occurring ratio of positive and negative examples is represented. Though this natural ratio may not be clearly evident for some applications, it should be given a reasonable estimate. The performance of the method should be reported based on THIS evaluation, not the evaluation on the balanced set- since that is likely to be inflated from a little to a lot.

For those that are interested in real examples of this problem I’ve got two example studies from one of my own areas of research- type III effector prediction in bacteria. In Gram negative bacteria with type III secretion systems there are an unknown number of secreted effectors (proteins that are injected into host cells to effect virulence) but we estimate on the order of 50-100 for a genome like Salmonella Typhimurium, which has 4500 proteins total, so the ratio should be around 1:40 to 1:150 for most bacteria like this. In my own study on type III effector prediction I used a 1:120 ratio for evaluation for exactly this reason. A subsequent paper in this area was published that chose to use a 1:2 ratio because “the number of non-T3S proteins was much larger than the number of positive proteins,…, to overcome the imbalance between positive and negative datasets.” If you’ve been paying attention, THAT is not a good reason and I didn’t review that paper (though I’m not saying that their conclusions are incorrect since I haven’t closely evaluated their study).

  1. Samudrala R, Heffron F and McDermott JE. 2009. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathogens 5(4):e1000375.
  2. Wang Y, Zhang Q, Sun MA, Guo D. 2011. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics. 2011 Mar 15;27(6):777-84.

So the trick here is to not fool yourself, and in turn fool others. Make sure you’re being your own worst critic. Otherwise someone else will take up that job instead.

Quick and Dirty Primer on Machine Learning

Machine learning is an approach to pattern recognition that learns patterns from data. Often times the pattern that is learned is a particular pattern of features, properties of the examples, that can classify one group of examples from another. A simple example would be to try to identify all the basketball players at an awards ceremony for football, basketball, and baseball players. You would start out by selecting some features, that is, player attributes, that you think might separate the groups out. You might select hair color, length of shorts or pants in the uniform, height, and handedness of the player as potential features. Obviously all these features would not be equally powerful at identifying basketball players, but a good algorithm will be able to make best use of the features. A machine learning algorithm could then look at all the examples: the positive examples, basketball players; and the negative examples, everyone else. The algorithm would consider the values of the features in each group and ideally find the best way to separate the two groups. Generally to evaluate the algorithm all the examples are separated into a training set, to learn the pattern, and a testing set, to test how well the pattern works on an independent set. Cross-validation, a common method of evaluation, does this repeatedly, each time separating the larger group into training and testing sets by randomly selecting positive and negative examples to put into each set. Evaluation is very important since the performance of the method will provide end users with an idea of how well the method has worked for their real world application where they don’t know the answers already. Performance measures vary but for classification they generally involve comparing predictions made by the algorithm with the known ‘labels’ of the examples- that is, whether the player is a basketball player or not. There are four categories of prediction: true positives (TP), the algorithm predicts a basketball player where there is a real basketball player; true negatives (TN), the algorithm predicts not a basketball player when the example is not a basketball player; false positives (FP), the algorithm predicts a basketball player when the example is not; and false negatives (FN), the algorithm predicts not a basketball player when the example actually is.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won't work well but a machine learning algorithm can provide a good separation. I'm showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won’t work well but a machine learning algorithm can provide a good separation. I’m showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.