Well, there probably ARE some exceptions here.

Well, there probably ARE some exceptions here.

So I first thought of this as a funny way of expressing relief over a paper being accepted that was a real pain to get finished. But after I thought about the general idea awhile I actually think it’s got some merit in science. Academic publication is not about publishing airtight studies with every possibility examined and every loose end or unconstrained variable nailed down. It can’t be. That would limit scientific productivity to zero because it’s not possible. Science is an evolving dialogue, some of it involving elements of the truth.

The dirty little secret (or elegant grand framework, depending on your perspective) of research is that science is not about finding the truth. It’s about moving our understanding closer to the truth. Often times that involves false positive observations- not because of the misconduct of science but because of it’s proper conduct. You should never publish junk or anything that’s deliberately misleading. But you can’t help publishing things that sometimes move us further away from the truth. The idea in science is that these erroneous findings will be corrected by further iterations and may even provide an impetus for driving studies that advance science. So publish away!

Magic Hands

Too good to be true or too good to pass up?

Too good to be true or too good to pass up?

There’s been a lot of discussion about the importance of replication in science (read an extensive and very thoughtful post about that here) and notable occurrences of non-reproducible science being published in high-impact journals. The recent retraction of the two STAP stem cell papers from Nature and accompanying debate over who should be blamed and how. The publication of a study (see also my post about this) in which research labs responsible for high-impact publications were challenged to reproduce their findings showed that many of these findings could not be replicated, in the same labs they were originally performed in. These, and similar cases and studies, indicate serious problems in the scientific process- especially, it seems, for some high-profile studies published in high-impact journals.

I was surprised, therefore, at the reaction of some older, very experienced PIs recently after a talk I gave at a university. I mentioned these problems, and briefly explained the results of the study on reproducibility to them- that, in 90% of the cases, the same lab could not reproduce the results that they had previously published. They were generally nonplussed. “Oh”, one said, “probably just a post-doc with magic hands that’s no longer in the group”. And all agreed on the difficulty of reproducing results for difficult and complicated experiments.

So my question is: do these fabled lab technicians actually exist? Are there those people who can “just get things to work”? And is this actually a good thing for science?

I have some personal experience in this area. I was quite good at futzing around with getting a protocol to work the first time. I would get great results. Once. Then I would continue to ‘innovate’ and find that I couldn’t replicate my previous work. In my early experiences I sometimes would not keep notes well enough to allow me to go back to the point where I got it to work. Which was quite disturbing and could send me into a non-productive tailspin of trying to replicate the important results. Other times I’d written things down sufficiently that I could get them to work again. And still others I found that someone else in the lab could consistently get better results out of the EXACT SAME protocol- apparently followed the same way. They had magic hands. Something about the way they did things just *worked*. There were some protocols in the lab that just seemed to need this magic touch- some people had it and some people didn’t. But does that mean that the results these protocols produced were wrong?

What kinds of procedures seem to require “magic hands”? One example is from when I was doing electron microscopy (EM) as a graduate student. We were working constantly at improving our protocols for making two-dimensional protein crystals for EM. This was delicate work, which involved mixing protein with a buffer in a small droplet, layering on a special lipid, incubating for some amount of time to let the crystals form, then lifting the fragile lipid monolayer (hopefully with protein crystals) off onto an EM grid and finally staining with an electron dense stain or flash freezing in liquid nitrogen. The buffers would change, the protein preparations would change, the incubation conditions would change, and how the EM grids were applied to our incubation droplets to lift off the delicate 2D crystals was subject to variation. Any one of these things could scuttle getting good crystals and would therefore produce a non-replication situation. There were several of us in the lab that did this and were successful in getting it to work- but it didn’t always work and it took some time to develop the right ‘touch’ to get it to work. The number of factors that *potentially* contributed to success or failure was daunting and a bit disturbing- and sometimes didn’t seem to be amenable to communication in a written protocol. The line between superstition and required steps was very thin.

But this is true of many protocols that I worked with throughout my lab career* – they were often complicated, multi-step procedures that could be affected by many variables- from the ambient temperature and humidity to who prepared the growth media and when. Not that all of these variables DID affect the outcomes but when an experiment failed there were a long list of possible causes. And the secret with this long list? It probably didn’t include all the factors that did affect the outcome. There were likely hidden factors that could be causing problems. So is someone with magic hands lucky, gifted, or simply persistent? I know of a few examples where all three qualities were likely present- with the last one being, in a way, most important. Yes, my collaborator’s post-doc was able to do amazing things and get amazing results. But (and I know this was the case) she worked really long and hard to get them. She probably repeated experiments many, many times ins some cases before she got it to work. And then she repeated the exact combination to repeat the experiments again. And again. And sometimes even that wasn’t enough (oops, the buffer ran out and had to be remade, but the lot number on the bottle was different, and weren’t they working on the DI water supply last week? Now my experiment doesn’t work anymore.)

So perhaps it’s not so surprising that many of these key findings from these papers couldn’t be repeated, even in the same labs. There was not the same incentive to get it to work for one thing- so that post-doc or another graduate student who’s taken over the same duties, probably tried once to repeat the experiment. Maybe twice. Didn’t work. Huh? That’s unfortunate. And that’s about as much time as we’re going to put in to this little exercise. The protocols could be difficult, complicated, and have many known and unknown variables affecting their outcomes.

But does it mean that all these results are incorrect? Does it mean that the underlying mechanisms or biology that was discovered was just plain wrong? No. Not necessarily. Most, if not all, of these high-profile publications that failed to repeat spawned many follow-on experiments and studies. It’s likely that many of the findings were borne out by orthogonal experiments, that is, experiments that test implications of these findings, and by extension the results of the original finding itself. Because of the nature of this study it was conducted anonymously- so we don’t really know, but it’s probably true. This was an important point, and one that was brought up by these experienced PIs I was talking with, is that sometimes direct replication may not be the most important thing. Important, yes. But perhaps not deal-killing if it doesn’t work. The results still might stand IF, and only if, second, third, and fourth orthogonal experiments can be performed that tell the same story.

Does this mean that you actually can make stem cells by treating regular cultured cells with an acid bath? Well, probably not. For some of these surprising, high-profile findings the ‘replication’ that is discussed is other labs trying to see if the finding is correct. So they try the protocols that have been reported, but it’s likely that they also try other orthogonal experiments that would, if positive, support the original claim.

"OMG! This would be so amazing if it's true- so, it MUST be true!"

“OMG! This would be so amazing if it’s true- so, it MUST be true!”

So this gets back to my earlier discussions on the scientific method and the importance of being your own worst skeptic (see here and here). For every positive result the first reaction should be “this is wrong”, followed by, “but- if it WERE right then X, Y, and Z would have to be true. And we can test X, Y, and Z by…”. The burden of scientific ‘truth’** is in replication, but in replication of the finding– NOT NECESSARILY in replication of the identical experiments.

*I was a labbie for quite a few of my formative years. That is, I actually got my hands dirty and did real, honest-to-god experiments, with Eppendorf tubes, vortexers, water baths, cell culture, the whole bit. Then I converted and became what I am today – a creature purely of silicon and code. Which suits me quite well. This is all just to add to my post a “I kinda know what I’m talking about here- at least somewhat”.

** where I using a very scientific meaning of truth here, which is actually something like “a finding that has extensive support through multiple lines of complementary evidence”

Please help me with my simple demonstration

I’ve written before about the importance of replicates. Here’s my funny idea of how a scientist might try to carry out this meme of trying to get a picture of yourself holding up a sign passed around the internet to demonstrate the danger of posting stuff to kids/students/etc. And what is up with that anyway? It’s interesting and cool the first few times you see someone do it. But after that it starts to get a *little* bit old.

I am but a poor scientist trying to demonstrate (very confidently) a simple concept.

I am but a poor scientist trying to demonstrate (very confidently) a simple concept.

The false dichotomy of multiple hypothesis testing

[Disclaimer: I’m not a statistician, but I do play one at work from time to time. If I’ve gotten something wrong here please point it out to me. This is an evolving thought process for me that’s part of the larger picture of what the scientific method does and doesn’t mean- not the definitive truth about multiple hypothesis testing.]

There’s a division in research between hypothesis-driven and discovery-driven endeavors. In hypothesis-driven research you start out with a model of what’s going on (this can be explicitly stated or just the amalgamation of what’s known about the system you’re studying) and then design an experiment to test that hypothesis (see my discussions on the scientific method here and here). In discovery-driven research you start out with more general questions (that can easily be stated as hypotheses, but often aren’t) and generate larger amounts of data, then search the data for relationships using statistical methods (or other discovery-based methods).

The problem with analysis of large amounts of data is that when you’re applying a statistical test to a dataset you are actually testing many, many hypotheses at once. This means that your level of surprise at finding something that you call significant (arbitrarily but traditionally a p-value of less than 0.05) may be inflated by the fact that you’re looking a whole bunch of times (thus increasing the odds that you’ll observe SOMETHING just on random chance alone- see this excellent xkcd cartoon for an example, see below since I’ll refer to this example). So you need to apply some kind of multiple hypothesis correction to your statistical results to reduce the chances that you’ll fool yourself into thinking that you’ve got something real when actually you’ve just got something random. In the XKCD example below a multiple hypothesis correction using Bonferroni’s method (one of the simplest and most conservative corrections) would suggest that the threshold for significance should be moved to 0.05/20=0.0025 – since 20 different tests were performed.

Here’s where the problem of a false dichotomy occurs. Many researchers who analyze large amounts of data believe that utilizing a hypothesis-based approach mitigates the effect of multiple hypothesis testing on their results. That is, they believe that they can focus their investigation of the data to a subset constrained by a model/hypothesis and thus reduce the effect that multiple hypothesis testing has on their analysis. Instead of looking at 10,000 proteins in a study they now look at only the 25 proteins that are thought to be present in a particular pathway of interest (where the pathway here represent the model based on existing knowledge). This is like saying, “we believe that jelly beans in the blue green color range cause acne” and then drawing your significance threshold at 0.05/4=0.0125 – since there are ~4 jelly beans tested that are in the blue-green color range (not sure if ‘lilac’ counts or not- that would make 5). All well and good EXCEPT for the fact that the actual chance of detecting something by random chance HASN’T changed. In large scale data analysis (transcriptome analysis, e.g.) you’ve still MEASURED everything else. You’ve just chosen to limit your investigation to a smaller subset and then can ‘go easy’ on your multiple hypothesis correction.

The counter-argument that might be made to this point is that by doing this you’re testing a specific hypothesis, one that you believe to be true and may be supported by existing data . This is a reasonable point in one sense- it may lend credence to your finding that there is existing information supporting your result. But on the other hand it doesn’t change the fact that you still could be finding more things by chance than you realize because you simply hadn’t looked at the rest of your data. It turns out that this is true not just of analysis of big data, but also of some kinds of traditional experiments aimed at testing individual – associative- hypotheses. The difference there is that it is technically unfeasible to actually test a large amount of the background cases (generally limited to one or two negative controls). Also a mechanistic hypothesis (as opposed to an associative one) is based on intervention, which tells you something different and so is not (as) subject to these considerations.

Imagine that you’ve dropped your car keys in the street and you don’t know what they look like (maybe borrowing a friend’s car). You’re pretty sure you dropped them in front of the coffee shop on a block with 7 other shops on it- but you did walk the length of the block before you noticed the keys were gone. You walk directly back to look in front of the coffee shop and find a set of keys. Great, you’re done. You found your keys, right? What if you looked in front of the other stores and found other sets of keys. You didn’t look- but that doesn’t make it less likely that you’re wrong about these keys (your existing knowledge/model/hypothesis “I dropped them in front of the coffee shop” could easily be wrong).

XKCD: significant

A word about balance

I’ve been reviewing machine learning papers lately and have seen a particular problem repeatedly. Essentially it’s a problem of how a machine learning algorithm is trained and evaluated for performance versus how it would be actually applied. I’ve seen this particular problem also in a whole bunch of published papers too so thought I’d write a blog rant post about it. I’ve given a quick-and-dirty primer to machine learning approaches at the end of this post for those interested.

The problem is this: methods are often evaluated using an artificial balance of positive versus negative training examples, one that can artificially inflate estimates of performance over what would actually be obtained in a real world application.

I’ve seen lots of studies that use a balanced approach to training. That is, the number of positive examples is matched with the number of negative examples. The problem is that many times the number of negative examples in a ‘real world’ application is much larger than the number of positive examples- sometimes by orders of magnitude. The reason that is often given for choosing to use a balanced training set? That this provides better performance and that training on datasets with a real distribution of examples would not work well since any pattern in the features from the positive examples would be drowned out by the sheer number of negative examples. So essentially- that when we use a real ratio of positive to negative examples in our evaluation our method sucks. Hmmmmm……

This argument is partly true- though some machine learning algorithms do perform very poorly with highly unbalanced datasets. Support Vector Machines (SVM), though and some other kinds of machine learning approaches, seem to do pretty well. Some studies then follow this initial balanced training step with an evaluation on a real world set – that is, one with a ‘naturally’ occurring balance of positive and negative examples. This is a perfectly reasonable approach. However, too many studies don’t do this step, or perform a follow on ‘validation’ on a dataset with more negative examples, but still nowhere near the number that would be present in a real dataset. And importantly- the ‘bad’ studies report the performance results from the balanced (and thus, artificial) dataset.

The issue here is that evaluation on a dataset with an even number of positive and negative examples can vastly overestimate performance by decreasing the number of false positive predictions that are made. Imagine that we have a training set with 50 positive examples and a matched number of 50 negative examples. The algorithm is trained on these examples and cross-validation (random division of the training set for evaluation purposes- see below) reveals that the algorithm predicts 40 of the positives to be positive (TP) and 48 of the negatives to be negative (TN). So it misclassifies two negative examples to be positive examples with scores that make it look as good or better than the other TPs- which wouldn’t be too bad, the majority of positive predictions would be true positives. Now imagine that the actual ratio of positives to negative examples in a real world example was 1:50, that is for every positive example there are 50 negative examples. So, what’s not done in these problem cases is extrapolating the performance of the algorithm to a real world dataset. In that case you’d expect to see 100 false positive predictions- now outnumbering the number of true positive predictions and making the results a lot less confident than originally estimated. The example I use here is actually a generous one. I frequently deal with datasets (and review or read papers) where the ratios are 1:100 to 1:10,000 where this can substantially impact results.

So the evaluation of a machine learning method should involve a step where a naturally occurring ratio of positive and negative examples is represented. Though this natural ratio may not be clearly evident for some applications, it should be given a reasonable estimate. The performance of the method should be reported based on THIS evaluation, not the evaluation on the balanced set- since that is likely to be inflated from a little to a lot.

For those that are interested in real examples of this problem I’ve got two example studies from one of my own areas of research- type III effector prediction in bacteria. In Gram negative bacteria with type III secretion systems there are an unknown number of secreted effectors (proteins that are injected into host cells to effect virulence) but we estimate on the order of 50-100 for a genome like Salmonella Typhimurium, which has 4500 proteins total, so the ratio should be around 1:40 to 1:150 for most bacteria like this. In my own study on type III effector prediction I used a 1:120 ratio for evaluation for exactly this reason. A subsequent paper in this area was published that chose to use a 1:2 ratio because “the number of non-T3S proteins was much larger than the number of positive proteins,…, to overcome the imbalance between positive and negative datasets.” If you’ve been paying attention, THAT is not a good reason and I didn’t review that paper (though I’m not saying that their conclusions are incorrect since I haven’t closely evaluated their study).

  1. Samudrala R, Heffron F and McDermott JE. 2009. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathogens 5(4):e1000375.
  2. Wang Y, Zhang Q, Sun MA, Guo D. 2011. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics. 2011 Mar 15;27(6):777-84.

So the trick here is to not fool yourself, and in turn fool others. Make sure you’re being your own worst critic. Otherwise someone else will take up that job instead.

Quick and Dirty Primer on Machine Learning

Machine learning is an approach to pattern recognition that learns patterns from data. Often times the pattern that is learned is a particular pattern of features, properties of the examples, that can classify one group of examples from another. A simple example would be to try to identify all the basketball players at an awards ceremony for football, basketball, and baseball players. You would start out by selecting some features, that is, player attributes, that you think might separate the groups out. You might select hair color, length of shorts or pants in the uniform, height, and handedness of the player as potential features. Obviously all these features would not be equally powerful at identifying basketball players, but a good algorithm will be able to make best use of the features. A machine learning algorithm could then look at all the examples: the positive examples, basketball players; and the negative examples, everyone else. The algorithm would consider the values of the features in each group and ideally find the best way to separate the two groups. Generally to evaluate the algorithm all the examples are separated into a training set, to learn the pattern, and a testing set, to test how well the pattern works on an independent set. Cross-validation, a common method of evaluation, does this repeatedly, each time separating the larger group into training and testing sets by randomly selecting positive and negative examples to put into each set. Evaluation is very important since the performance of the method will provide end users with an idea of how well the method has worked for their real world application where they don’t know the answers already. Performance measures vary but for classification they generally involve comparing predictions made by the algorithm with the known ‘labels’ of the examples- that is, whether the player is a basketball player or not. There are four categories of prediction: true positives (TP), the algorithm predicts a basketball player where there is a real basketball player; true negatives (TN), the algorithm predicts not a basketball player when the example is not a basketball player; false positives (FP), the algorithm predicts a basketball player when the example is not; and false negatives (FN), the algorithm predicts not a basketball player when the example actually is.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won't work well but a machine learning algorithm can provide a good separation. I'm showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won’t work well but a machine learning algorithm can provide a good separation. I’m showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

A case for failure in science

If you’re a scientist and not failing most of the time you’re doing it wrong. The scientific method in a nutshell is to take a best guess based on existing knowledge (the hypothesis) then collect evidence to test that guess, then evaluate what the evidence says about the guess. Is it right or is it wrong? Most of the time this should fail. The helpful and highly accurate plot below illustrates why.

Science is about separating the truth of the universe from the false possibilities about what might be true. There are vastly fewer true things than false possibilities in the universe. Therefore if we’re not failing by disproving our hypotheses then we really are failing at being scientists. In fact, as scientists all we really HAVE is failure. That is, we can never prove something is true, only eliminate incorrect possibilities. Therefore, 100% of our job is failure. Or rather success at elimination of incorrect possibilities.

So if you’re not failing on a regular repeated basis, you’re doing something wrong. Either you’re not being skeptical and critical enough of your own work or you’re not posing interesting hypotheses for testing. So stretch a little bit. Take chances. Push the boundaries (within what is testable using the scientific method and available methods/data, of course). Don’t be afraid of failure. Embrace it!

How much failure, exactly, is there to be had out there? This plot should be totally unhelpful in answering that question

How much failure, exactly, is there to be had out there? This plot should be totally unhelpful in answering that question



Eight red flags in bioinformatics analyses

A recent comment in Nature by C. Glenn Begley outlines six red flags that basic science research won’t be reproducible. Excellent read and excellent points. The point of this comment, based on experience from writing two papers in which:

Researchers — including me and my colleagues — had just reported that the majority of preclinical cancer papers in top-tier journals could not be reproduced, even by the investigators themselves12.

was to summarize the common problems observed in the non-reproducible papers surveyed since the author could not reveal the identities of the papers themselves. Results in a whopping 90% of papers they surveyed could not be reproduced, in some cases even by the same researchers in the same lab, using the same protocols and reagents. The ‘red flags’ are really warnings to researchers of ways that they can fool themselves (as well as reviewers and readers in high-profile journals) and things that they should do to avoid falling into the traps found by the survey. These kinds of issues are major problems in analysis of high-throughput data for biomarker studies, and other purposes as well. As I was reading this I realized that I’d written several posts about these issues, but applied to bioinformatics and computational biology research. Therefore, here is my brief summary of these six red flags, plus two more that are more specific to high-throughput analysis, as they apply to computational analysis- linking to my previous posts or those of others as applicable.

  1. Were experiments performed blinded? This is something I hadn’t previously considered directly but my post on how it’s easy to fool yourself in science does address this. In some cases blinding your bioinformatic analysis might be possible and certainly be very helpful in making sure that you’re not ‘guiding’ your findings to a predetermined answer. The cases where this is especially important is when the analysis is directly targeted at addressing a hypothesis. In these cases a solution may be to have a colleague review the results in a blinded manner- though this may take more thought and work than would reviewing the results of a limited set of Western blots.
  2. Were basic experiments repeated? This is one place where high-throughput methodology and analysis might have a step up on ‘traditional’ science involving (for example) Western blots. Though it’s a tough fight and sometimes not done correctly, the need for replicates is well-recognized as discussed in my recent post on the subject. In studies where the point is determining patterns from high-throughput data (biomarker studies, for example) it is also quite important to see if the study has found their pattern in an independent dataset. Often cross-validation is used as a substitute for an independent dataset- but this falls short. Many biomarkers have been found not to generalize to different datasets (other patient cohorts). Examination of the pattern in at least one other independent dataset strengthens the claim of reproducibility considerably.
  3. Were all the results presented? This is an important point but can be tricky in analysis that involves many ‘discovery’ focused analyses. It is not important to present every comparison, statistical test, heatmap, or network generated during the entire arc of the analysis process. However, when addressing hypotheses (see my post on the scientific method as applied in computational biology) that are critical to the arguments presented in a study it is essential that you present your results, even where those results are confusing or partly unclear. Obviously, this needs to be undertaken through a filter to balance readability and telling a coherent story– but results that partly do not support the hypothesis are very important to present.
  4. Were there positive and negative controls? This is just incredibly central to the scientific method but is a problem in high-throughput data analysis. At the most basic level, analyzing the raw (or mostly raw) data from instruments, this is commonly performed but never reported. In a number of recent cases in my group we’ve found real problems in the data that were revealed by simply looking at these built-in controls, or by figuring out what basic comparisons could be used as controls (for example, do gene expression from biological replicates correlate with each other?). What statistical associations do you expect to see and what do you expect not to see? These checks are good to prevent fooling yourself- and if they are important they should be presented.
  5. Were reagents validated? For data analysis this should be: “Was the code used to perform the analysis validated?” I’ve not written much on this but there are several out there who make it a central point in their discussions including Titus Brown. Among his posts on this subject are here, here, and here. If your code (an extremely important reagent in a computational experiment) does not function as it should the results of your analyses will be incorrect. A great example of this is from a group that hunted down a bunch of errors in a series of high-profile cancer papers I posted about recently. The authors of those papers were NOT careful about checking that the results of their analyses were correct.
  6. Were statistical tests appropriate? There is just too much to write on this subject in relation to data analysis. There are many ways to go wrong here- inappropriate data for a test, inappropriate assumptions, inappropriate data distribution. I am not a statistician so I will not weigh in on the possibilities here. But it’s important. Really important. Important enough that if you’re not a statistician you should have a good friend/colleague who is and can provide specific advice to you about how to handle statistical analysis.
  7. New! Was multiple hypothesis correction correctly applied? This is really an addition to flag #6 above specific for high-throughput data analysis. Multiple hypothesis correction is very important to high-throughput data analysis because of the number of statistical comparisons being made. It is a way of filtering predictions or statistical relationships observed to provide more conservative estimates. Essentially it extends the question, “how likely is it that the difference I observed in one measurement is occurring by chance?” to the population-level question, “how likely is it that I would find this difference by chance if I looked at a whole bunch of measurements?”. Know it. Understand it. Use it.
  8. New! Was an appropriate background distribution used? Again, an extension to flag #6. When judging significance of findings it is very important to choose a correct background distribution for your test. An example is in proteomics analysis. If you want to know what functional groups are overrepresented in a global proteomics dataset should you choose your background to be all proteins that are coded for by the genome? No- because the set of proteins that can be measured by proteomics (in general) is highly biased to start with. So to get an appropriate idea of which functional groups are enriched you should choose the proteins actually observed in all conditions as a background.

The comment by Glenn Begely wraps up with this statement about why these problems are still present in research:

Every biologist wants and often needs to get a paper into Nature or Science or Cell, yet the scientific community fails to recognize the perverse incentive this creates.

I think this is true, but you could substitute “any peer-reviewed journal” for “Nature or Science or Cell”- the problem comes at all levels. It’s also true that these problems are particularly relevant to high-throughput data analysis because they can be less hypothesis directed and more discovery oriented, because they are generally more expensive and there’s thus more scrutiny of the results (in some cases), and due to rampant enthusiasm and overselling of potential results arising from these kinds of studies.

Illustration from Derek Roczen

The big question: Will following these rules improve reproducibility in high-throughput data analysis? The Comment talks about these being things that were present in reproducible studies (that small 10% of the papers) but does that mean that paying attention to them will improve reproducibility, especially in the case of high-throughput data analysis? There are issues that are more specific to high-throughput data (as my flags #7 and #8, above) but essentially these flags are a great starting point to evaluate the integrity of a computational study. With high-throughput methods, and their resulting papers, gaining importance all the time we need to consider these both as producers and consumers.


  1. Prinz, F., Schlange, T. & Asadullah, K. Nature Rev. Drug Discov. 10, 712 (2011).
  2. Begley, C. G. & Ellis, L. M. Nature 483, 531–533 (2012).

How can two be worse than one? Replicates in high-throughput experiments

[Disclaimer: I’m not a lot of things. Statistician is high on that list of things I’m not.]

A fundamental rift between statisticians/computational biologists and bench biologists related to high-throughput data collection (and low-throughput as well, though it’s not discussed as much) is that of the number of replicates to use in the experimental design.

Replicates are multiple copies of samples under the same conditions that are used to assess the underlying variability in measurement. A biological replicate is when the source of the sample is different, meaning that different individuals were used, for human samples, for example, or different cultures were grown independently, for bacterial cultures. This is different from a technical replicate, where one sample is taken or grown, then subsequently split up into replicates that will assess the technical variability of the instrument being used to gather the data (for example, though other types of technical replicates are used too sometimes). Most often you will not know the extent of variability arising from the biology or process and so it is difficult to choose the right balance of replicates without doing pilot studies first. With well-established platforms (microarrays, e.g.) the technical/process variability is understood, but the biological variability is generally not. These choices must also be balanced with expense in terms of money, time, and effort. Choice of number of replicates of each type can mean the difference between a usable experiment that will answer the questions posed and a waste of time and effort that will frustrate everyone involved.

The fundamental rift is this:

  • More is better: statisticians want to make sure that the data gathered, which can be very expensive, can be used to accurately estimate the variability. More is better, and very few experimental designs have as many replicates as statisticians would like.
  • No need for redundant information: Bench biologists, on the other hand, tend to want to get as much science done as possible. Replicates are expensive and often aren’t that interesting in terms of the biology that they reveal when they work- that is, if replicates 1, 2, and 3 agree then wouldn’t it be more efficient to just have run replicate 1 in the first place and use replicates 2 and 3 to get more biology?

This is a vast generalization, and many biologists gathering experimental data understand the statistical issues inherent in this problem- more so in certain fields like genome-wide association studies.

Three replicates is kind-of a minimum for statistical analysis. This number doesn’t give you any room if any of the replicates fail for technical reasons, but if they’re successful you can at least get an estimate of variation in the form of standard deviation out (not a very robust estimate mind you, but the calculation will run). I’ve illustrated the point in the graph below.

Running one replicate can be understood for some situations, and the results have to be presented with the rather large caveat that they will need to be validated in follow-on studies.

Two replicates? Never a good idea. This is solidly in the “why bother?” category. If the data points agree, great. But how much confidence can you have that they’re not just accidentally lining up? If they disagree, you’re out of luck. If you have ten replicates and one doesn’t agree you could, if you investigated the underlying reason for this failure, exclude it from the analysis as an ‘outlier’ (this can get in to shady territory pretty fast- but there are sound ways to do this). However, with two replicates they just don’t agree and you have no idea which value to believe. Many times two replicates are the result of an experimental design with more replicates but some of the samples have failed for some reason. But an experimental design should never be initiated with just two replicates. It doesn’t make sense- though I’ve seen many and have participated in analysis of some too (thus giving me this opinion).

There is much more that can be said on this topic but this is a critical issue that can ruin costly and time-consuming high-throughput experiments before they’ve even started.


Cool example of invisible science

I recently posted on invisible science, unexpected observations that don’t fit the hypothesis and can be easily discarded or overlooked completely. Through a collaboration we just published a paper that demonstrates this concept very well. Here’s its story in a nutshell.

A few years back I published a paper in PLoS Pathogens that described the first use* of a machine learning approach to identify bacterial type III effectors from protein sequence. Type III effectors are proteins that are made by bacteria and exported through a complicated structure (the type III secretion apparatus- aka. the injectisome) directly in to a host cell. Inside the host cell these effectors interact with host proteins and networks to effect a change, one that is beneficial for the invading bacteria, and allow survival in an environment that’s not very hospitable for bacterial growth. Though there are a lot of these kinds of proteins known, there’s no pattern that has been found to specify secretion by type III mechanism. It’s a mystery still.

(* there was another paper published back-to-back with mine in PLoS Pathogens that reported the same thing. Additionally, two other papers were published subsequently in other journals that reiterated our findings. I wrote a review of this field here.)

So on the basis of the model that I published my collaborators (Drs. Heffron and Niemann) thought it would be cool to see if a consensus signal (an average of the different parts my model predicted to be important for secretion) that the model predicted would be hyper-secreted (i.e. would be secreted at a high level). I sent them a couple of predictions and some time later (maybe 8 months) Dr. Niemann contacted me to say that the consensus sequence was not, in fact, secreted. So it looked like the prediction wasn’t any good and that some work had been done to get this negative result.

But not so fast, because they’d had some issues with how they’d made the initial construct to do the experiment they remade the construct used to express the consensus. The first one (that was not secreted) used a native promoter and upstream gene sequence. This is the region that causes a gene to be expressed, then allows the ribosome to bind to the mRNA and start translation of the actual coding sequence. The native upstream sequence

Figure 1. Translocation of a consensus effector seq.

Figure 1. Translocation of a consensus effector seq.

was just taken from a real effector. When they redid the construct they used a non-native upstream sequence from a bacteriophage (a virus that infects bacteria), commonly used for expressing genes. All of the sudden, they got secretion from the same consensus sequence. This was a very weird result: why would changing the untranslated region suddenly change the function that the protein sequence was supposed to be directing?

The path of this experiment could have taken a very different turn here. Dr. Niemann could have simply ignored that ‘spurious’ result and decided that the native promoter was the right answer- the consensus sequence wasn’t secreted.

However, in this case the spurious result was the interesting one. Why did the bacteriophage upstream region construct get secreted? The only difference was in the upstream RNA (since the difference was in the non-coding region and the protein produced was exactly the same). Dr. Niemann pressed on and found that the RNA itself was directing secretion. And he found that there were other examples of native upstream sequences in the bacteria (Salmonella Typhimurium) that we were working on. This had never been observed before in Salmonella, though it was known for a few effectors from Yersinia pestis. He also identified an RNA-binding protein, hfq, that was required for this secretion. This paper is currently available as a preprint from the journal.

Niemann GS, Brown RN, Mushamiri IT, Nguyen NT, Taiwo R, Stufkens A, Smith RD, Adkins JN, McDermott JE, Heffron F. RNA Type III Secretion Signals that require Hfq. J Bacteriol. 2013 Feb 8. [Epub ahead of print]

So this story never ended in validation of my consensus sequence. Actually, in all likelihood it can’t direct secretion (the results in the paper show that, though it’s not highlighted). But the story turned out to be more interesting and more impactful and it shows why it’s good to be flexible in science. If you see the results of each experiment only in black and white (it supports or does not support my original hypothesis) this will be extremely limiting to the science you can accomplish.