I had a weird thing happen on my way in to work this morning. On the main road just a short distance from my parking lot I noticed that the SUV in front of me had the same three letter combination on their license plate as mine, “YGK”. Then I noticed that the car in front of THEM had the SAME three letter combination! Wow. What are the odds of that happening? Well, I’m not going to tell you the odds of that happening, because I don’t really know. But it did happen. An odd coincidence for sure, but maybe not as cosmically-connected as you might be inclined to think.

First off, let’s think about the odds of drawing the same 3-letter combination from a hat with 26^3 combinations two times in a row (approximating what happened here- because my license plate is fixed). That’s how many different possible 3-letter combinations there are- I suppose probably subtracting one or two for words that aren’t allowed, like “ASS” and, ummm, well maybe there’s another. This is 17,576. The chances of drawing two of the same out of a hat would be 1/17,576 X 1/17,576 – 1 in 300 million. So this means that you could sit and draw letters out of this hat every second (that is drawing two sets of three letters out every second) for about 10 years before you’d be likely to have this happen. Now clearly I’m simplifying here- but still. So for my license plate story I’d be unlikely to have this happen in my lifetime since I’m only driving every now and then and I’m not generally even paying attention to other people’s license plates to see if this has happened or not.

So here are some reasons why it’s not TOO surprising that it did happen. First, assuming all combinations are used, there are 1000 other vehicles in WA state with the same letters, which narrows the field a bit- but only a bit since there are ~6 million registered vehicles (at least in 2012, though some portion of these have the longer 7 number/letter plates). Second, is that it is likely that these are issued in order (though I’m not 100% sure about that, it would seem to make sense) of request. That means that vehicles purchased about the same time as mine (2001) are probably far more likely to have the same set of letters.That’s been about 13 years, which means that those vehicles are going to be of a certain age.  I would also include geography – since that could be another influencing factor as to which numbers/letters you get, but I did get my license plate on the other side of the state. I don’t have a clear idea of how this would bias the probability of seeing three license plates in a row, but it fits in to my next point, which is hidden or partially hidden explanatory variables.

When my wife and I lived in Portland, far before we had such encumbrances as kids to drag us down, we often did a bunch of activities on a weekend. I started to be surprised to notice some of the same people turning up at different places, parks, restaurants, bookstores, museums, etc, far across town. This happened more than you’d expect in a moderately-sized city. Interestingly, in Seattle when we had a kid this also happened. And it happens all the time in our current city(ies), which are much smaller. My idea about this is that it’s not surprising at all. Our choice of activities and times is dictated or heavily influenced by our age, interests, kidlet status, etc. – as are other peoples’. So instead of thinking of the chances of repeatedly bumping in to the same set of people out of the entire population, think about the chances if the background distribution is much more limited, constrained (in part) by those interests and other personal constraints. The probability of this happening then rises considerably because your considering a smaller number of possible people. I’m sure this has been described before in statistics and would love it if someone knew what it’s called (leave a comment).

How does this fit in to my license plate experience? I don’t really have a clear idea, but it is evident that there can be multiple underlying and often hidden explanatory variables that may be influencing such probabilities. Perhaps my work is enriched in people who think like me and hold on to vehicles for a long time- AND purchased vehicles at about the same time. I think that’s probably likely, though I have no idea how to test it. If that’s true then the chances of running in to someone else with the same letters on their plates, or two people at the same time, would have to go up quite a lot. Still, what are the odds?

How can two be worse than one? Replicates in high-throughput experiments

[Disclaimer: I’m not a lot of things. Statistician is high on that list of things I’m not.]

A fundamental rift between statisticians/computational biologists and bench biologists related to high-throughput data collection (and low-throughput as well, though it’s not discussed as much) is that of the number of replicates to use in the experimental design.

Replicates are multiple copies of samples under the same conditions that are used to assess the underlying variability in measurement. A biological replicate is when the source of the sample is different, meaning that different individuals were used, for human samples, for example, or different cultures were grown independently, for bacterial cultures. This is different from a technical replicate, where one sample is taken or grown, then subsequently split up into replicates that will assess the technical variability of the instrument being used to gather the data (for example, though other types of technical replicates are used too sometimes). Most often you will not know the extent of variability arising from the biology or process and so it is difficult to choose the right balance of replicates without doing pilot studies first. With well-established platforms (microarrays, e.g.) the technical/process variability is understood, but the biological variability is generally not. These choices must also be balanced with expense in terms of money, time, and effort. Choice of number of replicates of each type can mean the difference between a usable experiment that will answer the questions posed and a waste of time and effort that will frustrate everyone involved.

The fundamental rift is this:

  • More is better: statisticians want to make sure that the data gathered, which can be very expensive, can be used to accurately estimate the variability. More is better, and very few experimental designs have as many replicates as statisticians would like.
  • No need for redundant information: Bench biologists, on the other hand, tend to want to get as much science done as possible. Replicates are expensive and often aren’t that interesting in terms of the biology that they reveal when they work- that is, if replicates 1, 2, and 3 agree then wouldn’t it be more efficient to just have run replicate 1 in the first place and use replicates 2 and 3 to get more biology?

This is a vast generalization, and many biologists gathering experimental data understand the statistical issues inherent in this problem- more so in certain fields like genome-wide association studies.

Three replicates is kind-of a minimum for statistical analysis. This number doesn’t give you any room if any of the replicates fail for technical reasons, but if they’re successful you can at least get an estimate of variation in the form of standard deviation out (not a very robust estimate mind you, but the calculation will run). I’ve illustrated the point in the graph below.

Running one replicate can be understood for some situations, and the results have to be presented with the rather large caveat that they will need to be validated in follow-on studies.

Two replicates? Never a good idea. This is solidly in the “why bother?” category. If the data points agree, great. But how much confidence can you have that they’re not just accidentally lining up? If they disagree, you’re out of luck. If you have ten replicates and one doesn’t agree you could, if you investigated the underlying reason for this failure, exclude it from the analysis as an ‘outlier’ (this can get in to shady territory pretty fast- but there are sound ways to do this). However, with two replicates they just don’t agree and you have no idea which value to believe. Many times two replicates are the result of an experimental design with more replicates but some of the samples have failed for some reason. But an experimental design should never be initiated with just two replicates. It doesn’t make sense- though I’ve seen many and have participated in analysis of some too (thus giving me this opinion).

There is much more that can be said on this topic but this is a critical issue that can ruin costly and time-consuming high-throughput experiments before they’ve even started.