The Scientific Method as it Applies to Coding and Data Analysis

This is a follow-up on my previous post that gives a short, biased overview of the scientific method.

Essentially it occurred to me while coding (which I used to do before I became too busy to do so) that the way coding works follows the scientific method, more-or-less. Here’s how I think this works based on the steps I outlined in my previous post.

  1. Hypothesis formation. Code has a purpose. It accomplishes something. It can accomplish this something in a myriad of ways. In coding the hypothesis formulation is the initial step where you form an idea of the algorithm that you want to write. It’s not exactly a ‘hypothesis’ in the scientific sense of the word, but it is a model of how something should work.
  2. Formulation of an experiment. In coding this is the step at which you instantiate how you will test the hypothesis, that is, how you will write the code. This would be where you write pseudocode to get your idea down on paper. With many smaller programming tasks this step only occurs in the programmer’s head.
  3. Execution of the experiment. Write the code and execute it with some input.
  4. Evaluation of the initial hypothesis. This is where things start to converge and get interesting. You’ve executed your code on some inputs where you have an idea of what the answer should be (your control, I guess). Now what does the answer look like? Does it match what you expect to see? No? Then you need to go on to the next step. Yes? Then you’re OK, for now. But try some other inputs and see if the results still look good.
  5. Formulation of further experiments. This is debugging. That is, you use your hypothesis/model to interpret the results and how they differ from what you expect. This interpretation is aimed at identifying the step or steps that may be incorrect in your code that would lead to such a result. This can be trivial or not, but it is the feedback loop that defines the scientific method. The hypotheses springing from this step are very similar to scientific hypotheses: my hypothesis is that step X is broken because of Y. It is easy to test this: fix step X (given the reason Y) and go back to step 3.

So when you are coding, whether you’re aware of it or not, you are formulating, testing, and revising small hypotheses all the time. It’s interesting to think of the process in this way, and it makes sense. The question is, if you think of it this way does it change how you code? I’m not sure about that point. I always have the data that I want to use as input eventually, and I routinely have positive and negative controls: data that I know or can reasonably figure out what the result should be, and some broken input data that should break the code at some reasonable point. Experiments to run can be introducing mutations (by breaking or modifying key steps in the code), swapping out routines, or mutating the input data in different ways. Each of these mutations results in a different hypothesis that will be evaluated on the basis of comparing the results with what you expect to find.

Another interesting point is that this analogy can be extended to many other activities. I think this has been noted by others- for example that this is exactly how children with no scientific training learn things about their environments.┬áData analysis, for many different purposes related to computational biology, is somewhere I use this approach all the time. It’s very similar to the process I outline above for coding, just substitute in the steps of analysis for actually writing code.

One of the big differences between coding or data analysis and doing actual wet bench experiments is that, for the most part, the computational aspects are deterministic; you get the same answer every time you using the same input and parameters. Of course, there are many stochastic algorithms that you can use that make this process non-deterministic, but those steps will function in the same stochastic way each time- for example creating a random normal distribution of values. In wet bench science EVERYTHING in the environment of the lab (and quite possibly beyond) can factor in to your experiment. Just because you use the same reagents, with the same measurements (measured on the same measuring instruments), with the same timing, and analyze the exact same samples- that is, the exact same CODE and INPUTS for the experiment- you are NOT guaranteed the same output. Frequently the most frustrating parts of wet-bench science are that you can’t get something to work again- even though you’re doing everything exactly the same (what do you know? The phase of the moon is affecting the binding affinity of your antibody).

It’s actually one of the main reasons I don’t do bench science anymore. Computational biology is, in some ways, much cleaner. Don’t even get me started on the ways that THAT statement isn’t true.