What is a hypothesis?

So I got this comment from a reviewer on one of my grants:

The use of the term “hypothesis” throughout this application is confusing. In research, hypotheses pertain to phenomena that can be empirically observed. Observation can then validate or refute a hypothesis. The hypotheses in this application pertain to models not to actual phenomena. Of course the PI may hypothesize that his models will work, but that is not hypothesis-driven research.

There are a lot of things I can say about this statement, which really rankles. As a thought experiment replace all occurrences of the word “model” with “Western blot” in the above comment. Does the comment still hold?

At this point it may be informative to get some definitions, keeping in mind that the _working_ definitions in science can have somewhat different connotations.

From Google:

Hypothesis: a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation.

This definition has nothing about empirical observation- and I would argue that this definition would be fairly widely accepted in biological sciences research, though the underpinnings of the reviewer’s comment- empirically observed phenomena- probably are in the minds of many biologists.

So then, also from Google:

Empirical: based on, concerned with, or verifiable by observation or experience rather than theory or pure logic.

Here’s where the real meat of the discussion is. Empirical evidence is based on observation or experience as opposed to being based on theory or pure logic. It’s important to understand that the “models” being referred to in my grant are machine learning statistical models that have been derived from sequence data (that is, observation).

I would argue that including some theory or logic in a model that’s based on observation is exactly what science is about- this is what the basis of a hypothesis IS. All the hypotheses considered in my proposal were based on empirical observation, filtered through some form of logic/theory (if X is true then it’s reasonable to conclude Y), and would be tested by returning to empirical observations (either of protein sequences or experimentation at the actual lab bench).

I believe that the reviewer was confused by the use of statistics, which is a largely empirical endeavor (based on the observation of data- though filtered through theory) and computation, which they do not see as empirical. Back to my original thought experiment, there’s a lot of assumptions, theory, and logic that goes into interpretation of Western blot – or any other common lab experiment. However, this does not mean that we can’t use them to formulate further hypotheses.

This debate is really fundamental to my scientific identity. I am a biologist who uses computers (algorithms, visualization, statistics, machine learning and more) to do biology. If the reviewer is correct, then I’m pretty much out of a job I guess. Or I have to settle back on “data analyst” as a job title (which is certainly a good part of my job, but not the core of it).

So I’d appreciate feedback and discussion on this. I’m interested to hear what other people think about this point.

Best Practices


This comic is inspired, not by real interactions I’ve had with developers (no developer has ever volunteered to get within 20 paces of my code), but rather by discussions online on the importance of ‘proper’ coding. Here’s a comic from xkcd which has a different point:

My reaction to this– as a bench biology-trained computational biologist who has never taken a computer programming class– is “who cares?” If it works, really, who cares?

Sure, there are very good reasons for standard programming practices, standards, and clean, efficient code. Even in bioinformatics (or especially so). These would be almost exclusively applicable to approaches that you’ve had quite a bit of experience with working out the bugs, figuring out how it works with the underlying data, making sure that it’s actually useful in terms of the biology. This is at least 75% of my job. I try and discard many approaches for any particular problem I’m working on. It’s important to have a record of these attempts, but this code doesn’t have to be clean or efficient. There are exceptions to this, such as when you have code that takes a loooong time to run even once, you probably want to make that as efficient as you can. The vast majority of the things I do- even with large amounts of data- I can determine if they’re working or not in a reasonable amount of time using inefficient code (anything written in R, for example).

The other part, where good coding is important, is when you want the code to be usable by other people. This is an incredibly important part of computational biology and I’m not trying to downplay its importance here. This is when you’re relatively certain that the code will be looked at and/or used by other people in your own group and when you publish or release the code to a wider audience.

For further reading into this subject here’s a post from Byte Size Biology that covers some great ideas for writing *research* code. And here is some dissenting opinion from Living in and Ivory Basement touting the importance of good programming practices (note- I don’t disagree, but do believe that at least 75% of the coding I do should not have such a high bar- not necessary and I’d never get anything done) . Finally, here are some of my thoughts on how coding really follows the scientific method.

Big Data Showdown

One of the toughest parts of collaborative science is communication across disciplines. I’ve had many (generally initial) conversations with bench biologists, clinicians, and sometimes others that go approximately like:

“So, tell me what you can do with my data.”

“OK- tell me what questions you’re asking.”

“Um,.. that kinda depends on what you can do with it.”

“Well, that kinda depends on what you’re interested in…”

And this continues.

But the great part- the part about it that I really love- is that given two interested parties you’ll sometimes work to a point of mutual understanding, figuring out the borders and potential of each other’s skills and knowledge. And you generally work out a way of communicating that suits both sides and (mostly) works to get the job done. This is really when you start to hit the point of synergistic collaboration- and also, sadly, usually about the time you run out of funding to do the research.

Asked and answered: Computational Biology Contribution?

So someone asked me this question today: “as a computational biologist,how can you be useful to the world?”. OK so they didn’t ask me, per se, they got to my blog by typing the question into a search engine and I saw this on my WordPress stats page (see bottom of this post). Which made me think- “I don’t know what page they were directed to- but I know I haven’t addressed that specific question before on my blog”. So here’s a quick answer, especially relevant since I’ve been talking with CS people about this at the ACM-BCB meeting the last few days.

As a computational biologist how can you be useful to the world?

  1. Choose your questions carefully. Make sure that the algorithm you’re developing, the software that you’re designing, the fundamental hypothesis that you’re researching is actually one that people (see collaborators, below) are interested in and see the value in. Identify the gaps in the biology that you can address. Don’t build new software for the sake of building new software- generally people (see collaborators) don’t care about a different way to do the same thing, even if it’s moderately better than the old way.
  2. Collaborate with biologists, clinicians, public health experts, etc. Go to the people who have the problems. What they can offer you is focus on important problems that will improve the impact of your research (you want NIH funding? You HAVE to have impact and probably collaborators). What you can give them is a solution to a problem that they are actually facing. Approach the relationship with care though since this is where the language barrier between fields can be very difficult (a forthcoming post from me in the near future on this). Make sure that you interact with these collaborators during the process- that way you don’t go off and do something completely different than what they had in their heads.
  3. In research be rigorous. The last thing that anyone in any discipline needs is a study that has not considered validation, generalizability, statistical significance, or having a gold-standard or reasonable facsimile thereof to compare to. Consider collaborating with a statistician to at least run your ideas by- they can be very helpful, or a senior computational biologist mentor.
  4. In software development be thoughtful. Consider robustness of your code- have you tested it extensively? How will average users (see collaborators, above) be able to get their data into it? How will average users be able to interpret the results of your methods? Put effort into working with those collaborators to define the user interface and user experience. They don’t (to a point) care about execution times as long as it finishes in a reasonable amount of time (have your software estimate time to completion and display it) and it gives good results. They do care if they can’t use it (or rather they completely don’t care and will stop working with you on the spot).
  5. Sometimes people don’t know what they need until they see it. This is a tip for at least 10th level computational biologists (to make a D&D analogy). This was a tenet of Steve Jobs of Apple and I believe it to be true. Sometimes, someone with passion and skill has to break new ground and do something that no one is asking them to do but that they will LOVE and won’t know how they lived without it. IT IS HIGHLY LIKELY THAT THIS IS NOT YOU. This is a pretty sure route to madness, wearing a tin hat, and spouting “you fools! you’ll never understand my GENIUS”- keep that in mind.
  6. For a computational biologist with some experience make sure that you pass it along. Attend conferences where there are likely to be younger faculty/staff members, students, and post-docs. Comment on their posters and engage. When possible suggest or make connections with collaborators (see above) for them. Question them closely on the four points above- just asking the questions may be an effective way of conveying importance. Organize sessions at these conferences. In your own institution be an accessible and engaged mentor. This has the most potential to increase your impact on the world. It’s true.

Next week: “pathogens found in confectionary” (OK- probably not going to get to that one, but interesting anyway)

People be searchin'

People be searchin’

The uncanny valley of multidisciplinary studies

This was inspired by a conversation with a colleague today who suggested the term, as well as a particularly thorny paper that has now been in review for going on two years, and has been reviewed by four journals (and one conference)- and of course the wonderful xkcd for the format. Ugh! Sometimes it really does feel like I’m a zombie. A multidisciplinary undead. Blarg.

Arrgh - I'm a zombie. Brains!

Arrgh – I’m a zombie. Brains!

Here’s a link to the Wikipedia entry on “uncanny valley“. It’s from robotics and it describes how robots make us feel increasingly uncomfortable, uncanny, as they get more and more human like. It’s not a completely appropriate analogy to link it to publishing computational biology studies, but I think it actually makes a lot of sense. From the reviewers’ point of view the methods, language, format, and sometimes even goals of a multidisciplinary paper become more and more foreign as they move further into the territory of the other field. If they are too far one way or another they won’t be seen by the other side’s reviewers. Of course, there are those reviewers who are completely familiar with the middle ground- we’ll call them zombie-lovers, who have no problems. But getting a review like that is an exception rather than the rule.

What if I were my own post-doc mentor?

Recently I’ve  had time, and reason, to reflect upon what was expected of me during the early portion of my post-doc and what I was able to deliver. It started me thinking: how would I judge myself as a post-doc if I (the me right now) were my own mentor?

My post-doc started 12 years ago and completed when I landed my current job, 7 years ago. I’ve given a short introduction that includes some context; where I was coming from and what I settled on for my post-doc project.

Background: I did my PhD in a structural virology lab in a microbiology and immunology department. I started out solidly on the bench science side then worked my way slowly into image analysis and some coding as we developed methods for analysis of electron microscopy images to get structural information.

May 2001: Interviewed for a post-doc position with Dr. Ram Samudrala in the Department of Microbiology at UW. Offered a position and accepted soon after. My second day on the job, sitting in an office with a wonderful panoramic view of downtown Seattle from tall tower to tall tower, was September 11th 2001.

First idea on the job: Was to develop a one-dimensional cellular automaton to predict protein structure. It didn’t work, but I learned a lot of coding. I’m planning on writing a post about that and will link to it here (in the near future).

Starting project: My starting project that I finally settled on was to predict structures for all the tractable proteins in the rice, Oryza sativa, proteome, a task that I’m pretty sure has never been completed by anyone. The idea here is that there are three classes of protein sequence: those which have structures that have been solved for that specific protein, those that have significant sequence similarity to proteins with solved structures, and those that are not similar to sequences with known structures. Also, there’s a problem with large proteins that have many domains. These need to be broken up into their domains (structurally and functionally distinct regions of the protein) before they can be predicted. So I started organizing and analyzing sequences in the rice proteome. This quickly took on a life of it’s own and became my post-doc project. I did still work some with structure but focused more on how to represent data, access it, and use it from multiple levels to make predictions that were not obvious from any of the individual data sources. This is a area that I continue to work in in my current position. What came out of it was The Bioverse, a repository for genomic and proteomic data, and a way to represent that data in a way that was accessible to anyone with interest. The first version was coded all by me from the ground up in a colossal, and sometimes misguided, monolithic process that included a workflow pipeline, a webserver, a network viewer, and a database, of sorts. It makes me tired just thinking of it. Ultimately the Bioverse was an idea that didn’t have longevity for a number of different reasons- maybe I’ll write a post about that in the future.

Publishing my first paper as a post-doc: My first paper was a short note for the Nucleic Acids Research special issue on databases on the Bioverse that I’d developed. I submitted it one and a half years after starting my post-doc.

Now the hard part, what if I were my own mentor: How would mentor me view post-doc me?

How would I evaluate myself if I were my own mentor? Hard to say, but I’m pretty sure mentor me would be frustrated at post-doc me’s lack of progress publishing papers. However, I think mentor me would also see the value in the amount and quality of the technical work post-doc me had done, though I’m not sure mentor me would give post-doc me the kind of latitude I’d need to get to that point. Mentor me would think that post-doc me needed mentoring. You know- mentor me needs to DO something, right? And I’m not sure how post-doc me would react to that. Probably it would be fine, but I’m not sure it’d be helpful. Mentor me would push for greater productivity, and post-doc me would chafe under the stress. We might very well have a blow up over that.

Mentor me would be frustrated that post-doc me was continually reinventing the wheel in terms of code. Mentor me would push post-doc me to learn more about what was already being done in the field and what resources existed that had similarities with what post-doc me was doing. Mentor me would be frustrated with post-doc me’s lack of vision for the future: did post-doc me consider writing a grant? How long did post-doc me want to remain a post-doc? How did post-doc me think they’d be able to land a job with minimal publications?

Advice that mentor me would give post-doc me? Probably to focus more on getting science done and publishing some of it than futzing around with (sometimes unnecessary) code. I might very well be wrong about that too. The path that I took through my post-doc and to my current independent scientist position might very well be the optimal path for what I do now.

I (mentor me) filled out an evaluation form that is similar to the one I have to do for my current post-docs (see below). Remember, this was 12 years ago- so it’s a bit fuzzy. I (post-doc me) comes out OK- but having a number of places for improvement.

This evaluation makes me realize how ideas and evaluations of “success”, “progress”, and even “potential as an independent scientist” can be very complicated and can evolve rapidly over time for the same person. As a mentor there is not a single clear path to promote these qualities in your mentees. In fact, mentorship is hard. Too much mentorship and you could stifle good qualities. Too little and you could let those qualities die. And here’s the kicker: or not. What you do as a mentor might not have as much to do with eventual outcomes of success as you’d like to think.


How would mentor me rate post-doc me if I had to evaluate using the same criteria that I now use for my own post-docs?

How would mentor me rate post-doc me if I had to evaluate using the same criteria that I now use for my own post-docs?

Journaling a Computational Biology Project: Part 5

Days 6-22: (link to my previous post in this series)

I said “real time” and I meant real time. This is what happens to projects when there are 4-5 or more other projects vying for attention, plus a bunch of internal strategy planning things, plus outside reviews to complete, plus papers to wrap up and submit, plus summer vacation stuff, oh and plus Twitter stuff and making a t shirt and stuff. You don’t get stuff done for long periods of time.

Anyway, what I have gotten done is to focus on my highly related side project that arose from my initial project. The key here was that I could process the data exactly the same way then just use it in a slightly modified version of another existing algorithm, and voila! magic happens.

Of course this wasn’t how things shook out. I did the data processing and ran the algorithm and found that the results weren’t really that enlightening. In fact they looked a bit ho-hum. Suspiciously so actually. So there are two alternatives: alternative 1 is that the method is functioning perfectly and that the biology just is like this and I’m not going to get very much from this approach; alternative 2 is that the method is not working for some reason. To test this I ran a subset of the data using the existing method and my new extended method. For reasons regarding the particular method I’m using and how I’m extending it these two subsets should yield exactly the same results.

Which do you think was the case? Well, of course, results from the two subsets look different and it’s alternative 2. I messed something up or my thinking about the problem was wrong. This morning I sat down for 5 minutes (that’s how long this process took me, really) and read the methods section for the paper describing the method I’m trying to extend. Actually I just read the first paragraph of the methods section. Ohhhhhhhhhhh. Go figure. It actually HELPS to read how they implemented the existing method in the first place <facepalm>.

Facepalm. Old school.

So now I’m recoding and rerunning- hoping that this improves things. The way they did it in the first place is elegantly simple, so it works with my extensions no problems. I’ll see. And you will too in the next installment. Which could be tomorrow. Or in December. Who knows?

Journaling a computational biology project: Part 2

Day 3 (link to my previous entry)

Uh-oh- roadblock. Remember how I was saying this project was dirt simple?

It's just THIS simple. This has to work- there's no WAY it could fail.

It’s just THIS simple. This has to work- there’s no WAY it could fail.

This came much faster than I thought it would. I’ve got actual data and I have to figure out if there’s a story there. Or rather, where the story is. The results from my large-scale parallel run are interesting, but I’m not sure they clearly demonstrate how this approach is better than previous approaches. Also there I had to rerun the whole thing to all the results, turns out I was only capturing about 1/5th of the results- but the end problem was the same. The results are very significant, but not head and shoulders above previous results, and don’t really demonstrate what I was hoping they would. Strongly anyway. Time for some thinkin. Never as dirt simple as I think it will be to start with.

Down, down, down, down...

Down, down, down, down…

Anyway, pushing onwards doing permutations. The question here is how likely would I be to see the scores I’m getting just by chance alone. So I permute the labels on my data and run the thing a few times with random labels. The permutation is done on the sample level- the data I’m using is from observations under condition 1 and condition 2- and I have multiple observations from each conditions. So to permute I just randomize which observations I’m saying are from condition 1 and condition 2.

I’ve done the first couple of randomized runs and they’re actually coming up with some reasonably significant results. This means that I’ll have to compare the random scores with my real scores in order to establish a false discovery rate, which I can then use as a threshold for reporting.

I’ve also started to put things into a kind of an outline for the paper. Here’s what I’ve got so far- I’ve taken the details of what I’m doing out for blogging purposes- but you get the idea:


  1. General background about the problem we’re developing our method on
  2. Description of the previous algorithm, what it offers and what is the gap that our approach will fill
  3. Specific details about the data set we’re using
  4. Summary of what our approach is and what results we’ll be presenting in the paper


  1. First apply the previous algorithm on our data (this hasn’t been done). Possibly validate on an external dataset
  2. Show how our algorithm improves results over previous
  3. Add in the extra idea we came up with that will also be a novel twist on the approach
  4. Show what kind of biological information can be derived from these new approaches. This is really open at this point since I’m not really sure what I’ll get yet. But preparing for it and thinking about it so writing it down.
  5. Validation on an external dataset (i.e. a different one from the one I’m using)- maybe. This might be difficult to impossible.

Journaling a computational biology study: Part 1

The process of how I compose a computational biology study, execute it, and write it up seems to follow a kind of set pattern, for the relatively simple projects anyway. So I thought I’d blog about this process as it happens.

I debated on several takes on this:

  1. Blogging as it’s happening with full technical details
  2. Blogging a journal that I could then release after the paper was finished and submitted
  3. Blogging as it’s happening but not talking about the technical details

The first is appealing, but probably wouldn’t go over well with my employer- and it is a simple idea that someone else could pick up and run with. I’m not paranoid, but in this case it might be too enticing for someone with the right skills. The second seems like not as much fun. So I’m opting for the third option and will be blogging about what I’m doing generally, but not giving specifics on the science, algorithms, or data I’m working with.

Background on the project

I’m not starting cold on this project. I came up with the idea last year but haven’t had time to implement it until now. It’s a dirt simple extension of an existing method that has the potential to be very interesting. I have the problem and data in hand to work on it. Last year we implemented a parallel version of a prototype of the algorithm. Now that I can actually work on it I can see a clear path to a finished project- being a submitted paper, or possibly inclusion as a part of a larger paper.

Day 1

Started out by revisiting the idea. Thinking about it and doing some PubMed searches. I just wanted to make sure that it hadn’t been done by anyone, especially the groups that developed the original algorithm. Nothing seems to be there- which is good, because as I said- it’s dirt simple.

Mid-day talked myself out of the idea in it’s original form- it can’t work as simply as I’d thought.

Relay my thoughts to my post-doc who reassured me that it was actually that simple and we could do it the way I originally envisioned. He was right. We talked about the statistics and algorithms for it for awhile.

Got my old code working again. Revised the core bits to handle the new idea. Actually ran some data through on my laptop using a very limited dataset. Looks like it works! So fun to actually be coding again and not just writing papers, grants, emails, or notes. Opened a blank Word document to do some writing. *sigh*

Decided on a tentative title (which will change) and a tentative author list. Myself, the post-doc who I talked with about it, the programmer who coded the parallel version previously, a post-doc who hasn’t worked on it yet, but probably will, and a senior domain expert. Yes, I’m doing this very early on. But as I said, there’s a clear path from here to a paper- it’s not too early.

Day 2

More testing on the prototype code to make sure that it’s behaving as I think it should. Also coded up an alternative data pre-processing step that seems to be a good idea. Comparing results from both pre-processing methods determine that they give different answers. I’ll have to iron that one out later when working with the real datasets.

Figured out the plan for the project- at least in broad strokes. Run on complete dataset, implement a random permutation strategy to estimate false discovery rate, break up dataset and show how the method works on individual parts of it (this is specific to the problem), find another dataset for validation, write it up. Yes, it’s just that simple.

Discussed an additional very interesting strategy with post-doc number 1 that will really add novelty and hopefully value to the study. Also discussed the permutation strategy in some detail. That will be really important to demonstrate that this actually works.

Spent most of the day revising the code for the parallel implementation to get the new ideas and testing it out on our cluster to see if it works. Slow progress, but finally got the entire thing to run! I did a couple of test runs using a limited dataset and only running on 2 nodes. When those worked I did the whole shebang. Finished in about an hour on 60 nodes, which is really pretty impressive given what it’s doing. Definitely a win!

Now to work on putting some words down for the Introduction section. I also like to outline the results section by generally writing about how I think it will go in a glorified outline. I’ve posted about this process previously here.



Job opening: worst critic. Better fill it for yourself, otherwise someone else will.

A recent technical comment in Science (here) reminded me of a post I’d been meaning to write. We need to be our own worst critics. And by “we” I’m specifically talking about the bioinformaticians and computational biologists who are doing lots of transformations with lots of data all the time- but this generally applies to any scientist.

The technical comment I referred to is behind a paywall so I’ll summarize. The first group published the discovery of a mechanism for X-linked dosage compensation in Drosophila based on, among other things, ChIP-seq data (to determine transcription factor binding to DNA). The authors of the comment found that the initial analysis of the data had used an inappropriate normalization step – and the error is pretty simple: instead of multiplying a ratio by a factor (the square root of the number of bins used in a moving average) they multiplied the log2 transform of the ratio by the factor. This resulted in greatly exaggerated ratios, and artificially inducing a statistically significant difference where there was none. Importantly, the authors of the comment noticed this when,

We noticed that the analysis by Conrad et al. reported unusually high Pol II ChIP enrichment levels. The average enrichment at the promoters of bound genes was reported to be ~30,000-fold over input (~15 on a log2 scale), orders of magnitude higher than what is typical of robust ChIP-seq experiments.

This is important because it means that this was an obvious flag that the original authors SHOULD have seen and wondered about at some point. If they wondered about it they SHOULD have looked further into their analysis and done some simple tests to determine if what they were seeing (30,000 fold increase) was actually reasonable. In all likelihood they would have found their error. Of course, they may not have ended up with a story that could be published in Science- but at least they would not have had the embarrassment of being caught out that way. This is not to say that there is any indication of wrongdoing on the part of the original paper- it seems that they made an honest mistake.

In this story the authors likely fell prey to the Confirmation Bias, the tendency to believe results that support your hypothesis. This is a particularly enticing and tricky bias and I have fallen prey to it many times. As far as I know, these errors have never made it into any of my published work. However, falling for particularly egregious examples (arising from mistakes in machine learning applications, for example) trains you to be on the lookout for it in other situations. Essentially it boils down to the following:

  1. Be suspicious of all your results.
  2. Be especially suspicious of results that support your hypothesis.
  3. The amount you should be suspicious should be proportional to the quality of the results. That is, the better the results are the more you should be suspicious of them and the more rigorously you should try to disprove them.

This is essentially wrapped up in the scientific method (my post about that here)- but it bears repeating and revisiting. You need to be extremely critical of your own work. If something works, check to make sure that it actually does work. If it works extremely well, be very suspicious and look at the problem from multiple angles. If you don’t someone else may, and they may not write as nice of things about you as YOU would.

The example I give above is nice in its clarity and it resulted in calling into question the findings of a Science paper (which is embarrassing). However, there are much, much worse cases with more serious consequences.

Take, for instance, the work Keith Baggerly and Kevin Coombes did to uncover a series of cancer papers that had multiple data processing, analysis and interpretation errors. The NY Times ran a good piece on it. It is more complicated and involves both (likely) unintentional errors in processing, analysis, or interpretation and could actually involve more serious issues of impropriety. I won’t go in to the details here but their original paper in The Annals of Applied Statistics, “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology“, should be reading for any bioinformatics or computational biology researcher. The paper painstakingly and clearly goes through the results of several high profile papers from the same group and reconstructs, first, the steps they must have taken to get the results they did, then second, where the errors occurred, and finally, the results if the analysis had been done correctly.

Their conclusions are startling and scary: they found that the methods were often times not described clearly such that a reader could easily reconstruct what was done and they found a number of easily explainable errors that SHOULD have been caught by the researchers.

These were associated with one group and a particular approach, but I can easily recognize the first, if not the second, in many papers. That is, it is often times very difficult to tell what has actually been done to process the data and analyze it. Steps that have to be there are missing in the methods sections, parameters for programs are omitted, data is referred to but not provided, and the list goes on. I’m sure that I’ve been guilty of this from time to time. It is difficult to remember that writing the “boring” parts of the methods may actually ensure that someone else can do what you’ve done. And sharing your data? That’s just a no-brainer, but something that is too often overlooked in the rush to publish.

So these are cautionary tales. For those of us handling lots of data of different types for different purposes and running many different types of analysis to obtain predictions we must always be on guard against our own worst enemy, ourselves and the errors we might make. And we must be our own worst (and best) critics: if something seems too good to be true, it probably is.