Asked and answered: Computational Biology Contribution?

So someone asked me this question today: “as a computational biologist,how can you be useful to the world?”. OK so they didn’t ask me, per se, they got to my blog by typing the question into a search engine and I saw this on my WordPress stats page (see bottom of this post). Which made me think- “I don’t know what page they were directed to- but I know I haven’t addressed that specific question before on my blog”. So here’s a quick answer, especially relevant since I’ve been talking with CS people about this at the ACM-BCB meeting the last few days.

As a computational biologist how can you be useful to the world?

  1. Choose your questions carefully. Make sure that the algorithm you’re developing, the software that you’re designing, the fundamental hypothesis that you’re researching is actually one that people (see collaborators, below) are interested in and see the value in. Identify the gaps in the biology that you can address. Don’t build new software for the sake of building new software- generally people (see collaborators) don’t care about a different way to do the same thing, even if it’s moderately better than the old way.
  2. Collaborate with biologists, clinicians, public health experts, etc. Go to the people who have the problems. What they can offer you is focus on important problems that will improve the impact of your research (you want NIH funding? You HAVE to have impact and probably collaborators). What you can give them is a solution to a problem that they are actually facing. Approach the relationship with care though since this is where the language barrier between fields can be very difficult (a forthcoming post from me in the near future on this). Make sure that you interact with these collaborators during the process- that way you don’t go off and do something completely different than what they had in their heads.
  3. In research be rigorous. The last thing that anyone in any discipline needs is a study that has not considered validation, generalizability, statistical significance, or having a gold-standard or reasonable facsimile thereof to compare to. Consider collaborating with a statistician to at least run your ideas by- they can be very helpful, or a senior computational biologist mentor.
  4. In software development be thoughtful. Consider robustness of your code- have you tested it extensively? How will average users (see collaborators, above) be able to get their data into it? How will average users be able to interpret the results of your methods? Put effort into working with those collaborators to define the user interface and user experience. They don’t (to a point) care about execution times as long as it finishes in a reasonable amount of time (have your software estimate time to completion and display it) and it gives good results. They do care if they can’t use it (or rather they completely don’t care and will stop working with you on the spot).
  5. Sometimes people don’t know what they need until they see it. This is a tip for at least 10th level computational biologists (to make a D&D analogy). This was a tenet of Steve Jobs of Apple and I believe it to be true. Sometimes, someone with passion and skill has to break new ground and do something that no one is asking them to do but that they will LOVE and won’t know how they lived without it. IT IS HIGHLY LIKELY THAT THIS IS NOT YOU. This is a pretty sure route to madness, wearing a tin hat, and spouting “you fools! you’ll never understand my GENIUS”- keep that in mind.
  6. For a computational biologist with some experience make sure that you pass it along. Attend conferences where there are likely to be younger faculty/staff members, students, and post-docs. Comment on their posters and engage. When possible suggest or make connections with collaborators (see above) for them. Question them closely on the four points above- just asking the questions may be an effective way of conveying importance. Organize sessions at these conferences. In your own institution be an accessible and engaged mentor. This has the most potential to increase your impact on the world. It’s true.

Next week: “pathogens found in confectionary” (OK- probably not going to get to that one, but interesting anyway)

People be searchin'

People be searchin’

How to review a scientific manuscript

Finished up another paper review yesterday and I was thinking about the process of actually doing the review. I’ve reviewed a bunch of papers over the years and I follow a general strategy that seems to work well for me. I’m sure there are lots of great ways of doing this- and I’m not trying to be comprehensive here, just giving some ideas.

The general process:

  1. I start by printing out the paper. I’ve reviewed a few manuscripts completely electronically, but I find that to be difficult. It really helps me to have a paper copy that I can jot notes on and underline sections.
  2. I do a first read through going pretty much straight through and not sweating it if I don’t get something right away since I know I’ll go back over it again.
  3. During this read through I mark sections that seem confusing, jot questions I have down in the margins, and underline misspelled or misused words.
  4. Generally at this point I’ll start writing up my review – which generally consists of a summary paragraph, a list of major comments and a list of minor comments- but check the journal guidelines for specifics. This allows me to start the process and get something down on paper. I generally start by listing out the minor comments, and slowly add in the major comments.
  5. I re-read the paper being guided by my questions I’ve noted. This allows me to delve in to sections that are confusing to see if the section is actually confusing or if I’m just missing something. That’s sometimes the hardest call to make as a reviewer. As I go back through the paper I try to develop and refine my major comments.

Here are some things to remember as you’re reviewing papers:

  1. You have an obligation and duty as a reviewer to be thorough and make sure that you’ve really tried to understand what the authors are saying. For me this means not ignoring those nagging feelings that I sometimes get, “well it seems OK, but this one section is a little fuzzy. It seems odd”. It’s easy to brush that feeling aside and believe that the authors know what they’re talking about. But don’t do that. Really look at the argument they’re making and try to understand it. Many times I’ll be able to discriminate if they’ve got it right or not by putting a little effort into it. If you can’t understand it after having tried then you’re in the shoes of a future reader- and it’s perfectly all right to comment that you didn’t understand it and it needs to be made more clear.
  2. You also have an obligation to be as clear as possible in your communication. That is, try to be specific about your comments. List the page and line numbers that you’re referring to. Specify exactly what your problem with the text is- even if it’s that you don’t understand. If you can, suggest the kind of solution that you’d like to see in a revised manuscript.
  3. Before rejecting a paper think it over carefully. Would a revision reasonably be able to fix the problems? Did the paper annoy you for some reason? If so was that annoyance a significant flaw in the paper, or was it a pet peeve that doesn’t merit a harsh penalty? Rejection is part of the business and it’s really not unusual for papers to get rejected, just make sure you’re rejecting for the right reasons.
  4. Before accepting a paper think it over carefully. This paper will enter the scientific record as “peer reviewed”- which should mean something. The review will reflect on you personally, whether or not it is anonymous. If it’s not anonymous (some journals post the names of the reviewers and in some cases the reviews themselves) then everyone will be able to make their own judgement about whether you screwed up by accepting- are you good with that? If the review is anonymous the editor (who can frequently be someone well-known and/or influential) still knows who you are. Also, many sub-sub-sub-discplines are small enough that the authors may be able to glean who you are from your review, they may even have suggested you as a reviewer. This is especially true if your review includes a comment like, “the authors neglected to mention the seminal work of McDermott, et al. (McDermott, et al. 2009, McDermott, et al. 2010).”
  5. Remember that it’s OK to say that you don’t know or that you aren’t an expert in a particular area. You can either communicate directly with the editor prior to completion of the review (if you feel that you really aren’t suited to provide a review at all) and request that you be removed from the review process or state where you might be a bit shaky in the comments to the editor (this is generally a separate text box on your review page that lets you communicate with the editor, but the authors don’t see it).
  6. Added (h/t Jessie Tenenbaum): It’s ok to decline a review because you are in a particularly busy period and just can’t devote the time it will require (e.g. when traveling, or a grant is due) but do remember we’re all busy, but we all rely on others agreeing to review OUR papers.
  7. Added (h/t Jessie Tenenbaum): If you must decline, recommendations of another qualified reviewer are GREATLY appreciated, especially for those sub-sub-sub-specialty areas
  8. Added (h/t Jessie Tenenbaum): The reviewer’s role is ideally more of coach than critic. It’s helpful to approach the review with the goal of helping the authors to make it a better paper for publication some day- either this submission or elsewhere.

 Some general things to look out for

Sure, papers from different disciplines and sub-disciplines and sub-sub-disciplines require different kinds of review and have different red flags but here are some things I think are fairly general to look out for in papers (see also Eight Red Flags in Bioinformatic Analysis).  

  1. Are the main arguments made in the paper understandable? Is the data presented sufficient to be able to evaluate the claims of the paper? Is this data accessible in a sufficiently raw form, say as a supplement? For bioinformatics-type papers is the code available?
  2. Have the appropriate controls been done? For bioinformatics-type papers this usually amounts to answering the question: “Would similar results have been seen if we’d looked in an appropriately randomized dataset?”- possibly my most frequent criticism of these kinds of papers.
  3. Is language usage appropriate and clear? This can be in terms of language usage itself (say by non-native English speaking authors), consistency (same word used for same concept all the way through), and just general clarity. Your job as reviewer is not to proofread the paper- you should probably not take your time to correct every instance of misused language in the document. Generally it’s sufficient to state in the review that “language usage throughout the manuscript was unclear/inappropriate and needs to be carefully reviewed” but if you see repeated offenses you could mention them in the minor comments.
  4. Are conclusions appropriate for the results presented? I see many times (and get back as comments on my own papers) that the conclusions drawn from the results are too strong, that the results presented don’t support such strong conclusions, or sometimes that conclusions are drawn that don’t seem to match the data presented at all (or not well).
  5. What does the study tell you about the underlying biology? Does it shed significant light on an important question? Can you identify from the manuscript what that question is (this is a frequent problem I see- the gap being addressed is not clearly stated, or stated at all)? Evaluation of this question should vary depending on the focus of the journal- some journals do not (and should not) require groundbreaking biological advances.
  6. Are there replicates? That is, did they do the experiments more than once (more than twice, actually, should be at least three times)? How were the replicates done? Are these technical replicates – essentially where some of the sample is split during some point in the processing and analyzed- or biological replicates – where individual and independent biological samples were taken (say from different patients, cultures, or animals) and processed and analyzed independently?
  7. Are the statistical approaches used to draw the conclusions appropriate and convincing? This is a place where knowing about the limitations on the p-value come in handy: for example that a comparison can have a highly significant p-value but a largely meaningless effect size. It is also OK to state that you don’t have a sufficient understanding of the statistical methods used in the paper to provide an evaluation. You’re then kicking it back to the editor to make a decision or to get a more statistically-saavy reviewer to evaluate the manuscript.

Conclusion

It’s important to take your role seriously. You and the one or two other reviewers plus the editor for the paper are the keepers of the scientific peer review flame. You will help make the decision on whether or not the work that is presented, that probably took a lot of time and effort to produce, is worthy of being published and distributed to the scientific community. If you’ve been on the receiving end of a review (and who hasn’t) think about how you felt- did you complain about the reviewer not spending time on your paper, about them “not getting” it, about them doing a poor job that set you back months? Then don’t be that person. Finally, try to be on time with your reviews. The average time in review is long (I have an estimate based on my papers here) but it doesn’t need to be so long. The process of peer review can be very helpful for you the reviewer. I find that it helps my writing a lot to see good and bad examples of scientific manuscripts, to see how different people present their work, and to think critically about the science.

 

Have laptop, will travel

One of the great things about being a purely computational researcher is that, nowadays, my office is pretty much wherever I want it to be. I’ve got my laptop, WiFi is omnipresent, and I have noise-canceling headphones for the serious business. There are lots of reasons that I have to be at my office – meetings and increased ability to focus being primary. However, it’s not the case that you have to be purely computational to get a lot out of working in non-traditional locales. Writing is the place where we all (as researchers) can do this. Writing manuscripts and grants being the biggest time sucks. Some of you will have the ability to be flexible in your actual work time, others this might pertain mostly to the ‘extra’ work you do writing grants and papers.

So here is my random collection of thoughts on this topic.

Why take your work outside the standard work environment?

  1. Flexibility and efficient use of time. If you have your laptop with you you can fit in writing wherever you are (see list below). This allows you to use your time well instead of standing around checking Facebook on your phone. Not all writing work is  suited for the short bits of time (probably no less than about 20-30 minutes at a time) but if you plan what to work on you can get a lot done this way. If you don’t have your laptop a surprising amount of work can get done with just a pen and paper.
  2. Freedom from distraction. OK, a coffee shop can be a pretty distracting place, that’s a given. But sometimes being in your office can be pretty distracting too. People stop by to chat for a minute, phones ring, drawers need organizing, etc. If you can ignore the distractions outside your office (wherever you’re choosing to work) then this can be a productive way to go. Also, try working somewhere WITHOUT WiFi (it can be done)- and cut out the social media chatter.
  3. Creative stimulation. Changing your work environment drastically can give you a shot of creative energy. It can be refreshing wot work outside at a park, or while enjoying a glass of your favorite beverage at a cafe or bar.

What to work on?

  1. Grants
  2. Manuscripts
  3. Reviewing papers/grants
  4. Catching up on answering emails
  5. Reading papers- no laptop required
  6. Planning and outlining- also no laptop required, use a pen and notebook

Where can you do this?

  1. Coffee shop. Everyone pretty much knows about this one. Can be distracting, but find a quiet corner and bring headphones. Also, try not to drink 15 double espressos while you’re there (not that I would have ANY experience with that)
  2. The Mad Scientist enjoying a beer after a long day meeting and about to do some grant writing at a McMenamin's pub in Portland

    The Mad Scientist enjoying a beer after a long day meeting and about to do some grant writing at a McMenamin’s pub in Portland

    Bar/pub. These can be awesome places to work- probably not on a Friday or Saturday night, but other times. Many have WiFi and they have BEER! Also, try not to drink 8 beers while you’re there. Alcohol is actually a consideration since it can affect your motivation pretty severely. Ordering ONE beer and some food works OK for me, but certainly use your best judgement- and they will always have alternate non-alcoholic beverage options.

  3. Public library. This is really just a no-brainer. No cost (though many libraries have coffee shops attached and allow you to bring covered cups in), free WiFi, lots of sitting areas, quiet atmosphere, surrounded by the smell of knowledge.
  4. Park. Working outside is sometimes really nice in nice weather. If you’re lucky enough to have workable weather (not too hot, not too cold, not too windy or rainy) then find a table in the shade and settle in. I’ve never found this particularly effective myself, though the idea is wonderful, but I’m sure it could work for others.
  5. Doctor/dentist office, DMV, etc. This option is one I use quite a bit, but it only works for things that you can do a little bit on before being interrupted. I find that making todo lists and outlines work well here. Also reading background material can also work well.
  6. Car. Not while you’re driving! I mean if you’re sitting and waiting for something or someone this can be a good time too.
  7. Public transportation. When I was in Seattle I rode the commuter train in from Everett to work several times a week. A great place to work. An hour of uninterrupted time while beautiful countryside rolls by. Buses can work too, though not always for actual writing since often they bump and move too much for a laptop. Subways/metros also work well. Of course, this is pretty dependent on the density of people. It’s really hard to do anything productive when you have an elbow in your face and about 6 inches of standing room.
  8. *that's me in the seat behind Rex, by the way.

    *that’s me in the seat behind Rex, by the way.

    Airplane/airport. So much wasted time in airports- which are great places to work if you find the right spots. Airplanes can be a bit problematic in terms of an actual laptop (I find I can do it if I type like a T-rex) but I bring papers to read and a notebook to do planning and write ideas. In airports try to find places where there aren’t many people- away from your departing gate if you have time. More chance of getting a power outlet and fewer distractions. If you’re really in need of an outlet try looking in places where other people aren’t going to be sitting (hallways and walkways) and sit on the floor- it can be done.

  9. Hotel. Also in the traveling realm. Hotels can be excellent places to write. Free from a lot of the distractions and obligations of home and office. If you have extra time after a day at a conference or between sessions or before you catch your plane- use it. Many hotels are set up with desks, comfy chairs, outlets, coffee makers, and WiFi. When I travel to the east coast and my return flight is early I will frequently work through the night. Not for everyone, but I’m a night owl and I find it easier to do this (sometimes) than to sleep for a few hours then drag myself out of bed at 5 AM (3 AM my time) to get to the airport. Also, no danger of oversleeping – unless of course you accidentally crash. So if you do this make sure to arrange a wake up call and set an alarm for backup.
  10. Other locations. Be on the lookout for other opportunities. I have worked on a grant while pouring wine for a wine tasting at a friend’s house (not a wine-tasting party, mind you- this was a professional activity, so quite a bit of down time). That was pretty epic really but it still didn’t get my grant funded.

 

 

Academic Rejection Training

Following on my previous post about methods to deal with the inevitable, frequent, and necessary instances of academic rejection you’ll face in your career I drew this comic to provide some helpful advice on ways to train for proposal writing. Since the review process generally takes months (well, the delay from the time of submission to the time that you find out is months- not the actual review itself) it’s good to work yourself up to this level slowly. You don’t want to sprain anything in the long haul getting to the proposal rejection stage.

ThreeQuickWays

15 great ways to fool yourself about your results

I’ve written before about how easy it is to fool yourself and some tips on how to avoid it for high-throughput data. Here is a non-exhaustive list of ways you too can join in the fun!

  1. Those results SHOULD be that good. Nearly perfect. It all makes sense.
  2. Our bioinformatics algorithm worked! We put input in and out came output! Yay! Publishing time.
  3. Hey, these are statistically significant results. I don’t need to care about how many different ways I tested to see if SOMETHING was significant about them.
  4. We only need three replicates to come to our conclusions. Really, it’s what everyone does.
  5. These results don’t look all THAT great, but the biological story is VERY compelling.
  6. A pilot study can yield solid conclusions, right?
  7. Biological replicates? Those are pretty much the same as technical replicates, right?
  8. Awesome! Our experiment eliminated one alternate hypothesis. That must mean our hypothesis is TRUE!
  9. Model parameters were chosen based on what produced reasonable output: therefore, they are biologically correct.
  10. The statistics on this comparison just aren’t working out right. If I adjust the background I’m comparing to I can get much better results. That’s legit, right
  11. Repeating the experiment might spoil these good results I’ve got already.
  12. The goal is to get the p-value less than 0.05. End.Of.The.Line. (h/t Siouxsie Wilespvalue_kid_meme
  13. Who, me biased? Bias is for chumps and those not so highly trained in the sciences as an important researcher such as myself. (h/t Siouxsie Wiles)
  14. It doesn’t seem like the right method to use- but that’s the way they did it in this one important paper, so we’re all good. (h/t Siouxsie Wiles)
  15. Sure the results look surprising, and I apparently didn’t write down exactly what I did, and my memory on it’s kinda fuzzy because I did the experiment six months ago, but I must’ve done it THIS way because that’s what would make the most sense.
  16. My PI told me to do this, so it’s the right thing to do. If I doubt that it’s better not to question it since that would make me look dumb.
  17. Don’t sweat the small details- I mean what’s the worst that could happen?

Want to AVOID doing this? Check out my previous post on ways to do robust data analysis and the BioStat Decision Tool from Siouxsie Wiles that will walk you through the process of choosing appropriate statistical analyses for your purposes! Yes, it is JUST THAT EASY!

Feel free to add to this list in the comments. I’m sure there’s a whole gold mine out there. Never a shortage of ways to fool yourself.

 

A word about balance

I’ve been reviewing machine learning papers lately and have seen a particular problem repeatedly. Essentially it’s a problem of how a machine learning algorithm is trained and evaluated for performance versus how it would be actually applied. I’ve seen this particular problem also in a whole bunch of published papers too so thought I’d write a blog rant post about it. I’ve given a quick-and-dirty primer to machine learning approaches at the end of this post for those interested.

The problem is this: methods are often evaluated using an artificial balance of positive versus negative training examples, one that can artificially inflate estimates of performance over what would actually be obtained in a real world application.

I’ve seen lots of studies that use a balanced approach to training. That is, the number of positive examples is matched with the number of negative examples. The problem is that many times the number of negative examples in a ‘real world’ application is much larger than the number of positive examples- sometimes by orders of magnitude. The reason that is often given for choosing to use a balanced training set? That this provides better performance and that training on datasets with a real distribution of examples would not work well since any pattern in the features from the positive examples would be drowned out by the sheer number of negative examples. So essentially- that when we use a real ratio of positive to negative examples in our evaluation our method sucks. Hmmmmm……

This argument is partly true- though some machine learning algorithms do perform very poorly with highly unbalanced datasets. Support Vector Machines (SVM), though and some other kinds of machine learning approaches, seem to do pretty well. Some studies then follow this initial balanced training step with an evaluation on a real world set – that is, one with a ‘naturally’ occurring balance of positive and negative examples. This is a perfectly reasonable approach. However, too many studies don’t do this step, or perform a follow on ‘validation’ on a dataset with more negative examples, but still nowhere near the number that would be present in a real dataset. And importantly- the ‘bad’ studies report the performance results from the balanced (and thus, artificial) dataset.

The issue here is that evaluation on a dataset with an even number of positive and negative examples can vastly overestimate performance by decreasing the number of false positive predictions that are made. Imagine that we have a training set with 50 positive examples and a matched number of 50 negative examples. The algorithm is trained on these examples and cross-validation (random division of the training set for evaluation purposes- see below) reveals that the algorithm predicts 40 of the positives to be positive (TP) and 48 of the negatives to be negative (TN). So it misclassifies two negative examples to be positive examples with scores that make it look as good or better than the other TPs- which wouldn’t be too bad, the majority of positive predictions would be true positives. Now imagine that the actual ratio of positives to negative examples in a real world example was 1:50, that is for every positive example there are 50 negative examples. So, what’s not done in these problem cases is extrapolating the performance of the algorithm to a real world dataset. In that case you’d expect to see 100 false positive predictions- now outnumbering the number of true positive predictions and making the results a lot less confident than originally estimated. The example I use here is actually a generous one. I frequently deal with datasets (and review or read papers) where the ratios are 1:100 to 1:10,000 where this can substantially impact results.

So the evaluation of a machine learning method should involve a step where a naturally occurring ratio of positive and negative examples is represented. Though this natural ratio may not be clearly evident for some applications, it should be given a reasonable estimate. The performance of the method should be reported based on THIS evaluation, not the evaluation on the balanced set- since that is likely to be inflated from a little to a lot.

For those that are interested in real examples of this problem I’ve got two example studies from one of my own areas of research- type III effector prediction in bacteria. In Gram negative bacteria with type III secretion systems there are an unknown number of secreted effectors (proteins that are injected into host cells to effect virulence) but we estimate on the order of 50-100 for a genome like Salmonella Typhimurium, which has 4500 proteins total, so the ratio should be around 1:40 to 1:150 for most bacteria like this. In my own study on type III effector prediction I used a 1:120 ratio for evaluation for exactly this reason. A subsequent paper in this area was published that chose to use a 1:2 ratio because “the number of non-T3S proteins was much larger than the number of positive proteins,…, to overcome the imbalance between positive and negative datasets.” If you’ve been paying attention, THAT is not a good reason and I didn’t review that paper (though I’m not saying that their conclusions are incorrect since I haven’t closely evaluated their study).

  1. Samudrala R, Heffron F and McDermott JE. 2009. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathogens 5(4):e1000375.
  2. Wang Y, Zhang Q, Sun MA, Guo D. 2011. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics. 2011 Mar 15;27(6):777-84.

So the trick here is to not fool yourself, and in turn fool others. Make sure you’re being your own worst critic. Otherwise someone else will take up that job instead.

Quick and Dirty Primer on Machine Learning

Machine learning is an approach to pattern recognition that learns patterns from data. Often times the pattern that is learned is a particular pattern of features, properties of the examples, that can classify one group of examples from another. A simple example would be to try to identify all the basketball players at an awards ceremony for football, basketball, and baseball players. You would start out by selecting some features, that is, player attributes, that you think might separate the groups out. You might select hair color, length of shorts or pants in the uniform, height, and handedness of the player as potential features. Obviously all these features would not be equally powerful at identifying basketball players, but a good algorithm will be able to make best use of the features. A machine learning algorithm could then look at all the examples: the positive examples, basketball players; and the negative examples, everyone else. The algorithm would consider the values of the features in each group and ideally find the best way to separate the two groups. Generally to evaluate the algorithm all the examples are separated into a training set, to learn the pattern, and a testing set, to test how well the pattern works on an independent set. Cross-validation, a common method of evaluation, does this repeatedly, each time separating the larger group into training and testing sets by randomly selecting positive and negative examples to put into each set. Evaluation is very important since the performance of the method will provide end users with an idea of how well the method has worked for their real world application where they don’t know the answers already. Performance measures vary but for classification they generally involve comparing predictions made by the algorithm with the known ‘labels’ of the examples- that is, whether the player is a basketball player or not. There are four categories of prediction: true positives (TP), the algorithm predicts a basketball player where there is a real basketball player; true negatives (TN), the algorithm predicts not a basketball player when the example is not a basketball player; false positives (FP), the algorithm predicts a basketball player when the example is not; and false negatives (FN), the algorithm predicts not a basketball player when the example actually is.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won't work well but a machine learning algorithm can provide a good separation. I'm showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won’t work well but a machine learning algorithm can provide a good separation. I’m showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.

Presentation blues

It never seems to fail. The presentation is set up, the slides have been previewed, the system has been tested- but now that everyone’s actually IN the room to listen something goes wrong. There’s a laundry list of things that can go wrong (see below) but this invariably causes delays, uncomfortable silence, fidgeting, and, if you’ve got a flexible audience at a conference or online, you’ll lose people, first slowly then in droves. It is truly

This is a projector my grandfather, Gideon Kramer, designed. It uses a single carousel and dissolves between slides- extremely cool for it's time.

This is a projector my grandfather, Gideon Kramer, designed. It uses a single carousel and dissolves between slides- extremely cool for its time.

amazing the percentage of meetings where SOMETHING goes wrong with the presentation- even at well-organized computational-leaning conferences.

Back in the day (i.e. days of my grad school experience) we weren’t ‘blessed’ with such things as computer projectors, Powerpoint, and web-enabled broadcast. We had actual slide projectors. Humming heat sinks with mechanical carousels of physical slides. You had to plan ahead for your talk- no last minute rearrangement of slides or addition of attributions before the talk. You had to plan out your talk, compose the slides, then have them made (when I started the presentation thing in earnest we at least could compose the slides on a computer.)

A visting senior scientist told us before a talk that he would do a test for all his graduate students and post-docs prior to them giving a talk. He would take their prepared carousel full of carefully arranged slides and turn it upside down. If they hadn’t remembered to

"THIS MAN... has a very large head"

“THIS MAN… has a very large head”

attach the retainer ring the slides would all fall out and the poor student would be left scrambling and presumably panicking. I guess this was to teach them to be careful to pay attention to important details or something- to me it just sounded sadistic. In any case, his point was a good one: always be prepared for something going wrong. As the importance of the presentation rises so too does your need to remember and apply this rule. Do not be caught with a blank screen and a blank stare at a job interview- EVEN IF IT’S NOT YOUR FAULT!

To help I’ve prepared the following lists. The first is a (non-exhaustive) list of things that can go wrong- most of these I’ve seen or been a part of, the second is a list of things you can do to prepare yourself for these eventualities.

Things that can go wrong

  1. Technical issues with computer-projector connection. By far the most common problem I see. The computer doesn’t talk to the projector. It can happen when you switch computers to present from your own computer, for example. Or it can just happen. I know from experience that you can have several people test the connection ahead of time and things can still go wrong. The computer forgets its display connection, the cord comes loose, sometimes there are complicated control panels that have to be configured just right.
  2. Format issues with slides. Do you use Keynote or some other software that isn’t used as ubiquitiously as, you know, PowerPoint? Then you might have problems. Switching the AV to use your computer can often lead to other problems (see point 1.) Sometimes platform differences do cause problems (Mac to PC, vice-versa) but more often it’s old versus new formats that might prevent a presentation from being given.
  3. Web- or video- casting problems. I will say that I’ve almost never been involved in a video webcast that has gone without a hitch. Also, running a webinar is pretty tricky too (though it is getting easier.) This is a bad situation because you have people who are not in the same room with you getting antsy. Of course, the upside is that you can’t see their discomfort.
  4. Problems playing embedded videos. It’s just an issue. Embedded videos can work on one computer and then not on another. There are sometimes encoding issues and other times it just doesn’t seem to work. I sat in a presentation hall with 300 other people at a big conference and we waited 20 minutes (!) in the middle of a Pinky_and_the_brain_by_themiconot-really-that-interesting talk so they could fix video playback. The video wasn’t all that great in the end. This is particularly insidious because it means stopping in the middle of a presentation to fix things. Ugly. Ugly. Ugly.
  5. Can’t find the file. You’ve seen it happen. The speaker thinks that their presentation has been loaded on the computer. Maybe they sent it to the conference organizers the previous week? Maybe they have a thumb drive that they’re sure they put the file on. This may be one of the more embarrassing problems. Sometimes it’s the senior researchers squinting over their glasses at the screen and repeatedly opening the wrong folders/drives (while everyone watches on the big screen) but nearly as often it’s younger, computer-saavy folks who really should know better.

What can you do to prepare yourself

  1. Know your talk. This is by far the most important point I’ll make. Know it inside and out and be able to give it even without your slides. This is especially important for things like job talks where a lot is on the line. It’s true that your interviewers will cut you some slack if there are technical issues that are clearly on their side. But you’ve missed a HUGE opportunity by standing around with a dumb look on your face. You could be giving your talk (or a portion of your talk) WITH NO SLIDES. It’s like MAGIC and it can be done for ANY talk. Sure display of actual data will suffer but if you have access to a white board you can sketch quick visual aids to give the idea. If you do this I guarantee no one will forget your talk. I still remember a talk I saw like this once where the slide projector failed mid-talk and the speaker (a fellow grad student) kept on GIVING their talk. Practice important talks both with and without slides. And not just once, multiple times. It will pay off.
  2. Have backup copies. If you’ve sent your slide file ahead to be loaded by someone make sure that you also have a thumb drive in your pocket with the presentation too. Also- always have a backup format too. A PDF copy of your slides seems to work just fine and can be more universal. Of course, the PDF won’t display the animation of the stick figure and the associated completely unnecessary sound effects that accompany your slide transitions. But that’s really not a downside, is it?
  3. Do a pre-flight check. If you have the opportunity to view your slides as they will be presented then do this before your talk. A lot of times scheduled individual talks will include setup time in your agenda. Use it to load your slides and click through each one to make sure that it looks OK. Especially pay attention to embedded videos and anything else that might be problematic.
  4. Pre-flight video/web casts. If possible, test your webinar or video connection prior to your talk (or a talk you’re organizing). Do this with someone who can be offsite to make sure that there aren’t issues with firewalls to test this as well as possible. In my experience even a preflight check won’t guarantee that things will work. If you can send a slide file to your destination, or get one from the presenter if you’re organizing. That way, at the very least, you can follow along in real time. It works.
  5. Know what you’ll do/say if your presentation stops in the middle. You should never simply stop if your next slide doesn’t come up, if an animation or video you had doesn’t work right, or if the AV simply goes down in mid-sentence. This is a subpoint of my point 1 above, know your talk. If you know your talk you can keep on rolling and pause when your at a more convenient point. It does not
  6. Have a filler. This is one that I haven’t tried but seems like a good idea. Have in mind a short something that you can say that will fill time while the AV people (and generally 3-5 other interested parties from the audience who are ‘sure’ they know the problem and the fix) get you fixed up. This could be a short introduction to your talk (maybe then skip a couple of intro slides), an aside that is highly relevant to your audience, or even a short introduction to yourself and your work. I can see this could be dangerous because you don’t want it to sound like filler. So just take this as an untested idea.

Please let me know if you have comments or additions (or funny stories) in the comments below.

Journaling a Computational Biology Project: Part 4

Day 5 (link to my previous post in this series)

So the cluster was back on line at 11:45 AM this morning, meaning that I could restart with my false-discovery rate estimation running more permutations. However, I decided to take a short detour to quickly implement a related idea that came up when I was coding up this one. It involves a similar data preprocessing step as this one, so I modularized my code (making the data preprocessing more independent from the rest of the code). After I ran

Eek! A BUG!

Eek! A BUG!

the code I realized I had a big ole stinkin’ BUG that meant that I wasn’t preprocessing really at all. This was bad: I should’ve seen this before and it cost me a bunch of time on the cluster running fairly meaningless code (200 hours CPU time or so). It was also very good: it means that the so-so results I was getting previously may have been because the algorithm wasn’t actually working correctly (but in a way that was impossible to see given the end results).

So, optimistically, I started the job again- but it’s going to run over night, so I won’t know the answer until tomorrow (and you won’t either). In any case I got the side-track idea working, though it’s still waiting on the addition of one thing to make it really interesting. Lesson for the day: check your code carefully! (this is only about the 350th time I’ve learned this lesson, believe me)

Journaling a Computational Biology Project: Part 3

Day 4 (link to previous post)

It figures that the week I decide to return to using the cluster (the PIC, in case you’re interested) is the week that they have to shut it down for construction. So ran no more permutations today- that’ll have to wait until next week.

Didn’t really do any other work on the paper or project today either- busy doing other things. So not much to report today actually. I did talk a bit about the results with my post-doc on our semi-weekly MSFAB (Mad Scientist Friday Afternoon Beer). We both agreed that the permutation test was a good idea and possibly the only way to get an estimate of real false discovery rates. Along these lines, as I reported yesterday the first round of permutations returned with some fairly significant results. These actually exceeded the Bonferroni corrected p values I was getting, which is supposed to tell you essentially the same thing. So it seems in this case that Bonferroni, generally a conservative multiple hypothesis correction, was not conservative enough. Good lesson to remember.

Journaling a computational biology project: Part 2

Day 3 (link to my previous entry)

Uh-oh- roadblock. Remember how I was saying this project was dirt simple?

It's just THIS simple. This has to work- there's no WAY it could fail.

It’s just THIS simple. This has to work- there’s no WAY it could fail.

This came much faster than I thought it would. I’ve got actual data and I have to figure out if there’s a story there. Or rather, where the story is. The results from my large-scale parallel run are interesting, but I’m not sure they clearly demonstrate how this approach is better than previous approaches. Also there I had to rerun the whole thing to all the results, turns out I was only capturing about 1/5th of the results- but the end problem was the same. The results are very significant, but not head and shoulders above previous results, and don’t really demonstrate what I was hoping they would. Strongly anyway. Time for some thinkin. Never as dirt simple as I think it will be to start with.

Down, down, down, down...

Down, down, down, down…

Anyway, pushing onwards doing permutations. The question here is how likely would I be to see the scores I’m getting just by chance alone. So I permute the labels on my data and run the thing a few times with random labels. The permutation is done on the sample level- the data I’m using is from observations under condition 1 and condition 2- and I have multiple observations from each conditions. So to permute I just randomize which observations I’m saying are from condition 1 and condition 2.

I’ve done the first couple of randomized runs and they’re actually coming up with some reasonably significant results. This means that I’ll have to compare the random scores with my real scores in order to establish a false discovery rate, which I can then use as a threshold for reporting.

I’ve also started to put things into a kind of an outline for the paper. Here’s what I’ve got so far- I’ve taken the details of what I’m doing out for blogging purposes- but you get the idea:

Introduction

  1. General background about the problem we’re developing our method on
  2. Description of the previous algorithm, what it offers and what is the gap that our approach will fill
  3. Specific details about the data set we’re using
  4. Summary of what our approach is and what results we’ll be presenting in the paper

Results

  1. First apply the previous algorithm on our data (this hasn’t been done). Possibly validate on an external dataset
  2. Show how our algorithm improves results over previous
  3. Add in the extra idea we came up with that will also be a novel twist on the approach
  4. Show what kind of biological information can be derived from these new approaches. This is really open at this point since I’m not really sure what I’ll get yet. But preparing for it and thinking about it so writing it down.
  5. Validation on an external dataset (i.e. a different one from the one I’m using)- maybe. This might be difficult to impossible.