I’ve been reviewing machine learning papers lately and have seen a particular problem repeatedly. Essentially it’s a problem of how a machine learning algorithm is trained and evaluated for performance versus how it would be actually applied. I’ve seen this particular problem also in a whole bunch of published papers too so thought I’d write a blog rant post about it. I’ve given a quick-and-dirty primer to machine learning approaches at the end of this post for those interested.
The problem is this: methods are often evaluated using an artificial balance of positive versus negative training examples, one that can artificially inflate estimates of performance over what would actually be obtained in a real world application.
I’ve seen lots of studies that use a balanced approach to training. That is, the number of positive examples is matched with the number of negative examples. The problem is that many times the number of negative examples in a ‘real world’ application is much larger than the number of positive examples- sometimes by orders of magnitude. The reason that is often given for choosing to use a balanced training set? That this provides better performance and that training on datasets with a real distribution of examples would not work well since any pattern in the features from the positive examples would be drowned out by the sheer number of negative examples. So essentially- that when we use a real ratio of positive to negative examples in our evaluation our method sucks. Hmmmmm……
This argument is partly true- though some machine learning algorithms do perform very poorly with highly unbalanced datasets. Support Vector Machines (SVM), though and some other kinds of machine learning approaches, seem to do pretty well. Some studies then follow this initial balanced training step with an evaluation on a real world set – that is, one with a ‘naturally’ occurring balance of positive and negative examples. This is a perfectly reasonable approach. However, too many studies don’t do this step, or perform a follow on ‘validation’ on a dataset with more negative examples, but still nowhere near the number that would be present in a real dataset. And importantly- the ‘bad’ studies report the performance results from the balanced (and thus, artificial) dataset.
The issue here is that evaluation on a dataset with an even number of positive and negative examples can vastly overestimate performance by decreasing the number of false positive predictions that are made. Imagine that we have a training set with 50 positive examples and a matched number of 50 negative examples. The algorithm is trained on these examples and cross-validation (random division of the training set for evaluation purposes- see below) reveals that the algorithm predicts 40 of the positives to be positive (TP) and 48 of the negatives to be negative (TN). So it misclassifies two negative examples to be positive examples with scores that make it look as good or better than the other TPs- which wouldn’t be too bad, the majority of positive predictions would be true positives. Now imagine that the actual ratio of positives to negative examples in a real world example was 1:50, that is for every positive example there are 50 negative examples. So, what’s not done in these problem cases is extrapolating the performance of the algorithm to a real world dataset. In that case you’d expect to see 100 false positive predictions- now outnumbering the number of true positive predictions and making the results a lot less confident than originally estimated. The example I use here is actually a generous one. I frequently deal with datasets (and review or read papers) where the ratios are 1:100 to 1:10,000 where this can substantially impact results.
So the evaluation of a machine learning method should involve a step where a naturally occurring ratio of positive and negative examples is represented. Though this natural ratio may not be clearly evident for some applications, it should be given a reasonable estimate. The performance of the method should be reported based on THIS evaluation, not the evaluation on the balanced set- since that is likely to be inflated from a little to a lot.
For those that are interested in real examples of this problem I’ve got two example studies from one of my own areas of research- type III effector prediction in bacteria. In Gram negative bacteria with type III secretion systems there are an unknown number of secreted effectors (proteins that are injected into host cells to effect virulence) but we estimate on the order of 50-100 for a genome like Salmonella Typhimurium, which has 4500 proteins total, so the ratio should be around 1:40 to 1:150 for most bacteria like this. In my own study on type III effector prediction I used a 1:120 ratio for evaluation for exactly this reason. A subsequent paper in this area was published that chose to use a 1:2 ratio because “the number of non-T3S proteins was much larger than the number of positive proteins,…, to overcome the imbalance between positive and negative datasets.” If you’ve been paying attention, THAT is not a good reason and I didn’t review that paper (though I’m not saying that their conclusions are incorrect since I haven’t closely evaluated their study).
- Samudrala R, Heffron F and McDermott JE. 2009. Accurate prediction of secreted substrates and identification of a conserved putative secretion signal for type III secretion systems. PLoS Pathogens 5(4):e1000375.
- Wang Y, Zhang Q, Sun MA, Guo D. 2011. High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics. 2011 Mar 15;27(6):777-84.
So the trick here is to not fool yourself, and in turn fool others. Make sure you’re being your own worst critic. Otherwise someone else will take up that job instead.
Quick and Dirty Primer on Machine Learning
Machine learning is an approach to pattern recognition that learns patterns from data. Often times the pattern that is learned is a particular pattern of features, properties of the examples, that can classify one group of examples from another. A simple example would be to try to identify all the basketball players at an awards ceremony for football, basketball, and baseball players. You would start out by selecting some features, that is, player attributes, that you think might separate the groups out. You might select hair color, length of shorts or pants in the uniform, height, and handedness of the player as potential features. Obviously all these features would not be equally powerful at identifying basketball players, but a good algorithm will be able to make best use of the features. A machine learning algorithm could then look at all the examples: the positive examples, basketball players; and the negative examples, everyone else. The algorithm would consider the values of the features in each group and ideally find the best way to separate the two groups. Generally to evaluate the algorithm all the examples are separated into a training set, to learn the pattern, and a testing set, to test how well the pattern works on an independent set. Cross-validation, a common method of evaluation, does this repeatedly, each time separating the larger group into training and testing sets by randomly selecting positive and negative examples to put into each set. Evaluation is very important since the performance of the method will provide end users with an idea of how well the method has worked for their real world application where they don’t know the answers already. Performance measures vary but for classification they generally involve comparing predictions made by the algorithm with the known ‘labels’ of the examples- that is, whether the player is a basketball player or not. There are four categories of prediction: true positives (TP), the algorithm predicts a basketball player where there is a real basketball player; true negatives (TN), the algorithm predicts not a basketball player when the example is not a basketball player; false positives (FP), the algorithm predicts a basketball player when the example is not; and false negatives (FN), the algorithm predicts not a basketball player when the example actually is.
Features (height and pant length) of examples (basketball players and non-basketball players) plotted against each other. Trying to classify based on either of the individual features won’t work well but a machine learning algorithm can provide a good separation. I’m showing something that an SVM might do here- but the basic idea is the same with other ML algorithms.