So I just published a paper on predicting multi drug resistance transporters in the journal F1000 Research. This was part of my diabolical* plot (and here) to get grant money (*not really diabolical, but definitely risky, and hopefully clever). So what’s the paper about? Here’s my short explanation, hopefully aimed so that everyone can understand.
TL;DR version (since I wrote more than I thought I was going to)
Antibiotic resistance in bacteria is a rapidly growing health problem- if our existing antibiotics become useless against pathogens we’ve got a big problem. One of the mechanisms of resistance is that bacteria have transporters, proteins that pump out the antibiotics so they can’t kill the bacteria. There are many different kinds of these transporters and finding more of them will help us understand resistance mechanisms. We’ve used a method based on understanding written language to interpret the sequence of proteins (the order of building blocks used to build the protein) and predict a meaning from this- the meaning being the function of antibiotic transporter. We applied this approach to a large set of proteins from bacteria in the environment (a salty lake in Washington state in this case) because it’s known that these poorly understood bacteria have a lot of new proteins that can be transferred to human pathogens and give them superpowers (that is, antibiotic resistance).
(now the long version)
Antibiotic resistance in bacteria
This is a growing world health problem that you’ve probably heard about. Prior to the discovery of antibiotics bacterial infections were a very serious problem that we couldn’t do much about. Antibiotics changed all that, providing a very effective way to treat common and uncommon bacterial infections, and saving countless lives. The problem is that there are a limited number of different kinds of antibiotics that we have (that is, that have been discovered and are clinically effective without drastic side effects) and the prevalence of strains of common bacterial pathogens with resistance to one or more of these antibiotics is growing at an alarming rate. The world will be a very different place if we no longer have effective antibiotics (see this piece for a scary peek into what it’ll be like).
How does this happen? The driving force is Darwinian selection- survival of the fittest. Imagine that the pathogens are a herd of deer and that antibiotics are a wolf pack. The wolf pack easily kills off the slower deer, but leaves the fastest ones to live and reproduce, leading to faster offspring that are harder to kill. Also, the fast deer can pass off their speed to slow deer that are around, making them hard to kill.
Bacterial resistance to antibiotics works in a somewhat similar way. Bacteria can evolve, driven by natural selection, and they reproduce very quickly- but they have an even faster way to accomplish this adaptation than evolving new functions from the ground up. They can exchange genetic material, including the plans for resistance mechanisms (genes that code for resistance proteins) with other bacteria. And they can make these exchanges between bacteria of different species, so a resistant pathogen can pass off resistance to another pathogen, or an innocuous environmental bacteria can pass off a resistance gene to a pathogen making it resistant.
There are three main classes of resistance. First, the bacteria can develop resistance by altering the target of the antibiotic so that it can no longer kill. The ‘target’ in this case is often a protein that the bacteria uses to do some critical thing- and the antibiotic mucks it up so that bacteria die since they can’t accomplish that thing they need to do. Think of this like a disguise- the deer put on a nose and glasses and long coat ala Scooby Doo, and the wolves run right by without noticing. Second, the bacteria can produce an enzyme (a protein that alters small molecules in some way like sugars or drugs) that transforms the antibiotic into an ineffective form. Think of this like the deer using handcuffs to cuff the legs of the wolves together so they can’t run anymore, and thus can’t chase and kill the deer (which are the bacteria if you remember). Third, the bacteria can produce special transporter proteins that pump the antibiotic out of the inside of the cell (the bacterial cell) and away from the vital machinery that the antibiotic is targeting to kill the bacteria. Think of this like the possibility that deer engineers have developed portable wolf catapults. When a wolf gets too close it’s simply catapulted over the trees so it can’t do it’s evil business (in this case, actually good business because the wolves are the antibiotics, remember?)
The problem addressed in the paper
The problem we address in the paper is related to the third mechanism of resistance- the transporter proteins. There are a number of types of these transporters that can transport several or many different kinds of antibiotics at the same time- thus multi drug resistance transporters. Still, it’s likely that there are a lot of these kinds of proteins out there that we don’t recognize as such- in many cases you can’t just look at the sequence (see the section below) of the protein and figure out what it does.
The point of the paper is to develop a method that can detect these kinds of proteins and look for those beyond what we already know about. The long range view is that this will help us understand better how these kinds of proteins work and possibly suggest ways to block them (using novel antibiotics) to make existing antibiotics more effective again.
An interesting thing that has become clear in the last few years is that environmental bacteria have a large number of different resistance mechanisms to existing antibiotics (and probably to antibiotics we don’t even know about yet). And there are a LOT of environmental bacteria in just about every place on earth. Most of these we don’t know anything about. This has been called the “antibiotic resistome” meaning that it’s a vast reservoir of unknown potential for resistance that can be transferred to human pathogens. In the case of the second mechanism of resistance, the enzymes, these likely have evolved since bacteria in these environmental communities are undergoing constant warfare with each other- producing molecules (like antibiotics) that are designed to kill other species. In the case of the third resistance mechanism (the transporters) this could also be true, but these transporters seem to have a lot of other functions too- like ridding the bacteria of harmful molecules that might be in the environment like salts.
Linguistic-based sequence analysis
The paper uses an approach that was developed in linguistics (study of language) to analyze proteins. This works because the building blocks of proteins (see below) can be viewed as a kind of language, where different combinations of blocks in different orders can give rise to different meanings- that is, different functions for the protein.
The sequence of a protein refers to the fact that proteins are made up of long chains of amino acids. Amino acids are just building block molecules, and there are 20 different kinds that are commonly found in proteins. These 20 different kinds make up an alphabet, and the alphabet is used to “spell” the protein. The list of amino acid letters that represents the protein is its sequence. It’s relatively easy to get the sequences of proteins for many bacteria, but the problem of what these sequences actually do is very much an open one. Proteins with similar sequences often times do similar things. But there are some interesting exceptions to this that I can illustrate using actual letters and sentences.
The first is that similar sequences might have different meanings.
1) “When she looked at the pool Jala realized it was low.”
2) “When she looked at the pool Jala realized she was too slow.”
The second is that very different sentences might have similar meanings.
1) “When he looked at the pool Joe realized it was dirty.”
2) “The dirty pool caught Joe’s attention.”
(these probably aren’t the BEST sentences to illustrate this, if you have better suggestions please let me know)
The multi drug transporters have elements of both problems. There are large families of transporter proteins that are pretty similar in terms of protein sequence- but the proteins actually transport different things (like, non-antibiotic molecules, and at this point we can’t just look at the sequences and figure out what they transport for many examples. There are also several families of multi drug transporters that have pretty different sequences between families but all do essentially the same job of transporting several types of drugs.
Linguistics, and especially computational linguistics, has been focused on developing algorithms (computer code) to interpret language into meaning. The approach we use in the paper, called PILGram, does exactly this and has been applied to interpretation of natural (English) language for other projects. We just adapted it somewhat so that it would work on protein sequences. Then we trained the method (since the method learns by example) on a set of proteins where we know the answer- previously identified multi drug transporters. After this was trained and we evaluated how well it could do it’s intended job (that is, taking protein sequences and figuring out if they are multi drug transporters or not) we let it loose on a large set of proteins from bacteria in a very salty lake in northern Washington state called Hot Lake.
What we found
First we found that the linguistic-based method did pretty well on some protein sequence problems where we already knew what the answer was. These PROSITE patterns are from a database where scientists have spent a lot of effort figuring out protein motifs (like figures of speech in language that always mean the same thing) for a whole collection of different protein functions. PILGram was able to do pretty well (though not perfectly) at figuring out what those motifs were- even though we didn’t spend any time on looking through the protein sequences, which is what PROSITE did. So that was good.
We then showed that the method could predict multi drug resistance transporters, a set of proteins where a common motif isn’t known. Again, it does fairly well – not perfect but much better than existing ways of doing this. We evaluated how well it did by pretending we didn’t know the answers for a set of proteins when we actually do know the answer- this is called ‘holding out’ some of the data. The trained method (trained on the set of proteins we didn’t hold out) was then used to predict whether or not the held out proteins were multi drug transporters and we could evaluate how well they did by comparing with the real answers.
Finally, we found that the method identified a number of likely looking candidate multi drug transporters from the Hot Lake community proteins and we listed a few of these candidates.
The next step will be to look at these candidates in the lab and see if they actually are multi drug transporters or not. This step is called “validation”. If they are (or at least one or two are) then that’s good for the method- it says that the method can actually predict something useful. If not then we’ll have to refine the method further and try to do better (though a negative result in a limited validation doesn’t necessarily mean that the method doesn’t work). This step, along with a number of computational improvements to the method, is what I proposed in the grant I just submitted. So if I get the funding I get to do this fun stuff.