Using evolutionary conservation to help structure prediction

Author	Message
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 30155 - Posted: 28 Oct 2006, 0:57:58 UTC Hello, My name is Sarel Fleishman, and I'm a new postdoc in the Baker group. My project deals with predicting the structure of large protein complexes. Chu has recently described the motivations and the approach for predicting structures of complexes, which you can follow at: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2395 By looking at the sequences of proteins that have been identified in various organisms, one can identify amino-acid residues that are evolutionarily more conserved than others. As a simple example, consider two proteins from human and from yeast that carry out the same biological process. If we find that a given amino acid position is the same in the yeast and human sequences, we would consider it to be evolutionarily conserved. The various genome projects are generating very large numbers of such homologous sequences across different species, which can then be used to derive statistically robust estimates of the evolutionary conservation of amino-acid positions. This type of evolutionary conservation has been shown to correlate with whether an amino-acid site is buried in the protein core or exposed to water. The reason why buried amino-acid positions tend to be evolutionarily more conserved is that changing the identity of a position in the protein core is likely to disrupt its stability and render it dysfunctional. Conversely, changing an amino-acid position that is exposed to water is unlikely to harm the protein function. Therefore, evolutionarily conservation is potentially useful for predicting the structures of individual proteins or protein complexes. With the help of the BOINC users, I'm testing the hypothesis that amino-acid conservation could help Rosetta pick out conformations that are near the correct structure. The idea is to prefer conformations that place evolutionarily conserved amino acid residues in the core of the protein. The huge computing power of BOINC is essential for testing this hypothesis. BOINC produces a very large number of conformations, allowing us to pick up even slight improvements in the predicted structures. Hopefully, following these improvements, it will be possible to identify the optimal way of incorporating evolutionary conservation into Rosetta. If you'd like to follow these runs, they typically have the suffix _ENVFILE to them. ID: 30155 · Rating: 1 · rate: / Reply Quote

Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0	Message 30162 - Posted: 28 Oct 2006, 3:29:38 UTC Hi Sarel, A good part of this went over my head, but what I understood was very interesting. Thanks for sharing the science with us. Glad to be helping the experiment with my idle CPU cycles. ID: 30162 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 30169 - Posted: 28 Oct 2006, 8:26:40 UTC - in response to Message 30155. Last modified: 28 Oct 2006, 8:36:27 UTC Hi Sarel, yes, first let me join in the thanks for sharing the science. May I also suggest you put the first paragraph of your post into a profile, and include in the profile a link back to this thread. Then when people see your name on these boards in future, they will be able to find out what you want us to know about you. That is not my main reason for responding. You say ... I'm testing the hypothesis that amino-acid conservation could help Rosetta pick out conformations that are near the correct structure. The idea is to prefer conformations that place evolutionarily conserved amino acid residues in the core of the protein. The huge computing power of BOINC is essential for testing this hypothesis. ... This reminds me of a worry I have, not just about your approach but about other approaches too. Rosetta cannot check everything, so it goes for the "best bet" at various stages in the process in order to cut the search time. This seems to me a bit like the way a human chess master will not spend time looking at moves that lose him/her the Queen in exchange for a pawn; or will prioritise moves that give them more influence over the centre of the board even if they do not as yet have a clear idea of how they will use that influence. This usually works but has a flaw. The grand master will spot that s/he can sacrifice a Queen to win the game a few moves later, and the mere master may have missed this outcome. This is how a "gambit" works - the eventual loser overlooks the downside of an apparently advantageous move. Evolutionary conservation is another such heuristic. So I am being slightly unfair by putting this point to you alone: it applies equally to all the other heuristics being applied by your colleagues at Bakerlab. So I hope your colleagues will feel free to answer as well as / instead of yourself. My worry is that these heuristics get built into future programs, and that most of the time they work fine, so people get to trust them. Then along comes an important divergence from the heuristic (in your case, an important divergence from evolutionary conservation that might be very exciting science) and the program will miss it. It is worse than a Monte Carlo approach, which may merely be unlucky enough not to stumble across the right answer. What I am saying is that if someone uses a program based on this heuristic at a later date, and is applying the answers to matters of evolutionary conservation, then the program has been set up to ignore the very thing they are looking for. There is an all too common pattern in human thinking that we see what we expect to see. Scientists work hard to exclude such patterns from their thinking, often unsuccessfully (eg Einstein worked for several years to take out of his theory the prediction that the universe is expanding -- he never for one monent thought it could be possible. He later described this as his greates mistake) We are building this same fallibility into our thinking machines and the danger for the future is that we may forget the blind spots we built in. If it is hard to see my own blind spots, it is even worse to look for them in a machine that I trust. Aside: Of course, there is another side to this coin. Why do we see what we expect to see? Because often enough (in our evolutionary past) it has worked out OK and saved time. In a predator/prey evolutionary race, the edge given by faster thinking may be more of a benefit than the disadvantage the few times our ancestors got it wrong. We, just like Rosetta, take short cuts in out thinking not because it is always right, but because that gives the best odds. But best odds are not certainty. At present lay-people expect certainty out of a computer result in the way they don't out of a human expert. So it does worry me that we are building fallibility into programs that will be treated by many as "giving the objective answer". River~~ ID: 30169 · Rating: 0 · rate: / Reply Quote

Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 30267 - Posted: 30 Oct 2006, 6:11:42 UTC Thanks for your interest in this post! As I'm still new to the message boards, I hope that it has been clear, and would be delighted to elaborate if necessary. I've also followed River's suggestion, and added something to my profile, though it's still under construction. River, your remarks are absolutely correct, and are some of our biggest concerns when working out the details of how to incorporate this new score term into Rosetta. Briefly (and I hope that I'm not misrepresenting), River's point is that adding a term based on what has been 'observed' to be correct for true protein structures could be dangerous, because it would bias future predictions towards our current prejudices. Significantly, those cases that counter our prejudices often turn out to be the most fascinating. There is no clear cut way to eliminate the problem that River mentioned, but there are ways to minimize it. One way is to test any modification to the score with as many disparate cases as possible (i.e., proteins with different sequences and local structures), and this is indeed what we are doing with the crucial help of the BOINC platform. If many very different proteins seem to behave well with the new scheme, then there is a good chance that it is generally applicable. More realistically, I expect that we will find that some cases work well with this scheme, whereas others do not, and hopefully, we will be able to improve or at least to outline when this scheme is likely to fail. Another element that can help in minimizing the problems of such bias is the way in which the score is designed. If we used evolutionary conservation to eliminate decoys that do not meet a predefined criterion, then we would probably have a big problem of missing structures that might excel in other aspects, such as structural stability. Our approach is therefore to use conservation in a very similar way to the other terms in the score function; that is as a contribution to the overall score, but not as a filter. This way, if a given decoy structure appears incorrect from the standpoint of evolutionary conservation, but nevertheless seems to have all other structural features in place, it is very likely to go on to the next stages of testing in Rosetta. Ultimately, however, I think that these scoring schemes are an aid to human reasoning, but, due to these problems, should not replace human examination at the end of the prediction process. If a prediction appears to have been overly biased by any single criterion, this prediction should be revisited. ID: 30267 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 662 Credit: 12,158,847 RAC: 1,274	Message 30279 - Posted: 30 Oct 2006, 9:48:40 UTC This reminds me of an exchange I had with David Baker last year, where I was suggesting that non-biologists look at the data. My point being along the same lines as that made, somewhat more eruditely, by River. It started in this thread but wandered about for weeks across several topics. Without the "baggage" of foreknowledge, we may be able to see the wood in the trees. Welcome to the team Sarel. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 30279 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 30293 - Posted: 30 Oct 2006, 15:42:40 UTC Last modified: 30 Oct 2006, 15:45:33 UTC I'm not a biophysics expert by any means. So bare with me, but this seems as good a place to ask as any. I note that it seems to be comparitively "easy" to determine the amino acid sequence of a protein with existing technology. Is there any way to create a set of 10 (or 100) short proteins that fairly readily bind rather indiscriminately to other proteins? And then determine if that binding has occured? My thought is this, if you see there are 3 of your test set that bind to the protein who's structure you are attempting to predict... then it should help you to bias your predictions towards conformations that would cause those 3 to bind, and the other 7 not to. It is rather similar to the hydrophobic/hydrophilic attribute isn't it? Essentially helping you determine which AAs will be on the "outside" of the fold, and which hidden within. Basically, I'm wondering if it might be fairly easy to find or design short strands, and use them to do further physical study in a lab prior to virtual study on a computer. It would be like determining some docking experimentally, and then using that information to help find the protein structure. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 30293 · Rating: 0 · rate: / Reply Quote

Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0	Message 30354 - Posted: 31 Oct 2006, 5:43:31 UTC There are some methods in molecular biology that allow one to do something very similar to what you're suggesting, and these methods are actually very useful. The first such method that comes to my mind is the phage-display library, where short segments of protein are amplified if they bind to a target protein. You can read more about this and related methods in the following page: http://www.answers.com/topic/phage-display You are right that potentially data from such libraries could be used to help in structure prediction, but this is still quite far from today's computational capabilities. Such short fragments are likely to have very little intrinsic structure, so the problem of docking them on a given protein is in effect very close to the problem of ab-initio structure prediction, with all of the degrees of freedom. So, in order to bias the results of an ab-initio structure prediction using such fragments you would have to simultaneously fold your target protein and the fragments, and then try to dock them together. ID: 30354 · Rating: 0 · rate: / Reply Quote