Message boards : Rosetta@home Science : Well, you said you wanted feedback ...
Previous · 1 · 2
Author | Message |
---|---|
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
My starting point is that I see a coverage pattern in Top Predictions that is always a cluster. I watch models where the same places are adjusted over and over again with other places ignored. Both, to my mind, are evidence of non-random behavior ... I can't speak for the second of these, but the "cluster" in the predictions graphs presents absolutely zero hard evidence of non-random behavior. I know that this flies directly in the face of what you see on that graph, but once you understand what that graph shows, you'll see why. The best way to describe the problem is like this. Suppose you had a bag containing 1,000,000 marbles, where 10 of them have the number 1, 499,995 have number 2, and the remaining 499,995 have the number 3. If you now pull marbles at random from the bag, and plot the distribution: count of marbles vs number on the marble, you'll get a graph that is very skewed. It'll have a "cluster" at 2 and 3, and almost nothing at 1. This skewed distribution is not a result of a faulty random number generator, it's a result of a skewed distribution of numbers in the bag in the first place. Doing this, the only way you'd ever see a linear output distribution is with about 333,333 marbles in the bag for each of the numbers 1, 2 and 3. We're up against the same problem. We have a large space to work in (David Baker's 500 dimensional space), that contains an astronomical number of points (marbles in the bag). Each of these has an energy and an RMSD associated with it, and we're simply plotting a graph of RMSD vs energy for each point that we've managed to find. For your argument to hold up, you'd have to prove that the disribution of energy and RMSD across this 500D space is linear, and all the evidence suggests it's not. |
Ace Paradis Send message Joined: 4 Oct 05 Posts: 51 Credit: 96,906 RAC: 0 |
wow, I have absoulutly not even the slightest idea of what your saying, but it does sound preety interesting. |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
wow, I have absoulutly not even the slightest idea of what your saying, but it does sound preety interesting. I know. It does sound kinda goofy to talk about 500 dimensional spaces. I will now take exactly one try and attempt to explain what this 500 dim space is all about. It's a measure of how many values in the problem can change independantly. Let's just say you have a box. It has a width, a height and a length. You can change all three of these measurments, and by doing so you get a volume. Since there are three independant things that can change, you're working in a three dimensional space. Now lets take a sheet of 20 bond paper. We can change the height and width only on this (go with me on this people, I know it's really 3D), so there are two values that can change. As they do so, you change the area of the sheet of paper. That's working in a two dimensional space: two things that can change independantly. What we run into with Rosetta is twisting the bonds between atoms in a protein molecule. As David Baker has pointed out, there can be 500 individual bonds (or more) in a protein that can twist, and they can all twist independantly of one another. Snce we want to be able to manipulate all of these twist angles independantly, that's why we wind up in that "500 dimensional space". As a general rule the number of dimensions in the space is a measure of the number of variables in the problem that can change independantly. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
dgnuff, I did follow your argument ... :) My counter is that reality was to be found in the corner. And we did not even get close. There were/are large areas within the result space that had no tick. I follow that the probability would result in a clustering and have no problems with that. However, there were traces in some areas, but, to *MY* simple mind way too much white space with no activity. Two candidates, non-random behavior, fatally flawed search algorithm. Which is easier to test and eliminate as a cause? Drat, I had promised to not mention this again ... |
halfmeg Send message Joined: 14 Dec 05 Posts: 7 Credit: 2,496 RAC: 0 |
Jack Schonbrun stated on Nov 1 05: > In fact, I hope we can move on from the discussion of random number > generation, and talk about what perhaps is the real issue: whether the > non-randomness that we intentionally introduce to bias our search is doing > what we want. No matter what algorithm is utilized ( perfect or impaired ), unless there has been an analysis of the output of your specific coding ( embedded in rosetta ) to determine randomness, you cannot assume it works as expected. If the non-randomness introduced is something of the sort: RANDOM High energy random point is flexed. SUBSTITUTE A pre-constructed segment array ordered by low-energy 1st is joined at the random point. REPEAT SUBSTITUTE for x segments Lowest energy results of previous step are keep. DO RANDOM until no high energy segments remain. This type of thing will not look like a random sampling but should favor the lower energy portion of the graph. The completeness of a universe of low energy segments for substitution could inhibit the 'final' threshold levels from being plotted. This might account for a "walled off" target level. Phil - then again I might have misunderstood the method of directing the search |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
I guess I should clarify what I mean by "intentional non-randomness." What I was referring to is the fact that we do not want to sample conformational space evenly. halfmeg, you are right that one way we avoid searching randomly through space is by rejecting changes to the configuration that raise the energy too much. The other way in which the search is not random is that we do not try to move all parts of the protein chain uniformly. Largely based on evolutionary information, we bias each segment of chain to certain possible configurations. This is one reason why some parts of the chain appear to move much less than others. They are more tightly constrained by our prior information. So the appearance of non-randomness is actually what we expect from our algorithm. |
proxima Send message Joined: 9 Dec 05 Posts: 44 Credit: 4,148,186 RAC: 0 |
Sorry to join this discussion half-way through - I only started crunching here yesterday, having left Find-a-Drug on Friday. The talk of randon-number generators is interesting, as a project of my own very nearly failed due to a bad Random Number Generator (although it wasn't RAN3 or similar - it was the 'noddy' RNG built into the C compiler I was using - that was my mistake). My application was a Genetic Algorithm, which makes very heavy use of the RNG in precise patterns (e.g. pulling off 1000 random numbers, then 3 more, then repeating, 1000, 3, etc). I was getting very strange results, which indicated only some numbers were ever getting chosen in some parts of the sequence - sometimes the "3 more" numbers were absolutely ALWAYS odd numbers, no matter how long I let it run, but the next time, they would always be even, or always divisible by 5 - weird things like that. Because of how the GA worked, this broke things quite badly. I spent ages looking for coding errors, but in the end, someone suggested I change my RNG. I downloaded an implementation of the Mersenne Twister, which is fast, has an extremely long period, high 'dimensionality', and other desirable properties that I only half-understand (not being any kind of maths expert). Immediately, all such problems were solved and I never saw any other such skewing of numbers again. I don't know how the Mersenne Twister compares to RAN3, though. Alver Valley Software Ltd - Contributing ALL our spare computing power to BOINC, 24x365. |
Message boards :
Rosetta@home Science :
Well, you said you wanted feedback ...
©2024 University of Washington
https://www.bakerlab.org