Problems and Technical Issues with Rosetta@home

Author	Message
mikey Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,860,414 RAC: 3,555	Message 98187 - Posted: 17 Jul 2020, 23:02:09 UTC - in response to Message 98157. Any discussion of how much faster HD helps? I see an great difference in rac between my "old" sata and new SSD (with the same memory and cpu). It really depends on the Project and how much it rights to the disk and how often, the more it rights and the more often it does it the SSD drives will benefit. A Project like Rosetta should benefit more than a project with 30 minute tasks for example. I went to all ssd drives awhile back because I refuse to pay for a/v stuff on Boinc only machines and the free ones want me to reboot every couple of weeks or so, the bootup time is sooo much better than I have better rac because of it. ID: 98187 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,010,866 RAC: 677	Message 98189 - Posted: 17 Jul 2020, 23:09:52 UTC - in response to Message 98187. Last modified: 17 Jul 2020, 23:10:22 UTC Any discussion of how much faster HD helps? I see an great difference in rac between my "old" sata and new SSD (with the same memory and cpu). It really depends on the Project and how much it rights to the disk and how often, the more it rights and the more often it does it the SSD drives will benefit. A Project like Rosetta should benefit more than a project with 30 minute tasks for example. I went to all ssd drives awhile back because I refuse to pay for a/v stuff on Boinc only machines and the free ones want me to reboot every couple of weeks or so, the bootup time is sooo much better than I have better rac because of it. Some of my Boinc machines have rotary drives, because I had them kicking about anyway. Some have an SSD because the rust spinners were annoyingly slow to reboot. The ones with normal drives seem to run just as fast once they're computing, but take ages to start some tasks, and if it's got 24 Theory tasks to start from LHC to start at once after a reboot, it can take 10 minutes. ID: 98189 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1920 Credit: 18,534,891 RAC: 0	Message 98191 - Posted: 17 Jul 2020, 23:25:58 UTC - in response to Message 98156. Last modified: 17 Jul 2020, 23:30:44 UTC They're also taking up a massive chunk of CPU, despite me having set BOINC to use 30% CPU time. That's your problem. Limiting the amount of time doesn't limit the number of cores/threads in use. In fact what it does mean is that it will take you more than 3 times as long to process a Task than your Target CPU time. ie- 8 hour Tasks will take over 24 hours. CPU time should always be left at 100%. If you feel it's necessary to limit the number of cores/threads in use (since i paid for them i choose to use them all), set "Use at most 100 % of the CPUs" to less than 100% Edit- you also need to reduce your cache size as you are missing deadlines. If you have more than 1 project, there is no need for a cache at all. Even with just one project, a cache really isn't necessary unless that project has lots of extended down time. Grant Darwin NT ID: 98191 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1920 Credit: 18,534,891 RAC: 0	Message 98192 - Posted: 17 Jul 2020, 23:27:47 UTC - in response to Message 98157. Any discussion of how much faster memory helps? About all I've been able to find so far us that at least for Rosetta@home, it does help. Any discussion of how much faster HD helps? I see an great difference in rac between my "old" sata and new SSD (with the same memory and cpu). Don't know why as storage performance has no impact at all on processing performance- unless your system is short of RAM and has been spending all of it's time previously swapping to the page file. Grant Darwin NT ID: 98192 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 29 Mar 20 Posts: 97 Credit: 343,893 RAC: 0	Message 98193 - Posted: 17 Jul 2020, 23:28:59 UTC Last modified: 17 Jul 2020, 23:30:50 UTC I'm getting a bit frustrated with the current running task. I have my settings to run for 4 hours. The current running task is going on 10+ hours now and is barely crawling up from 97% complete, ten minutes to go, 1/1000 of a percent at a time. The thing that irks me no end is that the task wrote exactly ONE checkpoint, 30 minutes or so after it was initially started and nothing since. I want to do some maintenance on the PC yet can't unless I want to throw away 10 hours of crunching on the task because it will restart from basically zero. Why the heck did the developers configure the task to only write one checkpoint? Arggh! I understand the 10+ hour watchdog will eventually kick in supposedly at the 14 hour mark of runtime. Question is will I have enough patience to wait it out. ID: 98193 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1920 Credit: 18,534,891 RAC: 0	Message 98194 - Posted: 17 Jul 2020, 23:32:49 UTC - in response to Message 98193. I'm getting a bit frustrated with the current running task. I have my settings to run for 4 hours. The current running task is going on 10+ hours now and is barely crawling up from 97% complete, ten minutes to go, 1/1000 of a percent at a time. There have been a few Tasks that run longer than the set time till the watchdog timer kills them off. There were a lot when i first started here, for a while now there haven't been any. Just recently, there's been a small batch of those longer running Tasks again. Grant Darwin NT ID: 98194 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1920 Credit: 18,534,891 RAC: 0	Message 98195 - Posted: 17 Jul 2020, 23:40:28 UTC - in response to Message 98172. Last modified: 17 Jul 2020, 23:42:11 UTC Typically, you get about 30% greater output using 100% of the cores than when using only 50%, even though each work unit runs faster in the latter case. Yep. When i started here i didn't have enough RAM to support all the cores/threads on both of my systems. On one system i turned of hyper-threading, the other i left it enabled. The system with hyperthreading produced way more work than the one with it turned off- more than 50%. It does depend a lot on the software being run- in some instances hyperthreading can reduce the amount of work done. In some cases the increase is just in line with the increase in cores/threads. In most cases the improvement tends to be from 30-60% In some case the output can be almost double. It really does depend a lot on the work being done. Grant Darwin NT ID: 98195 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 98199 - Posted: 18 Jul 2020, 10:47:35 UTC - in response to Message 98193. Keith Myers wrote: I'm getting a bit frustrated with the current running task This one? Does indeed look like the watchdog stepped in: BOINC:: CPU time: 50522.6s, 36000s + 14400s[2020- 7-17 19: 6:41:] :: BOINC (Output like that does not normally appear in task results.) Tasks do occasionally ‘go rogue’, and checkpointing is known to be difficult and inconsistent in Rosetta. I don’t think there’s anything we can do about it other than leave them alone, switch off when we need to and hope not too much work is lost, or abort overrunning tasks. Once a task has overrun the timing predictions become meaningless; it seems progress asymptotically approaches 100%, and estimated remaining time never goes below 10 minutes. ID: 98199 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,010,866 RAC: 677	Message 98208 - Posted: 18 Jul 2020, 17:37:40 UTC - in response to Message 98191. Last modified: 18 Jul 2020, 17:42:16 UTC Edit- you also need to reduce your cache size as you are missing deadlines. If you have more than 1 project, there is no need for a cache at all. Even with just one project, a cache really isn't necessary unless that project has lots of extended down time. Or the project has very small tasks that take 18 seconds to complete! (Milkyway) There have been a few Tasks that run longer than the set time till the watchdog timer kills them off. There were a lot when i first started here, for a while now there haven't been any. Just recently, there's been a small batch of those longer running Tasks again. Actually the watchdog doesn't kill them off, I had one run for 2.5 days. They always finish eventually, and with success, but the lack of checkpoints is a bit annoying - I lost quite a bit when Microsoft illegally restarted my property for a Windows update (which I've since disabled). ID: 98208 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 98210 - Posted: 18 Jul 2020, 18:06:50 UTC - in response to Message 98157. boboviz wrote: Any discussion of how much faster HD helps? I see an great difference in rac between my "old" sata and new SSD (with the same memory and cpu). The only reason I can think of for the hard disk to make such a difference is that you are short on RAM and the system is spending a lot of time paging. From what I’ve seen, once the application and protein database are loaded, Rosetta itself uses the disk very little. It infrequently saves a few kilobytes of state/checkpoint/results data; nothing more. ID: 98210 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,010,866 RAC: 677	Message 98211 - Posted: 18 Jul 2020, 18:12:45 UTC - in response to Message 98210. boboviz wrote: Any discussion of how much faster HD helps? I see an great difference in rac between my "old" sata and new SSD (with the same memory and cpu). The only reason I can think of for the hard disk to make such a difference is that you are short on RAM and the system is spending a lot of time paging. From what I’ve seen, once the application and protein database are loaded, Rosetta itself uses the disk very little. It infrequently saves a few kilobytes of state/checkpoint/results data; nothing more. Which will be buffered in RAM anyway and not hold up the program. Only if it needs to read data can it be stalled, even then if it's stuff it or another task from the same project has used recently, that will be cached in RAM. ID: 98211 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,076,156 RAC: 29,077	Message 98223 - Posted: 19 Jul 2020, 3:19:43 UTC - in response to Message 98169. After the task outage last month I guess people re-prioritised other projects, understandably. I really don't understand why people do that. I have all my computers set to run at least two projects. If one goes wrong or runs out of work, it will run entirely the other one with no intervention from myself. When it's fixed, it'll go back to doing it at the proportion I've set (and in fact tries to make up lost ground by doing more of the one that was broken for a while. You could even have Rosetta at weight 1,000,000 and another project at 1. I know, but you've been here longer than me, so you ought to know that some people are... weird in their reasoning. There's no explaining that'll help ID: 98223 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2538 Credit: 47,076,156 RAC: 29,077	Message 98224 - Posted: 19 Jul 2020, 3:27:34 UTC - in response to Message 98199. Keith Myers wrote: I'm getting a bit frustrated with the current running task This one? Does indeed look like the watchdog stepped in: BOINC:: CPU time: 50522.6s, 36000s + 14400s[2020- 7-17 19: 6:41:] :: BOINC (Output like that does not normally appear in task results.) Tasks do occasionally ‘go rogue’, and checkpointing is known to be difficult and inconsistent in Rosetta. I don’t think there’s anything we can do about it other than leave them alone, switch off when we need to and hope not too much work is lost, or abort overrunning tasks. Once a task has overrun the timing predictions become meaningless; it seems progress asymptotically approaches 100%, and estimated remaining time never goes below 10 minutes. Slightly side-tracking. That task isn't available to view any more, but if it says "36000s + 14400s" that indicates the watchdog has now been set back to 4hrs rather than 10hrs. I wasn't aware that'd changed back as I haven't had a long-running task for a very long time ID: 98224 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 98225 - Posted: 19 Jul 2020, 8:19:58 UTC - in response to Message 98224. if it says "36000s + 14400s" that indicates the watchdog has now been set back to 4hrs rather than 10hrs The 4 hours I took to be the run time preference (per this post); the 10 hours the watchdog (per this post). ID: 98225 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,010,866 RAC: 677	Message 98229 - Posted: 19 Jul 2020, 18:25:23 UTC - in response to Message 98224. Slightly side-tracking. That task isn't available to view any more, but if it says "36000s + 14400s" that indicates the watchdog has now been set back to 4hrs rather than 10hrs. I wasn't aware that'd changed back as I haven't had a long-running task for a very long time I've got one running right now. 1 day, 5 hours, 40 minutes of CPU time: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1095559368 My wingman completed it in 13 hours, but so far I've taken 1 day, 5 hours, 40 minutes. The wingman's computer has an i5-6402P, which I've never heard of, but if it's a similar speed to an i5-6400, then it's a similar speed to my Xeon per core, so I'm not sure how he did it so quickly. How does winging work with Rosetta? Can't you end up with one guy doing more modules than another because his computer is faster? ID: 98229 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 98230 - Posted: 19 Jul 2020, 18:42:59 UTC - in response to Message 98229. Last modified: 19 Jul 2020, 18:50:18 UTC How does winging work with Rosetta? It doesn’t. Tasks are typically not sent to more than one machine. Yours did probably only because its deadline has passed. If your machine does ever finish it, you will get the same credit as the other user. (Looking at the FLOPS: his machine is 30% faster than yours.) And yes: this is where BOINC’s credit model (designed for fixed work / variable time) breaks down on Rosetta (fixed time / variable work). (Explanation from Mod.Sense.) ID: 98230 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,010,866 RAC: 677	Message 98231 - Posted: 19 Jul 2020, 18:48:19 UTC - in response to Message 98230. Last modified: 19 Jul 2020, 18:49:22 UTC How does winging work with Rosetta? It doesn’t. Tasks are typically not sent to more than one machine. Yours did probably only because its deadline has passed. If your machine does ever finish it, you will get the same credit as the other user. (Explanation from Mod.Sense.) Ok that answers one of my two questions, but.... how did he finish it so quickly? I can only assume his CPU, although similar in a benchmark, is faster at Rosetta. Back to the question you answered - I take it Rosetta is programmed such that it cannot send back a wrong result? Most projects have to check with at least one other person to make sure you got the answer right. ID: 98231 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 98232 - Posted: 19 Jul 2020, 19:10:35 UTC - in response to Message 98231. I edited while you were replying… Looking at the stats: his machine is 30% faster at floating point ops, and 80% faster at integer ops, than yours. Using those numbers, yours should take somewhere between 17 and 25 hours. But that the task is still not finished after 30 hours suggests it’s not that simple… From what Mod.Sense wrote, Rosetta would rather have two machines doing two different tasks than both doing the same and comparing results to ensure they’re ‘right’. I’m not sure there’s really such a thing as a ‘wrong’ answer with Rosetta anyway, if the tasks are simply asking: “What if…?” Any results that look promising will be investigated further, and can be discarded if they turn out to be somehow erroneous. ID: 98232 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1602 Credit: 13,010,866 RAC: 677	Message 98233 - Posted: 19 Jul 2020, 21:27:55 UTC - in response to Message 98232. Last modified: 19 Jul 2020, 21:30:27 UTC Looking at the stats: his machine is 30% faster at floating point ops, and 80% faster at integer ops, than yours. Using those numbers, yours should take somewhere between 17 and 25 hours. But that the task is still not finished after 30 hours suggests it’s not that simple… Where did you get the data from? I usually compare using http://cpuboss.com/compare-cpus but that has not heard of his CPU. I tried searching for a few more comparison sites, but the ones that list his don't have benchmarks, they just list all the specs side by side. From what Mod.Sense wrote, Rosetta would rather have two machines doing two different tasks than both doing the same and comparing results to ensure they’re ‘right’. I’m not sure there’s really such a thing as a ‘wrong’ answer with Rosetta anyway, if the tasks are simply asking: “What if…?” Any results that look promising will be investigated further, and can be discarded if they turn out to be somehow erroneous. But if a computer makes a mistake it will miss what could be an interesting combination. There must be some kinda CRC check in the programming. Astrophysics projects use at least two machines, as the answer can be incorrect. And yes: this is where BOINC’s credit model (designed for fixed work / variable time) breaks down on Rosetta (fixed time / variable work). (Explanation from Mod.Sense.) It only breaks down when someone returns it too late. ID: 98233 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 98234 - Posted: 19 Jul 2020, 21:59:47 UTC - in response to Message 98233. Where did you get the data from? I was just looking at the Measured floating point speed and Measured integer speed values on each Computer Details page, which come from the Whetstone and Dhrystone benchmarks that BOINC runs. ID: 98234 · Rating: 0 · rate: / Reply Quote