Problems and Technical Issues with Rosetta@home

Author	Message
Sid Celery Send message Joined: 11 Feb 08 Posts: 2479 Credit: 46,506,558 RAC: 1,247	Message 93051 - Posted: 2 Apr 2020, 10:48:41 UTC - in response to Message 93046. the first of the new tasks has just finished, took 4 hours to run the 1 decoy for me, these were definitely running under an hour previously. If you have your runtime to 4 hours you wont really notice the difference in time, but i'm more concerned with the actual work being done by the program. If points are an accurate indication then with 4.07 I was running at an average of 300pts per hour per core, this just finished task has returned 300 points in 4 hours, which ties in with my thinking they are not running efficiently. Is there a mod reading who can make a comment? edit, there are 60 of these now finishing so plenty to look at https://boinc.bakerlab.org/rosetta/result.php?resultid=1138591491 Are you sure? Looks more like 75/core/hr in the past to me. Sometimes 50 Also, new versions take a little while to get their scoring sorted out iirc. Looks like it started at 150/4hrs and risen to nearer 300 now. But this isn't my strong suit. Anyway, I only chimed in because I'd be happy with 8 or 16 WUs atm. 11 now here on my 8-core but still nothing for my 2 4-core machines. 60 would be a dream ID: 93051 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1896 Credit: 18,534,891 RAC: 0	Message 93052 - Posted: 2 Apr 2020, 10:49:10 UTC - in response to Message 93044. Edit- Finally finished a few of these longer running Rosetta Minis and i've decided this isn't really a problem at all. While the Tasks take twice as long to process, they pay out 4 times more Credit than they usually do. I can live with that. Well, it was nice while it lasted. Gone from 4 times as much down to 2 times as much- so back on par with Tasks that run for normal Target times. Grant Darwin NT ID: 93052 · Rating: 0 · rate: / Reply Quote

nastasache Send message Joined: 24 Feb 07 Posts: 16 Credit: 171,383 RAC: 0	Message 93053 - Posted: 2 Apr 2020, 11:00:46 UTC - in response to Message 92687. Thanks a lot, Robert I changed all to use 99% of RAM (was 90% as default and 50% for other). And 1% of swap. It looks no out of memory errors for now but memory usage stay as before. For 12 tasks, the total memory usage is about 6GB. It looks R@H using less memory per task than max available for 32bit app. Here is a task with max mem usage: Application Rosetta 4.12 Name 4dy3ga3h_jhr_design1_COVID-19_SAVE_ALL_OUT_903392_1 State Running Received 2020-04-01 21:33:01 Report deadline 2020-04-09 21:33:00 Estimated computation size 80,000 GFLOPs CPU time 08:11:40 CPU time since checkpoint 00:04:37 Elapsed time 15:34:17 Estimated time remaining 2d 05:56:33 Fraction done 22.400% Virtual memory size 1.12 GB Working set size 1.14 GB Directory slots/2 Process ID 14460 Progress rate 2.520% per hour Executable rosetta_4.12_windows_intelx86.exe Btw a task take about 2-3 days to finish, from an initial 4 hours estimation; it's that normal? Iulian ID: 93053 · Rating: 0 · rate: / Reply Quote

strongboes Send message Joined: 3 Mar 20 Posts: 27 Credit: 5,394,270 RAC: 0	Message 93054 - Posted: 2 Apr 2020, 11:10:49 UTC - in response to Message 93051. see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313 The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid= ID: 93054 · Rating: 0 · rate: / Reply Quote

JoshuaScholar Send message Joined: 26 Mar 20 Posts: 18 Credit: 232,183 RAC: 0	Message 93058 - Posted: 2 Apr 2020, 12:03:18 UTC Last modified: 2 Apr 2020, 12:08:04 UTC I know this affects so few people that it won't matter much but: I have an older 2 socket Xeon system (Sandy Bridge era e5-2690s). Let me tell you what DOESN'T work properly with the Windows client on my Windows 10 pro setup: 1) NUMA. Having two sockets, the most common way to run Windows is with each processor accessing the memory that's attached to it directly preferentially. This is called NUMA, and it's slightly faster. But with NUMA enabled, the client picks the proper number of threads as if it's going to use both sockets, but then it runs all of the threads on only ONE of the sockets. 2) Hyperthreading with NUMA off. [NUMA off is called "uniform memory access", by the way.] With NUMA off and Hyperthreading enabled, the client creates the right number of threads for using both sockets BUT it allocates both threads to the SAME hyperthread in each core. So each core has one empty hyperthread and one hyperthread shared by two threads. So on this old 2 socket Xeon system running Windows 10 pro, the only efficient way to run the BOINC client is to turn off NUMA and also turn off hyperthreading. Then it works properly. On a machine this old, on a highly parallel workload, turning off hyperthreading is about a 20% throughput hit. On a newer processor it would be a greater hit. I'm not sure if there's any real hit to turning off NUMA, but it isn't a big one. Josh Scholar ID: 93058 · Rating: 0 · rate: / Reply Quote

nastasache Send message Joined: 24 Feb 07 Posts: 16 Credit: 171,383 RAC: 0	Message 93059 - Posted: 2 Apr 2020, 12:06:36 UTC Hi especially @Grant (SSSF) Where I am wrong? I need 2x more time to finish the tasks and 50% GFLOPS on similar i7-8700K CPU Compare: - https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=3933928 - https://boinc.bakerlab.org/rosetta/host_app_versions.php?hostid=3914491 Thanks in advance. ID: 93059 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1251 Credit: 14,421,737 RAC: 0	Message 93061 - Posted: 2 Apr 2020, 12:21:07 UTC - in response to Message 93039. strongboes, [snip] I'm saying it doesn't look productive because the decoys are taking approximately 4 to 6 times longer to process. If you watch the graphics, it gets to a certain number of steps and then almost stops, taking 30-60 minutes for each additional step. Half last night before I went to bed stopped at step 24600, then took 30 mins to do step 24601 etc. So that's what I mean, it is taking 4-6 times longer to process the same work, so it appears. The latest batch which are rb 04 01 20235 19963 ab t000 robetta cstwt... Are currently on 2 hours 49, 56% on first decoy. Looks like 5hrs to run. 4.07 was running very similar tasks under an hour. You are assuming that each decoy does an equal amount of work, and that each step does an equal amount of work. I don't expect that to be true. Generally, the first decoy is only for checking that your computer works correctly and is the same every time, The second decoy starts the useful work. ID: 93061 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1251 Credit: 14,421,737 RAC: 0	Message 93063 - Posted: 2 Apr 2020, 12:34:41 UTC One thing to watch for when using CPUs with especially high numbers of cores - the bandwidth from the CPU to the memory may not be adequate to run all of the cores very well. This could leave each core in use waiting for access to memory most of the time, If so, it can be useful to reduce the number of cores BOINC is allowed to use and see if that speeds up the work enough to more than compensate for fewer cores in use. ID: 93063 · Rating: 0 · rate: / Reply Quote

JoshuaScholar Send message Joined: 26 Mar 20 Posts: 18 Credit: 232,183 RAC: 0	Message 93066 - Posted: 2 Apr 2020, 12:42:22 UTC - in response to Message 93063. That might be because of the bugs I noticed. Make sure that every thread is really allocated in its own hyperhthread, because BOINC doesn't leave it up to the OS. ID: 93066 · Rating: 0 · rate: / Reply Quote

strongboes Send message Joined: 3 Mar 20 Posts: 27 Credit: 5,394,270 RAC: 0	Message 93071 - Posted: 2 Apr 2020, 12:48:24 UTC - in response to Message 93063. One thing to watch for when using CPUs with especially high numbers of cores - the bandwidth from the CPU to the memory may not be adequate to run all of the cores very well. This could leave each core in use waiting for access to memory most of the time, If so, it can be useful to reduce the number of cores BOINC is allowed to use and see if that speeds up the work enough to more than compensate for fewer cores in use. If you read previous posts you will see that i'm not hyper threading and have large l3 cache and ram, I tried running just 10 cores also. It isn't that, they run roughly 4 times slower than 4.07 if they start with rb, It will be obvious soon enough. ID: 93071 · Rating: 0 · rate: / Reply Quote

JoshuaScholar Send message Joined: 26 Mar 20 Posts: 18 Credit: 232,183 RAC: 0	Message 93072 - Posted: 2 Apr 2020, 12:51:00 UTC - in response to Message 93071. Last modified: 2 Apr 2020, 13:10:19 UTC Oh you're right. I just looked at my task list. Time per WU has jumped from 8 hours to 16 hours! The cores are running cooler than the last version too, suggests a bottleneck. Note 2, I just noticed that the most recent few are fast again. Maybe there was just a run of WU for a harder problem. ID: 93072 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1251 Credit: 14,421,737 RAC: 0	Message 93074 - Posted: 2 Apr 2020, 13:25:00 UTC A typical cause here for harder problems is larger proteins. ID: 93074 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2479 Credit: 46,506,558 RAC: 1,247	Message 93077 - Posted: 2 Apr 2020, 14:25:26 UTC - in response to Message 93054. see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313 The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid= I didn't look back that far earlier. What I notice now is that starting today, 2-Apr, the scoring for mini-Rosetta has plunged to 75/hr, down from 300/hr and 4.12 are 300/4hr - 75/hr too It looks like something has happened to <all> scoring from today - a step change down - but consistent between the two on validation. Very odd. ID: 93077 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2479 Credit: 46,506,558 RAC: 1,247	Message 93079 - Posted: 2 Apr 2020, 14:40:50 UTC - in response to Message 93077. see below, there are no 4.07 tasks left showing, there was 9000 yesterday only 400 today, the mini was taking around an hour but gives an idea. the 4.07 were averaging a 40 min runtime, with a rate of 1 credit for 11.5 secs of runtime on average. 3600/11.5 = 313 The last 4.12 is running at 1 credit for 59.95 seconds of runtime. 4.7* slower https://boinc.bakerlab.org/rosetta/results.php?hostid=3800945&offset=340&show_names=0&state=4&appid= I didn't look back that far earlier. What I notice now is that starting today, 2-Apr, the scoring for mini-Rosetta has plunged to 75/hr, down from 300/hr and 4.12 are 300/4hr - 75/hr too It looks like something has happened to <all> scoring from today - a step change down - but consistent between the two on validation. Very odd. Oh, you're not going to like this... I've just checked my own PC to see how my dribble of tasks have performed on a mere FX8370 1 Apr - Mini & 4.12 tasks around 45/hr, 280-340/8hr task. Better than I usually get tbh 2 Apr - Mini only (4.12 not reported yet) 110-120/hr, 890-950/8hr task. Lol Nothing I can say to that... ID: 93079 · Rating: 0 · rate: / Reply Quote

entity Send message Joined: 8 May 18 Posts: 23 Credit: 10,249,932 RAC: 0	Message 93080 - Posted: 2 Apr 2020, 15:11:15 UTC - in response to Message 93072. Last modified: 2 Apr 2020, 15:13:23 UTC Oh you're right. I just looked at my task list. Time per WU has jumped from 8 hours to 16 hours! The cores are running cooler than the last version too, suggests a bottleneck. Note 2, I just noticed that the most recent few are fast again. Maybe there was just a run of WU for a harder problem. This is a known problem in Rosetta that the developers have acknowledged but probably haven't fixed yet. They indicated that it would take a major rewrite of the code. L3 cache tends to become over utilized and the CPU waits for data to make the trip from main memory hence the CPU runs cooler (more waiting). There was a post by a developer in another project that suggested to limit the number of tasks run concurrently. They indicated that each task uses about 4MB of L3 cache. Concerning the run time, I noticed that the run parameters include something like cpu_seconds=57500. That is 16 hours. They are ignoring the Target CPU runtime setting ID: 93080 · Rating: 0 · rate: / Reply Quote

Stephen "Heretic" Send message Joined: 2 Apr 20 Posts: 21 Credit: 11,028 RAC: 0	Message 93081 - Posted: 2 Apr 2020, 15:27:06 UTC - in response to Message 93040. Hello, I have just joined this project but it seems there is no work to do at the moment. Is this a common state of affairs or have I struck a bad moment to join?? Work being done has increased by 500% over the last 2 and a bit weeks, so there's not much work available as demand is far exceeding supply. More work is meant to be coming, but apparently it takes quite a while to prepare it for release, so it will take a while before work production comes close to matching the present demand. . . I'm guessing fellow refugees from S@H ... oh well, I'll just have to be patient ... Stephen :( ID: 93081 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 93083 - Posted: 2 Apr 2020, 15:37:56 UTC I've tried to summarize the new work unit runtimes in a new thread, please post concerns about "performance" of new v4.12, or estimated time to completion over there. Rosetta Moderator: Mod.Sense ID: 93083 · Rating: 0 · rate: / Reply Quote

BetelgeuseFive Send message Joined: 10 Aug 10 Posts: 4 Credit: 1,543,284 RAC: 0	Message 93084 - Posted: 2 Apr 2020, 16:23:02 UTC I'm having a problem with 4.12 on Linux (CentOS 7). Found out my computer was doing nothing while there were plenty of tasks "Ready to start". First rebooted the system, but this did not change anything. Enabled cpu_sched_debug in the event log and messages indicated it was trying to start v4.12 tasks, but nothing actually started. Suspended the v4.12 tasks and other v4.08 tasks started immediately without any problems. Any clues ? Thanks, Tom ID: 93084 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 93086 - Posted: 2 Apr 2020, 16:42:05 UTC - in response to Message 93084. Last modified: 2 Apr 2020, 16:51:56 UTC How much memory have you allowed BOINC to use, when active? when idle? Rosetta Moderator: Mod.Sense ID: 93086 · Rating: 0 · rate: / Reply Quote

BetelgeuseFive Send message Joined: 10 Aug 10 Posts: 4 Credit: 1,543,284 RAC: 0	Message 93087 - Posted: 2 Apr 2020, 17:00:28 UTC - in response to Message 93086. How much memory have you allowed BOINC to use, when active? when idle? System has 6 Gb configured (running inside VM). Just checked settings, it has: When in use, use at most 50% When not in use, use at most 90% Should have been plenty start at least one task. ID: 93087 · Rating: 0 · rate: / Reply Quote