Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 28 · 29 · 30 · 31 · 32 · 33 · 34 . . . 300 · Next
Author | Message |
---|---|
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early? |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. Pass, I’ve never looked at LHC so I wouldn’t know. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
Pass, I’ve never looked at LHC so I wouldn’t know. They've got Atlas tasks, which will run one WU on all your CPU cores at once. I want to get a Ryzen threadripper to see if they'll give me a 64 core task :-) |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
All WUs for Rosetta v4.07 i686-pc-linux-gnu on my 1st gen AppleTV running linux (OSMC with all GUI etc. disabled) are failing for going over the RAM limit. See here. E.g.: working set size > client RAM limit: 167.87MB > 167.55MB Is there something wrong with the working set size matching to the amount of available RAM? Or can I limit to the Rosetta Mini application only? |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
Looks like the same is happening for mini tasks, e.g. task 1132535295: working set size > client RAM limit: 170.39MB > 167.55MB |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
rlpm, your host profile shows 256MB of memory. And the "mini" tasks require just as much memory as any others. They seem to have moved the documentation on minimum host requirements on the R@h website, so I'm not finding it at the moment. But the basic guideline is 1GB of memory per active CPU core. I might suggest that you attach the machine to World Community Grid. They have a number of bioscience projects running there, and generally can run in a smaller memory footprint. Rosetta Moderator: Mod.Sense |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
Thanks Mod.Sense. It would be nice if BOINC automatically failed early, perhaps even at project attachment, if the host doesn't meet the minimum requirements for any app (RAM, disk, instruction set, OS). I already have my old 1st gen RasPis crunching on TN-Grid (gene sequencing) via BOINC, so I'll do the same with this AppleTV. |
bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0 |
The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window. The graphics of the other WUs are working without any problems. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,110,625 RAC: 19,736 |
I've seen the Rosetta stats for the number of new users who've come on board recently - basically quadrupled with massive throughput, which is great. The number of in-progress tasks is similarly huge - well over a million - more than I can ever remember seeing. A little earlier this afternoon I saw my buffers were smaller than usual and noticed that a few calls for new tasks had brought none down. This is hardly surprising. Before I finally got to this page to mention the task shortage, more had come on stream, which is great. I guess all I'm saying is, especially with all the new users around, if there's an interruption in task supply in the coming daysweeks, we (more accurately, I) need to have a little patience and understanding. It's going to happen and it's surprising it hasn't happened already. Great job on keeping the tasks coming through - thanks. |
Shaky Jake Send message Joined: 26 Mar 07 Posts: 2 Credit: 55,684 RAC: 0 |
I have an older desktop computer with a Pentium Duo cpu that is having a problem with the COVID-19 workunits. They are erroring out at about 2 min. EXAMPLE: Task 1134452442 Name 0ef4jx8h_jhr_design1_COVID-19_SAVE_ALL_OUT_903439_1_0 Workunit 1021756085 Created 27 Mar 2020, 9:12:21 UTC Sent 27 Mar 2020, 9:38:35 UTC Report deadline 4 Apr 2020, 9:38:35 UTC Received 28 Mar 2020, 12:10:42 UTC Server state Over Outcome Computation error Client state Compute error Exit status 11 (0x0000000B) Unknown error code Computer ID 3794680 Run time 2 min 15 sec CPU time 1 min 59 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.87 GFLOPS Application version Rosetta v4.08 x86_64-pc-linux-gnu Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> process got signal 11 </message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0ef4jx8h_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0ef4jx8h_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3902678 Starting watchdog... Watchdog active. </stderr_txt> ]]> I have seen a couple that did complete and were validated. EXAMPLE: Task 1133949909 Name 0gr1iv8s_jhr_design1_COVID-19_SAVE_ALL_OUT_903456_1_0 Workunit 1021309240 Created 26 Mar 2020, 20:05:44 UTC Sent 26 Mar 2020, 20:22:20 UTC Report deadline 3 Apr 2020, 20:22:20 UTC Received 27 Mar 2020, 23:58:09 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 3794680 Run time 13 hours 53 min 23 sec CPU time 10 hours 30 min 46 sec Validate state Valid Credit 222.11 Device peak FLOPS 1.87 GFLOPS Application version Rosetta v4.07 i686-pc-linux-gnu Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 0gr1iv8s_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 0gr1iv8s_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3546964 Starting watchdog... Watchdog active. ====================================================== DONE :: 3 starting structures 37846.6 cpu seconds This process generated 3 decoys from 3 attempts ====================================================== BOINC :: WS_max 9.36336e-97 BOINC :: Watchdog shutting down... 18:53:10 (26863): called boinc_finish(0) </stderr_txt> ]]> Should I stop using this computer for this project or let it continue. All of the other workunits appear to process with no problems. |
IBM01902 Send message Joined: 23 Mar 20 Posts: 3 Credit: 43,044 RAC: 0 |
I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me. |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
<message> process got signal 11 </message> The process is crashing. More info: SIGSEGV 11 Core Invalid memory reference The people with access to the code will have to look into it. I don't know whether there are any crash reports (stack traces, etc.) that you can pull to provide more information to them. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
I am seeing this too with older computers. I don't have any new ones. They seem to eventually find something they can work on, but there's nothing in the BOINC event log that's helpful. I will occasionally have a task that halts and waits for memory, but that's not the Computation Error result we're seeing. Glad it's not just me. Working ok for me on all my computers. My oldest is an Intel Q8400 (about 10 years old). It's a pity you can't select which sub projects to run in the Rosetta preferences. Most projects allow you to pick which ones, so you can block the ones that don't work on your machines. I guess as long as some of them work, you should keep going. Sending one back with an error just means the server will try someone else. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
@Shaky Jake. I see you have two machines. It appears the one with 2 CPUs and 2GB of memory is where the errors are occurring the most (the other machine has 2CPUs and 4GB). This is consistent with what I have gleaned from others as well. I believe the Project Team will be tagging the COVID tasks as requiring more memory in the coming days. This should help things run smoother going forward. Rosetta Moderator: Mod.Sense |
Shaky Jake Send message Joined: 26 Mar 07 Posts: 2 Credit: 55,684 RAC: 0 |
I found the problem. I am short .1 GB of memory so when 2 COVID-19 WUs try to run, one of them will fail due to lack of memory. I have ordered additional memory. Until it arrives I have set the computer to use run only 1 WU at a time. Thanks Mod.Sense Every thing seems to be running OK by using only 1 core. I am going to upgrade to 4GB of memory. I think that will solve the problem. My other computer is a laptop with 2 cores and 4GB memory and it has had no problems. Shaky Jake |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
The binaries should check that there's enough memory for the WU, both at process start time, and checking results of malloc, etc. at run time. Since the process on your computer hit a segfault, it may have been due to a memory allocation failing but the software not checking the result of the allocation. There must be some checking in the 32-bit (for linux) version of the Rosetta & Rosetta Mini binaries, since I've encountered this error message on an older box with only 256MB of memory: working set size > client RAM limit: 180.00MB > 179.51MB (But it would be nice to have the check happen ahead of time -- before sending the WU to the computer.) |
bormolino Send message Joined: 16 May 13 Posts: 4 Credit: 160,977 RAC: 0 |
The graphics of the Rosetta 4.07 WU for COVID-19 does not work. It shows "Stage unknown" and "No shared mem" inside the graphics-window. The graphics of the other WUs are working without any problems. |
EHM-1 Send message Joined: 21 Mar 20 Posts: 23 Credit: 183,782 RAC: 0 |
Hello all- Longtime SETI@Home user here, new to Rosetta. Hope I'm posting in the right place; please advise me if not. I attached several days ago, and the screensaver was displaying what I would expect for processing until a couple days ago. Since at least yesterday morning (midday Mar 28 UT), the processing screen displays what I would call a blank template, with no indication that anything is being processed. See image below. Any ideas? Anyone else encountering this? I could find no mention of anything similar in the forums. Thanks in advance for any help. Eric PS- Just after posting, I now see that bormolino might be reporting the same issue just above my post. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org