Problems and Technical Issues with Rosetta@home

Author	Message
eldredg@unm.edu Send message Joined: 22 Sep 10 Posts: 4 Credit: 2,973,365 RAC: 0	Message 99648 - Posted: 12 Nov 2020, 1:05:50 UTC Last modified: 12 Nov 2020, 1:07:30 UTC Running multiple GPUS (3 on each) on 2 PCs. In the past, was able to run Rosetta on those PCs with no issues. Ran all 3 GPUs and 3 rosetta tasks at a time. However recently, I was getting messages from Rosetta that I needed to detach from Rosetta@Home and reattach to URL https://boinc.bakerlab.org/rosetta/. Prior version of rosetta cross-connected somehow? Hadn't run rosetta since early spring. Couldn't resolve this until I deleted all references to Rosetta in my BOINC data directory. Re-added Rosetta in BOINC Manager. Now rosetta runs multiple tasks while only 1 of my GPU task can run at a time. I see many msgs asking for control of rosetta for similar resource issues. Sad that my PCs with 6 or 8 cores can not use 3 for GPUs and the rest for rosetta? Guess Rosetta is off the list for me if this can't be resolved. At the rosetta page, I actually set the project resource percent usage "preference" down to .001. This shows in BOINC manager, but sems to do nothing? I have 1 4-core processor machine with 1 GPU that I am running rosetta on, since no problem there. Solution needed in application configuration! Tired of mucking around with preferences all over with no results! ID: 99648 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1251 Credit: 14,421,737 RAC: 0	Message 99649 - Posted: 12 Nov 2020, 2:21:31 UTC - in response to Message 99648. Last modified: 12 Nov 2020, 2:25:05 UTC Running multiple GPUS (3 on each) on 2 PCs. In the past, was able to run Rosetta on those PCs with no issues. Ran all 3 GPUs and 3 rosetta tasks at a time. However recently, I was getting messages from Rosetta that I needed to detach from Rosetta@Home and reattach to URL https://boinc.bakerlab.org/rosetta/. Prior version of rosetta cross-connected somehow? Hadn't run rosetta since early spring. Couldn't resolve this until I deleted all references to Rosetta in my BOINC data directory. Re-added Rosetta in BOINC Manager. Now rosetta runs multiple tasks while only 1 of my GPU task can run at a time. I see many msgs asking for control of rosetta for similar resource issues. Sad that my PCs with 6 or 8 cores can not use 3 for GPUs and the rest for rosetta? Guess Rosetta is off the list for me if this can't be resolved. At the rosetta page, I actually set the project resource percent usage "preference" down to .001. This shows in BOINC manager, but sems to do nothing? I have 1 4-core processor machine with 1 GPU that I am running rosetta on, since no problem there. Solution needed in application configuration! Tired of mucking around with preferences all over with no results! I've had somewhat similar problems with Folding@home using the GPU (only one) on my computer. I determined how many CPU cores Folding@home uses, then subtracted that number from the number of virtual cores BOINC is allowed to use. One more subtracted so I can do operations on the console. Setting the preference low only lowers the percentage of CPU-only work for Rosetta@home, and only if CPU tasks from some other BOINC are available. It has no effect on GPU tasks. The https URL is due to Rosetta@home switching to a more secure method of file exchange. Some of the older versions of BOINC cannot handle this properly, so which version are you using on the computers with and without the problem? Note that rather few people run with BOINC using more than one GPU on the same computer, so it could be a problem seen only only on computers where BOINC uses more than one GPU. ID: 99649 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99650 - Posted: 12 Nov 2020, 8:55:45 UTC - in response to Message 99648. Last modified: 12 Nov 2020, 8:57:07 UTC I actually set the project resource percent usage "preference" down to .001. What value are you referring to there? The Resource share setting is a ratio, not a percentage. And it is a longer term setting (not a short term one) for working out the balance of work between projects. If you want the work split evenly between projects, then just leave the Resource Share value for each project at the default value of 100. Solution needed in application configuration! Solution lies in using app_config.xml to reserve a CPU core to support your GPU for the project that uses the GPU. Although looking at the processing times for the Rosetta work you have returned, the system's are only slightly over committed. Reserving 1 CPU core for 2 or even 3 GPUs should be good enough. If you chose to use some configuration settings to get the most from your GPUs then it will be necessary to reserve a CPU core/thread per GPU. From the Collatz forums <app_config> <app> <name>collatz_sieve</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>0.3</cpu_usage> </gpu_versions> </app> </app_config> change cpu_usage value to 1 for 1 CPU core/thread per GPU. BOINC Manager, Options, Read config files for it to take effect (make sure it's in the project directory). https://boinc.thesonntags.com/collatz/forum_thread.php?id=168 Tired of mucking around with preferences all over with no results! It's the mucking around with the preferences that is causing most of your issues. That, combined with somehow you created a new machine id for your FX-8320 E, so that is starting from scratch & it would have taken several days to settle down as it worked out how much Rosetta & Collatz work it needed to do over a day to meet your Resource share settings. The fact you've been changing things randomly means it will take even longer for things to settle down- in accordance which whatever new settings you have selected. Grant Darwin NT ID: 99650 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99651 - Posted: 12 Nov 2020, 9:03:47 UTC - in response to Message 99649. Note that rather few people run with BOINC using more than one GPU on the same computer, so it could be a problem seen only only on computers where BOINC uses more than one GPU. At Seti there were many people running multi GPU setups with no issues- as long as they reserved as many CPU cores as were needed by the application. For the stock application that was 1 CPU core per GPU Task running, particularly so if they used optimised settings. For the Linux special application a single CPU core was able to handle multiple GPUs. Grant Darwin NT ID: 99651 · Rating: 0 · rate: / Reply Quote

eldredg@unm.edu Send message Joined: 22 Sep 10 Posts: 4 Credit: 2,973,365 RAC: 0	Message 99665 - Posted: 13 Nov 2020, 7:16:15 UTC - in response to Message 99648. Update on this issue. The app_config.xml for Collatz project did not solve the problem. I've found that reducing the % of cpus used in BAM preferences will reduce the number of rosetta tasks that run concurrently, from 5 at 90% to 2 at 40%. Unfortunately, that does not solve the GPU problem either. All 3 GPUs run Collatz tasks with rosetta suspended. When I click resume for rosetta, Collatz processing pauses momentarily in BAM display, then 2 of my GPUs go into wait mode and multiple rosetta tasks start.. I also set resource preference on rosetta page back to 100, since trying to limit rosetta resources that way doesn't work. I suppose I might find a clue in the backup files of one of my PCs before BAM and BOINC were reinstalled recently. Last week one PC was actually running 3 collatz GPU tasks and 3 rosetta tascks at the same time before I cleaned out old files to clear the error for 2 connections to rosetta. But when the rosetta tasks completed, rosetta went into pause and BAM told me the pause was at my request, though it wasn't? So clearly something was not right. Gonna keep trying to resolve this somehow. Thanks to the folks that gave me feedback on this. ID: 99665 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99666 - Posted: 13 Nov 2020, 9:54:43 UTC What messages do you get in the Event log when Tasks start running, and others suspending? There has recently been a batch of Rosetta Tasks where a single Task can use as much as 5GB of RAM. With your limited system RAM, one of those Tasks with a couple of normal RAM requirement Tasks would result in other Tasks pausing due to a lack of RAM. Once completed, then the other Tasks would start back up again. Once all the large RAM Tasks are done, then things should run as they did previously (if all changes have been reverted to their original settings). Grant Darwin NT ID: 99666 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99668 - Posted: 13 Nov 2020, 10:21:39 UTC I wonder what happened with these two Tasks? Both marked as Invalid. dt_201104_hallucinated_C3D_01_122C3D_01_122_r2_127_model_fd_chA_fragments_abinitio_SAVE_ALL_OUT_1020463_1271_0 <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -beta -frag3 00001.200.3mers -frag9 00001.200.9mers -abinitio::increase_cycles 10 -mute all -abinitio::fastrelax -relax::default_repeats 5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -abinitio::use_filters false -ex1 -ex2aro -in:file:boinc_wu_zip dt_201104_hallucinated_C3D_01_122C3D_01_122_r2_127_model_fd_chA_fragments_fold_data.zip -abinitio::rg_reweight 0.5 -out:file:silent default.out -silent_gz -mute all -in:file:native 00001.pdb -out:file:silent_struct_type binary -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1411607 Using database: database_357d5d93529_n_methylminirosetta_database ====================================================== DONE :: 1 starting structures 28802 cpu seconds This process generated 98 decoys from 98 attempts ====================================================== BOINC :: WS_max 4.74567e+08 09:56:25 (1888): called boinc_finish(0) </stderr_txt> ]]> dt_201104_hallucinated_C3D_01_129C3D_01_129_r2_36_model_fd_chA_fragments_abinitio_SAVE_ALL_OUT_1021102_1270_0 <core_client_version>7.6.33</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -beta -frag3 00001.200.3mers -frag9 00001.200.9mers -abinitio::increase_cycles 10 -mute all -abinitio::fastrelax -relax::default_repeats 5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -abinitio::use_filters false -ex1 -ex2aro -in:file:boinc_wu_zip dt_201104_hallucinated_C3D_01_129C3D_01_129_r2_36_model_fd_chA_fragments_fold_data.zip -abinitio::rg_reweight 0.5 -out:file:silent default.out -silent_gz -mute all -in:file:native 00001.pdb -out:file:silent_struct_type binary -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3454571 Using database: database_357d5d93529_n_methylminirosetta_database ====================================================== DONE :: 1 starting structures 28653.6 cpu seconds This process generated 95 decoys from 95 attempts ====================================================== BOINC :: WS_max 4.65773e+08 09:14:35 (7324): called boinc_finish(0) </stderr_txt> ]]> Grant Darwin NT ID: 99668 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1251 Credit: 14,421,737 RAC: 0	Message 99673 - Posted: 13 Nov 2020, 22:20:45 UTC - in response to Message 99665. Update on this issue. The app_config.xml for Collatz project did not solve the problem. I've found that reducing the % of cpus used in BAM preferences will reduce the number of rosetta tasks that run concurrently, from 5 at 90% to 2 at 40%. Unfortunately, that does not solve the GPU problem either. All 3 GPUs run Collatz tasks with rosetta suspended. When I click resume for rosetta, Collatz processing pauses momentarily in BAM display, then 2 of my GPUs go into wait mode and multiple rosetta tasks start.. I also set resource preference on rosetta page back to 100, since trying to limit rosetta resources that way doesn't work. I suppose I might find a clue in the backup files of one of my PCs before BAM and BOINC were reinstalled recently. Last week one PC was actually running 3 collatz GPU tasks and 3 rosetta tascks at the same time before I cleaned out old files to clear the error for 2 connections to rosetta. But when the rosetta tasks completed, rosetta went into pause and BAM told me the pause was at my request, though it wasn't? So clearly something was not right. Gonna keep trying to resolve this somehow. Thanks to the folks that gave me feedback on this. It may be time for you to add Ralph@home to one of your computers with 3 GPUs each to help debug this problem. ID: 99673 · Rating: 0 · rate: / Reply Quote

eldredg@unm.edu Send message Joined: 22 Sep 10 Posts: 4 Credit: 2,973,365 RAC: 0	Message 99674 - Posted: 14 Nov 2020, 0:29:44 UTC - in response to Message 99648. Eureka! There's gold in the BOINC Client config parameter doc page. Problem solved! Again, thanks to all the folks who assisted me in digging this out! Got it! Following info. Shotgun approach that works. All parameters are documented at boinc site. All 3 GPUs run collatz while 3 rosetta tasks run! This works on my 6 core I7 and 8 core AMD, both eunning Win 10. URL for boinc configuration parameters: https://boinc.berkeley.edu/wiki/Client_configuration cc_config.xml in C:windowsprogramdataboinc Use all GPUs for collatz. Exclude NVIDIA GPUs for rosetta. I run NVIDIA only. Also arguments for other GPU types and GPU by number like 2 for second one? cc_config.xml ------------------------------------ <cc_config> <options> <use_all_gpus>1</use_all_gpus> <skip_cpu_benchmarks>1</skip_cpu_benchmarks> <exclude_gpu> <url>https://boinc.bakerlab.org/rosetta/</url> <type>NVIDIA</type> </exclude_gpu> </options> </cc_config> This app_config in C:windowsprogramdataboincproject...collatz... directory. Says to use a GPU for each collatz_sieve task. I personally don't want to try fractional values for <gpu_usage>, though you may be able too with 6GB or 8GB GPUs. Supposed to allow multiple tasks per card that way? app_ config.xml in collatz project directory --------------------- <app_config> <app> <name>collatz_sieve</name> <gpu_versions> <gpu_usage>1.0</gpu_usage> <cpu_usage>1.<span class="mark">3</span></cpu_usage> </gpu_versions> </app> </app_config> ** This app_config in C:windowsprogramdataboincproject...rosetta... directory. Says limit rosetta to N tasks, 3 here. app_ config.xml in rosetta project directory --------------------- <app_config> <app> <name>rosetta</name> </app> <project_max_concurrent>3</project_max_concurrent> </app_config> Crunch those numbers! ID: 99674 · Rating: 0 · rate: / Reply Quote

eldredg@unm.edu Send message Joined: 22 Sep 10 Posts: 4 Credit: 2,973,365 RAC: 0	Message 99675 - Posted: 14 Nov 2020, 0:32:55 UTC - in response to Message 99674. Hmmm, guess the editor here don't like the "" character in pathname strings. Oh well, we can figure it out, right? ID: 99675 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 99676 - Posted: 14 Nov 2020, 1:56:24 UTC - in response to Message 99675. It seems to be a bug in the BOINC forum software, as backslashes in user input get interpreted as escape characters instead of themselves being escaped… … multiple times… … which means if you type enough in the input to overcome that, you’ll get one in the output! Four backslashes in ⟶ one backslash out C:\Program Files\BOINC ID: 99676 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99677 - Posted: 14 Nov 2020, 2:36:23 UTC - in response to Message 99674. ** This app_config in C:windowsprogramdataboincproject...rosetta... directory. Says limit rosetta to N tasks, 3 here. Not necessary if you reserve a CPU core to support the GPU. If you run out of GPU work, the CPU cores will pick up CPU work. If you get more GPU work, then those cores will go back to supporting the GPU. Grant Darwin NT ID: 99677 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99685 - Posted: 15 Nov 2020, 4:51:46 UTC Last modified: 15 Nov 2020, 4:53:16 UTC Ah, we're back. Project was MIA for a while- no web site & uploads backing up. Edit- although still no luck with Scheduler responses, says it's down for maintenance. Grant Darwin NT ID: 99685 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99686 - Posted: 15 Nov 2020, 6:36:33 UTC - in response to Message 99685. Edit- although still no luck with Scheduler responses, says it's down for maintenance. Still down. Grant Darwin NT ID: 99686 · Rating: 0 · rate: / Reply Quote

MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0	Message 99696 - Posted: 17 Nov 2020, 4:38:08 UTC - in response to Message 99674. app_ config.xml in rosetta project directory --------------------- <app_config> <app> <name>rosetta</name> </app> <project_max_concurrent>3</project_max_concurrent> </app_config> Crunch those numbers! You don't need the app tags if you are using a project_max_concurrent. It applies to the project as whole, not to a particular app. You can simplify it to: <app_config> <project_max_concurrent>3</project_max_concurrent> </app_config> BOINC blog ID: 99696 · Rating: 0 · rate: / Reply Quote

tom Send message Joined: 29 Nov 08 Posts: 10 Credit: 6,044,733 RAC: 0	Message 99762 - Posted: 25 Nov 2020, 3:48:33 UTC for some reason, i have been set to ONE work unit a day for quite a while now. after literally years of processing lots of work units, trouble-free, i still don't understand why. afaik, it started when boinc switched to ssl, but since i can successfully connect to other sites over ssl, the switch over (yes, i switched, too) shouldn't have nuked my ability to communicate with the project. and no, i don't see any errors in the event log, although i'm not very expert at looking through it. currently running: boinc 7.16.11 mac os x 10.7.5 mac mini server i7 ID: 99762 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 354 Credit: 1,647,009 RAC: 48	Message 99764 - Posted: 25 Nov 2020, 8:09:22 UTC - in response to Message 99762. Last modified: 25 Nov 2020, 8:10:57 UTC Deleted. ID: 99764 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1899 Credit: 18,534,891 RAC: 0	Message 99765 - Posted: 25 Nov 2020, 8:09:25 UTC - in response to Message 99762. for some reason, i have been set to ONE work unit a day for quite a while now. Because all you do is produce errors. If you want more than 1 Task per day, you need to start producing Valid work. Try detaching & re-attaching to the project- that will dump all your current work, but it will make the system re-download the science application. If you are still producing errors- then it will most likely be a hardware issue- memory, power supply, CPU overheating (memory or PSU overheating) (or possibly an OS issue, but very, very unlikely- unless you recently did an update of some sort?) Grant Darwin NT ID: 99765 · Rating: 0 · rate: / Reply Quote

Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0	Message 99767 - Posted: 25 Nov 2020, 12:24:03 UTC - in response to Message 99762. This is the same problem you reported five months ago? The server is limiting the amount of work it sends because your computer is returning so many errors. It can’t be related to SSL, since BOINC is successfully communicating with the server and able to download tasks. The other thing that changed around the same time was the update to application version 4.20. Your recent tasks have all failed within seconds of starting, which suggests there’s some kind of fundamental incompatibility between the application and your system. Any Mac OS experts here who can offer any suggestions how to diagnose that? You could try the Mac forum, but it’s pretty quiet in there… ID: 99767 · Rating: 0 · rate: / Reply Quote

nikolce Send message Joined: 28 Apr 07 Posts: 2 Credit: 2,002,356 RAC: 0	Message 99785 - Posted: 27 Nov 2020, 11:16:27 UTC Hi all, Can someone tell me if I should abort the below tasks? It's a bit annoying to find your CPU crunching nothing for two days. Thank you! ID: 99785 · Rating: 0 · rate: / Reply Quote