Message boards : Number crunching : Rosetta Beta 6.00
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
Author | Message |
---|---|
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
A lot of 6.03 errors... ERROR: Error in protocols::cyclic_peptide_predict::SimpleCycpepPredictpplication::set_up_n_to_c_cyclization_mover() function: residue 1 does not have a LOWER_CONNECT. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
A lot of 6.03 errors... Yup. The queue, which was over 300k tasks when I saw it last night, seems to have been removed already, so noticed. Most tasks crash out within 20 seconds, but I've had a few run for several hours before crashing out. Also, the ones that do run don't seem to checkpoint, but I've got two (out of 40-50) that do and I'm hoping might even complete successfully. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
A lot of 6.03 errors... And they did complete successfully, so not a complete waste of time (just mostly a waste of time) |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,198,375 RAC: 1,147 |
Getting this error on the work units that did not fail in 29 seconds but run for many, many hours (my pre set is 6 hours) Task 1533688974 Name 7hal_NME_af2_hal_07_283_SAVE_ALL_OUT_2961446_35_0 Workunit 1365145536 Created 23 Aug 2023, 0:58:13 UTC Sent 23 Aug 2023, 1:35:20 UTC Report deadline 26 Aug 2023, 1:35:20 UTC Received 23 Aug 2023, 20:22:44 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 1503952 Run time 12 hours 20 min 35 sec CPU time 12 hours 7 min 21 sec Validate state Invalid Credit 0.00 Device peak FLOPS 7.87 GFLOPS Application version Rosetta Beta v6.00 x86_64-pc-linux-gnu Peak working set size 363.05 MB Peak swap size 430.13 MB Peak disk usage 24.06 MB Stderr output <core_client_version>7.17.0</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.00_x86_64-pc-linux-gnu @7hal_NME_af2_hal_07_283.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 Using database: database_0f7f01a1b07/database ====================================================== DONE :: 1 starting structures 43641.1 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... 06:20:36 (24284): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>7hal_NME_af2_hal_07_283_SAVE_ALL_OUT_2961446_35_0_r1774002546_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> I have many others like this. i also have had 2 that ran normally, for 6 odd hours and completed without error. But only 2, the ones left on the computer have been running for over 10 hours already and only at 60% or so. This is on Linux. Conan |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
Getting this error on the work units that did not fail in 29 seconds but run for many, many hours (my pre set is 6 hours) Yeah, this is exactly the pattern I saw on my Windows box too. Majority: Error within 20secs Minority: Fails to checkpoint for many hours, 1st model completes, filename error on upload Exception: Completes ok Some rb tasks have just come down to be going on with, but very few |
mrchips Send message Joined: 11 Nov 09 Posts: 10 Credit: 14,697,477 RAC: 12,433 |
ALL mine have failed Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 6260865 Run time 10 hours 1 min CPU time 9 hours 55 min 13 sec Validate state Invalid 10 hours wasted. I will try to abort these when I see them.... |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,431,332 RAC: 5,665 |
ALL mine have failed ALL seven of mine failed too. This one ran a long time. The others failed pretty fast: Task 1533723354 Name 7hal_NME_af2_hal_07_73_SAVE_ALL_OUT_2961577_97_0 Workunit 1365167789 Created 23 Aug 2023, 3:30:56 UTC Sent 23 Aug 2023, 3:59:17 UTC Report deadline 26 Aug 2023, 3:59:17 UTC Received 24 Aug 2023, 15:31:30 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 5910575 Run time 3 hours 47 min 17 sec CPU time 3 hours 43 min 42 sec Validate state Invalid Credit 0.00 Device peak FLOPS 6.02 GFLOPS Application version Rosetta Beta v6.00 x86_64-pc-linux-gnu Peak working set size 353.82 MB Peak swap size 427.60 MB Peak disk usage 24.05 MB Stderr output <core_client_version>7.20.2</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.00_x86_64-pc-linux-gnu @7hal_NME_af2_hal_07_73.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 Using database: database_0f7f01a1b07/database ====================================================== DONE :: 1 starting structures 13422.1 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... 11:08:36 (3335130): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>7hal_NME_af2_hal_07_73_SAVE_ALL_OUT_2961577_97_0_r1168317456_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
ALL mine have failed If tasks fail quickly, fine. If tasks don't fail quickly, check the properties of the task first. - If it hasn't checkpointed, certainly delete it - invariably no good will come of it. - If it <has> checkpointed (a rarity) let it run. These odd few do seem to succeed based on my limited sample. If it turns out this advice is wrong, please do come back and correct me. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
<message> All my failded wus have a little bit different error code <message> |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
If tasks fail quickly, fine. After over 9hs of running, all errors excetp 3 wus ok. P.S. I don't see the checkpoint argument in the properties |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
I have not been paying much attention rosetta lately so I didn't notice till just now that a broken `beta` has wasted 20 hours stuck in loop |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
If tasks fail quickly, fine. 3 out of however many is better than I'm getting tbh. Only rarely do I see two good tasks at a time. Regarding checkpointing, select one task and click on Properties CPU time 00:40:29 CPU time since checkpoint 00:05:13 If they show the same amount of time after 15 minutes or so then it's not checkpointing at all, so abort it straight away. If they're different, like above, you'll be lucky, in my experience, and it will report correctly and give proper credit too. I know it's not great advice, but it's all I have to offer anyone |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
I have not been paying much attention rosetta lately so I didn't notice till just now that a broken `beta` has wasted 20 hours stuck in loop I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now. Tasks, not the app. I've just grabbed a few Rosetta 4.20 "rb" tasks and all are running well fwiw |
mmonnin Send message Joined: 2 Jun 16 Posts: 59 Credit: 24,222,307 RAC: 83,030 |
I wish I had checked task credit before wasting 12 hours of run time per task on 32 completed tasks. I now aborted many others that have not checkpointed. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now. But this family didn't run on Ralph to test it before the production. So, usual waste of time and resources |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
I have not been paying much attention rosetta lately so I didn't notice till just now that a broken `beta` has wasted 20 hours stuck in loop These Hal7000 tasks have got a mind of their own . . . 7hal_nme_af2_hal_07 Hmm . . was`nt there some computer that sied " I`m sorry Dave but I can`t allow you to process that" |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
Oops , Turns out that one was HAL9000 I just looked it up on wiki I must have been thinking of HAL7600 that works with win7 and missed the edit hour |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 188 Credit: 6,431,332 RAC: 5,665 |
I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now. You may be right.Most of my beta ones are running a long time before failing. Here is one. Of those I received lately, all have run quite a long time and they are all 7hal Task 1534000991 Name 7hal_nme_af2_hal_07_313_SAVE_ALL_OUT_2961707_989_1 Workunit 1365344374 Created 25 Aug 2023, 1:52:11 UTC Sent 25 Aug 2023, 1:52:15 UTC Report deadline 28 Aug 2023, 1:52:15 UTC Received 26 Aug 2023, 12:18:43 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x00000000) Computer ID 5910575 Run time 3 hours 20 min 58 sec CPU time 3 hours 19 min 31 sec Validate state Invalid Credit 0.00 Device peak FLOPS 6.02 GFLOPS Application version Rosetta Beta v6.00 x86_64-pc-linux-gnu Peak working set size 352.36 MB Peak swap size 426.14 MB Peak disk usage 24.05 MB Stderr output <core_client_version>7.20.2</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.00_x86_64-pc-linux-gnu @7hal_nme_af2_hal_07_313.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 Using database: database_0f7f01a1b07/database ====================================================== DONE :: 1 starting structures 11971.5 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: WS_max 0 BOINC :: Watchdog shutting down... 07:40:01 (3501495): called boinc_finish(0) </stderr_txt> <message> upload failure: <file_xfer_error> <file_name>7hal_nme_af2_hal_07_313_SAVE_ALL_OUT_2961707_989_1_r394580450_0</file_name> <error_code>-161 (not found)</error_code> </file_xfer_error> </message> ]]> The 4.20 ones seem to run just fine. Task 1533981934 Name rb_08_24_544889_539739_ab_t000__h002_robetta_IGNORE_THE_REST_04_12_2961726_20_0 Workunit 1365333581 Created 24 Aug 2023, 22:04:43 UTC Sent 24 Aug 2023, 22:34:04 UTC Report deadline 27 Aug 2023, 22:34:04 UTC Received 26 Aug 2023, 13:07:40 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 5910575 Run time 7 hours 52 min 1 sec CPU time 7 hours 46 min 47 sec Validate state Valid Credit 423.06 Device peak FLOPS 6.02 GFLOPS Application version Rosetta v4.20 x86_64-pc-linux-gnu Peak working set size 988.96 MB Peak swap size 1,130.77 MB Peak disk usage 31.53 MB Stderr output <core_client_version>7.20.2</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @rb_08_24_544889_539739_ab_t000__h002_robetta_FLAGS -in::file::fasta t000__h002.fasta -in:file:boinc_wu_zip rb_08_24_544889_539739_ab_t000__h002_robetta.zip -frag3 rb_08_24_544889_539739_ab_t000__h002_robetta.200.3mers.index.gz -fragA rb_08_24_544889_539739_ab_t000__h002_robetta.200.12mers.index.gz -fragB rb_08_24_544889_539739_ab_t000__h002_robetta.200.4mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1370977 Using database: database_357d5d93529_n_methyl/minirosetta_database ====================================================== DONE :: 1 starting structures 28007.6 cpu seconds This process generated 24 decoys from 24 attempts ====================================================== BOINC :: WS_max 1.01981e+09 09:07:16 (3491732): called boinc_finish(0) </stderr_txt> ]]> |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,228,659 RAC: 10,982 |
I don't believe it's the Rosetta beta app, but the family of "7hal" tasks that are coming through right now. The pattern looks consistent as I look back over my task history. The proof will be if we get a different batch of Beta 6.03 tasks and they run ok. We wait in hope. |
Jeff Send message Joined: 24 Jan 15 Posts: 4 Credit: 1,352,221 RAC: 744 |
I have been a particiapant in rosetta@home for 8 years, and only rarely do my allocated tasks fail due to computation errors. Yet lately, all but one of about 20 of my tasks on beta 6.03 app, with the 7hal prefix to the task name have led to a 'computation error' message. Sometimes within a few moments of starting, but much more frequently, many times in excess of the original estimated 'remaining time. I want to process as many rosetta tasks as I can, but a lot of my computation time is wasted by this problem. I expect this is also a problem for rosetta@home, because allocated tasks are not successfully processed by users who also experience this problem. Does anyone know what accounts for this? Does anyone know how can I deal with this problem? |
Message boards :
Number crunching :
Rosetta Beta 6.00
©2024 University of Washington
https://www.bakerlab.org