Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 119 · 120 · 121 · 122 · 123 · 124 · 125 . . . 300 · Next
Author | Message |
---|---|
wolfman1360 Send message Joined: 18 Feb 17 Posts: 72 Credit: 18,450,036 RAC: 0 |
Several of these tasks that are running for twice my set computation time and not checkpointing to boot. I hope I get some sort of credit for these. Thannk you, this is super helpful and I will do so. I don't think some of these tasks are going to complete in time for the deadline without checkpointing. I'm going to try and keep the client running but they're also using pretty excessive amounts of ram. I thought the quorum for each task (number of machines to complete) needed to be 1? Or do you mean others, apart from myself, also get this task, in case I don't complete it first? |
Kissagogo27 Send message Joined: 31 Mar 20 Posts: 86 Credit: 2,889,169 RAC: 2,376 |
Hi, for me, for a setting time of 12h , some of them just run in 8h ! |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 2,588 |
[snip] Thannk you, this is super helpful and I will do so. The usual quorum used to be two, but has often been 1 lately. A quorum of 1 is adequate only for tasks for which some quick method of checking the output of the task is available. If the quorum is 2, the first two sets of task output files returned must agree enough before they are considered validated. If they don't agree enough, one more task is sent out to determine which of the first two tasks is correct enough to be validated. The purpose of the quorum is to check whether the task or tasks returned correct outputs, even if the task did not detect an error. Sometimes, a workunit with an error in its input files will give some credit if other tasks for that same workunit agree on detecting the error. Usually, the first group of tasks sent out has as many tasks as the quorum, so if the quorum is greater than one, at least one other person will also get a task for that workunit. For each task that goes past its deadline, one more task for that workunit will be sent out. You have a head start on any task sent due to another task reaching its deadline, and therefore some chance of still returning it in time. If the tasks are using excessive amounts of RAM, you may need to tell BOINC to reduce the number of tasks it is allowed to run at the same time, so that the reduced number will fit in the amount of RAM you have available. I normally keep my computer running and doing BOINC work day and night, so it can handle tasks that go over 24 hours between checkpoints. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 2,588 |
Hi, for me, for a setting time of 12h , some of them just run in 8h ! Typical if it finishes its list of possible decoys in 8 hours. Also expected if at the end of a decoy it calculates the time expected to do one more decoy and it would put the total time too far past the time you set. |
Kissagogo27 Send message Joined: 31 Mar 20 Posts: 86 Credit: 2,889,169 RAC: 2,376 |
ok, thks ;) |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 3,710 |
Funny, I'm running "degrader" units at Rosetta@home and also "degrader" units at Ralph@home. 2 of the Rosetta@home units finished very early after 18 and 56 minutes, respectively. |
wolfman1360 Send message Joined: 18 Feb 17 Posts: 72 Credit: 18,450,036 RAC: 0 |
[snip] Hi, I normally do too, on all but one. Of course that was the one that had these issues. The tasks ended up erroring out though they for some reason displayed a vast amount of credit, over 400. Thanks for the explanation. That clears things up. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,589,473 RAC: 22,408 |
Is it just my memory playing up, or has the Total queued jobs dropped from over 4 million to just over 1.7 million over night? Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,523,781 RAC: 8,309 |
Funny, I'm running "degrader" units at Rosetta@home and also "degrader" units at Ralph@home. I have a lot of errors on "degrader" on Ralph with this message: ERROR: Error in core::conformation::Conformation::residue(): The sequence position requested was greater than the number of residues in the pose. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,589,473 RAC: 22,408 |
I've got the same thing- it seems to be one particular group- _5nvx_ - that crashes & burns in less than 2 minutes.Funny, I'm running "degrader" units at Rosetta@home and also "degrader" units at Ralph@home. All the others are crunching without issues (or if they do have issues, they take longer & Validate).. Grant Darwin NT |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 3,710 |
Is it just my memory playing up, or has the Total queued jobs dropped from over 4 million to just over 1.7 million over night? It did. It was at 3.899 last I saw. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,589,473 RAC: 22,408 |
Phew.Is it just my memory playing up, or has the Total queued jobs dropped from over 4 million to just over 1.7 million over night?It did. It was at 3.899 last I saw. Nice to know i haven't completely lost it (yet). Grant Darwin NT |
lazyacevw Send message Joined: 18 Mar 20 Posts: 12 Credit: 93,576,463 RAC: 0 |
Is it just my memory playing up, or has the Total queued jobs dropped from over 4 million to just over 1.7 million over night? Well, 371 of those were "relatively" quick task failures on my systems. I have a sneaky feeling I am going to run out of bandwidth a little sooner this month. |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
Been getting lots of errors recently: degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1co8fa9r_1729406_12_1 <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_120_10_tfirst--fuse--predictor_v13_degrader_boinc--fuse--tslp_design_v2_degrader_boinc.xml @degrader_site_5nvx_jhr_bcov_flags2 -in:file:silent degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1co8fa9r.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1co8fa9r.zip @degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1co8fa9r.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3430233 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::conformation::Conformation::residue(): The sequence position requested was greater than the number of residues in the pose. ERROR:: Exit from: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/conformation/Conformation.hh line: 508 BOINC:: Error reading and gzipping output datafile: default.out 06:16:21 (22764): called boinc_finish(1) </stderr_txt> ]]> degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1mo7yf7k_1729668_18_1 <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_120_10_tfirst--fuse--predictor_v13_degrader_boinc--fuse--tslp_design_v2_degrader_boinc.xml @degrader_site_5nvx_jhr_bcov_flags2 -in:file:silent degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1mo7yf7k.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1mo7yf7k.zip @degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_1mo7yf7k.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2084867 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::conformation::Conformation::residue(): The sequence position requested was greater than the number of residues in the pose. ERROR:: Exit from: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/conformation/Conformation.hh line: 508 BOINC:: Error reading and gzipping output datafile: default.out 07:41:32 (35104): called boinc_finish(1) </stderr_txt> ]]> degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_2sv7td9t_1730197_15_0 <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_120_10_tfirst--fuse--predictor_v13_degrader_boinc--fuse--tslp_design_v2_degrader_boinc.xml @degrader_site_5nvx_jhr_bcov_flags2 -in:file:silent degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_2sv7td9t.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_2sv7td9t.zip @degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_2sv7td9t.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 2786048 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::conformation::Conformation::residue(): The sequence position requested was greater than the number of residues in the pose. ERROR:: Exit from: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/conformation/Conformation.hh line: 508 BOINC:: Error reading and gzipping output datafile: default.out 09:22:19 (22736): called boinc_finish(1) </stderr_txt> ]]> degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_3qn1ob6y_1729329_16_0 <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_120_10_tfirst--fuse--predictor_v13_degrader_boinc--fuse--tslp_design_v2_degrader_boinc.xml @degrader_site_5nvx_jhr_bcov_flags2 -in:file:silent degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_3qn1ob6y.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_3qn1ob6y.zip @degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_3qn1ob6y.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3805639 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::conformation::Conformation::residue(): The sequence position requested was greater than the number of residues in the pose. ERROR:: Exit from: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/conformation/Conformation.hh line: 508 BOINC:: Error reading and gzipping output datafile: default.out 16:22:42 (9952): called boinc_finish(1) </stderr_txt> ]]> degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_7ue1xx0j_1729914_20_0 <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pdblite_boinc_120_10_tfirst--fuse--predictor_v13_degrader_boinc--fuse--tslp_design_v2_degrader_boinc.xml @degrader_site_5nvx_jhr_bcov_flags2 -in:file:silent degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_7ue1xx0j.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_7ue1xx0j.zip @degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST_7ue1xx0j.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1228105 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: Error in core::conformation::Conformation::residue(): The sequence position requested was greater than the number of residues in the pose. ERROR:: Exit from: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/conformation/Conformation.hh line: 508 BOINC:: Error reading and gzipping output datafile: default.out 23:28:52 (3568): called boinc_finish(1) </stderr_txt> ]]> I'm going to assume these are known problems and are being investigated. I remember seeing these over at the test project, Ralph@home. These "degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST" sure aren't the most stable WUs I've seen. Has anyone seen a single "degrader_site_5nvx_jhr_bcov3_SAVE_ALL_OUT_IGNORE_THE_REST" that didn't crash and burn? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1673 Credit: 17,589,473 RAC: 22,408 |
I'm going to assume these are known problems and are being investigated. I remember seeing these over at the test project, Ralph@home.Whoever sent them out obviously didn't take notice of what was occurring at Ralph before they released them here, so no investigation- if there were they would have all been cancelled ages ago, Not a single _5nvx_ has lasted more than a couple of minutes. A 100% failure rate. As it is, we're now out of work again, so the only Tasks we'll see for a while will be resends and the odd rb_ task. And you can bet many of resends will be _5nvx_ and every last one of them will fail in minutes. Grant Darwin NT |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 2,588 |
Been getting lots of errors recently: [snip] I've been seeing a lot of those lately, all in tasks with _5nvx_ in their names. Does that mean that whoever created that group of workunits needs to pay more attention in class? |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 3,710 |
There's new work available and it looks like the 5nvx workunits. Let's see if these actually run. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,138,406 RAC: 16,288 |
Been getting lots of errors recently: Seems like the range error has been corrected and a whole pile are getting downloaded and running right now. Edit: Spoke too soon... (unknown error) - exit code 3221226356 (0xc0000374)</message> |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 3,710 |
I have 4 running at between 11%-16% progress. Seems better than before but perhaps some are still going to error now. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2117 Credit: 41,138,406 RAC: 16,288 |
I have 4 running at between 11%-16% progress. Seems better than before but perhaps some are still going to error now. I've got 14 running between 0 & 23% A second has crashed out after 1h 58m - same error as above. No idea what it means Edit: And a 3rd crashes at 1h 35m - same error again. Other tasks reaching 25% |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org