Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 99 · 100 · 101 · 102 · 103 · 104 · 105 . . . 300 · Next
Author | Message |
---|---|
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
I don't have moods, people who can't handle my facts or opinions have moods. They're usually American as they're quite soft over there. I just got banned from a forum for pointing out the fact that the average American IQ is only 98, whereas the UK is 100 and Japan is 106.[Double take] I made a good point?The 6.5GB problem goes away on an 8GB machine if you set it to use 100% memory. It never actually uses 100% since everything overestimates. I just changed my old Boinc-only machines [1] and Rosettas downloaded and ranThis is actually a good point. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
I don't have moods, people who can't handle my facts or opinions have moods. They're usually American as they're quite soft over there. I just got banned from a forum for pointing out the fact that the average American IQ is only 98, whereas the UK is 100 and Japan is 106.[Double take] I made a good point?The 6.5GB problem goes away on an 8GB machine if you set it to use 100% memory. It never actually uses 100% since everything overestimates. I just changed my old Boinc-only machines [1] and Rosettas downloaded and ranThis is actually a good point. You don't have moods?! Not only do you have moods, sometimes they're arsey - that is, more than one. Never mind, though. I wouldn't want you to get moody over my facts and opinions... lol Let's go back to you making a good point - then everyone's happy |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
From Brian Nixon, 31 Mar I've had no issues with insufficient disk space or memory.This points to a misconfiguration of the new batch of work units, as it seems unlikely it would be the project’s intention to cut off a third of its capacity… Brian, I looked at my client_state.xml file and, as you speculated(?), those are the figures showing there. I've been in contact with Project admins and this was a deliberate change, not a misconfiguration. It's been looked at more closely and brought down to a figure nearer 4Gb - hopefully we see the result of that soon. I note In Progress tasks are edging up, but let's see how that pans out. There was obviously a need for that change, but I don't know what it is. I've asked if a brief note can be posted to explain what they're working on that requires the increase. No idea when or if that will happen. But small victories - thanks for your pointer. Well spotted. I didn't appreciate the significance of it at the time you posted. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I've asked if a brief note can be posted to explain what they're working on that requires the increase.That will be getting blood out of a turnip. It must be their policy not to comment. There is probably a good reason for it, but it is not entirely apparent to me what it is. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
I've asked if a brief note can be posted to explain what they're working on that requires the increase.That will be getting blood out of a turnip. It must be their policy not to comment. You've been here longer than me - I can't say anything... I speculated that the change might have been a test that got left in the defaults, so asked if it could revert back to what it was. But it was a change for a reason, so while it could be fine-tuned it still couldn't go back all the way. When the project started working on SARS-CoVid2 there were some big changes in the size of tasks that didn't always go through successfully, but for all the errors it threw up for us they got significant results too. None of is have any idea what this change relates to, hence my request. If they tell us, it'll be understandable to everyone. I made the point that their technical posts always go down very well, so it's worth taking the time. Whether they do or not is out of our hands. We wait. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
From Brian Nixon, 31 Mar In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two. Hopefully we'll all see a lot fewer crashes than we have recently. I've regularly found my own PCs have rebooted overnight due to these faulty tasks. If any new ones arise, note the names and they can be looked into if they haven't already noticed them |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,118,186 RAC: 5,220 |
[quote]From Sid Celery 31 Mar9 Apr I've never considered that being the cause of a reboot before...hmmmmm light bulb going off icon needed!!! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
From Sid Celery 9 Apr It could be a lot of things, but when I check the start of the Event log I'm finding like 44 tasks uploaded and a few coming down and online they all report with Computation errors at that time. It may be different for others, but it's been taking out every task of mine, good or bad, and crashing the whole PC. If everything's good tomorrow morning, it'll be because the Server aborted all those tasks today. Let's see if I'm right. |
PorkyPies Send message Joined: 6 Apr 20 Posts: 45 Credit: 1,650,779 RAC: 0 |
I've been in contact with Project admins and this was a deliberate change, not a misconfiguration. I noticed the dud tasks have stopped coming down. Well done for getting them removed. I thought the increased memory and disk space requirement was deliberate, The project clearly think they'll have some work that needs that much memory and/or disk space. Pity for the machines that don't have more than 4GB but I guess it can't be helped unless they want to split tasks into small or large types and have different queues of work. Probably a lot of work on the project side to implement for not much gain. I've taken my 4GB Pi4's out of my Pi cluster. MarksRpiCluster |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1671 Credit: 17,531,042 RAC: 22,700 |
And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.SSD Endurance ExperimentI've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time. Just as some HDDs fail before their time, so to do some SSDs. For all of the articles that complain about SSD failures, there would be just as many about HDD failures. SSD vs HDD: Which One is More Reliable? But in terms of data security, evidence of flash wear appeared after 200TB of writes for TechReport’s Solid State Drives, when their Samsung 840 Series started logging reallocated sectors. As the only TLC candidate in the bunch, this drive was expected to show the first cracks. The 840 Series didn’t encounter actual problems until 300TB, when it failed a hash check during the setup for an unpowered data retention test. The drive went on to pass that test and continue writing, but it recorded a rash of uncorrectable errors around the same time. Uncorrectable errors can compromise data integrity and system stability, so I’d recommend taking drives out of service the moment they appear.I'll take an SSD over a HDD any day. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
From Sid Celery 9 Apr Partly right. No re-boot, but my entire cache showing Computation errors and a message in the Event log saying: 20/04/2021 10:05:11 | Rosetta@home | [error] Signature verification failed for database_357d5d93529_n_methyl.zip and a back-off from re-contacting the server for 24hrs 2 steps forward, one step back... |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1671 Credit: 17,531,042 RAC: 22,700 |
I'd backoff any over clocks for memory & CPU and let things run at stock for a while. Some of the errors could be due to internet/AV issues eg <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>database_357d5d93529_n_methyl.zip</file_name> <error_code>-120 (RSA key check failed for file)</error_code> <error_message>signature verification failed</error_message> </file_xfer_error> </message> ]]> But the Tasks that are starting and then erroring out after a while eg <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 3221225477 (0xc0000005)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.zip @miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3225505 Using database: database_357d5d93529_n_methylminirosetta_database Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0000000000000004 Engaging BOINC Windows Runtime Debugger...Indicate some other issue. I've had a couple of miniprotein_relax8_ error out after a while with a similar error message <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 3221225477 (0xc0000005)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.zip @miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1040802 Using database: database_357d5d93529_n_methylminirosetta_database Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00007FF736388316 read attempt to address 0xFFFFFFFF Engaging BOINC Windows Runtime Debugger..., but 95% or more of them have completed without issue. And while a few pre_helical_bundles_round1_attempt1_ error out in seconds <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3386203 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: d13b0a13bd57de6e8dc1565c1b82259f_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 BOINC:: Error reading and gzipping output datafile: default.out 10:12:15 (5600): called boinc_finish(1) </stderr_txt> ]]> But once again, the vast majority have completed ok. I've gone from over 150 errors to just 5. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
From Brian Nixon, 31 Mar After 1 day (a very short amount of time) it appears I'm being too optimistic. Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks In March, the figure was 550k When all the problems began, the figure dropped to around 318k - a loss of 41% Today the figure is around 360k - loss reduced to 34.5% Usually it's a good thing to have a large queue of tasks to run. A week ago this figure increased to over 20m tasks. After the 2 or 3 rogue task-types that were causing all the crashes were removed, this dropped to 19m. Now it seems like the change to RAM & Disk requirements will only take effect for new tasks added to the queue - the amounts showing in my client_state.xml are largely the same as before. It may take 7 or 8 weeks for 19m tasks in the current queue to be ploughed through to see the (slightly) lower resource demands. June 2021... This is me speculating after just 1 day. Hopefully I'm wrong and it's quicker than that. I'm working on the basis that "bad news early" is better than no news at all. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
I'd backoff any over clocks for memory & CPU and let things run at stock for a while. Is this directed at me? If so, yes, I've assumed some of my problems are of my own making. I'm edging things down every couple of days and I've got a particular setting I'm looking to move down a lot the next chance I get. My temps are abnormally high atm, so I have to fix that. I've had a couple of miniprotein_relax8_ error out after a while with a similar error message Haven't all those tasks been aborted by the server now? And while a few pre_helical_bundles_round1_attempt1_ error out in seconds<core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3386203 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: d13b0a13bd57de6e8dc1565c1b82259f_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 BOINC:: Error reading and gzipping output datafile: default.out 10:12:15 (5600): called boinc_finish(1) </stderr_txt> ]]> I've reported that as well. Some crash out within 20secs with a Computation error, while others stop short after 7 or 8mins but validated as if nothing went wrong. But both report errors, which is weird. ERROR: [ERROR] Unable to open constraints file: e1096e175045f039d630a9b7543a561f_0001.MSAcst |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
You don't have moods?!I'm a very calm person actually. The only mood I get in here is amused when people get upset over nothing. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two.That's odd, I've never had a computer crash due to a faulty task from any project. A whole machine going down from one program error, that's a Windows XP problem isn't it? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
The only reboots I've had is that criminally auto-rebooting Windows 10. I've thwarted that though. My updates are "managed by my organisation" or so it thinks.[quote]From Sid Celery 31 Mar9 Apr |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
Depends what you mean by normal. Mine has a security camera recording onto it, two graphics cards and a 24 core CPU doing Boinc, I record TV to it, .... I guess there are some people who just play solitaire and use email, those might last that long.And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.SSD Endurance ExperimentI've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,118,186 RAC: 5,220 |
[quote]From Sid Celery 31 Mar9 Apr That's funny....you actually thinking MS gives a crap about what YOU, or your organization, wants to do with THEIR software. I hope it works for you I really really do but past history suggests MS just ups the priority of their updates and you get unwanted ones anyway because it serves their tracking needs. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,115,753 RAC: 19,563 |
In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two.That's odd, I've never had a computer crash due to a faulty task from any project. A whole machine going down from one program error, that's a Windows XP problem isn't it? It never did with my previous PC - and after the removal of these tasks it didn't happen last night either - but while those particular tasks were running and crashing, they took out every other task of any type and the whole PC with it. Maybe it's just me. Anyway, it seems to have stopped now |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org