Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 295 · 296 · 297 · 298 · 299 · Next
Author | Message |
---|---|
tgbauer Send message Joined: 5 Jan 06 Posts: 10 Credit: 100,068,428 RAC: 91,041 |
One of my systems (phenom ii x6 1065t) fails all Rosetta BETA 6 tasks yet is fine with Rosetta 4 tasks. I'm seeing similar with my older 64bit system (Beta 6.06 tasks fail in 1 second without providing output, but all 4.20 tasks complete as expected - "Reset project" didn't help) " 27-Oct-2018 17:57:12 [---] Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ [Family 15 Model 75 Stepping 2] 27-Oct-2018 17:57:12 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow re p_good nopl pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall 27-Oct-2018 17:57:12 [---] OS: Linux: 4.4.0-138-generic " " Application Rosetta Beta 6.06 Name 8aahal_r_hal_8aa_3jp5416_d40_1_0001_1_SAVE_ALL_OUT_2999122_54 State Computation error Received Fri 01 Nov 2024 12:26:18 AM EDT Report deadline Sun 03 Nov 2024 11:26:18 PM EST Estimated computation size 80,000 GFLOPs CPU time 00:00:00 Elapsed time 00:00:01 Executable rosetta_beta_6.06_x86_64-pc-linux-gnu " For some reason not able to grab stderr.txt in time. Is there something else to look at to find out why the failures? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
Almost half a million waiting for Validation now.boinc-process host has died yet again... Grant Darwin NT |
tgbauer Send message Joined: 5 Jan 06 Posts: 10 Credit: 100,068,428 RAC: 91,041 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=1587071539 <core_client_version>7.16.6</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-linux-gnu @8aahal_r_hal_8aa_3jp5416_d40_1_0001_1.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 Using database: database_f5ae1de8e1/database </stderr_txt> ]]> One of my systems (phenom ii x6 1065t) fails all Rosetta BETA 6 tasks yet is fine with Rosetta 4 tasks. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
From a previous thread Under Linux, signal 11 means that the program tried to execute something that was not marked as executable code. The project administrators should use the dump to determine where the program got the address of what it was trying to execute, and then trace backwards from there.Other than running the latest kernel and/or version of your distribution (or an earlier one if the latest ones have depreciated your older CPU) i can't think of anything else to try. Even if someone has a similar system with Windows on it & seeing if that application has the same issue on the same hardware as well or not, since they're no longer doing any development work on this application i don't see anything happening to resolve the issue. Grant Darwin NT |
tgbauer Send message Joined: 5 Jan 06 Posts: 10 Credit: 100,068,428 RAC: 91,041 |
From a previous threadUnder Linux, signal 11 means that the program tried to execute something that was not marked as executable code. The project administrators should use the dump to determine where the program got the address of what it was trying to execute, and then trace backwards from there.Other than running the latest kernel and/or version of your distribution (or an earlier one if the latest ones have depreciated your older CPU) i can't think of anything else to try. Looks like might be the lack of SSSE issue that was around in 4.08: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658&postid=92557#92557 |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
Looks like might be the lack of SSSE issue that was around in 4.08: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658&postid=92557#92557Very possible. The Beta application was developed long after the Rosetta application, very possibly by a different developer & they decided SSSE instructions would be the minimum supported (so no support for CPUs a bit over 15 years old at the time of the Rosetta Beta application release). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
I really wish they'd replace that boinc-process host (or at the very least restart it, yet again). Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
635k waiting for Validation and rising. Will we make it to 1 million? Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2110 Credit: 40,988,019 RAC: 19,589 |
635k waiting for Validation and rising. Will we make it to 1 million? Currently showing 663,306 and I was going to suggest we keep some kind of tally to see how high we can get... ...except, I've just looked and all servers are now showing as running on the server status page, so let's see if that starts reducing or whether it's a false reading. The front page is still showing as frozen for some reason. On the plus side, this is a particularly consistent run of work over the last week or so. Let's see what kind of a points boost we all eventually get. All fun and games... Edit: I've just checked and across my whole team I have 120 tasks pending validation, but as I scrolled through there are definitely one or two tasks that've now received credit, so I think validation is definitely starting to work through the very long queue. Boinc-process lives |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
Boinc-process livesTill the next time. It'd be nice if they got the main page Server Status info updating again, but if it's one or the other then it's better having the Validators running while there is work available. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2110 Credit: 40,988,019 RAC: 19,589 |
Boinc-process livesTill the next time. Definitely. Waiting to validate has edged fractionally down to 662,008 on the Server Status page, but I'm definitely seeing more tasks than that validated - out of order for some reason but they all count. I've got a full cache, but I'm manually polling anyway to see my credits going up each time. These are our salad days (hours anyway) |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
And the main page Server Status is updating again. The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either. Hopefully they'll start putting a dent in the backlog over the next few hours. Once that happens it shouldn't take long to then clear the backlog; but at present all they're doing is treading water. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2110 Credit: 40,988,019 RAC: 19,589 |
The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either. This is true. Several hours later, the 663k backlog is now 659k. But my team's unvalidated tasks are up from 120 to 132. It seemed a fair few were being validated at the start, but now not many more have been since. If it takes 2 or 3 days to notice the entire server is down I'm not convinced anyone will notice at all that the validation backlog is barely reducing. It may take until new tasks run out or, more likely, for boinc-process to fail again, take another few days, then get re-restarted to improve matters... |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
Now it's up to 683k.The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either. I'm hoping that it's just a case of a messy crash of the server, and it's just re-building/verifying it's storage. In which case it could take a day or so to complete, during which performance is significantly degraded. And once done, the backlog will clear like it usually does in an hour or 2. Or there is something still seriously wrong and the backlog will continue to grow slowly until the current batch of work runs out & the work being returned tapers off (or the server just crashes yet again, and the backlog climbs rapidly like before). Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2110 Credit: 40,988,019 RAC: 19,589 |
Now it's up to 683k.The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either. While we're guessing, I now note that when I've uploaded completed tasks I'm not seeing any change in credits so, despite what the server status page shows, the continuing buildup to 699k is because validation has stopped altogether, not just slowed. While all servers show green/running I don't know what other trigger there'll be so someone notices, because it isn't even noticed when they're all red. We could be waiting some while. So, the new prediction game is: what will the validation backlog peak at? 1m? 1.2m? 1.5m? |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1989 Credit: 9,460,369 RAC: 12,264 |
While all servers show green/running I don't know what other trigger there'll be so someone notices, because it isn't even noticed when they're all red. Maybe a solution is to stop the wus generator and stop download/upload until the validation queue is clear... |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour). It's taken 16 hours since the Validators were restarted, but we're starting to get some significant falls in the backlog- and looking at my systems pendings, they've actually started to drop too. *fingers crossed* Maybe a solution is to stop the wus generator and stop download/upload until the validation queue is clear...Stopping the return of completed work, you get a massive surge of returned results awaiting on Validation when it's re-enabled (instead of 10k per hour you're looking at 100k or more per hour), and if they're still not working properly, you get an instant backlog & log jam. Stopping new work from being sent would be the most effective method- as caches clear then the amount returned per hour tapers off. When work is re-enabled, the returned per hour gradually builds up again. No sudden massive surge. Grant Darwin NT |
OffDutyTaoist Send message Joined: 10 Oct 06 Posts: 3 Credit: 1,973,131 RAC: 1,338 |
My Pixel 6 is having issues again with Rosetta v4.20 arm-android-linux-gnu. rb_10_30_639032_632668_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_04_05_2997716_402_1 rb_10_30_639032_632668_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_04_10_2997716_399_1 rb_10_30_639032_632668_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_12_2997716_399_1 All were running at the same time, got up to ~0.319% then started resetting my phone. I paused all of them and tried running them individually, and all three would do the same thing when ran separately. I ended up having to abort all them and suspend the project unless someone has an idea. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1664 Credit: 17,390,757 RAC: 24,406 |
You've managed to complete 2 of those Tasks on that phone, but it's taking 10.5 hours to do 7.5 hours of work, which indicates that the phone is busy doing other things while it's trying to process the Rosetta Tasks. It's possible the phone is overheating, although it should just throttle & not restart. Other than setting the phone to run only when it's not doing other things, or only while on the charger (although doing that you would have to change the Target CPU time to 4 hours or less to make sure to return them before the deadline), otherwise i'd say Rosetta just isn't a suitable project for that device. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2110 Credit: 40,988,019 RAC: 19,589 |
Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour). I looked a few hours ago and my 132 pending had dropped to 80 and now I've arrived home it's already further down to just 31. Backlog down to 370k so it's all looking good now. My fears from yesterday have largely been allayed. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org