Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 295 · 296 · 297 · 298 · 299 · Next

AuthorMessage
tgbauer

Send message
Joined: 5 Jan 06
Posts: 10
Credit: 100,080,044
RAC: 91,388
Message 109949 - Posted: 1 Nov 2024, 4:44:13 UTC - in response to Message 109930.  

One of my systems (phenom ii x6 1065t) fails all Rosetta BETA 6 tasks yet is fine with Rosetta 4 tasks.

It almost immediately fails the tasks.

I'm seeing similar with my older 64bit system (Beta 6.06 tasks fail in 1 second without providing output, but all 4.20 tasks complete as expected - "Reset project" didn't help)
"
27-Oct-2018 17:57:12 [---] Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ [Family 15 Model 75 Stepping 2]
27-Oct-2018 17:57:12 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow re
p_good nopl pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall
27-Oct-2018 17:57:12 [---] OS: Linux: 4.4.0-138-generic
"

"
Application
Rosetta Beta 6.06
Name
8aahal_r_hal_8aa_3jp5416_d40_1_0001_1_SAVE_ALL_OUT_2999122_54
State
Computation error
Received
Fri 01 Nov 2024 12:26:18 AM EDT
Report deadline
Sun 03 Nov 2024 11:26:18 PM EST
Estimated computation size
80,000 GFLOPs
CPU time
00:00:00
Elapsed time
00:00:01
Executable
rosetta_beta_6.06_x86_64-pc-linux-gnu
"

For some reason not able to grab stderr.txt in time. Is there something else to look at to find out why the failures?
ID: 109949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109951 - Posted: 1 Nov 2024, 5:46:01 UTC - in response to Message 109948.  

boinc-process host has died yet again...

Still down, but two batches of tasks issued and 1m+ queued up to process

Still down, 400k awaiting validation now, but also the front page info seems to have frozen - no update for @18hrs while the Server Status page still seems ok. For now
Almost half a million waiting for Validation now.
Grant
Darwin NT
ID: 109951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 10
Credit: 100,080,044
RAC: 91,388
Message 109952 - Posted: 1 Nov 2024, 7:25:52 UTC - in response to Message 109949.  

https://boinc.bakerlab.org/rosetta/result.php?resultid=1587071539

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_beta_6.06_x86_64-pc-linux-gnu @8aahal_r_hal_8aa_3jp5416_d40_1_0001_1.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937
Using database: database_f5ae1de8e1/database

</stderr_txt>
]]>


One of my systems (phenom ii x6 1065t) fails all Rosetta BETA 6 tasks yet is fine with Rosetta 4 tasks.

It almost immediately fails the tasks.

I'm seeing similar with my older 64bit system (Beta 6.06 tasks fail in 1 second without providing output, but all 4.20 tasks complete as expected - "Reset project" didn't help)
"
27-Oct-2018 17:57:12 [---] Processor: 2 AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 3800+ [Family 15 Model 75 Stepping 2]
27-Oct-2018 17:57:12 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm 3dnowext 3dnow re
p_good nopl pni cx16 lahf_lm cmp_legacy svm extapic cr8_legacy 3dnowprefetch vmmcall
27-Oct-2018 17:57:12 [---] OS: Linux: 4.4.0-138-generic
"

"
Application
Rosetta Beta 6.06
Name
8aahal_r_hal_8aa_3jp5416_d40_1_0001_1_SAVE_ALL_OUT_2999122_54
State
Computation error
Received
Fri 01 Nov 2024 12:26:18 AM EDT
Report deadline
Sun 03 Nov 2024 11:26:18 PM EST
Estimated computation size
80,000 GFLOPs
CPU time
00:00:00
Elapsed time
00:00:01
Executable
rosetta_beta_6.06_x86_64-pc-linux-gnu
"

For some reason not able to grab stderr.txt in time. Is there something else to look at to find out why the failures?

ID: 109952 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109953 - Posted: 1 Nov 2024, 8:07:30 UTC - in response to Message 109952.  

From a previous thread
Under Linux, signal 11 means that the program tried to execute something that was not marked as executable code. The project administrators should use the dump to determine where the program got the address of what it was trying to execute, and then trace backwards from there.
Other than running the latest kernel and/or version of your distribution (or an earlier one if the latest ones have depreciated your older CPU) i can't think of anything else to try.
Even if someone has a similar system with Windows on it & seeing if that application has the same issue on the same hardware as well or not, since they're no longer doing any development work on this application i don't see anything happening to resolve the issue.
Grant
Darwin NT
ID: 109953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tgbauer

Send message
Joined: 5 Jan 06
Posts: 10
Credit: 100,080,044
RAC: 91,388
Message 109954 - Posted: 1 Nov 2024, 8:21:22 UTC - in response to Message 109953.  

From a previous thread
Under Linux, signal 11 means that the program tried to execute something that was not marked as executable code. The project administrators should use the dump to determine where the program got the address of what it was trying to execute, and then trace backwards from there.
Other than running the latest kernel and/or version of your distribution (or an earlier one if the latest ones have depreciated your older CPU) i can't think of anything else to try.
Even if someone has a similar system with Windows on it & seeing if that application has the same issue on the same hardware as well or not, since they're no longer doing any development work on this application i don't see anything happening to resolve the issue.

Looks like might be the lack of SSSE issue that was around in 4.08: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658&postid=92557#92557
ID: 109954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109955 - Posted: 1 Nov 2024, 10:24:47 UTC - in response to Message 109954.  

Looks like might be the lack of SSSE issue that was around in 4.08: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658&postid=92557#92557
Very possible.
The Beta application was developed long after the Rosetta application, very possibly by a different developer & they decided SSSE instructions would be the minimum supported (so no support for CPUs a bit over 15 years old at the time of the Rosetta Beta application release).
Grant
Darwin NT
ID: 109955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109956 - Posted: 1 Nov 2024, 10:26:09 UTC

I really wish they'd replace that boinc-process host (or at the very least restart it, yet again).
Grant
Darwin NT
ID: 109956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109957 - Posted: 1 Nov 2024, 21:39:24 UTC

635k waiting for Validation and rising. Will we make it to 1 million?
Grant
Darwin NT
ID: 109957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2110
Credit: 40,989,749
RAC: 19,556
Message 109958 - Posted: 2 Nov 2024, 1:52:54 UTC - in response to Message 109957.  
Last modified: 2 Nov 2024, 1:57:51 UTC

635k waiting for Validation and rising. Will we make it to 1 million?

Currently showing 663,306 and I was going to suggest we keep some kind of tally to see how high we can get...

...except, I've just looked and all servers are now showing as running on the server status page, so let's see if that starts reducing or whether it's a false reading.
The front page is still showing as frozen for some reason.

On the plus side, this is a particularly consistent run of work over the last week or so. Let's see what kind of a points boost we all eventually get.

All fun and games...

Edit: I've just checked and across my whole team I have 120 tasks pending validation, but as I scrolled through there are definitely one or two tasks that've now received credit, so I think validation is definitely starting to work through the very long queue. Boinc-process lives
ID: 109958 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109959 - Posted: 2 Nov 2024, 2:16:37 UTC - in response to Message 109958.  

Boinc-process lives
Till the next time.

It'd be nice if they got the main page Server Status info updating again, but if it's one or the other then it's better having the Validators running while there is work available.
Grant
Darwin NT
ID: 109959 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2110
Credit: 40,989,749
RAC: 19,556
Message 109960 - Posted: 2 Nov 2024, 2:54:03 UTC - in response to Message 109959.  

Boinc-process lives
Till the next time.

It'd be nice if they got the main page Server Status info updating again, but if it's one or the other then it's better having the Validators running while there is work available.

Definitely. Waiting to validate has edged fractionally down to 662,008 on the Server Status page, but I'm definitely seeing more tasks than that validated - out of order for some reason but they all count.
I've got a full cache, but I'm manually polling anyway to see my credits going up each time.
These are our salad days (hours anyway)
ID: 109960 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109961 - Posted: 2 Nov 2024, 5:21:00 UTC
Last modified: 2 Nov 2024, 5:21:20 UTC

And the main page Server Status is updating again.

The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either.
Hopefully they'll start putting a dent in the backlog over the next few hours. Once that happens it shouldn't take long to then clear the backlog; but at present all they're doing is treading water.
Grant
Darwin NT
ID: 109961 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2110
Credit: 40,989,749
RAC: 19,556
Message 109962 - Posted: 2 Nov 2024, 7:35:55 UTC - in response to Message 109961.  

The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either.
Hopefully they'll start putting a dent in the backlog over the next few hours. Once that happens it shouldn't take long to then clear the backlog; but at present all they're doing is treading water.

This is true. Several hours later, the 663k backlog is now 659k.
But my team's unvalidated tasks are up from 120 to 132.
It seemed a fair few were being validated at the start, but now not many more have been since.
If it takes 2 or 3 days to notice the entire server is down I'm not convinced anyone will notice at all that the validation backlog is barely reducing.
It may take until new tasks run out or, more likely, for boinc-process to fail again, take another few days, then get re-restarted to improve matters...
ID: 109962 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109963 - Posted: 2 Nov 2024, 10:03:11 UTC - in response to Message 109962.  

The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either.
Hopefully they'll start putting a dent in the backlog over the next few hours. Once that happens it shouldn't take long to then clear the backlog; but at present all they're doing is treading water.

This is true. Several hours later, the 663k backlog is now 659k.
Now it's up to 683k.

I'm hoping that it's just a case of a messy crash of the server, and it's just re-building/verifying it's storage. In which case it could take a day or so to complete, during which performance is significantly degraded. And once done, the backlog will clear like it usually does in an hour or 2.
Or there is something still seriously wrong and the backlog will continue to grow slowly until the current batch of work runs out & the work being returned tapers off (or the server just crashes yet again, and the backlog climbs rapidly like before).
Grant
Darwin NT
ID: 109963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2110
Credit: 40,989,749
RAC: 19,556
Message 109965 - Posted: 2 Nov 2024, 12:24:07 UTC - in response to Message 109963.  

The Validators are validating, but they're seriously struggling- the backlog isn't getting any bigger, but it's not getting any less either.
Hopefully they'll start putting a dent in the backlog over the next few hours. Once that happens it shouldn't take long to then clear the backlog; but at present all they're doing is treading water.

This is true. Several hours later, the 663k backlog is now 659k.
Now it's up to 683k.

I'm hoping that it's just a case of a messy crash of the server, and it's just re-building/verifying it's storage. In which case it could take a day or so to complete, during which performance is significantly degraded. And once done, the backlog will clear like it usually does in an hour or 2.
Or there is something still seriously wrong and the backlog will continue to grow slowly until the current batch of work runs out & the work being returned tapers off (or the server just crashes yet again, and the backlog climbs rapidly like before).

While we're guessing, I now note that when I've uploaded completed tasks I'm not seeing any change in credits so, despite what the server status page shows, the continuing buildup to 699k is because validation has stopped altogether, not just slowed.
While all servers show green/running I don't know what other trigger there'll be so someone notices, because it isn't even noticed when they're all red.
We could be waiting some while.

So, the new prediction game is: what will the validation backlog peak at? 1m? 1.2m? 1.5m?
ID: 109965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1989
Credit: 9,464,560
RAC: 12,433
Message 109967 - Posted: 2 Nov 2024, 17:55:44 UTC - in response to Message 109965.  

While all servers show green/running I don't know what other trigger there'll be so someone notices, because it isn't even noticed when they're all red.
We could be waiting some while.

So, the new prediction game is: what will the validation backlog peak at? 1m? 1.2m? 1.5m?


Maybe a solution is to stop the wus generator and stop download/upload until the validation queue is clear...
ID: 109967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109968 - Posted: 2 Nov 2024, 19:51:15 UTC
Last modified: 2 Nov 2024, 20:27:25 UTC

Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour).
It's taken 16 hours since the Validators were restarted, but we're starting to get some significant falls in the backlog- and looking at my systems pendings, they've actually started to drop too.
*fingers crossed*



Maybe a solution is to stop the wus generator and stop download/upload until the validation queue is clear...
Stopping the return of completed work, you get a massive surge of returned results awaiting on Validation when it's re-enabled (instead of 10k per hour you're looking at 100k or more per hour), and if they're still not working properly, you get an instant backlog & log jam.
Stopping new work from being sent would be the most effective method- as caches clear then the amount returned per hour tapers off. When work is re-enabled, the returned per hour gradually builds up again. No sudden massive surge.
Grant
Darwin NT
ID: 109968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
OffDutyTaoist

Send message
Joined: 10 Oct 06
Posts: 3
Credit: 1,973,131
RAC: 1,338
Message 109969 - Posted: 2 Nov 2024, 22:52:27 UTC

My Pixel 6 is having issues again with Rosetta v4.20 arm-android-linux-gnu.

rb_10_30_639032_632668_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_04_05_2997716_402_1

rb_10_30_639032_632668_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_04_10_2997716_399_1

rb_10_30_639032_632668_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_12_2997716_399_1

All were running at the same time, got up to ~0.319% then started resetting my phone. I paused all of them and tried running them individually, and all three would do the same thing when ran separately. I ended up having to abort all them and suspend the project unless someone has an idea.
ID: 109969 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1664
Credit: 17,391,955
RAC: 24,348
Message 109970 - Posted: 2 Nov 2024, 23:14:07 UTC - in response to Message 109969.  

You've managed to complete 2 of those Tasks on that phone, but it's taking 10.5 hours to do 7.5 hours of work, which indicates that the phone is busy doing other things while it's trying to process the Rosetta Tasks.
It's possible the phone is overheating, although it should just throttle & not restart.
Other than setting the phone to run only when it's not doing other things, or only while on the charger (although doing that you would have to change the Target CPU time to 4 hours or less to make sure to return them before the deadline), otherwise i'd say Rosetta just isn't a suitable project for that device.
Grant
Darwin NT
ID: 109970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2110
Credit: 40,989,749
RAC: 19,556
Message 109971 - Posted: 3 Nov 2024, 2:40:32 UTC - in response to Message 109968.  

Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour).
It's taken 16 hours since the Validators were restarted, but we're starting to get some significant falls in the backlog- and looking at my systems pendings, they've actually started to drop too.
*fingers crossed*

I looked a few hours ago and my 132 pending had dropped to 80 and now I've arrived home it's already further down to just 31.
Backlog down to 370k so it's all looking good now. My fears from yesterday have largely been allayed.
ID: 109971 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 295 · 296 · 297 · 298 · 299 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org