Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 300 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98835 - Posted: 7 Sep 2020, 23:49:30 UTC - in response to Message 98827.  

Even if I had an account, I doubt I'd be asked to log in for an inline image. Or does it stay logged in forever on your browser?


My Firefox browser will save logon information indefinitely as long as you use it every so often.


Yes, my Opera browser does that too, but all it does is fill in the password when you're asked for it. I don't think an inline image would pass the correct request through.
ID: 98835 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98836 - Posted: 7 Sep 2020, 23:52:13 UTC - in response to Message 98829.  
Last modified: 7 Sep 2020, 23:52:40 UTC

Unless you run Milkyway on a GPU. Those have tasks that can take 30 seconds. And they refuse to fix the server (I've asked two successive project leaders and nothing gets fixed) - you cannot download new tasks if you're reporting completed tasks, so you need a big buffer (well 3 hours anyway).
If it were your only project, yes. If you're running more than one project, it's still not necessary even if one of the projects has issues with work allocation. Your other project will pick up work, and then BOINC will do extra for the first project when it can get work to balance out the debt between projects.


I run more than Milkyway and I need the buffer. Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.
ID: 98836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
10esseetony

Send message
Joined: 24 Dec 11
Posts: 5
Credit: 23,602,985
RAC: 0
Message 98838 - Posted: 8 Sep 2020, 0:11:06 UTC - in response to Message 98820.  
Last modified: 8 Sep 2020, 0:18:59 UTC

You are correct, no, one shouldn't have to log in to see the image, now that you mention it. I'll just link to the thread, but beware, have your adblocker turned on: https://forums.anandtech.com/threads/recent-changes-in-projects.2500471/post-40275238

My tasks that timed out were not due to an inability to complete them, it was forgetfulness that I had 'temporarily' suspended Rosetta on that machine. ///insert forehead slap emoji here///

I would caution against having zero cache as you suggest....I pay too much for my energy bill to have my machines idle for ANY length of time (internet outage/server outage/server upgrade/home router locked up/etc etc). Rosetta has run dry many times and I do not check my machines but once daily.
ID: 98838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98839 - Posted: 8 Sep 2020, 0:23:21 UTC - in response to Message 98836.  

Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.
Looks like it's been an issue forever.
J Stateson built a BOINC client to work around Milkyway's stuffed up server configuration.

Finally getting new tasks only seconds after running out. May not be worth the hassle.
Grant
Darwin NT
ID: 98839 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,118,186
RAC: 6,004
Message 98840 - Posted: 8 Sep 2020, 0:29:32 UTC - in response to Message 98836.  

Peter Hucker wrote:
I run more than Milkyway and I need the buffer. Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.


MilkyWay needs us to run other projects tasks that run more than 10 minutes because that's the backoff the Project requires...NO communication with MW for 10 minutes before it will send new gpu tasks, personally I use PrimeGrid as they have short tasks and respect the zero resources share. I run 1 maybe 2 PG tasks and them MW refills the cache and I am off and crunching them again. If the gpu is not the fastest then Collatz will work as a zero resource share project too.

IF you want to go outside the norm then a user made an alternative Boinc Manager at MilkyWay and it handles the 10 minute backoff so that it's not a problem, I don't know how but people that use it say it works.
ID: 98840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98841 - Posted: 8 Sep 2020, 0:31:11 UTC - in response to Message 98838.  

I pay too much for my energy bill to have my machines idle for ANY length of time
?
If they are idle, the power they'd be using (unless they're really, really old systems), would be bugger all.



Rosetta has run dry many times and I do not check my machines but once daily.
Rosetta might have run out, but you are also doing work for over a dozen other projects. I can't see all those projects running out of work at the same time- so you'll do a bit more work for those projects, then a bit extra for Rosetta when it has work again.
Hence no need for a cache, let alone one more than a few hours or so.

If you have crappy internet, what's the longest usual outage? Set the cache for that. Even so, with the short deadlines with Rosetta, anything larger than a couple of days when running that many projects will result in some missed deadlines as the systems workout how to meet their Resource share settings.
Grant
Darwin NT
ID: 98841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hangint3n

Send message
Joined: 23 Mar 20
Posts: 8
Credit: 1,958,078
RAC: 0
Message 98845 - Posted: 8 Sep 2020, 0:55:15 UTC - in response to Message 98812.  

Just had a similar problem on my box. froze the whole thing up.

===
hangint3n
ID: 98845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
hangint3n

Send message
Joined: 23 Mar 20
Posts: 8
Credit: 1,958,078
RAC: 0
Message 98846 - Posted: 8 Sep 2020, 0:55:20 UTC - in response to Message 98812.  

Just had a similar problem on my box. froze the whole thing up.

===
hangint3n
ID: 98846 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98853 - Posted: 8 Sep 2020, 2:00:03 UTC - in response to Message 98838.  
Last modified: 8 Sep 2020, 2:01:11 UTC

You are correct, no, one shouldn't have to log in to see the image, now that you mention it. I'll just link to the thread, but beware, have your adblocker turned on: https://forums.anandtech.com/threads/recent-changes-in-projects.2500471/post-40275238

My tasks that timed out were not due to an inability to complete them, it was forgetfulness that I had 'temporarily' suspended Rosetta on that machine. ///insert forehead slap emoji here///

I would caution against having zero cache as you suggest....I pay too much for my energy bill to have my machines idle for ANY length of time (internet outage/server outage/server upgrade/home router locked up/etc etc). Rosetta has run dry many times and I do not check my machines but once daily.


I can get to the forum with your link, but clicking the image requests me to log in. I don't have an account.

And I have 11 ad blockers, will that do? Not only do they block ads, but also youtube video ads, EU cookie notices, government coronavirus advice, and links to grass people off in forums that used a naughty word.

Electricity isn't wasted when the PC is idle, they don't use much then.

I have all 6 machines displayed permanently on a monitor [1] in here, via Boinctasks. I spot immediately if one is playing up. The other 5 machines are in the garage where I can't hear the many fans, but usually I can sort stuff via Boinctasks or remote desktop.

[1] Correction, two monitors, one above the other. The list got too large with 5 GPUs and 66 cores.
ID: 98853 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98854 - Posted: 8 Sep 2020, 2:02:09 UTC - in response to Message 98839.  

Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.
Looks like it's been an issue forever.
J Stateson built a BOINC client to work around Milkyway's stuffed up server configuration.

Finally getting new tasks only seconds after running out. May not be worth the hassle.


Yes I've been attacking that problem a lot.
ID: 98854 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98855 - Posted: 8 Sep 2020, 2:04:18 UTC - in response to Message 98840.  

Peter Hucker wrote:
I run more than Milkyway and I need the buffer. Otherwise Boinc only ever asks MW for a couple of 30 second tasks, as that's all it needs to fill the buffer. Then it hits the problem of not getting any more until it's backed off for 10 minutes. So even if I've said half Einstein, half MW, it ends up only managing to run MW a tenth of the time.


MilkyWay needs us to run other projects tasks that run more than 10 minutes because that's the backoff the Project requires...NO communication with MW for 10 minutes before it will send new gpu tasks, personally I use PrimeGrid as they have short tasks and respect the zero resources share. I run 1 maybe 2 PG tasks and them MW refills the cache and I am off and crunching them again. If the gpu is not the fastest then Collatz will work as a zero resource share project too.

IF you want to go outside the norm then a user made an alternative Boinc Manager at MilkyWay and it handles the 10 minute backoff so that it's not a problem, I don't know how but people that use it say it works.


The 10 minutes isn't enforced by MW servers. Boinc chooses to wait that long when it's denied it the first time. If you do a manual update after about 2 minutes, it gets them. So presumably the modified Boinc just changes that setting. Or it could stop Boinc reporting tasks every time it contacts the server, that would work.
ID: 98855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98861 - Posted: 8 Sep 2020, 2:29:23 UTC - in response to Message 98791.  
Last modified: 8 Sep 2020, 2:52:38 UTC

Problems and Technical Issues, eh? How about 41GB of RAM for ONE task? Name: ygG5REMC******1009391_1307_0
So far all of these reports of out of control Memory Tasks have been on Linux systems. Has anyone with a Windows system got one of the problem Tasks yet?


Edit-
Even if the RAM usage doesn't get out of control, it looks like they crash and burn anyway.

kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_4701_0

              Outcome Computation error
         Client state Compute error
          Exit status 1 (0x00000001) Unknown error code
          Computer ID 3930525
             Run time 22 min 26 sec
             CPU time 21 min 53 sec
       Validate state Invalid
               Credit 0.00
    Device peak FLOPS 5.60 GFLOPS
  Application version Rosetta v4.20 x86_64-pc-linux-gnu
Peak working set size 617.72 MB
       Peak swap size 758.16 MB
      Peak disk usage 48.62 MB


Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3868745
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump!
ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 436
BOINC:: Error reading and gzipping output datafile: default.out
20:45:43 (4601): called boinc_finish(1)

</stderr_txt>
]]>



And an out of control RAM error Task,

kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_893_0

              Outcome Computation error
         Client state Compute error
          Exit status 1 (0x00000001) Unknown error code
          Computer ID 3930525
             Run time 44 min 11 sec
             CPU time 44 min 11 sec
       Validate state Invalid
               Credit 24.00
    Device peak FLOPS 5.60 GFLOPS
  Application version Rosetta v4.20 x86_64-pc-linux-gnu
Peak working set size 19,307.60 MB
       Peak swap size 20,495.17 MB
      Peak disk usage 49.49 MB[/pre


[pre]Stderr output
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3872553
Using database: database_357d5d93529_n_methyl/minirosetta_database

ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump!
ERROR:: Exit from: src/core/kinematics/FoldTree.cc line: 436
BOINC:: Error reading and gzipping output datafile: default.out
19:10:32 (4261): called boinc_finish(1)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 98861 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
10esseetony

Send message
Joined: 24 Dec 11
Posts: 5
Credit: 23,602,985
RAC: 0
Message 98862 - Posted: 8 Sep 2020, 3:02:38 UTC - in response to Message 98841.  



.....Rosetta might have run out, but you are also doing work for over a dozen other projects....



LOL, I am just curious, where are you guys getting the info that I am running 12+ projects at once [presumably on a single computer]? Let me help you out: https://stats.free-dc.org/userbycpid/627a6be35f3dbebd60ed8b5cda8c0b95

I am currently in 'Summer' mode, only running 4 computers out of the 21 at my disposal. Well, running 5 if you want to count that poor old iMac in my daughter's room. My current projects are Universe, WCG, and Rosetta, all other points received today are from quorum 2 projects (wingmen double checking my work finally).

If I do run multiple projects on one machine, I prefer only 3 per computer, but I assure you they each will have their own client/manager running just one project each at a set percentage of CPU usage, and in no way are fighting with other projects for run time. If you would like to know how to do that, see this thread:
https://forums.anandtech.com/threads/multiple-boinc-clients-on-the-same-computer.2573424/



Now, back to the topic, good catch that the problem is (possibly) Linux only, and that they crash and burn anyway. I was curious to see the points on that one, but I'll go nuke it instead.
ID: 98862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98864 - Posted: 8 Sep 2020, 3:56:06 UTC - in response to Message 98862.  
Last modified: 8 Sep 2020, 3:56:25 UTC

LOL, I am just curious, where are you guys getting the info that I am running 12+ projects at once
Click on a person's name & it shows what projects they are doing.


[presumably on a single computer]
Because that was the whole point of BOINC, one manager to let you run multiple projects. Whether you have 1 or 1,000 systems doing the work, you install BOINC, attach to the projects of your choice & then let it manage things according to your Resource share settings. If people choose to complicate things, it's their choice.




Now, back to the topic, good catch that the problem is (possibly) Linux only, and that they crash and burn anyway. I was curious to see the points on that one, but I'll go nuke it instead.
Hopefully over the next day or so we'll see some results from Windows machines as to whether they crash and burn as well (most likely), and do some of the Work Units also have runaway memory usage issues?
Grant
Darwin NT
ID: 98864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
10esseetony

Send message
Joined: 24 Dec 11
Posts: 5
Credit: 23,602,985
RAC: 0
Message 98865 - Posted: 8 Sep 2020, 4:24:08 UTC - in response to Message 98864.  
Last modified: 8 Sep 2020, 4:43:19 UTC

Well, thanks to your findings, I have switched my allocated 8 of 32 threads of Ryzen under Linux to 10 threads of Haswell on Windows. Hopefully the issue is therefore solved (for me).....and then I downloaded 10+10 days of tasks! (J/K!!!!!)

Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines. I am glad the BOINC client works for you 100% as intended. Which I am sure you have tested. Meanwhile I'll simply continue to complicate things.

PS: Click on a person's name and it shows everything they have EVER done. You have some very nice systems, and I appreciate you donating your computers and your time and your money for citizen science research.
ID: 98865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98866 - Posted: 8 Sep 2020, 5:59:18 UTC - in response to Message 98865.  

Regarding resource share settings......I have Rosetta at 1 and WCG at 9999, and yet Rosetta still takes control and suspends WCG tasks during this transition between machines.
Because you have effectively joined a new project with that system.
To date all the work has been on the existing project, the new/increased computation resource project is now owed a debt for it to actually match up with your resource share settings.
And with the short deadlines for Rosetta, the long Task processing times, and the amount of work the system has just got it needs to do what it has for Rosetta to meet those deadlines. Once that is done, it will then process mostly WCG until the debt then owed to it is met, then some more Rosetta, then more WCG etc, etc until it settles down to the work being processed at any given time being in accordance with your Resource share settings.

Resource share is something that balances out over the longer term, not just a matter of hours- and certainly not straight off the bat.
The less projects, the smaller the cache, the more cores & threads you have, the sooner the Resource share settings will be honoured (within a week, even within a few days in many cases). The less cores & threads, the larger the cache and the more projects you have then the longer it takes for your Resource share to be honoured (as in months- and as in many months if people then start trying to micro manage things).
Grant
Darwin NT
ID: 98866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98916 - Posted: 9 Sep 2020, 1:13:00 UTC

Ah, we're back.
Forums/server info was all MIA for a while there due to the database being down/unavailable.
Grant
Darwin NT
ID: 98916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,527,680
RAC: 23,122
Message 98918 - Posted: 9 Sep 2020, 2:26:39 UTC - in response to Message 98916.  
Last modified: 9 Sep 2020, 2:28:58 UTC

Ah, we're back.
Forums/server info was all MIA for a while there due to the database being down/unavailable.


Now just getting random
Project is down
The project's database server is either down or ran out of connections at the moment. Please check back in a few minutes.
errors.
Grant
Darwin NT
ID: 98918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98938 - Posted: 9 Sep 2020, 21:17:03 UTC - in response to Message 98862.  


If I do run multiple projects on one machine, I prefer only 3 per computer, but I assure you they each will have their own client/manager running just one project each at a set percentage of CPU usage, and in no way are fighting with other projects for run time. If you would like to know how to do that, see this thread:
https://forums.anandtech.com/threads/multiple-boinc-clients-on-the-same-computer.2573424/


What's the advantage of a client per project?
ID: 98938 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 98939 - Posted: 9 Sep 2020, 21:20:12 UTC - in response to Message 98864.  

Because that was the whole point of BOINC, one manager to let you run multiple projects. Whether you have 1 or 1,000 systems doing the work, you install BOINC, attach to the projects of your choice & then let it manage things according to your Resource share settings. If people choose to complicate things, it's their choice.


It's a pity Boinc doesn't manage multiple computers and we have to use third party programs to do so. I use Boinctasks, and in fact I'd use it for a single machine too, because it's display is 10 times better than Boinc. For a start it colour codes running, queued, etc, and collapses a queue of 50 tasks into one line. The actual Boinc manager is unusable as an interface.
ID: 98939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 300 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org