Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 300 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2115
Credit: 41,115,238
RAC: 19,699
Message 97915 - Posted: 5 Jul 2020, 3:17:45 UTC - in response to Message 97872.  

What's easiest is to set Rosetta at say 50%, WCG at 25% and some orther project at 25% andlet Boinc figure it out,which it will do over time.Just besure to keep your cache sizes small so you don't run into deadline problems. With Rosetta's 3 day deadline if you have 3 days of work NO other projects will crunch because their deadline will be further out than 3 days.
Where are these resource share settings hidden?
In your account, Rosetta@ home preferences, Resource share.
The number there isn't a percentage. It makes up the ratio for the work to be done with the values from you other projects.

Yes, it isn't a %age.
I've seen someone's now pointed out where the setting is at WCG, but I could never find it before, so I just increased Rosetta to 2900. Amounts to the same thing
ID: 97915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2115
Credit: 41,115,238
RAC: 19,699
Message 97916 - Posted: 5 Jul 2020, 3:21:19 UTC - in response to Message 97898.  

I've noticed the same: Tasks are arriving with an 8 hour estimated completion time.
Setting is at 12 hours.

Definitely, yes.
During the last outage I increased my runtimes to 12hrs to eke my last few out, and they ran for 12hrs, but when new tasks came through, the unstarted ones still showed 8hrs.

I've reduced my runtimes back to 8hrs. Boinc has enough trouble with scheduling without me or Rosetta making it worse.
ID: 97916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stevie G

Send message
Joined: 15 Dec 18
Posts: 107
Credit: 822,669
RAC: 1,625
Message 97930 - Posted: 6 Jul 2020, 2:46:53 UTC - in response to Message 97866.  
Last modified: 6 Jul 2020, 2:47:37 UTC

[quote] 8Gb RAM ought to be plenty for a 2-core processor.
Have you looked at the previous advice in this thread and compared to your own settings (even though the advice was for a different machine)? There should be plenty for you to consider.
Boinc <ought> to be able to give your other projects enough time to complete before their deadlines without you having to suspend them. The longer you can run without interfering, the better Boinc will be able to decide for you.[quote]

For some reason, the computer shut down and was unresponsive for 48 hours. No action from the power button, hard drive, etc. Nada, nichts, zip.

Power cable was OK, I don't think there's an inline fuse, so I dunno. There's a reset button on the power supply, but I didn't mess with that and the button is not popped out. Overheat? Usually, that just results in a restart. To be safe, I just vacuumed out all the accumulated cat hair and dust. We had some thunderstorms here last night, so maybe there was a power interruption. But nothing else in the house was affected and this machine is on a UPS backup, which did not register any action. A deep mystery

But when I just now turned it et Voila!! It awoke from its coma. Which is how I'm writing to you at this moment.{:>) No explanation for that, but I'll take it.

However, I've been out of business for more than two days, with deadlines rapidly approaching.

So I will take your suggestion under advisement and scrutinize my settings and preferences.

Thanks again.

SGaber
ID: 97930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 97931 - Posted: 6 Jul 2020, 2:57:11 UTC - in response to Message 97930.  
Last modified: 6 Jul 2020, 3:03:01 UTC

[snip]

For some reason, the computer shut down and was unresponsive for 48 hours. No action from the power button, hard drive, etc. Nada, nichts, zip.

The shutdown is typical after a momentary loss of power.

The UPS may have let its battery or batteries run too low. For example, if its rating was too low for your computer. If so, it would eventually recharge it or them after long enough with the computer using no power.

You may have needed to unplug it to keep it from being confused about whether it was still running.
ID: 97931 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2115
Credit: 41,115,238
RAC: 19,699
Message 97955 - Posted: 7 Jul 2020, 22:58:56 UTC - in response to Message 97930.  

8Gb RAM ought to be plenty for a 2-core processor.
Have you looked at the previous advice in this thread and compared to your own settings (even though the advice was for a different machine)? There should be plenty for you to consider.
Boinc <ought> to be able to give your other projects enough time to complete before their deadlines without you having to suspend them. The longer you can run without interfering, the better Boinc will be able to decide for you.

For some reason, the computer shut down and was unresponsive for 48 hours. No action from the power button, hard drive, etc. Nada, nichts, zip.

Power cable was OK, I don't think there's an inline fuse, so I dunno. There's a reset button on the power supply, but I didn't mess with that and the button is not popped out. Overheat? Usually, that just results in a restart. To be safe, I just vacuumed out all the accumulated cat hair and dust. We had some thunderstorms here last night, so maybe there was a power interruption. But nothing else in the house was affected and this machine is on a UPS backup, which did not register any action. A deep mystery

But when I just now turned it on et Voila!! It awoke from its coma. Which is how I'm writing to you at this moment.{:>) No explanation for that, but I'll take it.

However, I've been out of business for more than two days, with deadlines rapidly approaching.

So I will take your suggestion under advisement and scrutinize my settings and preferences.

Thanks again.
SGaber

That's not a great sign. It's quite an old PC and must have done a lot of work in its time.
The best thing you've done is vacuum it out, because it sounds heat-related to me and you'll have helped it run cooler by getting rid of the junk, which will extend its remaining life.
I'm actually in much the same situation myself and considering what my next PC should be within my budget.
ID: 97955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
justsomeguy

Send message
Joined: 24 May 17
Posts: 1
Credit: 375,643
RAC: 0
Message 97961 - Posted: 8 Jul 2020, 15:43:42 UTC

Recently, I started seeing a lot of jobs completing with a status of "aborted by project". They were completed prior to the deadline, but it doesn't appear that I get any credit for them either.
Any ideas/thoughts on this?
ID: 97961 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,222,776
RAC: 4,804
Message 97964 - Posted: 8 Jul 2020, 19:56:05 UTC
Last modified: 8 Jul 2020, 20:04:34 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1092259599


Both tasks errored out after just a few seconds. Slightly different error codes but the same "upload failure":

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>TFSCAFFOLD0001_6_SAVE_ALL_OUT_IGNORE_THE_REST_0ub6wd0j_953357_1_1_r1180454695_0</file_name>
<error_code>-240(stat() failed)</error_code>
</file_xfer_error>
</message>
]]>


</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>TFSCAFFOLD0001_6_SAVE_ALL_OUT_IGNORE_THE_REST_0ub6wd0j_953357_1_0_r1298488601_0</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>


EDIT: Got another FSCAFFOLD0001 WU, also errored after just a few seconds. Bad batch?
https://boinc.bakerlab.org/rosetta/result.php?resultid=1217236062
</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>TFSCAFFOLD0001_2_SAVE_ALL_OUT_IGNORE_THE_REST_1xl5lk3f_953353_2_0_r1523244009_0</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>
ID: 97964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 97967 - Posted: 8 Jul 2020, 20:58:08 UTC - in response to Message 97961.  

Recently, I started seeing a lot of jobs completing with a status of "aborted by project". They were completed prior to the deadline, but it doesn't appear that I get any credit for them either.
Any ideas/thoughts on this?

Usually done only if your computer has downloaded them but not started on them yet, but can be done even if started or completed but not returned.

You may need to try harder to return completed tasks.
ID: 97967 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 97968 - Posted: 8 Jul 2020, 21:01:44 UTC - in response to Message 97964.  

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1092259599


Both tasks errored out after just a few seconds. Slightly different error codes but the same "upload failure":

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>TFSCAFFOLD0001_6_SAVE_ALL_OUT_IGNORE_THE_REST_0ub6wd0j_953357_1_1_r1180454695_0</file_name>
<error_code>-240(stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

Very fast failures usually mean that not all of the expected output files were produced, and therefore those files were not available to upload,
ID: 97968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,222,776
RAC: 4,804
Message 97969 - Posted: 8 Jul 2020, 22:35:52 UTC - in response to Message 97968.  

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1092259599


Both tasks errored out after just a few seconds. Slightly different error codes but the same "upload failure":

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>TFSCAFFOLD0001_6_SAVE_ALL_OUT_IGNORE_THE_REST_0ub6wd0j_953357_1_1_r1180454695_0</file_name>
<error_code>-240(stat() failed)</error_code>
</file_xfer_error>
</message>
]]>

Very fast failures usually mean that not all of the expected output files were produced, and therefore those files were not available to upload,


Well, 3rd TFSCAFFOLD task the errors out. Good thing they fail quickly, whatever is causing it.
ID: 97969 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,526,840
RAC: 23,319
Message 97975 - Posted: 9 Jul 2020, 4:44:51 UTC - in response to Message 97961.  

Recently, I started seeing a lot of jobs completing with a status of "aborted by project". They were completed prior to the deadline, but it doesn't appear that I get any credit for them either.
Any ideas/thoughts on this?
The only similar errors i could find were "Cancelled by server", and none of them were cancelled before your system started to process them.
No work done, no Credit.
Grant
Darwin NT
ID: 97975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 97986 - Posted: 9 Jul 2020, 20:01:36 UTC - in response to Message 97968.  
Last modified: 9 Jul 2020, 20:03:46 UTC

Very fast failures usually mean that not all of the expected output files were produced, and therefore those files were not available to upload,


How can it have got to the uploading stage if it's only just started?

Well, 3rd TFSCAFFOLD task the errors out. Good thing they fail quickly, whatever is causing it.


Hopefully the server gives up and only tries sending them to several people before putting them in a "fix this" box for the programmers. I have also noticed my Boinc client backing off and not trying to get Rosetta tasks if it's just had a few failures. Universe and LHC tasks coming in more often just now.
ID: 97986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 97989 - Posted: 9 Jul 2020, 22:19:51 UTC - in response to Message 97986.  
Last modified: 9 Jul 2020, 22:22:24 UTC

Very fast failures usually mean that not all of the expected output files were produced, and therefore those files were not available to upload,


How can it have got to the uploading stage if it's only just started?

I'd expect that only if an error occurred it some point where the error output wasn't going to either of the output log files the users are able to see, which seems to be what happened to most of the TFSCAFFOLD tasks my computer tried to run.

Well, 3rd TFSCAFFOLD task the errors out. Good thing they fail quickly, whatever is causing it.


Hopefully the server gives up and only tries sending them to several people before putting them in a "fix this" box for the programmers. I have also noticed my Boinc client backing off and not trying to get Rosetta tasks if it's just had a few failures. Universe and LHC tasks coming in more often just now.

I've found a thread for a moderator's attention, and asked the moderator to check this thread.

Those of this type that I've looked at were set to have the server give up on the workunit after two failed tasks.
ID: 97989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,222,776
RAC: 4,804
Message 97991 - Posted: 9 Jul 2020, 22:36:29 UTC - in response to Message 97989.  

Very fast failures usually mean that not all of the expected output files were produced, and therefore those files were not available to upload,


How can it have got to the uploading stage if it's only just started?

I'd expect that only if an error occurred it some point where the error output wasn't going to either of the output log files the users are able to see, which seems to be what happened to most of the TFSCAFFOLD tasks my computer tried to run.

Well, 3rd TFSCAFFOLD task the errors out. Good thing they fail quickly, whatever is causing it.


Hopefully the server gives up and only tries sending them to several people before putting them in a "fix this" box for the programmers. I have also noticed my Boinc client backing off and not trying to get Rosetta tasks if it's just had a few failures. Universe and LHC tasks coming in more often just now.

I've found a thread for a moderator's attention, and asked the moderator to check this thread.

Those of this type that I've looked at were set to have the server give up on the workunit after two failed tasks.



I saw, thanks. Had 1 or 2 more fail but am now running 1 just fine, as others have reported.
ID: 97991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stevie G

Send message
Joined: 15 Dec 18
Posts: 107
Credit: 822,669
RAC: 1,625
Message 98037 - Posted: 13 Jul 2020, 3:34:16 UTC - in response to Message 97955.  
Last modified: 13 Jul 2020, 3:36:16 UTC


For some reason, the computer shut down and was unresponsive for 48 hours. No action from the power button, hard drive, etc. Nada, nichts, zip.


After a week of working interspersed with total shut-downs I finally solved the problem.

It was a faulty power supply. I installed a new heftier (600W) power supply and a more powerful fan.

The machine has been crunching non-stop for three days. YAAAYY!

The exhaust air is much cooler. According to the CoreTemp utility, the CPU is running between 42 and 54 degrees C. It's also quieter and apparently happier.

Now if Rosetta would send me some WUs, that would complete my week.

Thanks for your support and patience.

Cheers,
Steven Gaber
Oldsmar, FL
ID: 98037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2115
Credit: 41,115,238
RAC: 19,699
Message 98063 - Posted: 14 Jul 2020, 2:37:18 UTC - in response to Message 98037.  


For some reason, the computer shut down and was unresponsive for 48 hours. No action from the power button, hard drive, etc. Nada, nichts, zip.


After a week of working interspersed with total shut-downs I finally solved the problem.

It was a faulty power supply. I installed a new heftier (600W) power supply and a more powerful fan.

The machine has been crunching non-stop for three days. YAAAYY!

The exhaust air is much cooler. According to the CoreTemp utility, the CPU is running between 42 and 54 degrees C. It's also quieter and apparently happier.

Now if Rosetta would send me some WUs, that would complete my week.

Thanks for your support and patience.

Cheers,
Steven Gaber
Oldsmar, FL

Good news - it could've easily been something much more expensive
ID: 98063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Stevie G

Send message
Joined: 15 Dec 18
Posts: 107
Credit: 822,669
RAC: 1,625
Message 98066 - Posted: 14 Jul 2020, 6:20:41 UTC - in response to Message 98063.  


Good news - it could've easily been something much more expensive


Yes, but I think at that point, say a defective motherboard or CPU, I would have just gotten another computer. It's like trying to keep an old car running for another year.

This one was barebones box that I filled with the parts for around $550.

Now that it's working again, I think I will put another 8 GB of RAM in it.

There are some really inexpensive refurbished Dell and HP computers out there, starting at $200.

Anybody ever try one of those?

Steven Gaber
Oldsmar, FL
ID: 98066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98072 - Posted: 14 Jul 2020, 12:46:58 UTC

rgmjp tasks running way longer than 8 hours: 1220528042 · 1220528132 · 1220528339

I’ve got another couple still running after nearly 16 hours, and a few more in the pipeline…
ID: 98072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 22 Apr 20
Posts: 17
Credit: 270,864
RAC: 0
Message 98080 - Posted: 14 Jul 2020, 19:37:35 UTC - in response to Message 98072.  

This one was just a smidgeon over 23hrs. Not a problem for me as my hosts run 24/7 and I have "Switch between" set beyond 2 days (to allow an occasional long LHC virtual task to run to completion without interruption) but I don't know how it would have fared on a machine that only runs 8hrs a day or if it was switched out too many times.
ID: 98080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 98087 - Posted: 14 Jul 2020, 21:08:06 UTC

The rgmjp tasks appear to complete only one decoy. The first decoy is usually only a quick check to make sure that your computer is running properly, so does this mean that the usual first decoy is skipped for these, or does it mean that more decoys are done but without adding them to the decoy count?
ID: 98087 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 65 · 66 · 67 · 68 · 69 · 70 · 71 . . . 300 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org