Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 . . . 300 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 101007 - Posted: 2 Apr 2021, 17:48:52 UTC - in response to Message 100993.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?

There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
Shouldn't it have already done that when the 2nd genuine one was posted?
ID: 101007 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 101008 - Posted: 2 Apr 2021, 17:50:21 UTC - in response to Message 100995.  

I have seen several periods of downtime where work units have not been deployed for days at a time.

For an individual host's circumstances it's fine if you have a specific reason

This kind of reminds me of the hoarding that takes place here (even prior to the pandemic). There's a supply problem, which leads to hoarding, which makes it worse.

Kind of remarkable that we have too much unused CPU time to go around.
Same happens in real life with toilet paper because of the plandemic. Some people are selfish idiots.
ID: 101008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 101009 - Posted: 2 Apr 2021, 17:52:03 UTC - in response to Message 100996.  


Are you one of those pricks who said "made you look" in the school playground as a kid? If so, how's the broken nose?


Woah, dude, where did that come from? Over the use of an "at" symbol?

If you get spun up that hard, that fast over what I write, maybe the better solution is to stop reading my posts, okay?
No, because you did an "I know you are" variant saying I'd used @ when telling you not to use @.

Do people seriously say "dude"?

Anyway "prick" is a compliment, it means you have a big appendage.
ID: 101009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 101010 - Posted: 2 Apr 2021, 17:53:29 UTC - in response to Message 100997.  


I will complete the units currently being processed but suspect that this project is not for me.


Don't take it personally. There's a three roll limit on toilet paper here because of some hoarders (not you). That's the rule. But best practice for the community at large is for folks to take less, if they can. If everybody does it, then there is more likely to be a ready supply available, including for you. It's something worth repeating, just so everyone is aware of it.
That limit doesn't work, you just buy from more shops at once. Not that I do that with toilet paper, but I do a similar thing to buy more paracetamol (painkiller) than you're "allowed" by the government.
ID: 101010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 101011 - Posted: 2 Apr 2021, 17:55:07 UTC - in response to Message 100999.  

Say hello to two less hosts after they finish their current tasks, @Rosetta. I don't know if I have the time that's required to provide the space that is needed.
You’re not alone. Look at the recent results graphs – ‘tasks in progress’ has dropped by around 200,000 (a third)…
In the past it has taken several days for In progress numbers to get back to their pre-work shortage numbers. And that's with out running out of work again only a few hours after new work started coming through (which occurred this time).
If we don't run out of work again over the next few days, we should see how things actually are by early next week.
A few days in and the impact of the mis-configured Work Units is becoming clearer. Looks like the amount of work being done has dropped by almost a third, and isn't showing any signs of recovering.
For all of the latest & greatest systems there are, there are an awful lot more older much more resource limited systems.

That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.
ID: 101011 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 389
Credit: 12,070,320
RAC: 12,300
Message 101012 - Posted: 2 Apr 2021, 18:15:20 UTC

This problem with tasks erroring out with computation error is now getting serious.

Up until now my attitude has been “it’s only a few seconds a task, no sweat” but because I’m running a very small cache with multiple projects it runs Rosetta on a one out, one in basis so it gets one, errors it and then has to wait ages before it uploads the result and asks for another. Now it’s gone to the next level because my last n tasks have all errored it is extending the back off period to many hours before it will allow another request and I’m almost to the point where Rosetta is no longer running on my main machine.

Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books?
ID: 101012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,716,372
RAC: 18,198
Message 101013 - Posted: 2 Apr 2021, 18:23:38 UTC - in response to Message 101012.  

This problem with tasks erroring out with computation error is now getting serious.

Up until now my attitude has been “it’s only a few seconds a task, no sweat” but because I’m running a very small cache with multiple projects it runs Rosetta on a one out, one in basis so it gets one, errors it and then has to wait ages before it uploads the result and asks for another. Now it’s gone to the next level because my last n tasks have all errored it is extending the back off period to many hours before it will allow another request and I’m almost to the point where Rosetta is no longer running on my main machine.

Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books?

I do the same as you with a small buffer, but all that happens is Boinc builds up a Rosetta debt, and you'll end up doing more of them when they fix it.
ID: 101013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101014 - Posted: 2 Apr 2021, 19:18:32 UTC - in response to Message 101006.  

Bandwidth usage massively increased in March
This might be at least in part due to the current batch of work units suffering an unusually high failure rate, meaning you will be downloading a lot more tasks than normal in any given period. As an extreme example, your Threadripper has had over 300 failures in the last few days. As there’s no way to tell bad tasks from good before they’ve downloaded and started, there’s nothing we can do about it other than let them run their course (or stop running Rosetta until they’ve passed).

In BOINC Manager you can set a limit on the amount of data transferred in a given period. It’s not very sophisticated and only works per machine, so when you’ve got several the best you can do is set an allowance for each one as a proportion of your total limit based on the number of tasks you expect it to run. (And if you do set a limit you then need to keep an eye out for it being reached, at which point even small results files for completed tasks won’t be uploaded.)

Bad tasks aside, one way to reduce the overall amount of network traffic while performing the same amount of work is to increase the target run time for tasks in your project preferences. Even though a longer run time might increase the upload size needed for each task (due to the greater number of results), that is often far outweighed by the saving in download size (which is fixed for each task, however long it runs for). The credit per hour is more or less the same whatever target run time you choose.
ID: 101014 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 101015 - Posted: 2 Apr 2021, 19:34:53 UTC - in response to Message 101012.  

Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books?
(a) With 1.1 million jobs in the queue and a completion rate around 280,000 per day, I’d estimate at least 4 days…

(b) At just shy of 500 max per day you still are in Rosetta’s good books, so number of tasks isn’t the issue. If it’s just backoff times you’re running in to, either that’s set by the server and there’s nothing you can do about it, or you can try to force a connection by selecting Update on the Projects page.
ID: 101015 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 389
Credit: 12,070,320
RAC: 12,300
Message 101016 - Posted: 2 Apr 2021, 19:41:35 UTC - in response to Message 101015.  

Does the panel know (a) how long these errors will continue (b) how many good tasks I need to return to get back into Rosetta’s good books?
(a) With 1.1 million jobs in the queue and a completion rate around 280,000 per day, I’d estimate at least 4 days…

(b) At just shy of 500 max per day you still are in Rosetta’s good books, so number of tasks isn’t the issue. If it’s just backoff times you’re running in to, either that’s set by the server and there’s nothing you can do about it, or you can try to force a connection by selecting Update on the Projects page.


The back off time appears to be set by the server and is near doubling with each computation error that I’m returning :-(
ID: 101016 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 101017 - Posted: 2 Apr 2021, 20:58:11 UTC - in response to Message 101007.  

Duplicate post deleted.
You'd think there'd be a delete button. Who designs these things?

There's a workaround. If you use the same way to mark it as a duplicate every time, the software will see it as multiple identical posts, and delete all but one of them.
Shouldn't it have already done that when the 2nd genuine one was posted?

Yes. But it's rather slow to happen.
ID: 101017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 3,846
Message 101018 - Posted: 2 Apr 2021, 21:03:52 UTC - in response to Message 101014.  

Bandwidth usage massively increased in March
This might be at least in part due to the current batch of work units suffering an unusually high failure rate, meaning you will be downloading a lot more tasks than normal in any given period. As an extreme example, your Threadripper has had over 300 failures in the last few days. As there’s no way to tell bad tasks from good before they’ve downloaded and started, there’s nothing we can do about it other than let them run their course (or stop running Rosetta until they’ve passed).
[/quote

[snip]
If anyone can get them to look at the log files to see why the errors are occurring, that might help. For the errors on my computer, they should quickly notice that something with "6mers" in its name is missing from the input files.
ID: 101018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,529,908
RAC: 22,862
Message 101019 - Posted: 2 Apr 2021, 22:35:45 UTC - in response to Message 101006.  
Last modified: 2 Apr 2021, 23:03:33 UTC

Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?
Hard to say.
In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed.
I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is.

Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system.


I would also check to see if there has been a increase in Windows update traffic,, that is the only thing that causes regular spikes in my network bandwidth- also check your privacy settings as having these set loosly results in aa lot of data being sent back to Microsoft & other companies. Also the occasional youtube usage when i find some interesting videos can result in a huge spike in data usage.
But Brian's suggestion of the errored Tasks is the most likely cause.


This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work.
I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps.
50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those).
Grant
Darwin NT
ID: 101019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,529,908
RAC: 22,862
Message 101020 - Posted: 2 Apr 2021, 22:45:47 UTC - in response to Message 101012.  

...how many good tasks I need to return to get back into Rosetta’s good books?
It's not an issue.
Rosetta is set up to allow for such problems. Both of your systems are still good for plenty of Tasks each day- 491 on one, 502 on the other.

I can't remember the exact mechanism, but for example for each Tasks that Validates, your limit in increases by 2 (it's actually more than that- there were times at Seti where people were down to 1 Task per 24hrs. Once they started returning valid Tasks again, within a few hours (depending on how fast they were returning Valid work) their limits were back in the 100 & even thousands of Tasks per 24 hours).



But it would be nice of the researchers would test their models a bit more before releasing them here. The odd error is OK, but when it's a case of the odd Task not being an error and all others erroring out it really is a bit silly.
Grant
Darwin NT
ID: 101020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,529,908
RAC: 22,862
Message 101021 - Posted: 2 Apr 2021, 22:51:11 UTC - in response to Message 101016.  

The back off time appears to be set by the server and is near doubling with each computation error that I’m returning :-(
I haven't seen that occur myself (but most of my errors were returned while i wasn't here).
Boinc Manager backoffs set by the Scheduler and usually only occur when there is a problem contacting the Scheduler. A successful Scheduler contact & it's rest to the default 30 seconds.
Returning errors should only result in a reduction in the number of Tasks per 24 hours for that host.
Grant
Darwin NT
ID: 101021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,529,908
RAC: 22,862
Message 101022 - Posted: 2 Apr 2021, 22:57:22 UTC - in response to Message 101011.  

That means nothing. For example I might (manually or Boinc did it) download a load of work from another project when this one runs out. Now that has to be completed before it will get work from Rosetta again.
It really is a shame you don't read all of what's posted before you feel the need to comment.
Grant
Darwin NT
ID: 101022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DizzyD

Send message
Joined: 23 Nov 20
Posts: 6
Credit: 1,438,330
RAC: 0
Message 101023 - Posted: 3 Apr 2021, 0:59:59 UTC

Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit. My stats have dropped over 10% in the past day.
ID: 101023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,529,908
RAC: 22,862
Message 101024 - Posted: 3 Apr 2021, 1:01:15 UTC - in response to Message 101019.  

Has there been a significant project change which could be the cause of this increased usage or am I looking for another problem?
Hard to say.
In most cases the results returned to the Rosetta servers are around 200k-1MB. But they can be well over 1MB in some cases, depending on the type of Task being processed.
I'd suggest disabling BOINC network access for a while & see what the average result file size being returned is.

Edit- and as Brian mentioned, we have recently had a large batch of Tasks that error out quickly, and appear to still be moving through the system.


I would also check to see if there has been a increase in Windows update traffic,, that is the only thing that causes regular spikes in my network bandwidth- also check your privacy settings as having these set loosely results in aa lot of data being sent back to Microsoft & other companies. Also the occasional youtube usage when i find some interesting videos can result in a huge spike in data usage.
But Brian's suggestion of the errored Tasks is the most likely cause.


This is unsustainable and I will either have to shell out for an expensive unlimited contract (because I have an Ultima connection at over 100mbps) or cut back on Rosetta work.
I'm guessing you don't have any real options when it comes to ISP? 50GB limit for a 100Mb connection is insane IMHO. Higher speed plans here come with high data caps.
50GB is something you used to get on a basic 25Mb/s starter plan- these days even 12Mb/s plans have can have as much as 500MB data caps. 100Mb/s plans are 1TB caps or unlimited by default (of course we pay through the nose for those).



I just checked out my Data usage, and it is actually less than it has been- in the last reporting period there were some large deferred Windows updates so they would have skewed the figures.
Even so, my average usage is around 1GB per day. Since the 28/3 my usage is around only 731MB per day (to date).
Grant
Darwin NT
ID: 101024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1671
Credit: 17,529,908
RAC: 22,862
Message 101025 - Posted: 3 Apr 2021, 1:04:35 UTC - in response to Message 101023.  

Who is the guilty party submitting tasks that all "Error while computing"? I have 70 tasks on April 4th that have errored with no credit.
That would be you.
Along with everyone else- as mentioned in several posts here & some other threads there is a current batch of work that is presently producing almost nothing but errors.

My stats have dropped over 10% in the past day.
Mine are still climbing, but that is after falling for 4 days straight due to the lack of work for a while, and the fact there is now a new batch of work and that it takes a while for granted Credit to stabilise.
Grant
Darwin NT
ID: 101025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,222,776
RAC: 4,349
Message 101026 - Posted: 3 Apr 2021, 4:10:27 UTC

Queued jobs dropped to 393,000 from over a million on the last update.

Looks like someone pulled off some batches from circulation.
ID: 101026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 88 · 89 · 90 · 91 · 92 · 93 · 94 . . . 300 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org