Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 300 · Next
Author | Message |
---|---|
amgthis Send message Joined: 25 Mar 06 Posts: 81 Credit: 203,879,282 RAC: 0 |
Failed Downloads. I, too have seen many ~3kb or so file size downloads just hang or 'stall' at somewhere around 80-90% completion. Then they just sit and seem to rob my limited bandwidth impeding other traffic up and downloads. I delete the stalled download, then refresh and it gets replaced by new. Then I watch to make sure it d/l's successful. Sometimes a stop and start of 'network access or activity' will let it resume but usually it stalls out again. I've been noticing this for the last couple of weeks I think. Various file names but they are always small files ~3kb or so in size. When you have 20 boxes sharing a 7 Mbs DSL line, bandwidth can be sketchy under the best conditions. 8^( /Mike |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 3 |
Yes, same here, stalled downloads can only be fixed by manual intervention (abort or abort) and therefore a big pain to keep crunching the project. They require continuous attention, which is not sustainable. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
Yes, same here, stalled downloads can only be fixed by manual intervention (abort or abort) and therefore a big pain to keep crunching the project. They require continuous attention, which is not sustainable. Just had one I can't fix. Usually aborting the download, then aborting the task, then reporting it, allows me to continue. But now Boinc is still saying: Rosetta@home 16/02/2020 11:00:16 AM Not requesting tasks: some download is stalled I'll try a fresh post on this here, and ask in the main Boinc forum why Boinc thinks something is still stalled which isn't. P.S. For some reason I'm not getting emailed when someone posts in this thread. Another problem! Works fine in forums of all other projects. Ah, a hidden preference defaulting to a daft way - why would I subscribe to a thread if I didn't want to be told? |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. I abort the transfer (not the task) and normally that is enough to allow downloads to restart when I do an update project. On the odd occasion, however, it has given the message you reported after the update. In that case I leave it an hour and redo the update, on all occasions so far the update has succeeded in bringing down new WUs. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. Ok thanks, in the future I'll just abort then leave it alone. Although the next time it happens I'm going to try to gather technical info on the problem - see this thread over at Boinc: https://boinc.berkeley.edu/dev/forum_thread.php?id=13435 I've been requested to: "1) if you see it happening, set <http_debug> in Event Log options, and retry the transfer - find out what's happening behind that 'transient HTTP error'. 2) make a careful and exact note of the file name in question. Cancel the download, and make sure it disappears from the transfers tab. Restart the client, and if the 'stalled download' message reappears, have a very careful 'read only' (no edits) peek inside client_state.xml - same folder. Find the reference (if any) to the file you cancelled, and post the whole of the <file> ... </file> section it's enclosed in." |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,110,095 RAC: 19,722 |
When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it. I’m on 7.16 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails. Rosetta Moderator: Mod.Sense |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
It's a pity the Boinc manager doesn't seem to notice for hours. I get the message in the log "some download is stalled" (which prevents any more tasks getting downloaded from that project) up to an hour or so after I've cancelled both the download and the task. 1) Rosetta needs to stop the server stalling downloads. 2) Boinc needs to fix their program so it doesn't get upset just because 1 file failed, then fail to notice the user told it to give up. |
amgthis Send message Joined: 25 Mar 06 Posts: 81 Credit: 203,879,282 RAC: 0 |
I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
I haven't seen the small file stall on d/l lately, but did have this one hang a I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong! |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
I haven't seen the small file stall on d/l lately, but did have this one hang a I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit. So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :- https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278 |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
I haven't seen the small file stall on d/l lately, but did have this one hang a What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
I haven't seen the small file stall on d/l lately, but did have this one hang a Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 3,846 |
Another small file download problem, which is currently blocking me from getting any more R@H tasks: 10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637 A wingmate timed out and therefore may have had the same problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810 Relevant lines from the log: 3/4/2020 4:00:05 AM | Rosetta@home | Started download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:12 AM | Rosetta@home | Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error 3/4/2020 4:05:12 AM | Rosetta@home | Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:13 AM | | Project communication failed: attempting access to reference site 3/4/2020 4:05:15 AM | | Internet access OK - project servers may be temporarily down. 3/4/2020 4:42:49 AM | Rosetta@home | Sending scheduler request: To report completed tasks. 3/4/2020 4:42:49 AM | Rosetta@home | Reporting 2 completed tasks 3/4/2020 4:42:49 AM | Rosetta@home | Not requesting tasks: some download is stalled 3/4/2020 4:42:51 AM | Rosetta@home | Scheduler request completed Does the file that fails to download even exist on the server? The expected size of the file is only 3.23 KB. Could the server have problems downloading files of a certain small size? DSL speed here is not especially high or low. Enough other BOINC projects are selected on this computer to keep it busy. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 11,716,372 RAC: 18,198 |
Another small file download problem, which is currently blocking me from getting any more R@H tasks: It's always the little 3kB files that stuck for me. It suggests it's a different server producing those that's misbehaving, or the files are corrupt for some reason. They don't even download using a web browser. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org