Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 300 · Next
Author | Message |
---|---|
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1671 Credit: 17,527,680 RAC: 23,122 |
Edit- although still no luck with Scheduler responses, says it's down for maintenance.Still down. Grant Darwin NT |
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
app_ config.xml in rosetta project directory --------------------- You don't need the app tags if you are using a project_max_concurrent. It applies to the project as whole, not to a particular app. You can simplify it to: <app_config> <project_max_concurrent>3</project_max_concurrent> </app_config> BOINC blog |
tom Send message Joined: 29 Nov 08 Posts: 10 Credit: 6,044,733 RAC: 0 |
for some reason, i have been set to ONE work unit a day for quite a while now. after literally years of processing lots of work units, trouble-free, i still don't understand why. afaik, it started when boinc switched to ssl, but since i can successfully connect to other sites over ssl, the switch over (yes, i switched, too) shouldn't have nuked my ability to communicate with the project. and no, i don't see any errors in the event log, although i'm not very expert at looking through it. currently running: boinc 7.16.11 mac os x 10.7.5 mac mini server i7 |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,222,776 RAC: 4,804 |
Deleted. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1671 Credit: 17,527,680 RAC: 23,122 |
for some reason, i have been set to ONE work unit a day for quite a while now.Because all you do is produce errors. If you want more than 1 Task per day, you need to start producing Valid work. Try detaching & re-attaching to the project- that will dump all your current work, but it will make the system re-download the science application. If you are still producing errors- then it will most likely be a hardware issue- memory, power supply, CPU overheating (memory or PSU overheating) (or possibly an OS issue, but very, very unlikely- unless you recently did an update of some sort?) Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
This is the same problem you reported five months ago? The server is limiting the amount of work it sends because your computer is returning so many errors. It can’t be related to SSL, since BOINC is successfully communicating with the server and able to download tasks. The other thing that changed around the same time was the update to application version 4.20. Your recent tasks have all failed within seconds of starting, which suggests there’s some kind of fundamental incompatibility between the application and your system. Any Mac OS experts here who can offer any suggestions how to diagnose that? You could try the Mac forum, but it’s pretty quiet in there… |
nikolce Send message Joined: 28 Apr 07 Posts: 2 Credit: 2,002,356 RAC: 0 |
Hi all, Can someone tell me if I should abort the below tasks? It's a bit annoying to find your CPU crunching nothing for two days. Thank you! |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
Hi all, Have you any other Rosetta jobs running correctly alongside those? Can you see the file names of the WUs to see if they’re all the same type? Could you suspend (some of) those tasks and see if the replacement tasks run ok? That should tell you if it’s the tasks that are ok or a problem with your setup. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Those look badly broken. They’re running but making no progress, which is why the estimated remaining time has grown so large. Kill them. Is this your Intel machine? It is returning nothing but errors lately. There seems to be something seriously wrong with it. |
nikolce Send message Joined: 28 Apr 07 Posts: 2 Credit: 2,002,356 RAC: 0 |
Thanks for bringing the errors to my attention. Apparently from that point on it has been returning nothing but errors. I caught it today since my RAC dropped a little. As recommended I killed the tasks and it started to drop on the next tasks within minutes. I restarted the PC and tested the CPU and memory with Prime95 on smallest and large FFTs for 15 minutes each ( I know it's should be longer), with no errors. Meanwhile the PC was not showing any signs of instability. I've resumed the project and the tasks are doing fine for almost an hour now. I'll keep a close eye in the next couple of days. I thought I'll have to retire the old bugger. Thank you! |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1671 Credit: 17,527,680 RAC: 23,122 |
Thanks for bringing the errors to my attention. Apparently from that point on it has been returning nothing but errors. I caught it today since my RAC dropped a little.If it occurs again, try rebooting before aborting the Tasks. Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
stub_cyc_target tasks completing in anywhere from under 2 hours to over 19 (against a default target of 8). (Not a problem; just an observation…) |
lazyacevw Send message Joined: 18 Mar 20 Posts: 12 Credit: 93,576,463 RAC: 0 |
Can anyone comment on the level of compression the tasks are sent with and separately the level of compression that is applied before submitting completed workloads? Anyone know the type of compression? lzma2? bzip2? The reason I ask is that beginning about 1 or 2 months ago, I started exceeding my 22 GB monthly data cap about 22 to 25 days into the month. I've been running pretty much the same batch of clients since April and the internet connection is dedicated solely to R@H so I only assume that work units have become more complex and larger. If I don't have any data hiccups, I average around 150,000 credits. I've started to shut down a few clients in order to just stay under my data cap. I'm not sure if the usage is purely work units or if it is the ancillary files R@H downloads (like the database_357d5d93529_n_methyl files) that they use to set up different variables and references. I'm running dedicated Linux machines that have been pared down to preclude any unnecessary or ancillary data usage. I even gone so far as to set the updater service to not check for updates on all of the machines. I'd really like to bring a few dozen more cores online but I'm in a holding pattern until my data usage goes down. When I run out of data, I have to tether my phone to each computer each day to do batch uploads and downloads. This isn't ideal because it's a little inconvenient having to plug into each computer plus I'm sure it creates unnecessary network oscillations in R@H distribution server. There have been a few days where I forgot to connect and kept breaking the 2 day crunch time limit. I'm sure that is inefficient for the project as a whole. Would the devs consider upping the compression levels? It does slightly increase client and server overhead but most computers are more than capable to spend an extra minute or two to accomplish more compression. It might help people bring more systems online. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
In my observation, most downloaded data files use Zip compression. I haven’t paid any attention to what gets uploaded, though I’ve seen gzip mentioned in some logs and file names. The big (500 MB Zip) database and the applications can go months between updates, so even though they’re relatively large they shouldn’t be affecting your recent usage. You can see a history of data download and upload totals in the file daily_xfer_history.xml in your BOINC data directory; you could analyse that to see how usage has changed over time. As a single data point, my current usage seems to be averaging around 4 MB per task. You could increase the target CPU run time in your project preferences to run each task for longer (and thus fewer tasks in total, requiring less overall data transfer, in any given period of time) while maintaining the same credit rate. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1671 Credit: 17,527,680 RAC: 23,122 |
You could increase the target CPU run time in your project preferences to run each task for longer (and thus fewer tasks in total, requiring less overall data transfer, in any given period of time) while maintaining the same credit rate.Someone would need to try that out to see if it would make any significant difference, i suspect the difference would be minimal (if any). The downloads of Tasks and their support files are rather small in size. It's the returned result files that can be extremely large (i've noticed a couple over 900MB in size). Running the Tasks for longer will result in a larger result file. So instead of returning 2 smaller files, you're returning one larger one. The only saving being less downloaded files- which as i mentioned are very small in comparison -resulting in little if any data transfer reduction. The reason I ask is that beginning about 1 or 2 months ago, I started exceeding my 22 GB monthly data cap about 22 to 25 days into the month.Which is around the time you brought some new systems online. A 4 core/thread, 8c/t & of course the 64c/t Threadripper system. They would all have a significant impact of the amount of results you return. When I run out of data, I have to tether my phone to each computer each day to do batch uploads and downloads. This isn't ideal because it's a little inconvenient having to plug into each computerDoes your modem/router support WiFi? If so, just setup your phone as a Personal Hotspot. If not, a WiFi dongle on one of the systems & connect that system to the phone, and enable internet sharing from that system for all of the others. I'm running dedicated Linux machines that have been pared down to preclude any unnecessary or ancillary data usage. I even gone so far as to set the updater service to not check for updates on all of the machines.I suspect Linux has a similar option to Windows Update that allows systems to check for their updates on the local network before checking for them over the internet. Would the devs consider upping the compression levels? It does slightly increase client and server overhead but most computers are more than capable to spend an extra minute or two to accomplish more compression.Problem being that you can only compress data up to a certain point, after which no further compression is possible. And spending 2, 3 or 4 times as long compressing (and uncompressing) the data for 3-5% saving in file size is really not an option. Grant Darwin NT |
MarkJ Send message Joined: 28 Mar 20 Posts: 72 Credit: 25,238,680 RAC: 0 |
You could put a proxy server on your network and that can save on duplicate transfers. Unfortunately since Rosetta switched to https it won’t help with project data files. I find the best benefit for os updates, the 1st machine downloads them but subsequent requests come from the proxy server. It also works well with Einstein and their locality scheduler. BOINC blog |
lazyacevw Send message Joined: 18 Mar 20 Posts: 12 Credit: 93,576,463 RAC: 0 |
In my observation, most downloaded data files use Zip compression. I haven’t paid any attention to what gets uploaded, though I’ve seen gzip mentioned in some logs and file names. Thanks. I too have seen gz and zip files in project slot directories along with a lot of uncompressed files. It just isn't clear what files were generated, decompressed from files received, or created when all of the work is said and done. Essentially, I can only easily see the middle part of the data processing. You can see a history of data download and upload totals in the file daily_xfer_history.xml in your BOINC data directory; you could analyse that to see how usage has changed over time.. I took a look at the file but I couldn't make much sense of it. Which is around the time you brought some new systems online. It does appear from my profile that I brought new systems online recently but in actuality, I just reinstalled the OS on most of my systems when R@H ran out of tasks. That TR has been running every day since early April. Does your modem/router support WiFi? If so, just setup your phone as a Personal Hotspot. I'm using switches connected to a wired personal hotspot. I tried using a USB-C to Ethernet adapter but my Samsung S8 doesn't appear to support it. Internet sharing isn't as easy to set up as it is with Windows but I will look into it. My clients do not have wireless adapters either. I might see about using a Windows laptop and have it connect wirelessly to my phone and then share it's connection over LAN. So, I do have a few options.... You could put a proxy server on your network and that can save on duplicate transfers I really should set up a pfsense/squid box for the network. One for extra security but another for the network caching feature. It would work as long as the Linux updater doesn't use HTTPS. I could probably also set up a local update mirror. Thanks for the tip! |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 3,846 |
A task running for over 12 hours so far, even though I've selected a run length of 8 hours: 3stub_cyc_target_1cwa_01152_14_extract_B_SAVE_ALL_OUT_1044879_311 The estimated time remaining is INCREASING, not decreasing. It is doing checkpoints WITHOUT ending the task. 26 seconds since the last one. Is something wrong with this task? Should I abort it? |
Speedy Send message Joined: 25 Sep 05 Posts: 163 Credit: 808,098 RAC: 0 |
A task running for over 12 hours so far, even though I've selected a run length of 8 hours: Have you tried exiting boinc and opening it again or restarting your computer/laptop? If after restart it starts back at for example 10 hours letters run and see if it will finish with in the 12 hours. If it doesn't & keeps running past 13 hours feel free to abort Have a crunching good day!! |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
I had some stub_cyc_target tasks complete in under 2 hours; some after more than 19. I aborted one that had been running for 1½ days, as even though it still seemed to be running, its progress percentage was increasing so slowly that it didn’t seem likely it would reach 100% in any reasonable amount of time. Maybe leave yours for a few more hours, and kill it if it gets to a full day without completing? Once a task has overrun, its remaining time estimate becomes meaningless, as BOINC has no way of knowing when it will finish. And sometimes tasks can get in a state where they are running but not reporting progress, so BOINC estimates progress using elapsed time towards a target perpetually 10 minutes in the future, meaning the value only asymptotically approaches 100%. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org