Questions and Answers : Windows : Aborted work unit and memory usage
Author | Message |
---|---|
idb Send message Joined: 17 Sep 05 Posts: 2 Credit: 100,029 RAC: 0 |
Just a couple of observations... I had to abort one of the work units I downloaded this morning. It was running OK, up to about 80% in around 5 hours, when I closed down BOINC (to check if it was causing a general slowdown problem I was seeing). When I restarted BOINC the rosetta w/u started up again but there must have been some error caused by the restart as it was running very slowly. It took 3 hours or so to do another 10% and then appeared to get stuck. I aborted it after 8 hours. The elapsed time/time to go were also reset to 0 after the BOINC restart, although the % completed showed the correct value? Memory usage is a bit excessive! I've just started another w/u and it is currently using over 200MB. I now think the general sluggishness I was seeing on my system (see above) was probably caused by memory swapping. BOINC 4.45, XP home, 512 MB P4 @ 3.2 MHz Ian |
Chris Marshall Send message Joined: 28 Sep 05 Posts: 1 Credit: 11,038 RAC: 0 |
I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using. |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using. Truth be told, 180 Mb does not necessarily mean it's using that much physical. If you go grab a copy of Process Explorer from Sysinternals it can provide some useful insight into memory usage. Brief crash course on the subject. With virtual memory, there's two sizes you care about: the Virtual size and the WS size (i.e. working set size). The virtual size (180 mb) is the total size of the image: code and data, but not all of that is resident in physical memory, some of it will be in the swap file. The working set size is the size of the actual portion that is resident in physical memory. That being said, Rosetta IS a little on the greedy size, typical WSS values range from 50 Mb to 60 Mb in my experience. That being said, Process Explorer will show VM size and WS size values if you pull down the View menu, chose "Select Colums" (last entry), switch to the Process Performance tab, check Virtual Size and Working Set Size, and click <OK>. -- Edit -- Made the URL work -- |
Vester Send message Joined: 2 Nov 05 Posts: 258 Credit: 3,651,260 RAC: 775 |
Alex Nichol's Virtual Memory in Windows XP is good. I have two computers, one P4 and one AMD, running Windows XP and each has 512 KB of RAM. Why run an outdated version of BOINC? Are you also running an optimized, older version of Rosetta? Running the latest client is important to the project, and I would expect that earlier revisions cannot handle more complex jobs. I haven't been here long, but I see is no reason to dump a job as long as it is running. |
Garry Send message Joined: 20 Nov 05 Posts: 3 Credit: 1,326,602 RAC: 0 |
I saw a recommendation in the system requirements that Rosetta should be allowed to stay in memory when not running. And that the code does a checkpoint four times or so during each work unit. I suspected that if I didn't let it stay in memory, it would lose all work since the last checkpoint. I tested. The first time BOINC gave the processor to another experiment (and kicked Rosetta out of memory), Rosetta reported "Result 1n0u__abrelaxmode_random_length20_jitter02_omega_sim_aneal_bab100_12350_0 exited with zero status but no 'finished' file". CPU time and progress reported zero. Is it reasonable to assess that Rosetta didn't do a checkpoint during the time it ran on my machine? And that the time my machine contributed to Rosetta was lost? |
Frank Send message Joined: 29 Nov 05 Posts: 1 Credit: 2,256 RAC: 0 |
Alex Nichol's Virtual Memory in Windows XP is good. What if it says it is "running", but is making no progress? When I first signed onto Boinc a couple of weeks ago, I got a Rosetta work unit which said "CPU time 15 sec - 8hr to completion - running - 0% completed" for several days. I dumped it and got a new one because I had read in another forum that this solved someone else's problems with work units hanging. This one ran for a while but has said "CPU time5 hrs to completion - running - 30% completed" for 4 days now. In the "Messages" section, the only thing it says for Rosetta is "Pausing result 1dcj__abrelax_rand_len10_jit02_omega_sim_filters_47230_0 (removed from memory)" Fine - it's "pausing", but why? and how do I get it to start running? I don't see anything in previous messages to suggest an answer. I have no problem running Predictor sets - several have been run and turned in - and memory is also not a problem. |
R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0 |
I got a client error here. Was at 50% and it switched over to another project, as is normal and then the whole unit failed. I run boinc 5.2.1 on win xphome on athlon 3,000+ 12/8/2005 12:26:43 PM|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_01515_0 ( - exit code -1073741819 (0xc0000005)) 'Topology sample' ? Is this some kind of calibration unit I did half of because I'm new? Curiously...credit is waiting to be granted under 'your account' when I click on it. But it says 'client error' next to it... <--------CONFUSED. Thanks for help in advance. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. |
Rebirther Send message Joined: 17 Sep 05 Posts: 116 Credit: 41,315 RAC: 0 |
I got a client error here. Was at 50% and it switched over to another project, as is normal and then the whole unit failed. I run boinc 5.2.1 on win xphome on athlon 3,000+ You must set in preferences "Leave application in memory" to solve this problem by switching between projects. If you are getting client errors this will not be credited! |
R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0 |
Thanks, I had it set to leave in memory but forgot to click 'update' on my Boinc mgr.. I haven't slept in a while. I run a few other projects but am new to Rosetta as of today. Seemed like this athlon was humming along at 50% complete of that unit at 45 minutes into it. So I just put rosetta on my old 500 mhz machine right now....we'll see how fast my 2nd and older machine can get them done. It's trying to d/l 30 Rosetta units on this 2nd machine which is a 500mhz 256 Ram on dialup. Is this normal? It's set to a 3 day cache... Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. |
R/B Send message Joined: 8 Dec 05 Posts: 195 Credit: 28,095 RAC: 0 |
Ahhh, I think I get it. Rosetta sends multiple packets that comprise each individual work unit. So it looked like I was downloading 20 or 30 when I was really just downloading 4 or 5. Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers. |
Gio Send message Joined: 5 Jan 06 Posts: 2 Credit: 18,139 RAC: 0 |
I get a lot of client errors, I'm about to leave this project. I get 10% of success only if I let rosetta running alone for hours. So I have to suspend all other projects. I want to give last chance to this project. As I read here I need to set "Leave in memory". I cannot find the option "Leave in memory". Anyone so kind to tell me exactly "where" to set this option? Thanks Gio |
J D K Send message Joined: 23 Sep 05 Posts: 168 Credit: 101,266 RAC: 0 |
I get a lot of client errors, I'm about to leave this project. I get 10% of success only if I let Rosetta running alone for hours. So I have to suspend all other projects. Go to your acct look under preferences and you will find it... BOINC Wiki |
Gio Send message Joined: 5 Jan 06 Posts: 2 Credit: 18,139 RAC: 0 |
I get a lot of client errors, I'm about to leave this project. I get 10% of success only if I let Rosetta running alone for hours. So I have to suspend all other projects. thanks, I found it. (for some reason I did not find it before) Hope now all is fixed. |
verdy_p Send message Joined: 8 Feb 06 Posts: 5 Credit: 8,785 RAC: 0 |
I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using. You're right. It's completely ridiculous to dedicate permanently so much virtual space on disk when Rosetta is suspended, because we are running another application, or because time issharedwith other BOINC projects. The main reason is a serious bug in Rosetta's interrupt handler whose programming is definitely not multithread safe and corrupts the main computing thread. I don't like leaving Rosetta in memory. It is against the philosophy of BOINC projects that should use ONLY idle computing resources.Rosetta should retire from BOINC projects as long as this bug is not corrected (this bug may even generate false scientific results due to the possible data corruption that it may generate even if a work unit apparently does not terminate abruptly with an unrecoverable error). Also: please save computing snapshots more often. When there's a failure, the work unit state should be recovered without too much CPU time lost, and will progress enough until the next snapshot to bypass a single failure caused by an external event. Note that when BOINC is running as a screensaver, itmaybe interrrupted very fast before any significant progress has been done. And suchevent mayoccur several times rapidly. This is not a failure, but a commonissue of screen savers that are sometimes triggered when the user is just reading a documentor has paused for a smalltime butdoes not want the screensaverto come into interrupt his job. To solve this problem: Rosetta should enter sleepmode for a smalltime ifit gets paused, and unless it has not been resumed after 2 minutes, it should shutdown and save its computing state to disk and exit. If the computing thread is locking a critical data section, the interrupt handler may not be able to terminate the job immediately and should start a timer for a delayed retry after 2 seconds up to 1 minute, in a loop. It MUST make all efforts to exit the Rosetta process and free memory as fast as possible. For now, Rosettta just stinks, and abuses users computing resources. |
teepsy Send message Joined: 5 Jan 06 Posts: 2 Credit: 4,926 RAC: 0 |
I've been running only Rosetta for a few days now, but something happened yesterday when it started aborting every unit I've been getting. Sometimes it aborts at 22 or so percent, but usually at about 77-78%. Any ideas? Am I the only one that has been having this problem the past two days? |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
I've been running only Rosetta for a few days now, but something happened yesterday when it started aborting every unit I've been getting. Sometimes it aborts at 22 or so percent, but usually at about 77-78%. Any ideas? Am I the only one that has been having this problem the past two days? Teepsy, Some of the errors are actually work units that the system thaought were hung and so it automatically aborted them. This is called the "watchdog". Some of the others may be a known BOINC issue. If you have nnot done so already, you should upgrade your BOINC software to version 5.4.9. That has fixed a lot of problems for people. You can get the software here. Moderator9 ROSETTA@home FAQ Moderator Contact |
teepsy Send message Joined: 5 Jan 06 Posts: 2 Credit: 4,926 RAC: 0 |
I've been running only Rosetta for a few days now....it aborts at 22 or so percent, but usually at about 77-78%. Thanks for the 5.4.9 - it looks much better now! However, I got one at 23% and another completely screwed up. Do you think I should just not do Rosetta for awhile? I hate to be messing up their data! Oh! When I go into my "results" page - it keeps telling me I have one pending, but I do not have anything working. I just noticed that some of my units are showing up as completed (and I have received credit) but others keep saying client error so, of course, I don't get credit. I really don't know what to do; I feel badly about this! |
dr.frank Send message Joined: 10 Apr 06 Posts: 1 Credit: 23,691 RAC: 0 |
I'm sorry to have to agree with verdy. I'm running 4 different applications, and NONE of the other pose any problems. Every time I get upstairs where I have a server running, the screen is blocked with a Rosetta screen and error message. So if this keep in memory setting is the only way to solve this, Rosetta is out of the door, even if it has a pretty screensaver... Get your act together, you are getting a lot of calculus - time for free, the LEAST you can do is make your software work! Frank |
MG Send message Joined: 27 Nov 06 Posts: 1 Credit: 0 RAC: 0 |
I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using. Still got the same problem with rosetta@home while all my other projects are running without any difficulties. There still seems to be no solution. So I'm out, too! |
Paul Massaria Send message Joined: 23 Nov 06 Posts: 1 Credit: 20,520 RAC: 0 |
I have also seen the memory issue on my system, I have a P4 - 2.8Ghz with HT enabled. Each WU is currently using 180Mb of Virtual Memory. Lucikly I have 2Gb Ram so it is not effecting the performance much but still that is a lot of memory to be using. Is this the reason I get so many client errors on the results? |
Questions and Answers :
Windows :
Aborted work unit and memory usage
©2024 University of Washington
https://www.bakerlab.org