Questions and Answers : Windows : Frequent hung work units
Author | Message |
---|---|
Brock Jones Send message Joined: 30 Dec 09 Posts: 6 Credit: 163,688 RAC: 0 |
I have two Windows machines I'm running BOINC and Rosetta@home on: one XP machine and one Win7 x64 machine. Both are running BOINC v.6.10.18. I'm frequently getting hung WUs on *both* machines. When the WUs are hung, I'll see no progress after running for > 15 hours and an ever climbing 'To completion' time. I can abort the WUs in question and it will typically process several additional WUs, but will eventually (within a day or two) get hung again. When a task is hung, it's reported as running, I can see the process in the Windows task manager (minirosetta_2.03_windows_intelx86.exe on the XP box that I'm at right now) and it's 'using' memory, but it never uses any processor time. Meanwhile another WU running on the other core is taking it's typical 50%. I've got one right now that says it's been running for 15.5 hours with 30.5 hours remaining. [edit]I should add that - while a work unit is hung - the screensaver does not work - it simply displays a completely blank black screen. Additionally - the hung work units do not always hang in the same place. They typically start processing normally and get hung up mid-way. A currently hung WU (job 16684_27_0)is stuck at 8.459%, hasn't made any progress in > 12 hours, and is currently using no CPU cycles.[/edit] |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
What is the status shown for the two active tasks? What shows in the messages around the time it stopped getting CPU time? I am thinking perhaps the BOINC Manager suspended the task to assure that the two combined do not exceed your memory usage preferences. The amount of memory used by a WU varies as it runs. And so at the random points in time where they both are hitting a peak, it sounds like it is crossing your preference. Note: I'm talking about the amount of memory BOINC is allowed to use, not the amount of memory on the machine. I guess that doesn't really explain 10+ hours though. Ideally, you would try suspending tasks and then restarting them first. This can often clear up any problem, and preserve the work you have completed. Does this seem to you like a new problem with the 2.03 version? Or has it been occurring longer then that? Any pattern in the names of the WUs that are hanging? Rosetta Moderator: Mod.Sense |
Brock Jones Send message Joined: 30 Dec 09 Posts: 6 Credit: 163,688 RAC: 0 |
What is the status shown for the two active tasks? They display as 'Running' What shows in the messages around the time it stopped getting CPU time? Nothing in there at all. It starts and then there are no further messages about it until I suspend or abort it. I am thinking perhaps the BOINC Manager suspended the task to assure that the two combined do not exceed your memory usage preferences. The amount of memory used by a WU varies as it runs. And so at the random points in time where they both are hitting a peak, it sounds like it is crossing your preference. Note: I'm talking about the amount of memory BOINC is allowed to use, not the amount of memory on the machine. Both of the machines in question are set to allow up to 90% of memory when idle and 75% of swap space. Their memory usage doesn't appear to be anywhere near that high when they stall. Each machine has 4GB of physical memory and each process is using 180-300MB. Frequently, the next work unit to come along will use *more* memory (and more total memory between the two running WUs) and complete just fine. Ideally, you would try suspending tasks and then restarting them first. This can often clear up any problem, and preserve the work you have completed. I actually tried that first -- suspending and resuming them had no impact. Any pattern in the names of the WUs that are hanging? Here are the most recent WUs from the XP machine that I've had to abort: homopt_nat.t312_.t312_.IGNORE_THE_REST.native_0001_0026.pdb.JOB_16681_27_0 homopt_nat.t322_.t322_.IGNORE_THE_REST.native_0001_0095.pdb.JOB_16684_27_0 homopt2b.t331_.t331_.IGNORE_THE_REST.S_00002_0000473_00069.pdb.JOB_16718_12_0 |
Brock Jones Send message Joined: 30 Dec 09 Posts: 6 Credit: 163,688 RAC: 0 |
I have another one that appears to be currently stalled as well: homopt4.t293_.t293_.IGNORE_THE_REST.S_00002_0000001_0_0_00088.pdb_00008.pdb_00002.pdb.JOB_16810_2_0 It's only been running for 2.5 hours, but it's exactly the same presentation. It showing as 'Running', taking 220MB of memory (task manager), and using no CPU time. The 'To completion' estimate just keeps on climbing - it's at 5 hours and climbing right now. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
THose are my jobs. Seems like the display counter is not being updated during this protocol. THis is a cosmetic problem though - the jobs are running just fine underneath and we're getting lots of good data back! I'll put a bug fix in in the next version. Cheers, Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Brock Jones Send message Joined: 30 Dec 09 Posts: 6 Credit: 163,688 RAC: 0 |
THose are my jobs. Seems like the display counter is not being updated during this protocol. THis is a cosmetic problem though - the jobs are running just fine underneath and we're getting lots of good data back! I'll put a bug fix in in the next version. It's *not* a display problem. When they hang, they use no processor time whatsoever and they *never* finish. I have one that has been going for 30 hours solid right now. Perhaps there is something specific about my configuration on these machines that is causing a problem, but once I get two hung WUs on a machine (both are dual core machines), it's completely stopped at that point and will never process another WU - at least out to just over 40 hours of run time. |
Brock Jones Send message Joined: 30 Dec 09 Posts: 6 Credit: 163,688 RAC: 0 |
I find it difficult to believe that more people aren't having this problem. This is happening on two totally fresh/vanilla installs of BOINC with Rosetta@home as the only running science app. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
Hmm. We've not been able to reproduce this problem here unfortunately. There is an update going out today (2.05) (it may have already gone out in fact) that fixed a different issue to do with checkpointing. Is that version still giving you these troubles ? Mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
Brock Jones Send message Joined: 30 Dec 09 Posts: 6 Credit: 163,688 RAC: 0 |
Hmm. We've not been able to reproduce this problem here unfortunately. Nope -- these are running 2.03. I just had one go fail with a 'Computation error' after 53 hours. I've got another one that's been running for 32 hours and predicts 50 hours remaining. |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Brock, I had the same issue, but 2.05 update seems to have fixed it. Abort the stuck WU's as nothing else will happen other than a time increase. I had to abort 5-6 in a row due to this issue. Hope this helps ya. |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Just had to abort a stuck homopt WU @ 40% on minirosetta 2.05. Doesnt look like whatever the issue is has been totally fixed yet. |
mdillenk Send message Joined: 19 Feb 06 Posts: 8 Credit: 865,454 RAC: 0 |
I find it difficult to believe that more people aren't having this problem. This is happening on two totally fresh/vanilla installs of BOINC with Rosetta@home as the only running science app. I'm having the same exact problem: Jobs such as these never finished, in the BOINC client they look frozen but the job doesn't utilize any cpu. I would guess that between 5% to 10% of the jobs do this. I'm running the 64 bit BOINC client on Windows 7 64. Any body else having problems like this or know what may be wrong? t374__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_5733_0 t365__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_5996_0 lr15clus_opt_.1eyv.1eyv.IGNORE_THE_REST.c.10.2.pdb.pdb.JOB_17448_1_0 |
banicki Send message Joined: 7 Dec 05 Posts: 1 Credit: 7,990,466 RAC: 1,028 |
Me too! Brand New WIN7 x64 machine picks up units, starts them and sometimes they just stop processing. They stop requesting CPU, or using CPU, for long periods of time, like 24 hours with no credits rac'ed up. These are 4 hour units. Tasks that I aborted as suspected as hung: lrmixclus2_opt_.1bq9.1bq9.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.1.pdb.pdb.JOB_18226_2_0 aborted by user 3/2/2010 9:04:11 PM rosetta@home Computation for task lrmixclus2_opt_.1bq9.1bq9.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.1.pdb.pdb.JOB_18226_2_0 finished 3/2/2010 9:04:27 PM rosetta@home task lrmixclus2_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.12.3.pdb.pdb.JOB_18257_2_0 aborted by user 3/2/2010 9:04:29 PM rosetta@home Computation for task lrmixclus2_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.12.3.pdb.pdb.JOB_18257_2_0 finished I'm running 6.10.18 2/28/2010 5:22:23 PM Starting BOINC client version 6.10.18 for windows_x86_64 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Please post these details to the appropriate version number'd thread on the Number Crunching board. mdillenk, please do the same, and post BOINC and Windows versions. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Well, here it is, almost the end of April, and I am experiencing the exact same problem. My symptoms mirror those of Brock Jones. I am running XP-SP3 on a Pentium D with 2 Gigs RAM, and I have been plagued by this problem since about two days after I began using BOINC and processing Rosetta tasks. There appears to be nothing in the messages to offer any clues, and it takes a re-boot to fix the problem. Should we have to baby-sit these tasks in order to feel confident that they will complete? Is there a problem with Windows and BOINC? I left the FAH Project to contribute to Rosetta, but this project appears to be even less stable than the "new" FAH SMP2 client. Is there any cure for this problem? deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Which Rosetta version are the problem tasks running? Any pattern in the naming of problem and successful tasks? Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Which Rosetta version are the problem tasks running? Any pattern in the naming of problem and successful tasks? I have started a new thread with additional information, but the Windows Task Manager says that MiniRosetta_2.11_windows_intelx86.exe are the running processes. No pattern that I could see. deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Link to the new thread to discuss the specifics of your issue. Rosetta Moderator: Mod.Sense |
Questions and Answers :
Windows :
Frequent hung work units
©2024 University of Washington
https://www.bakerlab.org