Questions and Answers : Windows : Processing Ceases
Author | Message |
---|---|
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
I have a recurring problem that BOINC Support is not able to resolve, and I was advised to submit the problem to Rosetta. I have e-mail messages and log excerpts that I can PM, if anybody is interested. I have a dual-core machine with 2 Gigs RAM and Windows XP-SP3, so two MiniRosetta tasks are usually running at the same time. Seemingly at random, a task will just stop processing, and my CPU utilization will drop by 50%. If I do nothing, it is only a matter of time before the other task will also stop processing, and my CPUs will be at 99% System Idle. During these interruptions, the BOINC Manager tells me that the tasks are either "Running," or are "Running, high priority." The only way I can recover and restart the Rosetta tasks is to reboot my computer, and this is getting very frustrating. Is this a known problem? Is there a fix? deesy58 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It seems every month of two I will see a report like this. No root cause has yet been found. I forwarded your PM with log details to the Project Team for review. I am sure they are fully engaged with CASP beginning. The only pattern I've noticed is that some machines have the problem cronically, and others never see it at all. Do you have a feel for what % of tasks stop using CPU in this manner on your machine? Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
It seems every month of two I will see a report like this. No root cause has yet been found. I forwarded your PM with log details to the Project Team for review. I am sure they are fully engaged with CASP beginning. I have noticed this problem for several days, and it continued to worsen until yesterday when it appeared to have peaked. Two of the last Rosetta tasks have been particularly unstable. One completed last night after "crashing" about six or seven times. The other is still running after about 21.5 hours elapsed, but it has halted two or three times, too. I have also "aborted" at least two tasks. As of right now, both tasks have been running without interruption for about 12 hours. If I suspend the tasks, then attempt to resume processing on them, they will report that they are running, but will use no CPU resources to do so. Only a reboot will solve the problem. deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
It happened again during the night. When I went to bed, both tasks were processing fine. This morning, only one of them is using the CPU and advancing towards completion. The other is not using the CPU, and is not advancing towards completion, even though BOINC Manager reports that it is "Running, high priority." If this is the Rosetta software, it really needs to be fixed. If it is hardware, then the software does not seem to be responding appropriately to an error condition. Just halting without any sort of notification would seem to be a little less than consistent with good programming practice, and I am left with no clue as to where to look for the problem. This is really frustrating! deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
It just happened again. My leading task "froze" at 35.2% complete. I exited the BOINC Manager, and set the option to halt the science projects when doing so. Then I restarted BOINC Manager. Both tasks restarted, and I waited to see progress towards completion on the previously frozen task. I was quite surprised when the progress suddenly changed from 35.2% complete to 30.9% complete while I was looking at it. This is just weird! What happened to the other 4.3%? I now must keep the Windows Task Manager open at all times so that I can monitor CPU usage in order to detect when one or both of the Rosetta science tasks freezes. deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
The task finally finished after 13 hours of processing. I had to exit and restart BOINC three times today. I have put the BOINC Manager icon on my desktop so that it will be easier to access for restarts. :( deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It just happened again. My leading task "froze" at 35.2% complete. I exited the BOINC Manager, and set the option to halt the science projects when doing so. Then I restarted BOINC Manager. Both tasks restarted, and I waited to see progress towards completion on the previously frozen task. I was quite surprised when the progress suddenly changed from 35.2% complete to 30.9% complete while I was looking at it. This is just weird! What happened to the other 4.3%? Any time you end a task and it is removed from memory, some amount of completed work is lost. Work in progress is periodically saved be a process called checkpointing. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Any time you end a task and it is removed from memory, some amount of completed work is lost. Work in progress is periodically saved be a process called checkpointing. Okay. I understand that. What I do NOT understand, however, is why some Work Units just seem to "freeze" at some level of completion, and stop using computer resources (CPU). These events seem to be random. After more than 48 hours of continuous processing on both of my CPU cores, one of the two tasks being processed just "froze" at a level of 7.763& complete. The only way to resume processing on this WU was to exit BOINC, and then restart it. deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Tasks dropping off and no longer using CPU, even when BOINC Manager would indicate they should be (i.e. they aren't suspended for any reason nor preempted to run another project) is one quirk that hasn't been tracked down yet. Fortunately it seems to effect very few people and even within those machines, only a small number of tasks. I've been gathering all the hints I can to report to the Project Team. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Tasks dropping off and no longer using CPU, even when BOINC Manager would indicate they should be (i.e. they aren't suspended for any reason nor preempted to run another project) is one quirk that hasn't been tracked down yet. Fortunately it seems to effect very few people and even within those machines, only a small number of tasks. I've been gathering all the hints I can to report to the Project Team. Okay, perhaps I might be able to supply what could be a helpful hint. I have determined that these "freezes" appear to be WU-related. The "freeze" happened again last night, making it four times in a little more than 16 hours. Each time, it was the same WU that stopped: rb_05_02_122_331_rs_stg0_lrlx_t000_boincid_SAVE_ALL_OUT.IGNORE_THE_REST_B_20262_857_0 Elapsed: 14:01:17 Progress: 27.103% To Completion: 21:31:52 Perhaps this will help. Each time it freezes and I must restart BOINC to recover, I lose progress on the task, so it is going to take a lot longer to complete these types of WUs, and it requires constant attention to the Windows Task Manager to detect the "freezing." Perhaps these types of WU's are sufficiently different from others that the miniboinc_2.11 software can't process them reliably. Should I immediately abort each of these kinds of WUs when I see that they have been assigned? Do you think that might alleviate my problem? deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
[quote]Tasks dropping off and no longer using CPU, even when BOINC Manager would indicate they should be (i.e. they aren't suspended for any reason nor preempted to run another project) is one quirk that hasn't been tracked down yet. Fortunately it seems to effect very few people and even within those machines, only a small number of tasks. I've been gathering all the hints I can to report to the Project Team. After the offending WU "froze" for the sixth (and final) time about ten minutes ago, I aborted the task. A new, similar, WU began processing immediately. After an elapsed time of one minute and 21 seconds, it aborted itself with a "Computation Error" message. Now, it appears that one of the WUs being processed might be some sort of "test," since the word "test" appears in the name of the WU. From here on out, I will immediately abort all WUs that appear to be of the same type as those that appear unable to process on my machine. deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
You have to do what works for you. Ideally, if you are around the machine and able to look in on it, you would let them run and see if you can confirm the theory that specific names consistently have problems. Or suspend them until a time when you are around the machine. If everyone had the same symptoms as you, then there would be a glaring lack of returned results for those tasks and a big red flag would have risen some time ago. So, I suspect you will find that most of them will run fine. And no, the word "test", or any other word in a task name has no relation to expected reliability in crunching. It is just a reference to this task's relationship to oth |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
You have to do what works for you. Ideally, if you are around the machine and able to look in on it, you would let them run and see if you can confirm the theory that specific names consistently have problems. Or suspend them until a time when you are around the machine. If everyone had the same symptoms as you, then there would be a glaring lack of returned results for those tasks and a big red flag would have risen some time ago. So, I suspect you will find that most of them will run fine. And no, the word "test", or any other word in a task name has no relation to expected reliability in crunching. It is just a reference to this task's relationship to others in the study of the protein. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
You have to do what works for you. Ideally, if you are around the machine and able to look in on it, you would let them run and see if you can confirm the theory that specific names consistently have problems. Or suspend them until a time when you are around the machine. If everyone had the same symptoms as you, then there would be a glaring lack of returned results for those tasks and a big red flag would have risen some time ago. So, I suspect you will find that most of them will run fine. Okay. I'll just abort any task that ever freezes in the future. If I collect the names of the "offending" tasks, do you want me to post them here? deesy |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If I collect the names of the "offending" tasks, do you want me to post them here? Certainly, yes. Posting is even better then directly EMailing them to me because it allows others to compare their own notes with yours and offer further information. Rosetta Moderator: Mod.Sense |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
If I collect the names of the "offending" tasks, do you want me to post them here? Okay, Mod.Sense, I will do that. It is difficult for me to imagine that my machine regularly halts processing certain Work Units, but that nobody else experiences the problem. That seems extremely unlikely to me. Perhaps other contributers have become annoyed with the problem and either stopped contributing to Rosetta@Home entirely, or they simply abort the task whenever it happens (like I will now do). In 40 years of software development, our development teams always made an intense effort to locate and fix bugs, even if only a very few users were adversely affected. :-| deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Okay! Here is another one. After processing without problems since May 5, 2010, the latest WU "crashed" a few minutres ago. Here is the name of the WU: rb_05_19_162_579_rs_stg0_lrlx_t000_casp9_SAVE_ALL_OUT.IGNORE_THE_REST_B_21112_2094_0 I will no longer attempt to restart these defective WUs, but will abort them as soon as I notice that processing has ceased. I hope that this information will assist in tracking down the root of the problem. deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Here's one more: rs_stg0_lrlxjcst_t308_run6_SAVE_ALL_OUT_20984_304_0 This would be a lot easier, and perhaps more contributors would post these defective WUs, if we could copy the name to the clipboard, and then paste it. Either that, or come up with simpler WU names. deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Here we go again. I wish these work units were more stable. rs_stg0_lrlxcst_T477_casp8_SAVE_ALL_OUT_20745_1622_0 Progress: 12.369% (aborted the task) deesy |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Well, here's another one: rs_stg0_lrlx_T389_casp8_SAVE_ALL_OUT_20772_2567_0 It seems a waste that these work units complete 10%-15% before crashing. This one quit processing in the middle of the night, again. BOINC points a finger at the project, and the project just shrugs. Is anybody watching? Does anybody care? Is this a normal occurrence? Should this information be posted elsewhere? WTF! deesy |
Questions and Answers :
Windows :
Processing Ceases
©2024 University of Washington
https://www.bakerlab.org