Questions and Answers : Windows : Rosetta WU's restart
Author | Message |
---|---|
Kim Schreiber Send message Joined: 29 Mar 09 Posts: 2 Credit: 1,675,649 RAC: 0 |
Can anybody tell me why my Rosetta WU's starts all over again when i have had my computer turned off. Have to finish a WU if I don't want to start from 0. Other project WU's continue from what it has reached. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Preserving the work done so far, is called checkpointing. Different types of Rosetta tasks checkpoint with different levels of regularity. Overall, most tasks checkpoint about every 15 minutes of runtime. If your computer is on, but perhaps set to only run BOINC when idle, I would suggest you also set your preference to leave tasks in memory while suspended. That way, even if you pop on and off of your computer, the work you've done so far stays in memory for when it can run again and eventually reach a checkpoint. Rosetta Moderator: Mod.Sense |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,992,337 RAC: 15,074 |
Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly. For example WU named "ha_notyr....." - after several hours of computing "CPU time at last checkpoint" stays "-----" (none). If I restart(or shut down) computer while this WU running - all results are lost and after restart computation starts from begining. And other ones WU writing in logs about checkpoint, BUT actually is not checkpointing. For example WU named "lr_mix..." (example url: https://boinc.bakerlab.org/rosetta/result.php?resultid=309128812) my computer crunch one about 3 hours, boinc manager shows "CPU time at last checkpoint" correctly (only few minutes less compare to total CPU time), "show graphics" shows that 38 models already done. After that i shut down computer, and on next day when computer and BOINC/Rosetta started again this WU restarts from 0% (in "show graphics" 0 models too), so i abort this WU. It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise). Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H). What it is possible to make with it? (well except refusal of the Rosetta and transition to calculations of other projects)? P.S. Sorry for my English - i studied it only at basic school and for me was not enough practice. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Mad Max, while you are correct that there are still some types of work units that only checkpoint after a model is completed, I believe your main problem is patience. It takes a work unit a minute or so to really get restarted. So, I think you just aborted it before it had a chance to wake up and realize it had already completed the 38 models. Either way, if you would let it run to completion rather then aborting it, and then post to the appropriate version's thread on the Number Crunching board with a link to the WU, that would be valuable information for the Project Team to have to resolve the problem. I'm sure they see lots of odd results (and aborts), but without that observation and knowledge of what causes them, it is often difficult to understand what areas require correction. Rosetta Moderator: Mod.Sense |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,992,337 RAC: 15,074 |
I think i waited enough - before cancelling this WU it have had time to calculate 2 more new models, but it were 2 NEW models (counting has gone with 0,1,2), no tags of 38 models calculated before turn off existing. And in any case there is a question with other type WU "ha_notyr..." One of such is computing right now, BOINC Manager shows 77 % of progress, "show graphics" shows 297 calculated models, but this task has no checkpoints at all (фs well as the previous WUs of this type). Look at example: This will be a correct branch of discussion: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5186 (if I use minirosetta 2.03)? I should copy my "report" there? P.S. The screenshot above is normally visible? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, I see your screenshot just fine. Yes that would be the thread to post to. It sounds like you have something interesting there. The task is definitely "awake" when it's completed the next model. You should be seeing a checkpoint saved at the end of every model. Are you familar with setting options in the cc_config.xml file? I think there is a setting to debug the checkpointing. Perhaps an error was encountered when a checkpoint was attempted. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Any time you turn off your PC, you should first completely shutdown the BOINC Manager. So that means right click the icon and Exit. This assures it has closed all of it's files first. Is that what you've been doing? Rosetta Moderator: Mod.Sense |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,992,337 RAC: 15,074 |
2 Mod.Sense Yes, usually I finish work so (except a case if the computer completely hangup because of other processes executable on it or a power fail). Moreover, under "restarts", I meant not only a computer hard reset, but also simply turn off BOINС and start it again. I did it some times specially to try to catch a problem - the same results. No, while I know nothing about operation with cc_config.xml file. While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum. But I think, it only partial and is far not the best solution... P.S. I have transferred my "report" on a problem to an appropriate thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=5186 So further discussion I suggest to continue there. |
Questions and Answers :
Windows :
Rosetta WU's restart
©2024 University of Washington
https://www.bakerlab.org