Losing WU progress

Message boards : Cafe Rosetta : Losing WU progress

To post messages, you must log in.

AuthorMessage
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93489 - Posted: 5 Apr 2020, 12:21:30 UTC

Hello,
I have been running Rosetta@Home COVID-19 WUs on a really low end hardware. That computer has AMD Turion X2 TL-60 inside and each WU takes 21 hours to complete. At home I don't feel comfortable leaving old hardware running overnight and I have to shut down that PC after 13-15 hours. Sadly, it seems that after shut down I lost all progress of those two WUs and they simply started from scratch. I run latest linux Mint there. Is there anything I can do about this? Cosmology@Home WUs work just fine and save their progress.
ID: 93489 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93499 - Posted: 5 Apr 2020, 15:06:59 UTC

It sounds like perhaps the WU did not reach the completion of the first model. The end of a model is always a checkpoint, where work is saved. Some WUs have checkpoints within a model as well. Can you look at the properties of the WUs and see the time since their last checkpoint as compared to CPU time?
Rosetta Moderator: Mod.Sense
ID: 93499 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93506 - Posted: 5 Apr 2020, 15:48:37 UTC - in response to Message 93499.  

CPU time is equal to CPU time since checkpoint. Right now it's at 3:40:40. It seems that previous progress was completely discarded. Also it's weird that my other PC had working checkpoints.
ID: 93506 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 93515 - Posted: 5 Apr 2020, 17:04:09 UTC

Different types of work units checkpoint at different time. The one thing they all have in common is that at the end of a model, a checkpoint is taken.

Any time you shutdown BOINC, the work since the last checkpoint is lost. The objective is that this would normally be less than 15 minutes of work. When the task restarts, it restores the checkpoint and continues from there.

In your case, unfortunately, it sounds like a checkpoint was never reached. Taking so long to reach is checkpoint is very unusual.

I can only suggest looking for other normal bottlenecks, memory, and BOINC settings for how much memory can be used, leave tasks in memory when suspended, don't use BOINC as a screensaver, don't leave the graphics display running, etc.

If the problem continues, rest assured that it will be detected and ended. The message says something like "too many restarts with no progress". But it takes 4 or 5 times. Since the project is out of work, there won't be others to replace them. If you can, it would be interesting to see how it looks if you leave the machine on overnight.
Rosetta Moderator: Mod.Sense
ID: 93515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93525 - Posted: 5 Apr 2020, 17:39:27 UTC - in response to Message 93515.  

I will try to leave that computer running. I don't use it for anything else and it once was DV6000 laptop. Now it's a halftop. There's no screen, battery, keyboard touchpad. The only plastic left is the one which holds everything together. It was also upgraded from 1.8GHz Sempron to Turion X2 2GHz, from 1GB DDR2 to 4GB, from 80GB HDD to 120GB SSD. So naturally now I use it as "desktop". Meaning that I plug in keyboard/touchpad (Logitech K400), to monitor via VGA, use old Android phone as WiFi adapter. Not having internal display means that I can't see how to enter BIOS and lack of internal keyboard means that support for external ones is finicky. So I can't really access it anymore and if I unplug it from power all settings are lost. I think that by default there's only 32MB assigned to GPU (GeForce 6150 Go), which might be problematic nowadays. Not sure if it could be related to not saving progress. Not sure about virtualization and security setting (NX bit maybe). Hopefully this information helps a bit.
ID: 93525 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93576 - Posted: 5 Apr 2020, 23:41:21 UTC - in response to Message 93525.  

Small update:
The whole thing froze. BOINC window in fullscreen froze, Mint menus froze. PC doesn't respond to lost internet connection. Only mouse cursor moves. It looks like it won't resolve itself, but I will keep it running and won't try to close BOINC window. There's only 1 hour and 28 minutes work left. That's so frustrating.
ID: 93576 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93617 - Posted: 6 Apr 2020, 8:48:00 UTC - in response to Message 93576.  

And I had to restart machine afterall. During night it became completely unresponsive. After restart both Rosetta WUs have failed ("computation error"). However it seems that both are 100% complete.
ID: 93617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93675 - Posted: 6 Apr 2020, 20:38:22 UTC - in response to Message 93617.  

I'm really sorry for ruining those 2 WUs. After that incident I set Rosetta to not get any new tasks, but during that freeze it downloaded one WU and I noticed that it's 95% complete. There's not much left, so I will keep it going. Checkpoint functionality is again nonexistant on that WU, however Cosmology's WU seems to have working checkpoints.
ID: 93675 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93681 - Posted: 6 Apr 2020, 21:22:43 UTC - in response to Message 93675.  

So, that one WU was crunched successfully. However, checkpoints didn't work.
ID: 93681 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 93682 - Posted: 6 Apr 2020, 21:24:45 UTC - in response to Message 93681.  

Can you give a link to the task?
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 93682 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The red spirit
Avatar

Send message
Joined: 22 Nov 15
Posts: 10
Credit: 214,036
RAC: 141
Message 93684 - Posted: 6 Apr 2020, 21:49:45 UTC - in response to Message 93682.  
Last modified: 6 Apr 2020, 21:54:05 UTC

Can you give a link to the task?


Here they are (the failed ones):
https://boinc.bakerlab.org/rosetta/result.php?resultid=1139781657
https://boinc.bakerlab.org/rosetta/result.php?resultid=1139781737

And here's the successful one:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1140448033
ID: 93684 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bkil
Avatar

Send message
Joined: 11 Jan 20
Posts: 97
Credit: 4,433,288
RAC: 0
Message 94008 - Posted: 9 Apr 2020, 22:02:00 UTC - in response to Message 93684.  

This has been mentioned before. The Rosetta i686 applications aren't working correctly since last week. They only give 20 credits upon completion and only produce a single decoy. This also explains why checkpointing isn't working midway (I've observed this as well, by the way). You can correct this by working on 64-bit applications only by disabling
alt_platforms
as per the guide: https://boinc.berkeley.edu/wiki/Client_configuration
ID: 94008 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bkil
Avatar

Send message
Joined: 11 Jan 20
Posts: 97
Credit: 4,433,288
RAC: 0
Message 94010 - Posted: 9 Apr 2020, 22:17:08 UTC - in response to Message 93525.  

I also have a few theories about your freeze.


  • Double check your process priorities
  • The new Rosetta applications consume much more RAM than before, 1-1.5GB is not unusual. Close every other application during computation. Set up as much zram as your RAM and set it to deflate compression. Add a little disk swap as well, at most half your RAM. Review your BOINC user configuration regarding memory usage. Preferably reboot the machine weekly from cron.
  • Disable keep in memory when suspended setting. Reduce your cache to 0.1 day mandatory + 0.1 extra and ensure that no other project is running (other than a 0-share backup project like WCG).
  • The new Rosetta applications copy, unpack and load a non-trivial amount of data from disk on startup and maybe periodically as well and if two threads start at the same time, you may easily have an I/O bottleneck. There may exist various solutions to this (tweaking Linux internals, or just using a zram-compressed ramdisk for the slots directory).
  • Thermal throttling may kick in -> log and monitor your temperatures and thermal events with an app or by a simple sysfs cat/while/sleep loop. Blow out dust and/or greasy debris, reapply thermal paste and fasten heat sinks tight. Consider undervolting. If this is still not enough, reduce CPU runtime %.
  • It may also be a hardware problem.

ID: 94010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94017 - Posted: 9 Apr 2020, 23:05:43 UTC - in response to Message 94008.  
Last modified: 10 Apr 2020, 13:23:35 UTC

So would the alt_platform look like this then??
<alt_platform>(Nope, I had it wrong)</alt_platform>

Rosetta Moderator: Mod.Sense
ID: 94017 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
bkil
Avatar

Send message
Joined: 11 Jan 20
Posts: 97
Credit: 4,433,288
RAC: 0
Message 94049 - Posted: 10 Apr 2020, 9:58:01 UTC - in response to Message 94017.  
Last modified: 10 Apr 2020, 9:58:32 UTC

This is what /var/lib/boinc-client/cc_config.xml looks like over here:
<cc_config>
  <log_flags>
    <task>1</task>
    <file_xfer>1</file_xfer>
    <sched_ops>1</sched_ops>
  </log_flags>
  <options>
    <no_alt_platform>1</no_alt_platform>
  </options>
</cc_config>


After restarting the boinc daemon (systemctl restart boinc-client), this will completely remove the alt_platform line from your client_state.xml and all future scheduler requests.
ID: 94049 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Cafe Rosetta : Losing WU progress



©2024 University of Washington
https://www.bakerlab.org