Watchdog not working too well

Message boards : Number crunching : Watchdog not working too well

To post messages, you must log in.

AuthorMessage
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,071,286
RAC: 0
Message 94622 - Posted: 16 Apr 2020, 17:40:25 UTC

Have a task that is on a 2 hour run time target. The watchdog should have stopped it at 6 CPU hours. Currently it is over 21 hours of cpu time:

Application Rosetta 4.15
Name 12v1n_al_12mer_design_00240_010210_0001_SAVE_ALL_OUT_914331_72
State Running
Received Wed 15 Apr 2020 12:05:07 PM EDT
Report deadline Sat 18 Apr 2020 12:05:06 PM EDT
Estimated computation size 80,000 GFLOPs
CPU time 21:22:39
CPU time since checkpoint 21:22:39
Elapsed time 21:39:20
Estimated time remaining 00:10:07
Fraction done 99.226%
Virtual memory size 382.00 MB
Working set size 304.89 MB
Directory slots/3
Process ID 164535
Progress rate 4.680% per hour
Executable rosetta_4.15_x86_64-pc-linux-gnu

Also note that it has not checkpointed yet either. Looking at files in the slots/3 directory does show some current activity (current time at my location on 13:34 as I type this):

ls -lart | tail
-rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 rosetta_tmp.txt
-rw-r--r--. 1 boinc boinc 0 Apr 15 15:50 minirosetta_database.zip.is_extracted
-rw-rw-r--. 1 charlie charlie 0 Apr 16 06:57 stderrgfx.txt
-rw-rw-r--. 1 charlie charlie 14 Apr 16 06:57 gfx_info
-rw-r--r--. 1 boinc boinc 6175 Apr 16 11:28 init_data.xml
drwxrwx--x. 3 boinc boinc 20480 Apr 16 11:28 .
-rw-r--r--. 1 boinc boinc 9529 Apr 16 13:30 12v1n_al_12mer_design_00240_010210_0001_check.txt
-rw-r--r--. 1 boinc boinc 3589 Apr 16 13:30 rng.state.gz
-rw-rw----. 1 boinc boinc 25001680 Apr 16 13:33 boinc_rosetta_3
-rw-r--r--. 1 boinc boinc 8192 Apr 16 13:33 boinc_mmap_file

A tail of the 12v1n_al_12mer_design_00240_010210_0001_check.txt file shows this:

tail 12v1n_al_12mer_design_00240_010210_0001_check.txt
LAST 497 SUCCESS 0
LAST 498 SUCCESS 0
LAST 499 SUCCESS 0
LAST 500 SUCCESS 0
LAST 501 SUCCESS 0
LAST 502 SUCCESS 0
LAST 503 SUCCESS 0
LAST 504 SUCCESS 0
LAST 505 SUCCESS 0
LAST 506 SUCCESS 0

Here's a link to the task:
https://boinc.bakerlab.org/rosetta/result.php?resultid=1150908452

I'm going to let it run for a while just to see what happens.

-Charlie
-Charlie
ID: 94622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94625 - Posted: 16 Apr 2020, 17:58:04 UTC - in response to Message 94622.  

I'm going to let it run for a while just to see what happens.


You are much more curious than I :) I would blast it. But either way, please post an update when it reports back. I am curious too. Perhaps your dog (your profile photo) can help teach the R@h watchdog.

Have you verified the venue of the host as compared to the runtime preference for that venue? Have you been modifying the runtime preferences recently?
Rosetta Moderator: Mod.Sense
ID: 94625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,071,286
RAC: 0
Message 94626 - Posted: 16 Apr 2020, 18:11:03 UTC - in response to Message 94625.  

I'm going to let it run for a while just to see what happens.


You are much more curious than I :) I would blast it. But either way, please post an update when it reports back. I am curious too. Perhaps your dog (your profile photo) can help teach the R@h watchdog.

Have you verified the venue of the host as compared to the runtime preference for that venue? Have you been modifying the runtime preferences recently?


I don't use a specific venue. A couple of weeks ago I raised the run time from 1 hour to 8 hours and ran that way for a while. I noticed my RAC started dropping so several days ago I lowered it to 2 hours to see if by any chance it would make a difference (not that I expect it to). The task was received well after I did that and was preceded by a lot of tasks that ran successfully with the 2 hour cpu time. So, I doubt that would have been the reason. Still, with an 8 hour run time the watchdog would have aborted it after 12 hours.

Unfortunately, the dog in my profile is no longer with us. I'll have to get a picture of my new yellow lab. I ran R@H for a long time but stopped several years ago with all distributed computing. I retired a year ago and with the recent pandemic I jumped back in to to my part. R@H was always one of my favorites. Right now it's all I'm doing across 3 systems/12 cores.

-Charlie
-Charlie
ID: 94626 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,071,286
RAC: 0
Message 94669 - Posted: 17 Apr 2020, 10:54:39 UTC - in response to Message 94626.  

After over a day and a half of cpu time I've aborted the task.

-Charlie
-Charlie
ID: 94669 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94680 - Posted: 17 Apr 2020, 13:48:46 UTC - in response to Message 94669.  

Thank you for posting. These "12v1n" tasks are now under discussion here.
Rosetta Moderator: Mod.Sense
ID: 94680 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Watchdog not working too well



©2024 University of Washington
https://www.bakerlab.org