Message boards : Number crunching : Linux Hung Machine
Author | Message |
---|---|
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help. Thanks, Greg |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 389 Credit: 12,070,320 RAC: 12,300 |
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help. Not something that I’ve experienced but the evidence should still exist in /var/logs/... |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help. I can check when I get home, usually when I am forced to power cycle the jobs all error out which I suspect may mask the true issue. |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,074,852 RAC: 207 |
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help. Possible memory/swap issues? Maybe the machine is starting to use a good amount of swap space? How much memory does the machine have? Charlie -Charlie |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,074,852 RAC: 207 |
Excuse the bogus signature. Back crunching after being away for a few years. Fixed it in my profile. Now to go fix it in my forum signature. <edit>Fixed</edit> |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
I am having issues with Rosetta jobs on my Linux machine. The machine randomly becomes unrepsonsive and I have to power cycle in order to get it back. The power cycle obviously purges the issue but also any evidence of what caused it. I only have this issue when running Rosetta jobs. Has anyone else had this issue? I would appreciate some help. The machine has 128 GB of RAM and nothing other than BOINC is running when it hangs. I did reduce the default file swap from 75% to 50% this morning though.....will not know if the machine has hung until I get home from work. No other project is having issues with current BOINC settings though... |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I see your linux machine shows that it has 64 processors, and 128GB of memory, and is running: Linux LinuxMint Linux Mint 19.3 Tricia [5.3.0-42-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)] Is the machine running a mix of BOINC projects? Is it running other types of work as well? With that many tasks running, it would be possible that one got to a point that it was using excessive memory. But I believe the BOINC core monitors that and insulates the rest of the system by making the task wait for memory or ending it. Just looking at a few of the failed tasks, their peak memory was about 1.2 GB. Hang conditions are always difficult. Have you seen this happen a few times? Is BOINC allowed to use most of that memory (CPU preferences)? What about the disk? Is BOINC allowed to use plenty of disk space? (say 2GB per task) I can only suggest using the settings to run on less than 100 percent of your CPUs and see if this helps. Rosetta Moderator: Mod.Sense |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
I see your linux machine shows that it has 64 processors, and 128GB of memory, and is running: It is running a mix of projects but currently the machine is only loaded with Seti (GPU) and Rosetta (CPU). This hanging occurs once a day at least and sometimes again within minutes of rebooting. Machine has 128GB RAM. I currently have Boinc set to use 50% swap and up to 50GB of HD space, the HD itself is 2TB so space should be no issue. I currently have Computation set at 85% for 90% of the time. In the past when I ran Seti only, I set 90% and 100% time with no issues. You mentioned 2GB per job, should I increase the HD more than the 50GB already established? |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,488,239 RAC: 11,585 |
Are you sure it's not overheating? Rosetta might push the FPU or RAM harder than other projects. Can the machine handle a stress test like P95? D |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
Are you sure it's not overheating? Rosetta might push the FPU or RAM harder than other projects. Can the machine handle a stress test like P95? Unsure, it is liquid cooled but not sure how to test this theory.... I can back off the percentages.... |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
I have not had the issue since yesterday, I was not a good engineer and changed numerous things all at once: I doubled the HD space available to 100GB I decreased CPU utilization to 75% from 100% I decreased CPU count to 75% from 90% I reduced the file swap from 75% to 50% I suspended all Seti jobs from running on GPUs (do not think this matters though) If I still do not see any issues by tomorrow I will start to increase CPU %s as I have been running between 90-100% on other projects. I am leaving quite a bit of computation power on the table only running at 75%. Thanks everyone for your suggestions! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2115 Credit: 41,107,172 RAC: 21,100 |
I have not had the issue since yesterday, I was not a good engineer and changed numerous things all at once: As a generalisation (because I'm no expert on this) if things go well (or even if they don't) increase CPU utilisation back to 100%. Having it lower, eg 75%, turns out to mean it runs at 100% for 75% of the time and 0% for 25% of the time, which isn't what you might expect. All that switching on and off can't help, so 100% utilisation might even remove a problem. If that works, look to increasing CPU count next. The other 3 look reasonable and better choices already |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
well crap, even with those changes I just hung my computer....... |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
This issue is clearly a rosetta one. I can run current settings on other projects with no issues....If I remove Rosetta jobs, computer does not hang at all. |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,488,239 RAC: 11,585 |
Does sound strange. I would try moving the BOINC data directory to another drive - can you do that? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
With 64 processor cores, how many threads is BOINC trying to run? Is it a hyperthreaded CPU? This would cause BOINC to attempt 128 active tasks, which would then make the 128GB of memory rather tight. (actually a quick search, it looks like there are 32 physical cores, hyperthreaded to 64 active threads). I would suggest bumping CPU utilization back to 100% as dcdc suggests (we've seen odd issues with <100% in the past). And dial back the CPU count %. Maybe start at 50% and work your way up. Have you run any stress tests on the machine? CPU or memory tests? Sometimes R@h ends up being the first stress test a machine has seen. Also, have you checked for any updates to your Linux version? I'm not seeing others reporting hangs like this. So, what else could be unique about your machine? (besides that it is such a BEAST of a machine! :) Rosetta Moderator: Mod.Sense |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
With 64 processor cores, how many threads is BOINC trying to run? Is it a hyperthreaded CPU? This would cause BOINC to attempt 128 active tasks, which would then make the 128GB of memory rather tight. (actually a quick search, it looks like there are 32 physical cores, hyperthreaded to 64 active threads). You are correct, it has 32 cores that are dual threaded so I can run 64 CPU jobs at the same time. I have bumped cpu utilization to 100% and reduced cpu count to 50%. I am running most recent linux mint, thought about doing a reinstall but have not gone that far yet. I dont seem to have issues with other boinc projects hanging....not sure why rosetta would be any different. I have not done any memory/stress tests. |
Buckeye4lf Send message Joined: 29 Aug 08 Posts: 43 Credit: 8,512,904 RAC: 1,994 |
Machine has not hung in last 24 hours. I backed off the number of CPU jobs to 70% instead of the 90% I had been running on other projects. Maybe I was just on the edge of unstable before and Rosetta was the project where I was getting issues. It seems to be more stable now, just less throughput. Has Rosetta ever considered GPU jobs? |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1670 Credit: 17,519,240 RAC: 23,823 |
I have bumped cpu utilization to 100% and reduced cpu count to 50%. I am running most recent linux mint, thought about doing a reinstall but have not gone that far yet. I dont seem to have issues with other boinc projects hanging....not sure why rosetta would be any different. I have not done any memory/stress tests.The main difference between projects would appear to be memory usage. Even running all the others, you're not likely to be using much RAM. Running Rosetta, with all those threads & cores, along with other projects will result in RAM being used that probably doesn't normally get touched. Hence the system lockups. I'd suggest a thorough memtest session (you may or may not have a copy of memtest86+ with your distro). The other option is swapping RAM modules. Check exactly how much RAM is being used, limit the number of Rosetta jobs so RAM in use (with other projects running) is just under the limit for only 2 modules on the motherboard- make sure you have them in the appropriate slots to maintain at least dual channel operation. Pull all other modules. Let the system run & see if there are any issues (how long does it usually take for a problem to occur?) If no problems, pull those modules, add others that have been removed. Run again. Do it till you get a failure. If no failure, add more modules, bump up the number of Rosetta jobs to near the memory limit. See how it goes, Repeat. Or run that intensive memtest session (although it could take most of a day for that amount of RAM). I notice you have several WIn10 systems, if all are DDR4 and the same size, pull modules form the Win10 systems to put in the Threadripper, the Thread ripper modules in the Win10 systems. Even if they don't error out, Win10 comes with it's own memory tester so you could use those system to do the memory tests. Just keep track of which modules are where... Grant Darwin NT |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Has Rosetta ever considered GPU jobs? Yes, it ends up becoming a rather contentious discussion. One of the developers did offer some insight last week. There are several older threads elsewhere about the topic as well. The bottom line is that GPUs are fantastic at doing lots of things, but many GPU enthusiasts do not understand they are not general purpose processors, and the coding effort required to get from one platform to the other. And many people have tried to follow up with me to further persuade me to the merits of GPU. Rest assured it does no good. My personal limited understanding of GPU, and the coding efforts required to migrate do not effect the project at all. I am not on the Development Team, I am just an at-home moderator. The perspective, directly from a developer, is expressed well in the thread linked above. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Linux Hung Machine
©2024 University of Washington
https://www.bakerlab.org