Orphan processes in killed NWChem 6.1.1 job

Viewed 616 times, With a total of 7 Posts

Clicked A Few Times

Threads 10
Posts 25

11:59:38 AM PST - Thu, Feb 14th 2013
We are running NWChem 6.1.1 (Jan 2012) on CentOS-6.3-x86_64 using openmpi 1.5.4 on a 16-core node. I notice that after a multi-processor job is killed, the system still claims the cpu activity and the %idle does not return to 100% as it does after a completed job, according to the output of a sar command. These cores are then not available for subsequent multi-processor jobs. The only way I know to reclaim these cores is to restart the node.

Forum Vet

Threads 1
Posts 398

5:12:29 PM PST - Thu, Feb 14th 2013
What happens if you try to kill the left-over processes? Are they in "D" state?

Clicked A Few Times

Threads 10
Posts 25

10:07:34 PM PST - Thu, Feb 14th 2013
Running processes after job stopped
Quote:Edoapra Feb 14th 4:12 pm What happens if you try to kill the left-over processes? Are they in "D" state? They are in the "R" state. Here is the "ps aux" output (keller6 is the NWChem user.) USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND keller6 6001 99.8 0.2 1262432 140092 ? R 19:25 29:12 /usr/local/nwchem/bin/nwchem input.inp keller6 6002 99.9 0.1 1261552 125244 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp keller6 6003 99.9 0.1 1261540 128448 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp keller6 6004 99.9 0.1 1261540 122332 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp I can kill the process with "kill -9 6001"

Clicked A Few Times

Threads 10
Posts 25

2:55:04 AM PST - Fri, Feb 15th 2013
Quote:Jwkeller Feb 14th 10:59 am We are running NWChem 6.1.1 (Jan 2012) on CentOS-6.3-x86_64 using openmpi 1.5.4 on a 16-core node. I notice that after a multi-processor job is killed, the system still claims the cpu activity and the %idle does not return to 100% as it does after a completed job, according to the output of a sar command. These cores are then not available for subsequent multi-processor jobs. The only way I know to reclaim these cores is to restart the node. Sorry, its the NWChem 6.1.1 (Jan 2013) version, not 2012. JK

Forum Vet

Threads 1
Posts 398

12:53:33 PM PST - Fri, Feb 15th 2013
Quote:Jwkeller Feb 14th 9:07 pm Quote:Edoapra Feb 14th 4:12 pm What happens if you try to kill the left-over processes? Are they in "D" state? They are in the "R" state. Here is the "ps aux" output (keller6 is the NWChem user.) USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND keller6 6001 99.8 0.2 1262432 140092 ? R 19:25 29:12 /usr/local/nwchem/bin/nwchem input.inp keller6 6002 99.9 0.1 1261552 125244 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp keller6 6003 99.9 0.1 1261540 128448 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp keller6 6004 99.9 0.1 1261540 122332 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp I can kill the process with "kill -9 6001" Do you still need to reboot the cluster nodes after having manually killed the leftover processes? Edo

Clicked A Few Times

Threads 10
Posts 25

5:55:37 PM PST - Fri, Feb 15th 2013
Do you still need to reboot the cluster nodes after having manually killed the leftover processes? No - I tried several runs, and I can indeed recover full funationality of the node by finding the process numbers and then issuing n "kill -9 [process number} n times. Hopefully this can be done automatically, or prevented in the first place. John K.

Forum Vet

Threads 1
Posts 398

10:22:48 AM PST - Mon, Feb 18th 2013
Quote:Jwkeller Feb 15th 4:55 pm Do you still need to reboot the cluster nodes after having manually killed the leftover processes? No - I tried several runs, and I can indeed recover full funationality of the node by finding the process numbers and then issuing n "kill -9 [process number} n times. Hopefully this can be done automatically, or prevented in the first place. John K. John, Using killall you should be able to kill all the nwchem associated processes. The command is killall -9 nwchem Another possibility is to use the openmpi orte-clean (a.k.a. ompi-clean) http://www.open-mpi.org/doc/v1.4/man1/ompi-clean.1.php

Clicked A Few Times

Threads 10
Posts 25

10:54:05 PM PST - Sat, Mar 2nd 2013
Thanks Edo - This is problem in WebMO 12.1, which should insert a "scratch_dir" line when it creates the nwchem input file. Currently it is dumping all these aoints and grid files into one directory, and trying to copy those back to the user's directory on the WebMO server, rather than put them in a separate directory that is deleted after the job finishes. Apparently this is fixed in v 13 of WebMO Pro. But it is fairly easy to insert this line manually.

Who's here now Members 0 Guests 0 Bots/Crawler 1

AWC's: 2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC