From NWChem
Viewed 616 times, With a total of 7 Posts
|
Clicked A Few Times
Threads 10
Posts 25
|
|
11:59:38 AM PST - Thu, Feb 14th 2013 |
|
We are running NWChem 6.1.1 (Jan 2012) on CentOS-6.3-x86_64 using openmpi 1.5.4 on a 16-core node. I notice that after a multi-processor job is killed, the system still claims the cpu activity and the %idle does not return to 100% as it does after a completed job, according to the output of a sar command. These cores are then not available for subsequent multi-processor jobs. The only way I know to reclaim these cores is to restart the node.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 1
Posts 398
|
|
5:12:29 PM PST - Thu, Feb 14th 2013 |
|
What happens if you try to kill the left-over processes? Are they in "D" state?
|
|
|
|
Clicked A Few Times
Threads 10
Posts 25
|
|
10:07:34 PM PST - Thu, Feb 14th 2013 |
|
Running processes after job stopped
|
Quote:Edoapra Feb 14th 4:12 pmWhat happens if you try to kill the left-over processes? Are they in "D" state?
They are in the "R" state. Here is the "ps aux" output (keller6 is the NWChem user.)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
keller6 6001 99.8 0.2 1262432 140092 ? R 19:25 29:12 /usr/local/nwchem/bin/nwchem input.inp
keller6 6002 99.9 0.1 1261552 125244 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6003 99.9 0.1 1261540 128448 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6004 99.9 0.1 1261540 122332 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
I can kill the process with "kill -9 6001"
|
|
|
|
Clicked A Few Times
Threads 10
Posts 25
|
|
2:55:04 AM PST - Fri, Feb 15th 2013 |
|
Quote:Jwkeller Feb 14th 10:59 amWe are running NWChem 6.1.1 (Jan 2012) on CentOS-6.3-x86_64 using openmpi 1.5.4 on a 16-core node. I notice that after a multi-processor job is killed, the system still claims the cpu activity and the %idle does not return to 100% as it does after a completed job, according to the output of a sar command. These cores are then not available for subsequent multi-processor jobs. The only way I know to reclaim these cores is to restart the node.
Sorry, its the NWChem 6.1.1 (Jan 2013) version, not 2012. JK
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 1
Posts 398
|
|
12:53:33 PM PST - Fri, Feb 15th 2013 |
|
Quote:Jwkeller Feb 14th 9:07 pmQuote:Edoapra Feb 14th 4:12 pmWhat happens if you try to kill the left-over processes? Are they in "D" state?
They are in the "R" state. Here is the "ps aux" output (keller6 is the NWChem user.)
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
keller6 6001 99.8 0.2 1262432 140092 ? R 19:25 29:12 /usr/local/nwchem/bin/nwchem input.inp
keller6 6002 99.9 0.1 1261552 125244 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6003 99.9 0.1 1261540 128448 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
keller6 6004 99.9 0.1 1261540 122332 ? R 19:25 29:13 /usr/local/nwchem/bin/nwchem input.inp
I can kill the process with "kill -9 6001"
Do you still need to reboot the cluster nodes after having manually killed the leftover processes?
Edo
|
|
|
|
Clicked A Few Times
Threads 10
Posts 25
|
|
5:55:37 PM PST - Fri, Feb 15th 2013 |
|
Do you still need to reboot the cluster nodes after having manually killed the leftover processes?
No - I tried several runs, and I can indeed recover full funationality of the node by finding the process numbers and then issuing n "kill -9 [process number} n times. Hopefully this can be done automatically, or prevented in the first place.
John K.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 1
Posts 398
|
|
10:22:48 AM PST - Mon, Feb 18th 2013 |
|
Quote:Jwkeller Feb 15th 4:55 pmDo you still need to reboot the cluster nodes after having manually killed the leftover processes?
No - I tried several runs, and I can indeed recover full funationality of the node by finding the process numbers and then issuing n "kill -9 [process number} n times. Hopefully this can be done automatically, or prevented in the first place.
John K.
John,
Using killall you should be able to kill all the nwchem associated processes. The command is
killall -9 nwchem
Another possibility is to use the openmpi orte-clean (a.k.a. ompi-clean)
http://www.open-mpi.org/doc/v1.4/man1/ompi-clean.1.php
|
|
|
|
Clicked A Few Times
Threads 10
Posts 25
|
|
10:54:05 PM PST - Sat, Mar 2nd 2013 |
|
Thanks Edo - This is problem in WebMO 12.1, which should insert a "scratch_dir" line when it creates the nwchem input file. Currently it is dumping all these aoints and grid files into one directory, and trying to copy those back to the user's directory on the WebMO server, rather than put them in a separate directory that is deleted after the job finishes. Apparently this is fixed in v 13 of WebMO Pro. But it is fairly easy to insert this line manually.
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC