|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
Errors Running Nwchem 6.3
From NWChem
Viewed 4541 times, With a total of 8 Posts
|
Clicked A Few Times
Threads 13
Posts 35
|
|
5:59:12 PM PDT - Wed, Oct 2nd 2013 |
|
Hi, recently I have been installing the nwchem 6.3 on a 720-core/1440-GB (60 dual-cpu/six-core/24GB nodes) 64-bit Xeon Cluster (from Dell), with InfiniBand network fabric. I have used the following configuration to compile the code.
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export LIB_DEFINES="-DDFLT_TOT_MEM=16777216"
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/home/tpirojsi/nwchem-test-6.3
export NWCHEM_TARGET=LINUX64
export ARMCI_NETWORK=OPENIB
export IB_HOME=/usr
export IB_INCLUDE=$IB_HOME/include
export IB_LIB=$IB_HOME/lib64
export IB_LIB_NAME="-lrdmacm -libumad -libverbs -lpthread -lrt"
export MSG_COMMS=MPI
export CC=gcc
export FC=gfortran
export NWCHEM_MODULES="all"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/home/tpirojsi/MPI
export MPI_LIB=$MPI_LOC/lib
export MPI_INCLUDE=$MPI_LOC/include
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
The code was successfully compiled and installed. For 1 node (12 cpus), there is no problem running the code. However, most of the time when I run the code using sge-base script for more than 1 node, I got the following errors.
qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory
qrsh_starter: cannot change to directory /home/tpirojsi/test: No such file or directory
A daemon (pid 25141) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
OR
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 11 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode 15.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
mpirun noticed that process rank 0 with PID 24769 on node compute-0-37.local exited on signal 9 (Killed).
These two errors most commonly occur. There were only few times I could successfully run the job with more than one node. I know the second error is commonly dealing with the communication between nodes but I have no idea why this happens frequently. This error always happens at the beginning after the job has been launched. Is there something to do with the cluster itself or with the way I compiled the code?
For the first error, even trying to search for some answers online, I still haven't got any solutions. Indeed, The working directory did exist when running the code but I don't know why the error showed such a thing.
Also, Is it possible that the code is aborted if there is not enough memory/swap, as a result from unterminated child process in the previous job not being killed? I have seen the swap memory for many nodes was filled (last column) even though the node has not been used.
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
global - - - - - - -
compute-0-0 lx26-amd64 12 0.04 23.6G 415.4M 1000.0M 0.0
compute-0-1 lx26-amd64 12 0.00 23.6G 412.6M 1000.0M 0.0
compute-0-10 lx26-amd64 12 0.00 23.6G 431.3M 996.2M 996.2M
compute-0-11 lx26-amd64 12 0.00 23.6G 538.2M 996.2M 0.0
compute-0-12 lx26-amd64 12 0.01 23.6G 524.1M 996.2M 0.0
compute-0-13 lx26-amd64 12 0.03 23.6G 416.2M 996.2M 0.0
compute-0-14 lx26-amd64 12 0.00 23.6G 426.4M 996.2M 996.2M
compute-0-15 lx26-amd64 12 0.00 23.6G 430.4M 996.2M 996.1M
compute-0-16 lx26-amd64 12 0.00 23.6G 494.1M 996.2M 0.0
compute-0-17 lx26-amd64 12 0.00 23.6G 425.0M 996.2M 0.0
compute-0-18 lx26-amd64 12 0.00 23.6G 477.4M 996.2M 995.9M
compute-0-19 lx26-amd64 12 0.00 23.6G 474.2M 996.2M 0.0
compute-0-2 lx26-amd64 12 0.01 23.6G 485.3M 996.2M 995.9M
If that is the case, are there any ways we could clear up the memory/swap or reboot the nodes before starting a new job?
Any advice is greatly appreciated.
|
Edited On 6:18:18 PM PDT - Wed, Oct 2nd 2013 by Tpirojsi
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1461
|
|
3:20:16 PM PDT - Mon, Oct 7th 2013 |
|
Shared memory segments
|
Tee,
Your memory might be taken by the shared memory segments allocated by the failed NWChem runs.
You can check their existence with the command
ipcs -a
There is a script that can clean all of these leftover segments. The scripts is shipped on any NWChem source as
$NWCHEM_TOP/src/tools/global/testing/ipcreset
HCeers, Edo
|
|
|
|
Clicked A Few Times
Threads 13
Posts 35
|
|
4:23:42 PM PDT - Mon, Oct 7th 2013 |
|
Quote:Edoapra Oct 7th 10:20 pmTee,
Your memory might be taken by the shared memory segments allocated by the failed NWChem runs.
You can check their existence with the command
ipcs -a
There is a script that can clean all of these leftover segments. The scripts is shipped on any NWChem source as
$NWCHEM_TOP/src/tools/global/testing/ipcreset
HCeers, Edo
Hi Edo.
Thank you for your reply. So far the problem I only see is the one with qrsh_starter which I still don't know how this problem arises. It randomly occurs though. I've tried your suggestion and this is what I got:
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
Semaphore Arrays --------
key semid owner perms nsems
Message Queues --------
key msqid owner perms used-bytes messages
So it doesn't seem that the memory is used by failed NWChem runs, right?
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1461
|
|
5:22:42 PM PDT - Mon, Oct 7th 2013 |
|
Have you checked if the NWChem processes have been properly terminated?
|
|
|
|
Clicked A Few Times
Threads 13
Posts 35
|
|
6:09:39 PM PDT - Mon, Oct 7th 2013 |
|
Quote:Edoapra Oct 8th 12:22 amHave you checked if the NWChem processes have been properly terminated?
Edo, by running 'qstat' I didn't see any processes running, so I think the job has been terminated. Is that what you meant? However, even when the job was running, when I tried 'ipcs -a' command, it showed up nothing too. Did it sound weird to you?
job-ID prior name user state submit/start at queue slots ja-task-ID
72312 0.50500 neb-0 tpirojsi r 10/07/2013 18:06:23 batch.q@compute-0-41.local 12
[tpirojsi@ccom-boom test]$ ipcs -a
Shared Memory Segments --------
key shmid owner perms bytes nattch status
Semaphore Arrays --------
key semid owner perms nsems
Message Queues --------
key msqid owner perms used-bytes messages
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1461
|
|
9:44:50 AM PDT - Tue, Oct 8th 2013 |
|
set number of threads to 1
|
Did you set the env. variables
OMP_NUM_THREADS
and
GOTO_NUM_THREADS
to 1?
http://www.tacc.utexas.edu/tacc-projects/gotoblas2/faq
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1461
|
|
9:46:11 AM PDT - Tue, Oct 8th 2013 |
|
Tee
qstat is not checking the running processes on the compute nodes.
In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)
|
|
|
|
Clicked A Few Times
Threads 13
Posts 35
|
|
11:28:05 AM PDT - Tue, Oct 8th 2013 |
|
Quote:Edoapra Oct 8th 4:46 pmTee
qstat is not checking the running processes on the compute nodes.
In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)
Oh! Thank you for shedding some light on this for me. You are absolutely correct. I logged in to computer nodes, and punched in the 'ipcs -a' command and did see some shared memory segments showed up. I used the tools provided in nwchem package to clean them up and saw the swap space came back to normal. It seems to solve qrsh_starter problem too!
I really appreciate your help.
Tee
|
Edited On 1:04:03 PM PDT - Tue, Oct 8th 2013 by Tpirojsi
|
|
|
|
Clicked A Few Times
Threads 13
Posts 35
|
|
1:48:59 PM PDT - Tue, Oct 8th 2013 |
|
Quote:Tpirojsi Oct 8th 6:28 pmQuote:Edoapra Oct 8th 4:46 pmTee
qstat is not checking the running processes on the compute nodes.
In order to check the status of running processes (and for the ipcs output as well), you need to login to the compute nodes (using ssh, for example)
Oh! Thank you for shedding some light on this for me. You are absolutely correct. I logged in to computer nodes, and punched in the 'ipcs -a' command and did see some shared memory segments showed up. I used the tools provided in nwchem package to clean them up and saw the swap space came back to normal. It seems to solve qrsh_starter problem too!
I really appreciate your help.
Tee
Indeed, I still see some qrsh_starter errors but very few. Do you have any idea what is it related to?
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |