From NWChem
Viewed 71 times, With a total of 1 Posts
|
Just Got Here
Threads 1
Posts 1
|
|
6:13:37 AM PST - Tue, Feb 19th 2013 |
|
Dear all,
some runs of NWCHEM (version 6.0 and 6.1.1) lead to an failed assertion on our Intel Nehalem Cluster (roughly 2.8GB
memory per core are available) when running on 512 cores.
The message is as follows:
"0:Terminate signal was sent, status=: 15 (rank:0 hostname:jj20c79 pid:16917):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0"
We checked the memory consumption of the run and it appears that more and more memory is consumed during iteration steps without being deallocated again. As our system does not allow memory swapping the run crashes after a while. The user who drew that behavior to our attention assumes that the ARMCI driver doesn't fulfill the garbage cleaning the way it is expected by the "global array" toolkit. Might this be the case?
The code is compiled and linked with the Intel compiler 12.1.4 (OS: SUSE Linux Enterprise Server 11 (x86_64), Kernel: 2.6.32.59-0.3-default).
The following settings were used:
export PYTHONHOME="/usr/bin/python"
export PYTHONVERSION="2.6"
export MSG_COMMS="MPI"
export LARGE_FILES="TRUE"
export ARMCI_NETWORK=OPENIB
export NWCHEM_MODULES="all"
export NWCHEM_TOP="/some_directory/nwchem-6.1.1"
export NWCHEM_TARGET="LINUX64"
export USE_MPI="y"
export USE_MPIF="y"
export USE_MPIF4="y"
export MPI_HOME=/usr/local/parastation/mpi2-intel-5.0.26-1/
export MPI_LIB=$MPI_HOME/lib
export MPI_INCLUDE=$MPI_HOME/include
export LIBMPI="-lmpich"
export PATH=$PATH:$MPI_HOME/bin
cd $NWCHEM_TOP/src
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" nwchem_config
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" >& make.log
cd $NWCHEM_TOP/src/util
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" version
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc"
cd $NWCHEM_TOP/src
make FDEBUG="-g" CDEBUG="-g" FC="ifort" CC="icc" link
Using the GCC compiler suite instead also does not solve the problem.
Additionally we tried the following environment variables without success:
ARMCI_DEFAULT_SHMMAX=4096 (or 1024 or 2048)
MA_USE_ARMCI_MEM=1
Did some of you also ever encounter the same problem?
Any help on this is highly appreciated.
Best regards
Alexander
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Regular
Threads 1
Posts 234
|
|
10:53:36 AM PST - Tue, Feb 19th 2013 |
|
Alexander,
I think that the ARMCI developers might be more helpful for your GA/ARMCI problem.
The GA/ARMCI discussion group can be found the following URL
http://groups.google.com/group/hpctools
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC