|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
6.1.1 MPI build runs great, but only on 1 node
From NWChem
Viewed 18153 times, With a total of 24 Posts
|
Just Got Here
Threads 1
Posts 4
|
|
9:29:01 AM PDT - Wed, Jul 18th 2012 |
|
NWChem 6.1.1 on SL6 Linux, built with gcc-4.4 and openmpi-1.4.3.
Here's what I did to build it:
export NWCHEM_TOP=$PWD
export NWCHEM_TARGET=LINUX64
export INSTALL_PREFIX=/opt/nwchem/6.1.1
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export MPI_LIB=/opt/openmpi/1.4.3/lib
export MPI_INCLUDE=/opt/openmpi/1.4.3/include
export FC=gfortran
export CC=gcc
cd $NWCHEM_TOP/src
make nwchem_config NWCHEM_MODULES=all
make
mkdir -p $INSTALL_PREFIX
mkdir -p $INSTALL_PREFIX/bin
mkdir -p $INSTALL_PREFIX/data
cp $NWCHEM_TOP/bin/${NWCHEM_TARGET}/nwchem $INSTALL_PREFIX/bin
chmod 755 $INSTALL_PREFIX/bin/nwchem
cp -r $NWCHEM_TOP/src/basis/libraries $INSTALL_PREFIX/data
cp -r $NWCHEM_TOP/src/data $INSTALL_PREFIX
cp -r $NWCHEM_TOP/src/nwpw/libraryps $INSTALL_PREFIX/data
Here's how I run it (using PBS Professional 11.2):
- !/bin/bash
- PBS -N nwchem
- PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
- PBS -j oe
mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out
But all 16 processes appear on only one of the 2 nodes I've been allocated for this job. If I switch to running on only 1 node, everything looks great, but more than 1 node causes all of the processes to double-up on only the "master" node.
Any ideas/comments/suggestions?
Thanks a lot!
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
9:31:13 AM PDT - Wed, Jul 18th 2012 |
|
Ooops! My PBS job script was autoformatted when I submitted. It should look like this:
\#!/bin/bash
\#PBS -N nwchem
\#PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
\#PBS -j oe
mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
9:44:27 AM PDT - Wed, Jul 18th 2012 |
|
I posted this in the "Compiling NWChem" section because I suspect that this problem is associated with the way I built my executable.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 597
|
|
12:07:27 PM PDT - Wed, Jul 18th 2012 |
|
This is not a build issue as far as I can see. It is the mpiexec command that starts the 16 nwchem processes on one node, nwchem itself has nothing to do with that. You may want to look at the mpiexec manual. For example adding "-npernode 8" might give you want you need. Alternatively, you may want to use mpirun.
Bert
Quote:Chemogan Jul 18th 5:31 pmOoops! My PBS job script was autoformatted when I submitted. It should look like this:
\#!/bin/bash
\#PBS -N nwchem
\#PBS -l select=2:ncpus=8:mpiprocs=8:mem=8gb,walltime=00:30:00
\#PBS -j oe
mpiexec -n 16 nwchem formaldehyde.scf.nwchem > formaldehyde.scf.out
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
2:24:03 PM PDT - Wed, Jul 18th 2012 |
|
Thanks Bert. Yeah, I'm getting the impression that I did build NWChem successfully, and that I'm just having some trouble with OpenMPI (I'm more accustomed to MPICH2).
Ah-ha! Yes that was it. Works now.
I added "-hostfile" and "-npernode" to my command (mpiexec is just a synonym for mpirun, they're both symbolic links for orterun):
mpiexec -n 16 -hostfile $PBS_NODEFILE -npernode 8 nwchem n2.mp2.ccsd.nwchem > n2.mp2.ccsd.out
Sorry for posting in the "Compiling" section. Perhaps this thread should be moved to the "Running" seciton, if that's possible.
Thanks so much for your help!
|
|
|
|
Gets Around
Threads 17
Posts 72
|
|
6:56:19 AM PDT - Mon, Aug 20th 2012 |
|
Hi,
it seems reopening of the thread is needed.
The nwchem 6.1.1 does not run accross the nodes on my system too. Nwchem 6.0 runs fine.
The 6.1.1 (and also the initial 6.1 release), when run across the nodes, crashes with:
argument 1 = ../nwchem.nw
-10000:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10000 hostname:d071.dcsc.fysik.dtu.dk pid:20939):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0
0:armci_rcv_data: read failed: -1
(rank:0 hostname:d071.dcsc.fysik.dtu.dk pid:20936):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/dataserv.c:armci_ReadFromDirect():439 cond:0
-10002:armci_AcceptSockAll:timeout waiting for connection: 0
(rank:-10002 hostname:d031.dcsc.fysik.dtu.dk pid:22561):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/sockets/sockets.c:armci_AcceptSockAll():673 cond:0
2:Child process terminated prematurely, status=: 256
(rank:2 hostname:d031.dcsc.fysik.dtu.dk pid:22558):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigChldHandler():178 cond:0
The http://www.nwchem-sw.org/images/Nwchem-6.1.1-src.2012-06-27.tar.gz is built against openmpi 1.3.3 with torque support, with the following script (irrelavant parts of the filesystem paths are replaced by ...) on CentOS 5, x86_64:
export NWCHEM_TOP=/.../nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export LD_LIBRARY_PATH=/.../lib64
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/.../bin/mpiexec
export MPI_LIB=/.../lib64
export MPI_INCLUDE=/.../include/
export LIBMPI='-L/.../lib64 -lmpi -lmpi_f90 -lmpi_f77'
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export TCGRSH=ssh
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export PYTHONLIBTYPE=a
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT="-L/usr/lib64 -lblas -llapack"
make nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee make_nwchem_config.log
make 64_to_32 2>&1 | tee make_64_to_32.log
make USE_64TO32=y 2>&1 | tee make.log
I run the following example (with mpiexec `which nwchem` nwchem.nw):
geometry noautoz noautosym
O 0.0 0.0 1.245956
O 0.0 0.0 0.0
end
basis spherical
\* library cc-pvdz
end
dft
mult 3
xc xpbe96 cpbe96
smear 0.0
direct
noio
end
task dft energy
I have tried also to specify the PBS_NODEFILE explicitly with --hostfile ${PBS_NODEFILE}.
On the nodes, i see just one nwchem per node sitting with 100% of CPU, other instances are with 0 CPU load.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
10:54:07 AM PDT - Mon, Aug 20th 2012 |
|
Marcindulak
Could please post the following files
$NWCHEM_TOP/src/tools/build/config/makefile.h
$NWCHEM_TOP/src/tools/build/armci/config/makefile.h
Please send the output of the following command, too
mpiexec -V
It would be useful to see the full error/output file from NWChem,
with -v option passed to mpiexec
|
|
|
|
Gets Around
Threads 17
Posts 72
|
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
9:20:11 AM PDT - Tue, Aug 21st 2012 |
|
What Linux Distribution?
|
Marcindulak
What linux distribution & version are you using?
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
11:07:49 AM PDT - Tue, Aug 21st 2012 |
|
BLAS size
|
Marcindulak
The only problem I spotted so far (and that should not explain the inter-node problem) is that your are using blas (and maybe lapack)
from /usr/lib64. My guess is that library uses 32-bit integers. If this is indeed the case, you would need to specify that to the tools
configurations by specifying the following environmental variables
BLAS_SIZE=4
LAPACK_SIZE=4
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
5:27:15 PM PDT - Tue, Aug 21st 2012 |
|
Marcindulak
I have managed to reproduce your problem.
However, I do not see any difference with 6.0 ... can you confirm that 6.0 works fine when using ARMCI_NETWORK=SOCKETS and using OpenMPI?
Cheers, Edo
|
|
|
|
Gets Around
Threads 17
Posts 72
|
|
2:28:40 AM PDT - Wed, Aug 22nd 2012 |
|
I have compiled 6.1.1 with {BLAS,LAPACK}_SIZE=4 without solving the mpi problem, apart from getting --with-blas4="-L/usr/lib64 -lblas -llapack" in the make stages. As a side comment shouldn't LAPACK_LIB variable be set too, and not only BLASOPT?
I see LAPACK_LIB variable is not mentioned at http://www.nwchem-sw.org/index.php/Compiling_NWChem
This makes the output when setting BLASOPT to look like:
--without-lapack --with-blas8=-L/usr/lib64 -lblas -llapack
The 6.0 version i use is this one:
http://download.opensuse.org/repositories/home:/marcindulak/CentOS_CentOS-5/
with the log available:
https://build.opensuse.org/package/live_build_log?arch=x86_64&package=nwchem&proje...
It does not look like nwchem 6.0 prints anything about ARMCI_NETWORK, and i haven't set anything.
In my impression the problems with crashes across the nodes started around the time when i had to set
USE_MPIF4=y in order to kompile nwchem.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
8:36:42 AM PDT - Wed, Aug 22nd 2012 |
|
Marcindulak,
Could you please send me the full stderr/stdout of a successful multinode run with 6.0?
Could you please add the following options to mpiexec/mpirun/orterun
--mca btl_base_verbose 50 --mca btl_openib_verbose 1
Thanks, Edo
|
Edited On 9:19:32 AM PDT - Wed, Aug 22nd 2012 by Edoapra
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
11:20:22 AM PDT - Wed, Aug 22nd 2012 |
|
Please ignore previous post
|
Marcindulak,
Please ignore the previous post since I have managed to reproduce your findings using the 6.0 and 6.1.1 binaries from your RPMs
(it took me a while to figure out the write openmpi orterun option to get things working, however ...).
More later, Edo
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1267
|
|
11:29:42 AM PDT - Wed, Aug 22nd 2012 |
|
How to revert 6.1 back to the 6.0 behavior for the tools directory
|
Marcindulak,
The following recipe might work to fix your 6.1 issues (it worked for me).
It allows you to link with the same parallel tools used in 6.0.
cd $NWCHEM_TOP/src/tools
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y clean
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y
cd ..
make FC=gfortran link
Cheers, Edo
|
|
|
|
Gets Around
Threads 17
Posts 72
|
|
11:53:28 PM PDT - Thu, Aug 23rd 2012 |
|
The following fails linking blas for me, are the steps in the right order?:
export NWCHEM_TOP=/.../nwchem-6.1.1-src
export NWCHEM_TARGET=LINUX64
export CC=gcc
export FC=gfortran
export LD_LIBRARY_PATH=/usr/lib64/openmpi/1.4-gcc/lib
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPIEXEC=/usr/lib64/openmpi/1.4-gcc/bin/mpiexec
export MPI_LIB=/usr/lib64/openmpi/1.4-gcc/lib
export MPI_INCLUDE=/usr/lib64/openmpi/1.4-gcc/include
export LIBMPI='-L/usr/lib64/openmpi/1.4-gcc/lib -lmpi -lmpi_f90 -lmpi_f77'
export LARGE_FILES=TRUE
export USE_NOFSCHECK=TRUE
export TCGRSH=ssh
export PYTHONHOME=/usr
export PYTHONVERSION=2.4
export PYTHONLIBTYPE=a
export USE_PYTHON64=y
export HAS_BLAS=yes
export BLASOPT="-L/usr/lib64 -lblas -llapack"
make nwchem_config NWCHEM_MODULES="all python" 2>&1 | tee make_nwchem_config.log
make 64_to_32 2>&1 | tee make_64_to_32.log
make USE_64TO32=y 2>&1 | tee make.log
cd $NWCHEM_TOP/src/tools
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y clean 2>&1 | tee ../make_ga_clean.log
make FC=gfortran GA_DIR=ga-4-3 OLD_GA=y 2>&1 | tee ../make_ga.log
cd ..
make FC=gfortran link 2>&1 | tee make_link.log
with:
/.../nwchem-6.1.1-src/src/task/task_bsse.F:1778: undefined reference to `ycopy_'
Please fix the problems with posting to the forum: im wasting about 5 min per post trying
to figure out, line-by-line, which characters are are allowed and which are not.
This time i figured out that single quote is not allowed in im.
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |