|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
having some problems with 6.1.1 and qlogic openmpi
From NWChem
Viewed 4405 times, With a total of 16 Posts
|
Clicked A Few Times
Threads 3
Posts 13
|
|
6:52:51 AM PDT - Wed, Aug 29th 2012 |
|
good day all,
we received a new cluster in that is based on qlogic infiniband, i've spent a couple of days fiddling around with the build and i'm still having issues. system is centOS 6.2 based, gcc 4.4.6, qlogic fabric, QLogic Corp. IBA7322 QDR InfiniBand HCA (rev 02). i built openmpi this way:
./configure --prefix=/shared/openmpi-1.6.1/gcc --enable-static --without-tm -with-openib=/usr --with-psm=/usr CC=gcc CXX=g++ F77=gfortran FC=gfortran --enable-mpi-thread-multiple
my nwchem environment:
export NWCHEM_TOP=/shared/build/nwchem-6.1.1-src
export NWCHEM_MODULES="all"
export INSTALL_PREFIX=/shared/nwchem-6.1.1
export CC=gcc
export FC=gfortran
export MPI_INCLUDE=/shared/openmpi-1.6.1/gcc/include
export MPI_LIB=/shared/openmpi-1.6.1/gcc/lib
export LIBMPI="-lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil"
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export ARMCI_NETWORK=MPI-MT
export TARGET=LINUX64
export LARGE_FILES=TRUE
export NWCHEM_TARGET=LINUX64
export IB_HOME=/usr
export IB_INCLUDE=/usr/include
export IB_LIB=/usr/lib64
export IB_LIB_NAME="-libverbs -libumad -lpthread"
nwchem builds and runs on two nodes, but errors after a couple of minutes with this error
12:Segmentation Violation error, status=: 11
(rank:12 hostname:node033 pid:2797):ARMCI DASSERT fail. ../../ga-5-1/armci/src/c
ommon/signaltrap.c:SigSegvHandler():310 cond:0
i looked at this thread,
http://www.nwchem-sw.org/index.php/Special:AWCforum/st/id435/#post_1562
removed the rpm builds of blas and re linked, same error.
ldd on nwchem binary is:
ldd nwchem
linux-vdso.so.1 => (0x00007fff50351000)
libmpi_f90.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi_f90.so.1 (0x00007f20e6c37000)
libmpi_f77.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi_f77.so.1 (0x00007f20e6a02000)
libmpi.so.1 => /shared/openmpi-1.6.1/gcc/lib/libmpi.so.1 (0x00007f20e643f000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003e19200000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003e24e00000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003e23a00000)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00007f20e6139000)
libm.so.6 => /lib64/libm.so.6 (0x0000003e19e00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003e25e00000)
libc.so.6 => /lib64/libc.so.6 (0x0000003e19600000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003e19a00000)
libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x00000033cd200000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00000033cda00000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00007f20e5f28000)
libpsm_infinipath.so.1 => /usr/lib64/libpsm_infinipath.so.1 (0x00007f20e5cd6000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003e1aa00000)
librt.so.1 => /lib64/librt.so.1 (0x0000003e1a200000)
/lib64/ld-linux-x86-64.so.2 (0x0000003e18e00000)
libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x00000033cd600000)
libinfinipath.so.4 => /usr/lib64/libinfinipath.so.4 (0x00007f20e5ac7000)
where else should i be looking should i be building a local blas/etc?
-- michael
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
9:36:44 AM PDT - Wed, Aug 29th 2012 |
|
Michael
Could you please post (or put on a website) the following
1) Full output file
2) Full link step (that is the the result of "cd $NWCHEM_TOP/src; make FC=gfortran link"
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
9:52:45 AM PDT - Wed, Aug 29th 2012 |
|
output file is here
http://pastebin.com/1WGsJ1RT
and the output from the relink:
[root@cmbcluster src]# make FC=gfortran link
make nwchem.o stubs.o
make[1]: warning: -jN forced in submake: disabling jobserver mode.
gfortran -fdefault-integer-8 -Wextra -Wuninitialized -g -O -I. -I/shared/build/nwchem-6.1.1-src/src/include -I/shared/build/nwchem-6.1.1-src/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/shared/build/nwchem-6.1.1-src'" -DNWCHEM_BRANCH="'6.1.1'" -c -o nwchem.o nwchem.F
gfortran -fdefault-integer-8 -Wextra -Wuninitialized -g -O -I. -I/shared/build/nwchem-6.1.1-src/src/include -I/shared/build/nwchem-6.1.1-src/src/tools/install/include -DEXT_INT -DLINUX -DLINUX64 -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/shared/build/nwchem-6.1.1-src'" -DNWCHEM_BRANCH="'6.1.1'" -c -o stubs.o stubs.F
gfortran -L/shared/build/nwchem-6.1.1-src/lib/LINUX64 -L/shared/build/nwchem-6.1.1-src/src/tools/install/lib -o /shared/build/nwchem-6.1.1-src/bin/LINUX64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -lpeigs -lperfm -lcons -lbq -lnwcutil -llapack -lblas -L/shared/openmpi-1.6.1/gcc/lib -lmpi_f90 -lmpi_f77 -lmpi -ldl -Wl,--export-dynamic -lnsl -lutil
/usr/bin/ld: Warning: alignment 16 of symbol `cface_' in /shared/build/nwchem-6.1.1-src/lib/LINUX64/libstepper.a(stpr_face.o) is smaller than 32 in /shared/build/nwchem-6.1.1-src/lib/LINUX64/libstepper.a(stpr_partit.o)
thanks for having a look! :-)
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
9:57:53 AM PDT - Wed, Aug 29th 2012 |
|
Michael,
Everything looks good so far.
Before moving to a more detailed analysis,
I would like to know if the simple $NWCHEM_TOP/src/nwchem.nw test works fine using more than one node.
Cheers, Edo
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
10:28:15 AM PDT - Wed, Aug 29th 2012 |
|
same:
http://pastebin.com/sMAKgWZH
process left 12 processes on 2 nodes running.
|
Edited On 10:31:08 AM PDT - Wed, Aug 29th 2012 by Michael tn
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
11:03:13 AM PDT - Wed, Aug 29th 2012 |
|
Does NWChem run on a single core
|
Michael,
Does NWChem run on a single core?
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
11:09:09 AM PDT - Wed, Aug 29th 2012 |
|
you mean non-mpi? single thread on a node correct?
this runs to completion:
/shared/openmpi-1.6.1/gcc/bin/mpirun -n 1 /shared/nwchem-6.1.1/bin/nwchem /home/mgx/testing/nwchem.nw
http://pastebin.com/nst5ga8n
|
Edited On 11:11:57 AM PDT - Wed, Aug 29th 2012 by Michael tn
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
11:24:35 AM PDT - Wed, Aug 29th 2012 |
|
What about multiples processes on the same node? E.g.
mpirun -np 2
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
11:29:57 AM PDT - Wed, Aug 29th 2012 |
|
yup, fine, up to the number of cores (12).
/shared/openmpi-1.6.1/gcc/bin/mpirun -n 12 /shared/nwchem-6.1.1/bin/nwchem /home/mgx/testing/nwchem.nw
completes fine
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
11:36:33 AM PDT - Wed, Aug 29th 2012 |
|
ompi_info
|
Michael,
1) Did you check from the ompi_info output that OpenMPI was correctly built for multi-threading?
"ompi_info | grep Thread" should show "Thread support: posix (mpi: yes, progress: no)"
2) Let's check if GA/ARMCI built correctly. Could you please post the following
a) $NWCHEM_TOP/src/tools/build/config.log
b) $NWCHEM_TOP/src/tools/build/armci/config.log
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
11:57:44 AM PDT - Wed, Aug 29th 2012 |
|
yup, here:
1:)
[mgx@cmbcluster testing]$ ompi_info | grep -i Thread
Thread support: posix (MPI_THREAD_MULTIPLE: yes, progress: no)
FT Checkpoint support: no (checkpoint thread: no)
2:)
a) http://pastebin.com/qwbLzJ32
b) http://pastebin.com/4A6UCuX8
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
2:17:47 PM PDT - Wed, Aug 29th 2012 |
|
mpirun
|
Michael,
Let's try what happens if we have openmpi using the slower ethernet device. Please add the following options to mpirun
--mca btl tcp,self,sm --mca btl_tcp_if_include eth0
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
2:40:34 PM PDT - Wed, Aug 29th 2012 |
|
nope :-\
full output is here:
http://pastebin.com/d48Dxqc4
12:Segmentation Violation error, status=: 11
(rank:12 hostname:node033 pid:10675):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
Last System Error Message from Task 12:: No such file or directory
^Cmpirun: killing job...
my namd2 build runs fine with this mpi, fwiw
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
9:38:48 AM PDT - Thu, Aug 30th 2012 |
|
i rechecked my openmpi build and the examples hello_xxx and ring_xxx work as expected. do you think this is an mpi issue or a GA issue?
--- michael
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 855
|
|
11:35:59 AM PDT - Thu, Aug 30th 2012 |
|
Michael,
I think it is most likely a GA issue with the current source code. The next thing we could do is to start to debug the code where the segv occurs, but I am not sure we will get much out of it.
There is a new GA implementation on top of MPI in the works and it might be easier to install on the QLogic hardware. My suggestion for you would be to wait for this new GA to be released.
A major problem of this effort to port NWChem to QLogic is that we do not have access to it and,
as you can see, trying to help remotely is not always straightforward.
Let me know how do you want to proceed.
Cheers, Edo
|
|
|
|
Clicked A Few Times
Threads 3
Posts 13
|
|
11:48:49 AM PDT - Thu, Aug 30th 2012 |
|
thats fine, we can bide our time, are you still at ORNL, i could arrange access to the qlogic cluster, if that would help.
we appreciate your effort in helping!
michael
|
Edited On 11:49:37 AM PDT - Thu, Aug 30th 2012 by Michael tn
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |