Improper compilation causes memory error during running nwchem
3:02:48 AM PST - Wed, Mar 1st 2017 |
i am struggling with compiliing nwchem 6.6 in ubuntu 16.10 with gfortran. Locales are set to german (LC_NUMERIC="de_DE.UTF-8"). When following the instructions given in "Documentation>Compiling NWChem>3.1 NWChem 6.6 on Ubuntu 14.04 (Trusty Tahr)" compilation finishes without error messages. Then for a simple test case i use the example input file:
title "Nitrogen cc-pvdz SCF geometry optimization"
n 0 0 0
n 0 0 1.08
n library cc-pvdz
task scf optimize
My own compiled version ends with an error
from getmem: mem. needed= 248762 , mem. available= 209363
Error no. 1 in getmem memory overflow : call no., amount requested : 85 49790
0:texas: nerror called:Received an Error in Communication
When comparing the output to a pre-build version of NWChem which shows no memory issue i get the diff result
(side-by-side: pre-compiled version | source compiled version with error)
Memory information Memory information
------------------ ------------------
heap = 13107198 doubles = 100.0 Mbytes | heap = 13107200 doubles = 100.0 Mbytes
stack = 13107195 doubles = 100.0 Mbytes | stack = 13107197 doubles = 100.0 Mbytes
global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack) global = 26214400 doubles = 200.0 Mbytes (distinct from heap & stack)
total = 52428793 doubles = 400.0 Mbytes | total = 52428797 doubles = 400.0 Mbytes
verify = yes verify = yes
hardfail = no hardfail = no
Forming initial guess at 0.1s | Forming initial guess at 0.0s
Superposition of Atomic Density Guess Superposition of Atomic Density Guess
------------------------------------- -------------------------------------
Sum of atomic energies: -108.60004629 Sum of atomic energies: -108.60004629
| from getmem: mem. needed= 248762 , mem. available= 209363
Non-variational initial energy | ------------------------------------------------------------------------
------------------------------ | texas: nerror called 0
| ------------------------------------------------------------------------
Total energy = -109.172911 | ------------------------------------------------------------------------
1-e energy = -194.701220 | current input line :
2-e energy = 61.519341 | 9: task scf optimize
HOMO = -0.421673 | ------------------------------------------------------------------------
LUMO = 0.042733 | ------------------------------------------------------------------------
| An error occured while computing integrals
| ------------------------------------------------------------------------
Symmetry analysis of molecular orbitals - initial | For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documen
------------------------------------------------- |
| For further details see manual section:
!! scf_movecs_sym_adapt: 4 vectors were symmetry contaminated | --------------------------------------------------------------------------
| MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 DUP FROM 0
Symmetry fudging | with errorcode -1.
!! scf_movecs_sym_adapt: 4 vectors were symmetry contaminated | NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
| You may or may not see output from other processes, depending on
Numbering of irreducible representations: | exactly when Open MPI kills them.
| --------------------------------------------------------------------------
Since the pre-build version is running as expected, i assume the memory issue is related to compiling from source. Does anyone has an idea what possibly went wrong during compilation?
11:54:35 AM PST - Wed, Mar 1st 2017 |
Have you applied the patches listed at
Could you please post the output of the following commands
gcc -v
gfortran -v
mpicc -v
mpif90 -v
env | grep MPI
env |grep BLAS
env | grep SCALA
env|grep USE_6
env|egrep NWC
head -25 $NWCHEM_TOP/src/tools/build/config.log
grep -i gemm $NWCHEM_TOP/src/nwdft/xc/xc_tabcd.F
10:25:06 AM PST - Fri, Mar 3rd 2017 |
Solved! Thank you very much! Your first suggestion did the trick!
Somehow i missed the patches completely. After installing every patch (some seemed to be already included in the source) the simulation for my own compiled version successfully finishes without error messages. Since lack of time, i didn't figure out which one of the patches was responsible for the error message mentioned above, sorry.
Thank you again for pointing me in the right direction!
5:45:43 PM PDT - Thu, Mar 23rd 2017 |
centos 7.3 NWChem 6.6 segmentation fault
I compiled NWC 6.6.revision.-src.2015-10-20 on an Intel system w 64 gb memory, basically following the standard procedure which has worked for other Centos 7 systems. An nwchem binary was created in /bin/ which I usually take to be a successful compilation. However, when I try to run a test job, a segmentation fault is output to the console. Any ideas what went wrong?
Here are the details:
OS and installed programs:
OS: CentOS-7.3-x86_64 7-3.1611.el7
openmpi.x86_64 1.10.0-10.el7 and
openmpi-develop.x86_64 1.10.0-10.el7
make.x86_64 3.82-21.el7
python.x86_64 2.7.5-39.el7_2
python-devel.x86_64 2.7.5-39.el7_2
gcc.x86_64 4.8.5-4.el7
gcc-c++.x86_64 4.8.5-4.el7
gcc-gfortran.x86_64 4.8.5-4.el7
perl.x86_64 4:5.16.3-286.el7
perl-libs.x86_64 4:5.16.3-286.el7
tcsh.x86_64 4:5.16.3-286.el7
openssh.x86_64 6.6.1pl-25.el7_2
openssh-clients.x86_64 6.6.1pl-25.el7_2
openblas.x86_64 0.2.19-3.el7
openblas-devel.x86_64 0.2.19-3.el7
openblas-openmp.x86_64 0.2.19-3.el7
openblas-openmp64.x86_64 0.2.19-3.el7
openblas-openmp64_.x86_64 0.2.19-3.el7
openblas-serial64.x86_64 0.2.19-3.el7
openblas-serial64_.x86_64 0.2.19-3.el7
openblas-threads.x86_64 0.2.19-3.el7
openblas-threads64.x86_64 0.2.19-3.el7
openblas-threads64_.x86_64 0.2.19-3.el7
scalapack-openmpi-devel.x86_64 2.0.2-15.el7
scalapack-common.x86_64 2.0.2-15.el7
blas.x86_64 3.4.2-5.el7
blas-devel.x86_64 3.4.2-5.el7
environment-modules.x86_64 3.2.10-10.el7
hwloc-libs.x86_64 1.7-5.el7
infinipath-psm.x86_64 3.3-0.g6f42cdb1bb8.2.el7
lapack.x86_64 3.4.2-5.el7
lapack-devel.x86_64 3.4.2-5.el7
libfabric.x86_64 1.1.0-2.el7
libpsm2.x86_64 0.7-4.el7
opensm-libs.x86_64 3.3.19-1.el7
elpa-openmpi.x86_64 2015.02.002-4.el7
elpa-openmpi-devel.x86_64 2015.02.002-4.el7
atlas.x86_64 3.10.1-10.el7
blacs-common.x86_64 2.0.2-15.el7
blacs-openmpi.x86_64 2.0.2-15.el7
compat-openmpi16.x86_64 1.6.4-10.el7
elpa-common.noarch 2015.02.002-4.el7
elpa-devel.noarch 2015.02.002-4.el7
libesmtp.x86_64 1.0.6-7.el7
Following patches were installed:
The environmental variables were set:
export USE_MPI=y
export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:$LD_LIBRARY_PATH
export PATH=/usr/lib64/openmpi/bin/:$PATH
export NWCHEM_MODULES="all"
export NWCHEM_TOP=/usr/local/nwchem-6.6
export BLAS_SIZE=4
export USE_64TO32=y
The console command and reply were as follows:
mpirun -np 2 /usr/local/nwchem/bin/nwchem n2.in > n2-4.out
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
- 0 0x7F8D4D4BE467
- 1 0x7F8D4D4BEAAE
- 2 0x7F8D4C7A924F
- 0 0x7F35D21D7467
- 1 0x7F35D21D7AAE
- 2 0x7F35D14C224F
- 3 0x2BCE6C0 in dcopy_
- 3 0x2BCE6C0 in dcopy_
- 4 0x2B310B3 in ycopy_
- 4 0x2B310B3 in ycopy_
- 5 0x9B6999 in pstat_init_ at pstat_init.F:32
- 5 0x9B6999 in pstat_init_ at pstat_init.F:32
- 6 0x406960 in MAIN__ at nwchem.F:204
- 6 0x406960 in MAIN__ at nwchem.F:204
The output.out file contents:
corsair3.cns.uaf.edu.1197hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
corsair3.cns.uaf.edu.1198hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out
argument 1 = n2.in
mpirun noticed that process rank 0 with PID 1197 on node corsair3 exited on signal 11 (Segmentation fault).
Thanks for any suggestions,
John Keller
11:56:33 AM PDT - Fri, Mar 24th 2017 |
minor correction
The nwchem executable was created - apparently normally - in the /usr/local/nwchem-6.6/LINUX64/bin directory.
John K.
Edited On 11:57:41 AM PDT - Fri, Mar 24th 2017
3:49:56 PM PDT - Fri, Mar 24th 2017 |
Please send the output of the following commands
grep -i gemm $NWCHEM_TOP/src/nwdft/xc/xc_tabcd.F
ldd $NWCHEM_TOP/bin/LINUX64/nwchem
nm $NWCHEM_TOP/bin/LINUX64/nwchem|grep ygemm|head
5:07:36 PM PDT - Fri, Mar 24th 2017 |
[jkeller@corsair3 ~]$ grep -i gemm /usr/local/nwchem-6.6/src/nwdft/xc/xc_tabcd.F
call ygemm('T', 'N', nnia, nnja, nq, 1.d0, Bmat,
call ygemm('T', 'N', nnia, nnja, nq, 1.0d0, Emat,
call ygemm('T', 'N', nnia, nnja, nq, -1.d0, Bmat,
call ygemm('T', 'N', nnia, nnja, 3*nq,
call ygemm('T', 'N', nnia, nnja, 3*nq,
[jkeller@corsair3 ~]$ ldd /usr/local/nwchem-6.6/bin/LINUX64/nwchem
linux-vdso.so.1 => (0x00007fff2d18d000)
libmpi_usempi.so.5 => /usr/lib64/openmpi/lib/libmpi_usempi.so.5 (0x00007f14c81d3000)
libmpi_mpifh.so.12 => /usr/lib64/openmpi/lib/libmpi_mpifh.so.12 (0x00007f14c7f7d000)
libmpi.so.12 => /usr/lib64/openmpi/lib/libmpi.so.12 (0x00007f14c7c99000)
librt.so.1 => /lib64/librt.so.1 (0x00007f14c7a7d000)
libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f14c7759000)
libm.so.6 => /lib64/libm.so.6 (0x00007f14c7457000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f14c723b000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f14c7024000)
libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f14c6de8000)
libc.so.6 => /lib64/libc.so.6 (0x00007f14c6a27000)
libopen-rte.so.12 => /usr/lib64/openmpi/lib/libopen-rte.so.12 (0x00007f14c67a9000)
libopen-pal.so.13 => /usr/lib64/openmpi/lib/libopen-pal.so.13 (0x00007f14c6505000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f14c6301000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f14c60fd000)
libhwloc.so.5 => /lib64/libhwloc.so.5 (0x00007f14c5ec3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f14c83d7000)
libnuma.so.1 => /lib64/libnuma.so.1 (0x00007f14c5cb6000)
libltdl.so.7 => /lib64/libltdl.so.7 (0x00007f14c5aac000)
[jkeller@corsair3 ~]$ nm /usr/local/nwchem-6.6/bin/LINUX64/nwchem|grep ygemm|head
0000000002b31100 T ygemm_
1:56:48 PM PDT - Tue, Mar 28th 2017 |
Do I need to send anything more to the Forum relating to this issue?
John Keller
5:58:48 PM PDT - Tue, Mar 28th 2017 |
John, to be honest with you, I ma not quite sure what's going wrong in your installation on Centos 7.3
Please send me the output of the following
cd $NWCHEM_TOP/src/blas
make clean
1:12:35 AM PDT - Wed, Mar 29th 2017 |
Edo - I re-compiled as above, and now it works. (?) The only thing I did differently was add "make clean" before "make".
However, I am getting messages at the top of the .log file "hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds..." one per processor requested. According to some website, this is due to a bug in libfabrick 1.3 which is included in CentOS 7.3. CentOS 7.2 has libfrabric 1.1.
John K.
10:02:10 AM PDT - Wed, Mar 29th 2017 |
Thanks for your feedback.
I think I have found a work-around for the fifteen seconds hfi_wait_for_device issue. Please execute the following commands
mkdir -p $HOME/.openmpi
echo "mtl = psm" >> $HOME/.openmpi/mca-params.conf
This trick worked for my Centos 7.3 installation
12:34:16 PM PDT - Wed, Mar 29th 2017 |
Edo - Thanks. That works - but only for the user's account running NWC on that machine ("corsair3").
However, the lines are still there at the top of the .out file when WebMO runs the job on that machine. WebMO is running applications on this server supposedly under this user's account, but there must be some other way WebMO is running NWC.
John K.
2:39:27 PM PDT - Mon, Apr 17th 2017 |
Problem linking python
I'm having problems linking python into NWChem. I'm following the instructions for setting the environment variables. Here's a script for compiling:
- !/bin/bash
export USE_MPI="y"
export PYTHONVERSION="2.7"
export PYTHONHOME="/scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh"
export PYTHONCONFIGDIR="config/../.."
export BLASOPT="-llapack -lblas"
export BLAS_SIZE="8"
export USE_ARUR="n"
export NWCHEM_TOP="/scr_haswell/swsides/directpkgs/nwchem-6.6"
echo ""
echo "-------------------------------------------------------------"
echo "Setup environment settings"
echo ""
echo "-------------------------------------------------------------"
echo ""
echo "Running make nwchem_config..."
echo ""
make nwchem_config NWCHEM_MODULES="all python" > test-config.log
sleep 2
echo ""
echo "Running make"
echo ""
make -j 8
The error is:
make nwchem.o stubs.o
make[1]: Entering directory `/scr_haswell/swsides/directpkgs/nwchem-6.6/src'
gfortran -m64 -ffast-math -Warray-bounds -fdefault-integer-8 -march=native -mtune=native -finline-functions -O2 -g -fno-aggressive-loop-optimizations -g -O -I. -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/include -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/tools/install/include -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DEXT_INT -DLINUX -DLINUX64 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/scr_haswell/swsides/directpkgs/nwchem-6.6'" -DNWCHEM_BRANCH="'6.6'" -c -o nwchem.o nwchem.F
gfortran -m64 -ffast-math -Warray-bounds -fdefault-integer-8 -march=native -mtune=native -finline-functions -O2 -g -fno-aggressive-loop-optimizations -g -O -I. -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/include -I/scr_haswell/swsides/directpkgs/nwchem-6.6/src/tools/install/include -DGFORTRAN -DCHKUNDFLW -DGCC4 -DGCC46 -DEXT_INT -DLINUX -DLINUX64 -DPARALLEL_DIAG -DCOMPILATION_DATE="'`date +%a_%b_%d_%H:%M:%S_%Y`'" -DCOMPILATION_DIR="'/scr_haswell/swsides/directpkgs/nwchem-6.6'" -DNWCHEM_BRANCH="'6.6'" -c -o stubs.o stubs.F
make[1]: Leaving directory `/scr_haswell/swsides/directpkgs/nwchem-6.6/src'
gfortran -Wl,--export-dynamic -L/scr_haswell/swsides/directpkgs/nwchem-6.6/lib/LINUX64 -L/scr_haswell/swsides/directpkgs/nwchem-6.6/src/tools/install/lib -o /scr_haswell/swsides/directpkgs/nwchem-6.6/bin/LINUX64/nwchem nwchem.o stubs.o -lnwctask -lccsd -lmcscf -lselci -lmp2 -lmoints -lstepper -ldriver -loptim -lnwdft -lgradients -lcphf -lesp -lddscf -ldangchang -lguess -lhessian -lvib -lnwcutil -lrimp2 -lproperty -lsolvation -lnwints -lprepar -lnwmd -lnwpw -lofpw -lpaw -lpspw -lband -lnwpwlib -lcafe -lspace -lanalyze -lqhop -lpfft -ldplot -lnwpython -ldrdy -lvscf -lqmmm -lqmd -letrans -lpspw -ltce -lbq -lmm -lcons -lperfm -ldntmc -lccca -lnwcutil -lga -larmci -lpeigs -lperfm -lcons -lbq -lnwcutil /scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh/lib/python2.7/config/../../libpython2.7.so -llapack -lblas -lnwclapack -lnwcblas -L/scr_haswell/swsides/opt/contrib-qmcpack/mpich-3.1.4-shared/lib -lmpifort -lmpi -lrt -lm -lpthread -lnwcutil -lpython2.7 -lpthread -ldl -lutil -lm
/usr/bin/ld: cannot find -lpython2.7
collect2: error: ld returned 1 exit status
make: *** [all] Error 1
I can adjust the
line to
-L /scr_haswell/swsides/opt/contrib-qmcpack/Python-2.7.13-sersh/lib (where the shared lib actually is)
and this will work. But the entire logic of the make files is broken. Is there a fix or a patch? I've got an automatic build system that need to use the in-place
build system of the application and I can't edit make files by hand.
4:18:51 PM PDT - Mon, Apr 17th 2017 |
Could you unset the following env. variables and try to link again: PYTHONLIBTYPE, PYTHONCONFIGDIR
cd $NWCHEM_TOP/src
make link
