|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
Memory problem on AIX
From NWChem
Viewed 5318 times, With a total of 13 Posts
|
Clicked A Few Times
Threads 2
Posts 9
|
|
4:20:38 AM PDT - Mon, Apr 16th 2012 |
|
Dear All,
I've compiled NWChem-6.0 on an IBM machine and both serial and parallel versions work, except
the case when I specify memory to a value greater than 2 GB.
Here is the compilation and system info. The machine has 32 IBM power6 cores and 256 GB of memory.
$ uname -a
AIX wcu02 3264193612 3 5 00C28FA44C00
$ xlf -qversion
IBM XL Fortran Enterprise Edition for AIX, V11.1
Version: 11.01.0000.0008
$ xlc -qversion
IBM XL C/C++ Enterprise Edition for AIX, V9.0
Version: 09.00.0000.0007
I compiled the code with gmake 3.8 and setting the following variables
setenv NWCHEM_TOP /home/user/Source/nwchem-6.0
setenv NWCHEM_TARGET IBM
setenv LD_LIBRARY_PATH /usr/lpp/ppe.poe/lib
setenv INCLUDE /usr/include
setenv USE_MPI y
setenv LARGE_FILES TRUE
setenv MPI_LIB /usr/lpp/ppe.poe/lib
setenv MPI_INCLUDE "/usr/lpp/ppe.poe/include/thread -I/usr/lpp/ppe.poe/include"
setenv LIBMPI "-binitfini:poe_remote_main -lmpi_r -lvtd_r -lpthreads"
setenv NWCHEM_MODULES all
setenv HAS_BLAS TRUE
setenv BLASOPT "-lessl"
gmake nwchem_config
gmake >& make.log &
If needed I can also upload my make.log file.
Now when I try to run the program and increase the memory above 2 GB I get the following error
MA fatal error: MA_sizeof: invalid nelem: -1988100096
which is the only thing in the output file except the list of arguments used to run the job (in my case only the input file name).
Does anyone know what could be the source of the problem and how to solve it?
(recompile with other variables set/define additional variables when running?)
Thanks in advance,
Lukasz
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
9:50:00 AM PDT - Mon, Apr 16th 2012 |
|
NWCHEM_TARGET=IBM is a 32-bit platform
|
Lukasz,
Since when you compiled with NWCHEM_TARGET=IBM, you have generated a 32-bit executable that will not be able to address more than 2GB of memory. In order to overcome this limit, you would have to generate a 64-bit binary using NWCHEM_TARGET=IBM64
Cheers, Edo
|
|
|
|
Clicked A Few Times
Threads 2
Posts 9
|
|
2:34:45 PM PDT - Mon, Apr 16th 2012 |
|
Dear Edo,
Thanks for your immediate response. I recompiled the code with NWCHEM_TARGET=IBM64
(after performing make realclean) and it compiled without problems. Now there seems to be another
issue. I tried to run one of the examples shipped with NWChem, namely
$NWCHEM_TOP/examples/rimp2/hf-scf.nwc and I got the following output:
argument 1 = hf-scf.nwc
Northwest Computational Chemistry Package (NWChem) 6.0
------------------------------------------------------
Environmental Molecular Sciences Laboratory
Pacific Northwest National Laboratory
Richland, WA 99352
Copyright (c) 1994-2010
Pacific Northwest National Laboratory
Battelle Memorial Institute
NWChem is an open-source computational chemistry package
distributed under the terms of the
Educational Community License (ECL) 2.0
A copy of the license is included with this distribution
in the LICENSE.TXT file
ACKNOWLEDGMENT
--------------
This software and its documentation were developed at the
EMSL at Pacific Northwest National Laboratory, a multiprogram
national laboratory, operated for the U.S. Department of Energy
by Battelle under Contract Number DE-AC05-76RL01830. Support
for this work was provided by the Department of Energy Office
of Biological and Environmental Research, Office of Basic
Energy Sciences, and the Office of Advanced Scientific Computing.
Job information
---------------
hostname = wcu02
program = ../../bin/IBM64/nwchem
date = Tue Apr 17 06:19:13 2012
compiled = Tue_Apr_17_06:07:58_2012
source = /wcu/w01/Source/nwchem-6.0
nwchem branch = 6.0
input = hf-scf.nwc
prefix = hf.
data base = ./hf.db
status = startup
nproc = 1
time left = -1s
Memory information
------------------
heap = 23107201 doubles = 176.3 Mbytes
stack = 23107201 doubles = 176.3 Mbytes
global = 46214400 doubles = 352.6 Mbytes (distinct from heap & stack)
total = 92428802 doubles = 705.2 Mbytes
verify = yes
hardfail = no
Directory information
---------------------
0 permanent = .
0 scratch = .
------------------------------------------------------------------------
util_set_rtdb_state: rtdb_put failed 911
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0: task scf
------------------------------------------------------------------------
------------------------------------------------------------------------
An error occured in the Runtime Database
------------------------------------------------------------------------
For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation
For further details see manual section:
rtdb_seq_put: put failed for "" in ./hf.db
Last System Error Message from Task 0:: A file or directory in the path name does not exist.
0:0: util_set_rtdb_state: rtdb_put failed:: 911
(rank:0 hostname:wcu02 pid:438646):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
ERROR: 0031-250 task 0: Terminated
I saw a post on this forum with a similar error but there was no solution given. Do you know
how this problem could be solved?
Thanks again for your time,
Cheers, Lukasz
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
4:15:25 PM PDT - Mon, Apr 16th 2012 |
|
Does any NWChem input file fail?
|
Lukasz,
Does any NWChem input file (e.g. $NWCHEM_TOP/src/nwchem.nw) fail with IBM64 or is this failure specific to hf-scf.nwc?
Thanks, Edo
|
|
|
|
Clicked A Few Times
Threads 2
Posts 9
|
|
4:56:47 PM PDT - Mon, Apr 16th 2012 |
|
I've tried 10 different jobs and in all cases I get the same error as in my previous post.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
10:14:05 AM PDT - Tue, Apr 17th 2012 |
|
Please recompile the rtdb directory
|
Lukasz,
I would suggest you to use nwchem 6.1 (if possible) since it contains some IBM fixes (however I don't see anything related to the rtdb problem you are seeing)
Could please recompile the rtdb directory with no optimization and then relink?
Here are the instructions:
cd $NWCHEM_TOP/src/rtdb
make COPTIMIZE="-O0 -g"
cd ..
make FC=xlf link
Please let me know if this fixes the problem you are facing.
Cheers, Edo
|
|
|
|
Clicked A Few Times
Threads 2
Posts 9
|
|
10:57:27 AM PDT - Wed, Apr 18th 2012 |
|
Dear Edo,
I'm not using nwchem-6.1 since I cannot make the binary. Here are last few lines from my make.log
source='../ga-5-1/pario/elio/stat.c' object='pario/elio/stat.lo' libtool=yes \
DEPDIR=.deps depmode=aix /bin/sh ../ga-5-1/build-aux/depcomp \
/bin/sh ./libtool --tag=CC --mode=compill xlc -DHAVE_CONFIG_H -I. -I../ga-5-1 -I/usr/lpp/ppe.poe/include/thread -I/usr/lpp/ppe.poe/include -Ima -I../ga-5-1/ma -I../ga-5-1/LinAlg/lapack+blas -Iglobal/src -I../ga-5-1/global/src -I../ga-5-1/global/testing -I../ga-5-1/pario/dra -I../ga-5-1/pario/eaf -I../ga-5-1/pario/elio -I../ga-5-1/pario/sf -I../ga-5-1/armci/src/include -Iarmci/gaf2c -I../ga-5-1/armci/gaf2c -I../ga-5-1/armci/tcgmsg -c -o pario/elio/stat.lo ../ga-5-1/pario/elio/stat.c
libtool: compill : xlc -DHAVE_CONFIG_H -I. -I../ga-5-1 -I/usr/lpp/ppe.poe/include/thread -I/usr/lpp/ppe.poe/include -Ima -I../ga-5-1/ma -I../ga-5-1/LinAlg/lapack+blas -Iglobal/src -I../ga-5-1/global/src -I../ga-5-1/global/testing -I../ga-5-1/pario/dra -I../ga-5-1/pario/eaf -I../ga-5-1/pario/elio -I../ga-5-1/pario/sf -I../ga-5-1/armci/src/include -Iarmci/gaf2c -I../ga-5-1/armci/gaf2c -I../ga-5-1/armci/tcgmsg -c -M ../ga-5-1/pario/elio/stat.c -o pario/elio/stat.o
"../ga-5-1/pario/elio/stat.c", line 80.13: 1506-007 (S) "struct STATVFS" is undefined.
"../ga-5-1/pario/elio/stat.c", line 81.9: 1506-334 (S) Identifier bsize has already been defined on line 78 of "../ga-5-1/pario/elio/stat.c".
gmake[4]: *** [pario/elio/stat.lo] Error 1
gmake[3]: *** [all-recursive] Error 1
gmake[2]: *** [all] Error 2
gmake[1]: *** [build/.libs/libga.a] Error 1
gmake: *** [libraries] Error 1
But I think it should be addressed in a separate topic.
I followed your advice and recompiled rtdb in nwchem-6.0 without optmization. Now I can allocate up to 3400 mb,
allocating anything more results in an error:
argument 1 = RbYb_RSC_CCSDT_09.00.inp
MA error: MA_init: could not allocate 1835008208 bytes
------------------------------------------------------------------------
nwchem.F: ma_init failed (ga_uses_ma=F) 911
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0: memory total 3500 mb
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation
For further details see manual section:
0:0:nwchem.F: ma_init failed (ga_uses_ma=F):: 911
(rank:0 hostname:wcu02 pid:471870):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 0:: There is not enough memory available now.
ERROR: 0031-250 task 0: Terminated
I'm sure I didn't exceed the available memory limit since at the time of running the job there
was 200 GB available:
Total Memory = 252672 mb
Memory = 252672 mb
FreeRealMemory = 194432 mb
I would be grateful for any suggestions ?
Thanks, Lukasz
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
12:14:29 PM PDT - Wed, Apr 18th 2012 |
|
Lukasz,
I am quite clueless about the reason why your memory allocation is failing under IBM64 (especially since
I have no access to a IBM64 platform and since things work OK under LINUX64, instead).
Could you please try the following memory line, instead and tell me what happens?
Since NWChem use local (a.k.a MA) memory and global (GA) memory, I would like to see
what happens if you try a small amount of GA memory and keep increasing MA memory instead,
that please try the following sequence (in separate input files, of course)
memory stack 1000 mb heap 300 mb global 250 mb
memory stack 1250 mb heap 300 mb global 250 mb
memory stack 1500 mb heap 300 mb global 250 mb
memory stack 1750 mb heap 300 mb global 250 mb
...
and so on by increasing the stack value until NWChem crashes.
Please let me know the outcome of this process, Edo
PS To fix the 6.1 tools compilation problem, you might want to use gcc instead of xlc as C compiler by setting CC=gcc
|
|
|
|
Clicked A Few Times
Threads 2
Posts 9
|
|
4:21:17 PM PDT - Wed, Apr 18th 2012 |
|
Dear Edo,
Thanks again for your precious advice. I run the jobs you asked and the code crashed already for memory stack 1500 mb heap 300 mb global 250 mb with a similar message as previously
MA error: MA_init: could not allocate 1887437008 bytes
------------------------------------------------------------------------
nwchem.F: ma_init failed (ga_uses_ma=F) 911
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0: memory stack 1500 mb heap 300 mb global 250 mb
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at http://www.nwchem-sw.org/index.php/NWChem_Documentation
For further details see manual section:
0:0:nwchem.F: ma_init failed (ga_uses_ma=F):: 911
(rank:0 hostname:wcu02 pid:356372):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 0:: There is not enough memory available now.
ERROR: 0031-250 task 0: Terminated
Compiling nwchem-6.1 fails also with gcc (my gcc version is 4.2.0) when compiling the same
routine as previously, here's the last portion of the make.log
libtool: compille: gcc -DHAVE_CONFIG_H -I. -I../ga-5-1 -I/usr/lpp/ppe.poe/include/thread64 -I/usr/lpp/ppe.poe/include -Ima -I../ga-5-1/ma -I../ga-5-1/LinAlg/lapack+blas -Iglobal/src -I../ga-5-1/global/src -I../ga-5-1/global/testing -I../ga-5-1/pario/dra -I../ga-5-1/pario/eaf -I../ga-5-1/pario/elio -I../ga-5-1/pario/sf -I../ga-5-1/armci/src/include -Iarmci/gaf2c -I../ga-5-1/armci/gaf2c -I../ga-5-1/armci/tcgmsg -MT pario/elio/stat.lo -MD -MP -MF pario/elio/.deps/stat.Tpo -c ../ga-5-1/pario/elio/stat.c -o pario/elio/stat.o
../ga-5-1/pario/elio/stat.c: In function 'elio_stat':
../ga-5-1/pario/elio/stat.c:77: error: storage size of 'ufs_statfs' isn't known
gmake[4]: *** [pario/elio/stat.lo] Error 1
gmake[3]: *** [all-recursive] Error 1
gmake[2]: *** [all] Error 2
gmake[1]: *** [build/.libs/libga.a] Error 1
gmake: *** [libraries] Error 1
I already had trouble compiling other software on AIX in the past because the OS
lacks some of the standard linux headers, libs and commands, but I was always able to
figure out what is missing and install it. In this case I have no idea where the problem lies.
Anyway I would appreciate any piece of advice on how to get one of the versions of nwchem working.
Cheers, Lukasz
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
10:45:59 AM PDT - Fri, Apr 20th 2012 |
|
Lukasz,
The memory experiment showed that you cannot got beyond 1.8GB of local memory and I have no explanation for this, since on 64-bit Linux we have not seen this problem. Do you define ARMCI_NETWORK, by any chance?
As far as the nwchem-6.1 compilation problem is concerned,
you need to edit $NWCHEM_TOP/src/tools/ga-5-1/pario/elio//eliop.h
and add the following 3 lines justt after line 42,
#else
# include <sys/statvfs.h>
# define STATVFS statvfs
Or as in patch format,
$ svn diff
Index: eliop.h
=======================================================
--- eliop.h (revision 9865)
+++ eliop.h (working copy)
@@ -40,6 +40,9 @@
# include <sys/vfs.h>
# define STATVFS statfs
# define NO_F_FRSIZE
+#else
+# include <sys/statvfs.h>
+# define STATVFS statvfs
#endif
#ifdef WIN32
|
|
|
|
Clicked A Few Times
Threads 2
Posts 9
|
|
5:03:38 PM PDT - Mon, Apr 23rd 2012 |
|
Edo,
Patch works fine. Now the 6.1 version compiles but I have the same problem as with 6.0 so I cannot allocate more than 1.8GB of local memory. I don't define ARMCI_NETWORK.
Cheers, Lukasz
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
9:52:11 AM PDT - Tue, Apr 24th 2012 |
|
Lukasz
Did you have ever managed to use 2GB of memory (or more) with any other program on your AIX system?
What is the output of "ulimit -a"
Cheers, Edo
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1296
|
|
10:01:07 AM PDT - Tue, Apr 24th 2012 |
|
Might have found the culprit
|
Lukasz
Please ignore the posting I have just made a few minutes ago since I might have found the root cause of the problem.
The NWChem makefile structure is using a hardwired link option that limit the amount of memory to less thank 2GB (bmaxdata:0x80000000). In order to use, say 8Gb you would need to set bmaxdata:0x200000000.
This is set at line 933 of $NWCHEM_TOP/src/config/makefile.h
The line should be changed from
LDOPTIONS += -bmaxstack:0x80000000 -bmaxdata:0x80000000 # needed because of bigtoc
to
LDOPTIONS += -bmaxstack:0x80000000 -bmaxdata:0x200000000 # needed because of bigtoc
You do need to recompile nwchem to do it, but just re-link it, instead, by typing
make FC=xlf link
Let me know how it goes
|
|
|
|
Clicked A Few Times
Threads 2
Posts 9
|
|
5:28:25 AM PDT - Wed, Apr 25th 2012 |
|
Edo,
It seems that this was the problem with memory. After changing the bmaxdata all the
limitations on the memory are gone. I run a few test jobs (single and parallel) and the program works fine.
Thanks again for your kind help.
Cheers, Lukasz
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |