|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
CreateSharedRegion: kr malloc Numerical result out of range
From NWChem
Viewed 2759 times, With a total of 13 Posts
|
Just Got Here
Threads 1
Posts 2
|
|
7:57:15 AM PDT - Mon, May 21st 2012 |
|
Someone in the lab that I work in has been trying to run some calculations with NWChem and I've been trying to help him get started running it with MPI. It has been a bumpy ride, however. At first, we were getting problems about not being able to allocate a shared block of memory. SHMMAX was already plenty high (as large as all of the physical memory), so I created a swap file, started calculations again, and waited.
Now they are crashing again, but this time with a much different error message. From what I can tell, it is another problem with allocating shared memory, but it seems like NWChem is passing an invalid (i.e., negative) number to the allocation function.
Here is the relevant error message:
0:CreateSharedRegion:kr_malloc failed KB=: -772361
(rank:0 hostname:vivaldi.chem.utk.edu pid:3128):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:Create_Shared_Region():1188 cond:0
Last System Error Message from Task 0:: Numerical result out of range
application called MPI_Abort(comm=0x84000007, -772361) - process 0
rank 0 in job 2 vivaldi.chem.utk.edu_35229 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
And just in case, here is a link to the full output file:
http://web.eecs.utk.edu/~dbauer3/nwchem/macrofe_full631f.out
Running under Ubuntu 11.04 with MPICH2.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
8:50:10 AM PDT - Mon, May 21st 2012 |
|
For one, you are requesting 22 Gbyte per processor. Assuming you are running on 8 cores per node, you are asking (potentially) for 176 Gbyte of memory for the calculation.
Remember, the memory keyword is per processor or process, not for the whole calculation! You should keep your memory allocation per process below the available memory / # processors per node.
Given the size of the calculation I don't believe you need that much memory.
Bert
Quote:DBauer May 21st 2:57 pmSomeone in the lab that I work in has been trying to run some calculations with NWChem and I've been trying to help him get started running it with MPI. It has been a bumpy ride, however. At first, we were getting problems about not being able to allocate a shared block of memory. SHMMAX was already plenty high (as large as all of the physical memory), so I created a swap file, started calculations again, and waited.
Now they are crashing again, but this time with a much different error message. From what I can tell, it is another problem with allocating shared memory, but it seems like NWChem is passing an invalid (i.e., negative) number to the allocation function.
Here is the relevant error message:
0:CreateSharedRegion:kr_malloc failed KB=: -772361
(rank:0 hostname:vivaldi.chem.utk.edu pid:3128):ARMCI DASSERT fail. ../../ga-5-1/armci/src/memory/shmem.c:Create_Shared_Region():1188 cond:0
Last System Error Message from Task 0:: Numerical result out of range
application called MPI_Abort(comm=0x84000007, -772361) - process 0
rank 0 in job 2 vivaldi.chem.utk.edu_35229 caused collective abort of all ranks
exit status of rank 0: killed by signal 9
And just in case, here is a link to the full output file:
http://web.eecs.utk.edu/~dbauer3/nwchem/macrofe_full631f.out
Running under Ubuntu 11.04 with MPICH2.
|
|
|
|
Just Got Here
Threads 1
Posts 2
|
|
9:13:37 AM PDT - Mon, May 21st 2012 |
|
Okay, that sounds like a problem with not reading the documentation quite close enough. In any case, I'll work with the chemist running the code to get more reasonable values and try running again. The system should have enough RAM + swap to get the job done.
Thanks for the quick answer.
EDIT:
I assume that if the calculation requires more memory than we allow it per process then it'll swap out to the hard drive or something to free up space.
|
Edited On 9:35:39 AM PDT - Mon, May 21st 2012 by DBauer
|
|
|
|
Clicked A Few Times
Threads 0
Posts 5
|
|
9:00:33 AM PDT - Thu, May 24th 2012 |
|
Continuing trouble with calculation.
|
Hello all,
I’m working with DBauer on the calculation he mentioned previously in this thread. We’ve also been working on a much larger cluster computer that has 96 gigs of ram per node in an effort to complete this calculation.
Yesterday I did a run with 16 cores with the full 96 gigs of ram split between them (6 gigs per core). In nwchem I used the directive “memory total 6144 mb” to assign the memory to be used. This calculation failed at the same point all the previous have with the error: Error Run 1
I then ran the same calculation in the same manner this time removing the memory directive and allowing nwchem to assign the memory itself. This calculation also failed with the error: Error Run 2
The cluster that I was running on has a monitoring system that allows for reviewing of node performance. I went and reviewed the system performance and found that in both calculations NWChem never used more than ~4 gigs of ram for these calculations which I find puzzling. The 4 gig threshold makes me suspicious of a 32 bit limit somewhere.
Link to Images: Images
Link to Input 1: Input run 1
Link to Input 2: Input run 2
Outputs can be provided if more clarification is needed.
Notes:
1st calculation ran from ~1pm-8pm
2nd calculation ran from ~10:30pm-5:30am
The calculation running before till ~1pm was a Raman calculation I did with 48 cores and “memory 1800 mb”
|
Edited On 9:03:47 AM PDT - Thu, May 24th 2012 by KarlB
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
9:23:50 AM PDT - Thu, May 24th 2012 |
|
The default memory settings for NWChem are pretty small, 400mb I think (unless it was changed at compile time). So, simply without the memory keyword the calculation does not have enough memory to proceed.
Note, you may want to check how much memory the OS is taking. Fully loading the memory with NWChem will create problems. Swapping memory is not going to work well or at all.
Let's take the 48 core 1800mb case. This means that each processor is going to allocate 450 mb of local heap, 450 mb of local stack (these two are not the problem), and 900 mb of global shared memory. Now, on a single node this is allocated in shared memory segments. 900 mb * 48 cores means the code with try to potentially allocate a single (over) 43 GByte segment of memory.
Let me try and run the input. I'll try 16 cores with 3500 mb.
Bert
[QUOTE=KarlB May 24th 4:00 pm]Hello all,
I’m working with DBauer on the calculation he mentioned previously in this thread. We’ve also been working on a much larger cluster computer that has 96 gigs of ram per node in an effort to complete this calculation.
Yesterday I did a run with 16 cores with the full 96 gigs of ram split between them (6 gigs per core). In nwchem I used the directive “memory total 6144 mb” to assign the memory to be used. This calculation failed at the same point all the previous have with the error: Error Run 1
I then ran the same calculation in the same manner this time removing the memory directive and allowing nwchem to assign the memory itself. This calculation also failed with the error: Error Run 2
The cluster that I was running on has a monitoring system that allows for reviewing of node performance. I went and reviewed the system performance and found that in both calculations NWChem never used more than ~4 gigs of ram for these calculations which I find puzzling. The 4 gig threshold makes me suspicious of a 32 bit limit somewhere.
Link to Images: Images
Link to Input 1: Input run 1
Link to Input 2: Input run 2
Outputs can be provided if more clarification is needed.
Notes:
1st calculation ran from ~1pm-8pm
2nd calculation ran from ~10:30pm-5:30am
The calculation running before till ~1pm was a Raman calculation I did with 48 cores and “memory 1800 mb”
|
|
|
|
Clicked A Few Times
Threads 0
Posts 5
|
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
11:43:18 AM PDT - Mon, Jun 4th 2012 |
|
Still working on this case, running it among the other calculations in the long queues on our system.
It's not a small calculation. I generally would try and run this on 128 cores or so.
Bert
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
3:48:16 PM PDT - Wed, Jun 13th 2012 |
|
I was able to run this calculation on 32 processors with the memory keyword set as follows:
memory heap 100 mb stack 1000 mb global 2400 mb
Bert
Quote:Bert Jun 4th 6:43 pmStill working on this case, running it among the other calculations in the long queues on our system.
It's not a small calculation. I generally would try and run this on 128 cores or so.
Bert
|
|
|
|
Clicked A Few Times
Threads 0
Posts 5
|
|
12:16:33 PM PDT - Mon, Jun 25th 2012 |
|
Hey Bert,
Thank you for your time and trying to help me solve this problem.
I've tried it twice in what I think to be idealized and improved idealized situations with your numbers and it still seems to not be working for me. How did you arrive at those numbers and what might be different with my system than yours that's causing the problem?
Let me know if you need the full files for any reason or more info of some sort.
(Job file1)
#$ -N Title
#$ -pe threads 26
#$ -l mem=3500M
#$ -q medium_chi
#$ -l proc_vendor=AMD
#$ -cwd
force coredumpsize 1
module load nwchem/6.1
/data/apps/openmpi/1.4.3-gcc/bin/mpirun nwchem macrofe_vibded.nw > macrofe_vibded.out
(/Job file1)
(input1)
start
echo
title "macrovib"
scratch_dir /somedirectories/.scratch
memory heap 100 mb stack 1000 mb global 2400 mb
geometry noautoz
Tons and Tons of Geo
end
charge 2
basis
* library 6-31g**
end
dft
xc b3lyp
iterations 1000
noio
direct
grid nodisk
end
driver
maxiter 1000
end
task dft freq
(/input1)
(result1)
...
2 2 0 0 -215.909422 -13516.764569 -13516.764569 26817.619717
2 1 1 0 -7.361729 9.715923 9.715923 -26.793575
2 1 0 1 0.102928 -0.219199 -0.219199 0.541326
2 0 2 0 -190.652630 -18119.815610 -18119.815610 36048.978590
2 0 1 1 0.084968 -0.339414 -0.339414 0.763796
2 0 0 2 -261.530067 -3482.457806 -3482.457806 6703.385545
0:0:ndai_get failed:: -1977
(rank:0 hostname:chi16 pid:23323):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
(/result1)
************************************************************************
****************************** Run 2 ************************************
************************************************************************
(Job file2)
#$ -N macfe_ded
#$ -pe threads 24
#$ -l mem=4G
#$ -q medium_chi
#$ -l proc_vendor=AMD
#$ -cwd
force coredumpsize 1
module load nwchem/6.1
/data/apps/openmpi/1.4.3-gcc/bin/mpirun nwchem macrofe_vibded.nw > macrofe_vibded.out
(/job file2)
(input2)
Identical save...
memory heap 117 mb stack 1171 mb global 2808 mb
(/input2)
(result2)
...
2 2 0 0 -215.909422 -13516.764569 -13516.764569 26817.619717
2 1 1 0 -7.361729 9.715923 9.715923 -26.793575
2 1 0 1 0.102928 -0.219199 -0.219199 0.541326
2 0 2 0 -190.652630 -18119.815610 -18119.815610 36048.978590
2 0 1 1 0.084968 -0.339414 -0.339414 0.763796
2 0 0 2 -261.530067 -3482.457806 -3482.457806 6703.385545
0:0:ndai_get failed:: -1977
(rank:0 hostname:chi14 pid:14312):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
(/result2)
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
2:56:30 PM PDT - Mon, Jun 25th 2012 |
|
What are the error or warning messages in the error file (not the output file)? Any armci_set_mem_offset messages?
One thing I set is "setenv ARMCI_DEFAULT_SHMMAX 4096"
I was also running with 4 cores/node on our 8 core/node system (i.e. half filled).
Looks like you are running 26 or 24 processes respectively? I generally run "mpirun -np 32". We're running HPMPI, not openMPI, and we definitely are not runnign with treads.
Bert
Quote:KarlB Jun 25th 7:16 pmHey Bert,
Thank you for your time and trying to help me solve this problem.
I've tried it twice in what I think to be idealized and improved idealized situations with your numbers and it still seems to not be working for me. How did you arrive at those numbers and what might be different with my system than yours that's causing the problem?
Let me know if you need the full files for any reason or more info of some sort.
(Job file1)
#$ -N Title
#$ -pe threads 26
#$ -l mem=3500M
#$ -q medium_chi
#$ -l proc_vendor=AMD
#$ -cwd
force coredumpsize 1
module load nwchem/6.1
/data/apps/openmpi/1.4.3-gcc/bin/mpirun nwchem macrofe_vibded.nw > macrofe_vibded.out
(/Job file1)
(input1)
start
echo
title "macrovib"
scratch_dir /somedirectories/.scratch
memory heap 100 mb stack 1000 mb global 2400 mb
geometry noautoz
Tons and Tons of Geo
end
charge 2
basis
* library 6-31g**
end
dft
xc b3lyp
iterations 1000
noio
direct
grid nodisk
end
driver
maxiter 1000
end
task dft freq
(/input1)
(result1)
...
2 2 0 0 -215.909422 -13516.764569 -13516.764569 26817.619717
2 1 1 0 -7.361729 9.715923 9.715923 -26.793575
2 1 0 1 0.102928 -0.219199 -0.219199 0.541326
2 0 2 0 -190.652630 -18119.815610 -18119.815610 36048.978590
2 0 1 1 0.084968 -0.339414 -0.339414 0.763796
2 0 0 2 -261.530067 -3482.457806 -3482.457806 6703.385545
0:0:ndai_get failed:: -1977
(rank:0 hostname:chi16 pid:23323):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
(/result1)
************************************************************************
****************************** Run 2 ************************************
************************************************************************
(Job file2)
#$ -N macfe_ded
#$ -pe threads 24
#$ -l mem=4G
#$ -q medium_chi
#$ -l proc_vendor=AMD
#$ -cwd
force coredumpsize 1
module load nwchem/6.1
/data/apps/openmpi/1.4.3-gcc/bin/mpirun nwchem macrofe_vibded.nw > macrofe_vibded.out
(/job file2)
(input2)
Identical save...
memory heap 117 mb stack 1171 mb global 2808 mb
(/input2)
(result2)
...
2 2 0 0 -215.909422 -13516.764569 -13516.764569 26817.619717
2 1 1 0 -7.361729 9.715923 9.715923 -26.793575
2 1 0 1 0.102928 -0.219199 -0.219199 0.541326
2 0 2 0 -190.652630 -18119.815610 -18119.815610 36048.978590
2 0 1 1 0.084968 -0.339414 -0.339414 0.763796
2 0 0 2 -261.530067 -3482.457806 -3482.457806 6703.385545
0:0:ndai_get failed:: -1977
(rank:0 hostname:chi14 pid:14312):ARMCI DASSERT fail. ../../ga-5-1/armci/src/common/armci.c:ARMCI_Error():208 cond:0
(/result2)
|
|
|
|
Clicked A Few Times
Threads 0
Posts 5
|
|
9:50:54 PM PDT - Mon, Jun 25th 2012 |
|
Will look into the setenv ARMCI_DEFAULT_SHMMAX 4096 in the morning...
The cluster I'm currently running on is a 48 core node with ~94 gigs of ram per node and running on a single node @24 cores gives me half a node as well (with full ram usage).
It seems the error codes do indeed have something about armci memory usage...
(Run1 Error)
force: Command not found.
2: WARNING:armci_set_mem_offset: offset changed -204456148992 to -204454051840
Last System Error Message from Task 0:: Inappropriate ioctl for device
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode -1977.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 23323 on
node chi16 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
(/run1 error)
***************************
***********Run2***********
***************************
(run2 error)
force: Command not found.
23: WARNING:armci_set_mem_offset: offset changed 440439992320 to 440442089472
Last System Error Message from Task 0:: Inappropriate ioctl for device
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 4 DUP FROM 0
with errorcode -1977.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 14312 on
node chi14 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
(/run2 error)
Thank you for your assistance!
Karl
|
|
|
|
Clicked A Few Times
Threads 0
Posts 5
|
|
2:14:07 PM PDT - Tue, Jun 26th 2012 |
|
Hey Bert,
Please forgive my ignorance here, but where/how am I setting ARMCI_DEFAULT_SHMMAX? In the kernel directory like shmall/shmmax/shmmni?
Karl
|
Edited On 7:33:33 AM PDT - Wed, Jun 27th 2012 by KarlB
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
2:27:47 PM PDT - Tue, Jun 26th 2012 |
|
As an environment variable. Has nothing to do with the kernel, but rather with GA and NWChem.
Bert
Quote:KarlB Jun 26th 9:14 pmHey Bert,
Please forgive my ignorance here, but where/how am I setting ARMCI_DEFAULT_SHMMAX? In the kernel directory like shmall/shmmax/shmmni?
Karl
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 867
|
|
11:47:34 AM PDT - Tue, Oct 2nd 2012 |
|
Karl,
You need to take care of both.
ARMCI_DEFAULT_SHMMAX has to greater or equal than kernel.shmmax.
For example, if the value of kernel.shmmax is 4294967296 as in the example below,
ARMCI_DEFAULT_SHMMAX can be at most 4096 (4294967296=4096*1024*1024)
$ sysctl kernel.shmmax
kernel.shmmax = 4294967296
Cheers, Edo
Quote:KarlB Jun 26th 1:14 pmHey Bert,
Please forgive my ignorance here, but where/how am I setting ARMCI_DEFAULT_SHMMAX? In the kernel directory like shmall/shmmax/shmmni?
Karl
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |