|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
Is there a systematic way of finding out how much memory is needed?
From NWChem
Viewed 2194 times, With a total of 10 Posts
|
Just Got Here
Threads 1
Posts 4
|
|
2:04:13 AM PST - Tue, Nov 13th 2012 |
|
Dear nwchem users,
I'm using nwchem on infiniband cluster and strugling with memory problems when doing TDDFT. The input is:
Title "dye2Nex"
Start dye2Nex
set fock:replicated logical .false.
permanent_dir /Data/Users/syesylevsky/QM/dye2/N
memory total 400 mb
echo
charge 0
geometry noautosym units angstrom
C 0.00000 0.00000 0.00000
C 1.36800 0.00000 0.00000
C -0.774000 1.26900 0.00000
C 0.0560000 2.48300 0.00300000
O 2.11800 1.15900 -0.00500000
O -0.652000 -1.20200 0.00400000
C 2.28500 3.49300 0.00900000
C 1.70000 4.74800 0.0160000
C 0.309000 4.88600 0.0130000
C -0.507000 3.76700 0.00500000
O -1.99700 1.27200 0.00200000
C 1.45200 2.36000 0.00300000
H -1.58400 -1.02300 0.0550000
H 3.37500 3.38100 0.00500000
H 2.33400 5.64100 0.0240000
H -0.135000 5.88700 0.0160000
H -1.59900 3.87300 -0.00100000
C 2.22300 -1.17800 -0.00300000
C 4.14100 -2.24800 0.313000
O 3.55400 -0.999000 0.414000
C 5.46200 -2.57000 0.622000
C 5.82700 -3.89700 0.443000
C 4.91700 -4.85600 -0.0240000
C 3.60400 -4.52700 -0.330000
C 1.97000 -2.48200 -0.356000
C 3.20900 -3.20300 -0.158000
H 6.16900 -1.81700 0.984000
H 5.25600 -5.89000 -0.149000
H 2.89600 -5.27800 -0.693000
H 1.03900 -2.91400 -0.717000
H 6.85300 -4.20500 0.672000
end
ecce_print ecce.out
basis "ao basis" spherical print
H library "3-21G"
O library "3-21G"
C library "3-21G"
END
dft
mult 1
XC b3lyp
iterations 5000
mulliken
direct
end
driver
default
maxiter 2000
end
tddft
nroots 3
target 1
end
task tddft optimize
When I'm running this I get the following error:
2: error ival=5
(rank:2 hostname:mesocomte87 pid:9679):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193
cond:(pdscr->status==IBV_WC_SUCCESS)
1: error ival=10
(rank:1 hostname:mesocomte65 pid:18582):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_send_complete():459
cond:(pdscr->status==IBV_WC_SUCCESS)
5: error ival=10
(rank:5 hostname:mesocomte19 pid:20956):ARMCI DASSERT fail.
../../ga-5-1/armci/src/devices/openib/openib.c:armci_call_data_server():2193
cond:(pdscr->status==IBV_WC_SUCCESS)
0:Terminate signal was sent, status=: 15
(rank:0 hostname:mesocomte21 pid:30562):ARMCI DASSERT fail.
../../ga-5-1/armci/src/common/signaltrap.c:SigTermHandler():472 cond:0
As it was advised on this forum I set
export ARMCI_DEFAULT_SHMMAX=2048
but this does not help. I spent a lot of time playing with different memory values and finally got it working with
memory stack 150 mb heap 50 mb global 200 mb
but this was a blind guesswork, which I really don't want to do for every new system or basis level.
EDIT: it crashed after few hours. I still can't get it running.
Is there a good systematic way of finding out how much memory particular job needs to run normally in parallel environment? Which disgnostic messages should I use for this?
Thank you very much in advance!
Semen
|
Edited On 5:34:33 AM PST - Tue, Nov 13th 2012 by Yesint
|
|
|
|
Clicked A Few Times
Threads 4
Posts 13
|
|
10:30:39 AM PST - Tue, Nov 13th 2012 |
|
I had some related problems recently. How much memory do you have on your system? Try increasing total memory drastically?
For an 8 processor job, I use:
memory total 22 gb
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
11:31:26 PM PST - Tue, Nov 13th 2012 |
|
Quote:Andrew.yeung Nov 13th 9:30 amI had some related problems recently. How much memory do you have on your system? Try increasing total memory drastically?
For an 8 processor job, I use:
memory total 22 gb[/quote]
In principle I can ask for up to 12 Gb per process, but than this job will stay in the queue forever (it will saturate the nodes compeletely and will get very low priority). My objective is to allocate just enough to get it running but keep waiting time reasonable. My molecule is rather small and on 1 CPU it runs under 1 Gb of memory, but I can't understand how to estimate memory consumption in parallel mode.
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
2:20:38 PM PST - Wed, Nov 14th 2012 |
|
How memory allocation works in NWChem
|
Let start with the beginning:
A. The memory keyword in the input specifies the memory per process, generally per processor and NOT per job.
Hence, if you tried to specify "memory total 22 gb" with 8 processors on one node, that means you are asking for 178 gbyte on one node to make this job run.
B. When you specify "memory total xxx mb", the amount xxx gets split up in 25% heap, 25% stack, and 50% global.
Heap: For most applications heap is not important and could be a much smaller block of memory. Generally we set this to 100 mb at most if we specify explicitly.
Stack: Effectively your local memory for each processor to use for the calculations.
Global: Memory used to store arrays that are globally accessible. Effectively it has a block of the <size global> times <# of processors used on node>, which can get very big.
C. Specifying memory explicitly, I recommend you use the format:
memory heap 100 mb stack 1000 mb global 2400 mb
The example here makes available 3500 mb, 3.5 Gbyte per processor and would require 3.5 Gbyte times the # of processors running on the node to be physically available. You cannot use virtual memory. You also need to leave space for the OS, so the above example we use when we have 8 processors and 32 gbyte of memory per node.
D. How much memory does the calculation need? The amount and distribution of stack and global needed is strongly dependent on the application. Generally an equal distribution works fine to start with. The code will indicate if it runs out of local or global memory, and you can redistribute. For coupled cluster (TCE) calculations you will generally need more global than stack memory (above example is a TCE style input). Tiling is important for TCE, to reduce local memory requirements.
E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.
Hope this helps,
Bert
|
|
|
|
Clicked A Few Times
Threads 4
Posts 13
|
|
4:27:06 PM PST - Wed, Nov 14th 2012 |
|
Thanks for correcting my mistake, Bert.
Is there a reason why you break up memory this way (heap 100 mb/stack 1000 mb/global 2400 mb), instead of the 25-25-50% by default?
|
|
|
|
Clicked A Few Times
Threads 2
Posts 8
|
|
1:54:40 AM PST - Fri, Nov 16th 2012 |
|
Hi Bert!
Thanks for your post, but I still have a question about the ARMCI_DEFAULT_SHMMAX.
Suppose I use 2 nodes with 16 cores each node, each core has 4GB memory, and I specify:
memory heap 100 mb stack 400 mb global 3200 mb
that is to say, I use 3200MB*16=51200MB global memory each node.
If I set
setenv ARMCI_DEFAULT_SHMMAX 51200
comes out this warning:
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
Do you know what's the problem is?
Thank you!
Quote:Bert Nov 14th 1:20 pmLet start with the beginning:
E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.
Hope this helps,
Bert
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
3:56:05 PM PST - Sat, Nov 17th 2012 |
|
Simply because most codes do not use that much stack memory, so it would be waisted.
Bert
Quote:Andrew.yeung Nov 14th 11:27 pmThanks for correcting my mistake, Bert.
Is there a reason why you break up memory this way (heap 100 mb/stack 1000 mb/global 2400 mb), instead of the 25-25-50% by default?
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
3:59:07 PM PST - Sat, Nov 17th 2012 |
|
Yes, the code right now has some internal limits. Henc,e you cannot set it to more than 8000 mb, mainly because this was based on fewer cores per node. I'll look at having this updated and tested.
I would have to suggest you do not set the stack that small if you want to run coupled cluster caculations, it will be more expensive as you are forced to use smaller blocks.
Bert
Quote:Psd Nov 16th 8:54 amHi Bert!
Thanks for your post, but I still have a question about the ARMCI_DEFAULT_SHMMAX.
Suppose I use 2 nodes with 16 cores each node, each core has 4GB memory, and I specify:
memory heap 100 mb stack 400 mb global 3200 mb
that is to say, I use 3200MB*16=51200MB global memory each node.
If I set
setenv ARMCI_DEFAULT_SHMMAX 51200
comes out this warning:
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
incorrect ARMCI_DEFAULT_SHMMAX should be <1,8192>mb and 2^N Found=51200
Do you know what's the problem is?
Thank you!
Quote:Bert Nov 14th 1:20 pmLet start with the beginning:
E. What about those pesky "ARMCI DASSERT fail" errors and ARMCI_DEFAULT_SHMMAX. On certain architectures ARMCI_DEFAULT_SHMMAX needs to be set to generate one big block of global memory per node (i.e. combine all the global memory pieces of each processor on a node into one big block) for faster access. Generally ARMCI_DEFAULT_SHMMAX should be set to <amount of global memory per process> times <# of processors used by calculation on node>. By the latter I mean the number of processors you are actually using. If you only use 4 on a node, the multiplier is only 4.
Hope this helps,
Bert
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
1:36:16 AM PST - Sun, Nov 18th 2012 |
|
I understand the things in theory, but on practice I still can't get it working. Currently I have
memory total 4000 mb
It runs for few hours and than fails. The end of log is the following:
Memory Information
------------------
Available GA space size is 524244319 doubles
Available MA space size is 65513497 doubles
Length of a trial vector is 9864
Algorithm : Incore multiple tensor contraction
Estimated peak GA usage is 182779852 doubles
Estimated peak MA usage is 6600 doubles
3 smallest eigenvalue differences (eV)
No. Spin Occ Vir Irrep E(Vir) E(Occ) E(Diff)
1 1 72 73 a -0.071 -0.208 3.744
2 1 71 73 a -0.071 -0.239 4.578
3 1 70 73 a -0.071 -0.245 4.747
Entering Davidson iterations
Restricted singlet excited states
Iter NTrls NConv DeltaV DeltaE Time
---- ------ ------ --------- --------- ---------
0: error ival=-1
(rank:0 hostname:mesocomte68 pid:30430):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_rdma_strided_to_contig():3239 cond:(rc==0)
As far as I can see from Memory Information I have a lot of free memory, but it still fails. Could you please tell what's wrong? I wonder what is armci_server_rdma_strided_to_contig()...
|
|
|
-
Bert Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Forum Vet
Threads 5
Posts 598
|
|
8:09:01 AM PST - Sun, Nov 18th 2012 |
|
Not clear, seems to be related to the system. I would try and reduce the memory footprint. The output suggest you do not need that much memory in the first place.
Doing the numbers and info it looks like you are running on 2 processor cores, and each core is on a different node connected by IB? How many cores and how much memory do you have per node? You may be able to run this on a single node.
Bert
Quote:Yesint Nov 18th 8:36 amI understand the things in theory, but on practice I still can't get it working. Currently I have
memory total 4000 mb
It runs for few hours and than fails. The end of log is the following:
Memory Information
------------------
Available GA space size is 524244319 doubles
Available MA space size is 65513497 doubles
Length of a trial vector is 9864
Algorithm : Incore multiple tensor contraction
Estimated peak GA usage is 182779852 doubles
Estimated peak MA usage is 6600 doubles
3 smallest eigenvalue differences (eV)
No. Spin Occ Vir Irrep E(Vir) E(Occ) E(Diff)
1 1 72 73 a -0.071 -0.208 3.744
2 1 71 73 a -0.071 -0.239 4.578
3 1 70 73 a -0.071 -0.245 4.747
Entering Davidson iterations
Restricted singlet excited states
Iter NTrls NConv DeltaV DeltaE Time
---- ------ ------ --------- --------- ---------
0: error ival=-1
(rank:0 hostname:mesocomte68 pid:30430):ARMCI DASSERT fail. ../../ga-5-1/armci/src/devices/openib/openib.c:armci_server_rdma_strided_to_contig():3239 cond:(rc==0)
As far as I can see from Memory Information I have a lot of free memory, but it still fails. Could you please tell what's wrong? I wonder what is armci_server_rdma_strided_to_contig()...
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
2:55:37 AM PST - Sun, Nov 25th 2012 |
|
Quote:Bert Nov 18th 7:09 amNot clear, seems to be related to the system. I would try and reduce the memory footprint. The output suggest you do not need that much memory in the first place.
Doing the numbers and info it looks like you are running on 2 processor cores, and each core is on a different node connected by IB? How many cores and how much memory do you have per node? You may be able to run this on a single node.
Bert
It runs over IB, one core per node. Each node has at least 12GB of RAM. I'll try to put it on the single node, but this is not what we want to do normally.
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |