|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
Help with large CCSD(T) Calculation
From NWChem
Viewed 2359 times, With a total of 8 Posts
|
Just Got Here
Threads 1
Posts 4
|
|
6:47:14 PM PDT - Sat, Aug 17th 2013 |
|
Hi all. I'm working on some CCSD(T) calculations of CO2 dimers using aug-cc-pvqz basis sets. I realize that this is a very large job. I've run a few calculations previously using molpro (on XSEDE Blacklight), which (I don't have my notes on me, but if I recall correctly) took about 20 hours on 16 cores, and required ~256GB memory.
I would like to try running these jobs on NWChem instead, but I'm having problems with 1) tweaking the performance options and 2) my jobs are dying due to a file writing error.
First, here is my input file. I've not included the basis set specification, as it's a long copy/paste from BSEL
title "co2 test"
#memory stack 9600 mb heap 800 mb global 4800 mb // tried this also, same error
memory stack 1500 mb heap 100 mb global 1400 mb
geometry
symmetry c1
C 2.12544 0.00000 0.00000
O 1.82852 -0.93172 -0.62769
O 2.42235 0.93172 0.62769
C -2.12544 0.00000 0.00000
O -1.20623 -0.32695 -0.63119
O -3.04465 0.32695 0.63119
end
basis
## *snip*
end
bsse
mon firstmonomer 1 2 3
mon secondmonomer 4 5 6
end
scf
singlet
rhf
end
tce
ccsd(t)
2eorb
io ga
tilesize 10 # also tried 15 and 20
end
task tce energy
I'm running the jobs on XSEDE trestles, on 8 cores (mpirun_rsh) over 2 nodes (64GB mem/node) using environmental variable ARMCI_DEFAULT_SHMMAX=2048. I've also tried running without the variable set, but with the same results.
So now the results. The job runs for a while, and generates ~150GB of temp files before dying. I've pasted the relevant output below.
*snip*
General Information
-------------------
Number of processors : 16
Wavefunction type : Restricted Hartree-Fock
No. of electrons : 44
Alpha electrons : 22
Beta electrons : 22
No. of orbitals : 1234
Alpha orbitals : 617
Beta orbitals : 617
Alpha frozen cores : 0
Beta frozen cores : 0
Alpha frozen virtuals : 0
Beta frozen virtuals : 0
Spin multiplicity : singlet
Number of AO functions : 630
Number of AO shells : 120
Use of symmetry is : off
Symmetry adaption is : off
Schwarz screening : 0.10D-09
!! WARNING !! The number of MO is less than the number of AO
Correlation Information
-----------------------
Calculation type : Coupled-cluster singles & doubles w/ perturbation
Perturbative correction : (T)
Max iterations : 100
Residual threshold : 0.10D-06
DIIS level shift : 0.00D+00
CC-LR DIIS level shift : 0.00D+00
CC-IR DIIS level shift : 0.00D+00
Amplitude update : 5-th order DIIS
I/O scheme : Global Array Library
Memory Information
------------------
Available GA space size is ********** doubles
Available MA space size is 681563897 doubles
Maximum block size supplied by input
Maximum block size 20 doubles
tile_dim = 20
Block Spin Irrep Size Offset Alpha
-------------------------------------------------
1 alpha a 11 doubles 0 1
2 alpha a 11 doubles 11 2
3 beta a 11 doubles 22 1
4 beta a 11 doubles 33 2
5 alpha a 19 doubles 44 5
6 alpha a 20 doubles 63 6
7 alpha a 20 doubles 83 7
8 alpha a 20 doubles 103 8
9 alpha a 20 doubles 123 9
10 alpha a 20 doubles 143 10
11 alpha a 19 doubles 163 11
12 alpha a 20 doubles 182 12
13 alpha a 20 doubles 202 13
14 alpha a 20 doubles 222 14
15 alpha a 20 doubles 242 15
16 alpha a 20 doubles 262 16
17 alpha a 19 doubles 282 17
18 alpha a 20 doubles 301 18
19 alpha a 20 doubles 321 19
20 alpha a 20 doubles 341 20
21 alpha a 20 doubles 361 21
22 alpha a 20 doubles 381 22
23 alpha a 19 doubles 401 23
24 alpha a 20 doubles 420 24
25 alpha a 20 doubles 440 25
26 alpha a 20 doubles 460 26
27 alpha a 20 doubles 480 27
28 alpha a 20 doubles 500 28
29 alpha a 19 doubles 520 29
30 alpha a 20 doubles 539 30
31 alpha a 20 doubles 559 31
32 alpha a 20 doubles 579 32
33 alpha a 20 doubles 599 33
34 alpha a 20 doubles 619 34
35 beta a 19 doubles 639 5
36 beta a 20 doubles 658 6
37 beta a 20 doubles 678 7
38 beta a 20 doubles 698 8
39 beta a 20 doubles 718 9
40 beta a 20 doubles 738 10
41 beta a 19 doubles 758 11
42 beta a 20 doubles 777 12
43 beta a 20 doubles 797 13
44 beta a 20 doubles 817 14
45 beta a 20 doubles 837 15
46 beta a 20 doubles 857 16
47 beta a 19 doubles 877 17
48 beta a 20 doubles 896 18
49 beta a 20 doubles 916 19
50 beta a 20 doubles 936 20
51 beta a 20 doubles 956 21
52 beta a 20 doubles 976 22
53 beta a 19 doubles 996 23
54 beta a 20 doubles 1015 24
55 beta a 20 doubles 1035 25
56 beta a 20 doubles 1055 26
57 beta a 20 doubles 1075 27
58 beta a 20 doubles 1095 28
59 beta a 19 doubles 1115 29
60 beta a 20 doubles 1134 30
61 beta a 20 doubles 1154 31
62 beta a 20 doubles 1174 32
63 beta a 20 doubles 1194 33
64 beta a 20 doubles 1214 34
Global array virtual files algorithm will be used
Parallel file system coherency ......... OK
Integral file = ./co2.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 0 Max. records in file = ******
No. of bits per label = 16 No. of bits per value = 64
#quartets = 1.807D+07 #integrals = 1.013D+10 #direct = 0.0% #cached =100.0%
File balance: exchanges= 63 moved= 7630 time= 5.1
Fock matrix recomputed
1-e file size = 380689
1-e file name = ./co2.f1
Cpu & wall time / sec 137.1 183.2
4-electron integrals stored in orbital form
available GA memory 2516039248 bytes
available GA memory available GA memory available GA memory available GA memory 2516039256 2516039256 available GA memory available GA memory available GA memory available GA memory Last System Error Message from Task 10:: No such file or directory
Last System Error Message from Task 9:: No such file or directory
available GA memory 2516039256 2516039256 bytes
bytes bytes 2516392056 2516392056
2516392056 bytes bytes
available GA memory 2516039256
2516039256 bytes createfile: failed ga_create size=*********
createfile: failed ga_create size=********* createfile: failed ga_create size=*********
------------------------------------------------------------------------
bytes
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------ current input line :
------------------------------------------------------------------------ ------------------------------------------------------------------------
current input line : ------------------------------------------------------------------------ ------------------------------------------------------------------------
0: 0: ------------------------------------------------------------------------
------------------------------------------------------------------------
Last System Error Message from Task 0:: No such file or directory
createfile: failed ga_create size=********* createfile: failed ga_create size=********* createfile: failed ga_create size=********* ------------------------------------------------------------------------
------------------------------------------------------------------------ current input line : ------------------------------------------------------------------------
------------------------------------------------------------------------
0:
------------------------------------------------------------------------ 0: ------------------------------------------------------------------------ 289: task tce energy
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at
http://www.emsl.pnl.gov/docs/nwchem/nwchem.html
For further details see manual section:
0:0:createfile: failed ga_create size=:: 2137779302
(rank:0 hostname:trestles-2-32.local pid:25704):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------ 0: ------------------------------------------------------------------------ ------------------------------------------------------------------------For more information see the NWChem manual at
For more information see the NWChem manual at
------------------------------------------------------------------------
------------------------------------------------------------------------For more information see the NWChem manual at
------------------------------------------------------------------------
http://www.emsl.pnl.gov/docs/nwchem/nwchem ------------------------------------------------------------------------
http://www.emsl.pnl.gov/docs/nwchem/nwchem .html ------------------------------------------------------------------------
.html
For more information see the NWChem manual at
For more information see the NWChem manual at http://www.emsl.pnl.gov/docs/nwchem/nwchem
For more information see the NWChem manual at .html
------------------------------------------------------------------------
http://www.emsl.pnl.gov/docs/nwchem/nwchem
http://www.emsl.pnl.gov/docs/nwchem/nwchem
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 12
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 10
.htmlhttp://www.emsl.pnl.gov/docs/nwchem/nwchem
For further details see manual section: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 13
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 15
For more information see the NWChem manual at .htmlFor further details see manual section:
For further details see manual section: For further details see manual section: .html
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 9
For further details see manual section: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 11
http://www.emsl.pnl.gov/docs/nwchem/nwchem
.html
For further details see manual section:
10:10:createfile: failed ga_create size=:: 2137779302
For further details see manual section:
(rank:10 hostname:trestles-2-4.local pid:10516):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
9:9:createfile: failed ga_create size=:: 2137779302
13:13:createfile: failed ga_create size=:: 2137779302
(rank:9 hostname:trestles-2-4.local pid:10515):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
11:11:createfile: failed ga_create size=:: 2137779302
(rank:13 hostname:trestles-2-4.local pid:10519):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
(rank:11 hostname:trestles-2-4.local pid:10517):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
12:12:createfile: failed ga_create size=:: 2137779302
(rank:12 hostname:trestles-2-4.local pid:10518):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
15:15:createfile: failed ga_create size=:: 2137779302
(rank:15 hostname:trestles-2-4.local pid:10521):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
8:8:createfile: failed ga_create size=:: 2137779302
(rank:8 hostname:trestles-2-4.local pid:10514):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
available GA memory 2516392056 bytes
createfile: failed ga_create size=*********
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
0:
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
For more information see the NWChem manual at
http://www.emsl.pnl.gov/docs/nwchem/nwchem.html
For further details see manual section:
14:14:createfile: failed ga_create size=:: 2137779302
(rank:14 hostname:trestles-2-4.local pid:10520):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 14:: No such file or directory
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------
current input line :
------------------------------------------------------------------------Last System Error Message from Task 2:: No such file or directory
Last System Error Message from Task 3:: No such file or directory
Last System Error Message from Task 1:: No such file or directory
------------------------------------------------------------------------Last System Error Message from Task 5:: No such file or directory
------------------------------------------------------------------------Last System Error Message from Task 7:: No such file or directory
Last System Error Message from Task 4:: No such file or directory
Last System Error Message from Task 6:: No such file or directory
current input line : ------------------------------------------------------------------------
current input line : 0:
current input line : 0: ------------------------------------------------------------------------
current input line :
current input line : current input line :
------------------------------------------------------------------------
0: 0:
------------------------------------------------------------------------ 0:
0: ------------------------------------------------------------------------ ------------------------------------------------------------------------
0:
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------ ------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------For more information see the NWChem manual at For more information see the NWChem manual at ------------------------------------------------------------------------
------------------------------------------------------------------------ ------------------------------------------------------------------------
http://www.emsl.pnl.gov/docs/nwchem/nwchem
------------------------------------------------------------------------http://www.emsl.pnl.gov/docs/nwchem/nwchem
.html
------------------------------------------------------------------------
For more information see the NWChem manual at For more information see the NWChem manual at For further details see manual section: .html
For more information see the NWChem manual at
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 2
http://www.emsl.pnl.gov/docs/nwchem/nwchem
.htmlapplication called MPI_Abort(MPI_COMM_WORLD, 0) - process 1
http://www.emsl.pnl.gov/docs/nwchem/nwchem
For further details see manual section: For further details see manual section: application called MPI_Abort(MPI_COMM_WORLD, 0) - process 14
For more information see the NWChem manual at
.htmlFor more information see the NWChem manual at
http://www.emsl.pnl.gov/docs/nwchem/nwchemapplication called MPI_Abort(MPI_COMM_WORLD, 0) - process 4
2:2:createfile: failed ga_create size=:: 2137779302
http://www.emsl.pnl.gov/docs/nwchem/nwchemapplication called MPI_Abort(MPI_COMM_WORLD, 0) - process 5
.html
(rank:2 hostname:trestles-2-32.local pid:25706):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
http://www.emsl.pnl.gov/docs/nwchem/nwchem.html.html
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 6
For further details see manual section:
For further details see manual section:
3:3:createfile: failed ga_create size=:: 2137779302
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 7
For further details see manual section: For further details see manual section: (rank:3 hostname:trestles-2-32.local pid:25707):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
1:1:createfile: failed ga_create size=:: 2137779302
(rank:1 hostname:trestles-2-32.local pid:25705):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
7:7:createfile: failed ga_create size=:: 2137779302
5:5:createfile: failed ga_create size=:: 2137779302
(rank:7 hostname:trestles-2-32.local pid:25711):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
(rank:5 hostname:trestles-2-32.local pid:25709):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
6:6:createfile: failed ga_create size=:: 2137779302
4:4:createfile: failed ga_create size=:: 2137779302
(rank:6 hostname:trestles-2-32.local pid:25710):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
(rank:4 hostname:trestles-2-32.local pid:25708):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 8
Any help would be very very very appreciated. Thanks.
Keith McLaughlin
University of South Florida
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 865
|
|
10:22:40 PM PDT - Fri, Aug 23rd 2013 |
|
Keith
The input you are using can be run with the TCE module if you increase the number of processors.
If you want to stick to 16 processors, you might want to switch to the older "CCSD" module that
has small memory requirements.
I have been trying to reproduce the behavior of the input you are using and
I have come up with the two input files below.
Please keep in mind that because of the different memory requirements
for CCSD and the (T) part, you will have to use two different input files
start ccsd
title "CCSD input"
memory stack 800 mb heap 100 mb global 750 mb
geometry
C 2.12544 0.00000 0.00000
O 1.82852 -0.93172 -0.62769
O 2.42235 0.93172 0.62769
C -2.12544 0.00000 0.00000
O -1.20623 -0.32695 -0.63119
O -3.04465 0.32695 0.63119
end
basis
* library aug-cc-pvqz
end
scf
direct
thresh 1d-8
end
ccsd
diisbas 2
freeze atomic
nodisk
tol2e 1d-14
end
task ccsd
restart ccsd
title "CCSD(T) input"
memory stack 400 mb heap 100 mb global 950 mb
task ccsd(t)
|
Edited On 12:13:15 PM PST - Fri, Dec 26th 2014 by Edoapra
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
11:24:20 PM PDT - Sat, Aug 24th 2013 |
|
Hi Edoapra, thanks for your reply.
It seems that you're correct that the job will run if I request more cores. I'm now running on 64 cores, but I'm now running into a new error.
*snip*
Global array virtual files algorithm will be used
Parallel file system coherency ......... OK
Integral file = ./n2.aoints.00
Record size in doubles = 65536 No. of integs per rec = 32766
Max. records in memory = 1874 Max. records in file = ******
No. of bits per label = 16 No. of bits per value = 64
#quartets = 2.929D+06 #integrals = 1.446D+09 #direct = 0.0% #cached =100.0%
File balance: exchanges= 254 moved= 1805 time= 0.1
Fock matrix recomputed
1-e file size = 173056
1-e file name = ./n2.f1
Cpu & wall time / sec 12.3 15.4
4-electron integrals stored in orbital form
1: WARNING:armci_set_mem_offset: offset changed 0 to 26914816
33: WARNING:armci_set_mem_offset: offset changed 0 to 22720512
(rank:32 hostname:trestles-4-13.local pid:26022):ARMCI DASSERT fail. openib.c:armci_server_register_region():964 cond:(memhdl->memhndl!=((void *)0))
Last System Error Message from Task 32:: Cannot allocate memory
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
*snip*
Any suggestions?
|
Edited On 11:24:59 PM PDT - Sat, Aug 24th 2013 by Kmclaugh
|
|
|
-
Karol Forum:Admin, Forum:Mod, NWChemDeveloper, bureaucrat, sysop
|
|
Clicked A Few Times
Threads 1
Posts 31
|
|
2:56:22 PM PDT - Mon, Aug 26th 2013 |
|
Hi Keith,
You have two options:
1.) run the TCE but with other options for 4-index transfromation, which is currently causing problems. Instead of "2eorb" please use the sequence:
2eorb
2emet 13
if your job is still crashing please use
2eorb
2emet 14
split 2
you may also make "split" value bigger (for example "split 4" which means that atomic 2-electron integrals will be divided into 4 batches, which reduces the memory required to perfrom 4-index transformation). The TCE code will also require more processors (according to my estimates 128 or more should be fine). Please also use ARMCI_DEFAULT_SHMMAX=4096.
2.) you run the "old" spin-free version of CCSD(T) for the closed shell.
Best,
Karol
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
4:10:01 PM PDT - Mon, Aug 26th 2013 |
|
Thanks for your help. I will try your suggestions.
|
|
|
|
Just Got Here
Threads 1
Posts 4
|
|
10:52:06 PM PDT - Sat, Aug 31st 2013 |
|
I'm still having some issues, but I have been able to get some smaller jobs to complete.
I don't quite understand the output. I'm trying to calculate the CBS extrapolated interaction energy. I noticed that in my current input file, the interaction energy is not given. In molpro I'd usually use the "dummy" command (to get the BSSE corrected interaction energy), but I'm not sure how to do this in nwchem. Please advise.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 3
Posts 865
|
|
|
|
|
Clicked A Few Times
Threads 0
Posts 34
|
|
9:07:09 AM PST - Wed, Dec 10th 2014 |
|
Quote:Kmclaugh Aug 25th 6:24 am
(rank:32 hostname:trestles-4-13.local pid:26022):ARMCI DASSERT fail. openib.c:armci_server_register_region():964 cond:(memhdl->memhndl!=((void *)0))
Last System Error Message from Task 32:: Cannot allocate memory
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
32:Segmentation Violation error, status=: 11
[/code]
Any suggestions?
ARMCI-MPI (wiki.mpich.org/armci-mpi/index.php/NWChem) eliminates all ARMCI-related segfaults on InfiniBand. If it doesn't run with ARMCI-MPI, you need more nodes, which is to say, you've exceeded the actual limit of memory. On the other hand, ARMCI-OPENIB segfaults for any number of reasons, many of which are not actually running out of memory.
Jeff
|
|
|
|
Clicked A Few Times
Threads 19
Posts 43
|
|
2:52:23 AM PST - Wed, Dec 24th 2014 |
|
Quote:Edoapra Aug 23rd 9:22 pmKeithusin
The input you are using can be run with the TCE module if you increase the number of processors.
If you want to stick to 16 processors, you might want to switch to the older "CCSD" module that
has small memory requirements.
I have been trying to reproduce the behavior of the input you are using and
I have come up with the two input files below.
Please keep in mind that because of the different memory requirements
for CCSD and the (T) part, you will have to use two different input files
start ccsd
title "CCSD input"
memory stack 800 mb heap 100 mb global 750 mb
geometry
C 2.12544 0.00000 0.00000
O 1.82852 -0.93172 -0.62769
O 2.42235 0.93172 0.62769
C -2.12544 0.00000 0.00000
O -1.20623 -0.32695 -0.63119
O -3.04465 0.32695 0.63119
end
basis
* library aug-cc-pvqz
end
scf
direct
thresh 1d-8
end
ccsd
diisbas 2
freeze atomic
nodisk
tol2e 1d-14
end
task ccsd
restart ccsd
title "CCSD(T) input"
memory stack 400 mb heap 100 mb global 950 mb
task ccsd(t)
Hello Edoapra,
I just need to understand something..
From this code of yours how can you be able to get the interaction energy term between CO2--CO2.
I mean which portion of this code is defining that..??
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |