|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
TCE restart on BlueGene/Q
From NWChem
Viewed 175 times, With a total of 10 Posts
|
Clicked A Few Times
Threads 2
Posts 5
|
|
8:57:39 PM PST - Sun, Jan 28th 2018 |
|
Dear users and developers,
I try to run nwchem version 6.8.1 on an IBM BGQ cluster. I compiled the program with the following script:
#/bin/bash
export NWCHEM_TOP=/home/i/ihamilto/ehlert/NWCHEM/omp/nwchem-6.8.1
export NWCHEM_TARGET=BGQ
export NWCHEM_MODULES="qm"
export ARMCI_NETWORK=MPI-TS
export USE_MPI=y
export USE_MPIF=y
export USE_OPENMP=y
export LARGE_FILES=TRUE
export MPI_INCLUDE=/bgsys/drivers/ppcfloor/comm/xl/include
ESSL="/opt/ibmmath/essl/5.1/lib64/libesslsmpbg.a"
export BLAS_SIZE=4
export USE_64TO32=y
LAPACK="/home/i/ihamilto/ehlert/NWCHEM/lapack-3.8.0/liblapack.a "
export BLASOPT="$LAPACK $ESSL -Wl,-zmuldefs -lxlsmp"
export DISABLE_GAMIRROR=y
The compilation runs without any issues (except for the ccsd_trpdrv_omp.F , where I adapted the XLF compiler options).
Hartree-Fock and CCSD(T) calculations are also running, however, when I try to save integrals and amplitudes, I run into problems. Here is my sample input file:
memory stack 1200 mb heap 500 mb global 2000 mb
geometry
H 1.0 1.0 0.0
H -1.0 1.0 0.0
symmetry c1
end
basis
* library def2-qzvpp
end
SCF
direct
end
TCE
ccsd(t)
2eorb
tilesize 16
END
set tce:save_integrals T T T T T
set tce:save_t T T T T
set tce:read_t T T T T
task tce energy
The last lines of the output look as follows:
Parallel file system coherency ......... OK
Saving 1-electron integrals now...
f1_restart_save filename: ./nwchem.f1_copy
f1_restart_save finished
Fock matrix recomputed
1-e file size = 4900
1-e file name = ./nwchem.f1
Cpu & wall time / sec 2.6 2.6
4-electron integrals stored in orbital form
v2 file size = 4556750
4-index algorithm nr. 1 is used
imaxsize = 30
imaxsize ichop = 0
v2int file size = 6592705
Cpu & wall time / sec 5.0 5.0
Saving 2-electron integrals now...
v2_restart_save filename: ./nwchem.v2_copy
hashn: addr 4 key 1
length 3
hashn: addr 4 key 2
hashn: addr 4 key 3
length 3
...
tce_hash_n: key not found 1
I already tried some 2emet/2eorb variations, however without any success. To me, it is absolutely not clear, where the problem is, especially because the one-electron integrals and amplitudes are written properly. So when I change the one line to:
set tce:save_integrals T F F F F
the program runs without an error (however it's not useful ^^).
I am very thankful for any hint, advice or solution.
Thanks in advance!
Christopher
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1459
|
|
11:31:56 AM PST - Mon, Jan 29th 2018 |
|
Please try the following input (and swap t/f for tce:readint/writeint and tce:readt/writet during the restart run)
start h2_tce
memory stack 1200 mb heap 500 mb global 2000 mb
geometry
H 1.0 1.0 0.0
H -1.0 1.0 0.0
symmetry c1
end
basis
* library def2-qzvpp
end
SCF
direct
end
TCE
ccsd(t)
2eorb
2emet 15
tilesize 16
END
set tce:writeint t
set tce:readint f
set tce:writet t
set tce:readt f
set tce:tceiop 2048
set tce:nts t
task tce energy
|
|
|
|
Clicked A Few Times
Threads 2
Posts 5
|
|
9:30:07 AM PST - Tue, Jan 30th 2018 |
|
Hi Edoapra,
thank you for your answer. I tried the input and the integrals are written (so I can see files). However, the CCSD iterations look strange:
t2 file size = 7618
t2 file name = ./h2_tce.t2
t2 file handle = -996
CCSD iterations
---------------------------------------------------------
Iter Residuum Correlation Cpu Wall
---------------------------------------------------------
NEW TASK SCHEDULING
CCSD_T1_NTS --- OK
CCSD_T2_NTS --- OK
1 0.0000000000484 -0.0000000000404 3.4 3.4
-----------------------------------------------------------------
Iterations converged
CCSD correlation energy / hartree = -0.000000000040354
CCSD total energy / hartree = -0.926129209689597
However, on my workstation, it runs correctly. Do you have an idea, what might be the problem?
Thanks again,
Christopher
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1459
|
|
12:56:01 PM PST - Tue, Jan 30th 2018 |
|
Christopher
Since we have not spent a great deal of time to test and/or optimize NWChem on BlueGeneQ, my suggestion is to move to a different platform if you have this opportunity.
If you have to stick to BGP, my suggestions are
1) Instead of TCE, try the CCSD module if you intend to study closed-shell molecules
https://github.com/nwchemgit/nwchem/wiki/CCSD
2) If you want to fix TCE problems on TCE, try to find a baseline that works by
i) using a single process
ii) compiling without OpenMP
iii) recompile TCE with not optimization, e.g.
make FOPTIMIZE="-O0 -g" FDEBUG="-O0 -g"
|
|
|
|
Clicked A Few Times
Threads 2
Posts 5
|
|
5:43:59 PM PST - Mon, Feb 12th 2018 |
|
Hi Edoapra,
thanks for your help. I finally managed to solve the problem by replacing all integer types by long types in the tce/sort/tce_sort_4kg.c file. The whole tce part was compiled with "-qintsize=8" but the c-code used 32bit integers, so that collided somehow. The code then runs with " 2emet 15".
Maybe this info is useful for someone,
best,
Christopher
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1459
|
|
10:57:53 AM PST - Tue, Feb 13th 2018 |
|
Thank you very much for the bug report.
My guess is that the code is likely to work on other 64-bit cpu since they are little-endian while the PowerPC 440 on BGp is big-endian, and that causes the long/int breakeage
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1459
|
|
11:11:32 AM PST - Tue, Feb 13th 2018 |
|
Christopher
I have opened a github issue on this topic
https://github.com/nwchemgit/nwchem/issues/16
and pushed a fix (thanks to your suggestion) to the hotfix/release-6-8 and master branches.
If you have time to test this fix, you help is greatly appreciated.
To checkout hotfix/release-6-8, please type
git clone -b hotfix/release-6-8 https://github.com/nwchemgit/nwchem.git nwchem-6.8.1
The change NWCHEM_TOP to .../nwchem-6.8.1
|
|
|
|
Clicked A Few Times
Threads 2
Posts 5
|
|
6:49:44 PM PST - Tue, Feb 13th 2018 |
|
Ok,
the patch is approved!
However, regarding the initial question, I had to remove also "USE_EAF" macro in the "src/tce/tensor_read_write.F"; and I modified the ccsd_energy_loc.F where I added these lines after line 180:
if (write_ta .and.(mod(iter,save_interval).eq.0)) then
if(nodezero) then
write(LuOut,*) 'Saving Amplitudes now...'
endif
call util_file_name0('t1amp',.false.,.true.,filename,fldgts)
unitn=79
call write_tensor(filename,d_t1,size_t1,unitn)
call util_file_name0('t2amp',.false.,.true.,filename,fldgts)
unitn=80
call write_tensor(filename,d_t2,size_t2,unitn)
call ga_sync()
endif
One can then use the following input:
set tce:writeint t
set tce:readint f
set tce:writet t
set tce:readt f
set tce:save_interval 10
set tce:tceiop 2048
set tce:nts t
to save the amplitudes every 10 iterations. I think that's useful.
best,
Christopher
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 9
Posts 1459
|
|
10:52:34 AM PST - Fri, Feb 16th 2018 |
|
Thanks for the feedback.
Could you please use the github issue option for this topic at
https://github.com/nwchemgit/nwchem/issues/16
Cheers, Edo
|
|
|
|
Forum Regular
Threads 42
Posts 191
|
|
9:22:47 PM PST - Sat, Feb 17th 2018 |
|
Dear Dr. Edoapra
Your input does produce results with surely negligible differences using
NWCHEM6.8 both on Ubuntu17.10, repeated for three times, and macOS
High Sierra 10.13.2 where the unusual thing for both is there are two different
CCSD[T] and CCSD(T) results, respectively.
I have not employed your patch.
I have already put the log files on your GitHub topic.
It is clearly stated on the NWCHEM6.6 manual that "The only platform for which
restart may cause I/O problems is BlueGene, due to ratio of compute to I/O nodes
(64 on BlueGene/P)".
Very Best Regards!
|
Edited On 7:17:28 PM PST - Sun, Feb 25th 2018 by Xiongyan21
|
|
|
|
Forum Regular
Threads 42
Posts 191
|
|
7:18:03 PM PST - Sun, Feb 25th 2018 |
|
It is clearly stated on the NWCHEM6.6 manual that "The only platform for which
restart may cause I/O problems is BlueGene, due to ratio of compute to I/O nodes
(64 on BlueGene/P)".
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |