|
SEARCH
TOOLBOX
LANGUAGES
Forum Menu
QA tests and verification for nwchem 6.5 and hess/vib modules
From NWChem
Viewed 2875 times, With a total of 16 Posts
|
Clicked A Few Times
Threads 1
Posts 8
|
|
11:02:56 AM PDT - Fri, Sep 19th 2014 |
|
I've been compiling nwchem 6.5 (Nwchem-6.5.revision26243-src.2014-09-10) on a number of different platforms:
- CentOS 5.x with OpenMPI 1.6.4 and IB and gcc 4.4
- CentOS 5.x with OpenMPI 1.6.4 and IB and Intel compilers 2013 and 2013_SP1
- CentOS 6.x with MVAPICH2 1.9 and IB (Stampede at TACC) and gcc default
- CentOS 6.x with MVAPICH2 1.9 and IB (Stampede at TACC) and Intel compilers 2013
and I've noticed some failures with the following patterns:
On CentOS5.x platform, compiling with the Intel 2013 compilers results in errors in the properties module. Here is a diff from the QA tests:
http://pastebin.com/ivr85r5f
Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works fine on this platform.
On the CentOS6.x patform, running the QA tests on a single node, 16 processors seems to work for all QA tests for both compilers. Running the QA tests on 3-nodes, 16 processors each, results in the following frequency variations for both compilers:
http://pastebin.com/XFegynU3
It looks like there could be some bugs in the compiling/linking step for the hessian/vibration portions of the code.
Is there a way to selectively control the compiling flags for specific modules? What additional information is needed to address these issues?
Thanks.
|
Edited On 12:03:10 PM PDT - Fri, Sep 19th 2014 by Statics
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
11:39:17 AM PDT - Fri, Sep 19th 2014 |
|
Statistics
Thank you very much for the feedback.
Could you please provide more details about your installation?
For example
Value of ARMCI_NETWORK variable
Detailed compiler version (output of ifort -V or gfortran -v)
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
11:48:32 AM PDT - Fri, Sep 19th 2014 |
|
Quote:Statics Sep 19th 10:02 am
Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works file on this platform.
Do you mean: " ... GCC44 and Intel 2013_SP1 works fine on this platform.." ?
Are you stating that all the failures occur with Intel compilers 2013, while 2013_SP1 works fine?
I am not quite sure of what you see with CentOS 6.x on TACC Stamped ... do you have any compiler version that seems to work?
Once again, for any case please provide a detailed compiler version.
Cheers, Edo
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
12:01:59 PM PDT - Fri, Sep 19th 2014 |
|
Quote:Edoapra Sep 19th 6:48 pmQuote:Statics Sep 19th 10:02 am
Most other tests completed successfully. The errors look to be limited to the property module tests. GCC44 and Intel 2013_SP1 works file on this platform.
Do you mean: " ... GCC44 and Intel 2013_SP1 works fine on this platform.." ?
Are you stating that all the failures occur with Intel compilers 2013, while 2013_SP1 works fine?
I am not quite sure of what you see with CentOS 6.x on TACC Stamped ... do you have any compiler version that seems to work?
Once again, for any case please provide a detailed compiler version.
Cheers, Edo
Yes, sorry for the typo, I meant fine. Here are the compiler versions:
CentOS 5.x:
Working:
gcc 4.4: gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
ifort 2013 SP1: ifort (IFORT) 14.0.2 20140120
Not working:
ifort 2013: ifort (IFORT) 13.0.1 20121010
CentOS 6.x: works on a single node, not working on multiple nodes. No compiler/MPI library combination has been able to successfully pass the multiple node frequency QA tests.
gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
ifort (IFORT) 13.1.0 20130121
|
Edited On 12:08:53 PM PDT - Fri, Sep 19th 2014 by Statics
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
12:02:46 PM PDT - Fri, Sep 19th 2014 |
|
Quote:Edoapra Sep 19th 6:39 pmStatistics
Thank you very much for the feedback.
Could you please provide more details about your installation?
For example
Value of ARMCI_NETWORK variable
Detailed compiler version (output of ifort -V or gfortran -v)
ARMCI_NETWORK on all compiles are OPENIB. Would it help to have the full build environment on pastebin?
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
4:21:42 PM PDT - Fri, Sep 19th 2014 |
|
ARMCI_OPENIB_DEVICE=mlx4_0
|
By the way, when you run nwchem on stampede do you set the environmental variable
ARMCI_OPENIB_DEVICE equal to mlx4_0?
|
Edited On 4:22:17 PM PDT - Fri, Sep 19th 2014 by Edoapra
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
7:09:50 PM PDT - Fri, Sep 19th 2014 |
|
Please try the following
cd $NWCHEM_TOP/src/NWints/hondo
touch hnd_giaxyz.F
make FC=ifort FOPTIMIZE="-O0 -g" FDEBUG="-O0 -g"
cd ../..
make FC=ifort link
This should fix the prop_ch3f problem
|
Edited On 3:06:41 PM PDT - Sat, Sep 20th 2014 by Edoapra
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
9:58:44 PM PDT - Sat, Sep 20th 2014 |
|
Quote:Edoapra Sep 19th 11:21 pmBy the way, when you run nwchem on stampede do you set the environmental variable
ARMCI_OPENIB_DEVICE equal to mlx4_0?
Yes, I do set that variable.
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
11:21:57 AM PDT - Mon, Sep 22nd 2014 |
|
CentOS 5.x summary
|
I have some more data regarding the issues on CentOS 5.x. Hopefully this table will help:
CentOS 5.x with OpenMPI 1.6.4, Nwchem-6.5.revision26243-src.2014-09-10
| gcc 4.4
| ifort 2013
| ifort 2013 w/ hondo fix
| ifort 2013 SP1
|
Compiler version
| gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
| ifort (IFORT) 13.0.1 20121010
| ifort (IFORT) 13.0.1 20121010
| ifort (IFORT) 14.0.2 20140120
|
Passes most tests
| Template:Yes
| Template:No
| Template:Yes
| Template:No
|
Failure description
| N/A
| Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
| N/A
| TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault
=========================================
Excited-state calculation ( b2 symmetry)
=========================================
Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11
|
|
Edited On 11:36:06 AM PDT - Mon, Sep 22nd 2014 by Statics
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
11:55:11 AM PDT - Mon, Sep 22nd 2014 |
|
CentOS 6.x summary
|
I also have some more information regarding NWChem on CentOS 6.x (stampede). It currently looks like the problem is related to the parallelization.
I was able to reproduce the problems using the hess_h2o QA test using the default compiled 6.3 version on the system. The error/symptom is identical to that observed with 6.5.
I tried a number of different MPI and Intel compiler versions all with the same problem; however, it looks like the problems is related to the parallelization and number and distribution of cores.
I'm still working through the scenarios. The original tests were run on 3 nodes, 16 ppn. They failed with the odd frequency values. But it looks like parallelizing the test using 24 cores (either 3:ppn=8 or 4:ppn=6) fails, but 2:ppn=12 works. However, 4:ppn=16 works so it doesn't look to be an upper bound issue.
So, it looks like there is a working version of 6.5 on the system; however, the accuracy of the frequency values depends upon the parallelization.
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
2:16:51 PM PDT - Mon, Sep 22nd 2014 |
|
Quote:Statics Sep 22nd 10:21 amI have some more data regarding the issues on CentOS 5.x. Hopefully this table will help:
CentOS 5.x with OpenMPI 1.6.4, Nwchem-6.5.revision26243-src.2014-09-10
| gcc 4.4
| ifort 2013
| ifort 2013 w/ hondo fix
| ifort 2013 SP1
|
Compiler version
| gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
| ifort (IFORT) 13.0.1 20121010
| ifort (IFORT) 13.0.1 20121010
| ifort (IFORT) 14.0.2 20140120
|
Passes most tests
| Template:Yes
| Template:No
| Template:Yes
| Template:No
|
Failure description
| N/A
| Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
| N/A
| TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault
=========================================
Excited-state calculation ( b2 symmetry)
=========================================
Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11
|
Thanks for the detailed report
I was able to reproduce the segv with ifort 14.02
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
4:09:20 PM PDT - Mon, Sep 22nd 2014 |
|
Here is the fix for the Intel 14.0.2 SegV you reported
cd $NWCHEM_TOP/src
wget http://www.nwchem-sw.org/images/Hbar.patch.gz
gzip -d Hbar.patch
patch -p0 < Hbar.patch
cd tce
make FC=ifort
cd ..
make FC=ifort link
Thanks again for the detailed and useful bug report
|
Edited On 5:39:21 PM PDT - Mon, Sep 22nd 2014 by Edoapra
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
9:28:50 AM PDT - Tue, Sep 23rd 2014 |
|
Updated CentOS 5.x summary. Thanks for the patch. Intel 2013SP1 now passes most tests and is about 8% faster than the gcc44 version based on the wall clock time of the many small QA tests in md and qm-fast set.
| gcc 4.4
| ifort 2013
| ifort 2013 w/ hondo fix
| ifort 2013 SP1
| ifort 2013 SP1 w/ tce fix
|
Compiler version
| gcc44 (GCC) 4.4.7 20120313 (Red Hat 4.4.7-1)
| ifort (IFORT) 13.0.1 20121010
| ifort (IFORT) 13.0.1 20121010
| ifort (IFORT) 14.0.2 20140120
| ifort (IFORT) 14.0.2 20140120
|
Passes most tests
| Template:Yes
| Template:No
| Template:Yes
| Template:No
| Template:Yes
|
Failure description
| N/A
| Wildly incorrect isotropic and anisotropy values (e.g. in prop_ch3f)
| N/A
| TCE jobs seg fault (e.g. tce_cr_eom_t_ch_rohf)
Segmentation fault
=========================================
Excited-state calculation ( b2 symmetry)
=========================================
Dim. of EOMCC iter. space 500
2:Segmentation Violation error, status=: 11
(rank:2 hostname:fermi11 pid:10806):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
1:Segmentation Violation error, status=: 11
| N/A
|
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 4
Posts 953
|
|
|
|
|
Clicked A Few Times
Threads 1
Posts 8
|
|
2:27:59 PM PDT - Tue, Sep 23rd 2014 |
|
Yes, I'll look at that and the other patches available.
Thanks.
|
|
|
|
Clicked A Few Times
Threads 10
Posts 20
|
|
4:18:13 PM PST - Thu, Jan 1st 2015 |
|
Hello
I am trying to get the vibrational modes for the following system using DFT. I am using NWChem 6.5 and I am not able to get the vibrational modes. I was not sure if the patch which is given in this page or http://www.nwchem-sw.org/index.php/Special:AWCforum/sp/id5149 would apply for my system too and if that is the only rectification available.
echo
title "cluster"
memory total 2000 mb #I thought segmentation fault was due to some memory issues, so increase total from 400 to 2000mb.
geometry autosym
Pd -3.78493565 1.99241247 -0.78722757
Pd -3.14879170 -0.73451269 -0.78726064
Pd -2.51264775 -3.46143787 -0.78729369
Pd -1.74142116 3.90679187 -0.78723053
Pd -1.11988907 1.19421949 -0.88090579
Pd -0.47503225 -1.56660404 -0.88041514
Pd 0.16701070 -4.27398363 -0.78732972
Pd 0.93823729 3.09424611 -0.78726656
Pd 1.59464768 0.37329742 -0.88143236
Pd 2.21052519 -2.35960422 -0.78733268
Pd 3.61789574 2.28170036 -0.78730258
Pd 4.25403969 -0.44522482 -0.78733564
Pd -2.67964416 0.81256671 1.49905647
Pd -2.04350020 -1.91435845 1.49902340
Pd -0.63612966 2.72694612 1.49905350
Pd 0.00001429 0.00002095 1.49902044
Pd 0.63615824 -2.72690422 1.49898738
Pd 2.04352879 1.91440036 1.49901748
Pd 2.67967273 -0.81252482 1.49898442
C -0.00087599 -0.00094796 -2.10286302
O -0.00115233 -0.00065097 -3.29875807
end
basis "large" cartesian
Pd library lanl2dz_ecp file /usr/local/NWChem/data/libraries/
C library 6-311G** file /usr/local/NWChem/data/libraries/
O library 6-311G** file /usr/local/NWChem/data/libraries/
H library 6-311G** file /usr/local/NWChem/data/libraries/
end
ecp
Pd library lanl2dz_ecp file /usr/local/NWChem/data/libraries/
end
set geometry:actlist 5 6 9 20 21
set "ao basis" "large"
dft
vectors input dft-freq.movecs output dft-freq-2.movecs
iterations 500
direct
mult 1
XC xpbe96 cpbe96
convergence ncyds 1000 damp 70 ncydp 100 diis 16 #default diis also gives the same error
smear 0.001
end
DRIVER
XYZ a.xyz
MAXITER 500
END
freq
temp 1 298
end
task dft freq
I am always getting the following error:
stpr_wrt_fd_from_sq: overwrite of existing file:./cluster.hess
stpr_wrt_fd_dipole: overwrite of existing file./cluster.fd_ddipole
HESSIAN: the one electron contributions are done in 134.5s
14:Segmentation Violation error, status=: 11
(rank:14 hostname:hpc-6 pid:22410):ARMCI DASSERT fail. ../../ga-5-3/armci/src/common/signaltrap.c:SigSegvHandler():310 cond:0
rank 14 in job 10 hpc-6_59277 caused collective abort of all ranks
exit status of rank 14: return code 11
Thanks for any help.
|
Edited On 4:22:20 PM PST - Thu, Jan 1st 2015 by Nwchemy
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension Forum theme style by: AWC
| |