From NWChem
Viewed 2390 times, With a total of 2 Posts
|
Just Got Here
Threads 1
Posts 2
|
|
3:29:23 AM PDT - Fri, Oct 11th 2013 |
|
Hi,
I am compiling NWChem 6.3 on the Cray XC30 "Sisu" of CSC, Finland. The executables are created successfully, but I'm not sure if they work correctly as I get errors already with the first test case I try. The test case in question is the C240 Buckyball given in the Benchmarks section of this website. This test case worked well with NWChem 6.1.1 but with version 6.3 it fails with the error message:
dft energy failed
0
This type of error is most commonly associated with calculations not reaching convergence criteria
Here is the batch job script for my test run:
#!/bin/sh
#SBATCH --job-name=c240_pbe0
#SBATCH --output=c240_pbe0.%J.out
#SBATCH --error=c240_pbe0.%J.err
#SBATCH --tasks=512
#SBATCH --tasks-per-node=16
#SBATCH --time=01:00:00
#SBATCH --partition=small
export HUGETLB_MORECORE=yes
export HUGETLB_DEFAULT_PAGE_SIZE=8M
#sed -i 's%scratch_dir /scratch%%g' Input_c240_pbe0.nw
aprun -n 512 -N 16 nwchem Input_c240_pbe0.nw
In compiling I followed strictly the instructions given at:
http://www.nwchem-sw.org/index.php/Compiling_NWChem#How-to:_Cray_platforms
I got several warnings, but the executables were created successfully. The warnings looked like this:
libhugetlbfs [sisu-login5:9779]: WARNING: No mount point found for default hugepage size. Using first available mount point.
libhugetlbfs [sisu-login5:9779]: WARNING: Hugepage size 2097152 unavailable
During execution, I got more warnings and errors:
LIBDMAPP WARNING: invalid value 0 for queue_depth. See dmapp.h for valid options. Defaulting to DMAPP_QUEUE_DEFAULT_DEPTH
LIBDMAPP WARNING: invalid value 0 for queue_nelems. See dmapp.h for valid options. Defaulting to DMAPP_QUEUE_DEFAULT_NELEMS
libhugetlbfs [nid00080:28446]: WARNING: New heap segment map at
0xa0000000 failed: Cannot allocate memory
Rank 0 [Thu Oct 10 15:12:46 2013] [c0-0c1s4n0] application called MPI_Abort(comm=0x84000004, -1) - process 0
Program received signal SIGABRT: Process abort signal.
Any ideas how to fix the problem?
|
|
|
-
Edoapra Forum:Admin, Forum:Mod, bureaucrat, sysop
|
|
Forum Vet
Threads 7
Posts 1143
|
|
5:15:27 PM PDT - Mon, Oct 21st 2013 |
|
Petri
I think that there are two separate problems affecting your runs on the Cray XC30 at CSC
1) You get plenty of warnings coming out of the libhugetlbfs library. I don't think these warnings are causing your run to fail (more later).
2) Your calculation fails (eventually calling MPI_Abort) since NWChem detects an error.
First, let's talk about point #2. The input for the c240 benchmark available from the NWChem website always causes NWChem to stop with an error. This is caused by the fact the input limit the number of SCF cycle to four ( iterations 4), but the SCF will never converge in four iterations. To avoid this failure, I have checked in a new version of
the input file that avoids generating a fatal error.
As far as the issue with libhugetlbfs is concerned, there are a few suggestions I can pass to you.
If you set the env. variable HUGETLB_VERBOSE equal to zero, all your warning messages are going to vanish and,
at the same time, the wall-time of your jobs will decrease (most likely since each warning messages takes quite a bit of time to be written).
Another thing that I have done that seems to have improved performance and stability for my Cray XC30 runs has been to switch from the GA source code we distrubute with NWChem 6.3 (that resided in the $NWCHEM_TOP/src/tools directory) to the modified version contributed by the Cray folks (available at https://github.com/ryanolson/ga/archive/cray.zip).
Please let me know if you need any help in installing these modified tools source code.
Cheers, Edo
|
|
|
|
Just Got Here
Threads 1
Posts 2
|
|
5:28:41 AM PST - Mon, Nov 18th 2013 |
|
Many thanks! I forgot to check updates on this message thread until now.
I followed your suggestions and the problem went away. Performance is good:
Platform: Cray XC30 Sisu
Application: NWChem 6.3
Data: C240 buckyball (nwchem-sw.org)
MPI tasks: 512
Physical CPU cores: 512
Hyperthreading: disabled
Metric: Time per step
Result: 73.1 s
With version 6.1.1 it took 74.8 s to run the same benchmark, so version 6.3 is equally fast. (Few percent differences are not significant.)
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC