From NWChem
You are viewing a single post from the thread title above
|
Clicked A Few Times
Threads 2
Posts 5
|
|
5:44:36 PM - Tue, Nov 2nd 2010 |
|
Hello, the last few weeks, I have been trying to analyse a nwchem crash.
The input of the calculation is from the Benchmarks of this site and called C 240 Buckminster Fullerene.
This is being calculated on 32 nodes with 2 Xeon CPU's both with hyperthreading enabled so each compute
node has 4 computational units. The network interconnections are plain Gigabit Ethernet.
The first crashes were with a home built binary with O3 compiler optimisation. Then I built it again with
O2 optimisation and everything stops at the exactly same spot and both binarys stop after a computation
of almost equal duration. Now both builds were done with Intel MKL so the next step ist to remove MKL and
see what it does. Also the program is built with mpich2 and ifort compiler.
It seems that ARMCI is somehow incorrectly configured or somehow does not now how to communicate.
The significant error seems to be ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
I still have not dug into the code to find out what that means.
Here is an excerpt from the nwchem log.
dft energy failed 0
------------------------------------------------------------------------
------------------------------------------------------------------------
current input line :
278: task dft energy
------------------------------------------------------------------------
------------------------------------------------------------------------
This type of error is most commonly
associatated with calculations not reaching convergence criteria
------------------------------------------------------------------------
For more information see the NWChem manual at
http://www.emsl.pnl.gov/docs/nwchem/nwchem.html
For further details see manual section:
0:0:dft energy failed:: 0
(rank:0 hostname:j314.jotunn.rhi.hi.is pid:13071):ARMCI DASSERT fail. armci.c:ARMCI_Error():260 cond:0
Last System Error Message from Task 0:: Inappropriate ioctl for device
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
I am working on testing some alternatives to try out: Eliminating MKL, Eliminating BLAS altogether, Trying Atlas and lapack.
Should I use Intel CC instead of the GNU CC.
Best regards, Anna Jonna.
|
|
|
AWC's:
2.5.10 MediaWiki - Stand Alone Forum Extension
Forum theme style by: AWC