My Community

Posted: **Thu Apr 22, 2010 2:24 pm**

I am trying to run a couple of NEB calculations using VASP 5.2 on a BG/P, and have run into a problem that has been reported before, but the fixes mentioned (increasing NONLR_S%IRMAX and NONLR_S%IRALLOC) haven't affected my problem.

The geometries are reasonable, and this error occurs before the first electronic iteration is completed. I am using PAW and PW-91 GGA potentials. each image is a ~140 atom calculation with 2x2x2 kpoints. I am trying to study the diffusion path of an LiBH4 vacancy, if that helps.

INCAR:
SYSTEM = Li4BN3H10 relax
ISTART = 0
ISMEAR = 0
SIGMA = .1
ISIF = 2
ISPIN = 2
PREC = MED
ENCUT = 520
IBRION = 2
NSW = 50
GGA = 91
NWRITE = 1
IALGO = 48
LREAL = A
NSIM = 2
NPAR = 1
LPLANE = .TRUE.
LCHARG = .FALSE.
LWAVE = .FALSE.
LSCALAPACK = .TRUE.
LSCALU = .TRUE.
MAGMOM = 144*0
IMAGES = 5
SPRING=-5

output:
running on 640 nodes
each image running on 128 nodes
distr: one band on 128 nodes, 1 groups
vasp.5.2.2 15Apr09 complex
POSCAR found type information on POSCAR B H Li N
01/POSCAR found : 4 types and 138 ions
scaLAPACK will be used
LDA part: xc-table for Ceperly-Alder, standard interpolation
00/POSCAR found : 4 types and 138 ions
06/POSCAR found : 4 types and 138 ions
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: small aliasing (wrap around) errors must be expected
internal ERROR RSPHER:running out of buffer 0 0 8 1 0
internal ERROR RSPHER:running out of buffer 0 0 8 1 0
internal ERROR RSPHER:running out of buffer 0 0 8 1 0
internal ERROR RSPHER:running out of buffer 0 0 8 1 0
<continues for number of nodes>

I also found if I set
LSCALAPACK = .FALSE.
LSCALU = .FALSE.

then the simulation will run for a few ionic steps, but at some point will hang up until the scheduler kills the job. The output of each image indicates that there may be some lag in one of the images and the other processes hang up waiting for it to complete. It seems to be a fairly repeatable error, and I've seen it also on a Cray XT5 using VASP 4.6.35 (the version modded for scalability by Paul Kent at ORNL).

Any ideas on what could be happening and how to fix it?

Posted: **Sat Apr 24, 2010 6:12 am**

Hi,

I'm having similar issues and I've been trying to narrow down the cause. So far it seems to be a parallelisation issue (as you say) associated with projection operator evaluation (real/recip. space) and system size.

What MPI software is running on the cluster and how many CPUs per node or in total?

Chris.

Posted: **Sun Apr 25, 2010 3:13 pm**

Try setting LREAL=FALSE

If that doesn't help try removing NPAR=1

Posted: **Wed Apr 28, 2010 1:11 am**

on the BG/P, there are 4 cores/node and I am not entirely sure what flavor of MPI it is using. VASP was compiled with the mpi-enabled XLF compiler that is on the machine (built for its architecture).

Kraken has 12 cores/node and I am using whatever flavor of MPI is rolled into the PGI compilers (and also Cray specific, I believe).

I am trying it with LREAL=.FALSE. now on both machines, with a few number of processors to see if it works. Just going to 60 or 64 processors doesn't seem to help.

Posted: **Wed Apr 28, 2010 10:33 am**

Another possibility (bit of trial and error here): use LREAL=AUTO, NPAR=4

This article claims best results achieved with NPAR=NCPUS/16

http://www.hpcx.ac.uk/research/hpc/tech ... TR0414.pdf

Posted: **Wed Apr 28, 2010 2:23 pm**

I tried LREAL=.FALSE. on 640 processors on the BG/P, and ended up getting a bad electronic solution (after about a dozen electronic steps):

WARNING in EDDIAG: sub space matrix is not hermitian 1 -0.104E+04
pPOTRF_TRTRI, POTRF, INFO: 1

I'm checking with fewer processors, as the geometry should be OK. But I'm double checking that as well.

Posted: **Wed Apr 28, 2010 3:59 pm**

From what I can gather now it seems to be a parallelisation issue associated with the MPI software or the way in which VASP was compiled (assuming we are dealing with the same issue when VASP reports ERROR RSPHERE: running out of buffer). I don't think the problem is anything to do with VASP itself, therefore adjusting VASP parameters is merely a means of getting around the problem, not actually solving it.

That said, I think your best chance is to try different combinations of the IREAL=TRUE/FALSE and NPAR=1-8 (or removed) and adjust the number of CPUS. I have had limited success through adjusting these parameters.

Posted: **Thu Apr 29, 2010 12:21 pm**

hmm - while I won't rule out the MPI library being at fault, I find it a bit odd that 2 relatively stable HPC machines with completely different architectures and libraries are having what appears precisely the same problem. Given what I know of the use of VASP on large-scale parallel machines (i.e. that rather little work has been done on it, certainly compared to other codes), I am a little wary of giving the sources a pass.

However, seeing as how I won't be digging into the sources anytime soon, I don't think we have much choice in trying to hack a 'fix'.

Just for my own info - what kind of machine are you running on?

Posted: **Thu Apr 29, 2010 3:25 pm**

OK, on 64 cores per image, letting NPAR be the default (i.e. dont' specify a value) and setting NSIM = 1 appears to work, at least for 10 iterations.

I'm now waiting for the results of a 128 core/image run.

Posted: **Fri Apr 30, 2010 3:06 pm**

OK, so it appears that if I run it out at 128 cores/image (on the BG/P) I am able to run for about 13 steps before something goes wrong.

What is interesting is that the failure modes seems to be that the OUTCAR indicates an ionic iteration has completed on one of the images (image 3 in my test) because the LOOP+ timing line is printed, but that is where it seems to hang up and the OSZICAR isn't updated. So you end up with a significant (2-4 minute) timestamp difference between the OUTCAR and the OSZICAR. the OUTCAR seems to be further along than the OSZICAR.

I was able to see this same behavior in another test system (different NEB system) on the Cray XT5, Kraken using the earlier version of VASP.

Posted: **Fri Apr 30, 2010 3:52 pm**

additional info:

in the OUTCAR for the image that seems to hang (regardless of architecture, system or version), the line 'reached required accuracy - stopping structural energy minimisation' gets printed before the LOOP+ timing line while the other images seem to want to continue.

perhaps a logic issue in the NEB implementation? It could be hard to get because it relies on a system where one image gets closer to the minimization criteria before the others?

Posted: **Fri Apr 30, 2010 6:49 pm**

I have a 5 image, 140 atom problem that reproduces this error and takes about a bit less than an hour to run on 480 processors (but the processor count doesn't seem to matter). Is there a good place to send this along with a more detailed bug report?

Posted: **Fri Apr 30, 2010 6:50 pm**

to clarify - this reproduces the hang problem, not the out of buffer problem. That seems to possibly be a separate issue.

Posted: **Sun May 23, 2010 4:06 am**

[quote="d-farrell2"]Just for my own info - what kind of machine are you running on?[/quote]

I don't have a great deal of knowledge on HPC, so I have just copied this from cluster website:

Specifications of the NCI NF Oracle/Sun Constellation Cluster:

* a high density integrated system

* 1492 nodes in Sun X6275 blades, each containing:
o two quad-core 2.93GHz Intel Nehalem cpus with 6.4GTs QPI bus
o L1 cache (on chip): 32KB (I) + 32KB (D)
L2 cache (on chip): 256KB
L3 cache (on chip): 8MB per quad-core cpu
o 24Gbytes DDR3-1333 memory (48 nodes have 48 Gbytes)
o 24GB Flash DIMM for swap and some job scratch
o on-board QDR InfiniBand adapter

* Aggregate Specfprate_base_2006 of (compute nodes only) 250000 (AC was around 20000)
Peak theoretical performance of approximately 140TFlops.

* Total of 37TB of RAM on compute nodes

* 30 dual socket, quad-core Sun X4170 servers for Lustre fileserving

* Approx 800 TBytes of usable global storage using 52 Sun J4400 JBOD trays each with 24 1TB Seagate Enterprise SATA drives

* Four independent Sun DS648 Infiniband switches each with 432 QDR IB ports for both MPI and Lustre filesystem traffic:
Measured MPI Latency: < 2.0us
Measured MPI Bandwidth: > 2600MB/s per node

The system software used on the vayu cluster includes:

* CentOS 5.4 Linux distribution (based on RHEL5.4)

* the oneSIS cluster software management system

* the Lustre cluster file system:
o 104 x (8+2 RAID6 8TB) OSTs for /short
o 104 x (1+1 RAID1 520GB) OSTs for /home
o 104 x (1+1 RAID1 140GB) OSTs for /apps
o root filesystem also on Lustre

* the National Facility's variant of the OpenPBS batch queuing system

Posted: **Sun May 23, 2010 4:13 am**

I ran an NPAR optimisation test and found that if NCPU is too high and NPAR too low the job will quit with ERROR RSPHER. On the other hand if NCPU is too low and NPAR too high the job will hang during iterations and cease to write any output.

There is an optimum value of NCPU/NPAR which can be determined by running a series of 1 or 2 hour jobs with NPAR=1,2,4,8,16,32 for a given number of CPUs. NPAR must always be a factor of NCPU and I believe optimum is usually close to NCPU/16, although this is probably cluster dependant and possibly also dependant on the type of calculation being performed.

My Community

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems

5 Image NEB "internal ERROR RSPHER:running out of buffer" or hang problems