Parallelization/MPI problem on Cray XC30/40 (hangs)
Posted: Wed Feb 24, 2016 4:36 pm
Hi,
The last few weeks I encountered a quite weird behaviour of vasp on a cray xc30 or xc40 machine (CSCS Daint or Dora). I was used to normal openmpi cluster (CSCS Monch) and my self compiled vasp version runs very nice there. Then I switched to Dora and I first compiled vasp 5.3.5 with intel/15.0.1.133, intel fft (cray-libsci unloaded) and cray-mpich/7.2.2 as suggested by Peter Larsson (https://www.nsc.liu.se/~pla/blog/2015/0 ... cray-xc40/). I did extensive testing and everything looked really good. So I went on to vasp 5.4.1 and did the compilation with the same modules. Tests look also really good on a single node. One important test-case for me is a relaxation with (S)GGA+U , but here I encounter a problem, when I use more than one node (number of cores does not matter): The first two relaxation steps run fine, but then everything goes bananas. Sometime VASP does not find a electronic minimum or more often it just hangs after the last iteration. The OUTCAR looks in the end like this:
----------------------------------------- Iteration 3( 14) ---------------------------------------
POTLOK: cpu time 0.0480: real time 0.0462
SETDIJ: cpu time 0.0040: real time 0.0055
EDDAV: cpu time 14.1769: real time 14.2087
DOS: cpu time 0.0240: real time 0.0214
The problem is, that it freezes and I have to watch my jobs the whole time, to see if their are still working.
I tried with several setups for compilation: older mpich version, older intel version, fft from cray-libsci, tried the -DMPI_barrier_after_bcast flag, only -O1 optimization, without avx/avx2 support and I also contacted the CSCS team. They can reproduce my problem, but have no clue what could be the problem. I'm using 5.4.1 with the latest patches. I attach an example makefile and INCAR, POSCAR files.
I also tried to change NSIM,NCORE,KPAR but still no luck. When I use more and more nodes the problem becomes even worse. Has anyone encountered a similar problem? With vasp 5.3.5 or on CSCS Monch or on ETH Euler the same calculation runs fine and it hangs always at the same step, even when numerical parameters are changed. I have also a traceback for this error, but somehow I'm not allowed to upload txt files?
# Precompiler options
CPP_OPTIONS= -DMPI -DHOST=\"CrayXC-Intel\" \
-DIFC \
-DCACHE_SIZE=32000 \
-DPGF90 \
-DscaLAPACK \
-Davoidalloc \
-DMPI_BLOCK=128000 \
-Duse_collective \
-DnoAugXCmeta \
-Duse_bse_te \
-Duse_shmem \
-Dtbdyn \
-DVASP2WANNIER90 \
-DMPI_barrier_after_bcast
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = ftn -I$(MKLROOT)/include/fftw -g -traceback -heap-arrays
FCL = ftn
FREE = -free -names lowercase
FFLAGS = -assume byterecl
OFLAG = -O1 -ip -xCORE-AVX2
#OFLAG = -O0 -g -traceback
OFLAG_IN = $(OFLAG)
DEBUG = -O0
BLAS = -mkl=cluster #sequential
LAPACK =
SCA = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
INCS =
LLIBS = $(SCA) $(LAPACK) $(BLAS) \
/store/p504/ahampel/codes/dora/wannier90/1.2/wannier90-1.2/libwannier.a
OBJECTS_O2 += fftw3d.o fftmpi.o fftmpiw.o fft3dlib.o
OBJECTS_O1 += fft3dfurth.o mpi.o wave_mpi.o electron.o charge.o us.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
######## INCAR ###########
SYSTEM=lanio3-sggau-rel
#Startup
ICHARG=2
ISTART=0
# parameters that determine general accuracy
PREC = Accurate
ENCUT = 550
EDIFF = 1e-7
LWAVE = .FALSE.
# relaxation
EDIFFG = -0.001
ISIF = 3
NSW = 100
IBRION = 2
ISMEAR = 0
SIGMA = 0.001
NEDOS = 1001
# spin polarization
ISPIN = 2
MAGMOM = 4*0 4 4 4 4 12*0
LORBIT = 11
# parallelization & perfomance
IALGO=38
LPLANE = .TRUE.
NCORE=24
LSCALU = .FALSE.
NSIM = 2
KPAR = 2
# LSDA+U
LMAXMIX=4
LDAU = .TRUE.
LDAUTYPE= 1
LDAUL= -1 2 -1
LDAUU= 0 5 0
LDAUJ= 0 1 0
#### job output #######
Running on 8 nodes (nid00[701-707,712]), 24 tasks/node
running on 192 total cores
distrk: each k-point on 24 cores, 8 groups
distr: one band on 24 cores, 1 groups
using from now: INCAR
vasp.5.4.1 24Jun15 (build Jan 28 2016 14:53:33) complex
POSCAR found type information on POSCAR La Ni O
POSCAR found : 3 types and 20 ions
...
0.10177E+01 -0.20455E+01 94944 0.338E+01 0.138E+01
DAV: 2 -0.137266193277E+03 -0.33294E+00 -0.12622E+01119188 0.450E+01 0.173E+01
DAV: 3 -0.136069236957E+03 0.11970E+01 -0.26720E+00104420 0.259E+01 0.724E+00
DAV: 4 -0.135969111317E+03 0.10013E+00 -0.44404E-01127460 0.931E+00 0.350E+00
DAV: 5 -0.135941844668E+03 0.27267E-01 -0.21654E-01142004 0.220E+00 0.166E+00
DAV: 6 -0.135932672328E+03 0.91723E-02 -0.17165E-01141520 0.216E+00 0.162E+00
DAV: 7 -0.135930126912E+03 0.25454E-02 -0.30734E-02137968 0.142E+00 0.926E-01
DAV: 8 -0.135930465065E+03 -0.33815E-03 -0.33151E-02142880 0.606E-01 0.551E-01
DAV: 9 -0.135928787348E+03 0.16777E-02 -0.72248E-03155584 0.670E-01 0.228E-01
DAV: 10 -0.135929064014E+03 -0.27667E-03 -0.66241E-03153224 0.319E-01 0.162E-01
DAV: 11 -0.135928908660E+03 0.15535E-03 -0.81901E-04141224 0.164E-01 0.126E-01
DAV: 12 -0.135928873697E+03 0.34963E-04 -0.32048E-04141420 0.829E-02 0.717E-02
DAV: 13 -0.135928875002E+03 -0.13046E-05 -0.52199E-05 91544 0.427E-02
2 F= -.13592888E+03 E0= -.13592887E+03 d E =-.132833E-01 mag= 4.0000
trial-energy change: -0.013283 1 .order -0.014666 -0.069607 0.040276
step: 0.6156(harm= 0.6335) dis= 0.00388 next Energy= -135.936694 (dE=-0.211E-01)
bond charge predicted
N E dE d eps ncg rms rms(c)
DAV: 1 -0.136080861388E+03 -0.15199E+00 -0.30120E+00 95640 0.129E+01 0.465E+00
DAV: 2 -0.136098448919E+03 -0.17588E-01 -0.16609E+00120316 0.175E+01 0.655E+00
DAV: 3 -0.135954445365E+03 0.14400E+00 -0.35270E-01106132 0.925E+00 0.239E+00
DAV: 4 -0.135942004858E+03 0.12441E-01 -0.46751E-02127024 0.301E+00 0.900E-01
DAV: 5 -0.135937623995E+03 0.43809E-02 -0.37683E-02140576 0.838E-01 0.411E-01
DAV: 6 -0.135937041107E+03 0.58289E-03 -0.16542E-02122660 0.113E+00 0.438E-01
DAV: 7 -0.135936778213E+03 0.26289E-03 -0.34062E-03155024 0.354E-01 0.217E-01
DAV: 8 -0.135936701451E+03 0.76762E-04 -0.76129E-04127800 0.297E-01 0.662E-02
DAV: 9 -0.135936718409E+03 -0.16958E-04 -0.37154E-04137044 0.139E-01 0.264E-02
The last few weeks I encountered a quite weird behaviour of vasp on a cray xc30 or xc40 machine (CSCS Daint or Dora). I was used to normal openmpi cluster (CSCS Monch) and my self compiled vasp version runs very nice there. Then I switched to Dora and I first compiled vasp 5.3.5 with intel/15.0.1.133, intel fft (cray-libsci unloaded) and cray-mpich/7.2.2 as suggested by Peter Larsson (https://www.nsc.liu.se/~pla/blog/2015/0 ... cray-xc40/). I did extensive testing and everything looked really good. So I went on to vasp 5.4.1 and did the compilation with the same modules. Tests look also really good on a single node. One important test-case for me is a relaxation with (S)GGA+U , but here I encounter a problem, when I use more than one node (number of cores does not matter): The first two relaxation steps run fine, but then everything goes bananas. Sometime VASP does not find a electronic minimum or more often it just hangs after the last iteration. The OUTCAR looks in the end like this:
----------------------------------------- Iteration 3( 14) ---------------------------------------
POTLOK: cpu time 0.0480: real time 0.0462
SETDIJ: cpu time 0.0040: real time 0.0055
EDDAV: cpu time 14.1769: real time 14.2087
DOS: cpu time 0.0240: real time 0.0214
The problem is, that it freezes and I have to watch my jobs the whole time, to see if their are still working.
I tried with several setups for compilation: older mpich version, older intel version, fft from cray-libsci, tried the -DMPI_barrier_after_bcast flag, only -O1 optimization, without avx/avx2 support and I also contacted the CSCS team. They can reproduce my problem, but have no clue what could be the problem. I'm using 5.4.1 with the latest patches. I attach an example makefile and INCAR, POSCAR files.
I also tried to change NSIM,NCORE,KPAR but still no luck. When I use more and more nodes the problem becomes even worse. Has anyone encountered a similar problem? With vasp 5.3.5 or on CSCS Monch or on ETH Euler the same calculation runs fine and it hangs always at the same step, even when numerical parameters are changed. I have also a traceback for this error, but somehow I'm not allowed to upload txt files?
# Precompiler options
CPP_OPTIONS= -DMPI -DHOST=\"CrayXC-Intel\" \
-DIFC \
-DCACHE_SIZE=32000 \
-DPGF90 \
-DscaLAPACK \
-Davoidalloc \
-DMPI_BLOCK=128000 \
-Duse_collective \
-DnoAugXCmeta \
-Duse_bse_te \
-Duse_shmem \
-Dtbdyn \
-DVASP2WANNIER90 \
-DMPI_barrier_after_bcast
CPP = fpp -f_com=no -free -w0 $*$(FUFFIX) $*$(SUFFIX) $(CPP_OPTIONS)
FC = ftn -I$(MKLROOT)/include/fftw -g -traceback -heap-arrays
FCL = ftn
FREE = -free -names lowercase
FFLAGS = -assume byterecl
OFLAG = -O1 -ip -xCORE-AVX2
#OFLAG = -O0 -g -traceback
OFLAG_IN = $(OFLAG)
DEBUG = -O0
BLAS = -mkl=cluster #sequential
LAPACK =
SCA = -lmkl_scalapack_lp64 -lmkl_blacs_intelmpi_lp64
OBJECTS = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o
INCS =
LLIBS = $(SCA) $(LAPACK) $(BLAS) \
/store/p504/ahampel/codes/dora/wannier90/1.2/wannier90-1.2/libwannier.a
OBJECTS_O2 += fftw3d.o fftmpi.o fftmpiw.o fft3dlib.o
OBJECTS_O1 += fft3dfurth.o mpi.o wave_mpi.o electron.o charge.o us.o
# For what used to be vasp.5.lib
CPP_LIB = $(CPP)
FC_LIB = $(FC)
CC_LIB = cc
CFLAGS_LIB = -O
FFLAGS_LIB = -O1
FREE_LIB = $(FREE)
OBJECTS_LIB= linpack_double.o getshmem.o
# Normally no need to change this
SRCDIR = ../../src
BINDIR = ../../bin
######## INCAR ###########
SYSTEM=lanio3-sggau-rel
#Startup
ICHARG=2
ISTART=0
# parameters that determine general accuracy
PREC = Accurate
ENCUT = 550
EDIFF = 1e-7
LWAVE = .FALSE.
# relaxation
EDIFFG = -0.001
ISIF = 3
NSW = 100
IBRION = 2
ISMEAR = 0
SIGMA = 0.001
NEDOS = 1001
# spin polarization
ISPIN = 2
MAGMOM = 4*0 4 4 4 4 12*0
LORBIT = 11
# parallelization & perfomance
IALGO=38
LPLANE = .TRUE.
NCORE=24
LSCALU = .FALSE.
NSIM = 2
KPAR = 2
# LSDA+U
LMAXMIX=4
LDAU = .TRUE.
LDAUTYPE= 1
LDAUL= -1 2 -1
LDAUU= 0 5 0
LDAUJ= 0 1 0
#### job output #######
Running on 8 nodes (nid00[701-707,712]), 24 tasks/node
running on 192 total cores
distrk: each k-point on 24 cores, 8 groups
distr: one band on 24 cores, 1 groups
using from now: INCAR
vasp.5.4.1 24Jun15 (build Jan 28 2016 14:53:33) complex
POSCAR found type information on POSCAR La Ni O
POSCAR found : 3 types and 20 ions
...
0.10177E+01 -0.20455E+01 94944 0.338E+01 0.138E+01
DAV: 2 -0.137266193277E+03 -0.33294E+00 -0.12622E+01119188 0.450E+01 0.173E+01
DAV: 3 -0.136069236957E+03 0.11970E+01 -0.26720E+00104420 0.259E+01 0.724E+00
DAV: 4 -0.135969111317E+03 0.10013E+00 -0.44404E-01127460 0.931E+00 0.350E+00
DAV: 5 -0.135941844668E+03 0.27267E-01 -0.21654E-01142004 0.220E+00 0.166E+00
DAV: 6 -0.135932672328E+03 0.91723E-02 -0.17165E-01141520 0.216E+00 0.162E+00
DAV: 7 -0.135930126912E+03 0.25454E-02 -0.30734E-02137968 0.142E+00 0.926E-01
DAV: 8 -0.135930465065E+03 -0.33815E-03 -0.33151E-02142880 0.606E-01 0.551E-01
DAV: 9 -0.135928787348E+03 0.16777E-02 -0.72248E-03155584 0.670E-01 0.228E-01
DAV: 10 -0.135929064014E+03 -0.27667E-03 -0.66241E-03153224 0.319E-01 0.162E-01
DAV: 11 -0.135928908660E+03 0.15535E-03 -0.81901E-04141224 0.164E-01 0.126E-01
DAV: 12 -0.135928873697E+03 0.34963E-04 -0.32048E-04141420 0.829E-02 0.717E-02
DAV: 13 -0.135928875002E+03 -0.13046E-05 -0.52199E-05 91544 0.427E-02
2 F= -.13592888E+03 E0= -.13592887E+03 d E =-.132833E-01 mag= 4.0000
trial-energy change: -0.013283 1 .order -0.014666 -0.069607 0.040276
step: 0.6156(harm= 0.6335) dis= 0.00388 next Energy= -135.936694 (dE=-0.211E-01)
bond charge predicted
N E dE d eps ncg rms rms(c)
DAV: 1 -0.136080861388E+03 -0.15199E+00 -0.30120E+00 95640 0.129E+01 0.465E+00
DAV: 2 -0.136098448919E+03 -0.17588E-01 -0.16609E+00120316 0.175E+01 0.655E+00
DAV: 3 -0.135954445365E+03 0.14400E+00 -0.35270E-01106132 0.925E+00 0.239E+00
DAV: 4 -0.135942004858E+03 0.12441E-01 -0.46751E-02127024 0.301E+00 0.900E-01
DAV: 5 -0.135937623995E+03 0.43809E-02 -0.37683E-02140576 0.838E-01 0.411E-01
DAV: 6 -0.135937041107E+03 0.58289E-03 -0.16542E-02122660 0.113E+00 0.438E-01
DAV: 7 -0.135936778213E+03 0.26289E-03 -0.34062E-03155024 0.354E-01 0.217E-01
DAV: 8 -0.135936701451E+03 0.76762E-04 -0.76129E-04127800 0.297E-01 0.662E-02
DAV: 9 -0.135936718409E+03 -0.16958E-04 -0.37154E-04137044 0.139E-01 0.264E-02