Page 1 of 1

Segfault version 5.4.1 with MPI rank count not 1, 2, 4, or 8

Posted: Wed Oct 28, 2015 8:14 pm
by cchang
Hi,

I have run into the following strange behavior.
When I run a particular build of VASP version 5.4.1 with 1, 2, 4, or 8 MPI ranks, a test job completes fine. However, with any rank count other than that, the code segfaults.
Debugging at 3 ranks shows the stack trace as
[0-2] (mpigdb) bt
[0-2] #0 vhamil (wdes1=Cannot access memory at address 0x16260
[0,2] ) at hamil.F:794
[1]
[0,2] #1 hamil::hamiltmu (wdes1=Cannot access memory at address 0x16260
[1] ) at hamil.F:794
[0,2] ) at hamil.F:794
[1] #1 hamil::hamiltmu (wdes1=Cannot access memory at address 0x16260
[0,2] #2 0x0000000000e13325 in david::eddav (hamiltonian=Cannot access memory at address 0x16260
[1]
[0,2] ) at davidson.F:419
[1] ) at hamil.F:794
[0,2] #3 0x0000000000e3ae43 in elmin (hamiltonian=Cannot access memory at address 0x16260
[1] #2 0x0000000000e13325 in david::eddav (hamiltonian=Cannot access memory at address 0x16260
[0,2] ) at electron.F:418
[1] ) at davidson.F:419
[0,2] #4 0x00000000014c96e3 in vamp () at main.F:2994
[1] #3 0x0000000000e3ae43 in elmin (hamiltonian=Cannot access memory at address 0x16260
[0,2] #5 0x000000000040ba6e in main ()
[1] ) at electron.F:418
[1] #4 0x00000000014c96e3 in vamp () at main.F:2994
[1] #5 0x000000000040ba6e in main ()

It seems like something is wrong in wdes1, but I can't tell what.

Build: Intel MPI 5.0.3, Intel Fortran compiler 15.0.3, MKL 15.3.187, Scalapack enabled
Test case 4X4X2 Gamma-centered k-point mesh; 832 bands (auto-modified to 834 for 3-rank case)
INCAR:
ISTART = 0
ICHARG = 2
ENCUT = 300
ISMEAR = 0
SIGMA = 0.01
LMAXMIX = 4
ADDGRID = .TRUE.
PREC = Accurate
NELM = 10
NELMIN = 3
EDIFF = 1E-5
LORBIT = 11
NBANDS = 832
LOPTICS = .TRUE.
LWAVE = .FALSE.
LCHARG = .FALSE.
LREAL = On

Would appreciate any pointers, including what to try next in GDB.

Thanks in advance,

Chris

Re: Segfault version 5.4.1 with MPI rank count not 1, 2, 4,

Posted: Thu Oct 29, 2015 5:43 pm
by cchang
Weird cause, but figured it out. Test was running on a new system which had the default Linux stack size (10240K). Setting to unlimited fixed the problem.

I guess the problem size may have been right at the boundary, where the additional bands added to make evenly divisible by rank count pushed the stack requirements over the edge. Still not sure why 16 ranks (which didn't need added bands) failed, though, maybe parallel overhead.

Anyway, adding the solution here for posterity. Mischief Managed.