Page 1 of 1

Collective abort of all ranks

Posted: Fri Mar 15, 2013 11:56 am
by gVallverdu
Good morning

For large system (160 atoms) VASP crash at the beginning of the calculation with MPI errors, see below :

Code: Select all

?running?on???72?total?cores
?distrk:??each?k-point?on???72?cores,????1?groups
?distr:??one?band?on???12?cores,????6?groups
?using?from?now:?INCAR?????
?vasp.5.3.2?13Sep12?(build?Nov?05?2012?17:04:05)?complex????????????????????????
??
?POSCAR?found?type?information?on?POSCAR??Li?Co?O??S?
?POSCAR?found?:??4?types?and?????163?ions
?LDA?part:?xc-table?for?Ceperly-Alder,?standard?interpolation
?POSCAR,?INCAR?and?KPOINTS?ok,?starting?setup
?FFT:?planning?...
?WAVECAR?not?read
?entering?main?loop
???????N???????E?????????????????????dE?????????????d?eps???????ncg?????rms??????????rms(c)
rank?65?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?65:?killed?by?signal?9?
rank?20?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?20:?killed?by?signal?9?
rank?18?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?18:?killed?by?signal?9?
rank?34?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?34:?killed?by?signal?9?
rank?33?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?33:?killed?by?signal?9?
rank?32?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?32:?killed?by?signal?9?
rank?46?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?46:?killed?by?signal?9?
rank?42?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks

ect?....

Do you have any idea on how can I fix this ? I tried to compile VASP with -Ddebug fpp flag but it does not work, I get the foloowing error :

Code: Select all

fpp?-f_com=no?-free?-w0?pawlhf.F?pawlhf.f90??-DMPI??-DHOST=\"LinuxIFC\"?-DIFC?-DCACHE_SIZE=8000?-DPGF90?-Davoidalloc?-DNGZhalf?-DMPI_BLOCK=10000?-Duse_collective?-DRPROMU_DGEMV??-DRACCMU_DGEMV?-Ddebug
mpiifort??-FR?-names?lowercase?-assume?byterecl?-m64?-warn?nousage?-g?-traceback??-I/cm/shared/apps/intel/icsxe/2012.0.032/mkl/include/fftw??-c?pawlhf.f90
pawlhf.F(1169):?error?#6404:?This?name?does?not?have?a?type,?and?must?have?an?explicit?type.???[DFOCKAE]
??????CALL?DUMP_DLLMM(?"ONE-CENTRE-CORRECTION?AE",DFOCKAE,?PP)
--------------------------------------------------^
pawlhf.F(1169):?error?#6634:?The?shape?matching?rules?of?actual?arguments?and?dummy?arguments?have?been?violated.???[DFOCKAE]
??????CALL?DUMP_DLLMM(?"ONE-CENTRE-CORRECTION?AE",DFOCKAE,?PP)
--------------------------------------------------^
compilation?aborted?for?pawlhf.f90?(code?1)
make:?***?[pawlhf.o]?Erreur?1
I also had got this error, but I temporarly fix it by renaming NY in NYY

Code: Select all

fpp -f_com=no -free -w0 xcspin.F xcspin.f90  -DMPI  -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=8000 -DPGF90 -Davoidalloc -DNGZhalf -DMPI_BLOCK=10000 -Duse_collective -DRPROMU_DGEMV  -DRACCMU_DGEMV -Ddebug
mpiifort  -FR -names lowercase -assume byterecl -m64 -warn nousage -g -traceback -check bounds  -I/cm/shared/apps/intel/icsxe/2012.0.032/mkl/include/fftw  -c xcspin.f90
xcspin.F(1271): error #6423: This name has already been used as an external function name.   [NY]
      DO NY=1,GRIDC%NGY
---------^
compilation aborted for xcspin.f90 (code 1)
make: *** [xcspin.o] Erreur 1
Thanks

Collective abort of all ranks

Posted: Fri Mar 15, 2013 2:22 pm
by admin
please check whether your stack size has no limits;
please note that we do not support any questions concerning errors generated by executables of the code which has been modified by our colleagues, the code distributed by us compiles without any errors with the Intel compiler.

Collective abort of all ranks

Posted: Fri Mar 15, 2013 5:32 pm
by gVallverdu
I will check for the stack size limits.

I did not do any source code modification. Until today I succeeded in compiling VASP but if I add the -Ddebug option to fpp I get the errors I put in the first post.

I use the following intel compiler version :
mpiifort for the Intel(R) MPI Library 4.0 Update 2 for Linux*
Copyright(C) 2003-2011, Intel Corporation. All rights reserved.
ifort version 12.1.0

and

mpiifort for the Intel(R) MPI Library 4.0 Update 2 for Linux*
Copyright(C) 2003-2011, Intel Corporation. All rights reserved.
Version 11.1

Both gives the same error for file xcspin.F (without any modification of the source code).

Collective abort of all ranks

Posted: Thu Mar 21, 2013 10:28 am
by gVallverdu
Hello

I tried to increase stack size by adding the following line in the job :

ulimit -d hard
ulimit -f hard
ulimit -l hard
ulimit -m hard
ulimit -n hard
ulimit -s hard
ulimit -t hard
ulimit -u hard

But I still get the same error.

I do not know if it may help but the job manager return this code at the end : Exit_status=137

Thanks

Collective abort of all ranks

Posted: Thu Mar 21, 2013 3:57 pm
by admin
ulimit -s unlimited
is not a tag in vasp calculation. It is a line in your .bashrc file.
http://cms.mpi.univie.ac.at/vasp-forum/ ... hp?2.12729

Collective abort of all ranks

Posted: Thu Mar 21, 2013 4:33 pm
by gVallverdu
Of course ! When I said job I meant the PBS-torque script.

ulimit -s unlimited does not fix the crash. I will ask for the admin of the cluster and I come back if I had more informtions.