Collective abort of all ranks

Problems running VASP: crashes, internal errors, "wrong" results.


Moderators: Global Moderator, Moderator

Post Reply
Message
Author
gVallverdu
Newbie
Newbie
Posts: 13
Joined: Mon Jan 17, 2011 9:43 am
Location: Pau, France
Contact:

Collective abort of all ranks

#1 Post by gVallverdu » Fri Mar 15, 2013 11:56 am

Good morning

For large system (160 atoms) VASP crash at the beginning of the calculation with MPI errors, see below :

Code: Select all

?running?on???72?total?cores
?distrk:??each?k-point?on???72?cores,????1?groups
?distr:??one?band?on???12?cores,????6?groups
?using?from?now:?INCAR?????
?vasp.5.3.2?13Sep12?(build?Nov?05?2012?17:04:05)?complex????????????????????????
??
?POSCAR?found?type?information?on?POSCAR??Li?Co?O??S?
?POSCAR?found?:??4?types?and?????163?ions
?LDA?part:?xc-table?for?Ceperly-Alder,?standard?interpolation
?POSCAR,?INCAR?and?KPOINTS?ok,?starting?setup
?FFT:?planning?...
?WAVECAR?not?read
?entering?main?loop
???????N???????E?????????????????????dE?????????????d?eps???????ncg?????rms??????????rms(c)
rank?65?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?65:?killed?by?signal?9?
rank?20?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?20:?killed?by?signal?9?
rank?18?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?18:?killed?by?signal?9?
rank?34?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?34:?killed?by?signal?9?
rank?33?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?33:?killed?by?signal?9?
rank?32?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?32:?killed?by?signal?9?
rank?46?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks
??exit?status?of?rank?46:?killed?by?signal?9?
rank?42?in?job?1??node252.cm.cluster_35718???caused?collective?abort?of?all?ranks

ect?....

Do you have any idea on how can I fix this ? I tried to compile VASP with -Ddebug fpp flag but it does not work, I get the foloowing error :

Code: Select all

fpp?-f_com=no?-free?-w0?pawlhf.F?pawlhf.f90??-DMPI??-DHOST=\"LinuxIFC\"?-DIFC?-DCACHE_SIZE=8000?-DPGF90?-Davoidalloc?-DNGZhalf?-DMPI_BLOCK=10000?-Duse_collective?-DRPROMU_DGEMV??-DRACCMU_DGEMV?-Ddebug
mpiifort??-FR?-names?lowercase?-assume?byterecl?-m64?-warn?nousage?-g?-traceback??-I/cm/shared/apps/intel/icsxe/2012.0.032/mkl/include/fftw??-c?pawlhf.f90
pawlhf.F(1169):?error?#6404:?This?name?does?not?have?a?type,?and?must?have?an?explicit?type.???[DFOCKAE]
??????CALL?DUMP_DLLMM(?"ONE-CENTRE-CORRECTION?AE",DFOCKAE,?PP)
--------------------------------------------------^
pawlhf.F(1169):?error?#6634:?The?shape?matching?rules?of?actual?arguments?and?dummy?arguments?have?been?violated.???[DFOCKAE]
??????CALL?DUMP_DLLMM(?"ONE-CENTRE-CORRECTION?AE",DFOCKAE,?PP)
--------------------------------------------------^
compilation?aborted?for?pawlhf.f90?(code?1)
make:?***?[pawlhf.o]?Erreur?1
I also had got this error, but I temporarly fix it by renaming NY in NYY

Code: Select all

fpp -f_com=no -free -w0 xcspin.F xcspin.f90  -DMPI  -DHOST=\"LinuxIFC\" -DIFC -DCACHE_SIZE=8000 -DPGF90 -Davoidalloc -DNGZhalf -DMPI_BLOCK=10000 -Duse_collective -DRPROMU_DGEMV  -DRACCMU_DGEMV -Ddebug
mpiifort  -FR -names lowercase -assume byterecl -m64 -warn nousage -g -traceback -check bounds  -I/cm/shared/apps/intel/icsxe/2012.0.032/mkl/include/fftw  -c xcspin.f90
xcspin.F(1271): error #6423: This name has already been used as an external function name.   [NY]
      DO NY=1,GRIDC%NGY
---------^
compilation aborted for xcspin.f90 (code 1)
make: *** [xcspin.o] Erreur 1
Thanks
Last edited by gVallverdu on Fri Mar 15, 2013 11:56 am, edited 1 time in total.

admin
Administrator
Administrator
Posts: 2921
Joined: Tue Aug 03, 2004 8:18 am
License Nr.: 458

Collective abort of all ranks

#2 Post by admin » Fri Mar 15, 2013 2:22 pm

please check whether your stack size has no limits;
please note that we do not support any questions concerning errors generated by executables of the code which has been modified by our colleagues, the code distributed by us compiles without any errors with the Intel compiler.
Last edited by admin on Fri Mar 15, 2013 2:22 pm, edited 1 time in total.

gVallverdu
Newbie
Newbie
Posts: 13
Joined: Mon Jan 17, 2011 9:43 am
Location: Pau, France
Contact:

Collective abort of all ranks

#3 Post by gVallverdu » Fri Mar 15, 2013 5:32 pm

I will check for the stack size limits.

I did not do any source code modification. Until today I succeeded in compiling VASP but if I add the -Ddebug option to fpp I get the errors I put in the first post.

I use the following intel compiler version :
mpiifort for the Intel(R) MPI Library 4.0 Update 2 for Linux*
Copyright(C) 2003-2011, Intel Corporation. All rights reserved.
ifort version 12.1.0

and

mpiifort for the Intel(R) MPI Library 4.0 Update 2 for Linux*
Copyright(C) 2003-2011, Intel Corporation. All rights reserved.
Version 11.1

Both gives the same error for file xcspin.F (without any modification of the source code).
Last edited by gVallverdu on Fri Mar 15, 2013 5:32 pm, edited 1 time in total.

gVallverdu
Newbie
Newbie
Posts: 13
Joined: Mon Jan 17, 2011 9:43 am
Location: Pau, France
Contact:

Collective abort of all ranks

#4 Post by gVallverdu » Thu Mar 21, 2013 10:28 am

Hello

I tried to increase stack size by adding the following line in the job :

ulimit -d hard
ulimit -f hard
ulimit -l hard
ulimit -m hard
ulimit -n hard
ulimit -s hard
ulimit -t hard
ulimit -u hard

But I still get the same error.

I do not know if it may help but the job manager return this code at the end : Exit_status=137

Thanks
Last edited by gVallverdu on Thu Mar 21, 2013 10:28 am, edited 1 time in total.

admin
Administrator
Administrator
Posts: 2921
Joined: Tue Aug 03, 2004 8:18 am
License Nr.: 458

Collective abort of all ranks

#5 Post by admin » Thu Mar 21, 2013 3:57 pm

ulimit -s unlimited
is not a tag in vasp calculation. It is a line in your .bashrc file.
http://cms.mpi.univie.ac.at/vasp-forum/ ... hp?2.12729
Last edited by admin on Thu Mar 21, 2013 3:57 pm, edited 1 time in total.

gVallverdu
Newbie
Newbie
Posts: 13
Joined: Mon Jan 17, 2011 9:43 am
Location: Pau, France
Contact:

Collective abort of all ranks

#6 Post by gVallverdu » Thu Mar 21, 2013 4:33 pm

Of course ! When I said job I meant the PBS-torque script.

ulimit -s unlimited does not fix the crash. I will ask for the admin of the cluster and I come back if I had more informtions.
Last edited by gVallverdu on Thu Mar 21, 2013 4:33 pm, edited 1 time in total.

Post Reply