Page 1 of 1

local QP operation err

Posted: Mon Dec 09, 2013 4:36 pm
by kuang
Hello everyone,

I have got a problem while running DFPT calculations using VASP (IBRION=8). It turns out that some of my jobs crashed due to the following error:
mlx4: local QP operation err (QPN 003fb8, WQE index dce90000, vendor syndrome 77, opcode = 5e)
And also I got this message from the program:
--------------------------------------------------------------------------
mpirun has exited due to process rank 20 with PID 27788 on
node tiger-r2c2n9 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
This crash happens at random points, so it is very hard to predict. It is also related how many nodes and processors I use, and seems more likely to happen if I use lots of nodes.
The VASP I use was compiled with openmpi-1.6.3, the VASP version number is VASP.5.2.11, and the code is run with the following command line options:
$MPIRUN -np `wc -l \$PBS_NODEFILE | awk '{print \$1}'` --mca btl ^tcp --bind-to-socket $VASP > logfile
Did anyone else see this problem? And thank you very much for your help!

Sincerely yours,

Kuang

local QP operation err

Posted: Tue Dec 10, 2013 10:21 am
by alex
Hi kuang,

I'd guess your Mellanox card has some issues. Please check:
- does it always happen with the same node?
- does it happen with VASP only or with different software, too?
- have you (or your admin) tried to do a firmware update on your Mellanox cards?
- try some hardware checks on the card(s)
- maybe your cables are injured

Good luck!

alex

local QP operation err

Posted: Wed Dec 11, 2013 11:02 pm
by kuang
OK, I will try to contact our cluster administrator to see what exactly happened. Thank you very much!