Page 1 of 1

VASP run error under mpich2

Posted: Thu Jan 06, 2011 8:04 am
by Gu Chenjie
Hi, all. Here I try to explain my problem clearly.
I get two nodes, each nodes have 12 cores, and the two nodes have been connected by a Giga Switch.
Now, I test the vasp examples on nodes.
First, I boot the mpd on the single node, and all the examples run well on each of the single nodes.
then I try to run the examples on the both nodes using 24 cores, after booting the mpd on the two nodes, I got the following error.

Code: Select all

Fatal error in MPI_Waitall: Other MPI error, error stack:
MPI_Waitall(261)..................: MPI_Waitall(count=46, req_array=0x7fffeeca46a0, status_array=0x7fffeeca4760) failed
MPIDI_CH3I_Progress(150)..........: 
MPID_nem_mpich2_blocking_recv(948): 
MPID_nem_tcp_connpoll(1709).......: Communication error
rank 23 in job 1  node0_55860   caused collective abort of all ranks
  exit status of rank 23: killed by signal 9 
As I am thinking, whether this error is from the low speed of the switch or the limitation of memory, althrough I have set the stack and memeory to unlimited .
Thanks for your attention.

VASP run error under mpich2

Posted: Mon Jan 31, 2011 12:58 pm
by admin
sorry but this error is not related to vasp itself, but to an error in your MPI

VASP run error under mpich2

Posted: Sun Feb 06, 2011 6:54 am
by Gu Chenjie
Hi, Sir, yes, today I find where the problem is.
As I mentioned, I have two nodes, and if the job run on one nodes, there should be no problem. However, if the job was assigned to run on both of the two nodes, no matter how many cores I use, so long as there is data exchange between the two nodes, the problem will come, and most important is that it depends on the size of the super cell.
let's take the handson1 as an example(1_1_O_atom), the original POSCAR is as follows:

Code: Select all

O atom in a box
 1.0          ! universal scaling parameters
 8.0 0.0 0.0  ! lattice vector  a(1)
 0.0 8.0 0.0  ! lattice vector  a(2)
 0.0 0.0 8.0  ! lattice vector  a(3)
1             ! number of atoms
cart          ! positions in cartesian coordinates
 0 0 0

and this job can not run if two nodes are used at the same tome, however, if I change the size of the super cell to 4X4X4, means the new POSCAR will as follows:

Code: Select all

O atom in a box
 [color=red]0.5          ! universal scaling parameters[/color] 8.0 0.0 0.0  ! lattice vector  a(1)
 0.0 8.0 0.0  ! lattice vector  a(2)
 0.0 0.0 8.0  ! lattice vector  a(3)
1             ! number of atoms
cart          ! positions in cartesian coordinates
 0 0 0

now, it run very well.
so now I am thinking this should be my compiling problem, maybe from the FFT lib or the hard ware limitation, such as the limit of the CPU stack or memory.
hope you can give me soem suggestions.

Code: Select all

HP BL460c:
CPU 2 X5670 for each node
Memory 24G for each node
NIC 2x10G
using Bewulf structure

Thanks a lot..