Frequent problem
Posted: Thu Jun 28, 2007 10:30 pm
Hi, all. It happens to me that we have this segfault error quire requently for a variety of job sizes. For this particular job, we ran on 4 nodes with 16 processors and each node has 8 Gig of RAM.
running on 16 nodes
each image running on 2 nodes
distr: one band on 1 nodes, 2 groups
vasp.4.6.19 08Dec03 complex
01/POSCAR found : 4 types and 15 ions
-----------------------------------------------------------------------------
| |
| ADVICE TO THIS USER RUNNING 'VASP/VAMP' (HEAR YOUR MASTER'S VOICE ...): |
| |
| You enforced a specific xc-type in the INCAR file, |
| a different type was found on the POTCAR file |
| I HOPE YOU KNOW, WHAT YOU ARE DOING |
| |
-----------------------------------------------------------------------------
LDA part: xc-table for Pade appr. of Perdew
00/POSCAR found : 4 types and 15 ions
09/POSCAR found : 4 types and 15 ions
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: wrap around errors must be expected
FFT: planning ...
reading WAVECAR
WARNING: random wavefunctions but no delay for mixing, default for NELMDL
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.167129221272E+04 0.16713E+04 -0.38097E+04 780 0.105E+03
*******
1 F= -.62961368E+03 E0= -.62951081E+03 d E =-.629614E+03
rm_l_2_10283: p4_error: interrupt SIGx: 15
bm_list_10389: p4_error: listener select: -1
rm_l_4_10285: (972.683594) net_send: could not write to fd=6, errno = 9
rm_l_4_10285: p4_error: net_send write: -1
rm_l_3_10284: (972.683594) net_send: could not write to fd=6, errno = 9
rm_l_3_10284: p4_error: net_send write: -1
rm_l_4_10285: (972.683594) net_send: could not write to fd=5, errno = 104
rm_l_6_10015: (972.699219) net_send: could not write to fd=6, errno = 9
rm_l_6_10015: p4_error: net_send write: -1
rm_l_7_10017: (972.699219) net_send: could not write to fd=7, errno = 9
rm_l_7_10017: p4_error: net_send write: -1
rm_l_8_10272: (972.687500) net_send: could not write to fd=6, errno = 9
rm_l_8_10272: p4_error: net_send write: -1
rm_l_9_10275: (972.687500) net_send: could not write to fd=8, errno = 9
rm_l_9_10275: p4_error: net_send write: -1
In the corresponding error file
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
forrtl: error (78): process killed (SIGTERM)
mpiexec: Warning: tasks 0-9 died with signal 11 (Segmentation fault).
I also found this after executing dmesg command or in /var/log/messages
vaspmpitst[10388]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
vaspmpitst[10386]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
vaspmpitst[10387]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
vaspmpitst[10371]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
Could the experts here tell me what could be the possible causes for this kind of error and how to fix it?
Thank you so much.
running on 16 nodes
each image running on 2 nodes
distr: one band on 1 nodes, 2 groups
vasp.4.6.19 08Dec03 complex
01/POSCAR found : 4 types and 15 ions
-----------------------------------------------------------------------------
| |
| ADVICE TO THIS USER RUNNING 'VASP/VAMP' (HEAR YOUR MASTER'S VOICE ...): |
| |
| You enforced a specific xc-type in the INCAR file, |
| a different type was found on the POTCAR file |
| I HOPE YOU KNOW, WHAT YOU ARE DOING |
| |
-----------------------------------------------------------------------------
LDA part: xc-table for Pade appr. of Perdew
00/POSCAR found : 4 types and 15 ions
09/POSCAR found : 4 types and 15 ions
POSCAR, INCAR and KPOINTS ok, starting setup
WARNING: wrap around errors must be expected
FFT: planning ...
reading WAVECAR
WARNING: random wavefunctions but no delay for mixing, default for NELMDL
entering main loop
N E dE d eps ncg rms rms(c)
RMM: 1 0.167129221272E+04 0.16713E+04 -0.38097E+04 780 0.105E+03
*******
1 F= -.62961368E+03 E0= -.62951081E+03 d E =-.629614E+03
rm_l_2_10283: p4_error: interrupt SIGx: 15
bm_list_10389: p4_error: listener select: -1
rm_l_4_10285: (972.683594) net_send: could not write to fd=6, errno = 9
rm_l_4_10285: p4_error: net_send write: -1
rm_l_3_10284: (972.683594) net_send: could not write to fd=6, errno = 9
rm_l_3_10284: p4_error: net_send write: -1
rm_l_4_10285: (972.683594) net_send: could not write to fd=5, errno = 104
rm_l_6_10015: (972.699219) net_send: could not write to fd=6, errno = 9
rm_l_6_10015: p4_error: net_send write: -1
rm_l_7_10017: (972.699219) net_send: could not write to fd=7, errno = 9
rm_l_7_10017: p4_error: net_send write: -1
rm_l_8_10272: (972.687500) net_send: could not write to fd=6, errno = 9
rm_l_8_10272: p4_error: net_send write: -1
rm_l_9_10275: (972.687500) net_send: could not write to fd=8, errno = 9
rm_l_9_10275: p4_error: net_send write: -1
In the corresponding error file
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
p4_error: latest msg from perror: Bad file descriptor
forrtl: error (78): process killed (SIGTERM)
mpiexec: Warning: tasks 0-9 died with signal 11 (Segmentation fault).
I also found this after executing dmesg command or in /var/log/messages
vaspmpitst[10388]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
vaspmpitst[10386]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
vaspmpitst[10387]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
vaspmpitst[10371]: segfault at 00007fff0e4f78a8 rip 00000000005f82f7 rsp
00007fff0e4f78b0 error 6
Could the experts here tell me what could be the possible causes for this kind of error and how to fix it?
Thank you so much.