I very much enjoy the new MLFF feature; however, I have encountered a few issues fairly consistently while attempting to run with pre-existing ML_AB files...
When attempting to restart a run with a somewhat large ML_AB (in this particular case it contains 1475 structures, 6 elements, max 3124 basis configs), VASP will hang after the SCF cycle has converged. After this first step is completed, the line reporting the temperature/energy/etc. is not written to the OSZICAR & stdout, and none of the OUTCAR, stdout, vasprun.xml, ML_ABN, or ML_REG are updated after the positions and total forces have been written to the OUTCAR/vasprun.
The ML_LOGFILE, after all of the initial boilerplate, shows this:
Code: Select all
STATUS 0 learning 3 T F 0 0
SPRSC 0 1475 1475 Al 550 550 C 1893 1893 H 3124 3124 O 2952 2952 Zn 1237 1237 P 170 170
REGR 0 1 1 4.80520352E-01 2.70686887E-02 3.88934899E-13 6.60186318E+03
REGR 0 1 2 2.65297417E+00 2.06119355E-02 5.36422807E-14 5.00693884E+03
REGR 0 1 3 9.60078933E+00 1.83151329E-02 1.31711779E-14 4.43880639E+03
REGR 0 1 4 2.31145840E+01 1.73880995E-02 5.19382743E-15 4.20853094E+03
REGR 0 1 5 3.98908662E+01 1.70119314E-02 2.94443274E-15 4.11453918E+03
REGR 0 1 6 5.50733221E+01 1.68483116E-02 2.11220757E-15 4.07342537E+03
REGR 0 1 7 6.64584665E+01 1.67694861E-02 1.74217156E-15 4.05352463E+03
REGR 0 1 8 7.41236092E+01 1.67285109E-02 1.55819644E-15 4.04314007E+03
REGR 0 1 9 7.89711368E+01 1.67062112E-02 1.46059921E-15 4.03747233E+03
This is with a fresh compilation of VASP 6.4.2 using gcc 11, OpenMPI 4.1.5, OpenBLAS, and the AMD optimized Scalapack & FFTW. The precompiler flags use_shmem, shmem_bcast_buffer, shmem_rproj, and sysv are all used. Slurm reports that 286GB of memory were used, well within the 500GB provided to it.
Any help with diagnosing this issue would be greatly appreciated!
Thank you very much,
Vivienne