MLFF Hangs Before First Ionic Step when ML_AB is Present

Message

vivienne_pelletier · #1 Post by **vivienne_pelletier** » Wed Aug 23, 2023 12:05 am

Hi everyone,

I very much enjoy the new MLFF feature; however, I have encountered a few issues fairly consistently while attempting to run with pre-existing ML_AB files...
When attempting to restart a run with a somewhat large ML_AB (in this particular case it contains 1475 structures, 6 elements, max 3124 basis configs), VASP will hang after the SCF cycle has converged. After this first step is completed, the line reporting the temperature/energy/etc. is not written to the OSZICAR & stdout, and none of the OUTCAR, stdout, vasprun.xml, ML_ABN, or ML_REG are updated after the positions and total forces have been written to the OUTCAR/vasprun.
The ML_LOGFILE, after all of the initial boilerplate, shows this:

Code: Select all

STATUS                  0 learning   3      T      F         0         0
SPRSC                   0      1475      1475 Al       550       550 C       1893      1893 H       3124      3124 O       2952      2952 Zn      1237      1237 P        170       170
REGR                    0    1    1   4.80520352E-01   2.70686887E-02   3.88934899E-13   6.60186318E+03
REGR                    0    1    2   2.65297417E+00   2.06119355E-02   5.36422807E-14   5.00693884E+03
REGR                    0    1    3   9.60078933E+00   1.83151329E-02   1.31711779E-14   4.43880639E+03
REGR                    0    1    4   2.31145840E+01   1.73880995E-02   5.19382743E-15   4.20853094E+03
REGR                    0    1    5   3.98908662E+01   1.70119314E-02   2.94443274E-15   4.11453918E+03
REGR                    0    1    6   5.50733221E+01   1.68483116E-02   2.11220757E-15   4.07342537E+03
REGR                    0    1    7   6.64584665E+01   1.67694861E-02   1.74217156E-15   4.05352463E+03
REGR                    0    1    8   7.41236092E+01   1.67285109E-02   1.55819644E-15   4.04314007E+03
REGR                    0    1    9   7.89711368E+01   1.67062112E-02   1.46059921E-15   4.03747233E+03

This has happened multiple times upon cancelling and restarting this job, while the same INCAR & POSCAR without the ML_AB file runs without a problem. I left one instance running for 2 days before checking on it and seeing this, so I'm fairly certain nothing more was going to happen.

This is with a fresh compilation of VASP 6.4.2 using gcc 11, OpenMPI 4.1.5, OpenBLAS, and the AMD optimized Scalapack & FFTW. The precompiler flags use_shmem, shmem_bcast_buffer, shmem_rproj, and sysv are all used. Slurm reports that 286GB of memory were used, well within the 500GB provided to it.

Any help with diagnosing this issue would be greatly appreciated!
Thank you very much,
Vivienne

#2 Post by **manuel_engel1** » Wed Aug 23, 2023 8:57 am

Hi Vivienne,

Thank you for the report. I forwarded this to an MLFF expert. However, it will probably take a few days to investigate. We'll report back as soon as we know what the issue is.

#3 Post by **ferenc_karsai** » Mon Aug 28, 2023 8:08 am

This could be a memory issue, I encountered several times that when trying to allocate memory the code just hangs at the allocate command when not enough memory is available. The most important allocate commands have a status check, which should prevent this behaviour, but if this is inside a scalapack call then we have no control over it or the memory runs out for an array allocation that we have not checked with a status variable. Are you running with ML_MODE=TRAIN? If yes then, depending on your parameters VASP can eat up also a lot of memory.

Please provide us your input and output files (INCAR, POSCAR, POTCAR, KPOINTS, ML_AB) otherwise we can't do much more.

vivienne_pelletier · #4 Post by **vivienne_pelletier** » Mon Aug 28, 2023 9:03 pm

I've attached the inputs in an xz tar file.
Related to the idea that it is a memory issue inside a scalapack call, on a different compilation of VASP linked to different libraries I've seen "xxmr2d:out of memory" output when using a large enough ML_AB.

#5 Post by **ferenc_karsai** » Tue Aug 29, 2023 7:52 am

I will try to run it as soon as possible on our high memory machine (at the moment it is occupied).

In the meantime you could try to run a calculation with the following INCAR:

INCAR:
ML_LMLFF = .TRUE.
ML_MODE = REFITBAYESIAN

This will only do a refit using most of the routines of the machine learning program, but using less memory than ML_MODE=REFIT since it will not use the memory intensive singular value decomposition.
If this runs through then it is almost guaranteed that you run into memory problems when using ML_MODE=train, since the electronic calculation needs also a lot of memory.

The "r2d out of memory" what you sometimes see is most likely due to the singular value decomposition scaLAPACK calls (PDGESVD), since that part calls the redistribution routine PXGEMR2D. This routine calls a malloc call and if the size of the local array (product of the two dimensions) is larger than 2^^31 than it runs into an integer overflow. In that case you can try to run on more cores until the local arrays are small enough.

My Community

MLFF Hangs Before First Ionic Step when ML_AB is Present

MLFF Hangs Before First Ionic Step when ML_AB is Present

Re: MLFF Hangs Before First Ionic Step when ML_AB is Present

Re: MLFF Hangs Before First Ionic Step when ML_AB is Present

Re: MLFF Hangs Before First Ionic Step when ML_AB is Present

Re: MLFF Hangs Before First Ionic Step when ML_AB is Present