Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Moderators: Global Moderator, Moderator
-
- Newbie
- Posts: 2
- Joined: Mon May 08, 2023 11:45 pm
Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
I'm facing an issue in VASP's MLFF with the 'Refit' mode. Despite knowing that there was a problem with 'Incorrect MLFF fast-mode predictions for some triclinic geometries,' I've updated to the latest version and continue to experience the same problem.
With my current simulation, I experience an empty OSZICAR file, and the output file progresses only up to 'initializing machine learning' before remaining stuck indefinitely, regardless of the simulation runtime I set.
I've also attempted to address the issue by reducing the 'ML_AB' configurations (I have reduced it to half of the configurations), as the current number is quite high, but it hasn't yielded any changes in the output results.
I also have read other posts but it seems that many people had some results after the simulation "stopped", so I wonder if this could be yet another memory allocation problem or something else that I could troubleshoot.
I've posted the necessary files for reference on the link below (too large to attach), and any input or guidance on resolving this persistent problem would be greatly appreciated.
Link: https://drive.google.com/drive/folders/ ... sp=sharing
With my current simulation, I experience an empty OSZICAR file, and the output file progresses only up to 'initializing machine learning' before remaining stuck indefinitely, regardless of the simulation runtime I set.
I've also attempted to address the issue by reducing the 'ML_AB' configurations (I have reduced it to half of the configurations), as the current number is quite high, but it hasn't yielded any changes in the output results.
I also have read other posts but it seems that many people had some results after the simulation "stopped", so I wonder if this could be yet another memory allocation problem or something else that I could troubleshoot.
I've posted the necessary files for reference on the link below (too large to attach), and any input or guidance on resolving this persistent problem would be greatly appreciated.
Link: https://drive.google.com/drive/folders/ ... sp=sharing
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
So I've run your calculation on 64 cores. After a few hours I get the following:
"xxmr2d:out of memory"
This is inside the scalapack routines for SVD where it redistributes some routines internally. For that it allocates helping arrays that are allocated with malloc. If the size of the helping arrays (which is unfortunately 1D) is larger than 2**31, that means 4 byte integer, this error message comes. The size of this arrays gets smaller and smaller the more computational cores one takes, since the arrays are distributed via the cores and each core only needs to allocate parts of the arrays.
So I reran the calculation with 128 cores and it went through fine.
You ran on 40 cores (I saw it from the OUTCAR) which is definitely not enough, but it's strange you don't get an error.
Please try the calculation with more cores, to be safe at least with 128.
"xxmr2d:out of memory"
This is inside the scalapack routines for SVD where it redistributes some routines internally. For that it allocates helping arrays that are allocated with malloc. If the size of the helping arrays (which is unfortunately 1D) is larger than 2**31, that means 4 byte integer, this error message comes. The size of this arrays gets smaller and smaller the more computational cores one takes, since the arrays are distributed via the cores and each core only needs to allocate parts of the arrays.
So I reran the calculation with 128 cores and it went through fine.
You ran on 40 cores (I saw it from the OUTCAR) which is definitely not enough, but it's strange you don't get an error.
Please try the calculation with more cores, to be safe at least with 128.
-
- Newbie
- Posts: 15
- Joined: Fri Oct 20, 2023 1:13 pm
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
This question and answer was very useful for me as well. I got the same problem when refitting an FF for a liquid phase system. At first, I was quite surprised, as I only used 650 GB of the nearly 1500 GB of memory available, but I still got this error message. However, now I understand that this issue occurs because of allocating the array in the memory instead of the absolute memory size. I am testing the solution (using more cores) and will see how this does in the future. However, I must note that it would be nice if this error could be avoided by adjusting the ML algorithm, as this will cause me to use more high memory nodes than I strictly need memory wise. Using more cores to have shorter arrays on the separate cores does not feel like an appropriate long term solution .
Regards,
Jelle
Regards,
Jelle
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
The clean fix for this will come when scaLAPACK will officially change from integer4 to integer8. This will completely solve the problem.
Until then there is not much we can do, since we absolutely need the parallel SVD solvers from scaLAPACK. It's also hard to know in advance when this problem occurs, so writing warnings is also not easy.
Until then there is not much we can do, since we absolutely need the parallel SVD solvers from scaLAPACK. It's also hard to know in advance when this problem occurs, so writing warnings is also not easy.
-
- Newbie
- Posts: 2
- Joined: Mon May 08, 2023 11:45 pm
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
Hey everyone, thank you for the inputs on the problem. I've been trying to run on 128 cores like suggested but I still have some problems with it. Could you recommend a compiler and MPI to try? I've tried the compiled versions below:
FIRST:
module load intel/19.0.4
module load intel-mpi
module load intel-mkl
module load cuda
SECOND:
module load gcc/11.3.0
module load openmpi/4.1.4
module load hdf5/1.12.2
module load intel-oneapi/2021.3
module unload intel-oneapi-mpi/2021.3
Thanks,
FIRST:
module load intel/19.0.4
module load intel-mpi
module load intel-mkl
module load cuda
SECOND:
module load gcc/11.3.0
module load openmpi/4.1.4
module load hdf5/1.12.2
module load intel-oneapi/2021.3
module unload intel-oneapi-mpi/2021.3
Thanks,
-
- Global Moderator
- Posts: 460
- Joined: Mon Nov 04, 2019 12:44 pm
Re: Stuck in 'Refit' Mode: VASP ML Force Field Issue with Triclinic Geometries
I don't think it's a problem of the compilers. It rather depends on the size of your calculation. If you have a huge calculation then possibly 128 cores are also not enough. So my suggestion is to try with more cores maybe 256 or more until the problem goes away.
If it still does not help then try this toolchain:
Intel fortran 22.0.1 with Intel MPI 21.5.0
If it still does not help then try this toolchain:
Intel fortran 22.0.1 with Intel MPI 21.5.0