ML_ISTART = 1 doesn't work with different element types - v6.4.1

Message

john_martirez1 · #1 Post by **john_martirez1** » Wed Jun 28, 2023 2:26 pm

ML_AB file has H O, training for system with H O and C. Just terminates early, not even an SCF gets done.
OK with ML_ISTART = 0.

#2 Post by **ferenc_karsai** » Wed Jun 28, 2023 2:35 pm

Please send all necessary files to be able to run and check the calculation.
This means POSCAR, POTCAR, KPOINTS, INCAR, OUTCAR, ML_AB, ML_LOGFILE and stdout.

john_martirez1 · #3 Post by **john_martirez1** » Wed Jun 28, 2023 2:53 pm

thanks for the quick reply. See attached files.

john_martirez1 · #4 Post by **john_martirez1** » Wed Jun 28, 2023 3:29 pm

I found the reason, there's a significant jump in memory requirements from ML_ISTART = 0 to ML_ISTART = 1.
I increased mem/cpu to 9200 MB, and then it worked.

Hopefully, memory allocation can be improved in the future for ML_ISTART = 1? Or am I missing something?

#5 Post by **ferenc_karsai** » Mon Jul 03, 2023 9:25 am

It's hard to do anything about the memory allocation.
Here some explanations:
At the moment we have to statically allocate memory at the beginning, this is mainly due to the use of shared memory MPI. We saw several times that one gets problems if shared memory MPI needs to be reallocated. I don't know if this problem will be ever solved for all compilers.

So how can the memory grow so much in your case:
1) New element types entered the calculations. We use multidimensional allocatable arrays in fortran. So the local reference dimension will be allocated with the same maximum for all element types. Ideally one wants to have the same number of local reference configurations for all element types. Of course this is often hard to achive for dopands where we are limited by few atoms as local reference canditates from the training structures. In this case we waste some memory. Your case might belong to that.
2) It's a continuation run and if you don't specify anything then then on top of the already available data min(1500, NSW) is added. Please see the documentation of ML_MB and ML_MCONF (https://www.vasp.at/wiki/index.php/ML_MB and https://www.vasp.at/wiki/index.php/ML_MCONF). This default has worked until now quite nicely but if it turns out it's problematic for the majority of users then we will change that.

What can you do?
1) Check if you compiled with shared memory MPI ("-Duse_shmem").
2) Adjust ML_MB and ML_MCONF.
3) Go to a larger number of compute nodes since the design matrix which needs the most memory is distributed linearly over the number of cores.

My Community

ML_ISTART = 1 doesn't work with different element types - v6.4.1

ML_ISTART = 1 doesn't work with different element types - v6.4.1

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1

Re: ML_ISTART = 1 doesn't work with different element types - v6.4.1