it become slower train when copying ML_ABN to ML_AB to continue to train

Message

suojiang_zhang1 · #1 Post by **suojiang_zhang1** » Sat Mar 29, 2025 9:53 am

Dear,
For the same computer to run the MLFF train, I found the training speed will be slower when I copied the ML_ABN from the first time to ML_AB to continue train,

#2 Post by **marie-therese.huebsch** » Mon Mar 31, 2025 10:08 am

Hi,

Great that you do some testing. Could you clarify what exactly you are observing?

For reference, the ab-initio calculation should remain at the same computational cost in any MD step unless you changed some settings. During training more and more local reference configurations are collected and that will indeed cost more computational effort to add the e.g. 15th compared to the 4th local reference configuration and apply the design matrix. However it is not an option to entirely avoid adding local reference configurations, since this is what improves the force field. A comparison of restarting a training calculation or running a training calculation for longer should not impact the computational cost significantly (minus the overhead you get from writing and reading files etc.).

Do you have a question in connection with your observation?

Marie-Therese

suojiang_zhang1 · #3 Post by **suojiang_zhang1** » Mon Apr 21, 2025 1:32 am

Hi,
The continue train become really slower when I copy the ML_ABN to ML_AB

My INCAR looks like:
ISMEAR = 0
SIGMA = 0.5
ISPIN = 1
ISYM = 0
LREAL = Auto
### MD part
IBRION = 0
MDALGO = 3
LANGEVIN_GAMMA = 10.0 10.0 10.0 10.0 10.0 10.0
LANGEVIN_GAMMA_L = 10.0
NSW = 10000
POTIM = 1.5
ISIF = 3
TEBEG = 200
TEEND = 500
PSTRESS = 0.001
PMASS=100
POMASS= 12 8 14 32 16 19
RANDOM_SEED = 486686595 0 0
### Output
LWAVE = .FALSE.
LCHARG = .FALSE.
#NBLOCK = 10
#KBLOCK = 10
##############################
### MACHINE-LEARNING ###
################################
ML_LMLFF = .T.
ML_MODE=train
ML_DESC_TYPE = 1
ML_MCONF_NEW=12
ML_CDOUB=4
ML_CTIFOR=0.02

I check the ML_ABN and find that the numbers of basis sets per atom type increase 3000 from scratch 1500 after continuing train.
I guess the increase of basis set lead to the very slow train.

suojiang_zhang1 · #4 Post by **suojiang_zhang1** » Mon Apr 21, 2025 4:26 am

in addition, I find the ML_FFN was frequently rewrited that spends quitely time on it. How set the rewriting frequency for ML_FFN

#5 Post by **marie-therese.huebsch** » Tue Apr 22, 2025 6:00 am

Hi,

It seems that the training becomes slower because you are advancing in the training (and not due to restarting). In other words, running the two jobs with NSW=50000 time steps or one job with NSW=100000 will take approximately the same time.

To control the output frequency of various quantities computed at every step during an MD run is an excellent idea. The tag is ML_OUTBLOCK.

Does this solve the issue?

Marie-Therese

My Community

it become slower train when copying ML_ABN to ML_AB to continue to train

it become slower train when copying ML_ABN to ML_AB to continue to train

Re: it become slower train when copying ML_ABN to ML_AB to continue to train

Re: it become slower train when copying ML_ABN to ML_AB to continue to train

Re: it become slower train when copying ML_ABN to ML_AB to continue to train

Re: it become slower train when copying ML_ABN to ML_AB to continue to train