VASP 6.4 ML_ISTART=2 doesn't work

Message

xiliang_lian · #1 Post by **xiliang_lian** » Fri Feb 17, 2023 3:38 pm

Hello,

I have learned you released a new version of VASP which has exciting features on MLFF. I really appreciate this and had a try today. However, I got into trouble.

Whenever I tried to train an MLFF, it seemed to be impossible to apply the model to a larger cell. The warning message I got is this:

Code: Select all

Atom types present in ML_FF inconsistent with prediction-only run       |
 |     (ML_ISTART = 2) with the given structure.

However, I am sure the atom types are correct. I tried to use it with the same structure but also had the same failure. If I didn't have a type error or other types of mistakes, I am afraid it might come from the code.

You will find my calculation setup in the attachment. I did the test with an example structure: NaF. I have three folders: run1 for training, run2 for applying to a large supercell, and run3 for applying it to the CONTCAR from run1. Can you please have a look?

Thanks a lot in advance.

Best,
Xiliang

#2 Post by **jonathan_lahnsteiner2** » Mon Feb 20, 2023 9:04 am

Dear xiliang_lian,

I checked your data files. There is a problem with your ML_FF files. The first line of the ML_FF contains an ascii header which still looks fine in your case. After the ascii header, there are some variables defined, for example the number of atom types. Those variables are all set to zero in your ML_FF file. This is why you receive the bug that the number of types don't match. So somehow your ML_FF files seem to be corrupted.
When I repeat your calculations and generate new ML_FF files with vasp6.4.0 everything works perfectly fine for me. I just copied your input files from your run1 folder and executed vasp.

So I wanted to ask, if you could be so kind to repeat the calculations to see if you still receive the same bug with the latest vasp version. And could you please also let me know what operating system your are using and what compilers you took to build vasp. This would be very helpful to resolve the issue.

All the best Jonathan

xiliang_lian · #3 Post by **xiliang_lian** » Mon Feb 20, 2023 9:42 am

Dear Jonathan,

Thanks a lot for taking care of this problem.

Before I posted the problem, I actually had tried calculations of two systems and also two platforms. I ran the calculations again on both platforms just now. Unfortunately, the problem persisted.

Here is some information related to the system and compilers we use.

On the first platform, we have compiled the GPU version. The system is

Code: Select all

LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	RedHatEnterprise
Description:	Red Hat Enterprise Linux release 8.6 (Ootpa)
Release:	8.6
Codename:	Ootpa

and then we use the following dependencies

Code: Select all

Loading requirement: nvidia-compilers/22.9 cuda/11.2 openmpi/4.1.1-cuda nccl/2.9.6-1-cuda
    intel-mkl/2020.4

On the second platform, we have compiled the CPU version. The system is

Code: Select all

LSB Version:	n/a
Distributor ID:	SUSE
Description:	SUSE Linux Enterprise Server 12 SP3
Release:	12.3
Codename:	n/a

and I use the following modules for compilation:

Code: Select all

 module load intel/intel-compilers-2021.3/2021.3
 module load intel/intel-mkl-2021.3/2021.3
 module load intel/intel-mpi/2021.3
 module load HDF5/1.12.2-intel2021

It is exactly the same error so I didn't put the files in the attachment this time. Please let me know if there is anything I can do to identify the problems. Thanks again.

Best regards,
Xiliang

#4 Post by **jonathan_lahnsteiner2** » Mon Feb 20, 2023 3:17 pm

Dear xiliang_lian,

We tried now with the compilers that you mentioned in your post. But, unfortunately we are
still not able to reproduce the error you have.

Is it possible for you to try the following workflows:

Step1) Train a ML_FF as you do in your run1 folder.
Step2) Take the created ML_ABN file and copy it to ML_AB and retrain the force field by
setting ML_ISTART=1 and setting NSW=1 and last set ML_CTIFOR = 1000000 ( large number)
Step3) Take the ML_FFN file from Step2 and copy it to your run2 folder and try to run the simulation again.

And maybe try again the same workflow but replace Step2) by
Take the created ML_ABN file and copy it to ML_AB and retrain the force field by
setting ML_ISTART=3 and setting NSW=1 and last set ML_CTIFOR = 1000000 ( large number)

Do you then still get the same error message with these approaches
and would you be so kind to upload all the ML_FF files for the intel and
Nvidia tests you mentioned? And please also send us the files of the two proposed workflows
in case they are not working.
Thank you, and I am sorry for the inconvenience.

Best Jonathan

xiliang_lian · #5 Post by **xiliang_lian** » Mon Feb 20, 2023 4:49 pm

Dear Jonathan,

Yes. There are two attachments in this message. I put all the files there so it might be quite large.

In problems1.zip, I attached the previous calculations on both Nvidia and intel.

In problems2.zip, you will find calculations I made based on the procedure mentioned in your last message (Due to the size limit, I omitted the dir where I obtain the first ML_ABN. I can send you if you need). It worked for me this time. Probably this will help to identify the source of the issue. If you look at this file ("run.o1444776") in run3, you will see I still have quite a lot of warnings. Anyway, it worked. But the best would be to solve this problem otherwise it is very inconvenient.

Thanks again.

Xiliang

#6 Post by **andreas.singraber** » Tue Feb 21, 2023 3:15 pm

Dear Xiliang,

thanks for uploading additional files, this helped us to bring more light to the problem. As Jonathan already mentioned your ML_FFN files are corrupted right after the end of the header (which is 4096 bytes long). Here you can see how this looks in comparison with a working file:

vimdiff.jpg

As you can see there is a lot of zeros instead of data right after the ASCII header (red line). After a while the file continues normally (line starting with 00001240:). Interestingly, the length of this all-zero block varies across the different ML_FFN files you sent us. We did not ever observe such a behavior and were not able to reproduce the problem on any of our machines and across the compilers you mentioned.

However, the good news is that we were able to identify a part in the code that could be the culprit. We found that there is not enough protection against concurrent writes of different MPI tasks for this file (only a single MPI rank should write at the same time). This did not show up in our tests but maybe some parallel file systems may get confused and react with broken file buffering. I have created a preliminary patch file for VASP 6.4.0 with a fix (please see the attachment). After extracting the patch file in your VASP root folder you can apply the patch by executing:

Code: Select all

patch src/ml_ff_iohandle.F < ml_ff_iohandle.patch

Then, please completely recompile VASP and try again to run your problematic training cases to create new ML_FFN files. There is no guarantee that this will fix the problem but at the moment it is our best guess. Please report back whether this patch works for you!

Thank you and sorry about the inconvenience!

All the best,
Andreas Singraber

xiliang_lian · #7 Post by **xiliang_lian** » Wed Feb 22, 2023 9:14 am

Dear Andreas.

Thank you very much. I will do the test today and come back to you as soon as I finish. What I forgot to tell you is that the problem only comes with the latest version. I have been using MLFF for some time and no problem has occurred before. Therefore, the problem might come along with the new features introduced with 6.4. If you have anything new, I will be happy to have a try.

Best regards,
Xiliang

#8 Post by **andreas.singraber** » Wed Feb 22, 2023 9:48 am

Dear Xiliang,

thank you for testing this! It makes sense that this only occurred with the latest release because there the ML_FFN ASCII header feature was added. The patch removes potential concurrent writes coming from the subroutine which writes the header.

Best,
Andreas

szurlle · #9 Post by **szurlle** » Wed Feb 22, 2023 3:42 pm

Dear all,

Thank you for the discussion. I had the same problem as Xiliang has reported, and the problem was gone when vasp6.4 was re-compiled using the patch file Andreas provided.

Best,
Shuai

#10 Post by **andreas.singraber** » Wed Feb 22, 2023 4:45 pm

Hello Shuai,

thank you for your report, it is good to know that the fix works. We will put this information and the patch file also on the known issues page on the Wiki. Of course the fix will also be part of the next patch release of VASP.

I would like to stress one point that is not yet well documented on the Wiki but very important to gain from the significant performance improvements that were introduced in prediction mode (ML_ISTART=2). Once you are finished with training and would like to apply the force field in a long MD run always refit with your last ML_AB and the new ML_MODE=refit tag. This automatically sets ML_ISTART=4 and ML_LFAST=.TRUE. and results in a new ML_FFN file which supports the fast execution mode (you can check if that is enabled by checking ML_LFAST in the ML_LOGFILE). The ML_FFN files resulting from normal training runs (ML_ISTART=0,1,3) do not allow for this fast prediction in MD runs. That is why we now recommend at the end of your training stage always to do a refitting (and also asked Xiliang to perform it).

I am sorry for the inconvenience and that the new features are not yet properly included on all the corresponding Wiki pages. We will update them in the coming days...

All the best,
Andreas Singraber

xiliang_lian · #11 Post by **xiliang_lian** » Fri Feb 24, 2023 8:16 am

Dear Andreas,

Thanks a lot for the work and also for your kind comments. As an additional confirmation, the patch also works for (the GPU version).

I really appreciate your efforts in the last few days to help me with the issues.

Best,
Xiliang

#12 Post by **andreas.singraber** » Fri Feb 24, 2023 9:28 am

Dear Xiliang,

Thanks for reporting back, I am glad that the fix also worked for you! Please note that all the machine learning functionality is not yet parallelized via GPU, so you can only make use of the GPU for the ab initio calculations during ML_ISTART=0,1 on-the-fly training, but not yet for ML_ISTART=2,3,4. This is definitely on our TODO-list but for now there is only parallelization via CPU available. Fortunately, the CPU-version for ML_ISTART=2 got now significantly faster with VASP.6.4.0 (we typically observed a factor 20-100), so you should be able to get into simulation times of some nanoseconds per day.

All the best,
Andreas Singraber

xiliang_lian · #13 Post by **xiliang_lian** » Fri Feb 24, 2023 9:39 am

Dear Andreas,

It is very kind of you to remind me of this. It is very important information because I was thinking of using GPU to accelerate the MLFF further. The efficiency boost with the new version is exciting. Looking forward to the GPU implementation of the MLFF module. Thanks again.

Best,
Xiliang

My Community

VASP 6.4 ML_ISTART=2 doesn't work

VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work

Re: VASP 6.4 ML_ISTART=2 doesn't work