Dear Xiliang,
thanks for uploading additional files, this helped us to bring more light to the problem. As Jonathan already mentioned your
ML_FFN files are corrupted right after the end of the header (which is 4096 bytes long). Here you can see how this looks in comparison with a working file:
vimdiff.jpg
As you can see there is a lot of zeros instead of data right after the ASCII header (red line). After a while the file continues normally (line starting with
00001240:). Interestingly, the length of this all-zero block varies across the different
ML_FFN files you sent us. We did not ever observe such a behavior and were not able to reproduce the problem on any of our machines and across the compilers you mentioned.
However, the good news is that we were able to identify a part in the code that could be the culprit. We found that there is not enough protection against concurrent writes of different MPI tasks for this file (only a single MPI rank should write at the same time). This did not show up in our tests but maybe some parallel file systems may get confused and react with broken file buffering. I have created a preliminary patch file for VASP 6.4.0 with a fix (please see the attachment). After extracting the patch file in your VASP root folder you can apply the patch by executing:
Code: Select all
patch src/ml_ff_iohandle.F < ml_ff_iohandle.patch
Then, please completely recompile VASP and try again to run your problematic training cases to create new
ML_FFN files. There is no guarantee that this will fix the problem but at the moment it is our best guess. Please report back whether this patch works for you!
Thank you and sorry about the inconvenience!
All the best,
Andreas Singraber
You do not have the required permissions to view the files attached to this post.