Problems with vasp on hpc machines with high load on the file server

Message

emmi_gareis · #1 Post by **emmi_gareis** » Thu Jan 18, 2024 1:27 pm

Dear VASP-Community,

during the last weeks I ran into some problems when doing VASP-calculations. I am using vasp version 6.3.2.

I am investigating tungesten trioxide clusters. Therefore I mainly used the gamma-only version vasp_gam so far. But as soon as many people are using the cluster and there is high or moderate load on the file server connected to the hpc machine a really high percentage of the calculations slow down significantly at a random point of the calculation. Most of the time, these jobs are not able to finish within 24h even if they should do. When I restart these calculations during a time period, when there is no high load on the file server, the same job is perfectly able to finish within a few hours. When looking at the cluster cockpit I noticed for these jobs the Gflops go down to zero as soon as the problem occurs even if the CPU-load stays constant. This fits to the observation that the calculation just seems to stop and wait. I attached a corresponding INCAR, POSCAR, submit.sh and vasp.log and vaspout.h5 VASP output file to this mail together with the error message .err and the cluster related output .out. You can find these files in "random_slow_down.tar.gz" in the attachments. But since I observed this problem for lots of my settings and configurations I doubt that it is explicitly connected to the specific input. I additionally attached the data about cpu load, memory and Gflops from the cluster cockpit in "random_slow_down.tar.gz"; I used 4 nodes and each color corresponds to the data of one node. If any further data regarding the job is missing please don't hesitate to contact me.
I observed this problem earlier when I still used vasp version 5.4.4.pl2 so I don't think that it is a problem with the explicit version neither.
I also tried to use the standard version vasp_std instead of vasp_gam. When using this version the sudden breakdown of Gflops seemed to become indeed less likely even if it still ocurred in some of my jobs. But using the standard version lead to another problem. Almost all of my calculations done during high load on the file server ended at a sudden point of the calculation with the error message:
"internal error in: vhdf5.F at line: 394
HDF5 call in vhdf5.F:394 produced error: -1 "
The problem is also non deterministic, running the same calculation does not necessarily reproduce the error. I attached a corresponding INCAR, POSCAR, submit.sh and vasp related output together with the whole error message .err and cluster related output .out in "error_in_hdf5.tar.gz". Additionally I attached the data from the cluster cockpit in "error_in_hdf5.tar.gz". If you need more information about the job I would be happy to send it to you. This error also occasionally occurs when using version vasp_gam but became more frequently with vasp_std.
Since these problems seem to be connected with I/O, is there a possibility to make it easier for VASP to store the output, especially the hdf5 file? Is there for example a possibility to choose the path where the vaspout.h5 will be stored?

Best regards,
Emmi Gareis

#2 Post by **alexey.tal** » Fri Jan 19, 2024 4:39 pm

Dear Emmi Gareis,

Thank you for your comprehensive report.
Both of the reported issues seem to be related to the filesystem, so can you describe in more detail what filesystem you are using?

The HDF5 issue could be due to the use of the network filesystem or the lack of the storage space.
The performance issue on your HPC might be related to the limited bandwidth of the network filesystem. In this case you could try to run the calculations in a scratch or temporary folder that are not connected to the network, which should solve both problems.

Did you monitor the network usage during these "low performance" episodes?

Since these problems seem to be connected with I/O, is there a possibility to make it easier for VASP to store the output, especially the hdf5 file? Is there for example a possibility to choose the path where the vaspout.h5 will be stored?

It is also possible to compile VASP without hdf5 to minimize I/O. Furthermore, there is a number of tags in VASP that allow users to write less output during the calculation, but this is not always practical.

emmi_gareis · #3 Post by **emmi_gareis** » Wed Feb 07, 2024 4:11 pm

Dear Alexey,

Thank your for your answer. I am doing my simulations on a HPC-cluster with multiple hundreds of available nodes and store the output data on the filesystem intended for I/O of hpc data. During my work on this topic I also switched from one cluster to another of even greater size and therefore also the filesystem and I observed both errors on both hpc clusters.
I already tried to recompile VASP without hdf5 support which solved the bug connected to hdf5. But I am still observing these low performance episodes on both hpc machines and consequently on two different filesystems.
Additionally I already switched off the writing of wavefunctions and charge density completely which unfortunately did change nothing in the frequency of the occurance of the Gflop/performance breakdown.
I also monitored the network usage which yielded no hints on the sudden breakdown of performance.

Best regards

#4 Post by **alexey.tal** » Fri Feb 23, 2024 3:09 pm

VASP needs to be able to use fast I/O. If the file server is indeed overloaded and VASP cannot read and write files, the performance is going to be strongly affected.

On the computers that I have used, usually users get access to a filesystem intended for permanent storage of the files and a "scratch" filesystem, which is more performant and intended for running calculations. Have you asked the administrators of your computer if is possible to get access to a file server with higher bandwidth?

Have you tried running calculations in a directory local to the node (not connected via network) during these high-load moments? If this issue has to do with the file server being overloaded, running the calculations in a local directory should help.

Also, you could minimize the output further by setting tags NWRITE and KBLOCK. Although, completely eliminating I/O is not possible in VASP.

The performance of VASP can also be strongly affects if the calculation runs out of memory and starts using swap memory. Since you are describing that this issue occurs during high computational load, it could be a result of multiple job running on the same node and using up all the available RAM.

My Community

Problems with vasp on hpc machines with high load on the file server

Problems with vasp on hpc machines with high load on the file server

Re: Problems with vasp on hpc machines with high load on the file server

Re: Problems with vasp on hpc machines with high load on the file server

Re: Problems with vasp on hpc machines with high load on the file server