Page 1 of 1
VASP 6.3.0 ACC OMP memory leak
Posted: Tue May 03, 2022 1:36 am
by Dankomaister
Hi,
I have compiled the ACC+OMP version of VASP 6.3.0 with Intel mkl
Code: Select all
makefile.include.nvhpc_ompi_mkl_omp_acc
Using these compilers / libraries
CUDA/11.4.4
NVHPC/22.3
OpenMPI/4.1.3
imkl/2021.4.0
HDF5/1.12.1
I runs but unfortunately there is a huge memory leak (on the host side) as seen in the attached picture.
Any ideas what can cause this? I have tried playing around with different version of the compilers / libraries but can seems to solve this :/ Perhaps there is some bug?
I attached the makefile.include and one of the test systems for which I noticed the memory leak.
/Daniel
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Mon May 09, 2022 2:13 pm
by andreas.singraber
Hello!
Thanks for reporting this memory leak, I have already tried to reproduce this on our machines. Unfortunately I was not able to run exactly the same job as we do not have enough GPUs available. Also, I can not use the OpenMP parallelization together with OpenACC at the moment. To be able to test at least a similar job I modified the INCAR file (used standard IALGO, smaller ENCUT, disabled vdW-DF functionals). Even with this strongly modified setup I did get a (smaller) memory leak... it is of course not clear if it has the same origin as in your case. However, the memory leak I observed has its origin in libnvf.so which indicates that it resides not in our code but somewhere in the NVIDIA libraries. We are now getting in contact with NVIDIA for further support.
Can you please tell me which GPUs you were using? Did you observe a memory leak also without additional OpenMP parallelization? I am a bit confused about the memory graph you posted. It would be good if I could estimate the amount of leaked memory from the graph but I am not sure it really belongs to the output files you prepared. The start/end time does not match and the increase of memory in the graph lasts over 3 hours while the runtime in OUTCAR indicates about an hour of execution time. Is it just from another (longer) run with identical settings?
Thank you!
All the best,
Andreas Singraber
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Mon May 09, 2022 3:37 pm
by andreas.singraber
Hello again!
It seems our attempt to reproduce the memory leak was not yet successful. The memory increase in our modified setup I described in my last post turned out to stabilize after a while. So we probably simplified your example too much, we will continue our efforts, stay tuned...
Best,
Andreas
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Tue May 10, 2022 5:12 am
by Dankomaister
Hi Andreas!
Thanks a lot for having a look at this!!!
I can confirm that the graph in my previous post is the same system with just longer runtime as you suggested.
I have also done more tests on my side using the same system but a simplified makefile.include (attached) which includes only the necessary changes to compile on our system, which is a small local HPC with 20 GPU nodes each with two Nvidia K80 GPUs. This is the system I'm trying to compile the VASP ACC version for. But I have also seen this memory leak on a different system with Nvidia V100 GPUs running a different (bigger) calculation. So my guess is that this is not related to our specific GPUs or my specific calculation.
However, I did find out that the memory leak is related to the use of OpenMP parallelization.
When setting OMP_NUM_THREADS=1 the memory usage is stable compared to OMP_NUM_THREADS=8 as seen in the figure.
This and other problems I have experienced with NVHPC+MKL+OpenMP parallelization makes me think its related to OpenMP.
For example linking MKL as follows
Code: Select all
-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_openmpi_lp64 -liomp5 -lpthread -lm -ldl
results in segmentation fault. But liking with the following
Code: Select all
-L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_pgi_thread -lmkl_core -lmkl_blacs_openmpi_lp64 -pgf90libs -mp -lpthread -lm -ldl
or
Code: Select all
-Mmkl -L${MKLROOT}/lib/intel64 -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64
works fine (except for the memory leak)
It would be very nice if we can solve this problem since running without OpenMP parallelization is very detrimental.
Performance is around 50% higher using OMP_NUM_THREADS=8 compared to OMP_NUM_THREADS=1.
/Daniel
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Tue May 24, 2022 8:27 am
by Dankomaister
Any updates on fixing this?
/Daniel
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Tue May 24, 2022 11:49 am
by andreas.singraber
Hello Daniel!
I am sorry, but there is not really anything we can say yet... we are waiting for NVIDIA to have a look at it. In the meantime I tried on 2 GPUs with 8 threads per rank. I could see some leakage but not as massive as you reported. The origin of this is again
__fort_gmalloc_without_abort in libnvf.so but I am not so sure about this reporting...
memleak.png
Please stay tuned...
Best,
Andreas
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Fri Aug 05, 2022 2:53 pm
by Dankomaister
No updated on this?
Just wanted to say we still have this problem affecting our users
/Daniel
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Thu Aug 18, 2022 9:36 am
by andreas.singraber
Dear Daniel!
Unfortunately, NVIDIA did not yet reply, I have asked them again for their assistance.
I completely understand your frustration and I would like to try some more things until we hear back from NVIDIA. However, I would need a smaller reproducer case, so I can try it on our limited hardware.. did you in the meantime observe this behavior also with smaller system sizes (less atoms)?
Thank you!
All the best,
Andreas Singraber
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Tue Aug 23, 2022 8:47 am
by andreas.singraber
Dear Daniel!
We have received an update from NVIDIA: they found a similar behavior on 8x A100 GPUs with the initial memory consumption of 4.9 GB rising up to 5.5 GB (4 threads/rank) or 5.9 GB (8 threads/rank). However, they are not sure that OpenMP alone is to be blamed because there seems to be a slight increase also with a single thread per rank. Maybe the additional threads are only magnifying the problem. They will continue to investigate and also switch to V100 GPUs.
Best,
Andreas
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Fri Sep 02, 2022 12:37 am
by Dankomaister
Hi Andreas,
Great to hear that Nvidia is finally looking into this.
Perhaps OpenMP alone is not to blame, I also see that when setting the number of OpenMP treads to 1 there is still a memory leak
Hope this can be resolved soon.
/Daniel
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Wed Nov 23, 2022 11:23 am
by henrique_miranda
More users are reporting a similar issue:
https://www.vasp.at/forum/viewtopic.php?f=7&t=18700
https://www.vasp.at/forum/viewtopic.php?f=4&t=18736
https://www.vasp.at/forum/viewtopic.php?f=3&t=18739
We believe these issues might have to do with a memory leak when compiling VASP with OpenMP support using the NV compiler (see recently added entry in
https://www.vasp.at/wiki/index.php/Known_issues).
Maybe you can try compiling the code without OpenMP and check if the issue persists.
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Tue Nov 29, 2022 2:48 am
by Dankomaister
Hi,
Yes compiling without OpenMP, or just setting OMP_NUM_THREADS=1, gets rid of the memory leak as I showed above.
However, this is not really a solution since the whole point is to run with OpenMP so to not be performance limited but the non GPU accelerated parts of the code.
I really hope this can be resolved soon as running VASP on GPU without OpenMP is too expensive in terms of core-h cost compared to a CPU job due to the loss in performance without OpenMP. At least on our system.
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Fri Dec 02, 2022 2:46 pm
by henrique_miranda
Yes, we will include a workaround for this compiler issue in the next release of VASP which will be released this year or at the beginning of the next year.
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Mon Jan 16, 2023 6:07 am
by Dankomaister
Okay so you mentioned that the next release of VASP would be last year or the beginning of this year.
Do you have a more detailed timeline on when the next release of VASP which will be?
We really need this to be fixed so that we can run longer calculations using the GPU version.
/Daniel
Re: VASP 6.3.0 ACC OMP memory leak
Posted: Mon Jan 16, 2023 8:21 am
by andreas.singraber
Hello Daniel!
We plan to release by the end of this month, but I cannot give you an exact date because this depends on the progress of final testing steps.
Best,
Andreas Singraber