Page 1 of 1

Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Mon Aug 07, 2023 5:18 am
by guorong_weng
Dear VASP folks,
I have been recently trying to compile VASP 6.4.1 with Openacc on our local GPU machine (NVIDIA RTX A5000, 8 GPUS).
The compilation is successful with the following "makefile.include" file, with NVIDIA HPC 23.7 and FFTW3 installed in the indicated directory.

Code: Select all

# Default precompiler options
CPP_OPTIONS = -DHOST=\"LinuxNV\" \
              -DMPI -DMPI_INPLACE -DMPI_BLOCK=8000 -Duse_collective \
              -DscaLAPACK \
              -DCACHE_SIZE=4000 \
              -Davoidalloc \
              -Dvasp6 \
              -Duse_bse_te \
              -Dtbdyn \
              -Dqd_emulate \
              -Dfock_dblbuf \
              -D_OPENMP \
              -D_OPENACC \
              -DUSENCCL -DUSENCCLP2P

CPP         = nvfortran -Mpreprocess -Mfree -Mextend -E $(CPP_OPTIONS) $*$(FUFFIX)  > $*$(SUFFIX)

# N.B.: you might need to change the cuda-version here
#       to one that comes with your NVIDIA-HPC SDK
FC          = mpif90 -acc -gpu=cc60,cc70,cc80,cuda12.2 -mp
FCL         = mpif90 -acc -gpu=cc60,cc70,cc80,cuda12.2 -mp -c++libs

FREE        = -Mfree -Mx,231,0x1

FFLAGS      = -Mbackslash -Mlarge_arrays

OFLAG       = -fast

DEBUG       = -Mfree -O0 -traceback

OBJECTS     = fftmpiw.o fftmpi_map.o fftw3d.o fft3dlib.o

LLIBS       = -cudalib=cublas,cusolver,cufft,nccl -cuda

# Redefine the standard list of O1 and O2 objects
SOURCE_O1  := pade_fit.o minimax_dependence.o
SOURCE_O2  := pead.o

# For what used to be vasp.5.lib
CPP_LIB     = $(CPP)
FC_LIB      = nvfortran
CC_LIB      = nvc -w
CFLAGS_LIB  = -O
FFLAGS_LIB  = -O1 -Mfixed
FREE_LIB    = $(FREE)

OBJECTS_LIB = linpack_double.o

# For the parser library
CXX_PARS    = nvc++ --no_warnings

##
## Customize as of this point! Of course you may change the preceding
## part of this file as well if you like, but it should rarely be
## necessary ...
##
# When compiling on the target machine itself , change this to the
# relevant target when cross-compiling for another architecture
VASP_TARGET_CPU ?= -tp host
FFLAGS     += $(VASP_TARGET_CPU)

# Specify your NV HPC-SDK installation (mandatory)
#... first try to set it automatically
#NVROOT      =$(shell which nvfortran | awk -F /compilers/bin/nvfortran '{ print $$1 }')

# If the above fails, then NVROOT needs to be set manually
NVHPC_PATH  ?= /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk
NVVERSION   = 23.7
NVROOT      = $(NVHPC_PATH)/Linux_x86_64/$(NVVERSION)

## Improves performance when using NV HPC-SDK >=21.11 and CUDA >11.2
#OFLAG_IN   = -fast -Mwarperf
#SOURCE_IN  := nonlr.o

# Software emulation of quadruple precsion (mandatory)
QD         ?= $(NVROOT)/compilers/extras/qd
LLIBS      += -L$(QD)/lib -lqdmod -lqd
INCS       += -I$(QD)/include/qd

# BLAS (mandatory)
BLAS        = -L$(NVROOT)/compilers/lib -lblas

# LAPACK (mandatory)
LAPACK      = -L$(NVROOT)/compilers/lib -llapack

# scaLAPACK (mandatory)
SCALAPACK   = -L$(NVROOT)/comm_libs/mpi/lib -Mscalapack

LLIBS      += $(SCALAPACK) $(LAPACK) $(BLAS)

# FFTW (mandatory)
FFTW_ROOT  ?= /home/gwen/libraries/fftw-3.3.10/fftw
LLIBS      += -L$(FFTW_ROOT)/lib -lfftw3 -lfftw3_omp
INCS       += -I$(FFTW_ROOT)/include
The "LD_LIBRARY_PATH" is exported as follows:

Code: Select all

/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/nvvm/lib64:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/extras/Debugger/lib64:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/extras/CUPTI/lib64:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/lib64:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/12.2/nccl/lib:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib:/home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/extras/qd/lib:/home/gwen/libraries/hdf5-1.14.1-2/hdf5_install/lib:/home/gwen/libraries/libfabric-main/fabric_install/lib
And the "ldd vasp_std" reads as follows:

Code: Select all

linux-vdso.so.1 (0x00007ffec7542000)
	libqdmod.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/extras/qd/lib/libqdmod.so.0 (0x00007fb326400000)
	libqd.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/extras/qd/lib/libqd.so.0 (0x00007fb326000000)
	liblapack_lp64.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/liblapack_lp64.so.0 (0x00007fb325589000)
	libblas_lp64.so.0 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libblas_lp64.so.0 (0x00007fb323738000)
	libmpi_usempif08.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi_usempif08.so.40 (0x00007fb323400000)
	libmpi_usempi_ignore_tkr.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007fb323000000)
	libmpi_mpifh.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi_mpifh.so.40 (0x00007fb322c00000)
	libmpi.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libmpi.so.40 (0x00007fb322600000)
	libscalapack_lp64.so.2 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libscalapack_lp64.so.2 (0x00007fb321f82000)
	libnvhpcwrapcufft.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvhpcwrapcufft.so (0x00007fb321c00000)
	libcufft.so.11 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcufft.so.11 (0x00007fb316e00000)
	libcusolver.so.11 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcusolver.so.11 (0x00007fb30fc00000)
	libcudaforwrapnccl.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudaforwrapnccl.so (0x00007fb30f800000)
	libnccl.so.2 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/12.2/nccl/lib/libnccl.so.2 (0x00007fb2fe800000)
	libcublas.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcublas.so.12 (0x00007fb2f7e00000)
	libcublasLt.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcublasLt.so.12 (0x00007fb2d5e00000)
	libcudaforwrapblas.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudaforwrapblas.so (0x00007fb2d5a00000)
	libcudaforwrapblas117.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudaforwrapblas117.so (0x00007fb2d5600000)
	libcudart.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64/libcudart.so.12 (0x00007fb2d5200000)
	libcudafor_120.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudafor_120.so (0x00007fb2cf200000)
	libcudafor.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudafor.so (0x00007fb2cee00000)
	libacchost.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacchost.so (0x00007fb2cea00000)
	libaccdevaux.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libaccdevaux.so (0x00007fb2ce600000)
	libacccuda.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacccuda.so (0x00007fb2ce200000)
	libcudadevice.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudadevice.so (0x00007fb2cde00000)
	libcudafor2.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libcudafor2.so (0x00007fb2cda00000)
	libnvf.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvf.so (0x00007fb2cd200000)
	libnvhpcatm.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvhpcatm.so (0x00007fb2cce00000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb2ccbd4000)
	libnvomp.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvomp.so (0x00007fb2cba00000)
	libnvcpumath.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvcpumath.so (0x00007fb2cb400000)
	libnvc.so => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libnvc.so (0x00007fb2cb000000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb2cadd8000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb32663f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb326319000)
	libatomic.so.1 => /lib/x86_64-linux-gnu/libatomic.so.1 (0x00007fb326635000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb326630000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb326629000)
	libopen-rte.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libopen-rte.so.40 (0x00007fb2caa00000)
	libopen-pal.so.40 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libopen-pal.so.40 (0x00007fb2ca400000)
	librdmacm.so.1 => /lib/x86_64-linux-gnu/librdmacm.so.1 (0x00007fb3262fa000)
	libibverbs.so.1 => /lib/x86_64-linux-gnu/libibverbs.so.1 (0x00007fb3262d7000)
	libnuma.so.1 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/comm_libs/mpi/lib/libnuma.so.1 (0x00007fb2ca000000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fb326622000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fb3262bb000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb3262b6000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb326683000)
	libnvJitLink.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64/libnvJitLink.so.12 (0x00007fb2c6c00000)
	libcusparse.so.12 => /home/gwen/libraries/nvhpc_2023_237_Linux_x86_64_cuda_12.2/hpc_sdk/Linux_x86_64/23.7/math_libs/12.2/lib64/libcusparse.so.12 (0x00007fb2b6e00000)
	libnl-3.so.200 => /lib/x86_64-linux-gnu/libnl-3.so.200 (0x00007fb326291000)
	libnl-route-3.so.200 => /lib/x86_64-linux-gnu/libnl-route-3.so.200 (0x00007fb3236b5000)
After installation, I export the following the parameters

Code: Select all

export CUDA_VISIBLE_DEVICES=0,1,2,3
export OMP_NUM_THREADS=1
export OMP_PLACES=cores
export OMP_PROC_BIND=close
export OMP_STACKSIZE=512m
 
and then launch "make test".

Immediately from the output I received the following repeated bugs (errors) in each tested folder:

Code: Select all

[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: lambda-scalar

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
 running    4 mpi-ranks, with    1 threads/rank, on    1 nodes
 distrk:  each k-point on    2 cores,    2 groups
 distr:  one band on    1 cores,    2 groups
 OpenACC runtime initialized ...    4 GPUs detected
 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mpi.F  at line: 898                                  |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------

 -----------------------------------------------------------------------------
|                     _     ____    _    _    _____     _                     |
|                    | |   |  _ \  | |  | |  / ____|   | |                    |
|                    | |   | |_) | | |  | | | |  __    | |                    |
|                    |_|   |  _ <  | |  | | | | |_ |   |_|                    |
|                     _    | |_) | | |__| | | |__| |    _                     |
|                    (_)   |____/   \____/   \_____|   (_)                    |
|                                                                             |
|     internal error in: mpi.F  at line: 898                                  |
|                                                                             |
|     M_init_nccl: Error in ncclCommInitRank                                  |
|                                                                             |
|     If you are not a developer, you should not encounter this problem.      |
|     Please submit a bug report.                                             |
|                                                                             |
 -----------------------------------------------------------------------------

[lambda-scalar:1552414] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[lambda-scalar:1552414] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
As a beginner in VASP, I have no clue about how to figure this bug out. Hope someone can help me out here. Thanks a lot.

Gwen

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Tue Aug 08, 2023 12:03 pm
by alexey.tal
Dear Gwen,
[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Indicates that the issues is in the communication network
Does this calculation run on a single GPU?

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Tue Aug 08, 2023 10:51 pm
by guorong_weng
alexey.tal wrote: Tue Aug 08, 2023 12:03 pm Dear Gwen,
[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Indicates that the issues is in the communication network
Does this calculation run on a single GPU?
Hi Alexey. By default, four GPUS are used for testing in the vasp package. I set CUDA_VISIBLE_DEVICES to include 4 GPUS.

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Wed Aug 09, 2023 10:21 am
by alexey.tal
You can run the tests with a single MPI rank by changing the number of ranks in testsuite/fast.conf and execute the tests by running the following command
./runtest --fast fast.conf

Were you able to run the tests without GPUs?

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Thu Aug 10, 2023 8:40 pm
by guorong_weng
alexey.tal wrote: Wed Aug 09, 2023 10:21 am You can run the tests with a single MPI rank by changing the number of ranks in testsuite/fast.conf and execute the tests by running the following command
./runtest --fast fast.conf

Were you able to run the tests without GPUs?
Hi Alexey. The crashing issue has been resolved by using all the four libraries from intel oneAPI. However, the following warning still persists

Code: Select all

[[7276,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: lambda-scalar

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
So far I can only suppress it by setting btl_base_warn_component_unused to 0. I am afraid of having low performance now. Is there any way to resolve this problem? Thanks.

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Fri Aug 11, 2023 8:13 am
by alexey.tal
Do you have an infiniband connection? If not, you can manually choose the shared memory communications fabric:

Code: Select all

mpirun -np 4 -genv I_MPI_FABRICS=shm vasp_std 

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Fri Aug 11, 2023 6:01 pm
by guorong_weng
alexey.tal wrote: Fri Aug 11, 2023 8:13 am Do you have an infiniband connection? If not, you can manually choose the shared memory communications fabric:

Code: Select all

mpirun -np 4 -genv I_MPI_FABRICS=shm vasp_std 
I believe this will resolve my problem finally.
Since I am not using intel MPI but open MPI from the nvidia HPC kit, I am using the following command

Code: Select all

mpirun -np 4 --mca btl [fabric options] vasp_std
I am wondering what fabric options below work the best for VASP

Code: Select all

 MCA btl: self (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: smcuda (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.0.0, Component v3.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v3.1.5)

Re: Execution Issue of VASP 6.4.1 with Openacc and GPU

Posted: Mon Aug 14, 2023 8:37 am
by alexey.tal
As far as I understand, you are running this job on a single node with multiple GPUs and you don't use any inter-node communication, so you don't need openib or tcp, but you should specify one of the shared-memory options. I think you should be able to get the best performance with --mca btl self,vader. But I don't know if smcuda might give some advantages, so you might want to try this option too.