nvptx#

On the hardware side, there is the hierarchy (fine to coarse):

  • thread

  • warp

  • thread block

  • streaming multiprocessor

All OpenMP and OpenACC levels are used, i.e.

  • OpenMP’s simd and OpenACC’s vector map to threads

  • OpenMP’s threads (‘parallel’) and OpenACC’s workers map to warps

  • OpenMP’s teams and OpenACC’s gang use a threadpool with the

    size of the number of teams or gangs, respectively.

The used sizes are

  • The warp_size is always 32

  • CUDA kernel launched: dim={#teams,1,1}, blocks={#threads,warp_size,1}.

Additional information can be obtained by setting the environment variable to GOMP_DEBUG=1 (very verbose; grep for kernel.*launch for launch parameters).

GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA, which caches the JIT in the user’s directory (see CUDA documentation; can be tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}.

Note: While PTX ISA is generic, the -mptx= and -march= commandline options still affect the used PTX ISA code and, thus, the requirments on CUDA version and hardware.

The implementation remark:

  • I/O within OpenMP target regions and OpenACC parallel/kernels is supported

    using the C library printf functions. Note that the Fortran print / write statements are not supported, yet.

  • Compilation OpenMP code that contains requires reverse_offload

    requires at least -march=sm_35, compiling for -march=sm_30 is not supported.