nvptx#
On the hardware side, there is the hierarchy (fine to coarse):
thread
warp
thread block
streaming multiprocessor
All OpenMP and OpenACC levels are used, i.e.
OpenMP’s simd and OpenACC’s vector map to threads
OpenMP’s threads (‘parallel’) and OpenACC’s workers map to warps
- OpenMP’s teams and OpenACC’s gang use a threadpool with the
size of the number of teams or gangs, respectively.
The used sizes are
The
warp_size
is always 32CUDA kernel launched:
dim={#teams,1,1}, blocks={#threads,warp_size,1}
.
Additional information can be obtained by setting the environment variable to
GOMP_DEBUG=1
(very verbose; grep for kernel.*launch
for launch
parameters).
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user’s directory (see CUDA documentation; can be
tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}
.
Note: While PTX ISA is generic, the -mptx=
and -march=
commandline
options still affect the used PTX ISA code and, thus, the requirments on
CUDA version and hardware.
The implementation remark:
- I/O within OpenMP target regions and OpenACC parallel/kernels is supported
using the C library
printf
functions. Note that the Fortranprint
/write
statements are not supported, yet.
- Compilation OpenMP code that contains
requires reverse_offload
requires at least
-march=sm_35
, compiling for-march=sm_30
is not supported.
- Compilation OpenMP code that contains