AMD Radeon (GCN)#

On the hardware side, there is the hierarchy (fine to coarse):

All OpenMP and OpenACC levels are used, i.e.

OpenMP’s simd and OpenACC’s vector map to work items (thread)
OpenMP’s threads (‘parallel’) and OpenACC’s workers map
to wavefronts
OpenMP’s teams and OpenACC’s gang use a threadpool with the
size of the number of teams or gangs, respectively.

The used sizes are

Number of teams is the specified num_teams (OpenMP) or
num_gangs (OpenACC) or otherwise the number of CU
Number of wavefronts is 4 for gfx900 and 16 otherwise;
num_threads (OpenMP) and num_workers (OpenACC) overrides this if smaller.
The wavefront has 102 scalars and 64 vectors
Number of workitems is always 64
The hardware permits maximally 40 workgroups/CU and
16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
80 scalars registers and 24 vector registers in non-kernel functions
(the chosen procedure-calling API).
For the kernel itself: as many as register pressure demands (number of
teams and number of threads, scaled down if registers are exhausted)

The implementation remark:

I/O within OpenMP target regions and OpenACC parallel/kernels is supported
using the C library printf functions and the Fortran print / write statements.