AMD Radeon (GCN)#
On the hardware side, there is the hierarchy (fine to coarse):
work item (thread)
wavefront
work group
compute unite (CU)
All OpenMP and OpenACC levels are used, i.e.
OpenMP’s simd and OpenACC’s vector map to work items (thread)
- OpenMP’s threads (‘parallel’) and OpenACC’s workers map
to wavefronts
- OpenMP’s teams and OpenACC’s gang use a threadpool with the
size of the number of teams or gangs, respectively.
The used sizes are
- Number of teams is the specified
num_teams
(OpenMP) or num_gangs
(OpenACC) or otherwise the number of CU
- Number of teams is the specified
- Number of wavefronts is 4 for gfx900 and 16 otherwise;
num_threads
(OpenMP) andnum_workers
(OpenACC) overrides this if smaller.
The wavefront has 102 scalars and 64 vectors
Number of workitems is always 64
- The hardware permits maximally 40 workgroups/CU and
16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
- 80 scalars registers and 24 vector registers in non-kernel functions
(the chosen procedure-calling API).
- For the kernel itself: as many as register pressure demands (number of
teams and number of threads, scaled down if registers are exhausted)
The implementation remark:
- I/O within OpenMP target regions and OpenACC parallel/kernels is supported
using the C library
printf
functions and the Fortranprint
/write
statements.