AMD Radeon (GCN)#

On the hardware side, there is the hierarchy (fine to coarse):

  • work item (thread)

  • wavefront

  • work group

  • compute unite (CU)

All OpenMP and OpenACC levels are used, i.e.

  • OpenMP’s simd and OpenACC’s vector map to work items (thread)

  • OpenMP’s threads (‘parallel’) and OpenACC’s workers map

    to wavefronts

  • OpenMP’s teams and OpenACC’s gang use a threadpool with the

    size of the number of teams or gangs, respectively.

The used sizes are

  • Number of teams is the specified num_teams (OpenMP) or

    num_gangs (OpenACC) or otherwise the number of CU

  • Number of wavefronts is 4 for gfx900 and 16 otherwise;

    num_threads (OpenMP) and num_workers (OpenACC) overrides this if smaller.

  • The wavefront has 102 scalars and 64 vectors

  • Number of workitems is always 64

  • The hardware permits maximally 40 workgroups/CU and

    16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.

  • 80 scalars registers and 24 vector registers in non-kernel functions

    (the chosen procedure-calling API).

  • For the kernel itself: as many as register pressure demands (number of

    teams and number of threads, scaled down if registers are exhausted)

The implementation remark:

  • I/O within OpenMP target regions and OpenACC parallel/kernels is supported

    using the C library printf functions and the Fortran print / write statements.