Implementation Status and Implementation-Defined Behavior#
We’re implementing the OpenACC Profiling Interface as defined by the OpenACC 2.6 specification. We’re clarifying some aspects here as implementation-defined behavior, while they’re still under discussion within the OpenACC Technical Committee.
This implementation is tuned to keep the performance impact as low as possible for the (very common) case that the Profiling Interface is not enabled. This is relevant, as the Profiling Interface affects all the hot code paths (in the target code, not in the offloaded code). Users of the OpenACC Profiling Interface can be expected to understand that performance will be impacted to some degree once the Profiling Interface has gotten enabled: for example, because of the runtime (libgomp) calling into a third-party library for every event that has been registered.
We’re not yet accounting for the fact that OpenACC events may
occur during event processing.
We just handle one case specially, as required by CUDA 9.0
nvprof, that acc_get_device_type
(acc_get_device_type – Get type of device accelerator to be used.)) may be called from
acc_ev_device_init_start
, acc_ev_device_init_end
callbacks.
We’re not yet implementing initialization via a
acc_register_library
function that is either statically linked
in, or dynamically via LD_PRELOAD
.
Initialization via acc_register_library
functions dynamically
loaded via the ACC_PROFLIB
environment variable does work, as
does directly calling acc_prof_register
,
acc_prof_unregister
, acc_prof_lookup
.
As currently there are no inquiry functions defined, calls to
acc_prof_lookup
will always return NULL
.
There aren’t separate start, stop events defined for the
event types acc_ev_create
, acc_ev_delete
,
acc_ev_alloc
, acc_ev_free
. It’s not clear if these
should be triggered before or after the actual device-specific call is
made. We trigger them after.
Remarks about data provided to callbacks:
- acc_prof_info.event_type
It’s not clear if for nested event callbacks (for example,
acc_ev_enqueue_launch_start
as part of a parent compute construct), this should be set for the nested event (acc_ev_enqueue_launch_start
), or if the value of the parent construct should remain (acc_ev_compute_construct_start
). In this implementation, the value will generally correspond to the innermost nested event type.- acc_prof_info.device_type
For
acc_ev_compute_construct_start
, and in presence of anif
clause with false argument, this will still refer to the offloading device type. It’s not clear if that’s the expected behavior.Complementary to the item before, for
acc_ev_compute_construct_end
, this is set toacc_device_host
in presence of anif
clause with false argument. It’s not clear if that’s the expected behavior.
- acc_prof_info.thread_id
Always
-1
; not yet implemented.- acc_prof_info.async
Not yet implemented correctly for
acc_ev_compute_construct_start
.In a compute construct, for host-fallback execution/
acc_device_host
it will always beacc_async_sync
. It’s not clear if that’s the expected behavior.For
acc_ev_device_init_start
andacc_ev_device_init_end
, it will always beacc_async_sync
. It’s not clear if that’s the expected behavior.
- acc_prof_info.async_queue
There is no limited number of asynchronous queues in libgomp. This will always have the same value as
acc_prof_info.async
.- acc_prof_info.src_file
Always
NULL
; not yet implemented.- acc_prof_info.func_name
Always
NULL
; not yet implemented.- acc_prof_info.line_no
Always
-1
; not yet implemented.- acc_prof_info.end_line_no
Always
-1
; not yet implemented.- acc_prof_info.func_line_no
Always
-1
; not yet implemented.- acc_prof_info.func_end_line_no
Always
-1
; not yet implemented.- acc_event_info.event_type, acc_event_info.*.event_type
Relating to
acc_prof_info.event_type
discussed above, in this implementation, this will always be the same value asacc_prof_info.event_type
.- acc_event_info.*.parent_construct
Will be
acc_construct_parallel
for all OpenACC compute constructs as well as many OpenACC Runtime API calls; should be the one matching the actual construct, oracc_construct_runtime_api
, respectively.Will be
acc_construct_enter_data
oracc_construct_exit_data
when processing variable mappings specified in OpenACC declare directives; should beacc_construct_declare
.For implicit
acc_ev_device_init_start
,acc_ev_device_init_end
, and explicit as well as implicitacc_ev_alloc
,acc_ev_free
,acc_ev_enqueue_upload_start
,acc_ev_enqueue_upload_end
,acc_ev_enqueue_download_start
, andacc_ev_enqueue_download_end
, will beacc_construct_parallel
; should reflect the real parent construct.
- acc_event_info.*.implicit
For
acc_ev_alloc
,acc_ev_free
,acc_ev_enqueue_upload_start
,acc_ev_enqueue_upload_end
,acc_ev_enqueue_download_start
, andacc_ev_enqueue_download_end
, this currently will be1
also for explicit usage.- acc_event_info.data_event.var_name
Always
NULL
; not yet implemented.- acc_event_info.data_event.host_ptr
For
acc_ev_alloc
, andacc_ev_free
, this is alwaysNULL
.- typedef union acc_api_info
… as printed in 5.2.3. Third Argument: API-Specific Information. This should obviously be
typedef struct acc_api_info
.- acc_api_info.device_api
Possibly not yet implemented correctly for
acc_ev_compute_construct_start
,acc_ev_device_init_start
,acc_ev_device_init_end
: will always beacc_device_api_none
for these event types. Foracc_ev_enter_data_start
, it will beacc_device_api_none
in some cases.- acc_api_info.device_type
Always the same as
acc_prof_info.device_type
.- acc_api_info.vendor
Always
-1
; not yet implemented.- acc_api_info.device_handle
Always
NULL
; not yet implemented.- acc_api_info.context_handle
Always
NULL
; not yet implemented.- acc_api_info.async_handle
Always
NULL
; not yet implemented.
Remarks about certain event types:
- acc_ev_device_init_start, acc_ev_device_init_end
When a compute construct triggers implicit
acc_ev_device_init_start
andacc_ev_device_init_end
events, they currently aren’t nested within the correspondingacc_ev_compute_construct_start
andacc_ev_compute_construct_end
, but they’re currently observed beforeacc_ev_compute_construct_start
. It’s not clear what to do: the standard asks us provide a lot of details to theacc_ev_compute_construct_start
callback, without (implicitly) initializing a device before?Callbacks for these event types will not be invoked for calls to the
acc_set_device_type
andacc_set_device_num
functions. It’s not clear if they should be.
- acc_ev_enter_data_start, acc_ev_enter_data_end, acc_ev_exit_data_start, acc_ev_exit_data_end
Callbacks for these event types will also be invoked for OpenACC host_data constructs. It’s not clear if they should be.
Callbacks for these event types will also be invoked when processing variable mappings specified in OpenACC declare directives. It’s not clear if they should be.
Callbacks for the following event types will be invoked, but dispatch and information provided therein has not yet been thoroughly reviewed:
acc_ev_alloc
acc_ev_free
acc_ev_update_start
,acc_ev_update_end
acc_ev_enqueue_upload_start
,acc_ev_enqueue_upload_end
acc_ev_enqueue_download_start
,acc_ev_enqueue_download_end
During device initialization, and finalization, respectively, callbacks for the following event types will not yet be invoked:
acc_ev_alloc
acc_ev_free
Callbacks for the following event types have not yet been implemented, so currently won’t be invoked:
acc_ev_device_shutdown_start
,acc_ev_device_shutdown_end
acc_ev_runtime_shutdown
acc_ev_create
,acc_ev_delete
acc_ev_wait_start
,acc_ev_wait_end
For the following runtime library functions, not all expected callbacks will be invoked (mostly concerning implicit device initialization):
acc_get_num_devices
acc_set_device_type
acc_get_device_type
acc_set_device_num
acc_get_device_num
acc_init
acc_shutdown
Aside from implicit device initialization, for the following runtime library functions, no callbacks will be invoked for shared-memory offloading devices (it’s not clear if they should be):
acc_malloc
acc_free
acc_copyin
,acc_present_or_copyin
,acc_copyin_async
acc_create
,acc_present_or_create
,acc_create_async
acc_copyout
,acc_copyout_async
,acc_copyout_finalize
,acc_copyout_finalize_async
acc_delete
,acc_delete_async
,acc_delete_finalize
,acc_delete_finalize_async
acc_update_device
,acc_update_device_async
acc_update_self
,acc_update_self_async
acc_map_data
,acc_unmap_data
acc_memcpy_to_device
,acc_memcpy_to_device_async
acc_memcpy_from_device
,acc_memcpy_from_device_async