Implementation Status and Implementation-Defined Behavior#
We’re implementing the OpenACC Profiling Interface as defined by the OpenACC 2.6 specification. We’re clarifying some aspects here as implementation-defined behavior, while they’re still under discussion within the OpenACC Technical Committee.
This implementation is tuned to keep the performance impact as low as possible for the (very common) case that the Profiling Interface is not enabled. This is relevant, as the Profiling Interface affects all the hot code paths (in the target code, not in the offloaded code). Users of the OpenACC Profiling Interface can be expected to understand that performance will be impacted to some degree once the Profiling Interface has gotten enabled: for example, because of the runtime (libgomp) calling into a third-party library for every event that has been registered.
We’re not yet accounting for the fact that OpenACC events may
occur during event processing.
We just handle one case specially, as required by CUDA 9.0
nvprof, that acc_get_device_type
(acc_get_device_type – Get type of device accelerator to be used.)) may be called from
acc_ev_device_init_start, acc_ev_device_init_end
callbacks.
We’re not yet implementing initialization via a
acc_register_library function that is either statically linked
in, or dynamically via LD_PRELOAD.
Initialization via acc_register_library functions dynamically
loaded via the ACC_PROFLIB environment variable does work, as
does directly calling acc_prof_register,
acc_prof_unregister, acc_prof_lookup.
As currently there are no inquiry functions defined, calls to
acc_prof_lookup will always return NULL.
There aren’t separate start, stop events defined for the
event types acc_ev_create, acc_ev_delete,
acc_ev_alloc, acc_ev_free. It’s not clear if these
should be triggered before or after the actual device-specific call is
made. We trigger them after.
Remarks about data provided to callbacks:
- acc_prof_info.event_type
It’s not clear if for nested event callbacks (for example,
acc_ev_enqueue_launch_startas part of a parent compute construct), this should be set for the nested event (acc_ev_enqueue_launch_start), or if the value of the parent construct should remain (acc_ev_compute_construct_start). In this implementation, the value will generally correspond to the innermost nested event type.- acc_prof_info.device_type
For
acc_ev_compute_construct_start, and in presence of anifclause with false argument, this will still refer to the offloading device type. It’s not clear if that’s the expected behavior.Complementary to the item before, for
acc_ev_compute_construct_end, this is set toacc_device_hostin presence of anifclause with false argument. It’s not clear if that’s the expected behavior.
- acc_prof_info.thread_id
Always
-1; not yet implemented.- acc_prof_info.async
Not yet implemented correctly for
acc_ev_compute_construct_start.In a compute construct, for host-fallback execution/
acc_device_hostit will always beacc_async_sync. It’s not clear if that’s the expected behavior.For
acc_ev_device_init_startandacc_ev_device_init_end, it will always beacc_async_sync. It’s not clear if that’s the expected behavior.
- acc_prof_info.async_queue
There is no limited number of asynchronous queues in libgomp. This will always have the same value as
acc_prof_info.async.- acc_prof_info.src_file
Always
NULL; not yet implemented.- acc_prof_info.func_name
Always
NULL; not yet implemented.- acc_prof_info.line_no
Always
-1; not yet implemented.- acc_prof_info.end_line_no
Always
-1; not yet implemented.- acc_prof_info.func_line_no
Always
-1; not yet implemented.- acc_prof_info.func_end_line_no
Always
-1; not yet implemented.- acc_event_info.event_type, acc_event_info.*.event_type
Relating to
acc_prof_info.event_typediscussed above, in this implementation, this will always be the same value asacc_prof_info.event_type.- acc_event_info.*.parent_construct
Will be
acc_construct_parallelfor all OpenACC compute constructs as well as many OpenACC Runtime API calls; should be the one matching the actual construct, oracc_construct_runtime_api, respectively.Will be
acc_construct_enter_dataoracc_construct_exit_datawhen processing variable mappings specified in OpenACC declare directives; should beacc_construct_declare.For implicit
acc_ev_device_init_start,acc_ev_device_init_end, and explicit as well as implicitacc_ev_alloc,acc_ev_free,acc_ev_enqueue_upload_start,acc_ev_enqueue_upload_end,acc_ev_enqueue_download_start, andacc_ev_enqueue_download_end, will beacc_construct_parallel; should reflect the real parent construct.
- acc_event_info.*.implicit
For
acc_ev_alloc,acc_ev_free,acc_ev_enqueue_upload_start,acc_ev_enqueue_upload_end,acc_ev_enqueue_download_start, andacc_ev_enqueue_download_end, this currently will be1also for explicit usage.- acc_event_info.data_event.var_name
Always
NULL; not yet implemented.- acc_event_info.data_event.host_ptr
For
acc_ev_alloc, andacc_ev_free, this is alwaysNULL.- typedef union acc_api_info
… as printed in 5.2.3. Third Argument: API-Specific Information. This should obviously be
typedef struct acc_api_info.- acc_api_info.device_api
Possibly not yet implemented correctly for
acc_ev_compute_construct_start,acc_ev_device_init_start,acc_ev_device_init_end: will always beacc_device_api_nonefor these event types. Foracc_ev_enter_data_start, it will beacc_device_api_nonein some cases.- acc_api_info.device_type
Always the same as
acc_prof_info.device_type.- acc_api_info.vendor
Always
-1; not yet implemented.- acc_api_info.device_handle
Always
NULL; not yet implemented.- acc_api_info.context_handle
Always
NULL; not yet implemented.- acc_api_info.async_handle
Always
NULL; not yet implemented.
Remarks about certain event types:
- acc_ev_device_init_start, acc_ev_device_init_end
When a compute construct triggers implicit
acc_ev_device_init_startandacc_ev_device_init_endevents, they currently aren’t nested within the correspondingacc_ev_compute_construct_startandacc_ev_compute_construct_end, but they’re currently observed beforeacc_ev_compute_construct_start. It’s not clear what to do: the standard asks us provide a lot of details to theacc_ev_compute_construct_startcallback, without (implicitly) initializing a device before?Callbacks for these event types will not be invoked for calls to the
acc_set_device_typeandacc_set_device_numfunctions. It’s not clear if they should be.
- acc_ev_enter_data_start, acc_ev_enter_data_end, acc_ev_exit_data_start, acc_ev_exit_data_end
Callbacks for these event types will also be invoked for OpenACC host_data constructs. It’s not clear if they should be.
Callbacks for these event types will also be invoked when processing variable mappings specified in OpenACC declare directives. It’s not clear if they should be.
Callbacks for the following event types will be invoked, but dispatch and information provided therein has not yet been thoroughly reviewed:
acc_ev_allocacc_ev_freeacc_ev_update_start,acc_ev_update_endacc_ev_enqueue_upload_start,acc_ev_enqueue_upload_endacc_ev_enqueue_download_start,acc_ev_enqueue_download_end
During device initialization, and finalization, respectively, callbacks for the following event types will not yet be invoked:
acc_ev_allocacc_ev_free
Callbacks for the following event types have not yet been implemented, so currently won’t be invoked:
acc_ev_device_shutdown_start,acc_ev_device_shutdown_endacc_ev_runtime_shutdownacc_ev_create,acc_ev_deleteacc_ev_wait_start,acc_ev_wait_end
For the following runtime library functions, not all expected callbacks will be invoked (mostly concerning implicit device initialization):
acc_get_num_devicesacc_set_device_typeacc_get_device_typeacc_set_device_numacc_get_device_numacc_initacc_shutdown
Aside from implicit device initialization, for the following runtime library functions, no callbacks will be invoked for shared-memory offloading devices (it’s not clear if they should be):
acc_mallocacc_freeacc_copyin,acc_present_or_copyin,acc_copyin_asyncacc_create,acc_present_or_create,acc_create_asyncacc_copyout,acc_copyout_async,acc_copyout_finalize,acc_copyout_finalize_asyncacc_delete,acc_delete_async,acc_delete_finalize,acc_delete_finalize_asyncacc_update_device,acc_update_device_asyncacc_update_self,acc_update_self_asyncacc_map_data,acc_unmap_dataacc_memcpy_to_device,acc_memcpy_to_device_asyncacc_memcpy_from_device,acc_memcpy_from_device_async