Implementation Status and Implementation-Defined Behavior#

We’re implementing the OpenACC Profiling Interface as defined by the OpenACC 2.6 specification. We’re clarifying some aspects here as implementation-defined behavior, while they’re still under discussion within the OpenACC Technical Committee.

This implementation is tuned to keep the performance impact as low as possible for the (very common) case that the Profiling Interface is not enabled. This is relevant, as the Profiling Interface affects all the hot code paths (in the target code, not in the offloaded code). Users of the OpenACC Profiling Interface can be expected to understand that performance will be impacted to some degree once the Profiling Interface has gotten enabled: for example, because of the runtime (libgomp) calling into a third-party library for every event that has been registered.

We’re not yet accounting for the fact that OpenACC events may occur during event processing. We just handle one case specially, as required by CUDA 9.0 nvprof, that acc_get_device_type (acc_get_device_type – Get type of device accelerator to be used.)) may be called from acc_ev_device_init_start, acc_ev_device_init_end callbacks.

We’re not yet implementing initialization via a acc_register_library function that is either statically linked in, or dynamically via LD_PRELOAD. Initialization via acc_register_library functions dynamically loaded via the ACC_PROFLIB environment variable does work, as does directly calling acc_prof_register, acc_prof_unregister, acc_prof_lookup.

As currently there are no inquiry functions defined, calls to acc_prof_lookup will always return NULL.

There aren’t separate start, stop events defined for the event types acc_ev_create, acc_ev_delete, acc_ev_alloc, acc_ev_free. It’s not clear if these should be triggered before or after the actual device-specific call is made. We trigger them after.

Remarks about data provided to callbacks:

acc_prof_info.event_type

It’s not clear if for nested event callbacks (for example, acc_ev_enqueue_launch_start as part of a parent compute construct), this should be set for the nested event (acc_ev_enqueue_launch_start), or if the value of the parent construct should remain (acc_ev_compute_construct_start). In this implementation, the value will generally correspond to the innermost nested event type.

acc_prof_info.device_type
  • For acc_ev_compute_construct_start, and in presence of an if clause with false argument, this will still refer to the offloading device type. It’s not clear if that’s the expected behavior.

  • Complementary to the item before, for acc_ev_compute_construct_end, this is set to acc_device_host in presence of an if clause with false argument. It’s not clear if that’s the expected behavior.

acc_prof_info.thread_id

Always -1 ; not yet implemented.

acc_prof_info.async
  • Not yet implemented correctly for acc_ev_compute_construct_start.

  • In a compute construct, for host-fallback execution/ acc_device_host it will always be acc_async_sync. It’s not clear if that’s the expected behavior.

  • For acc_ev_device_init_start and acc_ev_device_init_end, it will always be acc_async_sync. It’s not clear if that’s the expected behavior.

acc_prof_info.async_queue

There is no limited number of asynchronous queues in libgomp. This will always have the same value as acc_prof_info.async.

acc_prof_info.src_file

Always NULL ; not yet implemented.

acc_prof_info.func_name

Always NULL ; not yet implemented.

acc_prof_info.line_no

Always -1 ; not yet implemented.

acc_prof_info.end_line_no

Always -1 ; not yet implemented.

acc_prof_info.func_line_no

Always -1 ; not yet implemented.

acc_prof_info.func_end_line_no

Always -1 ; not yet implemented.

acc_event_info.event_type, acc_event_info.*.event_type

Relating to acc_prof_info.event_type discussed above, in this implementation, this will always be the same value as acc_prof_info.event_type.

acc_event_info.*.parent_construct
  • Will be acc_construct_parallel for all OpenACC compute constructs as well as many OpenACC Runtime API calls; should be the one matching the actual construct, or acc_construct_runtime_api, respectively.

  • Will be acc_construct_enter_data or acc_construct_exit_data when processing variable mappings specified in OpenACC declare directives; should be acc_construct_declare.

  • For implicit acc_ev_device_init_start, acc_ev_device_init_end, and explicit as well as implicit acc_ev_alloc, acc_ev_free, acc_ev_enqueue_upload_start, acc_ev_enqueue_upload_end, acc_ev_enqueue_download_start, and acc_ev_enqueue_download_end, will be acc_construct_parallel ; should reflect the real parent construct.

acc_event_info.*.implicit

For acc_ev_alloc, acc_ev_free, acc_ev_enqueue_upload_start, acc_ev_enqueue_upload_end, acc_ev_enqueue_download_start, and acc_ev_enqueue_download_end, this currently will be 1 also for explicit usage.

acc_event_info.data_event.var_name

Always NULL ; not yet implemented.

acc_event_info.data_event.host_ptr

For acc_ev_alloc, and acc_ev_free, this is always NULL.

typedef union acc_api_info

… as printed in 5.2.3. Third Argument: API-Specific Information. This should obviously be typedef struct acc_api_info.

acc_api_info.device_api

Possibly not yet implemented correctly for acc_ev_compute_construct_start, acc_ev_device_init_start, acc_ev_device_init_end : will always be acc_device_api_none for these event types. For acc_ev_enter_data_start, it will be acc_device_api_none in some cases.

acc_api_info.device_type

Always the same as acc_prof_info.device_type.

acc_api_info.vendor

Always -1 ; not yet implemented.

acc_api_info.device_handle

Always NULL ; not yet implemented.

acc_api_info.context_handle

Always NULL ; not yet implemented.

acc_api_info.async_handle

Always NULL ; not yet implemented.

Remarks about certain event types:

acc_ev_device_init_start, acc_ev_device_init_end
  • When a compute construct triggers implicit acc_ev_device_init_start and acc_ev_device_init_end events, they currently aren’t nested within the corresponding acc_ev_compute_construct_start and acc_ev_compute_construct_end, but they’re currently observed before acc_ev_compute_construct_start. It’s not clear what to do: the standard asks us provide a lot of details to the acc_ev_compute_construct_start callback, without (implicitly) initializing a device before?

  • Callbacks for these event types will not be invoked for calls to the acc_set_device_type and acc_set_device_num functions. It’s not clear if they should be.

acc_ev_enter_data_start, acc_ev_enter_data_end, acc_ev_exit_data_start, acc_ev_exit_data_end
  • Callbacks for these event types will also be invoked for OpenACC host_data constructs. It’s not clear if they should be.

  • Callbacks for these event types will also be invoked when processing variable mappings specified in OpenACC declare directives. It’s not clear if they should be.

Callbacks for the following event types will be invoked, but dispatch and information provided therein has not yet been thoroughly reviewed:

  • acc_ev_alloc

  • acc_ev_free

  • acc_ev_update_start, acc_ev_update_end

  • acc_ev_enqueue_upload_start, acc_ev_enqueue_upload_end

  • acc_ev_enqueue_download_start, acc_ev_enqueue_download_end

During device initialization, and finalization, respectively, callbacks for the following event types will not yet be invoked:

  • acc_ev_alloc

  • acc_ev_free

Callbacks for the following event types have not yet been implemented, so currently won’t be invoked:

  • acc_ev_device_shutdown_start, acc_ev_device_shutdown_end

  • acc_ev_runtime_shutdown

  • acc_ev_create, acc_ev_delete

  • acc_ev_wait_start, acc_ev_wait_end

For the following runtime library functions, not all expected callbacks will be invoked (mostly concerning implicit device initialization):

  • acc_get_num_devices

  • acc_set_device_type

  • acc_get_device_type

  • acc_set_device_num

  • acc_get_device_num

  • acc_init

  • acc_shutdown

Aside from implicit device initialization, for the following runtime library functions, no callbacks will be invoked for shared-memory offloading devices (it’s not clear if they should be):

  • acc_malloc

  • acc_free

  • acc_copyin, acc_present_or_copyin, acc_copyin_async

  • acc_create, acc_present_or_create, acc_create_async

  • acc_copyout, acc_copyout_async, acc_copyout_finalize, acc_copyout_finalize_async

  • acc_delete, acc_delete_async, acc_delete_finalize, acc_delete_finalize_async

  • acc_update_device, acc_update_device_async

  • acc_update_self, acc_update_self_async

  • acc_map_data, acc_unmap_data

  • acc_memcpy_to_device, acc_memcpy_to_device_async

  • acc_memcpy_from_device, acc_memcpy_from_device_async