blob: a71d0ec9902703a457958faf89ff9df7341373b0 [file]
// Copyright 2018-2024 The Khronos Group Inc.
// SPDX-License-Identifier: CC-BY-4.0
include::{generated}/meta/{refprefix}cl_khr_command_buffer.txt[]
// *Revision*::
// 0.9.4
// *Extension and Version Dependencies*::
// This extension requires OpenCL 1.2 or later.
// Buffering of SVM commands requires OpenCL 2.0 or later.
=== Other Extension Metadata
*Last Modified Date*::
2023-03-31
*IP Status*::
No known IP claims.
*Contributors*::
- Ewan Crawford, Codeplay Software Ltd.
- Gordon Brown, Codeplay Software Ltd.
- Kenneth Benzie, Codeplay Software Ltd.
- Alastair Murray, Codeplay Software Ltd.
- Jack Frankland, Codeplay Software Ltd.
- Balaji Calidas, Qualcomm Technologies Inc.
- Joshua Kelly, Qualcomm Technologies, Inc.
- Kevin Petit, Arm Ltd.
- Aharon Abramson, Intel.
- Ben Ashbaugh, Intel.
- Boaz Ouriel, Intel.
- Chris Gearing, Intel.
- Pekka Jääskeläinen, Tampere University and Intel
- Jan Solanti, Tampere University
- Nikhil Joshi, NVIDIA
- James Price, Google
- Brice Videau, Argonne National Laboratory
=== Description
`cl_khr_command_buffer` adds the ability to record and replay buffers of
OpenCL commands.
Command-buffers enable a reduction in overhead when enqueuing the same
workload multiple times. By separating the command-queue setup from dispatch,
the ability to replay a set of previously created commands is introduced.
Device-side _cl_sync_point_khr_ synchronization-points can be used within
command-buffers to define command dependencies. This allows the commands of a
command-buffer to execute out-of-order on a single <<compatible, compatible>>
command-queue. The command-buffer itself has no inherent in-order/out-of-order
property, this ordering is inferred from the command-queue used on command
recording. Out-of-order enqueues without event dependencies of both regular
commands, such as {clEnqueueFillBuffer}, and command-buffers are allowed to
execute concurrently, and it is up to the user to express any dependencies using
events.
The command-queues a command-buffer will be executed on can be set on replay via
parameters to {clEnqueueCommandBufferKHR}, provided they are
<<compatible, compatible>> with the command-queues used on command-buffer
recording.
==== Background
On embedded devices where building a command stream accounts for a significant
expenditure of resources and where workloads are often required to be pipelined,
a solution that minimizes driver overhead can significantly improve the
utilization of accelerators by removing a bottleneck in repeated command stream
generation.
An additional motivator is lowering task execution latency, as devices can be
kept occupied with work by repeated submissions, without having to wait on
the host to construct commands again for a similar workload.
==== Rationale
The command-buffer abstraction over the generation of command streams is a
proven approach which facilitates a significant reduction in driver overhead in
existing real-world applications with repetitive pipelined workloads which are
built on top of Vulkan, DirectX 12, and Metal.
A primary goal is for a command-buffer to avoid any interaction with
application code after being enqueued until all recorded commands have
completed. As such, any command which maps or migrates memory objects; reads
or writes memory objects; or enqueues a native kernel, is not available for
command-buffer recording. Finally commands recorded into a command buffer do
not wait for or return event objects, these are instead replaced with
device-side synchronization-point identifiers which enable out-of-order
execution when enqueued on <<compatible, compatible>> command-queues.
Adding new entry-points for individual commands, rather than recording existing
command-queue APIs with begin/end markers was a design decision made for the
following reasons:
* Individually specified entry points makes it clearer to the user what's
supported, as opposed to adding a large number of error conditions
throughout the specification with all the restrictions.
* Prevents code forking in existing entry points for the implementer, as
otherwise separate paths in each entry point need to be maintained for both
the recording and normal cases.
* Allows the definition of a new device-side synchronization primitive rather
than overloading {cl_event_TYPE}. As use of {cl_event_TYPE} in individual commands
allows host interaction from callback and user-events, as well as introducing
complexities when a command-buffer is enqueued multiple times regarding
profiling and execution status.
* New entry points facilitate returning handles to individual commands, allowing
those commands to be modified between enqueues of the command buffer. Not all
command handles are used in this extension, but providing them facilitates
other extensions layered on top to take advantage of them to provide additional
mutable functionality.
==== Simultaneous Use
The optional simultaneous use capability was added to the extension so that
vendors can support pipelined workflows, where command-buffers are repeatedly
enqueued without blocking in user code. However, simultaneous use may result in
command-buffers being more expensive to enqueue than in a sequential model, so
the capability is optional to enable optimizations on command-buffer recording.
=== Interactions With Other Extensions
The introduction of the command-buffer abstraction enables functionality
beyond what the `cl_khr_command_buffer` extension currently provides, i.e.
the recording of immutable commands to a single queue which can then be
executed without commands synchronizing outside the command-buffer. Extra
functionality expanding on this is provided as layered extensions on top of
`cl_khr_command_buffer`. The layered extensions that currently exist are:
* `<<cl_khr_command_buffer_multi_device>>`
* `<<cl_khr_command_buffer_mutable_dispatch>>`
Having `cl_khr_command_buffer` as a minimal base specification means that the
API defines mechanisms for functionality that is not enabled by this extension,
these are described in the following sub-sections. `cl_khr_command_buffer` will
retain its provisional extension status until other layered extensions are
released, as these may reveal modifications needed to the base specification to
support their intended use cases.
==== ND-range Kernel Command Properties
The {clCommandNDRangeKernelKHR} entry-point defines a `properties` parameter of
new type {cl_ndrange_kernel_command_properties_khr_TYPE}. No properties are defined
in `cl_khr_command_buffer`, but the parameter enables layered extensions like
`<<cl_khr_command_buffer_mutable_dispatch>>` to define properties that inform
the characteristics of the kernel command.
==== Command Handles
All command recording entry-points define a {cl_mutable_command_khr_TYPE} output
parameter which provides a handle to the specific command being recorded. Use of
these output handles is not enabled by the `cl_khr_command_buffer` extension,
but the handles allow individual commands in a command-buffer to be
referenced by the user.
Use of these handles is enabled in `<<cl_khr_command_buffer_mutable_dispatch>>`
to give the capability for an application to use the handles to modify commands
between enqueues of a command-buffer.
==== List of Queues
Only a single command-queue can be associated with a command-buffer in the
`cl_khr_command_buffer` extension, but the API is designed so that the layered
`<<cl_khr_command_buffer_multi_device>>` extension can relax this constraint
to allow commands to be recorded across multiple queues in the same
command-buffer, providing replay of heterogeneous task graphs.
Using multiple queue functionality will result in an error without
`<<cl_khr_command_buffer_multi_device>>` to relax usage of the following API
features:
* When a command-buffer is created the API enables passing a list of queues
that the command-buffer will record commands to. Only a single queue is
permitted in `cl_khr_command_buffer`.
* Individual command recording entry-points define a {cl_command_queue_TYPE}
parameter for which of the queues set on command-buffer creation that command
should be record to. This must be passed as NULL in `cl_khr_command_buffer`.
* {clEnqueueCommandBufferKHR} takes a list of queues for command-buffer execution,
correspond to those set on creation. Only a single queue is permitted in
`cl_khr_command_buffer`.
// The 'New ...' section can be auto-generated
=== New Commands
* {clCreateCommandBufferKHR}
* {clRetainCommandBufferKHR}
* {clReleaseCommandBufferKHR}
* {clFinalizeCommandBufferKHR}
* {clEnqueueCommandBufferKHR}
* {clCommandBarrierWithWaitListKHR}
* {clCommandCopyBufferKHR}
* {clCommandCopyBufferRectKHR}
* {clCommandCopyBufferToImageKHR}
* {clCommandCopyImageKHR}
* {clCommandCopyImageToBufferKHR}
* {clCommandFillBufferKHR}
* {clCommandFillImageKHR}
* {clCommandNDRangeKernelKHR}
* {clGetCommandBufferInfoKHR}
* The following SVM entry points are supported only with at least OpenCL 2.0,
and starting from version 0.9.4 of this extension
** {clCommandSVMMemcpyKHR}
** {clCommandSVMMemFillKHR}
=== New Types
* {cl_device_command_buffer_capabilities_khr_TYPE}
* {cl_command_buffer_khr_TYPE}
* {cl_sync_point_khr_TYPE}
* {cl_command_buffer_info_khr_TYPE}
* {cl_command_buffer_state_khr_TYPE}
* {cl_command_buffer_properties_khr_TYPE}
* {cl_command_buffer_flags_khr_TYPE}
* {cl_ndrange_kernel_command_properties_khr_TYPE}
* {cl_mutable_command_khr_TYPE}
=== New Enums
* {cl_device_info_TYPE}
** {CL_DEVICE_COMMAND_BUFFER_CAPABILITIES_KHR}
** {CL_DEVICE_COMMAND_BUFFER_REQUIRED_QUEUE_PROPERTIES_KHR}
* {cl_device_command_buffer_capabilities_khr_TYPE}
** {CL_COMMAND_BUFFER_CAPABILITY_KERNEL_PRINTF_KHR}
** {CL_COMMAND_BUFFER_CAPABILITY_DEVICE_SIDE_ENQUEUE_KHR}
** {CL_COMMAND_BUFFER_CAPABILITY_SIMULTANEOUS_USE_KHR}
** {CL_COMMAND_BUFFER_CAPABILITY_OUT_OF_ORDER_KHR}
* {cl_command_buffer_properties_khr_TYPE}
** {CL_COMMAND_BUFFER_FLAGS_KHR}
* {cl_command_buffer_flags_khr_TYPE}
** {CL_COMMAND_BUFFER_SIMULTANEOUS_USE_KHR}
* {cl_command_buffer_info_khr_TYPE}
** {CL_COMMAND_BUFFER_QUEUES_KHR}
** {CL_COMMAND_BUFFER_NUM_QUEUES_KHR}
** {CL_COMMAND_BUFFER_REFERENCE_COUNT_KHR}
** {CL_COMMAND_BUFFER_STATE_KHR}
** {CL_COMMAND_BUFFER_PROPERTIES_ARRAY_KHR}
** {CL_COMMAND_BUFFER_CONTEXT_KHR}
* {cl_command_buffer_state_khr_TYPE}
** {CL_COMMAND_BUFFER_STATE_RECORDING_KHR}
** {CL_COMMAND_BUFFER_STATE_EXECUTABLE_KHR}
** {CL_COMMAND_BUFFER_STATE_PENDING_KHR}
* {cl_command_type_TYPE}
** {CL_COMMAND_COMMAND_BUFFER_KHR}
* New Error Codes
** {CL_INVALID_COMMAND_BUFFER_KHR}
** {CL_INVALID_SYNC_POINT_WAIT_LIST_KHR}
** {CL_INCOMPATIBLE_COMMAND_QUEUE_KHR}
=== Sample Code
[source]
----
#define CL_CHECK(ERROR) \
if (ERROR) { \
std::cerr << "OpenCL error: " << ERROR << "\n"; \
return ERROR; \
}
int main() {
cl_platform_id platform;
CL_CHECK(clGetPlatformIDs(1, &platform, nullptr));
cl_device_id device;
CL_CHECK(clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, nullptr));
cl_int error;
cl_context context =
clCreateContext(nullptr, 1, &device, nullptr, nullptr, &error);
CL_CHECK(error);
const char* code = R"OpenCLC(
kernel void vector_addition(global int* tile1, global int* tile2,
global int* res) {
size_t index = get_global_id(0);
res[index] = tile1[index] + tile2[index];
}
)OpenCLC";
const size_t length = std::strlen(code);
cl_program program =
clCreateProgramWithSource(context, 1, &code, &length, &error);
CL_CHECK(error);
CL_CHECK(clBuildProgram(program, 1, &device, nullptr, nullptr, nullptr));
cl_kernel kernel = clCreateKernel(program, "vector_addition", &error);
CL_CHECK(error);
constexpr size_t frame_count = 60;
constexpr size_t frame_elements = 1024;
constexpr size_t frame_size = frame_elements * sizeof(cl_int);
constexpr size_t tile_count = 16;
constexpr size_t tile_elements = frame_elements / tile_count;
constexpr size_t tile_size = tile_elements * sizeof(cl_int);
cl_mem buffer_tile1 =
clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, nullptr, &error);
CL_CHECK(error);
cl_mem buffer_tile2 =
clCreateBuffer(context, CL_MEM_READ_ONLY, tile_size, nullptr, &error);
CL_CHECK(error);
cl_mem buffer_res =
clCreateBuffer(context, CL_MEM_WRITE_ONLY, tile_size, nullptr, &error);
CL_CHECK(error);
CL_CHECK(clSetKernelArg(kernel, 0, sizeof(buffer_tile1), &buffer_tile1));
CL_CHECK(clSetKernelArg(kernel, 1, sizeof(buffer_tile2), &buffer_tile2));
CL_CHECK(clSetKernelArg(kernel, 2, sizeof(buffer_res), &buffer_res));
cl_command_queue command_queue =
clCreateCommandQueue(context, device,
CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, &error);
CL_CHECK(error);
cl_command_buffer_khr command_buffer =
clCreateCommandBufferKHR(1, &command_queue, nullptr, &error);
CL_CHECK(error);
cl_mem buffer_src1 =
clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, nullptr, &error);
CL_CHECK(error);
cl_mem buffer_src2 =
clCreateBuffer(context, CL_MEM_READ_ONLY, frame_size, nullptr, &error);
CL_CHECK(error);
cl_mem buffer_dst =
clCreateBuffer(context, CL_MEM_WRITE_ONLY, frame_size, nullptr, &error);
CL_CHECK(error);
cl_sync_point_khr tile_sync_point = 0;
for (size_t tile_index = 0; tile_index < tile_count; tile_index++) {
std::array<cl_sync_point_khr, 2> copy_sync_points;
CL_CHECK(clCommandCopyBufferKHR(command_buffer,
command_queue, buffer_src1, buffer_tile1, tile_index * tile_size, 0,
tile_size, tile_sync_point ? 1 : 0,
tile_sync_point ? &tile_sync_point : nullptr, &copy_sync_points[0]),
nullptr);
CL_CHECK(clCommandCopyBufferKHR(command_buffer,
command_queue, buffer_src2, buffer_tile2, tile_index * tile_size, 0,
tile_size, tile_sync_point ? 1 : 0,
tile_sync_point ? &tile_sync_point : nullptr, &copy_sync_points[1]),
nullptr);
cl_sync_point_khr nd_sync_point;
CL_CHECK(clCommandNDRangeKernelKHR(command_buffer,
command_queue, nullptr, kernel, 1, nullptr, &tile_elements, nullptr,
copy_sync_points.size(), copy_sync_points.data(), &nd_sync_point,
nullptr));
CL_CHECK(clCommandCopyBufferKHR(command_buffer,
command_queue, buffer_res, buffer_dst, 0, tile_index * tile_size,
tile_size, 1, &nd_sync_point, &tile_sync_point, nullptr));
}
CL_CHECK(clFinalizeCommandBufferKHR(command_buffer));
std::random_device random_device;
std::mt19937 random_engine{random_device()};
std::uniform_int_distribution<cl_int> random_distribution{
0, std::numeric_limits<cl_int>::max() / 2};
auto random_generator = [&]() { return random_distribution(random_engine); };
for (size_t frame_index = 0; frame_index < frame_count; frame_index++) {
std::array<cl_event, 2> write_src_events;
std::vector<cl_int> src1(frame_elements);
std::generate(src1.begin(), src1.end(), random_generator);
CL_CHECK(clEnqueueWriteBuffer(command_queue, buffer_src1, CL_FALSE, 0,
frame_size, src1.data(), 0, nullptr,
&write_src_events[0]));
std::vector<cl_int> src2(frame_elements);
std::generate(src2.begin(), src2.end(), random_generator);
CL_CHECK(clEnqueueWriteBuffer(command_queue, buffer_src2, CL_FALSE, 0,
frame_size, src2.data(), 0, nullptr,
&write_src_events[1]));
CL_CHECK(clEnqueueCommandBufferKHR(0, NULL, command_buffer, 2,
write_src_events.data(), nullptr));
CL_CHECK(clFinish(command_queue));
CL_CHECK(clReleaseEvent(write_src_event[0]));
CL_CHECK(clReleaseEvent(write_src_event[1]));
}
CL_CHECK(clReleaseCommandBufferKHR(command_buffer));
CL_CHECK(clReleaseCommandQueue(command_queue));
CL_CHECK(clReleaseMemObject(buffer_src1));
CL_CHECK(clReleaseMemObject(buffer_src2));
CL_CHECK(clReleaseMemObject(buffer_dst));
CL_CHECK(clReleaseMemObject(buffer_tile1));
CL_CHECK(clReleaseMemObject(buffer_tile2));
CL_CHECK(clReleaseMemObject(buffer_res));
CL_CHECK(clReleaseKernel(kernel));
CL_CHECK(clReleaseProgram(program));
CL_CHECK(clReleaseContext(context));
return 0;
}
----
=== Issues
. Introduce a `clCloneCommandBufferKHR` entry-point for cloning a
command-buffer.
+
--
*UNRESOLVED*
--
. Enable detached command-buffer execution, where command-buffers are executed
on their own internal queue to prevent locking user created queues for the
duration of their execution.
+
--
*UNRESOLVED*
--
include::provisional_notice.asciidoc[]
=== Version History
* Revision 0.9.0, 2021-11-10
** First assigned version (provisional).
* 0.9.1, 2022-08-24
** Specify an error if a command-buffer is finalized multiple times
(provisional).
* 0.9.2, 2023-03-31
** Introduce context query {CL_COMMAND_BUFFER_CONTEXT_KHR} (provisional).
* 0.9.3, 2023-04-04
** Remove Invalid command-buffer state (provisional).
* 0.9.4, 2023-05-11
** Add clCommandSVMMemcpyKHR and clCommandSVMMemFillKHR command entries
(provisional).