blob: ae479e9b7785afb99d1810976b84bf84f238313a [file]
// Copyright 2017-2024 The Khronos Group. This work is licensed under a
// Creative Commons Attribution 4.0 International License; see
// http://creativecommons.org/licenses/by/4.0/
[[cl_khr_subgroups]]
== Sub-groups
This section describes the *cl_khr_subgroups* extension.
This extension adds support for implementation-controlled groups of work items, known as sub-groups.
Sub-groups behave similarly to work-groups and have their own sets of built-ins and synchronization primitives.
Sub-groups within a work-group are independent, may make forward progress with respect to each other, and may map to optimized hardware structures where that makes sense.
Sub-groups were promoted to a core feature in OpenCL 2.1, however note that:
* The sub-group OpenCL C built-in functions described by this extension must still be accessed as an OpenCL C extension in OpenCL 2.1.
* Sub-group independent forward progress is an optional device property in OpenCL 2.1, see {CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS}.
=== General Information
==== Version History
[cols="1,1,3",options="header",]
|====
| *Date* | *Version* | *Description*
| 2020-04-21 | 1.0.0 | First assigned version.
|====
[[cl_khr_subgroups-additions-to-chapter-3-of-the-opencl-2.0-specification]]
=== Additions to Chapter 3 of the OpenCL 2.0 Specification
[[cl_khr_subgroups-additions-to-section-3.2-execution-model]]
=== Additions to section 3.2 -- Execution Model
Within a work-group work-items may be divided into sub-groups.
The mapping of work-items to sub-groups is implementation-defined and may be queried at runtime.
While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of.
Work items are mapped into sub-groups through a combination of compile-time decisions and the parameters of the dispatch.
The mapping to sub-groups is invariant for the duration of a kernel’s execution, across dispatches of a given kernel with the same launch parameters, and from one work-group to another within the dispatch (excluding the trailing edge work-groups in the presence of non-uniform work-group sizes).
In addition, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-group.
Sub-groups execute concurrently within a given work-group and make independent forward progress with respect to each other even in the absence of work-group barrier operations.
Sub-groups are able to internally synchronize using barrier operations without synchronizing with each other.
In the degenerate case, with the extension enabled, a single sub-group must be supported for each work-group.
In this situation all sub-group scope functions alias their work-group level equivalents.
[[cl_khr_subgroups-additions-to-chapter-5-of-the-opencl-2.0-specification]]
=== Additions to Chapter 5 of the OpenCL 2.0 Specification
The function
include::{generated}/api/protos/clGetKernelSubGroupInfoKHR.txt[]
returns information about the kernel object.
_kernel_ specifies the kernel object being queried.
_device_ identifies a specific device in the list of devices associated with
_kernel_.
The list of devices is the list of devices in the OpenCL context that is
associated with _kernel_.
If the list of devices associated with _kernel_ is a single device, _device_
can be a `NULL` value.
_param_name_ specifies the information to query.
The list of supported _param_name_ types and the information returned in
_param_value_ by {clGetKernelSubGroupInfoKHR} is described in the
<<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table.
_input_value_size_ is used to specify the size in bytes of memory pointed to
by _input_value_.
This size must be == size of input type as described in the table below.
_input_value_ is a pointer to memory where the appropriate parameterization
of the query is passed from.
If _input_value_ is `NULL`, it is ignored.
_param_value_ is a pointer to memory where the appropriate result being
queried is returned.
If _param_value_ is `NULL`, it is ignored.
_param_value_size_ is used to specify the size in bytes of memory pointed to
by _param_value_.
This size must be {geq} size of return type as described in the
<<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table.
_param_value_size_ret_ returns the actual size in bytes of data being
queried by _param_name_.
If _param_value_size_ret_ is `NULL`, it is ignored.
[[cl_khr_subgroups-kernel-sub-group-info-table]]
.List of supported param_names by {clGetKernelSubGroupInfoKHR}
[width="100%",cols="<25%,<25%,<25%,<25%",options="header"]
|====
| Kernel Sub-group Info | Input Type | Return Type | Description
| {CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR}
| {size_t_TYPE}*
| {size_t_TYPE}
| Returns the maximum sub-group size for this kernel.
All sub-groups must be the same size, while the last sub-group in
any work-group (i.e. the sub-group with the maximum index) could
be the same or smaller size.
The _input_value_ must be an array of size_t values
corresponding to the local work size parameter of the intended
dispatch.
The number of dimensions in the ND-range will be inferred from
the value specified for _input_value_size_.
| {CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR}
| {size_t_TYPE}*
| {size_t_TYPE}
| Returns the number of sub-groups that will be present in each
work-group for a given local work size.
All workgroups, apart from the last work-group in each dimension
in the presence of non-uniform work-group sizes, will have the
same number of sub-groups.
The _input_value_ must be an array of size_t values
corresponding to the local work size parameter of the intended
dispatch.
The number of dimensions in the ND-range will be inferred from
the value specified for _input_value_size_.
|====
{clGetKernelSubGroupInfoKHR} returns {CL_SUCCESS} if the function is executed
successfully.
Otherwise, it returns one of the following errors:
* {CL_INVALID_DEVICE} if _device_ is not in the list of devices associated
with _kernel_ or if _device_ is `NULL` but there is more than one device
associated with _kernel_.
* {CL_INVALID_VALUE} if _param_name_ is not valid, or if size in bytes
specified by _param_value_size_ is < size of return type as described in
the <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table
and _param_value_ is not `NULL`.
* {CL_INVALID_VALUE} if _param_name_ is
{CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR} and the size in bytes specified by
_input_value_size_ is not valid or if _input_value_ is `NULL`.
* {CL_INVALID_KERNEL} if _kernel_ is a not a valid kernel object.
* {CL_OUT_OF_RESOURCES} if there is a failure to allocate resources required
by the OpenCL implementation on the device.
* {CL_OUT_OF_HOST_MEMORY} if there is a failure to allocate resources
required by the OpenCL implementation on the host.
[[cl_khr_subgroups-additions-to-chapter-6-of-the-opencl-2.0-specification]]
=== Additions to Chapter 6 of the OpenCL 2.0 C Specification
[[cl_khr_subgroups-additions-to-section-6.13.1-work-item-functions]]
==== Additions to section 6.13.1 -- Work Item Functions
[cols="a,",options="header",]
|====
| *Function*
| *Description*
| uint *get_sub_group_size* ()
| Returns the number of work items in the sub-group.
This value is no more than the maximum sub-group size and is
implementation-defined based on a combination of the compiled kernel and
the dispatch dimensions.
This will be a constant value for the lifetime of the sub-group.
| uint *get_max_sub_group_size* ()
| Returns the maximum size of a sub-group within the dispatch.
This value will be invariant for a given set of dispatch dimensions and a
kernel object compiled for a given device.
| uint *get_num_sub_groups* ()
| Returns the number of sub-groups that the current work-group is divided
into.
This number will be constant for the duration of a work-group's execution.
If the kernel is executed with a non-uniform work-group size
(i.e. the global_work_size values specified to {clEnqueueNDRangeKernel}
are not evenly divisible by the local_work_size values for any dimension,
calls to this built-in from some work-groups may return different values
than calls to this built-in from other work-groups.
| uint *get_enqueued_num_sub_groups* ()
| Returns the same value as that returned by *get_num_sub_groups* if the
kernel is executed with a uniform work-group size.
If the kernel is executed with a non-uniform work-group size, returns the
number of sub-groups in each of the work-groups that make up the uniform
region of the global range.
| uint *get_sub_group_id* ()
| *get_sub_group_id* returns the sub-group ID which is a number from 0 ..
*get_num_sub_groups*() - 1.
For {clEnqueueTask}, this returns 0.
| uint *get_sub_group_local_id* ()
| Returns the unique work item ID within the current sub-group.
The mapping from *get_local_id*(__dimindx__) to *get_sub_group_local_id*
will be invariant for the lifetime of the work-group.
|====
[[cl_khr_subgroups-additions-to-section-6.13.8-synchronization-functions]]
==== Additions to section 6.13.8 -- Synchronization Functions
[cols="3,7",options="header",]
|====
| *Function*
| *Description*
| void **sub_group_barrier** ( +
cl_mem_fence_flags _flags_)
void **sub_group_barrier** ( +
cl_mem_fence_flags _flags_, memory_scope _scope_)
| All work items in a sub-group executing the kernel on a processor must
execute this function before any are allowed to continue execution beyond
the sub-group barrier.
This function must be encountered by all work items in a sub-group
executing the kernel.
These rules apply to ND-ranges implemented with uniform and non-uniform
work-groups.
If *sub_group_barrier* is inside a conditional statement, then all work
items within the sub-group must enter the conditional if any work item in
the sub-group enters the conditional statement and executes the
sub_group_barrier.
If *sub_group_barrier* is inside a loop, all work items within the sub-group
must execute the sub_group_barrier for each iteration of the loop before
any are allowed to continue execution beyond the sub_group_barrier.
The *sub_group_barrier* function also queues a memory fence (reads and
writes) to ensure correct ordering of memory operations to local or global
memory.
The flags argument specifies the memory address space and can be set to a
combination of the following values:
CLK_LOCAL_MEM_FENCE - The *sub_group_barrier* function will either flush
any variables stored in local memory or queue a memory fence to ensure
correct ordering of memory operations to local memory.
CLK_GLOBAL_MEM_FENCE -- The *sub_group_barrier* function will queue a
memory fence to ensure correct ordering of memory operations to global
memory.
This can be useful when work items, for example, write to buffer objects
and then want to read the updated data from these buffer objects.
CLK_IMAGE_MEM_FENCE -- The *sub_group_barrier* function will queue a memory
fence to ensure correct ordering of memory operations to image objects.
This can be useful when work items, for example, write to image objects
and then want to read the updated data from these image objects.
|====
[[cl_khr_subgroups-additions-to-section-6.13.11-atomic-functions]]
==== Additions to section 6.13.11 -- Atomic Functions
Add the following new value to the enumerated type `memory_scope` defined in
_section 6.13.11.4_.
----
memory_scope_sub_group
----
The `memory_scope_sub_group` specifies that the memory ordering constraints
given by `memory_order` apply to work items in a sub-group.
This memory scope can be used when performing atomic operations to global or
local memory.
[[cl_khr_subgroups-add-a-new-section-6.13.X-sub-group-functions]]
==== Add a new section 6.13.X -- Sub-group Functions
The table below describes OpenCL C programming language built-in functions that operate on a sub-group level.
These built-in functions must be encountered by all work items in the sub-group executing the kernel.
For the functions below, the generic type name `gentype` may be the one of the supported built-in scalar data types `int`, `uint`, `long`, `ulong`, `float`, `double` (if double precision is supported), or `half` (if half precision is supported).
[cols=",",options="header",]
|====
| *Function*
| *Description*
| int *sub_group_all* (int _predicate_)
| Evaluates _predicate_ for all work items in the sub-group and returns a
non-zero value if _predicate_ evaluates to non-zero for all work items in
the sub-group.
| int *sub_group_any* (int _predicate_)
| Evaluates _predicate_ for all work items in the sub-group and returns a
non-zero value if _predicate_ evaluates to non-zero for any work items in
the sub-group.
| gentype *sub_group_broadcast* ( +
gentype _x_, uint _sub_group_local_id_)
| Broadcast the value of _x_ for work item identified by
_sub_group_local_id_ (value returned by *get_sub_group_local_id*) to all
work items in the sub-group.
_sub_group_local_id_ must be the same value for all work items in the
sub-group.
| gentype *sub_group_reduce_<op>* ( +
gentype _x_)
| Return result of reduction operation specified by *<op>* for all values of
_x_ specified by work items in a sub-group.
| gentype *sub_group_scan_exclusive_<op>* ( +
gentype _x_)
| Do an exclusive scan operation specified by *<op>* of all values specified
by work items in a sub-group.
The scan results are returned for each work item.
The scan order is defined by increasing sub-group local ID within the
sub-group.
| gentype *sub_group_scan_inclusive_<op>* ( +
gentype _x_)
| Do an inclusive scan operation specified by *<op>* of all values specified
by work items in a sub-group.
The scan results are returned for each work item.
The scan order is defined by increasing sub-group local ID within the
sub-group.
|====
The *<op>* in *sub_group_reduce_<op>*, *sub_group_scan_inclusive_<op>* and *sub_group_scan_exclusive_<op>* defines the operator and can be *add*, *min* or *max*.
The exclusive scan operation takes a binary operator *op* with an identity I and _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [I, a~0~, (a~0~ *op* a~1~), ... (a~0~ *op* a~1~ *op* ... *op* a~n-2~)].
The inclusive scan operation takes a binary operator *op* with _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [a~0~, (a~0~ *op* a~1~), ... (a~0~ *op* a~1~ *op* ... *op* a~n-1~)].
If *op* = *add*, the identity I is 0.
If *op* = *min*, the identity I is `INT_MAX`, `UINT_MAX`, `LONG_MAX`, `ULONG_MAX`, for `int`, `uint`, `long`, `ulong` types and is `+INF` for
floating-point types.
Similarly if *op* = max, the identity I is `INT_MIN`, 0, `LONG_MIN`, 0 and `-INF`.
[NOTE]
====
The order of floating-point operations is not guaranteed for the *sub_group_reduce_<op>*, *sub_group_scan_inclusive_<op>* and *sub_group_scan_exclusive_<op>* built-in functions that operate on `half`, `float` and `double` data types.
The order of these floating-point operations is also non-deterministic for a given sub-group.
====
[[cl_khr_subgroups-additions-to-section-6.13.16-pipe-functions]]
==== Additions to section 6.13.16 -- Pipe Functions
The OpenCL C programming language implements the following built-in pipe
functions that operate at a sub-group level.
These built-in functions must be encountered by all work items in a sub-group
executing the kernel with the same argument values; otherwise the behavior
is undefined.
We use the generic type name `gentype` to indicate the built-in OpenCL C
scalar or vector integer or floating-point data types or any user defined
type built from these scalar and vector data types can be used as the type
for the arguments to the pipe functions listed in _table 6.29_.
[cols=",",options="header",]
|====
| *Function*
| *Description*
| reserve_id_t *sub_group_reserve_read_pipe* ( +
read_only pipe gentype _pipe_, +
uint _num_packets_)
reserve_id_t *sub_group_reserve_write_pipe* ( +
write_only pipe gentype _pipe_, +
uint _num_packets_)
| Reserve _num_packets_ entries for reading from or writing to _pipe_.
Returns a valid non-zero reservation ID if the reservation is successful
and 0 otherwise.
The reserved pipe entries are referred to by indices that go from 0 ...
_num_packets_ - 1.
| void *sub_group_commit_read_pipe* ( +
read_only pipe gentype _pipe_, +
reserve_id_t _reserve_id_)
void *sub_group_commit_write_pipe* ( +
write_only pipe gentype _pipe_, +
reserve_id_t _reserve_id_)
| Indicates that all reads and writes to _num_packets_ associated with
reservation _reserve_id_ are completed.
|====
Note: Reservations made by a sub-group are ordered in the pipe as they are
ordered in the program.
Reservations made by different sub-groups that belong to the same work-group
can be ordered using sub-group synchronization.
The order of sub-group based reservations that belong to different work
groups is implementation-defined.
[[cl_khr_subgroups-additions-to-section-6.13.17.6-enqueuing-kernels-kernel-query-functions]]
==== Additions to section 6.13.17.6 -- Enqueuing Kernels (Kernel Query Functions)
[cols="5,4",options="header",]
|====
| *Built-in Function*
| *Description*
| uint *get_kernel_sub_group_count_for_ndrange* ( +
const ndrange_t _ndrange_, +
void (^block)(void));
uint *get_kernel_sub_group_count_for_ndrange* ( +
const ndrange_t _ndrange_, +
void (^block)(local void *, ...));
| Returns the number of sub-groups in each work-group of the dispatch (except
for the last in cases where the global size does not divide cleanly into
work-groups) given the combination of the passed ndrange and block.
_block_ specifies the block to be enqueued.
| uint *get_kernel_max_sub_group_size_for_ndrange* ( +
const ndrange_t _ndrange_, +
void (^block)(void)); +
uint *get_kernel_max_sub_group_size_for_ndrange* ( +
const ndrange_t _ndrange_, +
void (^block)(local void *, ...));
| Returns the maximum sub-group size for a block.
|====