| // Copyright 2017-2024 The Khronos Group. This work is licensed under a |
| // Creative Commons Attribution 4.0 International License; see |
| // http://creativecommons.org/licenses/by/4.0/ |
| |
| [[cl_khr_subgroups]] |
| == Sub-groups |
| |
| This section describes the *cl_khr_subgroups* extension. |
| |
| This extension adds support for implementation-controlled groups of work items, known as sub-groups. |
| Sub-groups behave similarly to work-groups and have their own sets of built-ins and synchronization primitives. |
| Sub-groups within a work-group are independent, may make forward progress with respect to each other, and may map to optimized hardware structures where that makes sense. |
| |
| Sub-groups were promoted to a core feature in OpenCL 2.1, however note that: |
| |
| * The sub-group OpenCL C built-in functions described by this extension must still be accessed as an OpenCL C extension in OpenCL 2.1. |
| * Sub-group independent forward progress is an optional device property in OpenCL 2.1, see {CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS}. |
| |
| === General Information |
| |
| ==== Version History |
| |
| [cols="1,1,3",options="header",] |
| |==== |
| | *Date* | *Version* | *Description* |
| | 2020-04-21 | 1.0.0 | First assigned version. |
| |==== |
| |
| [[cl_khr_subgroups-additions-to-chapter-3-of-the-opencl-2.0-specification]] |
| === Additions to Chapter 3 of the OpenCL 2.0 Specification |
| |
| [[cl_khr_subgroups-additions-to-section-3.2-execution-model]] |
| === Additions to section 3.2 -- Execution Model |
| |
| Within a work-group work-items may be divided into sub-groups. |
| The mapping of work-items to sub-groups is implementation-defined and may be queried at runtime. |
| While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of. |
| |
| Work items are mapped into sub-groups through a combination of compile-time decisions and the parameters of the dispatch. |
| The mapping to sub-groups is invariant for the duration of a kernel’s execution, across dispatches of a given kernel with the same launch parameters, and from one work-group to another within the dispatch (excluding the trailing edge work-groups in the presence of non-uniform work-group sizes). |
| In addition, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-group. |
| |
| Sub-groups execute concurrently within a given work-group and make independent forward progress with respect to each other even in the absence of work-group barrier operations. |
| Sub-groups are able to internally synchronize using barrier operations without synchronizing with each other. |
| |
| In the degenerate case, with the extension enabled, a single sub-group must be supported for each work-group. |
| In this situation all sub-group scope functions alias their work-group level equivalents. |
| |
| [[cl_khr_subgroups-additions-to-chapter-5-of-the-opencl-2.0-specification]] |
| === Additions to Chapter 5 of the OpenCL 2.0 Specification |
| |
| The function |
| |
| include::{generated}/api/protos/clGetKernelSubGroupInfoKHR.txt[] |
| |
| returns information about the kernel object. |
| |
| _kernel_ specifies the kernel object being queried. |
| |
| _device_ identifies a specific device in the list of devices associated with |
| _kernel_. |
| The list of devices is the list of devices in the OpenCL context that is |
| associated with _kernel_. |
| If the list of devices associated with _kernel_ is a single device, _device_ |
| can be a `NULL` value. |
| |
| _param_name_ specifies the information to query. |
| The list of supported _param_name_ types and the information returned in |
| _param_value_ by {clGetKernelSubGroupInfoKHR} is described in the |
| <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table. |
| |
| _input_value_size_ is used to specify the size in bytes of memory pointed to |
| by _input_value_. |
| This size must be == size of input type as described in the table below. |
| |
| _input_value_ is a pointer to memory where the appropriate parameterization |
| of the query is passed from. |
| If _input_value_ is `NULL`, it is ignored. |
| |
| _param_value_ is a pointer to memory where the appropriate result being |
| queried is returned. |
| If _param_value_ is `NULL`, it is ignored. |
| |
| _param_value_size_ is used to specify the size in bytes of memory pointed to |
| by _param_value_. |
| This size must be {geq} size of return type as described in the |
| <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table. |
| |
| _param_value_size_ret_ returns the actual size in bytes of data being |
| queried by _param_name_. |
| If _param_value_size_ret_ is `NULL`, it is ignored. |
| |
| [[cl_khr_subgroups-kernel-sub-group-info-table]] |
| .List of supported param_names by {clGetKernelSubGroupInfoKHR} |
| [width="100%",cols="<25%,<25%,<25%,<25%",options="header"] |
| |==== |
| | Kernel Sub-group Info | Input Type | Return Type | Description |
| | {CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR} |
| | {size_t_TYPE}* |
| | {size_t_TYPE} |
| | Returns the maximum sub-group size for this kernel. |
| All sub-groups must be the same size, while the last sub-group in |
| any work-group (i.e. the sub-group with the maximum index) could |
| be the same or smaller size. |
| |
| The _input_value_ must be an array of size_t values |
| corresponding to the local work size parameter of the intended |
| dispatch. |
| The number of dimensions in the ND-range will be inferred from |
| the value specified for _input_value_size_. |
| | {CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR} |
| | {size_t_TYPE}* |
| | {size_t_TYPE} |
| | Returns the number of sub-groups that will be present in each |
| work-group for a given local work size. |
| All workgroups, apart from the last work-group in each dimension |
| in the presence of non-uniform work-group sizes, will have the |
| same number of sub-groups. |
| |
| The _input_value_ must be an array of size_t values |
| corresponding to the local work size parameter of the intended |
| dispatch. |
| The number of dimensions in the ND-range will be inferred from |
| the value specified for _input_value_size_. |
| |==== |
| |
| {clGetKernelSubGroupInfoKHR} returns {CL_SUCCESS} if the function is executed |
| successfully. |
| Otherwise, it returns one of the following errors: |
| |
| * {CL_INVALID_DEVICE} if _device_ is not in the list of devices associated |
| with _kernel_ or if _device_ is `NULL` but there is more than one device |
| associated with _kernel_. |
| * {CL_INVALID_VALUE} if _param_name_ is not valid, or if size in bytes |
| specified by _param_value_size_ is < size of return type as described in |
| the <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table |
| and _param_value_ is not `NULL`. |
| * {CL_INVALID_VALUE} if _param_name_ is |
| {CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR} and the size in bytes specified by |
| _input_value_size_ is not valid or if _input_value_ is `NULL`. |
| * {CL_INVALID_KERNEL} if _kernel_ is a not a valid kernel object. |
| * {CL_OUT_OF_RESOURCES} if there is a failure to allocate resources required |
| by the OpenCL implementation on the device. |
| * {CL_OUT_OF_HOST_MEMORY} if there is a failure to allocate resources |
| required by the OpenCL implementation on the host. |
| |
| [[cl_khr_subgroups-additions-to-chapter-6-of-the-opencl-2.0-specification]] |
| === Additions to Chapter 6 of the OpenCL 2.0 C Specification |
| |
| [[cl_khr_subgroups-additions-to-section-6.13.1-work-item-functions]] |
| ==== Additions to section 6.13.1 -- Work Item Functions |
| |
| [cols="a,",options="header",] |
| |==== |
| | *Function* |
| | *Description* |
| |
| | uint *get_sub_group_size* () |
| | Returns the number of work items in the sub-group. |
| This value is no more than the maximum sub-group size and is |
| implementation-defined based on a combination of the compiled kernel and |
| the dispatch dimensions. |
| This will be a constant value for the lifetime of the sub-group. |
| |
| | uint *get_max_sub_group_size* () |
| | Returns the maximum size of a sub-group within the dispatch. |
| This value will be invariant for a given set of dispatch dimensions and a |
| kernel object compiled for a given device. |
| |
| | uint *get_num_sub_groups* () |
| | Returns the number of sub-groups that the current work-group is divided |
| into. |
| |
| This number will be constant for the duration of a work-group's execution. |
| If the kernel is executed with a non-uniform work-group size |
| (i.e. the global_work_size values specified to {clEnqueueNDRangeKernel} |
| are not evenly divisible by the local_work_size values for any dimension, |
| calls to this built-in from some work-groups may return different values |
| than calls to this built-in from other work-groups. |
| |
| | uint *get_enqueued_num_sub_groups* () |
| | Returns the same value as that returned by *get_num_sub_groups* if the |
| kernel is executed with a uniform work-group size. |
| |
| If the kernel is executed with a non-uniform work-group size, returns the |
| number of sub-groups in each of the work-groups that make up the uniform |
| region of the global range. |
| |
| | uint *get_sub_group_id* () |
| | *get_sub_group_id* returns the sub-group ID which is a number from 0 .. |
| *get_num_sub_groups*() - 1. |
| |
| For {clEnqueueTask}, this returns 0. |
| |
| | uint *get_sub_group_local_id* () |
| | Returns the unique work item ID within the current sub-group. |
| The mapping from *get_local_id*(__dimindx__) to *get_sub_group_local_id* |
| will be invariant for the lifetime of the work-group. |
| |
| |==== |
| |
| [[cl_khr_subgroups-additions-to-section-6.13.8-synchronization-functions]] |
| ==== Additions to section 6.13.8 -- Synchronization Functions |
| |
| [cols="3,7",options="header",] |
| |==== |
| | *Function* |
| | *Description* |
| |
| | void **sub_group_barrier** ( + |
| cl_mem_fence_flags _flags_) |
| |
| void **sub_group_barrier** ( + |
| cl_mem_fence_flags _flags_, memory_scope _scope_) |
| |
| | All work items in a sub-group executing the kernel on a processor must |
| execute this function before any are allowed to continue execution beyond |
| the sub-group barrier. |
| This function must be encountered by all work items in a sub-group |
| executing the kernel. |
| These rules apply to ND-ranges implemented with uniform and non-uniform |
| work-groups. |
| |
| If *sub_group_barrier* is inside a conditional statement, then all work |
| items within the sub-group must enter the conditional if any work item in |
| the sub-group enters the conditional statement and executes the |
| sub_group_barrier. |
| |
| If *sub_group_barrier* is inside a loop, all work items within the sub-group |
| must execute the sub_group_barrier for each iteration of the loop before |
| any are allowed to continue execution beyond the sub_group_barrier. |
| |
| The *sub_group_barrier* function also queues a memory fence (reads and |
| writes) to ensure correct ordering of memory operations to local or global |
| memory. |
| |
| The flags argument specifies the memory address space and can be set to a |
| combination of the following values: |
| |
| CLK_LOCAL_MEM_FENCE - The *sub_group_barrier* function will either flush |
| any variables stored in local memory or queue a memory fence to ensure |
| correct ordering of memory operations to local memory. |
| |
| CLK_GLOBAL_MEM_FENCE -- The *sub_group_barrier* function will queue a |
| memory fence to ensure correct ordering of memory operations to global |
| memory. |
| This can be useful when work items, for example, write to buffer objects |
| and then want to read the updated data from these buffer objects. |
| |
| CLK_IMAGE_MEM_FENCE -- The *sub_group_barrier* function will queue a memory |
| fence to ensure correct ordering of memory operations to image objects. |
| This can be useful when work items, for example, write to image objects |
| and then want to read the updated data from these image objects. |
| |
| |==== |
| |
| [[cl_khr_subgroups-additions-to-section-6.13.11-atomic-functions]] |
| ==== Additions to section 6.13.11 -- Atomic Functions |
| |
| Add the following new value to the enumerated type `memory_scope` defined in |
| _section 6.13.11.4_. |
| |
| ---- |
| memory_scope_sub_group |
| ---- |
| |
| The `memory_scope_sub_group` specifies that the memory ordering constraints |
| given by `memory_order` apply to work items in a sub-group. |
| This memory scope can be used when performing atomic operations to global or |
| local memory. |
| |
| [[cl_khr_subgroups-add-a-new-section-6.13.X-sub-group-functions]] |
| ==== Add a new section 6.13.X -- Sub-group Functions |
| |
| The table below describes OpenCL C programming language built-in functions that operate on a sub-group level. |
| These built-in functions must be encountered by all work items in the sub-group executing the kernel. |
| For the functions below, the generic type name `gentype` may be the one of the supported built-in scalar data types `int`, `uint`, `long`, `ulong`, `float`, `double` (if double precision is supported), or `half` (if half precision is supported). |
| |
| [cols=",",options="header",] |
| |==== |
| | *Function* |
| | *Description* |
| |
| | int *sub_group_all* (int _predicate_) |
| | Evaluates _predicate_ for all work items in the sub-group and returns a |
| non-zero value if _predicate_ evaluates to non-zero for all work items in |
| the sub-group. |
| |
| | int *sub_group_any* (int _predicate_) |
| | Evaluates _predicate_ for all work items in the sub-group and returns a |
| non-zero value if _predicate_ evaluates to non-zero for any work items in |
| the sub-group. |
| |
| | gentype *sub_group_broadcast* ( + |
| gentype _x_, uint _sub_group_local_id_) |
| | Broadcast the value of _x_ for work item identified by |
| _sub_group_local_id_ (value returned by *get_sub_group_local_id*) to all |
| work items in the sub-group. |
| |
| _sub_group_local_id_ must be the same value for all work items in the |
| sub-group. |
| |
| | gentype *sub_group_reduce_<op>* ( + |
| gentype _x_) |
| | Return result of reduction operation specified by *<op>* for all values of |
| _x_ specified by work items in a sub-group. |
| |
| | gentype *sub_group_scan_exclusive_<op>* ( + |
| gentype _x_) |
| | Do an exclusive scan operation specified by *<op>* of all values specified |
| by work items in a sub-group. |
| The scan results are returned for each work item. |
| |
| The scan order is defined by increasing sub-group local ID within the |
| sub-group. |
| |
| | gentype *sub_group_scan_inclusive_<op>* ( + |
| gentype _x_) |
| | Do an inclusive scan operation specified by *<op>* of all values specified |
| by work items in a sub-group. |
| The scan results are returned for each work item. |
| |
| The scan order is defined by increasing sub-group local ID within the |
| sub-group. |
| |
| |==== |
| |
| The *<op>* in *sub_group_reduce_<op>*, *sub_group_scan_inclusive_<op>* and *sub_group_scan_exclusive_<op>* defines the operator and can be *add*, *min* or *max*. |
| |
| The exclusive scan operation takes a binary operator *op* with an identity I and _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [I, a~0~, (a~0~ *op* a~1~), ... (a~0~ *op* a~1~ *op* ... *op* a~n-2~)]. |
| |
| The inclusive scan operation takes a binary operator *op* with _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [a~0~, (a~0~ *op* a~1~), ... (a~0~ *op* a~1~ *op* ... *op* a~n-1~)]. |
| |
| If *op* = *add*, the identity I is 0. |
| If *op* = *min*, the identity I is `INT_MAX`, `UINT_MAX`, `LONG_MAX`, `ULONG_MAX`, for `int`, `uint`, `long`, `ulong` types and is `+INF` for |
| floating-point types. |
| Similarly if *op* = max, the identity I is `INT_MIN`, 0, `LONG_MIN`, 0 and `-INF`. |
| |
| [NOTE] |
| ==== |
| The order of floating-point operations is not guaranteed for the *sub_group_reduce_<op>*, *sub_group_scan_inclusive_<op>* and *sub_group_scan_exclusive_<op>* built-in functions that operate on `half`, `float` and `double` data types. |
| The order of these floating-point operations is also non-deterministic for a given sub-group. |
| ==== |
| |
| [[cl_khr_subgroups-additions-to-section-6.13.16-pipe-functions]] |
| ==== Additions to section 6.13.16 -- Pipe Functions |
| |
| The OpenCL C programming language implements the following built-in pipe |
| functions that operate at a sub-group level. |
| These built-in functions must be encountered by all work items in a sub-group |
| executing the kernel with the same argument values; otherwise the behavior |
| is undefined. |
| We use the generic type name `gentype` to indicate the built-in OpenCL C |
| scalar or vector integer or floating-point data types or any user defined |
| type built from these scalar and vector data types can be used as the type |
| for the arguments to the pipe functions listed in _table 6.29_. |
| |
| [cols=",",options="header",] |
| |==== |
| | *Function* |
| | *Description* |
| |
| | reserve_id_t *sub_group_reserve_read_pipe* ( + |
| read_only pipe gentype _pipe_, + |
| uint _num_packets_) |
| |
| reserve_id_t *sub_group_reserve_write_pipe* ( + |
| write_only pipe gentype _pipe_, + |
| uint _num_packets_) |
| | Reserve _num_packets_ entries for reading from or writing to _pipe_. |
| Returns a valid non-zero reservation ID if the reservation is successful |
| and 0 otherwise. |
| |
| The reserved pipe entries are referred to by indices that go from 0 ... |
| _num_packets_ - 1. |
| |
| | void *sub_group_commit_read_pipe* ( + |
| read_only pipe gentype _pipe_, + |
| reserve_id_t _reserve_id_) |
| |
| void *sub_group_commit_write_pipe* ( + |
| write_only pipe gentype _pipe_, + |
| reserve_id_t _reserve_id_) |
| | Indicates that all reads and writes to _num_packets_ associated with |
| reservation _reserve_id_ are completed. |
| |
| |==== |
| |
| Note: Reservations made by a sub-group are ordered in the pipe as they are |
| ordered in the program. |
| Reservations made by different sub-groups that belong to the same work-group |
| can be ordered using sub-group synchronization. |
| The order of sub-group based reservations that belong to different work |
| groups is implementation-defined. |
| |
| [[cl_khr_subgroups-additions-to-section-6.13.17.6-enqueuing-kernels-kernel-query-functions]] |
| ==== Additions to section 6.13.17.6 -- Enqueuing Kernels (Kernel Query Functions) |
| |
| [cols="5,4",options="header",] |
| |==== |
| | *Built-in Function* |
| | *Description* |
| |
| | uint *get_kernel_sub_group_count_for_ndrange* ( + |
| const ndrange_t _ndrange_, + |
| void (^block)(void)); |
| |
| uint *get_kernel_sub_group_count_for_ndrange* ( + |
| const ndrange_t _ndrange_, + |
| void (^block)(local void *, ...)); |
| | Returns the number of sub-groups in each work-group of the dispatch (except |
| for the last in cases where the global size does not divide cleanly into |
| work-groups) given the combination of the passed ndrange and block. |
| |
| _block_ specifies the block to be enqueued. |
| |
| | uint *get_kernel_max_sub_group_size_for_ndrange* ( + |
| const ndrange_t _ndrange_, + |
| void (^block)(void)); + |
| |
| uint *get_kernel_max_sub_group_size_for_ndrange* ( + |
| const ndrange_t _ndrange_, + |
| void (^block)(local void *, ...)); |
| | Returns the maximum sub-group size for a block. |
| |
| |==== |