ext/cl_khr_subgroups.asciidoc - external/github.com/KhronosGroup/OpenCL-Docs - Git at Google

 // Copyright 2017-2024 The Khronos Group. This work is licensed under a
 // Creative Commons Attribution 4.0 International License; see
 // http://creativecommons.org/licenses/by/4.0/

 [[cl_khr_subgroups]]
 == Sub-groups

 This section describes the *cl_khr_subgroups* extension.

 This extension adds support for implementation-controlled groups of work items, known as sub-groups.
 Sub-groups behave similarly to work-groups and have their own sets of built-ins and synchronization primitives.
 Sub-groups within a work-group are independent, may make forward progress with respect to each other, and may map to optimized hardware structures where that makes sense.

 Sub-groups were promoted to a core feature in OpenCL 2.1, however note that:

 * The sub-group OpenCL C built-in functions described by this extension must still be accessed as an OpenCL C extension in OpenCL 2.1.
 * Sub-group independent forward progress is an optional device property in OpenCL 2.1, see {CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS}.

 === General Information

 ==== Version History

 [cols="1,1,3",options="header",]
 |====
 | *Date*     | *Version* | *Description*
 | 2020-04-21 | 1.0.0     | First assigned version.
 |====

 [[cl_khr_subgroups-additions-to-chapter-3-of-the-opencl-2.0-specification]]
 === Additions to Chapter 3 of the OpenCL 2.0 Specification

 [[cl_khr_subgroups-additions-to-section-3.2-execution-model]]
 === Additions to section 3.2 -- Execution Model

 Within a work-group work-items may be divided into sub-groups.
 The mapping of work-items to sub-groups is implementation-defined and may be queried at runtime.
 While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of.

 Work items are mapped into sub-groups through a combination of compile-time decisions and the parameters of the dispatch.
 The mapping to sub-groups is invariant for the duration of a kernel’s execution, across dispatches of a given kernel with the same launch parameters, and from one work-group to another within the dispatch (excluding the trailing edge work-groups in the presence of non-uniform work-group sizes).
 In addition, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-group.

 Sub-groups execute concurrently within a given work-group and make independent forward progress with respect to each other even in the absence of work-group barrier operations.
 Sub-groups are able to internally synchronize using barrier operations without synchronizing with each other.

 In the degenerate case, with the extension enabled, a single sub-group must be supported for each work-group.
 In this situation all sub-group scope functions alias their work-group level equivalents.

 [[cl_khr_subgroups-additions-to-chapter-5-of-the-opencl-2.0-specification]]
 === Additions to Chapter 5 of the OpenCL 2.0 Specification

 The function

 include::{generated}/api/protos/clGetKernelSubGroupInfoKHR.txt[]

 returns information about the kernel object.

 _kernel_ specifies the kernel object being queried.

 _device_ identifies a specific device in the list of devices associated with
 _kernel_.
 The list of devices is the list of devices in the OpenCL context that is
 associated with _kernel_.
 If the list of devices associated with _kernel_ is a single device, _device_
 can be a `NULL` value.

 _param_name_ specifies the information to query.
 The list of supported _param_name_ types and the information returned in
 _param_value_ by {clGetKernelSubGroupInfoKHR} is described in the
 <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table.

 _input_value_size_ is used to specify the size in bytes of memory pointed to
 by _input_value_.
 This size must be == size of input type as described in the table below.

 _input_value_ is a pointer to memory where the appropriate parameterization
 of the query is passed from.
 If _input_value_ is `NULL`, it is ignored.

 _param_value_ is a pointer to memory where the appropriate result being
 queried is returned.
 If _param_value_ is `NULL`, it is ignored.

 _param_value_size_ is used to specify the size in bytes of memory pointed to
 by _param_value_.
 This size must be {geq} size of return type as described in the
 <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table.

 _param_value_size_ret_ returns the actual size in bytes of data being
 queried by _param_name_.
 If _param_value_size_ret_ is `NULL`, it is ignored.

 [[cl_khr_subgroups-kernel-sub-group-info-table]]
 .List of supported param_names by {clGetKernelSubGroupInfoKHR}
 [width="100%",cols="<25%,<25%,<25%,<25%",options="header"]
 |====
 | Kernel Sub-group Info | Input Type | Return Type | Description
 | {CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR}
   | {size_t_TYPE}*
       | {size_t_TYPE}
           | Returns the maximum sub-group size for this kernel.
             All sub-groups must be the same size, while the last sub-group in
             any work-group (i.e. the sub-group with the maximum index) could
             be the same or smaller size.

             The _input_value_ must be an array of size_t values
             corresponding to the local work size parameter of the intended
             dispatch.
             The number of dimensions in the ND-range will be inferred from
             the value specified for _input_value_size_.
 | {CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR}
   | {size_t_TYPE}*
       | {size_t_TYPE}
           | Returns the number of sub-groups that will be present in each
             work-group for a given local work size.
             All workgroups, apart from the last work-group in each dimension
             in the presence of non-uniform work-group sizes, will have the
             same number of sub-groups.

             The _input_value_ must be an array of size_t values
             corresponding to the local work size parameter of the intended
             dispatch.
             The number of dimensions in the ND-range will be inferred from
             the value specified for _input_value_size_.
 |====

 {clGetKernelSubGroupInfoKHR} returns {CL_SUCCESS} if the function is executed
 successfully.
 Otherwise, it returns one of the following errors:

   * {CL_INVALID_DEVICE} if _device_ is not in the list of devices associated
     with _kernel_ or if _device_ is `NULL` but there is more than one device
     associated with _kernel_.
   * {CL_INVALID_VALUE} if _param_name_ is not valid, or if size in bytes
     specified by _param_value_size_ is < size of return type as described in
     the <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table
     and _param_value_ is not `NULL`.
   * {CL_INVALID_VALUE} if _param_name_ is
     {CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR} and the size in bytes specified by
     _input_value_size_ is not valid or if _input_value_ is `NULL`.
   * {CL_INVALID_KERNEL} if _kernel_ is a not a valid kernel object.
   * {CL_OUT_OF_RESOURCES} if there is a failure to allocate resources required
     by the OpenCL implementation on the device.
   * {CL_OUT_OF_HOST_MEMORY} if there is a failure to allocate resources
     required by the OpenCL implementation on the host.

 [[cl_khr_subgroups-additions-to-chapter-6-of-the-opencl-2.0-specification]]
 === Additions to Chapter 6 of the OpenCL 2.0 C Specification

 [[cl_khr_subgroups-additions-to-section-6.13.1-work-item-functions]]
 ==== Additions to section 6.13.1 -- Work Item Functions

 [cols="a,",options="header",]
 |====
 | *Function*
 | *Description*

 | uint *get_sub_group_size* ()
 | Returns the number of work items in the sub-group.
   This value is no more than the maximum sub-group size and is
   implementation-defined based on a combination of the compiled kernel and
   the dispatch dimensions.
   This will be a constant value for the lifetime of the sub-group.

 | uint *get_max_sub_group_size* ()
 | Returns the maximum size of a sub-group within the dispatch.
   This value will be invariant for a given set of dispatch dimensions and a
   kernel object compiled for a given device.

 | uint *get_num_sub_groups* ()
 | Returns the number of sub-groups that the current work-group is divided
   into.

   This number will be constant for the duration of a work-group's execution.
   If the kernel is executed with a non-uniform work-group size
   (i.e. the global_work_size values specified to {clEnqueueNDRangeKernel}
   are not evenly divisible by the local_work_size values for any dimension,
   calls to this built-in from some work-groups may return different values
   than calls to this built-in from other work-groups.

 | uint *get_enqueued_num_sub_groups* ()
 | Returns the same value as that returned by *get_num_sub_groups* if the
   kernel is executed with a uniform work-group size.

   If the kernel is executed with a non-uniform work-group size, returns the
   number of sub-groups in each of the work-groups that make up the uniform
   region of the global range.

 | uint *get_sub_group_id* ()
 | *get_sub_group_id* returns the sub-group ID which is a number from 0 ..
   *get_num_sub_groups*() - 1.

   For {clEnqueueTask}, this returns 0.

 | uint *get_sub_group_local_id* ()
 | Returns the unique work item ID within the current sub-group.
   The mapping from *get_local_id*(__dimindx__) to *get_sub_group_local_id*
   will be invariant for the lifetime of the work-group.

 |====

 [[cl_khr_subgroups-additions-to-section-6.13.8-synchronization-functions]]
 ==== Additions to section 6.13.8 -- Synchronization Functions

 [cols="3,7",options="header",]
 |====
 | *Function*
 | *Description*

 | void **sub_group_barrier** ( +
   cl_mem_fence_flags _flags_)

   void **sub_group_barrier** ( +
   cl_mem_fence_flags _flags_, memory_scope _scope_)

 | All work items in a sub-group executing the kernel on a processor must
   execute this function before any are allowed to continue execution beyond
   the sub-group barrier.
   This function must be encountered by all work items in a sub-group
   executing the kernel.
   These rules apply to ND-ranges implemented with uniform and non-uniform
   work-groups.

   If *sub_group_barrier* is inside a conditional statement, then all work
   items within the sub-group must enter the conditional if any work item in
   the sub-group enters the conditional statement and executes the
   sub_group_barrier.

   If *sub_group_barrier* is inside a loop, all work items within the sub-group
   must execute the sub_group_barrier for each iteration of the loop before
   any are allowed to continue execution beyond the sub_group_barrier.

   The *sub_group_barrier* function also queues a memory fence (reads and
   writes) to ensure correct ordering of memory operations to local or global
   memory.

   The flags argument specifies the memory address space and can be set to a
   combination of the following values:

   CLK_LOCAL_MEM_FENCE - The *sub_group_barrier* function will either flush
   any variables stored in local memory or queue a memory fence to ensure
   correct ordering of memory operations to local memory.

   CLK_GLOBAL_MEM_FENCE -- The *sub_group_barrier* function will queue a
   memory fence to ensure correct ordering of memory operations to global
   memory.
   This can be useful when work items, for example, write to buffer objects
   and then want to read the updated data from these buffer objects.

   CLK_IMAGE_MEM_FENCE -- The *sub_group_barrier* function will queue a memory
   fence to ensure correct ordering of memory operations to image objects.
   This can be useful when work items, for example, write to image objects
   and then want to read the updated data from these image objects.

 |====

 [[cl_khr_subgroups-additions-to-section-6.13.11-atomic-functions]]
 ==== Additions to section 6.13.11 -- Atomic Functions

 Add the following new value to the enumerated type `memory_scope` defined in
 _section 6.13.11.4_.

 ----
 memory_scope_sub_group
 ----

 The `memory_scope_sub_group` specifies that the memory ordering constraints
 given by `memory_order` apply to work items in a sub-group.
 This memory scope can be used when performing atomic operations to global or
 local memory.

 [[cl_khr_subgroups-add-a-new-section-6.13.X-sub-group-functions]]
 ==== Add a new section 6.13.X -- Sub-group Functions

 The table below describes OpenCL C programming language built-in functions that operate on a sub-group level.
 These built-in functions must be encountered by all work items in the sub-group executing the kernel.
 For the functions below, the generic type name `gentype` may be the one of the supported built-in scalar data types `int`, `uint`, `long`, `ulong`, `float`, `double` (if double precision is supported), or `half` (if half precision is supported).

 [cols=",",options="header",]
 |====
 | *Function*
 | *Description*

 | int *sub_group_all* (int _predicate_)
 | Evaluates _predicate_ for all work items in the sub-group and returns a
   non-zero value if _predicate_ evaluates to non-zero for all work items in
   the sub-group.

 | int *sub_group_any* (int _predicate_)
 | Evaluates _predicate_ for all work items in the sub-group and returns a
   non-zero value if _predicate_ evaluates to non-zero for any work items in
   the sub-group.

 | gentype *sub_group_broadcast* ( +
   gentype _x_, uint _sub_group_local_id_)
 | Broadcast the value of _x_ for work item identified by
   _sub_group_local_id_ (value returned by *get_sub_group_local_id*) to all
   work items in the sub-group.

   _sub_group_local_id_ must be the same value for all work items in the
   sub-group.

 | gentype *sub_group_reduce_<op>* ( +
   gentype _x_)
 | Return result of reduction operation specified by *<op>* for all values of
   _x_ specified by work items in a sub-group.

 | gentype *sub_group_scan_exclusive_<op>* ( +
   gentype _x_)
 | Do an exclusive scan operation specified by *<op>* of all values specified
   by work items in a sub-group.
   The scan results are returned for each work item.

   The scan order is defined by increasing sub-group local ID within the
   sub-group.

 | gentype *sub_group_scan_inclusive_<op>* ( +
   gentype _x_)
 | Do an inclusive scan operation specified by *<op>* of all values specified
   by work items in a sub-group.
   The scan results are returned for each work item.

   The scan order is defined by increasing sub-group local ID within the
   sub-group.

 |====

 The *<op>* in *sub_group_reduce_<op>*, *sub_group_scan_inclusive_<op>* and *sub_group_scan_exclusive_<op>* defines the operator and can be *add*, *min* or *max*.

 The exclusive scan operation takes a binary operator *op* with an identity I and _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [I, a~0~, (a~0~ *op* a~1~), ... (a~0~ *op* a~1~ *op* ... *op* a~n-2~)].

 The inclusive scan operation takes a binary operator *op* with _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [a~0~, (a~0~ *op* a~1~), ... (a~0~ *op* a~1~ *op* ... *op* a~n-1~)].

 If *op* = *add*, the identity I is 0.
 If *op* = *min*, the identity I is `INT_MAX`, `UINT_MAX`, `LONG_MAX`, `ULONG_MAX`, for `int`, `uint`, `long`, `ulong` types and is `+INF` for
 floating-point types.
 Similarly if *op* = max, the identity I is `INT_MIN`, 0, `LONG_MIN`, 0 and `-INF`.

 [NOTE]
 ====
 The order of floating-point operations is not guaranteed for the *sub_group_reduce_<op>*, *sub_group_scan_inclusive_<op>* and *sub_group_scan_exclusive_<op>* built-in functions that operate on `half`, `float` and `double` data types.
 The order of these floating-point operations is also non-deterministic for a given sub-group.
 ====

 [[cl_khr_subgroups-additions-to-section-6.13.16-pipe-functions]]
 ==== Additions to section 6.13.16 -- Pipe Functions

 The OpenCL C programming language implements the following built-in pipe
 functions that operate at a sub-group level.
 These built-in functions must be encountered by all work items in a sub-group
 executing the kernel with the same argument values; otherwise the behavior
 is undefined.
 We use the generic type name `gentype` to indicate the built-in OpenCL C
 scalar or vector integer or floating-point data types or any user defined
 type built from these scalar and vector data types can be used as the type
 for the arguments to the pipe functions listed in _table 6.29_.

 [cols=",",options="header",]
 |====
 | *Function*
 | *Description*

 | reserve_id_t *sub_group_reserve_read_pipe* ( +
   read_only pipe gentype _pipe_, +
   uint _num_packets_)

   reserve_id_t *sub_group_reserve_write_pipe* ( +
   write_only pipe gentype _pipe_, +
   uint _num_packets_)
 | Reserve _num_packets_ entries for reading from or writing to _pipe_.
   Returns a valid non-zero reservation ID if the reservation is successful
   and 0 otherwise.

   The reserved pipe entries are referred to by indices that go from 0 ...
   _num_packets_ - 1.

 | void *sub_group_commit_read_pipe* ( +
   read_only pipe gentype _pipe_, +
   reserve_id_t _reserve_id_)

   void *sub_group_commit_write_pipe* ( +
   write_only pipe gentype _pipe_, +
   reserve_id_t _reserve_id_)
 | Indicates that all reads and writes to _num_packets_ associated with
   reservation _reserve_id_ are completed.

 |====

 Note: Reservations made by a sub-group are ordered in the pipe as they are
 ordered in the program.
 Reservations made by different sub-groups that belong to the same work-group
 can be ordered using sub-group synchronization.
 The order of sub-group based reservations that belong to different work
 groups is implementation-defined.

 [[cl_khr_subgroups-additions-to-section-6.13.17.6-enqueuing-kernels-kernel-query-functions]]
 ==== Additions to section 6.13.17.6 -- Enqueuing Kernels (Kernel Query Functions)

 [cols="5,4",options="header",]
 |====
 | *Built-in Function*
 | *Description*

 | uint *get_kernel_sub_group_count_for_ndrange* ( +
   const ndrange_t _ndrange_, +
   void (^block)(void));

   uint *get_kernel_sub_group_count_for_ndrange* ( +
   const ndrange_t _ndrange_, +
   void (^block)(local void *, ...));
 | Returns the number of sub-groups in each work-group of the dispatch (except
   for the last in cases where the global size does not divide cleanly into
   work-groups) given the combination of the passed ndrange and block.

   _block_ specifies the block to be enqueued.

 | uint *get_kernel_max_sub_group_size_for_ndrange* ( +
   const ndrange_t _ndrange_, +
   void (^block)(void)); +

   uint *get_kernel_max_sub_group_size_for_ndrange* ( +
   const ndrange_t _ndrange_, +
   void (^block)(local void *, ...));
 | Returns the maximum sub-group size for a block.

 |====
	// Copyright 2017-2024 The Khronos Group. This work is licensed under a
	// Creative Commons Attribution 4.0 International License; see
	// http://creativecommons.org/licenses/by/4.0/

	[[cl_khr_subgroups]]
	== Sub-groups

	This section describes the cl_khr_subgroups extension.

	This extension adds support for implementation-controlled groups of work items, known as sub-groups.
	Sub-groups behave similarly to work-groups and have their own sets of built-ins and synchronization primitives.
	Sub-groups within a work-group are independent, may make forward progress with respect to each other, and may map to optimized hardware structures where that makes sense.

	Sub-groups were promoted to a core feature in OpenCL 2.1, however note that:

	* The sub-group OpenCL C built-in functions described by this extension must still be accessed as an OpenCL C extension in OpenCL 2.1.
	* Sub-group independent forward progress is an optional device property in OpenCL 2.1, see {CL_DEVICE_SUB_GROUP_INDEPENDENT_FORWARD_PROGRESS}.

	=== General Information

	==== Version History

	[cols="1,1,3",options="header",]
	\|====
	\| Date \| Version \| Description
	\| 2020-04-21 \| 1.0.0 \| First assigned version.
	\|====

	[[cl_khr_subgroups-additions-to-chapter-3-of-the-opencl-2.0-specification]]
	=== Additions to Chapter 3 of the OpenCL 2.0 Specification

	[[cl_khr_subgroups-additions-to-section-3.2-execution-model]]
	=== Additions to section 3.2 -- Execution Model

	Within a work-group work-items may be divided into sub-groups.
	The mapping of work-items to sub-groups is implementation-defined and may be queried at runtime.
	While sub-groups may be used in multi-dimensional work-groups, each sub-group is 1-dimensional and any given work-item may query which sub-group it is a member of.

	Work items are mapped into sub-groups through a combination of compile-time decisions and the parameters of the dispatch.
	The mapping to sub-groups is invariant for the duration of a kernel’s execution, across dispatches of a given kernel with the same launch parameters, and from one work-group to another within the dispatch (excluding the trailing edge work-groups in the presence of non-uniform work-group sizes).
	In addition, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-group.

	Sub-groups execute concurrently within a given work-group and make independent forward progress with respect to each other even in the absence of work-group barrier operations.
	Sub-groups are able to internally synchronize using barrier operations without synchronizing with each other.

	In the degenerate case, with the extension enabled, a single sub-group must be supported for each work-group.
	In this situation all sub-group scope functions alias their work-group level equivalents.

	[[cl_khr_subgroups-additions-to-chapter-5-of-the-opencl-2.0-specification]]
	=== Additions to Chapter 5 of the OpenCL 2.0 Specification

	The function

	include::{generated}/api/protos/clGetKernelSubGroupInfoKHR.txt[]

	returns information about the kernel object.

	_kernel_ specifies the kernel object being queried.

	_device_ identifies a specific device in the list of devices associated with
	_kernel_.
	The list of devices is the list of devices in the OpenCL context that is
	associated with _kernel_.
	If the list of devices associated with _kernel_ is a single device, _device_
	can be a `NULL` value.

	_param_name_ specifies the information to query.
	The list of supported _param_name_ types and the information returned in
	_param_value_ by {clGetKernelSubGroupInfoKHR} is described in the
	<<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table.

	_input_value_size_ is used to specify the size in bytes of memory pointed to
	by _input_value_.
	This size must be == size of input type as described in the table below.

	_input_value_ is a pointer to memory where the appropriate parameterization
	of the query is passed from.
	If _input_value_ is `NULL`, it is ignored.

	_param_value_ is a pointer to memory where the appropriate result being
	queried is returned.
	If _param_value_ is `NULL`, it is ignored.

	_param_value_size_ is used to specify the size in bytes of memory pointed to
	by _param_value_.
	This size must be {geq} size of return type as described in the
	<<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table.

	_param_value_size_ret_ returns the actual size in bytes of data being
	queried by _param_name_.
	If _param_value_size_ret_ is `NULL`, it is ignored.

	[[cl_khr_subgroups-kernel-sub-group-info-table]]
	.List of supported param_names by {clGetKernelSubGroupInfoKHR}
	[width="100%",cols="<25%,<25%,<25%,<25%",options="header"]
	\|====
	\| Kernel Sub-group Info \| Input Type \| Return Type \| Description
	\| {CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR}
	\| {size_t_TYPE}*
	\| {size_t_TYPE}
	\| Returns the maximum sub-group size for this kernel.
	All sub-groups must be the same size, while the last sub-group in
	any work-group (i.e. the sub-group with the maximum index) could
	be the same or smaller size.

	The _input_value_ must be an array of size_t values
	corresponding to the local work size parameter of the intended
	dispatch.
	The number of dimensions in the ND-range will be inferred from
	the value specified for _input_value_size_.
	\| {CL_KERNEL_SUB_GROUP_COUNT_FOR_NDRANGE_KHR}
	\| {size_t_TYPE}*
	\| {size_t_TYPE}
	\| Returns the number of sub-groups that will be present in each
	work-group for a given local work size.
	All workgroups, apart from the last work-group in each dimension
	in the presence of non-uniform work-group sizes, will have the
	same number of sub-groups.

	The _input_value_ must be an array of size_t values
	corresponding to the local work size parameter of the intended
	dispatch.
	The number of dimensions in the ND-range will be inferred from
	the value specified for _input_value_size_.
	\|====

	{clGetKernelSubGroupInfoKHR} returns {CL_SUCCESS} if the function is executed
	successfully.
	Otherwise, it returns one of the following errors:

	* {CL_INVALID_DEVICE} if _device_ is not in the list of devices associated
	with _kernel_ or if _device_ is `NULL` but there is more than one device
	associated with _kernel_.
	* {CL_INVALID_VALUE} if _param_name_ is not valid, or if size in bytes
	specified by _param_value_size_ is < size of return type as described in
	the <<cl_khr_subgroups-kernel-sub-group-info-table,Kernel Object Sub-group Queries>> table
	and _param_value_ is not `NULL`.
	* {CL_INVALID_VALUE} if _param_name_ is
	{CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE_KHR} and the size in bytes specified by
	_input_value_size_ is not valid or if _input_value_ is `NULL`.
	* {CL_INVALID_KERNEL} if _kernel_ is a not a valid kernel object.
	* {CL_OUT_OF_RESOURCES} if there is a failure to allocate resources required
	by the OpenCL implementation on the device.
	* {CL_OUT_OF_HOST_MEMORY} if there is a failure to allocate resources
	required by the OpenCL implementation on the host.

	[[cl_khr_subgroups-additions-to-chapter-6-of-the-opencl-2.0-specification]]
	=== Additions to Chapter 6 of the OpenCL 2.0 C Specification

	[[cl_khr_subgroups-additions-to-section-6.13.1-work-item-functions]]
	==== Additions to section 6.13.1 -- Work Item Functions

	[cols="a,",options="header",]
	\|====
	\| Function
	\| Description

	\| uint get_sub_group_size ()
	\| Returns the number of work items in the sub-group.
	This value is no more than the maximum sub-group size and is
	implementation-defined based on a combination of the compiled kernel and
	the dispatch dimensions.
	This will be a constant value for the lifetime of the sub-group.

	\| uint get_max_sub_group_size ()
	\| Returns the maximum size of a sub-group within the dispatch.
	This value will be invariant for a given set of dispatch dimensions and a
	kernel object compiled for a given device.

	\| uint get_num_sub_groups ()
	\| Returns the number of sub-groups that the current work-group is divided
	into.

	This number will be constant for the duration of a work-group's execution.
	If the kernel is executed with a non-uniform work-group size
	(i.e. the global_work_size values specified to {clEnqueueNDRangeKernel}
	are not evenly divisible by the local_work_size values for any dimension,
	calls to this built-in from some work-groups may return different values
	than calls to this built-in from other work-groups.

	\| uint get_enqueued_num_sub_groups ()
	\| Returns the same value as that returned by get_num_sub_groups if the
	kernel is executed with a uniform work-group size.

	If the kernel is executed with a non-uniform work-group size, returns the
	number of sub-groups in each of the work-groups that make up the uniform
	region of the global range.

	\| uint get_sub_group_id ()
	\| get_sub_group_id returns the sub-group ID which is a number from 0 ..
	get_num_sub_groups() - 1.

	For {clEnqueueTask}, this returns 0.

	\| uint get_sub_group_local_id ()
	\| Returns the unique work item ID within the current sub-group.
	The mapping from get_local_id(__dimindx__) to get_sub_group_local_id
	will be invariant for the lifetime of the work-group.

	\|====

	[[cl_khr_subgroups-additions-to-section-6.13.8-synchronization-functions]]
	==== Additions to section 6.13.8 -- Synchronization Functions

	[cols="3,7",options="header",]
	\|====
	\| Function
	\| Description

	\| void sub_group_barrier ( +
	cl_mem_fence_flags _flags_)

	void sub_group_barrier ( +
	cl_mem_fence_flags _flags_, memory_scope _scope_)

	\| All work items in a sub-group executing the kernel on a processor must
	execute this function before any are allowed to continue execution beyond
	the sub-group barrier.
	This function must be encountered by all work items in a sub-group
	executing the kernel.
	These rules apply to ND-ranges implemented with uniform and non-uniform
	work-groups.

	If sub_group_barrier is inside a conditional statement, then all work
	items within the sub-group must enter the conditional if any work item in
	the sub-group enters the conditional statement and executes the
	sub_group_barrier.

	If sub_group_barrier is inside a loop, all work items within the sub-group
	must execute the sub_group_barrier for each iteration of the loop before
	any are allowed to continue execution beyond the sub_group_barrier.

	The sub_group_barrier function also queues a memory fence (reads and
	writes) to ensure correct ordering of memory operations to local or global
	memory.

	The flags argument specifies the memory address space and can be set to a
	combination of the following values:

	CLK_LOCAL_MEM_FENCE - The sub_group_barrier function will either flush
	any variables stored in local memory or queue a memory fence to ensure
	correct ordering of memory operations to local memory.

	CLK_GLOBAL_MEM_FENCE -- The sub_group_barrier function will queue a
	memory fence to ensure correct ordering of memory operations to global
	memory.
	This can be useful when work items, for example, write to buffer objects
	and then want to read the updated data from these buffer objects.

	CLK_IMAGE_MEM_FENCE -- The sub_group_barrier function will queue a memory
	fence to ensure correct ordering of memory operations to image objects.
	This can be useful when work items, for example, write to image objects
	and then want to read the updated data from these image objects.

	\|====

	[[cl_khr_subgroups-additions-to-section-6.13.11-atomic-functions]]
	==== Additions to section 6.13.11 -- Atomic Functions

	Add the following new value to the enumerated type `memory_scope` defined in
	_section 6.13.11.4_.

	----
	memory_scope_sub_group
	----

	The `memory_scope_sub_group` specifies that the memory ordering constraints
	given by `memory_order` apply to work items in a sub-group.
	This memory scope can be used when performing atomic operations to global or
	local memory.

	[[cl_khr_subgroups-add-a-new-section-6.13.X-sub-group-functions]]
	==== Add a new section 6.13.X -- Sub-group Functions

	The table below describes OpenCL C programming language built-in functions that operate on a sub-group level.
	These built-in functions must be encountered by all work items in the sub-group executing the kernel.
	For the functions below, the generic type name `gentype` may be the one of the supported built-in scalar data types `int`, `uint`, `long`, `ulong`, `float`, `double` (if double precision is supported), or `half` (if half precision is supported).

	[cols=",",options="header",]
	\|====
	\| Function
	\| Description

	\| int sub_group_all (int _predicate_)
	\| Evaluates _predicate_ for all work items in the sub-group and returns a
	non-zero value if _predicate_ evaluates to non-zero for all work items in
	the sub-group.

	\| int sub_group_any (int _predicate_)
	\| Evaluates _predicate_ for all work items in the sub-group and returns a
	non-zero value if _predicate_ evaluates to non-zero for any work items in
	the sub-group.

	\| gentype sub_group_broadcast ( +
	gentype _x_, uint _sub_group_local_id_)
	\| Broadcast the value of _x_ for work item identified by
	_sub_group_local_id_ (value returned by get_sub_group_local_id) to all
	work items in the sub-group.

	_sub_group_local_id_ must be the same value for all work items in the
	sub-group.

	\| gentype sub_group_reduce_<op> ( +
	gentype _x_)
	\| Return result of reduction operation specified by <op> for all values of
	_x_ specified by work items in a sub-group.

	\| gentype sub_group_scan_exclusive_<op> ( +
	gentype _x_)
	\| Do an exclusive scan operation specified by <op> of all values specified
	by work items in a sub-group.
	The scan results are returned for each work item.

	The scan order is defined by increasing sub-group local ID within the
	sub-group.

	\| gentype sub_group_scan_inclusive_<op> ( +
	gentype _x_)
	\| Do an inclusive scan operation specified by <op> of all values specified
	by work items in a sub-group.
	The scan results are returned for each work item.

	The scan order is defined by increasing sub-group local ID within the
	sub-group.

	\|====

	The <op> in sub_group_reduce_<op>, sub_group_scan_inclusive_<op> and sub_group_scan_exclusive_<op> defines the operator and can be add, min or max.

	The exclusive scan operation takes a binary operator op with an identity I and _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [I, a~0~, (a~0~ op a~1~), ... (a~0~ op a~1~ op ... op a~n-2~)].

	The inclusive scan operation takes a binary operator op with _n_ (where _n_ is the size of the sub-group) elements [a~0~, a~1~, ... a~n-1~] and returns [a~0~, (a~0~ op a~1~), ... (a~0~ op a~1~ op ... op a~n-1~)].

	If op = add, the identity I is 0.
	If op = min, the identity I is `INT_MAX`, `UINT_MAX`, `LONG_MAX`, `ULONG_MAX`, for `int`, `uint`, `long`, `ulong` types and is `+INF` for
	floating-point types.
	Similarly if op = max, the identity I is `INT_MIN`, 0, `LONG_MIN`, 0 and `-INF`.

	[NOTE]
	====
	The order of floating-point operations is not guaranteed for the sub_group_reduce_<op>, sub_group_scan_inclusive_<op> and sub_group_scan_exclusive_<op> built-in functions that operate on `half`, `float` and `double` data types.
	The order of these floating-point operations is also non-deterministic for a given sub-group.
	====

	[[cl_khr_subgroups-additions-to-section-6.13.16-pipe-functions]]
	==== Additions to section 6.13.16 -- Pipe Functions

	The OpenCL C programming language implements the following built-in pipe
	functions that operate at a sub-group level.
	These built-in functions must be encountered by all work items in a sub-group
	executing the kernel with the same argument values; otherwise the behavior
	is undefined.
	We use the generic type name `gentype` to indicate the built-in OpenCL C
	scalar or vector integer or floating-point data types or any user defined
	type built from these scalar and vector data types can be used as the type
	for the arguments to the pipe functions listed in _table 6.29_.

	[cols=",",options="header",]
	\|====
	\| Function
	\| Description

	\| reserve_id_t sub_group_reserve_read_pipe ( +
	read_only pipe gentype _pipe_, +
	uint _num_packets_)

	reserve_id_t sub_group_reserve_write_pipe ( +
	write_only pipe gentype _pipe_, +
	uint _num_packets_)
	\| Reserve _num_packets_ entries for reading from or writing to _pipe_.
	Returns a valid non-zero reservation ID if the reservation is successful
	and 0 otherwise.

	The reserved pipe entries are referred to by indices that go from 0 ...
	_num_packets_ - 1.

	\| void sub_group_commit_read_pipe ( +
	read_only pipe gentype _pipe_, +
	reserve_id_t _reserve_id_)

	void sub_group_commit_write_pipe ( +
	write_only pipe gentype _pipe_, +
	reserve_id_t _reserve_id_)
	\| Indicates that all reads and writes to _num_packets_ associated with
	reservation _reserve_id_ are completed.

	\|====

	Note: Reservations made by a sub-group are ordered in the pipe as they are
	ordered in the program.
	Reservations made by different sub-groups that belong to the same work-group
	can be ordered using sub-group synchronization.
	The order of sub-group based reservations that belong to different work
	groups is implementation-defined.

	[[cl_khr_subgroups-additions-to-section-6.13.17.6-enqueuing-kernels-kernel-query-functions]]
	==== Additions to section 6.13.17.6 -- Enqueuing Kernels (Kernel Query Functions)

	[cols="5,4",options="header",]
	\|====
	\| Built-in Function
	\| Description

	\| uint get_kernel_sub_group_count_for_ndrange ( +
	const ndrange_t _ndrange_, +
	void (^block)(void));

	uint get_kernel_sub_group_count_for_ndrange ( +
	const ndrange_t _ndrange_, +
	void (^block)(local void *, ...));
	\| Returns the number of sub-groups in each work-group of the dispatch (except
	for the last in cases where the global size does not divide cleanly into
	work-groups) given the combination of the passed ndrange and block.

	_block_ specifies the block to be enqueued.

	\| uint get_kernel_max_sub_group_size_for_ndrange ( +
	const ndrange_t _ndrange_, +
	void (^block)(void)); +

	uint get_kernel_max_sub_group_size_for_ndrange ( +
	const ndrange_t _ndrange_, +
	void (^block)(local void *, ...));
	\| Returns the maximum sub-group size for a block.

	\|====