blob: 1f568b05a646aa1ab620cf2dcd9833514e4fa712 [file] [edit]
// Copyright 2017-2020 The Khronos Group. This work is licensed under a
// Creative Commons Attribution 4.0 International License; see
// http://creativecommons.org/licenses/by/4.0/
= The OpenCL Architecture
*OpenCL* is an open industry standard for programming a heterogeneous
collection of CPUs, GPUs and other discrete computing devices organized into
a single platform.
It is more than a language.
OpenCL is a framework for parallel programming and includes a language, API,
libraries and a runtime system to support software development.
Using OpenCL, for example, a programmer can write general purpose programs
that execute on GPUs without the need to map their algorithms onto a 3D
graphics API such as OpenGL or DirectX.
The target of OpenCL is expert programmers wanting to write portable yet
efficient code.
This includes library writers, middleware vendors, and performance oriented
application programmers.
Therefore OpenCL provides a low-level hardware abstraction plus a framework
to support programming and many details of the underlying hardware are
exposed.
To describe the core ideas behind OpenCL, we will use a hierarchy of models:
* Platform Model
* Memory Model
* Execution Model
* Programming Model
== Platform Model
The <<platform-model-image, Platform model>> for OpenCL is defined below.
The model consists of a *host* connected to one or more *OpenCL devices*.
An OpenCL device is divided into one or more *compute units* (CUs) which are
further divided into one or more *processing elements* (PEs).
Computations on a device occur within the processing elements.
An OpenCL application is implemented as both host code and device kernel
code.
The host code portion of an OpenCL application runs on a host processor
according to the models native to the host platform.
The OpenCL application host code submits the kernel code as commands from
the host to OpenCL devices.
An OpenCL device executes the commands computation on the processing
elements within the device.
An OpenCL device has considerable latitude on how computations are mapped
onto the devices processing elements.
When processing elements within a compute unit execute the same sequence of
statements across the processing elements, the control flow is said to be
_converged_.
Hardware optimized for executing a single stream of instructions over
multiple processing elements is well suited to converged control flows.
When the control flow varies from one processing element to another, it is
said to be _diverged_.
While a kernel always begins execution with a converged control flow, due to
branching statements within a kernel, converged and diverged control flows
may occur within a single kernel.
This provides a great deal of flexibility in the algorithms that can be
implemented with OpenCL.
[[platform-model-image]]
image::images/platform_model.png[align="center", title="Platform Model ... one host plus one or more compute devices each with one or more compute units composed of one or more processing elements."]
Programmers may provide programs in the form of OpenCL C source strings,
the SPIR-V intermediate language, or as implementation-defined binary objects.
An OpenCL platform provides a compiler to translate programs of these
forms into executable program objects.
The device code compiler may be _online_ or _offline_.
An _online_ _compiler_ is available during host program execution using
standard APIs.
An _offline compiler_ is invoked outside of host program control, using
platform-specific methods.
The OpenCL runtime allows developers to get a previously compiled device
program executable and be able to load and execute a previously compiled
device program executable.
OpenCL defines two kinds of platform profiles: a _full profile_ and a
reduced-functionality _embedded profile_.
A full profile platform must provide an online compiler for all its devices.
An embedded platform may provide an online compiler, but is not required to
do so.
A device may expose special purpose functionality as a _built-in kernel_.
The platform provides APIs for enumerating and invoking the built-in
kernels offered by a device, but otherwise does not define their
construction or semantics.
A _custom device_ supports only built-in kernels, and cannot be programmed
via a kernel language.
NOTE: Built-in kernels and custom devices are <<unified-spec, missing before>>
version 1.2.
All device types support the OpenCL execution model, the OpenCL memory
model, and the APIs used in OpenCL to manage devices.
The platform model is an abstraction describing how OpenCL views the
hardware.
The relationship between the elements of the platform model and the hardware
in a system may be a fixed property of a device or it may be a dynamic
feature of a program dependent on how a compiler optimizes code to best
utilize physical hardware.
== Execution Model
The OpenCL execution model is defined in terms of two distinct units of
execution: *kernels* that execute on one or more OpenCL devices and a *host
program* that executes on the host.
With regard to OpenCL, the kernels are where the "work" associated with a
computation occurs.
This work occurs through *work-items* that execute in groups
(*work-groups*).
A kernel executes within a well-defined context managed by the host.
The context defines the environment within which kernels execute.
It includes the following resources:
* *Devices*: One or more devices exposed by the OpenCL platform.
* *Kernel Objects*: The OpenCL functions with their associated argument
values that run on OpenCL devices.
* *Program Objects*: The program source and executable that implement the
kernels.
* *Memory Objects*: Variables visible to the host and the OpenCL devices.
Instances of kernels operate on these objects as they execute.
The host program uses the OpenCL API to create and manage the context.
Functions from the OpenCL API enable the host to interact with a device
through a _command-queue_.
Each command-queue is associated with a single device.
The commands placed into the command-queue fall into one of three types:
* *Kernel-enqueue commands*: Enqueue a kernel for execution on a device.
* *Memory commands*: Transfer data between the host and device memory,
between memory objects, or map and unmap memory objects from the host
address space.
* *Synchronization commands*: Explicit synchronization points that define
order constraints between commands.
In addition to commands submitted from the host command-queue, a kernel
running on a device can enqueue commands to a device-side command queue.
This results in _child kernels_ enqueued by a kernel executing on a device
(the _parent kernel_).
Regardless of whether the command-queue resides on the host or a device,
each command passes through six states.
. *Queued*: The command is enqueued to a command-queue.
A command may reside in the queue until it is flushed either explicitly
(a call to {clFlush}) or implicitly by some other command.
. *Submitted*: The command is flushed from the command-queue and submitted
for execution on the device.
Once flushed from the command-queue, a command will execute after any
prerequisites for execution are met.
. *Ready*: All prerequisites constraining execution of a command have been
met.
The command, or for a kernel-enqueue command the collection of work
groups associated with a command, is placed in a device work-pool from
which it is scheduled for execution.
. *Running*: Execution of the command starts.
For the case of a kernel-enqueue command, one or more work-groups
associated with the command start to execute.
. *Ended*: Execution of a command ends.
When a Kernel-enqueue command ends, all of the work-groups associated
with that command have finished their execution.
_Immediate side effects_, i.e. those associated with the kernel but not
necessarily with its child kernels, are visible to other units of
execution.
These side effects include updates to values in global memory.
. *Complete*: The command and its child commands have finished execution
and the status of the event object, if any, associated with the command
is set to {CL_COMPLETE}.
The <<profiled-states-image, execution states and the transitions between
them>> are summarized below.
These states and the concept of a device work-pool are conceptual elements
of the execution model.
An implementation of OpenCL has considerable freedom in how these are
exposed to a program.
Five of the transitions, however, are directly observable through a
profiling interface.
These <<profiled-states-image, profiled states>> are shown below.
[[profiled-states-image]]
image::images/profiled_states.png[align="center", title="The states and transitions between states defined in the OpenCL execution model. A subset of these transitions is exposed through the <<profiling-operations, profiling interface>>."]
Commands communicate their status through _Event objects_.
Successful completion is indicated by setting the event status associated
with a command to {CL_COMPLETE}.
Unsuccessful completion results in abnormal termination of the command which
is indicated by setting the event status to a negative value.
In this case, the command-queue associated with the abnormally terminated
command and all other command-queues in the same context may no longer be
available and their behavior is implementation defined.
A command submitted to a device will not launch until prerequisites that
constrain the order of commands have been resolved.
These prerequisites have three sources:
* They may arise from commands submitted to a command-queue that constrain
the order in which commands are launched.
For example, commands that follow a command queue barrier will not
launch until all commands prior to the barrier are complete.
* The second source of prerequisites is dependencies between commands
expressed through events.
A command may include an optional list of events.
The command will wait and not launch until all the events in the list
are in the state CL COMPLETE.
By this mechanism, event objects define order constraints between
commands and coordinate execution between the host and one or more
devices.
* The third source of prerequisites can be the presence of non-trivial C
initializers or {cpp} constructors for program scope global variables.
In this case, OpenCL C/{cpp} compiler shall generate program
initialization kernels that perform C initialization or {cpp}
construction.
These kernels must be executed by OpenCL runtime on a device before any
kernel from the same program can be executed on the same device.
The ND-range for any program initialization kernel is (1,1,1).
When multiple programs are linked together, the order of execution of
program initialization kernels that belong to different programs is
undefined.
Program clean up may result in the execution of one or more program clean up
kernels by the OpenCL runtime.
This is due to the presence of non-trivial {cpp} destructors for
program scope variables.
The ND-range for executing any program clean up kernel is (1,1,1).
The order of execution of clean up kernels from different programs (that are
linked together) is undefined.
NOTE: Program initialization and clean-up kernels are <<unified-spec,
missing before>> version 2.2.
Note that C initializers, {cpp} constructors, or {cpp} destructors for program
scope variables cannot use pointers to coarse grain and fine grain SVM
allocations.
A command may be submitted to a device and yet have no visible side effects
outside of waiting on and satisfying event dependences.
Examples include markers, kernels executed over ranges of no work-items or
copy operations with zero sizes.
Such commands may pass directly from the _ready_ state to the _ended_ state.
Command execution can be blocking or non-blocking.
Consider a sequence of OpenCL commands.
For blocking commands, the OpenCL API functions that enqueue commands don't
return until the command has completed.
Alternatively, OpenCL functions that enqueue non-blocking commands return
immediately and require that a programmer defines dependencies between
enqueued commands to ensure that enqueued commands are not launched before
needed resources are available.
In both cases, the actual execution of the command may occur asynchronously
with execution of the host program.
Commands within a single command-queue execute relative to each other in one
of two modes:
* *In-order Execution*: Commands and any side effects associated with
commands appear to the OpenCL application as if they execute in the same
order they are enqueued to a command-queue.
* *Out-of-order Execution*: Commands execute in any order constrained only
by explicit synchronization points (e.g. through command queue barriers)
or explicit dependencies on events.
Multiple command-queues can be present within a single context.
Multiple command-queues execute commands independently.
Event objects visible to the host program can be used to define
synchronization points between commands in multiple command queues.
If such synchronization points are established between commands in multiple
command-queues, an implementation must assure that the command-queues
progress concurrently and correctly account for the dependencies established
by the synchronization points.
For a detailed explanation of synchronization points, see the execution model
<<execution-model-sync, Synchronization>> section.
The core of the OpenCL execution model is defined by how the kernels
execute.
When a kernel-enqueue command submits a kernel for execution, an index space
is defined.
The kernel, the argument values associated with the arguments to the kernel,
and the parameters that define the index space define a _kernel-instance_.
When a kernel-instance executes on a device, the kernel function executes
for each point in the defined index space.
Each of these executing kernel functions is called a _work-item_.
The work-items associated with a given kernel-instance are managed by the
device in groups called _work-groups_.
These work-groups define a coarse grained decomposition of the Index space.
Work-groups are further divided into _sub-groups_, which provide an
additional level of control over execution.
NOTE: Sub-groups are <<unified-spec, missing before>> version 2.1.
Work-items have a global ID based on their coordinates within the Index
space.
They can also be defined in terms of their work-group and the local ID
within a work-group.
The details of this mapping are described in the following section.
=== Mapping work-items onto an NDRange
The index space supported by OpenCL is called an NDRange.
An NDRange is an N-dimensional index space, where N is one, two or three.
The NDRange is decomposed into work-groups forming blocks that cover the
Index space.
An NDRange is defined by three integer arrays of length N:
* The extent of the index space (or global size) in each dimension.
* An offset index F indicating the initial value of the indices in each
dimension (zero by default).
* The size of a work-group (local size) in each dimension.
Each work-items global ID is an N-dimensional tuple.
The global ID components are values in the range from F, to F plus the
number of elements in that dimension minus one.
Unless a kernel comes from a source that disallows it, e.g. OpenCL C 1.x or
using `-cl-uniform-work-group-size`, the size of work-groups in
an NDRange (the local size) need not be the same for all work-groups.
In this case, any single dimension for which the global size is not
divisible by the local size will be partitioned into two regions.
One region will have work-groups that have the same number of work-items as
was specified for that dimension by the programmer (the local size).
The other region will have work-groups with less than the number of work
items specified by the local size parameter in that dimension (the
_remainder work-groups_).
Work-group sizes could be non-uniform in multiple dimensions, potentially
producing work-groups of up to 4 different sizes in a 2D range and 8
different sizes in a 3D range.
NOTE: Non-uniform work-group sizes are <<unified-spec, missing before>> version
2.0.
Each work-item is assigned to a work-group and given a local ID to represent
its position within the work-group.
A work-item's local ID is an N-dimensional tuple with components in the
range from zero to the size of the work-group in that dimension minus one.
Work-groups are assigned IDs similarly.
The number of work-groups in each dimension is not directly defined but is
inferred from the local and global NDRanges provided when a kernel-instance
is enqueued.
A work-group's ID is an N-dimensional tuple with components in the range 0
to the ceiling of the global size in that dimension divided by the local
size in the same dimension.
As a result, the combination of a work-group ID and the local-ID within a
work-group uniquely defines a work-item.
Each work-item is identifiable in two ways; in terms of a global index, and
in terms of a work-group index plus a local index within a work-group.
For example, consider the <<index-space-image, 2-dimensional index space>>
shown below.
We input the index space for the work-items (G~x~, G~y~), the size of each
work-group (S~x~, S~y~) and the global ID offset (F~x~, F~y~).
The global indices define an G~x~by G~y~ index space where the total number
of work-items is the product of G~x~ and G~y~.
The local indices define an S~x~ by S~y~ index space where the number of
work-items in a single work-group is the product of S~x~ and S~y~.
Given the size of each work-group and the total number of work-items we can
compute the number of work-groups.
A 2-dimensional index space is used to uniquely identify a work-group.
Each work-item is identified by its global ID (_g_~x~, _g_~y~) or by the
combination of the work-group ID (_w_~x~, _w_~y~), the size of each
work-group (S~x~,S~y~) and the local ID (s~x~, s~y~) inside the work-group
such that
[none]
* (g~x~, g~y~) = (w~x~ {times} S~x~ + s~x~ + F~x~, w~y~ {times} S~y~ + s~y~ + F~y~)
The number of work-groups can be computed as:
[none]
* (W~x~, W~y~) = (ceil(G~x~ / S~x~), ceil(G~y~ / S~y~))
Given a global ID and the work-group size, the work-group ID for a work-item
is computed as:
[none]
* (w~x~, w~y~) = ( (g~x~ - s~x~ - F~x~) / S~x~, (g~y~ - s~y~ - F~y~) / S~y~ )
[[index-space-image]]
image::images/index_space.jpg[align="center", title="An example of an NDRange index space showing work-items, their global IDs and their mapping onto the pair of work-group and local IDs. In this case, we assume that in each dimension, the size of the work-group evenly divides the global NDRange size (i.e. all work-groups have the same size) and that the offset is equal to zero."]
Within a work-group work-items may be divided into sub-groups.
The mapping of work-items to sub-groups is implementation-defined and may be
queried at runtime.
While sub-groups may be used in multi-dimensional work-groups, each
sub-group is 1-dimensional and any given work-item may query which sub-group
it is a member of.
NOTE: Sub-groups are <<unified-spec, missing before>> version 2.1.
Work-items are mapped into sub-groups through a combination of compile-time
decisions and the parameters of the dispatch.
The mapping to sub-groups is invariant for the duration of a kernels
execution, across dispatches of a given kernel with the same work-group
dimensions, between dispatches and query operations consistent with the
dispatch parameterization, and from one work-group to another within the
dispatch (excluding the trailing edge work-groups in the presence of
non-uniform work-group sizes).
In addition, all sub-groups within a work-group will be the same size, apart
from the sub-group with the maximum index which may be smaller if the size
of the work-group is not evenly divisible by the size of the sub-groups.
In the degenerate case, a single sub-group must be supported for each
work-group.
In this situation all sub-group scope functions are equivalent to their
work-group level equivalents.
=== Execution of kernel-instances
The work carried out by an OpenCL program occurs through the execution of
kernel-instances on compute devices.
To understand the details of OpenCL's execution model, we need to consider
how a kernel object moves from the kernel-enqueue command, into a
command-queue, executes on a device, and completes.
A kernel object is defined as a function within the program object and a
collection of arguments connecting the kernel to a set of argument values.
The host program enqueues a kernel object to the command queue along with
the NDRange and the work-group decomposition.
These define a _kernel-instance_.
In addition, an optional set of events may be defined when the kernel is
enqueued.
The events associated with a particular kernel-instance are used to
constrain when the kernel-instance is launched with respect to other
commands in the queue or to commands in other queues within the same
context.
A kernel-instance is submitted to a device.
For an in-order command queue, the kernel instances appear to launch and
then execute in that same order; where we use the term appear to emphasize
that when there are no dependencies between commands and hence differences
in the order that commands execute cannot be observed in a program, an
implementation can reorder commands even in an in-order command queue.
For an out of order command-queue, kernel-instances wait to be launched
until:
* Synchronization commands enqueued prior to the kernel-instance are
satisfied.
* Each of the events in an optional event list defined when the
kernel-instance was enqueued are set to {CL_COMPLETE}.
Once these conditions are met, the kernel-instance is launched and the
work-groups associated with the kernel-instance are placed into a pool of
ready to execute work-groups.
This pool is called a _work-pool_.
The work-pool may be implemented in any manner as long as it assures that
work-groups placed in the pool will eventually execute.
The device schedules work-groups from the work-pool for execution on the
compute units of the device.
The kernel-enqueue command is complete when all work-groups associated with
the kernel-instance end their execution, updates to global memory associated
with a command are visible globally, and the device signals successful
completion by setting the event associated with the kernel-enqueue command
to {CL_COMPLETE}.
While a command-queue is associated with only one device, a single device
may be associated with multiple command-queues all feeding into the single
work-pool.
A device may also be associated with command queues associated with
different contexts within the same platform, again all feeding into the
single work-pool.
The device will pull work-groups from the work-pool and execute them on one
or several compute units in any order; possibly interleaving execution of
work-groups from multiple commands.
A conforming implementation may choose to serialize the work-groups so a
correct algorithm cannot assume that work-groups will execute in parallel.
There is no safe and portable way to synchronize across the independent
execution of work-groups since once in the work-pool, they can execute in
any order.
The work-items within a single sub-group execute concurrently but not
necessarily in parallel (i.e. they are not guaranteed to make independent
forward progress).
Therefore, only high-level synchronization constructs (e.g. sub-group
functions such as barriers) that apply to all the work-items in a sub-group
are well defined and included in OpenCL.
NOTE: Sub-groups are <<unified-spec, missing before>> version 2.1.
Sub-groups execute concurrently within a given work-group and with
appropriate device support (see <<platform-querying-devices, Querying
Devices>>), may make independent forward progress with respect to each
other, with respect to host threads and with respect to any entities
external to the OpenCL system but running on an OpenCL device, even in the
absence of work-group barrier operations.
In this situation, sub-groups are able to internally synchronize using
barrier operations without synchronizing with each other and may perform
operations that rely on runtime dependencies on operations other sub-groups
perform.
The work-items within a single work-group execute concurrently but are only
guaranteed to make independent progress in the presence of sub-groups and
device support.
In the absence of this capability, only high-level synchronization
constructs (e.g. work-group functions such as barriers) that apply to all
the work-items in a work-group are well defined and included in OpenCL for
synchronization within the work-group.
In the absence of synchronization functions (e.g. a barrier), work-items
within a sub-group may be serialized.
In the presence of sub -group functions, work-items within a sub -group may
be serialized before any given sub -group function, between dynamically
encountered pairs of sub-group functions and between a work-group function
and the end of the kernel.
In the absence of independent forward progress of constituent sub-groups,
work-items within a work-group may be serialized before, after or between
work-group synchronization functions.
[[device-side-enqueue]]
=== Device-side enqueue
NOTE: Device-side enqueue is <<unified-spec, missing before>> version 2.0.
Algorithms may need to generate additional work as they execute.
In many cases, this additional work cannot be determined statically; so the
work associated with a kernel only emerges at runtime as the kernel-instance
executes.
This capability could be implemented in logic running within the host
program, but involvement of the host may add significant overhead and/or
complexity to the application control flow.
A more efficient approach would be to nest kernel-enqueue commands from
inside other kernels.
This *nested parallelism* can be realized by supporting the enqueuing of
kernels on a device without direct involvement by the host program;
so-called *device-side enqueue*.
Device-side kernel-enqueue commands are similar to host-side kernel-enqueue
commands.
The kernel executing on a device (the *parent kernel*) enqueues a
kernel-instance (the *child kernel*) to a device-side command queue.
This is an out-of-order command-queue and follows the same behavior as the
out-of-order command-queues exposed to the host program.
Commands enqueued to a device side command-queue generate and use events to
enforce order constraints just as for the command-queue on the host.
These events, however, are only visible to the parent kernel running on the
device.
When these prerequisite events take on the value {CL_COMPLETE}, the
work-groups associated with the child kernel are launched into the devices
work pool.
The device then schedules them for execution on the compute units of the
device.
Child and parent kernels execute asynchronously.
However, a parent will not indicate that it is complete by setting its event
to {CL_COMPLETE} until all child kernels have ended execution and have
signaled completion by setting any associated events to the value
{CL_COMPLETE}.
Should any child kernel complete with an event status set to a negative
value (i.e. abnormally terminate), the parent kernel will abnormally
terminate and propagate the childs negative event value as the value of the
parents event.
If there are multiple children that have an event status set to a negative
value, the selection of which childs negative event value is propagated is
implementation-defined.
[[execution-model-sync]]
=== Synchronization
Synchronization refers to mechanisms that constrain the order of execution
between two or more units of execution.
Consider the following three domains of synchronization in OpenCL:
* Work-group synchronization: Constraints on the order of execution for
work-items in a single work-group
* Sub-group synchronization: Constraints on the order of execution for
work-items in a single sub-group.
Note: Sub-groups are <<unified-spec, missing before>> version 2.1
* Command synchronization: Constraints on the order of commands launched
for execution
Synchronization across all work-items within a single work-group is carried
out using a _work-group function_.
These functions carry out collective operations across all the work-items in
a work-group.
Available collective operations are: barrier, reduction, broadcast, prefix
sum, and evaluation of a predicate.
A work-group function must occur within a converged control flow; i.e. all
work-items in the work-group must encounter precisely the same work-group
function.
For example, if a work-group function occurs within a loop, the work-items
must encounter the same work-group function in the same loop iterations.
All the work-items of a work-group must execute the work-group function and
complete reads and writes to memory before any are allowed to continue
execution beyond the work-group function.
Work-group functions that apply between work-groups are not provided in
OpenCL since OpenCL does not define forward-progress or ordering relations
between work-groups, hence collective synchronization operations are not
well defined.
Synchronization across all work-items within a single sub-group is carried
out using a _sub-group function_.
These functions carry out collective operations across all the work-items in
a sub-group.
Available collective operations are: barrier, reduction, broadcast, prefix
sum, and evaluation of a predicate.
A sub-group function must occur within a converged control flow; i.e. all
work-items in the sub-group must encounter precisely the same sub-group
function.
For example, if a work-group function occurs within a loop, the work-items
must encounter the same sub-group function in the same loop iterations.
All the work-items of a sub-group must execute the sub-group function and
complete reads and writes to memory before any are allowed to continue
execution beyond the sub-group function.
Synchronization between sub-groups must either be performed using work-group
functions, or through memory operations.
Using memory operations for sub-group synchronization should be used
carefully as forward progress of sub-groups relative to each other is only
supported optionally by OpenCL implementations.
Command synchronization is defined in terms of distinct *synchronization
points*.
The synchronization points occur between commands in host command-queues and
between commands in device-side command-queues.
The synchronization points defined in OpenCL include:
* *Launching a command:* A kernel-instance is launched onto a device after
all events that kernel is waiting-on have been set to {CL_COMPLETE}.
* *Ending a command:* Child kernels may be enqueued such that they wait
for the parent kernel to reach the _end_ state before they can be
launched.
In this case, the ending of the parent command defines a synchronization
point.
* *Completion of a command:* A kernel-instance is complete after all of
the work-groups in the kernel and all of its child kernels have
completed.
This is signaled to the host, a parent kernel or other kernels within
command queues by setting the value of the event associated with a
kernel to {CL_COMPLETE}.
* *Blocking Commands:* A blocking command defines a synchronization point
between the unit of execution that calls the blocking API function and
the enqueued command reaching the complete state.
* *Command-queue barrier:* The command-queue barrier ensures that all
previously enqueued commands have completed before subsequently enqueued
commands can be launched.
* {clFinish}: This function blocks until all previously enqueued commands
in the command queue have completed after which {clFinish} defines a
synchronization point and the {clFinish} function returns.
A synchronization point between a pair of commands (A and B) assures that
results of command A happens-before command B is launched.
This requires that any updates to memory from command A complete and are
made available to other commands before the synchronization point completes.
Likewise, this requires that command B waits until after the synchronization
point before loading values from global memory.
The concept of a synchronization point works in a similar fashion for
commands such as a barrier that apply to two sets of commands.
All the commands prior to the barrier must complete and make their results
available to following commands.
Furthermore, any commands following the barrier must wait for the commands
prior to the barrier before loading values and continuing their execution.
These _happens-before_ relationships are a fundamental part of the OpenCL 2.x
memory model.
When applied at the level of commands, they are straightforward to define at
a language level in terms of ordering relationships between different
commands.
Ordering memory operations inside different commands, however, requires
rules more complex than can be captured by the high level concept of a
synchronization point.
These rules are described in detail in <<memory-ordering-rules, Memory
Ordering Rules>>.
=== Categories of Kernels
The OpenCL execution model supports three types of kernels:
* *OpenCL kernels* are managed by the OpenCL API as kernel objects
associated with kernel functions within program objects.
OpenCL program objects are created and built using OpenCL APIs.
The OpenCL API includes functions to query the kernel languages and
and intermediate languages that may be used to create OpenCL program
objects for a device.
* *Native kernels* are accessed through a host function pointer.
Native kernels are queued for execution along with OpenCL kernels on a
device and share memory objects with OpenCL kernels.
For example, these native kernels could be functions defined in
application code or exported from a library.
The ability to execute native kernels is optional within OpenCL and the
semantics of native kernels are implementation-defined.
The OpenCL API includes functions to query capabilities of a device
to determine if this capability is supported.
* *Built-in kernels* are tied to particular device and are not built at
runtime from source code in a program object.
The common use of built in kernels is to expose fixed-function hardware
or firmware associated with a particular OpenCL device or custom device.
The semantics of a built-in kernel may be defined outside of OpenCL and
hence are implementation defined.
Note: Built-in kernels are <<unified-spec, missing before>> version 1.2.
All three types of kernels are manipulated through the OpenCL command queues
and must conform to the synchronization points defined in the OpenCL
execution model.
== Memory Model
The OpenCL memory model describes the structure, contents, and behavior of
the memory exposed by an OpenCL platform as an OpenCL program runs.
The model allows a programmer to reason about values in memory as the host
program and multiple kernel-instances execute.
An OpenCL program defines a context that includes a host, one or more
devices, command-queues, and memory exposed within the context.
Consider the units of execution involved with such a program.
The host program runs as one or more host threads managed by the operating
system running on the host (the details of which are defined outside of
OpenCL).
There may be multiple devices in a single context which all have access to
memory objects defined by OpenCL.
On a single device, multiple work-groups may execute in parallel with
potentially overlapping updates to memory.
Finally, within a single work-group, multiple work-items concurrently
execute, once again with potentially overlapping updates to memory.
The memory model must precisely define how the values in memory as seen from
each of these units of execution interact so a programmer can reason about
the correctness of OpenCL programs.
We define the memory model in four parts.
* Memory regions: The distinct memories visible to the host and the
devices that share a context.
* Memory objects: The objects defined by the OpenCL API and their
management by the host and devices.
* Shared Virtual Memory: A virtual address space exposed to both the host
and the devices within a context.
Note: SVM is <<unified-spec, missing before>> version 2.0.
* Consistency Model: Rules that define which values are observed when
multiple units of execution load data from memory plus the atomic/fence
operations that constrain the order of memory operations and define
synchronization relationships.
=== Fundamental Memory Regions
Memory in OpenCL is divided into two parts.
* *Host Memory:* The memory directly available to the host.
The detailed behavior of host memory is defined outside of OpenCL.
Memory objects move between the Host and the devices through functions
within the OpenCL API or through a shared virtual memory interface.
* *Device Memory:* Memory directly available to kernels executing on
OpenCL devices.
Device memory consists of four named address spaces or _memory regions_:
* *Global Memory:* This memory region permits read/write access to all
work-items in all work-groups running on any device within a context.
Work-items can read from or write to any element of a memory object.
Reads and writes to global memory may be cached depending on the
capabilities of the device.
* *Constant Memory*: A region of global memory that remains constant
during the execution of a kernel-instance.
The host allocates and initializes memory objects placed into constant
memory.
* *Local Memory*: A memory region local to a work-group.
This memory region can be used to allocate variables that are shared by
all work-items in that work-group.
* *Private Memory*: A region of memory private to a work-item.
Variables defined in one work-items private memory are not visible to
another work-item.
The <<memory-regions-image, memory regions>> and their relationship to the
OpenCL Platform model are summarized below.
Local and private memories are always associated with a particular device.
The global and constant memories, however, are shared between all devices
within a given context.
An OpenCL device may include a cache to support efficient access to these
shared memories.
To understand memory in OpenCL, it is important to appreciate the
relationships between these named address spaces.
The four named address spaces available to a device are disjoint meaning
they do not overlap.
This is a logical relationship, however, and an implementation may choose to
let these disjoint named address spaces share physical memory.
Programmers often need functions callable from kernels where the pointers
manipulated by those functions can point to multiple named address spaces.
This saves a programmer from the error-prone and wasteful practice of
creating multiple copies of functions; one for each named address space.
Therefore the global, local and private address spaces belong to a single
_generic address space_.
This is closely modeled after the concept of a generic address space used in
the embedded C standard (ISO/IEC 9899:1999).
Since they all belong to a single generic address space, the following
properties are supported for pointers to named address spaces in device
memory:
* A pointer to the generic address space can be cast to a pointer to a
global, local or private address space
* A pointer to a global, local or private address space can be cast to a
pointer to the generic address space.
* A pointer to a global, local or private address space can be implicitly
converted to a pointer to the generic address space, but the converse is
not allowed.
The constant address space is disjoint from the generic address space.
NOTE: The generic address space is <<unified-spec, missing before>> version
2.0.
The addresses of memory associated with memory objects in Global memory are
not preserved between kernel instances, between a device and the host, and
between devices.
In this regard global memory acts as a global pool of memory objects rather
than an address space.
This restriction is relaxed when shared virtual memory (SVM) is used.
NOTE: Shared virtual memory is <<unified-spec, missing before>> version 2.0.
SVM causes addresses to be meaningful between the host and all of the
devices within a context hence supporting the use of pointer based data
structures in OpenCL kernels.
It logically extends a portion of the global memory into the host address
space giving work-items access to the host address space.
On platforms with hardware support for a shared address space between the
host and one or more devices, SVM may also provide a more efficient way to
share data between devices and the host.
Details about SVM are presented in <<shared-virtual-memory, Shared Virtual
Memory>>.
[[memory-regions-image]]
image::images/memory_regions.png[align="center", title="The named address spaces exposed in an OpenCL Platform. Global and Constant memories are shared between the one or more devices within a context, while local and private memories are associated with a single device. Each device may include an optional cache to support efficient access to their view of the global and constant address spaces."]
A programmer may use the features of the <<memory-consistency-model, memory
consistency model>> to manage safe access to global memory from multiple
work-items potentially running on one or more devices.
In addition, when using shared virtual memory (SVM), the memory consistency
model may also be used to ensure that host threads safely access memory
locations in the shared memory region.
=== Memory Objects
The contents of global memory are _memory objects_.
A memory object is a handle to a reference counted region of global memory.
Memory objects use the OpenCL type _cl_mem_ and fall into three distinct
classes.
* *Buffer*: A memory object stored as a block of contiguous memory and
used as a general purpose object to hold data used in an OpenCL program.
The types of the values within a buffer may be any of the built in types
(such as int, float), vector types, or user-defined structures.
The buffer can be manipulated through pointers much as one would with
any block of memory in C.
* *Image*: An image memory object holds one, two or three dimensional
images.
The formats are based on the standard image formats used in graphics
applications.
An image is an opaque data structure managed by functions defined in the
OpenCL API.
To optimize the manipulation of images stored in the texture memories
found in many GPUs, OpenCL kernels have traditionally been disallowed
from both reading and writing a single image.
In OpenCL 2.0, however, we have relaxed this restriction by providing
synchronization and fence operations that let programmers properly
synchronize their code to safely allow a kernel to read and write a
single image.
* *Pipe*: The _pipe_ memory object conceptually is an ordered sequence of
data items.
A pipe has two endpoints: a write endpoint into which data items are
inserted, and a read endpoint from which data items are removed.
At any one time, only one kernel instance may write into a pipe, and
only one kernel instance may read from a pipe.
To support the producer consumer design pattern, one kernel instance
connects to the write endpoint (the producer) while another kernel
instance connects to the reading endpoint (the consumer).
Note: The _pipe_ memory object is <<unified-spec, missing before>>
version 2.0.
Memory objects are allocated by host APIs.
The host program can provide the runtime with a pointer to a block of
continuous memory to hold the memory object when the object is created
({CL_MEM_USE_HOST_PTR}).
Alternatively, the physical memory can be managed by the OpenCL runtime and
not be directly accessible to the host program.
Allocation and access to memory objects within the different memory regions
varies between the host and work-items running on a device.
This is summarized in the <<memory-regions-table, Memory Regions>> table,
which describes whether the kernel or the host can allocate from a memory
region, the type of allocation (static at compile time vs.
dynamic at runtime) and the type of access allowed (i.e. whether the kernel
or the host can read and/or write to a memory region).
[[memory-regions-table]]
.Memory Regions
[cols="2,2,3,3,3,3",options="header"]
|====
| | | Global | Constant | Local | Private
.2+| *Host*
| Allocation
| Dynamic
| Dynamic
| Dynamic
| None
| Access
| Read/Write to Buffers and Images, but not Pipes
| Read/Write
| None
| None
.2+| *Kernel*
| Allocation
| Static (program scope variables)
| Static (program scope variables)
| Static for parent kernel,
Dynamic for child kernels
| Static
| Access
| Read/Write
| Read-only
| Read/Write,
No access to child kernel memory
| Read/Write
|====
The <<memory-regions-table, Memory Regions>> table shows the different
memory regions in OpenCL and how memory objects are allocated and accessed
by the host and by an executing instance of a kernel.
For kernels, we distinguish between the behavior of local memory
for a parent kernel and its child kernels.
Once allocated, a memory object is made available to kernel-instances
running on one or more devices.
In addition to <<shared-virtual-memory, Shared Virtual Memory>>, there are
three basic ways to manage the contents of buffers between the host and
devices.
* *Read/Write/Fill commands*: The data associated with a memory object is
explicitly read and written between the host and global memory regions
using commands enqueued to an OpenCL command queue.
Note: Fill commands are <<unified-spec, missing before>> version 1.2.
* *Map/Unmap commands*: Data from the memory object is mapped into a
contiguous block of memory accessed through a host accessible pointer.
The host program enqueues a _map_ command on block of a memory object
before it can be safely manipulated by the host program.
When the host program is finished working with the block of memory, the
host program enqueues an _unmap_ command to allow a kernel-instance to
safely read and/or write the buffer.
* *Copy commands:* The data associated with a memory object is copied
between two buffers, each of which may reside either on the host or on
the device.
With Read/Write/Map, the commands
can be blocking or non-blocking operations.
The OpenCL function call for a blocking memory transfer returns once the
command (memory transfer) has completed. At this point the associated memory
resources on the host can be safely reused, and following operations on the host are
guaranteed that the transfer has already completed.
For a non-blocking memory transfer, the OpenCL function call returns as soon
as the command is enqueued.
Memory objects are bound to a context and hence can appear in multiple
kernel-instances running on more than one physical device.
The OpenCL platform must support a large range of hardware platforms
including systems that do not support a single shared address space in
hardware; hence the ways memory objects can be shared between
kernel-instances is restricted.
The basic principle is that multiple read operations on memory objects from
multiple kernel-instances that overlap in time are allowed, but mixing
overlapping reads and writes into the same memory objects from different
kernel instances is only allowed when fine grained synchronization is used
with <<shared-virtual-memory, Shared Virtual Memory>>.
When global memory is manipulated by multiple kernel-instances running on
multiple devices, the OpenCL runtime system must manage the association of
memory objects with a given device.
In most cases the OpenCL runtime will implicitly associate a memory object
with a device.
A kernel instance is naturally associated with the command queue to which
the kernel was submitted.
Since a command-queue can only access a single device, the queue uniquely
defines which device is involved with any given kernel-instance; hence
defining a clear association between memory objects, kernel-instances and
devices.
Programmers may anticipate these associations in their programs and
explicitly manage association of memory objects with devices in order to
improve performance.
[[shared-virtual-memory]]
=== Shared Virtual Memory
IMPORTANT: Shared virtual memory is <<unified-spec, missing before>>
version 2.0.
OpenCL extends the global memory region into the host memory region through
a shared virtual memory (SVM) mechanism.
There are three types of SVM in OpenCL
* *Coarse-Grained buffer SVM*: Sharing occurs at the granularity of
regions of OpenCL buffer memory objects.
Consistency is enforced at synchronization points and with map/unmap
commands to drive updates between the host and the device.
This form of SVM is similar to non-SVM use of memory; however, it lets
kernel-instances share pointer-based data structures (such as
linked-lists) with the host program.
Program scope global variables are treated as per-device coarse-grained
SVM for addressing and sharing purposes.
* *Fine-Grained buffer SVM*: Sharing occurs at the granularity of
individual loads/stores into bytes within OpenCL buffer memory objects.
Loads and stores may be cached.
This means consistency is guaranteed at synchronization points.
If the optional OpenCL atomics are supported, they can be used to
provide fine-grained control of memory consistency.
* *Fine-Grained system SVM*: Sharing occurs at the granularity of
individual loads/stores into bytes occurring anywhere within the host
memory.
Loads and stores may be cached so consistency is guaranteed at
synchronization points.
If the optional OpenCL atomics are supported, they can be used to
provide fine-grained control of memory consistency.
[[svm-summary-table]]
.A summary of shared virtual memory (SVM) options in OpenCL
[width="100%",cols="^,^,^,^,^",options="header"]
|====
| | Granularity of sharing | Memory Allocation | Mechanisms to enforce Consistency | Explicit updates between host and device
| Non-SVM buffers
| OpenCL Memory objects(buffer)
| {clCreateBuffer} +
{clCreateBufferWithProperties}
| Host synchronization points on the same or between devices.
| yes, through Map and Unmap commands.
| Coarse-Grained buffer SVM
| OpenCL Memory objects (buffer)
| {clSVMAlloc}
| Host synchronization points between devices
| yes, through Map and Unmap commands.
| Fine-Grained buffer SVM
| Bytes within OpenCL Memory objects (buffer)
| {clSVMAlloc}
| Synchronization points plus atomics (if supported)
| No
| Fine-Grained system SVM
| Bytes within Host memory (system)
| Host memory allocation mechanisms (e.g. malloc)
| Synchronization points plus atomics (if supported)
| No
|====
Coarse-Grained buffer SVM is required in the core OpenCL specification.
The two finer grained approaches are optional features in OpenCL.
The various SVM mechanisms to access host memory from the work-items
associated with a kernel instance are <<svm-summary-table, summarized
above>>.
=== Memory Consistency Model for OpenCL 1.x
IMPORTANT: This memory consistency model is <<unified-spec, deprecated
by>> version 2.0.
OpenCL 1.x uses a relaxed consistency memory model; i.e. the state of memory
visible to a work-item is not guaranteed to be consistent across the collection
of work-items at all times.
Within a work-item memory has load / store consistency.
Local memory is consistent across work-items in a single work-group at a
work-group barrier.
Global memory is consistent across work-items in a single work-group at a
work-group barrier, but there are no guarantees of memory consistency between
different work-groups executing a kernel.
Memory consistency for memory objects shared between enqueued commands is
enforced at a synchronization point.
[[memory-consistency-model]]
=== Memory Consistency Model for OpenCL 2.x
IMPORTANT: This memory consistency model is <<unified-spec, missing
before>> version 2.0.
The OpenCL 2.x memory model tells programmers what they can expect from an
OpenCL 2.x implementation; which memory operations are guaranteed to happen in
which order and which memory values each read operation will return.
The memory model tells compiler writers which restrictions they must follow
when implementing compiler optimizations; which variables they can cache in
registers and when they can move reads or writes around a barrier or atomic
operation.
The memory model also tells hardware designers about limitations on hardware
optimizations; for example, when they must flush or invalidate hardware
caches.
The memory consistency model in OpenCL 2.x is based on the memory model from
the ISO C11 programming language.
To help make the presentation more precise and self-contained, we include
modified paragraphs taken verbatim from the ISO C11 international standard.
When a paragraph is taken or modified from the C11 standard, it is
identified as such along with its original location in the <<iso-c11,C11
standard>>.
For programmers, the most intuitive model is the _sequential consistency_
memory model.
Sequential consistency interleaves the steps executed by each of the units
of execution.
Each access to a memory location sees the last assignment to that location
in that interleaving.
While sequential consistency is relatively straightforward for a programmer
to reason about, implementing sequential consistency is expensive.
Therefore, OpenCL 2.x implements a relaxed memory consistency model; i.e. it is
possible to write programs where the loads from memory violate sequential
consistency.
Fortunately, if a program does not contain any races and if the program only
uses atomic operations that utilize the sequentially consistent memory order
(the default memory ordering for OpenCL 2.x), OpenCL programs appear to execute
with sequential consistency.
Programmers can to some degree control how the memory model is relaxed by
choosing the memory order for synchronization operations.
The precise semantics of synchronization and the memory orders are formally
defined in <<memory-ordering-rules, Memory Ordering Rules>>.
Here, we give a high level description of how these memory orders apply to
atomic operations on atomic objects shared between units of execution.
OpenCL 2.x memory_order choices are based on those from the ISO C11 standard
memory model.
They are specified in certain OpenCL functions through the following
enumeration constants:
* *memory_order_relaxed*: implies no order constraints.
This memory order can be used safely to increment counters that are
concurrently incremented, but it doesn't guarantee anything about the
ordering with respect to operations to other memory locations.
It can also be used, for example, to do ticket allocation and by expert
programmers implementing lock-free algorithms.
* *memory_order_acquire*: A synchronization operation (fence or atomic)
that has acquire semantics "acquires" side-effects from a release
operation that synchronises with it: if an acquire synchronises with a
release, the acquiring unit of execution will see all side-effects
preceding that release (and possibly subsequent side-effects.) As part
of carefully-designed protocols, programmers can use an "acquire" to
safely observe the work of another unit of execution.
* *memory_order_release*: A synchronization operation (fence or atomic
operation) that has release semantics "releases" side effects to an
acquire operation that synchronises with it.
All side effects that precede the release are included in the release.
As part of carefully-designed protocols, programmers can use a "release"
to make changes made in one unit of execution visible to other units of
execution.
NOTE: In general, no acquire must _always_ synchronise with any particular
release.
However, synchronisation can be forced by certain executions.
See the description of <<memory-ordering-fence, Fence Operations>> for
detailed rules for when synchronisation must occur.
* *memory_order_acq_rel*: A synchronization operation with acquire-release
semantics has the properties of both the acquire and release memory
orders.
It is typically used to order read-modify-write operations.
* *memory_order_seq_cst*: The loads and stores of each unit of execution
appear to execute in program (i.e., sequenced-before) order, and the
loads and stores from different units of execution appear to be simply
interleaved.
Regardless of which memory_order is specified, resolving constraints on
memory operations across a heterogeneous platform adds considerable overhead
to the execution of a program.
An OpenCL platform may be able to optimize certain operations that depend on
the features of the memory consistency model by restricting the scope of the
memory operations.
Distinct memory scopes are defined by the values of the memory_scope
enumeration constant:
* *memory_scope_work_item*: memory-ordering constraints only apply within
the work-item footnote:[{fn-image-mem-fence}].
* *memory_scope_sub_group*: memory-ordering constraints only apply within
the sub-group.
* *memory_scope_work_group*: memory-ordering constraints only apply to
work-items executing within a single work-group.
* *memory_scope_device:* memory-ordering constraints only apply to
work-items executing on a single device
* *memory_scope_all_svm_devices*: memory-ordering constraints apply to
work-items executing across multiple devices and (when using SVM) the
host.
A release performed with *memory_scope_all_svm_devices* to a buffer that
does not have the {CL_MEM_SVM_ATOMICS} flag set will commit to at least
*memory_scope_device* visibility, with full synchronization of the
buffer at a queue synchronization point (e.g. an OpenCL event).
These memory scopes define a hierarchy of visibilities when analyzing the
ordering constraints of memory operations.
For example if a programmer knows that a sequence of memory operations will
only be associated with a collection of work-items from a single work-group
(and hence will run on a single device), the implementation is spared the
overhead of managing the memory orders across other devices within the same
context.
This can substantially reduce overhead in a program.
All memory scopes are valid when used on global memory or local memory.
For local memory, all visibility is constrained to within a given work-group
and scopes wider than *memory_scope_work_group* carry no additional meaning.
In the following subsections (leading up to <<opencl-framework, OpenCL
Framework>>), we will explain the synchronization constructs and detailed
rules needed to use OpenCL's 2.x relaxed memory models.
It is important to appreciate, however, that many programs do not benefit
from relaxed memory models.
Even expert programmers have a difficult time using atomics and fences to
write correct programs with relaxed memory models.
A large number of OpenCL programs can be written using a simplified memory
model.
This is accomplished by following these guidelines.
* Write programs that manage safe sharing of global memory objects through
the synchronization points defined by the command queues.
* Restrict low level synchronization inside work-groups to the work-group
functions such as barrier.
* If you want sequential consistency behavior with system allocations or
fine-grain SVM buffers with atomics support, use only
*memory_order_seq_cst* operations with the scope
*memory_scope_all_svm_devices*.
* If you want sequential consistency behavior when not using system
allocations or fine-grain SVM buffers with atomics support, use only
*memory_order_seq_cst* operations with the scope *memory_scope_device*
or *memory_scope_all_svm_devices*.
* Ensure your program has no races.
If these guidelines are followed in your OpenCL programs, you can skip the
detailed rules behind the relaxed memory models and go directly to
<<opencl-framework, OpenCL Framework>>.
=== Overview of atomic and fence operations
OpenCL 2.x has a number of _synchronization operations_ that are used to define
memory order constraints in a program.
They play a special role in controlling how memory operations in one unit of
execution (such as work-items or, when using SVM a host thread) are made
visible to another.
There are two types of synchronization operations in OpenCL; _atomic
operations_ and _fences_.
Atomic operations are indivisible.
They either occur completely or not at all.
These operations are used to order memory operations between units of
execution and hence they are parameterized with the memory_order and
memory_scope parameters defined by the OpenCL memory consistency model.
The atomic operations for OpenCL kernel languages are similar to the
corresponding operations defined by the C11 standard.
The OpenCL 2.x atomic operations apply to variables of an atomic type (a
subset of those in the C11 standard) including atomic versions of the int,
uint, long, ulong, float, double, half, intptr_t, uintptr_t, size_t, and
ptrdiff_t types.
However, support for some of these atomic types depends on support for the
corresponding regular types.
An atomic operation on one or more memory locations is either an acquire
operation, a release operation, or both an acquire and release operation.
An atomic operation without an associated memory location is a fence and can
be either an acquire fence, a release fence, or both an acquire and release
fence.
In addition, there are relaxed atomic operations, which do not have
synchronization properties, and atomic read-modify-write operations, which
have special characteristics.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 5, modified.]>>
The orders *memory_order_acquire* (used for reads), *memory_order_release*
(used for writes), and *memory_order_acq_rel* (used for read-modify-write
operations) are used for simple communication between units of execution
using shared variables.
Informally, executing a *memory_order_release* on an atomic object A makes
all previous side effects visible to any unit of execution that later
executes a *memory_order_acquire* on A.
The orders *memory_order_acquire*, *memory_order_release*, and
*memory_order_acq_rel* do not provide sequential consistency for race-free
programs because they will not ensure that atomic stores followed by atomic
loads become visible to other threads in that order.
[[atomic-fence-orders]]
The fence operation is atomic_work_item_fence, which includes a memory_order
argument as well as the memory_scope and cl_mem_fence_flags arguments.
Depending on the memory_order argument, this operation:
* has no effects, if *memory_order_relaxed*;
* is an acquire fence, if *memory_order_acquire*;
* is a release fence, if *memory_order_release*;
* is both an acquire fence and a release fence, if *memory_order_acq_rel*;
* is a sequentially-consistent fence with both acquire and release
semantics, if *memory_order_seq_cst*.
If specified, the cl_mem_fence_flags argument must be `CLK_IMAGE_MEM_FENCE`,
`CLK_GLOBAL_MEM_FENCE`, `CLK_LOCAL_MEM_FENCE`, or `CLK_GLOBAL_MEM_FENCE |
CLK_LOCAL_MEM_FENCE`.
The `atomic_work_item_fence(CLK_IMAGE_MEM_FENCE, ...)` built-in function must be
used to make sure that sampler-less writes are visible to later reads by the
same work-item.
Without use of the atomic_work_item_fence function, write-read coherence on
image objects is not guaranteed: if a work-item reads from an image to which
it has previously written without an intervening atomic_work_item_fence, it
is not guaranteed that those previous writes are visible to the work-item.
The synchronization operations in OpenCL 2.x can be parameterized by a
memory_scope.
Memory scopes control the extent that an atomic operation or fence is
visible with respect to the memory model.
These memory scopes may be used when performing atomic operations and fences
on global memory and local memory.
When used on global memory visibility is bounded by the capabilities of that
memory.
When used on a fine-grained non-atomic SVM buffer, a coarse-grained SVM
buffer, or a non-SVM buffer, operations parameterized with
*memory_scope_all_svm_devices* will behave as if they were parameterized
with *memory_scope_device*.
When used on local memory, visibility is bounded by the work-group and, as a
result, memory_scope with wider visibility than *memory_scope_work_group*
will be reduced to *memory_scope_work_group*.
Two actions *A* and *B* are defined to have an inclusive scope if they have
the same scope *P* such that:
* *P* is *memory_scope_sub_group* and *A* and *B* are executed by
work-items within the same sub-group.
* *P* is *memory_scope_work_group* and *A* and *B* are executed by
work-items within the same work-group.
* *P* is *memory_scope_device* and *A* and *B* are executed by work-items
on the same device when *A* and *B* apply to an SVM allocation or *A*
and *B* are executed by work-items in the same kernel or one of its
children when *A* and *B* apply to a {cl_mem_TYPE} buffer.
* *P* is *memory_scope_all_svm_devices* if *A* and *B* are executed by
host threads or by work-items on one or more devices that can share SVM
memory with each other and the host process.
[[memory-ordering-rules]]
=== Memory Ordering Rules
Fundamentally, the issue in a memory model is to understand the orderings in
time of modifications to objects in memory.
Modifying an object or calling a function that modifies an object are side
effects, i.e. changes in the state of the execution environment.
Evaluation of an expression in general includes both value computations and
initiation of side effects.
Value computation for an lvalue expression includes determining the identity
of the designated object.
<<iso-c11,[C11 standard, Section 5.1.2.3, paragraph 2, modified.]>>
We assume that the OpenCL kernel language and host programming languages
have a sequenced-before relation between the evaluations executed by a
single unit of execution.
This sequenced-before relation is an asymmetric, transitive, pair-wise
relation between those evaluations, which induces a partial order among
them.
Given any two evaluations *A* and *B*, if *A* is sequenced-before *B*, then
the execution of *A* shall precede the execution of *B*.
(Conversely, if *A* is sequenced-before *B*, then *B* is sequenced-after
*A*.) If *A* is not sequenced-before or sequenced-after *B*, then *A* and
*B* are unsequenced.
Evaluations *A* and *B* are indeterminately sequenced when *A* is either
sequenced-before or sequenced-after *B*, but it is unspecified which.
<<iso-c11,[C11 standard, Section 5.1.2.3, paragraph 3, modified.]>>
NOTE: Sequenced-before is a partial order of the operations executed by a
single unit of execution (e.g. a host thread or work-item).
It generally corresponds to the source program order of those operations, and
is partial because of the undefined argument evaluation order of the OpenCL C
kernel language.
In an OpenCL kernel language, the value of an object visible to a work-item
W at a particular point is the initial value of the object, a value stored
in the object by W, or a value stored in the object by another work-item or
host thread, according to the rules below.
Depending on details of the host programming language, the value of an
object visible to a host thread may also be the value stored in that object
by another work-item or host thread.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 2, modified.]>>
Two expression evaluations conflict if one of them modifies a memory
location and the other one reads or modifies the same memory location.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 4.]>>
All modifications to a particular atomic object *M* occur in some particular
total order, called the modification order of *M*.
If *A* and *B* are modifications of an atomic object *M*, and *A*
happens-before *B*, then *A* shall precede *B* in the modification order of
*M*, which is defined below.
Note that the modification order of an atomic object *M* is independent of
whether *M* is in local or global memory.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 7, modified.]>>
A release sequence begins with a release operation *A* on an atomic object
*M* and is the maximal contiguous sub-sequence of side effects in the
modification order of *M*, where the first operation is *A* and every
subsequent operation either is performed by the same work-item or host
thread that performed the release or is an atomic read-modify-write
operation.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 10, modified.]>>
OpenCL's local and global memories are disjoint.
Kernels may access both kinds of memory while host threads may only access
global memory.
Furthermore, the _flags_ argument of OpenCL's work_group_barrier function
specifies which memory operations the function will make visible: these
memory operations can be, for example, just the ones to local memory, or the
ones to global memory, or both.
Since the visibility of memory operations can be specified for local memory
separately from global memory, we define two related but independent
relations, _global-synchronizes-with_ and _local-synchronizes-with_.
Certain operations on global memory may global-synchronize-with other
operations performed by another work-item or host thread.
An example is a release atomic operation in one work- item that
global-synchronizes-with an acquire atomic operation in a second work-item.
Similarly, certain atomic operations on local objects in kernels can
local-synchronize- with other atomic operations on those local objects.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 11, modified.]>>
We define two separate happens-before relations: global-happens-before and
local-happens-before.
A global memory action *A* global-happens-before a global memory action *B*
if
* *A* is sequenced before *B*, or
* *A* global-synchronizes-with *B*, or
* For some global memory action *C*, *A* global-happens-before *C* and *C*
global-happens-before *B*.
A local memory action *A* local-happens-before a local memory action *B* if
* *A* is sequenced before *B*, or
* *A* local-synchronizes-with *B*, or
* For some local memory action *C*, *A* local-happens-before *C* and *C*
local-happens-before *B*.
An OpenCL 2.x implementation shall ensure that no program execution
demonstrates a cycle in either the local-happens-before relation or the
global-happens-before relation.
NOTE: The global- and local-happens-before relations are critical to
defining what values are read and when data races occur.
The global-happens-before relation, for example, defines what global memory
operations definitely happen before what other global memory operations.
If an operation *A* global-happens-before operation *B* then *A* must occur
before *B*; in particular, any write done by *A* will be visible to *B*.
The local-happens-before relation has similar properties for local memory.
Programmers can use the local- and global-happens-before relations to reason
about the order of program actions.
A visible side effect *A* on a global object *M* with respect to a value
computation *B* of *M* satisfies the conditions:
* *A* global-happens-before *B*, and
* there is no other side effect *X* to *M* such that *A*
global-happens-before *X* and *X* global-happens-before *B*.
We define visible side effects for local objects *M* similarly.
The value of a non-atomic scalar object *M*, as determined by evaluation
*B*, shall be the value stored by the visible side effect *A*.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 19, modified.]>>
The execution of a program contains a data race if it contains two
conflicting actions *A* and *B* in different units of execution, and
* (1) at least one of *A* or *B* is not atomic, or *A* and *B* do not have
inclusive memory scope, and
* (2) the actions are global actions unordered by the
global-happens-before relation or are local actions unordered by the
local-happens-before relation.
Any such data race results in undefined behavior.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 25, modified.]>>
We also define the visible sequence of side effects on local and global
atomic objects.
The remaining paragraphs of this subsection define this sequence for a
global atomic object *M*; the visible sequence of side effects for a local
atomic object is defined similarly by using the local-happens-before
relation.
The visible sequence of side effects on a global atomic object *M*, with
respect to a value computation *B* of *M*, is a maximal contiguous
sub-sequence of side effects in the modification order of *M*, where the
first side effect is visible with respect to *B*, and for every side effect,
it is not the case that *B* global-happens-before it.
The value of *M*, as determined by evaluation *B*, shall be the value stored
by some operation in the visible sequence of *M* with respect to *B*.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 22, modified.]>>
If an operation *A* that modifies an atomic object *M* global-happens before
an operation *B* that modifies *M*, then *A* shall be earlier than *B* in
the modification order of *M*.
This requirement is known as write-write coherence.
If a value computation *A* of an atomic object *M* global-happens-before a
value computation *B* of *M*, and *A* takes its value from a side effect *X*
on *M*, then the value computed by *B* shall either equal the value stored
by *X*, or be the value stored by a side effect *Y* on *M*, where *Y*
follows *X* in the modification order of *M*.
This requirement is known as read-read coherence.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 22, modified.]>>
If a value computation *A* of an atomic object *M* global-happens-before an
operation *B* on *M*, then *A* shall take its value from a side effect *X*
on *M*, where *X* precedes *B* in the modification order of *M*.
This requirement is known as read-write coherence.
If a side effect *X* on an atomic object *M* global-happens-before a value
computation *B* of *M*, then the evaluation *B* shall take its value from
*X* or from a side effect *Y* that follows *X* in the modification order of
*M*.
This requirement is known as write-read coherence.
==== Atomic Operations
This and following sections describe how different program actions in kernel
C code and the host program contribute to the local- and
global-happens-before relations.
This section discusses ordering rules for OpenCL 2.x atomic operations.
<<device-side-enqueue, Device-side enqueue>> defines the enumerated type
memory_order.
* For *memory_order_relaxed*, no operation orders memory.
* For *memory_order_release*, *memory_order_acq_rel*, and
*memory_order_seq_cst*, a store operation performs a release operation
on the affected memory location.
* For *memory_order_acquire*, *memory_order_acq_rel*, and
*memory_order_seq_cst*, a load operation performs an acquire operation
on the affected memory location.
<<iso-c11,[C11 standard, Section 7.17.3, paragraphs 2-4, modified.]>>
Certain built-in functions synchronize with other built-in functions
performed by another unit of execution.
This is true for pairs of release and acquire operations under specific
circumstances.
An atomic operation *A* that performs a release operation on a global object
*M* global-synchronizes-with an atomic operation *B* that performs an
acquire operation on *M* and reads a value written by any side effect in the
release sequence headed by *A*.
A similar rule holds for atomic operations on objects in local memory: an
atomic operation *A* that performs a release operation on a local object *M*
local-synchronizes-with an atomic operation *B* that performs an acquire
operation on *M* and reads a value written by any side effect in the release
sequence headed by *A*.
<<iso-c11,[C11 standard, Section 5.1.2.4, paragraph 11, modified.]>>
NOTE: Atomic operations specifying *memory_order_relaxed* are relaxed only
with respect to memory ordering.
Implementations must still guarantee that any given atomic access to a
particular atomic object be indivisible with respect to all other atomic
accesses to that object.
There shall exist a single total order *S* for all *memory_order_seq_cst*
operations that is consistent with the modification orders for all affected
locations, as well as the appropriate global-happens-before and
local-happens-before orders for those locations, such that each
*memory_order_seq_cst* operation *B* that loads a value from an atomic object
*M* in global or local memory observes one of the following values:
* the result of the last modification *A* of *M* that precedes *B* in *S*,
if it exists, or
* if *A* exists, the result of some modification of *M* in the visible
sequence of side effects with respect to *B* that is not
*memory_order_seq_cst* and that does not happen before *A*, or
* if *A* does not exist, the result of some modification of *M* in the
visible sequence of side effects with respect to *B* that is not
*memory_order_seq_cst*.
<<iso-c11,[C11 standard, Section 7.17.3, paragraph 6, modified.]>>
Let X and Y be two *memory_order_seq_cst* operations.
If X local-synchronizes-with or global-synchronizes-with Y then X both
local-synchronizes-with Y and global-synchronizes-with Y.
If the total order *S* exists, the following rules hold:
* For an atomic operation *B* that reads the value of an atomic object
*M*, if there is a *memory_order_seq_cst* fence *X* sequenced-before
*B*, then *B* observes either the last *memory_order_seq_cst*
modification of *M* preceding *X* in the total order *S* or a later
modification of *M* in its modification order.
<<iso-c11,[C11 standard, Section 7.17.3, paragraph 9.]>>
* For atomic operations *A* and *B* on an atomic object *M*, where *A*
modifies *M* and *B* takes its value, if there is a
*memory_order_seq_cst* fence *X* such that *A* is sequenced-before *X*
and *B* follows *X* in *S*, then *B* observes either the effects of *A*
or a later modification of *M* in its modification order.
<<iso-c11,[C11 standard, Section 7.17.3, paragraph 10.]>>
* For atomic operations *A* and *B* on an atomic object *M*, where *A*
modifies *M* and *B* takes its value, if there are
*memory_order_seq_cst* fences *X* and *Y* such that *A* is
sequenced-before *X*, *Y* is sequenced-before *B*, and *X* precedes *Y*
in *S*, then *B* observes either the effects of *A* or a later
modification of *M* in its modification order.
<<iso-c11,[C11 standard, Section 7.17.3, paragraph 11.]>>
* For atomic operations *A* and *B* on an atomic object *M*, if there are
*memory_order_seq_cst* fences *X* and *Y* such that *A* is
sequenced-before *X*, *Y* is sequenced-before *B*, and *X* precedes *Y*
in *S*, then *B* occurs later than *A* in the modification order of *M*.
NOTE: *memory_order_seq_cst* ensures sequential consistency only for a
program that is (1) free of data races, and (2) exclusively uses
*memory_order_seq_cst* synchronization operations.
Any use of weaker ordering will invalidate this guarantee unless extreme
care is used.
In particular, *memory_order_seq_cst* fences ensure a total order only for
the fences themselves.
Fences cannot, in general, be used to restore sequential consistency for
atomic operations with weaker ordering specifications.
Atomic read-modify-write operations should always read the last value (in
the modification order) stored before the write associated with the
read-modify-write operation.
<<iso-c11,[C11 standard, Section 7.17.3, paragraph 12.]>>
[underline]#Implementations should ensure that no "out-of-thin-air" values
are computed that circularly depend on their own computation.#
Note: Under the rules described above, and independent to the previously
footnoted {cpp} issue, it is known that _x == y == 42_ is a valid final state
in the following problematic example:
[source,c]
----
global atomic_int x = ATOMIC_VAR_INIT(0);
local atomic_int y = ATOMIC_VAR_INIT(0);
unit_of_execution_1:
... [execution not reading or writing x or y, leading up to:]
int t = atomic_load_explicit(&y, memory_order_acquire);
atomic_store_explicit(&x, t, memory_order_release);
unit_of_execution_2:
... [execution not reading or writing x or y, leading up to:]
int t = atomic_load_explicit(&x, memory_order_acquire);
atomic_store_explicit(&y, t, memory_order_release);
----
This is not useful behavior and implementations should not exploit this
phenomenon.
It should be expected that in the future this may be disallowed by
appropriate updates to the memory model description by the OpenCL committee.
Implementations should make atomic stores visible to atomic loads within a
reasonable amount of time.
<<iso-c11,[C11 standard, Section 7.17.3, paragraph 16.]>>
As long as the following conditions are met, a host program sharing SVM memory
with a kernel executing on one or more OpenCL 2.x devices may use atomic and
synchronization operations to ensure that its assignments, and those of the
kernel, are visible to each other:
. Either fine-grained buffer or fine-grained system SVM must be used to
share memory.
While coarse-grained buffer SVM allocations may support atomic
operations, visibility on these allocations is not guaranteed except at
map and unmap operations.
. The optional OpenCL 2.x SVM atomic-controlled visibility specified by
provision of the {CL_MEM_SVM_ATOMICS} flag must be supported by the device
and the flag provided to the SVM buffer on allocation.
. The host atomic and synchronization operations must be compatible with
those of an OpenCL kernel language.
This requires that the size and representation of the data types that
the host atomic operations act on be consistent with the OpenCL kernel
language atomic types.
If these conditions are met, the host operations will apply at
all_svm_devices scope.
[[memory-ordering-fence]]
==== Fence Operations
This section describes how the OpenCL 2.x fence operations contribute to the
local- and global-happens-before relations.
Earlier, we introduced synchronization primitives called fences.
Fences can utilize the acquire memory_order, release memory_order, or both.
A fence with acquire semantics is called an acquire fence; a fence with
release semantics is called a release fence. The <<atomic-fence-orders,
overview of atomic and fence operations>> section describes the memory orders
that result in acquire and release fences.
A global release fence *A* global-synchronizes-with a global acquire fence
*B* if there exist atomic operations *X* and *Y*, both operating on some
global atomic object *M*, such that *A* is sequenced-before *X*, *X*
modifies *M*, *Y* is sequenced-before *B*, *Y* reads the value written by
*X* or a value written by any side effect in the hypothetical release
sequence *X* would head if it were a release operation, and that the scopes
of *A*, *B* are inclusive.
<<iso-c11,[C11 standard, Section 7.17.4, paragraph 2, modified.]>>
A global release fence *A* global-synchronizes-with an atomic operation *B*
that performs an acquire operation on a global atomic object *M* if there
exists an atomic operation *X* such that *A* is sequenced-before *X*, *X*
modifies *M*, *B* reads the value written by *X* or a value written by any
side effect in the hypothetical release sequence *X* would head if it were a
release operation, and the scopes of *A* and *B* are inclusive.
<<iso-c11,[C11 standard, Section 7.17.4, paragraph 3, modified.]>>
An atomic operation *A* that is a release operation on a global atomic
object *M* global-synchronizes-with a global acquire fence *B* if there
exists some atomic operation *X* on *M* such that *X* is sequenced-before
*B* and reads the value written by *A* or a value written by any side effect
in the release sequence headed by *A*, and the scopes of *A* and *B* are
inclusive.
<<iso-c11,[C11 standard, Section 7.17.4, paragraph 4, modified.]>>
A local release fence *A* local-synchronizes-with a local acquire fence *B*
if there exist atomic operations *X* and *Y*, both operating on some local
atomic object *M*, such that *A* is sequenced-before *X*, *X* modifies *M*,
*Y* is sequenced-before *B*, and *Y* reads the value written by *X* or a
value written by any side effect in the hypothetical release sequence *X*
would head if it were a release operation, and the scopes of *A* and *B* are
inclusive.
<<iso-c11,[C11 standard, Section 7.17.4, paragraph 2, modified.]>>
A local release fence *A* local-synchronizes-with an atomic operation *B*
that performs an acquire operation on a local atomic object *M* if there
exists an atomic operation *X* such that *A* is sequenced-before *X*, *X*
modifies *M*, and *B* reads the value written by *X* or a value written by
any side effect in the hypothetical release sequence *X* would head if it
were a release operation, and the scopes of *A* and *B* are inclusive.
<<iso-c11,[C11 standard, Section 7.17.4, paragraph 3, modified.]>>
An atomic operation *A* that is a release operation on a local atomic object
*M* local-synchronizes-with a local acquire fence *B* if there exists some
atomic operation *X* on *M* such that *X* is sequenced-before *B* and reads
the value written by *A* or a value written by any side effect in the
release sequence headed by *A*, and the scopes of *A* and *B* are inclusive.
<<iso-c11,[C11 standard, Section 7.17.4, paragraph 4, modified.]>>
Let *X* and *Y* be two work-item fences that each have both the
`CLK_GLOBAL_MEM_FENCE` and `CLK_LOCAL_MEM_FENCE` flags set.
*X* global-synchronizes-with *Y* and *X* local synchronizes with *Y* if the
conditions required for *X* to global-synchronize with *Y* are met, the
conditions required for *X* to local-synchronize-with *Y* are met, or both
sets of conditions are met.
==== Work-group Functions
The OpenCL kernel execution model includes collective operations across the
work-items within a single work-group.
These are called work-group functions, and include functions such as
barriers, scans, reductions, and broadcasts.
We will first discuss the work-group barrier function.
Other work-group functions are discussed afterwards.
The barrier function provides a mechanism for a kernel to synchronize the
work-items within a single work-group: informally, each work-item of the
work-group must execute the barrier before any are allowed to proceed.
It also orders memory operations to a specified combination of one or more
address spaces such as local memory or global memory, in a similar manner to
a fence.
To precisely specify the memory ordering semantics for barrier, we need to
distinguish between a dynamic and a static instance of the call to a
barrier.
A call to a barrier can appear in a loop, for example, and each execution of
the same static barrier call results in a new dynamic instance of the
barrier that will independently synchronize a work-groups work-items.
A work-item executing a dynamic instance of a barrier results in two
operations, both fences, that are called the entry and exit fences.
These fences obey all the rules for fences specified elsewhere in this
chapter as well as the following:
* The entry fence is a release fence with the same flags and scope as
requested for the barrier.
* The exit fence is an acquire fence with the same flags and scope as
requested for the barrier.
* For each work-item the entry fence is sequenced before the exit fence.
* If the flags have `CLK_GLOBAL_MEM_FENCE` set then for each work-item the
entry fence global-synchronizes-with the exit fence of all other
work-items in the same work-group.
* If the flags have `CLK_LOCAL_MEM_FENCE` set then for each work-item the
entry fence local-synchronizes-with the exit fence of all other
work-items in the same work-group.
Other work-group functions include such functions as scans, reductions,
and broadcasts, and are described in the kernel language and IL specifications.
The use of these work-group functions implies sequenced-before relationships
between statements within the execution of a single work-item in order to
satisfy data dependencies.
For example, a work-item that provides a value to a work-group function must
behave as if it generates that value before beginning execution of that
work-group function.
Furthermore, the programmer must ensure that all work-items in a work-group
must execute the same work-group function call site, or dynamic work-group
function instance.
==== Sub-group Functions
NOTE: Sub-group functions are <<unified-spec, missing before>> version 2.1.
Also see extension *cl_khr_subgroups*.
The OpenCL kernel execution model includes collective operations across the
work-items within a single sub-group.
These are called sub-group functions.
We will first discuss the sub-group barrier.
Other sub-group functions are discussed afterwards.
The barrier function provides a mechanism for a kernel to synchronize the
work-items within a single sub-group: informally, each work-item of the
sub-group must execute the barrier before any are allowed to proceed.
It also orders memory operations to a specified combination of one or more
address spaces such as local memory or global memory, in a similar manner to
a fence.
To precisely specify the memory ordering semantics for barrier, we need to
distinguish between a dynamic and a static instance of the call to a
barrier.
A call to a barrier can appear in a loop, for example, and each execution of
the same static barrier call results in a new dynamic instance of the
barrier that will independently synchronize a sub-groups work-items.
A work-item executing a dynamic instance of a barrier results in two
operations, both fences, that are called the entry and exit fences.
These fences obey all the rules for fences specified elsewhere in this
chapter as well as the following:
* The entry fence is a release fence with the same flags and scope as
requested for the barrier.
* The exit fence is an acquire fence with the same flags and scope as
requested for the barrier.
* For each work-item the entry fence is sequenced before the exit fence.
* If the flags have `CLK_GLOBAL_MEM_FENCE` set then for each work-item the
entry fence global-synchronizes-with the exit fence of all other
work-items in the same sub-group.
* If the flags have `CLK_LOCAL_MEM_FENCE` set then for each work-item the
entry fence local-synchronizes-with the exit fence of all other
work-items in the same sub-group.
Other sub-group functions include such functions as scans, reductions,
and broadcasts, and are described in the kernel languages and IL specifications.
The use of these sub-group functions implies sequenced-before relationships
between statements within the execution of a single work-item in order to
satisfy data dependencies.
For example, a work-item that provides a value to a sub-group function must
behave as if it generates that value before beginning execution of that
sub-group function.
Furthermore, the programmer must ensure that all work-items in a sub-group
must execute the same sub-group function call site, or dynamic sub-group
function instance.
==== Host-side and Device-side Commands
This section describes how the OpenCL API functions associated with
command-queues contribute to happens-before relations.
There are two types of command queues and associated API functions in OpenCL
2.x; _host command-queues_ and _device command-queues_.
The interaction of these command queues with the memory model are for the
most part equivalent.
In a few cases, the rules only applies to the host command-queue.
We will indicate these special cases by specifically denoting the host
command-queue in the memory ordering rule.
SVM memory consistency in such instances is implied only with respect to
synchronizing host commands.
Memory ordering rules in this section apply to all memory objects (buffers,
images and pipes) as well as to SVM allocations where no earlier, and more
fine-grained, rules apply.
In the remainder of this section, we assume that each command *C* enqueued
onto a command-queue has an associated event object *E* that signals its
execution status, regardless of whether *E* was returned to the unit of
execution that enqueued *C*.
We also distinguish between the API function call that enqueues a command
*C* and creates an event *E*, the execution of *C*, and the completion of
*C*(which marks the event *E* as complete).
The ordering and synchronization rules for API commands are defined as
following:
. If an API function call *X* enqueues a command *C*, then *X*
global-synchronizes-with *C*.
For example, a host API function to enqueue a kernel
global-synchronizes-with the start of that kernel-instances execution,
so that memory updates sequenced-before the enqueue kernel function call
will global-happen-before any kernel reads or writes to those same
memory locations.
For a device-side enqueue, global memory updates sequenced before *X*
happens-before *C* reads or writes to those memory locations only in the
case of fine-grained SVM.
. If *E* is an event upon which a command *C* waits, then *E*
global-synchronizes-with *C*.
In particular, if *C* waits on an event *E* that is tracking the
execution status of the command *C1*, then memory operations done by
*C1* will global-happen-before memory operations done by *C*.
As an example, assume we have an OpenCL program using coarse-grain SVM
sharing that enqueues a kernel to a host command-queue to manipulate the
contents of a region of a buffer that the host thread then accesses
after the kernel completes.
To do this, the host thread can call {clEnqueueMapBuffer} to enqueue a
blocking-mode map command to map that buffer region, specifying that the
map command must wait on an event signaling the kernels completion.
When {clEnqueueMapBuffer} returns, any memory operations performed by
the kernel to that buffer region will global- happen-before subsequent
memory operations made by the host thread.
. If a command *C* has an event *E* that signals its completion, then *C*
global- synchronizes-with *E*.
. For a command *C* enqueued to a host-side command queue, if *C* has an
event *E* that signals its completion, then *E* global-synchronizes-with
an API call *X* that waits on *E*.
For example, if a host thread or kernel-instance calls the
wait-for-events function on *E* (e.g. the {clWaitForEvents} function
called from a host thread), then *E* global-synchronizes-with that
wait-for-events function call.
. If commands *C* and *C1* are enqueued in that sequence onto an in-order
command-queue, then the event (including the event implied between *C*
and *C1* due to the in-order queue) signaling *C*'s completion
global-synchronizes-with *C1*.
Note that in OpenCL 2.x, only a host command-queue can be configured as
an in-order queue.
. If an API call enqueues a marker command *C* with an empty list of
events upon which *C* should wait, then the events of all commands
enqueued prior to *C* in the command-queue global-synchronize-with *C*.
. If a host API call enqueues a command-queue barrier command *C* with an
empty list of events on which *C* should wait, then the events of all
commands enqueued prior to *C* in the command-queue
global-synchronize-with *C*.
In addition, the event signaling the completion of *C*
global-synchronizes-with all commands enqueued after *C* in the
command-queue.
. If a host thread executes a {clFinish} call *X*, then the events of all
commands enqueued prior to *X* in the command-queue
global-synchronizes-with *X*.
. The start of a kernel-instance *K* global-synchronizes-with all
operations in the work-items of *K*.
Note that this includes the execution of any atomic operations by the
work-items in a program using fine-grain SVM.
. All operations of all work-items of a kernel-instance *K*
global-synchronizes-with the event signaling the completion of *K*.
Note that this also includes the execution of any atomic operations by
the work-items in a program using fine-grain SVM.
. If a callback procedure *P* is registered on an event *E*, then *E*
global-synchronizes-with all operations of *P*.
Note that callback procedures are only defined for commands within host
command-queues.
. If *C* is a command that waits for an event *E*'s completion, and API
function call *X* sets the status of a user event *E*'s status to
{CL_COMPLETE} (for example, from a host thread using a
{clSetUserEventStatus} function), then *X* global-synchronizes-with *C*.
. If a device enqueues a command *C* with the
`CLK_ENQUEUE_FLAGS_WAIT_KERNEL` flag, then the end state of the parent
kernel instance global-synchronizes with *C*.
. If a work-group enqueues a command *C* with the
`CLK_ENQUEUE_FLAGS_WAIT_WORK_GROUP` flag, then the end state of the
work-group global-synchronizes with *C*.
When using an out-of-order command queue, a wait on an event or a marker or
command-queue barrier command can be used to ensure the correct ordering of
dependent commands.
In those cases, the wait for the event or the marker or barrier command will
provide the necessary global-synchronizes-with relation.
In this situation:
* access to shared locations or disjoint locations in a single {cl_mem_TYPE}
object when using atomic operations from different kernel instances
enqueued from the host such that one or more of the atomic operations is
a write is implementation-defined and correct behavior is not guaranteed
except at synchronization points.
* access to shared locations or disjoint locations in a single {cl_mem_TYPE}
object when using atomic operations from different kernel instances
consisting of a parent kernel and any number of child kernels enqueued
by that kernel is guaranteed under the memory ordering rules described
earlier in this section.
* access to shared locations or disjoint locations in a single program
scope global variable, coarse-grained SVM allocation or fine-grained SVM
allocation when using atomic operations from different kernel instances
enqueued from the host to a single device is guaranteed under the memory
ordering rules described earlier in this section.
If fine-grain SVM is used but without support for the OpenCL 2.x atomic
operations, then the host and devices can concurrently read the same memory
locations and can concurrently update non-overlapping memory regions, but
attempts to update the same memory locations are undefined.
Memory consistency is guaranteed at the OpenCL synchronization points
without the need for calls to {clEnqueueMapBuffer} and
{clEnqueueUnmapMemObject}.
For fine-grained SVM buffers it is guaranteed that at synchronization points
only values written by the kernel will be updated.
No writes to fine-grained SVM buffers can be introduced that were not in the
original program.
In the remainder of this section, we discuss a few points regarding the
ordering rules for commands with a host command queue.
NOTE: In an OpenCL 1.x implementation a synchronization point is a
kernel-instance or host program location where the contents of memory
visible to different work-items or command-queue commands are the same.
It also says that waiting on an event and a command-queue barrier are
synchronization points between commands in command-queues.
Four of the rules listed above (2, 4, 7, and 8) cover these OpenCL
synchronization points.
A map operation ({clEnqueueMapBuffer} or {clEnqueueMapImage}) performed on a
non-SVM buffer or a coarse-grained SVM buffer is allowed to overwrite the
entire target region with the latest runtime view of the data as seen by the
command with which the map operation synchronizes, whether the values were
written by the executing kernels or not.
Any values that were changed within this region by another kernel or host
thread while the kernel synchronizing with the map operation was executing
may be overwritten by the map operation.
Access to non-SVM {cl_mem_TYPE} buffers and coarse-grained SVM allocations is
ordered at synchronization points between host commands.
In the presence of an out-of-order command queue or a set of command queues
mapped to the same device, multiple kernel instances may execute
concurrently on the same device.
[[opencl-framework]]
== The OpenCL Framework
The OpenCL framework allows applications to use a host and one or more
OpenCL devices as a single heterogeneous parallel computer system.
The framework contains the following components:
* *OpenCL Platform layer*: The platform layer allows the host program to
discover OpenCL devices and their capabilities and to create contexts.
* *OpenCL Runtime*: The runtime allows the host program to manipulate
contexts once they have been created.
* *OpenCL Compiler*: The OpenCL compiler creates program executables that
contain OpenCL kernels.
The OpenCL compiler may build program executables from OpenCL C source
strings, the SPIR-V intermediate language, or device-specific program
binary objects, depending on the capabilities of a device.
Other kernel languages or intermediate languages may be supported by
some implementations.
=== Mixed Version Support
NOTE: Mixed version support <<unified-spec, missing before>> version 1.1.
OpenCL supports devices with different capabilities under a single platform.
This includes devices which conform to different versions of the OpenCL
specification.
There are three version identifiers to consider for an OpenCL system: the
platform version, the version of a device, and the version(s) of the kernel
language or IL supported on a device.
The platform version indicates the version of the OpenCL runtime that is
supported.
This includes all of the APIs that the host can use to interact with
resources exposed by the OpenCL runtime; including contexts, memory objects,
devices, and command queues.
The device version is an indication of the device's capabilities separate
from the runtime and compiler as represented by the device info returned by
{clGetDeviceInfo}.
Examples of attributes associated with the device version are resource
limits (e.g., minimum size of local memory per compute unit) and extended
functionality (e.g., list of supported KHR extensions).
The version returned corresponds to the highest version of the OpenCL
specification for which the device is conformant, but is not higher than the
platform version.
The language version for a device represents the OpenCL programming language
features a developer can assume are supported on a given device.
The version reported is the highest version of the language supported.
=== Backwards Compatibility
Backwards compatibility is an important goal for the OpenCL standard.
Backwards compatibility is expected such that a device will consume earlier
versions of the OpenCL C programming languages and the SPIR-V intermediate language with the following
minimum requirements:
* An OpenCL 1.x device must support at least one 1.x version of the OpenCL C programming language.
* An OpenCL 2.0 device must support all the requirements of an OpenCL 1.2 device in addition to the OpenCL C 2.0 programming language.
If multiple language versions are supported, the compiler defaults to using the OpenCL C 1.2 language version.
To utilize the OpenCL 2.0 Kernel programming language, a programmer must specifically pass the appropriate compiler build option (`-cl-std=CL2.0`).
The language version must not be higher than the platform version, but may exceed the <<opencl-c-version, device version>>.
* An OpenCL 2.1 device must support all the requirements of an OpenCL 2.0 device in addition to the SPIR-V intermediate language at version 1.0 or above.
Intermediate language versioning is encoded as part of the binary object and no flags are required to be passed to the compiler.
* An OpenCL 2.2 device must support all the requirements of an OpenCL 2.0 device in addition to the SPIR-V intermediate language at version 1.2 or above.
Intermediate language versioning is encoded as a part of the binary object and no flags are required to be passed to the compiler.
* OpenCL 3.0 is designed to enable any OpenCL implementation supporting OpenCL 1.2 or newer to easily support and transition to OpenCL 3.0, by making many features in OpenCL 2.0, 2.1, or 2.2 optional.
This means that OpenCL 3.0 is backwards compatible with OpenCL 1.2, but is not necessarily backwards compatible with OpenCL 2.0, 2.1, or 2.2.
+
An OpenCL 3.0 platform must implement all OpenCL 3.0 APIs, but some APIs may return an error code unconditionally when a feature is not supported by any devices in the platform.
Whenever a feature is optional, it will be paired with a query to determine whether the feature is supported.
The queries will enable correctly written applications to selectively use all optional features without generating any OpenCL errors, if desired.
+
OpenCL 3.0 also adds a new version of the OpenCL C programming language, which makes many features in OpenCL C 2.0 optional.
The new version of OpenCL C is backwards compatible with OpenCL C 1.2, but is not backwards compatible with OpenCL C 2.0.
The new version of OpenCL C must be explicitly requested via the `-cl-std=` build option, otherwise a program will continue to be compiled using the highest OpenCL C 1.x language version supported for the device.
+
Whenever an OpenCL C feature is optional in the new version of the OpenCL C programming language, it will be paired with a feature macro, such as `+__opencl_c_feature_name+`, and a corresponding API query.
If a feature macro is defined then the feature is supported by the OpenCL C compiler, otherwise the optional feature is not supported.
In order to allow future versions of OpenCL to support new types of
devices, minor releases of OpenCL may add new profiles where some
features that are currently required for all OpenCL devices become
optional.
All features that are required for an OpenCL profile will also be
required for that profile in subsequent minor releases of OpenCL,
thereby guaranteeing backwards compatibility for applications
targeting specific profiles.
It is therefore strongly recommended that applications
<<CL_DEVICE_PROFILE,query the profile>> supported by the OpenCL device
they are running on in order to remain robust to future changes.
=== Versioning
The OpenCL specification is regularly updated with bug fixes and clarifications.
Occasionally new functionality is added to the core and extensions. In order to
indicate to developers how and when these changes are made to the specification,
and to provide a way to identify each set of changes, the OpenCL API, C language,
intermediate languages and extensions maintain a version number. Built-in kernels
are also versioned.
==== Versions
A version number comprises three logical fields:
* The _major_ version indicates a significant change. Backwards compatibility may
break across major versions.
* The _minor_ version indicates the addition of new functionality with backwards
compatibility for any existing profiles.
* The _patch_ version indicates bug fixes, clarifications and general improvements.
Version numbers are represented using the {cl_version_TYPE} type that is an alias for
a 32-bit integer. The fields are packed as follows:
* The _major_ version is a 10-bit integer packed into bits 31-22.
* The _minor_ version is a 10-bit integer packed into bits 21-12.
* The _patch_ version is a 12-bit integer packed into bits 11-0.
This enables versions to be ordered using standard C/C++ operators.
A number of convenience macros are provided by the OpenCL Headers to make
working with version numbers easier.
`CL_VERSION_MAJOR` extracts the _major_ version from a packed {cl_version_TYPE}. +
`CL_VERSION_MINOR` extracts the _minor_ version from a packed {cl_version_TYPE}. +
`CL_VERSION_PATCH` extracts the _patch_ version from a packed {cl_version_TYPE}. +
`CL_MAKE_VERSION` returns a packed {cl_version_TYPE} from a _major_, _minor_ and
_patch_ version.
These are defined as follows:
[source,c]
----
typedef cl_uint cl_version;
#define CL_VERSION_MAJOR_BITS (10)
#define CL_VERSION_MINOR_BITS (10)
#define CL_VERSION_PATCH_BITS (12)
#define CL_VERSION_MAJOR_MASK ((1 << CL_VERSION_MAJOR_BITS) - 1)
#define CL_VERSION_MINOR_MASK ((1 << CL_VERSION_MINOR_BITS) - 1)
#define CL_VERSION_PATCH_MASK ((1 << CL_VERSION_PATCH_BITS) - 1)
#define CL_VERSION_MAJOR(version) \
((version) >> (CL_VERSION_MINOR_BITS + CL_VERSION_PATCH_BITS))
#define CL_VERSION_MINOR(version) \
(((version) >> CL_VERSION_PATCH_BITS) & CL_VERSION_MINOR_MASK)
#define CL_VERSION_PATCH(version) ((version) & CL_VERSION_PATCH_MASK)
#define CL_MAKE_VERSION(major, minor, patch) \
((((major)& CL_VERSION_MAJOR_MASK) << \
(CL_VERSION_MINOR_BITS + CL_VERSION_PATCH_BITS)) | \
(((minor)& CL_VERSION_MINOR_MASK) << \
CL_VERSION_PATCH_BITS) | \
((patch) & CL_VERSION_PATCH_MASK))
----
==== Version name pairing
It is sometimes necessary to associate a version to an entity it applies to
(e.g. extension or built-in kernel). This is done using a dedicated
{cl_name_version_TYPE} structure, defined as follows:
include::{generated}/api/structs/cl_name_version.txt[]
The `name` field is an array of `CL_NAME_VERSION_MAX_NAME_SIZE` bytes used as
storage for a NUL-terminated string whose maximum length is therefore
`CL_NAME_VERSION_MAX_NAME_SIZE - 1`.