blob: 2aea846a7c24476808eead4048fc22696727f5bd [file] [log] [blame] [edit]
// Copyright 2016-2020 The Khronos Group. This work is licensed under a
// Creative Commons Attribution 4.0 International License; see
// http://creativecommons.org/licenses/by/4.0/
[appendix]
= Portability
OpenCL is designed to be portable to other architectures and hardware
designs.
OpenCL has used at its core a C99 based programming language and follows
rules based on that heritage.
Floating-point arithmetic is based on the *IEEE-754* and *IEEE-754-2008*
standards.
The memory objects, pointer qualifiers and weakly ordered memory are
designed to provide maximum compatibility with discrete memory architectures
implemented by OpenCL devices.
Command-queues and barriers allow for synchronization between the host and
OpenCL devices.
The design, capabilities and limitations of OpenCL are very much a
reflection of the capabilities of underlying hardware.
Unfortunately, there are a number of areas where idiosyncrasies of one
hardware platform may allow it to do some things that do not work on
another.
By virtue of the rich operating system resident on the CPU, on some
implementations the kernels executing on a CPU may be able to call out to
system services whereas the same calls on the GPU will likely fail for now.
Since there is some advantage to having these services available for
debugging purposes, implementations can use the OpenCL extension mechanism
to implement these services.
Likewise, the heterogeneity of computing architectures might mean that a
particular loop construct might execute at an acceptable speed on the CPU
but very poorly on a GPU, for example.
CPUs are designed in general to work well on latency sensitive algorithms on
single threaded tasks, whereas common GPUs may encounter extremely long
latencies, potentially orders of magnitude worse.
Developers interested in writing portable code may need to test their
software on a diversity of hardware designs to make sure that key algorithms
are structured in a way that works well on a diversity of hardware.
We suggest favoring more work-items over fewer.
It is anticipated that over the coming months and years experience will
produce a set of best practices that will help foster a uniformly favorable
experience on a diversity of computing devices.
Of somewhat more concern is the topic of endianness.
Since a majority of devices supported by the initial implementation of
OpenCL are little-endian, developers need to make sure that their kernels
are tested on both big-endian and little-endian devices to ensure source
compatibility with OpenCL devices now and in the future.
The endian attribute qualifier is supported by the SPIR-V IL to allow
developers to specify whether the data uses the endianness of the host or
the OpenCL device.
This allows the OpenCL compiler to do appropriate endian-conversion on load
and store operations from or to this data.
We also describe how endianness can leak into an implementation causing
kernels to produce unintended results:
When a big-endian vector machine (e.g. AltiVec, CELL SPE) loads a vector,
the order of the data is retained.
That is both the order of the bytes within each element and the order of the
elements in the vector are the same as in memory.
When a little-endian vector machine (e.g. SSE) loads a vector, the order of
the data in register (where all the work is done) is reversed.
*Both* the order of the bytes within each element and the order of the
elements with respect to one another in the vector are reversed.
Memory:
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
|====
In register (big-endian):
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
|====
In register (little-endian):
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x0F0E0D0C | 0x0B0A0908 | 0x07060504 | 0x03020100
|====
This allows little-endian machines to use a single vector load to load
little-endian data, regardless of how large each piece of data is in the
vector.
That is the transformation is equally valid whether that vector was a
`uchar16` or a `ulong2`.
Of course, as is well known, little-endian machines
actually footnote:[{fn-endianness}] store their data in reverse byte order to
compensate for the little-endian storage format of the array elements:
Memory (big-endian):
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
|====
Memory (little-endian):
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x03020100 | 0x07060504 | 0x0B0A0908 | 0x0F0E0D0C
|====
Once that data is loaded into a vector, we end up with this:
In register (big-endian):
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
|====
In register (little-endian):
uint4 a =
[width="100%",cols="<25%,<25%,<25%,<25%",]
|====
| 0x0C0D0E0F | 0x08090A0B | 0x04050607 | 0x00010203
|====
That is, in the process of correcting the endianness of the bytes within
each element, the machine ends up reversing the order that the elements
appear in the vector with respect to each other within the vector.
0x00010203 appears at the left of the big-endian vector and at the right of
the little-endian vector.
When the host and device have different endianness, the developer must
ensure that kernel argument values are processed correctly.
The implementation may or may not automatically convert endianness of kernel
arguments.
Developers should consult vendor documentation for guidance on how to handle
kernel arguments in these situations.
OpenCL provides a consistent programming model across architectures by
numbering elements according to their order in memory.
Concepts such as `even`/`odd` and `high`/`low` follow accordingly.
Once the data is loaded into registers, we find that element 0 is at the
left of the big-endian vector and element 0 is at the right of the
little-endian vector:
[source,c]
----
float x[4];
float4 v = vload4( 0, x );
----
Big-endian:
[source,c]
----
v contains { x[0], x[1], x[2], x[3] }
----
Little-endian:
[source,c]
----
v contains { x[3], x[2], x[1], x[0] }
----
The compiler is aware that this swap occurs and references elements
accordingly.
So long as we refer to them by a numeric index such as `.s0123456789abcdef`
or by descriptors such as `.xyzw`, `.hi`, `.lo`, `.even` and `.odd`,
everything works transparently.
Any ordering reversal is undone when the data is stored back to memory.
The developer should be able to work with a big-endian programming model and
ignore the element ordering problem in the vector ... for most problems.
This mechanism relies on the fact that we can rely on a consistent element
numbering.
Once we change numbering system, for example by conversion-free casting
(using ``as_type_``__n__) a vector to another vector of the same size but a
different number of elements, then we get different results on different
implementations depending on whether the system is big-endian, or
little-endian or indeed has no vector unit at all.
(Thus, the behavior of bitcasts to vectors of different numbers of elements
is implementation-defined, see section 6.4.4 of OpenCL C specification.)
An example follows:
[source,c]
----
float x[4] = { 0.0f, 1.0f, 2.0f, 3.0f };
float4 v = vload4( 0, x );
uint4 y = as_uint4(v); // legal, portable
ushort8 z = as_ushort8(v); // legal, not portable
// element size changed
----
Big-endian:
[source,c]
----
v contains { 0.0f, 1.0f, 2.0f, 3.0f }
y contains { 0x00000000, 0x3f800000,
0x40000000, 0x40400000 }
z contains { 0x0000, 0x0000, 0x3f80, 0x0000,
0x4000, 0x0000, 0x4040, 0x0000 }
z.z is 0x3f80
----
Little-endian:
[source,c]
----
v contains { 3.0f, 2.0f, 1.0f, 0.0f }
y contains { 0x40400000, 0x40000000,
0x3f800000, 0x00000000 }
z contains { 0x4040, 0x0000, 0x4000, 0x0000,
0x3f80, 0x0000, 0x0000, 0x0000 }
z.z is 0
----
Here, the value in `z.z` is not the same between big- and little-endian
vector machines
OpenCL could have made it illegal to do a conversion free cast that changes
the number of elements in the name of portability.
However, while OpenCL provides a common set of operators drawing from the
set that are typically found on vector machines, it can not provide access
to everything every ISA may offer in a consistent uniform portable manner.
Many vector ISAs provide special purpose instructions that greatly
accelerate specific operations such as DCT, SAD, or 3D geometry.
It is not intended for OpenCL to be so heavy handed that time-critical
performance sensitive algorithms can not be written by knowledgeable
developers to perform at near peak performance.
Developers willing to throw away portability should be able to use the
platform-specific instructions in their code.
For this reason, OpenCL is designed to allow traditional vector C language
programming extensions, such as the AltiVec C Programming Interface or the
Intel C programming interfaces (such as those found in emmintrin.h) to be
used directly in OpenCL with OpenCL data types as an extension to OpenCL.
As these interfaces rely on the ability to do conversion-free casts that
change the number of elements in the vector to function properly, OpenCL
allows them too.
As a general rule, any operation that operates on vector types in segments
that are not the same size as the vector element size may break on other
hardware with different endianness or different vector architecture.
Examples might include:
* Combining two ``uchar8``'s containing high and low bytes of a ushort, to
make a `ushort8` using `.even` and `.odd` operators (please use
*upsample()* for this)
* Any bitcast that changes the number of elements in the vector.
(Operations on the new type are non-portable.)
* Swizzle operations that change the order of data using chunk sizes that
are not the same as the element size
Examples of operations that are portable:
* Combining two ``uint8``'s to make a `uchar16` using `.even` and `.odd`
operators.
For example to interleave left and right audio streams.
* Any bitcast that does not change the number of elements (e.g. `(float4)
uint4`) -- we define the storage format for floating-point types)
* Swizzle operations that swizzle elements of the same size as the
elements of the vector.
OpenCL has made some additions to C to make application behavior more
dependable than C.
Most notably in a few cases OpenCL defines the behavior of some operations
that are undefined in C99:
* OpenCL provides `convert_` operators for conversion between all types.
C99 does not define what happens when a floating-point type is converted
to integer type and the floating-point value lies outside the
representable range of the integer type after rounding.
When the `_sat` variant of the conversion is used, the float shall be
converted to the nearest representable integer value.
Similarly, OpenCL also makes recommendations about what should happen
with NaN.
Hardware manufacturers that provide the saturated conversion in hardware
may use the saturated conversion hardware for both the saturated and
non-saturated versions of the OpenCL `convert_` operator.
OpenCL does not define what happens for the non-saturated conversions
when floating-point operands are outside the range representable
integers after rounding.
* The format of `half`, `float`, and `double` types is defined to be the
binary16, binary32 and binary64 formats in the draft IEEE-754 standard.
(The latter two are identical to the existing IEEE-754 standard.) You
may depend on the positioning and meaning of the bits in these types.
* OpenCL defines behavior for oversized shift values.
Shift operations that shift greater than or equal to the number of bits
in the first operand reduce the shift value modulo the number of bits in
the element.
For example, if we shift an `int4` left by `33` bits, OpenCL treats this
as shift left by `33%32 = 1` bit.
* A number of edge cases for math library functions are more rigorously
defined than in C99.
Please see _section 7.5_ of the OpenCL C specification.