api/appendix_b.asciidoc - external/github.com/KhronosGroup/OpenCL-Docs - Git at Google

 // Copyright 2016-2020 The Khronos Group. This work is licensed under a
 // Creative Commons Attribution 4.0 International License; see
 // http://creativecommons.org/licenses/by/4.0/

 [appendix]
 = Portability

 OpenCL is designed to be portable to other architectures and hardware
 designs.
 OpenCL has used at its core a C99 based programming language and follows
 rules based on that heritage.
 Floating-point arithmetic is based on the *IEEE-754* and *IEEE-754-2008*
 standards.
 The memory objects, pointer qualifiers and weakly ordered memory are
 designed to provide maximum compatibility with discrete memory architectures
 implemented by OpenCL devices.
 Command-queues and barriers allow for synchronization between the host and
 OpenCL devices.
 The design, capabilities and limitations of OpenCL are very much a
 reflection of the capabilities of underlying hardware.

 Unfortunately, there are a number of areas where idiosyncrasies of one
 hardware platform may allow it to do some things that do not work on
 another.
 By virtue of the rich operating system resident on the CPU, on some
 implementations the kernels executing on a CPU may be able to call out to
 system services whereas the same calls on the GPU will likely fail for now.
 Since there is some advantage to having these services available for
 debugging purposes, implementations can use the OpenCL extension mechanism
 to implement these services.

 Likewise, the heterogeneity of computing architectures might mean that a
 particular loop construct might execute at an acceptable speed on the CPU
 but very poorly on a GPU, for example.
 CPUs are designed in general to work well on latency sensitive algorithms on
 single threaded tasks, whereas common GPUs may encounter extremely long
 latencies, potentially orders of magnitude worse.
 Developers interested in writing portable code may need to test their
 software on a diversity of hardware designs to make sure that key algorithms
 are structured in a way that works well on a diversity of hardware.
 We suggest favoring more work-items over fewer.
 It is anticipated that over the coming months and years experience will
 produce a set of best practices that will help foster a uniformly favorable
 experience on a diversity of computing devices.

 Of somewhat more concern is the topic of endianness.
 Since a majority of devices supported by the initial implementation of
 OpenCL are little-endian, developers need to make sure that their kernels
 are tested on both big-endian and little-endian devices to ensure source
 compatibility with OpenCL devices now and in the future.
 The endian attribute qualifier is supported by the SPIR-V IL to allow
 developers to specify whether the data uses the endianness of the host or
 the OpenCL device.
 This allows the OpenCL compiler to do appropriate endian-conversion on load
 and store operations from or to this data.

 We also describe how endianness can leak into an implementation causing
 kernels to produce unintended results:

 When a big-endian vector machine (e.g. AltiVec, CELL SPE) loads a vector,
 the order of the data is retained.
 That is both the order of the bytes within each element and the order of the
 elements in the vector are the same as in memory.
 When a little-endian vector machine (e.g. SSE) loads a vector, the order of
 the data in register (where all the work is done) is reversed.
 *Both* the order of the bytes within each element and the order of the
 elements with respect to one another in the vector are reversed.

 Memory:

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
 |====


 In register (big-endian):

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
 |====

 In register (little-endian):

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x0F0E0D0C | 0x0B0A0908 | 0x07060504 | 0x03020100
 |====

 This allows little-endian machines to use a single vector load to load
 little-endian data, regardless of how large each piece of data is in the
 vector.
 That is the transformation is equally valid whether that vector was a
 `uchar16` or a `ulong2`.
 Of course, as is well known, little-endian machines
 actually footnote:[{fn-endianness}] store their data in reverse byte order to
 compensate for the little-endian storage format of the array elements:

 Memory (big-endian):

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
 |====

 Memory (little-endian):

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x03020100 | 0x07060504 | 0x0B0A0908 | 0x0F0E0D0C
 |====

 Once that data is loaded into a vector, we end up with this:


 In register (big-endian):

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x00010203 | 0x04050607 | 0x08090A0B | 0x0C0D0E0F
 |====

 In register (little-endian):

 uint4 a =

 [width="100%",cols="<25%,<25%,<25%,<25%",]
 |====
 | 0x0C0D0E0F | 0x08090A0B | 0x04050607 | 0x00010203
 |====

 That is, in the process of correcting the endianness of the bytes within
 each element, the machine ends up reversing the order that the elements
 appear in the vector with respect to each other within the vector.
 0x00010203 appears at the left of the big-endian vector and at the right of
 the little-endian vector.

 When the host and device have different endianness, the developer must
 ensure that kernel argument values are processed correctly.
 The implementation may or may not automatically convert endianness of kernel
 arguments.
 Developers should consult vendor documentation for guidance on how to handle
 kernel arguments in these situations.

 OpenCL provides a consistent programming model across architectures by
 numbering elements according to their order in memory.
 Concepts such as `even`/`odd` and `high`/`low` follow accordingly.
 Once the data is loaded into registers, we find that element 0 is at the
 left of the big-endian vector and element 0 is at the right of the
 little-endian vector:

 [source,c]
 ----
 float x[4];
 float4 v = vload4( 0, x );
 ----

 Big-endian:

 [source,c]
 ----
 v contains { x[0], x[1], x[2], x[3] }
 ----

 Little-endian:

 [source,c]
 ----
 v contains { x[3], x[2], x[1], x[0] }
 ----

 The compiler is aware that this swap occurs and references elements
 accordingly.
 So long as we refer to them by a numeric index such as `.s0123456789abcdef`
 or by descriptors such as `.xyzw`, `.hi`, `.lo`, `.even` and `.odd`,
 everything works transparently.
 Any ordering reversal is undone when the data is stored back to memory.
 The developer should be able to work with a big-endian programming model and
 ignore the element ordering problem in the vector ... for most problems.
 This mechanism relies on the fact that we can rely on a consistent element
 numbering.
 Once we change numbering system, for example by conversion-free casting
 (using ``as_type_``__n__) a vector to another vector of the same size but a
 different number of elements, then we get different results on different
 implementations depending on whether the system is big-endian, or
 little-endian or indeed has no vector unit at all.
 (Thus, the behavior of bitcasts to vectors of different numbers of elements
 is implementation-defined, see section 6.4.4 of OpenCL C specification.)

 An example follows:

 [source,c]
 ----
 float x[4] = { 0.0f, 1.0f, 2.0f, 3.0f };
 float4 v = vload4( 0, x );
 uint4 y = as_uint4(v);      // legal, portable
 ushort8 z = as_ushort8(v);  // legal, not portable
                             // element size changed
 ----


 Big-endian:

 [source,c]
 ----
 v contains { 0.0f, 1.0f, 2.0f, 3.0f }
 y contains { 0x00000000, 0x3f800000,
              0x40000000, 0x40400000 }
 z contains { 0x0000, 0x0000, 0x3f80, 0x0000,
              0x4000, 0x0000, 0x4040, 0x0000 }
 z.z is 0x3f80
 ----

 Little-endian:

 [source,c]
 ----
 v contains { 3.0f, 2.0f, 1.0f, 0.0f }
 y contains { 0x40400000, 0x40000000,
              0x3f800000, 0x00000000 }
 z contains { 0x4040, 0x0000, 0x4000, 0x0000,
              0x3f80, 0x0000, 0x0000, 0x0000 }
 z.z is 0
 ----

 Here, the value in `z.z` is not the same between big- and little-endian
 vector machines

 OpenCL could have made it illegal to do a conversion free cast that changes
 the number of elements in the name of portability.
 However, while OpenCL provides a common set of operators drawing from the
 set that are typically found on vector machines, it can not provide access
 to everything every ISA may offer in a consistent uniform portable manner.
 Many vector ISAs provide special purpose instructions that greatly
 accelerate specific operations such as DCT, SAD, or 3D geometry.
 It is not intended for OpenCL to be so heavy handed that time-critical
 performance sensitive algorithms can not be written by knowledgeable
 developers to perform at near peak performance.
 Developers willing to throw away portability should be able to use the
 platform-specific instructions in their code.
 For this reason, OpenCL is designed to allow traditional vector C language
 programming extensions, such as the AltiVec C Programming Interface or the
 Intel C programming interfaces (such as those found in emmintrin.h) to be
 used directly in OpenCL with OpenCL data types as an extension to OpenCL.
 As these interfaces rely on the ability to do conversion-free casts that
 change the number of elements in the vector to function properly, OpenCL
 allows them too.

 As a general rule, any operation that operates on vector types in segments
 that are not the same size as the vector element size may break on other
 hardware with different endianness or different vector architecture.

 Examples might include:

   * Combining two ``uchar8``'s containing high and low bytes of a ushort, to
     make a `ushort8` using `.even` and `.odd` operators (please use
     *upsample()* for this)
   * Any bitcast that changes the number of elements in the vector.
     (Operations on the new type are non-portable.)
   * Swizzle operations that change the order of data using chunk sizes that
     are not the same as the element size

 Examples of operations that are portable:

   * Combining two ``uint8``'s to make a `uchar16` using `.even` and `.odd`
     operators.
     For example to interleave left and right audio streams.
   * Any bitcast that does not change the number of elements (e.g. `(float4)
     uint4`) -- we define the storage format for floating-point types)
   * Swizzle operations that swizzle elements of the same size as the
     elements of the vector.

 OpenCL has made some additions to C to make application behavior more
 dependable than C.
 Most notably in a few cases OpenCL defines the behavior of some operations
 that are undefined in C99:

   * OpenCL provides `convert_` operators for conversion between all types.
     C99 does not define what happens when a floating-point type is converted
     to integer type and the floating-point value lies outside the
     representable range of the integer type after rounding.
     When the `_sat` variant of the conversion is used, the float shall be
     converted to the nearest representable integer value.
     Similarly, OpenCL also makes recommendations about what should happen
     with NaN.
     Hardware manufacturers that provide the saturated conversion in hardware
     may use the saturated conversion hardware for both the saturated and
     non-saturated versions of the OpenCL `convert_` operator.
     OpenCL does not define what happens for the non-saturated conversions
     when floating-point operands are outside the range representable
     integers after rounding.
   * The format of `half`, `float`, and `double` types is defined to be the
     binary16, binary32 and binary64 formats in the draft IEEE-754 standard.
     (The latter two are identical to the existing IEEE-754 standard.) You
     may depend on the positioning and meaning of the bits in these types.
   * OpenCL defines behavior for oversized shift values.
     Shift operations that shift greater than or equal to the number of bits
     in the first operand reduce the shift value modulo the number of bits in
     the element.
     For example, if we shift an `int4` left by `33` bits, OpenCL treats this
     as shift left by `33%32 = 1` bit.
   * A number of edge cases for math library functions are more rigorously
     defined than in C99.
     Please see _section 7.5_ of the OpenCL C specification.
	// Copyright 2016-2020 The Khronos Group. This work is licensed under a
	// Creative Commons Attribution 4.0 International License; see
	// http://creativecommons.org/licenses/by/4.0/

	[appendix]
	= Portability

	OpenCL is designed to be portable to other architectures and hardware
	designs.
	OpenCL has used at its core a C99 based programming language and follows
	rules based on that heritage.
	Floating-point arithmetic is based on the IEEE-754 and IEEE-754-2008
	standards.
	The memory objects, pointer qualifiers and weakly ordered memory are
	designed to provide maximum compatibility with discrete memory architectures
	implemented by OpenCL devices.
	Command-queues and barriers allow for synchronization between the host and
	OpenCL devices.
	The design, capabilities and limitations of OpenCL are very much a
	reflection of the capabilities of underlying hardware.

	Unfortunately, there are a number of areas where idiosyncrasies of one
	hardware platform may allow it to do some things that do not work on
	another.
	By virtue of the rich operating system resident on the CPU, on some
	implementations the kernels executing on a CPU may be able to call out to
	system services whereas the same calls on the GPU will likely fail for now.
	Since there is some advantage to having these services available for
	debugging purposes, implementations can use the OpenCL extension mechanism
	to implement these services.

	Likewise, the heterogeneity of computing architectures might mean that a
	particular loop construct might execute at an acceptable speed on the CPU
	but very poorly on a GPU, for example.
	CPUs are designed in general to work well on latency sensitive algorithms on
	single threaded tasks, whereas common GPUs may encounter extremely long
	latencies, potentially orders of magnitude worse.
	Developers interested in writing portable code may need to test their
	software on a diversity of hardware designs to make sure that key algorithms
	are structured in a way that works well on a diversity of hardware.
	We suggest favoring more work-items over fewer.
	It is anticipated that over the coming months and years experience will
	produce a set of best practices that will help foster a uniformly favorable
	experience on a diversity of computing devices.

	Of somewhat more concern is the topic of endianness.
	Since a majority of devices supported by the initial implementation of
	OpenCL are little-endian, developers need to make sure that their kernels
	are tested on both big-endian and little-endian devices to ensure source
	compatibility with OpenCL devices now and in the future.
	The endian attribute qualifier is supported by the SPIR-V IL to allow
	developers to specify whether the data uses the endianness of the host or
	the OpenCL device.
	This allows the OpenCL compiler to do appropriate endian-conversion on load
	and store operations from or to this data.

	We also describe how endianness can leak into an implementation causing
	kernels to produce unintended results:

	When a big-endian vector machine (e.g. AltiVec, CELL SPE) loads a vector,
	the order of the data is retained.
	That is both the order of the bytes within each element and the order of the
	elements in the vector are the same as in memory.
	When a little-endian vector machine (e.g. SSE) loads a vector, the order of
	the data in register (where all the work is done) is reversed.
	Both the order of the bytes within each element and the order of the
	elements with respect to one another in the vector are reversed.

	Memory:

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x00010203 \| 0x04050607 \| 0x08090A0B \| 0x0C0D0E0F
	\|====


	In register (big-endian):

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x00010203 \| 0x04050607 \| 0x08090A0B \| 0x0C0D0E0F
	\|====

	In register (little-endian):

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x0F0E0D0C \| 0x0B0A0908 \| 0x07060504 \| 0x03020100
	\|====

	This allows little-endian machines to use a single vector load to load
	little-endian data, regardless of how large each piece of data is in the
	vector.
	That is the transformation is equally valid whether that vector was a
	`uchar16` or a `ulong2`.
	Of course, as is well known, little-endian machines
	actually footnote:[{fn-endianness}] store their data in reverse byte order to
	compensate for the little-endian storage format of the array elements:

	Memory (big-endian):

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x00010203 \| 0x04050607 \| 0x08090A0B \| 0x0C0D0E0F
	\|====

	Memory (little-endian):

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x03020100 \| 0x07060504 \| 0x0B0A0908 \| 0x0F0E0D0C
	\|====

	Once that data is loaded into a vector, we end up with this:


	In register (big-endian):

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x00010203 \| 0x04050607 \| 0x08090A0B \| 0x0C0D0E0F
	\|====

	In register (little-endian):

	uint4 a =

	[width="100%",cols="<25%,<25%,<25%,<25%",]
	\|====
	\| 0x0C0D0E0F \| 0x08090A0B \| 0x04050607 \| 0x00010203
	\|====

	That is, in the process of correcting the endianness of the bytes within
	each element, the machine ends up reversing the order that the elements
	appear in the vector with respect to each other within the vector.
	0x00010203 appears at the left of the big-endian vector and at the right of
	the little-endian vector.

	When the host and device have different endianness, the developer must
	ensure that kernel argument values are processed correctly.
	The implementation may or may not automatically convert endianness of kernel
	arguments.
	Developers should consult vendor documentation for guidance on how to handle
	kernel arguments in these situations.

	OpenCL provides a consistent programming model across architectures by
	numbering elements according to their order in memory.
	Concepts such as `even`/`odd` and `high`/`low` follow accordingly.
	Once the data is loaded into registers, we find that element 0 is at the
	left of the big-endian vector and element 0 is at the right of the
	little-endian vector:

	[source,c]
	----
	float x[4];
	float4 v = vload4( 0, x );
	----

	Big-endian:

	[source,c]
	----
	v contains { x[0], x[1], x[2], x[3] }
	----

	Little-endian:

	[source,c]
	----
	v contains { x[3], x[2], x[1], x[0] }
	----

	The compiler is aware that this swap occurs and references elements
	accordingly.
	So long as we refer to them by a numeric index such as `.s0123456789abcdef`
	or by descriptors such as `.xyzw`, `.hi`, `.lo`, `.even` and `.odd`,
	everything works transparently.
	Any ordering reversal is undone when the data is stored back to memory.
	The developer should be able to work with a big-endian programming model and
	ignore the element ordering problem in the vector ... for most problems.
	This mechanism relies on the fact that we can rely on a consistent element
	numbering.
	Once we change numbering system, for example by conversion-free casting
	(using ``as_type_``__n__) a vector to another vector of the same size but a
	different number of elements, then we get different results on different
	implementations depending on whether the system is big-endian, or
	little-endian or indeed has no vector unit at all.
	(Thus, the behavior of bitcasts to vectors of different numbers of elements
	is implementation-defined, see section 6.4.4 of OpenCL C specification.)

	An example follows:

	[source,c]
	----
	float x[4] = { 0.0f, 1.0f, 2.0f, 3.0f };
	float4 v = vload4( 0, x );
	uint4 y = as_uint4(v); // legal, portable
	ushort8 z = as_ushort8(v); // legal, not portable
	// element size changed
	----


	Big-endian:

	[source,c]
	----
	v contains { 0.0f, 1.0f, 2.0f, 3.0f }
	y contains { 0x00000000, 0x3f800000,
	0x40000000, 0x40400000 }
	z contains { 0x0000, 0x0000, 0x3f80, 0x0000,
	0x4000, 0x0000, 0x4040, 0x0000 }
	z.z is 0x3f80
	----

	Little-endian:

	[source,c]
	----
	v contains { 3.0f, 2.0f, 1.0f, 0.0f }
	y contains { 0x40400000, 0x40000000,
	0x3f800000, 0x00000000 }
	z contains { 0x4040, 0x0000, 0x4000, 0x0000,
	0x3f80, 0x0000, 0x0000, 0x0000 }
	z.z is 0
	----

	Here, the value in `z.z` is not the same between big- and little-endian
	vector machines

	OpenCL could have made it illegal to do a conversion free cast that changes
	the number of elements in the name of portability.
	However, while OpenCL provides a common set of operators drawing from the
	set that are typically found on vector machines, it can not provide access
	to everything every ISA may offer in a consistent uniform portable manner.
	Many vector ISAs provide special purpose instructions that greatly
	accelerate specific operations such as DCT, SAD, or 3D geometry.
	It is not intended for OpenCL to be so heavy handed that time-critical
	performance sensitive algorithms can not be written by knowledgeable
	developers to perform at near peak performance.
	Developers willing to throw away portability should be able to use the
	platform-specific instructions in their code.
	For this reason, OpenCL is designed to allow traditional vector C language
	programming extensions, such as the AltiVec C Programming Interface or the
	Intel C programming interfaces (such as those found in emmintrin.h) to be
	used directly in OpenCL with OpenCL data types as an extension to OpenCL.
	As these interfaces rely on the ability to do conversion-free casts that
	change the number of elements in the vector to function properly, OpenCL
	allows them too.

	As a general rule, any operation that operates on vector types in segments
	that are not the same size as the vector element size may break on other
	hardware with different endianness or different vector architecture.

	Examples might include:

	* Combining two ``uchar8``'s containing high and low bytes of a ushort, to
	make a `ushort8` using `.even` and `.odd` operators (please use
	upsample() for this)
	* Any bitcast that changes the number of elements in the vector.
	(Operations on the new type are non-portable.)
	* Swizzle operations that change the order of data using chunk sizes that
	are not the same as the element size

	Examples of operations that are portable:

	* Combining two ``uint8``'s to make a `uchar16` using `.even` and `.odd`
	operators.
	For example to interleave left and right audio streams.
	* Any bitcast that does not change the number of elements (e.g. `(float4)
	uint4`) -- we define the storage format for floating-point types)
	* Swizzle operations that swizzle elements of the same size as the
	elements of the vector.

	OpenCL has made some additions to C to make application behavior more
	dependable than C.
	Most notably in a few cases OpenCL defines the behavior of some operations
	that are undefined in C99:

	* OpenCL provides `convert_` operators for conversion between all types.
	C99 does not define what happens when a floating-point type is converted
	to integer type and the floating-point value lies outside the
	representable range of the integer type after rounding.
	When the `_sat` variant of the conversion is used, the float shall be
	converted to the nearest representable integer value.
	Similarly, OpenCL also makes recommendations about what should happen
	with NaN.
	Hardware manufacturers that provide the saturated conversion in hardware
	may use the saturated conversion hardware for both the saturated and
	non-saturated versions of the OpenCL `convert_` operator.
	OpenCL does not define what happens for the non-saturated conversions
	when floating-point operands are outside the range representable
	integers after rounding.
	* The format of `half`, `float`, and `double` types is defined to be the
	binary16, binary32 and binary64 formats in the draft IEEE-754 standard.
	(The latter two are identical to the existing IEEE-754 standard.) You
	may depend on the positioning and meaning of the bits in these types.
	* OpenCL defines behavior for oversized shift values.
	Shift operations that shift greater than or equal to the number of bits
	in the first operand reduce the shift value modulo the number of bits in
	the element.
	For example, if we shift an `int4` left by `33` bits, OpenCL treats this
	as shift left by `33%32 = 1` bit.
	* A number of edge cases for math library functions are more rigorously
	defined than in C99.
	Please see _section 7.5_ of the OpenCL C specification.