| # API synopsis / quick reference |
| |
| [[_TOC_]] |
| |
| ## Usage modes |
| |
| A project or module using SIMD can choose to integrate into its callers via |
| static or dynamic dispatch. Examples of both are provided in examples/. |
| |
| Static dispatch means choosing the single CPU target at compile-time. This does |
| not require any setup nor per-call overhead. |
| |
| Dynamic dispatch means generating implementations for multiple targets and |
| choosing the best available at runtime. Uses the same source code as static, |
| plus `#define HWY_TARGET_INCLUDE` and `#include |
| "third_party/highway/hwy/foreach_target.h"`. |
| |
| ## Headers |
| |
| The public headers are: |
| |
| * hwy/highway.h: main header, included from source AND/OR header files that |
| use vector types. Note that including in headers may increase compile time, |
| but allows declaring functions implemented out of line. |
| |
| * hwy/base.h: included from headers that only need compiler/platform-dependent |
| definitions (e.g. `HWY_ALIGN_MAX` and/or `kMaxVectorSize`) without the full |
| highway.h. |
| |
| * hwy/foreach_target.h: re-includes the translation unit (specified by |
| `HWY_TARGET_INCLUDE`) once per enabled target to generate code from the same |
| source code. highway.h must still be included, either before or after. |
| |
| * hwy/aligned_allocator.h: defines functions for allocating memory with |
| alignment suitable for `Load`/`Store`. |
| |
| * hwy/cache_control.h: defines stand-alone functions to control caching (e.g. |
| prefetching) and memory barriers, independent of actual SIMD. |
| |
| SIMD implementations must be preceded and followed by the following: |
| ``` |
| #include "hwy/highway.h" |
| HWY_BEFORE_NAMESPACE(); // at file scope |
| namespace project { // optional |
| namespace HWY_NAMESPACE { |
| |
| // implementation |
| |
| // NOLINTNEXTLINE(google-readability-namespace-comments) |
| } // namespace HWY_NAMESPACE |
| } // namespace project - optional |
| HWY_AFTER_NAMESPACE(); |
| ``` |
| |
| ## Preprocessor macros |
| |
| * `HWY_ALIGN`: Ensures an array is aligned and suitable for Load()/Store() |
| functions. Example: `HWY_ALIGN float lanes[4];` Note that arrays are mainly |
| useful for 128-bit SIMD, or `LoadDup128`; otherwise, use dynamic allocation. |
| |
| * `HWY_ALIGN_MAX`: As `HWY_ALIGN`, but aligns to an upper bound suitable for |
| all targets on this platform. Use this for caller of SIMD modules, e.g. for |
| arrays used as arguments. |
| |
| ## Vector and descriptor types |
| |
| SIMD vectors consist of one or more 'lanes' of the same built-in type `T = |
| uint##_t, int##_t, float or double` for `## = 8, 16, 32, 64`. Highway provides |
| vectors with `N <= kMaxVectorSize / sizeof(T)` lanes, where `N` is a power of |
| two. |
| |
| Platforms such as x86 support multiple vector types, and other platforms require |
| that vectors are built-in types. Thus the Highway API consists of overloaded |
| functions selected via a zero-sized tag parameter `d` of type `D = Simd<T, N>`. |
| These are typically constructed using aliases: |
| |
| * `const HWY_FULL(T) d;` chooses the maximum N for the current target; |
| * `const HWY_CAPPED(T, N) d;` for up to `N` lanes. |
| |
| The type `T` may be accessed as D::T (prefixed with typename if D is a template |
| argument). |
| |
| There are three possibilities for the template parameter `N`: |
| 1. Equal to the hardware vector width. This is the most common case, e.g. when |
| using `HWY_FULL(T)` on a target with compile-time constant vectors. |
| |
| 1. Less than the hardware vector width. This is the result of a compile-time |
| decision by the user, i.e. using `HWY_CAPPED(T, N)` to limit the number of |
| lanes, even when the hardware vector width could be greater. |
| |
| 1. Greater or equal to the hardware vector width, e.g. when the hardware vector |
| width is not known at compile-time. User code should not rely on `N` |
| actually being an upper bound, because variable vectors can be large! |
| |
| In all cases, `Lanes(d)` returns the actual number of lanes, i.e. the amount by |
| which to advance loop counters. `MaxLanes(d)` returns the `N` from `Simd<T, N>`, |
| which is NOT necessarily the actual vector size (see above) and some compilers |
| are not able to interpret it as constexpr. Instead of `MaxLanes`, prefer to use |
| alternatives, e.g. `Rebind` or `aligned_allocator.h` for dynamic allocation of |
| `Lanes(d)` elements. |
| |
| Note that case 3 does not imply the API will use more than one native vector. |
| Highway is designed to map a user-specified vector to a single |
| (possibly partial) vector. By discouraging user-specified `N`, we improve |
| performance portability (e.g. by reducing spills to memory for platforms that |
| have smaller vectors than the developer expected). |
| |
| To construct vectors, call factory functions (see "Initialization" below) with |
| a tag parameter `d`. |
| |
| Local variables typically use auto for type deduction. If `d` is |
| `HWY_FULL(int32_t)`, users may instead use the full-width vector alias `I32xN` |
| (or `U16xN`, `F64xN` etc.) to document the types used. |
| |
| For function arguments, it is often sufficient to return the same type as the |
| argument: `template<class V> V Squared(V v) { return v * v; }`. Otherwise, use |
| the alias `Vec<D>`. |
| |
| Note that Highway functions reside in `hwy::HWY_NAMESPACE`, whereas user-defined |
| functions reside in `project::[nested]::HWY_NAMESPACE`. Because all Highway |
| functions generally take either a `Simd` or vector argument, which are also |
| defined in namespace `hwy`, they will typically be found via Argument-Dependent |
| Lookup and namespace qualifiers are not necessary. As an exception, Highway |
| functions that are templates (e.g. because they require a compile-time argument |
| such as a lane index or shift count) require a using-declaration such as |
| `using hwy::HWY_NAMESPACE::ShiftLeft`. |
| |
| ## Operations |
| |
| In the following, the argument or return type `V` denotes a vector with `N` |
| lanes. Operations limited to certain vector types begin with a constraint of the |
| form `V`: `{u,i,f}{8/16/32/64}` to denote unsigned/signed/floating-point types, |
| possibly with the specified size in bits of `T`. |
| |
| ### Initialization |
| |
| * <code>V **Zero**(D)</code>: returns N-lane vector with all bits set to 0. |
| * <code>V **Set**(D, T)</code>: returns N-lane vector with all lanes equal to |
| the given value of type `T`. |
| * <code>V **Undefined**(D)</code>: returns uninitialized N-lane vector, e.g. |
| for use as an output parameter. |
| * <code>V **Iota**(D, T)</code>: returns N-lane vector where the lane with |
| index `i` has the given value of type `T` plus `i`. The least significant |
| lane has index 0. This is useful in tests for detecting lane-crossing bugs. |
| * <code>V **SignBit**(D, T)</code>: returns N-lane vector with all lanes set |
| to a value whose representation has only the most-significant bit set. |
| |
| ### Arithmetic |
| |
| * <code>V **operator+**(V a, V b)</code>: returns `a[i] + b[i]` (mod 2^bits). |
| * <code>V **operator-**(V a, V b)</code>: returns `a[i] - b[i]` (mod 2^bits). |
| |
| * `V`: `ui8/16` \ |
| <code>V **SaturatedAdd**(V a, V b)</code> returns `a[i] + b[i]` saturated to |
| the minimum/maximum representable value. |
| |
| * `V`: `ui8/16` \ |
| <code>V **SaturatedSub**(V a, V b)</code> returns `a[i] - b[i]` saturated to |
| the minimum/maximum representable value. |
| |
| * `V`: `u8/16` \ |
| <code>V **AverageRound**(V a, V b)</code> returns `(a[i] + b[i] + 1) / 2`. |
| |
| * `V`: `i8/16/32`, `f` \ |
| <code>V **Abs**(V a)</code> returns the absolute value of `a[i]`; for |
| integers, `LimitsMin()` maps to `LimitsMax() + 1`. |
| |
| * `V`: `ui8/16/32`, `f` \ |
| <code>V **Min**(V a, V b)</code>: returns `min(a[i], b[i])`. |
| |
| * `V`: `ui8/16/32`, `f` \ |
| <code>V **Max**(V a, V b)</code>: returns `max(a[i], b[i])`. |
| |
| * `V`: `ui8/16/32`, `f` \ |
| <code>V **Clamp**(V a, V lo, V hi)</code>: returns `a[i]` clamped to |
| `[lo[i], hi[i]]`. |
| |
| * `V`: `f` \ |
| <code>V **operator/**(V a, V b)</code>: returns `a[i] / b[i]` in each lane. |
| |
| * `V`: `f` \ |
| <code>V **Sqrt**(V a)</code>: returns `sqrt(a[i])`. |
| |
| * `V`: `f32` \ |
| <code>V **ApproximateReciprocalSqrt**(V a)</code>: returns an approximation |
| of `1.0 / sqrt(a[i])`. `sqrt(a) ~= ApproximateReciprocalSqrt(a) * a`. x86 |
| and PPC provide 12-bit approximations but the error on ARM is closer to 1%. |
| |
| * `V`: `f32` \ |
| <code>V **ApproximateReciprocal**(V a)</code>: returns an approximation of |
| `1.0 / a[i]`. |
| |
| * `V`: `f32` \ |
| <code>V **AbsDiff**(V a, V b)</code>: returns `|a[i] - b[i]|` in each lane. |
| |
| * `V`: `if` \ |
| <code>V **Neg**(V a)</code>: returns `-a[i]`. |
| |
| #### Multiply |
| |
| * `V`: `ui16/32` \ |
| <code>V <b>operator*</b>(V a, V b)</code>: returns the lower half of |
| `a[i] * b[i]` in each lane. |
| |
| * `V`: `f` \ |
| <code>V <b>operator*</b>(V a, V b)</code>: returns `a[i] * b[i]` in each |
| lane. |
| |
| * `V`: `i16` \ |
| <code>V **MulHigh**(V a, V b)</code>: returns the upper half of |
| `a[i] * b[i]` in each lane. |
| |
| * `V`: `ui32` \ |
| <code>V **MulEven**(V a, V b)</code>: returns double-wide result of |
| `a[i] * b[i]` for every even `i`, in lanes `i` (lower) and `i + 1` (upper). |
| |
| #### Fused multiply-add |
| |
| When implemented using special instructions, these functions are more precise |
| and faster than separate multiplication followed by addition. The `*Sub` |
| variants are somewhat slower on ARM; it is preferable to replace them with |
| `MulAdd` using a negated constant. |
| |
| * `V`: `f` \ |
| <code>V **MulAdd**(V a, V b, V c)</code>: returns `a[i] * b[i] + c[i]`. |
| |
| * `V`: `f` \ |
| <code>V **NegMulAdd**(V a, V b, V c)</code>: returns `-a[i] * b[i] + c[i]`. |
| |
| * `V`: `f` \ |
| <code>V **MulSub**(V a, V b, V c)</code>: returns `a[i] * b[i] - c[i]`. |
| |
| * `V`: `f` \ |
| <code>V **NegMulSub**(V a, V b, V c)</code>: returns |
| `-a[i] * b[i] - c[i]`. |
| |
| #### Shifts |
| |
| **Note**: it is generally fastest to shift by a compile-time constant number of |
| bits. ARM requires the count be less than the lane size. |
| |
| * `V`: `ui16/32/64` \ |
| <code>V **ShiftLeft**<int>(V a)</code> returns `a[i] <<` a |
| compile-time constant count. |
| |
| * `V`: `u16/32/64`, `i16/32` \ |
| <code>V **ShiftRight**<int>(V a)</code> returns `a[i] >>` a |
| compile-time constant count. Inserts zero or sign bit(s) depending on `V`. |
| |
| **Note**: Vectors must be `HWY_CAPPED(T, HWY_VARIABLE_SHIFT_LANES(T))`: |
| |
| * `V`: `ui32/64` \ |
| <code>V **operator<<**(V a, V b)</code> returns `a[i] << b[i]`, which is |
| zero when `b[i] >= sizeof(T)*8`. |
| |
| * `V`: `u32/64`, `i32` \ |
| <code>V **operator>>**(V a, V b)</code> returns `a[i] >> b[i]`, which is |
| zero when `b[i] >= sizeof(T)*8`. Inserts zero or sign bit(s). |
| |
| **Note**: the following are only provided if `HWY_VARIABLE_SHIFT_LANES(T) == 1`: |
| |
| * `V`: `ui16/32/64` \ |
| <code>V **ShiftLeftSame**(V a, int bits)</code> returns `a[i] << bits`. |
| |
| * `V`: `u16/32/64`, `i16/32` \ |
| <code>V **ShiftRightSame**(V a, int bits)</code> returns `a[i] >> bits`. |
| Inserts 0 or sign bit(s). |
| |
| #### Floating-point rounding |
| |
| * `V`: `f` \ |
| <code>V **Round**(V a)</code>: returns `a[i]` rounded towards the nearest |
| integer, with ties to even. |
| |
| * `V`: `f` \ |
| <code>V **Trunc**(V a)</code>: returns `a[i]` rounded towards zero |
| (truncate). |
| |
| * `V`: `f` \ |
| <code>V **Ceil**(V a)</code>: returns `a[i]` rounded towards positive |
| infinity (ceiling). |
| |
| * `V`: `f` \ |
| <code>V **Floor**(V a)</code>: returns `a[i]` rounded towards negative |
| infinity. |
| |
| ### Logical |
| |
| These operate on individual bits within each lane. |
| |
| * `V`: `ui` \ |
| <code>V **operator&**(V a, V b)</code>: returns `a[i] & b[i]`. |
| |
| * `V`: `ui` \ |
| <code>V **operator|**(V a, V b)</code>: returns `a[i] | b[i]`. |
| |
| * `V`: `ui` \ |
| <code>V **operator^**(V a, V b)</code>: returns `a[i] ^ b[i]`. |
| |
| For floating-point types, builtin operators are not always available, so |
| non-operator functions (also available for integers) must be used: |
| |
| * <code>V **And**(V a, V b)</code>: returns `a[i] & b[i]`. |
| |
| * <code>V **Or**(V a, V b)</code>: returns `a[i] | b[i]`. |
| |
| * <code>V **Xor**(V a, V b)</code>: returns `a[i] ^ b[i]`. |
| |
| * <code>V **AndNot**(V a, V b)</code>: returns `~a[i] & b[i]`. |
| |
| Special functions for floating-point types: |
| |
| * `V`: `f` \ |
| <code>V **CopySign**(V a, V b)</code>: returns the number with the magnitude |
| of `a` and sign of `b`. |
| |
| * `V`: `f` \ |
| <code>V **CopySignToAbs**(V a, V b)</code>: as above, but potentially |
| slightly more efficient; requires the first argument to be non-negative. |
| |
| ### Masks |
| |
| Let `M` denote a mask capable of storing true/false for each lane. |
| |
| * <code>M **MaskFromVec**(V v)</code>: returns false in lane `i` if `v[i] == |
| 0`, or true if `v[i]` has all bits set. |
| |
| * <code>V **VecFromMask**(M m)</code>: returns 0 in lane `i` if `m[i] == |
| false`, otherwise all bits set. |
| |
| * <code>V **IfThenElse**(M mask, V yes, V no)</code>: returns `mask[i] ? |
| yes[i] : no[i]`. |
| |
| * <code>V **IfThenElseZero**(M mask, V yes)</code>: returns `mask[i] ? |
| yes[i] : 0`. |
| |
| * <code>V **IfThenZeroElse**(M mask, V no)</code>: returns `mask[i] ? 0 : |
| no[i]`. |
| |
| * <code>V **ZeroIfNegative**(V v)</code>: returns `v[i] < 0 ? 0 : v[i]`. |
| |
| * <code>bool **AllTrue**(M m)</code>: returns whether all `m[i]` are true. |
| |
| * <code>bool **AllFalse**(M m)</code>: returns whether all `m[i]` are false. |
| |
| * <code>uint64_t **BitsFromMask**(M m)</code>: returns `sum{1 << i}` for all |
| indices `i` where `m[i]` is true. |
| |
| * <code>size_t **CountTrue**(M m)</code>: returns how many of `m[i]` are true |
| [0, N]. This is typically more expensive than AllTrue/False. |
| |
| * `V`: `uif32,uif64` \ |
| <code>V **Compress**(V v, M m)</code>: returns `r` such that `r[n]` is |
| `v[i]`, with `i` the n-th lane index (starting from 0) where `m[i]` is true. |
| Compacts lanes whose mask is set into the lower lanes; upper lanes are |
| undefined. |
| |
| * `V`: `uif32,uif64` \ |
| <code>size_t **CompressStore**(V v, M m, D, T* aligned)</code>: writes lanes |
| whose mask is set into `aligned`, starting from lane 0. Returns |
| `CountTrue(m)`, the number of valid lanes. All subsequent lanes may be |
| overwritten! Alignment ensures inactive lanes will not cause faults. |
| |
| ### Comparisons |
| |
| These return a mask (see above) indicating whether the condition is true. |
| |
| * <code>M **operator==**(V a, V b)</code>: returns `a[i] == b[i]`. |
| |
| * `V`: `if` \ |
| <code>M **operator<**(V a, V b)</code>: returns `a[i] < b[i]`. |
| * `V`: `if` \ |
| <code>M **operator>**(V a, V b)</code>: returns `a[i] > b[i]`. |
| |
| * `V`: `f` \ |
| <code>M **operator<=**(V a, V b)</code>: returns `a[i] <= b[i]`. |
| * `V`: `f` \ |
| <code>M **operator>=**(V a, V b)</code>: returns `a[i] >= b[i]`. |
| |
| * `V`: `ui` \ |
| <code>M **TestBit**(V v, V bit)</code>: returns `(v[i] & bit[i]) == bit[i]`. |
| `bit[i]` must have exactly one bit set. |
| |
| ### Memory |
| |
| Memory operands are little-endian, otherwise their order would depend on the |
| lane configuration. Pointers are the addresses of `N` consecutive `T` values, |
| either naturally-aligned (`aligned`) or possibly unaligned (`p`). |
| |
| #### Load |
| |
| * <code>Vec<D> **Load**(D, const T* aligned)</code>: returns |
| `aligned[i]`. |
| * <code>Vec<D> **LoadU**(D, const T* p)</code>: returns `p[i]`. |
| |
| * <code>Vec<D> **LoadDup128**(D, const T* p)</code>: returns one 128-bit |
| block loaded from `p` and broadcasted into all 128-bit block\[s\]. This |
| enables a specialized `U32FromU8` that avoids a 3-cycle overhead on |
| AVX2/AVX-512. This may be faster than broadcasting single values, and is |
| more convenient than preparing constants for the maximum vector length. |
| |
| #### Gather |
| |
| **Note**: Vectors must be `HWY_CAPPED(T, HWY_GATHER_LANES(T))`: |
| |
| * `V`,`VI`: (`uif32,i32`), (`uif64,i64`) \ |
| <code>Vec<D> **GatherOffset**(D, const T* base, VI offsets)</code>. |
| Returns elements of base selected by signed/possibly repeated *byte* |
| `offsets[i]`. |
| |
| * `V`,`VI`: (`uif32,i32`), (`uif64,i64`) \ |
| <code>Vec<D> **GatherIndex**(D, const T* base, VI indices)</code>. |
| Returns vector of `base[indices[i]]`. Indices are signed and need not be |
| unique. |
| |
| #### Store |
| |
| * <code>void **Store**(Vec<D> a, D, T* aligned)</code>: copies `a[i]` |
| into `aligned[i]`, which must be naturally aligned. Writes exactly N * |
| sizeof(T) bytes. |
| * <code>void **StoreU**(Vec<D> a, D, T* p)</code>: as Store, but without |
| the alignment requirement. |
| |
| ### Cache control |
| |
| All functions except Stream are defined in cache_control.h. |
| |
| * <code>void **Stream**(Vec<D> a, D, const T* aligned)</code>: copies |
| `a[i]` into `aligned[i]` with non-temporal hint on x86 (for good |
| performance, call for all consecutive vectors within the same cache line). |
| (Over)writes a multiple of HWY_STREAM_MULTIPLE bytes. |
| |
| * <code>void **LoadFence**()</code>: delays subsequent loads until prior loads |
| are visible. Also a full fence on Intel CPUs. No effect on non-x86. |
| |
| * <code>void **StoreFence**()</code>: ensures previous non-temporal stores are |
| visible. No effect on non-x86. |
| |
| * <code>void **FlushCacheline**(const void* p)</code>: invalidates and flushes |
| the cache line containing "p". No effect on non-x86. |
| |
| * <code>void **Prefetch**(const T* p)</code>: begins loading the cache line |
| containing "p". |
| |
| ### Type conversion |
| |
| * <code>Vec<D> **BitCast**(D, V)</code>: returns the bits of `V` |
| reinterpreted as type `Vec<D>`. |
| |
| * `V`,`D`: (`u8,i16`), (`u8,i32`), (`u16,i32`), (`i8,i16`), (`i8,i32`), |
| (`i16,i32`), (`f32,f64`) \ |
| <code>Vec<D> **PromoteTo**(D, V part)</code>: returns `part[i]` zero- |
| or sign-extended to the wider `D::T` type. |
| |
| * `V`,`D`: `i32,f64` \ |
| <code>Vec<D> **PromoteTo**(D, V part)</code>: returns `part[i]` |
| converted to 64-bit floating point. |
| |
| * `V`,`D`: (`u8,u32`) \ |
| <code>Vec<D> **U32FromU8**(V)</code>: special-case `u8` to `u32` |
| conversion when all blocks of `V` are identical, e.g. from `LoadDup128`. |
| |
| * `V`,`D`: (`u32,u8`) \ |
| <code>Vec<D> **U8FromU32**(V)</code>: special-case `u32` to `u8` |
| conversion when all lanes of `V` are already clamped to `[0, 256)`. |
| |
| * `V`,`D`: (`i16,i8`), (`i32,i8`), (`i32,i16`), (`i16,u8`), (`i32,u8`), |
| (`i32,u16`), (`f64,f32`) \ |
| <code>Vec<D> **DemoteTo**(D, V a)</code>: returns `a[i]` after packing |
| with signed/unsigned saturation, i.e. a vector with narrower type `D::T`. |
| |
| * `V`,`D`: `f64,i32` \ |
| <code>Vec<D> **DemoteTo**(D, V a)</code>: rounds floating point |
| towards zero and converts the value to 32-bit integers. |
| |
| * `V`,`D`: (`i32`,`f32`), (`i64`,`f64`) \ |
| <code>Vec<D> **ConvertTo**(D, V)</code>: converts an integer value to |
| same-sized floating point. |
| |
| * `V`,`D`: (`f32`,`i32`), (`f64`,`i64`) \ |
| <code>Vec<D> **ConvertTo**(D, V)</code>: rounds floating point towards |
| zero and converts the value to same-sized integer. |
| |
| * `V`: `f32`; `Ret`: `i32` \ |
| <code>Ret **NearestInt**(V a)</code>: returns the integer nearest to `a[i]`. |
| |
| ### Swizzle |
| |
| * <code>T **GetLane**(V)</code>: returns lane 0 within `V`. This is useful |
| for extracting `SumOfLanes` results. |
| |
| * <code>V2 **Upper/LowerHalf**(V)</code>: returns upper or lower half of |
| the vector `V`. |
| |
| * <code>V **ZeroExtendVector**(V2)</code>: returns vector whose UpperHalf is |
| zero and whose LowerHalf is the argument. |
| |
| * <code>V **Combine**(V2, V2)</code>: returns vector whose UpperHalf is |
| the first argument and whose LowerHalf is the second argument. |
| |
| * <code>V **OddEven**(V a, V b)</code>: returns a vector whose odd lanes are |
| taken from `a` and the even lanes from `b`. |
| |
| **Note**: if vectors are larger than 128 bits, the following operations split |
| their operands into independently processed 128-bit *blocks*. |
| |
| * `V`: `ui16/32/64`, `f` \ |
| <code>V **Broadcast**<int i>(V)</code>: returns individual *blocks*, |
| each with lanes set to `input_block[i]`, `i = [0, 16/sizeof(T))`. |
| |
| * `Ret`: double-width `u/i`; `V`: `u8/16/32`, `i8/16/32` \ |
| <code>Ret **ZipLower**(V a, V b)</code>: returns the same bits as |
| `InterleaveLower`, except that `Ret` is a vector with double-width lanes |
| (required in order to use this operation with scalars). |
| |
| * `V`: `ui` \ |
| <code>V **TableLookupBytes**(V bytes, V from)</code>: returns |
| `bytes[from[i]]`. Uses byte lanes regardless of the actual vector types. |
| Results are undefined if `from[i] >= HWY_MIN(vector size, 16)`. |
| |
| **Note**: the following are only available for full vectors (`N` > 1), and split |
| their operands into independently processed 128-bit *blocks*: |
| |
| * `Ret`: double-width u/i; `V`: `u8/16/32`, `i8/16/32` \ |
| <code>Ret **ZipUpper**(V a, V b)</code>: returns the same bits as |
| `InterleaveUpper`, except that `Ret` is a vector with double-width lanes |
| (required in order to use this operation with scalars). |
| |
| * `V`: `ui` \ |
| <code>V **ShiftLeftBytes**<int>(V)</code>: returns the result of |
| shifting independent *blocks* left by `int` bytes \[1, 15\]. |
| |
| * `V`: \ |
| <code>V **ShiftLeftLanes**<int>(V)</code>: returns the result of |
| shifting independent *blocks* left by `int` lanes. |
| |
| * `V`: `ui` \ |
| <code>V **ShiftRightBytes**<int>(V)</code>: returns the result of |
| shifting independent *blocks* right by `int` bytes \[1, 15\]. |
| |
| * `V`: \ |
| <code>V **ShiftRightLanes**<int>(V)</code>: returns the result of |
| shifting independent *blocks* right by `int` lanes. |
| |
| * `V`: \ |
| <code>V **CombineShiftRightBytes**<int>(V hi, V lo)</code>: returns a |
| vector of *blocks* each the result of shifting two concatenated *blocks* |
| `hi[i] || lo[i]` right by `int` bytes \[1, 16). |
| |
| * `V`: \ |
| <code>V **CombineShiftRightLanes**<int>(V hi, V lo)</code>: returns a |
| vector of *blocks* each the result of shifting two concatenated *blocks* |
| `hi[i] || lo[i]` right by `int` lanes \[1, 16/sizeof(T)). |
| |
| * `V`: `uif32` \ |
| <code>V **Shuffle2301**(V)</code>: returns *blocks* with 32-bit halves |
| swapped inside 64-bit halves. |
| |
| * `V`: `uif32` \ |
| <code>V **Shuffle1032**(V)</code>: returns *blocks* with 64-bit halves |
| swapped. |
| |
| * `V`: `uif64` \ |
| <code>V **Shuffle01**(V)</code>: returns *blocks* with 64-bit halves |
| swapped. |
| |
| * `V`: `uif32` \ |
| <code>V **Shuffle0321**(V)</code>: returns *blocks* rotated right (toward |
| the lower end) by 32 bits. |
| |
| * `V`: `uif32` \ |
| <code>V **Shuffle2103**(V)</code>: returns *blocks* rotated left (toward the |
| upper end) by 32 bits. |
| |
| * `V`: `uif32` \ |
| <code>V **Shuffle0123**(V)</code>: returns *blocks* with lanes in reverse |
| order. |
| |
| * <code>V **InterleaveLower**(V a, V b)</code>: returns *blocks* with |
| alternating lanes from the lower halves of `a` and `b` (`a[0]` in the |
| least-significant lane). |
| |
| * <code>V **InterleaveUpper**(V a, V b)</code>: returns *blocks* with |
| alternating lanes from the upper halves of `a` and `b` (`a[N/2]` in the |
| least-significant lane). |
| |
| **Note**: the following operations cross block boundaries, which is typically |
| more expensive on AVX2/AVX-512 than within-block operations. |
| |
| * <code>V **ConcatLowerLower**(V hi, V lo)</code>: returns the concatenation |
| of the lower halves of `hi` and `lo` without splitting into blocks. |
| |
| * <code>V **ConcatUpperUpper**(V hi, V lo)</code>: returns the concatenation |
| of the upper halves of `hi` and `lo` without splitting into blocks. |
| |
| * <code>V **ConcatLowerUpper**(V hi, V lo)</code>: returns the inner half of |
| the concatenation of `hi` and `lo` without splitting into blocks. Useful for |
| swapping the two blocks in 256-bit vectors. |
| |
| * <code>V **ConcatUpperLower**(V hi, V lo)</code>: returns the outer quarters |
| of the concatenation of `hi` and `lo` without splitting into blocks. Unlike |
| the other variants, this does not incur a block-crossing penalty on AVX2. |
| |
| * `V`: `uif32` \ |
| <code>V **TableLookupLanes**(V a, VI)</code> returns a vector of |
| `a[indices[i]]`, where `VI` is from `SetTableIndices(D, &indices[0])`. |
| |
| * <code>VI **SetTableIndices**(D, int* idx)</code> prepares for |
| `TableLookupLanes` with lane indices `idx = [0, N)` (need not be unique). |
| |
| ### Reductions |
| |
| **Note**: the following are only available for full vectors (including scalar). |
| These 'reduce' all lanes to a single result. This result is broadcasted to all |
| lanes at no extra cost; you can use `GetLane` to obtain the value. |
| |
| Being a horizontal operation (across lanes of the same vector), these are slower |
| than normal SIMD operations and are typically used outside critical loops. |
| |
| * `V`: `u8`; `Ret`: `u64` \ |
| <code>Ret **SumsOfU8x8**(V)</code>: returns the sums of 8 consecutive |
| bytes in each 64-bit lane. |
| |
| * `V`: `uif32/64` \ |
| <code>V **SumOfLanes**(V v)</code>: returns the sum of all lanes in |
| each lane. |
| * `V`: `uif32/64` \ |
| <code>V **MinOfLanes**(V v)</code>: returns the minimum-valued lane in |
| each lane. |
| * `V`: `uif32/64` \ |
| <code>V **MaxOfLanes**(V v)</code>: returns the maximum-valued lane in |
| each lane. |
| |
| ## Advanced macros |
| |
| Let `Target` denote an instruction set: `SCALAR/SSE4/AVX2/AVX3/PPC8/NEON/WASM`. |
| Targets are only used if enabled (i.e. not broken nor disabled). Baseline means |
| the compiler is allowed to generate such instructions (implying the target CPU |
| would have to support them). |
| |
| * `HWY_Target=##` are powers of two uniquely identifying `Target`. |
| |
| * `HWY_STATIC_TARGET` is the best enabled baseline `HWY_Target`, and matches |
| `HWY_TARGET` in static dispatch mode. This is useful even in dynamic |
| dispatch mode for deducing and printing the compiler flags. |
| |
| * `HWY_TARGETS` indicates which targets to generate for dynamic dispatch, and |
| which headers to include. It is determined by configuration macros and |
| always includes `HWY_STATIC_TARGET`. |
| |
| * `HWY_SUPPORTED_TARGETS` is the set of targets available at runtime. Expands |
| to a literal if only a single target is enabled, or SupportedTargets(). |
| |
| * `HWY_TARGET`: which `HWY_Target` is currently being compiled. This is |
| initially identical to `HWY_STATIC_TARGET` and remains so in static dispatch |
| mode. For dynamic dispatch, this changes before each re-inclusion and |
| finally reverts to `HWY_STATIC_TARGET`. Can be used in `#if` expressions to |
| provide an alternative to functions which are not supported by HWY_SCALAR. |
| |
| * `HWY_LANES(T)`: how many lanes of type `T` in a full vector (>= 1). Used by |
| HWY_FULL/CAPPED. Note: cannot be used in #if because it uses sizeof. |
| |
| * `HWY_IDE` is 0 except when parsed by IDEs; adding it to conditions such as |
| `#if HWY_TARGET != HWY_SCALAR || HWY_IDE` avoids code appearing greyed out. |
| |
| The following signal capabilities and expand to 1 or 0. |
| * `HWY_CAP_INTEGER64`: support for 64-bit signed/unsigned integer lanes. |
| * `HWY_CAP_FLOAT64`: support for double-precision floating-point lanes. |
| * `HWY_CAP_GE256`: the current target supports vectors of >= 256 bits. |
| * `HWY_CAP_GE512`: the current target supports vectors of >= 512 bits. |
| |
| The following indicate the maximum number of lanes for certain operations. For |
| targets that support the feature/operation, the macro evaluates to |
| `HWY_LANES(T)`, otherwise 1. Using `HWY_CAPPED(T, HWY_GATHER_LANES(T))` |
| generates the best possible code (or scalar fallback) from the same source code. |
| * `HWY_GATHER_LANES(T)`: supports GatherIndex/Offset. |
| * `HWY_VARIABLE_SHIFT_LANES(T)`: supports per-lane shift amounts (v1 << v2). |
| |
| As above, but the feature implies the type so there is no T parameter: |
| * `HWY_COMPARE64_LANES`: 64-bit signed integer comparisons. |
| * `HWY_MINMAX64_LANES`: 64-bit signed/unsigned integer min/max. |
| |
| ## Detecting supported targets |
| |
| `SupportedTargets()` returns a cached (initialized on-demand) bitfield of the |
| targets supported on the current CPU, detected using CPUID on x86 or equivalent. |
| This may include targets that are not in `HWY_TARGETS`, and vice versa. If |
| there is no overlap the binary will likely crash. This can only happen if: |
| |
| * the specified baseline is not supported by the current CPU, which |
| contradicts the definition of baseline, so the configuration is invalid; or |
| * the baseline does not include the enabled/attainable target(s), which are |
| also not supported by the current CPU, and baseline targets (in particular |
| `HWY_SCALAR`) were explicitly disabled. |
| |
| ## Advanced configuration macros |
| |
| The following macros govern which targets to generate. Unless specified |
| otherwise, they may be defined per translation unit, e.g. to disable >128 bit |
| vectors in modules that do not benefit from them (if bandwidth-limited or only |
| called occasionally). This is safe because `HWY_TARGETS` always includes at |
| least one baseline target which `HWY_EXPORT` can use. |
| |
| * `HWY_DISABLE_CACHE_CONTROL` makes the cache-control functions no-ops. |
| * `HWY_DISABLE_BMI2_FMA` prevents emitting BMI/BMI2/FMA instructions. This |
| allows using AVX2 in VMs that do not support the other instructions, but |
| only if defined for all translation units. |
| |
| The following `*_TARGETS` are zero or more `HWY_Target` bits and can be defined |
| as an expression, e.g. `-DHWY_DISABLED_TARGETS=(HWY_SSE4|HWY_AVX3)`. |
| |
| * `HWY_BROKEN_TARGETS` defaults to a blocklist of known compiler bugs. |
| Defining to 0 disables the blocklist. |
| |
| * `HWY_DISABLED_TARGETS` defaults to zero. This allows explicitly disabling |
| targets without interfering with the blocklist. |
| |
| * `HWY_BASELINE_TARGETS` defaults to the set whose predefined macros are |
| defined (i.e. those for which the corresponding flag, e.g. -mavx2, was |
| passed to the compiler). If specified, this should be the same for all |
| translation units, otherwise the safety check in SupportedTargets (that all |
| enabled baseline targets are supported) may be inaccurate. |
| |
| Zero or one of the following macros may be defined to replace the default |
| policy for selecting `HWY_TARGETS`: |
| |
| * `HWY_COMPILE_ONLY_SCALAR` selects only `HWY_SCALAR`, which disables SIMD. |
| * `HWY_COMPILE_ONLY_STATIC` selects only `HWY_STATIC_TARGET`, which |
| effectively disables dynamic dispatch. |
| * `HWY_COMPILE_ALL_ATTAINABLE` selects all attainable targets (i.e. enabled |
| and permitted by the compiler, independently of autovectorization), which |
| maximizes coverage in tests. |
| |
| If none are defined, the default is to select all attainable targets except any |
| non-best baseline (typically `HWY_SCALAR`), which reduces code size. |
| |
| ## Compiler support |
| |
| Clang and GCC require e.g. -mavx2 flags in order to use SIMD intrinsics. |
| However, this enables AVX2 instructions in the entire translation unit, which |
| may violate the one-definition rule and cause crashes. Instead, we use |
| target-specific attributes introduced via #pragma. Function using SIMD must |
| reside between `HWY_BEFORE_NAMESPACE` and `HWY_AFTER_NAMESPACE`. |
| |
| Immediates (compile-time constants) are specified as template arguments to avoid |
| constant-propagation issues with Clang on ARM. |
| |
| ## Type traits |
| |
| * `IsFloat<T>()` returns true if the `T` is a floating-point type. |
| * `IsSigned<T>()` returns true if the `T` is a signed or floating-point type. |
| * `LimitsMin/Max<T>()` return the smallest/largest value representable in `T`. |
| * `SizeTag<N>` is an empty struct, used to select overloaded functions |
| appropriate for `N` bytes. |
| |
| ## Memory allocation |
| |
| `AllocateAligned<T>(items)` returns a unique pointer to newly allocated memory |
| for `items` elements of POD type `T`. The start address is aligned as required |
| by `Load/Store`. Furthermore, successive allocations are not congruent modulo a |
| platform-specific alignment. This helps prevent false dependencies or cache |
| conflicts. The memory allocation is analogous to using `malloc()` and `free()` |
| with a `std::unique_ptr` since the returned items are *not* initialized or |
| default constructed and it is released using `FreeAlignedBytes()` without |
| calling `~T()`. |
| |
| `MakeUniqueAligned<T>(Args&&... args)` creates a single object in newly |
| allocated aligned memory as above but constructed passing the `args` argument to |
| `T`'s constructor and returning a unique pointer to it. This is analogous to |
| using `std::make_unique` with `new` but for aligned memory since the object is |
| constructed and later destructed when the unique pointer is deleted. Typically |
| this type `T` is a struct containing multiple members with `HWY_ALIGN` or |
| `HWY_ALIGN_MAX`, or arrays whose lengths are known to be a multiple of the |
| vector size. |
| |
| `MakeUniqueAlignedArray<T>(size_t items, Args&&... args)` creates an array of |
| objects in newly allocated aligned memory as above and constructs every element |
| of the new array using the passed constructor parameters, returning a unique |
| pointer to the array. Note that only the first element is guaranteed to be |
| aligned to the vector size; because there is no padding between elements, |
| the alignment of the remaining elements depends on the size of `T`. |