bug fix to encode_arm64.s: some registers overwritten in memmove call

In encode_arm64.s, encodeBlock, two of the registers added during the port from
amd64 were not saved or restored for the memmove call. Instead of saving them,
just recalculate their values. Additionally, I made a few small changes to
improve things since I've learned a bit more about ARMv8 assembly.
 - The CMP instruction accepts an immediate as the first argument
 - use LDP/STP instead of SIMD instructions

The change to use the load-pair and store-pair instructions instead of the SIMD
instructions results in some modest performance improvements as meastured on
Neoverse N1 (Graviton 2).

name              old time/op    new time/op    delta
WordsDecode1e1-2    25.9ns ± 1%    26.1ns ± 1%  +0.66%  (p=0.005 n=10+10)
WordsDecode1e2-2     107ns ± 0%     105ns ± 0%  -1.87%  (p=0.000 n=10+10)
WordsDecode1e3-2     953ns ± 0%     901ns ± 0%  -5.50%  (p=0.000 n=10+10)
WordsDecode1e4-2    10.6µs ± 0%     9.9µs ± 2%  -6.60%  (p=0.000 n=7+10)
WordsDecode1e5-2     170µs ± 1%     164µs ± 1%  -3.12%  (p=0.000 n=10+9)
WordsDecode1e6-2    1.71ms ± 0%    1.66ms ± 0%  -2.98%  (p=0.000 n=10+10)
WordsEncode1e1-2    22.0ns ± 1%    21.9ns ± 1%  -0.67%  (p=0.006 n=8+10)
WordsEncode1e2-2     248ns ± 0%     245ns ± 0%  -1.21%  (p=0.002 n=8+10)
WordsEncode1e3-2    2.50µs ± 0%    2.49µs ± 0%    ~     (p=0.103 n=10+9)
WordsEncode1e4-2    27.8µs ± 3%    28.0µs ± 2%    ~     (p=0.075 n=10+10)
WordsEncode1e5-2     339µs ± 0%     343µs ± 0%  +1.18%  (p=0.000 n=9+10)
WordsEncode1e6-2    3.39ms ± 0%    3.42ms ± 0%  +0.94%  (p=0.000 n=10+10)
RandomEncode-2      74.8µs ± 1%    77.1µs ± 1%  +3.16%  (p=0.000 n=10+10)
_UFlat0-2           68.8µs ± 1%    66.4µs ± 2%  -3.54%  (p=0.000 n=10+10)
_UFlat1-2            770µs ± 0%     740µs ± 1%  -3.93%  (p=0.000 n=10+10)
_UFlat2-2           6.57µs ± 0%    6.55µs ± 0%  -0.25%  (p=0.000 n=8+10)
_UFlat3-2            183ns ± 0%     178ns ± 1%  -2.84%  (p=0.000 n=9+10)
_UFlat4-2           9.76µs ± 1%    9.56µs ± 0%  -2.07%  (p=0.000 n=10+9)
_UFlat5-2            301µs ± 0%     293µs ± 0%  -2.67%  (p=0.000 n=9+10)
_UFlat6-2            280µs ± 1%     267µs ± 1%  -4.63%  (p=0.000 n=10+10)
_UFlat7-2            241µs ± 0%     230µs ± 1%  -4.68%  (p=0.000 n=9+10)
_UFlat8-2            745µs ± 0%     715µs ± 1%  -4.11%  (p=0.000 n=10+10)
_UFlat9-2           1.01ms ± 0%    0.96ms ± 0%  -4.60%  (p=0.000 n=10+10)
_UFlat10-2          62.3µs ± 1%    59.3µs ± 1%  -4.72%  (p=0.000 n=10+9)
_UFlat11-2           258µs ± 0%     252µs ± 1%  -2.56%  (p=0.000 n=10+10)
_ZFlat0-2            135µs ± 1%     132µs ± 1%  -1.88%  (p=0.000 n=10+8)
_ZFlat1-2           1.76ms ± 0%    1.74ms ± 0%  -1.00%  (p=0.000 n=9+9)
_ZFlat2-2           9.54µs ± 0%    9.84µs ± 5%  +3.18%  (p=0.000 n=10+10)
_ZFlat3-2            449ns ± 0%     447ns ± 0%  -0.38%  (p=0.000 n=10+9)
_ZFlat4-2           15.6µs ± 0%    16.0µs ± 4%    ~     (p=0.118 n=9+10)
_ZFlat5-2            560µs ± 1%     555µs ± 1%  -0.89%  (p=0.000 n=9+9)
_ZFlat6-2            531µs ± 0%     534µs ± 0%  +0.64%  (p=0.000 n=10+10)
_ZFlat7-2            466µs ± 0%     468µs ± 0%  +0.32%  (p=0.003 n=10+10)
_ZFlat8-2           1.42ms ± 0%    1.42ms ± 0%  +0.43%  (p=0.000 n=10+10)
_ZFlat9-2           1.93ms ± 0%    1.94ms ± 0%  +0.44%  (p=0.000 n=10+10)
_ZFlat10-2           120µs ± 0%     121µs ± 3%    ~     (p=0.436 n=9+9)
_ZFlat11-2           433µs ± 0%     437µs ± 0%  +1.03%  (p=0.000 n=10+10)
ExtendMatch-2       9.77µs ± 0%    9.76µs ± 0%  -0.13%  (p=0.050 n=10+10)

As measured on Cortex-A53 (Raspberry Pi 3)

name              old time/op    new time/op    delta
WordsDecode1e1-4     152ns ± 2%     151ns ± 0%    ~     (p=0.536 n=10+8)
WordsDecode1e2-4     639ns ± 0%     617ns ± 0%  -3.54%  (p=0.000 n=9+8)
WordsDecode1e3-4    6.74µs ± 2%    6.35µs ± 0%  -5.75%  (p=0.000 n=10+9)
WordsDecode1e4-4    66.7µs ± 0%    63.5µs ± 0%  -4.69%  (p=0.000 n=9+9)
WordsDecode1e5-4     715µs ± 0%     684µs ± 0%  -4.38%  (p=0.000 n=8+8)
WordsDecode1e6-4    6.87ms ± 2%    6.53ms ± 1%  -4.99%  (p=0.000 n=10+9)
WordsEncode1e1-4     127ns ± 2%     126ns ± 0%    ~     (p=0.065 n=10+9)
WordsEncode1e2-4    1.58µs ± 0%    1.57µs ± 0%  -0.99%  (p=0.000 n=8+8)
WordsEncode1e3-4    15.1µs ± 0%    14.9µs ± 0%  -1.46%  (p=0.000 n=9+8)
WordsEncode1e4-4     148µs ± 0%     148µs ± 4%    ~     (p=0.497 n=9+10)
WordsEncode1e5-4    1.54ms ± 0%    1.54ms ± 0%  +0.12%  (p=0.012 n=10+8)
WordsEncode1e6-4    14.4ms ± 0%    14.4ms ± 1%  -0.47%  (p=0.015 n=9+8)
RandomEncode-4      1.13ms ± 1%    1.13ms ± 1%    ~     (p=0.529 n=10+10)
_UFlat0-4            294µs ± 0%     288µs ± 1%  -2.08%  (p=0.000 n=9+9)
_UFlat1-4           3.05ms ± 1%    2.98ms ± 1%  -2.22%  (p=0.000 n=9+9)
_UFlat2-4           37.3µs ± 0%    37.4µs ± 1%    ~     (p=0.093 n=8+9)
_UFlat3-4            909ns ± 0%     914ns ± 2%    ~     (p=0.526 n=8+10)
_UFlat4-4           58.7µs ± 0%    58.1µs ± 0%  -1.09%  (p=0.000 n=8+10)
_UFlat5-4           1.22ms ± 0%    1.19ms ± 1%  -2.14%  (p=0.000 n=8+8)
_UFlat6-4           1.03ms ± 0%    0.99ms ± 0%  -3.28%  (p=0.000 n=9+8)
_UFlat7-4            895µs ± 0%     861µs ± 0%  -3.79%  (p=0.000 n=8+8)
_UFlat8-4           2.83ms ± 0%    2.75ms ± 0%  -2.88%  (p=0.000 n=7+8)
_UFlat9-4           3.85ms ± 1%    3.73ms ± 1%  -3.03%  (p=0.000 n=8+9)
_UFlat10-4           286µs ± 0%     282µs ± 0%  -1.59%  (p=0.000 n=9+9)
_UFlat11-4          1.06ms ± 0%    1.02ms ± 0%  -3.58%  (p=0.000 n=8+9)
_ZFlat0-4            620µs ± 0%     620µs ± 1%    ~     (p=0.963 n=9+8)
_ZFlat1-4           9.49ms ± 1%    9.67ms ± 3%  +1.87%  (p=0.000 n=9+10)
_ZFlat2-4           61.8µs ± 0%    62.3µs ± 3%    ~     (p=0.829 n=8+10)
_ZFlat3-4           2.80µs ± 1%    2.79µs ± 0%  -0.55%  (p=0.000 n=8+8)
_ZFlat4-4            108µs ± 0%     109µs ± 0%  +0.55%  (p=0.000 n=10+8)
_ZFlat5-4           2.59ms ± 2%    2.58ms ± 1%    ~     (p=0.274 n=10+8)
_ZFlat6-4           2.39ms ± 3%    2.40ms ± 1%    ~     (p=0.631 n=10+10)
_ZFlat7-4           2.11ms ± 0%    2.08ms ± 1%  -1.23%  (p=0.000 n=10+9)
_ZFlat8-4           6.86ms ± 0%    6.92ms ± 1%  +0.78%  (p=0.000 n=9+8)
_ZFlat9-4           9.42ms ± 0%    9.40ms ± 1%    ~     (p=0.606 n=8+9)
_ZFlat10-4           620µs ± 1%     621µs ± 4%    ~     (p=0.173 n=8+10)
_ZFlat11-4          1.94ms ± 0%    1.93ms ± 0%  -0.52%  (p=0.001 n=9+8)
ExtendMatch-4       69.3µs ± 2%    69.2µs ± 0%    ~     (p=0.515 n=10+8)
2 files changed