mirror of
https://github.com/openjdk/jdk.git
synced 2026-01-28 12:09:14 +00:00
When optimizing some VectorMask related APIs , we found an optimization opportunity related to the `cpy (immediate, zeroing)` instruction [1]. Implementing the functionality of this instruction using `cpy (immediate, merging)` instruction [2] leads to better performance. Currently the `cpy (imm, zeroing)` instruction is used in code generated by `VectorStoreMaskNode` and `VectorReinterpretNode`. Doing this optimization benefits all vector APIs that generate these two IRs potentially, such as `VectorMask.intoArray()` and `VectorMask.toLong()`. Microbenchmarks show this change brings performance uplift ranging from **11%** to **33%**, depending on the specific operation and data types. The specific changes in this PR: 1. Achieve the functionality of the `cpy (imm, zeroing)` instruction with the `movi + cpy (imm, merging)` instructions in assembler: ``` cpy z17.d, p1/z, #1 => movi v17.2d, #0 // this instruction is zero cost cpy z17.d, p1/m, #1 ``` 2. Add a new option `PreferSVEMergingModeCPY` to indicate whether to apply this optimization or not. - This option belongs to the Arch product category. - The default value is true on Neoverse-V1/V2 where the improvement has been confirmed, false on others. - When its value is true, the change is applied. 3. Add a jtreg test to verify the behavior of this option. This PR was tested on aarch64 and x86 machines with different configurations, and all tests passed. JMH benchmarks: On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: ``` Benchmark Unit size Before Error After Error Uplift byteIndexInRange ops/ms 7.00 471816.15 1125.96 473237.77 1593.92 1.00 byteIndexInRange ops/ms 256.00 149654.21 416.57 149259.95 116.59 1.00 byteIndexInRange ops/ms 259.00 177850.31 991.13 179785.19 1110.07 1.01 byteIndexInRange ops/ms 512.00 133393.26 167.26 133484.61 281.83 1.00 doubleIndexInRange ops/ms 7.00 302176.39 12848.8 299813.02 37.76 0.99 doubleIndexInRange ops/ms 256.00 47831.93 56.70 46708.70 56.11 0.98 doubleIndexInRange ops/ms 259.00 11550.02 27.95 15333.50 10.40 1.33 doubleIndexInRange ops/ms 512.00 23687.76 61.65 23996.08 69.52 1.01 floatIndexInRange ops/ms 7.00 412195.79 124.71 411770.23 78.73 1.00 floatIndexInRange ops/ms 256.00 84479.98 70.69 84237.31 70.15 1.00 floatIndexInRange ops/ms 259.00 22585.65 80.07 28296.21 7.98 1.25 floatIndexInRange ops/ms 512.00 46902.99 51.60 46686.68 66.01 1.00 intIndexInRange ops/ms 7.00 413411.70 50.59 420684.66 253.55 1.02 intIndexInRange ops/ms 256.00 84652.41 191.45 86758.74 193.66 1.02 intIndexInRange ops/ms 259.00 61825.20 291.71 62037.58 2355.43 1.00 intIndexInRange ops/ms 512.00 46754.89 149.72 46972.06 40.13 1.00 longIndexInRange ops/ms 7.00 329385.10 3292.7 318538.75 11103.9 0.97 longIndexInRange ops/ms 256.00 46910.36 53.41 46927.82 138.29 1.00 longIndexInRange ops/ms 259.00 33126.45 3210.07 32245.59 1347.58 0.97 longIndexInRange ops/ms 512.00 23931.64 215.55 23805.65 312.39 0.99 shortIndexInRange ops/ms 7.00 479265.67 1055.89 468452.89 433.15 0.98 shortIndexInRange ops/ms 256.00 138657.38 317.72 138695.29 505.69 1.00 shortIndexInRange ops/ms 259.00 113353.87 913.13 108912.75 1125.60 0.96 shortIndexInRange ops/ms 512.00 84652.74 171.37 84447.01 91.99 1.00 ``` On an AWS Graviton3 (Neoverse-V1) machine with 128-bit SVE1: ``` Benchmark Unit size Before Error After Error Uplift byteIndexInRange ops/ms 7.00 320073.86 669.91 318557.87 1285.42 1.00 byteIndexInRange ops/ms 256.00 119246.71 43.13 120658.01 28.27 1.01 byteIndexInRange ops/ms 259.00 137664.23 12001.6 150378.59 70.41 1.09 byteIndexInRange ops/ms 512.00 97187.13 18.60 95356.43 78.60 0.98 doubleIndexInRange ops/ms 7.00 291076.68 603.08 287383.75 518.59 0.99 doubleIndexInRange ops/ms 256.00 57473.11 123.34 61559.58 687.21 1.07 doubleIndexInRange ops/ms 259.00 19396.73 40.03 22046.65 8.66 1.14 doubleIndexInRange ops/ms 512.00 33619.28 33.58 34715.40 157.72 1.03 floatIndexInRange ops/ms 7.00 317295.18 627.76 303857.78 465.78 0.96 floatIndexInRange ops/ms 256.00 91734.27 183.61 91851.31 394.35 1.00 floatIndexInRange ops/ms 259.00 38103.12 129.44 42237.38 92.17 1.11 floatIndexInRange ops/ms 512.00 57219.58 366.00 57769.07 264.71 1.01 intIndexInRange ops/ms 7.00 317063.25 830.81 304289.56 541.12 0.96 intIndexInRange ops/ms 256.00 91535.60 315.36 98143.40 142.44 1.07 intIndexInRange ops/ms 259.00 73827.89 472.28 73781.80 21.53 1.00 intIndexInRange ops/ms 512.00 57552.09 20.19 62348.87 37.45 1.08 longIndexInRange ops/ms 7.00 301886.14 381.89 301636.82 184.80 1.00 longIndexInRange ops/ms 256.00 62246.77 69.29 62093.75 88.72 1.00 longIndexInRange ops/ms 259.00 40642.36 861.47 41566.43 256.04 1.02 longIndexInRange ops/ms 512.00 34850.70 154.39 34884.42 149.17 1.00 shortIndexInRange ops/ms 7.00 318133.03 593.20 313469.12 528.73 0.99 shortIndexInRange ops/ms 256.00 105019.58 21.38 105014.90 21.81 1.00 shortIndexInRange ops/ms 259.00 116235.93 1985.27 118697.74 48.41 1.02 shortIndexInRange ops/ms 512.00 91981.84 166.84 91874.82 78.28 1.00 ``` [1] https://developer.arm.com/documentation/ddi0602/2025-06/SVE-Instructions/CPY--immediate--zeroing---Copy-signed-integer-immediate-to-vector-elements--zeroing--?lang=en [2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/CPY--immediate--merging---Copy-signed-integer-immediate-to-vector-elements--merging--?lang=en