Optimize AES/ECB implementation using full-message intrinsic stub and parallel RoundKey addition

XMLWordPrintable

    • Type: Enhancement
    • Resolution: Fixed
    • Priority: P4
    • 27
    • Affects Version/s: None
    • Component/s: security-libs
    • None
    • master
    • generic
    • generic

      Create on behalf wuxinyang@hygon.cn

      The current implementation of AES in ECB mode still uses a per-block intrinsic approach with loop invocation, incurring superfluous invocations and context-switching overhead. We suggest introducing a full plaintext/ciphertext intrinsic stub and further optimizing it with parallel RoundKey addition.

      ===========================
      Dear Security group and members,

      Hello,

      I recently submitted a PR that introduces a parallel intrinsic implementation for AES/ECB operations, aiming to replace the current per-block processing approach and improve performance for multi-block encryption/decryption.

      This work is motivated by several performance limitations in the existing AES/ECB implementation (except for AVX-512 support):

         1.

         *Excessive stub call overhead* ? each 16-byte block triggers a separate
         intrinsic call, leading to high invocation frequency.
         2.

         *Limited instruction-level parallelism* ? serialized block processing
         does not fully utilize available ILP.
         3.

         *Redundant setup and teardown* ? encryption state is repeatedly
         initialized for every block.

      Summary of changes

         -

         Added a parallel AES intrinsic implementation to process multiple blocks
         in a single native call.
         -

         Reduced intrinsic invocation overhead.
         -

         Improved utilization of instruction-level parallelism.

      Performance results (JMH)

      Test platform: Intel(R) Core(TM) i9-14900HX OpenJDK 17 baseline:

      Benchmark Mode Cnt Score Error Units
      AesTest.test avgt 5 13334.163 ? 220.891 ns/op

      With optimized implementation:

      Benchmark Mode Cnt Score Error Units
      AesTest.test avgt 5 10391.371 ? 94.966 ns/op

      This shows approximately *28.3% performance improvement*.

      I would greatly appreciate your feedback on:

         -

         The design of the parallel intrinsic approach
         -

         Any potential correctness or portability concerns
         -

         Suggestions for further optimization or alignment with HotSpot intrinsic
         conventions

      JBS Issue: https://bugs.openjdk.org/browse/JDK-8376164 ? This issue tracks the performance improvement of AES/ECB operations by introducing a parallel intrinsic to reduce per-block overhead and enhance throughput.

      I am very happy to revise or extend the patch based on your guidance.
      Thank you for your time and for maintaining such a great platform.

      Best regards,
      Xinyang Wu

            Assignee:
            Sendao Yan
            Reporter:
            Sendao Yan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: