Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8264762

ByteBuffer.byteOrder(BIG_ENDIAN).asXBuffer.put(Xarray) and ByteBuffer.byteOrder(nativeOrder()).asXBuffer.put(Xarray) are slow

    XMLWordPrintable

Details

    • b21
    • x86_64
    • windows_10

    Description

      ADDITIONAL SYSTEM INFORMATION :
      Java 17-ea+13 (and Java 15+36), Windows 10 x64

      A DESCRIPTION OF THE PROBLEM :
      The task is to serialize an array of floats to a byte array as fast as possible (this time with byte order = BIG_ENDIAN), see https://www.reddit.com/r/java/comments/m4b9f6/ for the full context. The fastest available options are using a ByteBuffer and/or a VarHandle. However, one specific shape is unexpectedly slow:

      ```java
      // Fast! Good!
      @Benchmark
      public byte[] byteBufferBigEndian() {
          ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize);
          byteBuffer.asFloatBuffer().put(floats);
          return byteBuffer.array();
      }

      // Slow!
      @Benchmark
      public byte[] byteBufferBigEndianSwapMemoryCopy() {
          ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize);
          // The wrap() forces usage of Unsafe.swapCopyMemory() which is twice as slow as the other variant:
          byteBuffer.asFloatBuffer().put(FloatBuffer.wrap(floats));
          return byteBuffer.array();
      }
      ```

      The problem is that even though the more natural approach to call `.put(Xarray)` is fast, an alternative `put(FloatBuffer.wrap(floats))` is much slower because the latter uses Unsafe.swapCopyMemory under the hood which, on my system, is much worse than the alternative even though the source is the exact same array of floats.

      ```java
      // Unusable because it's a preview feature.
      @Benchmark
      public byte[] memorySegment() {
          try (MemorySegment segment = MemorySegment.ofArray(floats)) {
              return segment.toByteArray();
          }
      }

      // Slow!
      @Benchmark
      public byte[] byteBufferNativeOrder() {
          ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize).order(ByteOrder.nativeOrder());
          byteBuffer.asFloatBuffer().put(floats);
          return byteBuffer.array();
      }

      // Fast!
      @Benchmark
      public byte[] byteBufferNativeOrderMemoryCopy() {
          ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize).order(ByteOrder.nativeOrder());
          // The wrap() forces usage of Unsafe.copyMemory() which is twice as fast as the other variant:
          byteBuffer.asFloatBuffer().put(FloatBuffer.wrap(floats));
          return byteBuffer.array();
      }
      ```

      The problem is that the more natural approach to call `.put(Xarray)` is much slower than the less obvious alternative `put(FloatBuffer.wrap(floats))`. because the former is missing a bulk approach while the latter uses Unsafe.copyMemory under the hood.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Run https://gitlab.com/janecekpetr/benchmarks/-/blob/master/src/main/java/com/gitlab/janecekpetr/benchmarks/FloatSerializationBenchmark.java
      by
      1. cloning the repo
      2. mvn verify
      3. java -jar target/benchmarks.jar FloatSerializationBenchmark

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      I expect byteBuffer.asFloatBuffer().put(FloatBuffer.wrap(floats)) to perform just as fast as byteBuffer.asFloatBuffer().put(floats).
      ACTUAL -
      Benchmark (size) Mode Cnt Score Error Units
      byteBufferBigEndian 2048 thrpt 5 800522,219 ± 69499,093 ops/s
      byteBufferBigEndianSwapMemoryCopy 2048 thrpt 5 371907,046 ± 9270,812 ops/s
      byteBufferNativeOrder 2048 thrpt 5 756516,722 ± 33633,399 ops/s
      byteBufferNativeOrderMemoryCopy 2048 thrpt 5 1208847,781 ± 67935,938 ops/s
      dataOutputStream 2048 thrpt 5 99949,822 ± 17233,752 ops/s
      kryoLikeUnsafe 2048 thrpt 5 1248879,311 ± 26843,663 ops/s
      manualUnpacking 2048 thrpt 5 181612,250 ± 21232,457 ops/s
      objectOutputStream 2048 thrpt 5 102348,095 ± 4135,803 ops/s
      varHandleBigEndian 2048 thrpt 5 726448,503 ± 13138,903 ops/s
      varHandleNativeOrder 2048 thrpt 5 698638,620 ± 20742,939 ops/s

      ---------- BEGIN SOURCE ----------
      @Fork(1)
      @Warmup(iterations = 3, time = 3, timeUnit = TimeUnit.SECONDS)
      @Measurement(iterations = 5, time = 6, timeUnit = TimeUnit.SECONDS)
      @State(Scope.Thread)
      public class FloatSerializationBenchmark {

          @Param({/*"8", "32", "128", "512",*/ "2048"})
          private int size;
          private float[] floats;
          private int byteSize;

          @Setup
          public void setup() {
              floats = new float[size];
              ThreadLocalRandom random = ThreadLocalRandom.current();
              for (int i = 0; i < floats.length; i++) {
                  floats[i] = random.nextFloat();
              }

              byteSize = size * Float.BYTES;
          }

          @Benchmark
          public byte[] byteBufferBigEndian() {
              ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize).order(ByteOrder.BIG_ENDIAN);
              byteBuffer.asFloatBuffer().put(floats);
              return byteBuffer.array();
          }
          
          @Benchmark
          public byte[] byteBufferBigEndianSwapMemoryCopy() {
              ByteBuffer byteBuffer = ByteBuffer.allocate(byteSize).order(ByteOrder.BIG_ENDIAN);
              // The wrap() forces usage of Unsafe.swapCopyMemory() which is twice as slow as the other variant:
              byteBuffer.asFloatBuffer().put(FloatBuffer.wrap(floats));
              return byteBuffer.array();
          }

      }
      ---------- END SOURCE ----------

      FREQUENCY : always


      Attachments

        Issue Links

          Activity

            People

              bpb Brian Burkhalter
              webbuggrp Webbug Group
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: