Use more SIMD registers when auto vectorizing and loop unrolling

XMLWordPrintable

    • x86_64
    • linux_oracle_6.0

      ADDITIONAL SYSTEM INFORMATION :
      Red Hat Enterprise Linux 8.8 (Ootpa)
      openjdk 17.0.7 2023-04-18 LTS
      openjdk 21-ea 2023-09-19

      A DESCRIPTION OF THE PROBLEM :
      When running a simple look like:

      for (int i = 0; i < 512; i++) {
           result[i] = a[i] + b[i];
       }

      Where a, b, result are of size 64 and have random value.
      The JIT produces the following code (tested with various array sizes) for the main part of the loop (and similar code to the other parts):

                   L0002: vmovdqu32 0x10(%rax,%r11,4),%zmm0
      0x00007f0c604ff4db: vaddps 0x10(%rdi,%r11,4),%zmm0,%zmm0
      0x00007f0c604ff4e6: vmovdqu32 %zmm0,0x10(%rdx,%r11,4)
      0x00007f0c604ff4f1: vmovdqu32 0x50(%rax,%r11,4),%zmm0
      0x00007f0c604ff4fc: vaddps 0x50(%rdi,%r11,4),%zmm0,%zmm0
      0x00007f0c604ff507: vmovdqu32 %zmm0,0x50(%rdx,%r11,4)
      0x00007f0c604ff512: vmovdqu32 0x90(%rax,%r11,4),%zmm0
      0x00007f0c604ff51d: vaddps 0x90(%rdi,%r11,4),%zmm0,%zmm0
      0x00007f0c604ff528: vmovdqu32 %zmm0,0x90(%rdx,%r11,4)
      0x00007f0c604ff533: vmovdqu32 0xd0(%rax,%r11,4),%zmm0
      0x00007f0c604ff53e: vaddps 0xd0(%rdi,%r11,4),%zmm0,%zmm0
      0x00007f0c604ff549: vmovdqu32 %zmm0,0xd0(%rdx,%r11,4) ;*fastore {reexecute=0 rethrow=0 return_oop=0}

      As you can tell, its only uses 1 SIMD register while they are plenty more available in modern CPUs, which means it is not taking full advantage of instruction level parallelism

            Assignee:
            Unassigned
            Reporter:
            Praveen Narayanaswamy
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: