C2 SuperWord and Vector API: vector algorithms test and benchmark

XMLWordPrintable

      This is an exploratory task.

      There are many "algorithms" that can benefit from vectorization:
      - fill
      - populate_index (aka iota)
      - reduce (sum, min, and, ... even polynomial reduction = hash)
      - map
      - copy
      - reverse (copy with shuffle)
      - find (predicate based on single element)
      - scan (prefix sum, min scan, ...)
      - filter (using compress)
      - segmented scans (list of strings -> list of hashes)
      - run-length encoding and decoding (replicate)

      In the Linear Algebra / Machine Learning area, there may be:
      - matrix multiplication
      - sparse matrix multiplication
      - softmax

      [~jrose] Has some ideas here:
      https://cr.openjdk.org/~jrose/pres/202202-VectorTopics.pdf

      Here some links on Replicate:
      https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrinking-time/
      https://mlochbaum.github.io/BQN/implementation/primitive/replicate.html#boolean-compress

      [~vlivanov] Mentioned Daniel Lemire's work, there are lots of ideas here:
      http://0x80.pl/notesen.html
      https://lemire.me/blog/

      [~pminborg] Suggested compare for mismatch and simple pattern matching. What I could imagine:
      - Case-insensitive string compare
      - match all or any, possibly with and without early exit

      [~pminborg] Also mentioned sort, which I had forgot to mention earlier. We can do:
      - quicksort or radix sort. They use filter/compress with multiple passes.
      - Per also suggested sorting in a block first. I'd need to think about how to create a larger sort from that.
      Per mentioned this link: https://tweedegolf.nl/en/blog/79/sorting-with-simd

      [~pminborg] also brought up the idea of SIMD-fying object/record access, initializing fields and doing checks on objects and their fields.
      This would require deep changes in the JVM, and so it is out of scope for this experiment. But: we could consider using the Vector API to sketch these kinds of accesses, to get a sense of how much we could improve performance using SIMD. Some Ideas I have here:
      - Have a "layout" of repeating "records" (e.g. 4 ints, or 2 floats, or 1 long and 2 ints, add some null bits etc) ... then do operations with those "fake objects". Modify them, reduce over fields, filter, ...
      - I could even use ints/longs as "fake oops", use them to jump to other records and "dereference fields". That would lead to some gather/scatter experiments. Would be an interesting "sketch" to see if we could ever vectorize oop accesses ... though GC barriers will make things even more difficult if we were ever to implement this in the JVM for real objects.

      [~gfrost] Suggested this video:
      I Want a Good Parallel Language, by Raph Levien (BALISP)
      https://www.youtube.com/watch?v=0-eViUyPwso

      There are also lots of parsing tasks that can be vectorized.

      The goal is to implement many of these, to test them reasonably, check for vectorization with IR tests, and also create JMH benchmarks.

      We can compare the performance of different implementations, and also compare with scalar and auto vectorized performance.

            Assignee:
            Emanuel Peter
            Reporter:
            Emanuel Peter
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: