Please find the context in following mail chain
On Mar 3, 2025, at 1:03 PM, Paul Sandoz <paul.sandoz@oracle.com> wrote:
Ok, thanks!
Paul.
On Mar 3, 2025, at 11:25 AM, Bhateja, Jatin <jatin.bhateja@intel.com> wrote:
Hi Paul,
The C-snippet you showed is making use of two vector permutation instruction, our recently added Vector.selectFrom[1] API may be useful here.
I will do some experimentation and get back.
Best Regards,
Jatin.
[1] https://download.java.net/java/early_access/jdk24/docs/api/jdk.incubator.vector/jdk/incubator/vector/ByteVector.html#selectFrom(jdk.incubator.vector.Vector,jdk.incubator.vector.Vector)
From: Paul Sandoz <paul.sandoz@oracle.com>
Sent: Tuesday, March 4, 2025 12:11 AM
To: Ian Graves <ian.graves@oracle.com>
Cc: Bhateja, Jatin <jatin.bhateja@intel.com>; Viswanathan, Sandhya <sandhya.viswanathan@intel.com>
Subject: Re: Vector API Performance: UTF-8 Validation Example
Thanks Ian.
AFAICT the Java algorithm derived from the paper is slightly different the algorithm implemented in C, as if the algorithm evolved? If so we should recognize that, but clearly those algorithmic differences should result in larger differences in generated instructions.
It would be good to identify in the Java perfasm results the improvements due to the optimization of selectFrom.
I think you have rightly identified the current main bottleneck, which is the code gen of slice, it's not currently an intrinsic [1].
ByteVector prev2 = prevInputBlock.slice(species.length() - 2, input);
Vs.
static inline __m512i avx512_push_last_2bytes_of_a_to_b(__m512i a, __m512i b) {
__m512i indexes = _mm512_set_epi64(0x3D3C3B3A39383736, 0x3534333231302F2E,
0x2D2C2B2A29282726, 0x2524232221201F1E,
0x1D1C1B1A19181716, 0x1514131211100F0E,
0x0D0C0B0A09080706, 0x0504030201007F7E);
return _mm512_permutex2var_epi8(b, indexes, a);
}
Or
static inline __m256i push_last_2bytes_of_a_to_b(__m256i a, __m256i b) {
return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 14);
}
If the slice offset is a constant I believe there is some opportunity for us to generate similar code.
Paul.
[1]
final
@ForceInline
ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
ByteVector that = (ByteVector) v1;
that.check(this);
Objects.checkIndex(origin, length() + 1);
ByteVector iotaVector = (ByteVector) iotaShuffle().toBitsVector();
ByteVector filter = broadcast((byte)(length() - origin));
VectorMask<Byte> blendMask = iotaVector.compare(VectorOperators.LT, filter);
AbstractShuffle<Byte> iota = iotaShuffle(origin, 1, true);
return that.rearrange(iota).blend(this.rearrange(iota), blendMask);
}
On Feb 28, 2025, at 3:17 PM, Ian Graves <ian.graves@oracle.com> wrote:
Greetings All!
Hello! I’ve been doing some work on the Oracle side experimenting with performance of an example vectorized workload from[1][2] with UTF-8 Validation. I’ve been running some perf benchmarks comparing [1] to [2] with some minor modifications to see how much I could close the gap between the optimized C version vs the Java Vector version. Some of my findings are attached in a rough draft that I intend to share to the list, but wanted to run by you all first. The write up is in Markdown.
It seems that I’m observing some bloated code generation around slices and perhaps some other spots in the code that results in much larger hot segments. It seems like a possibility for some optimization work around here, but I think you all may be a better judge of this than me on the specifics. I’m more than happy to dig in further but at this point it probably make sense to share some findings (attached).
I’m going to keep at this write up for a little bit, but feel free to read it the draft and offer any opinions or thoughts you may have on ways this could go forward.
Thanks!
Ian Graves
[1] https://github.com/lemire/fastvalidate-utf-8
[2] https://github.com/AugustNagro/utf8.java
<utf8-blending.md>
On Mar 3, 2025, at 1:03 PM, Paul Sandoz <paul.sandoz@oracle.com> wrote:
Ok, thanks!
Paul.
On Mar 3, 2025, at 11:25 AM, Bhateja, Jatin <jatin.bhateja@intel.com> wrote:
Hi Paul,
The C-snippet you showed is making use of two vector permutation instruction, our recently added Vector.selectFrom[1] API may be useful here.
I will do some experimentation and get back.
Best Regards,
Jatin.
[1] https://download.java.net/java/early_access/jdk24/docs/api/jdk.incubator.vector/jdk/incubator/vector/ByteVector.html#selectFrom(jdk.incubator.vector.Vector,jdk.incubator.vector.Vector)
From: Paul Sandoz <paul.sandoz@oracle.com>
Sent: Tuesday, March 4, 2025 12:11 AM
To: Ian Graves <ian.graves@oracle.com>
Cc: Bhateja, Jatin <jatin.bhateja@intel.com>; Viswanathan, Sandhya <sandhya.viswanathan@intel.com>
Subject: Re: Vector API Performance: UTF-8 Validation Example
Thanks Ian.
AFAICT the Java algorithm derived from the paper is slightly different the algorithm implemented in C, as if the algorithm evolved? If so we should recognize that, but clearly those algorithmic differences should result in larger differences in generated instructions.
It would be good to identify in the Java perfasm results the improvements due to the optimization of selectFrom.
I think you have rightly identified the current main bottleneck, which is the code gen of slice, it's not currently an intrinsic [1].
ByteVector prev2 = prevInputBlock.slice(species.length() - 2, input);
Vs.
static inline __m512i avx512_push_last_2bytes_of_a_to_b(__m512i a, __m512i b) {
__m512i indexes = _mm512_set_epi64(0x3D3C3B3A39383736, 0x3534333231302F2E,
0x2D2C2B2A29282726, 0x2524232221201F1E,
0x1D1C1B1A19181716, 0x1514131211100F0E,
0x0D0C0B0A09080706, 0x0504030201007F7E);
return _mm512_permutex2var_epi8(b, indexes, a);
}
Or
static inline __m256i push_last_2bytes_of_a_to_b(__m256i a, __m256i b) {
return _mm256_alignr_epi8(b, _mm256_permute2x128_si256(a, b, 0x21), 14);
}
If the slice offset is a constant I believe there is some opportunity for us to generate similar code.
Paul.
[1]
final
@ForceInline
ByteVector sliceTemplate(int origin, Vector<Byte> v1) {
ByteVector that = (ByteVector) v1;
that.check(this);
Objects.checkIndex(origin, length() + 1);
ByteVector iotaVector = (ByteVector) iotaShuffle().toBitsVector();
ByteVector filter = broadcast((byte)(length() - origin));
VectorMask<Byte> blendMask = iotaVector.compare(VectorOperators.LT, filter);
AbstractShuffle<Byte> iota = iotaShuffle(origin, 1, true);
return that.rearrange(iota).blend(this.rearrange(iota), blendMask);
}
On Feb 28, 2025, at 3:17 PM, Ian Graves <ian.graves@oracle.com> wrote:
Greetings All!
Hello! I’ve been doing some work on the Oracle side experimenting with performance of an example vectorized workload from[1][2] with UTF-8 Validation. I’ve been running some perf benchmarks comparing [1] to [2] with some minor modifications to see how much I could close the gap between the optimized C version vs the Java Vector version. Some of my findings are attached in a rough draft that I intend to share to the list, but wanted to run by you all first. The write up is in Markdown.
It seems that I’m observing some bloated code generation around slices and perhaps some other spots in the code that results in much larger hot segments. It seems like a possibility for some optimization work around here, but I think you all may be a better judge of this than me on the specifics. I’m more than happy to dig in further but at this point it probably make sense to share some findings (attached).
I’m going to keep at this write up for a little bit, but feel free to read it the draft and offer any opinions or thoughts you may have on ways this could go forward.
Thanks!
Ian Graves
[1] https://github.com/lemire/fastvalidate-utf-8
[2] https://github.com/AugustNagro/utf8.java
<utf8-blending.md>
- relates to
-
JDK-8303762 [vectorapi] Intrinsification of Vector.slice
-
- Open
-
-
JDK-8341102 Add element type information to vector types
-
- Open
-
-
JDK-8342662 C2: Add new phase for backend-specific lowering
-
- Open
-