Currently in the stock JVM we are generating vzeroupper at the end of a stub and at the end of a C2 jitted method only if it has larger than 128-bit vector instructions. For a C2 jitted method, this could be either due to auto vectorization or due to inline intrinsics. The clear_upper_avx() marks that the method has larger vectors. The vzeroupper is generated in the method epilog when the marker is found set for the method compilation or if the max_vector_size() due to auto vectorization is set to be greater than 128 bit.
We need vzeroupper in JITed code (because called from Interpreter) if it or intrinsic stub it calls use avx512 instructions. We don't generate vzeroupper in JITed code epilog if the code does not uses avx512. But intrinsic stub it calls may use avx512 so we delegate vzeroupper generation to stub.
All this is fine but there is duplication: the code may have avx512 instructions, it may call several intrinsics stubs which have them too. Currently we don't check, I think, such duplication but we may need to do that to improve performance.
Also vzeroupper is unconditionally inserted before all CallLeaf/CallLeafNoFP (but not for CallLeafVector) irrespective of what stub is being called. So, when it comes to C2, there are two vzeroupper instructions issued. It needs to be understood if the vzeroupper at a call is required or is unnecessary and could be eliminated.
We need vzeroupper in JITed code (because called from Interpreter) if it or intrinsic stub it calls use avx512 instructions. We don't generate vzeroupper in JITed code epilog if the code does not uses avx512. But intrinsic stub it calls may use avx512 so we delegate vzeroupper generation to stub.
All this is fine but there is duplication: the code may have avx512 instructions, it may call several intrinsics stubs which have them too. Currently we don't check, I think, such duplication but we may need to do that to improve performance.
Also vzeroupper is unconditionally inserted before all CallLeaf/CallLeafNoFP (but not for CallLeafVector) irrespective of what stub is being called. So, when it comes to C2, there are two vzeroupper instructions issued. It needs to be understood if the vzeroupper at a call is required or is unnecessary and could be eliminated.