Loading...

XML

Word

Printable

Type: Bug
Resolution: Not an Issue
Priority: P4
Fix Version/s: None
Affects Version/s: 8, 11, 17, 20, 21
Component/s: core-libs
Labels:

Subcomponent:
java.text
CPU:

generic
OS:

generic

ADDITIONAL SYSTEM INFORMATION :
macOS Ventura/Windows 10/ Java v8

A DESCRIPTION OF THE PROBLEM :
Some special characters which use UTF-8 encoding when passed through below code
Normalizer.normalize(input, Normalizer.Form.NFC);
where "input" represents any string containing a special character, the normalizer changes their underlying encoding resulting in different character altogether.
Ex. a person having name in Arabic وَلِيِّدْ-ألطَآئِيّ has UTF-8 code points as
"D9 88 D9 8E D9 84 D9 90 D9 8A D9 91 D9 90"
after passing through above line of code
"Normalizer.normalize(input, Normalizer.Form.NFC);" defined in java.net library here
https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
where it tries to encode the string, results in below UTF-8 code points
"D9 88 D9 8E D9 84 D9 90 D9 8A D9 90 D9 91"

If you see the last 4 code points and compare before and after, they are actually swapped and results in an altogether different string compared to original.

Another special character that also changes due to the above line of code is ć represented with UTF-8 code units 63 CC 81

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Have any string that contains this particular special character (either ć or arabic letters )
2. Write code to pass that string through this code "Normalizer.normalize(input, Normalizer.Form.NFC);". It is used in java.net library as https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
3. Capture the output of the code
4. Compare the original string with output of above code and see if they're same or not.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The input string and output should be equivalent when represented in UTF-8 form
ACTUAL -
The input string and output should are not equivalent when represented in UTF-8 form

---------- BEGIN SOURCE ----------
import java.text.*;
public class MyClass {
    public static void main(String args[]) {
      String s = "وَلِيِّدْ-ألطَآئِيّ";
      String ns = Normalizer.normalize(s, Normalizer.Form.NFC);
      boolean isEqualString = s.equals(ns);
      System.out.println("Output: " + ns + ", equal: " + isEqualString);
    }
}

---------- END SOURCE ----------

FREQUENCY : always

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

MyClass.java
0.3 kB
2023-05-08 23:23

Assignee:: Naoto Sato

Reporter:: Webbug Group

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023-05-08 12:10

Updated:: 2024-10-09 13:34

Resolved:: 2023-05-09 11:20

Details

Description

Attachments

Attachments

Activity

People

Dates