Details
-
Bug
-
Status: Resolved
-
P4
-
Resolution: Not an Issue
-
8, 11, 17, 20, 21
-
None
-
generic
-
generic
Description
ADDITIONAL SYSTEM INFORMATION :
macOS Ventura/Windows 10/ Java v8
A DESCRIPTION OF THE PROBLEM :
Some special characters which use UTF-8 encoding when passed through below code
Normalizer.normalize(input, Normalizer.Form.NFC);
where "input" represents any string containing a special character, the normalizer changes their underlying encoding resulting in different character altogether.
Ex. a person having name in Arabic وَلِيِّدْ-ألطَآئِيّ has UTF-8 code points as
"D9 88 D9 8E D9 84 D9 90 D9 8A D9 91 D9 90"
after passing through above line of code
"Normalizer.normalize(input, Normalizer.Form.NFC);" defined in java.net library here
https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
where it tries to encode the string, results in below UTF-8 code points
"D9 88 D9 8E D9 84 D9 90 D9 8A D9 90 D9 91"
If you see the last 4 code points and compare before and after, they are actually swapped and results in an altogether different string compared to original.
Another special character that also changes due to the above line of code is ć represented with UTF-8 code units 63 CC 81
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Have any string that contains this particular special character (either ć or arabic letters )
2. Write code to pass that string through this code "Normalizer.normalize(input, Normalizer.Form.NFC);". It is used in java.net library as https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
3. Capture the output of the code
4. Compare the original string with output of above code and see if they're same or not.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The input string and output should be equivalent when represented in UTF-8 form
ACTUAL -
The input string and output should are not equivalent when represented in UTF-8 form
---------- BEGIN SOURCE ----------
import java.text.*;
public class MyClass {
public static void main(String args[]) {
String s = "وَلِيِّدْ-ألطَآئِيّ";
String ns = Normalizer.normalize(s, Normalizer.Form.NFC);
boolean isEqualString = s.equals(ns);
System.out.println("Output: " + ns + ", equal: " + isEqualString);
}
}
---------- END SOURCE ----------
FREQUENCY : always
macOS Ventura/Windows 10/ Java v8
A DESCRIPTION OF THE PROBLEM :
Some special characters which use UTF-8 encoding when passed through below code
Normalizer.normalize(input, Normalizer.Form.NFC);
where "input" represents any string containing a special character, the normalizer changes their underlying encoding resulting in different character altogether.
Ex. a person having name in Arabic وَلِيِّدْ-ألطَآئِيّ has UTF-8 code points as
"D9 88 D9 8E D9 84 D9 90 D9 8A D9 91 D9 90"
after passing through above line of code
"Normalizer.normalize(input, Normalizer.Form.NFC);" defined in java.net library here
https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
where it tries to encode the string, results in below UTF-8 code points
"D9 88 D9 8E D9 84 D9 90 D9 8A D9 90 D9 91"
If you see the last 4 code points and compare before and after, they are actually swapped and results in an altogether different string compared to original.
Another special character that also changes due to the above line of code is ć represented with UTF-8 code units 63 CC 81
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Have any string that contains this particular special character (either ć or arabic letters )
2. Write code to pass that string through this code "Normalizer.normalize(input, Normalizer.Form.NFC);". It is used in java.net library as https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
3. Capture the output of the code
4. Compare the original string with output of above code and see if they're same or not.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
The input string and output should be equivalent when represented in UTF-8 form
ACTUAL -
The input string and output should are not equivalent when represented in UTF-8 form
---------- BEGIN SOURCE ----------
import java.text.*;
public class MyClass {
public static void main(String args[]) {
String s = "وَلِيِّدْ-ألطَآئِيّ";
String ns = Normalizer.normalize(s, Normalizer.Form.NFC);
boolean isEqualString = s.equals(ns);
System.out.println("Output: " + ns + ", equal: " + isEqualString);
}
}
---------- END SOURCE ----------
FREQUENCY : always