Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8307686

Normalization produces incorrect encoding for special characters

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • P4
    • Resolution: Not an Issue
    • 8, 11, 17, 20, 21
    • None
    • core-libs
    • generic
    • generic

    Description

      ADDITIONAL SYSTEM INFORMATION :
      macOS Ventura/Windows 10/ Java v8

      A DESCRIPTION OF THE PROBLEM :
      Some special characters which use UTF-8 encoding when passed through below code
      Normalizer.normalize(input, Normalizer.Form.NFC);
      where "input" represents any string containing a special character, the normalizer changes their underlying encoding resulting in different character altogether.
      Ex. a person having name in Arabic وَلِيِّدْ-ألطَآئِيّ has UTF-8 code points as
      "D9 88 D9 8E D9 84 D9 90 D9 8A D9 91 D9 90"
      after passing through above line of code
      "Normalizer.normalize(input, Normalizer.Form.NFC);" defined in java.net library here
      https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
      where it tries to encode the string, results in below UTF-8 code points
      "D9 88 D9 8E D9 84 D9 90 D9 8A D9 90 D9 91"

      If you see the last 4 code points and compare before and after, they are actually swapped and results in an altogether different string compared to original.

      Another special character that also changes due to the above line of code is ć represented with UTF-8 code units 63 CC 81

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      1. Have any string that contains this particular special character (either ć or arabic letters )
      2. Write code to pass that string through this code "Normalizer.normalize(input, Normalizer.Form.NFC);". It is used in java.net library as https://hg.openjdk.org/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/net/URI.java#l2723
      3. Capture the output of the code
      4. Compare the original string with output of above code and see if they're same or not.

      EXPECTED VERSUS ACTUAL BEHAVIOR :
      EXPECTED -
      The input string and output should be equivalent when represented in UTF-8 form
      ACTUAL -
      The input string and output should are not equivalent when represented in UTF-8 form

      ---------- BEGIN SOURCE ----------
      import java.text.*;
      public class MyClass {
          public static void main(String args[]) {
            String s = "وَلِيِّدْ-ألطَآئِيّ";
            String ns = Normalizer.normalize(s, Normalizer.Form.NFC);
            boolean isEqualString = s.equals(ns);
            System.out.println("Output: " + ns + ", equal: " + isEqualString);
          }
      }

      ---------- END SOURCE ----------

      FREQUENCY : always


      Attachments

        Activity

          People

            naoto Naoto Sato
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: