Summary
Change the behavior of MessageFormat.toPattern()
so that any unquoted curly braces in subformat pattern strings are quoted to fix a bug that meant the returned pattern string could not be used to recreate the original MessageFormat
.
Problem
The MessageFormat
class has a constructor MessgeFormat(String pattern)
and a method String toPattern()
. Although it's not explicitly documented as such, the strong implication here is that if you take a "round trip" from a MessageFormat
to its pattern string and then back to a new MessageFormat
, what you get will have equivalent behavior to the original.
Moreover, because format pattern strings are commonly extracted and used for i18n tools and processes. It's important that this "round trip" process be sound.
With other Format
classes, and in most MessageFormat
cases, this is actually the case. However, there is a corner case where the "round trip" fails, namely is when a subformat pattern contains unquoted {
and }
characters.
Explanation: MessageFormat.toPattern()
builds its pattern string by concatenating the plain text bits, with any special characters therein suitably quoted, and the subformat pattern strings from each subformat nested inside subformat syntax that look like {0,fmttype,XXX}
. The subformat patterns XXX
, which come from the subformats' toPattern()
methods, are assumed to have any plain text characters that are special to that subformat already quoted, and so the XXX
strings are appended as-is.
The problem is that the set of characters special in the subformat pattern string is not necessarily equal to the set of characters special in a MessageFormat
pattern string. So any characters in the latter set but not the former might not be quoted in XXX
, and if not, they can cause the resulting MessageFormat
pattern string to be parsed differently. In particular the {
and }
characters, which are special to MessageFormat
, are vulnerable.
Here's a demonstration using JShell:
| Welcome to JShell -- Version 17.0.9
| For an introduction type: /help intro
jshell> import java.text.*
jshell> var fmt1 = new MessageFormat("{0,number,':} '#.##}")
fmt1 ==> java.text.MessageFormat@0
jshell> fmt1.format(new Object[] {1.359})
$3 ==> ":} 1.36"
jshell> fmt1.toPattern()
$4 ==> "{0,number,:} #0.##}"
jshell> var fmt2 = new MessageFormat($4)
fmt2 ==> java.text.MessageFormat@db3adf3c
jshell> fmt2.format(new Object[] {1.359})
$6 ==> ":1 #0.##}"
jshell> fmt2.toPattern()
$7 ==> "{0,number,:#} #0.##}"
The "round trip" operation has resulted in the string #0.##}
being moved out of the number
subformat and into the following plain text segment of the MessageFormat
.
After this change, the result of fmt2.toPattern()
will be the same as fmt1.toPattern()
and have the extra curly brace quoted, i.e., "{0,number,:'}' #0.##}"
.
Solution
The solution is to add some additional quoting where needed so that a "round trip" operation will recreate the original even if there are unquoted curly brace characters in a subformat pattern string.
Specifically, MessageFormat
needs to analyze each subformat pattern string in order to determine which characters are quoted and which are not, and if there are any unquoted curly brace characters characters, then quote them. This process must leave any already-quoted characters alone.
We need to ask how do we know it is safe to quote the unquoted curly brace characters in the subformat patterns? If curly braces are not special to the subformat, then quoting them clearly does no harm. So we only need worry about subformat patterns where curly braces are special. But the only subformat pattern strings supported by MessageFormat
are for DecimalFormat
, SimpleDateFormat
, and ChoiceFormat
, and curly braces are not special for any of these classes, so we're good1.
To minimize the effect of this change, the output of MessageFormat.toPattern()
should only change if additional quoting was actually required. This will be a small subset of pattern strings, because it is rare for curly braces to appear in XXX
subformat patterns.
More precisely:
MessageFormat.toPattern()
will now analyze each subformat pattern string before adding it to the overall pattern string it is constructing.- If the subformat pattern string contains no unquoted
{
or}
characters, thenMessageFormat
will behave exactly as before. - If the subformat pattern string contains any unquoted
{
or}
characters, those characters will be newly quoted.
Specification
--- a/src/java.base/share/classes/java/text/MessageFormat.java
+++ b/src/java.base/share/classes/java/text/MessageFormat.java
@@ -553,6 +553,11 @@ private void applyPatternImpl(String pattern) {
* The string is constructed from internal information and therefore
* does not necessarily equal the previously applied pattern.
*
+ * @implSpec The implementation in {@link MessageFormat} returns a
+ * string that, when passed to a {@code MessageFormat()} constructor
+ * or {@link #applyPattern applyPattern()}, produces an instance that
+ * is semantically equivalent to this instance.
+ *
* @return a pattern representing the current state of the message format
*/
public String toPattern() {
Notes
- Although curly brace characters are not special in
ChoiceFormat
pattern strings, there is some special logic inMessageFormat
that clouds this question: afterMessageFormat
evaluates aChoiceFormat
subformat, if the resulting string contains an opening curly brace, then a newMessageFormat
is created from that string and evaluated, and the result of thatMessageFormat
operation replaces the originalChoiceFormat
result. In particular, this can happen recursively to an arbitrary depth. This all causes a lot of confusion when looking at a singleChoiceFormat
subformat pattern string within aMessageFormat
because it's hard for a human to keep track of all the steps that occur during parsing and evaluation, especially when there are multiple levels of nesting. Regardless, this special logic has nothing to do with how subformats pattern strings withinMessageFormat
pattern strings should be quoted - it only affects how those subformat's outputs are interpreted at "run time".
- csr of
-
JDK-8323699 MessageFormat.toPattern() generates non-equivalent MessageFormat pattern
-
- Closed
-