Loading...

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 5.0
Affects Version/s: 1.4.1
Component/s: core-libs
Labels:
- webbug

Subcomponent:
java.net
Resolved In Build:
tiger
CPU:

x86
OS:

windows_2000

Name: nt126004 Date: 02/21/2003

FULL PRODUCT VERSION :
java version "1.4.1_01"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.1_01-b01)
Java HotSpot(TM) Client VM (build 1.4.1_01-b01, mixed mode)

FULL OPERATING SYSTEM VERSION : Microsoft Windows 2000
[Version 5.00.2195]

ADDITIONAL OPERATING SYSTEMS : All

EXTRA RELEVANT SYSTEM CONFIGURATION :
Problem applies for all versions of the JDK.

A DESCRIPTION OF THE PROBLEM :
The class, java.net.URLDecoder, performs extremly poorly -
to the point of unusability. For example, it takes roughly
14 seconds to decode a string of approximately 500K
characters on a Pentium 4 workstation.

The performance problem stems from URLDecoder's repeated
allocation of an internal temporary buffer used for
character encoding. By changing URLDecoder's implementation
to allocate the buffer once, the encoding time dropped from
14 seconds to approximately 200 milliseconds.

As an aside, URLDecoder allocates the StringBuffer uses as
output with a default buffer size. The StringBuffer should
be allocated with a buffer that is at least as large as
1/3rd the size of the input string, based on simple math.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
1. Execute java.net.URLDecoder.decode()

EXPECTED VERSUS ACTUAL BEHAVIOR :
Expected results: Approximately linear performance results
with respect to the length of the input string and the
number of encoded characters.

Actual Results: Geometrically explosive decoding times,
depending upon the length of the input string and the number
of encoded characters.

REPRODUCIBILITY :
This bug can be reproduced always.

---------- BEGIN SOURCE ----------
The following source code is a modified version of Sun's implementation of
java.net.URLDecoder. It implements the two specified changes with the resulting
performance improvement. Note: This is not a version of URLDecoder optimized to
the Nth degree - It only contains 2 very simple changes which result in a 100x
performance improvement over the existing URLDecoder.

/*
* A slightly modified version of Sun's implementation of java.net.URLDecoder.
* The original copyright follows.
*
* @author Toby Reyelts
*
*/

/*
* @(#)URLDecoder.java 1.20 01/12/03
*
* Copyright 2002 Sun Microsystems, Inc. All rights reserved.
* SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.
*/

package test;

import java.io.*;

/**
* Utility class for HTML form decoding. This class contains static methods
* for decoding a String from the <CODE>application/x-www-form-urlencoded</CODE>
* MIME format.
* 
* To conversion process is the reverse of that used by the URLEncoder class. It
is assumed
* that all characters in the encoded string are one of the following:
* "<code>a</code>" through "<code>z</code>",
* "<code>A</code>" through "<code>Z</code>",
* "<code>0</code>" through "<code>9</code>", and
* "<code>-</code>", "<code>_</code>",
* "<code>.</code>", and "<code>*</code>". The
* character "<code>%</code>" is allowed but is interpreted
* as the start of a special escaped sequence.
* 
* The following rules are applied in the conversion:
* 
* <ul>
* <li>The alphanumeric characters "<code>a</code>" through
* "<code>z</code>", "<code>A</code>" through
* "<code>Z</code>" and "<code>0</code>"
* through "<code>9</code>" remain the same.
* <li>The special characters "<code>.</code>",
* "<code>-</code>", "<code>*</code>", and
* "<code>_</code>" remain the same.
* <li>The plus sign "<code>+</code>" is converted into a
* space character "<code> </code>" .
* <li>A sequence of the form "<code>%xy</code>" will be
* treated as representing a byte where xy is the two-digit
* hexadecimal representation of the 8 bits. Then, all substrings
* that contain one or more of these byte sequences consecutively
* will be replaced by the character(s) whose encoding would result
* in those consecutive bytes.
* The encoding scheme used to decode these characters may be specified,
* or if unspecified, the default encoding of the platform will be used.
* </ul>
* 
* There are two possible ways in which this decoder could deal with
* illegal strings. It could either leave illegal characters alone or
* it could throw an <tt>{@link java.lang.IllegalArgumentException}</tt>.
* Which approach the decoder takes is left to the
* implementation.
*
* @author Mark Chamness
* @author Michael McCloskey
* @version 1.20, 12/03/01
* @since 1.2
*/

public class FastUrlDecoder {

 // The platform default encoding
 // Can't use this, since it's private to java.net. Of course,
 // this can be changed back for the real java.net.URLDecoder.
 // static String dfltEncName = java.net.URLEncoder.dfltEncName;
 static String dfltEncName = null;

 /**
 * Decodes a <code>x-www-form-urlencoded</code> string.
 * The platform's default encoding is used to determine what characters
 * are represented by any consecutive sequences of the form
 * "<code>%xy</code>".
 * @param s the <code>String</code> to decode
 * @deprecated The resulting string may vary depending on the platform's
 * default encoding. Instead, use the decode(String,String) method
 * to specify the encoding.
 * @return the newly decoded <code>String</code>
 */
 public static String decode(String s) {

 String str = null;

 try {
 str = decode(s, dfltEncName);
 }
 catch ( UnsupportedEncodingException e ) {
 // The system should always have the platform default
 }

 return str;
 }

 /**
 * Decodes a <code>application/x-www-form-urlencoded</code> string using a
specific
 * encoding scheme.
 * The supplied encoding is used to determine
 * what characters are represented by any consecutive sequences of the
 * form "<code>%xy</code>".
 * 
 * Note: The <a href=
 * "http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars">
 * World Wide Web Consortium Recommendation</a> states that
 * UTF-8 should be used. Not doing so may introduce
 * incompatibilites.
 *
 * @param s the <code>String</code> to decode
 * @param enc The name of a supported
 * <a href="../lang/package-summary.html#charenc">character
 * encoding</a>.
 * @return the newly decoded <code>String</code>
 * @exception UnsupportedEncodingException
 * If the named encoding is not supported
 * @see URLEncoder#encode(java.lang.String, java.lang.String)
 */
 public static String decode( String s, String enc ) throws
UnsupportedEncodingException {

 boolean needToChange = false;

 /* Modification 1 - The original URLDecoder allocates a StringBuffer with a
default
 * buffer size. For the composition of large strings, constant reallocation of
 * the internal buffer in StringBuffer is expensive.
 *
 * In the case of url decoding, the smallest decoded string would require
 * at least 1/3rd the space of the original string. We can allocate
 * a buffer the length of <code>s</code> optimistically, or to be conservative,
 * we could allocate the buffer as something like numChars / 3 or numChars / 2.
 */
 int numChars = s.length();
 // StringBuffer sb = new StringBuffer();
 StringBuffer sb = new StringBuffer( numChars );

 int i = 0;

/** Commented out temporarily, while this is not java.net.URLDecoder
 if ( enc.length() == 0 ) {
 throw new UnsupportedEncodingException ("URLDecoder: empty string enc
parameter");
 }
*/

 /* Modification 2 - In the original URLDecoder, <code>bytes</code> was local
 * to the try block for <code>case '%'</code>. Every single time a '%' was
encountered,
 * URLDecoder allocated an entirely new buffer 1/3rd the size of the rest
 * of the string to be decoded. That is an entirely wasteful allocation of
 * heap space. For example, given a string of length 500K with approximately
 * 1K encoded characters, the URLDecoder would allocate something on the
 * order of 100M of memory.
 *
 * The simple fix, used here, is to only allocate the buffer once during the
 * execution of the decode. The first time it is lazily allocated, it will
be large
 * enough for all ensuing encounters with any encoded characters.
 *
 */
 byte[] bytes = null;

 while ( i < numChars ) {

 char c = s.charAt( i );

 switch ( c ) {

 case '+':
 sb.append(' ');
 i++;
 needToChange = true;
 break;

 case '%':

 /*
 * Starting with this instance of %, process all
 * consecutive substrings of the form %xy. Each
 * substring %xy will yield a byte. Convert all
 * consecutive bytes obtained this way to whatever
 * character(s) they represent in the provided
 * encoding.
 */

 try {

 // (numChars-i)/3 is an upper bound for the number
 // of remaining bytes

 /* Lazily allocated - and only once in the new version of URLDecoder
 */
 if ( bytes == null ) {
 bytes = new byte[ ( numChars - i ) / 3 ];
 }

 int pos = 0;

 while ( ((i+2) < numChars) && (c=='%')) {

 bytes[pos++] = (byte)Integer.parseInt(s.substring(i+1,i+3),16);
 i+= 3;

 if ( i < numChars ) {
c = s.charAt(i);
 }
 }

 // A trailing, incomplete byte encoding such as
 // "%x" will cause an exception to be thrown
 if ((i < numChars) && (c=='%')) {
 throw new IllegalArgumentException( "URLDecoder: Incomplete
trailing escape (%) pattern");
 }

 if ( enc == null ) {
 sb.append(new String(bytes, 0, pos ));
 }
 else {
 sb.append(new String(bytes, 0, pos, enc));
 }
 }
 catch (NumberFormatException e) {
 throw new IllegalArgumentException( "URLDecoder: Illegal hex
characters in escape (%) pattern - " + e.getMessage());
 }

needToChange = true;
 break;

 default:
 sb.append(c);
 i++;
 break;
 }
 }

 return (needToChange? sb.toString() : s);
 }
}

---------- END SOURCE ----------

CUSTOMER WORKAROUND :
Rewrite a new url decoder from scratch.
(Review ID: 181588)
======================================================================

Details

Description

Attachments

Activity

People

Dates