-
Bug
-
Resolution: Fixed
-
P2
-
1.1.1
-
1.2alpha2
-
sparc
-
solaris_2.5.1
-
Not verified
Enclosed find a description of a bug that we have found in the code that
changes a bytecode to its quick variant. This is a subtle bug that is
hard to find and may cause unexpected behavior.
Through reading the implementation of the JVM, I and Alain Azagury, a
colleague here at Haifa Research Lab, believe that we have
found a bug in the way that the JVM changes the getfield and putfield
bytecodes to their quick variants. (We also suspect a similar bug in
the conversion of invocation bytecodes to their quick variants, but we
have not yet had time to check it.) The bug is a subtle bug in
synchronization. In particular, the code synchronizes all threads
that try to update the bytecode to the quick variant, but it does not
synchronize the threads that execute (e.g., read) the bytecode.
The bug is due to hidden assumptions made by the programmers of the
JVM that the order in which a thread performs assignment statements
and the order in which those assignments are viewed as occurring by
another thread are the same. Given today's optimizing compilers and
weak consistency models used in multi-processor architectures, this
assumption may not be true.
The nature of this bug makes it very hard to find, hard to repeat, and
potentially hard to know that it occurred; yet, it could allow a
program to read or write the wrong place in memory.
Detailed Description
--------------------
(For the sake of simplicity we will describe the bug for getfield and
getfield-quick; the same bug occurs for putfield and putfield-quick.)
The original getfield bytecode consists of an opcode byte for getfield
followed by two index bytes, which together serve as an index into the
constant table. The getfield_quick bytecode consists of an opcode
byte for getfield_quick followed by a single offset byte; the third
byte (e.g., the former second index byte) is ignored when
getfield_quick is executed.
To convert a getfield bytecode to its quick variant, the JVM uses the
index to access the constant table and resolve the constant it finds
there to an offset. This is the offset that is put back into the code
when the getfield opcode is replaced by the getfield_quick.
When the JVM encounters a getfield bytecode in the Java instruction
stream, it always converts it to its quick variant and then it
executes the quick variant. Subsequent executions of the instruction
will also execute the quick variant.
Execution flow for the getfield bytecode including conversion process:
---------------------------------------------------------------------
1. The JVM reads the getfield opcode and dispatches to the
appropriate code.
2. The JVM obtains the CODE_LOCK lock.
3. The JVM rereads the opcode from the instruction stream and also
reads the index bytes.
4. The JVM releases CODE_LOCK.
5. The JVM checks that the reread opcode is the same as the original
opcode. If not, it goes back and re-executes the instruction from the
beginning.
6. The JVM uses the index bytes into the constant pool in order to
find the class and the offset (In Java terminology this is called
constant resolution).
7. The JVM re-obtains the CODE_LOCK lock.
8. It rechecks the opcode in the instruction stream again. If the
opcode has changed, it releases the lock and goes back and re-executes
the instruction from the beginning.
9. The JVM replaces the first index byte in the instruction stream by
the offset.
10. The JVM replaces the getfield opcode by the getfield_quick opcode.
11. The JVM releases the lock.
12. The JVM executes the quick opcode.
Execution flow for the getfield_quick opcode:
---------------------------------------------
1. The JVM reads the getfield_quick opcode and dispatches to the
appropriate code.
2. The JVM reads the offset parameter from the instruction stream.
3. The JVM uses the offset to read the appropriate field from the
object.
Notice that execution of the getfield_quick opcode does not obtain any
lock.
Possible interleavings of threads:
----------------------------------
If one thread is in the process of converting a getfield bytecode to a
getfield_quick and another thread executes the same bytecode
instruction simultaneously, there can be a race condition that is not
totally closed by the CODE_LOCK mechanism. There are three scenarios:
1. The second thread sees the old getfield. In this case, it also
enters the flow for getfield and attempts to obtain CODE_LOCK. When
the CODE_LOCK is granted, it will discover that the bytecode has
changed; thus, it will release the lock and re-execute the instruction
as a getfield_quick. Thus, there is no problem.
2. The second thread sees both the new getfield_quick and the new
offset byte. In this case, there is no problem.
3. The second thread sees the new getfield_quick and the old index
byte. This can happen if a compiler reorders steps 9 and 10 of the
getfield execution flow; it can also happen in a multi-processor,
which employs weakly consistent storage access ordering, such as the
PowerPC and the PentiumPro.
A compiler can reorder code
---------------------------
We checked with the compiler experts in Haifa and verified that modern
optimizing compilers can change the order of store instructions. They
can do this when they do not detect any dependency between the two
instructions. The code for the getfield execution flow is written in
a way such that the compiler cannot detect a dependency between step
9, which stores the offset into the bytecode stream, and step 10, which
stores the getfield_quick.
In this case there is no good reason for a compiler to change the
order of the store instructions. However, there is also no indication
from the programmer to indicate that the order should not be changed.
Appropriate use of the C "volatile" modifier in the variable
declarations could force the compiler not to change the ordering.
A multi-processor can execute load and stores out of order
----------------------------------------------------------
We checked with the hardware and VLSI experts in Haifa and verified
that the PowerPC uses weakly consistent storage access ordering. Here
is an example to explain this concept:
processor 1 processor 2
----------- -----------
std r1,A ld r3,B
std r2,B ld r4,A
Assume that the old value of A is a0 and the new value is a1;
similarly, for B, b0 and b1. According to weak consistency, processor
2 might see any of the following <r3, r4> pairs: <b0, a0>, <b0, a1>,
<b1, a0>, <b1, a1>.
If we compare the above scenario to the getfield execution, the two
stores of process 1 correspond to steps 9 and 10. The two loads of
process 2 correspond to steps 1 and 2 of the getfield_quick execution.
Thus, a thread executing a getfield_quick opcode can see the new
getfield_quick opcode and the old index belonging to the original
getfield instruction. This can occur even if the compiler does not
reorder the store instructions.
This race condition can be closed by inserting PowerPC "execution
synchronizing" instructions, e.g., sync, in appropriate places in the
code.
The bug occurs on other architectures including Sun's Sparc architecture.
1) The bug exists on two levels: (1) the compiler level for both
uniprocessors and multiprocessors and (2) the architecture level for
multiprocessors.
Compiler level:
If an optimizing compiler can tell that two store statements always
write to two distinct locations of memory, then the compiler can reorder
those statements. This is true for any computing platform, i.e., any
combination of hardware and operating system. So the bug exists at the
compiler level for all platforms on which the JVM runs.
Architecture level:
An architecture for a multiprocessor specifies a memory coherence model.
For performance reasons, most modern multiprocessor architectures
specify weak coherence models. In a weak model, if one processor makes
consecutive stores to two locations, other processors are not guaranteed
to see those stores in the same order they were made by the first
processor.
The Sparc, PowerPC, and Alpha all specify weak memory coherence models. So
the synchronization bug could occur on all three of these architectures and
probably on others.
changes a bytecode to its quick variant. This is a subtle bug that is
hard to find and may cause unexpected behavior.
Through reading the implementation of the JVM, I and Alain Azagury, a
colleague here at Haifa Research Lab, believe that we have
found a bug in the way that the JVM changes the getfield and putfield
bytecodes to their quick variants. (We also suspect a similar bug in
the conversion of invocation bytecodes to their quick variants, but we
have not yet had time to check it.) The bug is a subtle bug in
synchronization. In particular, the code synchronizes all threads
that try to update the bytecode to the quick variant, but it does not
synchronize the threads that execute (e.g., read) the bytecode.
The bug is due to hidden assumptions made by the programmers of the
JVM that the order in which a thread performs assignment statements
and the order in which those assignments are viewed as occurring by
another thread are the same. Given today's optimizing compilers and
weak consistency models used in multi-processor architectures, this
assumption may not be true.
The nature of this bug makes it very hard to find, hard to repeat, and
potentially hard to know that it occurred; yet, it could allow a
program to read or write the wrong place in memory.
Detailed Description
--------------------
(For the sake of simplicity we will describe the bug for getfield and
getfield-quick; the same bug occurs for putfield and putfield-quick.)
The original getfield bytecode consists of an opcode byte for getfield
followed by two index bytes, which together serve as an index into the
constant table. The getfield_quick bytecode consists of an opcode
byte for getfield_quick followed by a single offset byte; the third
byte (e.g., the former second index byte) is ignored when
getfield_quick is executed.
To convert a getfield bytecode to its quick variant, the JVM uses the
index to access the constant table and resolve the constant it finds
there to an offset. This is the offset that is put back into the code
when the getfield opcode is replaced by the getfield_quick.
When the JVM encounters a getfield bytecode in the Java instruction
stream, it always converts it to its quick variant and then it
executes the quick variant. Subsequent executions of the instruction
will also execute the quick variant.
Execution flow for the getfield bytecode including conversion process:
---------------------------------------------------------------------
1. The JVM reads the getfield opcode and dispatches to the
appropriate code.
2. The JVM obtains the CODE_LOCK lock.
3. The JVM rereads the opcode from the instruction stream and also
reads the index bytes.
4. The JVM releases CODE_LOCK.
5. The JVM checks that the reread opcode is the same as the original
opcode. If not, it goes back and re-executes the instruction from the
beginning.
6. The JVM uses the index bytes into the constant pool in order to
find the class and the offset (In Java terminology this is called
constant resolution).
7. The JVM re-obtains the CODE_LOCK lock.
8. It rechecks the opcode in the instruction stream again. If the
opcode has changed, it releases the lock and goes back and re-executes
the instruction from the beginning.
9. The JVM replaces the first index byte in the instruction stream by
the offset.
10. The JVM replaces the getfield opcode by the getfield_quick opcode.
11. The JVM releases the lock.
12. The JVM executes the quick opcode.
Execution flow for the getfield_quick opcode:
---------------------------------------------
1. The JVM reads the getfield_quick opcode and dispatches to the
appropriate code.
2. The JVM reads the offset parameter from the instruction stream.
3. The JVM uses the offset to read the appropriate field from the
object.
Notice that execution of the getfield_quick opcode does not obtain any
lock.
Possible interleavings of threads:
----------------------------------
If one thread is in the process of converting a getfield bytecode to a
getfield_quick and another thread executes the same bytecode
instruction simultaneously, there can be a race condition that is not
totally closed by the CODE_LOCK mechanism. There are three scenarios:
1. The second thread sees the old getfield. In this case, it also
enters the flow for getfield and attempts to obtain CODE_LOCK. When
the CODE_LOCK is granted, it will discover that the bytecode has
changed; thus, it will release the lock and re-execute the instruction
as a getfield_quick. Thus, there is no problem.
2. The second thread sees both the new getfield_quick and the new
offset byte. In this case, there is no problem.
3. The second thread sees the new getfield_quick and the old index
byte. This can happen if a compiler reorders steps 9 and 10 of the
getfield execution flow; it can also happen in a multi-processor,
which employs weakly consistent storage access ordering, such as the
PowerPC and the PentiumPro.
A compiler can reorder code
---------------------------
We checked with the compiler experts in Haifa and verified that modern
optimizing compilers can change the order of store instructions. They
can do this when they do not detect any dependency between the two
instructions. The code for the getfield execution flow is written in
a way such that the compiler cannot detect a dependency between step
9, which stores the offset into the bytecode stream, and step 10, which
stores the getfield_quick.
In this case there is no good reason for a compiler to change the
order of the store instructions. However, there is also no indication
from the programmer to indicate that the order should not be changed.
Appropriate use of the C "volatile" modifier in the variable
declarations could force the compiler not to change the ordering.
A multi-processor can execute load and stores out of order
----------------------------------------------------------
We checked with the hardware and VLSI experts in Haifa and verified
that the PowerPC uses weakly consistent storage access ordering. Here
is an example to explain this concept:
processor 1 processor 2
----------- -----------
std r1,A ld r3,B
std r2,B ld r4,A
Assume that the old value of A is a0 and the new value is a1;
similarly, for B, b0 and b1. According to weak consistency, processor
2 might see any of the following <r3, r4> pairs: <b0, a0>, <b0, a1>,
<b1, a0>, <b1, a1>.
If we compare the above scenario to the getfield execution, the two
stores of process 1 correspond to steps 9 and 10. The two loads of
process 2 correspond to steps 1 and 2 of the getfield_quick execution.
Thus, a thread executing a getfield_quick opcode can see the new
getfield_quick opcode and the old index belonging to the original
getfield instruction. This can occur even if the compiler does not
reorder the store instructions.
This race condition can be closed by inserting PowerPC "execution
synchronizing" instructions, e.g., sync, in appropriate places in the
code.
The bug occurs on other architectures including Sun's Sparc architecture.
1) The bug exists on two levels: (1) the compiler level for both
uniprocessors and multiprocessors and (2) the architecture level for
multiprocessors.
Compiler level:
If an optimizing compiler can tell that two store statements always
write to two distinct locations of memory, then the compiler can reorder
those statements. This is true for any computing platform, i.e., any
combination of hardware and operating system. So the bug exists at the
compiler level for all platforms on which the JVM runs.
Architecture level:
An architecture for a multiprocessor specifies a memory coherence model.
For performance reasons, most modern multiprocessor architectures
specify weak coherence models. In a weak model, if one processor makes
consecutive stores to two locations, other processors are not guaranteed
to see those stores in the same order they were made by the first
processor.
The Sparc, PowerPC, and Alpha all specify weak memory coherence models. So
the synchronization bug could occur on all three of these architectures and
probably on others.