Sequenced Collections Compatibility Analysis

Stuart W. Marks
2022-09-20

Introduction

This is an analysis of the potential compatibility impact of the Sequenced Collections proposal. The proposal introduces new interfaces and new methods into the Java collections framework type hierarchy, some with covariant overrides, and it also introduces covariant overrides into existing interfaces. The collections framework classes and interfaces are widely used and subclassed, which increases the risk of incompatibilities that might arise when modifying the framework.

This analysis considers the following potential areas of incompatibility:

  1. Class Name Collisions

  2. Method Conflicts

  3. Covariant Overrides and the reversed Method

  4. Covariant Overrides of SequencedMap View Collection Methods

  5. Behavior Changes Caused by Introduction of New Methods

Methodology

This analysis uses two main techniques for assessing incompatibility issues: a corpus search and the "nine-case" technique for trying various combinations of old and new versions of an application, a library, and the JDK.

Corpus Searching

This part of the analysis uses a set of searches through a large corpus of Java code. The corpus consists of about 115,650 jar files downloaded from Maven Central. There are a total of about 29 million classes in the corpus.

There are, however, not 29 million unique classes in the corpus. There is a considerable amount of duplication among certain classes. Some popular libraries (such as Google Guava, Google Protobuf, and Apache Commons) have been copied and "shaded" into other jar files hundreds of times. ("Shading" is the process of renaming and rewriting the class files of a library so that it cannot conflict with other versions of the same library on the classpath. The name comes from the Maven Shade plugin.) Thus, a single issue that might occur within a popular library might occur hundreds of times in the corpus. However, it can be mitigated by fixing it just once in the library's source code.

The Nine-Case Technique

This part of the analysis uses a prototype JDK that includes an implementation of the proposal, a library that contains a class that might be the victim of an incompatibility, and an application that uses that library. Various versions of each component are used:

The baseline case is JDKo + LIBo + APPo. The assumption is that this is a working configuration in the field today.

Assessing compatibility impact requires running on JDKn. This leads to nine combinations, with each of the three different versions of the library and application:

  1. LIBo + APPo
  2. LIBo + APPr
  3. LIBo + APPn
  4. LIBr + APPo
  5. LIBr + APPr
  6. LIBr + APPn
  7. LIBn + APPo
  8. LIBn + APPr
  9. LIBn + APPn

Some of these cases might seem unlikely, but it's important to test all combinations. In particular it might not seem reasonable to run a "new" version of the application using the new JDK APIs with an "old" version of a library binary. In fact this case likely arises frequently in practice. A typical application build system based on maven will recompile the application against a new JDK, but it might bring in library binaries from Maven Central that were compiled against older JDKs. Even though the application is being recompiled, it relies on library binary compatibility to run on newer JDKs.

In updating the application to use new APIs, particular attention has been paid to calling various APIs through variables of different static types. This figures prominently in the compatibility analysis. This is important, as differences in static types do occur in practice. Consider the following:

    var list1 = new MySpecialList<Node>();
    List<Node> list2 = new MySpecialList<>();

    list1.method();
    list2.method();

The two calls to method generate different bytecode at the call site, and depending upon the exact combination of application, library, and JDK in use, they might end up calling different methods.

Areas of Incompatibility

This section surveys different issues that can arise when evolving an existing API.

Class Name Collisions

The proposal introduces three new types:

    java.util.SequencedCollection
    java.util.SequencedSet
    java.util.SequencedMap

There are no classes in the corpus with matching simple names of SequencedCollection, SequencedSet, or SequencedMap. It's possible that some class outside the corpus has a matching simple name. A matching simple name would potentially cause a source incompatibility. However, this is easily mitigated through the use of import statements and fully qualified class names.

Method Conflicts

A newly introduced method in an interface could clash with an existing method defined in a class that implements that interface. Methods with the same name do not necessarily cause an incompatibility; the number of arguments, their types, and the return types also play a role. There are several sub-cases to consider when assessing method conflicts:

Not all of the issues apply to this proposal. However, they are listed here because they were considered and determined not to be applicable. (This list should also be useful for future compatibility analyses.)

To analyze potential method conflicts with this proposal, the corpus was searched for methods with the following names, matching parameter types (if any), and with any return type.

    reversed()
    addFirst(Object)            // (1)
    addLast(Object)             // (1)
    getFirst()                  // (1)
    getLast()                   // (1)
    removeFirst(Object)         // (1)
    removeLast(Object)          // (1)
    firstKey()                  // (2)
    lastKey()                   // (2)
    firstEntry()                // (3)
    lastEntry()                 // (3)
    pollFirstEntry()            // (3)
    pollLastEntry()             // (3)
    putFirst(Object, Object)
    putLast(Object, Object)
    sequencedKeySet()
    sequencedValues()
    sequencedEntrySet()

    // (1) - promoted from Deque to SequencedCollection
    // (2) - promoted from SortedMap to SequencedMap
    // (3) - promoted from NavigableMap to SequencedMap

Where a method is listed as being "promoted," it is moved up the interface hierarchy while preserving its exact name, parameter types, and return type. Covariant overrides are not introduced for these methods.

The search was limited to collection-related classes, where a direct inheritance conflict might occur. The search also included matching names in any interface (even those not obviously related to collections) because of the possibility of those interfaces being mixed into a single implementation class.

There were a large number of matches for {add,get,remove}{First,Last} methods. These were mostly for classes that implement Deque. In some cases the class implemented Deque with a concrete parameter type, e.g.

    class MyDeque implements Deque<Node>

In this case the compiler will generate bridge methods that provide proper overloads for the Deque methods. This proposal should not affect any of these cases.

Similarly, matching methods firstKey and lastKey appear in implementations of SortedMap. This is also not a compatibility problem.

There were about ten matches for various methods and signatures with conflicting return types. Most of these were in individual libraries that don't appear to be in widespread use, or they appeared to be in internal implementation classes.

Of particular note are some classes in the Apache Commons Collections library. This library is quite popular and has been copied and shaded into around 300 jar files and is in common use. The org.apache.commons.collections4.list package has the following classes:

    AbstractLinkedList
    CursorableLinkedList
    CursorableSubList
    NodeCachingLinkedList

All of them have the following methods:

    boolean addFirst(Object)
    boolean addLast(Object)

This is clearly a return type conflict. Since the Commons Collections libraries are widely used, additional scrutiny is warranted here.

Analysis was performed using the nine-case technique. The results of this analysis showed that:

However, the library itself suffers a source incompatibility: it cannot be recompiled on the new JDK because of the method return type conflict. It's not clear whether it's possible to mitigate this in a compatible fashion. If the library were recompiled and changed so that its methods match the JDK's, this would cause an incompatibility between the library and its clients.

This is indeed a potentially difficult incompatibility, but it's unclear whether it is a problem in practice. Even though Apache Commons Collections is widely used and copied, it doesn't follow that these particular classes are used frequently. To gauge how frequently these classes are used, we used the SourceGraph tool to look for uses of these classes' addFirst and addLast methods.

The result was that there were only a handful of actual uses of these methods. There were many matches in copies of the Apache Commons Collections library itself. There were also quite a number of matches in tests for this library. There were still more matches in several copies of CodeQL, a vulnerability analysis tool. Apparently this tool generates tests based on API signatures. After setting these aside, there were only a handful of actual cases where other applications or libraries used these methods -- and in these cases, they didn't use the return value.

Thus, while changing the signatures of these methods from boolean to void is strictly an incompatible change, it's likely that such a to Apache Commons Collections will have a small impact.

The name "reversed" does appear fairly frequently in the classes searched in the corpus. It often occurs in unrelated classes where it poses no problems. A particularly prominent example is in sub-interfaces of the JDK Comparator class, which has a reversed method. There are many Comparator implementations in the corpus that inherit or override this method. It seems quite unlikely, though, that a class would be both a Collection and a Comparator.

Covariant Overrides and the reversed Method

The proposal introduces not only a new reversed() method, but it also introduces covariant overrides of this method in both new and existing collection interfaces. Specifically, the proposal introduces this new method:

    SequencedCollection<E>:
        /* abstract */ SequencedCollection reversed();

and this covariant override on this new interface:

    SequencedSet<E>:
        default SequencedSet<E> reversed() { ... }

and also these covariant overrides on existing interfaces:

    List<E>:
        default List<E> reversed() { ... }

    Deque<E>:
        default Deque<E> reversed() { ... }

    SortedSet<E>:
        default SortedSet<E> reversed() { ... }

and finally into the following existing class:

    LinkedHashSet<E>:
        SequencedSet<E> reversed() { ... }

Note that reversed is abstract on SequencedCollection, an interface being retrofitted into an existing interface hierarchy. This is somewhat unusual, but it's permissible, because all classes and interfaces that descend from it provide implementations.

The reversed method is new in this proposal, and as discussed above, we expect direct inheritance conflicts to be rare. However, given that there are different covariant overrides in different parts of the hierarchy, it is possible for a class to inherit conflicting method definitions from different interfaces.

This can give rise to conflicts because of differing return types. Even if the return types don't differ, a conflict can also occur if more than one default method definition occurs in the superinterface graph (as discussed above).

This case could arise if a class inherits from different parts of the collection framework hierarchy. This is quite rare, but it does occur in practice.

The number of classes and interfaces in the corpus that inherit from multiple different parts of the collections hierarchy is quite small. There are several combinations that don't give rise to conflicts, because only one of the kinds falls under the SequencedCollection sub-hierarchy. Common examples involve mixtures of List+Queue, Set+Queue, and List+Set. (Indeed, there are multiple cases where a class implements both List and Set. It's well understood that it is not possible to write a class that correctly implements both the Set and List interfaces, however, this doesn't prevent people from doing it.)

The only cases found in the corpus where a potential conflict can occur is with List+Deque. Indeed, in the JDK itself, LinkedList implements both List and Deque. (There is also an internal class inside AWT that implements List and Deque.)

We used the nine-case technique to analyze the compatibility implications of the covariant overrides of reversed in this proposal.

Case 1, old binaries of applications and libraries, is binary compatible, because there is no possibility of them attempting to invoke any new default methods. (Assuming there are no method inheritance conflicts.)

All other cases also work fine, with the exception of cases 3, 4, 5, and 6.

Case 3 illustrates both source and binary compatibility issues. The library code has a class declared something like this:

    public class ListDeque<E> implements List<E>, Deque<E> { ... }

A modified application (APPn) might reasonably want to reverse a ListDeque. This works for individual List and Deque implementations, because default implementations of the reversed method are provided. However, suppose the application does this:

    ListDeque<String> ld = new ListDeque<>(...);
    ld.reversed();

This results in a compile-time error:

    error: reference to reversed is ambiguous
    both method reversed() in Deque and method reversed() in List match

This isn't a terrible issue, since the application is being modified and recompiled anyway. The mitigation is to adjust the static type of the receiver, either via a cast or by changing the type of the variable, prior to calling reversed, to either List or Deque as discussed below.

Note that if the application adjusts the static type of the receiver to SequencedCollection, this will fail at runtime. Thus if we have

    SequencedCollection<String> ld = new ListDeque<>(...);
    ld.reversed();

The result is

    AbstractMethodError: Receiver class lib.ListDeque does not define or inherit an
    implementation of the resolved method 'abstract java.util.SequencedCollection
    reversed()' of interface java.util.SequencedCollection.

The problem is that the static type information at the point of call refers to SequencedCollection::reversed -- which is an abstract method. The receiver is an instance of ListDeque, which has no implementation of reversed in itself or inherited from a superclass. The JVM thus searches for default methods among the superinterfaces and finds two default methods -- one from List and one from Deque. This is illegal and so causes the JVM to throw a linkage error.

Developers might find this surprising, but again it's not difficult to mitigate. Since the library has not been recompiled, it is not possible to get a reversed view of a ListDeque whose type is ListDeque. However, it's possible to get a reversed view whose type is either List or Deque, by changing the static type of the receiver:

    List<String> ld = new ListDeque<>(...);
    ld.reversed();

Alternatively, the static type could be established by a cast:

    ListDeque<String> ld = new ListDeque<>(...);
    ((Deque<String>)ld).reversed();

Given that List as enhanced by this proposal is mostly equivalent to Deque in power, choosing one or the other static type is not a great loss.

Cases 4, 5, and 6 don't work at all; the library cannot be recompiled unchanged, because ListDeque inherits conflicting definitions of the reversed method that must be resolved by the programmer. Given the class

    public class ListDeque<E> implements List<E>, Deque<E> { ... }

Recompiling it produces the error:

    error: types Deque<E> and List<E> are incompatible;
    public class ListDeque<E> implements List<E>, Deque<E> {
    both define reversed(), but with unrelated return types

This is somewhat subtle to mitigate, but there are a couple different approaches.

One approach that's available to third party code is to refactor one of the interfaces into a nested class, making it available as a "view" collection, instead of having the class implement both interfaces simultaneously. For example, one could rewrite this class as follows:

    public class ListDeque<E> implements List<E> {
        class DequeView implements Deque<E> {
            // Deque method implementations that
            // delegate to the outer class
        }

        public Deque<E> asDeque() {
            return new DequeView();
        }
    }

Callers would then need to be updated to request the Deque view as necessary. This is fairly simple, as both classes (outer and inner) would inherit their respective default implementations for the reversed method.

Unfortunately this technique is not available to the JDK, because of constraints on API evolution. Third party code with similar constraints might also need to pursue an alternative.

One suitable alternative involves returning a type that implements both List and Deque. This could be a new interface, or it could be a subclass of the class itself. For example, something like the following was done with LinkedList in the JDK:

    public class LinkedList<E> implements List<E>, Deque<E> {

        public LinkedList<E> reversed() {
            return new ReverseOrderLinkedListView<>(
                this, super.reversed(), Deque.super.reversed());
        }

        static class ReverseOrderLinkedListView<E> extends LinkedList<E> {
            final LinkedList<E> list;
            final List<E> rlist;
            final Deque<E> rdeque;

            ReverseOrderLinkedListView(LinkedList<E> list, List<E> rlist, Deque<E> rdeque) {
                this.list = list;
                this.rlist = rlist;
                this.rdeque = rdeque;
            }

            public LinkedList<E> reversed() { return list; }

            public boolean add(E e) {
                return rlist.add(e);
            }

            public E pollFirst() {
                return rdeque.pollFirst();
            }

            // ...
        }
    }

The implementation of ReverseOrderLinkedListView is somewhat tedious, as it requires essentially all of the methods of the parent class to be overridden. The technique shown here relies on the default reversed-view implementations of both List and Deque and provides them to the nested class, which then delegates to one or the other as necessary. This is a moderate amount of work, but it's more tedious than difficult.

Implementation assistance for collections has historically been provided in the form of abstract classes (such as AbstractCollection) which required overriding only a few methods to be implemented in order to provide a fully functioning implementation. One could imagine a similar approach (perhaps AbstractSequencedCollection) that might require only a descending iterator to be implemented, beyond what's already required by AbstractCollection. A limitation of the abstract class approach is that it's difficult to retrofit into an existing class hierarchy. An alternative would be to provide a static functions that could be called by an implementation class instead of requiring inheritance. These functions might include creating a reverse-ordered view from an existing List or Deque, using the same technique as those interfaces' default implementations of the reversed method.

Additional possibilities for mitigation are to provide documentation covering this case or to publish articles that describe how one might perform this retrofitting.

It's important to note that the case of mixed collection kinds such as List+Deque is overall quite rare. Out of the 29 million classes in the corpus, there are about 80,000 collection-related classes and interfaces (where "collection-related" is something that inherits from a collection class or interface). Of these 80,000, there are only about ten (10) that implement both List and Deque. None of these appear in any popular library.

It's of course possible that there are additional List+Deque implementations that don't occur in the corpus. But it seems likely that the overall number of such implementations is quite small.

Summary of analysis of covariant overrides for reversed:

Covariant Overrides of SequencedMap View Collection Methods

An earlier version of this proposal included introduction of covariant overrides into the SequencedMap interface. Based on the results of the compatibility analysis, we've moved away from this approach and are pursuing an alternative. However, the analysis is interesting and useful and is included here to help inform future API evolution.

The covariant overrides that were proposed were as follows:

    SequencedMap<K,V>:
        default SequencedSet<K> keySet() { ... }
        default SequencedCollection<V> values() { ... }
        default SequencedSet<Map.Entry<K,V>> entrySet() { ... }

SequencedMap is extended in the JDK by LinkedHashMap and SortedMap. These override methods on the Map interface that return Set, Collection, and Set, respectively.

This design does two things:

The introduction of covariant overrides in a superclass sets up conditions for a source incompatibility, similar to cases 4, 5, and 6 in the analysis of reversed above. Unlike the reversed case, which is a new method, there are existing subclasses that override these methods. These subclasses cannot be recompiled, as they now have illegal "contravariant" overrides.

A more insidious problem is set up by the combination of covariant overrides and default implementations. This was revealed by using the nine-case analysis technique. Suppose that, in a library, there is a subclass MyLinkedHashMap, which provides customized implementations of keySet, values, and entrySet views. These have return types of Set, Collection, and Set, according to the definitions of these methods on the original Map interface. Suppose further that there is an application using MyLinkedHashMap. Combinations of old, recompiled, and new versions of the library and application were tested.

The most interesting issue is revealed by case 2, where the unchanged source code of an application is recompiled on a new JDK and is run on a new JDK with an old library binary. As before, this situation can occur in practice, if library binaries from a repository such as Maven Central are used.

First, consider the baseline case where the application and library are compiled on an old JDK (without SequencedCollections). The compilation environment is as follows:

    LinkedHashMap<K,V>:
        Set<K> keySet() 

    MyLinkedHashMap<K,V> extends LinkedHashMap<K,V>:
        Set<K> keySet() // provides its own keySet and doesn't call super

    App:
        LinkedHashMap<K,V> lhm = new MyLinkedHashMap<>();
        lhm.keySet()

Note carefully the static type of the receiver in the invocation of the keySet method. The app's bytecode at the call site of keySet looks like this:

    invokevirtual LinkedHashMap.keySet:()Ljava/util/Set;

When the app is executed, this call site will be linked to the overriding method MyLinkedHashMap::keySet. This is presumably the behavior intended to be provided by the overriding method.

In the new JDK, the compilation environment looks like this:

    LinkedHashMap:
        SequencedSet<K> keySet() // covariant override

    MyLinkedHashMap<K,V> extends LinkedHashMap<K,V>:
        Set<K> keySet() // provides its own keySet and doesn't call super

    App:
        LinkedHashMap<K,V> lhm = new MyLinkedHashMap<>();
        lhm.keySet()

(There is also a bridge method compiled into LinkedHashMap, because this is a covariant override of the Map::keySet method. However, the bridge method is not relevant to this example.)

When the app is recompiled in this environment, the bytecode at the call site looks like this:

    invokevirtual LinkedHashMap.keySet:()Ljava/util/SequencedSet;

Note that the return type of the call site's Methodref differs after recompilation. When the app is run, this will link to and call LinkedHashMap::keySet and not MyLinkedHashMap::keySet, because the latter no longer overrides the former, at least, not the version whose return type is SequencedSet.

Recompiling existing source code against the modified class library records different information at the call site, resulting in linkage to a different method at runtime. This is a silent behavior change introduced by recompilation of the same source code. It could lead to unexpected behavior or inexplicable errors downstream. This is clearly a case that should be avoided.

In retrospect, introducing a covariant override of a method that might have existing overrides is semantically dubious. It effectively invalidates any existing overrides. The covariant override introduces additional postconditions (in this case, returning a SequencedSet instead of a plain Set) which cannot possibly be fulfilled by any existing override.

Based on this analysis, we have revised the design not to include covariant overrides on the SequencedMap interface. Instead, the keySet, values, and entrySet methods will be left unchanged, and new methods will be introduced to provide sequenced return types for those views:

    SequencedMap<K,V>:
        default SequencedSet<K> sequencedKeySet() { ... }
        default SequencedCollection<V> sequencedValues() { ... }
        default SequencedSet<Map.Entry<K,V>> sequencedEntrySet() { ... }

The general conclusion is that one should avoid introducing covariant overrides into a class that is likely to have existing subclasses and overridden methods. The source compatibility is perhaps fairly obvious. However, the more insidious issue is the silent change in behavior that could occur if an application is recompiled without changing its source code. This by itself is reason enough to avoid this technique.

Behavior Changes Caused by Introduction of New Methods

Addition of new methods still leaves open the possibility for potentially surprising behavior that might be counter to the expectations of the subclass. This is typical when new functionality is added to a superclass. (This is known as the fragile base class problem.) This applies to the addition of the sequencedX view methods mentioned above, the reversed method, and indeed any new method introduced into a superinterface or superclass.

These situations are covered in case 3 of the analysis, where an old library binary is used with a modified and enhanced application on a new JDK. Suppose for example that the MyLinkedHashMap subclass wants to provide customized behavior of preventing a certain key-value mapping from being removed. To do this, it would override a several mutator methods (including remove, computeIfPresent, etc.) to provide the customized behavior. The subclass would also need to override the view-providing methods and return implementations that also implement this customized behavior. However, the actual mappings would still be stored in the LinkedHashMap superclass fields.

Consider what would happen if the application were to do this:

    LinkedHashMap<K,V> lhm = new MyLinkedHashMap<>();
    var entry = lhm.pollFirstEntry();

Since this is an existing (old) library binary, there is no override of the pollFirstEntry method, so the LinkedHashMap implementation will be called. If the special key-value mapping happens to be the first entry, it will be removed.

This is perhaps unfortunate but is unavoidable in the design of this subclass, which attempts to modify the behavior of a superclass by overriding the right set of methods. If new methods are added to the superclass, they can bypass overrides that the subclass uses to enforce additional invariants.

A preferable design for modifying the behavior of concrete classes is to use composition instead of inheritance. (See Bloch, Effective Java, 3rd edition, (c) 2018 Pearson Education. Item 18: "Favor composition over inheritance.")

Nonetheless, the subclass-and-override technique does occur in practice, and it's likely that some breakage will result from this change. Similar issues (here is one example) did occur in Java 8 when the stream default method was introduced. There was a transition period where people fixed their libraries to be compatible with Java 8.

Introducing new methods into existing interfaces is not without risk. However, this sort of change has been made in the past, and in general most developers accept that the risks of doing so are acceptable.

Summary

The nine-case analysis technique has proven useful on several counts. Beyond the fairly obvious issues of source compatibility, the nine-case analysis showed some subtle issues with behavioral compatibility of adding covariant overrides in the presence of existing overrides. These were sufficiently severe to cause us to modify the design. In addition, the nine-case analysis showed that covariant overrides for newly introduced methods works well with existing binaries.

Corpus-based analysis is also helpful to assess compatibility issues. While a large number of incompatibility issues are possible, the corpus analysis shows that many of them either simply do not occur in practice or are quite rare. The corpus of course does not contain all Java source code in the world, but we believe it is a sufficiently representative sample to be able to extrapolate from it. An incompatibility that occurs only rarely in the corpus is also likely to be rare outside the corpus.

In an ecosystem as large as Java's, it is inevitable that some code somewhere will be broken by any change that is close to the core of the system. This analysis shows that incompatibilities should be fairly infrequent, and when they do occur, it is usually possible to mitigate them without undue effort. There is of course a judgment call to be made here; however, we believe that the current design provides sufficient value to outweigh the incompatibility issues that we have discovered.