Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8357674

Classifier API to Map Finite Sets to Indexes

XMLWordPrintable

    • Icon: JEP JEP
    • Resolution: Unresolved
    • Icon: P4 P4
    • None
    • core-libs
    • None
    • john Rose
    • Feature
    • Open
    • JDK

      ROUGH DRAFT

      Summary

      Improve performance for querying constant tables, executing switches, and performing other classification tasks over fixed sets of alternative values.

      Goals

      • Speed up many search tasks which currently rely on table lookup, binary search, switches, or decision trees.

      • Provide a basis for continual improvement over switch-statement execution speed, with no recompilation.

      Motivation

      It is a very common operation to dispatch between several alternative values in a fixed finite set (of some agreed-upon type T). The output of the dispatch is an index, an agreed-upon value of some other type I (usually int).

      We may call this a classification problem, and a specific algorithm for a specific set of alternative values is a classifier.

      Logically speaking, the resulting index is an integer value, although it may be "dressed up" as an enumeration or other token. For any fixed set of alternative values, the set of possible indexes correspond to a range 0..N-1, where N is the number of alternative values. Some sentinel value like -1 (or any negative number) is reserved to signal the presence of a value (of the agreed-upon type T) not in the given set.

      Every programmer has written hand-coded solutions to classification problems, many times. Sometimes speed is not required, and in that case, a linear search, over a clearly marked list or table, is a way to express clearly what is needed.

      But when speed is required, linear search is not good enough. A binary search sometimes works, or some clever table lookup based on hashes or radixes, or a Java switch statement or expression might be applicable. The choice between these options (and more!) depends on the details of T and on the specific alternative values being discriminated.

      In general, it is difficult to know which tactic to use, to make a good classifier for a given classification problem. What's worse, the best tactics (such as perfect hashing) are impossible to maintain. If you spend an hour finding a perfect hash code for some particular input set, in a year some maintainer might have to take days figuring out what you did. As a result, only code for the most demanding uses is hand-optimized to full speed, accepting the higher maintainability costs.

      Sometimes the set of values is variable over time, in which case a mutable list or map data structure is fine way to keep track of the set of alternative values, and their indexes. The downside of such a structure is its very flexibility: There is no clear point at which the alternative value set is complete, and so it is difficult to know when to expend extra effort to organize the alternative values in a form (such as binary search or perfect hashing) which makes the dispatch operation as fast as possible.

      In Java, simple classifier objects can be discerned hiding among the lists:

      interface Classifier<T> { int indexOf(T value); }
      Classifier<String> cfr1 = List.of("no", "yes")::indexOf;
      System.out.println(cfr1.indexOf("yes"));   //=> 1
      System.out.println(cfr1.indexOf("no"));    //=> 0
      System.out.println(cfr1.indexOf("maybe")); //=> -1

      But they are seldom used in this way, since List::indexOf performs a linear search. If linear search is the only option, there is no benefit to calling out a special Classifier interface.

      There is also a classifier hiding in the Enum API, specifically for the set of values in each enumeration:

      enum Suit { CLUB, SPADE, HEART, DIAMOND }
      Classifier<Suit> ecfr = Suit::ordinal;
      System.out.println(ecfr.indexOf(Suit.HEART)); //=> 2

      This one is very fast, but it only works if you care about the whole enum, and in the enum's given order. Note that this is an example of a classifier which is total over its input type T.

      But, in general, there is no simple style for writing performant and maintainable code that can solve many kinds of classification problems. The typical classification problem is a unique snowflake, to be solved by human ingenuity alone.

      Description

      We introduce the interface Classifier<T> and associated factories and bootstrap methods. As part of the interface contract, instances can be optimized for performance that will be routinely superior to linear search, as good as most hand-coding techniques, and often better.

      By using JDK classifiers, coders can solve their classification problems at full speed, without worrying about surfacing the difficult detail of the optimal algorithm. The JIT and JDK can be co-engineered to ensure that each classifier uses an internal algorithm that is efficient on the specific platform the JVM is running on.

      The above "yes/no" classifier might be requested as follows:

      var cfr2 = Classifier.of(String.class, List.of("no", "yes"));
      System.out.println(cfr2.indexOf("yes")); => 1
      System.out.println(cfr2.indexOf("no")); => 0
      System.out.println(cfr2.indexOf("maybe")); => -1

      Although the two-value example here can tolerate linear search, the classifier might be internally reorganized to have a faster implementation, depending on the precise details of the alternative value set. For example:

      var cfr3 = (Classifier<String>) s ->
        switch (s.length()) {
        case 2 -> (s.equals("no") ? 0 : -1);
        case 3 -> (s.equals("yes") ? 1 : -1);
        default -> -1;
        };

      An explicit switch over an enum value can be coded like this:

      private static final Classifier<Suit>
        ECLS = EnumClassifier.of(Suit.class, Suit.HEART, Suit.CLUB);
      …
      switch (ECLS.indexOf(e)) {
      case 0 -> foo();  //case HEART
      case 1 -> bar();  //case CLUB
      }

      The classifier is created just once, as a static field or other constant. Any one-time costs paid to optimize its internals may be amortized over many uses. Since the classifier is created after the enum class is loaded, any table inside classifier is correct, and there is no possible failure due to separate recompilation of the enumeration. This JEP may include bootstrap methods for indy-based translation of enum switches.

      (There are a few more examples in this file of demo code.)

      Such reorganizations will no longer have to be done by hand, in forms that may be difficult to maintain. With the classifier API, the JDK (and the JIT as it cooperates with the JDK) manage those details transparently.

      The Classifier Zoo

      Besides the types T (and I which is int), there are a few other interesting degrees of freedom in a classifier API. First, if the user demands a specific numbering for the cases, then the alternative values must be presented in the form of a list. This is useful when the number corresponds to an external API of some sort, or to a hard-coded switch statement.

      But if the user does not have a strong opinion about which values should go with which indexes, the value are presented as a set, which is naturally unordered. This produces a result similar to a perfect hash table: Each value gets an index, but it is up to the internal algorithm which value goes in the first table slot, etc. This will be expressed, if needed, as an API point like Classifier::ofSet. This is equivalent to a perfect hash table construction, and may benefit from an additional parameter which permits gaps in the index set (semi-perfect hashes, which are easier to derive).

      It also makes sense, sometimes, to use the keys of a map as the source of alternative values, in those special cases where the user wishes two values to map to one index. If two keys map to the same hash table value, then a map-based classifier would take care to assign the same index to both keys. Such a map-based classifier can be immediately used to create a constant map whose performance and footprint may surpass a regular map. The indexes produced would be specified by the values (not keys) of the table, either as a list or as a set. This will be expressed, if needed, as an API point like Classifier::ofMap, with optional parameters to control how the map values are to be indexed.

      Another degree of freedom is the amount of effort to put into finding an efficient internal organization of the classification logic. Should hash codes be used, and even perfect hashing? Is there some component of the value type (T) that happens to be different across all the alternatives, so we can switch on it? Should we generate static tables? Should there be a log-depth decision tree?

      The answer depends on which set of alternative values is being classified (and the type T), and also on how much effort the user is willing to spend on finding a good algorithm. When executing a switch statement in Java source code, it is likely that pausing a moment (at link time) is a useful investment in future performance. But if the value set is ephemeral, with only a few queries in the future, there is no benefit to added optimization time. Some advanced classifier algorithms (such as those known to require perfect hashing) may have an optional "effort" parameter, which convey's the user's wishes.

      There may be specialized classifier factories that set out to use specific techniques. However, the generic API point Classifier::of, as shown above, is preferable for most usage, since it is free to use whichever internal technique seems best, for the specific value set and hardware platform.

      There are bootstrap methods to implement Java's increasing variety of switches, so that as classifier technology matures, the bootstrap methods can be upgraded in place. In this way, old switches can gain newly improved performance, with no recompilation of Java sources.

      Some classifiers may be given invocation counters or return value statistics, so they can reoptimize themselves, transparently, after statistics are gathered. Switches backed by such classifiers would outperform today's switches in many cases.

      In order to make classifiers easier to compose, or enhance their transparency for other purposes, a more informative sub-interface may perhaps be defined:

      interface Classifier<T> { int indexOf(T value); }
      interface ClassifierInfo<T> { Class<T> valueClass(); List<T> values(); }
      interface InformativeClassifier<T> extends Classifier<T> {
        ClassifierInfo<T> classifierInfo();
      }

      (Because of the connection to bootstrap methods, this JEP is initially placed in java.lang.invoke. As a data structure related to list, some or all of it may belong in java.util.)

            Unassigned Unassigned
            jrose John Rose
            John Rose John Rose
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: