Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8318599

HttpURLConnection cache issues leading to crashes in JGSS w/ native GSS introduced by 8303809

XMLWordPrintable

    • b26
    • Not verified

        # Crashes

        We recently upgrade to OpenJDK 17.0.8.1 and started observing crashes
        resulting from double-frees via `gss_delete_sec_context()`.

        Adding `-Djdk.spnego.cache=false` to our java invocations stops the
        crashes. We believe this is due to a race condition that has long been
        in `HttpURLConnection` that was mostly harmless before but which with
        the addition of this commit:

           16843770b5a 8303809: Dispose context in SPNEGO NegotiatorImpl

        became a use-after-free / double-free bug, leading to crashes.

        This happens with stock 17.0.8.1+1 from Adoptium.

        Attached is a reproducer, Test.java, which you can run like this:

           $ # credendials are kinit user@jgssbug.twosigma.com, password password
           $ #
           $ "$JAVA_HOME/bin/java" \
                     -Dsun.security.jgss.native=true \
                     -Dsun.security.jgss.lib=/usr/lib/libgssapi_krb5.so \
                     Test.java https://kerberos.club/tmp/jgssbug.txt

        It doesn't crash every time, but it crashes often.

        A typical Java stack trace upon crashing looks like:

        j sun.security.jgss.wrapper.GSSLibStub.deleteContext(J)J+0 java.security.jgss@17.0.8.1
        j sun.security.jgss.wrapper.NativeGSSContext.dispose()V+76 java.security.jgss@17.0.8.1
        j sun.security.jgss.GSSContextImpl.dispose()V+16 java.security.jgss@17.0.8.1
        j sun.net.www.protocol.http.spnego.NegotiatorImpl.disposeContext()V+11 java.security.jgss@17.0.8.1
        J 4456 c1 sun.net.www.protocol.http.NegotiateAuthentication.disposeContext()V java.base@17.0.8.1 (24 bytes) @ 0x00007f909dc33834 [0x00007f909dc337c0+0x0000000
        000000074]
        j sun.net.www.protocol.http.HttpURLConnection.getInputStream0()Ljava/io/InputStream;+1923 java.base@17.0.8.1
        j sun.net.www.protocol.http.HttpURLConnection.getInputStream()Ljava/io/InputStream;+62 java.base@17.0.8.1
        j sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream()Ljava/io/InputStream;+4 java.base@17.0.8.1
        j Test.lambda$main$0()V+16
        j Test$$Lambda$207+0x00007f9020144208.run()V+0
        j java.lang.Thread.run()V+11 java.base@17.0.8.1
        v ~StubRoutines::call_stub

        On the C side the crash typically happens in the allocator, typically in
        `free()`:

        #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        #1 0x00007fb6e1185535 in __GI_abort () at abort.c:79
        #2 0x00007fb6e11dc648 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7fb6e12e62a0 "%s\n") at ../sysdeps/posix/libc_fatal.c:181
        #3 0x00007fb6e11e2d6a in malloc_printerr (str=str@entry=0x7fb6e12e444e "free(): invalid pointer") at malloc.c:5359
        #4 0x00007fb6e11e459c in _int_free (av=<optimized out>, p=<optimized out>, have_lock=<optimized out>) at malloc.c:4180
        #5 0x00007fb582ba5b6d in krb5_free_address (context=context@entry=0x7fb5d80cbf60, val=0x7fb5d8000080) at kfree.c:62
        #6 0x00007fb582b93d9e in krb5_auth_con_free (context=context@entry=0x7fb5d80cbf60, auth_context=0x7fb5d80778a0) at auth_con.c:60
        #7 0x00007fb6440a3eec in krb5_gss_delete_sec_context (minor_status=0x7fb586868504, context_handle=0x7fb5d80e22d0, output_token=<optimized out>)
           at delete_sec_context.c:77
        #8 0x00007fb6440964b0 in gssint_delete_internal_sec_context (minor_status=0x7fb586868504, mech_type=<optimized out>,
           internal_ctx=internal_ctx@entry=0x7fb5d80e22d0, output_token=0x0) at g_glue.c:606
        #9 0x00007fb6440946dd in gss_delete_sec_context (minor_status=<optimized out>, context_handle=0x7fb5d80d2be8, output_token=<optimized out>)
           at g_delete_sec_context.c:91
        #10 0x00007fb6440b8dfe in spnego_gss_delete_sec_context (minor_status=<optimized out>, context_handle=0x7fb5d807bd70, output_token=<optimized out>)
           at spnego_mech.c:2161
        #11 0x00007fb6440964b0 in gssint_delete_internal_sec_context (minor_status=0x7fb586868504, mech_type=<optimized out>,
           internal_ctx=internal_ctx@entry=0x7fb5d807bd70, output_token=0x0) at g_glue.c:606
        #12 0x00007fb6440946dd in gss_delete_sec_context (minor_status=<optimized out>, context_handle=0x7fb586868508, output_token=<optimized out>)
           at g_delete_sec_context.c:91
        #13 0x00007fb6dc0559ff in Java_sun_security_jgss_wrapper_GSSLibStub_deleteContext () from /home/nico/tmp/jdk-17.0.8.1+1/lib/libj2gss.so
        #14 0x00007fb6c89a653a in ?? ()
        #15 0x0000000000000000 in ?? ()

        We've seen other variations, but always they can be traced to
        `HttpURLConnection`.

        (We do maintain local patches to OpenJDK JGSS code as well. But that's
        a story for another time. We do want to try again to upstream those
        patches. Because we maintain such patches we developed an interim fix
        where we use `synchronized (lock)` in the JGSS `dispose()` methods and
        also `Reference.reachabilityFence(this)`, which worked to prevent the
        crashes, but that was before we learned that `-Djdk.spnego.cache=false`
        also works. In any case, stock, unpatched OpenJDK 17.0.8.1 evinces this
        crasher.)

        # Root cause

        What 8303809 did was add calls to `dispose()` on `GSSContext` objects in
        NegotiatorImpl.java in a new method called `disposeContext()` that all
        HTTP authentication mechanisms now implement and which 8303809 also
        added.

        It took us a while to realize that the use of the `AuthCache` in
        `HttpURLConnection` implies concurrent access to the cached contexts via
        concurrent HTTP requests to the same origin.

        Concurrent access to and disposal of those context objects then leads to
        use-after-free / double-free errors when using sun.security.jgss.native.

        There's really several issues going on here:

        - The `dispose()` methods in 17.x in the JGSS native C GSS bindings
          clases are dangerous, as are finalizers generally (and these
          `dispose()` methods are called from finalizers).

          This is fixed in later OpenJDK releases by making use of `Cleaner`,
          so there's not much more to say about this here.

        - The `AuthCache` in `src/java.base/share/classes/sun/net/www/protocol/http/`
          implies concurrency, and it's not clear that concurrency is allowable
          in all cases.

        - In the case of RFC 4559 "Negotiate" and GSS concurrency is very much
          not permitted. The GSS APIs can and should be thread-safe, but it is
          still an error to invoke `initSecContext()` concurrently on the same
          `GSSContext`!

          GSS-API `initSecContext()`/`acceptSecContext()` are strictly
          synchronous and serial.

        - The `AuthCache` in `src/java.base/share/classes/sun/net/www/protocol/http/`
          is likely a misfeature altogether, at least as far as sharing GSS-
          like HTTP authentication mechanism contexts goes, and probably for
          all other HTTP auth methods too.

          It's not ok to take a reply token from a `WWW-Authenticate:` response
          header and consume it together with a "context" created by a
          unrelated request [to the same origin].

          Every `WWW-Authenticate:`/`Authorization:` token must be matched to
          the same [serial train of] request[s].

          Thus if we have an RFC 4559 Negotiate exchange requiring 5 tokens, so
          something like:

          0: request..
          1: 401 response w/ WWW-Authenticate: Negotiate
          2: request w/ Authorization: Negotiate ... (continue needed)
          3: 200 response w/ WWW-Authenticate: Negotiate ... (continue needed)
          4: request w/ Authorization: Negotiate ... (continue needed)
          5: 200 response w/ WWW-Authenticate: Negotiate ... (complete)

          it would not be permissible for one of those reply tokens to be fed
          to another request's GSSContext, nor would it be OK to have more than
          one serial set of requests share the same GSSContext.

        - The foregoing also applies to SCRAM, DIGEST-MD5, and just about any
          and all HTTP authentication methods that I know. I believe that
          there is no value in caching HTTP authentication mechanism contexts
          on the client side.

          HTTP Authentication mechanisms like SCRAM and DIGEST-MD5 are
          three-message mechanisms when counting the initial challenge:

          0: request...
          1: 401 response w/ WWW-Authenticate: DIGEST-MD5 <challenge>
          2: request w/ Authorization: DIGEST-MD5 ...
          3: 200 w/ WWW-Authenticate: DIGEST-MD5 ... (complete)

          There is no case in which caching and sharing a SCRAM or DIGEST-MD5
          authentication context is useful because there is only one request/
          response in which there is a context needed at all: between 2 and 3
          inclusive in the above diagram.

          Perhaps a single challenge can be shared with multiple subsequent
          requests that will optimistically use SCRAM or DIGEST-MD5 knowing
          that one response demanded it, but one should probably not do this,
          though I could find nothing in RFC 5802 and 7677.

        - Negotiate is really a bearer token system, albeit with the
          possibility of using channel bindings. Meaning we never use the
          resulting GSSContext to encrypt or sign any part of the HTTP requests
          and responses authenticated with that context. Therefore there's
          never any point in sharing a GSSContext in HTTP, not even once it's
          fully-established -- there is no point in doing anything other than
          disposing of a fully-established GSSContext in HTTP/Negotiate.

          (Attempts to develop standards for encrypting/signing of HTTP
          requests and responses at the HTTP layer rather than in TLS have
          historically failed. If one attempt to standardize such a thing
          succeeds, _then_ there will be something worth sharing in a cache
          with care to prevent use-after-dispose.)

        - Because of the preceding, in HTTP/Negotiate it is desirable to "free"
          any GSSContext immediately after it is complete.

          This is presumably how you came to add the `disposeContext()` method:
          because those cached contexts otherwise sit there using lots of
          memory for no reason; someone must have noticed. However, rather
          than have an explicit `disposeContext()` these contexts should just
          immediately be disposed of when complete or failed.

        # Possible fixes

        - Backport the use of `Cleaner` in the JGSS `dispose()` methods.

          This would not be very satisfying because the `HttpURLConnection`
          `AuthCache` would remain buggy, but at least the JVM wouldn't crash.

        - Default `jdk.spnego.cache` to `false`.

        - Rethink or remove the `AuthCache` in `HttpURLConnection`.

        - Something else?

              michaelm Michael McMahon
              weijun Weijun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: