Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4345509

ServerSocket.accept method gets a I/O exception thrown, retry causes hang

XMLWordPrintable

    • sparc
    • solaris_2.6


      The current recommende patch cluster is loaded and the exact java
      version is 1.1.7_08 with native threads.

      We are working to reproduce the problem with a smaller example, and
      hope to have one soon.

      Here's the customer's description of the problem:

      The servers are Java applications receiving RMI calls from a
      Servlet. The problem occurs in the socket handling underneath the
      RMI runtime code, but I believe it is a socket problem, not a RMI
      problem.

      The specific situation is that a thread blocked on a
      ServerSocket.accept method occasionally gets a I/O exception
      "interrupted system call" thrown. If we turn around and retry the
      accept method on the socket, it blocks as expected, but about 50% of
      the time it will no longer accept any new connections. Clients will
      connect to the server and block on I/O, but the server accept method
      never returns, so the system appears hung to these new connections.

      An interesting side item that may or may not be related to this hang
      is that after the server processes are killed, the OS has been
      observed to have sockets left in a CLOSE_WAIT state that lasts
      indefinitely. The passive close that should wipe them out in a few
      minutes or hours never occurs, even days later. A system reboot is
      the only thing that cleans this up, which is especially nasty because
      they are tying up port numbers used by our application servers.

      This customer is using a version of our product, Windchill, which is
      running in a 1.1.7 native threads JVM and uses Oracle 8.0 OCI JDBC
      drivers (type II, native code), and is running on Solaris 2.6. We
      had a similar problem over a year ago on HP/UX and determined that it
      was occasionally occurring when the Oracle OCI drivers opened new
      database connections. That was solved by configuring the server to
      use a fixed size connection pool so it wouldn't need to open new ones
      during its normal lifetime. We didn't seeing it again until now.
      This particular site has a customization that makes heavy use of JDNI
      to access a Netscape directory server using LDAP, but is otherwise
      unexceptional. We have other Solaris sites running the same
      configuration that haven't reported this problem yet, but this is
      probably the busiest one.

      I've looked through the bug parade and seen several reports of
      problems with socket methods throwing "interrupted system call"
      exceptions, but I haven't been able to determine if the problem has
      really been fixed in a production JVM release. Further, none of them
      described the situation where retrying the accept method leaves the
      server hung. The version of Windchill they are running supports Java
      2, and they will eventually upgrade for performance reasons, but I'd
      like to be able to say definitively whether it will also solve the
      server hanging problem.

      We have a wrapper class around the real server socket for logging and
      exception recovery, so it's easy to control the response to these
      exceptions. This logging is how we discovered that the hangs
      correspond to the server socket throwing this exception. Our code
      was retrying a couple of times to work around a early Windows/NT
      socket bug where clients that connect and die before the server
      accept call completes could raise a connection reset exception from
      the accept method. That was on NT, but now on Solaris the only
      working technique to prevent this hang is to shut down the server
      when its accept call fails. Luckily, Windchill can have multiple
      servers load balancing and they can fail and be restarted
      automatically by a server manager process. Most calls in progress
      will be retried against another server. This was simply defensive
      programming against bugs or memory leaks in native code (Oracle) or
      unstable JVMs. It allows the customer to stay in production, but
      it's not a very clean solution to have servers committing suicide
      every few hours. There was a total of 22 shutdowns yesterday. And
      we still have the lost port numbers waiting for a reboot (not typical
      of a Solaris enterprise server).

            jccollet Jean-Christophe Collet (Inactive)
            duke J. Duke
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: