Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6496038

JMXMP connector, failed to send a connectionLost notification

XMLWordPrintable

    • team
    • generic, sparc
    • solaris_8, solaris_9, solaris_10

        Here is the info from the client:

        About a year ago, we had a case about missing connection lost notification when the logical host for the destination address of the connection failed over from one node of the remote cluster to another node. Although the connection lost had been detected, but no connection lost notification was generated. You had later provided a module which would help locating the cause or even fix the problem. However, we were not able to reproduce it so it remained a mystery.

        We ran into the same problem again. The clusters are sabre1(phys-sabre-1, phys-sabre-2) and sabre2 (phys-sabre-3, phys-sabre-4). Here are some highlights -

        On cluster sabre1, logical host "sabre1" failed over from phys-sabre-1 to phys-sabre-2 -
        Nov 14 09:45:52 phys-sabre-2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource geo-clustername status msg on node phys-sabre-1 change to <LogicalHostname offline.>
        Nov 14 09:45:53 phys-sabre-2 Cluster.RGM.rgmd: [ID 922363 daemon.notice] resource geo-clustername status msg on node phys-sabre-2 change to <LogicalHostname online.>

        The connection lost was detected by cluster sabre2 (the time difference between the two clusters is about 11-12 minutes) -
        Nov 14 09:57:36 phys-sabre-3 cacao[26093]: [ID 702911 daemon.warning] ClientCommunicatorAdmin.Checker-run : Failed to check the connection: java.io.InterruptedIOException: Waiting response timeout: 30000

        Is there any way to check whether a JMXConnectionNotification was generated? Odyssey had subscribed the notification, but did not seem to get any.

        Before the connection lost was detected, phys-sabre-3 appeared to be having failure with the network adapter hosting sabre2, the logical host used to connect to cluster sabre1.

        Nov 14 09:57:16 phys-sabre-3 in.mpathd[80]: [ID 215189 daemon.error] The link has gone down on eri0
        Nov 14 09:57:16 phys-sabre-3 in.routed[353]: [ID 238047 daemon.warning] interface eri0 to 10.6.173.85 turned off
        Nov 14 09:57:16 phys-sabre-3 eri: [ID 786680 kern.notice] SUNW,eri0 : No response from Ethernet network : Link down -- cable problem?
        Nov 14 09:57:18 phys-sabre-3 in.mpathd[80]: [ID 820239 daemon.error] The link has come up on eri0
        Nov 14 09:57:18 phys-sabre-3 eri: [ID 786680 kern.notice] SUNW,eri0 : 100 Mbps full duplex link up
        Nov 14 09:57:18 phys-sabre-3 in.routed[353]: [ID 300549 daemon.warning] interface eri0 to 10.6.173.85 restored

        Nov 14 09:57:32 phys-sabre-3 in.mpathd[80]: [ID 299542 daemon.error] NIC repair detected on eri0 of group sc_ipmp0

        This happened repeatedly.

        Also cluster sabre2 periodically made connections to cluster sabre1. It returned successfully despite the logical host failover on cluster sabre1 and the unstable local network adapter.

        Would you be able to help us? The clusters are still in this state. Root password is "fu_bar".

        Let me know if you need more information.

              sjiang Shanliang Jiang (Inactive)
              sjiang Shanliang Jiang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: