IBN Global Services is implementing a Web solution for AT&T.
They are using Weblogic servers running on two Sun E5500 (2.6). The Web
traffic used to be load balanced using Cisco Local Directors. Weblogic
4.1.5 is using jdk 1.1.7_08A.
They have replaced the Cisco Local Directors with Alteon switches.
Alteon Switches use heartbeats for sensing the availability of the
Weblogic servers. Alteon Switch does a "Crunched" TCP connection to the
WebLogic port for the heartbeat. The TCP connection is followed similar
to Transactional TCP. The switch sends a SYN packet and expects a
SYN+ACK. The it sends a ACK +FIN (piggy backing the FIN on the ACK
packet) expecting FIN+ACK. So the entire heart beat is done with 4
packets instead of usual 6 packets used in TCP handshake. (SYN, SYN+ACK,
ACK, FIN, FIN+ACK, ACK). This happens once every 2 seconds.
Here IBM Global Services has encountered with a problem. Weblogic Server
when using JDK 1.1.7_08a does not respond to the Piggybacked FIN packet.
the Altheon switch , before sending the next heartbeat, as it has not received the FIN in the previous heartbeat, Altheon switch first sends a RST on the previous socket address before commencing on the next heartbeat. This RST creates an IDLE socket on the sun machine. Once the number of these IDLE
sockets reaches to 1023 the process starves on File Descriptors and
"dies".
When 1.3 is used instead of 1.1.7_08A as JDK for Weblogic, then
everthing is fine and rosey. We see a FIN response from the sunserver
for the piggybacked FIN from alteon switch and the problem does not
occur.
Other products running on these machines (iPlanet's LDAP datbase and
Apache web server) don't have this problem. All of them respond back with
FIN.
Customer cannot use 1.3 because it would take 4 months for them to
rewirite their product to be compatible with 1.3.
Customer needs a "fix" for this RFE, so that 1.1.7_08A knows how to deal
with piggybacked FIN.
ES Analysis:
The following files can be found at
/net/cores.ebay/cores/62455860
weblogic_j11_pkts -> snoop when using Java 1.1.8_13
weblogic_j13_pkts. -> snoop when using Java 1.3
truss_13 -> truss when using Java 1.3
truss_118 -> truss when using Java 118
netstat_j118 -> Netstat -an when using java 118
INFO:
The server's ip is 10.1.4.4 listening on port 8030. There are two
switches doing the health checks.. 10.1.4.253 and 10.1.4.254.
The following is the output from the failed Weblogic server:
java.lang.OutOfMemoryError
at sun.rmi.transport.tcp.TCPTransport.newListener(Compiled Code)
at sun.rmi.transport.tcp.TCPTransport.run(Compiled Code)
at java.lang.Thread.run(Compiled Code)
They are using Weblogic servers running on two Sun E5500 (2.6). The Web
traffic used to be load balanced using Cisco Local Directors. Weblogic
4.1.5 is using jdk 1.1.7_08A.
They have replaced the Cisco Local Directors with Alteon switches.
Alteon Switches use heartbeats for sensing the availability of the
Weblogic servers. Alteon Switch does a "Crunched" TCP connection to the
WebLogic port for the heartbeat. The TCP connection is followed similar
to Transactional TCP. The switch sends a SYN packet and expects a
SYN+ACK. The it sends a ACK +FIN (piggy backing the FIN on the ACK
packet) expecting FIN+ACK. So the entire heart beat is done with 4
packets instead of usual 6 packets used in TCP handshake. (SYN, SYN+ACK,
ACK, FIN, FIN+ACK, ACK). This happens once every 2 seconds.
Here IBM Global Services has encountered with a problem. Weblogic Server
when using JDK 1.1.7_08a does not respond to the Piggybacked FIN packet.
the Altheon switch , before sending the next heartbeat, as it has not received the FIN in the previous heartbeat, Altheon switch first sends a RST on the previous socket address before commencing on the next heartbeat. This RST creates an IDLE socket on the sun machine. Once the number of these IDLE
sockets reaches to 1023 the process starves on File Descriptors and
"dies".
When 1.3 is used instead of 1.1.7_08A as JDK for Weblogic, then
everthing is fine and rosey. We see a FIN response from the sunserver
for the piggybacked FIN from alteon switch and the problem does not
occur.
Other products running on these machines (iPlanet's LDAP datbase and
Apache web server) don't have this problem. All of them respond back with
FIN.
Customer cannot use 1.3 because it would take 4 months for them to
rewirite their product to be compatible with 1.3.
Customer needs a "fix" for this RFE, so that 1.1.7_08A knows how to deal
with piggybacked FIN.
ES Analysis:
The following files can be found at
/net/cores.ebay/cores/62455860
weblogic_j11_pkts -> snoop when using Java 1.1.8_13
weblogic_j13_pkts. -> snoop when using Java 1.3
truss_13 -> truss when using Java 1.3
truss_118 -> truss when using Java 118
netstat_j118 -> Netstat -an when using java 118
INFO:
The server's ip is 10.1.4.4 listening on port 8030. There are two
switches doing the health checks.. 10.1.4.253 and 10.1.4.254.
The following is the output from the failed Weblogic server:
java.lang.OutOfMemoryError
at sun.rmi.transport.tcp.TCPTransport.newListener(Compiled Code)
at sun.rmi.transport.tcp.TCPTransport.run(Compiled Code)
at java.lang.Thread.run(Compiled Code)