Sunday, March 30, 2014

Socket read timeout issue -- A Pattern with GC Activity



There may be several patterns for socket read timeout issue from client to the server, but this is one of the patterns I want to share:

Pattern A Description:
As we know, GC will make the world stopped(different GC Collector will have different behavior: https://www.cubrid.org/blog/3826410 and https://www.cubrid.org/blog/3826519)

When the “world stopped”, the JBoss(Tomcat) will stop responding any application threads execution as well as accept any coming connections except for the GC threads doing its own cleaning job…

Meanwhile, if Apache web server intents to establish a connection with JBoss(Tomcat) by AJP protocol, it will easily get the 200 seconds socket timeout issue(we defined by workers.properties), and looking at mod_jk.log, you will find the Error logs there:
PS: we set 1*apache and 1*Jboss on the same host, i am borrowing cubrid's nice picture, but we are using worker instead of prefork MPM :

 

Reproduce this scenario:
  Reproduce steps for this socket_timeout issue:

  •  Kick off load testing
  • Manually trigger the Full GC by jvusialVM and Jmeter Tree view will comes out the Socket read timeout Error


Jmeter_log:
GET https://hostname/help/services/popUp?nodeDesc=param1
Request Headers:
Connection: keep-alive
User-Agent:  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk

Apache_Access_log:
10.80.8.59 - [28/Mar/2014:03:24:38 +0000] "GET /help/services/popUp?nodeDesc=param1 HTTP/1.1" 200 58 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk" + requestTimeMicroS=200606264 xforwarded=10.80.8.59

Mod_jk_log: (PS: the timestamp is 200 seconds after the request sends out)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1274): (worker1) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:8009) is down (errno=11)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [error] ajp_get_reply::jk_ajp_common.c (2118): (worker1) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_service::jk_ajp_common.c (2607): (worker1) sending request to tomcat failed (recoverable),  (attempt=1)
[Fri Mar 28 03:27:59OURCE 2014] worker1 hostname 200.610982

Current Solution to reduce such a timeout issue:
1.       JVM tuning, leverage the CMS GC collector to reduce the GC timing (making the stop world timing as shorter as possible, and GC less frequently)
2.       Enable Cping/Cpong in workers.properties if there is using AJP between Apache and Jboss (detect the broken pipe and avoid the handshake failure in advance)
3.       For Jboss4 role particularly, we need to get rid of the TCP CLOSE_WAIT connection problem which is introduced by Ping mode by Replacing AJP processor with JbossWeb Native Connector(http://www.jboss.org/jbossweb/downloads/jboss-native-2-0-10)  
4.      A mod_jk bug reported by apache, replacing socket_timeout by socket_connect_timeout and activate ping mode with proper timeouts (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)socket_connect_timeout is to specify the TCP connect phase timeout from Apache to JBoss with AJP protocol

      why setting Cping and Cpong important in workers.properties:
No CPing/CPong set
The CPing/CPong property in mod_jk is the most important worker property setting, allowing mod_jk to test and detect faulty connections. Not setting this parameter can lead to bad connections not being detected as quickly which can lead to web requests behaving as if 'hung'. (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)

Next Step:
  •      Find some typical socket timeout cases on PROD
  •      Compare different configurations/Settings during local PE test to see the effect
  •      Test and Learn...