There may be several patterns for socket read timeout issue from client to the server, but this is one of the patterns I want to share:
Pattern
A Description:
As we know, GC will make the world
stopped(different GC Collector will have different behavior: https://www.cubrid.org/blog/3826410 and https://www.cubrid.org/blog/3826519)
When the “world stopped”, the JBoss(Tomcat)
will stop responding any application threads execution as well as accept any
coming connections except for the GC threads doing its own cleaning job…
Meanwhile, if Apache web server intents to
establish a connection with JBoss(Tomcat) by AJP protocol, it will easily get
the 200 seconds socket timeout issue(we defined by workers.properties), and
looking at mod_jk.log, you will find the Error logs there:
PS: we set 1*apache and 1*Jboss on the same host, i am borrowing cubrid's nice picture, but we are using worker instead of prefork MPM :
Reproduce
this scenario:
Reproduce steps for this
socket_timeout issue:
- Kick off load testing
- Manually trigger the Full GC by jvusialVM and Jmeter Tree view will comes out the Socket read timeout Error
Jmeter_log:
GET https://hostname/help/services/popUp?nodeDesc=param1Request Headers:Connection: keep-aliveUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk
Apache_Access_log:
10.80.8.59 - [28/Mar/2014:03:24:38 +0000] "GET /help/services/popUp?nodeDesc=param1 HTTP/1.1" 200 58 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk" + requestTimeMicroS=200606264 xforwarded=10.80.8.59
Mod_jk_log:
(PS: the timestamp is 200 seconds after the request sends out)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1274): (worker1) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:8009) is down (errno=11)[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [error] ajp_get_reply::jk_ajp_common.c (2118): (worker1) Tomcat is down or refused connection. No response has been sent to the client (yet)[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_service::jk_ajp_common.c (2607): (worker1) sending request to tomcat failed (recoverable), (attempt=1)[Fri Mar 28 03:27:59OURCE 2014] worker1 hostname 200.610982
Current
Solution to reduce such a timeout issue:
1.
JVM tuning, leverage the CMS GC
collector to reduce the GC timing (making the stop world timing as shorter as
possible, and GC less frequently)
2.
Enable Cping/Cpong in
workers.properties if there is using AJP between Apache and Jboss (detect the
broken pipe and avoid the handshake failure in advance)
3.
For Jboss4 role particularly, we need to get rid of the TCP
CLOSE_WAIT connection problem which is introduced by Ping mode by Replacing AJP
processor with JbossWeb Native Connector(http://www.jboss.org/jbossweb/downloads/jboss-native-2-0-10)
4. A mod_jk bug reported by apache, replacing
socket_timeout by socket_connect_timeout and activate ping mode with
proper timeouts
(https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)socket_connect_timeout is to specify the TCP connect phase timeout from Apache to JBoss with AJP protocol
PS:
read carefully with this page:
http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
PS:
read carefully with this page:
http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
why setting Cping and Cpong important in
workers.properties:
No CPing/CPong set
The CPing/CPong property
in mod_jk is the most important worker property setting, allowing mod_jk to
test and detect faulty connections. Not setting this parameter can lead to bad
connections not being detected as quickly which can lead to web requests
behaving as if 'hung'. (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)
Next
Step:
- Find some typical socket timeout cases on PROD
- Compare different configurations/Settings during local PE test to see the effect
- Test and Learn...