Sunday, March 30, 2014

Socket read timeout issue -- A Pattern with GC Activity



There may be several patterns for socket read timeout issue from client to the server, but this is one of the patterns I want to share:

Pattern A Description:
As we know, GC will make the world stopped(different GC Collector will have different behavior: https://www.cubrid.org/blog/3826410 and https://www.cubrid.org/blog/3826519)

When the “world stopped”, the JBoss(Tomcat) will stop responding any application threads execution as well as accept any coming connections except for the GC threads doing its own cleaning job…

Meanwhile, if Apache web server intents to establish a connection with JBoss(Tomcat) by AJP protocol, it will easily get the 200 seconds socket timeout issue(we defined by workers.properties), and looking at mod_jk.log, you will find the Error logs there:
PS: we set 1*apache and 1*Jboss on the same host, i am borrowing cubrid's nice picture, but we are using worker instead of prefork MPM :

 

Reproduce this scenario:
  Reproduce steps for this socket_timeout issue:

  •  Kick off load testing
  • Manually trigger the Full GC by jvusialVM and Jmeter Tree view will comes out the Socket read timeout Error


Jmeter_log:
GET https://hostname/help/services/popUp?nodeDesc=param1
Request Headers:
Connection: keep-alive
User-Agent:  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk

Apache_Access_log:
10.80.8.59 - [28/Mar/2014:03:24:38 +0000] "GET /help/services/popUp?nodeDesc=param1 HTTP/1.1" 200 58 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk" + requestTimeMicroS=200606264 xforwarded=10.80.8.59

Mod_jk_log: (PS: the timestamp is 200 seconds after the request sends out)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1274): (worker1) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:8009) is down (errno=11)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [error] ajp_get_reply::jk_ajp_common.c (2118): (worker1) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_service::jk_ajp_common.c (2607): (worker1) sending request to tomcat failed (recoverable),  (attempt=1)
[Fri Mar 28 03:27:59OURCE 2014] worker1 hostname 200.610982

Current Solution to reduce such a timeout issue:
1.       JVM tuning, leverage the CMS GC collector to reduce the GC timing (making the stop world timing as shorter as possible, and GC less frequently)
2.       Enable Cping/Cpong in workers.properties if there is using AJP between Apache and Jboss (detect the broken pipe and avoid the handshake failure in advance)
3.       For Jboss4 role particularly, we need to get rid of the TCP CLOSE_WAIT connection problem which is introduced by Ping mode by Replacing AJP processor with JbossWeb Native Connector(http://www.jboss.org/jbossweb/downloads/jboss-native-2-0-10)  
4.      A mod_jk bug reported by apache, replacing socket_timeout by socket_connect_timeout and activate ping mode with proper timeouts (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)socket_connect_timeout is to specify the TCP connect phase timeout from Apache to JBoss with AJP protocol

      why setting Cping and Cpong important in workers.properties:
No CPing/CPong set
The CPing/CPong property in mod_jk is the most important worker property setting, allowing mod_jk to test and detect faulty connections. Not setting this parameter can lead to bad connections not being detected as quickly which can lead to web requests behaving as if 'hung'. (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)

Next Step:
  •      Find some typical socket timeout cases on PROD
  •      Compare different configurations/Settings during local PE test to see the effect
  •      Test and Learn...

    Sunday, February 09, 2014

    Lessons learned druing one of my recent projects

    1.       Prepare a good Planning and Clear target in advance, and we share the common purpose

    2.       Understand the envs/software/settings (Design, Mechanical sympathy, VMware DRS, Gateway Throttling Replication mechanism), doing some Research before doing any testing

    3.       A simple but straightforward tests, complex ones make things complex and hard to narrow down the problem

    4.       A Clean up the env

    5.       A repeatable, accurate and detailed Baseline first.. Do not rush to do any optimization before a repeatable baseline captured

    6.       Get Fully real time Monitoring, even the load generators

    7.       Make changes One by One, not all in once, do not mix things up, checklist all the stuff we have been made

    8.       Logs is not free, turn DEBUG logs off if nobody look at it, if we really need them, make them into INFO

    9.       Check the Disk space when you do multiple rounds and write tons of logs.. Response time will be slowed down suddenly

    10.   Visibility to the VMXhost who is using VMware to do PE tests and get to know how your hosts are distributed is critical to the tests(resource utilization should be balanced(CPU,IO,Mem) and not touching the ceiling of each host), Disable the DRS feature for the PE cluster, it is not good for getting repeatable data, but we can optimize the hosts based on our observations by monitoring the hosts utilization

    11.   Do not trust open source, need to dig into it when you are making use of it intensively!!

    Sunday, September 15, 2013

    LDAP sample in Ruby Code

     require 'net/ldap'  
     class LDAPConn  
          HOST = "ldap.yourcorp.dev"  
          # 636 for ssl binding  
          PORT = "636"  
          USERNAME = "TestLDAP"  
          PASSWORD = "abcd1234"  
          TREEBASE = "OU=UserAccts,DC=yourdomain,DC=com"  
          def LDAPConn.getLDAP  
               ldap = Net::LDAP.new(:host => HOST,  
                 :port => PORT,  
                 :auth => {  
                      :method => :simple,  
                      :username => USERNAME,  
                      :password => PASSWORD  
                      },  
                 :encryption => :simple_tls)  
               return ldap  
          end  
          def LDAPConn.validateLogin(email, password)  
               ldap = LDAPConn.getLDAP  
               filter = Net::LDAP::Filter.eq("mail", email)  
               result = ldap.bind_as(:base => TREEBASE, :filter => filter, :password => password)  
               return result  
          end  
     end  
    

    Wednesday, August 07, 2013

    How to do a multipart file upload in ruby

    gem install httpclient

     require 'httpclient'  
     require 'uri'  
     # Do multipart file upload with POST  
      httpreq = HTTPClient.new  
      File.open("#{filename}") do |file|  
       body = { 'year' => "#{curyear}", 'tag' => "#{tagname}", 'uploadfile' => file}  
       res = httpreq.post(URI("#{uri}"), body)  
      end  
    

    Wednesday, July 31, 2013

    ActiveMQ Jmeter Test Plan on my github

    Recently, i am testing the our message queue capacity and performance which is ActiveMQ...So i made a sample Jmeter test plan on my github, once someone want take a reference:

    https://github.com/joychester/ActiveMQJMeterTestPlan

    This reminds me one of my report issue to JMeter JMS Sampler 3 years ago :) Time flies!!
    https://issues.apache.org/bugzilla/show_bug.cgi?id=49111

    Wednesday, May 01, 2013

    "PetGym" is on my github now -- Automated your Jmeter tests

    Jmeter PluginCMD provide a capability for running jmeter tests automatically with many cool reports generated: http://code.google.com/p/jmeter-plugins/wiki/JMeterPluginsCMD

    I am writing a ruby program to meet my basic requirement during my performance tests.
    Please check it out if you want to give it a try: https://github.com/joychester/PetGym

    Thursday, April 18, 2013

    FastMole is on my GitHub now!

    FastMole which is one of my project to do Continuous Page performance tests is on my github now: https://github.com/joychester/FastMole

    Tuesday, April 09, 2013

    My New Life, My Baby Girl!!

    My lovely baby girl comes into my life for 2 weeks!! My life is going to be changed from now on :-)

    Generate Load by adding reasonable think time

    I used to generate load by launching small number of VUs without any think time during performance testing. The pros is You can trigger a very high load/Throughput even you are not creating many concurrent threads, but the cons is you can not generate consistent/stable load to make comparison once you make some tuning or changes. Recently, I am discussing with my teammate ZhouZhou, and figure out how we generate a consistent load by adding reasonable think time(Little Law helps the calculation here) between each request, here is the diagram we made to show how much load/Throughput we can generate/get:

    Thursday, August 30, 2012

    Monitoring your load generator client, when the server CPU% is under utilized

    Recently, I am using one Virtual Box(VMware/2 Cores / 8GB) to conduct performance/load testing. With more and more load Jmeter generate, It seems the server's CPU% is not able to go up...
    However, I noticed the my vitual-box client's CPU% has reached to 90%+. After changing to a physical and more powerful machine, everything is fine, server side CPU% can go up with increasing payload. I am not often to monitoring load generator client during my testing previously, and they are usually works fine as its physical machine.
    You have to monitor your load generator machine as well when you find the server CPU% is under utilized(stay the same with even higher user load) and Server response time keep increasing, especially if you are using a Virtual Box to do performance testing, then please be caution!!