Sunday, December 28, 2014

Automated WebPageTest using "snowboard"

I have pushed my project code "snowboard" to my github and check it out if you want to see if it is helpful or not for your daily Synthetic Front-End Performance Test:
https://github.com/joychester/snowboard

Thanks to Webpagetest, from now on, you can request your own API key from : http://www.webpagetest.org/getkey.php

you can freely write your own dashboard or store the whole thing to MongoDB or PostgreSQL etc,  for page trending and further analysis, or you can define your own page perception time by filmstrip which is an existing stage to redefine the page load time for so dynamic web pages.


Monday, December 15, 2014

HTTP1.0 and HTTP1.1 Performance with KeepAlive enabled


The recent misconfiguration to the ssl.conf of apache gives me the chance to test the HTTP1.1 and HTTP1.0 performance difference with KeepAlive ON, actually it stays there for years...

Pic1: Shows the HTTP1.1 with Keepalive ON performance overtime, stable and fast:













Pic2: Shows the HTTP1.0 with Keepalive ON performance overtime, up and down:













Current settings in ssl.conf, which makes all IE user agent use HTTP 1.0 as a response protocol:

SetEnvIf User-Agent ".*MSIE.*" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0

To fix the issue, just bypass IE1-6 which may have issues instead of applying to all IE user-agent (it is said to be fixed by latest apache version already):

SetEnvIf User-Agent ".*MSIE [1-6].*" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0

PS: Also tested when turn Keepalive to Off , the response time between HTTP1.0 and HTTP1.1 is similar, but 3-4 times slower than keepalive settings for sure due to handshake..

Monday, November 10, 2014

Tweaking your load generator machine if you are using the Windows platform


I have done this for a while, recently, some guy came to me and ask the same questions they noticed, here is the story:
Sometimes, we need to pay attention if you noticed the performance result is adding 200ms latency comparing with previous results on windows platform, that may due to the following reason (AKA, Nagle algorithm):

http://en.wikipedia.org/wiki/Nagle%27s_algorithm

How to fix the problem, to enable the TCP no delay on Client side!!! (TCP_NODELAY):

http://www.justanswer.com/computer/3du1a-rid-200ms-delay-tcp-ip-ack-windows.html

Meanwhile, you also want to tweak/increase your dynamic tcp/udp port range to support more concurrency requests to avoid port exhaustion
https://docs.microsoft.com/en-us/windows/client-management/troubleshoot-tcpip-port-exhaust

Check ipv4 tcp dynamic ports value: 
PS C:\WINDOWS\system32> netsh int ipv4 show dynamicport tcp 

Set  ipv4 tcp dynamic ports value (need administrator rights):
PS C:\WINDOWS\system32> netsh int ipv4 set dynamicport tcp start=10000 num=20000

P.S. meanwhile, if you have TCP Ack delayed configured on the server side, you may consider to disable it by using the TCP_QUICKACK socket option , since this could cause another 200ms delay to send acks to clients.

Sunday, March 30, 2014

Socket read timeout issue -- A Pattern with GC Activity



There may be several patterns for socket read timeout issue from client to the server, but this is one of the patterns I want to share:

Pattern A Description:
As we know, GC will make the world stopped(different GC Collector will have different behavior: https://www.cubrid.org/blog/3826410 and https://www.cubrid.org/blog/3826519)

When the “world stopped”, the JBoss(Tomcat) will stop responding any application threads execution as well as accept any coming connections except for the GC threads doing its own cleaning job…

Meanwhile, if Apache web server intents to establish a connection with JBoss(Tomcat) by AJP protocol, it will easily get the 200 seconds socket timeout issue(we defined by workers.properties), and looking at mod_jk.log, you will find the Error logs there:
PS: we set 1*apache and 1*Jboss on the same host, i am borrowing cubrid's nice picture, but we are using worker instead of prefork MPM :

 

Reproduce this scenario:
  Reproduce steps for this socket_timeout issue:

  •  Kick off load testing
  • Manually trigger the Full GC by jvusialVM and Jmeter Tree view will comes out the Socket read timeout Error


Jmeter_log:
GET https://hostname/help/services/popUp?nodeDesc=param1
Request Headers:
Connection: keep-alive
User-Agent:  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk

Apache_Access_log:
10.80.8.59 - [28/Mar/2014:03:24:38 +0000] "GET /help/services/popUp?nodeDesc=param1 HTTP/1.1" 200 58 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.154 Safari/537.36 perfheader=4xlr3puk" + requestTimeMicroS=200606264 xforwarded=10.80.8.59

Mod_jk_log: (PS: the timestamp is 200 seconds after the request sends out)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_connection_tcp_get_message::jk_ajp_common.c (1274): (worker1) can't receive the response header message from tomcat, network problems or tomcat (127.0.0.1:8009) is down (errno=11)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [error] ajp_get_reply::jk_ajp_common.c (2118): (worker1) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Fri Mar 28 03:27:58OURCE 2014] [20266:1183357248] [info] ajp_service::jk_ajp_common.c (2607): (worker1) sending request to tomcat failed (recoverable),  (attempt=1)
[Fri Mar 28 03:27:59OURCE 2014] worker1 hostname 200.610982

Current Solution to reduce such a timeout issue:
1.       JVM tuning, leverage the CMS GC collector to reduce the GC timing (making the stop world timing as shorter as possible, and GC less frequently)
2.       Enable Cping/Cpong in workers.properties if there is using AJP between Apache and Jboss (detect the broken pipe and avoid the handshake failure in advance)
3.       For Jboss4 role particularly, we need to get rid of the TCP CLOSE_WAIT connection problem which is introduced by Ping mode by Replacing AJP processor with JbossWeb Native Connector(http://www.jboss.org/jbossweb/downloads/jboss-native-2-0-10)  
4.      A mod_jk bug reported by apache, replacing socket_timeout by socket_connect_timeout and activate ping mode with proper timeouts (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)socket_connect_timeout is to specify the TCP connect phase timeout from Apache to JBoss with AJP protocol

      why setting Cping and Cpong important in workers.properties:
No CPing/CPong set
The CPing/CPong property in mod_jk is the most important worker property setting, allowing mod_jk to test and detect faulty connections. Not setting this parameter can lead to bad connections not being detected as quickly which can lead to web requests behaving as if 'hung'. (https://issues.apache.org/bugzilla/show_bug.cgi?id=49468)

Next Step:
  •      Find some typical socket timeout cases on PROD
  •      Compare different configurations/Settings during local PE test to see the effect
  •      Test and Learn...

    Sunday, February 09, 2014

    Lessons learned druing one of my recent projects

    1.       Prepare a good Planning and Clear target in advance, and we share the common purpose

    2.       Understand the envs/software/settings (Design, Mechanical sympathy, VMware DRS, Gateway Throttling Replication mechanism), doing some Research before doing any testing

    3.       A simple but straightforward tests, complex ones make things complex and hard to narrow down the problem

    4.       A Clean up the env

    5.       A repeatable, accurate and detailed Baseline first.. Do not rush to do any optimization before a repeatable baseline captured

    6.       Get Fully real time Monitoring, even the load generators

    7.       Make changes One by One, not all in once, do not mix things up, checklist all the stuff we have been made

    8.       Logs is not free, turn DEBUG logs off if nobody look at it, if we really need them, make them into INFO

    9.       Check the Disk space when you do multiple rounds and write tons of logs.. Response time will be slowed down suddenly

    10.   Visibility to the VMXhost who is using VMware to do PE tests and get to know how your hosts are distributed is critical to the tests(resource utilization should be balanced(CPU,IO,Mem) and not touching the ceiling of each host), Disable the DRS feature for the PE cluster, it is not good for getting repeatable data, but we can optimize the hosts based on our observations by monitoring the hosts utilization

    11.   Do not trust open source, need to dig into it when you are making use of it intensively!!