Thursday, January 08, 2009

How monitoring data helps Performance analysis

This message will be focusing on how monitoring different data in performance test help analysis. only a draft and brain storming list, all the cases are we met during performance test and tuning phase :)

This message will NOT INCLUDE below 2 categories.

* LoadRunner/JMeter test tech and result analysis(90% response time, throughput, etc)
* Regular profiling (Web/App profiling, DB profiling, JProfiler, etc)

Total CPU of each server

* Case 1: "X Application" call common search service. While search server CPU is low and "X Application" server is extremely high, we realized bottleneck is on "X Application" side.
* Case 2: web server CPU high but App low. Later we found it is g-zip function in web server takes lot CPU so we reduce compression level to solve it. Similarly you can do quick judgment which one is bottleneck – DB, app, web, etc.
* Case 3: if you found DB server CPU very low, but from SQL Profiler SQL is slow, then it is possible there is DB lock or too much data fetched.

CPU / Disk read write/Memory/GDI Objects of each process especially java process

* Case 1: Rtvscan.exe is busy – obviously virus scan is on-going
* Case 2: csrss.exe very high. This process is kernel process that handles windows graphical work. Later we found in performance environment the app server console is not minimized and there is lot output of log on console. Painting takes lot CPU and slows down response time.

Check server log

* You might find lot debug log inside it which indicates log level incorrect. Usually needs double check log level after performance environment deployment complete.
* See if any errors in log
o If a lot errors, the response time is meaningless and should resolve this first.
o If there is few, even result could be great we still should not let them go.

File system

* If your application is using file system for temporary data transfer and designed to delete after process complete, you need to check if any file not cleaned up.
o If yes, that indicates you have some problem in temp file clean up or exception caused clean up work not happen
o Also if files accumulate, it can hurt performance greatly! – File read write will be very slow while too many files/folders under same folder.
o Design review needed at this time

Message Queue (check if it queues up)

* Case 1: We once found after each round test there are 10000 messages not processed and find send email has issue.
* Case 2: while "X Application" call search service, we find after few minutes several hundreds message queue up at ActiveMQ which blocks the communication between "X Application" and search service
* Also check how many DLQ left – that indicates your test is not that successful as you saw from JMeter/Loadrunner

JVM Memory usage

* If after each full GC the memory gets higher compare to last full GC, there is possible memory leak. (have a lot of doc describing how to tune and debug GC problems)

Cache hit rate

* If you see the cache hit very rate is very high during test, you better doubt if the case is designed realistically and if it hides issue

Connection pool status check(web server connections, Tomcat connections, DB connections)

Actually we don’t have real case finding issue by monitoring the connection pool status but we do double check connection pool size setting after deployment complete.

welcome all comments and feedback. :P(Neil wrote this, I am his assistant and first reviewer,haha)

No comments:

Post a Comment