Thursday, January 08, 2009

Analyzing thread dump practice

With this message I would like to share some real case that we successfully identified performance bottleneck through analyzing threa dump and correlate with our code.

The basic idea is, when the performance testing is running, get a snapshot of all threads several times(usually I would like to take 5-7 times at a randomly interval), see what they are busy at.

* You might see most threads have similar stacktrace and are pending on same method
o Maybe most of them are waiting for getting lock – there is thread lock happened.(waiting for monitor entry)
o Maybe that method is really time consuming so there is high chance most threads are working on that.(Runnable and high CPU utilization)

App profiling log may also help to breakdown time and find which method cause most time but with thread dump you will be able to look into the 3rd party libraries at low level API.

To generate the thread dump(except using Ctrl+break under console mode for Windows):

* http://www.jboss.org/community/docs/DOC-9804
* http://www.jboss.org/community/docs/DOC-12300
* https://visualvm.dev.java.net/threads.html
* Jconsole,Mbean

You can easily find some articles on internet how to analyze them and here is a few

* http://java.sun.com/developer/technicalArticles/Programming/Stacktrace/
* http://www.0xcafefeed.com/2004/06/of-thread-dumps-and-stack-traces/
* http://www.myloadtest.com/java-thread-dump/

Analyzing thread dump actually needs some knowledge but the real cases I would show below, is kind of “idiot”, you will see…
Case 1 – solved issue caused by BeanUtil.copyProperties() and log.trace()

During performance test we found view vendor summary is very slow. This page call search web service to get data, search server CPU is low but "X application" server CPU is high.

Take a look at the thread dump, we find most threads are doing VendorAdapter.unmarshal() and locked each other at javax.management.modelmbean.DescriptorSupport.clone().

Read little more code, we find

* VendorAdapter.unmarshal() used BeanUtil.copyProperties() in a loop
* BeanUtils.copyProperties() triggers log.trace() without checking isTraceEnabled().
* Then it triggered our customized logger and finally triggered config.getVersion() which caused thread lock.

Solution

* In XXXLog4JLogger.trace() add a check isTraceEnabled() before call super.trace(this.constructLogData(message));
Result
* Response time decrease from 6.2 seconds to 3.9 seconds

Later, with same approach we find BeanUtil.copyProperties() caused another thread lock at MappedPropertyDescriptor.getPublicDeclaredMethods so we complete give up BeanUtil.copyProperties().
With this change, response time decrease further.

Total improvement is from 6.2 seconds to 2.8 seconds under expected load.
Case 2 – identify issue caused by call to ActiveMQ

This happens on search page.

While testing with 50 users, we found after 5 minutes CPU usage of X application app and search app will dramatically decrease (from 90% to less than 50%). Meanwhile server response time become very slow.

By taking thread dump on X application app console we find by that time half of the http request hander thread is at com.XXX.commonsearch.client.util.SearchJMSQueueConnectionUtil.sendObjMsg, while the rest is waiting.
Justin also found there are around 400 messages pending on ActiveMQ

Then we quickly made a temporarily change to the code to remove that call to ActiveMQ and test again. The issue is gone.
This allows X application team continue with other testing and tuning while search team investigate the solution.

Case 3 – confirmed the issue caused by creating dispatch object for web service call

Again, this happens with vendor detail page while call search service.
From the profiling log we find most time of the page consumed in calling web service but when we directly test web service using jmeter, it is fast. so that indicate client side which call web service might have problem.

By looking at the thread dump we realized most time / CPU consumed on read WSDL file and create dispatch object for web service call, every time.

Well, thread dump analyze does not really contribute to the problem solving this time. our dev guys figured out a solution by reading code which is used to call the web service– as the dispatch object is actually thread-safe after investigation, we can cache it.

No comments:

Post a Comment