Wednesday, February 07, 2024

Performance Culture

 What is a Culture:

"culture" is a versatile term that can refer to the shared characteristics and behaviors of a group of people, whether at the societal, organizational, biological, or artistic level.

What is a Performance Culture:

  • Share the same perf vision/goal
  • Share the same perf language (basic concepts/terms/rules)
  • Share the same perf process
  • Share the same perf methodology
  • Share the same perf toolkit
  • Team work
  • Continuous measure, monitoring and improve, never satisfied
  • Team start looking at data,/numbers, questioning and discussing performance stuff

Sunday, February 04, 2024

Response Time Distributions Vs. Performance Metrics

  •  Right-Skewed Distribution

P90: 4s
P50: 1s 
AVG: 1.88s

  • Normal Distribution
P90: 4s
P50: 3s 
AVG: 3s

  •  Left-Skewed Distribution
P90: 5s
P50: 4s 
AVG: 4.1s

  • Bimodal Distribution

P90: 8s
P50: 3s 
AVG: 3.9s


Thursday, January 18, 2024

Random Quotes

  • "Everything which is measurable needs to be measured, properly!"
  • “In terms of performance issue, it appears that the application is running slowly, but more often it indicates something might implement wrong!”
  • "There is no Mystery or Magic for the slowness, must be (a) reason(s) you see or you may not see (yet)!"
  • "Isolation, isolation, isolation!"
  • "We are not concerned with all samples as it can be overwhelming and sometimes misleading. Our primary focus is solely on improving TP90 for each individual product or app."
  • "Care about the Facts! The world is a certain way, and now using facts as the baseline, we can try to figure out how to make things better."
  • "Keep the momentum and complete the loop"
  • "Do not mix everything together! Tuning one thing at a time, and measure it separately. "
  • "Build Visualization is to help me understand the systems and share with others"
  • "If you can not decide how many samples you gonna look at for your analysis, choose 5,  and ignore the min/max ones, it will give you 93.75% to locate representative range of all samples"
  • "Fixing high-traffic but poor-performing areas(so called high cost) of your site will help lift your overall metrics, since the total resources are scarce and sometimes they are competing with each other"
  • "Any performance or load tests gives you the confidence rather than concrete numbers, even if you already model 'perfectly' and conduct it carefully"
  • "Bigger Hardware is not always faster, but it can usually handle more load"
  • "The more data you accumulate, the more obvious the facts will become" 
  • "Performance numbers under high Error% is meaningless, fix the error first!"
  • "If you run the ETL job, you monitor the site! lol"
  • "The Capacity issue always leads to the slowness, however, the slowness is not always caused by the Capacity. So they are relatives but not the same topic. But try to fix (all) performance issues first before you evaluate your capacity!"
  • "API design should always keep applications in mind with a limited boundary, otherwise it is useless or abused"
  • "If you do not understand it, do not use it!" – My 10 year old daughter told me...
  • "Don't blame yourself, it's not entirely your fault, it's just that the reality is different from what you thought or expect.."
  • "The correct measurement is not only able to find the issue, but also prevent the issue from persisting"
  • "There is no solution, there are only trade-offs!" – Thomas Sowell 
  • "The incomprehensible should cause suspicion rather than admiration." – N. Wirth
  • "Go where the symptom is , and look at it... At the ACTUAL symptom" – Cary Millsap
  • "It is not enough just to make tools available to developers; those developers also have to have THE WILL and THE SKILL to use them" – Cary Millsap
  • "It's not data or intuition, it's data and intuition" – Ivy Ross
  • "Site Speed isn't something you fix once and walk away from" – SpeedCurve
  • "We need simplifications. They include methodologies and effects models, both implicit and explicit." -- Gerald M. Weinberg, And we need understand principles too
  • "Increasing variability makes the utilization and latency graph get worse faster"
  • The Data-Information-Knowledge-Wisdom (DIKW) pyramid, also known as the DIKW hierarchy, is a model that represents the relationship between data, information, knowledge, and wisdom -- wikipdia "DIKW pyramid"
  • "The better the job you have done in finding a path for yourself, the more boring and predictable your life is going to be" -- Jerry Seinfeld
  • "Unpredictable is one of the most challenging things in Performance Engineering!"
  • "The great performance number looks flat and smooth, the bad performance number looks fluctuation. The slow ones may looks flat and smooth too, I also call it great since they are usually easier to tune faster!"
  • “Take performance seriously, but there is a good way to do performance improvement, and that needs a disciplined measured process!” – Martin Fowler
  • "If you do not care about performance, the slowness will knock at your door someday, sooner or later"

Tuesday, January 04, 2022

API Performance Testing in k6 during the development phase

 

General Goal ->  Finding the performance bottleneck and regressions by simply...

  • Running a API level testing
  • Measuring the key performance Indicators
  • Analysis the performance result and trend
  • Isolate the external dependencies if needed (focus on your own code rather than anything else out of your control)

In this wiki, we will adopt k6.io as the performance/load testing tool, which is easy to setup and run locally, meanwhile, create complete monitoring system to visualize your test results as well as essential JVM performance metrics. In terms of isolating the external dependencies, we will create a docker based mock service, so that we can control the pace and customize response body to simulate different scenario with minimum effort.

If you want to conduct a scenario based performance testing towards an integration env such as staging env, I would recommend to use JMeter to do so, it is comprehensive and more mature tool, but it is out of this wiki's scope. We are not going to talk about the stress test, soak test or capacity test, since they need a more standard(production mimic or equally scaled) env and different test strategy, need thoughtful plan and focus on what we want to achieve by various experiment. The good thing is once you understand the basics of performance testing, you will be easily to have a better understanding with the other type of tests.

 

What I talk about when I talk about performance

My Daily life about Performance Engineering Cycle:

Performance is a generic term, it is difficult to give this word a concrete definition from single perspective. Performance issues could be caused by one or many factors, you may spend lots of time to find the right piece(s), clues or even using your educated guess to isolate the factors, prove your findings and resolve the issues. That's why performance issues always hard and some nerds are so obsessed with trouble shooting performance problems..

 

Why Local Performance test? (AKA, Unit test for API Performance)

The local environment is a great treasure(Any project can not be set up a local environment easily should be retired, seriously)!! It’s where we should be coding our load test scripts and from where we should initiate our load tests. Meanwhile, when I try to define "Local", here is not only referring to your own desktop or laptop, but any environments you are fully controlling and easy to manage and make changes without any impact to others.

Pros:

  • Easy to control
  • Flexible to manage your dependencies
  • Easy to setup and Test is cheap

Cons:

  • Hardware Spec limitation
  • Hard to compare with previous baseline
  • Difficult to simulate the complex scenario

 

K6 Local Env Setup:

To install the K6.io on Mac OS, Simply run following cmd: 

brew install k6

if you are using the other OS to run the tests, please refer to this link

 

Create a Simple API Test script using javascript and k6 lib:

API Level Performance testing supposed to be simple and straight-forward, so Dev could run it easily and often once they make any changes.


 

k6.io adopts javascript as its scripting language, and Go lang as its backbones. For detailed usage of k6.io, you can start with using K6 documentation

In general, the k6 test script at least contains a few blocks :

  • import used libs
  • define global const variables
  • define customized metrics/checks
  • define test running configs
  • init code function, just run once for all VUs, eg: deal with data parameterization (optional)
  • VU test code function, the scenario/steps for each VU
  • teardown code function, just run once for all VUs before ending/shutdown the tests (optional)

To simplify what i mentioned above, we will use following API Test script as a test template which provides the essiential elements and components to run a local perf test, for example naming your test script as sample_script.js:

 

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';
 
const SLEEP_DURATION = 0.2;
const PROTOCOL = "https"
const HOST_NAME = "test-api.k6.io";
 
//Define custom metrics
let successRate = new Rate("check_success_rate");
 
//Test running configs
export let options = {
  discardResponseBodies: false,
  userAgent: 'MyK6UserAgentString/1.0',
  scenarios:{
    http_get_api_3RPS: {
      executor: 'constant-arrival-rate', // use open model instead of close model
      rate: 3, // 3 RPS
      timeUnit: '1s',
      duration: '30s',
      preAllocatedVUs: 5,
      maxVUs: 15,
      startTime: '0s', // config stage tests
    },
    http_get_api_3RPS: {
      executor: 'constant-arrival-rate', // use open model instead of close model
      rate: 5, // 5 RPS
      timeUnit: '1s',
      duration: '30s',
      preAllocatedVUs: 5,
      maxVUs: 15,
      startTime: '31s', // config stage tests
    },
  },
  thresholds: {
    http_req_duration: ['p(90) < 250'],
    'check_success_rate': [{
      threshold: 'rate > 0.95',
      abortOnFail: true,
      delayAbortEval: '15s'}],
  }
}
 
//Init code
export function setup() {
  console.log("Init Testing..." + new Date().toLocaleString());
  return Date.now();
}
 
//VU test code
export default function() {
  // Send out the API
  const response = http.get(`${PROTOCOL}://${HOST_NAME}/public/crocodiles/?format=json`, {
    cookies: { my_cookie: "123456" },
    headers: { 'X-MyHeader': "apitest" },
    timeout: "15s",
    compression: "gzip, deflate, br",
    tags: {name: 'APINAME--GET'},
  });
 
  // Assert the response
  const checkResp = check(response, { // can be a combination assertion
    "response code is 200": (resp) => resp.status === 200,
    "content is present": (resp) => resp.body.includes("Bert"),
  });
 
  successRate.add(checkResp);
 
  // Simulate the think time
  sleep(Math.random() * SLEEP_DURATION);
}
 
//TearDown code
export function teardown(data) {
  console.log(`Test duration: ${ Date.now()- data }ms`);
}


During scripting phase,  we prefer to do Data Parameterization, so that we can try to avoid the cache and simulate the real world scenario, following is the typical methods we can use to deal with this: https://k6.io/docs/examples/data-parameterization/ or you can refer to one sample scripts i write in git repo

For some use cases, if the target API needs the other API's output as its input, this is called Correlation. For example, we can extract the data from previous API response body and compose this data as the input parameter to the API we want to measure most. k6 has the option to parser the response body and grab what you need for further steps(make sure you have the running config: discardResponseBodies: false). More example with correlation: https://k6.io/docs/examples/correlation-and-dynamic-data/ 

Recommendation:  in Local performance testing, we should avoid as much dependency as possible, using Mock services or generate "fake data" to remove the dependency as much as possible. Focus on your code and design first!

To run your test script locally once you prepare the scripts, execute following CLI after cd to your test script folder, usually you start your test with smoke testing to make sure your scripts has no Errors or unexpected results:

k6 run sample_script.js

Once the script is ready to do load testing , then you can tweak your testing running configs in script or you can overwrite some critical configs through CLI to meet your load target.

After all, we want to smell our own API, get confidence before you submit your commits and go to prod to monitoring your API with something (smile)

Some typical use case examples: https://k6.io/docs/examples/

K6 API documentations: https://k6.io/docs/javascript-api/

 

Test result visualization: 

Prefer to use influxDB + grafana to store and visualize your test result over time, so you can easily to notice the changes and time to go wrong, also easy to compare from time to time.

Install influxDB on your Mac OS, currently k6 does not support influxDB 2.0, so we will still use influxDB 1.8 until they add support 2.0 support officially: 

brew install influxdb@1

Start influxDB instance on local (background mode), so it listens to 8086 port by default for exchange the data: 

brew services start influxdb@1
or
nohup /usr/local/opt/influxdb@1/bin/influxd &

To run the k6 test and store the test data in local influxdb instance, in following example, it will create "myk6db" database automatically: 

k6 run --out influxdb=http://localhost:8086/myk6db sample_script.js

Install Grafana on Mac OS:

brew install grafana

Start Grafana service:

brew services start grafana

Access to your local grafana page by : http://localhost:3000/ , enter admin for username and password.

Next,  Add influxDB myk6db datasource and Create your own Dashboard to visualize the k6 test results:

(If you would like to add grafana panel plugin to build fancy dashboard, you can try to download the plugin folder and drop into grafana plugin folder: /usr/local/var/lib/grafana/plugins/)

I have defined a basic k6 test grafana dashboard for anyone to import as a quick start, feel free to download it from my github repo
P.S. Highly recommend to run the baseline before you make the changes and not compare with your out-of-date "baseline", things can be changed since it is local env. 

Monitoring Your (Java) application:


To monitoring local Java process is easy nowadays, I recommend you to use JAVA Mission Control (JMC) and Flight Recorder(JFR) which developed by Oracle JAVA team. You can download the latest version of JMC separately from here , and how to start the JMC. The other option you may want to choose is VisualVM, one of my previous fav monitoring tool for JVM.

Configure your Java application correctly for the VM options, just make sure you copy the same JVM options currently in use from production for your own role.

If it is newly developed, you can try to configure by yourself or use following simple template to get started, if the GC overhead is bottleneck, you have to revisit and tuning it. If GC throughput is over 99.5%+(which means GC timing spend less than 0.5% of your whole testing), you normally do not need to bother JVM options. Keep the JVM options minimum, make sure you fully understand the impact before you add it.

G1 GC is my recommendation if you are on JDK8+ in general, however, if you are on JDK11+(ZGC) or JDK12+(Shenandoah), you may do the comparison between newly added GC Collectors and G1 GC. Assume you have at least 16GB RAM on your local machine, and you wish to sizing your java heap space at 4GB: 


-Xms4096m
-Xmx4096m
-Xss256k
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+DisableExplicitGC
-XX:+UseStringDeduplication
-XX:+ParallelRefProcEnabled
-XX:MaxMetaspaceSize=512m
-Djava.rmi.server.hostname=192.168.0.xxx

 P.S. -Djava.rmi.server.hostname VM options need to be added to your Java application to let JMC or visualVM to connect to this host, otherwise, it may have following Error when trying to connect to jmx server:

...
Caused by: java.rmi.ConnectException: Connection refused to host: <Some_else_IP>; nested exception is:
    java.net.ConnectException: Operation timed out (Connection timed out)
...

Pay attention: If you connect to you VPN, then you might have a separate IP address to connect to, run following  cmd on you local:

% ifconfig | grep "inet "


It will show you the IP address you could use, if you could not decide which one to use, try both until it is connected.

In order to use JMC to monitor or use JFR to profiling and analyze your Java application,  it is out of this wiki's scope, please find out here. For JFR tool, you need add additional VM options to enable it, please make sure do not enable the JFC VM configs in production env since it needs additional commercial license and adding some overhead to your services or using OpenJDK JMC and JFR for free (you need to use OpenJDK 11+).

The Key Java performance metrics you need to pay attention to:

  • JAVA CPU%
  • Machine CPU%
  • Heap Memory Usage/Footprint
  • Non-Heap Memory Usage/Footprint
  • GC throughput, GC timings and GC Frequency
  • Java Threads count/trend
  • JDBC Connection Stats
  • System Level performance metrics(collect separately, but on local testing, it is optional)

P.S. Highly recommend to save the key JMX metrics to influxDB during the local testing, so you can get a historic point of view and compare how things change time to time. So you can use jmxtrans together with jmxtrans-output-influxdb to export important JVM metrics to influxDB, and visualize it in Grafana.

  • Install jmxtrans on Mac OS : 

    brew install jmxtrans
  • Instrument the JVM options to export jmx port: 

    -Dcom.sun.management.jmxremote.port=9426
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false
  • Instrument the JVM options to define the hostname for connection with jmx server : 

    -Djava.rmi.server.hostname=<Local_IP_Address>
  • Define jmxtrans configuration file, for example, save the file as "~/Tools/k6_Loadtest/jmxtrans_config/jmxconfig.json": 

    {
    "servers":[
    {
    "port":"9426",
    "host":"<Local_IP_Address>",
    "runPeriodSeconds": "10",
    "queries":[
    {
    "obj":"java.lang:type=Memory",
    "attr":[
    "HeapMemoryUsage",
    "NonHeapMemoryUsage"
    ],
    "resultAlias":"jvmMemory",
    "outputWriters":[
    {
    "@class":"com.googlecode.jmxtrans.model.output.InfluxDbWriterFactory",
    "username":"admin",
    "password":"admin",
    "database":"jmxDB",
    "tags":{
    "application":"demoApp"
    }}]}]}]}
  • Start jmxtrans process: 

    /usr/local/opt/jmxtrans/bin/jmxtrans ~/Tools/k6_Loadtest/jmxtrans_config/jmxconfig.json

    By default, the jmxtrans will collect the jmx metrics defined in JSON config file once per minute, For production monitoring, it is good enough, but for local performance testing, we had better to adjust it to 10 seconds per collection for more granularity. Once it is setup, its time to create grafana dashboard with JMX metrics monitoring together with k6.io test data.  It helps a ton to better understand your tests and the application under load

Create a mock services (Optional but highly recommend):

To have a external services being mocked is quite helpful, it will make your life much easier:

  • save time to find a workable and stable environment;
  • focus on your own code;
  • test result is more predictable and repeatable.

Since you are working on a Local env you can fully manage, so it is your choice to use the external mock services( such as Mockoon) or you just comment out some of the code to make your test work, but i would suggest to try to simulate the remote connection as much as possible, since it will help to simulate the threads, memory usage and network connections against real use cases.

In this section, I will create a dummy mock services using docker/Golang and Caddy HTTP server in order to simulate different Rest API HTTP methods/payload/Response time.


 
The sample code in my github repo for the reference

Preparations:

How to build and run dummy-mock service:

  • clone the github repo into your local
  • cd /path/to/target/dir/with/dockerfile
  • define your own response-GET.json and response-POST.json file
  • docker build . -t dummymock
  • docker image
  • docker run -d --rm -p 9091:80/tcp dummymock
  • /path/to/caddydir/caddy start (note: make sure current dir has predefined Caddyfile, so caddy will auto load the config file)
  • Use postman or curl to try the mock services with your HTTP method + customized duration you expect to simulate from mock service, for example: http://localhost:8020/?duration=200

Note:

  • If you want to support https protocol, you can dig into caddy documentation and config to support https,
  • by default, it does not support too many json response payload, but if you would like to do so, it is easier to extend by adding to the source(main.go) and re-build it

Fine Tuning your OS (Optional):

Make sure your Desktop or Laptop is not the bottleneck during running your performance test, if that is a case, you may consider to fine tuning your OS first , if nothing works, you may consider to adopt dedicate load generators to help you. with the test, however you are not flexible to do a test, it is a trade-off. Do remember, focus on your code first, no one cares if you do not even care.

 

Install xk6, the k6 extension modules(Optional):

  • make sure you have Go installed
  • You can download binaries that are already compiled for your platform
  • Extract on your local directory, go to the directory
  • If you are using MacOS, right click to open with Terminal to grand the permission to run xk6 on your local
  • Select xk6 extensions you want to try, for example, you want to run your k6 test with csv parser functionality from: https://github.com/szkiba/xk6-csv
  • run the cmd to build your k6 with extensions you selected : xk6 build --with github.com/szkiba/xk6-csv
  • it will generate a new k6 in the same folder, and run the test with following cmd:

    ./k6 run test.js