Performance QA practice in MySQL followed

  1. Continous performance regression testing using automated tool - Scalabilty and single threaded test
  2.  Release Performance for server - On platform Linux and Windows 
  3. Pre-push performance testing for Woklog/POC for major feature. Triage and find out whether non-functional test is required or not.
  4. Benchmarking like MySQL Heatwave (standalone or HA) vs RDS vs Aurora etc .
  5. Impact analysis of mysqld config changes .
  6. Third party libray , like openssl , jemalloc/tcmalloc , impact 
  7.  Compiler options change , devtoolset changes etc 
  8. Some adhoc performance assement like MySQL REST Service , Inbound replication in Oracle cloud 
 Main motto: 
  1.  catch regression/gain early. 
  2.  Run test and find out and no assumption based on documentation 

 That's the reason community/commerical MySQL had most robust performance. So , Performance QA make sense always. In next section , I will try to deep dive into stuff one-by-one


Continous performance regression testing using automated tool - Scalabilty and single threaded test
------------------------------------------------------------------------------------------------------------------------
First let me explain why it's important and the cost of doing it later , during release
* Release in MySQL happens quarterly and now the release window is shrunk so you do not have time to run all performance tests on Linux and windows platform. This means , there is a chance that you may miss identifying any regression seeped through either bug fix or feature push.
Release testing is generally meant for sanity testing in a narrow window of time.
Now, if you depend on the customer to tell the issue when it's deployed on production, then think about the cost of productivity loss.
I have seen many product companies do have continuous performance regression tests running.
Triaging and simulating customer issue to run continuously also very important.
When you are designing continuous performance regression then categories - scalability testing and single query testing.
We designed some tests with frequency of daily like many memory boundary scalability test and some weekly like io bound or single threaded test.
This approach helped to find many performance bugs.

Recently, I have come across Nyrkio.com which does the same , continuous change performance which you can setup against your nighlty build in GitHub.

Pre-push Performance Validation & NFR Governance
------------------------------------------------------------------------------------------------------------------------
Worklog in MySQL can be treated as feature request . In MySQL as good QA practice 2 sections were kept - FR ( functional requirement) and NFR ( non-functional requirement) . if it's code refracting the NFR can be bypassed but if there is new feature or change in default config of mysqld, then performance assessment were required.
Many times, we were doing traige of feature and making note in WL about NFR, if found to be missing. Some developers were proactive and many not in marking feature for NFR testing.
We handled NFR two ways -
1. We used few selected our daily/weekly tests to test WL to make sure that when this WL is pushed , its not goin to regress our continous running change performance tests.
2. Sometimes , we customized or wrote new test cases to handle it

Performance Bug Triage & Continuous Improvement
------------------------------------------------------------------------------------------------------------------------
Many customer bugs are vital source of gauging performance issue and area to address in the code which might be there for long but could not be picked because of various reason.

New feature release is important but addressing existing bugs or triaging bugs related to performance are also important.

A practice being introduced to triage , simulate and automate which was very good initiative and many bugs related to performance were fixed and our performance coverage also increased.


This proactive cycle of addressing technical debt while shipping new features is what maintained MySQL's industry-standard stability over the years.

Release Performance of MySQL - On Linux and Windows
------------------------------------------------------------------------------------------------------------------------

In MySQL release window is less like 1 month and when we get build , we run our subset of tests to just do sanity test.
Now, the question is why we have to run release test if we are covering most of the cases in daily/weekly scalability or single-threaded test.
This is required becuase we wanted to be sure that at the end of the day when customer gets build, its upto quality .
Other reason is that after code freeze and till we get build , somes changes by any means should not deteriorate the performance.
During release cycle ,If any perfromance bug , we find because of any chance, its missed during daily/weekly testing , we try to get fix before release. Performance related bug fixes take time because we have to iterate many cycles.
Perfromance test on windows is required altough customer base of mysql on windows is not as high as linux. Some bug fixes of innodb could impact performance on windows and we have seen this , where we found bug and raised it but it was not fixed and when escalted , it was given priority and fixed.


Benchmarking like MySQL Database Service (Standalone and HA) vs RDS vs Aurora
------------------------------------------------------------------------------------------------------------------------

* We benchmarked using sysbench's various tests like, point_select, connect/disconnect, OLTP_RO, OLTP_RW, update_key, update_nokey, insert, write-only , tpcc for memory-bound ( data fits into buffer pool) and io-bound ( data more than buffer-pool)
These were our targeted comparison:
*MySQL (standalone service) vs MySQL HA service ( 3 node group replication) vs RDS vs Aurora
* This with varying shape like 2 core, 4 core , 8 core/16 core , 32 core etc
* Block volume or storage selected such a way to match IOPS

In this -
* First , we did the comparison of default settings of MDS (MySQL database service) vs RDS (default setting) vs Aurora (default service)
* We used grafana monitoring for MDS and Cloudwatch for RDS/Aurora
* Note: MDS is using cloud rpm which is PGO build whereas RDS is using community version
* We tried to compare like how much tps we get, if mysql is installed on similar hardware like MDS in terms of cpu and block storage
* MDS use - Hardware-assisted (SR-IOV) networking and more difference about PV vs SRV*IO , you can can check in oci doc. But, this makes huge difference
* We did failover latency and switchover latency check between MDS (HA) vs Aurora to gauge where we stand. This latency check we did for read and write also.

Crux, is that exhaustive performance assessment done for fine tuning of MDS

Impact analysis of mysqld config changes
------------------------------------------------------------------------------------------------------------------------

In our continuous performance change test (daily/weekly), we were keeping minimal changes, considering our test strategy is for cpu bound or io bound for standalone server or replication server ( core replication and group replication)
Now, any default value changes were either assessed or caught in our continuous performance change tests running daily+weekly.
If you look at release pages, you will find all changes so ideally before upgrading the server build the impact should be assessed by proper performance testing and analysis of impact
I have seen many innodb , replication , optimizer etc related changes.

In summary, if you are using MySQL and trying to upgrade then thub rule is to look into release notes and check if there is any change in default value of config variable or not.
Pay attention on bug fixes because you have to override some default values of the variable

My friend Jeb of Mariadb foundation has written tool, available at GitHub GitHub.com/MariaDB/TAF , which helps to do various kind of tests like sysbench, tpcc and tpch to find out the change in different runs like change between different release and change in config of any.

Compiler options change , devtoolset change
------------------------------------------------------------------------------------------------------------------------


In MySQL we have evaluated most of the changes like -03 vs -02, pgo vs non-pgo for most of read, read-write, write only, tpcc , connect/disconnect for memory bound and io bound on 4 core vm or 64 core vm to do full assessment because many times test reveals whether hypothesis works on not.

Let me share some example of pgo vs non-pgo
1. pgo performance depends on the training so I case of MySQL mtr test case is used to train. The good practice is to keep training same .
2. pgo helps in most of read only like sysbench point-select or oltp_ro but it's not helpful in case of write only or tpcc kind of load. In 9.7 community version it's being put but I need to check what training being put in or it is left to dev to put training to build.

If you change c++ framework like c14 or something else, better to test and verify whether it's helping or not

Some adhoc performance assement like MySQL REST Service , Inbound replication in Oracle cloud
------------------------------------------------------------------------------------------------------------------------

MySQL rest service , called MARS
- in this h2load tool used to find static page vs sysbench' point-select like query with static value or random value in where clause through rest service vs sysbench point-select sql query

In read-replica through nlb being assessed for point-select, connect/disconnect, ro and rw being tested to see linear scalability when we keep adding read replica and where it hits the limit with sufficient network bandwidth.
Connect/Disconnect test exposes network connection related issues and limit of your network load balancer.
In MySQL, mle feature testing like if you run javascript function vs same function through sql what's overhead . Heap usage of graal vm in your mle engine and how to keep track.

* Always check overhead of new functionality vs sql for the same work

Comments

Popular posts from this blog

MySQL- Crash Recovery Performance Assessment

PGO in MySQL community 9.7

Mariadb Performance : Openssl and wolfssl