MySQL- Crash Recovery Performance Assessment

Why we need to assess the performance of   crash recovery in MySQ and here is my explanation:

 It’s not just WLs or bug fixes — any change in redo format or code changes around it, flushing behavior, or recovery concurrency can silently slow things down and if you have huge volume of data to recover than it could be painful.


MySQL errorlog is the best place to look into details like whats happening during recovery and which phase of recovery is time consuming.

Basically , I measured the recovery  timing  - from “Starting crash recovery” to “Apply batch completed , in MySQLD error log.

Refer this about recovery in Innodb here: https://dev.mysql.com/doc/refman/8.4/en/innodb-recovery.html#innodb-crash-recovery

Let me share some of my hack to measure the time taken for crash-recovery experiment.

1.  I used `innodb_checkpoint_disabled` which is strictly a testing mechanism. It forces maximum redo accumulation and doesn’t reflect real‑world recovery .

  innodb_checkpoint_disabled , works only with debug build.

2. I did not want to use debug build which is heavy and slow. So, I  hacked the code to create release build with  'innodb_checkpoint_disabled' 

3. Earlier, used to be UNIV_PERF_DEBUG preprocessor for this innodb_checkpoint_disabled and now it's under UNIV_DEBUG  ( meant for debug build).  UNIV_PERF_DEBUG , is provided in Innodb for debugging certain stuffs and it creates release build.

I  faced too much  build issue when I tried to put UNIV_PERF_DEBUG and compile to have release build and that’s the reason, I   removed UNIV_DEBUG around innodb_checkpoint_disabled and related to it.

4.  I used latest git hub of mysql-server : https://github.com/mysql/mysql-server and commit: 447eb26e094b444a88c532028647e48228c3c04f

The following files changed: ha_innodb.cc, row0mysql.h, srv0srv.h, row0mysql.cc, srv0srv.cc, because I had to remove #ifdef UNIV_DEBUG related to innodb_checkpoint_disabled .

Refer git-diff

5. Compiled  this way:

cmake -S "$SRC"  -DBUILD_CONFIG=mysql_release  -DWITH_UNIT_TESTS=0

cmake --build "$BUILD_GEN" -- -j3

cpack -G TGZ


Test methodology  followed:

1.  I initialize and  started server on my laptop-   6 core AMD Ryzen ubuntu 24 with 2 GB BP

2.  I populated  sysbench's data like 8 tbl x 500000=4 million rows ~ 1 GB data

3. set innodb_fast_shutdown = 0  to have slow shutdown to flush all dirty pages.

4. SHOW ENGINE INNODB STATUS;set global innodb_checkpoint_disabled=1;  

       This way I  dumped  all log related data like "log flushed upto" "Last checkpoint at " and disabled checkpoint also.

5. Then I  executed sysbench’s oltp_write_only.lua   test with 10 threads  with --events=100000 means  a total of 100k txn executed by all 10 threads.

 I  saved the output of SHOW ENGINE INNODB STATUS to see last checkpoint logged.

it  gave details like this:

LOG

---

Log capacity                 2147483648

Log capacity used            2147483648

Log sequence number          1417738834

Log buffer assigned up to    1417738834

Log buffer completed up to   1417738834

Log written up to            1417738834

Log flushed up to            1417738834

Added dirty pages up to      1417738834

Pages flushed up to          1417738834

Last checkpoint at           1242793196

Log minimum file id is       18

Log maximum file id is       21

65047 log i/o's done, 1250.48 log i/o's/second

 

6. After the test run ,  I  killed mysqld and restarted  mysqld. I looked into mysqld_error.log to check the timing  and other details related to checkpoint

Example:

2026-05-18T06:43:15.515038Z 1 [Note] [MY-012552] [InnoDB] Starting crash recovery.

2026-05-18T06:43:15.601590Z 1 [Note] [MY-013086] [InnoDB] Starting to parse redo log at lsn = 1242827287, whereas checkpoint_lsn = 1242827668 and start_lsn = 1242827264

.............................

2026-05-18T06:43:17.010779Z 1 [Note] [MY-012532] [InnoDB] Applying a batch of 65638 redo log records ...

....................

2026-05-18T06:43:18.202786Z 1 [Note] [MY-012533] [InnoDB] 100%

2026-05-18T06:43:18.245367Z 1 [Note] [MY-012535] [InnoDB] Apply batch completed!


So start time of crash recovery is : 2026-05-18T06:43:15.515038Z 

End Time: 2026-05-18T06:43:18.245367Z

After crash "SHOW ENGINE INNODB STATUS"  details like below looked into:

LOG

---

Log capacity                 2147483648

Log capacity used            2147483648

Log sequence number          1417794793

Log buffer assigned up to    1417794793

Log buffer completed up to   1417794793

Log written up to            1417794793

Log flushed up to            1417794793

Added dirty pages up to      1417794793

Pages flushed up to          1417794793

Last checkpoint at           1417794793

Log minimum file id is       18

Log maximum file id is       21

2693 log i/o's done, 141.74 log i/o's/second


Summary:

    1.  I am  timing crash recovery in error log , from “Starting crash recovery” to “Apply batch completed.  

    2. I  used innodb_checkpoint_disabled=1 , just before sysbench's DML and kept  the number of DML operation same like 100k to have deterministic way to test and measure crash-recovery timing.

    3. Ran 3 times to  see, how many red-logs records applied , by referring error log of mysqld , after crash recovery

    4. I used release build of mysqld.

    5. I kept higher redo log size , innodb_redo_log_capacity=20G so that 100k txn gets fit into red-log and it does not overrun. It keeps the test redo‑bound and avoids checkpoint interference will make the method easier for others to reproduce.

    6.  Hint: You can assess the recovery of undo by modifying lua script where you  use begin and do not use end  for the commit.

You can refer my rough setup details at github: https://github.com/amroo76/mysql-crash-recovery-performance





Comments

Popular posts from this blog

PGO in MySQL community 9.7

Mariadb Performance : Openssl and wolfssl