Small Datum

Friday, January 2, 2026

Common prefix skipping, adaptive sort

The patent expired for US7680791B2. I invented this while at Oracle and it landed in 10gR2 with claims of ~5X better performance vs the previous sort algorithm used by Oracle. I hope for an open-source implementation one day. The patent has a good description of the algorithm, it is much easier to read than your typical patent. Thankfully the IP lawyer made good use of the functional and design docs that I wrote.

The patent is for a new in-memory sort algorithm that needs a name. Features include:

common prefix skipping

skips comparing the prefix of of key bytes when possible

adaptive

switches between quicksort and most-significant digit radix sort

key substring caching

reduces CPU cache misses by caching the next few bytes of the key

produces results before sort is done

sorted output can be produced (to the rest of the query, or spilled to disk for an external sort) before the sort is finished.

How it came to be

From 2000 to 2005 I worked on query processing for Oracle. I am not sure why I started on this effort and it wasn't suggested by my bosses or peers. But the Sort Benchmark contest was active and I had more time to read technical papers. Perhaps I was inspired by the Alphasort paper.

While the Sort Benchmark advanced the state of the art in sort algorithms, it also encouraged algorithms that were great for benchmarks (focus on short keys with uniform distribution). But keys sorted by a DBMS are often much larger than 8 bytes and adjacent rows often have long common prefixes in their keys.

So I thought about this while falling to sleep and after many nights realized that with a divide and conquer sort, as the algorithm descends into subpartitions of the data, that the common prefixes of the keys in each subpartition were likely to grow:

were the algorithm able to remember the length of the common prefix as it descends then it can skip the common prefix during comparisons to save on CPU overhead
were the algorithm able to learn when the length of the common prefix grows then it can switch from quicksort to most-significant digit (MSD) radix sort using the next byte beyond the common prefix and then switch back to quicksort after doing that
the algorithm can cache bytes from the key in an array, like Alphasort. But unlike Alphasort as it descends it can cache the next few bytes it will need to compare rather than only caching the first few bytes of the key. This provides much better memory system behavior (fewer cache misses).

Early implementation

This might have been in 2003 before we were able to access work computers from home. I needed to get results that would convince management this was worth doing. I started my proof-of-concept on an old PowerPC based Mac I had at home that found a second life after I installed Yellow Dog Linux on it.

After some iteration I had good results on the PowerPC. So I brought my source code into work and repeated the test on other CPUs that I could find. On my desk I had a Sun workstation and a Windows PC with a 6 year old Pentium 3 CPU (600MHz, 128kb L2 cache). Elsewhere I had access to a new Sun server with a 900MHz UltraSPARC IV (or IV+) CPU and an HP server with a PA RISC CPU.

I also implemented other state of the art algorithms including Alphasort along with the old sort algorithm used by Oracle. From testing I learned:

my new sort was much faster than other algorithms when keys were larger than 8 bytes
my new sort was faster on my old Pentium 3 CPU than on the Sun UltraSPARC IV

The first was great news for me, the second was less than great news for Sun shareholders. I never learned why that UltraSPARC IV performance was lousy. It might have been latency to the caches.

Real implementation

Once I had great results, it was time for the functional and design specification reviews. I remember two issues:

the old sort was stable, the new sort was not

I don't remember how this concern was addressed

the new sort has a bad, but unlikely, worst-case

The problem here is the worst-case when quicksort picks the worst pivot every time it selects a pivot. The new sort wasn't naive, it used the median from a sample of keys each time to select a pivot (the sample size might have been 5). So I did the math to estimate the risk. Given that the numbers are big and probabilities are small I needed a library or tool that supported arbitrary-precision arithmetic and ended up using a Scheme implementation. The speedup in most cases justified the risk in a few cases.

And once I had this implemented within the Oracle DBMS I was able to compare it with the old sort. The new sort was often about 5 times faster than the old sort. I then compared it with SyncSort. I don't remember whether they had a DeWitt Clause so I won't share the results but I will say that the new sort in Oracle looked great in comparison.

The End

The new sorted landed in 10gR2 and was featured in a white-paper. I also got a short email from Larry Ellison thanking me for the work. A promotion or bonus would have to wait as you had to play the long-game in your career at Oracle. And that was all the motivation I needed to leave Oracle -- first for a startup, and then to Google and Facebook.

After leaving Oracle, much of my time was spent on making MySQL better. Great open-source DBMS, like MySQL and PostgreSQL, were not good for Oracle's new license revenue. Oracle is a better DBMS, but not everyone needs it or can afford it.

Tuesday, December 30, 2025

Performance for RocksDB 9.8 through 10.10 on 8-core and 48-core servers

This post has results for RocksDB performance using db_bench on 8-core and 48-core servers. I previously shared results for RocksDB performance using gcc and clang and then for RocksDB on a small Arm server.

tl;dr

RocksDB is boring, there are few performance regressions.
There was a regression in write-heavy workloads with RocksDB 10.6.2. See bug 13996 for details. That has been fixed.
I will repeat tests in a few weeks

Software

I used RocksDB versions 9.8 through 10.0.

I compiled each version clang version 18.3.1 with link-time optimization enabled (LTO). The build command line was:

flags=( DISABLE_WARNING_AS_ERROR=1 DEBUG_LEVEL=0 V=1 VERBOSE=1 )

# for clang+LTO
AR=llvm-ar-18 RANLIB=llvm-ranlib-18 CC=clang CXX=clang++ \
make "${flags[@]}" static_lib db_bench

Hardware

I used servers with 8 and 48 cores, both run Ubuntu 22.04:

8-core

Ryzen 7 (AMD) CPU with 8 cores and 32G of RAM.
storage is one NVMe SSD with discard enabled and ext-4
benchmarks are run with 1 client, 20M KV pairs for byrx and 400M KV pairs for iobuf and iodir

48-core

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled, 128G of RAM
storage is 2 SSDs with RAID 1 (3.8T each) and ext-4.
benchmarks are run with 36 clients, 200M KV pairs for byrx and 2B KV pairs for iobuf and iodir

Benchmark

Overviews on how I use db_bench are here and here.

Most benchmark steps were run for 1800 seconds and all used the LRU block cache. I try to use Hyperclock on large servers but forgot that this time.

Tests were run for three workloads:

byrx - database cached by RocksDB
iobuf - database is larger than RAM and RocksDB used buffered IO
iodir - database is larger than RAM and RocksDB used O_DIRECT

The benchmark steps that I focus on are:

fillseq

load RocksDB in key order with 1 thread

revrangeww, fwdrangeww

do reverse or forward range queries with a rate-limited writer. Report performance for the range queries

readww

do point queries with a rate-limited writer. Report performance for the point queries.

overwrite

overwrite (via Put) random keys and wait for compaction to stop at test end

Relative QPS

Many of the tables below (inlined and via URL) show the relative QPS which is:
(QPS for my version / QPS for RocksDB 9.8)

The base version varies and is listed below. When the relative QPS is > 1.0 then my version is faster than RocksDB 9.8. When it is < 1.0 then there might be a performance regression or there might just be noise.

The spreadsheet with numbers and charts is here. Performance summaries are here.

Results: cached database (byrx)

From 1 client on the 8-core server

Results are stable except for the overwrite test where there might be a regression, but I think that is noise after repeating this test 2 more times and the cause is that the base case (result from 9.8) was an outlier. I will revisit this.

From 36 clients on the 48-core server

Results are stable

Results: IO-bound with buffered IO (iobuf)

From 1 client on the 8-core server

Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server

Results are stable except for the overwrite test where there might be a large improvement. But I wonder if this is from noise in the result for the base case from RocksDB 9.8, just as there might be noice in the cached (byrx) results.
The regression in fillseq with 10.6.2 is from bug 13996

Results: IO-bound with O_DIRECT (iodir)

From 1 client on the 8-core server

Results are stable
The regression in fillseq with 10.6.2 is from bug 13996

From 36 clients on the 48-core server

Results are stable
The regression in fillseq with 10.6.2 is from bug 13996

Monday, December 29, 2025

IO-bound sysbench vs Postgres on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr - for Postgres 18.1 relative to 12.22

QPS for IO-bound point-query tests is similar while there is a large improvement for the one CPU-bound test (hot-points)
QPS for range queries without aggregation is similar
QPS for range queries with aggregation is between 1.05X and 1.25X larger in 18.1
QPS for writes show there might be a few large regressions in 18.1

tl;dr - for Postgres 18.1 using different values for the io_method option

for tests that do long range queries without aggregation

the best QPS is from io_method=io_uring
the second best QPS is from io_method=worker with a large value for io_workers

for tests that do long range queries with aggregation

when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
for most tests the best QPS is from io_method=io_uring

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.23, 14.20, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used a 48-core server from Hetzner

an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)

Configuration files for the big server

the config file is named conf.diff.cx10a_c32r128 (x10a_c32r128) and is here for versions 12, 13, 14, 15, 16 and 17.
for Postgres 18 I used

conf.diff.cx10b_c32r128 (x10b_c32r128)

uses io_method=sync and is similar to the config used for versions 12 through 17.

conf.diff.cx10c_c32r128 (x10c_c32r128)

uses io_method=worker and io_workers is not set

conf.diff.cx10cw8_c32r128 (x10cw8_c32r128)

uses io_method=worker and io_workers=8

conf.diff.cx10cw16_c32r128 (x10cw8_c32r128)

uses io_method=worker and io_workers=16

conf.diff.cx10cw32_c32r128 (x10cw8_c32r128)

uses io_method=worker and io_workers=32

conf.diff.cx10d_c32r128 (x10d_c32r128)

uses io_method=io_uring

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for base version)

When the relative QPS is > 1 then some version is faster than base version. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

I provide two comparisons and each uses a different base version. They are:

base version is Postgres 12.22

compare 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1
the goal for this is to see how performance changes over time
per-test results from vmstat and iostat are here

base version is Postgres 18.1

compare 18.1 using the x10b_c32r128, x10c_c32r128, x10cw8_c32r128, x10cw16_c32r128, x10cw32_c32r128 and x10d_c32r128 configs
the goal for this is to understand the impact of the io_method option
per-test results from vmstat and iostat are here

The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: Postgres 12.22 through 18.1

All charts except the first have the y-axis start at 0.7 rather than 0.0 to improve readability.

There are two charts for point queries. The second truncates the y-axis to improve readability.

a large improvement for the hot-points test arrives in 17.x. While most tests are IO-bound, this test is CPU-bound because all queries fetch the same N rows.
for other tests there are small changes, both improvements and regressions, and the regressions are too small to investigate

For range queries without aggregation:

QPS for Postgres 18.1 is within 5% of 12.22, sometimes better and sometimes worse
for Postgres 17.7 there might be a large regression on the scan test and that also occurs with 17.6 (not shown). But the scan test can be prone to variance, especially with Postgres and I don't expect to spend time debugging this. Note that the config I use for 18.1 here uses io_method=sync which is similar to what Postgres uses in releases prior to 18.x. From the vmstat and iostat metrics what I see is:

a small reduction in CPU overhead (cpu/o) in 18.1
a large reduction in the context switch rate (cs/o) in 18.1
small reductions in read IO (r/o and rKB/o) in 18.1

For range queries with aggregation:

QPS for 18.1 is between 1.05X and 1.25X better than for 12.22

For write-heavy tests

there might be large regressions for several tests: read-write, update-zipf and write-only, The read-write tests do all of the writes done by write-only and then add read-only statements.
from the vmstat and iostat results for the read-write tests I see

CPU (cpu/o) is up by ~1.2X in PG 16.x through 18.x
storage reads per query (r/o) have been increasing from PG 16.x through 18.x and are up by ~1.1X in PG 18.1
storage KB read per query (rKB/o) increased started in PG 16.1 and are 1.44X and 1.16X larger in PG 18.x

from the vmstat and iostat results for the update-zipf test

results are similar to the read-write tests above

from the vmstat and iostat results for the write-only test

results are similar to the read-write tests above

Results: Postgres 18.1 and io_method

For point queries

results are similar for all configurations and this is expected

For range queries without aggregation

there are two charts, the y-axis is truncated in the second to improve readability
all configs get similar QPS for all tests except scan
for the scan test

the x10c_c32r128 config has the worst result. This is expected given there are 40 concurrent connections and it used the default for io_workers (=3)
QPS improves for io_method=worker with larger values for io_workers
io_method=io_uring has the best QPS (the x10d_c32r128 config)

For range queries with aggregation

when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
io_method=io_uring gets the best QPS on all tests except for the read-only tests with range=10 and 10,000. There isn't an obvious problem based on the vmstat and iostat results. From the r_await column in iostat output (not shown) the differences are mostly explained by a change in IO latency. Perhaps variance in storage latency is the issue.

For writes

the best QPS occurs with the x10b_c32r128 config (io_method=sync). I am not sure if that option matters here and perhaps there is too much noise in the results.

Saturday, December 20, 2025

IO-bound sysbench vs MySQL on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for MySQL versions 5.6 through 9.5. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr

the regressions here on read-only tests are smaller than on the CPU bound workload, but when they occur are from new CPU overheads
the large improvements here on write-heavy tests are similar to the CPU bound workload

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is:

an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
128G RAM
Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)

The config files are here: 5.6.51, 5.7.44, 8.0.4x, 8.4.x, 9.x.0

Benchmark

The purpose is to search for regressions from new CPU overhead, mutex contention and IO stress.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for MySQL 5.6.51)

When the relative QPS is > 1 then some version is faster than MySQL 5.6.51. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 5.6.51.

Values from iostat and vmstat divided by QPS are here. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below.

Results: point queries

This has two charts. The y-axis is truncated on the second chart to improve readability for all tests but hot-points which is a positive outlier.

Summary:

the improvement for hot-points is similar to the CPU-bound results
the regressions here for the IO-bound tests are smaller than for the CPU-bound results
the regression in point-query is from new CPU overhead, see cpu/o here which is 1.37X larger in 9.5.0 vs 5.6.51
the regression in points-covered-si is from new CPU overhead, see cpu/o here which is 1.24X larger in 9.5.0 vs 5.6.51. This test is CPU-bound, the queries don't do IO because the secondary indexes are cached.

Results: range queries without aggregation

Summary:

the regressions here for the IO-bound tests are smaller than for the CPU-bound results, except for the scan test
the regressions in scan are from new CPU overhead, see cpu/o here, which is 1.38X larger in 9.5.0 vs 5.6.51

Results: range queries with aggregation

Summary:

the regressions here for the IO-bound tests are smaller than for the CPU-bound results
the regressions in read-only-count are from new CPU overhead, see cpu/o here, which is 1.25X larger in 9.5.0 vs 5.6.51

Results: writes

Summary:

the improvements here for the IO-bound tests are similar to the CPU-bound results
the largest improvement, for the update-index test, is from using less CPU, fewer context switches, less read IO and less write IO per operation -- see cpu/o, cs/o, rKB/o and wKB/o here

Wednesday, December 17, 2025

Performance regressions in MySQL 8.4 and 9.x with sysbench

I have been claiming that I don't find significant performance regressions in MySQL 8.4 and 9.x when I use sysbench. I need to change that claim. There are regressions for write-heavy tests, they are larger for tests with more concurrency and larger when gtid support is enabled.

By gtid support is enabled I mean that these options are set to ON:

Both of these are ON by default in MySQL 9.5.0 and were OFF by default in earlier releases. I just learned about the performance impact from these and in future tests I will make probably repeat tests with them set to ON and OFF.

This blog post has results from the write-heavy tests with sysbench for MySQL 8.0, 8.4, 9.4 and 9.5 to explain my claims above.

tl;dr

Regressions are most likely and larger on the insert test
There are regressions for write-heavy workloads in MySQL 8.4 and 9.x

Throughput is typically 15% less in MySQL 9.5 than in 8.0 for tests with 16 clients on the 24-core/2-socket srever
Throughput is typically 5% less in MySQL 9.5 than 8.0 for tests with 40 clients on the 48-core server

The regressions are larger when gtid_mode and enforce_gtid_consistency are set to ON

Throughput is typically 5% to 10% less with the -gtid configs vs the -nogtid configs with 40 clients on the 48-core server. But this is less of an issue on other servers.
There are significant increases in CPU, context switch rates and KB written to storage for the -gtid configs relative to the same MySQL version using the -nogtid configs

Regressions might be larger for the insert and update-inlist tests because they have larger transactions relative to other write-heavy tests. Performance regressions are correlated with increases in CPU, context switches and KB written to storage per transaction.

What changed?

I use diff to compare the output from SHOW GLOBAL VARIABLES when I build new releases and from that it is obvious that the default value for gtid_mode and enforce_gtid_consistency changed in MySQL 9.5 but I didn't appreciate the impact from that change.

Builds, configuration and hardware

I compiled MySQL from source for versions 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The versions that I tested are named:

8.0.44-nogtid

MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =OFF

8.0.44-gtid

MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =ON

8.4.7-notid

MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =OFF

8.4.7-gtid

MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =ON

9.4.0-nogtid

MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =OFF

9.4.0-gtid

MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =ON

9.5.0-nogtid

MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =OFF

9.5.0-gtid

MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =ON

The servers are:

8-core

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.
my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
The benchmark is run with 1 thread, 1 table and 50M rows per table

24-core

The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.
my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
The benchmark is run with 16 threads, 8 tables and 10M rows per table

48-core

The server is ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled and 128G of RAM. Storage is 2 Intel D7-P5520 NVMe devices with RAID 1 (3.8T each) using ext4. The OS is Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic).
my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
The benchmark is run with 40 threads, 8 tables and 10M rows per table

Benchmark

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds.

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. Here I only share results from a subset of the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for MySQL 8.0.44)

When the relative QPS is > 1 then some version is faster than MySQL 8.0.44. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 8.0.44.

Values from iostat and vmstat divided by QPS are here for the 8-core, 24-core and 48-core servers. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. The y-axis doesn't start at 0 to improve readability.

Results: 8-core

Summary

For many tests there are small regressions from 8.0 to 8.4 and 8.4 to 9.x
There are small improvements (~5%) for the -gtid configs vs the -nogtid result for update-index
There is a small regression (~5%) for the -gtid configs vs the -nogtid result for insert
There are small regression (~1%) for the -gtid configs vs the -nogtid result for other tests

From vmstat metrics for the insert test where perf decreases with the 9.5.0-gtid result

CPU per operation (cpu/o) increases by 1.10X with the -gtid config
Context switches per operation (cs/o) increases by 1.45X with the -gtid config
KB written to storage per commit (wKB/o) increases by 1.16X with the -gtid config

From vmstat metrics for the update-index test where perf increases with the 9.5.0-gtid result

CPU per operation (cpu/o) decreases by ~3% with the -gtid config
Context switches per operation (cs/o) decrease by ~2% with the -gtid config
KB written to storage per commit (wKB/o) decreases by ~3% with the -gtid config
This result is odd. I might try to reproduce it in the future

Results: 24-core

Summary

For many tests there are regressions from 8.0 to 8.4 and 8.4 to 9.x and throughput is typically 15% less in 9.5.0 than 8.0.44
There are large regressions in 9.4 and 9.5 for update-inlist
There is usually a small regression (~5%) for the -gtid configs vs the -nogtid result

From vmstat metrics for the insert test comparing 9.5.0-gtid with 9.5.0-nogtid

Throughput is 1.15X larger in 9.5.0-nogtid
CPU per operation (cpu/o) is 1.15X larger in 9.5.0-gtid
Context switches per operation (cs/o) are 1.23X larger in 9.5.0-gtid
KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid

From vmstat metrics for the update-inlist comparing both 9.5.0-nogtid and 9.5.0-nogtid with 8.0.44-nogtid

The problems here look different than most other tests as the regressions in 9.4 and 9.5 are similar for the -gtid and -nogtid configs. If I have time I will get flamegraphs and PMP output. The server here has two sockets and can suffer more from false-sharing and real contention on cache lines.
Throughput is 1.43X larger in 8.0.44-nogtid
CPU per operation (cpu/o) is 1.05X larger in 8.0.44-nogtid
Context switches per operation (cs/o) are 1.18X larger in 8.0.44-nogtid
KB written to storage per commit (wKB/o) is ~1.12X larger in 9.5.0

Results: 48-core

Summary

For many tests there are regressions from 8.0 to 8.4
For some tests there are regressions from 8.4 to 9.x
There is usually a large regression for the -gtid configs vs the -nogtid result and the worst case occurs on the insert test

From vmstat metrics for the insert test comparing 9.5.0-gtid with 9.5.0-nogtid

Throughput is 1.17X larger in 9.5.0-nogtid
CPU per operation (cpu/o) is 1.13X larger in 9.5.0-gtid
Context switches per operation (cs/o) are 1.26X larger in 9.5.0-gtid
KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid

Thursday, December 11, 2025

Sysbench for MySQL 5.6 through 9.5 on a 2-socket, 24-core server

This has results for the sysbench benchmark on a 2-socket, 24-core server. A post with results from 8-core and 32-core servers is here.

tl;dr

old bad news - there were many large regressions from 5.6 to 5.7 to 8.0
new bad news - there are some new regressions after MySQL 8.0

Normally I claim that there are few regressions after MySQL 8.0 but that isn't the case here. I also see regressions after MySQL 8.0 on the other larger servers that I use, but that topic will explained in another post.

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB, ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.

The config files are here for 5.6, 5.7, 8.0, 8.4 and 9.x.

Benchmark

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 16 clients and 8 tables with 10M rows per table.

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

I provide charts below with relative QPS. The relative QPS is the following:

(QPS for some version) / (QPS for base version)

When the relative QPS is > 1 then some version is faster than the base version. When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than the base version.

I present two sets of charts. One set uses MySQL 5.6.51 as the base version which is my standard practice. The other uses MySQL 8.0.44 as the base version to show

Results: point queries

Summary

from 5.6 to 5.7 there are big improvements for 5 tests, no changes for 2 tests and small regressions for 2 tests
from 5.7 to 8.0 there are big regressions for all tests
from 8.0 to 9.5 performance is stable
for 9.5 the common result is ~20% less throughput vs 5.6

Using vmstat from the hot-points test to understand the performance changes (see here)

context switch rate (cs/o) is stable, mutex contention hasn't changed
CPU per query (cpu/o) drops by 35% from 5.6 to 5.7
CPU per query (cpu/o) grows by 23% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: range queries without aggregation

Summary

from 5.6 to 5.7 throughput drops by 10% to 15%
from 5.7 to 8.0 throughput drops by about 15%
from 8.0 to 9.5 throughput is stable
for 9.5 the common result is ~30% less throughput vs 5.6

Using vmstat from the scan test to understand the performance changes (see here)

context switch rates are low and can be ignored
CPU per query (cpu/o) grows by 11% from 5.6 to 5.7
CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: range queries with aggregation

Summary

from 5.6 to 5.7 there are big improvements for 2 tests, no changes for 1 tests and regressions for 5 tests
from 5.7 to 8.0 there are regressions for all tests
from 8.0 through 9.5 performance is stable
for 9.5 the common result is ~25% less throughput vs 5.6

Using vmstat from the read-only-count test to understand the performance changes (see here)

context switch rates are similar
CPU per query (cpu/o) grows by 16% from 5.6 to 5.7
CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
CPU per query (cpu/o) is stable from 8.0 through 9.5

Results: writes

Summary

from 5.6 to 5.7 there are big improvements for 9 tests and no changes for 1 test
from 5.7 to 8.0 there are regressions for all tests
from 8.4 to 9.x there are regressions for 8 tests and no change for 2 tests
for 9.5 vs 5.6: 5 are slower in 9.5, 3 are similar and 2 are faster in 9.5

Using vmstat from the insert test to understand the performance changes (see here)

in 5.7, CPU per insert drops by 30% while context switch rates are stable vs 5.6
in 8.0, CPU per insert grows by 36% while context switch rates are stable vs 5.7
in 9.5, CPU per insert grows by 3% while context switch rates grow by 23% vs 8.4

The first chart doesn't truncate the y-axis to show the big improvement for update-index but that makes it hard to see the smaller changes on the other tests.

This chart truncates the y-axis to make it easier to see changes on tests other than update-index.

Wednesday, December 10, 2025

The insert benchmark on a small server : MySQL 5.6 through 9.5

This has results for MySQL versions 5.6 through 9.5 with the Insert Benchmark on a small server. Results for Postgres on the same hardware are here.

tl;dr

good news - there are no large regressions after MySQL 8.0
bad news - there are many large regressions from 5.6 to 5.7 to 8.0

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04. More details on it are here.

The config files are here: 5.6.51, 5.7.44, 8.0.4x, 8.4.x, 9.x.0.

The Benchmark

The benchmark is explained here and is run with 1 client and 1 table. I repeated it with two workloads:

cached - the values for X, Y, Z are 30M, 40M, 10M
IO-bound - the values for X, Y, Z are 800M, 4M, 1M

The point query (qp100, qp500, qp1000) and range query (qr100, qr500, qr1000) steps are run for 1800 seconds each.

The benchmark steps are:

l.i0

insert X rows per table in PK order. The table has a PK index but no secondary indexes. There is one connection per client.

create 3 secondary indexes per table. There is one connection per client.

l.i1

use 2 connections/client. One inserts Y rows per table and the other does deletes at the same rate as the inserts. Each transaction modifies 50 rows (big transactions). This step is run for a fixed number of inserts, so the run time varies depending on the insert rate.

l.i2

like l.i1 but each transaction modifies 5 rows (small transactions) and Z rows are inserted and deleted per table.
Wait for S seconds after the step finishes to reduce variance during the read-write benchmark steps that follow. The value of S is a function of the table size.

qr100

use 3 connections/client. One does range queries and performance is reported for this. The second does does 100 inserts/s and the third does 100 deletes/s. The second and third are less busy than the first. The range queries use covering secondary indexes. If the target insert rate is not sustained then that is considered to be an SLA failure. If the target insert rate is sustained then the step does the same number of inserts for all systems tested. This step is frequently not IO-bound for the IO-bound workload.

qp100

like qr100 except uses point queries on the PK index

qr500

like qr100 but the insert and delete rates are increased from 100/s to 500/s

qp500

like qp100 but the insert and delete rates are increased from 100/s to 500/s

qr1000

like qr100 but the insert and delete rates are increased from 100/s to 1000/s

qp1000

like qp100 but the insert and delete rates are increased from 100/s to 1000/s

Results: overview

The performance reports are here for:

Cached

IO-bound

The summary sections from the performances report have 3 tables. The first shows absolute throughput by DBMS tested X benchmark step. The second has throughput relative to the version from the first row of the table. The third shows the background insert rate for benchmark steps with background inserts. The second table makes it easy to see how performance changes over time. The third table makes it easy to see which DBMS+configs failed to meet the SLA.

Below I use relative QPS to explain how performance changes. It is: (QPS for $me / QPS for $base) where $me is the result for some version $base is the result from MySQL 5.6.51.

When relative QPS is > 1.0 then performance improved over time. When it is < 1.0 then there are regressions. The Q in relative QPS measures:

insert/s for l.i0, l.i1, l.i2
indexed rows/s for l.x
range queries/s for qr100, qr500, qr1000
point queries/s for qp100, qp500, qp1000

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements.

Results: cached

Performance summaries are here for all versions and latest versions. I focus on the latest versions.

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements. There are large regressions from new CPU overheads.

the load step (l.i0) is almost 2X faster for 5.6.51 vs 8.4.7 (relative QPS is 0.59)
the create index step (l.x) is more than 2X faster for 8.4.7 vs 5.6.51
the first write-only steps (l.i1) has similar throughput for 5.6.51 and 8.4.7
the second write-only step (l.i2) is 14% slower in 8.4.7 vs 8.4.7
the range-query steps (qr*) are ~30% slower in 8.4.7 vs 5.6.51
the point-query steps (qp*) are 38% slower in 8.4.7 vs 5.6.51

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
5.6.51	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
5.7.44	0.91	1.53	1.16	1.09	0.83	0.83	0.83	0.84	0.83	0.83
8.0.44	0.60	2.42	1.05	0.87	0.69	0.62	0.70	0.62	0.70	0.62
8.4.7	0.59	2.54	1.04	0.86	0.68	0.61	0.68	0.61	0.67	0.60
9.4.0	0.59	2.57	1.03	0.86	0.69	0.62	0.69	0.62	0.70	0.61
9.5.0	0.59	2.61	1.05	0.85	0.69	0.62	0.69	0.62	0.69	0.62

Results: IO-bound

Performance summaries are here for all versions and latest versions. I focus on the latest versions.

Below I use colors to highlight the relative QPS values with yellow for regressions and blue for improvements. There are large regressions from new CPU overheads.

the load step (l.i0) is almost 2X faster for 5.6.51 vs 8.4.7 (relative QPS is 0.60)
the create index step (l.x) is more than 2X faster for 8.4.7 vs 5.6.51
the first write-only steps (l.i1) is 1.54X faster for 8.4.7 vs 5.6.51
the second write-only step (l.i2) is 1.82X faster for 8.4.7 vs 5.6.51
the range-query steps (qr*) are ~20% slower in 8.4.7 vs 5.6.51
the point-query steps (qp*) are 13% slower, 3% slower and 17% faster in 8.4.7 vs 5.6.51

dbms	l.i0	l.x	l.i1	l.i2	qr100	qp100	qr500	qp500	qr1000	qp1000
5.6.51	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
5.7.44	0.91	1.42	1.52	1.78	0.84	0.92	0.87	0.97	0.93	1.17
8.0.44	0.62	2.58	1.56	1.81	0.76	0.88	0.79	0.99	0.85	1.18
8.4.7	0.60	2.65	1.54	1.82	0.74	0.87	0.77	0.98	0.82	1.17
9.4.0	0.61	2.68	1.52	1.76	0.75	0.86	0.80	0.97	0.85	1.16
9.5.0	0.60	2.75	1.53	1.73	0.75	0.87	0.79	0.97	0.84	1.17