Showing posts with label sysbench. Show all posts
Showing posts with label sysbench. Show all posts

Monday, December 29, 2025

IO-bound sysbench vs Postgres on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for Postgres versions 12 through 18. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr - for Postgres 18.1 relative to 12.22

  • QPS for IO-bound point-query tests is similar while there is a large improvement for the one CPU-bound test (hot-points)
  • QPS for range queries without aggregation is similar
  • QPS for range queries with aggregation is between 1.05X and 1.25X larger in 18.1
  • QPS for writes show there might be a few large regressions in 18.1
tl;dr - for Postgres 18.1 using different values for the io_method option
  • for tests that do long range queries without aggregation
    • the best QPS is from io_method=io_uring
    • the second best QPS is from io_method=worker with a large value for io_workers
  • for tests that do long range queries with aggregation
    • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
    • for most tests the best QPS is from io_method=io_uring

Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.23, 14.20, 15.15, 16.10, 16.11, 17.6, 17.7, 18.0 and 18.1.

I used a 48-core server from Hetzner
  • an ax162s with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
Configuration files for the big server
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead and mutex contention. I use the small server with low concurrency to find regressions from new CPU overheads and then larger servers with high concurrency to find regressions from new CPU overheads and mutex contention.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries without aggregation while part 2 has queries with aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than base version.

I provide two comparisons and each uses a different base version. They are:
  • base version is Postgres 12.22
    • compare 12.22, 13.23, 14.20, 15.15, 16.11, 17.7 and 18.1
    • the goal for this is to see how performance changes over time
    • per-test results from vmstat and iostat are here
  • base version is Postgres 18.1
    • compare 18.1 using the x10b_c32r128, x10c_c32r128, x10cw8_c32r128, x10cw16_c32r128, x10cw32_c32r128 and x10d_c32r128 configs
    • the goal for this is to understand the impact of the io_method option
    • per-test results from vmstat and iostat are here
The per-test results from vmstat and iostat can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: Postgres 12.22 through 18.1

All charts except the first have the y-axis start at 0.7 rather than 0.0 to improve readability.

There are two charts for point queries. The second truncates the y-axis to improve readability.
  • a large improvement for the hot-points test arrives in 17.x. While most tests are IO-bound, this test is CPU-bound because all queries fetch the same N rows.
  • for other tests there are small changes, both improvements and regressions, and the regressions are too small to investigate
For range queries without aggregation:
  • QPS for Postgres 18.1 is within 5% of 12.22, sometimes better and sometimes worse
  • for Postgres 17.7 there might be a large regression on the scan test and that also occurs with 17.6 (not shown). But the scan test can be prone to variance, especially with Postgres and I don't expect to spend time debugging this. Note that the config I use for 18.1 here uses io_method=sync which is similar to what Postgres uses in releases prior to 18.x. From the vmstat and iostat metrics what I see is:
    • a small reduction in CPU overhead (cpu/o) in 18.1
    • a large reduction in the context switch rate (cs/o) in 18.1
    • small reductions in read IO (r/o and rKB/o) in 18.1
For range queries with aggregation:
  • QPS for 18.1 is between 1.05X and 1.25X better than for 12.22
For write-heavy tests
  • there might be large regressions for several tests: read-write, update-zipf and write-only, The read-write tests do all of the writes done by write-only and then add read-only statements. 
  • from the vmstat and iostat results for the read-write tests I see
    • CPU (cpu/o) is up by ~1.2X in PG 16.x through 18.x
    • storage reads per query (r/o) have been increasing from PG 16.x through 18.x and are up by ~1.1X in PG 18.1
    • storage KB read per query (rKB/o) increased started in PG 16.1 and are 1.44X and 1.16X larger in PG 18.x
  • from the vmstat and iostat results for the update-zipf test
    • results are similar to the read-write tests above
  • from the vmstat and iostat results for the write-only test
    • results are similar to the read-write tests above
Results: Postgres 18.1 and io_method

For point queries
  • results are similar for all configurations and this is expected
For range queries without aggregation
  • there are two charts, the y-axis is truncated in the second to improve readability
  • all configs get similar QPS for all tests except scan
  • for the scan test
    • the x10c_c32r128 config has the worst result. This is expected given there are 40 concurrent connections and it used the default for io_workers (=3)
    • QPS improves for io_method=worker with larger values for io_workers
    • io_method=io_uring has the best QPS (the x10d_c32r128 config)
For range queries with aggregation
  • when using io_method=worker a larger value for io_workers hurt QPS in contrast to the result for range queries without aggregation
  • io_method=io_uring gets the best QPS on all tests except for the read-only tests with range=10 and 10,000. There isn't an obvious problem based on the vmstat and iostat results. From the r_await column in iostat output (not shown) the differences are mostly explained by a change in IO latency. Perhaps variance in storage latency is the issue.
For writes
  • the best QPS occurs with the x10b_c32r128 config (io_method=sync). I am not sure if that option matters here and perhaps there is too much noise in the results.

Saturday, December 20, 2025

IO-bound sysbench vs MySQL on a 48-core server

This has results for an IO-bound sysbench benchmark on a 48-core server for MySQL versions 5.6 through 9.5. Results from a CPU-bound sysbench benchmark on the 48-core server are here.

tl;dr

  • the regressions here on read-only tests are smaller than on the CPU bound workload, but when they occur are from new CPU overheads
  • the large improvements here on write-heavy tests are similar to the CPU bound workload

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is:
  • an ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled
  • 2 Intel D7-P5520 NVMe storage devices with RAID 1 (3.8T each) using ext4
  • 128G RAM
  • Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic)
The config files are here: 5.6.515.7.448.0.4x8.4.x9.x.0

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 40 clients and 8 tables with 250M rows per table. With 250M rows per table this is IO-bound. I normally use 10M rows per table for CPU-bound workloads.

The purpose is to search for regressions from new CPU overhead, mutex contention and IO stress. 

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 5.6.51)
When the relative QPS is > 1 then some version is faster than MySQL 5.6.51.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 5.6.51.

Values from iostat and vmstat divided by QPS are hereThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below.

Results: point queries

This has two charts. The y-axis is truncated on the second chart to improve readability for all tests but hot-points which is a positive outlier.

Summary:
  • the improvement for hot-points is similar to the CPU-bound results
  • the regressions here for the IO-bound tests are smaller than for the CPU-bound results
  • the regression in point-query is from new CPU overhead, see cpu/o here which is 1.37X larger in 9.5.0 vs 5.6.51
  • the regression in points-covered-si is from new CPU overhead, see cpu/o here which is 1.24X larger in 9.5.0 vs 5.6.51. This test is CPU-bound, the queries don't do IO because the secondary indexes are cached.
Results: range queries without aggregation

Summary:
  • the regressions here for the IO-bound tests are smaller than for the CPU-bound results, except for the scan test
  • the regressions in scan are from new CPU overhead, see cpu/o here, which is 1.38X larger in 9.5.0 vs 5.6.51
Results: range queries with aggregation

Summary:
  • the regressions here for the IO-bound tests are smaller than for the CPU-bound results
  • the regressions in read-only-count are from new CPU overhead, see cpu/o here, which is 1.25X larger in 9.5.0 vs 5.6.51
Results: writes

Summary:
  • the improvements here for the IO-bound tests are similar to the CPU-bound results
  • the largest improvement, for the update-index test, is from using less CPU, fewer context switches, less read IO and less write IO per operation -- see cpu/o, cs/o, rKB/o and wKB/o here

Wednesday, December 17, 2025

Performance regressions in MySQL 8.4 and 9.x with sysbench

I have been claiming that I don't find significant performance regressions in MySQL 8.4 and 9.x when I use sysbench. I need to change that claim. There are regressions for write-heavy tests, they are larger for tests with more concurrency and larger when gtid support is enabled.

By gtid support is enabled I mean that these options are set to ON:

Both of these are ON by default in MySQL 9.5.0 and were OFF by default in earlier releases. I just learned about the performance impact from these and in future tests I will make probably repeat tests with them set to ON and OFF.

This blog post has results from the write-heavy tests with sysbench for MySQL 8.0, 8.4, 9.4 and 9.5 to explain my claims above.

tl;dr

  • Regressions are most likely and larger on the insert test
  • There are regressions for write-heavy workloads in MySQL 8.4 and 9.x
    • Throughput is typically 15% less in MySQL 9.5 than in 8.0 for tests with 16 clients on the 24-core/2-socket srever
    • Throughput is typically 5% less in MySQL 9.5 than 8.0 for tests with 40 clients on the 48-core server
  • The regressions are larger when gtid_mode and enforce_gtid_consistency are set to ON
    • Throughput is typically 5% to 10% less with the -gtid configs vs the -nogtid configs with 40 clients on the 48-core server. But this is less of an issue on other servers.
    • There are significant increases in CPU, context switch rates and KB written to storage for the -gtid configs relative to the same MySQL version using the -nogtid configs
  • Regressions might be larger for the insert and update-inlist tests because they have larger transactions relative to other write-heavy tests. Performance regressions are correlated with increases in CPU, context switches and KB written to storage per transaction.
What changed?

I use diff to compare the output from SHOW GLOBAL VARIABLES when I build new releases and from that it is obvious that the default value for gtid_mode and enforce_gtid_consistency changed in MySQL 9.5 but I didn't appreciate the impact from that change.

Builds, configuration and hardware

I compiled MySQL from source for versions 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The versions that I tested are named:
  • 8.0.44-nogtid
    • MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =OFF
  • 8.0.44-gtid
    • MySQL 8.0.44 with gtid_mode and enforce_gtid_consistency =ON
  • 8.4.7-notid
    • MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =OFF
  • 8.4.7-gtid
    • MySQL 8.4.7 with gtid_mode and enforce_gtid_consistency =ON
  • 9.4.0-nogtid
    • MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =OFF
  • 9.4.0-gtid
    • MySQL 9.4.0 with gtid_mode and enforce_gtid_consistency =ON
  • 9.5.0-nogtid
    • MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =OFF
  • 9.5.0-gtid
    • MySQL 9.5.0 with gtid_mode and enforce_gtid_consistency =ON

The servers are:
  • 8-core
    • The server is an ASUS ExpertCenter PN53 with and AMD Ryzen 7 7735HS CPU, 8 cores, SMT disabled, 32G of RAM. Storage is one NVMe device for the database using ext-4 with discard enabled. The OS is Ubuntu 24.04.
    • my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
    • my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
    • The benchmark is run with 1 thread, 1 table and 50M rows per table
  • 24-core
    • The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB,  ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.
    • my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
    • my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
    • The benchmark is run with 16 threads, 8 tables and 10M rows per table
  • 48-core
    • The server is ax162s from Hetzner with an AMD EPYC 9454P 48-Core Processor with SMT disabled and 128G of RAM. Storage is 2 Intel D7-P5520 NVMe devices with RAID 1 (3.8T each) using ext4. The OS is Ubuntu 22.04 running the non-HWE kernel (5.5.0-118-generic).
    • my.cnf for the -nogtid configs are here for 8.0, 8.4, 9.4, 9.5
    • my.cnf for the -gtid configs are here for 8.0, 8.4, 9.4, 9.5
    • The benchmark is run with 40 threads, 8 tables and 10M rows per table
Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB. While I ran all of the tests, I only share results from a subset of the write-heavy tests.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. 

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. Here I only share results from a subset of the write-heavy tests.

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for MySQL 8.0.44)
When the relative QPS is > 1 then some version is faster than MySQL 8.0.44.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than MySQL 8.0.44.

Values from iostat and vmstat divided by QPS are here for the 8-core, 24-core and 48-core servers. These can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. The y-axis doesn't start at 0 to improve readability.

Results: 8-core

Summary
  • For many tests there are small regressions from 8.0 to 8.4 and 8.4 to 9.x
  • There are small improvements (~5%) for the -gtid configs vs the -nogtid result for update-index
  • There is a small regression (~5%) for the -gtid configs vs the -nogtid result for insert
  • There are small regression (~1%) for the -gtid configs vs the -nogtid result for other tests
From vmstat metrics for the insert test where perf decreases with the 9.5.0-gtid result
  • CPU per operation (cpu/o) increases by 1.10X with the -gtid config
  • Context switches per operation (cs/o) increases by 1.45X with the -gtid config
  • KB written to storage per commit (wKB/o) increases by 1.16X with the -gtid config
From vmstat metrics for the update-index test where perf increases with the 9.5.0-gtid result
  • CPU per operation (cpu/o) decreases by ~3% with the -gtid config
  • Context switches per operation (cs/o) decrease by ~2% with the -gtid config
  • KB written to storage per commit (wKB/o) decreases by ~3% with the -gtid config
  • This result is odd. I might try to reproduce it in the future
Results: 24-core

Summary
  • For many tests there are regressions from 8.0 to 8.4 and 8.4 to 9.x and throughput is typically 15% less in 9.5.0 than 8.0.44
  • There are large regressions in 9.4 and 9.5 for update-inlist
  • There is usually a small regression (~5%) for the -gtid configs vs the -nogtid result
From vmstat metrics for the insert test comparing 9.5.0-gtid with 9.5.0-nogtid
  • Throughput is 1.15X larger in 9.5.0-nogtid
  • CPU per operation (cpu/o) is 1.15X larger in 9.5.0-gtid
  • Context switches per operation (cs/o) are 1.23X larger in 9.5.0-gtid
  • KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid
From vmstat metrics for the update-inlist comparing both 9.5.0-nogtid and 9.5.0-nogtid with 8.0.44-nogtid
  • The problems here look different than most other tests as the regressions in 9.4 and 9.5 are similar for the -gtid and -nogtid configs. If I have time I will get flamegraphs and PMP output. The server here has two sockets and can suffer more from false-sharing and real contention on cache lines.
  • Throughput is 1.43X larger in 8.0.44-nogtid
  • CPU per operation (cpu/o) is 1.05X larger in 8.0.44-nogtid
  • Context switches per operation (cs/o) are 1.18X larger in 8.0.44-nogtid
  • KB written to storage per commit (wKB/o) is ~1.12X larger in 9.5.0
Results: 48-core

Summary
  • For many tests there are regressions from 8.0 to 8.4
  • For some tests there are regressions from 8.4 to 9.x
  • There is usually a large regression for the -gtid configs vs the -nogtid result and the worst case occurs on the insert test
From vmstat metrics for the insert test comparing 9.5.0-gtid with 9.5.0-nogtid
  • Throughput is 1.17X larger in 9.5.0-nogtid
  • CPU per operation (cpu/o) is 1.13X larger in 9.5.0-gtid
  • Context switches per operation (cs/o) are 1.26X larger in 9.5.0-gtid
  • KB written to storage per commit (wKB/o) is 1.24X larger in 9.5.0-gtid

Thursday, December 11, 2025

Sysbench for MySQL 5.6 through 9.5 on a 2-socket, 24-core server

This has results for the sysbench benchmark on a 2-socket, 24-core server. A post with results from 8-core and 32-core servers is here.

tl;dr

  • old bad news - there were many large regressions from 5.6 to 5.7 to 8.0
  • new bad news - there are some new regressions after MySQL 8.0
Normally I claim that there are few regressions after MySQL 8.0 but that isn't the case here. I also see regressions after MySQL 8.0 on the other larger servers that I use, but that topic will explained in another post.

Builds, configuration and hardware

I compiled MySQL from source for versions 5.6.51, 5.7.44, 8.0.43, 8.0.44, 8.4.6, 8.4.7, 9.4.0 and 9.5.0.

The server is a SuperMicro SuperWorkstation 7049A-T with 2 sockets, 12 cores/socket, 64G RAM, one m.2 SSD (2TB,  ext4 with discard enabled). The OS is Ubuntu 24.04. The CPUs are Intel Xeon Silver 4214R CPU @ 2.40GHz.

The config files are here for 5.6, 5.7, 8.0, 8.4 and 9.x.

Benchmark

I used sysbench and my usage is explained here. I now run 32 of the 42 microbenchmarks listed in that blog post. Most test only one type of SQL statement. Benchmarks are run with the database cached by InnoDB.

The read-heavy microbenchmarks are run for 600 seconds and the write-heavy for 900 seconds. The benchmark is run with 16 clients and 8 tables with 10M rows per table. 

The purpose is to search for regressions from new CPU overhead and mutex contention. The workload is cached -- there should be no read IO but will be some write IO.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than the base version.  When it is < 1 then there might be a regression. When the relative QPS is 1.2 then some version is about 20% faster than the base version.

I present two sets of charts. One set uses MySQL 5.6.51 as the base version which is my standard practice. The other uses MySQL 8.0.44 as the base version to show 

Values from iostat and vmstat divided by QPS are hereThese can help to explain why something is faster or slower because it shows how much HW is used per request, including CPU overhead per operation (cpu/o) and context switches per operation (cs/o) which are often a proxy for mutex contention.

The spreadsheet and charts are here and in some cases are easier to read than the charts below. Converting the Google Sheets charts to PNG files does the wrong thing for some of the test names listed at the bottom of the charts below.

Results: point queries

Summary
  • from 5.6 to 5.7 there are big improvements for 5 tests, no changes for 2 tests and small  regressions for 2 tests
  • from 5.7 to 8.0 there are big regressions for all tests
  • from 8.0 to 9.5 performance is stable
  • for 9.5 the common result is ~20% less throughput vs 5.6
Using vmstat from the hot-points test to understand the performance changes (see here)
  • context switch rate (cs/o) is stable, mutex contention hasn't changed
  • CPU per query (cpu/o) drops by 35% from 5.6 to 5.7
  • CPU per query (cpu/o) grows by 23% from 5.7 to 8.0
  • CPU per query (cpu/o) is stable from 8.0 through 9.5
Results: range queries without aggregation

Summary
  • from 5.6 to 5.7 throughput drops by 10% to 15%
  • from 5.7 to 8.0 throughput drops by about 15%
  • from 8.0 to 9.5 throughput is stable
  • for 9.5 the common result is ~30% less throughput vs 5.6
Using vmstat from the scan test to understand the performance changes (see here)
  • context switch rates are low and can be ignored
  • CPU per query (cpu/o) grows by 11% from 5.6 to 5.7
  • CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
  • CPU per query (cpu/o) is stable from 8.0 through 9.5
Results: range queries with aggregation

Summary
  • from 5.6 to 5.7 there are big improvements for 2 tests, no changes for 1 tests and regressions for 5 tests
  • from 5.7 to 8.0 there are regressions for all tests
  • from 8.0 through 9.5 performance is stable
  • for 9.5 the common result is ~25% less throughput vs 5.6
Using vmstat from the read-only-count test to understand the performance changes (see here)
  • context switch rates are similar
  • CPU per query (cpu/o) grows by 16% from 5.6 to 5.7
  • CPU per query (cpu/o) grows by 15% from 5.7 to 8.0
  • CPU per query (cpu/o) is stable from 8.0 through 9.5
Results: writes

Summary
  • from 5.6 to 5.7 there are big improvements for 9 tests and no changes for 1 test
  • from 5.7 to 8.0 there are regressions for all tests
  • from 8.4 to 9.x there are regressions for 8 tests and no change for 2 tests
  • for 9.5 vs 5.6: 5 are slower in 9.5, 3 are similar and 2 are faster in 9.5
Using vmstat from the insert test to understand the performance changes (see here)
  • in 5.7, CPU per insert drops by 30% while context switch rates are stable vs 5.6
  • in 8.0, CPU per insert grows by 36% while context switch rates are stable vs 5.7
  • in 9.5, CPU per insert grows by 3% while context switch rates grow by 23% vs 8.4
The first chart doesn't truncate the y-axis to show the big improvement for update-index but that makes it hard to see the smaller changes on the other tests.
This chart truncates the y-axis to make it easier to see changes on tests other than update-index.


Monday, October 20, 2025

Determine how much concurrency to use on a benchmark for small, medium and large servers

What I describe here works for me given my goal, which is to find performance regressions. A benchmark run at low concurrency is used to find regressions from CPU overhead. A benchmark run at high concurrency is used to find regressions from mutex contention. A benchmark run at medium concurrency might help find both.

My informal way for classifying servers by size is:

  • small - has less than 10 cores
  • medium - has between 10 and 20 cores
  • large - has more than 20 cores
How much concurrency?

I almost always co-locate benchmark clients and the DBMS on the same server. This comes at a cost (less CPU and RAM is available for the DBMS) and might have odd artifacts because clients in the real world are usually not co-located. But it has benefits that matter to me. First, I don't worry about variance from changes in network latency. Second, this is much easier to setup.

I try to not oversubscribe the CPU when I run a benchmark. For benchmarks where there are few waits for reads from or writes to storage, then I will limit the number of benchmark users so that the concurrent connection count is less than the number of CPU cores (cores, not VPUs) and I almost always use servers with Intel Hyperthreads and AMD SMT disabled. I do this because DBMS performance suffers when the CPU is oversubscribed and back when I was closer to production we did our best to avoid that state.

Even for benchmarks that have some benchmark steps where the workload will have IO waits, I will still limit the amount of concurrency unless all benchmark steps that I measure will have IO waits.

Assuming a benchmark is composed of a sequence of steps (at minimum: load, query) then I consider the number of concurrent connections per benchmark user. For sysbench, the number of concurrent connections is the same as the number of users, although sysbench uses the --threads argument to set the number of users. I am just getting started with TPROC-C via HammerDB and that appears to be like sysbench with one concurrent connection per virtual user (VU).

For the Insert Benchmark the number of concurrent connections is 2X the number of users on the l.i1 and l.i2 steps and then 3X the number of users on the range-query read-write steps (qr*) and the point-query read-write steps (qp*). And whether or not there are IO-waits for these users is complicated, so I tend to configure the benchmark so that the number of users is no more than half the number of CPU cores.

Finally, I usually set the benchmark concurrency level to be less than the number of CPU cores because I want to leave some cores for the DBMS to do the important background work, which is mostly MVCC garbage collection -- MyRocks compaction, InnoDB purge and dirty page writeback, Postgres vacuum.

Monday, October 13, 2025

Postgres 18.0 vs sysbench on a 32-core server

This is yet another great result for Postgres 18.0 vs sysbench. This time I used a 32-core server. Results for a 24-core server are here. The goal for this benchmark is to check for regressions from new CPU overhead and mutex contention.

I repeated the benchmark twice because I had some uncertainty about platform variance (HW and SW) on the first run.

tl;dr, from Postgres 17.6 to 18.0

  • There might be regressions from 17.6 to 18.0 but they are small (usually <= 3%)

tl;dr, from Postgres 12.22 through 18.0

  • the hot-points test is almost 2X faster starting in 17.6
  • scan is ~1.2X faster starting in 14.19
  • all write tests are much faster staring in 17.6
Builds, configuration and hardware

I compiled Postgres from source for versions 12.22, 13.22, 14.19, 15.14, 16.10, 17.6, and 18.0.

The server is a Dell Precision 7865 Tower Workstation with 1 socket, 128G RAM and an AMD Ryzen Threadripper PRO 5975WX with 32-Cores. The OS is Ubuntu 24.04 and storage is a 2TB m.2 SSD with ext-4 and discard enabled.

Prior to 18.0, the configuration file was named conf.diff.cx10a_c32r128 and is here for 12.2213.2214.1915.1416.10 and 17.6.

For 18.0 I tried 3 configuration files:

Benchmark

I used sysbench and my usage is explained here. To save time I only run 32 of the 42 microbenchmarks 
and most test only 1 type of SQL statement. Benchmarks are run with the database cached by Postgres.

The read-heavy microbenchmarks run for 600 seconds and the write-heavy for 900 seconds.

The benchmark is run with 24 clients and 8 tables with 10M rows per table. The purpose is to search for regressions from new CPU overhead and mutex contention.

I ran the benchmark twice. In the first run, there was several weeks between getting results for the older Postgres releases and Postgres 18.0 so I am less certain about variance from the hardware and softare. One concern is changes in daily temperature because I don't have a climate-controlled server room. Another concern is changes from updating my OS install.

In the second run, all results were collected within 7 days and I am less concerned about variance there.

Results

The microbenchmarks are split into 4 groups -- 1 for point queries, 2 for range queries, 1 for writes. For the range query microbenchmarks, part 1 has queries that don't do aggregation while part 2 has queries that do aggregation. 

I provide charts below with relative QPS. The relative QPS is the following:
(QPS for some version) / (QPS for base version)
When the relative QPS is > 1 then some version is faster than base version.  When it is < 1 then there might be a regression. Values from iostat and vmstat divided by QPS are also provided here. These can help to explain why something is faster or slower because it shows how much HW is used per request.

I present results for:
  • versions 12 through 18 using 12.22 as the base version
  • versions 17.6 and 18.0 using 17.6 as the base version
Results: Postgres 17.6 and 18.0

All files are here.

Results per microbenchmark from vmstat and iostat are here for the first and second run.

Some comments:
  • 18.0 looks better relative to 17.6 in the second run and I explain my uncertainty about the first run above
  • But I am skeptical about the great result for 18.0 on the full scan test (scan_range=100) in the second run. That might be variance induced by vacuum.
  • There might be regressions from 17.6 to 18.0 but they are small (usually <= 3%)
  • The small regression in read-only_range=10 might be from new optimizer overhead, because it doesn't reproduce when the length of the range query is increased -- see read-only_range=100 and read-only_range=10000.
Relative to: 17.6
col-1 : 18.0 with the x10b config that uses io_method=sync
col-2 : 18.0 with the x10c config that uses io_method=worker
col-3 : 18.0 with the x10d config that uses io_method=io_uring

col-1   col-2   col-3   point queries, first run
0.97    0.99    0.94    hot-points_range=100
0.97    0.98    0.96    point-query_range=100
1.00    0.99    0.99    points-covered-pk_range=100
0.99    1.00    1.00    points-covered-si_range=100
0.98    0.99    0.98    points-notcovered-pk_range=100
0.99    0.99    0.99    points-notcovered-si_range=100
1.00    1.00    0.99    random-points_range=1000
0.98    0.98    0.98    random-points_range=100
0.99    0.98    0.99    random-points_range=10

col-1   col-2   col-3   point queries, second run
0.98    1.00    0.99    hot-points_range=100
1.00    1.00    0.99    point-query_range=100
1.01    1.01    1.01    points-covered-pk_range=100
1.00    1.01    1.00    points-covered-si_range=100
1.00    0.98    1.00    points-notcovered-pk_range=100
1.00    1.00    1.01    points-notcovered-si_range=100
1.00    1.01    1.01    random-points_range=1000
1.00    0.99    1.01    random-points_range=100
0.99    0.99    1.00    random-points_range=10

col-1   col-2   col-3   range queries without aggregation, first run
0.97    0.98    0.95    range-covered-pk_range=100
0.97    0.97    0.94    range-covered-si_range=100
0.98    0.98    0.97    range-notcovered-pk_range=100
0.99    0.99    0.98    range-notcovered-si_range=100
0.97    0.99    0.96    scan_range=100

col-1   col-2   col-3   range queries without aggregation, second run
0.99    0.99    0.98    range-covered-pk_range=100
0.99    0.99    0.99    range-covered-si_range=100
0.98    0.99    0.98    range-notcovered-pk_range=100
0.99    1.00    1.00    range-notcovered-si_range=100
1.24    1.24    1.22    scan_range=100

col-1   col-2   col-3   range queries with aggregation, first run
0.99    1.00    1.00    read-only-count_range=1000
1.01    1.01    1.01    read-only-distinct_range=1000
1.01    1.01    1.00    read-only-order_range=1000
1.04    1.04    1.04    read-only_range=10000
0.99    0.99    0.98    read-only_range=100
0.97    0.98    0.97    read-only_range=10
0.99    0.98    0.98    read-only-simple_range=1000
0.99    0.99    0.99    read-only-sum_range=1000

col-1   col-2   col-3   range queries with aggregation, second run
0.99    1.00    1.00    read-only-count_range=1000
1.01    1.01    1.00    read-only-distinct_range=1000
0.99    0.99    1.00    read-only-order_range=1000
1.02    1.03    1.03    read-only_range=10000
0.99    0.99    0.99    read-only_range=100
0.99    0.99    0.98    read-only_range=10
0.99    1.00    1.01    read-only-simple_range=1000
1.00    1.00    1.00    read-only-sum_range=1000

col-1   col-2   col-3   writes, first run
0.99    0.98    0.96    delete_range=100
0.99    0.96    0.98    insert_range=100
1.00    0.99    0.98    read-write_range=100
0.99    0.98    0.98    read-write_range=10
1.00    0.99    1.00    update-index_range=100
1.03    0.95    1.01    update-inlist_range=100
0.99    0.99    1.00    update-nonindex_range=100
1.00    1.00    1.01    update-one_range=100
0.98    0.99    1.00    update-zipf_range=100
0.97    0.97    0.99    write-only_range=10000

col-1   col-2   col-3   writes, second run
0.97    0.97    0.98    delete_range=100
0.99    0.99    1.00    insert_range=100
0.99    0.99    0.98    read-write_range=100
0.98    0.98    0.98    read-write_range=10
0.97    0.98    0.97    update-index_range=100
0.98    0.99    1.04    update-inlist_range=100
0.98    0.99    0.98    update-nonindex_range=100
0.99    0.99    0.98    update-one_range=100
0.98    0.99    0.98    update-zipf_range=100
0.99    0.97    0.95    write-only_range=10000

Results: Postgres 12 to 18

All files are here.

Results per microbenchmark from vmstat and iostat are here for the first and second run.
The data below with a larger font is here.

Some comments:
  • the hot-points test is almost 2X faster starting in 17.6
  • scan is ~1.2X faster starting in 14.19
  • all write tests are much faster staring in 17.6
Relative to: 12.22
col-1 : 13.22
col-2 : 14.19
col-3 : 15.14
col-4 : 16.10
col-5 : 17.6
col-6 : 18.0 with the x10b config
col-7 : 18.0 with the x10c config
col-8 : 18.0 with the x10d config

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   point queries, first run
1.02    1.00    1.01    1.00    1.94    1.87    1.91    1.82    hot-points_range=100
1.01    1.02    1.02    1.00    1.02    0.99    1.00    0.98    point-query_range=100
1.02    1.02    1.01    1.03    1.01    1.01    1.00    1.00    points-covered-pk_range=100
1.01    1.04    1.03    1.05    1.03    1.02    1.03    1.03    points-covered-si_range=100
1.01    1.01    1.01    1.02    1.02    1.00    1.00    1.00    points-notcovered-pk_range=100
1.00    1.03    1.02    1.03    1.02    1.01    1.01    1.02    points-notcovered-si_range=100
1.01    1.02    1.02    1.03    1.00    1.00    1.00    0.99    random-points_range=1000
1.01    1.02    1.02    1.02    1.02    1.00    1.00    1.00    random-points_range=100
1.02    1.03    1.02    1.02    1.01    1.00    1.00    1.00    random-points_range=10

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   point queries, second run
1.00    0.98    0.99    1.00    1.94    1.90    1.93    1.92    hot-points_range=100
1.00    1.01    1.02    1.03    1.03    1.02    1.02    1.02    point-query_range=100
1.02    1.01    1.00    1.04    0.99    1.00    1.00    0.99    points-covered-pk_range=100
1.01    1.04    1.03    1.07    1.03    1.03    1.05    1.04    points-covered-si_range=100
1.01    1.02    1.03    1.04    1.01    1.00    0.99    1.01    points-notcovered-pk_range=100
1.02    1.05    1.05    1.05    1.03    1.03    1.03    1.04    points-notcovered-si_range=100
1.01    1.02    1.03    1.03    0.99    0.99    1.00    1.00    random-points_range=1000
1.02    1.02    1.03    1.04    1.01    1.01    1.00    1.01    random-points_range=100
1.02    1.02    1.02    1.03    1.02    1.01    1.01    1.02    random-points_range=10

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   range queries without aggregation, first run
1.00    1.02    1.02    1.01    1.00    0.97    0.98    0.95    range-covered-pk_range=100
1.00    1.02    1.02    1.01    1.00    0.97    0.97    0.94    range-covered-si_range=100
1.01    1.00    1.00    1.00    0.99    0.97    0.97    0.97    range-notcovered-pk_range=100
0.99    1.00    1.00    0.99    1.01    1.00    1.00    0.99    range-notcovered-si_range=100
0.98    1.24    1.11    1.13    1.16    1.12    1.14    1.11    scan_range=100

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   range queries without aggregation, second run
1.01    1.02    1.02    1.02    1.01    1.00    1.00    0.99    range-covered-pk_range=100
1.01    1.03    1.02    1.02    1.01    1.00    1.01    1.00    range-covered-si_range=100
1.00    0.99    1.00    1.00    0.99    0.97    0.98    0.98    range-notcovered-pk_range=100
1.00    1.00    1.00    0.98    1.01    1.00    1.01    1.01    range-notcovered-si_range=100
1.00    1.27    1.15    1.15    0.97    1.20    1.20    1.18    scan_range=100

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   range queries with aggregation, first run
1.02    1.00    1.00    1.01    0.97    0.96    0.97    0.97    read-only-count_range=1000
1.00    1.00    1.02    1.02    0.98    0.99    0.99    0.99    read-only-distinct_range=1000
1.01    1.00    1.03    1.03    1.00    1.01    1.01    1.01    read-only-order_range=1000
1.00    0.98    1.00    1.06    0.95    0.99    0.99    0.99    read-only_range=10000
1.00    1.00    1.00    1.00    1.00    0.98    0.98    0.98    read-only_range=100
1.00    1.01    1.01    1.00    1.01    0.98    0.99    0.98    read-only_range=10
1.01    1.00    1.02    1.01    1.00    0.99    0.98    0.98    read-only-simple_range=1000
1.00    1.00    1.01    1.00    0.99    0.98    0.98    0.98    read-only-sum_range=1000

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   range queries with aggregation, second run
1.03    1.02    1.02    1.03    0.97    0.97    0.97    0.98    read-only-count_range=1000
1.00    0.99    1.02    1.02    0.98    0.99    0.99    0.99    read-only-distinct_range=1000
1.00    0.99    1.02    1.04    1.02    1.01    1.01    1.02    read-only-order_range=1000
1.01    1.03    1.03    1.06    0.97    0.99    0.99    0.99    read-only_range=10000
0.99    1.00    1.00    1.01    1.00    0.99    0.99    0.99    read-only_range=100
0.99    1.00    1.00    1.00    1.01    0.99    1.00    0.99    read-only_range=10
1.00    0.99    1.01    1.00    0.99    0.98    0.98    0.99    read-only-simple_range=1000
1.00    1.00    1.01    1.01    0.99    0.98    0.98    0.98    read-only-sum_range=1000

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   writes, first run
1.00    1.08    1.08    1.05    1.25    1.24    1.23    1.20    delete_range=100
1.01    1.05    1.04    1.03    1.07    1.06    1.02    1.05    insert_range=100
1.00    1.06    1.07    1.07    1.10    1.09    1.08    1.07    read-write_range=100
1.00    1.07    1.08    1.07    1.13    1.13    1.11    1.11    read-write_range=10
0.99    1.04    1.04    0.90    1.43    1.43    1.41    1.43    update-index_range=100
1.00    1.09    1.08    1.08    1.11    1.15    1.06    1.12    update-inlist_range=100
1.00    1.05    1.05    1.04    1.35    1.34    1.34    1.35    update-nonindex_range=100
1.02    0.95    0.96    0.93    1.19    1.19    1.19    1.20    update-one_range=100
1.00    1.05    1.08    1.07    1.23    1.21    1.22    1.23    update-zipf_range=100
1.01    1.06    1.05    1.01    1.25    1.22    1.20    1.24    write-only_range=10000

col-1   col-2   col-3   col-4   col-5   col-6   col-7   col-8   writes, second run
1.00    1.06    1.07    1.07    1.26    1.23    1.23    1.24    delete_range=100
1.03    1.07    1.05    1.05    1.09    1.07    1.08    1.09    insert_range=100
1.01    1.07    1.08    1.07    1.11    1.10    1.10    1.09    read-write_range=100
0.99    1.04    1.06    1.07    1.13    1.11    1.11    1.12    read-write_range=10
0.99    1.02    1.04    0.87    1.44    1.40    1.41    1.40    update-index_range=100
1.00    1.11    1.12    1.09    1.17    1.14    1.16    1.22    update-inlist_range=100
1.01    1.04    1.06    1.03    1.36    1.33    1.35    1.34    update-nonindex_range=100
1.01    0.95    0.98    0.94    1.22    1.21    1.21    1.20    update-one_range=100
0.99    1.05    1.07    1.07    1.24    1.21    1.22    1.21    update-zipf_range=100
1.02    1.06    1.06    1.02    1.27    1.25    1.23    1.21    write-only_range=10000












Common prefix skipping, adaptive sort

The patent expired for US7680791B2 . I invented this while at Oracle and it landed in 10gR2  with claims of ~5X better performance vs the pr...