I/O performance regression on NVMes under same bridge (dual port nvme)

Bug #2115738 reported by Ioanna Alifieraki
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
Undecided
Massimiliano Pellizzer
Oracular
Won't Fix
Undecided
Unassigned
Plucky
Fix Released
Medium
Massimiliano Pellizzer
Questing
In Progress
Undecided
Massimiliano Pellizzer

Bug Description

[ Impact ]

iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes

The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:

- To flush caches when a second-stage page table is attached to a device
  whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
  resident remapping structures are visible to hardware (CAP_REG.RWBF==1).

However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.

Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.

Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.

[ Fix ]

Backport the following commit:
- 12724ce3fe1a iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
- b9434ba97c44 iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
- b33125296b50 iommu/vt-d: Create unique domain ops for each stage
- 0fa6f0893466 iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
- 85cfaacc9937 iommu/vt-d: Split paging_domain_compatible()
- cee686775f9c iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
to Plucky.

[ Test Plan ]

Run fio against two NVMEs under the same pci bridge (dual port NVMe):

$ sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvmeXnY --new_group --name=job2 --filename=/dev/nvmeWnZ

verify that the speed reached with the two NVMEs under the same bridge is the same that would have been reached if the two NVMEs were not under the same bridge.

[ Regression Potential ]

This fix affects the Intel IOMMU (VT-d) driver.
An issue with this fix may introduce problems such as
incorrect omission of required IOTLB cache or write buffer flushes
when attaching devices to a domain.
This could result in memory remapping structures not being visible
to hardware in configurations that actually require synchronization.
As a consequence, devices performing DMA may exhibit data corruption,
access violations, or inconsistent behavior due to stale or incomplete
translations being used by the hardware.

---

[Description]
A performance regression has been reported when running fio against two NVMe devices under the same pci bridge (dual port NVMe).
The issue was initially reported for 6.11-hwe kernel for Noble.
The performance regression was introduced in the 6.10 upstream kernel and is still present in 6.16 (build at commit e540341508ce2f6e27810106253d5).
Bisection pointed to commit 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map").

In our tests we observe ~6150 MiB/s when the NVMe devices are on different bridges and ~4985 MiB/s when under the same brigde.

Before the offending commit we observe ~6150 MiB/s, regardless of NVMe device placement.

[Test Case]

We can reproduce the issue on gcp on Z3 metal instance type (z3-highmem-192-highlssd-metal) [1].

You need to have 2 NVMe devices under the same bridge, e.g:

# nvme list -v
...
Device SN MN FR TxPort Address Slot Subsystem Namespaces
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 nvme_card-pd nvme_card-pd (null) pcie 0000:05:00.1 nvme-subsys0 nvme0n1
nvme1 3DE4D285C21A7C001.0 nvme_card 00000000 pcie 0000:3d:00.0 nvme-subsys1 nvme1n1
nvme10 3DE4D285C21A7C001.1 nvme_card 00000000 pcie 0000:3d:00.1 nvme-subsys10 nvme10n1
nvme11 3DE4D285C2027C000.0 nvme_card 00000000 pcie 0000:3e:00.0 nvme-subsys11 nvme11n1
nvme12 3DE4D285C2027C000.1 nvme_card 00000000 pcie 0000:3e:00.1 nvme-subsys12 nvme12n1
nvme2 3DE4D285C2368C001.0 nvme_card 00000000 pcie 0000:b7:00.0 nvme-subsys2 nvme2n1
nvme3 3DE4D285C22A74001.0 nvme_card 00000000 pcie 0000:86:00.0 nvme-subsys3 nvme3n1
nvme4 3DE4D285C22A74001.1 nvme_card 00000000 pcie 0000:86:00.1 nvme-subsys4 nvme4n1
nvme5 3DE4D285C2368C001.1 nvme_card 00000000 pcie 0000:b7:00.1 nvme-subsys5 nvme5n1
nvme6 3DE4D285C21274000.0 nvme_card 00000000 pcie 0000:87:00.0 nvme-subsys6 nvme6n1
nvme7 3DE4D285C21094000.0 nvme_card 00000000 pcie 0000:b8:00.0 nvme-subsys7 nvme7n1
nvme8 3DE4D285C21274000.1 nvme_card 00000000 pcie 0000:87:00.1 nvme-subsys8 nvme8n1
nvme9 3DE4D285C21094000.1 nvme_card 00000000 pcie 0000:b8:00.1 nvme-subsys9 nvme9n1

...

For the output above, drives nvme1n1 and nvme10n1 are under the same bridge, and looking the SN it seems it is a dual port NVMe.

- Under the same bridge
Run fio against nvme1n1 and nvme10n1, observe 4897MiB/s after a short spike in the beginning at ~6150MiB/s.

# sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 --filename=/dev/nvme10n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=4897MiB/s][r=1254k IOPS][eta 00m:00s]
...

- Under different bridge
Run fio against nvme1n1 and nvme11n1, observe

# sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 --filename=/dev/nvme11n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=6153MiB/s][r=1575k IOPS][eta 00m:00s]
...

** So far, we haven't been able to reproduce it on another machine, but we suspect will be reproducible with any machine with a dual port NVMe.

[Other]

In spreadsheet [2], the are some profiling data for different kernel versions, showing consistent performance difference between kernel versions.

Offending commit : https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129dab6e1286525fe5baed860d3dfcd9c6b4b327

Report issue upstream [3].

[1] https://cloud.google.com/compute/docs/storage-optimized-machines#z3_machine_types
[2] https://docs.google.com/spreadsheets/d/19F0Vvgz0ztFpDX4E37E_o8JYrJ04iYJz-1cqU-j4Umk/edit?gid=1544333169#gid=1544333169
[3] https://lore.kernel<email address hidden>/

CVE References

description: updated
description: updated
description: updated
description: updated
tags: added: kernel-daily-bug
description: updated
Revision history for this message
Ural Tunaboyu (uralt) wrote :

Ubuntu 24.10 (Oracular Oriole) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Oracular):
status: New → Won't Fix
Changed in linux (Ubuntu Questing):
assignee: nobody → Massimiliano Pellizzer (mpellizzer)
Changed in linux (Ubuntu Plucky):
assignee: nobody → Massimiliano Pellizzer (mpellizzer)
status: New → Confirmed
Changed in linux (Ubuntu Questing):
status: New → Confirmed
description: updated
Changed in linux (Ubuntu Plucky):
status: Confirmed → In Progress
Changed in linux (Ubuntu Questing):
status: Confirmed → In Progress
Revision history for this message
Massimiliano Pellizzer (mpellizzer) wrote :

The upstream patch will be included in Linux 6.17 (most probably) which is the kernel version Questing will be released with.

Sent the patch for Plucky to KTML:
- https://lists.ubuntu.com/archives/kernel-team/2025-July/161401.html

Revision history for this message
Massimiliano Pellizzer (mpellizzer) wrote :

I NACKed the patch sent to the mailing list, since it introduces a regression addressed by:
- https://<email address hidden>/
I will wait for the fix to be accepted upstream to send the fixed patchset again

description: updated
Revision history for this message
Massimiliano Pellizzer (mpellizzer) wrote :
description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Plucky):
importance: Undecided → Medium
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/6.14.0-32.32 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-plucky-linux' to 'verification-done-plucky-linux'. If the problem still exists, change the tag 'verification-needed-plucky-linux' to 'verification-failed-plucky-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-plucky-linux-v2 verification-needed-plucky-linux
Revision history for this message
Bryan Fraschetti (bryanfraschetti) wrote :
Download full text (6.7 KiB)

Verification
============

Performance results without the patch:
--------------------------------------

Ubuntu Release:
lsb_release -rc
Release: 25.04
Codename: plucky

Kernel in use:
uname -srvm
Linux 6.14.0-1015-gcp #16-Ubuntu SMP Tue Aug 19 00:02:17 UTC 2025 x86_64

nvme topology:
sudo lstopo | grep -A 4 NVMExp
          PCI 05:00.1 (NVMExp)
            Block(Disk) "nvme0n1"
      HostBridge
        PCI 6b:00.0 (Co-Processor)
      HostBridge
--
          PCI 3d:00.0 (NVMExp)
            Block(Disk) "nvme1n1"
          PCI 3d:00.1 (NVMExp)
            Block(Disk) "nvme10n1"
        PCIBridge
          PCI 3e:00.0 (NVMExp)
            Block(Disk) "nvme11n1"
          PCI 3e:00.1 (NVMExp)
            Block(Disk) "nvme12n1"
  Package L#1 + L3 L#1 (105MB)
    Group0 L#2
      NUMANode L#2 (P#2 378GB)
--
          PCI 86:00.0 (NVMExp)
            Block(Disk) "nvme2n1"
          PCI 86:00.1 (NVMExp)
            Block(Disk) "nvme4n1"
        PCIBridge
          PCI 87:00.0 (NVMExp)
            Block(Disk) "nvme7n1"
          PCI 87:00.1 (NVMExp)
            Block(Disk) "nvme9n1"
      HostBridge
        PCI e8:00.0 (Co-Processor)
      HostBridge
--
          PCI b7:00.0 (NVMExp)
            Block(Disk) "nvme3n1"
          PCI b7:00.1 (NVMExp)
            Block(Disk) "nvme5n1"
        PCIBridge
          PCI b8:00.0 (NVMExp)
            Block(Disk) "nvme6n1"
          PCI b8:00.1 (NVMExp)
            Block(Disk) "nvme8n1"
...

- From the topology we see nvme1n1 and nvme3n1 are under different bridges. Run the fio benchmark against these nvmes and observe the read perf is 6152MiB/s

sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 --filename=/dev/nvme3n1
...
Jobs: 16 (f=16): [r(16)][100.0%][r=6152MiB/s][r=1575k IOPS][eta 00m:00s]
job1: (groupid=0, jobs=8): err= 0: pid=12326: Wed Sep 3 15:12:38 2025
  read: IOPS=787k, BW=3073MiB/s (3222MB/s)(120GiB/40001msec)
...
job2: (groupid=1, jobs=8): err= 0: pid=12334: Wed Sep 3 15:12:38 2025
  read: IOPS=787k, BW=3073MiB/s (3222MB/s)(120GiB/40001msec)
...
Run status group 0 (all jobs):
   READ: bw=3073MiB/s (3222MB/s), 3073MiB/s-3073MiB/s (3222MB/s-3222MB/s), io=120GiB (129GB), run=40001-40001msec
Run status group 1 (all jobs):
   READ: bw=3073MiB/s (3222MB/s), 3073MiB/s-3073MiB/s (3222MB/s-3222MB/s), io=120GiB (129GB), run=40001-40001msec
Disk stats (read/write):
  nvme1n1: ios=31460219/0, sectors=251681752/0, merge=0/0, ticks=10023925/0, in_queue=10023925, util=99.46%
  nvme3n1: ios=31460291/0, sectors=251682328/0, merge=0/0, ticks=10039463/0, in_queue=10039463, util=99.49%

- We see that nvme1n1 is under the same bridge as nvme10n1. Run the same benchmark against these two nvmes and note the degraded performance (4947MiB/s), despite initially bursting at around the same performance of ~6150MiB/s

sudo fio --readwrite=randread --blocksize=4k --iodepth=32 --numjobs=8 --time_based --runtime=40 --ioengine=libaio --direct=1 --group_reporting --new_group --name=job1 --filename=/dev/nvme1n1 --new_group --name=job2 --filename=/dev/nv...

Read more...

tags: added: verification-done-plucky-linux
removed: verification-needed-plucky-linux
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (119.7 KiB)

This bug was fixed in the package linux - 6.14.0-32.32

---------------
linux (6.14.0-32.32) plucky; urgency=medium

  * plucky/linux: 6.14.0-32.32 -proposed tracker (LP: #2121653)

  * Packaging resync (LP: #1786013)
    - [Packaging] debian.master/dkms-versions -- update from kernel-versions
      (main/2025.08.11)

  * Pytorch reports incorrect GPU memory causing "HIP Out of Memory" errors
    (LP: #2120454)
    - drm/amdkfd: add a new flag to manage where VRAM allocations go
    - drm/amdkfd: use GTT for VRAM on APUs only if GTT is larger

  * nvme no longer detected on boot after upgrade to 6.8.0-60 (LP: #2111521)
    - SAUCE: PCI: Disable RRS polling for Intel SSDPE2KX020T8 nvme

  * kernel panic when reloading apparmor 5.0.0 profiles (LP: #2120233)
    - SAUCE: apparmor5.0.0 [59/53]: apparmor: prevent profile->disconnected
      double free in aa_free_profile

  * [SRU] Add support for ALC1708 codec on TRBL platform (LP: #2116247)
    - ASoC: Intel: soc-acpi-intel-lnl-match: add rt1320_l12_rt714_l0 support

  * [SRU] Add waiting latency for USB port resume (LP: #2115478)
    - usb: hub: fix detection of high tier USB3 devices behind suspended hubs
    - usb: hub: Fix flushing and scheduling of delayed work that tunes runtime
      pm
    - usb: hub: Fix flushing of delayed work used for post resume purposes

  * minimal kernel lacks modules for blk disk in arm64 openstack environments
    where config_drive is required (LP: #2118499)
    - [Config] Enable SYM53C8XX_2 on arm64

  * Support xe2_hpg (LP: #2116175)
    - drm/xe/xe2_hpg: Add PCI IDs for xe2_hpg
    - drm/xe/xe2_hpg: Define additional Xe2_HPG GMD_ID
    - drm/xe/xe2_hpg: Add set of workarounds
    - drm/xe/xe2hpg: Add Wa_16025250150

  * drm/xe: Lite restore breaks fdinfo drm-cycles-rcs reporting (LP: #2119526)
    - drm/xe: Add WA BB to capture active context utilization
    - drm/xe/lrc: Use a temporary buffer for WA BB

  * No IP Address assigned after hot-plugging Ethernet cable on HP Platform
    (LP: #2115393)
    - Revert "e1000e: change k1 configuration on MTP and later platforms"

  * I/O performance regression on NVMes under same bridge (dual port nvme)
    (LP: #2115738)
    - iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
    - iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
    - iommu/vt-d: Create unique domain ops for each stage
    - iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
    - iommu/vt-d: Split paging_domain_compatible()
    - iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain

  * BPF header file in wrong location (LP: #2118965)
    - [Packaging] Install bpf header to correct location

  * Internal microphone not working on ASUS VivoBook with Realtek ALC256
    (Ubuntu 24.04 + kernel 6.15) (LP: #2112330)
    - ALSA: hda/realtek: Fix built-in mic on ASUS VivoBook X513EA

  * Documentation update for [Ubuntu25.04] "virsh attach-interface" requires
    a reboot to reflect the attached interfaces on the guest (LP: #2111231)
    - powerpc/pseries/dlpar: Search DRC index from ibm, drc-indexes for IO add

  * Plucky update: upstream stable patchset 2025-08-06 (LP: #2119603)
    - tools/x86/kcpuid: Fix e...

Changed in linux (Ubuntu Plucky):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-nvidia-6.14/6.14.0-1011.11 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux-nvidia-6.14' to 'verification-done-noble-linux-nvidia-6.14'. If the problem still exists, change the tag 'verification-needed-noble-linux-nvidia-6.14' to 'verification-failed-noble-linux-nvidia-6.14'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-nvidia-6.14-v2 verification-needed-noble-linux-nvidia-6.14
tags: added: verification-done-noble-linux-nvidia-6.14
removed: verification-needed-noble-linux-nvidia-6.14
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-azure-nvidia-6.14/6.14.0-1006.6 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux-azure-nvidia-6.14' to 'verification-done-noble-linux-azure-nvidia-6.14'. If the problem still exists, change the tag 'verification-needed-noble-linux-azure-nvidia-6.14' to 'verification-failed-noble-linux-azure-nvidia-6.14'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-noble-linux-azure-nvidia-6.14-v2 verification-needed-noble-linux-azure-nvidia-6.14
tags: added: verification-done-noble-linux-azure-nvidia-6.14
removed: verification-needed-noble-linux-azure-nvidia-6.14
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-intel/6.14.0-1008.8 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-plucky-linux-intel' to 'verification-done-plucky-linux-intel'. If the problem still exists, change the tag 'verification-needed-plucky-linux-intel' to 'verification-failed-plucky-linux-intel'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-plucky-linux-intel-v2 verification-needed-plucky-linux-intel
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.