aboutsummaryrefslogtreecommitdiff
path: root/drivers/nvme
AgeCommit message (Collapse)AuthorFilesLines
11 hoursMerge tag 'for-7.2/block-20260615' of ↵Linus Torvalds15-144/+661
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - NVMe pull request via Keith: - Per-controller admin and IO timeout sysfs attributes, and letting the block layer set request timeouts (Maurizio, Maximilian) - Multipath passthrough iostats, and PCI P2PDMA enablement for multipath devices (Keith, Kiran) - A new diag sysfs attribute group exporting per-controller counters (retries, multipath failover, error counters, requeue and failure counts, reset and reconnect events) (Nilay) - FDP configuration validation and bounds check fixes (liuxixin) - Various nvmet fixes, including a pre-auth out-of-bounds read in the Discovery Get Log Page handler, auth payload bounds validation, and tcp error-path leak fixes (Bryam, Tianchu, Geliang) - nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki, Eric) - Assorted other fixes and cleanups (John, Yao, Chao, Mateusz, Achkinazi, Wentao) - MD pull request via Yu Kuai: - raid1/raid10 fixes for a deadlock in the read error recovery path, error-path detection and bio accounting with cloned bios, and an nr_pending leak in the REQ_ATOMIC bad-block error path (Abd-Alrhman) - PCI P2PDMA propagation from member devices to the RAID device (Kiran) - dm-raid bio requeue fix, and various smaller fixes and cleanups (Benjamin, Chen, Li, Thorsten) - Enable Clang lock context analysis for the block layer, with the accompanying annotations across queue limits, the blk_holder_ops callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart) - Block status code infrastructure work: a tagged status table, a str_to_blk_op() helper, a bio_endio_status() helper, and on top of that a new configurable block-layer error injection facility (Christoph) - DRBD netlink rework, replacing the genl_magic machinery with explicit netlink serialization and moving the DRBD UAPI headers to include/uapi/linux/ (Christoph Böhmwalder) - bvec improvements: a bvec_folio() helper and making the bvec_iter helpers proper inline functions (Willy, Christoph) - ublk cleanups and a canceling-flag fix for the disk-not-allocated case (Caleb, Ming) - Partition handling fixes: bound the AIX pp_count scan, fix an of_node refcount leak, and replace __get_free_page() with kmalloc() (Bryam, Wentao, Mike) - Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add WQ_PERCPU to the block workqueue users (Mateusz, Marco) - Block statistics and tracing: propagate in-flight to the whole disk on partition IO, export passthrough stats, and a new block_rq_tag_wait tracepoint (Tang, Keith, Aaron) - A round of removals, unexports and cleanups across bio, direct-io and the bvec helpers (Christoph) - Various driver fixes (mtip32xx use-after-free, rbd snap_count validation and strscpy conversion, nbd socket lockdep reclassify, virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS email/list updates (Coly, Li, Yu, Christoph Böhmwalder) - Other little fixes and cleanups all over * tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits) MAINTAINERS: Update Coly Li's email address block: check bio split for unaligned bvec nbd: Reclassify sockets to avoid lockdep circular dependency block: add configurable error injection block: add a str_to_blk_op helper block: add a "tag" for block status codes block: add a macro to initialize the status table floppy: Drop unused pnp driver data block: propagate in_flight to whole disk on partition I/O virtio-blk: clamp zone report to the report buffer capacity block: optimize I/O merge hot path with unlikely() hints drivers/block/rbd: Use strscpy() to copy strings into arrays partitions: aix: bound the pp_count scan to the ppe array block: Enable lock context analysis block/mq-deadline: Make the lock context annotations compatible with Clang block/Kyber: Make the lock context annotations compatible with Clang block/blk-mq-debugfs: Improve lock context annotations block/blk-iocost: Inline iocg_lock() and iocg_unlock() block/blk-iocost: Split ioc_rqos_throttle() block/crypto: Annotate the crypto functions ...
11 daysMerge tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme into ↵Jens Axboe14-119/+631
for-7.2/block Pull NVMe updates from Keith: "- Per-controller timeouts - Multipath telemetry - Namespace format validation - Various other fixes" * tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits) nvme: export controller reconnect event count via sysfs nvme: export controller reset event count via sysfs nvme: export I/O failure count when no path is available via sysfs nvme: export I/O requeue count when no path is usable via sysfs nvme: export command error counters via sysfs nvme: export multipath failover count via sysfs nvme: export command retry count via sysfs nvme: add diag attribute group under sysfs nvme-tcp: lockdep: use dynamic lockdep keys per socket instance nvme-tcp: move nvme_tcp_reclassify_socket() nvme: validate FDP configuration descriptor sizes nvmet-auth: validate reply message payload bounds against transfer length nvme: refresh multipath head zoned limits from path limits nvme: fix FDP fdpcidx bounds check nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false. nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks nvme-multipath: require exact iopolicy names for module parameter nvme-multipath: pass NS head to nvme_mpath_revalidate_paths() nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools ...
12 daysnvme: export controller reconnect event count via sysfsNilay Shroff5-0/+44
When an NVMe-oF link goes down, the driver attempts to recover the connection by repeatedly reconnecting to the remote controller at configured intervals. A maximum number of reconnect attempts is also configured, after which recovery stops and the controller is removed if the connection cannot be re-established. The driver maintains a counter, nr_reconnects, which is incremented on each reconnect attempt. However if in case the reconnect is successful then this counter reset to zero. Moreover, currently, this counter is only reported via kernel log messages and is not exposed to userspace. Since dmesg is a circular buffer, this information may be lost over time. So introduce a new accumulator which accumulates nr_reconnect attempts and also expose this accumulator per-fabric ctrl via a new sysfs attribute reconnect_count, under diag attribute grroup to provide persistent visibility into the number of reconnect attempts made by the host. This information can help users diagnose unstable links or connectivity issues. Furthermore, this sysfs attribute is also writable so user may reset it to zero, if needed. The reconnect_count can also be consumed by monitoring tools such as nvme-top to improve controller-level observability. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: export controller reset event count via sysfsNilay Shroff3-0/+29
The NVMe controller transitions into the RESETTING state during error recovery, link instability, firmware activation, or when a reset is explicitly triggered by the user. Expose a per-ctrl sysfs attribute reset_count, under diag attribute group to provide visibility into these RESETTING state transitions. Observing the frequency of reset events can help users identify issues such as PCIe errors or unstable fabric links. This counter is also writable thus allowing user to reset its value, if needed. This counter can also be consumed by monitoring tools such as nvme-top to improve controller-level observability. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: export I/O failure count when no path is available via sysfsNilay Shroff3-0/+37
When I/O is submitted to the NVMe namespace head and no available path can handle the request, the driver fails the I/O immediately. Currently, such failures are only reported via kernel log messages, which may be lost over time since dmesg is a circular buffer. Add a new ns-head sysfs counter io_fail_no_available_path_count, under diag attribute group to expose the number of I/Os that failed due to the absence of an available path. This provides persistent visibility into path-related I/O failures and can help users diagnose the cause of I/O errors. This counter is also writable and so user may reset its value, if needed. This counter can also be consumed by monitoring tools such as nvme-top. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: export I/O requeue count when no path is usable via sysfsNilay Shroff3-0/+37
When the NVMe namespace head determines that there is no currently available path to handle I/O (for example, while a controller is resetting/connecting or due to a transient link failure), incoming I/Os are added to the requeue list. Currently, there is no visibility into how many I/Os have been requeued in this situation. Add a new ns-head sysfs counter io_requeue_no_usable_path_count, under diag attribute group to expose the number of I/Os that were requeued due to the absence of an available path. This counter is also writable thus allowing user to reset it, if needed. This statistic can help users understand I/O slowdowns or stalls caused by temporary path unavailability, and can be consumed by monitoring tools such as nvme-top for real-time observability. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: export command error counters via sysfsNilay Shroff3-1/+77
When an NVMe command completes with an error status, the driver logs the error to the kernel log. However, these messages may be lost or overwritten over time since dmesg is a circular buffer. Expose per-path and ctrl sysfs attribute command_error_count, under diag attribute group to provide persistent visibility into error occurrences. This allows users to observe the total number of commands that have failed on a given path over time, which can be useful for diagnosing path health and stability. This attribute is both readable and writable thus allowing user to reset these counters. These counters can also be consumed by observability tools such as nvme-top to provide additional insight into NVMe error behavior. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: export multipath failover count via sysfsNilay Shroff3-1/+38
When an NVMe command completes with a path-specific error, the NVMe driver may retry the command on an alternate controller or path if one is available. These failover events indicate that I/O was redirected away from the original path. Currently, the number of times requests are failed over to another available path is not visible to userspace. Exposing this information can be useful for diagnosing path health and stability. Export per-path sysfs attribute "multipath_failover_count" under diag attribute group. This attribute is both readable and writable and thus allowing user to reset the counter. This counter can be consumed by monitoring tools such as nvme-top to help identify paths that consistently trigger failovers under load. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: export command retry count via sysfsNilay Shroff3-0/+38
When Advanced Command Retry Enable (ACRE) is configured, a controller may interrupt command execution and return a completion status indicating command interrupted with the DNR bit cleared. In this case, the driver retries the command based on the Command Retry Delay (CRD) value provided in the completion status. Currently, these command retries are handled entirely within the NVMe driver and are not visible to userspace. As a result, there is no observability into retry behavior, which can be a useful diagnostic signal. Expose a per-namespace sysfs attribute command_retries_count, under diag attribute group to provide visibility into retry activity. This information can help identify controller-side congestion under load and enables comparison across paths in multipath setups (for example, detecting cases where one path experiences significantly more retries than another under identical workloads). This exported metric is intended for diagnostics and monitoring tools such as nvme-top, and does not change command retry behavior. A new sysfs attribute named "command_retries_count" is added for this purpose. This attribute is both readable as well as writable. So user could reset this counter if needed. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme: add diag attribute group under sysfsNilay Shroff3-0/+37
Add a new diag attribute group under: /sys/class/nvme/<ctrl>/ /sys/block/<nvme-path-dev>/ /sys/block/<ns-head-dev>/ This new sysfs attribute group will be used to organize NVMe diagnostic and telemetry-related counters under it. Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme-tcp: lockdep: use dynamic lockdep keys per socket instanceShin'ichiro Kawasaki1-12/+26
When NVMe-TCP controller setup and teardown are repeated with lockdep enabled, lockdep reports false positives WARN for the following locks: 1) &q->elevator_lock : IO scheduler change context 2) &q->q_usage_counter(io) : SCSI disk probe context 3) fs_reclaim : CPU hotplug bring-up context 4) cpu_hotplug_lock : socket establishment context 5) sk_lock-AF_INET-NVME : MQ sched dispatch context for the socket 6) set->srcu : NVMe controller delete context The lockdep WARN was observed by running blktests test case nvme/005 for tcp transport on v7.1-rc1 kernel with a patch. Refer to the Link tag for the details of the WARN. This is a false positive because lockdep confuses lock 4) (socket establishment) with lock 5) (socket in use) for different socket instances. The locks belong to different sockets, but lockdep treats them as the same due to shared static lockdep keys. Fix this by using dynamically allocated lockdep keys per socket instance instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue and pass them to sock_lock_init_class_and_name() for proper lockdep tracking. Change the argument of nvme_tcp_reclassify_socket() from 'struct socket *' to 'struct nvme_tcp_queue *' to pass both the socket and the keys. Add CONFIG_DEBUG_LOCK_ALLOC guards to nvme_tcp_alloc_queue() and nvme_tcp_free_queue() to register and unregister the dynamic keys. Additionally, move nvme_tcp_reclassify_socket() inside these guards since it's only needed when lockdep is enabled. Link: https://lore.kernel.org/linux-nvme/afB5syZbUrppgsDQ@shinmob/ Suggested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
13 daysnvme-tcp: move nvme_tcp_reclassify_socket()Shin'ichiro Kawasaki1-38/+38
Move nvme_tcp_reclassify_socket() in tcp.c after the struct nvme_tcp_queue definition. This is preparation for adding a reference to struct nvme_tcp_queue in the function, which would otherwise cause a compile failure due to the struct being defined after the function. Move the entire CONFIG_DEBUG_LOCK_ALLOC block along with the function to maintain the code organization. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
13 daysnvme: validate FDP configuration descriptor sizesliuxixin1-4/+6
Validate descriptor sizes while walking the FDP configurations log so dsze == 0 or a descriptor past the log end cannot cause unbounded iteration or reads past the buffer. Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liuxixin <gliuxen@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
13 daysnvmet-auth: validate reply message payload bounds against transfer lengthTianchu Chen1-3/+12
nvmet_auth_reply() accesses the variable-length rval[] array using attacker-controlled hl (hash length) and dhvlen (DH value length) fields without verifying they fit within the allocated buffer of tl bytes. A malicious NVMe-oF initiator can craft a DHCHAP_REPLY message with a small transfer length but large hl/dhvlen values, causing out-of-bounds heap reads when the target processes the DH public key (rval + 2*hl) or performs the host response memcmp. With DH authentication configured, the OOB pointer is passed directly to sg_init_one() and read by crypto_kpp_compute_shared_secret(), reaching up to 526 bytes past the buffer. This is exploitable pre-authentication. Add bounds validation ensuring sizeof(*data) + 2*hl + dhvlen <= tl before any access to the variable-length fields. Discovered by Atuin - Automated Vulnerability Discovery Engine. Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication") Cc: stable@vger.kernel.org Reviewed-by: Hannes Reinecke <hare@kernel.org> Signed-off-by: Tianchu Chen <flynnnchen@tencent.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvme: refresh multipath head zoned limits from path limitsYao Sang1-0/+10
queue_limits_stack_bdev() updates the multipath head limits from the path queue, but it does not propagate max_open_zones or max_active_zones. As a result, a zoned multipath namespace head can keep stale 0/0 values even after a ready path reports finite zoned resource limits. When refreshing the head limits in nvme_update_ns_info(), stack the zoned resource limits directly after stacking the path queue limits. Use min_not_zero() so the block layer's 0 value keeps its "no limit" meaning while finite limits are combined conservatively. This avoids advertising "no limit" on the multipath head while keeping the zoned-limit handling local to the NVMe multipath update path. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yao Sang <sangyao@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvme: fix FDP fdpcidx bounds checkliuxixin1-1/+1
The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=, incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1). Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer") Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liuxixin <gliuxen@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.Kuniyuki Iwashima1-0/+2
Since commit 21c05ca88a54 ("workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly set WQ_PERCPU or WQ_UNBOUND when creating workqueue. nvme_tcp_init_module() sets WQ_UNBOUND when the module param wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering the warning below: workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU. WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1 Let's set WQ_PERCPU if wq_unbound is false. Reported-by: syzbot+d078cba4418e65f61984@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6a1a9a86.323e8352.141b09.0001.GAE@google.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log PageBryam Vargas1-1/+22
nvmet_execute_disc_get_log_page() validates only the dword alignment of the host-supplied Log Page Offset (lpo). The 64-bit offset is then added to a small kzalloc'd buffer that holds the discovery log page and the result is passed straight to nvmet_copy_to_sgl(), which memcpy()s data_len bytes out to the host with no source-side bound check: u64 offset = nvmet_get_log_page_offset(req->cmd); /* 64-bit host */ size_t data_len = nvmet_get_log_page_len(req->cmd); /* 32-bit host */ ... if (offset & 0x3) { ... } /* only check */ ... alloc_len = sizeof(*hdr) + entry_size * discovery_log_entries(req); buffer = kzalloc(alloc_len, GFP_KERNEL); ... status = nvmet_copy_to_sgl(req, 0, buffer + offset, data_len); The Discovery controller is unauthenticated -- nvmet_host_allowed() returns true unconditionally for the discovery subsystem -- so the call is reachable pre-authentication by any TCP/RDMA/FC peer that can reach the nvmet target. With a discovery log page of ~1 KiB, an attacker requesting up to 4 KiB starting at offset == alloc_len reads the next slab page out and gets its content returned over the fabric (an empirical run on a default nvmet-tcp loopback target leaked 81 canonical kernel pointers in one Get Log Page response). Pointing the offset at unmapped kernel memory faults the in-kernel memcpy and crashes (or panics, on panic_on_oops=1) the target host instead. The attacker-controlled source-side offset pattern "nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every other Get Log Page handler in admin-cmd.c either ignores lpo (and silently starts every response at offset 0) or tracks a local destination offset with a fixed source pointer. Validate the host-supplied offset against the log page size, cap the copy length to what is actually available, and zero-fill any remainder of the host transfer buffer. The zero-fill matches the existing short-response pattern in nvmet_execute_get_log_changed_ns() (admin-cmd.c) and prevents leaking transport SGL contents when the host asks for more bytes than the log page contains. Fixes: a07b4970f464 ("nvmet: add a generic NVMe target") Cc: stable@vger.kernel.org Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disksAchkinazi, Igor1-0/+6
When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a per-path namespace, bio_set_dev() clears BIO_REMAPPED. The remapped bio is then resubmitted through submit_bio_noacct() which calls bio_check_eod() because BIO_REMAPPED is not set. This races with nvme_ns_remove() which zeroes the per-path capacity before synchronize_srcu(): CPU 0 (IO submission) --------------------- srcu_read_lock() nvme_find_path() -> ns [NVME_NS_READY is set] CPU 1 (namespace removal) ------------------------- clear_bit(NVME_NS_READY) set_capacity(ns->disk, 0) synchronize_srcu() <- blocks CPU 0 (IO submission) --------------------- bio_set_dev(bio, ns->disk->part0) [clears BIO_REMAPPED] submit_bio_noacct(bio) -> bio_check_eod() sees capacity=0 -> bio fails with IO error The SRCU read lock prevents synchronize_srcu() from completing, but does not prevent set_capacity(0) from executing. The bio fails the EOD check before it reaches the NVMe driver, so nvme_failover_req() never gets a chance to redirect it to another path of multipath. IO errors are reported to the application despite another path being available. On older kernels (before commit 0b64682e78f7 "block: skip unnecessary checks for split bio"), the same race was also reachable through split remainders resubmitted via submit_bio_noacct(). Fix this by setting BIO_REMAPPED after bio_set_dev() in nvme_ns_head_submit_bio(). This skips bio_check_eod() on the per-path device; the EOD check already passed on the multipath head. NVMe per-path namespace devices are always whole disks (bd_partno=0), so the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op. The flag does not persist across failover and cannot go stale if the namespace geometry changes between attempts: nvme_failover_req() calls bio_set_dev() to redirect the bio back to the multipath head, which clears BIO_REMAPPED. When nvme_requeue_work() resubmits through submit_bio_noacct(), bio_check_eod() runs normally against the current capacity. Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for partition-remapped bios"). Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvme-multipath: require exact iopolicy names for module parameterliyouhong1-16/+24
The iopolicy module parameter uses strncmp prefix matching, so values like "numax" are accepted as "numa". The per-subsystem sysfs attribute already requires an exact match via sysfs_streq(). Parse both through a shared helper so invalid values are rejected consistently. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: liyouhong <liyouhong@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-06-02nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()John Garry3-5/+5
In nvme_mpath_revalidate_paths(), we are passed a NS pointer and use that to lookup the NS head and then use that same NS pointer as an iter variable. It makes more sense pass the NS head and use a local variable for the NS iter. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-28Merge tag 'net-7.1-rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "This is again significantly bigger than the same point into the previous cycle, but at least smaller than last week. I'm not aware of any pending regression for the current cycle. Including fixes from netfilter. Current release - regressions: - netfilter: walk fib6_siblings under RCU Previous releases - regressions: - netlink: fix sending unassigned nsid after assigned one - bridge: fix sleep in atomic context in netlink path - sched: fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop - ipv4: fix net->ipv4.sysctl_local_reserved_ports UaF - eth: tun: free page on short-frame rejection in tun_xdp_one() Previous releases - always broken: - skbuff: fix missing zerocopy reference in pskb_carve helpers - handshake: drain pending requests at net namespace exit - ethtool: - rss: avoid modifying the RSS context response - module: avoid leaking a netdev ref on module flash errors - coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES - netfilter: fix dst corruption in same register operation - nfc: hci: fix out-of-bounds read in HCP header parsing - ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo() - eth: - vti: use ip6_tnl.net in vti6_changelink(). - vxlan: do not reuse cached ip_hdr() value after skb_tunnel_check_pmtu()" * tag 'net-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) dpll: zl3073x: make frequency monitor a per-device attribute dpll: zl3073x: use __dpll_device_change_ntf() and remove change_work dpll: export __dpll_device_change_ntf() for use under dpll_lock net/handshake: Drain pending requests at net namespace exit net/handshake: Verify file-reference balance in submit paths net/handshake: Close the submit-side sock_hold race net/handshake: hand off the pinned file reference to accept_doit net/handshake: Take a long-lived file reference at submit net/handshake: Pass negative errno through handshake_complete() nvme-tcp: store negative errno in queue->tls_err net/handshake: Use spin_lock_bh for hn_lock net: skbuff: fix missing zerocopy reference in pskb_carve helpers net: hibmcge: move dma_rmb() after dma_sync_single_for_cpu() in RX path net: hibmcge: disable Relaxed Ordering to fix RX packet corruption selftests/tc-testing: Add netem test case exercising loops selftests/tc-testing: Add mirred test cases exercising loops net/sched: act_mirred: Fix return code in early mirred redirect error paths net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop net/sched: fix packet loop on netem when duplicate is on ...
2026-05-28nvme: add support multipath passthrough iostatsKeith Busch2-1/+13
Don't skip the io accounting for passthrough commands if the user enabled tracking these. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260528010041.1533124-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28nvme-tcp: cleanup nvme_tcp_init_iterChristoph Hellwig1-17/+10
Split the two init cases based on code in the zloop driver. This simplifies the code and makes it easier to follow. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Link: https://patch.msgid.link/20260527151043.2349900-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-28nvme-tcp: store negative errno in queue->tls_errChuck Lever1-1/+1
nvme_tcp_tls_done() assigns queue->tls_err in three branches. The ENOKEY lookup failure and the EOPNOTSUPP initializer both store negative errnos. The third branch, reached when the handshake layer reports a non-zero status, stores -status. The handshake layer delivers status to the consumer callback as a negative errno; the other in-tree consumers -- xs_tls_handshake_done() and the nvmet target callback -- treat their status argument that way. The extra negation in nvme_tcp_tls_done() flips the sign, leaving tls_err as a positive value (for instance, +EIO), which nvme_tcp_start_tls() then returns to its caller. Drop the extra negation so queue->tls_err uniformly carries a negative errno on failure. Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-2-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-05-27nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_poolsMateusz Nowicki1-2/+7
nvme_setup_descriptor_pools() indexes dev->descriptor_pools[] using the numa_node forwarded from hctx->numa_node by its single caller, nvme_init_hctx_common(). On a non-NUMA kernel hctx->numa_node is NUMA_NO_NODE (-1). Because the parameter was declared 'unsigned', the value becomes UINT_MAX and the index walks off the array (sized to nr_node_ids), faulting during nvme_alloc_ns() and leaving the namespace without a /dev node. Reproduces on any NVMe controller probed by a CONFIG_NUMA=n kernel: BUG: unable to handle page fault for address: ffff889101603d38 RIP: 0010:nvme_init_hctx_common+0x5a/0x190 [nvme] Call Trace: nvme_init_hctx+0x10/0x20 [nvme] nvme_alloc_ns+0x9e/0xa10 [nvme_core] nvme_scan_ns+0x301/0x3b0 [nvme_core] nvme_scan_ns_async+0x23/0x30 [nvme_core] Switch the parameter to int and fall back to node 0 when it is NUMA_NO_NODE; node 0 is always present. Fixes: d977506f8863 ("nvme-pci: make PRP list DMA pools per-NUMA-node") Link: https://lore.kernel.org/r/20260309062840.2937858-2-iam@sung-woo.kim Reported-by: Sung-woo Kim <iam@sung-woo.kim> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-27nvme: target: rdma: fix ndev refcount leak on queue connectWentao Liang1-2/+4
nvmet_rdma_queue_connect() calls nvmet_rdma_find_get_device() which acquires a reference on the returned ndev via kref_get(). On the path where the host queue backlog is exceeded and the function returns NVME_SC_CONNECT_CTRL_BUSY, reference of ndev is not released, leaking the kref. Fix this by adding a goto to the existing put_device label before the early return. Fixes: 31deaeb11ba7 ("nvmet-rdma: avoid circular locking dependency on install_queue()") Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-27nvme-multipath: fix flex array size in struct nvme_ns_headNilay Shroff1-1/+1
struct nvme_ns_head contains a flexible array member, current_path[], which is indexed using the NUMA node ID: head->current_path[numa_node_id()] The structure is currently allocated as: size = sizeof(struct nvme_ns_head) + (num_possible_nodes() * sizeof(struct nvme_ns *)); head = kzalloc(size, GFP_KERNEL); This allocation assumes that NUMA node IDs are sequential and densely packed from 0 .. num_possible_nodes() - 1. While this assumption holds on many systems, it is not always true on some architectures such as powerpc. On some powerpc systems, NUMA node IDs can be sparse. For example: NUMA: NUMA node(s): 6 NUMA node0 CPU(s): 80-159 NUMA node8 CPU(s): 0-79 NUMA node252 CPU(s): NUMA node253 CPU(s): NUMA node254 CPU(s): NUMA node255 CPU(s): That is, the possible/online NUMA node IDs are: 0, 8, 252, 253, 254, 255 In this case: num_possible_nodes() = 6 So memory is allocated for only 6 entries in current_path[]. However, the array is later indexed using the actual NUMA node ID. As a result, accesses such as: head->current_path[8] or head->current_path[252] goes out of bounds, leading to the following KASAN splat: ================================================================== BUG: KASAN: slab-out-of-bounds in nvme_mpath_revalidate_paths+0x22c/0x290 [nvme_core] Write of size 8 at addr c00020003bda35b8 by task kworker/u641:2/1997 CPU: 1 UID: 0 PID: 1997 Comm: kworker/u641:2 Not tainted 7.1.0-rc5-dirty #14 PREEMPT(lazy) Hardware name: 8335-GTH POWER9 0x4e1202 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV Workqueue: async async_run_entry_fn Call Trace: [c000200037fa7510] [c0000000021c23d4] dump_stack_lvl+0x88/0xdc (unreliable) [c000200037fa7540] [c0000000009fda90] print_report+0x22c/0x67c [c000200037fa7630] [c0000000009fd508] kasan_report+0x108/0x220 [c000200037fa7740] [c0000000009fff48] __asan_store8+0xe8/0x120 [c000200037fa7760] [c008000018e76474] nvme_mpath_revalidate_paths+0x22c/0x290 [nvme_core] [c000200037fa7800] [c008000018e6556c] nvme_update_ns_info+0x4a4/0x5e0 [nvme_core] [c000200037fa7a50] [c008000018e66270] nvme_alloc_ns+0x6d8/0x1a70 [nvme_core] [c000200037fa7c20] [c008000018e679fc] nvme_scan_ns+0x3f4/0x630 [nvme_core] [c000200037fa7d10] [c00000000031f22c] async_run_entry_fn+0x9c/0x3a0 [c000200037fa7db0] [c0000000002fa544] process_one_work+0x414/0xa10 [c000200037fa7ec0] [c0000000002fbf00] worker_thread+0x320/0x640 [c000200037fa7f80] [c00000000030d0f8] kthread+0x278/0x290 [c000200037fa7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18 Allocated by task 1997 on cpu 1 at 35.928317s: The buggy address belongs to the object at c00020003bda3000 which belongs to the cache kmalloc-rnd-15-2k of size 2048 The buggy address is located 16 bytes to the right of allocated 1448-byte region [c00020003bda3000, c00020003bda35a8) The buggy address belongs to the physical page: Memory state around the buggy address: c00020003bda3480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c00020003bda3500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >c00020003bda3580: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc ^ c00020003bda3600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc c00020003bda3680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fix this by allocating the flexible array using nr_node_ids instead of num_possible_nodes(). Since nr_node_ids represents the maximum possible NUMA node IDs, indexing current_path[] using numa_node_id() becomes safe even on systems with sparse node IDs. Fixes: f333444708f8 ("nvme: take node locality into account when selecting a path") Tested-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-27nvme: use DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE for multipath_sysfsJohn Garry1-8/+1
Use DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE instead of DEFINE_SYSFS_GROUP_VISIBLE, which means that we can drop multipath_sysfs_attr_visible(). Incidentally, multipath_sysfs_attr_visible() should have returned a umode_t. This idea was suggested by Ben Marzinski elsewhere. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-27nvmet-tcp: check return value of nvmet_tcp_set_queue_sockGeliang Tang1-2/+3
The return value of nvmet_tcp_set_queue_sock() is currently ignored in nvmet_tcp_tls_handshake_done(). If it fails (e.g., due to the socket not being in TCP_ESTABLISHED state), the socket callbacks will not be properly set, leading to queue and socket leakage. Fix this by capturing the return value and calling nvmet_tcp_schedule_release_queue() on failure to ensure proper cleanup. Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-27nvmet-tcp: fix page fragment cache leak in error pathGeliang Tang1-0/+6
In nvmet_tcp_alloc_queue(), when a connection is closed during the allocation process (e.g., nvmet_tcp_set_queue_sock() returns -ENOTCONN), the error handling jumps to out_destroy_sq and then to out_ida_remove without draining the page fragment cache. Although nvmet_tcp_free_cmd() is called in some error paths to release individual page fragments, the underlying page cache reference held by queue->pf_cache is never released. The first allocation using pf_cache is the call to nvmet_tcp_alloc_cmd() for queue->connect, which happens after ida_alloc() returns successfully. This results in a page leak each time a connection fails during allocation, which could lead to memory exhaustion over time if connections are repeatedly opened and closed. Fix this by calling page_frag_cache_drain() before freeing the queue structure in the out_ida_remove label. Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-27nvme-core: fix unsigned comparison warning in nvme_wait_freeze_timeoutMaurizio Lombardi1-1/+1
The timeout variable in nvme_wait_freeze_timeout() is an unsigned type. Checking if it is <= 0 triggers a compiler warning because an unsigned variable can never be negative. Fix this warning by changing the type to long. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202605211257.STzj2Ujv-lkp@intel.com/ Fixes: 23b6d2cbf75f ("nvme: remove redundant timeout argument from nvme_wait_freeze_timeout") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-26nvme-multipath: enable PCI P2PDMA for multipath devicesKiran Kumar Modukuri1-1/+1
NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk even when all underlying controllers support it. Set BLK_FEAT_PCI_P2PDMA unconditionally in nvme_mpath_alloc_disk() alongside the other features. nvme_update_ns_info_block() already calls queue_limits_stack_bdev() to stack each path's limits onto the head disk, which routes through blk_stack_limits(). The core now clears BLK_FEAT_PCI_P2PDMA automatically if any path (e.g., FC) does not support it, consistent with how BLK_FEAT_NOWAIT and BLK_FEAT_POLL are handled. Tested-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Tested=by: Pranjal Shrivastava <praan@google.com> Link: https://patch.msgid.link/20260513185153.95552-4-kch@nvidia.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-26block: switch numa_node to int in blk_mq_hw_ctx and init_requestMateusz Nowicki6-6/+6
numa_node in blk_mq_hw_ctx and the matching argument of blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1]. Switch the field and the callback prototype to int and update all in-tree init_request implementations. No functional change: cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already take int. Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/ [1] Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/ Suggested-by: Caleb Sander Mateos <csander@purestorage.com> Suggested-by: Sung-woo Kim <iam@sung-woo.kim> Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net Signed-off-by: Jens Axboe <axboe@kernel.dk>
2026-05-21nvme-pci: fix dma mapping leak on data setup errorKeith Busch1-3/+28
We're leaking the initial DMA mapping during iteration if we fail to allocate the tracking descriptor for both PRP and SGL. Unmap the iterator directly; we can't use the existing unmap helper because it depends on the tracking descriptor being successfully allocated, so a new one for an in-use iterator is provided. The mappings were also leaking when the driver detects an invalid bio_vec when mapping PRPs, so fix that too. Fixes: b8b7570a7ec87 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Fixes: 7ce3c1dd78fca ("nvme-pci: convert the data mapping to blk_rq_dma_map") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-21nvme-pci: fix dma_vecs leak on p2p memoryKeith Busch1-1/+2
We don't unmap P2P memory, so we don't need to track it. The dma_vec allocation was getting leaked on the completion. Fixes: b8b7570a7ec87 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-20nvme: core: reject invalid LBA data size from Identify NamespaceChao Shi1-1/+11
nvme_update_ns_info_block() trusts id->lbaf[lbaf].ds from the controller and assigns it directly to ns->head->lba_shift without bounds checking. nvme_lba_to_sect() then does: return lba << (head->lba_shift - SECTOR_SHIFT); When called with lba = le64_to_cpu(id->nsze) to compute the device capacity, an attacker-controlled controller can choose ds < 9 or a combination of (ds, nsze) that makes the left shift overflow sector_t. The former is a C undefined behaviour that UBSAN reports as a BUG; the latter silently yields a bogus capacity that the block layer then trusts for bounds checking. Validate ds against SECTOR_SHIFT and use check_shl_overflow() to compute capacity so that any (ds, nsze) combination that would overflow sector_t is rejected. The namespace is skipped with -ENODEV instead of crashing the kernel. This is reachable by a malicious NVMe device, a buggy firmware, or an attacker-controlled NVMe-oF target. The check is performed before queue_limits_start_update() and blk_mq_freeze_queue(), so the error path is a plain `goto out` with no cleanup needed. Stack trace (UBSAN, ds < 9 variant): RIP: nvme_lba_to_sect drivers/nvme/host/nvme.h:699 [inline] RIP: nvme_update_ns_info_block.cold+0x5/0x7 Call Trace: nvme_update_ns_info+0x175/0xd90 drivers/nvme/host/core.c:2467 nvme_validate_ns drivers/nvme/host/core.c:4299 [inline] nvme_scan_ns drivers/nvme/host/core.c:4350 nvme_scan_ns_async+0xa5/0xe0 drivers/nvme/host/core.c:4383 async_run_entry_fn process_one_work worker_thread kthread Found by Syzkaller. Acked-by: Sungwoo Kim <iam@sung-woo.kim> Acked-by: Dave Tian <daveti@purdue.edu> Acked-by: Weidong Zhu <weizhu@fiu.edu> Signed-off-by: Chao Shi <coshi036@gmail.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-05-20