| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- NVMe pull request via Keith:
- Per-controller admin and IO timeout sysfs attributes, and
letting the block layer set request timeouts (Maurizio,
Maximilian)
- Multipath passthrough iostats, and PCI P2PDMA enablement for
multipath devices (Keith, Kiran)
- A new diag sysfs attribute group exporting per-controller
counters (retries, multipath failover, error counters, requeue
and failure counts, reset and reconnect events) (Nilay)
- FDP configuration validation and bounds check fixes (liuxixin)
- Various nvmet fixes, including a pre-auth out-of-bounds read in
the Discovery Get Log Page handler, auth payload bounds
validation, and tcp error-path leak fixes (Bryam, Tianchu,
Geliang)
- nvme-tcp lockdep and workqueue fixes (Shin'ichiro, Kuniyuki,
Eric)
- Assorted other fixes and cleanups (John, Yao, Chao, Mateusz,
Achkinazi, Wentao)
- MD pull request via Yu Kuai:
- raid1/raid10 fixes for a deadlock in the read error recovery
path, error-path detection and bio accounting with cloned bios,
and an nr_pending leak in the REQ_ATOMIC bad-block error path
(Abd-Alrhman)
- PCI P2PDMA propagation from member devices to the RAID device
(Kiran)
- dm-raid bio requeue fix, and various smaller fixes and cleanups
(Benjamin, Chen, Li, Thorsten)
- Enable Clang lock context analysis for the block layer, with the
accompanying annotations across queue limits, the blk_holder_ops
callbacks, crypto, cgroup, iocost, kyber and mq-deadline (Bart)
- Block status code infrastructure work: a tagged status table, a
str_to_blk_op() helper, a bio_endio_status() helper, and on top of
that a new configurable block-layer error injection facility
(Christoph)
- DRBD netlink rework, replacing the genl_magic machinery with explicit
netlink serialization and moving the DRBD UAPI headers to
include/uapi/linux/ (Christoph Böhmwalder)
- bvec improvements: a bvec_folio() helper and making the bvec_iter
helpers proper inline functions (Willy, Christoph)
- ublk cleanups and a canceling-flag fix for the disk-not-allocated
case (Caleb, Ming)
- Partition handling fixes: bound the AIX pp_count scan, fix an of_node
refcount leak, and replace __get_free_page() with kmalloc() (Bryam,
Wentao, Mike)
- Convert numa_node to int in blk_mq_hw_ctx and ->init_request, and add
WQ_PERCPU to the block workqueue users (Mateusz, Marco)
- Block statistics and tracing: propagate in-flight to the whole disk
on partition IO, export passthrough stats, and a new
block_rq_tag_wait tracepoint (Tang, Keith, Aaron)
- A round of removals, unexports and cleanups across bio, direct-io and
the bvec helpers (Christoph)
- Various driver fixes (mtip32xx use-after-free, rbd snap_count
validation and strscpy conversion, nbd socket lockdep reclassify,
virtio-blk zone report clamp, floppy) and a batch of MAINTAINERS
email/list updates (Coly, Li, Yu, Christoph Böhmwalder)
- Other little fixes and cleanups all over
* tag 'for-7.2/block-20260615' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (117 commits)
MAINTAINERS: Update Coly Li's email address
block: check bio split for unaligned bvec
nbd: Reclassify sockets to avoid lockdep circular dependency
block: add configurable error injection
block: add a str_to_blk_op helper
block: add a "tag" for block status codes
block: add a macro to initialize the status table
floppy: Drop unused pnp driver data
block: propagate in_flight to whole disk on partition I/O
virtio-blk: clamp zone report to the report buffer capacity
block: optimize I/O merge hot path with unlikely() hints
drivers/block/rbd: Use strscpy() to copy strings into arrays
partitions: aix: bound the pp_count scan to the ppe array
block: Enable lock context analysis
block/mq-deadline: Make the lock context annotations compatible with Clang
block/Kyber: Make the lock context annotations compatible with Clang
block/blk-mq-debugfs: Improve lock context annotations
block/blk-iocost: Inline iocg_lock() and iocg_unlock()
block/blk-iocost: Split ioc_rqos_throttle()
block/crypto: Annotate the crypto functions
...
|
|
for-7.2/block
Pull NVMe updates from Keith:
"- Per-controller timeouts
- Multipath telemetry
- Namespace format validation
- Various other fixes"
* tag 'nvme-7.2-2026-06-04' of git://git.infradead.org/nvme: (34 commits)
nvme: export controller reconnect event count via sysfs
nvme: export controller reset event count via sysfs
nvme: export I/O failure count when no path is available via sysfs
nvme: export I/O requeue count when no path is usable via sysfs
nvme: export command error counters via sysfs
nvme: export multipath failover count via sysfs
nvme: export command retry count via sysfs
nvme: add diag attribute group under sysfs
nvme-tcp: lockdep: use dynamic lockdep keys per socket instance
nvme-tcp: move nvme_tcp_reclassify_socket()
nvme: validate FDP configuration descriptor sizes
nvmet-auth: validate reply message payload bounds against transfer length
nvme: refresh multipath head zoned limits from path limits
nvme: fix FDP fdpcidx bounds check
nvme-tcp: Use WQ_PERCPU explicitly if wq_unbound is false.
nvmet: fix pre-auth out-of-bounds heap read in Discovery Get Log Page
nvme-multipath: set BIO_REMAPPED on bios remapped to per-path namespace disks
nvme-multipath: require exact iopolicy names for module parameter
nvme-multipath: pass NS head to nvme_mpath_revalidate_paths()
nvme-pci: fix out-of-bounds access in nvme_setup_descriptor_pools
...
|
|
When an NVMe-oF link goes down, the driver attempts to recover the
connection by repeatedly reconnecting to the remote controller at
configured intervals. A maximum number of reconnect attempts is also
configured, after which recovery stops and the controller is removed
if the connection cannot be re-established.
The driver maintains a counter, nr_reconnects, which is incremented on
each reconnect attempt. However if in case the reconnect is successful
then this counter reset to zero. Moreover, currently, this counter is
only reported via kernel log messages and is not exposed to userspace.
Since dmesg is a circular buffer, this information may be lost over
time.
So introduce a new accumulator which accumulates nr_reconnect attempts
and also expose this accumulator per-fabric ctrl via a new sysfs
attribute reconnect_count, under diag attribute grroup to provide
persistent visibility into the number of reconnect attempts made by the
host. This information can help users diagnose unstable links or
connectivity issues. Furthermore, this sysfs attribute is also writable
so user may reset it to zero, if needed.
The reconnect_count can also be consumed by monitoring tools such as
nvme-top to improve controller-level observability.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The NVMe controller transitions into the RESETTING state during error
recovery, link instability, firmware activation, or when a reset is
explicitly triggered by the user.
Expose a per-ctrl sysfs attribute reset_count, under diag attribute
group to provide visibility into these RESETTING state transitions.
Observing the frequency of reset events can help users identify issues
such as PCIe errors or unstable fabric links. This counter is also
writable thus allowing user to reset its value, if needed.
This counter can also be consumed by monitoring tools such as nvme-top
to improve controller-level observability.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When I/O is submitted to the NVMe namespace head and no available path
can handle the request, the driver fails the I/O immediately. Currently,
such failures are only reported via kernel log messages, which may be
lost over time since dmesg is a circular buffer.
Add a new ns-head sysfs counter io_fail_no_available_path_count, under
diag attribute group to expose the number of I/Os that failed due to the
absence of an available path. This provides persistent visibility into
path-related I/O failures and can help users diagnose the cause of I/O
errors. This counter is also writable and so user may reset its value,
if needed.
This counter can also be consumed by monitoring tools such as nvme-top.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When the NVMe namespace head determines that there is no currently
available path to handle I/O (for example, while a controller is
resetting/connecting or due to a transient link failure), incoming
I/Os are added to the requeue list.
Currently, there is no visibility into how many I/Os have been requeued
in this situation. Add a new ns-head sysfs counter
io_requeue_no_usable_path_count, under diag attribute group to expose
the number of I/Os that were requeued due to the absence of an available
path. This counter is also writable thus allowing user to reset it, if
needed.
This statistic can help users understand I/O slowdowns or stalls caused
by temporary path unavailability, and can be consumed by monitoring
tools such as nvme-top for real-time observability.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When an NVMe command completes with an error status, the driver
logs the error to the kernel log. However, these messages may be
lost or overwritten over time since dmesg is a circular buffer.
Expose per-path and ctrl sysfs attribute command_error_count, under
diag attribute group to provide persistent visibility into error
occurrences. This allows users to observe the total number of commands
that have failed on a given path over time, which can be useful for
diagnosing path health and stability.
This attribute is both readable and writable thus allowing user to reset
these counters. These counters can also be consumed by observability
tools such as nvme-top to provide additional insight into NVMe error
behavior.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When an NVMe command completes with a path-specific error, the NVMe
driver may retry the command on an alternate controller or path if one
is available. These failover events indicate that I/O was redirected
away from the original path.
Currently, the number of times requests are failed over to another
available path is not visible to userspace. Exposing this information
can be useful for diagnosing path health and stability.
Export per-path sysfs attribute "multipath_failover_count" under diag
attribute group. This attribute is both readable and writable and thus
allowing user to reset the counter. This counter can be consumed by
monitoring tools such as nvme-top to help identify paths that
consistently trigger failovers under load.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When Advanced Command Retry Enable (ACRE) is configured, a controller
may interrupt command execution and return a completion status
indicating command interrupted with the DNR bit cleared. In this case,
the driver retries the command based on the Command Retry Delay (CRD)
value provided in the completion status.
Currently, these command retries are handled entirely within the NVMe
driver and are not visible to userspace. As a result, there is no
observability into retry behavior, which can be a useful diagnostic
signal.
Expose a per-namespace sysfs attribute command_retries_count, under
diag attribute group to provide visibility into retry activity. This
information can help identify controller-side congestion under load
and enables comparison across paths in multipath setups (for example,
detecting cases where one path experiences significantly more retries
than another under identical workloads).
This exported metric is intended for diagnostics and monitoring tools
such as nvme-top, and does not change command retry behavior. A new
sysfs attribute named "command_retries_count" is added for this purpose.
This attribute is both readable as well as writable. So user could
reset this counter if needed.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Add a new diag attribute group under:
/sys/class/nvme/<ctrl>/
/sys/block/<nvme-path-dev>/
/sys/block/<ns-head-dev>/
This new sysfs attribute group will be used to organize NVMe diagnostic
and telemetry-related counters under it.
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When NVMe-TCP controller setup and teardown are repeated with lockdep
enabled, lockdep reports false positives WARN for the following locks:
1) &q->elevator_lock : IO scheduler change context
2) &q->q_usage_counter(io) : SCSI disk probe context
3) fs_reclaim : CPU hotplug bring-up context
4) cpu_hotplug_lock : socket establishment context
5) sk_lock-AF_INET-NVME : MQ sched dispatch context for the socket
6) set->srcu : NVMe controller delete context
The lockdep WARN was observed by running blktests test case nvme/005 for
tcp transport on v7.1-rc1 kernel with a patch. Refer to the Link tag for
the details of the WARN.
This is a false positive because lockdep confuses lock 4) (socket
establishment) with lock 5) (socket in use) for different socket
instances. The locks belong to different sockets, but lockdep treats
them as the same due to shared static lockdep keys.
Fix this by using dynamically allocated lockdep keys per socket instance
instead of static keys nvme_tcp_sk_key[] and nvme_tcp_slock_key[]. Add
nvme_tcp_sk_key and nvme_tcp_slock_key fields to struct nvme_tcp_queue
and pass them to sock_lock_init_class_and_name() for proper lockdep
tracking. Change the argument of nvme_tcp_reclassify_socket() from
'struct socket *' to 'struct nvme_tcp_queue *' to pass both the socket
and the keys. Add CONFIG_DEBUG_LOCK_ALLOC guards to nvme_tcp_alloc_queue()
and nvme_tcp_free_queue() to register and unregister the dynamic keys.
Additionally, move nvme_tcp_reclassify_socket() inside these guards since
it's only needed when lockdep is enabled.
Link: https://lore.kernel.org/linux-nvme/afB5syZbUrppgsDQ@shinmob/
Suggested-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Move nvme_tcp_reclassify_socket() in tcp.c after the struct
nvme_tcp_queue definition. This is preparation for adding a reference
to struct nvme_tcp_queue in the function, which would otherwise cause a
compile failure due to the struct being defined after the function.
Move the entire CONFIG_DEBUG_LOCK_ALLOC block along with the function
to maintain the code organization.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Validate descriptor sizes while walking the FDP configurations log so
dsze == 0 or a descriptor past the log end cannot cause unbounded
iteration or reads past the buffer.
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvmet_auth_reply() accesses the variable-length rval[] array using
attacker-controlled hl (hash length) and dhvlen (DH value length) fields
without verifying they fit within the allocated buffer of tl bytes.
A malicious NVMe-oF initiator can craft a DHCHAP_REPLY message with a
small transfer length but large hl/dhvlen values, causing out-of-bounds
heap reads when the target processes the DH public key (rval + 2*hl) or
performs the host response memcmp.
With DH authentication configured, the OOB pointer is passed directly to
sg_init_one() and read by crypto_kpp_compute_shared_secret(), reaching
up to 526 bytes past the buffer. This is exploitable pre-authentication.
Add bounds validation ensuring sizeof(*data) + 2*hl + dhvlen <= tl before
any access to the variable-length fields.
Discovered by Atuin - Automated Vulnerability Discovery Engine.
Fixes: db1312dd9548 ("nvmet: implement basic In-Band Authentication")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Signed-off-by: Tianchu Chen <flynnnchen@tencent.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
queue_limits_stack_bdev() updates the multipath head limits from the
path queue, but it does not propagate max_open_zones or
max_active_zones. As a result, a zoned multipath namespace head can
keep stale 0/0 values even after a ready path reports finite zoned
resource limits.
When refreshing the head limits in nvme_update_ns_info(), stack the
zoned resource limits directly after stacking the path queue limits.
Use min_not_zero() so the block layer's 0 value keeps its "no limit"
meaning while finite limits are combined conservatively.
This avoids advertising "no limit" on the multipath head while keeping
the zoned-limit handling local to the NVMe multipath update path.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Yao Sang <sangyao@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The fdpcidx bounds check sets n = NUMFDPC + 1 but used > instead of >=,
incorrectly accepting fdp_idx when it equals n (i.e. NUMFDPC + 1).
Fixes: 30b5f20bb2dd ("nvme: register fdp parameters with the block layer")
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liuxixin <gliuxen@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Since commit 21c05ca88a54 ("workqueue: Add warnings and ensure
one among WQ_PERCPU or WQ_UNBOUND is present"), we must explicitly
set WQ_PERCPU or WQ_UNBOUND when creating workqueue.
nvme_tcp_init_module() sets WQ_UNBOUND when the module param
wq_unbound is set, but otherwise, WQ_PERCPU is missing, triggering
the warning below:
workqueue: nvme_tcp_wq is using neither WQ_PERCPU or WQ_UNBOUND. Setting WQ_PERCPU.
WARNING: kernel/workqueue.c:5856 at __alloc_workqueue+0x1d02/0x2070 kernel/workqueue.c:5855, CPU#0: swapper/0/1
Let's set WQ_PERCPU if wq_unbound is false.
Reported-by: syzbot+d078cba4418e65f61984@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/6a1a9a86.323e8352.141b09.0001.GAE@google.com/
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvmet_execute_disc_get_log_page() validates only the dword alignment
of the host-supplied Log Page Offset (lpo). The 64-bit offset is then
added to a small kzalloc'd buffer that holds the discovery log page
and the result is passed straight to nvmet_copy_to_sgl(), which
memcpy()s data_len bytes out to the host with no source-side bound
check:
u64 offset = nvmet_get_log_page_offset(req->cmd); /* 64-bit host */
size_t data_len = nvmet_get_log_page_len(req->cmd); /* 32-bit host */
...
if (offset & 0x3) { ... } /* only check */
...
alloc_len = sizeof(*hdr) + entry_size * discovery_log_entries(req);
buffer = kzalloc(alloc_len, GFP_KERNEL);
...
status = nvmet_copy_to_sgl(req, 0, buffer + offset, data_len);
The Discovery controller is unauthenticated -- nvmet_host_allowed()
returns true unconditionally for the discovery subsystem -- so the call
is reachable pre-authentication by any TCP/RDMA/FC peer that can reach
the nvmet target. With a discovery log page of ~1 KiB, an attacker
requesting up to 4 KiB starting at offset == alloc_len reads the next
slab page out and gets its content returned over the fabric (an
empirical run on a default nvmet-tcp loopback target leaked 81
canonical kernel pointers in one Get Log Page response). Pointing the
offset at unmapped kernel memory faults the in-kernel memcpy and
crashes (or panics, on panic_on_oops=1) the target host instead.
The attacker-controlled source-side offset pattern
"nvmet_copy_to_sgl(req, 0, buffer + ATTACKER_OFFSET, ...)" is unique
to nvmet_execute_disc_get_log_page in the entire nvmet codebase: every
other Get Log Page handler in admin-cmd.c either ignores lpo (and
silently starts every response at offset 0) or tracks a local
destination offset with a fixed source pointer.
Validate the host-supplied offset against the log page size, cap the
copy length to what is actually available, and zero-fill any remainder
of the host transfer buffer. The zero-fill matches the existing
short-response pattern in nvmet_execute_get_log_changed_ns()
(admin-cmd.c) and prevents leaking transport SGL contents when the
host asks for more bytes than the log page contains.
Fixes: a07b4970f464 ("nvmet: add a generic NVMe target")
Cc: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
When nvme_ns_head_submit_bio() remaps a bio from the multipath head to a
per-path namespace, bio_set_dev() clears BIO_REMAPPED. The remapped bio
is then resubmitted through submit_bio_noacct() which calls
bio_check_eod() because BIO_REMAPPED is not set.
This races with nvme_ns_remove() which zeroes the per-path capacity
before synchronize_srcu():
CPU 0 (IO submission)
---------------------
srcu_read_lock()
nvme_find_path() -> ns
[NVME_NS_READY is set]
CPU 1 (namespace removal)
-------------------------
clear_bit(NVME_NS_READY)
set_capacity(ns->disk, 0)
synchronize_srcu() <- blocks
CPU 0 (IO submission)
---------------------
bio_set_dev(bio, ns->disk->part0)
[clears BIO_REMAPPED]
submit_bio_noacct(bio)
-> bio_check_eod() sees capacity=0
-> bio fails with IO error
The SRCU read lock prevents synchronize_srcu() from completing, but does
not prevent set_capacity(0) from executing. The bio fails the EOD check
before it reaches the NVMe driver, so nvme_failover_req() never gets a
chance to redirect it to another path of multipath. IO errors are
reported to the application despite another path being available.
On older kernels (before commit 0b64682e78f7 "block: skip unnecessary
checks for split bio"), the same race was also reachable through split
remainders resubmitted via submit_bio_noacct().
Fix this by setting BIO_REMAPPED after bio_set_dev() in
nvme_ns_head_submit_bio(). This skips bio_check_eod() on the per-path
device; the EOD check already passed on the multipath head.
NVMe per-path namespace devices are always whole disks (bd_partno=0), so
the blk_partition_remap() skip also gated by BIO_REMAPPED is a no-op.
The flag does not persist across failover and cannot go stale if the
namespace geometry changes between attempts: nvme_failover_req() calls
bio_set_dev() to redirect the bio back to the multipath head, which
clears BIO_REMAPPED. When nvme_requeue_work() resubmits through
submit_bio_noacct(), bio_check_eod() runs normally against the current
capacity.
Same approach as commit 3a905c37c351 ("block: skip bio_check_eod for
partition-remapped bios").
Fixes: a7c7f7b2b641 ("nvme: use bio_set_dev to assign ->bi_bdev")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Igor Achkinazi <igor.achkinazi@dell.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The iopolicy module parameter uses strncmp prefix matching, so values
like "numax" are accepted as "numa". The per-subsystem sysfs attribute
already requires an exact match via sysfs_streq(). Parse both through
a shared helper so invalid values are rejected consistently.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: liyouhong <liyouhong@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
In nvme_mpath_revalidate_paths(), we are passed a NS pointer and use that
to lookup the NS head and then use that same NS pointer as an iter variable.
It makes more sense pass the NS head and use a local variable for the NS
iter.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"This is again significantly bigger than the same point into the
previous cycle, but at least smaller than last week.
I'm not aware of any pending regression for the current cycle.
Including fixes from netfilter.
Current release - regressions:
- netfilter: walk fib6_siblings under RCU
Previous releases - regressions:
- netlink: fix sending unassigned nsid after assigned one
- bridge: fix sleep in atomic context in netlink path
- sched: fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop
- ipv4: fix net->ipv4.sysctl_local_reserved_ports UaF
- eth: tun: free page on short-frame rejection in tun_xdp_one()
Previous releases - always broken:
- skbuff: fix missing zerocopy reference in pskb_carve helpers
- handshake: drain pending requests at net namespace exit
- ethtool:
- rss: avoid modifying the RSS context response
- module: avoid leaking a netdev ref on module flash errors
- coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES
- netfilter: fix dst corruption in same register operation
- nfc: hci: fix out-of-bounds read in HCP header parsing
- ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo()
- eth:
- vti: use ip6_tnl.net in vti6_changelink().
- vxlan: do not reuse cached ip_hdr() value after
skb_tunnel_check_pmtu()"
* tag 'net-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
dpll: zl3073x: make frequency monitor a per-device attribute
dpll: zl3073x: use __dpll_device_change_ntf() and remove change_work
dpll: export __dpll_device_change_ntf() for use under dpll_lock
net/handshake: Drain pending requests at net namespace exit
net/handshake: Verify file-reference balance in submit paths
net/handshake: Close the submit-side sock_hold race
net/handshake: hand off the pinned file reference to accept_doit
net/handshake: Take a long-lived file reference at submit
net/handshake: Pass negative errno through handshake_complete()
nvme-tcp: store negative errno in queue->tls_err
net/handshake: Use spin_lock_bh for hn_lock
net: skbuff: fix missing zerocopy reference in pskb_carve helpers
net: hibmcge: move dma_rmb() after dma_sync_single_for_cpu() in RX path
net: hibmcge: disable Relaxed Ordering to fix RX packet corruption
selftests/tc-testing: Add netem test case exercising loops
selftests/tc-testing: Add mirred test cases exercising loops
net/sched: act_mirred: Fix return code in early mirred redirect error paths
net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow
net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop
net/sched: fix packet loop on netem when duplicate is on
...
|
|
Don't skip the io accounting for passthrough commands if the user
enabled tracking these.
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260528010041.1533124-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Split the two init cases based on code in the zloop driver. This
simplifies the code and makes it easier to follow.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://patch.msgid.link/20260527151043.2349900-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
nvme_tcp_tls_done() assigns queue->tls_err in three branches. The
ENOKEY lookup failure and the EOPNOTSUPP initializer both store
negative errnos. The third branch, reached when the handshake
layer reports a non-zero status, stores -status.
The handshake layer delivers status to the consumer callback as a
negative errno; the other in-tree consumers --
xs_tls_handshake_done() and the nvmet target callback -- treat
their status argument that way. The extra negation in
nvme_tcp_tls_done() flips the sign, leaving tls_err as a positive
value (for instance, +EIO), which nvme_tcp_start_tls() then
returns to its caller.
Drop the extra negation so queue->tls_err uniformly carries a
negative errno on failure.
Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-2-66c616906ead@oracle.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
nvme_setup_descriptor_pools() indexes dev->descriptor_pools[] using the
numa_node forwarded from hctx->numa_node by its single caller,
nvme_init_hctx_common(). On a non-NUMA kernel hctx->numa_node is
NUMA_NO_NODE (-1). Because the parameter was declared 'unsigned', the
value becomes UINT_MAX and the index walks off the array (sized to
nr_node_ids), faulting during nvme_alloc_ns() and leaving the namespace
without a /dev node.
Reproduces on any NVMe controller probed by a CONFIG_NUMA=n kernel:
BUG: unable to handle page fault for address: ffff889101603d38
RIP: 0010:nvme_init_hctx_common+0x5a/0x190 [nvme]
Call Trace:
nvme_init_hctx+0x10/0x20 [nvme]
nvme_alloc_ns+0x9e/0xa10 [nvme_core]
nvme_scan_ns+0x301/0x3b0 [nvme_core]
nvme_scan_ns_async+0x23/0x30 [nvme_core]
Switch the parameter to int and fall back to node 0 when it is
NUMA_NO_NODE; node 0 is always present.
Fixes: d977506f8863 ("nvme-pci: make PRP list DMA pools per-NUMA-node")
Link: https://lore.kernel.org/r/20260309062840.2937858-2-iam@sung-woo.kim
Reported-by: Sung-woo Kim <iam@sung-woo.kim>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvmet_rdma_queue_connect() calls nvmet_rdma_find_get_device() which
acquires a reference on the returned ndev via kref_get(). On the path
where the host queue backlog is exceeded and the function returns
NVME_SC_CONNECT_CTRL_BUSY, reference of ndev is not released, leaking
the kref.
Fix this by adding a goto to the existing put_device label before the
early return.
Fixes: 31deaeb11ba7 ("nvmet-rdma: avoid circular locking dependency on install_queue()")
Cc: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
struct nvme_ns_head contains a flexible array member, current_path[],
which is indexed using the NUMA node ID:
head->current_path[numa_node_id()]
The structure is currently allocated as:
size = sizeof(struct nvme_ns_head) +
(num_possible_nodes() * sizeof(struct nvme_ns *));
head = kzalloc(size, GFP_KERNEL);
This allocation assumes that NUMA node IDs are sequential and densely
packed from 0 .. num_possible_nodes() - 1. While this assumption holds
on many systems, it is not always true on some architectures such as
powerpc.
On some powerpc systems, NUMA node IDs can be sparse. For example:
NUMA:
NUMA node(s): 6
NUMA node0 CPU(s): 80-159
NUMA node8 CPU(s): 0-79
NUMA node252 CPU(s):
NUMA node253 CPU(s):
NUMA node254 CPU(s):
NUMA node255 CPU(s):
That is, the possible/online NUMA node IDs are: 0, 8, 252, 253, 254, 255
In this case: num_possible_nodes() = 6
So memory is allocated for only 6 entries in current_path[]. However,
the array is later indexed using the actual NUMA node ID. As a result,
accesses such as:
head->current_path[8] or
head->current_path[252]
goes out of bounds, leading to the following KASAN splat:
==================================================================
BUG: KASAN: slab-out-of-bounds in nvme_mpath_revalidate_paths+0x22c/0x290 [nvme_core]
Write of size 8 at addr c00020003bda35b8 by task kworker/u641:2/1997
CPU: 1 UID: 0 PID: 1997 Comm: kworker/u641:2 Not tainted 7.1.0-rc5-dirty #14 PREEMPT(lazy)
Hardware name: 8335-GTH POWER9 0x4e1202 opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
Workqueue: async async_run_entry_fn
Call Trace:
[c000200037fa7510] [c0000000021c23d4] dump_stack_lvl+0x88/0xdc (unreliable)
[c000200037fa7540] [c0000000009fda90] print_report+0x22c/0x67c
[c000200037fa7630] [c0000000009fd508] kasan_report+0x108/0x220
[c000200037fa7740] [c0000000009fff48] __asan_store8+0xe8/0x120
[c000200037fa7760] [c008000018e76474] nvme_mpath_revalidate_paths+0x22c/0x290 [nvme_core]
[c000200037fa7800] [c008000018e6556c] nvme_update_ns_info+0x4a4/0x5e0 [nvme_core]
[c000200037fa7a50] [c008000018e66270] nvme_alloc_ns+0x6d8/0x1a70 [nvme_core]
[c000200037fa7c20] [c008000018e679fc] nvme_scan_ns+0x3f4/0x630 [nvme_core]
[c000200037fa7d10] [c00000000031f22c] async_run_entry_fn+0x9c/0x3a0
[c000200037fa7db0] [c0000000002fa544] process_one_work+0x414/0xa10
[c000200037fa7ec0] [c0000000002fbf00] worker_thread+0x320/0x640
[c000200037fa7f80] [c00000000030d0f8] kthread+0x278/0x290
[c000200037fa7fe0] [c00000000000ded8] start_kernel_thread+0x14/0x18
Allocated by task 1997 on cpu 1 at 35.928317s:
The buggy address belongs to the object at c00020003bda3000
which belongs to the cache kmalloc-rnd-15-2k of size 2048
The buggy address is located 16 bytes to the right of
allocated 1448-byte region [c00020003bda3000, c00020003bda35a8)
The buggy address belongs to the physical page:
Memory state around the buggy address:
c00020003bda3480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c00020003bda3500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>c00020003bda3580: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
^
c00020003bda3600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
c00020003bda3680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
==================================================================
Fix this by allocating the flexible array using nr_node_ids instead
of num_possible_nodes(). Since nr_node_ids represents the maximum
possible NUMA node IDs, indexing current_path[] using numa_node_id()
becomes safe even on systems with sparse node IDs.
Fixes: f333444708f8 ("nvme: take node locality into account when selecting a path")
Tested-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: Mukesh Kumar Chaurasiya (IBM) <mkchauras@gmail.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
Use DEFINE_SIMPLE_SYSFS_GROUP_VISIBLE instead of
DEFINE_SYSFS_GROUP_VISIBLE, which means that we can drop
multipath_sysfs_attr_visible().
Incidentally, multipath_sysfs_attr_visible() should have returned a
umode_t.
This idea was suggested by Ben Marzinski elsewhere.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The return value of nvmet_tcp_set_queue_sock() is currently ignored in
nvmet_tcp_tls_handshake_done(). If it fails (e.g., due to the socket
not being in TCP_ESTABLISHED state), the socket callbacks will not be
properly set, leading to queue and socket leakage.
Fix this by capturing the return value and calling
nvmet_tcp_schedule_release_queue() on failure to ensure proper cleanup.
Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
In nvmet_tcp_alloc_queue(), when a connection is closed during the
allocation process (e.g., nvmet_tcp_set_queue_sock() returns -ENOTCONN),
the error handling jumps to out_destroy_sq and then to out_ida_remove
without draining the page fragment cache.
Although nvmet_tcp_free_cmd() is called in some error paths to release
individual page fragments, the underlying page cache reference held by
queue->pf_cache is never released. The first allocation using pf_cache
is the call to nvmet_tcp_alloc_cmd() for queue->connect, which happens
after ida_alloc() returns successfully. This results in a page leak each
time a connection fails during allocation, which could lead to memory
exhaustion over time if connections are repeatedly opened and closed.
Fix this by calling page_frag_cache_drain() before freeing the queue
structure in the out_ida_remove label.
Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
The timeout variable in nvme_wait_freeze_timeout() is an unsigned type.
Checking if it is <= 0 triggers a compiler warning because an unsigned
variable can never be negative.
Fix this warning by changing the type to long.
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Closes: https://lore.kernel.org/r/202605211257.STzj2Ujv-lkp@intel.com/
Fixes: 23b6d2cbf75f ("nvme: remove redundant timeout argument from nvme_wait_freeze_timeout")
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
NVMe multipath does not expose BLK_FEAT_PCI_P2PDMA on the head disk
even when all underlying controllers support it.
Set BLK_FEAT_PCI_P2PDMA unconditionally in nvme_mpath_alloc_disk()
alongside the other features. nvme_update_ns_info_block() already
calls queue_limits_stack_bdev() to stack each path's limits onto the
head disk, which routes through blk_stack_limits(). The core now
clears BLK_FEAT_PCI_P2PDMA automatically if any path (e.g., FC) does
not support it, consistent with how BLK_FEAT_NOWAIT and BLK_FEAT_POLL
are handled.
Tested-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>
Signed-off-by: Kiran Kumar Modukuri <kmodukuri@nvidia.com>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Tested=by: Pranjal Shrivastava <praan@google.com>
Link: https://patch.msgid.link/20260513185153.95552-4-kch@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.
Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/ [1]
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
We're leaking the initial DMA mapping during iteration if we fail to
allocate the tracking descriptor for both PRP and SGL. Unmap the
iterator directly; we can't use the existing unmap helper because it
depends on the tracking descriptor being successfully allocated, so a
new one for an in-use iterator is provided.
The mappings were also leaking when the driver detects an invalid
bio_vec when mapping PRPs, so fix that too.
Fixes: b8b7570a7ec87 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping")
Fixes: 7ce3c1dd78fca ("nvme-pci: convert the data mapping to blk_rq_dma_map")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
We don't unmap P2P memory, so we don't need to track it. The dma_vec
allocation was getting leaked on the completion.
Fixes: b8b7570a7ec87 ("nvme-pci: fix dma unmapping when using PRPs and not using the IOVA mapping")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|
|
nvme_update_ns_info_block() trusts id->lbaf[lbaf].ds from the
controller and assigns it directly to ns->head->lba_shift without
bounds checking. nvme_lba_to_sect() then does:
return lba << (head->lba_shift - SECTOR_SHIFT);
When called with lba = le64_to_cpu(id->nsze) to compute the device
capacity, an attacker-controlled controller can choose ds < 9 or a
combination of (ds, nsze) that makes the left shift overflow
sector_t. The former is a C undefined behaviour that UBSAN reports
as a BUG; the latter silently yields a bogus capacity that the
block layer then trusts for bounds checking.
Validate ds against SECTOR_SHIFT and use check_shl_overflow() to
compute capacity so that any (ds, nsze) combination that would
overflow sector_t is rejected. The namespace is skipped with
-ENODEV instead of crashing the kernel. This is reachable by a
malicious NVMe device, a buggy firmware, or an attacker-controlled
NVMe-oF target.
The check is performed before queue_limits_start_update() and
blk_mq_freeze_queue(), so the error path is a plain `goto out` with
no cleanup needed.
Stack trace (UBSAN, ds < 9 variant):
RIP: nvme_lba_to_sect drivers/nvme/host/nvme.h:699 [inline]
RIP: nvme_update_ns_info_block.cold+0x5/0x7
Call Trace:
nvme_update_ns_info+0x175/0xd90 drivers/nvme/host/core.c:2467
nvme_validate_ns drivers/nvme/host/core.c:4299 [inline]
nvme_scan_ns drivers/nvme/host/core.c:4350
nvme_scan_ns_async+0xa5/0xe0 drivers/nvme/host/core.c:4383
async_run_entry_fn
process_one_work
worker_thread
kthread
Found by Syzkaller.
Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
|