aboutsummaryrefslogtreecommitdiff
path: root/drivers/nvme/host/tcp.c
AgeCommit message (Collapse)AuthorFilesLines
2026-05-28nvme-tcp: store negative errno in queue->tls_errChuck Lever1-1/+1
nvme_tcp_tls_done() assigns queue->tls_err in three branches. The ENOKEY lookup failure and the EOPNOTSUPP initializer both store negative errnos. The third branch, reached when the handshake layer reports a non-zero status, stores -status. The handshake layer delivers status to the consumer callback as a negative errno; the other in-tree consumers -- xs_tls_handshake_done() and the nvmet target callback -- treat their status argument that way. The extra negation in nvme_tcp_tls_done() flips the sign, leaving tls_err as a positive value (for instance, +EIO), which nvme_tcp_start_tls() then returns to its caller. Drop the extra negation so queue->tls_err uniformly carries a negative errno on failure. Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Link: https://patch.msgid.link/20260525-handshake-file-pin-v3-2-66c616906ead@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-04-22nvme-tcp: teardown circular locking fixesChaitanya Kulkarni1-7/+21
When a controller reset is triggered via sysfs (by writing to /sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down and re-establishes all queues. The socket release using fput() defers the actual cleanup to task_work delayed_fput workqueue. This deferred cleanup can race with the subsequent queue re-allocation during reset, potentially leading to use-after-free or resource conflicts. Replace fput() with __fput_sync() to ensure synchronous socket release, guaranteeing that all socket resources are fully cleaned up before the function returns. This prevents races during controller reset where new queue setup may begin before the old socket is fully released. * Call chain during reset: nvme_reset_ctrl_work() -> nvme_tcp_teardown_ctrl() -> nvme_tcp_teardown_io_queues() -> nvme_tcp_free_io_queues() -> nvme_tcp_free_queue() <-- fput() -> __fput_sync() -> nvme_tcp_teardown_admin_queue() -> nvme_tcp_free_admin_queue() -> nvme_tcp_free_queue() <-- fput() -> __fput_sync() -> nvme_tcp_setup_ctrl() <-- race with deferred fput memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks performing memory reclaim work that need reserve access. While PF_MEMALLOC prevents the task from entering direct reclaim (causing __need_reclaim() to return false), it does not strip __GFP_IO from gfp flags. The allocator can therefore still trigger writeback I/O when __GFP_IO remains set, which is unsafe when the caller holds block layer locks. Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in the scope, making it safe to allocate memory while holding elevator_lock and set->srcu. * The issue can be reproduced using blktests: nvme_trtype=tcp ./check nvme/005 blktests (master) # nvme_trtype=tcp ./check nvme/005 nvme/005 (tr=tcp) (reset local loopback target) [failed] runtime 0.725s ... 0.798s something found in dmesg: [ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20 [...] ... (See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message) blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg [ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20 [ 108.526983] loop0: detected capacity change from 0 to 2097152 [ 108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420) [ 108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. [ 108.616832] nvme nvme0: creating 48 I/O queues. [ 108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues. [ 108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 [ 108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. [ 108.748466] nvme nvme0: creating 48 I/O queues. [ 108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues. [ 108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" [ 108.854288] block nvme0n1: no available path - failing I/O [ 108.854344] block nvme0n1: no available path - failing I/O [ 108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read [ 108.891693] ====================================================== [ 108.895912] WARNING: possible circular locking dependency detected [ 108.900184] 6.17.0nvme+ #3 Tainted: G N [ 108.903913] ------------------------------------------------------ [ 108.908171] nvme/2734 is trying to acquire lock: [ 108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170 [ 108.917587] but task is already holding lock: [ 108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0 [ 108.927361] which lock already depends on the new lock. [ 108.933018] the existing dependency chain (in reverse order) is: [ 108.938223] -> #4 (&q->elevator_lock){+.+.}-{4:4}: [ 108.942988] __mutex_lock+0xa2/0x1150 [ 108.945873] elevator_change+0xa8/0x1c0 [ 108.948925] elv_iosched_store+0xdf/0x140 [ 108.952043] kernfs_fop_write_iter+0x16a/0x220 [ 108.955367] vfs_write+0x378/0x520 [ 108.957598] ksys_write+0x67/0xe0 [ 108.959721] do_syscall_64+0x76/0xbb0 [ 108.962052] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 108.965145] -> #3 (&q->q_usage_counter(io)){++++}-{0:0}: [ 108.968923] blk_alloc_queue+0x30e/0x350 [ 108.972117] blk_mq_alloc_queue+0x61/0xd0 [ 108.974677] scsi_alloc_sdev+0x2a0/0x3e0 [ 108.977092] scsi_probe_and_add_lun+0x1bd/0x430 [ 108.979921] __scsi_add_device+0x109/0x120 [ 108.982504] ata_scsi_scan_host+0x97/0x1c0 [ 108.984365] async_run_entry_fn+0x2d/0x130 [ 108.986109] process_one_work+0x20e/0x630 [ 108.987830] worker_thread+0x184/0x330 [ 108.989473] kthread+0x10a/0x250 [ 108.990852] ret_from_fork+0x297/0x300 [ 108.992491] ret_from_fork_asm+0x1a/0x30 [ 108.994159] -> #2 (fs_reclaim){+.+.}-{0:0}: [ 108.996320] fs_reclaim_acquire+0x99/0xd0 [ 108.998058] kmem_cache_alloc_node_noprof+0x4e/0x3c0 [ 109.000123] __alloc_skb+0x15f/0x190 [ 109.002195] tcp_send_active_reset+0x3f/0x1e0 [ 109.004038] tcp_disconnect+0x50b/0x720 [ 109.005695] __tcp_close+0x2b8/0x4b0 [ 109.007227] tcp_close+0x20/0x80 [ 109.008663] inet_release+0x31/0x60 [ 109.010175] __sock_release+0x3a/0xc0 [ 109.011778] sock_close+0x14/0x20 [ 109.013263] __fput+0xee/0x2c0 [ 109.014673] delayed_fput+0x31/0x50 [ 109.016183] process_one_work+0x20e/0x630 [ 109.017897] worker_thread+0x184/0x330 [ 109.019543] kthread+0x10a/0x250 [ 109.020929] ret_from_fork+0x297/0x300 [ 109.022565] ret_from_fork_asm+0x1a/0x30 [ 109.024194] -> #1 (sk_lock-AF_INET-NVME){+.+.}-{0:0}: [ 109.026634] lock_sock_nested+0x2e/0x70 [ 109.028251] tcp_sendmsg+0x1a/0x40 [ 109.029783] sock_sendmsg+0xed/0x110 [ 109.031321] nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp] [ 109.034263] nvme_tcp_try_send+0xb3/0x330 [nvme_tcp] [ 109.036375] nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp] [ 109.038528] blk_mq_dispatch_rq_list+0x297/0x800 [ 109.040448] __blk_mq_sched_dispatch_requests+0x3db/0x5f0 [ 109.042677] blk_mq_sched_dispatch_requests+0x29/0x70 [ 109.044787] blk_mq_run_work_fn+0x76/0x1b0 [ 109.046535] process_one_work+0x20e/0x630 [ 109.048245] worker_thread+0x184/0x330 [ 109.049890] kthread+0x10a/0x250 [ 109.051331] ret_from_fork+0x297/0x300 [ 109.053024] ret_from_fork_asm+0x1a/0x30 [ 109.054740] -> #0 (set->srcu){.+.+}-{0:0}: [ 109.056850] __lock_acquire+0x1468/0x2210 [ 109.058614] lock_sync+0xa5/0x110 [ 109.060048] __synchronize_srcu+0x49/0x170 [ 109.061802] elevator_switch+0xc9/0x330 [ 109.063950] elevator_change+0x128/0x1c0 [ 109.065675] elevator_set_none+0x4c/0x90 [ 109.067316] blk_unregister_queue+0xa8/0x110 [ 109.069165] __del_gendisk+0x14e/0x3c0 [ 109.070824] del_gendisk+0x75/0xa0 [ 109.072328] nvme_ns_remove+0xf2/0x230 [nvme_core] [ 109.074365] nvme_remove_namespaces+0xf2/0x150 [nvme_core] [ 109.076652] nvme_do_delete_ctrl+0x71/0x90 [nvme_core] [ 109.078775] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core] [ 109.081009] nvme_sysfs_delete+0x34/0x40 [nvme_core] [ 109.083082] kernfs_fop_write_iter+0x16a/0x220 [ 109.085009] vfs_write+0x378/0x520 [ 109.086539] ksys_write+0x67/0xe0 [ 109.087982] do_syscall_64+0x76/0xbb0 [ 109.089577] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 109.091665] other info that might help us debug this: [ 109.095478] Chain exists of: set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock [ 109.099544] Possible unsafe locking scenario: [ 109.101708] CPU0 CPU1 [ 109.103402] ---- ---- [ 109.105103] lock(&q->elevator_lock); [ 109.106530] lock(&q->q_usage_counter(io)); [ 109.109022] lock(&q->elevator_lock); [ 109.111391] sync(set->srcu); [ 109.112586] *** DEADLOCK *** [ 109.114772] 5 locks held by nvme/2734: [ 109.116189] #0: ffff888101925410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x67/0xe0 [ 109.119143] #1: ffff88817a914e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x10f/0x220 [ 109.123141] #2: ffff8881046313f8 (kn->active#185){++++}-{0:0}, at: sysfs_remove_file_self+0x26/0x50 [ 109.126543] #3: ffff88810470e1d0 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x6d/0xa0 [ 109.129891] #4: ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0 [ 109.133149] stack backtrace: [ 109.134817] CPU: 6 UID: 0 PID: 2734 Comm: nvme Tainted: G N 6.17.0nvme+ #3 PREEMPT(voluntary) [ 109.134819] Tainted: [N]=TEST [ 109.134820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 109.134821] Call Trace: [ 109.134823] <TASK> [ 109.134824] dump_stack_lvl+0x75/0xb0 [ 109.134828] print_circular_bug+0x26a/0x330 [ 109.134831] check_noncircular+0x12f/0x150 [ 109.134834] __lock_acquire+0x1468/0x2210 [ 109.134837] ? __synchronize_srcu+0x17/0x170 [ 109.134838] lock_sync+0xa5/0x110 [ 109.134840] ? __synchronize_srcu+0x17/0x170 [ 109.134842] __synchronize_srcu+0x49/0x170 [ 109.134843] ? mark_held_locks+0x49/0x80 [ 109.134845] ? _raw_spin_unlock_irqrestore+0x2d/0x60 [ 109.134847] ? kvm_clock_get_cycles+0x14/0x30 [ 109.134853] ? ktime_get_mono_fast_ns+0x36/0xb0 [ 109.134858] elevator_switch+0xc9/0x330 [ 109.134860] elevator_change+0x128/0x1c0 [ 109.134862] ? kernfs_put.part.0+0x86/0x290 [ 109.134864] elevator_set_none+0x4c/0x90 [ 109.134866] blk_unregister_queue+0xa8/0x110 [ 109.134868] __del_gendisk+0x14e/0x3c0 [ 109.134870] del_gendisk+0x75/0xa0 [ 109.134872] nvme_ns_remove+0xf2/0x230 [nvme_core] [ 109.134879] nvme_remove_namespaces+0xf2/0x150 [nvme_core] [ 109.134887] nvme_do_delete_ctrl+0x71/0x90 [nvme_core] [ 109.134893] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core] [ 109.134899] nvme_sysfs_delete+0x34/0x40 [nvme_core] [ 109.134905] kernfs_fop_write_iter+0x16a/0x220 [ 109.134908] vfs_write+0x378/0x520 [ 109.134911] ksys_write+0x67/0xe0 [ 109.134913] do_syscall_64+0x76/0xbb0 [ 109.134915] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 109.134916] RIP: 0033:0x7fd68a737317 [ 109.134917] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 109.134919] RSP: 002b:00007ffded1546d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 109.134920] RAX: ffffffffffffffda RBX: 000000000054f7e0 RCX: 00007fd68a737317 [ 109.134921] RDX: 0000000000000001 RSI: 00007fd68a855719 RDI: 0000000000000003 [ 109.134921] RBP: 0000000000000003 R08: 0000000030407850 R09: 00007fd68a7cd4e0 [ 109.134922] R10: 00007fd68a65b130 R11: 0000000000000246 R12: 00007fd68a855719 [ 109.134923] R13: 00000000304074c0 R14: 00000000304074c0 R15: 0000000030408660 [ 109.134926] </TASK> [ 109.962756] Key type psk unregistered Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-04-08nvme: add missing MODULE_ALIAS for fabrics transportsGeliang Tang1-0/+1
The generic fabrics layer uses request_module("nvme-%s", opts->transport) to auto-load transport modules. Currently, the nvme-tcp, nvme-rdma, and nvme-fc modules lack MODULE_ALIAS entries for these names, which prevents the kernel from automatically finding and loading them when requested. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Signed-off-by: Keith Busch <kbusch@kernel.org>
2026-03-06Merge tag 'block-7.0-20260305' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Improve quirk visibility and configurability (Maurizio) - Fix runtime user modification to queue setup (Keith) - Fix multipath leak on try_module_get failure (Keith) - Ignore ambiguous spec definitions for better atomics support (John) - Fix admin queue leak on controller reset (Ming) - Fix large allocation in persistent reservation read keys (Sungwoo Kim) - Fix fcloop callback handling (Justin) - Securely free DHCHAP secrets (Daniel) - Various cleanups and typo fixes (John, Wilfred) - Avoid a circular lock dependency issue in the sysfs nr_requests or scheduler store handling - Fix a circular lock dependency with the pcpu mutex and the queue freeze lock - Cleanup for bio_copy_kern(), using __bio_add_page() rather than the bio_add_page(), as adding a page here cannot fail. The exiting code had broken cleanup for the error condition, so make it clear that the error condition cannot happen - Fix for a __this_cpu_read() in preemptible context splat * tag 'block-7.0-20260305' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: use trylock to avoid lockdep circular dependency in sysfs nvme: fix memory allocation in nvme_pr_read_keys() block: use __bio_add_page in bio_copy_kern block: break pcpu_alloc_mutex dependency on freeze_lock blktrace: fix __this_cpu_read/write in preemptible context nvme-multipath: fix leak on try_module_get failure nvmet-fcloop: Check remoteport port_state before calling done callback nvme-pci: do not try to add queue maps at runtime nvme-pci: cap queue creation to used queues nvme-pci: ensure we're polling a polled queue nvme: fix memory leak in quirks_param_set() nvme: correct comment about nvme_ns_remove() nvme: stop setting namespace gendisk device driver data nvme: add support for dynamic quirk configuration via module parameter nvme: fix admin queue leak on controller reset nvme-fabrics: use kfree_sensitive() for DHCHAP secrets nvme: stop using AWUPF nvme: expose active quirks in sysfs nvme/host: fixup some typos
2026-02-21Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds1-2/+1
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds1-3/+3
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook1-5/+5
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-01-13nvme/host: fixup some typosWilfred Mallawa1-2/+3
Fix up some minor typos in the nvme host driver and a comment style to conform to the standard kernel style. Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-12-03Merge tag 'for-6.19/block-20251201' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...
2025-11-06nvme: remove virtual boundary for sgl capable devicesKeith Busch1-0/+1
The nvme virtual boundary is only required for the PRP format. Devices that can use SGL for DMA don't need it for IO queues. Drop reporting it for such devices; rdma fabrics controllers will continue to use the limit as they currently don't report any boundary requirements, but tcp and fc never needed it in the first place so they get to report no virtual boundary. Applications may continue to align to the same virtual boundaries for optimization purposes if they want, and the driver will continue to decide whether to use the PRP format the same as before if the IO allows it. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook1-1/+1
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook1-1/+1
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16nvme/tcp: handle tls partially sent records in write_space()Wilfred Mallawa1-0/+3
With TLS enabled, records that are encrypted and appended to TLS TX list can fail to see a retry if the underlying TCP socket is busy, for example, hitting an EAGAIN from tcp_sendmsg_locked(). This is not known to the NVMe TCP driver, as the TLS layer successfully generated a record. Typically, the TLS write_space() callback would ensure such records are retried, but in the NVMe TCP Host driver, write_space() invokes nvme_tcp_write_space(). This causes a partially sent record in the TLS TX list to timeout after not being retried. This patch fixes the above by calling queue->write_space(), which calls into the TLS layer to retry any pending records. Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-09-15nvme-tcp: send only permitted commands for secure concatMartin George1-0/+3
In addition to sending permitted commands such as connect/auth over the initial unencrypted admin connection as part of secure channel concatenation, the host also sends commands such as Property Get and Identify on the same. This is a spec violation leading to secure concat failures. Fix this by ensuring these additional commands are avoided on this connection. Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") Signed-off-by: Martin George <marting@netapp.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-07-31nvme: fix various comment typosBjorn Helgaas1-1/+1
Fix typos in comments. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-07-17nvme-tcp: log TLS handshake failures at error levelMaurizio Lombardi1-3/+8
Update the nvme_tcp_start_tls() function to use dev_err() instead of dev_dbg() when a TLS error is detected. This ensures that handshake failures are visible by default, aiding in debugging. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-06Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linuxLinus Torvalds1-3/+21
Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...
2025-06-04nvme-tcp: fix I/O stalls on congested socketsHannes Reinecke1-1/+6
When the socket is busy processing nvme_tcp_try_recv() might return -EAGAIN, but this doesn't automatically imply that the sending side is blocked, too. So check if there are pending requests once nvme_tcp_try_recv() returns -EAGAIN and continue with the sending loop to avoid I/O stalls. Signed-off-by: Hannes Reinecke <hare@kernel.org> Acked-by: Chris Leech <cleech@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: sanitize request list handlingHannes Reinecke1-1/+14
Validate the request in nvme_tcp_handle_r2t() to ensure it's not part of any list, otherwise a malicious R2T PDU might inject a loop in request list processing. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-06-04nvme-tcp: remove tag set when second admin queue config failsShin'ichiro Kawasaki1-1/+1
Commit 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") modified nvme_tcp_setup_ctrl() to call nvme_tcp_configure_admin_queue() twice. The first call prepares for DH-CHAP negotitation, and the second call is required for secure concatenation. However, this change triggered BUG KASAN slab-use-after- free in blk_mq_queue_tag_busy_iter(). This BUG can be recreated by repeating the blktests test case nvme/063 a few times [1]. When the BUG happens, nvme_tcp_create_ctrl() fails in the call chain below: nvme_tcp_create_ctrl() nvme_tcp_alloc_ctrl() new=true ... Alloc nvme_tcp_ctrl and admin_tag_set nvme_tcp_setup_ctrl() new=true nvme_tcp_configure_admin_queue() new=true ... Succeed nvme_alloc_admin_tag_set() ... Alloc the tag set for admin_tag_set nvme_stop_keep_alive() nvme_tcp_teardown_admin_queue() remove=false nvme_tcp_configure_admin_queue() new=false nvme_tcp_alloc_admin_queue() ... Fail, but do not call nvme_remove_admin_tag_set() nvme_uninit_ctrl() nvme_put_ctrl() ... Free up the nvme_tcp_ctrl and admin_tag_set The first call of nvme_tcp_configure_admin_queue() succeeds with new=true argument. The second call fails with new=false argument. This second call does not call nvme_remove_admin_tag_set() on failure, due to the new=false argument. Then the admin tag set is not removed. However, nvme_tcp_create_ctrl() assumes that nvme_tcp_setup_ctrl() would call nvme_remove_admin_tag_set(). Then it frees up struct nvme_tcp_ctrl which has admin_tag_set field. Later on, the timeout handler accesses the admin_tag_set field and causes the BUG KASAN slab-use-after-free. To not leave the admin tag set, call nvme_remove_admin_tag_set() when the second nvme_tcp_configure_admin_queue() call fails. Do not return from nvme_tcp_setup_ctrl() on failure. Instead, jump to "destroy_admin" go-to label to call nvme_tcp_teardown_admin_queue() which calls nvme_remove_admin_tag_set(). Fixes: 104d0e2f6222 ("nvme-fabrics: reset admin connection for secure concatenation") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-nvme/6mhxskdlbo6fk6hotsffvwriauurqky33dfb3s44mqtr5dsxmf@gywwmnyh3twm/ [1] Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-05-28Merge tag 'net-next-6.16' of ↵Linus Torvalds1-84/+40
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "Core: - Implement the Device Memory TCP transmit path, allowing zero-copy data transmission on top of TCP from e.g. GPU memory to the wire. - Move all the IPv6 routing tables management outside the RTNL scope, under its own lock and RCU. The route control path is now 3x times faster. - Convert queue related netlink ops to instance lock, reducing again the scope of the RTNL lock. This improves the control plane scalability. - Refactor the software crc32c implementation, removing unneeded abstraction layers and improving significantly the related micro-benchmarks. - Optimize the GRO engine for UDP-tunneled traffic, for a 10% performance improvement in related stream tests. - Cover more per-CPU storage with local nested BH locking; this is a prep work to remove the current per-CPU lock in local_bh_disable() on PREMPT_RT. - Introduce and use nlmsg_payload helper, combining buffer bounds verification with accessing payload carried by netlink messages. Netfilter: - Rewrite the procfs conntrack table implementation, improving considerably the dump performance. A lot of user-space tools still use this interface. - Implement support for wildcard netdevice in netdev basechain and flowtables. - Integrate conntrack information into nft trace infrastructure. - Export set count and backend name to userspace, for better introspection. BPF: - BPF qdisc support: BPF-qdisc can be implemented with BPF struct_ops programs and can be controlled in similar way to traditional qdiscs using the "tc qdisc" command. - Refactor the UDP socket iterator, addressing long standing issues WRT duplicate hits or missed sockets. Protocols: - Improve TCP receive buffer auto-tuning and increase the default upper bound for the receive buffer; overall this improves the single flow maximum thoughput on 200Gbs link by over 60%. - Add AFS GSSAPI security class to AF_RXRPC; it provides transport security for connections to the AFS fileserver and VL server. - Improve TCP multipath routing, so that the sources address always matches the nexthop device. - Introduce SO_PASSRIGHTS for AF_UNIX, to allow disabling SCM_RIGHTS, and thus preventing DoS caused by passing around problematic FDs. - Retire DCCP socket. DCCP only receives updates for bugs, and major distros disable it by default. Its removal allows for better organisation of TCP fields to reduce the number of cache lines hit in the fast path. - Extend TCP drop-reason support to cover PAWS checks. Driver API: - Reorganize PTP ioctl flag support to require an explicit opt-in for the drivers, avoiding the problem of drivers not rejecting new unsupported flags. - Converted several device drivers to timestamping APIs. - Introduce per-PHY ethtool dump helpers, improving the support for dump operations targeting PHYs. Tests and tooling: - Add support for classic netlink in user space C codegen, so that ynl-c can now read, create and modify links, routes addresses and qdisc layer configuration. - Add ynl sub-types for binary attributes, allowing ynl-c to output known struct instead of raw binary data, clarifying the classic netlink output. - Extend MPTCP selftests to improve the code-coverage. - Add tests for XDP tail adjustment in AF_XDP. New hardware / drivers: - OpenVPN virtual driver: offload OpenVPN data channels processing to the kernel-space, increasing the data transfer throughput WRT the user-space implementation. - Renesas glue driver for the gigabit ethernet RZ/V2H(P) SoC. - Broadcom asp-v3.0 ethernet driver. - AMD Renoir ethernet device. - ReakTek MT9888 2.5G ethernet PHY driver. - Aeonsemi 10G C45 PHYs driver. Drivers: - Ethernet high-speed NICs: - nVidia/Mellanox (mlx5): - refactor the steering table handling to significantly reduce the amount of memory used - add support for complex matches in H/W flow steering - improve flow streeing error handling - convert to netdev instance locking - Intel (100G, ice, igb, ixgbe, idpf): - ice: add switchdev support for LLDP traffic over VF - ixgbe: add firmware manipulation and regions devlink support - igb: introduce support for frame transmission premption - igb: adds persistent NAPI configuration - idpf: introduce RDMA support - idpf: add initial PTP support - Meta (fbnic): - extend hardware stats coverage - add devlink dev flash support - Broadcom (bnxt): - add support for RX-side device memory TCP - Wangxun (txgbe): - implement support for udp tunnel offload - complete PTP and SRIOV support for AML 25G/10G devices - Ethernet NICs embedded and virtual: - Google (gve): - add device memory TCP TX support - Amazon (ena): - support persistent per-NAPI config - Airoha: - add H/W support for L2 traffic offload - add per flow stats for flow offloading - RealTek (rtl8211): add support for WoL magic packet - Synopsys (stmmac): - dwmac-socfpga 1000BaseX support - add Loongson-2K3000 support - introduce support for hardware-accelerated VLAN stripping - Broadcom (bcmgenet): - expose more H/W stats - Freescale (enetc, dpaa2-eth): - enetc: add MAC filter, VLAN filter RSS and loopback support - dpaa2-eth: convert to H/W timestamping APIs - vxlan: convert FDB table to rhashtable, for better scalabilty - veth: apply qdisc backpressure on full ring to reduce TX drops - Ethernet switches: - Microchip (kzZ88x3): add ETS scheduler support - Ethernet PHYs: - RealTek (rtl8211): - add support for WoL magic packet - add support for PHY LEDs - CAN: - Adds RZ/G3E CANFD support to the rcar_canfd driver. - Preparatory work for CAN-XL support. - Add self-tests framework with support for CAN physical interfaces. - WiFi: - mac80211: - scan improvements with multi-link operation (MLO) - Qualcomm (ath12k): - enable AHB support for IPQ5332 - add monitor interface support to QCN9274 - add multi-link operation support to WCN7850 - add 802.11d scan offload support to WCN7850 - monitor mode for WCN7850, better 6 GHz regulatory - Qualcomm (ath11k): - restore hibernation support - MediaTek (mt76): - WiFi-7 improvements - implement support for mt7990 - Intel (iwlwifi): - enhanced multi-link single-radio (EMLSR) support on 5 GHz links - rework device configuration - RealTek (rtw88): - improve throughput for RTL8814AU - RealTek (rtw89): - add multi-link operation support - STA/P2P concurrency improvements - support different SAR configs by antenna - Bluetooth: - introduce HCI Driver protocol - btintel_pcie: do not generate coredump for diagnostic events - btusb: add HCI Drv commands for configuring altsetting - btusb: add RTL8851BE device 0x0bda:0xb850 - btusb: add new VID/PID 13d3/3584 for MT7922 - btusb: add new VID/PID 13d3/3630 and 13d3/3613 for MT7925 - btnxpuart: implement host-wakeup feature" * tag 'net-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1611 commits) selftests/bpf: Fix bpf selftest build warning selftests: netfilter: Fix skip of wildcard interface test net: phy: mscc: Stop clearing the the UDPv4 checksum for L2 frames net: openvswitch: Fix the dead loop of MPLS parse calipso: Don't call calipso functions for AF_INET sk. selftests/tc-testing: Add a test for HFSC eltree double add with reentrant enqueue behaviour on netem net_sched: hfsc: Address reentrant enqueue adding class to eltree twice octeontx2-pf: QOS: Refactor TC_HTB_LEAF_DEL_LAST callback octeontx2-pf: QOS: Perform cache sync on send queue teardown net: mana: Add support for Multi Vports on Bare metal net: devmem: ncdevmem: remove unused variable net: devmem: ksft: upgrade rx test to send 1K data net: devmem: ksft: add 5 tuple FS support net: devmem: ksft: add exit_wait to make rx test pass net: devmem: ksft: add ipv4 support net: devmem: preserve sockc_err page_pool: fix ugly page_pool formatting net: devmem: move list_add to net_devmem_bind_dmabuf. selftests: netfilter: nft_queue.sh: include file transfer duration in log message net: phy: mscc: Fix memory leak when using one step timestamping ...
2025-05-21nvme-tcp: use crc32c() and skb_copy_and_crc32c_datagram_iter()Eric Biggers1-84/+40
Now that the crc32c() library function directly takes advantage of architecture-specific optimizations and there also now exists a function skb_copy_and_crc32c_datagram_iter(), it is unnecessary to go through the crypto_ahash API. Just use those functions. This is much simpler, and it also improves performance due to eliminating the crypto API overhead. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-10-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-20nvme-tcp: open-code nvme_tcp_queue_request() for R2THannes Reinecke1-5/+7
When handling an R2T PDU we short-circuit nvme_tcp_queue_request() as we should not attempt to send consecutive PDUs. So open-code nvme_tcp_queue_request() for R2T and drop the last argument. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2025-05-20nvme-tcp: remove redundant check to ctrl->optsHannes Reinecke1-1/+1
When checking for secure concatenation we have already validated that 'ctrl->opts' is set, so we can remove this check. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-30nvme-tcp: fix premature queue removal and I/O failoverMichael Liang1-2/+29
This patch addresses a data corruption issue observed in nvme-tcp during testing. In an NVMe native multipath setup, when an I/O timeout occurs, all inflight I/Os are canceled almost immediately after the kernel socket is shut down. These canceled I/Os are reported as host path errors, triggering a failover that succeeds on a different path. However, at this point, the original I/O may still be outstanding in the host's network transmission path (e.g., the NIC’s TX queue). From the user-space app's perspective, the buffer associated with the I/O is considered completed since they're acked on the different path and may be reused for new I/O requests. Because nvme-tcp enables zero-copy by default in the transmission path, this can lead to corrupted data being sent to the original target, ultimately causing data corruption. We can reproduce this data corruption by injecting delay on one path and triggering i/o timeout. To prevent this issue, this change ensures that all inflight transmissions are fully completed from host's perspective before returning from queue stop. To handle concurrent I/O timeout from multiple namespaces under the same controller, always wait in queue stop regardless of queue's state. This aligns with the behavior of queue stopping in other NVMe fabric transports. Fixes: 3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver") Signed-off-by: Michael Liang <mliang@purestorage.com> Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com> Reviewed-by: Randy Jennings <randyj@purestorage.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-04-09nvme-tcp: fix use-after-free of netns by kernel TCP socket.Kuniyuki Iwashima1-0/+2
Commit 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg") converted sock_create() in nvme_tcp_alloc_queue() to sock_create_kern(). sock_create_kern() creates a kernel socket, which does not hold a reference to netns. If the code does not manage the netns lifetime properly, use-after-free could happen. Also, TCP kernel socket with sk_net_refcnt 0 has a socket leak problem: it remains FIN_WAIT_1 if it misses FIN after close() because tcp_close() stops all timers. To fix such problems, let's hold netns ref by sk_net_refcnt_upgrade(). We had the same issue in CIFS, SMC, etc, and applied the same solution, see commit ef7134c7fc48 ("smb: client: Fix use-after-free of network namespace.") and commit 9744d2bf1976 ("smc: Fix use-after-free in tcp_write_timer_handler()."). Fixes: 1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2025-03-26Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linuxLinus Torvalds1-8/+59
Pull block updates from Jens Axboe: - Fixes for integrity handling - NVMe pull request via Keith: - Secure concatenation for TCP transport (Hannes) - Multipath sysfs visibility (Nilay) - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li) - Correct use of 64-bit BARs for pci-epf target (Niklas) - Socket fix for selinux when used in containers (Peijie) - MD pull request via Yu: - fix recovery can preempt resync (Li Nan) - fix md-bitmap IO limit (Su Yue) - fix raid10 discard with REQ_NOWAIT (Xiao Ni) - fix raid1 memory leak (Zheng Qixing) - fix mddev uaf (Yu Kuai) - fix raid1,raid10 IO flags (Yu Kuai) - some refactor and cleanup (Yu Kuai) - Series cleaning up and fixing bugs in the bad block handling code - Improve support for write failure simulation in null_blk - Various lock ordering fixes - Fixes for locking for debugfs attributes - Various ublk related fixes and improvements - Cleanups for blk-rq-qos wait handling - blk-throttle fixes - Fixes for loop dio and sync handling - Fixes and cleanups for the auto-PI code - Block side support for hardware encryption keys in blk-crypto - Various cleanups and fixes * tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits) nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi) nvme-tcp: fix selinux denied when calling sock_sendmsg nvmet: pci-epf: Always configure BAR0 as 64-bit nvmet: Remove duplicate uuid_copy nvme: zns: Simplify nvme_zone_parse_entry() nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls nvmet-fc: Remove unused functions nvme-pci: remove stale comment nvme-fc: Utilise min3() to simplify queue count calculation nvme-multipath: Add visibility for queue-depth io-policy nvme-multipath: Add visibility for numa io-policy nvme-multipath: Add visibility for round-robin io-policy nvmet: add tls_concat and tls_key debugfs entries nvmet-tcp: support secure channel concatenation nvmet: Add 'sq' argument to alloc_ctrl_args nvme-fabrics: reset admin connection for secure concatenation nvme-tcp: request secure channel concatenation nvme-keyring: add nvme_tls_psk_refresh() nvme: add nvme_auth_derive_tls_psk() nvme: add nvme_auth_generate_digest() ...
2025-03-20nvme-tcp: fix selinux denied when calling sock_sendmsgPeijie Shao1-1/+2
In a SELinux enabled kernel, socket_create() initializes the security label of the socket using the security label of the calling process, this typically works well. However, in a containerized environment like Kubernetes, problem arises when a privileged container(domain spc_t) connects to an NVMe target and mounts the NVMe as persistent storage for unprivileged containers(domain container_t). This is because the container_t domain cannot access resources labeled with spc_t, resulting in socket_sendmsg returning -EACCES. The solution is to use socket_create_kern() instead of socket_create(), which labels the socket context to kernel_t. Access control will then be handled by the VFS layer rather than the socket itself. Signed-off-by: Peijie Shao <shaopeijie@cestc.cn> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2025-03-20nvme-fabrics: reset admin connection for secure concatenationHannes Reinecke1-0/+10