aboutsummaryrefslogtreecommitdiff
path: root/drivers/block
AgeCommit message (Collapse)AuthorFilesLines
5 daysMerge tag 'block-6.19-20260109' of ↵Linus Torvalds2-27/+55
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Kill unlikely checks for blk-rq-qos. These checks are really all-or-nothing, either the branch is taken all the time, or it's not. Depending on the configuration, either one of those cases may be true. Just remove the annotation - Fix for merging bios with different app tags set - Fix for a recently introduced slowdown due to RCU synchronization - Fix for a status change on loop while it's in use, and then a later fix for that fix - Fix for the async partition scanning in ublk * tag 'block-6.19-20260109' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: ublk: fix use-after-free in ublk_partition_scan_work blk-mq: avoid stall during boot due to synchronize_rcu_expedited loop: add missing bd_abort_claiming in loop_set_status block: don't merge bios with different app_tags blk-rq-qos: Remove unlikely() hints from QoS checks loop: don't change loop device under exclusive opener in loop_set_status
5 daysublk: fix use-after-free in ublk_partition_scan_workMing Lei1-15/+22
A race condition exists between the async partition scan work and device teardown that can lead to a use-after-free of ub->ub_disk: 1. ublk_ctrl_start_dev() schedules partition_scan_work after add_disk() 2. ublk_stop_dev() calls ublk_stop_dev_unlocked() which does: - del_gendisk(ub->ub_disk) - ublk_detach_disk() sets ub->ub_disk = NULL - put_disk() which may free the disk 3. The worker ublk_partition_scan_work() then dereferences ub->ub_disk leading to UAF Fix this by using ublk_get_disk()/ublk_put_disk() in the worker to hold a reference to the disk during the partition scan. The spinlock in ublk_get_disk() synchronizes with ublk_detach_disk() ensuring the worker either gets a valid reference or sees NULL and exits early. Also change flush_work() to cancel_work_sync() to avoid running the partition scan work unnecessarily when the disk is already detached. Fixes: 7fc4da6a304b ("ublk: scan partition in async way") Reported-by: Ruikai Peng <ruikai@pwno.io> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
7 daysloop: add missing bd_abort_claiming in loop_set_statusTetsuo Handa1-1/+3
Commit 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") forgot to call bd_abort_claiming() when mutex_lock_killable() failed. Fixes: 08e136ebd193 ("loop: don't change loop device under exclusive opener in loop_set_status") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jens Axboe <axboe@kernel.dk>
8 daysloop: don't change loop device under exclusive opener in loop_set_statusRaphael Pinsonneault-Thibeault1-11/+30
loop_set_status() is allowed to change the loop device while there are other openers of the device, even exclusive ones. In this case, it causes a KASAN: slab-out-of-bounds Read in ext4_search_dir(), since when looking for an entry in an inlined directory, e_value_offs is changed underneath the filesystem by loop_set_status(). Fix the problem by forbidding loop_set_status() from modifying the loop device while there are exclusive openers of the device. This is similar to the fix in loop_configure() by commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener") alongside commit ecbe6bc0003b ("block: use bd_prepare_to_claim directly in the loop driver"). Reported-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=3ee481e21fd75e14c397 Tested-by: syzbot+3ee481e21fd75e14c397@syzkaller.appspotmail.com Tested-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Raphael Pinsonneault-Thibeault <rpthibeault@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 daysMerge tag 'block-6.19-20260102' of ↵Linus Torvalds1-3/+32
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Scan partition tables asynchronously for ublk, similarly to how nvme does it. This avoids potential deadlocks, which is why nvme does it that way too. Includes a set of selftests as well. - MD pull request via Yu: - Fix null-pointer dereference in raid5 sysfs group_thread_cnt store (Tuo Li) - Fix possible mempool corruption during raid1 raid_disks update via sysfs (FengWei Shih) - Fix logical_block_size configuration being overwritten during super_1_validate() (Li Nan) - Fix forward incompatibility with configurable logical block size: arrays assembled on new kernels could not be assembled on older kernels (v6.18 and before) due to non-zero reserved pad rejection (Li Nan) - Fix static checker warning about iterator not incremented (Li Nan) - Skip CPU offlining notifications on unmapped hardware queues - bfq-iosched block stats fix - Fix outdated comment in bfq-iosched * tag 'block-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block, bfq: update outdated comment blk-mq: skip CPU offline notify on unmapped hctx selftests/ublk: fix Makefile to rebuild on header changes selftests/ublk: add test for async partition scan ublk: scan partition in async way block,bfq: fix aux stat accumulation destination md: Fix forward incompatibility from configurable logical block size md: Fix logical_block_size configuration being overwritten md: suspend array while updating raid_disks via sysfs md/raid5: fix possible null-pointer dereferences in raid5_store_group_thread_cnt() md: Fix static checker warning in analyze_sbs
2025-12-28ublk: scan partition in async wayMing Lei1-3/+32
Implement async partition scan to avoid IO hang when reading partition tables. Similar to nvme_partition_scan_work(), partition scanning is deferred to a work queue to prevent deadlocks. When partition scan happens synchronously during add_disk(), IO errors can cause the partition scan to wait while holding ub->mutex, which can deadlock with other operations that need the mutex. Changes: - Add partition_scan_work to ublk_device structure - Implement ublk_partition_scan_work() to perform async scan - Always suppress sync partition scan during add_disk() - Schedule async work after add_disk() for trusted daemons - Add flush_work() in ublk_stop_dev() before grabbing ub->mutex Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Yoav Cohen <yoav@nvidia.com> Closes: https://lore.kernel.org/linux-block/DM4PR12MB63280C5637917C071C2F0D65A9A8A@DM4PR12MB6328.namprd12.prod.outlook.com/ Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-26Merge tag 'block-6.19-20251226' of ↵Linus Torvalds2-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix for a signedness issue introduced in this kernel release for rnbd - Fix up user copy references for ublk when the server exits * tag 'block-6.19-20251226' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: rnbd-clt: Fix signedness bug in init_dev() ublk: clean up user copy references on ublk server exit
2025-12-20block: rnbd-clt: Fix signedness bug in init_dev()Dan Carpenter1-1/+1
The "dev->clt_device_id" variable is set using ida_alloc_max() which returns an int and in particular it returns negative error codes. Change the type from u32 to int to fix the error checking. Fixes: c9b5645fd8ca ("block: rnbd-clt: Fix leaked ID in init_dev()") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-20ublk: clean up user copy references on ublk server exitCaleb Sander Mateos1-2/+1
If a ublk server process releases a ublk char device file, any requests dispatched to the ublk server but not yet completed will retain a ref value of UBLK_REFCOUNT_INIT. Before commit e63d2228ef83 ("ublk: simplify aborting ublk request"), __ublk_fail_req() would decrement the reference count before completing the failed request. However, that commit optimized __ublk_fail_req() to call __ublk_complete_rq() directly without decrementing the request reference count. The leaked reference count incorrectly allows user copy and zero copy operations on the completed ublk request. It also triggers the WARN_ON_ONCE(refcount_read(&io->ref)) warnings in ublk_queue_reinit() and ublk_deinit_queue(). Commit c5c5eb24ed61 ("ublk: avoid ublk_io_release() called after ublk char dev is closed") already fixed the issue for ublk devices using UBLK_F_SUPPORT_ZERO_COPY or UBLK_F_AUTO_BUF_REG. However, the reference count leak also affects UBLK_F_USER_COPY, the other reference-counted data copy mode. Fix the condition in ublk_check_and_reset_active_ref() to include all reference-counted data copy modes. This ensures that any ublk requests still owned by the ublk server when it exits have their reference counts reset to 0. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: e63d2228ef83 ("ublk: simplify aborting ublk request") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-20Merge tag 'block-6.19-20251218' of ↵Linus Torvalds4-22/+53
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - ublk selftests for missing coverage - two fixes for the block integrity code - fix for the newly added newly added PR read keys ioctl, limiting the memory that can be allocated - work around for a deadlock that can occur with ublk, where partition scanning ends up recursing back into file closure, which needs the same mutex grabbed. Not the prettiest thing in the world, but an acceptable work-around until we can eliminate the reliance on disk->open_mutex for this - fix for a race between enabling writeback throttling and new IO submissions - move a bit of bio flag handling code. No changes, but needed for a patchset for a future kernel - fix for an init time id leak failure in rnbd - loop/zloop state check fix * tag 'block-6.19-20251218' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: block: validate interval_exp integrity limit block: validate pi_offset integrity limit block: rnbd-clt: Fix leaked ID in init_dev() ublk: fix deadlock when reading partition table block: add allocation size check in blkdev_pr_read_keys() Documentation: admin-guide: blockdev: replace zone_capacity with zone_capacity_mb when creating devices zloop: use READ_ONCE() to read lo->lo_state in queue_rq path loop: use READ_ONCE() to read lo->lo_state without locking block: fix race between wbt_enable_default and IO submission selftests: ublk: add user copy test cases selftests: ublk: add support for user copy to kublk selftests: ublk: forbid multiple data copy modes selftests: ublk: don't share backing files between ublk servers selftests: ublk: use auto_zc for PER_IO_DAEMON tests in stress_04 selftests: ublk: fix fio arguments in run_io_and_recover() selftests: ublk: remove unused ios map in seq_io.bt selftests: ublk: correct last_rw map type in seq_io.bt selftests: ublk: fix overflow in ublk_queue_auto_zc_fallback() block: move around bio flagging helpers
2025-12-18block: rnbd-clt: Fix leaked ID in init_dev()Thomas Fourier1-5/+8
If kstrdup() fails in init_dev(), then the newly allocated ID is lost. Fixes: 64e8a6ece1a5 ("block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name") Signed-off-by: Thomas Fourier <fourier.thomas@gmail.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-17ublk: fix deadlock when reading partition tableMing Lei1-4/+28
When one process(such as udev) opens ublk block device (e.g., to read the partition table via bdev_open()), a deadlock[1] can occur: 1. bdev_open() grabs disk->open_mutex 2. The process issues read I/O to ublk backend to read partition table 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request() runs bio->bi_end_io() callbacks 4. If this triggers fput() on file descriptor of ublk block device, the work may be deferred to current task's task work (see fput() implementation) 5. This eventually calls blkdev_release() from the same context 6. blkdev_release() tries to grab disk->open_mutex again 7. Deadlock: same task waiting for a mutex it already holds The fix is to run blk_update_request() and blk_mq_end_request() with bottom halves disabled. This forces blkdev_release() to run in kernel work-queue context instead of current task work context, and allows ublk server to make forward progress, and avoids the deadlock. Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver") Link: https://github.com/ublk-org/ublksrv/issues/170 [1] Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com> [axboe: rewrite comment in ublk] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15zloop: use READ_ONCE() to read lo->lo_state in queue_rq pathYongpeng Yang1-4/+4
In the queue_rq path, zlo->state is accessed without locking, and direct access may read stale data. This patch uses READ_ONCE() to read zlo->state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-15loop: use READ_ONCE() to read lo->lo_state without lockingYongpeng Yang1-9/+13
When lo->lo_mutex is not held, direct access may read stale data. This patch uses READ_ONCE() to read lo->lo_state and data_race() to silence code checkers, and changes all assignments to use WRITE_ONCE(). Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Yongpeng Yang <yangyongpeng@xiaomi.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-14Merge tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds1-3/+0
Pull ceph updates from Ilya Dryomov: "We have a patch that adds an initial set of tracepoints to the MDS client from Max, a fix that hardens osdmap parsing code from myself (marked for stable) and a few assorted fixups" * tag 'ceph-for-6.19-rc1' of https://github.com/ceph/ceph-client: rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AES ceph: stop selecting CRC32, CRYPTO, and CRYPTO_AES libceph: make decode_pool() more resilient against corrupted osdmaps libceph: Amend checking to fix `make W=1` build breakage ceph: Amend checking to fix `make W=1` build breakage ceph: add trace points to the MDS client libceph: fix log output race condition in OSD client
2025-12-12Merge tag 'block-6.19-20251211' of ↵Linus Torvalds1-7/+21
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Always initialize DMA state, fixing a potentially nasty issue on the block side - btrfs zoned write fix with cached zone reports - Fix corruption issues in bcache with chained bio's, and further make it clear that the chained IO handler is simply a marker, it's not code meant to be executed - Kill old code dealing with synchronous IO polling in the block layer, that has been dead for a long time. Only async polling is supported these days - Fix a lockdep issue in tag_set management, moving it to RCU - Fix an issue with ublks bio_vec iteration - Don't unconditionally enforce blocking issue of ublk control commands, allow some of them with non-blocking issue as they do not block * tag 'block-6.19-20251211' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: blk-mq-dma: always initialize dma state blk-mq: delete task running check in blk_hctx_poll() block: fix cached zone reports on devices with native zone append block: Use RCU in blk_mq_[un]quiesce_tagset() instead of set->tag_list_lock ublk: don't mutate struct bio_vec in iteration block: prohibit calls to bio_chain_endio bcache: fix improper use of bi_end_io ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issue
2025-12-10rbd: stop selecting CRC32, CRYPTO, and CRYPTO_AESIlya Dryomov1-3/+0
None of the RBD code directly requires CRC32, CRYPTO, or CRYPTO_AES. These options are needed by CEPH_LIB code and they are selected there directly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Dongsheng Yang <dongsheng.yang@linux.dev>
2025-12-09ublk: don't mutate struct bio_vec in iterationCaleb Sander Mateos1-6/+6
__bio_for_each_segment() uses the returned struct bio_vec's bv_len field to advance the struct bvec_iter at the end of each loop iteration. So it's incorrect to modify it during the loop. Don't assign to bv_len (or bv_offset, for that matter) in ublk_copy_user_pages(). Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Fixes: e87d66ab27ac ("ublk: use rq_for_each_segment() for user copy") Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-09Merge tag 'block-6.19-20251208' of ↵Linus Torvalds2-6/+4
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: "Followup set of fixes and updates for block for the 6.19 merge window. NVMe had some late minute debates which lead to dropping some patches from that tree, which is why the initial PR didn't have NVMe included. It's here now. This pull request contains: - NVMe pull request via Keith: - Subsystem usage cleanups (Max) - Endpoint device fixes (Shin'ichiro) - Debug statements (Gerd) - FC fabrics cleanups and fixes (Daniel) - Consistent alloc API usages (Israel) - Code comment updates (Chu) - Authentication retry fix (Justin) - Fix a memory leak in the discard ioctl code, if the task is being interrupted by a signal at just the wrong time - Zoned write plugging fixes - Add ioctls for for persistent reservations - Enable per-cpu bio caching by default - Various little fixes and tweaks" * tag 'block-6.19-20251208' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (27 commits) nvme-fabrics: add ENOKEY to no retry criteria for authentication failures nvme-auth: use kvfree() for memory allocated with kvcalloc() nvmet-tcp: use kvcalloc for commands array nvmet-rdma: use kvcalloc for commands and responses arrays nvme: fix typo error in nvme target nvmet-fc: use pr_* print macros instead of dev_* nvmet-fcloop: remove unused lsdir member. nvmet-fcloop: check all request and response have been processed nvme-fc: check all request and response have been processed block: fix memory leak in __blkdev_issue_zero_pages block: fix comment for op_is_zone_mgmt() to include RESET_ALL block: Clear BLK_ZONE_WPLUG_PLUGGED when aborting plugged BIOs blk-mq: Abort suspend when wakeup events are pending blk-mq: add blk_rq_nr_bvec() helper block: add IOC_PR_READ_RESERVATION ioctl block: add IOC_PR_READ_KEYS ioctl nvme: reject invalid pr_read_keys() num_keys values scsi: sd: reject invalid pr_read_keys() num_keys values block: enable per-cpu bio cache by default block: use bio_alloc_bioset for passthru IO by default ...
2025-12-08ublk: allow non-blocking ctrl cmds in IO_URING_F_NONBLOCK issueCaleb Sander Mateos1-1/+15
Handling most of the ublksrv_ctrl_cmd opcodes require locking a mutex, so ublk_ctrl_uring_cmd() bails out with EAGAIN when called with the IO_URING_F_NONBLOCK issue flag. However, several opcodes can be handled without blocking: - UBLK_CMD_GET_QUEUE_AFFINITY - UBLK_CMD_GET_DEV_INFO - UBLK_CMD_GET_DEV_INFO2 - UBLK_U_CMD_GET_FEATURES Handle these opcodes synchronously instead of returning EAGAIN so io_uring doesn't need to issue the command via the worker thread pool. Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-05Merge tag 'mm-stable-2025-12-03-21-26' of ↵Linus Torvalds2-109/+376
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "__vmalloc()/kvmalloc() and no-block support" (Uladzislau Rezki) Rework the vmalloc() code to support non-blocking allocations (GFP_ATOIC, GFP_NOWAIT) "ksm: fix exec/fork inheritance" (xu xin) Fix a rare case where the KSM MMF_VM_MERGE_ANY prctl state is not inherited across fork/exec "mm/zswap: misc cleanup of code and documentations" (SeongJae Park) Some light maintenance work on the zswap code "mm/page_owner: add debugfs files 'show_handles' and 'show_stacks_handles'" (Mauricio Faria de Oliveira) Enhance the /sys/kernel/debug/page_owner debug feature by adding unique identifiers to differentiate the various stack traces so that userspace monitoring tools can better match stack traces over time "mm/page_alloc: pcp->batch cleanups" (Joshua Hahn) Minor alterations to the page allocator's per-cpu-pages feature "Improve UFFDIO_MOVE scalability by removing anon_vma lock" (Lokesh Gidra) Address a scalability issue in userfaultfd's UFFDIO_MOVE operation "kasan: cleanups for kasan_enabled() checks" (Sabyrzhan Tasbolatov) "drivers/base/node: fold node register and unregister functions" (Donet Tom) Clean up the NUMA node handling code a little "mm: some optimizations for prot numa" (Kefeng Wang) Cleanups and small optimizations to the NUMA allocation hinting code "mm/page_alloc: Batch callers of free_pcppages_bulk" (Joshua Hahn) Address long lock hold times at boot on large machines. These were causing (harmless) softlockup warnings "optimize the logic for handling dirty file folios during reclaim" (Baolin Wang) Remove some now-unnecessary work from page reclaim "mm/damon: allow DAMOS auto-tuned for per-memcg per-node memory usage" (SeongJae Park) Enhance the DAMOS auto-tuning feature "mm/damon: fixes for address alignment issues in DAMON_LRU_SORT and DAMON_RECLAIM" (Quanmin Yan) Fix DAMON_LRU_SORT and DAMON_RECLAIM with certain userspace configuration "expand mmap_prepare functionality, port more users" (Lorenzo Stoakes) Enhance the new(ish) file_operations.mmap_prepare() method and port additional callsites from the old ->mmap() over to ->mmap_prepare() "Fix stale IOTLB entries for kernel address space" (Lu Baolu) Fix a bug (and possible security issue on non-x86) in the IOMMU code. In some situations the IOMMU could be left hanging onto a stale kernel pagetable entry "mm/huge_memory: cleanup __split_unmapped_folio()" (Wei Yang) Clean up and optimize the folio splitting code "mm, swap: misc cleanup and bugfix" (Kairui Song) Some cleanups and a minor fix in the swap discard code "mm/damon: misc documentation fixups" (SeongJae Park) "mm/damon: support pin-point targets removal" (SeongJae Park) Permit userspace to remove a specific monitoring target in the middle of the current targets list "mm: MISC follow-up patches for linux/pgalloc.h" (Harry Yoo) A couple of cleanups related to mm header file inclusion "mm/swapfile.c: select swap devices of default priority round robin" (Baoquan He) improve the selection of swap devices for NUMA machines "mm: Convert memory block states (MEM_*) macros to enums" (Israel Batista) Change the memory block labels from macros to enums so they will appear in kernel debug info "ksm: perform a range-walk to jump over holes in break_ksm" (Pedro Demarchi Gomes) Address an inefficiency when KSM unmerges an address range "mm/damon/tests: fix memory bugs in kunit tests" (SeongJae Park) Fix leaks and unhandled malloc() failures in DAMON userspace unit tests "some cleanups for pageout()" (Baolin Wang) Clean up a couple of minor things in the page scanner's writeback-for-eviction code "mm/hugetlb: refactor sysfs/sysctl interfaces" (Hui Zhu) Move hugetlb's sysfs/sysctl handling code into a new file "introduce VM_MAYBE_GUARD and make it sticky" (Lorenzo Stoakes) Make the VMA guard regions available in /proc/pid/smaps and improves the mergeability of guarded VMAs "mm: perform guard region install/remove under VMA lock" (Lorenzo Stoakes) Reduce mmap lock contention for callers performing VMA guard region operations "vma_start_write_killable" (Matthew Wilcox) Start work on permitting applications to be killed when they are waiting on a read_lock on the VMA lock "mm/damon/tests: add more tests for online parameters commit" (SeongJae Park) Add additional userspace testing of DAMON's "commit" feature "mm/damon: misc cleanups" (SeongJae Park) "make VM_SOFTDIRTY a sticky VMA flag" (Lorenzo Stoakes) Address the possible loss of a VMA's VM_SOFTDIRTY flag when that VMA is merged with another "mm: support device-private THP" (Balbir Singh) Introduce support for Transparent Huge Page (THP) migration in zone device-private memory "Optimize folio split in memory failure" (Zi Yan) "mm/huge_memory: Define split_type and consolidate split support checks" (Wei Yang) Some more cleanups in the folio splitting code "mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries" (Lorenzo Stoakes) Clean up our handling of pagetable leaf entries by introducing the concept of 'software leaf entries', of type softleaf_t "reparent the THP split queue" (Muchun Song) Reparent the THP split queue to its parent memcg. This is in preparation for addressing the long-standing "dying memcg" problem, wherein dead memcg's linger for too long, consuming memory resources "unify PMD scan results and remove redundant cleanup" (Wei Yang) A little cleanup in the hugepage collapse code "zram: introduce writeback bio batching" (Sergey Senozhatsky) Improve zram writeback efficiency by introducing batched bio writeback support "memcg: cleanup the memcg stats interfaces" (Shakeel Butt) Clean up our handling of the interrupt safety of some memcg stats "make vmalloc gfp flags usage more apparent" (Vishal Moola) Clean up vmalloc's handling of incoming GFP flags "mm: Add soft-dirty and uffd-wp support for RISC-V" (Chunyan Zhang) Teach soft dirty and userfaultfd write protect tracking to use RISC-V's Svrsw60t59b extension "mm: swap: small fixes and comment cleanups" (Youngjun Park) Fix a small bug and clean up some of the swap code "initial work on making VMA flags a bitmap" (Lorenzo Stoakes) Start work on converting the vma struct's flags to a bitmap, so we stop running out of them, especially on 32-bit "mm/swapfile: fix and cleanup swap list iterations" (Youngjun Park) Address a possible bug in the swap discard code and clean things up a little [ This merge also reverts commit ebb9aeb980e5 ("vfio/nvgrace-gpu: register device memory for poison handling") because it looks broken to me, I've asked for clarification - Linus ] * tag 'mm-stable-2025-12-03-21-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm: fix vma_start_write_killable() signal handling mm/swapfile: use plist_for_each_entry in __folio_throttle_swaprate mm/swapfile: fix list iteration when next node is removed during discard fs/proc/task_mmu.c: fix make_uffd_wp_huge_pte() huge pte handling mm/kfence: add reboot notifier to disable KFENCE on shutdown memcg: remove inc/dec_lruvec_kmem_state helpers selftests/mm/uffd: initialize char variable to Null mm: fix DEBUG_RODATA_TEST indentation in Kconfig mm: introduce VMA flags bitmap type tools/testing/vma: eliminate dependency on vma->__vm_flags mm: simplify and rename mm flags function for clarity mm: declare VMA flags by bit zram: fix a spelling mistake mm/page_alloc: optimize lowmem_reserve max lookup using its semantic monotonicity mm/vmscan: skip increasing kswapd_failures when reclaim was boosted pagemap: update BUDDY flag documentation mm: swap: remove scan_swap_map_slots() references from comments mm: swap: change swap_alloc_slow() to void mm, swap: remove redundant comment for read_swap_cache_async mm, swap: use SWP_SOLIDSTATE to determine if swap is rotational ...
2025-12-04blk-mq: add blk_rq_nr_bvec() helperChaitanya Kulkarni2-6/+4
Add a new helper function blk_rq_nr_bvec() that returns the number of bvecs in a request. This count represents the number of iterations rq_for_each_bvec() would perform on a request. Drivers need to pre-allocate bvec arrays before iterating through a request's bvecs. Currently, they manually count bvecs using rq_for_each_bvec() in a loop, which is repetitive. The new helper centralizes this logic. This pattern exists in loop and zloop drivers, where multi-bio requests require copying bvecs into a contiguous array before creating an iov_iter for file operations. Update loop and zloop drivers to use the new helper, eliminating duplicate code. This patch also provides a clear API to avoid any potential misuse of blk_nr_phys_segments() for calculating the bvecs since, one bvec can have more than one segments and use of blk_nr_phys_segments() can lead to extra memory allocation :- [ 6155.673749] nullb_bio: 128K bio as ONE bvec: sector=0, size=131072 [ 6155.673846] null_blk: #### null_handle_data_transfer:1375 [ 6155.673850] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=2 [ 6155.674263] null_blk: #### null_handle_data_transfer:1375 [ 6155.674267] null_blk: nr_bvec=1 blk_rq_nr_phys_segments=1 Reviewed-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-03Merge tag 'for-6.19/block-20251201' of ↵Linus Torvalds14-290/+421
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block updates from Jens Axboe: - Fix head insertion for mq-deadline, a regression from when priority support was added - Series simplifying and improving the ublk user copy code - Various ublk related cleanups - Fixup REQ_NOWAIT handling in loop/zloop, clearing NOWAIT when the request is punted to a thread for handling - Merge and then later revert loop dio nowait support, as it ended up causing excessive stack usage for when the inline issue code needs to dip back into the full file system code - Improve auto integrity code, making it less deadlock prone - Speedup polled IO handling, but manually managing the hctx lookups - Fixes for blk-throttle for SSD devices - Small series with fixes for the S390 dasd driver - Add support for caching zones, avoiding unnecessary report zone queries - MD pull requests via Yu: - fix null-ptr-dereference regression for dm-raid0 - fix IO hang for raid5 when array is broken with IO inflight - remove legacy 1s delay to speed up system shutdown - change maintainer's email address - data can be lost if array is created with different lbs devices, fix this problem and record lbs of the array in metadata - fix rcu protection for md_thread - fix mddev kobject lifetime regression - enable atomic writes for md-linear - some cleanups - bcache updates via Coly - remove useless discard and cache device code - improve usage of per-cpu workqueues - Reorganize the IO scheduler switching code, fixing some lockdep reports as well - Improve the block layer P2P DMA support - Add support to the block tracing code for zoned devices - Segment calculation improves, and memory alignment flexibility improvements - Set of prep and cleanups patches for ublk batching support. The actual batching hasn't been added yet, but helps shrink down the workload of getting that patchset ready for 6.20 - Fix for how the ps3 block driver handles segments offsets - Improve how block plugging handles batch tag allocations - nbd fixes for use-after-free of the configuration on device clear/put - Set of improvements and fixes for zloop - Add Damien as maintainer of the block zoned device code handling - Various other fixes and cleanups * tag 'for-6.19/block-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (162 commits) block/rnbd: correct all kernel-doc complaints blk-mq: use queue_hctx in blk_mq_map_queue_type md: remove legacy 1s delay in md_notify_reboot md/raid5: fix IO hang when array is broken with IO inflight md: warn about updating super block failure md/raid0: fix NULL pointer dereference in create_strip_zones() for dm-raid sbitmap: fix all kernel-doc warnings ublk: add helper of __ublk_fetch() ublk: pass const pointer to ublk_queue_is_zoned() ublk: refactor auto buffer register in ublk_dispatch_req() ublk: add `union ublk_io_buf` with improved naming ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() kfifo: add kfifo_alloc_node() helper for NUMA awareness blk-mq: fix potential uaf for 'queue_hw_ctx' blk-mq: use array manage hctx map instead of xarray ublk: prevent invalid access with DEBUG s390/dasd: Use scnprintf() instead of sprintf() s390/dasd: Move device name formatting into separate function s390/dasd: Remove unnecessary debugfs_create() return checks s390/dasd: Fix gendisk parent after copy pair swap ...
2025-12-03Merge tag 'for-6.19/io_uring-20251201' of ↵Linus Torvalds1-11/+11
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring updates from Jens Axboe: - Unify how task_work cancelations are detected, placing it in the task_work running state rather than needing to check the task state - Series cleaning up and moving the cancelation code to where it belongs, in cancel.c - Cleanup of waitid and futex argument handling - Add support for mixed sized SQEs. 6.18 added support for mixed sized CQEs, improving flexibility and efficiency of workloads that need big CQEs. This adds similar support for SQEs, where the occasional need for a 128b SQE doesn't necessitate having all SQEs be 128b in size - Introduce zcrx and SQ/CQ layout queries. The former returns what zcrx features are available. And both return the ring size information to help with allocation size calculation for user provided rings like IORING_SETUP_NO_MMAP and IORING_MEM_REGION_TYPE_USER - Zcrx updates for 6.19. It includes a bunch of small patches, IORING_REGISTER_ZCRX_CTRL and RQ flushing and David's work on sharing zcrx b/w multiple io_uring instances - Series cleaning up ring initializations, notable deduplicating ring size and offset calculations. It also moves most of the checking before doing any allocations, making the code simpler - Add support for getsockname and getpeername, which is mostly a trivial hookup after a bit of refactoring on the networking side - Various fixes and cleanups * tag 'for-6.19/io_uring-20251201' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits) io_uring: Introduce getsockname io_uring cmd socket: Split out a getsockname helper for io_uring socket: Unify getsockname and getpeername implementation io_uring/query: drop unused io_handle_query_entry() ctx arg io_uring/kbuf: remove obsolete buf_nr_pages and update comments io_uring/register: use correct location for io_rings_layout io_uring/zcrx: share an ifq between rings io_uring/zcrx: add io_fill_zcrx_offsets() io_uring/zcrx: export zcrx via a file io_uring/zcrx: move io_zcrx_scrub() and dependencies up io_uring/zcrx: count zcrx users io_uring/zcrx: add sync refill queue flushing io_uring/zcrx: introduce IORING_REGISTER_ZCRX_CTRL io_uring/zcrx: elide passing msg flags io_uring/zcrx: use folio_nr_pages() instead of shift operation io_uring/zcrx: convert to use netmem_desc io_uring/query: introduce rings info query io_uring/query: introduce zcrx query io_uring: move cq/sq user offset init around io_uring: pre-calculate scq layout ...
2025-12-03Merge tag 'net-next-6.19' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Replace busylock at the Tx queuing layer with a lockless list. Resulting in a 300% (4x) improvement on heavy TX workloads, sending twice the number of packets per second, for half the cpu cycles. - Allow constantly busy flows to migrate to a more suitable CPU/NIC queue. Normally we perform queue re-selection when flow comes out of idle, but under extreme circumstances the flows may be constantly busy. Add sysctl to allow periodic rehashing even if it'd risk packet reordering. - Optimize the NAPI skb cache, make it larger, use it in more paths. - Attempt returning Tx skbs to the originating CPU (like we already did for Rx skbs). - Various data structure layout and prefetch optimizations from Eric. - Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly quite expensive on recent AMD machines. - Extend threaded NAPI polling to allow the kthread busy poll for packets. - Make MPTCP use Rx backlog processing. This lowers the lock pressure, improving the Rx performance. - Support memcg accounting of MPTCP socket memory. - Allow admin to opt sockets out of global protocol memory accounting (using a sysctl or BPF-based policy). The global limits are a poor fit for modern container workloads, where limits are imposed using cgroups. - Improve heuristics for when to kick off AF_UNIX garbage collection. - Allow users to control TCP SACK compression, and default to 33% of RTT. - Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily aggressive rcvbuf growth and overshot when the connection RTT is low. - Preserve skb metadata space across skb_push / skb_pull operations. - Support for IPIP encapsulation in the nftables flowtable offload. - Support appending IP interface information to ICMP messages (RFC 5837). - Support setting max record size in TLS (RFC 8449). - Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL. - Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock. - Let users configure the number of write buffers in SMC. - Add new struct sockaddr_unsized for sockaddr of unknown length, from Kees. - Some conversions away from the crypto_ahash API, from Eric Biggers. - Some preparations for slimming down struct page. - YAML Netlink protocol spec for WireGuard. - Add a tool on top of YAML Netlink specs/lib for reporting commonly computed derived statistics and summarized system state. Driver API: - Add CAN XL support to the CAN Netlink interface. - Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as defined by the OPEN Alliance's "Advanced diagnostic features for 100BASE-T1 automotive Ethernet PHYs" specification. - Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x). - Refactor xfrm_input lock to reduce contention when NIC offloads IPsec and performs RSS. - Add info to devlink params whether the current setting is the default or a user override. Allow resetting back to default. - Add standard device stats for PSP crypto offload. - Leverage DSA frame broadcast to implement simple HSR frame duplication for a lot of switches without dedicated HSR offload. - Add uAPI defines for 1.6Tbps link modes. Device drivers: - Add Motorcomm YT921x gigabit Ethernet switch support. - Add MUCSE driver for N500/N210 1GbE NIC series. - Convert drivers to support dedicated ops for timestamping control, and away from the direct IOCTL handling. While at it support GET operations for PHY timestamping. - Add (and convert most drivers to) a dedicated ethtool callback for reading the Rx ring count. - Significant refactoring efforts in the STMMAC driver, which supports Synopsys turn-key MAC IP integrated into a ton of SoCs. - Ethernet high-speed NICs: - Broadcom (bnxt): - support PPS in/out on all pins - Intel (100G, ice, idpf): - ice: implement standard ethtool and timestamping stats - i40e: support setting the max number of MAC addresses per VF - iavf: support RSS of GTP tunnels for 5G and LTE deployments - nVidia/Mellanox (mlx5): - reduce downtime on interface reconfiguration - disable being an XDP redirect target by default (same as other drivers) to avoid wasting resources if feature is unused - Meta (fbnic): - add support for Linux-managed PCS on 25G, 50G, and 100G links - Wangxun: - support Rx descriptor merge, and Tx head writeback - support Rx coalescing offload - support 25G SPF and 40G QSFP modules - Ethernet virtual: - Google (gve): - allow ethtool to configure rx_buf_len - implement XDP HW RX Timestamping support for DQ descriptor format - Microsoft vNIC (mana): - support HW link state events - handle hardware recovery events when probing the device - Ethernet NICs consumer, and embedded: - usbnet: add support for Byte Queue Limits (BQL) - AMD (amd-xgbe): - add device selftests - NXP (enetc): - add i.MX94 support - Broadcom integrated MACs (bcmgenet, bcmasp): - bcmasp: add support for PHY-based Wake-on-LAN - Broadcom switches (b53): - support port isolation - support BCM5389/97/98 and BCM63XX ARL formats - Lantiq/MaxLinear switches: - support bridge FDB entries on the CPU port - use regmap for register access - allow user to enable/disable learning - support Energy Efficient Ethernet - support configuring RMII clock delays - add tagging driver for MaxLinear GSW1xx switches - Synopsys (stmmac): - support using the HW clock in free running mode - add Eswin EIC7700 support - add Rockchip RK3506 support - add Altera Agilex5 support - Cadence (macb): - cleanup and consolidate descriptor and DMA address handling - add EyeQ5 support - TI: - icssg-prueth: support AF_XDP - Airoha access points: - add missing Ethernet stats and link state callback - add AN7583 support - support out-of-order Tx completion processing - Power over Ethernet: - pd692x0: preserve PSE configuration across reboots - add support for TPS23881B devices - Ethernet PHYs: - Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support - Support 50G SerDes and 100G interfaces in Linux-managed PHYs - micrel: - support for non PTP SKUs of lan8814 - enable in-band auto-negotiation on lan8814 - realtek: - cable testing support on RTL8224 - interrupt support on RTL8221B - motorcomm: support for PHY LEDs on YT853 - microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag - mscc: support for PHY LED control - CAN drivers: - m_can: add support for optional reset and system wake up - remove can_change_mtu() obsoleted by core handling - mcp251xfd: support GPIO controller functionality - Bluetooth: - add initial support for PASTa - WiFi: - split ieee80211.h file, it's way too big - improvements in VHT radiotap reporting, S1G, Channel Switch Announcement handling, rate tracking in mesh networks - improve multi-radio monitor mode support, and add a cfg80211 debugfs interface for it - HT action frame handling on 6 GHz - initial chanctx work towards NAN - MU-MIMO sniffer improvements - WiFi drivers: - RealTek (rtw89): - support USB devices RTL8852AU and RTL8852CU - initial work for RTL8922DE - improved injection support - Intel: - iwlwifi: new sniffer API support - MediaTek (mt76): - WED support for >32-bit DMA - airoha NPU support - regdomain improvements - continued WiFi7/MLO work - Qualcomm/Atheros: - ath10k: factory test support - ath11k: TX power insertion support - ath12k: BSS color change support - ath12k: statistics improvements - brcmfmac: Acer A1 840 tablet quirk - rtl8xxxu: 40 MHz connection fixes/support" * tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits) net: page_pool: sanitise allocation order net: page pool: xa init with destroy on pp init net/mlx5e: Support XDP target xmit with dummy program net/mlx5e: Update XDP features in switch channels selftests/tc-testing: Test CAKE scheduler when enqueue drops packets net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop wireguard: netlink: generate netlink code wireguard: uapi: generate header with ynl-gen wireguard: uapi: move flag enums wireguard: uapi: move enum wg_cmd wireguard: netlink: add YNL specification selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py selftests: drv-net: introduce Iperf3Runner for measurement use cases selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive() Documentation: net: dsa: mention simple HSR offload helpers Documentation: net: dsa: mention availability of RedBox ...
2025-12-03Merge tag 'rust-6.19' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux Pull Rust updates from Miguel Ojeda: "Toolchain and infrastructure: - Add support for 'syn'. Syn is a parsing library for parsing a stream of Rust tokens into a syntax tree of Rust source code. Currently this library is geared toward use in Rust procedural macros, but contains some APIs that may be useful more generally. 'syn' allows us to greatly simplify writing complex macros such as 'pin-init' (Benno has already prepared the 'syn'-based version). We will use it in the 'macros' crate too. 'syn' is the most downloaded Rust crate (according to crates.io), and it is also used by the Rust compiler itself. While the amount of code is substantial, there should not be many updates needed for these crates, and even if there are, they should not be too big, e.g. +7k -3k lines across the 3 crates in the last year. 'syn' requires two smaller dependencies: 'quote' and 'proc-macro2'. I only modified their code to remove a third dependency ('unicode-ident') and to add the SPDX identifiers. The code can be easily verified to exactly match upstream with the provided scripts. They are all licensed under "Apache-2.0 OR MIT", like the other vendored 'alloc' crate we had for a while. Please see the merge commit with the cover letter for more context. - Allow 'unreachable_pub' and 'clippy::disallowed_names' for doctests. Examples (i.e. doctests) may want to do things like show public items and use names such as 'foo'. Nevertheless, we still try to keep examples as close to real code as possible (this is part of why running Clippy on doctests is important for us, e.g. for safety comments, which userspace Rust does not support yet but we are stricter). 'kernel' crate: - Replace our custom 'CStr' type with 'core::ffi::CStr'. Using the standard library type reduces our custom code footprint, and we retain needed custom functionality through an extension trait and a new 'fmt!' macro which replaces the previous 'core' import. This started in 6.17 and continued in 6.18, and we finally land the replacement now. This required quite some stamina from Tamir, who split the changes in steps to prepare for the flag day change here. - Replace 'kernel::c_str!' with C string literals. C string literals were added in Rust 1.77, which produce '&CStr's (the 'core' one), so now we can write: c"hi" instead of: c_str!("hi") - Add 'num' module for numerical features. It includes the 'Integer' trait, implemented for all primitive integer types. It also includes the 'Bounded' integer wrapping type: an integer value that requires only the 'N' least significant bits of the wrapped type to be encoded: // An unsigned 8-bit integer, of which only the 4 LSBs are used. let v = Bounded::<u8, 4>::new::<15>(); assert_eq!(v.get(), 15); 'Bounded' is useful to e.g. enforce guarantees when working with bitfields that have an arbitrary number of bits. Values can also be constructed from simple non-constant expressions or, for more complex ones, validated at runtime. 'Bounded' also comes with comparison and arithmetic operations (with both their backing type and other 'Bounded's with a compatible backing type), casts to change the backing type, extending/shrinking and infallible/fallible conversions from/to primitives as applicable. - 'rbtree' module: add immutable cursor ('Cursor'). It enables to use just an immutable tree reference where appropriate. The existing fully-featured mutable cursor is renamed to 'CursorMut'. kallsyms: - Fix wrong "big" kernel symbol type read from procfs. 'pin-init' crate: - A couple minor fixes (Benno asked me to pick these patches up for him this cycle). Documentation: - Quick Start guide: add Debian 13 (Trixie). Debian Stable is now able to build Linux, since Debian 13 (released 2025-08-09) packages Rust 1.85.0, which is recent enough. We are planning to propose that the minimum supported Rust version in Linux follows Debian Stable releases, with Debian 13 being the first one we upgrade to, i.e. Rust 1.85. MAINTAINERS: - Add entry for the new 'num' module. - Remove Alex as Rust maintainer: he hasn't had the time to contribute for a few years now, so it is a no-op change in practice. And a few other cleanups and improvements" * tag 'rust-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux: (53 commits) rust: macros: support `proc-macro2`, `quote` and `syn` rust: syn: enable support in kbuild rust: syn: add `README.md` rust: syn: remove `unicode-ident