| Age | Commit message (Collapse) | Author | Files | Lines |
|
zram_read_page() picks the sync or async backing device read path based on
whether the parent bio is NULL. zram_bvec_write_partial() passes its
parent bio down, so for ZRAM_WB slots the read is dispatched
asynchronously and zram_read_page() returns 0 while the bio is still in
flight. The caller then runs memcpy_from_bvec(), zram_write_page() and
__free_page() on the buffer, leaving the async read to write into a freed
page.
zram_bvec_read_partial() was switched to NULL in commit 4e3c87b9421d
("zram: fix synchronous reads") for the same reason; the write_partial
counterpart was missed.
Link: https://lore.kernel.org/20260528-zram-v3-1-cab86eef8764@gmail.com
Fixes: 8e654f8fbff5 ("zram: read page from backing device")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Cunlong Li <shenxiaogll@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Yisheng Xie <xieyisheng1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
|
|
A crash was observed in zram_writeback_endio due to a NULL pointer
dereference in wake_up. The root cause is a race condition between the
bio completion handler (zram_writeback_endio) and the writeback task.
In zram_writeback_endio, wake_up() is called on &wb_ctl->done_wait after
releasing wb_ctl->done_lock. This creates a race window where the
writeback task can see num_inflight become 0, return, and free wb_ctl
before zram_writeback_endio calls wake_up().
CPU 0 (zram_writeback_endio) CPU 1 (writeback_store)
============================ ============================
zram_writeback_slots
zram_submit_wb_request
zram_submit_wb_request
wait_event(wb_ctl->done_wait)
spin_lock(&wb_ctl->done_lock);
list_add(&req->entry, &wb_ctl->done_reqs);
spin_unlock(&wb_ctl->done_lock);
wake_up(&wb_ctl->done_wait);
zram_complete_done_reqs
spin_lock(&wb_ctl->done_lock);
list_add(&req->entry, &wb_ctl->done_reqs);
spin_unlock(&wb_ctl->done_lock);
while (num_inflight) > 0)
spin_lock(&wb_ctl->done_lock);
list_del(&req->entry);
spin_unlock(&wb_ctl->done_lock);
// num_inflight becomes 0
atomic_dec(num_inflight);
// Leave zram_writeback_slots
// Free wb_ctl
release_wb_ctl(wb_ctl);
// UAF crash!
wake_up(&wb_ctl->done_wait);
This patch fixes this race by using RCU. By protecting wb_ctl with
rcu_read_lock() in zram_writeback_endio and using kfree_rcu() to free it,
we ensure that wb_ctl remains valid during the execution of
zram_writeback_endio.
Link: https://lore.kernel.org/20260512074918.2606208-1-richardycc@google.com
Fixes: f405066a1f0d ("zram: introduce writeback bio batching")
Signed-off-by: Richard Chang <richardycc@google.com>
Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin Liu <liumartin@google.com>
Cc: wang wei <a929244872@163.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Given how rbd_lock_add_request() and rbd_img_exclusive_lock() are
written, lock_dwork may be (re)queued more than it's actually needed:
for example in case a new I/O request comes in while we are in the
middle of rbd_acquire_lock() on behalf of another I/O request. This is
expected and with rbd_release_lock() preemptively canceling lock_dwork
is benign under normal operation.
A more problematic example is maybe_kick_acquire():
if (have_requests || delayed_work_pending(&rbd_dev->lock_dwork)) {
dout("%s rbd_dev %p kicking lock_dwork\n", __func__, rbd_dev);
mod_delayed_work(rbd_dev->task_wq, &rbd_dev->lock_dwork, 0);
}
It's not unrealistic for lock_dwork to get canceled right after
delayed_work_pending() returns true and for mod_delayed_work() to
requeue it right there anyway. This is a classic TOCTOU race.
When it comes to unmapping the image, there is an implicit assumption
of no self-initiated exclusive lock activity past the point of return
from rbd_dev_image_unlock() which unlocks the lock if it happens to be
held. This unlock is assumed to be final and lock_dwork (as well as
all other exclusive lock tasks, really) isn't expected to get queued
again. However, lock_dwork is canceled only in cancel_tasks_sync()
(i.e. later in the unmap sequence) and on top of that the cancellation
can get in effect nullified by maybe_kick_acquire(). This may result
in rbd_acquire_lock() executing after rbd_dev_device_release() and
rbd_dev_image_release() run and free and/or reset a bunch of things.
One of the possible failure modes then is a violated
rbd_assert(rbd_image_format_valid(rbd_dev->image_format));
in rbd_dev_header_info() which is called via rbd_dev_refresh() from
rbd_post_acquire_action().
Redo exclusive lock task draining to provide saner semantics and try
to meet the assumptions around rbd_dev_image_unlock().
Cc: stable@vger.kernel.org
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
|
|
blk_validate_limits() requires max_hw_sectors >= PAGE_SECTORS and fires
a WARN_ON_ONCE if this invariant is violated. ublk_validate_params()
only checked the upper bound of max_sectors against max_io_buf_bytes,
allowing userspace to pass small values (including zero) that trigger
the warning when blk_mq_alloc_disk() is called from
ublk_ctrl_start_dev().
Before 494ea040bcb5, ublk used blk_queue_max_hw_sectors() which silently
clamped small values up to PAGE_SECTORS. The conversion to passing
queue_limits directly to blk_mq_alloc_disk() lost that clamping and now
hits blk_validate_limits()'s WARN_ON_ONCE instead.
Validate that max_sectors is at least PAGE_SECTORS in
ublk_validate_params() so invalid values are rejected early with
-EINVAL instead of reaching the block layer.
Fixes: 494ea040bcb5 ("ublk: pass queue_limits to blk_mq_alloc_disk")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260510144843.769031-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When ublk_reset_ch_dev() clears io->cmd via ublk_queue_reinit()
concurrently with ublk_cancel_cmd(), ublk_cancel_cmd() can read a
stale pointer and pass it to io_uring_cmd_done(), causing a
use-after-free.
Fix by synchronizing the two paths with ubq->cancel_lock:
- ublk_cancel_cmd(): read and clear io->cmd under cancel_lock,
then call io_uring_cmd_done() on the saved local copy outside
the lock.
- ublk_reset_ch_dev(): hold cancel_lock across ublk_queue_reinit()
so that io->cmd and io->flags are cleared atomically with respect
to ublk_cancel_cmd().
Fixes: 216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260508123746.242018-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_validate_params() checks logical_bs_shift is within
[9, PAGE_SHIFT] but has no upper bound for physical_bs_shift,
io_min_shift, or io_opt_shift. A malicious userspace can set any
of these to a large value (e.g., 44), causing undefined behavior
from `1 << shift` in ublk_ctrl_start_dev() since the result is
stored in 32-bit unsigned int.
Cap all three at ilog2(SZ_256M) (28). 256M is big enough to cover
all practical block sizes, and originates from the maximum physical
block size possible in NVMe (lba_size * (1 + npwg), where npwg is
16-bit).
Also zero out ub->params with memset() when copy_from_user() fails
or ublk_validate_params() returns error, so that no stale or partial
params survive for a subsequent START_DEV to consume.
Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260506082238.22363-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When ublk_ch_uring_cmd_cb() runs as fallback task work (e.g., because
the submitting task is exiting), the command should not be issued as
current is a kworker, not the daemon task. This can cause io->task
to capture the wrong task in __ublk_fetch(), leading to a task
mismatch warning in ublk_uring_cmd_cancel_fn().
Check tw.cancel and return -ECANCELED instead of issuing the command
from fallback context.
Fixes: 3421c7f68bba ("ublk: make sure io cmd handled in submitter task context")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260501112312.947327-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Series for zloop, fixing a variety of issues
- t10-pi code cleanup
- Fix for a merge window regression with the bio memory allocation mask
- Fix for a merge window regression in ublk, caused by an issue with
the maple tree iteration code at teardown
- ublk self tests additions
- Zoned device pgmap fixes
- Various little cleanups and fixes
* tag 'block-7.1-20260424' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (21 commits)
Revert "floppy: fix reference leak on platform_device_register() failure"
ublk: avoid unpinning pages under maple tree spinlock
ublk: refactor common helper ublk_shmem_remove_ranges()
ublk: fix maple tree lockdep warning in ublk_buf_cleanup
selftests: ublk: add ublk auto integrity test
selftests: ublk: enable test_integrity_02.sh on fio 3.42
selftests: ublk: remove unused argument to _cleanup
block: only restrict bio allocation gfp mask asked to block
block/blk-throttle: Add WQ_PERCPU to alloc_workqueue users
block: Add WQ_PERCPU to alloc_workqueue users
block: relax pgmap check in bio_add_page for compatible zone device pages
block: add pgmap check to biovec_phys_mergeable
floppy: fix reference leak on platform_device_register() failure
ublk: use unchecked copy helpers for bio page data
t10-pi: reduce ref tag code duplication
zloop: remove irq-safe locking
zloop: factor out zloop_mark_{full,empty} helpers
zloop: set RQF_QUIET when completing requests on deleted devices
zloop: improve the unaligned write pointer warning
zloop: use vfs_truncate
...
|
|
Pull ceph updates from Ilya Dryomov:
"We have a series from Alex which extends CephFS client metrics with
support for per-subvolume data I/O performance and latency tracking
(metadata operations aren't included) and a good variety of fixes and
cleanups across RBD and CephFS"
* tag 'ceph-for-7.1-rc1' of https://github.com/ceph/ceph-client:
ceph: add subvolume metrics collection and reporting
ceph: parse subvolume_id from InodeStat v9 and store in inode
ceph: handle InodeStat v8 versioned field in reply parsing
libceph: Fix slab-out-of-bounds access in auth message processing
rbd: fix null-ptr-deref when device_add_disk() fails
crush: cleanup in crush_do_rule() method
ceph: clear s_cap_reconnect when ceph_pagelist_encode_32() fails
ceph: only d_add() negative dentries when they are unhashed
libceph: update outdated comment in ceph_sock_write_space()
libceph: Remove obsolete session key alignment logic
ceph: fix num_ops off-by-one when crypto allocation fails
libceph: Prevent potential null-ptr-deref in ceph_handle_auth_reply()
|
|
This reverts commit e784f2ea0b4fd0e7b70028ff8218f22456c5dcf8.
Jiri says the patch is buggy, and it looks like he is right revert it
for now.
Link: https://lore.kernel.org/linux-block/897f442d-4e04-4b70-b716-38fd10b8af36@kernel.org/
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_shmem_remove_ranges() calls unpin_user_pages() while holding the
maple tree spinlock (mas_lock). Although unpin_user_pages() is safe in
atomic context, holding the spinlock across potentially many page
unpinning operations is not ideal.
Split into __ublk_shmem_remove_ranges() which erases up to 64 ranges
under mas_lock, collecting base_pfn and nr_pages into a temporary
xarray. Then drop the lock and unpin pages outside spinlock context.
ublk_shmem_remove_ranges() loops until all matching ranges are
processed.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260423033058.2805135-4-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Extract the shared walk+erase+unpin+kfree loop into
ublk_shmem_remove_ranges(). When buf_index >= 0, only ranges matching
that index are removed; when buf_index < 0, all ranges are removed.
Also extract ublk_unpin_range_pages() to share the page unpinning
loop.
Convert both __ublk_ctrl_unreg_buf() and ublk_buf_cleanup() to use
the new helper.
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260423033058.2805135-3-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
ublk_buf_cleanup() iterates the maple tree with mas_for_each()
without holding mas_lock, triggering a lockdep splat on
CONFIG_PROVE_RCU kernels since mas_find() internally uses
rcu_dereference_check() which requires either RCU or the tree lock.
Fix by holding mas_lock around the iteration, and call mas_erase()
before freeing each range to avoid dangling pointers in the tree.
Fixes: 5e864438e285 ("ublk: replace xarray with IDA for shmem buffer index allocation")
Reported-by: Jens Axboe <axboe@kernel.dk>
Closes: https://lore.kernel.org/linux-block/0349d72d-dff8-4f9f-b448-919fa5ae96da@kernel.dk/
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260423033058.2805135-2-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
do_rbd_add() publishes the device with device_add() before calling
device_add_disk(). If device_add_disk() fails after device_add()
succeeds, the error path calls rbd_free_disk() directly and then later
falls through to rbd_dev_device_release(), which calls rbd_free_disk()
again. This double teardown can leave blk-mq cleanup operating on
invalid state and trigger a null-ptr-deref in
__blk_mq_free_map_and_rqs(), reached from blk_mq_free_tag_set().
Fix this by following the normal remove ordering: call device_del()
before rbd_dev_device_release() when device_add_disk() fails after
device_add(). That keeps the teardown sequence consistent and avoids
re-entering disk cleanup through the wrong path.
The bug was first flagged by an experimental analysis tool we are
developing for kernel memory-management bugs while analyzing
v6.13-rc1. The tool is still under development and is not yet publicly
available.
We reproduced the bug on v7.0 with a real Ceph backend and a QEMU x86_64
guest booted with KASAN and CONFIG_FAILSLAB enabled. The reproducer
confines failslab injections to the __add_disk() range and injects
fail-nth while mapping an RBD image through
/sys/bus/rbd/add_single_major.
On the unpatched kernel, fail-nth=4 reliably triggered the fault:
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 0 UID: 0 PID: 273 Comm: bash Not tainted 7.0.0-01247-gd60bc1401583 #6 PREEMPT(lazy)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:__blk_mq_free_map_and_rqs+0x8c/0x240
Code: 00 00 48 8b 6b 60 41 89 f4 49 c1 e4 03 4c 01 e5 45 85 ed 0f 85 0a 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 e9 48 c1 e9 03 <80> 3c 01 00 0f 85 31 01 00 00 4c 8b 6d 00 4d 85 ed 0f 84 e2 00 00
RSP: 0018:ff1100000ab0fac8 EFLAGS: 00000246
RAX: dffffc0000000000 RBX: ff1100000c4806a0 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ff1100000c4806f4
RBP: 0000000000000000 R08: 0000000000000001 R09: ffe21c000189001b
R10: ff1100000c4800df R11: ff1100006cf37be0 R12: 0000000000000000
R13: 0000000000000000 R14: ff1100000c480700 R15: ff1100000c480004
FS: 00007f0fbe8fe740(0000) GS:ff110000e5851000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe53473b2e0 CR3: 0000000012eef000 CR4: 00000000007516f0
PKRU: 55555554
Call Trace:
<TASK>
blk_mq_free_tag_set+0x77/0x460
do_rbd_add+0x1446/0x2b80
? __pfx_do_rbd_add+0x10/0x10
? lock_acquire+0x18c/0x300
? find_held_lock+0x2b/0x80
? sysfs_file_kobj+0xb6/0x1b0
? __pfx_sysfs_kf_write+0x10/0x10
kernfs_fop_write_iter+0x2f4/0x4a0
vfs_write+0x98e/0x1000
? expand_files+0x51f/0x850
? __pfx_vfs_write+0x10/0x10
ksys_write+0xf2/0x1d0
? __pfx_ksys_write+0x10/0x10
do_syscall_64+0x115/0x690
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f0fbea15907
Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffe22346ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000058 RCX: 00007f0fbea15907
RDX: 0000000000000058 RSI: 0000563ace6c0ef0 RDI: 0000000000000001
RBP: 0000563ace6c0ef0 R08: 0000563ace6c0ef0 R09: 6b6435726d694141
R10: 5250337279762f78 R11: 0000000000000246 R12: 0000000000000058
R13: 00007f0fbeb1c780 R14: ff1100000c480700 R15: ff1100000c480004
</TASK>
With this fix applied, rerunning the reproducer over fail-nth=1..256
yields no KASAN reports.
[ idryomov: rename err_out_device_del -> err_out_device ]
Cc: stable@vger.kernel.org
Fixes: 27c97abc30e2 ("rbd: add add_disk() error handling")
Signed-off-by: Zilin Guan <zilin@seu.edu.cn>
Signed-off-by: Dawei Feng <dawei.feng@seu.edu.cn>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song)
Address the longstanding "dying memcg problem". A situation wherein a
no-longer-used memory control group will hang around for an extended
period pointlessly consuming memory
- "fix unexpected type conversions and potential overflows" (Qi Zheng)
Fix a couple of potential 32-bit/64-bit issues which were identified
during review of the "Eliminate Dying Memory Cgroup" series
- "kho: history: track previous kernel version and kexec boot count"
(Breno Leitao)
Use Kexec Handover (KHO) to pass the previous kernel's version string
and the number of kexec reboots since the last cold boot to the next
kernel, and print it at boot time
- "liveupdate: prevent double preservation" (Pasha Tatashin)
Teach LUO to avoid managing the same file across different active
sessions
- "liveupdate: Fix module unloading and unregister API" (Pasha
Tatashin)
Address an issue with how LUO handles module reference counting and
unregistration during module unloading
- "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar)
Simplify and clean up the zswap crypto compression handling and
improve the lifecycle management of zswap pool's per-CPU acomp_ctx
resources
- "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race"
(SeongJae Park)
Address unlikely but possible leaks and deadlocks in damon_call() and
damon_walk()
- "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park)
Fix a couple of root-only wild pointer dereferences
- "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race"
(SeongJae Park)
Update the DAMON documentation to warn operators about potential
races which can occur if the commit_inputs parameter is altered at
the wrong time
- "Minor hmm_test fixes and cleanups" (Alistair Popple)
Bugfixes and a cleanup for the HMM kernel selftests
- "Modify memfd_luo code" (Chenghao Duan)
Cleanups, simplifications and speedups to the memfd_lou code
- "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport)
Support for userfaultfd in guest_memfd
- "selftests/mm: skip several tests when thp is not available" (Chunyu
Hu)
Fix several issues in the selftests code which were causing breakage
when the tests were run on CONFIG_THP=n kernels
- "mm/mprotect: micro-optimization work" (Pedro Falcato)
A couple of nice speedups for mprotect()
- "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav)
Document upcoming changes in the maintenance of KHO, LUO, memfd_luo,
kexec, crash, kdump and probably other kexec-based things - they are
being moved out of mm.git and into a new git tree
* tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits)
MAINTAINERS: add page cache reviewer
mm/vmscan: avoid false-positive -Wuninitialized warning
MAINTAINERS: update Dave's kdump reviewer email address
MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE
MAINTAINERS: drop include/linux/kho/abi/ from KHO
MAINTAINERS: update KHO and LIVE UPDATE maintainers
MAINTAINERS: update kexec/kdump maintainers entries
mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd()
selftests: mm: skip charge_reserved_hugetlb without killall
userfaultfd: allow registration of ranges below mmap_min_addr
mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update
mm/hugetlb: fix early boot crash on parameters without '=' separator
zram: reject unrecognized type= values in recompress_store()
docs: proc: document ProtectionKey in smaps
mm/mprotect: special-case small folios when applying permissions
mm/mprotect: move softleaf code out of the main function
mm: remove '!root_reclaim' checking in should_abort_scan()
mm/sparse: fix comment for section map alignment
mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()
selftests/mm: transhuge_stress: skip the test when thp not available
...
|
|
recompress_store() parses the type= parameter with three if statements
checking for "idle", "huge", and "huge_idle". An unrecognized value
silently falls through with mode left at 0, causing the recompression pass
to run with no slot filter — processing all slots instead of the
intended subset.
Add a !mode check after the type parsing block to return -EINVAL for
unrecognized values, consistent with the function's other parameter
validation.
Link: https://lore.kernel.org/20260407153027.42425-1-astellman@stellman-greene.com
Signed-off-by: Andrew Stellman <astellman@stellman-greene.com>
Suggested-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As reported by Qu Wenruo and Avinesh Kumar, the following
getconf PAGESIZE
65536
blkdiscard -p 4k /dev/zram0
takes literally forever to complete. zram doesn't support partial
discards and just returns immediately w/o doing any discard work in such
cases. The problem is that we forget to endio on our way out, so
blkdiscard sleeps forever in submit_bio_wait(). Fix this by jumping to
end_bio label, which does bio_endio().
Link: https://lore.kernel.org/20260331074255.777019-1-senozhatsky@chromium.org
Fixes: 0120dd6e4e20 ("zram: make zram_bio_discard more self-contained")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reported-by: Qu Wenruo <wqu@suse.com>
Closes: https://lore.kernel.org/linux-block/92361cd3-fb8b-482e-bc89-15ff1acb9a59@suse.com
Tested-by: Qu Wenruo <wqu@suse.com>
Reported-by: Avinesh Kumar <avinesh.kumar@suse.com>
Closes: https://bugzilla.suse.com/show_bug.cgi?id=1256530
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When platform_device_register() fails in do_floppy_init(), the embedded
struct device in floppy_device[drive] has already been initialized by
device_initialize(), but the failure path jumps to out_remove_drives
without dropping the device reference for the current drive.
Previously registered floppy devices are cleaned up in out_remove_drives,
but the device for the drive that fails registration is not, leading to
a reference leak.
The issue was identified by a static analysis tool I developed and
confirmed by manual review. Fix this by calling put_device() for the
current floppy device before jumping to the common cleanup path.
Fixes: 94fd0db7bfb4a ("[PATCH] Floppy: Add cmos attribute to floppy driver")
Cc: stable@vger.kernel.org
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
Link: https://patch.msgid.link/20260415145708.3331818-1-lgs201920130244@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Bio pages may originate from slab caches that lack a usercopy region
(e.g. jbd2 frozen metadata buffers allocated via jbd2_alloc()).
When CONFIG_HARDENED_USERCOPY is enabled, copy_to_iter() calls
check_copy_size() which rejects these slab pages, triggering a
kernel BUG in usercopy_abort().
This is a false positive: the data is ordinary block I/O content —
the same data the loop driver writes to its backing file via
vfs_iter_write(). The bvec length is always trusted, so the size
check in check_copy_size() is not needed either.
Switch to _copy_to_iter()/_copy_from_iter() which skip the
check_copy_size() wrapper while the underlying copy_to_user()
remains unchanged.
Acked-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes: 2299ceec364e ("ublk: use copy_{to,from}_iter() for user copy")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://patch.msgid.link/20260415230246.808176-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- "maple_tree: Replace big node with maple copy" (Liam Howlett)
Mainly prepararatory work for ongoing development but it does reduce
stack usage and is an improvement.
- "mm, swap: swap table phase III: remove swap_map" (Kairui Song)
Offers memory savings by removing the static swap_map. It also yields
some CPU savings and implements several cleanups.
- "mm: memfd_luo: preserve file seals" (Pratyush Yadav)
File seal preservation to LUO's memfd code
- "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan
Chen)
Additional userspace stats reportng to zswap
- "arch, mm: consolidate empty_zero_page" (Mike Rapoport)
Some cleanups for our handling of ZERO_PAGE() and zero_pfn
- "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu
Han)
A robustness improvement and some cleanups in the kmemleak code
- "Improve khugepaged scan logic" (Vernon Yang)
Improve khugepaged scan logic and reduce CPU consumption by
prioritizing scanning tasks that access memory frequently
- "Make KHO Stateless" (Jason Miu)
Simplify Kexec Handover by transitioning KHO from an xarray-based
metadata tracking system with serialization to a radix tree data
structure that can be passed directly to the next kernel
- "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas
Ballasi and Steven Rostedt)
Enhance vmscan's tracepointing
- "mm: arch/shstk: Common shadow stack mapping helper and
VM_NOHUGEPAGE" (Catalin Marinas)
Cleanup for the shadow stack code: remove per-arch code in favour of
a generic implementation
- "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin)
Fix a WARN() which can be emitted the KHO restores a vmalloc area
- "mm: Remove stray references to pagevec" (Tal Zussman)
Several cleanups, mainly udpating references to "struct pagevec",
which became folio_batch three years ago
- "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl
Shutsemau)
Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail
pages encode their relationship to the head page
- "mm/damon/core: improve DAMOS quota efficiency for core layer
filters" (SeongJae Park)
Improve two problematic behaviors of DAMOS that makes it less
efficient when core layer filters are used
- "mm/damon: strictly respect min_nr_regions" (SeongJae Park)
Improve DAMON usability by extending the treatment of the
min_nr_regions user-settable parameter
- "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka)
The proper fix for a previously hotfixed SMP=n issue. Code
simplifications and cleanups ensued
- "mm: cleanups around unmapping / zapping" (David Hildenbrand)
A bunch of cleanups around unmapping and zapping. Mostly
simplifications, code movements, documentation and renaming of
zapping functions
- "support batched checking of the young flag for MGLRU" (Baolin Wang)
Batched checking of the young flag for MGLRU. It's part cleanups; one
benchmark shows large performance benefits for arm64
- "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner)
memcg cleanup and robustness improvements
- "Allow order zero pages in page reporting" (Yuvraj Sakshith)
Enhance free page reporting - it is presently and undesirably order-0
pages when reporting free memory.
- "mm: vma flag tweaks" (Lorenzo Stoakes)
Cleanup work following from the recent conversion of the VMA flags to
a bitmap
- "mm/damon: add optional debugging-purpose sanity checks" (SeongJae
Park)
Add some more developer-facing debug checks into DAMON core
- "mm/damon: test and document power-of-2 min_region_sz requirement"
(SeongJae Park)
An additional DAMON kunit test and makes some adjustments to the
addr_unit parameter handling
- "mm/damon/core: make passed_sample_intervals comparisons
overflow-safe" (SeongJae Park)
Fix a hard-to-hit time overflow issue in DAMON core
- "mm/damon: improve/fixup/update ratio calculation, test and
documentation" (SeongJae Park)
A batch of misc/minor improvements and fixups for DAMON
- "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David
Hildenbrand)
Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code
movement was required.
- "zram: recompression cleanups and tweaks" (Sergey Senozhatsky)
A somewhat random mix of fixups, recompression cleanups and
improvements in the zram code
- "mm/damon: support multiple goal-based quota tuning algorithms"
(SeongJae Park)
Extend DAMOS quotas goal auto-tuning to support multiple tuning
algorithms that users can select
- "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao)
Fix the khugpaged sysfs handling so we no longer spam the logs with
reams of junk when starting/stopping khugepaged
- "mm: improve map count checks" (Lorenzo Stoakes)
Provide some cleanups and slight fixes in the mremap, mmap and vma
code
- "mm/damon: support addr_unit on default monitoring targets for
modules" (SeongJae Park)
Extend the use of DAMON core's addr_unit tunable
- "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache)
Cleanups to khugepaged and is a base for Nico's planned khugepaged
mTHP support
- "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand)
Code movement and cleanups in the memhotplug and sparsemem code
- "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup
CONFIG_MIGRATION" (David Hildenbrand)
Rationalize some memhotplug Kconfig support
- "change young flag check functions to return bool" (Baolin Wang)
Cleanups to change all young flag check functions to return bool
- "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh
Law and SeongJae Park)
Fix a few potential DAMON bugs
- "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo
Stoakes)
Convert a lot of the existing use of the legacy vm_flags_t data type
to the new vma_flags_t type which replaces it. Mainly in the vma
code.
- "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes)
Expand the mmap_prepare functionality, which is intended to replace
the deprecated f_op->mmap hook which has been the source of bugs and
security issues for some time. Cleanups, documentation, extension of
mmap_prepare into filesystem drivers
- "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes)
Simplify and clean up zap_huge_pmd(). Additional cleanups around
vm_normal_folio_pmd() and the softleaf functionality are performed.
* tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits)
mm: fix deferred split queue races during migration
mm/khugepaged: fix issue with tracking lock
mm/huge_memory: add and use has_deposited_pgtable()
mm/huge_memory: add and use normal_or_softleaf_folio_pmd()
mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio()
mm/huge_memory: separate out the folio part of zap_huge_pmd()
mm/huge_memory: use mm instead of tlb->mm
mm/huge_memory: remove unnecessary sanity checks
mm/huge_memory: deduplicate zap deposited table call
mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE()
mm/huge_memory: add a common exit path to zap_huge_pmd()
mm/huge_memory: handle buggy PMD entry in zap_huge_pmd()
mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc
mm/huge: avoid big else branch in zap_huge_pmd()
mm/huge_memory: simplify vma_is_specal_huge()
mm: on remap assert that input range within the proposed VMA
mm: add mmap_action_map_kernel_pages[_full]()
uio: replace deprecated mmap hook with mmap_prepare in uio_info
drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
mm: allow handling of stacked mmap_prepare hooks in more drivers
...
|
|
All of zloop runs in user context, so drop the irq-safe locking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move a few chunks of duplicated code into helpers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Reduce the dmesg spam for tests that involve device deletion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the IS_ALIGNED helper and avoid extra conversions, and tell the
user what the unaligned size is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
While vfs_truncate does various extra checks that we don't really need,
it is always better to use a VFS helper rather than open coding the
logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The write pointer is absolute and in sector units, so we need to
convert it to a relative byte address first.
Fixes: c505448748f7 ("zloop: forget write cache on force removal")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://patch.msgid.link/20260414081811.549755-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
- Consolidate AMD and Hygon cases in parse_topology() (Wei Wang)
- asm constraints cleanups in __iowrite32_copy() (Uros Bizjak)
- Drop AMD Extended Interrupt LVT macros (Naveen N Rao)
- Don't use REALLY_SLOW_IO for delays (Juergen Gross)
- paravirt cleanups (Juergen Gross)
- FPU code cleanups (Borislav Petkov)
- split-lock handling code cleanups (Borislav Petkov, Ronan Pigott)
* tag 'x86-cleanups-2026-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Correct the comment explaining what xfeatures_in_use() does
x86/split_lock: Don't warn about unknown split_lock_detect parameter
x86/fpu: Correct misspelled xfeaures_to_write local var
x86/apic: Drop AMD Extended Interrupt LVT macros
x86/cpu/topology: Consolidate AMD and Hygon cases in parse_topology()
block/floppy: Don't use REALLY_SLOW_IO for delays
x86/paravirt: Replace io_delay() hook with a bool
x86/irqflags: Preemptively move include paravirt.h directive where it belongs
x86/split_lock: Restructure the unwieldy switch-case in sld_state_show()
x86/local: Remove trailing semicolon from _ASM_XADD in local_add_return()
x86/asm: Use inout "+" asm onstraint modifiers in __iowrite32_copy()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block updates from Jens Axboe:
- Add shared memory zero-copy I/O support for ublk, bypassing per-I/O
copies between kernel and userspace by matching registered buffer
PFNs at I/O time. Includes selftests.
- Refactor bio integrity to support filesystem initiated integrity
operations and arbitrary buffer alignment.
- Clean up bio allocation, splitting bio_alloc_bioset() into clear fast
and slow paths. Add bio_await() and bio_submit_or_kill() helpers,
unify synchronous bi_end_io callbacks.
- Fix zone write plug refcount handling and plug removal races. Add
support for serializing zone writes at QD=1 for rotational zoned
devices, yielding significant throughput improvements.
- Add SED-OPAL ioctls for Single User Mode management and a STACK_RESET
command.
- Add io_uring passthrough (uring_cmd) support to the BSG layer.
- Replace pp_buf in partition scanning with struct seq_buf.
- zloop improvements and cleanups.
- drbd genl cleanup, switching to pre_doit/post_doit.
- NVMe pull request via Keith:
- Fabrics authentication updates
- Enhanced block queue limits support
- Workqueue usage updates
- A new write zeroes device quirk
- Tagset cleanup fix for loop device
- MD pull requests via Yu Kuai:
- Fix raid5 soft lockup in retry_aligned_read()
- Fix raid10 deadlock with check operation and nowait requests
- Fix raid1 overlapping writes on writemostly disks
- Fix sysfs deadlock on array_state=clear
- Proactive RAID-5 parity building with llbitmap, with
write_zeroes_unmap optimization for initial sync
- Fix llbitmap barrier ordering, rdev skipping, and bitmap_ops
version mismatch fallback
- Fix bcache use-after-free and uninitialized closure
- Validate raid5 journal metadata payload size
- Various cleanups
- Various other fixes, improvements, and cleanups
* tag 'for-7.1/block-20260411' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (146 commits)
ublk: fix tautological comparison warning in ublk_ctrl_reg_buf
scsi: bsg: fix buffer overflow in scsi_bsg_uring_cmd()
block: refactor blkdev_zone_mgmt_ioctl
MAINTAINERS: update ublk driver maintainer email
Documentation: ublk: address review comments for SHMEM_ZC docs
ublk: allow buffer registration before device is started
ublk: replace xarray with IDA for shmem buffer index allocation
ublk: simplify PFN range loop in __ublk_ctrl_reg_buf
ublk: verify all pages in multi-page bvec fall within registered range
ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support
xfs: use bio_await in xfs_zone_gc_reset_sync
block: add a bio_submit_or_kill helper
block: factor out a bio_await helper
block: unify the synchronous bi_end_io callbacks
xfs: fix number of GC bvecs
selftests/ublk: add read-only buffer registration test
selftests/ublk: add filesystem fio verify test for shmem_zc
selftests/ublk: add hugetlbfs shmem_zc test for loop target
selftests/ublk: add shared memory zero-copy test
selftests/ublk: add UBLK_F_SHMEM_ZC support for loop target
...
|
|
On 32-bit architectures, 'unsigned long size' can never exceed
UBLK_SHMEM_BUF_SIZE_MAX (1ULL << 32), causing a tautological
comparison warning. Validate buf_reg.len (__u64) directly before
using it, and consolidate all input validation into a single check.
Also remove the unnecessary local variables 'addr' and 'size' since
buf_reg.addr and buf_reg.len can be used directly.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604101952.3NOzqnu9-lkp@intel.com/
Fixes: 23b3b6f0b584 ("ublk: widen ublk_shmem_buf_reg.len to __u64 for 4GB buffer support")
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260410124136.3983429-1-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Before START_DEV, there is no disk, no queue, no I/O dispatch, so
the maple tree can be safely modified under ub->mutex alone without
freezing the queue.
Add ublk_lock_buf_tree()/ublk_unlock_buf_tree() helpers that take
ub->mutex first, then freeze the queue if device is started. This
ordering (mutex -> freeze) is safe because ublk_stop_dev_unlocked()
already holds ub->mutex when calling del_gendisk() which freezes
the queue.
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-6-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove struct ublk_buf which only contained nr_pages that was never
read after registration. Use IDA for pure index allocation instead
of xarray. Make __ublk_ctrl_unreg_buf() return int so the caller
can detect invalid index without a separate lookup.
Simplify ublk_buf_cleanup() to walk the maple tree directly and
unpin all pages in one pass, instead of iterating the xarray by
buffer index.
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-5-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Use the for-loop increment instead of a manual `i++` past the last
page, and fix the mtree_insert_range end key accordingly.
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Link: https://patch.msgid.link/20260409133020.3780098-4-tom.leiming@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
rq_for_each_bvec() yields multi-page bvecs where bv_page is only the
first page. ublk |