aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2026-04-27mm/hugetlb_cma: round up per_node before logging itSang-Heon Jeon1-0/+1
When the user requests a total hugetlb CMA size without per-node specification, hugetlb_cma_reserve() computes per_node from hugetlb_cma_size and the number of nodes that have memory per_node = DIV_ROUND_UP(hugetlb_cma_size, nodes_weight(hugetlb_bootmem_nodes)); The reservation loop later computes size = round_up(min(per_node, hugetlb_cma_size - reserved), PAGE_SIZE << order); So the actually reserved per_node size is multiple of (PAGE_SIZE << order), but the logged per_node is not rounded up, so it may be smaller than the actual reserved size. For example, as the existing comment describes, if a 3 GB area is requested on a machine with 4 NUMA nodes that have memory, 1 GB is allocated on the first three nodes, but the printed log is hugetlb_cma: reserve 3072 MiB, up to 768 MiB per node Round per_node up to (PAGE_SIZE << order) before logging so that the printed log always matches the actual reserved size. No functional change to the actual reservation size, as the following case analysis shows 1. remaining (hugetlb_cma_size - reserved) >= rounded per_node - AS-IS: min() picks unrounded per_node; round_up() returns rounded per_node - TO-BE: min() picks rounded per_node; round_up() returns rounded per_node (no-op) 2. remaining < unrounded per_node - AS-IS: min() picks remaining; round_up() returns round_up(remaining) - TO-BE: min() picks remaining; round_up() returns round_up(remaining) 3. unrounded per_node <= remaining < rounded per_node - AS-IS: min() picks unrounded per_node; round_up() returns rounded per_node - TO-BE: min() picks remaining; round_up() returns round_up(remaining) equals rounded per_node Link: https://lore.kernel.org/20260422143353.852257-1-ekffu200098@gmail.com Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") # 5.7 Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/vma: do not try to unmap a VMA if mmap_prepare() invoked from mmap()Lorenzo Stoakes2-10/+19
The mmap_prepare hook functionality includes the ability to invoke mmap_prepare() from the mmap() hook of existing 'stacked' drivers, that is ones which are capable of calling the mmap hooks of other drivers/file systems (e.g. overlayfs, shm). As part of the mmap_prepare action functionality, we deal with errors by unmapping the VMA should one arise. This works in the usual mmap_prepare case, as we invoke this action at the last moment, when the VMA is established in the maple tree. However, the mmap() hook passes a not-fully-established VMA pointer to the caller (which is the motivation behind the mmap_prepare() work), which is detached. So attempting to unmap a VMA in this state will be problematic, with the most obvious symptom being a warning in vma_mark_detached(), because the VMA is already detached. It's also unncessary - the mmap() handler will clean up the VMA on error. So to fix this issue, this patch propagates whether or not an mmap action is being completed via the compatibility layer or directly. If the former, then we do not attempt VMA cleanup, if the latter, then we do. This patch also updates the userland VMA tests to reflect the change. Link: https://lore.kernel.org/20260421102150.189982-1-ljs@kernel.org Fixes: ac0a3fc9c07d ("mm: add ability to take further action in vm_area_desc") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Reported-by: syzbot+db390288d141a1dccf96@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/69e69734.050a0220.24bfd3.0027.GAE@google.com/ Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm: start background writeback based on per-wb threshold for strictlimit BDIsJoanne Koong1-10/+6
The proactive nr_dirty > gdtc->bg_thresh check in balance_dirty_pages() only checks the global dirty threshold to start background writeback while the writer is still free-running, but for strictlimit BDIs (eg fuse), the per-wb dirty count can exceed the per-wb background threshold while the global threshold is not yet exceeded, so background writeback for this case never gets proactively started. Add a per-wb threshold check for strictlimit BDIs so that background writeback is started when wb_dirty exceeds wb_bg_thresh, which drains dirty pages before the writer hits the throttle wall, matching the proactive behavior that the global check provides for non-strictlimit BDIs. fio runs on fuse show about a 3-4% improvement in perf for buffered writes: fio --name=writeback_test --ioengine=psync --rw=write --bs=128k \ --size=2G --numjobs=4 --ramp_time=10 --runtime=20 \ --time_based --group_reporting=1 --direct=0 Link: https://lore.kernel.org/20260326234629.840938-2-joannelkoong@gmail.com Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27vmalloc: fix buffer overflow in vrealloc_node_align()Marco Elver1-1/+1
Commit 4c5d3365882d ("mm/vmalloc: allow to set node and align in vrealloc") added the ability to force a new allocation if the current pointer is on the wrong NUMA node, or if an alignment constraint is not met, even if the user is shrinking the allocation. On this path (need_realloc), the code allocates a new object of 'size' bytes and then memcpy()s 'old_size' bytes into it. If the request is to shrink the object (size < old_size), this results in an out-of-bounds write on the new buffer. Fix this by bounding the copy length by the new allocation size. Link: https://lore.kernel.org/20260420114805.3572606-2-elver@google.com Fixes: 4c5d3365882d ("mm/vmalloc: allow to set node and align in vrealloc") Signed-off-by: Marco Elver <elver@google.com> Reported-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-27mm/slab: return NULL early from kmalloc_nolock() in NMI on UPHarry Yoo (Oracle)1-0/+4
On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, kmalloc_nolock() called from NMI context can re-enter the slab allocator and acquire n->list_lock that the interrupted context is already holding, corrupting slab state. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: <NMI> dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 get_from_partial_node+0x120/0x4d0 ___slab_alloc+0x8a/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] </NMI> Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-2-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-04-27mm/page_alloc: return NULL early from alloc_frozen_pages_nolock() in NMI on UPHarry Yoo (Oracle)1-0/+5
On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, alloc_frozen_pages_nolock() called from NMI context can re-enter rmqueue() and acquire the zone lock that the interrupted context is already holding, corrupting the freelists. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: <NMI> dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 rmqueue.isra.0+0x2a9/0xa70 get_page_from_freelist+0xeb/0x450 alloc_frozen_pages_nolock_noprof+0x111/0x1e0 allocate_slab+0x42a/0x500 ___slab_alloc+0xa7/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] </NMI> Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: d7242af86434 ("mm: Introduce alloc_frozen_pages_nolock()") Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org> Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-1-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-04-24Merge tag 'slab-for-7.1-fix' of ↵Linus Torvalds1-12/+12
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - A stable fix for k(v)ealloc() where reallocating on a different node or shrinking the object can result in either losing the original data or a buffer overflow (Marco Elver) * tag 'slab-for-7.1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: slub: fix data loss and overflow in krealloc()
2026-04-22Merge tag 's390-7.1-1' of ↵Linus Torvalds1-9/+6
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Vasily Gorbik: - Add support for CONFIG_PAGE_TABLE_CHECK and enable it in debug_defconfig. s390 can only tell user from kernel PTEs via the mm, so mm_struct is now passed into pxx_user_accessible_page() callbacks - Expose the PCI function UID as an arch-specific slot attribute in sysfs so a function can be identified by its user-defined id while still in standby. Introduces a generic ARCH_PCI_SLOT_GROUPS hook in drivers/pci/slot.c - Refresh s390 PCI documentation to reflect current behavior and cover previously undocumented sysfs attributes - zcrypt device driver cleanup series: consistent field types, clearer variable naming, a kernel-doc warning fix, and a comment explaining the intentional synchronize_rcu() in pkey_handler_register() - Provide an s390 arch_raw_cpu_ptr() that avoids the detour via get_lowcore() using alternatives, shrinking defconfig by ~27 kB - Guard identity-base randomization with kaslr_enabled() so nokaslr keeps the identity mapping at 0 even with RANDOMIZE_IDENTITY_BASE=y - Build S390_MODULES_SANITY_TEST as a module only by requiring KUNIT && m, since built-in would not exercise module loading - Remove the permanently commented-out HMCDRV_DEV_CLASS create_class() code in the hmcdrv driver - Drop stale ident_map_size extern conflicting with asm/page.h * tag 's390-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/zcrypt: Fix warning about wrong kernel doc comment PCI: s390: Expose the UID as an arch specific PCI slot attribute docs: s390/pci: Improve and update PCI documentation s390/pkey: Add comment about synchronize_rcu() to pkey base s390/hmcdrv: Remove commented out code s390/zcrypt: Slight rework on the agent_id field s390/zcrypt: Explicitly use a card variable in _zcrypt_send_cprb s390/zcrypt: Rework MKVP fields and handling s390/zcrypt: Make apfs a real unsigned int field s390/zcrypt: Rework domain processing within zcrypt device driver s390/zcrypt: Move inline function rng_type6cprb_msgx from header to code s390/percpu: Provide arch_raw_cpu_ptr() s390: Enable page table check for debug_defconfig s390/pgtable: Add s390 support for page table check s390/pgtable: Use set_pmd_bit() to invalidate PMD entry mm/page_table_check: Pass mm_struct to pxx_user_accessible_page() s390/boot: Respect kaslr_enabled() for identity randomization s390/Kconfig: Make modules sanity test a module-only option s390/setup: Drop stale ident_map_size declaration
2026-04-20Merge tag 'uml-for-7.1-rc1' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull uml updates from Johannes Berg: "Mostly cleanups and small things, notably: - musl libc compatibility - vDSO installation fix - TLB sync race fix for recent SMP support - build fix for 32-bit with Clang 20/21" * tag 'uml-for-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: um: Disable GCOV_PROFILE_ALL on 32-bit UML with Clang 20/21 um: drivers: call kernel_strrchr() explicitly in cow_user.c um: Replace strncpy() with strnlen()+memcpy_and_pad() in strncpy_chunk_from_user() x86/um: fix vDSO installation um: Remove CONFIG_FRAME_WARN from x86_64_defconfig um: Fix pte_read() and pte_exec() for kernel mappings um: Fix potential race condition in TLB sync um: time-travel: clean up kernel-doc warnings um: avoid struct sigcontext redefinition with musl um: fix address-of CMSG_DATA() rvalue in stub
2026-04-19Merge tag 'mm-hotfixes-stable-2026-04-19-00-14' of ↵Linus Torvalds8-12/+27
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "7 hotfixes. 6 are cc:stable and all are for MM. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-04-19-00-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/damon/core: disallow non-power of two min_region_sz on damon_start() mm/vmalloc: take vmap_purge_lock in shrinker mm: call ->free_folio() directly in folio_unmap_invalidate() mm: blk-cgroup: fix use-after-free in cgwb_release_workfn() mm/zone_device: do not touch device folio after calling ->folio_free() mm/damon/core: disallow time-quota setting zero esz mm/mempolicy: fix weighted interleave auto sysfs name
2026-04-19Merge tag 'mm-stable-2026-04-18-02-14' of ↵Linus Torvalds30-1000/+1626
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "Eliminate Dying Memory Cgroup" (Qi Zheng and Muchun Song) Address the longstanding "dying memcg problem". A situation wherein a no-longer-used memory control group will hang around for an extended period pointlessly consuming memory - "fix unexpected type conversions and potential overflows" (Qi Zheng) Fix a couple of potential 32-bit/64-bit issues which were identified during review of the "Eliminate Dying Memory Cgroup" series - "kho: history: track previous kernel version and kexec boot count" (Breno Leitao) Use Kexec Handover (KHO) to pass the previous kernel's version string and the number of kexec reboots since the last cold boot to the next kernel, and print it at boot time - "liveupdate: prevent double preservation" (Pasha Tatashin) Teach LUO to avoid managing the same file across different active sessions - "liveupdate: Fix module unloading and unregister API" (Pasha Tatashin) Address an issue with how LUO handles module reference counting and unregistration during module unloading - "zswap pool per-CPU acomp_ctx simplifications" (Kanchana Sridhar) Simplify and clean up the zswap crypto compression handling and improve the lifecycle management of zswap pool's per-CPU acomp_ctx resources - "mm/damon/core: fix damon_call()/damos_walk() vs kdmond exit race" (SeongJae Park) Address unlikely but possible leaks and deadlocks in damon_call() and damon_walk() - "mm/damon/core: validate damos_quota_goal->nid" (SeongJae Park) Fix a couple of root-only wild pointer dereferences - "Docs/admin-guide/mm/damon: warn commit_inputs vs other params race" (SeongJae Park) Update the DAMON documentation to warn operators about potential races which can occur if the commit_inputs parameter is altered at the wrong time - "Minor hmm_test fixes and cleanups" (Alistair Popple) Bugfixes and a cleanup for the HMM kernel selftests - "Modify memfd_luo code" (Chenghao Duan) Cleanups, simplifications and speedups to the memfd_lou code - "mm, kvm: allow uffd support in guest_memfd" (Mike Rapoport) Support for userfaultfd in guest_memfd - "selftests/mm: skip several tests when thp is not available" (Chunyu Hu) Fix several issues in the selftests code which were causing breakage when the tests were run on CONFIG_THP=n kernels - "mm/mprotect: micro-optimization work" (Pedro Falcato) A couple of nice speedups for mprotect() - "MAINTAINERS: update KHO and LIVE UPDATE entries" (Pratyush Yadav) Document upcoming changes in the maintenance of KHO, LUO, memfd_luo, kexec, crash, kdump and probably other kexec-based things - they are being moved out of mm.git and into a new git tree * tag 'mm-stable-2026-04-18-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (121 commits) MAINTAINERS: add page cache reviewer mm/vmscan: avoid false-positive -Wuninitialized warning MAINTAINERS: update Dave's kdump reviewer email address MAINTAINERS: drop include/linux/liveupdate from LIVE UPDATE MAINTAINERS: drop include/linux/kho/abi/ from KHO MAINTAINERS: update KHO and LIVE UPDATE maintainers MAINTAINERS: update kexec/kdump maintainers entries mm/migrate_device: remove dead migration entry check in migrate_vma_collect_huge_pmd() selftests: mm: skip charge_reserved_hugetlb without killall userfaultfd: allow registration of ranges below mmap_min_addr mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_update mm/hugetlb: fix early boot crash on parameters without '=' separator zram: reject unrecognized type= values in recompress_store() docs: proc: document ProtectionKey in smaps mm/mprotect: special-case small folios when applying permissions mm/mprotect: move softleaf code out of the main function mm: remove '!root_reclaim' checking in should_abort_scan() mm/sparse: fix comment for section map alignment mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete() selftests/mm: transhuge_stress: skip the test when thp not available ...
2026-04-18mm/damon/core: disallow non-power of two min_region_sz on damon_start()SeongJae Park1-0/+5
Commit d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") introduced a bug that allows unaligned DAMON region address ranges. Commit c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz") fixed it, but only for damon_commit_ctx() use case. Still, DAMON sysfs interface can emit non-power of two min_region_sz via damon_start(). Fix the path by adding the is_power_of_2() check on damon_start(). The issue was discovered by sashiko [1]. Link: https://lore.kernel.org/20260411213638.77768-1-sj@kernel.org Link: https://lore.kernel.org/20260403155530.64647-1-sj@kernel.org [1] Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 6.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/vmalloc: take vmap_purge_lock in shrinkerUladzislau Rezki (Sony)1-0/+1
decay_va_pool_node() can be invoked concurrently from two paths: __purge_vmap_area_lazy() when pools are being purged, and the shrinker via vmap_node_shrink_scan(). However, decay_va_pool_node() is not safe to run concurrently, and the shrinker path currently lacks serialization, leading to races and possible leaks. Protect decay_va_pool_node() by taking vmap_purge_lock in the shrinker path to ensure serialization with purge users. Link: https://lore.kernel.org/20260413192646.14683-1-urezki@gmail.com Fixes: 7679ba6b36db ("mm: vmalloc: add a shrinker to drain vmap pools") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Baoquan He <baoquan.he@linux.dev> Cc: chenyichong <chenyichong@uniontech.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm: call ->free_folio() directly in folio_unmap_invalidate()Matthew Wilcox (Oracle)3-3/+7
We can only call filemap_free_folio() if we have a reference to (or hold a lock on) the mapping. Otherwise, we've already removed the folio from the mapping so it no longer pins the mapping and the mapping can be removed, causing a use-after-free when accessing mapping->a_ops. Follow the same pattern as __remove_mapping() and load the free_folio function pointer before dropping the lock on the mapping. That lets us make filemap_free_folio() static as this was the only caller outside filemap.c. Link: https://lore.kernel.org/20260413184314.3419945-1-willy@infradead.org Fixes: fb7d3bc41493 ("mm/filemap: drop streaming/uncached pages when writeback completes") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: Google Big Sleep <big-sleep-vuln-reports+bigsleep-501448199@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm: blk-cgroup: fix use-after-free in cgwb_release_workfn()Breno Leitao1-2/+3
cgwb_release_workfn() calls css_put(wb->blkcg_css) and then later accesses wb->blkcg_css again via blkcg_unpin_online(). If css_put() drops the last reference, the blkcg can be freed asynchronously (css_free_rwork_fn -> blkcg_css_free -> kfree) before blkcg_unpin_online() dereferences the pointer to access blkcg->online_pin, resulting in a use-after-free: BUG: KASAN: slab-use-after-free in blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367) Write of size 4 at addr ff11000117aa6160 by task kworker/71:1/531 Workqueue: cgwb_release cgwb_release_workfn Call Trace: <TASK> blkcg_unpin_online (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 block/blk-cgroup.c:1367) cgwb_release_workfn (mm/backing-dev.c:629) process_scheduled_works (kernel/workqueue.c:3278 kernel/workqueue.c:3385) Freed by task 1016: kfree (./include/linux/kasan.h:235 mm/slub.c:2689 mm/slub.c:6246 mm/slub.c:6561) css_free_rwork_fn (kernel/cgroup/cgroup.c:5542) process_scheduled_works (kernel/workqueue.c:3302 kernel/workqueue.c:3385) ** Stack based on commit 66672af7a095 ("Add linux-next specific files for 20260410") I am seeing this crash sporadically in Meta fleet across multiple kernel versions. A full reproducer is available at: https://github.com/leitao/debug/blob/main/reproducers/repro_blkcg_uaf.sh (The race window is narrow. To make it easily reproducible, inject a msleep(100) between css_put() and blkcg_unpin_online() in cgwb_release_workfn(). With that delay and a KASAN-enabled kernel, the reproducer triggers the splat reliably in less than a second.) Fix this by moving blkcg_unpin_online() before css_put(), so the cgwb's CSS reference keeps the blkcg alive while blkcg_unpin_online() accesses it. Link: https://lore.kernel.org/20260413-blkcg-v1-1-35b72622d16c@debian.org Fixes: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: JP Kobryn <inwardvessel@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/zone_device: do not touch device folio after calling ->folio_free()Matthew Brost1-1/+1
The contents of a device folio can immediately change after calling ->folio_free(), as the folio may be reallocated by a driver with a different order. Instead of touching the folio again to extract the pgmap, use the local stack variable when calling percpu_ref_put_many(). Link: https://lore.kernel.org/20260410230346.4009855-1-matthew.brost@intel.com Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/damon/core: disallow time-quota setting zero eszSeongJae Park1-3/+5
When the throughput of a DAMOS scheme is very slow, DAMOS time quota can make the effective size quota smaller than damon_ctx->min_region_sz. In the case, damos_apply_scheme() will skip applying the action, because the action is tried at region level, which requires >=min_region_sz size. That is, the quota is effectively exceeded for the quota charge window. Because no action will be applied, the total_charged_sz and total_charged_ns are also not updated. damos_set_effective_quota() will try to update the effective size quota before starting the next charge window. However, because the total_charged_sz and total_charged_ns have not updated, the throughput and effective size quota are also not changed. Since effective size quota can only be decreased, other effective size quota update factors including DAMOS quota goals and size quota cannot make any change, either. As a result, the scheme is unexpectedly deactivated until the user notices and mitigates the situation. The users can mitigate this situation by changing the time quota online or re-install the scheme. While the mitigation is somewhat straightforward, finding the situation would be challenging, because DAMON is not providing good observabilities for that. Even if such observability is provided, doing the additional monitoring and the mitigation is somewhat cumbersome and not aligned to the intention of the time quota. The time quota was intended to help reduce the user's administration overhead. Fix the problem by setting time quota-modified effective size quota be at least min_region_sz always. The issue was discovered [1] by sashiko. Link: https://lore.kernel.org/20260407003153.79589-1-sj@kernel.org Link: https://lore.kernel.org/20260405192504.110014-1-sj@kernel.org [1] Fixes: 1cd243030059 ("mm/damon/schemes: implement time quota") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> # 5.16.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/mempolicy: fix weighted interleave auto sysfs nameJoshua Hahn1-3/+5
The __ATTR macro is a utility that makes defining kobj_attributes easier by stringfying the name, verifying the mode, and setting the show/store fields in a single initializer. It takes a raw token as the first value, rather than a string, so that __ATTR family macros like __ATTR_RW can token-paste it for inferring the _show / _store function names. Commit e341f9c3c841 ("mm/mempolicy: Weighted Interleave Auto-tuning") used the __ATTR macro to define the "auto" sysfs for weighted interleave. A few months later, commit 2fb6915fa22d ("compiler_types.h: add "auto" as a macro for "__auto_type"") introduced a #define macro which expanded auto into __auto_type. This led to the "auto" token passed into __ATTR to be expanded out into __auto_type, and the sysfs entry to be displayed as __auto_type as well. Expand out the __ATTR macro and directly pass a string "auto" instead of the raw token 'auto' to prevent it from being expanded out. Also bypass the VERIFY_OCTAL_PERMISSIONS check by triple checking that 0664 is indeed the intended permissions for this sysfs file. Before: $ ls /sys/kernel/mm/mempolicy/weighted_interleave __auto_type node0 After: $ ls /sys/kernel/mm/mempolicy/weighted_interleave/ auto node0 Link: https://lore.kernel.org/20260407141415.3080960-1-joshua.hahnjy@gmail.com Fixes: 2fb6915fa22d ("compiler_types.h: add "auto" as a macro for "__auto_type"") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Rakie Kim <rakie.kim@sk.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18Merge tag 'memblock-v7.1-rc1' of ↵Linus Torvalds5-126/+190
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - improve debuggability of reserve_mem kernel parameter handling with print outs in case of a failure and debugfs info showing what was actually reserved - Make memblock_free_late() and free_reserved_area() use the same core logic for freeing the memory to buddy and ensure it takes care of updating memblock arrays when ARCH_KEEP_MEMBLOCK is enabled. * tag 'memblock-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/alternative: delay freeing of smp_locks section memblock: warn when freeing reserved memory before memory map is initialized memblock, treewide: make memblock_free() handle late freeing memblock: make free_reserved_area() update memblock if ARCH_KEEP_MEMBLOCK=y memblock: extract page freeing from free_reserved_area() into a helper memblock: make free_reserved_area() more robust mm: move free_reserved_area() to mm/memblock.c powerpc: opal-core: pair alloc_pages_exact() with free_pages_exact() powerpc: fadump: pair alloc_pages_exact() with free_pages_exact() memblock: reserve_mem: fix end caclulation in reserve_mem_release_by_name() memblock: move reserve_bootmem_range() to memblock.c and make it static memblock: Add reserve_mem debugfs info memblock: Print out errors on reserve_mem parser
2026-04-18mm/vmscan: avoid false-positive -Wuninitialized warningArnd Bergmann1-2/+2
When the -fsanitize=bounds sanitizer is enabled, gcc-16 sometimes runs into a corner case in the read_ctrl_pos() pos function, where it sees possible undefined behavior from the 'tier' index overflowing, presumably in the case that this was called with a negative tier: In function 'get_tier_idx', inlined from 'isolate_folios' at mm/vmscan.c:4671:14: mm/vmscan.c: In function 'isolate_folios': mm/vmscan.c:4645:29: error: 'pv.refaulted' is used uninitialized [-Werror=uninitialized] Part of the problem seems to be that read_ctrl_pos() has unusual calling conventions since commit 37a260870f2c ("mm/mglru: rework type selection") where passing MAX_NR_TIERS makes it accumulate all tiers but passing a smaller positive number makes it read a single tier instead. Shut up the warning by adding a fake initialization to the two instances of this variable that can run into that corner case. Link: https://lore.kernel.org/all/CAJHvVcjtFW86o5FoQC8MMEXCHAC0FviggaQsd5EmiCHP+1fBpg@mail.gmail.com/ Link: https://lore.kernel.org/20260414065206.3236176-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Koichiro Den <koichiro.den@canonical.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/migrate_device: remove dead migration entry check in ↵Davidlohr Bueso1-6/+0
migrate_vma_collect_huge_pmd() The softleaf_is_migration() check is unreachable as entries that are not device_private are filtered out. Similarly, the PTE-level equivalent in migrate_vma_collect_pmd() skips migration entries. This dead branch also contained a double spin_unlock(ptl) bug. Link: https://lore.kernel.org/20260212014611.416695-1-dave@stgolabs.net Fixes: a30b48bf1b244 ("mm/migrate_device: implement THP migration of zone device pages") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Suggested-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Acked-by: Balbir Singh <balbirs@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/vmstat: fix vmstat_shepherd double-scheduling vmstat_updateBreno Leitao1-1/+1
vmstat_shepherd uses delayed_work_pending() to check whether vmstat_update is already scheduled for a given CPU before queuing it. However, delayed_work_pending() only tests WORK_STRUCT_PENDING_BIT, which is cleared the moment a worker thread picks up the work to execute it. This means that while vmstat_update is actively running on a CPU, delayed_work_pending() returns false. If need_update() also returns true at that point (per-cpu counters not yet zeroed mid-flush), the shepherd queues a second invocation with delay=0, causing vmstat_update to run again immediately after finishing. On a 72-CPU system this race is readily observable: before the fix, many CPUs show invocation gaps well below 500 jiffies (the minimum round_jiffies_relative() can produce), with the most extreme cases reaching 0 jiffies—vmstat_update called twice within the same jiffy. Fix this by replacing delayed_work_pending() with work_busy(), which returns non-zero for both WORK_BUSY_PENDING (timer armed or work queued) and WORK_BUSY_RUNNING (work currently executing). The shepherd now correctly skips a CPU in all busy states. After the fix, all sub-jiffy and most sub-100-jiffie gaps disappear. The remaining early invocations have gaps in the 700–999 jiffie range, attributable to round_jiffies_relative() aligning to a nearer jiffie-second boundary rather than to this race. Each spurious vmstat_update invocation has a measurable side effect: refresh_cpu_vm_stats() calls decay_pcp_high() for every zone, which drains idle per-CPU pages back to the buddy allocator via free_pcppages_bulk(), taking the zone spinlock each time. Eliminating the double-scheduling therefore reduces zone lock contention directly. On a 72-CPU stress-ng workload measured with perf lock contention: free_pcppages_bulk contention count: ~55% reduction free_pcppages_bulk total wait time: ~57% reduction free_pcppages_bulk max wait time: ~47% reduction Note: work_busy() is inherently racy—between the check and the subsequent queue_delayed_work_on() call, vmstat_update can finish execution, leaving the work neither pending nor running. In that narrow window the shepherd can still queue a second invocation. After the fix, this residual race is rare and produces only occasional small gaps, a significant improvement over the systematic double-scheduling seen with delayed_work_pending(). Link: https://lore.kernel.org/20260409-vmstat-v2-1-e9d9a6db08ad@debian.org Fixes: 7b8da4c7f07774 ("vmstat: get rid of the ugly cpu_stat_off variable") Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/hugetlb: fix early boot crash on parameters without '=' separatorThorsten Blum1-0/+3
If hugepages, hugepagesz, or default_hugepagesz are specified on the kernel command line without the '=' separator, early parameter parsing passes NULL to hugetlb_add_param(), which dereferences it in strlen() and can crash the system during early boot. Reject NULL values in hugetlb_add_param() and return -EINVAL instead. Link: https://lore.kernel.org/20260409105437.108686-4-thorsten.blum@linux.dev Fixes: 5b47c02967ab ("mm/hugetlb: convert cmdline parameters from setup to early") Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/mprotect: special-case small folios when applying permissionsPedro Falcato1-34/+57
The common order-0 case is important enough to want its own branch, and avoids the hairy, large loop logic that the CPU does not seem to handle particularly well. While at it, encourage the compiler to inline batch PTE logic and resolve constant branches by adding __always_inline strategically. Link: https://lore.kernel.org/20260402141628.3367596-3-pfalcato@suse.de Signed-off-by: Pedro Falcato <pfalcato@suse.de> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/mprotect: move softleaf code out of the main functionPedro Falcato1-60/+67
Patch series "mm/mprotect: micro-optimization work", v3. Micro-optimize the change_protection functionality and the change_pte_range() routine. This set of functions works in an incredibly tight loop, and even small inefficiencies are incredibly evident when spun hundreds, thousands or hundreds of thousands of times. There was an attempt to keep the batching functionality as much as possible, which introduced some part of the slowness, but not all of it. Removing it for !arm64 architectures would speed mprotect() up even further, but could easily pessimize cases where large folios are mapped (which is not as rare as it seems, particularly when it comes to the page cache these days). The micro-benchmark used for the tests was [0] (usable using google/benchmark and g++ -O2 -lbenchmark repro.cpp) This resulted in the following (first entry is baseline): --------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------- mprotect_bench 85967 ns 85967 ns 6935 mprotect_bench 70684 ns 70684 ns 9887 After the patchset we can observe an ~18% speedup in mprotect. Wonderful for the elusive mprotect-based workloads! Testing & more ideas welcome. I suspect there is plenty of improvement possible but it would require more time than what I have on my hands right now. The entire inlined function (which inlines into change_protection()) is gigantic - I'm not surprised this is so finnicky. Note: per my profiling, the next _big_ bottleneck here is modify_prot_start_ptes, exactly on the xchg() done by x86. ptep_get_and_clear() is _expensive_. I don't think there's a properly safe way to go about it since we do depend on the D bit quite a lot. This might not be such an issue on other architectures. Luke Yang reported [1]: : On average, we see improvements ranging from a minimum of 5% to a : maximum of 55%, with most improvements showing around a 25% speed up in : the libmicro/mprot_tw4m micro benchmark. This patch (of 2): Move softleaf change_pte_range code into a separate function. This makes the change_pte_range() function a good bit smaller, and lessens cognitive load when reading through the function. Link: https://lore.kernel.org/20260402141628.3367596-1-pfalcato@suse.de Link: https://lore.kernel.org/20260402141628.3367596-2-pfalcato@suse.de Link: https://lore.kernel.org/all/aY8-XuFZ7zCvXulB@luyang-thinkpadp1gen7.toromso.csb/ Link: https://gist.github.com/heatd/1450d273005aba91fa5744f44dfcd933 [0] Link: https://lore.kernel.org/CAL2CeBxT4jtJ+LxYb6=BNxNMGinpgD_HYH5gGxOP-45Q2OncqQ@mail.gmail.com [1] Signed-off-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Luke Yang <luyang@redhat.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm: remove '!root_reclaim' checking in should_abort_scan()Zhaoyang Huang1-4/+0
Android systems usually use memory.reclaim interface to implement user space memory management which expects that the requested reclaim target and actually reclaimed amount memory are not diverging by too much. With the current MGRLU implementation there is, however, no bail out when the reclaim target is reached and this could lead to an excessive reclaim that scales with the reclaim hierarchy size.For example, we can get a nr_reclaimed=394/nr_to_reclaim=32 proactive reclaim under a common 1-N cgroup hierarchy. This defect arose from the goal of keeping fairness among memcgs that is, for try_to_free_mem_cgroup_pages -> shrink_node_memcgs -> shrink_lruvec -> lru_gen_shrink_lruvec -> try_to_shrink_lruvec, the !root_reclaim(sc) check was there for reclaim fairness, which was necessary before commit b82b530740b9 ("mm: vmscan: restore incremental cgroup iteration") because the fairness depended on attempted proportional reclaim from every memcg under the target memcg. However after commit b82b530740b9 there is no longer a need to visit every memcg to ensure fairness. Let's have try_to_shrink_lruvec bail out when the nr_reclaimed achieved. Link: https://lore.kernel.org/20260318011558.1696310-1-zhaoyang.huang@unisoc.com Link: https://lore.kernel.org/20260212032111.408865-1-zhaoyang.huang@unisoc.com Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com> Suggested-by: T.J.Mercier <tjmercier@google.com> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Qi Zheng <qi.zheng@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18mm/page_io: use sio->len for PSWPIN accounting in sio_read_complete()David Carlier1-1/+1
sio_read_complete() uses sio->pages to account global PSWPIN vm events, but sio->pages tracks the number of bvec entries (folios), not base pages. While large folios cannot currently reach this path (SWP_FS_OPS and SWP_SYNCHRONOUS_IO are mutually exclusive, and mTHP swap-in allocation is gated on SWP_SYNCHRONOUS_IO), the accounting is semantically inconsistent with the per-memcg path which correctly uses folio_nr_pages(). Use sio->len >> PAGE_SHIFT instead, which gives the correct base page count since sio->len is accumulated via folio_size(folio). Link: https://lore.kernel.org/20260402061408.36119-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: NeilBrown <neil@brown.name> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18userfaultfd: mfill_atomic(): remove retry logicMike Rapoport (Microsoft)1-27/+0
Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). [rppt@kernel.org: remove safety measure of not returning ENOENT from _copy] Link: https://lore.kernel.org/ac5zcDUY8CFHr6Lw@kernel.org Link: https://lore.kernel.org/20260402041156.1377214-12-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18shmem, userfaultfd: implement shmem uffd operations using vm_uffd_opsMike Rapoport (Microsoft)2-136/+92
Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Link: https://lore.kernel.org/20260402041156.1377214-11-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrei Vagin <avagin@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand (Arm) <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Harry Yoo (Oracle) <harry@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nikita Kalyazin <kalyazin@amazon.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Carlier <devnexen@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-04-18userfaultfd: introduce vm_uffd_ops->alloc_folio()Mike Rapoport (Microsoft)1-44/+48
and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio