From 8613803cf5d532316aa886f17066c5e5968ea21e Mon Sep 17 00:00:00 2001 From: Chengkaitao Date: Thu, 23 Apr 2026 18:14:41 +0800 Subject: mm: convert vmemmap_p?d_populate() to static functions Since the vmemmap_p?d_populate functions are unused outside the mm subsystem, we can remove their external declarations and convert them to static functions. Link: https://lore.kernel.org/20260423101441.7089-1-kaitao.cheng@linux.dev Signed-off-by: Chengkaitao Acked-by: David Hildenbrand (arm) Acked-by: Mike Rapoport (Microsoft) Acked-by: Oscar Salvador Cc: David Hildenbrand Cc: Liam Howlett Cc: Lorenzo Stoakes (Oracle) Cc: Michal Hocko Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/mm.h | 7 ------- 1 file changed, 7 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 06bbe9eba636..e3b6112a8d79 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4860,13 +4860,6 @@ unsigned long section_map_size(void); struct page * __populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap, struct dev_pagemap *pgmap); -pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); -p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); -pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); -pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, - struct vmem_altmap *altmap, unsigned long ptpfn, - unsigned long flags); void *vmemmap_alloc_block(unsigned long size, int node); struct vmem_altmap; void *vmemmap_alloc_block_buf(unsigned long size, int node, -- cgit v1.2.3 From 4221aadd720bef7df1268391d6eb1ea1f0476b38 Mon Sep 17 00:00:00 2001 From: Bunyod Suvonov Date: Thu, 23 Apr 2026 18:37:53 +0800 Subject: mm/vmscan: add balance_pgdat begin/end tracepoints Vmscan has six main reclaim entry points: try_to_free_pages() for direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim, mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim() for node reclaim, shrink_all_memory() for hibernation reclaim, and balance_pgdat() for kswapd reclaim. All of them, except for shrink_all_memory() and balance_pgdat(), already have begin/end tracepoints. This makes it harder to trace which reclaim path is responsible for memory reclaim activity, because kswapd reclaim cannot be identified as cleanly as other reclaim entry points, even though it is the main background reclaim path under memory pressure. There may be no need to trace shrink_all_memory() as it is primarily used during hibernation. So this patch adds the missing tracepoint pair for balance_pgdat(). The begin tracepoint records the node id, requested reclaim order, and the requested classzone bound (highest_zoneidx). The end tracepoint records the node id, the reclaim order that balance_pgdat() finished with, the requested classzone bound, and nr_reclaimed. Together, they show the requested reclaim order and classzone bound, whether reclaim fell back to a lower order, and how much reclaim work was done. The end tracepoint also records highest_zoneidx even though it does not change within a balance_pgdat() invocation. This keeps the end event self-contained, so users can analyze reclaim results directly from end events without depending on begin/end correlation, which is less convenient when tracing is filtered or records are dropped. It also makes it straightforward to relate nr_reclaimed and the final reclaim order to the requested classzone bound. Link: https://lore.kernel.org/20260424031418.174597-1-b.suvonov@sjtu.edu.cn Link: https://lore.kernel.org/20260423103753.546582-1-b.suvonov@sjtu.edu.cn Signed-off-by: Bunyod Suvonov Acked-by: Shakeel Butt Cc: Axel Rasmussen Cc: Barry Song Cc: David Hildenbrand Cc: Johannes Weiner Cc: Kairui Song Cc: Lorenzo Stoakes Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Michal Hocko Cc: Qi Zheng Cc: Steven Rostedt Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/trace/events/vmscan.h | 52 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) (limited to 'include') diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index 4445a8d9218d..b4bf7b8def1f 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -96,6 +96,58 @@ TRACE_EVENT(mm_vmscan_kswapd_wake, __entry->order) ); +TRACE_EVENT(mm_vmscan_balance_pgdat_begin, + + TP_PROTO(int nid, int order, int highest_zoneidx), + + TP_ARGS(nid, order, highest_zoneidx), + + TP_STRUCT__entry( + __field(int, nid) + __field(int, order) + __field(int, highest_zoneidx) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->order = order; + __entry->highest_zoneidx = highest_zoneidx; + ), + + TP_printk("nid=%d order=%d highest_zoneidx=%-8s", + __entry->nid, + __entry->order, + __print_symbolic(__entry->highest_zoneidx, ZONE_TYPE)) +); + +TRACE_EVENT(mm_vmscan_balance_pgdat_end, + + TP_PROTO(int nid, int order, int highest_zoneidx, + unsigned long nr_reclaimed), + + TP_ARGS(nid, order, highest_zoneidx, nr_reclaimed), + + TP_STRUCT__entry( + __field(int, nid) + __field(int, order) + __field(int, highest_zoneidx) + __field(unsigned long, nr_reclaimed) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->order = order; + __entry->highest_zoneidx = highest_zoneidx; + __entry->nr_reclaimed = nr_reclaimed; + ), + + TP_printk("nid=%d order=%d highest_zoneidx=%-8s nr_reclaimed=%lu", + __entry->nid, + __entry->order, + __print_symbolic(__entry->highest_zoneidx, ZONE_TYPE), + __entry->nr_reclaimed) +); + TRACE_EVENT(mm_vmscan_wakeup_kswapd, TP_PROTO(int nid, int zid, int order, gfp_t gfp_flags), -- cgit v1.2.3 From 4aa4abf1f14bd6d0748b7d35a803cc2376a8e20b Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Wed, 1 Apr 2026 11:16:19 +0100 Subject: mm/page_alloc: optimize free_contig_range() Patch series "mm: Free contiguous order-0 pages efficiently", v6. A recent change to vmalloc caused some performance benchmark regressions (see [1]). I'm attempting to fix that (and at the same time significantly improve beyond the baseline) by freeing a contiguous set of order-0 pages as a batch. At the same time I observed that free_contig_range() was essentially doing the same thing as vfree() so I've fixed it there too. While at it, optimize the __free_contig_frozen_range() as well. Check that the contiguous range falls in the same section. If they aren't enabled, the if conditions get optimized out by the compiler as memdesc_section() returns 0. See num_pages_contiguous() for more details about it. This patch (of 3): Decompose the range of order-0 pages to be freed into the set of largest possible power-of-2 size and aligned chunks and free them to the pcp or buddy. This improves on the previous approach which freed each order-0 page individually in a loop. Testing shows performance to be improved by more than 10x in some cases. Since each page is order-0, we must decrement each page's reference count individually and only consider the page for freeing as part of a high order chunk if the reference count goes to zero. Additionally free_pages_prepare() must be called for each individual order-0 page too, so that the struct page state and global accounting state can be appropriately managed. But once this is done, the resulting high order chunks can be freed as a unit to the pcp or buddy. This significantly speeds up the free operation but also has the side benefit that high order blocks are added to the pcp instead of each page ending up on the pcp order-0 list; memory remains more readily available in high orders. vmalloc will shortly become a user of this new optimized free_contig_range() since it aggressively allocates high order non-compound pages, but then calls split_page() to end up with contiguous order-0 pages. These can now be freed much more efficiently. The execution time of the following function was measured in a server class arm64 machine: static int page_alloc_high_order_test(void) { unsigned int order = HPAGE_PMD_ORDER; struct page *page; int i; for (i = 0; i < 100000; i++) { page = alloc_pages(GFP_KERNEL, order); if (!page) return -1; split_page(page, order); free_contig_range(page_to_pfn(page), 1UL << order); } return 0; } Execution time before: 4097358 usec Execution time after: 729831 usec Perf trace before: 99.63% 0.00% kthreadd [kernel.kallsyms] [.] kthread | ---kthread 0xffffb33c12a26af8 | |--98.13%--0xffffb33c12a26060 | | | |--97.37%--free_contig_range | | | | | |--94.93%--___free_pages | | | | | | | |--55.42%--__free_frozen_pages | | | | | | | | | --43.20%--free_frozen_page_commit | | | | | | | | | --35.37%--_raw_spin_unlock_irqrestore | | | | | | | |--11.53%--_raw_spin_trylock | | | | | | | |--8.19%--__preempt_count_dec_and_test | | | | | | | |--5.64%--_raw_spin_unlock | | | | | | | |--2.37%--__get_pfnblock_flags_mask.isra.0 | | | | | | | --1.07%--free_frozen_page_commit | | | | | --1.54%--__free_frozen_pages | | | --0.77%--___free_pages | --0.98%--0xffffb33c12a26078 alloc_pages_noprof Perf trace after: 8.42% 2.90% kthreadd [kernel.kallsyms] [k] __free_contig_range | |--5.52%--__free_contig_range | | | |--5.00%--free_prepared_contig_range | | | | | |--1.43%--__free_frozen_pages | | | | | | | --0.51%--free_frozen_page_commit | | | | | |--1.08%--_raw_spin_trylock | | | | | --0.89%--_raw_spin_unlock | | | --0.52%--free_pages_prepare | --2.90%--ret_from_fork kthread 0xffffae1c12abeaf8 0xffffae1c12abe7a0 | --2.69%--vfree __free_contig_range Link: https://lore.kernel.org/20260401101634.2868165-1-usama.anjum@arm.com Link: https://lore.kernel.org/20260401101634.2868165-2-usama.anjum@arm.com Link: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com [1] Signed-off-by: Ryan Roberts Co-developed-by: Muhammad Usama Anjum Signed-off-by: Muhammad Usama Anjum Acked-by: David Hildenbrand (Arm) Acked-by: Vlastimil Babka (SUSE) Reviewed-by: Zi Yan Cc: Brendan Jackman Cc: David Sterba Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Nick Terrell Cc: Suren Baghdasaryan Cc: "Uladzislau Rezki (Sony)" Cc: Vishal Moola (Oracle) Signed-off-by: Andrew Morton --- include/linux/gfp.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 51ef13ed756e..87259e309dee 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -467,6 +467,8 @@ void free_contig_frozen_range(unsigned long pfn, unsigned long nr_pages); void free_contig_range(unsigned long pfn, unsigned long nr_pages); #endif +void __free_contig_range(unsigned long pfn, unsigned long nr_pages); + DEFINE_FREE(free_page, void *, free_page((unsigned long)_T)) #endif /* __LINUX_GFP_H */ -- cgit v1.2.3 From 60ced5818f64ac356620d1ad3e0d473c457dbf5b Mon Sep 17 00:00:00 2001 From: Ryan Roberts Date: Wed, 1 Apr 2026 11:16:20 +0100 Subject: vmalloc: optimize vfree with free_pages_bulk() Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it must immediately split_page() to order-0 so that it remains compatible with users that want to access the underlying struct page. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") recently made it much more likely for vmalloc to allocate high order pages which are subsequently split to order-0. Unfortunately this had the side effect of causing performance regressions for tight vmalloc/vfree loops (e.g. test_vmalloc.ko benchmarks). See Closes: tag. This happens because the high order pages must be gotten from the buddy but then because they are split to order-0, when they are freed they are freed to the order-0 pcp. Previously allocation was for order-0 pages so they were recycled from the pcp. It would be preferable if when vmalloc allocates an (e.g.) order-3 page that it also frees that order-3 page to the order-3 pcp, then the regression could be removed. So let's do exactly that; update stats separately first as coalescing is hard to do correctly without complexity. Use free_pages_bulk() which uses the new __free_contig_range() API to batch-free contiguous ranges of pfns. This not only removes the regression, but significantly improves performance of vfree beyond the baseline. A selection of test_vmalloc benchmarks running on arm64 server class system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") was added in v6.19-rc1 where we see regressions. Then with this change performance is much better. (>0 is faster, <0 is slower, (R)/(I) = statistically significant Regression/Improvement): +-----------------+----------------------------------------------------------+-------------------+--------------------+ | Benchmark | Result Class | mm-new | this series | +=================+==========================================================+===================+====================+ | micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec) | 1331843.33 | (I) 67.17% | | | fix_size_alloc_test: p:1, h:0, l:500000 (usec) | 415907.33 | -5.14% | | | fix_size_alloc_test: p:4, h:0, l:500000 (usec) | 755448.00 | (I) 53.55% | | | fix_size_alloc_test: p:16, h:0, l:500000 (usec) | 1591331.33 | (I) 57.26% | | | fix_size_alloc_test: p:16, h:1, l:500000 (usec) | 1594345.67 | (I) 68.46% | | | fix_size_alloc_test: p:64, h:0, l:100000 (usec) | 1071826.00 | (I) 79.27% | | | fix_size_alloc_test: p:64, h:1, l:100000 (usec) | 1018385.00 | (I) 84.17% | | | fix_size_alloc_test: p:256, h:0, l:100000 (usec) | 3970899.67 | (I) 77.01% | | | fix_size_alloc_test: p:256, h:1, l:100000 (usec) | 3821788.67 | (I) 89.44% | | | fix_size_alloc_test: p:512, h:0, l:100000 (usec) | 7795968.00 | (I) 82.67% | | | fix_size_alloc_test: p:512, h:1, l:100000 (usec) | 6530169.67 | (I) 118.09% | | | full_fit_alloc_test: p:1, h:0, l:500000 (usec) | 626808.33 | -0.98% | | | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 532145.67 | -1.68% | | | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) | 537032.67 | -0.96% | | | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec) | 8805069.00 | (I) 74.58% | | | pcpu_alloc_test: p:1, h:0, l:500000 (usec) | 500824.67 | 4.35% | | | random_size_align_alloc_test: p:1, h:0, l:500000 (usec) | 1637554.67 | (I) 76.99% | | | random_size_alloc_test: p:1, h:0, l:500000 (usec) | 4556288.67 | (I) 72.23% | | | vm_map_ram_test: p:1, h:0, l:500000 (usec) | 107371.00 | -0.70% | +-----------------+----------------------------------------------------------+-------------------+--------------------+ Link: https://lore.kernel.org/20260401101634.2868165-3-usama.anjum@arm.com Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator") Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/ Signed-off-by: Ryan Roberts Co-developed-by: Muhammad Usama Anjum Signed-off-by: Muhammad Usama Anjum Acked-by: Vlastimil Babka (SUSE) Acked-by: Zi Yan Acked-by: David Hildenbrand (Arm) Reviewed-by: Uladzislau Rezki (Sony) Cc: Brendan Jackman Cc: David Sterba Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Nick Terrell Cc: Suren Baghdasaryan Cc: Vishal Moola (Oracle) Signed-off-by: Andrew Morton --- include/linux/gfp.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 87259e309dee..cdf95a9f0b87 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -239,6 +239,8 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid, struct page **page_array); #define __alloc_pages_bulk(...) alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__)) +void free_pages_bulk(struct page **page_array, unsigned long nr_pages); + unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp, unsigned long nr_pages, struct page **page_array); -- cgit v1.2.3 From 9138e27a3bc380cd88475546688f23d5eda1ad23 Mon Sep 17 00:00:00 2001 From: Ravi Jonnalagadda Date: Mon, 27 Apr 2026 20:05:20 -0700 Subject: mm/damon: add node_eligible_mem_bp goal metric Background and Motivation ========================= In heterogeneous memory systems, controlling memory distribution across NUMA nodes is essential for performance optimization. This patch enables system-wide page distribution with target-state goals such as "maintain 60% of scheme-eligible memory on DRAM" using PA-mode DAMON schemes. Rather than using absolute thresholds, this metric tracks the ratio of memory that matches each scheme's access pattern filters on a target node, enabling the quota system to automatically adjust migration aggressiveness to maintain the desired distribution. What This Metric Measures ========================= node_eligible_mem_bp: scheme_eligible_bytes_on_node / total_scheme_eligible_bytes * 10000 Two-Scheme Setup for Hot Page Distribution ========================================== For maintaining 60% of hot memory on DRAM (node 0) and 40% on CXL (node 1): PULL scheme: migrate_hot to node 0 goal: node_eligible_mem_bp, nid=0, target=6000 addr filter: node 1 address range (only migrate FROM CXL) "Move hot pages to DRAM if less than 60% of hot data is in DRAM" PUSH scheme: migrate_hot to node 1 goal: node_eligible_mem_bp, nid=1, target=4000 addr filter: node 0 address range (only migrate FROM DRAM) "Move hot pages to CXL if less than 40% of hot data is in CXL" Each scheme independently measures its own eligible memory and adjusts its quota to achieve its target ratio. The schemes work in concert through DAMON's unified monitoring context, with the quota autotuner balancing their relative aggressiveness. Implementation Details ====================== The implementation adds a new quota goal metric type DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP to the existing DAMOS quota goal framework. When this metric is configured for a scheme: 1. During each quota adjustment cycle, damos_get_node_eligible_mem_bp() is called to calculate the current memory distribution. 2. The function iterates through all regions that match the scheme's access pattern (via __damos_valid_target()) and calculates: - Total eligible bytes across all nodes - Eligible bytes specifically on the target node (goal->nid) 3. For each eligible region, damos_calc_eligible_bytes() walks through the physical address range, using damon_get_folio() to look up each folio and determine its NUMA node via folio_nid(). 4. Large folios are handled by calculating the exact overlap between the region boundaries and folio boundaries, ensuring accurate byte counts even when regions partially span folios. 5. The ratio (node_eligible / total_eligible * 10000) is returned as basis points, which the quota autotuner uses to adjust the scheme's effective quota size (esz). The implementation requires CONFIG_DAMON_PADDR since damon_get_folio() is only available for physical address space monitoring. Testing Results =============== Functionally tested on a two-node heterogeneous memory system with DRAM (node 0) and CXL memory (node 1). A PUSH+PULL scheme configuration using migrate_hot actions was used to reach a target hot memory ratio between the two tiers. With the TEMPORAL tuner, the system converges quickly to the target distribution. The tuner drives esz to maximum when under goal and to zero once the goal is met, forming a simple on/off feedback loop that stabilizes at the desired ratio. With the CONSIST tuner, the scheme still converges but more slowly, as it migrates and then throttles itself based on quota feedback. The time to reach the goal varies depending on workload intensity. Note: This metric works with both TEMPORAL and CONSIST goal tuners. Link: https://lore.kernel.org/20260428030520.701-1-ravis.opensrc@gmail.com Signed-off-by: Ravi Jonnalagadda Suggested-by: SeongJae Park Reviewed-by: SeongJae Park Cc: Honggyu Kim Cc: Jonathan Corbet Cc: Yunjeong Mun Signed-off-by: Andrew Morton --- include/linux/damon.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index f2cdb7c3f5e6..986b8c902585 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -159,6 +159,8 @@ enum damos_action { * @DAMOS_QUOTA_NODE_MEMCG_FREE_BP: MemFree ratio of a node for a cgroup. * @DAMOS_QUOTA_ACTIVE_MEM_BP: Active to total LRU memory ratio. * @DAMOS_QUOTA_INACTIVE_MEM_BP: Inactive to total LRU memory ratio. + * @DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP: Scheme-eligible memory ratio of a + * node in basis points (0-10000). * @NR_DAMOS_QUOTA_GOAL_METRICS: Number of DAMOS quota goal metrics. * * Metrics equal to larger than @NR_DAMOS_QUOTA_GOAL_METRICS are unsupported. @@ -172,6 +174,7 @@ enum damos_quota_goal_metric { DAMOS_QUOTA_NODE_MEMCG_FREE_BP, DAMOS_QUOTA_ACTIVE_MEM_BP, DAMOS_QUOTA_INACTIVE_MEM_BP, + DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP, NR_DAMOS_QUOTA_GOAL_METRICS, }; -- cgit v1.2.3 From 4ee4fb3214a8aadf5e8d253f8a34b76baff7f37d Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 27 Apr 2026 18:33:52 -0700 Subject: mm/damon/core: introduce failed region quota charge ratio DAMOS quota is charged to all DAMOS action application attempted memory, regardless of how much of the memory the action was successful and failed. This makes understanding quota behavior without DAMOS stat but only with end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT action) difficult. Also, charging action-failed memory same as action-successful memory is somewhat unfair, as successful action application will induce more overhead in most cases. Introduce DAMON core API for setting the charge ratio for such action-failed memory. It allows API callers to specify the ratio in a flexible way, by setting the numerator and the denominator. Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 9 +++++++++ 1 file changed, 9 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 986b8c902585..2bb43910e22e 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -236,6 +236,8 @@ enum damos_quota_goal_tuner { * @goals: Head of quota tuning goals (&damos_quota_goal) list. * @goal_tuner: Goal-based @esz tuning algorithm to use. * @esz: Effective size quota in bytes. + * @fail_charge_num: Failed regions charge rate numerator. + * @fail_charge_denom: Failed regions charge rate denominator. * * @weight_sz: Weight of the region's size for prioritization. * @weight_nr_accesses: Weight of the region's nr_accesses for prioritization. @@ -265,6 +267,10 @@ enum damos_quota_goal_tuner { * * The resulting effective size quota in bytes is set to @esz. * + * For DAMOS action applying failed amount of regions, charging those same to + * those that the action has successfully applied may be unfair. For the + * reason, 'the size * @fail_charge_num / @fail_charge_denom' is charged. + * * For selecting regions within the quota, DAMON prioritizes current scheme's * target memory regions using the &struct damon_operations->get_scheme_score. * You could customize the prioritization logic by setting &weight_sz, @@ -279,6 +285,9 @@ struct damos_quota { enum damos_quota_goal_tuner goal_tuner; unsigned long esz; + unsigned int fail_charge_num; + unsigned int fail_charge_denom; + unsigned int weight_sz; unsigned int weight_nr_accesses; unsigned int weight_age; -- cgit v1.2.3 From ffe55393137c01aa01940b528afcea8c5a108ed7 Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Fri, 10 Apr 2026 17:24:19 +0800 Subject: mm/sparse: remove sparse buffer pre-allocation mechanism Commit 9bdac9142407 ("sparsemem: Put mem map for one node together.") introduced a mechanism to pre-allocate a large memory block to hold all memmaps for a NUMA node upfront. However, the original commit message did not clearly state the actual benefits or the necessity of explicitly pre-allocating a single chunk for all memmap areas of a given node. One of the concerns about removing this pre-allocation is that the subsequent per-section memmap allocations could become scattered around, and might turn too many memory blocks/sections into an "un-offlinable" state. However, tests show that even without the explicit node-wide pre-allocation, memblock still allocates memory closely and back-to-back. When tracing vmemmap_set_pmd allocations, the physical chunks allocated by memblock are strictly adjacent to each other in a single contiguous physical range (mapped top-down). Because they are packed tightly together naturally, they will at most consume or pollute the exact same number of memory blocks as the explicit pre-allocation did. Another concern is the boot performance impact of calling memmap_alloc() multiple times compared to one large node-wide allocation. Tests on a 256GB VM showed that memmap allocation time increased from 199,555 ns to 741,292 ns. Even though it is 3.7x slower, on a 1TB machine, the entire memory allocation time would only take a few milliseconds. This boot performance difference is completely negligible. Since no negative impact on memory offlining behavior or noticeable boot performance regression was found, this patch proposes removing the explicit node-wide memmap pre-allocation mechanism to reduce the maintenance burden. Link: https://lore.kernel.org/20260410092419.2446420-1-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand (Arm) Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/mm.h | 1 - 1 file changed, 1 deletion(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index e3b6112a8d79..8a0078a4dc78 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4855,7 +4855,6 @@ static inline void print_vma_addr(char *prefix, unsigned long rip) } #endif -void *sparse_buffer_alloc(unsigned long size); unsigned long section_map_size(void); struct page * __populate_section_memmap(unsigned long pfn, unsigned long nr_pages, int nid, struct vmem_altmap *altmap, -- cgit v1.2.3 From f2a950170f7a78761c2b2e5e535716fb0f8c0813 Mon Sep 17 00:00:00 2001 From: "JP Kobryn (Meta)" Date: Mon, 6 Apr 2026 12:50:14 -0700 Subject: mm/vmpressure: skip socket pressure for costly order reclaim When reclaim is triggered by high order allocations on a fragmented system, vmpressure() can report poor reclaim efficiency even though the system has plenty of free memory. This is because many pages are scanned, but few are found to actually reclaim - the pages are actively in use and don't need to be freed. The resulting scan:reclaim ratio causes vmpressure() to assert socket pressure, throttling TCP throughput unnecessarily. Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on compaction to succeed, so poor reclaim efficiency at these orders does not necessarily indicate memory pressure. The kernel already treats this order as the boundary where reclaim is no longer expected to succeed and compaction may take over. Make vmpressure() order-aware through an additional parameter sourced from scan_control at existing call sites. Socket pressure is now only asserted when order <= PAGE_ALLOC_COSTLY_ORDER. Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always uses order 0, which passes the filter unconditionally. Similarly, vmpressure_prio() now passes order 0 internally when calling vmpressure(), ensuring critical pressure from low reclaim priority is not suppressed by the order filter. The patch was motivated by a case of impacted net throughput in production. On one affected host, the memory state at the time showed ~15GB available, zero cgroup pressure, and the following buddyinfo state: Order FreePages 0: 133,970 1: 29,230 2: 17,351 3: 18,984 7+: 0 Using bpf, it was found that 94% of vmpressure calls on this host were from order-7 kswapd reclaim. TCP minimum recv window is rcv_ssthresh:19712. Before patch: 723 out of 3,843 (19%) TCP connections stuck at minimum recv window After live-patching and ~30min elapsed: 0 out of 3,470 TCP connections stuck at minimum recv window Link: https://lore.kernel.org/20260406195014.112521-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn (Meta) Reviewed-by: Rik van Riel Acked-by: Johannes Weiner Acked-by: Shakeel Butt Acked-by: Jakub Kicinski Reviewed-by: Barry Song Acked-by: Vlastimil Babka (SUSE) Cc: Axel Rasmussen Cc: David Hildenbrand Cc: Eric Dumazet Cc: Kairui Song Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Qi Zheng Cc: Suren Baghdasaryan Cc: Wei Xu Cc: Yuanchu Xie Signed-off-by: Andrew Morton --- include/linux/vmpressure.h | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'include') diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h index 6a2f51ebbfd3..faecd5522401 100644 --- a/include/linux/vmpressure.h +++ b/include/linux/vmpressure.h @@ -30,8 +30,8 @@ struct vmpressure { struct mem_cgroup; #ifdef CONFIG_MEMCG -extern void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, - unsigned long scanned, unsigned long reclaimed); +void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, bool tree, + unsigned long scanned, unsigned long reclaimed); extern void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio); extern void vmpressure_init(struct vmpressure *vmpr); @@ -44,8 +44,9 @@ extern int vmpressure_register_event(struct mem_cgroup *memcg, extern void vmpressure_unregister_event(struct mem_cgroup *memcg, struct eventfd_ctx *eventfd); #else -static inline void vmpressure(gfp_t gfp, struct mem_cgroup *memcg, bool tree, - unsigned long scanned, unsigned long reclaimed) {} +static inline void vmpressure(gfp_t gfp, int order, struct mem_cgroup *memcg, + bool tree, unsigned long scanned, + unsigned long reclaimed) {} static inline void vmpressure_prio(gfp_t gfp, struct mem_cgroup *memcg, int prio) {} #endif /* CONFIG_MEMCG */ -- cgit v1.2.3 From 3bbc54dd1b62f1a4b218c70aafbeceeba7c90c5d Mon Sep 17 00:00:00 2001 From: Muchun Song Date: Tue, 28 Apr 2026 16:18:52 +0800 Subject: mm/sparse-vmemmap: pass @pgmap argument to memory deactivation paths Currently, the memory hot-remove call chain -- arch_remove_memory(), __remove_pages(), sparse_remove_section() and section_deactivate() -- does not carry the struct dev_pagemap pointer. This prevents the lower levels from knowing whether the section was originally populated with vmemmap optimizations (e.g., DAX with vmemmap optimization enabled). Without this information, we cannot call vmemmap_can_optimize() to determine if the vmemmap pages were optimized. As a result, the vmemmap page accounting during teardown will mistakenly assume a non-optimized allocation, leading to incorrect memmap statistics. To lay the groundwork for fixing the vmemmap page accounting, we need to pass the @pgmap pointer down to the deactivation location. Plumb the @pgmap argument through the APIs of arch_remove_memory(), __remove_pages() and sparse_remove_section(), mirroring the corresponding *_activate() paths. Link: https://lore.kernel.org/20260428081855.1249045-4-songmuchun@bytedance.com Signed-off-by: Muchun Song Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Oscar Salvador Acked-by: David Hildenbrand (Arm) Acked-by: Liam R. Howlett Cc: "Aneesh Kumar K.V" Cc: Joao Martins Cc: Lorenzo Stoakes Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Michal Hocko Cc: Nicholas Piggin Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/memory_hotplug.h | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'include') diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 815e908c4135..7c9d66729c60 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -135,9 +135,10 @@ static inline bool movable_node_is_enabled(void) return movable_node_enabled; } -extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap); +extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap, + struct dev_pagemap *pgmap); extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages, - struct vmem_altmap *altmap); + struct vmem_altmap *altmap, struct dev_pagemap *pgmap); /* reasonably generic interface to expand the physical pages */ extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, @@ -307,7 +308,8 @@ extern int sparse_add_section(int nid, unsigned long pfn, unsigned long nr_pages, struct vmem_altmap *altmap, struct dev_pagemap *pgmap); extern void sparse_remove_section(unsigned long pfn, unsigned long nr_pages, - struct vmem_altmap *altmap); + struct vmem_altmap *altmap, + struct dev_pagemap *pgmap); extern struct zone *zone_for_pfn_range(enum mmop online_type, int nid, struct memory_group *group, unsigned long start_pfn, unsigned long nr_pages); -- cgit v1.2.3 From 58996503b631adc6a268a42f4624a34513c16199 Mon Sep 17 00:00:00 2001 From: Asier Gutierrez Date: Sun, 26 Apr 2026 16:16:17 -0700 Subject: mm/damon: support MADV_COLLAPSE via DAMOS_COLLAPSE scheme action This patch set introces a new action: DAMOS_COLLAPSE. For DAMOS_HUGEPAGE and DAMOS_NOHUGEPAGE to work, khugepaged should be working, since it relies on hugepage_madvise to add a new slot. This slot should be picked up by khugepaged and eventually collapse (or not, if we are using DAMOS_NOHUGEPAGE) the pages. If THP is not enabled, khugepaged will not be working, and therefore no collapse will happen. DAMOS_COLLAPSE eventually calls madvise_collapse, which will collapse the address range synchronously. In cases where there is a large VMA (databases, for example), DAMOS_COLLAPSE allows us to collapse only the hot region, and not the entire VMA. This new action may be required to support autotuning with hugepage as a goal[1]. ========= Benchmarks: ========= MySQL ===== Tests were performed in an ARM physical server with MariaDB 10.5 and sysbench. Read only benchmark was perform with gaussian row hitting, which follows a normal distribution. T n, D h: THP set to never, DAMON action set to hugepage T m, D h: THP set to madvise, DAMON action set to hugepage T n, D c: THP set to never, DAMON action set to collapse Memory consumption. Lower is better. +------------------+----------+----------+----------+ | | T n, D h | T m, D h | T n, D c | +------------------+----------+----------+----------+ | Total memory use | 2.13 | 2.20 | 2.20 | | Huge pages | 0 | 1.3 | 1.27 | +------------------+----------+----------+----------+ Performance in TPS (Transactions Per Second). Higher is better. T n, D h: 18225.58 T m, D h 18252.93 T n, D c: 18270.21 Performance counter I got the number of L1 D/I TLB accesses and the number a D/I TLB accesses that triggered a page walk. I divided the second by the first to get the percentage of page walkes per TLB access. The lower the better. +---------------+--------------+--------------+--------------+ | | T n, D h | T m, D h | T n, D c | +---------------+--------------+--------------+--------------+ | L1 DTLB | 127248242753 | 125431020479 | 125327001821 | | L1 ITLB | 80332558619 | 79346759071 | 79298139590 | | DTLB walk | 75011087 | 52800418 | 55895794 | | ITLB walk | 71577076 | 71505137 | 67262140 | | DTLB % misses | 0.058948623 | 0.042095183 | 0.044599961 | | ITLB % misses | 0.089100954 | 0.090117275 | 0.084821839 | +---------------+--------------+--------------+--------------+ Masim ===== I used masim with the "demo" configuration, but changing the times to 100 seconds for the initial phase and 50 seconds for the rest of the phases. Memory consumption: +------------------+----------+----------+----------+ | | T n, D h | T m, D h | T n, D c | +------------------+----------+----------+----------+ | Total memory use | 2.38 GB | 2.36 GB | 2.37 GB | | Huge pages | 0 | 190 MB | 188 MB | +------------------+----------+----------+----------+ Performance: THP never, DAMOS_HUGEPAGE initial phase: 40,491 accesses/msec, 100001 msecs run low phase 0: 39,658 accesses/msec, 50002 msecs run high phase 0: 41,678 accesses/msec, 50000 msecs run low phase 1: 39,625 accesses/msec, 50003 msecs run high phase 1: 41,658 accesses/msec, 50002 msecs run low phase 2: 39,642 accesses/msec, 50002 msecs run high phase 2: 41,640 accesses/msec, 50001 msecs run THP madvise, DAMOS_HUGEPAGE initial phase: 51,977 accesses/msec, 100000 msecs run low phase 0: 86,953 accesses/msec, 50000 msecs run high phase 0: 94,812 accesses/msec, 50000 msecs run low phase 1: 101,017 accesses/msec, 50000 msecs run high phase 1: 94,841 accesses/msec, 50000 msecs run low phase 2: 100,993 accesses/msec, 50000 msecs run high phase 2: 94,791 accesses/msec, 50001 msecs run THP never, DAMOS_COLLAPSE initial phase: 93,678 accesses/msec, 100001 msecs run low phase 0: 101,475 accesses/msec, 50000 msecs run high phase 0: 98,589 accesses/msec, 50000 msecs run low phase 1: 101,531 accesses/msec, 50001 msecs run high phase 1: 98,506 accesses/msec, 50001 msecs run low phase 2: 101,458 accesses/msec, 50001 msecs run high phase 2: 98,555 accesses/msec, 50000 msecs run Memory consumption dynamic (how quickly collapses occur): It shows in seconds how many huge pages are allocated. +----+----------+----------+ | | T m, D h | T n, D c | +----+----------+----------+ | 5 | 32 | 188 | | 10 | 48 | 188 | | 15 | 64 | 188 | | 20 | 96 | 188 | | 30 | 112 | 188 | | 35 | 144 | 188 | | 40 | 160 | 188 | | 45 | 190 | 188 | | 50 | 190 | 188 | | 55 | 190 | 188 | | 60 | 190 | 188 | +----+----------+----------+ ========= - We can see that DAMOS "hugepage" action works only when THP is set to madvise. "collapse" action works even when THP is set to never. - Performance for "collapse" action is slightly lower than "hugepage" action and THP madvise. This is due to the fact that collapases occur synchronously. With "hugepage" they may occur during page faults. - Memory consumption is slighly lower for "collapse" than "hugepage" with THP madvise. This is due to the khugepage collapses all VMAs, while "collapse" action only collapses the VMAs in the hot region. - There is an improvement in TLB utilization when collapse through "hugepage" or "collapse" actions are triggered. The amount of TLB misses is lower. - "collapse" action is performance synchronously, which means that page collapses happen earlier and more rapidly. This can be useful or not, depending on the scenario. - "hugepage" action may trigger a VMA split in some scenarios, since it needs to change the flag of the VMA to THP enabled. This may lead to additional overhead. Collapse action just adds a new option to chose the correct system balance. Link: https://lore.kernel.org/20260426231619.107231-5-sj@kernel.org Link: https://lore.kernel.org/damon/20260313000816.79933-1-sj@kernel.org/ [1] Signed-off-by: Asier Gutierrez Signed-off-by: SeongJae Park Reviewed-by: SeongJae Park Cc: Cheng-Han Wu Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Liam R. Howlett Cc: Liew Rui Yan Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index 2bb43910e22e..d3a231275c23 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -121,6 +121,7 @@ struct damon_target { * @DAMOS_PAGEOUT: Reclaim the region. * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. + * @DAMOS_COLLAPSE: Call ``madvise()`` for the region with MADV_COLLAPSE. * @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists. * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. * @DAMOS_MIGRATE_HOT: Migrate the regions prioritizing warmer regions. @@ -140,6 +141,7 @@ enum damos_action { DAMOS_PAGEOUT, DAMOS_HUGEPAGE, DAMOS_NOHUGEPAGE, + DAMOS_COLLAPSE, DAMOS_LRU_PRIO, DAMOS_LRU_DEPRIO, DAMOS_MIGRATE_HOT, -- cgit v1.2.3 From 90f01f5d6ba57d93363289b3247314b7fd5e8d49 Mon Sep 17 00:00:00 2001 From: "David Hildenbrand (Arm)" Date: Mon, 27 Apr 2026 13:43:16 +0200 Subject: mm: remove page_mapped() Let's replace the last user of page_mapped() by folio_mapped() so we can get rid of page_mapped(). Replace the remaining occurrences of page_mapped() in rmap documentation by folio_mapped(). Link: https://lore.kernel.org/20260427-page_mapped-v1-3-e89c3592c74c@kernel.org Signed-off-by: David Hildenbrand (Arm) Reviewed-by: Matthew Wilcox (Oracle) Cc: Alexei Starovoitov Cc: Andrii Nakryiko Cc: Eduard Zingerman Cc: Harry Yoo Cc: Jann Horn Cc: Jiri Olsa Cc: John Paul Adrian Glaubitz Cc: Kumar Kartikeya Dwivedi Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Martin KaFai Lau Cc: Michal Hocko Cc: Mike Rapoport Cc: Rich Felker Cc: Rik van Riel Cc: Song Liu Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Yonghong Song Cc: Yoshinori Sato Signed-off-by: Andrew Morton --- include/linux/mm.h | 10 ---------- 1 file changed, 10 deletions(-) (limited to 'include') diff --git a/include/linux/mm.h b/include/linux/mm.h index 8a0078a4dc78..9cedc5e75aa9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1888,16 +1888,6 @@ static inline bool folio_mapped(const struct folio *folio) return folio_mapcount(folio) >= 1; } -/* - * Return true if this page is mapped into pagetables. - * For compound page it returns true if any sub-page of compound page is mapped, - * even if this particular sub-page is not itself mapped by any PTE or PMD. - */ -static inline bool page_mapped(const struct page *page) -{ - return folio_mapped(page_folio(page)); -} - static inline struct page *virt_to_head_page(const void *x) { struct page *page = virt_to_page(x); -- cgit v1.2.3 From 7b32f64bc512b40b268776c5ac4d354b325b3197 Mon Sep 17 00:00:00 2001 From: Frederick Mayle Date: Sun, 26 Apr 2026 20:01:47 -0700 Subject: mm: limit filemap_fault readahead to VMA boundaries When a file mapping covers a strict subset of a file, an access to the mapping can trigger readahead of file pages outside the mapped region. Readahead is meant to prefetch pages likely to be accessed soon, but these pages aren't accessible via the same means, so it fair to say we don't have a good indicator they'll be accessed soon. Take an ELF file for example: an access to the end of a program's read-only segment isn't a sign that nearby file contents will be accessed next (they are likely to be mapped discontiguously, or not at all). The pressure from loading these pages into the cache can evict more useful pages. To improve the behavior, make three changes: * Introduce a new readahead_control field, max_index, as a hard limit on the readahead. The existing file_ra_state->size can't be used as a limit, it is more of a hint and can be increased by various heuristics. * Set readahead_control->max_index to the end of the VMA in all of the readahead paths that can be triggered from a fault on a file mapping (both "sync" and "async" readahead). * Limit the read-around range start to the VMA's start. Note that these changes only affect readahead triggered in the context of a fault, they do not affect readahead triggered by read syscalls. If a user mixes the two types of accesses, the behavior is expected to be the following: if a fault causes readahead and places a PG_readahead marker and then a read(2) syscall hits the PG_readahead marker, the resulting async readahead *will not* be limited to the VMA end. Conversely, if a read(2) syscall places a PG_readahead marker and then a fault hits the marker, the async readahead *will* be limited to the VMA end. There is an edge case that the above motivation glosses over: A single file mapping might be backed by multiple VMAs. For example, a whole file could be mapped RW, then part of the mapping made RO using mprotect. This patch would hurt performance of a sequential faulted read of such a mapping, the degree depending on how fragmented the VMAs are. A usage pattern like that is likely rare and already suffering from sub-optimal performance because, e.g., the fragmented VMAs limit the fault-around, so each VMA boundary in a sequential faulted read would cause a minor fault. Still, this patch would make it worse. See a previous discussion of this topic at [1]. Tested by mapping and reading a small subset of a large file, then using the cachestat syscall to verify the number of cached pages didn't exceed the mapping size. In practical scenarios, the effect depends on the specific file and usage. Sometimes there is no effect at all, but, for some ELF files in Android, we see ~20% fewer pages pulled into the cache. A comprehensive performance evaluation hasn't been done, but, in addition to the anecdontal memory savings mentioned above, a benchmark was run with fio 3.38, showing neutral looking results: /data/local/tmp/fio --version fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \ --offset=1G --size=1G --filesize=3G --numjobs=1 \ --filename=testfile.bin Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472) After: 4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850) +1.7% Same, with --ioengine=mmap --rw=randread Before: 445.6 MiB/s (avg of 446, 447, 442, 452, 441) After: 447.0 MiB/s (avg of 447, 446, 446, 451, 445) +0.3% Same, with --ioengine=psync --rw=read Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057) After: 3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094) -0.06% Same, with --ioengine=psync --rw=randread Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221) After: 2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251) +0.2% Link: https://lore.kernel.org/20260427030148.653228-1-fmayle@google.com Link: https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/ [1] Signed-off-by: Frederick Mayle Reviewed-by: Jan Kara Reviewed-by: Kalesh Singh Cc: David Hildenbrand Cc: Lorenzo Stoakes Cc: Matthew Wilcox Cc: Suren Baghdasaryan Signed-off-by: Andrew Morton --- include/linux/pagemap.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 31a848485ad9..1f50991b43e3 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1350,6 +1350,7 @@ struct readahead_control { struct file_ra_state *ra; /* private: use the readahead_* accessors instead */ pgoff_t _index; + pgoff_t _max_index; /* limit readahead to _max_index, inclusive */ unsigned int _nr_pages; unsigned int _batch_count; bool dropbehind; @@ -1363,6 +1364,7 @@ struct readahead_control { .mapping = m, \ .ra = r, \ ._index = i, \ + ._max_index = ULONG_MAX, \ } #define VM_READAHEAD_PAGES (SZ_128K / PAGE_SIZE) -- cgit v1.2.3 From 3b9e3cc0405b422db884054ea2417b7b85220c56 Mon Sep 17 00:00:00 2001 From: SeongJae Park Date: Mon, 27 Apr 2026 08:12:20 -0700 Subject: mm/damon/core: introduce damon_ctx->paused Patch series "mm/damon: let DAMON be paused and resumed", v2. DAMON utilizes a few mechanisms that enhance itself over time. Adaptive regions adjustment, goal-based DAMOS quota auto-tuning and monitoring intervals auto-tuning like self-training mechanisms are such examples. It also adds access frequency stability information (age) to the monitoring results, which makes it enhanced over time. Sometimes users have to stop DAMON. In this case, DAMON internal state that enhanced over the time of the last execution simply goes away. Restarted DAMON have to train itself and enhance its output from the scratch. This makes DAMON less useful in such cases. Introducing three such use cases below. Investigation of DAMON. It is best to do the investigation online, especially when it is a production environment. DAMON therefore provides features for such online investigations, including DAMOS stats, monitoring result snapshot exposure, and multiple tracepoints. When those are insufficient, and there are additional clues that could be interfered by DAMON, users have to temporarily stop DAMON to collect the additional clues. It is not very useful since many of DAMON internal clues are gone when DAMON is stopped. The loss of the monitoring results that improved over time is also problematic, especially in production environments. Monitoring of workloads that have different user-known phases. For example, in Android, applications are known to have very different access patterns and behaviors when they are running on the foreground and the background. It can therefore be useful to separate monitoring of apps based on whether they are running on the foreground and on the background. Having two DAMON threads per application that paused and resumed for the apps foreground/background switches can be useful for the purpose. But such pause/resume of the execution is not supported. Tests of DAMON. A few DAMON selftests are using drgn to dump the internal DAMON status. The tests show if the dumped status is the same as what the test code expected. Because DAMON keeps running and modifying its internal status, there are chances of data races that can cause false test results. Stopping DAMON can avoid the race. But, since the internal state of DAMON is dropped, the test coverage will be limited. Let DAMON execution be paused and resumed without loss of the internal state, to overhaul the limitations. For this, introduce a new DAMON context parameter, namely 'pause'. API callers can update it while the context is running, using the online parameters update functions (damon_commit_ctx() and damon_call()). Once it is set, kdamond_fn() main loop will do only limited works excluding the monitoring and DAMOS works, while sleeping sampling intervals per the work. The limited works include handling of the online parameters update. Hence users can unset the 'pause' parameter again. Once it is unset, kdamond_fn() main loop will do all the work again (resumed). Under the paused state, it also does stop condition checks and handling of it, so that paused DAMON can also be stopped if needed. Expose the feature to the user space via DAMON sysfs interface. Also, update existing drgn-based tests to test and use the feature. Tests ===== I confirmed the feature functionality using real time tracing ('perf trace' or 'trace-cmd stream') of damon:damon_aggregated DAMON tracepoint. By pausing and resuming the DAMON execution, I was able to see the trace stops and continued as expected. Note that the pause feature support is added to DAMON user-space tool (damo) after v3.1.9. Users can use '--pause_ctx' command line option of damo for that, and I actually used it for my test. The extended drgn-based selftests are also testing a part of the functionality. Patches Sequence ================ Patch 1 introduces the new core API for the pause feature. Patch 2 extend DAMON sysfs interface for the new parameter. Patches 3-5 update design, usage and ABI documents for the new sysfs file, respectively. The following five patches are for tests. Patch 6 implements a new kunit test for the pause parameter online commitment. Patches 7 and 8 extend DAMON selftest helpers to support the new feature. Patch 9 extends selftest to test the commitment of the feature. Finally, patch 10 updates existing selftest to be safe from the race condition using the pause/resume feature. This patch (of 10): DAMON supports only start and stop of the execution. When it is stopped, its internal data that it self-trained goes away. It will be useful if the execution can be paused and resumed with the previous self-trained data. Introduce per-context API parameter, 'paused', for the purpose. The parameter can be set and unset while DAMON is running and paused, using the online parameters commit helper functions (damon_commit_ctx() and damon_call()). Once 'paused' is set, the kdamond_fn() main loop does only limited works with sampling interval sleep during the works. The limited works include the handling of the online parameters update, so that users can unset the 'pause' and resume the execution when they want. It also keep checking DAMON stop conditions and handling of it, so that DAMON can be stopped while paused if needed. Link: https://lore.kernel.org/20260427151231.113429-1-sj@kernel.org Link: https://lore.kernel.org/20260427151231.113429-2-sj@kernel.org Signed-off-by: SeongJae Park Cc: Brendan Higgins Cc: David Hildenbrand Cc: Jonathan Corbet Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Shuah Khan Cc: Suren Baghdasaryan Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/damon.h | 2 ++ 1 file changed, 2 insertions(+) (limited to 'include') diff --git a/include/linux/damon.h b/include/linux/damon.h index d3a231275c23..f2370a3a4a9a 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -801,6 +801,7 @@ struct damon_attrs { * @ops: Set of monitoring operations for given use cases. * @addr_unit: Scale factor for core to ops address conversion. * @min_region_sz: Minimum region size. + * @pause: Pause kdamond main loop. * @adaptive_targets: Head of monitoring targets (&damon_target) list. * @schemes: Head of schemes (&damos) list. */ @@ -854,6 +855,7 @@ struct damon_ctx { struct damon_operations ops; unsigned long addr_unit; unsigned long min_region_sz; + bool pause; struct list_head adaptive_targets; struct list_head schemes; -- cgit v1.2.3 From b56ca146a2b2750172f91f6db960a37a1a546efd Mon Sep 17 00:00:00 2001 From: Muhammad Usama Anjum Date: Wed, 29 Apr 2026 15:57:02 +0530 Subject: vmalloc: add __GFP_SKIP_KASAN support MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "kasan: hw_tags: Disable tagging for stack and page-tables", v4. Stacks and page tables are always accessed with the match-all tag, so assigning a new random tag every time at allocation and setting invalid tag at deallocation time, just adds overhead without improving the detection. With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL (match-all tag) is stored in the page flags while keeping the poison tag in the hardware. The benefit of it is that 256 tag setting instruction per 4 kB page aren't needed at allocation and deallocation time. Thus match-all pointers still work, while non-match tags (other than poison tag) still fault. __GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is unchanged. Benchmark: The benchmark has two modes. In thread mode, the child process forks and creates N threads. In pgtable mode, the parent maps and faults a specified memory size and then forks repeatedly with children exiting immediately. Thread benchmark: 2000 iterations, 2000 threads: 2.575 s → 2.229 s (~13.4% faster) The pgtable samples: - 2048 MB, 2000 iters 19.08 s → 17.62 s (~7.6% faster) This patch (of 3): For allocations that will be accessed only with match-all pointers (e.g., kernel stacks), setting tags is wasted work. If the caller already set __GFP_SKIP_KASAN, skip tag setting of vmalloc pages. Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs. So it wasn't being checked. Now its being checked and acted upon. Other KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in the page allocator, and in vmalloc too we ignore this flag for them. This is a preparatory patch for optimizing kernel stack allocations. Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com Signed-off-by: Muhammad Usama Anjum Co-developed-by: Ryan Roberts Signed-off-by: Ryan Roberts Co-developed-by: Dev Jain Signed-off-by: Dev Jain Reviewed-by: Catalin Marinas Cc: Arnd Bergmann Cc: Ben Segall Cc: David Hildenbrand Cc: Dietmar Eggemann Cc: Ingo Molnar Cc: Juri Lelli Cc: Kees Cook Cc: K Prateek Nayak Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Mathieu Desnoyers Cc: Mel Gorman Cc: Michal Hocko Cc: Mike Rapoport Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Suren Baghdasaryan Cc: "Uladzislau Rezki (Sony)" Cc: Valentin Schneider