From 945578fee2ec17bebdec067371214d3cbed48822 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sun, 17 May 2026 23:39:45 +0800 Subject: mm/memcg, swap: tidy up cgroup v1 memsw swap helpers The cgroup v1 swap helpers always operate on swap cache folios whose swap entry is stable: the folio is locked and in the swap cache. There is no need to pass the swap entry or page count as separate parameters when they can be derived from the folio itself. Simplify the redundant parameters and add sanity checks to document the required preconditions. Also rename memcg1_swapout to __memcg1_swapout to indicate it requires special calling context: the folio must be isolated and dying, and the call must be made with interrupts disabled. No functional change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com Signed-off-by: Kairui Song Acked-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Youngjun Park Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index dc3fa687759b..7d08128de1fd 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1899,8 +1899,8 @@ static inline void mem_cgroup_exit_user_fault(void) current->in_user_fault = 0; } -void memcg1_swapout(struct folio *folio, swp_entry_t entry); -void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages); +void __memcg1_swapout(struct folio *folio); +void memcg1_swapin(struct folio *folio); #else /* CONFIG_MEMCG_V1 */ static inline @@ -1929,11 +1929,11 @@ static inline void mem_cgroup_exit_user_fault(void) { } -static inline void memcg1_swapout(struct folio *folio, swp_entry_t entry) +static inline void __memcg1_swapout(struct folio *folio) { } -static inline void memcg1_swapin(swp_entry_t entry, unsigned int nr_pages) +static inline void memcg1_swapin(struct folio *folio) { } -- cgit v1.2.3 From bc34e87a51d9e51d398ef6d8c2c35cf1a4ff38b9 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sun, 17 May 2026 23:39:47 +0800 Subject: mm, swap: delay and unify memcg lookup and charging for swapin Instead of checking the cgroup private ID during page table walk in swap_pte_batch(), move the memcg lookup into __swap_cache_add_check() under the cluster lock. The first pre-alloc check is speculative and skips the memcg check since the post-alloc stable check ensures all slots covered by the folio belong to the same memcg. It is very rare for contiguous and aligned entries across a contiguous region of a page table of the same process or shmem mapping to belong to different memcgs. This also prepares for recording the memcg info in the cluster's table. Also make the order check and fallback more compact. There should be no user-observable behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com Signed-off-by: Kairui Song Acked-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Youngjun Park Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 7d08128de1fd..a013f37f24aa 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -646,8 +646,8 @@ static inline int mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp); -int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm, - gfp_t gfp, swp_entry_t entry); +int mem_cgroup_swapin_charge_folio(struct folio *folio, unsigned short id, + struct mm_struct *mm, gfp_t gfp); void __mem_cgroup_uncharge(struct folio *folio); @@ -1137,7 +1137,7 @@ static inline int mem_cgroup_charge_hugetlb(struct folio* folio, gfp_t gfp) } static inline int mem_cgroup_swapin_charge_folio(struct folio *folio, - struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) + unsigned short id, struct mm_struct *mm, gfp_t gfp) { return 0; } -- cgit v1.2.3 From b197d41462c2076bc88c79fead7f400e48881c19 Mon Sep 17 00:00:00 2001 From: Kairui Song Date: Sun, 17 May 2026 23:39:49 +0800 Subject: mm/memcg, swap: store cgroup id in cluster table directly Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table instead. The per-cluster memcg table is 1024 / 512 bytes on most archs, and does not need RCU protection: the cgroup data is only read and written under the cluster lock. That keeps things simple, lets the allocation use plain kmalloc with immediate kfree (no deferred free), and keeps fragmentation acceptable. [akpm@linux-foundation.org: memcgv1: don't compile swap functions when CONFIG_SWAP=n] Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com Signed-off-by: Kairui Song Acked-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Youngjun Park Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index a013f37f24aa..8f2662db166b 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -29,6 +29,7 @@ struct obj_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct swap_cluster_info; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -1899,9 +1900,6 @@ static inline void mem_cgroup_exit_user_fault(void) current->in_user_fault = 0; } -void __memcg1_swapout(struct folio *folio); -void memcg1_swapin(struct folio *folio); - #else /* CONFIG_MEMCG_V1 */ static inline unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order, @@ -1929,14 +1927,23 @@ static inline void mem_cgroup_exit_user_fault(void) { } -static inline void __memcg1_swapout(struct folio *folio) +#endif /* CONFIG_MEMCG_V1 */ + +#if defined(CONFIG_MEMCG_V1) && defined(CONFIG_SWAP) + +void __memcg1_swapout(struct folio *folio, struct swap_cluster_info *ci); +void memcg1_swapin(struct folio *folio); + +#else + +static inline void __memcg1_swapout(struct folio *folio, + struct swap_cluster_info *ci) { } static inline void memcg1_swapin(struct folio *folio) { } - -#endif /* CONFIG_MEMCG_V1 */ +#endif #endif /* _LINUX_MEMCONTROL_H */ -- cgit v1.2.3 From fafaeceb89a5e2e856ff04c2cacb6cae4a2ecb67 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 27 May 2026 16:45:16 -0400 Subject: mm: switch deferred split shrinker to list_lru The deferred split queue handles cgroups in a suboptimal fashion. The queue is per-NUMA node or per-cgroup, not the intersection. That means on a cgrouped system, a node-restricted allocation entering reclaim can end up splitting large pages on other nodes: alloc/unmap deferred_split_folio() list_add_tail(memcg->split_queue) set_shrinker_bit(memcg, node, deferred_shrinker_id) for_each_zone_zonelist_nodemask(restricted_nodes) mem_cgroup_iter() shrink_slab(node, memcg) shrink_slab_memcg(node, memcg) if test_shrinker_bit(memcg, node, deferred_shrinker_id) deferred_split_scan() walks memcg->split_queue The shrinker bit adds an imperfect guard rail. As soon as the cgroup has a single large page on the node of interest, all large pages owned by that memcg, including those on other nodes, will be split. list_lru properly sets up per-node, per-cgroup lists. As a bonus, it streamlines a lot of the list operations and reclaim walks. It's used widely by other major shrinkers already. Convert the deferred split queue as well. The list_lru per-memcg heads are instantiated on demand when the first object of interest is allocated for a cgroup, by calling folio_memcg_alloc_deferred(). Add calls to where splittable pages are created: anon faults, swapin faults, khugepaged collapse. These calls create all possible node heads for the cgroup at once, so the migration code (between nodes) doesn't need any special care. [akpm@linux-foundation.org: fix build with CONFIG_TRANSPARENT_HUGEPAGE=n] Link: https://lore.kernel.org/202605281620.lc3rtkBm-lkp@intel.com [hannes@cmpxchg.org: fix cgroup.memory=nokmem handling] Link: https://lore.kernel.org/ah9PGv12mqai84ES@cmpxchg.org Link: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Mikhail Zaslonko Tested-by: Mikhail Zaslonko Acked-by: Shakeel Butt Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: Usama Arif Reviewed-by: Kairui Song Cc: Baolin Wang Cc: Barry Song Cc: Dave Chinner Cc: David Hildenbrand (Arm) Cc: Dev Jain Cc: Lance Yang Cc: Liam R. Howlett Cc: Michal Hocko Cc: Muchun Song Cc: Nico Pache Cc: Roman Gushchin Cc: Ryan Roberts Cc: Vasily Gorbik Cc: Vlastimil Babka Cc: Zi Yan Cc: kernel test robot Signed-off-by: Andrew Morton --- include/linux/memcontrol.h | 4 ---- 1 file changed, 4 deletions(-) (limited to 'include/linux/memcontrol.h') diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 8f2662db166b..e1f46a0016fc 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -278,10 +278,6 @@ struct mem_cgroup { struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT]; #endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - struct deferred_split deferred_split_queue; -#endif - #ifdef CONFIG_LRU_GEN_WALKS_MMU /* per-memcg mm_struct list */ struct lru_gen_mm_list mm_list; -- cgit v1.2.3