aboutsummaryrefslogtreecommitdiff
path: root/kernel/cgroup/cpuset.c
AgeCommit message (Collapse)AuthorFilesLines
2026-01-15kernel: cgroup: Add SPDX-License-Identifier linesTim Bird1-4/+1
Add GPL-2.0 SPDX license id lines to a few old files, replacing the reference to the COPYING file. The COPYING file at the time of creation of these files (2007 and 2005) was GPL-v2.0, with an additional clause indicating that only v2 applied. Signed-off-by: Tim Bird <tim.bird@sony.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-18cpuset: fix warning when disabling remote partitionChen Ridong1-5/+16
A warning was triggered as follows: WARNING: kernel/cgroup/cpuset.c:1651 at remote_partition_disable+0xf7/0x110 RIP: 0010:remote_partition_disable+0xf7/0x110 RSP: 0018:ffffc90001947d88 EFLAGS: 00000206 RAX: 0000000000007fff RBX: ffff888103b6e000 RCX: 0000000000006f40 RDX: 0000000000006f00 RSI: ffffc90001947da8 RDI: ffff888103b6e000 RBP: ffff888103b6e000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffff88810b2e2728 R12: ffffc90001947da8 R13: 0000000000000000 R14: ffffc90001947da8 R15: ffff8881081f1c00 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f55c8bbe0b2 CR3: 000000010b14c000 CR4: 00000000000006f0 Call Trace: <TASK> update_prstate+0x2d3/0x580 cpuset_partition_write+0x94/0xf0 kernfs_fop_write_iter+0x147/0x200 vfs_write+0x35d/0x500 ksys_write+0x66/0xe0 do_syscall_64+0x6b/0x390 entry_SYSCALL_64_after_hwframe+0x4b/0x53 RIP: 0033:0x7f55c8cd4887 Reproduction steps (on a 16-CPU machine): # cd /sys/fs/cgroup/ # mkdir A1 # echo +cpuset > A1/cgroup.subtree_control # echo "0-14" > A1/cpuset.cpus.exclusive # mkdir A1/A2 # echo "0-14" > A1/A2/cpuset.cpus.exclusive # echo "root" > A1/A2/cpuset.cpus.partition # echo 0 > /sys/devices/system/cpu/cpu15/online # echo member > A1/A2/cpuset.cpus.partition When CPU 15 is offlined, subpartitions_cpus gets cleared because no CPUs remain available for the top_cpuset, forcing partitions to share CPUs with the top_cpuset. In this scenario, disabling the remote partition triggers a warning stating that effective_xcpus is not a subset of subpartitions_cpus. Partitions should be invalidated in this case to inform users that the partition is now invalid(cpus are shared with top_cpuset). To fix this issue: 1. Only emit the warning only if subpartitions_cpus is not empty and the effective_xcpus is not a subset of subpartitions_cpus. 2. During the CPU hotplug process, invalidate partitions if subpartitions_cpus is empty. Fixes: f62a5d39368e ("cgroup/cpuset: Remove remote_partition_check() & make update_cpumasks_hier() handle remote partition") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-03Merge tag 'cgroup-for-6.19' of ↵Linus Torvalds1-136/+221
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: - Defer task cgroup unlink until after the dying task's final context switch so that controllers see the cgroup properly populated until the task is truly gone - cpuset cleanups and simplifications. Enforce that domain isolated CPUs stay in root or isolated partitions and fail if isolated+nohz_full would leave no housekeeping CPU. Fix sched/deadline root domain handling during CPU hot-unplug and race for tasks in attaching cpusets - Misc fixes including memory reclaim protection documentation and selftest KTAP conformance * tag 'cgroup-for-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (21 commits) cpuset: Treat cpusets in attaching as populated sched/deadline: Walk up cpuset hierarchy to decide root domain when hot-unplug cgroup/cpuset: Introduce cpuset_cpus_allowed_locked() docs: cgroup: No special handling of unpopulated memcgs docs: cgroup: Note about sibling relative reclaim protection docs: cgroup: Explain reclaim protection target selftests/cgroup: conform test to KTAP format output cpuset: remove need_rebuild_sched_domains cpuset: remove global remote_children list cpuset: simplify node setting on error cgroup: include missing header for struct irq_work cgroup: Fix sleeping from invalid context warning on PREEMPT_RT cgroup/cpuset: Globally track isolated_cpus update cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partition cgroup/cpuset: Move up prstate_housekeeping_conflict() helper cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeeping cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() cgroup: Defer task cgroup unlink until after the task is done switching out cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() ...
2025-12-02Merge tag 'timers-core-2025-11-30' of ↵Linus Torvalds1-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core updates from Thomas Gleixner: - Prevent a thundering herd problem when the timekeeper CPU is delayed and a large number of CPUs compete to acquire jiffies_lock to do the update. Limit it to one CPU with a separate "uncontended" atomic variable. - A set of improvements for the timer migration mechanism: - Support imbalanced NUMA trees correctly - Support dynamic exclusion of CPUs from the migrator duty to allow the cpuset/isolation mechanism to exclude them from handling timers of remote idle CPUs - The usual small updates, cleanups and enhancements * tag 'timers-core-2025-11-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timers/migration: Exclude isolated cpus from hierarchy cpumask: Add initialiser to use cleanup helpers sched/isolation: Force housekeeping if isolcpus and nohz_full don't leave any cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to update_isolation_cpumasks() timers/migration: Use scoped_guard on available flag set/clear timers/migration: Add mask for CPUs available in the hierarchy timers/migration: Rename 'online' bit to 'available' selftests/timers/nanosleep: Add tests for return of remaining time selftests/timers: Clean up kernel version check in posix_timers time: Fix a few typos in time[r] related code comments time: tick-oneshot: Add missing Return and parameter descriptions to kernel-doc hrtimer: Store time as ktime_t in restart block timers/migration: Remove dead code handling idle CPU checking for remote timers timers/migration: Remove unused "cpu" parameter from tmigr_get_group() timers/migration: Assert that hotplug preparing CPU is part of stable active hierarchy timers/migration: Fix imbalanced NUMA trees timers/migration: Remove locking on group connection timers/migration: Convert "while" loops to use "for" tick/sched: Limit non-timekeeper CPUs calling jiffies update
2025-11-20cpuset: Treat cpusets in attaching as populatedChen Ridong1-8/+27
Currently, the check for whether a partition is populated does not account for tasks in the cpuset of attaching. This is a corner case that can leave a task stuck in a partition with no effective CPUs. The race condition occurs as follows: cpu0 cpu1 //cpuset A with cpu N migrate task p to A cpuset_can_attach // with effective cpus // check ok // cpuset_mutex is not held // clear cpuset.cpus.exclusive // making effective cpus empty update_exclusive_cpumask // tasks_nocpu_error check ok // empty effective cpus, partition valid cpuset_attach ... // task p stays in A, with non-effective cpus. To fix this issue, this patch introduces cs_is_populated, which considers tasks in the attaching cpuset. This new helper is used in validate_change and partition_is_populated. Fixes: e2d59900d936 ("cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20timers/migration: Exclude isolated cpus from hierarchyGabriele Monaco1-0/+3
The timer migration mechanism allows active CPUs to pull timers from idle ones to improve the overall idle time. This is however undesired when CPU intensive workloads run on isolated cores, as the algorithm would move the timers from housekeeping to isolated cores, negatively affecting the isolation. Exclude isolated cores from the timer migration algorithm, extend the concept of unavailable cores, currently used for offline ones, to isolated ones: * A core is unavailable if isolated or offline; * A core is available if non isolated and online; A core is considered unavailable as isolated if it belongs to: * the isolcpus (domain) list * an isolated cpuset Except if it is: * in the nohz_full list (already idle for the hierarchy) * the nohz timekeeper core (must be available to handle global timers) CPUs are added to the hierarchy during late boot, excluding isolated ones, the hierarchy is also adapted when the cpuset isolation changes. Due to how the timer migration algorithm works, any CPU part of the hierarchy can have their global timers pulled by remote CPUs and have to pull remote timers, only skipping pulling remote timers would break the logic. For this reason, prevent isolated CPUs from pulling remote global timers, but also the other way around: any global timer started on an isolated CPU will run there. This does not break the concept of isolation (global timers don't come from outside the CPU) and, if considered inappropriate, can usually be mitigated with other isolation techniques (e.g. IRQ pinning). This effect was noticed on a 128 cores machine running oslat on the isolated cores (1-31,33-63,65-95,97-127). The tool monopolises CPUs, and the CPU with lowest count in a timer migration hierarchy (here 1 and 65) appears as always active and continuously pulls global timers, from the housekeeping CPUs. This ends up moving driver work (e.g. delayed work) to isolated CPUs and causes latency spikes: before the change: # oslat -c 1-31,33-63,65-95,97-127 -D 62s ... Maximum: 1203 10 3 4 ... 5 (us) after the change: # oslat -c 1-31,33-63,65-95,97-127 -D 62s ... Maximum: 10 4 3 4 3 ... 5 (us) The same behaviour was observed on a machine with as few as 20 cores / 40 threads with isocpus set to: 1-9,11-39 with rtla-osnoise-top. Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: John B. Wyatt IV <jwyatt@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251120145653.296659-8-gmonaco@redhat.com
2025-11-20cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to ↵Gabriele Monaco1-6/+6
update_isolation_cpumasks() update_unbound_workqueue_cpumask() updates unbound workqueues settings when there's a change in isolated CPUs, but it can be used for other subsystems requiring updated when isolated CPUs change. Generalise the name to update_isolation_cpumasks() to prepare for other functions unrelated to workqueues to be called in that spot. [longman: Change the function name to update_isolation_cpumasks()] Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Link: https://patch.msgid.link/20251120145653.296659-5-gmonaco@redhat.com
2025-11-20cgroup/cpuset: Introduce cpuset_cpus_allowed_locked()Pingfan Liu1-15/+36
cpuset_cpus_allowed() uses a reader lock that is sleepable under RT, which means it cannot be called inside raw_spin_lock_t context. Introduce a new cpuset_cpus_allowed_locked() helper that performs the same function as cpuset_cpus_allowed() except that the caller must have acquired the cpuset_mutex so that no further locking will be needed. Suggested-by: Waiman Long <longman@redhat.com> Signed-off-by: Pingfan Liu <piliu@redhat.com> Cc: Waiman Long <longman@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: linux-kernel@vger.kernel.org To: cgroups@vger.kernel.org Reviewed-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11cpuset: remove need_rebuild_sched_domainsChen Ridong1-5/+1
Previously, update_cpumasks_hier() used need_rebuild_sched_domains to decide whether to invoke rebuild_sched_domains_locked(). Now that rebuild_sched_domains_locked() only sets force_rebuild, the flag is redundant. Hence, remove it. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11cpuset: remove global remote_children listChen Ridong1-9/+4
The remote_children list is used to track all remote partitions attached to a cpuset. However, it serves no other purpose. Using a boolean flag to indicate whether a cpuset is a remote partition is a more direct approach, making remote_children unnecessary. This patch replaces the list with a remote_partition flag in the cpuset structure and removes remote_children entirely. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11cpuset: simplify node setting on errorChen Ridong1-12/+9
There is no need to jump to the 'done' label upon failure, as no cleanup is required. Return the error code directly instead. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Globally track isolated_cpus updateWaiman Long1-38/+35
The current cpuset code passes a local isolcpus_updated flag around in a number of functions to determine if external isolation related cpumasks like wq_unbound_cpumask should be updated. It is a bit cumbersome and makes the code more complex. Simplify the code by using a global boolean flag "isolated_cpus_updating" to track this. This flag will be set in isolated_cpus_update() and cleared in update_isolation_cpumasks(). No functional change is expected. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Ensure domain isolated CPUs stay in root or isolated partitionWaiman Long1-4/+6
Commit 4a74e418881f ("cgroup/cpuset: Check partition conflict with housekeeping setup") is supposed to ensure that domain isolated CPUs designated by the "isolcpus" boot command line option stay either in root partition or in isolated partitions. However, the required check wasn't implemented when a remote partition was created or when an existing partition changed type from "root" to "isolated". Even though this is a relatively minor issue, we still need to add the required prstate_housekeeping_conflict() call in the right places to ensure that the rule is strictly followed. The following steps can be used to reproduce the problem before this fix. # fmt -1 /proc/cmdline | grep isolcpus isolcpus=9 # cd /sys/fs/cgroup/ # echo +cpuset > cgroup.subtree_control # mkdir test # echo 9 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated # cat test/cpuset.cpus.effective 9 # echo root > test/cpuset.cpus.partition # cat test/cpuset.cpus.effective 9 # cat test/cpuset.cpus.partition root With this fix, the last few steps will become: # echo root > test/cpuset.cpus.partition # cat test/cpuset.cpus.effective 0-8,10-95 # cat test/cpuset.cpus.partition root invalid (partition config conflicts with housekeeping setup) Reported-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Move up prstate_housekeeping_conflict() helperWaiman Long1-20/+20
Move up the prstate_housekeeping_conflict() helper so that it can be used in remote partition code. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Fail if isolated and nohz_full don't leave any housekeepingWaiman Long1-1/+73
Currently the user can set up isolated cpus via cpuset and nohz_full in such a way that leaves no housekeeping CPU (i.e. no CPU that is neither domain isolated nor nohz full). This can be a problem for other subsystems (e.g. the timer wheel imgration). Prevent this configuration by blocking any assignation that would cause the union of domain isolated cpus and nohz_full to covers all CPUs. [longman: Remove isolated_cpus_should_update() and rewrite the checking in update_prstate() and update_parent_effective_cpumask()] Originally-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-05cgroup/cpuset: Rename update_unbound_workqueue_cpumask() to ↵Gabriele Monaco1-6/+6
update_isolation_cpumasks() update_unbound_workqueue_cpumask() updates unbound workqueues settings when there's a change in isolated CPUs, but it can be used for other subsystems requiring updated when isolated CPUs change. Generalise the name to update_isolation_cpumasks() to prepare for other functions unrelated to workqueues to be called in that spot. [longman: Change the function name to update_isolation_cpumasks()] Acked-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-20cgroup/cpuset: Don't track # of local child partitionsWaiman Long1-28/+13
The cpuset structure has a nr_subparts field which tracks the number of child local partitions underneath a particular cpuset. Right now, nr_subparts is only used in partition_is_populated() to avoid iteration of child cpusets if the condition is right. So by always performing the child iteration, we can avoid tracking the number of child partitions and simplify the code a bit. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Rename do_set_cpus_allowed()Peter Zijlstra1-1/+1
Hopefully saner naming. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
2025-09-22cpuset: remove is_prs_invalid helperChen Ridong1-8/+3
The is_prs_invalid helper function is redundant as it serves a similar purpose to is_partition_invalid. It can be fully replaced by the existing is_partition_invalid function, so this patch removes the is_prs_invalid helper. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22cpuset: remove impossible warning in update_parent_effective_cpumaskChen Ridong1-1/+0
If the parent is not a valid partition, an error will be returned before any partition update command is processed. This means the WARN_ON_ONCE(!is_partition_valid(parent)) can never be triggered, so it is safe to remove. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-22cpuset: remove redundant special case for null input in node mask updateChen Ridong1-14/+8
The nodelist_parse function already handles empty nodemask input appropriately, making it unnecessary to handle this case separately during the node mask update process. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-19cpuset: fix missing error return in update_cpumaskChen Ridong1-1/+2
The commit c6366739804f ("cpuset: refactor cpus_allowed_validate_change") inadvertently removed the error return when cpus_allowed_validate_change() fails. This patch restores the proper error handling by returning retval when the validation check fails. Fixes: c6366739804f ("cpuset: refactor cpus_allowed_validate_change") Signed-off-by: Chen Ridong <chenridong@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-19cpuset: Use new excpus for nocpu error check when enabling root partitionChen Ridong1-5/+1
A previous patch fixed a bug where new_prs should be assigned before checking housekeeping conflicts. This patch addresses another potential issue: the nocpu error check currently uses the xcpus which is not updated. Although no issue has been observed so far, the check should be performed using the new effective exclusive cpus. The comment has been removed because the function returns an error if nocpu checking fails, which is unrelated to the parent. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-19cpuset: fix failure to enable isolated partition when containing isolcpusChen Ridong1-1/+1
The 'isolcpus' parameter specified at boot time can be assigned to an isolated partition. While it is valid put the 'isolcpus' in an isolated partition, attempting to change a member cpuset to an isolated partition will fail if the cpuset contains any 'isolcpus'. For example, the system boots with 'isolcpus=9', and the following configuration works correctly: # cd /sys/fs/cgroup/ # mkdir test # echo 1 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated # echo 9 > test/cpuset.cpus # cat test/cpuset.cpus.partition isolated # cat test/cpuset.cpus 9 However, the following steps to convert a member cpuset to an isolated partition will fail: # cd /sys/fs/cgroup/ # mkdir test # echo 9 > test/cpuset.cpus # echo isolated > test/cpuset.cpus.partition # cat test/cpuset.cpus.partition isolated invalid (partition config conflicts with housekeeping setup) The issue occurs because the new partition state (new_prs) is used for validation against housekeeping constraints before it has been properly updated. To resolve this, move the assignment of new_prs before the housekeeping validation check when enabling a root partition. Fixes: 4a74e418881f ("cgroup/cpuset: Check partition conflict with housekeeping setup") Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: use partition_cpus_change for setting exclusive cpusChen Ridong1-27/+2
A previous patch has introduced a new helper function partition_cpus_change(). Now replace the exclusive cpus setting logic with this helper function. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: use parse_cpulist for setting cpus.exclusiveChen Ridong1-16/+9
Previous patches made parse_cpulist handle empty cpu mask input. Now use this helper for exclusive cpus setting. Also, compute_trialcs_xcpus can be called with empty cpus and handles it correctly. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: introduce partition_cpus_changeChen Ridong1-26/+38
Introduce the partition_cpus_change function to handle both regular CPU set updates and exclusive CPU modifications, either of which may trigger partition state changes. This generalized function will also be utilized for exclusive CPU updates in subsequent patches. With the introduction of compute_trialcs_excpus in a previous patch, the trialcs->effective_xcpus field is now consistently computed and maintained. Consequently, the legacy logic which assigned **trialcs->allowed_cpus to a local 'xcpus' variable** when trialcs->effective_xcpus was empty has been removed. This removal is safe because when trialcs is not a partition member, trialcs->effective_xcpus is now correctly populated with the intersection of user_xcpus and the parent's effective_xcpus. This calculation inherently covers the scenario previously handled by the removed code. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: refactor cpus_allowed_validate_changeChen Ridong1-39/+45
Refactor cpus_allowed_validate_change to handle the special case where cpuset.cpus can be set even when violating partition sibling CPU exclusivity rules. This differs from the general validation logic in validate_change. Add a wrapper function to properly handle this exceptional case. The trialcs->prs_err field is cleared before performing validation checks for both CPU changes and partition errors. If cpus_allowed_validate_change fails its validation, trialcs->prs_err is set to PERR_NOTEXCL. If partition validation fails, the specific error code returned by validate_partition is assigned to trialcs->prs_err. With the partition validation status now directly available through trialcs->prs_err, the local boolean variable 'invalidate' becomes redundant and can be safely removed. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: refactor out validate_partitionChen Ridong1-12/+36
Refactor the validate_partition function to handle cpuset partition validation when modifying cpuset.cpus. This refactoring also makes the function reusable for handling cpuset.cpus.exclusive updates in subsequent patches. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: introduce cpus_excl_conflict and mems_excl_conflict helpersChen Ridong1-30/+44
This patch adds cpus_excl_conflict() and mems_excl_conflict() helper functions to improve code readability and maintainability. The exclusive conflict checking follows these rules: 1. If either cpuset has the 'exclusive' flag set, their user_xcpus must not have any overlap. 2. If neither cpuset has the 'exclusive' flag set, their 'cpuset.cpus.exclusive' (only for v2) values must not intersect. 3. The 'cpuset.cpus' of one cpuset must not form a subset of another cpuset's 'cpuset.cpus.exclusive'. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: refactor CPU mask buffer parsing logicChen Ridong1-29/+30
The current implementation contains redundant handling for empty mask inputs, as cpulist_parse() already properly handles these cases. This refactoring introduces a new helper function parse_cpuset_cpulist() to consolidate CPU list parsing logic and eliminate special-case checks for empty inputs. Additionally, the effective_xcpus computation for trial cpusets has been simplified. Rather than computing effective_xcpus only when exclusive_cpus is set or when the cpuset forms a valid partition, we now recalculate it on every cpuset.cpus update. This approach ensures consistency and allows removal of redundant effective_xcpus logic in subsequent patches. The trial cpuset's effective_xcpus calculation follows two distinct cases: 1. For member cpusets: effective_xcpus is determined by the intersection of cpuset->exclusive_cpus and the parent's effective_xcpus. 2. For non-member cpusets: effective_xcpus is derived from the intersection of user_xcpus and the parent's effective_xcpus. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: Refactor exclusive CPU mask computation logicChen Ridong1-38/+65
The current compute_effective_exclusive_cpumask function handles multiple scenarios with different input parameters, making the code difficult to follow. This patch refactors it into two separate functions: compute_excpus and compute_trialcs_excpus. The compute_excpus function calculates the exclusive CPU mask for a given input and excludes exclusive CPUs from sibling cpusets when cs's exclusive_cpus is not explicitly set. The compute_trialcs_excpus function specifically handles exclusive CPU computation for trial cpusets used during CPU mask configuration updates, and always excludes exclusive CPUs from sibling cpusets. This refactoring significantly improves code readability and clarity, making it explicit which function to call for each use case and what parameters should be provided. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: change return type of is_partition_[in]valid to boolChen Ridong1-2/+2
The functions is_partition_valid() and is_partition_invalid() logically return boolean values, but were previously declared with return type 'int'. This patch changes their return type to 'bool' to better reflect their semantic meaning and improve type safety. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: remove unused assignment to trialcs->partition_root_stateChen Ridong1-2/+0
The trialcs->partition_root_state field is not used during the configuration of 'cpuset.cpus' or 'cpuset.cpus.exclusive'. Therefore, the assignment of values to this field can be safely removed. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-17cpuset: move the root cpuset write check earlierChen Ridong1-13/+4
The 'cpus' or 'mems' lists of the top_cpuset cannot be modified. This check can be moved before acquiring any locks as a common code block to improve efficiency and maintainability. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-16cgroup/cpuset: Remove redundant rcu_read_lock/unlock() in spin_lockpengdonglin1-6/+0
Since commit a8bb74acd8efe ("rcu: Consolidate RCU-sched update-side function definitions") there is no difference between rcu_read_lock(), rcu_read_lock_bh() and rcu_read_lock_sched() in terms of RCU read section and the relevant grace period. That means that spin_lock(), which implies rcu_read_lock_sched(), also implies rcu_read_lock(). There is no need no explicitly start a RCU read section if one has already been started implicitly by spin_lock(). Simplify the code and remove the inner rcu_read_lock() invocation. Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: pengdonglin <pengdonglin@xiaomi.com> Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04cpuset: Defer flushing of the cpuset_migrate_mm_wq to task_workChuyi Zhou1-5/+24
Now in cpuset_attach(), we need to synchronously wait for flush_workqueue to complete. The execution time of flushing cpuset_migrate_mm_wq depends on the amount of mm migration initiated by cpusets at that time. When the cpuset.mems of a cgroup occupying a large amount of memory is modified, it may trigger extensive mm migration, causing cpuset_attach() to block on flush_workqueue for an extended period. This could be dangerous because cpuset_attach() is within the critical section of cgroup_mutex, which may ultimately cause all cgroup-related operations in the system to be blocked. This patch attempts to defer the flush_workqueue() operation until returning to userspace using the task_work which is originally proposed by tejun[1], so that flush happens after cgroup_mutex is dropped. That way we maintain the operation synchronicity while avoiding bothering anyone else. [1]: https://lore.kernel.org/cgroups/ZgMFPMjZRZCsq9Q-@slm.duckdns.org/T/#m117f606fa24f66f0823a60f211b36f24bd9e1883 Originally-by: Tejun Heo <tj@kernel.org> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-04cpuset: Don't always flush cpuset_migrate_mm_wq in cpuset_write_resmaskChuyi Zhou1-1/+2
It is unnecessary to always wait for the flush operation of cpuset_migrate_mm_wq to complete in cpuset_write_resmask, as modifying cpuset.cpus or cpuset.exclusive does not trigger mm migrations. The flush_workqueue can be executed only when cpuset.mems is modified. Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-09-03cgroup/cpuset: Prevent NULL pointer access in free_tmpmasks()Waiman Long1-0/+3
Commit 5806b3d05165 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") separates out the freeing of tmpmasks into a new free_tmpmask() helper but removes the NULL pointer check in the process. Unfortunately a NULL pointer can be passed to free_tmpmasks() in cpuset_handle_hotplug() if cpuset v1 is active. This can cause segmentation fault and crash the kernel. Fix that by adding the NULL pointer check to free_tmpmasks(). Fixes: 5806b3d05165 ("cpuset: decouple tmpmasks and cpumasks freeing in cgroup") Reported-by: Ashay Jaiswal <quic_ashayj@quicinc.com> Closes: https://lore.kernel.org/lkml/20250902-cpuset-free-on-condition-v1-1-f46ffab53eac@quicinc.com/ Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25cpuset: add helpers for cpus read and cpuset_mutex locksChen Ridong1-26/+34
cpuset: add helpers for cpus_read_lock and cpuset_mutex locks. Replace repetitive locking patterns with new helpers: - cpuset_full_lock() - cpuset_full_unlock() This makes the code cleaner and ensures consistent lock ordering. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25cpuset: separate tmpmasks and cpuset allocation logicChen Ridong1-58/+69
The original alloc_cpumasks() served dual purposes: allocating cpumasks for both temporary masks (tmpmasks) and cpuset structures. This patch: 1. Decouples these allocation paths for better code clarity 2. Introduces dedicated alloc_tmpmasks() and dup_or_alloc_cpuset() functions 3. Maintains symmetric pairing: - alloc_tmpmasks() ↔ free_tmpmasks() - dup_or_alloc_cpuset() ↔ free_cpuset() Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-25cpuset: decouple tmpmasks and cpumasks freeing in cgroupChen Ridong1-19/+13
Currently, free_cpumasks() can free both tmpmasks and cpumasks of a cpuset (cs). However, these two operations are not logically coupled. To improve code clarity: 1. Move cpumask freeing to free_cpuset() 2. Rename free_cpumasks() to free_tmpmasks() This change enforces the single responsibility principle. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-13cpuset: remove redundant CS_ONLINE flagChen Ridong1-3/+1
The CS_ONLINE flag was introduced prior to the CSS_ONLINE flag in the cpuset subsystem. Currently, the flag setting sequence is as follows: 1. cpuset_css_online() sets CS_ONLINE 2. css->flags gets CSS_ONLINE set ... 3. cgroup->kill_css sets CSS_DYING 4. cpuset_css_offline() clears CS_ONLINE 5. css->flags clears CSS_ONLINE The is_cpuset_online() check currently occurs between steps 1 and 3. However, it would be equally safe to perform this check between steps 2 and 3, as CSS_ONLINE provides the same synchronization guarantee as CS_ONLINE. Since CS_ONLINE is redundant with CSS_ONLINE and provides no additional synchronization benefits, we can safely remove it to simplify the code. Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()Waiman Long1-2/+0
The css_get/put() calls in cpuset_partition_write() are unnecessary as an active reference of the kernfs node will be taken which will prevent its removal and guarantee the existence of the css. Only the online check is needed. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09cgroup/cpuset: Fix a partition error with CPU hotplugWaiman Long1-3/+4
It was found during testing that an invalid leaf partition with an empty effective exclusive CPU list can become a valid empty partition with no CPU afer an offline/online operation of an unrelated CPU. An empty partition root is allowed in the special case that it has no task in its cgroup and has distributed out all its CPUs to its child partitions. That is certainly not the case here. The problem is in the cpumask_subsets() test in the hotplug case (update with no new mask) of update_parent_effective_cpumask() as it also returns true if the effective exclusive CPU list is empty. Fix that by addding the cpumask_empty() test to root out this exception case. Also add the cpumask_empty() test in cpuset_hotplug_update_tasks() to avoid calling update_parent_effective_cpumask() for this special case. Fixes: 0c7f293efc87 ("cgroup/cpuset: Add cpuset.cpus.exclusive.effective for v2") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-08-09cgroup/cpuset: Use static_branch_enable_cpuslocked() on ↵Waiman Long1-1/+1
cpusets_insane_config_key The following lockdep splat was observed. [ 812.359086] ============================================ [ 812.359089] WARNING: possible recursive locking detected [ 812.359097] -------------------------------------------- [ 812.359100] runtest.sh/30042 is trying to acquire lock: [ 812.359105] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0xe/0x20 [ 812.359131] [ 812.359131] but task is already holding lock: [ 812.359134] ffffffffa7f27420 (cpu_hotplug_lock){++++}-{0:0}, at: cpuset_write_resmask+0x98/0xa70 : [ 812.359267] Call Trace: [ 812.359272] <TASK> [ 812.359367] cpus_read_lock+0x3c/0xe0 [ 812.359382] static_key_enable+0xe/0x20 [ 812.359389] check_insane_mems_config.part.0+0x11/0x30 [ 812.359398] cpuset_write_resmask+0x9f2/0xa70 [ 812.359411] cgroup_file_write+0x1c7/0x660 [ 812.359467] kernfs_fop_write_iter+0x358/0x530 [ 812.359479] vfs_write+0xabe/0x1250 [ 812.359529] ksys_write+0xf9/0x1d0 [ 812.359558] do_syscall_64+0x5f/0xe0 Since commit d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order"), the ordering of cpu hotplug lock and cpuset_mutex had been reversed. That patch correctly used the cpuslocked version of the static branch API to enable cpusets_pre_enable_key and cpusets_enabled_key, but it didn't do the same for cpusets_insane_config_key. The cpusets_insane_config_key can be enabled in the check_insane_mems_config() which is called from update_nodemask() or cpuset_hotplug_update_tasks() with both cpu hotplug lock and cpuset_mutex held. Deadlock can happen with a pending hotplug event that tries to acquire the cpu hotplug write lock which will block further cpus_read_lock() attempt from check_insane_mems_config(). Fix that by switching to use static_branch_enable_cpuslocked(). Fixes: d74b27d63a8b ("cgroup/cpuset: Change cpuset_rwsem and hotplug lock order") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-07-13kernel,cpuset: use node-notifier instead of memory-notifierOscar Salvador1-1/+1
cpuset is only concerned when a numa node changes its memory state, as it needs to know the current numa nodes with memory to keep an updated mems_allowed mask. So stop using the memory notifier and use the new numa node notifer instead. Link: https://lkml.kernel.org/r/20250616135158.450136-9-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim <rakie.kim@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31Merge tag 'mm-stable-2025-05-31-14-50' of ↵Linus Torvalds1-2/+38
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of creating a pte which addresses the first page in a folio and reduces the amount of plumbing which architecture must implement to provide this. - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of largely unrelated folio infrastructure changes which clean things up and better prepare us for future work. - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory Price adds early-init code to prevent x86 from leaving physical memory unused when physical address regions are not aligned to memory block size. - "mm/compaction: allow more aggressive proactive compaction" from Michal