aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)AuthorFilesLines
2026-03-06sched_ext: Separate bypass dispatch enabling from bypass depth trackingTejun Heo2-10/+69
The bypass_depth field tracks nesting of bypass operations but is also used to determine whether the bypass dispatch path should be active. With hierarchical scheduling, child schedulers may need to activate their parent's bypass dispatch path without affecting the parent's bypass_depth, requiring separation of these concerns. Add bypass_dsp_enable_depth and bypass_dsp_claim to independently control bypass dispatch path activation. The new enable_bypass_dsp() and disable_bypass_dsp() functions manage this state with proper claim semantics to prevent races. The bypass dispatch path now only activates when bypass_dsp_enabled() returns true, which checks the new enable_depth counter. The disable operation is carefully ordered after all tasks are moved out of bypass DSQs to ensure they are drained before the dispatch path is disabled. During scheduler teardown, disable_bypass_dsp() is called explicitly to ensure cleanup even if bypass mode was never entered normally. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: When calling ops.dispatch() @prev must be on the same scx_schedTejun Heo1-2/+3
The @prev parameter passed into ops.dispatch() is expected to be on the same sched. Passing in @prev which isn't on the sched can spuriously trigger failures that can kill the scheduler. Pass in @prev iff it's on the same sched. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Factor out scx_dispatch_sched()Tejun Heo1-58/+65
In preparation of multiple scheduler support, factor out scx_dispatch_sched() from balance_one(). The function boundary makes remembering $prev_on_scx and $prev_on_rq less useful. Open code $prev_on_scx in balance_one() and $prev_on_rq in both balance_one() and scx_dispatch_sched(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Prepare bypass mode for hierarchical operationTejun Heo1-22/+63
Bypass mode is used to simplify enable and disable paths and guarantee forward progress when something goes wrong. When enabled, all tasks skip BPF scheduling and fall back to simple in-kernel FIFO scheduling. While this global behavior can be used as-is when dealing with sub-scheds, that would allow any sub-sched instance to affect the whole system in a significantly disruptive manner. Make bypass state hierarchical by propagating it to descendants and updating per-cpu flags accordingly. This allows an scx_sched to bypass if itself or any of its ancestors are in bypass mode. However, this doesn't make the actual bypass enqueue and dispatch paths hierarchical yet. That will be done in later patches. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Move bypass state into scx_schedTejun Heo4-81/+80
In preparation of multiple scheduler support, make bypass state per-scx_sched. Move scx_bypass_depth, bypass_timestamp and bypass_lb_timer from globals into scx_sched. Move SCX_RQ_BYPASSING from rq to scx_sched_pcpu as SCX_SCHED_PCPU_BYPASSING. scx_bypass() now takes @sch and scx_rq_bypassing(rq) is replaced with scx_bypassing(sch, cpu). All callers updated. scx_bypassed_for_enable existed to balance the global scx_bypass_depth when enable failed. Now that bypass_depth is per-scheduler, the counter is destroyed along with the scheduler on enable failure. Remove scx_bypassed_for_enable. As all tasks currently use the root scheduler, there's no observable behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Move bypass_dsq into scx_sched_pcpuTejun Heo3-26/+29
To support bypass mode for sub-schedulers, move bypass_dsq from struct scx_rq to struct scx_sched_pcpu. Add bypass_dsq() helper. Move bypass_dsq initialization from init_sched_ext_class() to scx_alloc_and_attach_sched(). bypass_lb_cpu() now takes a CPU number instead of rq pointer. All callers updated. No behavior change as all tasks use the root scheduler. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Move aborting flag to per-scheduler fieldTejun Heo2-7/+4
The abort state was tracked in the global scx_aborting flag which was used to break out of potential live-lock scenarios when an error occurs. With hierarchical scheduling, each scheduler instance must track its own abort state independently so that an aborting scheduler doesn't interfere with others. Move the aborting flag into struct scx_sched and update all access sites. The early initialization check in scx_root_enable() that warned about residual aborting state is no longer needed as each scheduler instance now starts with a clean state. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Move default slice to per-scheduler fieldTejun Heo2-6/+9
The default time slice was stored in the global scx_slice_dfl variable which was dynamically modified when entering and exiting bypass mode. With hierarchical scheduling, each scheduler instance needs its own default slice configuration so that bypass operations on one scheduler don't affect others. Move slice_dfl into struct scx_sched and update all access sites. The bypass logic now modifies the root scheduler's slice_dfl. At task initialization in init_scx_entity(), use the SCX_SLICE_DFL constant directly since the task may not yet be associated with a specific scheduler. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Make scx_prio_less() handle multiple schedulersTejun Heo1-3/+4
Call ops.core_sched_before() iff both tasks belong to the same scx_sched. Otherwise, use timestamp based ordering. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Refactor task init/exit helpersTejun Heo1-23/+45
- Add the @sch parameter to scx_init_task() and drop @tg as it can be obtained from @p. Separate out __scx_init_task() which does everything except for the task state transition. - Add the @sch parameter to scx_enable_task(). Separate out __scx_enable_task() which does everything except for the task state transition. - Add the @sch parameter to scx_disable_task(). - Rename scx_exit_task() to scx_disable_and_exit_task() and separate out __scx_disable_and_exit_task() which does everything except for the task state transition. While some task state transitions are relocated, no meaningful behavior changes are expected. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: scx_dsq_move() should validate the task belongs to the right ↵Tejun Heo1-1/+6
scheduler scx_bpf_dsq_move[_vtime]() calls scx_dsq_move() to move task from a DSQ to another. However, @p doesn't necessarily have to come form the containing iteration and can thus be a task which belongs to another scx_sched. Verify that @p is on the same scx_sched as the DSQ being iterated. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Enforce scheduler ownership when updating slice and dsq_vtimeTejun Heo1-9/+32
scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() now verify that the calling scheduler has authority over the task before allowing updates. This prevents schedulers from modifying tasks that don't belong to them in hierarchical scheduling configurations. Direct writes to p->scx.slice and p->scx.dsq_vtime are deprecated and now trigger warnings. They will be disallowed in a future release. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Enforce scheduling authority in dispatch and select_cpu operationsTejun Heo3-0/+49
Add checks to enforce scheduling authority boundaries when multiple schedulers are present: 1. In scx_dsq_insert_preamble() and the dispatch retry path, ignore attempts to insert tasks that the scheduler doesn't own, counting them via SCX_EV_INSERT_NOT_OWNED. As BPF schedulers are allowed to ignore dequeues, such attempts can occur legitimately during sub-scheduler enabling when tasks move between schedulers. The counter helps distinguish normal cases from scheduler bugs. 2. For scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and(), error out when sub-schedulers are attached. These functions lack the aux__prog parameter needed to identify the calling scheduler, so they cannot be used safely with multiple schedulers. BPF programs should use the arg-wrapped versions (__scx_bpf_dsq_insert_vtime() and __scx_bpf_select_cpu_and()) instead. These checks ensure that with multiple concurrent schedulers, scheduler identity can be properly determined and unauthorized task operations are prevented or tracked. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Introduce scx_prog_sched()Tejun Heo3-102/+189
In preparation for multiple scheduler support, introduce scx_prog_sched() accessor which returns the scx_sched instance associated with a BPF program. The association is determined via the special KF_IMPLICIT_ARGS kfunc parameter, which provides access to bpf_prog_aux. This aux can be used to retrieve the struct_ops (sched_ext_ops) that the program is associated with, and from there, the corresponding scx_sched instance. For compatibility, when ops.sub_attach is not implemented (older schedulers without sub-scheduler support), unassociated programs fall back to scx_root. A warning is logged once per scheduler for such programs. As scx_root is still the only scheduler, this shouldn't introduce user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Introduce scx_task_sched[_rcu]()Tejun Heo2-24/+98
In preparation of multiple scheduler support, add p->scx.sched which points to the scx_sched instance that the task is scheduled by, which is currently always scx_root. Add scx_task_sched[_rcu]() accessors which return the associated scx_sched of the specified task and replace the raw scx_root dereferences with it where applicable. scx_task_on_sched() is also added to test whether a given task is on the specified sched. As scx_root is still the only scheduler, this shouldn't introduce user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Introduce cgroup sub-sched supportTejun Heo2-34/+565
A system often runs multiple workloads especially in multi-tenant server environments where a system is split into partitions servicing separate more-or-less independent workloads each requiring an application-specific scheduler. To support such and other use cases, sched_ext is in the process of growing multiple scheduler support. When partitioning a system in terms of CPUs for such use cases, an oft-taken approach is hard partitioning the system using cpuset. While it would be possible to tie sched_ext multiple scheduler support to cpuset partitions, such an approach would have fundamental limitations stemming from the lack of dynamism and flexibility. Users often don't care which specific CPUs are assigned to which workload and want to take advantage of optimizations which are enabled by running workloads on a larger machine - e.g. opportunistic over-commit, improving latency critical workload characteristics while maintaining bandwidth fairness, employing control mechanisms based on different criteria than on-CPU time for e.g. flexible memory bandwidth isolation, packing similar parts from different workloads on same L3s to improve cache efficiency, and so on. As this sort of dynamic behaviors are impossible or difficult to implement with hard partitioning, sched_ext is implementing cgroup sub-sched support where schedulers can be attached to the cgroup hierarchy and a parent scheduler is responsible for controlling the CPUs that each child can use at any given moment. This makes CPU distribution dynamically controlled by BPF allowing high flexibility. This patch adds the skeletal sched_ext cgroup sub-sched support: - sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero sub_cgroup_id indicates that the scheduler is to be attached to the identified cgroup. A sub-sched is attached to the cgroup iff the nearest ancestor scheduler implements .sub_attach() and grants the attachment. Max nesting depth is limited by SCX_SUB_MAX_DEPTH. - When a scheduler exits, all its descendant schedulers are exited together. Also, cgroup.scx_sched added which points to the effective scheduler instance for the cgroup. This is updated on scheduler init/exit and inherited on cgroup online. When a cgroup is offlined, the attached scheduler is automatically exited. - Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is automatically enabled if both SCX and cgroups are enabled. Sub-sched support is not tied to the CPU controller but rather the cgroup hierarchy itself. This is intentional as the support for cpu.weight and cpu.max based resource control is orthogonal to sub-sched support. Note that CONFIG_CGROUPS around cgroup subtree iteration support for scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency. - This allows loading sub-scheds and most framework operations such as propagating disable down the hierarchy work. However, sub-scheds are not operational yet and all tasks stay with the root sched. This will serve as the basis for building up full sub-sched support. - DSQs point to the scx_sched they belong to. - scx_qmap is updated to allow attachment of sub-scheds and also serving as sub-scheds. - scx_is_descendant() is added but not yet used in this patch. It is used by later changes in the series and placed here as this is where the function belongs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Reorganize enable/disable path for multi-scheduler supportTejun Heo1-35/+43
In preparation for multiple scheduler support, reorganize the enable and disable paths to make scheduler instances explicit. Extract scx_root_disable() from scx_disable_workfn(). Rename scx_enable_workfn() to scx_root_enable_workfn(). Change scx_disable() to take @sch parameter and only queue disable_work if scx_claim_exit() succeeds for consistency. Move exit_kind validation into scx_claim_exit(). The sysrq handler now prints a message when no scheduler is loaded. These changes don't materially affect user-visible behavior. v2: Keep scx_enable() name as-is and only rename the workfn to scx_root_enable_workfn(). Change scx_enable() return type to s32. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Update p->scx.disallow warning in scx_init_task()Tejun Heo1-4/+4
- Always trigger the warning if p->scx.disallow is set for fork inits. There is no reason to set it during forks. - Flip the positions of if/else arms to ease adding error conditions. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Add @kargs to scx_fork()Tejun Heo3-4/+4
Make sched_cgroup_fork() pass @kargs to scx_fork(). This will be used to determine @p's cgroup for cgroup sub-sched support. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Peter Zijlstra <peterz@infradead.org>
2026-03-06sched_ext: Implement cgroup subtree iteration for scx_task_iterTejun Heo1-6/+58
For the planned cgroup sub-scheduler support, enable/disable operations are going to be subtree specific and iterating all tasks in the system for those operations can be unnecessarily expensive and disruptive. cgroup already has mechanisms to perform subtree task iterations. Implement cgroup subtree iteration for scx_task_iter: - Add optional @cgrp to scx_task_iter_start() which enables cgroup subtree iteration. - Make scx_task_iter use css_next_descendant_pre() and css_task_iter to iterate all tasks in the cgroup subtree. - Update all existing callers to pass NULL to maintain current behavior. The two iteration mechanisms are independent and duplicate. It's likely that scx_tasks can be removed in favor of always using cgroup iteration if CONFIG_SCHED_CLASS_EXT depends on CONFIG_CGROUPS. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06Merge branch 'for-7.0-fixes' into for-7.1Tejun Heo7-85/+315
To prepare for hierarchical scheduling patchset which will cause multiple conflicts otherwise. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06sched_ext: Use READ_ONCE() for scx_slice_bypass_us in scx_bypass()David Carlier1-1/+1
Commit 0927780c90ce ("sched_ext: Use READ_ONCE() for lock-free reads of module param variables") annotated the plain reads of scx_slice_bypass_us and scx_bypass_lb_intv_us in bypass_lb_cpu(), but missed a third site in scx_bypass(): WRITE_ONCE(scx_slice_dfl, scx_slice_bypass_us * NSEC_PER_USEC); scx_slice_bypass_us is a module parameter writable via sysfs in process context through set_slice_us() -> param_set_uint_minmax(), which performs a plain store without holding bypass_lock. scx_bypass() reads the variable under bypass_lock, but since the writer does not take that lock, the two accesses are concurrent. WRITE_ONCE() only applies volatile semantics to the store of scx_slice_dfl -- the val expression containing scx_slice_bypass_us is evaluated as a plain read, providing no protection against concurrent writes. Wrap the read with READ_ONCE() to complete the annotation started by commit 0927780c90ce and make the access KCSAN-clean, consistent with the existing READ_ONCE(scx_slice_bypass_us) in bypass_lb_cpu(). Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06sched/headers: Inline raw_spin_rq_unlock()Xie Yuanbin2-8/+6
raw_spin_rq_unlock() is short, and is called in some hot code paths such as finish_lock_switch(). Inline raw_spin_rq_unlock() to micro-optimize performance a bit. Signed-off-by: Xie Yuanbin <qq570070308@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://patch.msgid.link/20260216164950.147617-3-qq570070308@gmail.com
2026-03-06Merge branch 'linus' into sched/core, to resolve conflictsIngo Molnar4-25/+87
Conflicts: kernel/sched/ext.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2026-03-06sched/hrtick: Mark hrtick_clear() as always usedIngo Molnar1-1/+1
This recent commit: 96d1610e0b20b ("sched: Optimize hrtimer handling") introduced a new build warning when !CONFIG_HOTPLUG_CPU while SCHED_HRTIMERS=y [ == HIGH_RES_TIMERS=y ]: /tip.testing/kernel/sched/core.c:882:13: warning: ‘hrtick_clear’ defined but not used [-Wunused-function] Mark this helper function as always-used, instead of complicating the code with another obscure #ifdef. Fixes: 96d1610e0b20b ("sched: Optimize hrtimer handling") Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/177245077226.1647592.1821545206171336606.tip-bot2@tip-bot2
2026-03-05sched_ext: Document task ownership state machineAndrea Righi1-16/+98
The task ownership state machine in sched_ext is quite hard to follow from the code alone. The interaction of ownership states, memory ordering rules and cross-CPU "lock dancing" makes the overall model subtle. Extend the documentation next to scx_ops_state to provide a more structured and self-contained description of the state transitions and their synchronization rules. The new reference should make the code easier to reason about and maintain and can help future contributors understand the overall task-ownership workflow. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05sched_ext: Use READ_ONCE() for lock-free reads of module param variableszhidao su1-2/+2
bypass_lb_cpu() reads scx_bypass_lb_intv_us and scx_slice_bypass_us without holding any lock, in timer callback context where module parameter writes via sysfs can happen concurrently: min_delta_us = scx_bypass_lb_intv_us / SCX_BYPASS_LB_MIN_DELTA_DIV; ^^^^^^^^^^^^^^^^^^^^ plain read -- KCSAN data race if (delta < DIV_ROUND_UP(min_delta_us, scx_slice_bypass_us)) ^^^^^^^^^^^^^^^^^ plain read -- KCSAN data race scx_bypass_lb_intv_us already uses READ_ONCE() in scx_bypass_lb_timerfn() and scx_bypass() for its other lock-free read sites, leaving bypass_lb_cpu() inconsistent. scx_slice_bypass_us has the same lock-free access pattern in the same function. Fix both plain reads by using READ_ONCE() to complete the concurrent access annotation and make the code KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-04sched_ext: Use WRITE_ONCE() for the write side of dsq->seq updatezhidao su1-1/+1
bpf_iter_scx_dsq_new() reads dsq->seq via READ_ONCE() without holding any lock, making dsq->seq a lock-free concurrently accessed variable. However, dispatch_enqueue(), the sole writer of dsq->seq, uses a plain increment without the matching WRITE_ONCE() on the write side: dsq->seq++; ^^^^^^^^^^^ plain write -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain write leaves the pair incomplete and will trigger KCSAN warnings. Fix by using WRITE_ONCE() for the write side of the update: WRITE_ONCE(dsq->seq, dsq->seq + 1); This is consistent with bpf_iter_scx_dsq_new() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-04sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boostingJuri Lelli1-0/+30
Running stress-ng --schedpolicy 0 on an RT kernel on a big machine might lead to the following WARNINGs (edited). sched: DL de-boosted task PID 22725: REPLENISH flag missing WARNING: CPU: 93 PID: 0 at kernel/sched/deadline.c:239 dequeue_task_dl+0x15c/0x1f8 ... (running_bw underflow) Call trace: dequeue_task_dl+0x15c/0x1f8 (P) dequeue_task+0x80/0x168 deactivate_task+0x24/0x50 push_dl_task+0x264/0x2e0 dl_task_timer+0x1b0/0x228 __hrtimer_run_queues+0x188/0x378 hrtimer_interrupt+0xfc/0x260 ... The problem is that when a SCHED_DEADLINE task (lock holder) is changed to a lower priority class via sched_setscheduler(), it may fail to properly inherit the parameters of potential DEADLINE donors if it didn't already inherit them in the past (shorter deadline than donor's at that time). This might lead to bandwidth accounting corruption, as enqueue_task_dl() won't recognize the lock holder as boosted. The scenario occurs when: 1. A DEADLINE task (donor) blocks on a PI mutex held by another DEADLINE task (holder), but the holder doesn't inherit parameters (e.g., it already has a shorter deadline) 2. sched_setscheduler() changes the holder from DEADLINE to a lower class while still holding the mutex 3. The holder should now inherit DEADLINE parameters from the donor and be enqueued with ENQUEUE_REPLENISH, but this doesn't happen Fix the issue by introducing __setscheduler_dl_pi(), which detects when a DEADLINE (proper or boosted) task gets setscheduled to a lower priority class. In case, the function makes the task inherit DEADLINE parameters of the donoer (pi_se) and sets ENQUEUE_REPLENISH flag to ensure proper bandwidth accounting during the next enqueue operation. Fixes: 2279f540ea7d ("sched/deadline: Fix priority inheritance with multiple scheduling classes") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260302-upstream-fix-deadline-piboost-b4-v3-1-6ba32184a9e0@redhat.com
2026-03-03Merge tag 'cgroup-for-7.0-rc2-fixes' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix circular locking dependency in cpuset partition code by deferring housekeeping_update() calls to a workqueue instead of calling them directly under cpus_read_lock - Fix null-ptr-deref in rebuild_sched_domains_cpuslocked() when generate_sched_domains() returns NULL due to kmalloc failure - Fix incorrect cpuset behavior for effective_xcpus in partition_xcpus_del() and cpuset_update_tasks_cpumask() in update_cpumasks_hier() - Fix race between task migration and cgroup iteration * tag 'cgroup-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: fix null-ptr-deref in rebuild_sched_domains_cpuslocked cgroup/cpuset: Call housekeeping_update() without holding cpus_read_lock cgroup/cpuset: Defer housekeeping_update() calls from CPU hotplug to workqueue cgroup/cpuset: Move housekeeping_update()/rebuild_sched_domains() together kselftest/cgroup: Simplify test_cpuset_prs.sh by removing "S+" command cgroup/cpuset: Set isolated_cpus_updating only if isolated_cpus is changed cgroup/cpuset: Clarify exclusion rules for cpuset internal variables cgroup/cpuset: Fix incorrect use of cpuset_update_tasks_cpumask() in update_cpumasks_hier() cgroup/cpuset: Fix incorrect change to effective_xcpus in partition_xcpus_del() cgroup: fix race between task migration and iteration
2026-03-03Merge tag 'sched_ext-for-7.0-rc2-fixes' of ↵Linus Torvalds3-22/+86
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix starvation of scx_enable() under fair-class saturation by offloading the enable path to an RT kthread - Fix out-of-bounds access in idle mask initialization on systems with non-contiguous NUMA node IDs - Fix a preemption window during scheduler exit and a refcount underflow in cgroup init error path - Fix SCX_EFLAG_INITIALIZED being a no-op flag - Add READ_ONCE() annotations for KCSAN-clean lockless accesses and replace naked scx_root dereferences with container_of() in kobject callbacks - Tooling and selftest fixes: compilation issues with clang 17, strtoul() misuse, unused options cleanup, and Kconfig sync * tag 'sched_ext-for-7.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Fix starvation of scx_enable() under fair-class saturation sched_ext: Remove redundant css_put() in scx_cgroup_init() selftests/sched_ext: Fix peek_dsq.bpf.c compile error for clang 17 selftests/sched_ext: Add -fms-extensions to bpf build flags tools/sched_ext: Add -fms-extensions to bpf build flags sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeout sched_ext: Replace naked scx_root dereferences in kobject callbacks sched_ext: Use READ_ONCE() for the read side of dsq->nr update tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq() sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flag sched_ext: Fix out-of-bounds access in scx_idle_init_masks() sched_ext: Disable preemption between scx_claim_exit() and kicking helper work tools/sched_ext: Add Kconfig to sync with upstream tools/sched_ext: Sync README.md Kconfig with upstream scx selftests/sched_ext: Remove duplicated unistd.h include in rt_stall.c tools/sched_ext: scx_sdt: Remove unused '-f' option tools/sched_ext: scx_central: Remove unused '-p' option selftests/sched_ext: Fix unused-result warning for read() selftests/sched_ext: Abort test loop on signal
2026-03-03sched_ext: Fix starvation of scx_enable() under fair-class saturationTejun Heo1-10/+56
During scx_enable(), the READY -> ENABLED task switching loop changes the calling thread's sched_class from fair to ext. Since fair has higher priority than ext, saturating fair-class workloads can indefinitely starve the enable thread, hanging the system. This was introduced when the enable path switched from preempt_disable() to scx_bypass() which doesn't protect against fair-class starvation. Note that the original preempt_disable() protection wasn't complete either - in partial switch modes, the calling thread could still be starved after preempt_enable() as it may have been switched to ext class. Fix it by offloading the enable body to a dedicated system-wide RT (SCHED_FIFO) kthread which cannot be starved by either fair or ext class tasks. scx_enable() lazily creates the kthread on first use and passes the ops pointer through a struct scx_enable_cmd containing the kthread_work, then synchronously waits for completion. The workfn runs on a different kthread from sch->helper (which runs disable_work), so it can safely flush disable_work on the error path without deadlock. Fixes: 8c2090c504e9 ("sched_ext: Initialize in bypass mode") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-03sched_ext: Remove redundant css_put() in scx_cgroup_init()Cheng-Yang Chou1-1/+0
The iterator css_for_each_descendant_pre() walks the cgroup hierarchy under cgroup_lock(). It does not increment the reference counts on yielded css structs. According to the cgroup documentation, css_put() should only be used to release a reference obtained via css_get() or css_tryget_online(). Since the iterator does not use either of these to acquire a reference, calling css_put() in the error path of scx_cgroup_init() causes a refcount underflow. Remove the unbalanced css_put() to prevent a potential Use-After-Free (UAF) vulnerability. Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+ Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02sched_ext: Use READ_ONCE() for plain reads of scx_watchdog_timeoutzhidao su1-3/+3
scx_watchdog_timeout is written with WRITE_ONCE() in scx_enable(): WRITE_ONCE(scx_watchdog_timeout, timeout); However, three read-side accesses use plain reads without the matching READ_ONCE(): /* check_rq_for_timeouts() - L2824 */ last_runnable + scx_watchdog_timeout /* scx_watchdog_workfn() - L2852 */ scx_watchdog_timeout / 2 /* scx_enable() - L5179 */ scx_watchdog_timeout / 2 The KCSAN documentation requires that if one accessor uses WRITE_ONCE() to annotate lock-free access, all other accesses must also use the appropriate accessor. Plain reads alongside WRITE_ONCE() leave the pair incomplete and can trigger KCSAN warnings. Note that scx_tick() already uses the correct READ_ONCE() annotation: last_check + READ_ONCE(scx_watchdog_timeout) Fix the three remaining plain reads to match, making all accesses to scx_watchdog_timeout consistently annotated and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02sched_ext: Replace naked scx_root dereferences in kobject callbackszhidao su1-2/+6
scx_attr_ops_show() and scx_uevent() access scx_root->ops.name directly. This is problematic for two reasons: 1. The file-level comment explicitly identifies naked scx_root dereferences as a temporary measure that needs to be replaced with proper per-instance access. 2. scx_attr_events_show(), the neighboring sysfs show function in the same group, already uses the correct pattern: struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj); Having inconsistent access patterns in the same sysfs/uevent group is error-prone. The kobject embedded in struct scx_sched is initialized as: kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root"); so container_of(kobj, struct scx_sched, kobj) correctly retrieves the owning scx_sched instance in both callbacks. Replace the naked scx_root dereferences with container_of()-based access, consistent with scx_attr_events_show() and in preparation for proper multi-instance scx_sched support. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02sched_ext: Use READ_ONCE() for the read side of dsq->nr updatezhidao su1-2/+6
scx_bpf_dsq_nr_queued() reads dsq->nr via READ_ONCE() without holding any lock, making dsq->nr a lock-free concurrently accessed variable. However, dsq_mod_nr(), the sole writer of dsq->nr, only uses WRITE_ONCE() on the write side without the matching READ_ONCE() on the read side: WRITE_ONCE(dsq->nr, dsq->nr + delta); ^^^^^^^ plain read -- KCSAN data race The KCSAN documentation requires that if one accessor uses READ_ONCE() or WRITE_ONCE() on a variable to annotate lock-free access, all other accesses must also use the appropriate accessor. A plain read on the right-hand side of WRITE_ONCE() leaves the pair incomplete and will trigger KCSAN warnings. Fix by using READ_ONCE() for the read side of the update: WRITE_ONCE(dsq->nr, READ_ONCE(dsq->nr) + delta); This is consistent with scx_bpf_dsq_nr_queued() and makes the concurrent access annotation complete and KCSAN-clean. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-27sched: Default enable HRTICK when deferred rearming is enabledPeter Zijlstra1-0/+5
The deferred rearm of the clock event device after an interrupt and and other hrtimer optimizations allow now to enable HRTICK for generic entry architectures. This decouples preemption from CONFIG_HZ, leaving only the periodic load-balancer and various accounting things relying on the tick. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.937531564@kernel.org
2026-02-27sched/core: Prepare for deferred hrtimer rearmingPeter Zijlstra1-0/+6
The hrtimer interrupt expires timers and at the end of the interrupt it rearms the clockevent device for the next expiring timer. That's obviously correct, but in the case that a expired timer sets NEED_RESCHED the return from interrupt ends up in schedule(). If HRTICK is enabled then schedule() will modify the hrtick timer, which causes another reprogramming of the hardware. That can be avoided by deferring the rearming to the return from interrupt path and if the return results in a immediate schedule() invocation then it can be deferred until the end of schedule(), which avoids multiple rearms and re-evaluation of the timer wheel. Add the rearm checks to the existing sched_hrtick_enter/exit() functions, which already handle the batched rearm of the hrtick timer. For now this is just placing empty stubs at the right places which are all optimized out by the compiler until the guard condition becomes true. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163431.208580085@kernel.org
2026-02-27sched/hrtick: Mark hrtick timer LAZY_REARMPeter Zijlstra1-1/+2
The hrtick timer is frequently rearmed before expiry and most of the time the new expiry is past the armed one. As this happens on every context switch it becomes expensive with scheduling heavy work loads especially in virtual machines as the "hardware" reprogamming implies a VM exit. hrtimer now provide a lazy rearm mode flag which skips the reprogamming if: 1) The timer was the first expiring timer before the rearm 2) The new expiry time is farther out than the armed time This avoids a massive amount of reprogramming operations of the hrtick timer for the price of eventually taking the alredy armed interrupt for nothing. Mark the hrtick timer accordingly. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.475409346@kernel.org
2026-02-27sched/hrtick: Avoid tiny hrtick rearmsThomas Gleixner1-5/+19
Tiny adjustments to the hrtick expiry time below 5 microseconds are just causing extra work for no real value. Filter them out when restarting the hrtick. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.340593047@kernel.org
2026-02-27sched: Optimize hrtimer handlingThomas Gleixner2-9/+50
schedule() provides several mechanisms to update the hrtick timer: 1) When the next task is picked 2) When the balance callbacks are invoked before rq::lock is released Each of them can result in a first expiring timer and cause a reprogram of the clock event device. Solve this by deferring the rearm to the end of schedule() right before releasing rq::lock by setting a flag on entry which tells hrtick_start() to cache the runtime constraint in rq::hrtick_delay without touching the timer itself. Right before releasing rq::lock evaluate the flags and either rearm or cancel the hrtick timer. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.273068659@kernel.org
2026-02-27sched: Use hrtimer_highres_enabled()Thomas Gleixner1-28/+9
Use the static branch based variant and thereby avoid following three pointers. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.203610956@kernel.org
2026-02-27sched: Avoid ktime_get() indirectionThomas Gleixner2-3/+2
The clock of the hrtick and deadline timers is known to be CLOCK_MONOTONIC. No point in looking it up via hrtimer_cb_get_time(). Just use ktime_get() directly. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163429.001511662@kernel.org
2026-02-27sched/fair: Make hrtick resched hardPeter Zijlstra (Intel)1-1/+1
Since the tick causes hard preemption, the hrtick should too. Letting the hrtick do lazy preemption completely defeats the purpose, since it will then still be delayed until a old tick and be dependent on CONFIG_HZ. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163428.933894105@kernel.org
2026-02-27sched/fair: Simplify hrtick_update()Peter Zijlstra (Intel)2-8/+8
hrtick_update() was needed when the slice depended on nr_running, all that code is gone. All that remains is starting the hrtick when nr_running becomes more than 1. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260224163428.866374835@kernel.org
2026-02-27sched/eevdf: Fix HRTICK durationPeter Zijlstra1-14/+27
The nominal duration for an EEVDF task to run is until its deadline. At which point the deadline is moved ahead and a new task selection is done. Try and predict the time 'lost' to higher scheduling classes. Since this is an estimate, the timer can be both early or late. In case it is early task_tick_fair() will take the !need_resched() path and restarts the timer. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20260224163428.798198874@kernel.org
2026-02-26sched_ext: Fix SCX_EFLAG_INITIALIZED being a no-op flagDavid Carlier1-1/+1
SCX_EFLAG_INITIALIZED is the sole member of enum scx_exit_flags with no explicit value, so the compiler assigns it 0. This makes the bitwise OR in scx_ops_init() a no-op: sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; /* |= 0 */ As a result, BPF schedulers cannot distinguish whether ops.init() completed successfully by inspecting exit_info->flags. Assign the value 1LLU << 0 so the flag is actually set. Fixes: f3aec2adce8d ("sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful ops.init()") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-25sched_ext: Fix out-of-bounds access in scx_idle_init_masks()David Carlier1-2/+2
scx_idle_node_masks is allocated with num_possible_nodes() elements but indexed by NUMA node IDs via for_each_node(). On systems with non-contiguous NUMA node numbering (e.g. nodes 0 and 4), node IDs can exceed the array size, causing out-of-bounds memory corruption. Use nr_node_ids instead, which represents the maximum node ID range and is the correct size for arrays indexed by node ID. Fixes: 7c60329e3521 ("sched_ext: Add NUMA-awareness to the default idle selection policy") Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-25sched/deadline: Add reporting of runtime left & abs deadline to ↵Tommaso Cucinotta3-9/+28
sched_getattr() for DEADLINE tasks The SCHED_DEADLINE scheduler allows reading the statically configured run-time, deadline, and period parameters through the sched_getattr() system call. However, there is no immediate way to access, from user space, the current parameters used within the scheduler: the instantaneous runtime left in the current cycle, as well as the current absolute deadline. The `flags' sched_getattr() parameter, so far mandated to contain zero, now supports the SCHED_GETATTR_FLAG_DL_DYNAMIC=1 flag, to request retrieval of the leftover runtime and absolute deadline, converted to a CLOCK_MONOTONIC reference, instead of the statically configured parameters. This