aboutsummaryrefslogtreecommitdiff
path: root/tools/sched_ext
AgeCommit message (Collapse)AuthorFilesLines
2026-04-21tools/sched_ext: scx_qmap: Silence task_ctx lookup missTejun Heo1-18/+6
scx_fork() dispatches ops.init_task to exactly one scheduler - the one owning the forking task's cgroup. A task forked inside a sub-scheduler's cgroup is init'd into the sub only; the root scheduler has no task_ctx entry for it. When that task later appears as @prev in the root's qmap_dispatch() (or flows through core-sched comparison via task_qdist), the bpf_task_storage_get() legitimately misses. qmap treated those misses as fatal via scx_bpf_error("task_ctx lookup failed") and aborted the scheduler as soon as the first cross-sched task hit the root. Drop the error in the sites where the miss is legitimate: lookup_task_ctx() (helper; callers already check for NULL), qmap_dispatch()'s @prev branch (bookkeeping-only), task_qdist() (returns 0 which makes the comparison a no-op), and qmap_select_cpu() (returns prev_cpu as a no-op fallback instead of -ESRCH). The existing scx_error was a paranoid guard from the pre-sub-sched world where every task was owned by the one and only scheduler. v2: qmap_select_cpu() returns prev_cpu on NULL instead of -ESRCH, so the root scheduler doesn't error on cross-sched tasks that pass through it (Andrea Righi). Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
2026-04-13tools/sched_ext: Add explicit cast from void* in RESIZE_ARRAY()Kuba Piecuch1-0/+1
This fixes the following compilation error when using the header from C++ code: error: assigning to 'struct scx_flux__data_uei_dump *' from incompatible type 'void *' Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-13sched_ext: Make string params of __ENUM_set() constKuba Piecuch1-1/+1
A small change to improve type safety/const correctness. __COMPAT_read_enum() already has const string parameters. It fixes a warning when using the header in C++ code: error: ISO C++11 does not allow conversion from string literal to 'char *' [-Werror,-Wwritable-strings] That's because string literals have type char[N] in C and const char[N] in C++. Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-13tools/sched_ext: Kick home CPU for stranded tasks in scx_qmapTejun Heo1-0/+40
scx_qmap uses global BPF queue maps (BPF_MAP_TYPE_QUEUE) that any CPU's ops.dispatch() can pop from. When a CPU pops a task that can't run on it (e.g. a pinned per-CPU kthread), it inserts the task into SHARED_DSQ. consume_dispatch_q() then skips the task due to affinity mismatch, leaving it stranded until some CPU in its allowed mask calls ops.dispatch(). This doesn't cause indefinite stalls -- the periodic tick keeps firing (can_stop_idle_tick() returns false when softirq is pending) -- but can cause noticeable scheduling delays. After inserting to SHARED_DSQ, kick the task's home CPU if this CPU can't run it. There's a small race window where the home CPU can enter idle before the kick lands -- if a per-CPU kthread like ksoftirqd is the stranded task, this can trigger a "NOHZ tick-stop error" warning. The kick arrives shortly after and the home CPU drains the task. Rather than fully eliminating the warning by routing pinned tasks to local or global DSQs, the current code keeps them going through the normal BPF queue path and documents the race and the resulting warning in detail. scx_qmap is an example scheduler and having tasks go through the usual dispatch path is useful for testing. The detailed comment also serves as a reference for other schedulers that may encounter similar warnings. Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-06tools/sched_ext: Fix off-by-one in scx_sdt payload zeroingCheng-Yang Chou1-1/+2
scx_alloc_free_idx() zeroes the payload of a freed arena allocation one word at a time. The loop bound was alloc->pool.elem_size / 8, but elem_size includes sizeof(struct sdt_data) (the 8-byte union sdt_id header). This caused the loop to write one extra u64 past the allocation, corrupting the tid field of the adjacent pool element. Fix the loop bound to (elem_size - sizeof(struct sdt_data)) / 8 so only the payload portion is zeroed. Test plan: - Add a temporary sanity check in scx_task_free() before the free call: if (mval->data->tid.idx != mval->tid.idx) scx_bpf_error("tid corruption: arena=%d storage=%d", mval->data->tid.idx, (int)mval->tid.idx); - stress-ng --fork 100 -t 10 & sudo ./build/bin/scx_sdt Without this fix, running scx_sdt under fork-heavy load triggers the corruption error. With the fix applied, the same workload completes without error. Fixes: 36929ebd17ae ("tools/sched_ext: add arena based scheduler") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-27scx_central: Defer timer start to central dispatch to fix init errorZhao Mengmeng2-44/+42
scx_central currently assumes that ops.init() runs on the selected central CPU and aborts otherwise. This is no longer true, as ops.init() is invoked from the scx_enable_helper thread, which can run on any CPU. As a result, sched_setaffinity() from userspace doesn't work, causing scx_central to fail when loading with: [ 1985.319942] sched_ext: central: scx_central.bpf.c:314: init from non-central CPU [ 1985.320317] scx_exit+0xa3/0xd0 [ 1985.320535] scx_bpf_error_bstr+0xbd/0x220 [ 1985.320840] bpf_prog_3a445a8163fa8149_central_init+0x103/0x1ba [ 1985.321073] bpf__sched_ext_ops_init+0x40/0xa8 [ 1985.321286] scx_root_enable_workfn+0x507/0x1650 [ 1985.321461] kthread_worker_fn+0x260/0x940 [ 1985.321745] kthread+0x303/0x3e0 [ 1985.321901] ret_from_fork+0x589/0x7d0 [ 1985.322065] ret_from_fork_asm+0x1a/0x30 DEBUG DUMP =================================================================== central: root scx_enable_help[134] triggered exit kind 1025: scx_bpf_error (scx_central.bpf.c:314: init from non-central CPU) Fix this by: - Defer bpf_timer_start() to the first dispatch on the central CPU. - Initialize the BPF timer in central_init() and kick the central CPU to guarantee entering the dispatch path on the central CPU immediately. - Remove the unnecessary sched_setaffinity() call in userspace. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-26tools/sched_ext: Remove redundant SCX_ENQ_IMMED compat definitionTejun Heo1-5/+0
compat.bpf.h defined a fallback SCX_ENQ_IMMED macro using __COMPAT_ENUM_OR_ZERO(). After 6bf36c68b0a2 ("tools/sched_ext: Regenerate autogen enum headers") added SCX_ENQ_IMMED to the autogen headers, including both triggers -Wmacro-redefined warnings. The autogen definition through const volatile __weak already resolves to 0 on older kernels, providing the same backward compatibility. Remove the now-redundant compat fallback. Fixes: 6bf36c68b0a2 ("tools/sched_ext: Regenerate autogen enum headers") Link: https://lore.kernel.org/r/20260326100313.338388-1-zhaomzhao@126.com Reported-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25tools/sched_ext: scx_pair: fix pair_ctx indexing for CPU pairsZhao Mengmeng1-2/+12
scx_pair sizes pair_ctx to nr_cpu_ids / 2, so valid pair_ctx keys are dense pair indexes in the range [0, nr_cpu_ids / 2). However, the userspace setup code stores pair_id as the first CPU number in each pair. On an 8-CPU system with "-S 1", that produces pair IDs 0, 2, 4 and 6 for pairs [0,1], [2,3], [4,5] and [6,7]. CPUs in the latter half then look up pair_ctx with out-of-range keys and the BPF scheduler aborts with: EXIT: scx_bpf_error (scx_pair.bpf.c:328: failed to lookup pairc and in_pair_mask for cpu[5]) Assign pair_id using a dense pair counter instead so that each CPU pair maps to a valid pair_ctx entry. Besides, reject odd CPU configuration, as scx_pair requires all CPUs to be paired. Fixes: f0262b102c7c ("tools/sched_ext: add scx_pair scheduler") Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25tools/sched_ext: Regenerate autogen enum headersCheng-Yang Chou3-0/+22
Regenerate enum_defs.autogen.h, enums.autogen.h and enums.autogen.bpf.h using the upstream scripts [1][2] to sync with recent kernel enum additions. [1] https://github.com/sched-ext/scx/blob/main/scripts/gen_enum_defs.py [2] https://github.com/sched-ext/scx/blob/main/scripts/gen_enums.py Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-23tools/sched_ext: Add scx_bpf_sub_dispatch() compat wrapperCheng-Yang Chou2-1/+13
Add a transparent compatibility wrapper for the scx_bpf_sub_dispatch() kfunc in compat.bpf.h. This allows BPF schedulers using the sub-sched dispatch feature to build and run on older kernels that lack the kfunc. To avoid requiring code changes in individual schedulers, the transparent wrapper pattern is used instead of a __COMPAT prefix. The kfunc is declared with a ___compat suffix, while the static inline wrapper retains the original scx_bpf_sub_dispatch() name. When the kfunc is unavailable, the wrapper safely falls back to returning false. This is acceptable because the dispatch path cannot do anything useful without underlying sub-sched support anyway. Tested scx_qmap on v6.14 successfully. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-22tools/sched_ext: Add compat handling for sub-scheduler opsAndrea Righi1-0/+16
Extend SCX_OPS_OPEN() with compatibility handling for ops.sub_attach() and ops.sub_detach(), allowing scx C schedulers with sub-scheduler support to run on kernels both with and without its support. Cc: Cheng-Yang Chou <yphbchou0911@gmail.com> Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21tools/sched_ext: Update stale scx_ops_error() comment in fcg_cgroup_move()Ke Zhao1-1/+1
The function scx_ops_error() was dropped, but the comment here is left pointing to the old name. Update to be consistent with current API. Signed-off-by: Ke Zhao <ke.zhao.kernel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17sched_ext: Fix typos in commentszhidao su1-1/+1
Fix five typos across three files: - kernel/sched/ext.c: 'monotically' -> 'monotonically' (line 55) - kernel/sched/ext.c: 'used by to check' -> 'used to check' (line 56) - kernel/sched/ext.c: 'hardlockdup' -> 'hardlockup' (line 3881) - kernel/sched/ext_idle.c: 'don't perfectly overlaps' -> 'don't perfectly overlap' (line 371) - tools/sched_ext/scx_flatcg.bpf.c: 'shaer' -> 'share' (line 21) Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-14sched_ext: Update demo schedulers and selftests to use ↵Cheng-Yang Chou2-7/+11
scx_bpf_task_set_dsq_vtime() Direct writes to p->scx.dsq_vtime are deprecated in favor of scx_bpf_task_set_dsq_vtime(). Update scx_simple, scx_flatcg, and select_cpu_vtime selftest to use the new kfunc with scale_by_task_weight_inverse(). Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flagTejun Heo3-41/+39
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start running immediately. Otherwise, the task is re-enqueued through ops.enqueue(). This provides tighter control but requires specifying the flag on every insertion. Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is automatically applied to all local DSQ enqueues including through scx_bpf_dsq_move_to_local(). scx_qmap is updated with -I option to test the feature and -F option for IMMED stress testing which forces every Nth enqueue to a busy local DSQ. v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2). - scx_qmap: Remove sched_switch and cpu_release handlers (superseded by kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local()Tejun Heo7-15/+20
scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the current CPU's local DSQ. This is an indirect way of dispatching to a local DSQ and should support enq_flags like direct dispatches do - e.g. SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate execution guarantees. Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The original becomes a v1 compat wrapper passing 0. The compat macro is updated to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume (pre-rename). All in-tree BPF schedulers are updated to pass 0. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-13sched_ext: Implement SCX_ENQ_IMMEDTejun Heo1-0/+5
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is dispatched with IMMED, it either gets on the CPU immediately and stays on it, or gets reenqueued back to the BPF scheduler. It will never linger on a local DSQ behind other tasks or on a CPU taken by a higher-priority class. rq_is_open() uses rq->next_class to determine whether the rq is available, and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task arrives. These capture all higher class preemptions. Combined with reenqueue points in the dispatch path, all cases where an IMMED task would not execute immediately are covered. SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the guarantee survives SAVE/RESTORE cycles. If preempted while running, put_prev_task_scx() reenqueues through ops.enqueue() with SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the local DSQ. This enables tighter scheduling latency control by preventing tasks from piling up on local DSQs. It also enables opportunistic CPU sharing across sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a shared CPU, making it difficult for others to use. v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and implement wakeup_preempt_scx() to achieve complete coverage of all cases where IMMED tasks could get stranded. - Track IMMED persistently in p->scx.flags and reenqueue preempted-while-running tasks through ops.enqueue(). - Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT). - Misc renames, documentation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-11sched_ext: Fix incomplete help text usage stringsCheng-Yang Chou5-5/+5
Several demo schedulers and the selftest runner had usage strings that omitted options which are actually supported: - scx_central: add missing [-v] - scx_pair: add missing [-v] - scx_qmap: add missing [-S] and [-H] - scx_userland: add missing [-v] - scx_sdt: remove [-f] which no longer exists - runner.c: add missing [-s], [-l], [-q]; drop [-h] which none of the other sched_ext tools list in their usage lines Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-09sched_ext: remove SCX_OPS_HAS_CGROUP_WEIGHTZhao Mengmeng2-2/+1
While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it. Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07tools/sched_ext/include: Regenerate enum_defs.autogen.hTejun Heo1-12/+37
Regenerate enum_defs.autogen.h from the current vmlinux.h to pick up new SCX enums added in the for-7.1 cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07tools/sched_ext/include: Add libbpf version guard for assoc_struct_opsTejun Heo1-6/+29
Extract the inline bpf_program__assoc_struct_ops() call in SCX_OPS_LOAD() into a __scx_ops_assoc_prog() helper and wrap it with a libbpf >= 1.7 version guard. bpf_program__assoc_struct_ops() was added in libbpf 1.7; the guard provides a no-op fallback for older versions. Add the <bpf/libbpf.h> include needed by the helper, and fix "assumming" typo in a nearby comment. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07tools/sched_ext/include: Add __COMPAT_HAS_scx_bpf_select_cpu_and macroTejun Heo1-0/+8
scx_bpf_select_cpu_and() is now an inline wrapper so bpf_ksym_exists(scx_bpf_select_cpu_and) no longer works. Add __COMPAT_HAS_scx_bpf_select_cpu_and macro that checks for either the struct args type (new) or the compat ksym (old) to test availability. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07tools/sched_ext/include: Add missing helpers to common.bpf.hTejun Heo1-0/+277
Sync several helpers from the scx repo: - bpf_cgroup_acquire() ksym declaration - __sink() macro for hiding values from verifier precision tracking - ctzll() count-trailing-zeros implementation - get_prandom_u64() helper - scx_clock_task/pelt/virt/irq() clock helpers with get_current_rq() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07tools/sched_ext/include: Sync bpf_arena_common.bpf.h with scx repoTejun Heo1-2/+6
Sync the following changes from the scx repo: - Guard __arena define with #ifndef to avoid redefinition when the attribute is already defined by another header. - Add bpf_arena_reserve_pages() and bpf_arena_mapping_nr_pages() ksym declarations. - Rename TEST to SCX_BPF_UNITTEST to avoid collision with generic TEST macros in other projects. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07tools/sched_ext/include: Remove dead sdt_task_defs.h guard from common.hTejun Heo1-4/+0
The __has_include guard for sdt_task_defs.h is vestigial — the only remaining content is the bpf_arena_common.h include which is available unconditionally. Remove the dead guard. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Implement scx_bpf_dsq_reenq() for user DSQsTejun Heo1-2/+55
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support user-defined DSQs by adding a deferred re-enqueue mechanism similar to the local DSQ handling. Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list. process_deferred_reenq_users() then iterates the DSQ using the cursor helpers and re-enqueues each task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueueTejun Heo3-4/+33
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be expanded to support user DSQs by future changes. scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future. Update compat.bpf.h with a compatibility shim and scx_qmap to test the new functionality. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Add basic building blocks for nested sub-scheduler dispatchingTejun Heo2-1/+37
This is an early-stage partial implementation that demonstrates the core building blocks for nested sub-scheduler dispatching. While significant work remains in the enqueue path and other areas, this patch establishes the fundamental mechanisms needed for hierarchical scheduler operation. The key building blocks introduced include: - Private stack support for ops.dispatch() to prevent stack overflow when walking down nested schedulers during dispatch operations - scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger dispatch operations on their direct child schedulers - Proper parent-child relationship validation to ensure dispatch requests are only made to legitimate child schedulers - Updated scx_dispatch_sched() to handle both nested and non-nested invocations with appropriate kf_mask handling The qmap scheduler is updated to demonstrate the functionality by calling scx_bpf_sub_dispatch() on registered child schedulers when it has no tasks in its own queues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Introduce scx_prog_sched()Tejun Heo1-0/+10
In preparation for multiple scheduler support, introduce scx_prog_sched() accessor which returns the scx_sched instance associated with a BPF program. The association is determined via the special KF_IMPLICIT_ARGS kfunc parameter, which provides access to bpf_prog_aux. This aux can be used to retrieve the struct_ops (sched_ext_ops) that the program is associated with, and from there, the corresponding scx_sched instance. For compatibility, when ops.sub_attach is not implemented (older schedulers without sub-scheduler support), unassociated programs fall back to scx_root. A warning is logged once per scheduler for such programs. As scx_root is still the only scheduler, this shouldn't introduce user-visible behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Introduce cgroup sub-sched supportTejun Heo2-2/+20
A system often runs multiple workloads especially in multi-tenant server environments where a system is split into partitions servicing separate more-or-less independent workloads each requiring an application-specific scheduler. To support such and other use cases, sched_ext is in the process of growing multiple scheduler support. When partitioning a system in terms of CPUs for such use cases, an oft-taken approach is hard partitioning the system using cpuset. While it would be possible to tie sched_ext multiple scheduler support to cpuset partitions, such an approach would have fundamental limitations stemming from the lack of dynamism and flexibility. Users often don't care which specific CPUs are assigned to which workload and want to take advantage of optimizations which are enabled by running workloads on a larger machine - e.g. opportunistic over-commit, improving latency critical workload characteristics while maintaining bandwidth fairness, employing control mechanisms based on different criteria than on-CPU time for e.g. flexible memory bandwidth isolation, packing similar parts from different workloads on same L3s to improve cache efficiency, and so on. As this sort of dynamic behaviors are impossible or difficult to implement with hard partitioning, sched_ext is implementing cgroup sub-sched support where schedulers can be attached to the cgroup hierarchy and a parent scheduler is responsible for controlling the CPUs that each child can use at any given moment. This makes CPU distribution dynamically controlled by BPF allowing high flexibility. This patch adds the skeletal sched_ext cgroup sub-sched support: - sched_ext_ops.sub_cgroup_id and .sub_attach/detach() are added. Non-zero sub_cgroup_id indicates that the scheduler is to be attached to the identified cgroup. A sub-sched is attached to the cgroup iff the nearest ancestor scheduler implements .sub_attach() and grants the attachment. Max nesting depth is limited by SCX_SUB_MAX_DEPTH. - When a scheduler exits, all its descendant schedulers are exited together. Also, cgroup.scx_sched added which points to the effective scheduler instance for the cgroup. This is updated on scheduler init/exit and inherited on cgroup online. When a cgroup is offlined, the attached scheduler is automatically exited. - Sub-sched support is gated on CONFIG_EXT_SUB_SCHED which is automatically enabled if both SCX and cgroups are enabled. Sub-sched support is not tied to the CPU controller but rather the cgroup hierarchy itself. This is intentional as the support for cpu.weight and cpu.max based resource control is orthogonal to sub-sched support. Note that CONFIG_CGROUPS around cgroup subtree iteration support for scx_task_iter is replaced with CONFIG_EXT_SUB_SCHED for consistency. - This allows loading sub-scheds and most framework operations such as propagating disable down the hierarchy work. However, sub-scheds are not operational yet and all tasks stay with the root sched. This will serve as the basis for building up full sub-sched support. - DSQs point to the scx_sched they belong to. - scx_qmap is updated to allow attachment of sub-scheds and also serving as sub-scheds. - scx_is_descendant() is added but not yet used in this patch. It is used by later changes in the series and placed here as this is where the function belongs. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06Merge branch 'for-7.0-fixes' into for-7.1Tejun Heo6-10/+70
To prepare for hierarchical scheduling patchset which will cause multiple conflicts otherwise. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-02tools/sched_ext: Add -fms-extensions to bpf build flagsZhao Mengmeng1-0/+2
Similar to commit 835a50753579 ("selftests/bpf: Add -fms-extensions to bpf build flags") and commit 639f58a0f480 ("bpftool: Fix build warnings due to MS extensions") The kernel is now built with -fms-extensions, therefore generated vmlinux.h contains types like: struct aes_key { struct aes_enckey; union aes_invkey_arch inv_k; }; struct ns_common { ... union { struct ns_tree; struct callback_head ns_rcu; }; }; Which raise warning like below when building scx scheduler: tools/sched_ext/build/include/vmlinux.h:50533:3: warning: declaration does not declare anything [-Wmissing-declarations] 50533 | struct ns_tree; | ^ Fix it by using -fms-extensions and -Wno-microsoft-anon-tag flags to build bpf programs that #include "vmlinux.h" Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-27tools/sched_ext: fix strtoul() misuse in scx_hotplug_seq()David Carlier1-2/+5
scx_hotplug_seq() uses strtoul() but validates the result with a negative check (val < 0), which can never be true for an unsigned return value. Use the endptr mechanism to verify the entire string was consumed, and check errno == ERANGE for overflow detection. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-24tools/sched_ext: Add Kconfig to sync with upstreamCheng-Yang Chou1-0/+61
Add the missing Kconfig file to tools/sched_ext/ as referenced in the README. Ref: https://github.com/sched-ext/scx/blob/main/kernel.config Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-24tools/sched_ext: Sync README.md Kconfig with upstream scxCheng-Yang Chou1-6/+0
Sync the documentation with the upstream scx repository to reflect the current recommended configuration. Ref: https://github.com/sched-ext/scx/blob/main/README.md#build--install Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23sched_ext: Fix ops.dequeue() semanticsAndrea Righi3-0/+4
Currently, ops.dequeue() is only invoked when the sched_ext core knows that a task resides in BPF-managed data structures, which causes it to miss scheduling property change events. In addition, ops.dequeue() callbacks are completely skipped when tasks are dispatched to non-local DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably track task state. Fix this by guaranteeing that each task entering the BPF scheduler's custody triggers exactly one ops.dequeue() call when it leaves that custody, whether the exit is due to a dispatch (regular or via a core scheduling pick) or to a scheduling property change (e.g. sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA balancing, etc.). BPF scheduler custody concept: a task is considered to be in the BPF scheduler's custody when the scheduler is responsible for managing its lifecycle. This includes tasks dispatched to user-created DSQs or stored in the BPF scheduler's internal data structures from ops.enqueue(). Custody ends when the task is dispatched to a terminal DSQ (such as the local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed due to a property change. Tasks directly dispatched to terminal DSQs bypass the BPF scheduler entirely and are never in its custody. Terminal DSQs include: - Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues where tasks go directly to execution. - Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the BPF scheduler is considered "done" with the task. As a result, ops.dequeue() is not invoked for tasks directly dispatched to terminal DSQs. To identify dequeues triggered by scheduling property changes, introduce the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set, the dequeue was caused by a scheduling property change. New ops.dequeue() semantics: - ops.dequeue() is invoked exactly once when the task leaves the BPF scheduler's custody, in one of the following cases: a) regular dispatch: a task dispatched to a user DSQ or stored in internal BPF data structures is moved to a terminal DSQ (ops.dequeue() called without any special flags set), b) core scheduling dispatch: core-sched picks task before dispatch (ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set), c) property change: task properties modified before dispatch, (ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set). This allows BPF schedulers to: - reliably track task ownership and lifecycle, - maintain accurate accounting of managed tasks, - update internal state when tasks change properties. Cc: Tejun Heo <tj@kernel.org> Cc: Emil Tsalapatis <emil@etsalapatis.com> Cc: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23tools/sched_ext: scx_sdt: Remove unused '-f' optionCheng-Yang Chou1-1/+1
The '-f' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-f' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-23tools/sched_ext: scx_central: Remove unused '-p' optionCheng-Yang Chou1-1/+1
The '-p' option is defined in getopt() but not handled in the switch statement or documented in the help text. Providing '-p' currently triggers the default error path. Remove it to sync the optstring with the actual implementation. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-20tools/sched_ext: fix getopt not re-parsed on restartDavid Carlier6-0/+6
After goto restart, optind retains its advanced position from the previous getopt loop, causing getopt() to immediately return -1. This silently drops all command-line options on the restarted skeleton. Reset optind to 1 at the restart label so options are re-parsed. Affected schedulers: scx_simple, scx_central, scx_flatcg, scx_pair, scx_sdt, scx_cpu0. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-20tools/sched_ext: scx_userland: fix data races on shared countersDavid Carlier1-13/+13
The stats thread reads nr_vruntime_enqueues, nr_vruntime_dispatches, nr_vruntime_failed, and nr_curr_enqueued concurrently with the main thread writing them, with no synchronization. Use __atomic builtins with relaxed ordering for all accesses to these counters to eliminate the data races. Only display accuracy is affected, not scheduling correctness. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-18tools/sched_ext: scx_pair: fix stride == 0 crash on single-CPU systemsDavid Carlier1-1/+6
nr_cpu_ids / 2 produces stride 0 on a single-CPU system, which later causes SCX_BUG_ON(i == j) to fire. Validate stride after option parsing to also catch invalid user-supplied values via -S. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-18tools/sched_ext: scx_central: fix CPU_SET and skeleton leak on early exitDavid Carlier1-1/+2
Use CPU_SET_S() instead of CPU_SET() on the dynamically allocated cpuset to avoid a potential out-of-bounds write when nr_cpu_ids exceeds CPU_SETSIZE. Also destroy the skeleton before returning on invalid central CPU ID to prevent a resource leak. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-16tools/sched_ext: scx_userland: fix stale data on restartDavid Carlier1-0/+8
Reset all counters, tasks and vruntime_head list on restart. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-16tools/sched_ext: scx_flatcg: fix potential stack overflow from VLA in ↵David Carlier1-4/+9
fcg_read_stats fcg_read_stats() had a VLA allocating 21 * nr_cpus * 8 bytes on the stack, risking stack overflow on large CPU counts (nr_cpus can be up to 512). Fix by using a single heap allocation with the correct size, reusing it across all stat indices, and freeing it at the end. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-12tools/sched_ext: scx_userland: fix restart and stats thread lifecycle bugsDavid Carlier1-2/+3
Fix three issues in scx_userland's restart path: - exit_req is not reset on restart, causing sched_main_loop() to exit immediately without doing any scheduling work. - stats_printer thread handle is local to spawn_stats_thread(), making it impossible to join from main(). Promote it to file scope. - The stats thread continues reading skel->bss after the skeleton is destroyed on restart, causing a use-after-free. Join the stats thread before destroying the skeleton to ensure it has exited. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-12tools/sched_ext: scx_central: fix sched_setaffinity() call with the set sizeDavid Carlier1-2/+4
The cpu set is dynamically allocated for nr_cpu_ids using CPU_ALLOC(), so the size passed to sched_setaffinity() should be CPU_ALLOC_SIZE() rather than sizeof(cpu_set_t). Valgrind flagged this as accessing unaddressable bytes past the allocation. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-12tools/sched_ext: scx_flatcg: zero-initialize stats counter arrayDavid Carlier1-0/+1
The local cnts array in read_stats() is not initialized before being accumulated into per-CPU stats, which may lead to reading garbage values. Zero it out with memset alongside the existing stats array initialization. Signed-off-by: David Carlier <devnexen@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-11Merge tag 'sched_ext-for-6.20' of ↵Linus Torvalds15-8/+2567
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext updates from Tejun Heo: - Move C example schedulers back from the external scx repo to tools/sched_ext as the authoritative source. scx_userland and scx_pair are returning while scx_sdt (BPF arena-based task data management) is new. These schedulers will be dropped from the external repo. - Improve error reporting by adding scx_bpf_error() calls when DSQ creation fails across all in-tree schedulers - Avoid redundant irq_work_queue() calls in destroy_dsq() by only queueing when llist_add() indicates an empty list - Fix flaky init_enable_count selftest by properly synchronizing pre-forked children using a pipe instead of sleep() * tag 'sched_ext-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: selftests/sched_ext: Fix init_enable_count flakiness tools/sched_ext: Fix data header access during free in scx_sdt tools/sched_ext: Add error logging for dsq creation failures in remaining schedulers tools/sched_ext: add arena based scheduler tools/sched_ext: add scx_pair scheduler tools/sched_ext: add scx_userland scheduler sched_ext: Add error logging for dsq creation failures sched_ext: Avoid multiple irq_work_queue() calls in destroy_dsq()
2026-02-02tools/sched_ext: Fix data header access during free in scx_sdtEmil Tsalapatis1-1/+1
Fix a pointer arithmetic error in scx_sdt during freeing that causes the allocator to use the wrong memory address for the allocation's data header. Fixes: 36929ebd17ae ("tools/sched_ext: add arena based scheduler") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-01tools/sched_ext: Add error logging for dsq creation failures in remaining ↵zhidao su4-4/+34
schedulers Add scx_bpf_error() calls when scx_bpf_create_dsq() fails in the remaining schedulers to improve debuggability: - scx_simple.bpf.c: simple_init() - scx_sdt.bpf.c: sdt_init() - scx_cpu0.bpf.c: cpu0_init() - scx_flatcg.bpf.c: fcg_init() This follows the same pattern established in commit 2f8d489897ae ("sched_ext: Add error logging for dsq creation failures") for other schedulers and ensures consistent error reporting across all schedulers. Signed-off-by: zhidao su <suzhidao@xiaomi.com> Signed-off-by: Tejun Heo <tj@kernel.org>