| Age | Commit message (Collapse) | Author | Files | Lines |
|
This fixes the following compilation error when using the header from
C++ code:
error: assigning to 'struct scx_flux__data_uei_dump *' from
incompatible type 'void *'
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
A small change to improve type safety/const correctness.
__COMPAT_read_enum() already has const string parameters.
It fixes a warning when using the header in C++ code:
error: ISO C++11 does not allow conversion from string literal
to 'char *' [-Werror,-Wwritable-strings]
That's because string literals have type char[N] in C and
const char[N] in C++.
Signed-off-by: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
compat.bpf.h defined a fallback SCX_ENQ_IMMED macro using
__COMPAT_ENUM_OR_ZERO(). After 6bf36c68b0a2 ("tools/sched_ext:
Regenerate autogen enum headers") added SCX_ENQ_IMMED to the autogen
headers, including both triggers -Wmacro-redefined warnings.
The autogen definition through const volatile __weak already resolves to
0 on older kernels, providing the same backward compatibility. Remove
the now-redundant compat fallback.
Fixes: 6bf36c68b0a2 ("tools/sched_ext: Regenerate autogen enum headers")
Link: https://lore.kernel.org/r/20260326100313.338388-1-zhaomzhao@126.com
Reported-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Regenerate enum_defs.autogen.h, enums.autogen.h and enums.autogen.bpf.h
using the upstream scripts [1][2] to sync with recent kernel enum
additions.
[1] https://github.com/sched-ext/scx/blob/main/scripts/gen_enum_defs.py
[2] https://github.com/sched-ext/scx/blob/main/scripts/gen_enums.py
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add a transparent compatibility wrapper for the scx_bpf_sub_dispatch()
kfunc in compat.bpf.h. This allows BPF schedulers using the sub-sched
dispatch feature to build and run on older kernels that lack the kfunc.
To avoid requiring code changes in individual schedulers, the
transparent wrapper pattern is used instead of a __COMPAT prefix. The
kfunc is declared with a ___compat suffix, while the static inline
wrapper retains the original scx_bpf_sub_dispatch() name.
When the kfunc is unavailable, the wrapper safely falls back to
returning false. This is acceptable because the dispatch path cannot
do anything useful without underlying sub-sched support anyway.
Tested scx_qmap on v6.14 successfully.
Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Extend SCX_OPS_OPEN() with compatibility handling for ops.sub_attach()
and ops.sub_detach(), allowing scx C schedulers with sub-scheduler
support to run on kernels both with and without its support.
Cc: Cheng-Yang Chou <yphbchou0911@gmail.com>
Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
SCX_ENQ_IMMED makes enqueue to local DSQs succeed only if the task can start
running immediately. Otherwise, the task is re-enqueued through ops.enqueue().
This provides tighter control but requires specifying the flag on every
insertion.
Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag. When set, SCX_ENQ_IMMED is
automatically applied to all local DSQ enqueues including through
scx_bpf_dsq_move_to_local().
scx_qmap is updated with -I option to test the feature and -F option for
IMMED stress testing which forces every Nth enqueue to a busy local DSQ.
v2: - Cover scx_bpf_dsq_move_to_local() path (now has enq_flags via ___v2).
- scx_qmap: Remove sched_switch and cpu_release handlers (superseded by
kernel-side wakeup_preempt_scx()). Add -F for IMMED stress testing.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_bpf_dsq_move_to_local() moves a task from a non-local DSQ to the
current CPU's local DSQ. This is an indirect way of dispatching to a local
DSQ and should support enq_flags like direct dispatches do - e.g.
SCX_ENQ_HEAD for head-of-queue insertion and SCX_ENQ_IMMED for immediate
execution guarantees.
Add scx_bpf_dsq_move_to_local___v2() with an enq_flags parameter. The
original becomes a v1 compat wrapper passing 0. The compat macro is updated
to a three-level chain: v2 (7.1+) -> v1 (current) -> scx_bpf_consume
(pre-rename). All in-tree BPF schedulers are updated to pass 0.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
Add SCX_ENQ_IMMED enqueue flag for local DSQ insertions. Once a task is
dispatched with IMMED, it either gets on the CPU immediately and stays on it,
or gets reenqueued back to the BPF scheduler. It will never linger on a local
DSQ behind other tasks or on a CPU taken by a higher-priority class.
rq_is_open() uses rq->next_class to determine whether the rq is available,
and wakeup_preempt_scx() triggers reenqueue when a higher-priority class task
arrives. These capture all higher class preemptions. Combined with reenqueue
points in the dispatch path, all cases where an IMMED task would not execute
immediately are covered.
SCX_TASK_IMMED persists in p->scx.flags until the next fresh enqueue, so the
guarantee survives SAVE/RESTORE cycles. If preempted while running,
put_prev_task_scx() reenqueues through ops.enqueue() with
SCX_TASK_REENQ_PREEMPTED instead of silently placing the task back on the
local DSQ.
This enables tighter scheduling latency control by preventing tasks from
piling up on local DSQs. It also enables opportunistic CPU sharing across
sub-schedulers - without this, a sub-scheduler can stuff the local DSQ of a
shared CPU, making it difficult for others to use.
v2: - Rewrite is_curr_done() as rq_is_open() using rq->next_class and
implement wakeup_preempt_scx() to achieve complete coverage of all
cases where IMMED tasks could get stranded.
- Track IMMED persistently in p->scx.flags and reenqueue
preempted-while-running tasks through ops.enqueue().
- Bound deferred reenq cycles (SCX_REENQ_LOCAL_MAX_REPEAT).
- Misc renames, documentation.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
While running scx_flatcg, dmesg prints "SCX_OPS_HAS_CGROUP_WEIGHT is
deprecated and a noop", in code, SCX_OPS_HAS_CGROUP_WEIGHT has been
marked as DEPRECATED, and will be removed on 6.18. Now it's time to do it.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Regenerate enum_defs.autogen.h from the current vmlinux.h to pick up
new SCX enums added in the for-7.1 cycle.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
Extract the inline bpf_program__assoc_struct_ops() call in SCX_OPS_LOAD()
into a __scx_ops_assoc_prog() helper and wrap it with a libbpf >= 1.7
version guard. bpf_program__assoc_struct_ops() was added in libbpf 1.7;
the guard provides a no-op fallback for older versions. Add the
<bpf/libbpf.h> include needed by the helper, and fix "assumming" typo in
a nearby comment.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_bpf_select_cpu_and() is now an inline wrapper so
bpf_ksym_exists(scx_bpf_select_cpu_and) no longer works. Add
__COMPAT_HAS_scx_bpf_select_cpu_and macro that checks for either the
struct args type (new) or the compat ksym (old) to test availability.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
Sync several helpers from the scx repo:
- bpf_cgroup_acquire() ksym declaration
- __sink() macro for hiding values from verifier precision tracking
- ctzll() count-trailing-zeros implementation
- get_prandom_u64() helper
- scx_clock_task/pelt/virt/irq() clock helpers with get_current_rq()
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
Sync the following changes from the scx repo:
- Guard __arena define with #ifndef to avoid redefinition when the
attribute is already defined by another header.
- Add bpf_arena_reserve_pages() and bpf_arena_mapping_nr_pages() ksym
declarations.
- Rename TEST to SCX_BPF_UNITTEST to avoid collision with generic TEST
macros in other projects.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
The __has_include guard for sdt_task_defs.h is vestigial — the only
remaining content is the bpf_arena_common.h include which is available
unconditionally. Remove the dead guard.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Andrea Righi <arighi@nvidia.com>
|
|
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's
local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target
any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be
expanded to support user DSQs by future changes.
scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around
scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future.
Update compat.bpf.h with a compatibility shim and scx_qmap to test the new
functionality.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
This is an early-stage partial implementation that demonstrates the core
building blocks for nested sub-scheduler dispatching. While significant
work remains in the enqueue path and other areas, this patch establishes
the fundamental mechanisms needed for hierarchical scheduler operation.
The key building blocks introduced include:
- Private stack support for ops.dispatch() to prevent stack overflow when
walking down nested schedulers during dispatch operations
- scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger
dispatch operations on their direct child schedulers
- Proper parent-child relationship validation to ensure dispatch requests
are only made to legitimate child schedulers
- Updated scx_dispatch_sched() to handle both nested and non-nested
invocations with appropriate kf_mask handling
The qmap scheduler is updated to demonstrate the functionality by calling
scx_bpf_sub_dispatch() on registered child schedulers when it has no
tasks in its own queues.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
In preparation for multiple scheduler support, introduce scx_prog_sched()
accessor which returns the scx_sched instance associated with a BPF program.
The association is determined via the special KF_IMPLICIT_ARGS kfunc
parameter, which provides access to bpf_prog_aux. This aux can be used to
retrieve the struct_ops (sched_ext_ops) that the program is associated with,
and from there, the corresponding scx_sched instance.
For compatibility, when ops.sub_attach is not implemented (older schedulers
without sub-scheduler support), unassociated programs fall back to scx_root.
A warning is logged once per scheduler for such programs.
As scx_root is still the only scheduler, this shouldn't introduce
user-visible behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
To prepare for hierarchical scheduling patchset which will cause multiple
conflicts otherwise.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_hotplug_seq() uses strtoul() but validates the result with a
negative check (val < 0), which can never be true for an unsigned
return value.
Use the endptr mechanism to verify the entire string was consumed,
and check errno == ERANGE for overflow detection.
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Currently, ops.dequeue() is only invoked when the sched_ext core knows
that a task resides in BPF-managed data structures, which causes it to
miss scheduling property change events. In addition, ops.dequeue()
callbacks are completely skipped when tasks are dispatched to non-local
DSQs from ops.select_cpu(). As a result, BPF schedulers cannot reliably
track task state.
Fix this by guaranteeing that each task entering the BPF scheduler's
custody triggers exactly one ops.dequeue() call when it leaves that
custody, whether the exit is due to a dispatch (regular or via a core
scheduling pick) or to a scheduling property change (e.g.
sched_setaffinity(), sched_setscheduler(), set_user_nice(), NUMA
balancing, etc.).
BPF scheduler custody concept: a task is considered to be in the BPF
scheduler's custody when the scheduler is responsible for managing its
lifecycle. This includes tasks dispatched to user-created DSQs or stored
in the BPF scheduler's internal data structures from ops.enqueue().
Custody ends when the task is dispatched to a terminal DSQ (such as the
local DSQ or %SCX_DSQ_GLOBAL), selected by core scheduling, or removed
due to a property change.
Tasks directly dispatched to terminal DSQs bypass the BPF scheduler
entirely and are never in its custody. Terminal DSQs include:
- Local DSQs (%SCX_DSQ_LOCAL or %SCX_DSQ_LOCAL_ON): per-CPU queues
where tasks go directly to execution.
- Global DSQ (%SCX_DSQ_GLOBAL): the built-in fallback queue where the
BPF scheduler is considered "done" with the task.
As a result, ops.dequeue() is not invoked for tasks directly dispatched
to terminal DSQs.
To identify dequeues triggered by scheduling property changes, introduce
the new ops.dequeue() flag %SCX_DEQ_SCHED_CHANGE: when this flag is set,
the dequeue was caused by a scheduling property change.
New ops.dequeue() semantics:
- ops.dequeue() is invoked exactly once when the task leaves the BPF
scheduler's custody, in one of the following cases:
a) regular dispatch: a task dispatched to a user DSQ or stored in
internal BPF data structures is moved to a terminal DSQ
(ops.dequeue() called without any special flags set),
b) core scheduling dispatch: core-sched picks task before dispatch
(ops.dequeue() called with %SCX_DEQ_CORE_SCHED_EXEC flag set),
c) property change: task properties modified before dispatch,
(ops.dequeue() called with %SCX_DEQ_SCHED_CHANGE flag set).
This allows BPF schedulers to:
- reliably track task ownership and lifecycle,
- maintain accurate accounting of managed tasks,
- update internal state when tasks change properties.
Cc: Tejun Heo <tj@kernel.org>
Cc: Emil Tsalapatis <emil@etsalapatis.com>
Cc: Kuba Piecuch <jpiecuch@google.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Commit 111a79800aed ("tools/sched_ext: Strip compatibility macros for
cgroup and dispatch APIs") removed the compat layer for v6.12-v6.13 kfunc
renaming, but v6.12 is the current LTS kernel and will remain supported
through 2026. Restore backward compatibility so schedulers built with v6.19+
headers can run on v6.12 kernels.
The restored compat differs from the original in two ways:
1. Uses ___new/___old suffixes instead of ___compat for clarity. The new
macros check for v6.13+ names (scx_bpf_dsq_move*), fall back to v6.12
names (scx_bpf_dispatch_from_dsq*, scx_bpf_consume), then return safe
no-ops for missing symbols.
2. Integrates with the args-struct-packing changes added in c0d630ba347c
("sched_ext: Wrap kfunc args in struct to prepare for aux__prog").
scx_bpf_dsq_insert_vtime() now tries __scx_bpf_dsq_insert_vtime (args
struct), then scx_bpf_dsq_insert_vtime___compat (v6.13-v6.18), then
scx_bpf_dispatch_vtime___compat (pre-v6.13).
Forward compatibility is not restored - binaries built against v6.13 or
earlier headers won't run on v6.19+ kernels, as the old kfunc names are not
exported. This is acceptable since the priority is new binaries running on
older kernels.
Also add missing compat checks for ops.cgroup_set_bandwidth() (added v6.17)
and ops.cgroup_set_idle() (added v6.19). These need to be NULLed out in
userspace on older kernels.
Reported-by: Andrea Righi <arighi@nvidia.com>
Acked-and-tested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The ops.cpu_acquire/release() callbacks miss events under multiple conditions.
There are two distinct task dispatch gaps that can cause cpu_released flag
desynchronization:
1. balance-to-pick_task gap: This is what was originally reported. balance_scx()
can enqueue a task, but during consume_remote_task() when the rq lock is
released, a higher priority task can be enqueued and ultimately picked while
cpu_released remains false. This gap is closeable via RETRY_TASK handling.
2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local
DSQ. By the time the sched path runs on the target CPU, higher class tasks may
already be queued. In such cases, nothing on sched_ext side will be invoked,
and the only solution would be a hook invoked regardless of sched class, which
isn't desirable.
Rather than adding invasive core hooks, BPF schedulers can use generic BPF
mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent
with other mechanisms it already uses and doesn't add further friction.
The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when
a CPU gets preempted by a higher priority scheduling class. However, the old
scx_bpf_reenqueue_local() could only be called from cpu_release() context.
Add a new version of scx_bpf_reenqueue_local() that can be called from any
context by deferring the actual re-enqueue operation. This eliminates the need
for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF
mechanisms like the sched_switch tracepoint to detect and handle CPU preemption.
Update scx_qmap to demonstrate the new approach using sched_switch instead of
cpu_release, with compat support for older kernels. Mark cpu_acquire/release()
as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in
v6.23.
Reported-by: Wen-Fang Liu <liuwenfang@honor.com>
Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary
compatibility") renamed the new bool-returning variant to scx_bpf_dsq_insert___v2
in the kernel. However, libbpf currently only strips ___SUFFIX on the BPF side,
not on kernel symbols, so the compat wrapper couldn't match the kernel kfunc and
would always fall back to the old variant even when the new one was available.
Add an extra ___compat suffix as a workaround - libbpf strips one suffix on the
BPF side leaving ___v2, which then matches the kernel kfunc directly. In the
future when libbpf strips all suffixes on both sides, all suffixes can be
dropped.
Fixes: 2dbbdeda77a6 ("sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibility")
Cc: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old
void-returning version to scx_bpf_dsq_insert___compat, with the expectation
that libbpf would match old binaries to the ___compat variant, maintaining
backward binary compatibility. However, while libbpf ignores ___suffix on
the BPF side when matching symbols, it doesn't do so for kernel-side symbols.
Old binaries compiled with the original scx_bpf_dsq_insert() could no longer
resolve the symbol.
Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old
void-returning interface and add ___v2 to the new bool-returning version.
This allows old binaries to continue working while new code can use the
___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the
___v2 suffix can be dropped when the compat interface is removed.
v2: Use ___v2 instead of ___new.
Fixes: cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool")
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The builtin DSQ queue data structures are meant to be used by a wide
range of different sched_ext schedulers with different demands on these
data structures. They might be per-cpu with low-contention, or
high-contention shared queues. Unfortunately, DSQs have a coarse-grained
lock around the whole data structure. Without going all the way to a
lock-free, more scalable implementation, a small step we can take to
reduce lock contention is to allow a lockless, small-fixed-cost peek at
the head of the queue.
This change allows certain custom SCX schedulers to cheaply peek at
queues, e.g. during load balancing, before locking them. But it
represents a few extra memory operations to update the pointer each
time the DSQ is modified, including a memory barrier on ARM so the write
appears correctly ordered.
This commit adds a first_task pointer field which is updated
atomically when the DSQ is modified, and allows any thread to peek at
the head of the queue without holding the lock.
Signed-off-by: Ryan Newton <newton@meta.com>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
for sub-scheduler authority checks. Add compat wrappers which fall back to
direct p->scx field writes on older kernels.
Suggested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In preparation for hierarchical schedulers, change scx_bpf_dsq_insert() and
scx_bpf_dsq_insert_vtime() to return bool instead of void. With
sub-schedulers, there will be no reliable way to guarantee a task is still
owned by the sub-scheduler at insertion time (e.g., the task may have been
migrated to another scheduler). The bool return value will enable
sub-schedulers to detect and gracefully handle insertion failures.
For the root scheduler, insertion failures will continue to trigger scheduler
abort via scx_error(), so existing code doesn't need to check the return
value. Backward compatibility is maintained through compat wrappers.
Also update scx_bpf_dsq_move() documentation to clarify that it can return
false for sub-schedulers when @dsq_id points to a disallowed local DSQ.
Reviewed-by: Changwoo Min <changwoo@igalia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_bpf_dsq_insert_vtime() and scx_bpf_select_cpu_and() currently have 5
parameters. An upcoming change will add aux__prog parameter which will exceed
BPF's 5 argument limit.
Prepare by adding new kfuncs __scx_bpf_dsq_insert_vtime() and
__scx_bpf_select_cpu_and() that take args structs. The existing kfuncs are
kept as compatibility wrappers. BPF programs use inline wrappers that detect
kernel API version via bpf_core_type_exists() and use the new struct-based
kfuncs when available, falling back to compat kfuncs otherwise. This allows
BPF programs to work with both old and new kernels.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
With the planned hierarchical scheduler support, sub-schedulers will need to
be verified for authority before being allowed to modify task->scx.slice and
task->scx.dsq_vtime. Add scx_bpf_task_set_slice() and
scx_bpf_task_set_dsq_vtime() which will perform the necessary permission
checks.
Root schedulers can still directly write to these fields, so this doesn't
affect existing schedulers.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Enough time has passed since the introduction of scx_bpf_task_cgroup() and
the scx_bpf_dispatch* -> scx_bpf_dsq* kfunc renaming. Strip the compatibility
macros.
Acked-by: Changwoo Min <changwoo@igalia.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Introduce a compatibility helper that allows BPF schedulers to use
scx_bpf_cpu_curr() on older kernels.
Fixes: 20b158094a1ad ("sched_ext: Introduce scx_bpf_cpu_curr()")
Cc: Christian Loehle <christian.loehle@arm.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Provide scx_bpf_cpu_curr() as a way for scx schedulers to check the curr
task of a remote rq without assuming its lock is held.
Many scx schedulers make use of scx_bpf_cpu_rq() to check a remote curr
(e.g. to see if it should be preempted). This is problematic because
scx_bpf_cpu_rq() provides access to all fields of struct rq, most of
which aren't safe to use without holding the associated rq lock.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Most fields in scx_bpf_cpu_rq() assume that its rq_lock is held.
Furthermore they become meaningless without rq lock, too.
Make a safer version of scx_bpf_cpu_rq() that only returns a rq
if we hold rq lock of that rq.
Also mark the new scx_bpf_locked_rq() as returning NULL as
scx_bpf_cpu_rq() should've been too.
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Acked-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Receive tools/sched_ext updates form https://github.com/sched-ext/scx to
sync userspace bits:
- basic BPF arena allocator abstractions,
- additional process flags definitions,
- fixed is_migration_disabled() helper,
- separate out user_exit_info BPF and user space code.
This also fixes the following warning when building the selftests:
tools/sched_ext/include/scx/common.bpf.h:550:9: warning: 'likely' macro redefined [-Wmacro-redefined]
550 | #define likely(x) __builtin_expect(!!(x), 1)
| ^
Co-developed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Provide a new kfunc, scx_bpf_select_cpu_and(), that can be used to apply
the built-in idle CPU selection policy to a subset of allowed CPU.
This new helper is basically an extension of scx_bpf_select_cpu_dfl().
However, when an idle CPU can't be found, it returns a negative value
instead of @prev_cpu, aligning its behavior more closely with
scx_bpf_pick_idle_cpu().
It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
strict selection to @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to
request only a full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying
the built-in selection logic.
With this helper, BPF schedulers can apply the built-in idle CPU
selection policy restricted to any arbitrary subset of CPUs.
Example usage
=============
Possible usage in ops.select_cpu():
s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
const struct cpumask *cpus = task_allowed_cpus(p) ?: p->cpus_ptr;
s32 cpu;
cpu = scx_bpf_select_cpu_and(p, prev_cpu, wake_flags, cpus, 0);
if (cpu >= 0) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
return cpu;
}
return prev_cpu;
}
Results
=======
Load distribution on a 4 sockets, 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses
scx_bpf_select_cpu_and() with 0xff00 as the allowed subset of CPUs:
$ vng --cpu 16,sockets=4,cores=4,threads=1
...
$ stress-ng -c 16
...
$ htop
...
0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
With scx_bpf_select_cpu_dfl() tasks would be distributed evenly across
all the available CPUs.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Synchronize with https://github.com/sched-ext/scx at dc44584874f0 ("kernel:
Synchronize with kernel tools/sched_ext").
- READ/WRITE_ONCE() is made more proper and READA_ONCE_ARENA() is dropped.
- scale_by_task_weight[_inverse]() helpers added.
- Enum defs expanded to cover more and new enums.
- Don't trigger fatal error when some enums are missing from kernel BTF.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull for-6.14-fixes to receive:
9360dfe4cbd6 ("sched_ext: Validate prev_cpu in scx_bpf_select_cpu_dfl()")
which conflicts with:
337d1b354a29 ("sched_ext: Move built-in idle CPU selection policy to a separate file")
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Introduce __COMPAT_scx_bpf_events() to use scx_bpf_events() in a
compatible way also with kernels that don't provide this kfunc.
This also fixes the following error with scx_qmap when running on a
kernel that does not provide scx_bpf_events():
; scx_bpf_events(&events, sizeof(events)); @ scx_qmap.bpf.c:777
318: (b7) r2 = 72 ; R2_w=72 async_cb
319: <invalid kfunc call>
kfunc 'scx_bpf_events' is referenced but wasn't resolved
Fixes: 9865f31d852a4 ("sched_ext: Add scx_bpf_events() and scx_read_event() for BPF schedulers")
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Make all the SCX_OPS_* and SCX_PICK_IDLE_* flags available to the
user-space part of the schedulers via the compat interface.
This allows schedulers / selftests to set all the ops flags in
user-space, rather than having them split between BPF and user-space.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Similarly to scx_bpf_nr_cpu_ids(), introduce a new kfunc
scx_bpf_nr_node_ids() to expose the maximum number of NUMA nodes in the
system.
BPF schedulers can use this information together with the new node-aware
kfuncs, for example to create per-node DSQs, validate node IDs, etc.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Introduce a new kfunc to retrieve the node associated to a CPU:
int scx_bpf_cpu_node(s32 cpu)
Add the following kfuncs to provide BPF schedulers direct access to
per-node idle cpumasks information:
const struct cpumask *scx_bpf_get_idle_cpumask_node(int node)
const struct cpumask *scx_bpf_get_idle_smtmask_node(int node)
s32 scx_bpf_pick_idle_cpu_node(const cpumask_t *cpus_allowed,
int node, u64 flags)
s32 scx_bpf_pick_any_cpu_node(const cpumask_t *cpus_allowed,
int node, u64 flags)
Moreover, trigger an scx error when any of the non-node aware idle CPU
kfuncs are used when SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled.
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Using a single global idle mask can lead to inefficiencies and a lot of
stress on the cache coherency protocol on large systems with multiple
NUMA nodes, since all the CPUs can create a really intense read/write
activity on the single global cpumask.
Therefore, split the global cpumask into multiple per-NUMA node cpumasks
to improve scalability and performance on large systems.
The concept is that each cpumask will track only the idle CPUs within
its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
In this way concurrent access to the idle cpumask will be restricted
within each NUMA node.
The split of multiple per-node idle cpumasks can be controlled using the
SCX_OPS_BUILTIN_IDLE_PER_NODE flag.
By default SCX_OPS_BUILTIN_IDLE_PER_NODE is not enabled and a global
host-wide idle cpumask is used, maintaining the previous behavior.
NOTE: if a scheduler explicitly enables the per-node idle cpumasks (via
SCX_OPS_BUILTIN_IDLE_PER_NODE), scx_bpf_get_idle_cpu/smtmask() will
trigger an scx error, since there are no system-wide cpumasks.
= Test =
Hardware:
- System: DGX B200
- CPUs: 224 SMT threads (112 physical cores)
- Processor: INTEL(R) XEON(R) PLATINUM 8570
- 2 NUMA nodes
Scheduler:
- scx_simple [1] (so that we can focus at the built-in idle selection
policy and not at the scheduling policy itself)
Test:
- Run a parallel kernel build `make -j $(nproc)` and measure the average
elapsed time over 10 runs:
avg time | stdev
---------+------
before: 52.431s | 2.895
after: 50.342s | 2.895
= Conclusion =
Splitting the global cpumask into multiple per-NUMA cpumasks helped to
achieve a speedup of approximately +4% with this particular architecture
and test case.
The same test on a DGX-1 (40 physical cores, Intel Xeon E5-2698 v4 @
2.20GHz, 2 NUMA nodes) shows a speedup of around 1.5-3%.
On smaller systems, I haven't noticed any measurable regressions or
improvements with the same test (parallel kernel build) and scheduler
(scx_simple).
Moreover, with a modified scx_bpfland that uses the new NUMA-aware APIs
I observed an additional +2-2.5% performance improvement with the same
test.
[1] https://github.com/sched-ext/scx/blob/main/scheds/c/scx_simple.bpf.c
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add the new scheduler flag SCX_OPS_BUILTIN_IDLE_PER_NODE, which allows
BPF schedulers to select between using a global flat idle cpumask or
multiple per-node cpumasks.
This only introduces the flag and the mechanism to enable/disable this
feature without affecting any scheduling behavior.
Cc: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Reviewed-by: Yury Norov [NVIDIA] <yury.norov@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Synchronize with https://github.com/sched-ext/scx at d384453984a0 ("kernel:
Sync at ad3b301aa05a ("sched_ext: Provides a sysfs 'events' to expose core
event counters")").
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Now BPF only supports bpf_list_push_{front,back}_impl kfunc, not bpf_list_
push_{front,back}.
This patch fix this issue. Without this patch, if we use bpf_list kfunc
in scx, the BPF verifier would complain:
libbpf: extern (func ksym) 'bpf_list_push_back': not found in kernel or
module BTFs
libbpf: failed to load object 'scx_foo'
libbpf: failed to load BPF skeleton 'scx_foo': -EINVAL
With this patch, the bpf list kfunc will work as expected.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Fixes: 2a52ca7c98960 ("sched_ext: Add scx_simple and scx_example_qmap example schedulers")
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Add where the script is located to the comment lines of the header file.
This helps anyone re-generate the header file if required.
Note that this is a sync from the PR [1] in the scx repo.
[1] https://github.com/sched-ext/scx/pull/1322
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
This provides compatible testing of SCX_ENQ_CPU_SELECTED.
More specifically, it handles two cases:
1. a BPF scheduler is compiled against vmlinux.h where
SCX_ENQ_CPU_SELECTED is defined, but it runs on a kernel that does not
have SCX_ENQ_CPU_SELECTED. In this case, the test result of
'enq_flags & SCX_ENQ_CPU_SELECTED' will always be false. That test result
is semantically incorrect because the kernel before SCX_ENQ_CPU_SELECTED
has never skipped select_task_rq_scx(), so the result should be true.
2. a BPF scheduler is compiling against vmlinux.h where
SCX_ENQ_CPU_SELECTED is not defined. In this case, directly using
SCX_ENQ_CPU_SELECTED causes compilation errors.
To hide such complexity, introduce __COMPAT_is_enq_cpu_selected(),
which checks if SCX_ENQ_CPU_SELECTED exists in runtime using BPF CO-RE.
This consists of three parts:
1. Add enum_defs.autogen.h, which has macros (HAVE_{enum name}) denoting
whether SCX enums are defined in the vmlinux.h or not.
2. Implement __COMPAT_is_enq_cpu_selected(), which provide the test of
SCX_ENQ_CPU_SELECTED in a compatible way.
3. Use __COMPAT_is_enq_cpu_selected() in scx_qmap.
Note that this is a sync of the relevant PR [1] in the scx repo.
[1] https://github.com/sched-ext/scx/pull/1314
Signed-off-by: Changwoo Min <changwoo@igalia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Pull to receive:
- 2fa0fbeb69ed ("sched_ext: Implement auto local dispatching of migration disabled tasks")
- 32966821574c ("sched_ext: Fix migration disabled handling in targeted dispatches")
as planned for-6.15 changes depend on them (e.g. adding event counter for
implicit migration disabled task handling).
|