aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2026-03-08locking/semaphore: Remove the list_head from struct semaphoreMatthew Wilcox (Oracle)1-10/+31
Instead of embedding a list_head in struct semaphore, store a pointer to the first waiter. The list of waiters remains a doubly linked list so we can efficiently add to the tail of the list and remove from the front (or middle) of the list. Some of the list manipulation becomes more complicated, but it's a reasonable tradeoff on the slow paths to shrink data structures which embed a semaphore. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260305195545.3707590-3-willy@infradead.org
2026-03-08locking/rwsem: Remove the list_head from struct rw_semaphoreMatthew Wilcox (Oracle)1-32/+58
Instead of embedding a list_head in struct rw_semaphore, store a pointer to the first waiter. The list of waiters remains a doubly linked list so we can efficiently add to the tail of the list, remove from the front (or middle) of the list. Some of the list manipulation becomes more complicated, but it's a reasonable tradeoff on the slow paths to shrink some core data structures like struct inode. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20260305195545.3707590-2-willy@infradead.org
2026-03-07Revert "sched_ext: Use READ_ONCE() for the read side of dsq->nr update"Tejun Heo1-6/+2
This reverts commit 9adfcef334bf9c6ef68eaecfca5f45d18614efe0. dsq->nr is protected by dsq->lock and reading while holding the lock doesn't constitute a racy read. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: zhidao su <suzhidao@xiaomi.com>
2026-03-07Merge tag 'timers-urgent-2026-03-08' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Make clock_adjtime() syscall timex validation slightly more permissive for auxiliary clocks, to not reject syscalls based on the status field that do not try to modify the status field. This makes the ABI behavior in clock_adjtime() consistent with CLOCK_REALTIME" * tag 'timers-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix timex status validation for auxiliary clocks
2026-03-07Merge tag 'sched-urgent-2026-03-08' of ↵Linus Torvalds1-0/+30
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a DL scheduler bug that may corrupt internal metrics during PI and setscheduler() syscalls, resulting in kernel warnings and misbehavior. Found during stress-testing" * tag 'sched-urgent-2026-03-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Fix missing ENQUEUE_REPLENISH during PI de-boosting
2026-03-07Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds3-6/+37
Pull bpf fixes from Alexei Starovoitov: - Fix u32/s32 bounds when ranges cross min/max boundary (Eduard Zingerman) - Fix precision backtracking with linked registers (Eduard Zingerman) - Fix linker flags detection for resolve_btfids (Ihor Solodrai) - Fix race in update_ftrace_direct_add/del (Jiri Olsa) - Fix UAF in bpf_trampoline_link_cgroup_shim (Lang Xu) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: resolve_btfids: Fix linker flags detection selftests/bpf: add reproducer for spurious precision propagation through calls bpf: collect only live registers in linked regs Revert "selftests/bpf: Update reg_bound range refinement logic" selftests/bpf: test refining u32/s32 bounds when ranges cross min/max boundary bpf: Fix u32/s32 bounds when ranges cross min/max boundary bpf: Fix a UAF issue in bpf_trampoline_link_cgroup_shim ftrace: Add missing ftrace_lock to update_ftrace_direct_add/del
2026-03-07sched_ext: Fix scx_bpf_reenqueue_local() silently reenqueuing nothingCheng-Yang Chou1-1/+1
ffa7ae0724e4 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()") introduced task_should_reenq() as a filter inside reenq_local(), requiring SCX_REENQ_ANY to be set in order to match any task. scx_bpf_dsq_reenq() handles this correctly by converting a bare reenq_flags=0 to SCX_REENQ_ANY, but scx_bpf_reenqueue_local() was not updated and continued to call reenq_local() with 0, causing it to silently reenqueue zero tasks. Fix by passing SCX_REENQ_ANY directly. Fixes: ffa7ae0724e4 ("sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()") Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07Merge tag 'trace-v7.0-rc2-2' of ↵Linus Torvalds3-4/+11
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix possible NULL pointer dereference in trace_data_alloc() On the trace_data_alloc() error path, it can call trigger_data_free() with a NULL pointer. This used to be a kfree() but was changed to trigger_data_free() to clean up any partial initialization. The issue is that trigger_data_free() does not expect a NULL pointer. Have trigger_data_free() return safely on NULL pointer. - Fix multiple events on the command line and bootconfig If multiple events are enabled on the command line separately and not grouped, only the last event gets enabled. That is: trace_event=sched_switch trace_event=sched_waking will only enable sched_waking whereas: trace_event=sched_switch,sched_waking will enable both. The bootconfig makes it even worse as the second way is the more common method. The issue is that a temporary buffer is used to store the events to enable later in boot. Each time the cmdline callback is called, it overwrites what was previously there. Have the callback append the next value (delimited by a comma) if the temporary buffer already has content. - Fix command line trace_buffer_size if >= 2G The logic to allocate the trace buffer uses "int" for the size parameter in the command line code causing overflow issues if more that 2G is specified. * tag 'trace-v7.0-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2G tracing: Fix enabling multiple events on the kernel command line and bootconfig tracing: Add NULL pointer check to trigger_data_free()
2026-03-07sched_ext: Add SCX_TASK_REENQ_REASON flagsTejun Heo2-10/+25
SCX_ENQ_REENQ indicates that a task is being re-enqueued but doesn't tell the BPF scheduler why. Add SCX_TASK_REENQ_REASON flags using bits 12-13 of p->scx.flags to communicate the reason during ops.enqueue(): - NONE: Not being reenqueued - KFUNC: Reenqueued by scx_bpf_dsq_reenq() and friends More reasons will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Simplify task state handlingTejun Heo1-10/+9
Task states (NONE, INIT, READY, ENABLED) were defined in a separate enum with unshifted values and then shifted when stored in scx_entity.flags. Simplify by defining them as pre-shifted values directly in scx_ent_flags and removing the separate scx_task_state enum. This removes the need for shifting when reading/writing state values. scx_get_task_state() now returns the masked flags value directly. scx_set_task_state() accepts the pre-shifted state value. scx_dump_task() shifts down for display to maintain readable output. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Optimize schedule_dsq_reenq() with lockless fast pathTejun Heo1-8/+36
schedule_dsq_reenq() always acquires deferred_reenq_lock to queue a reenqueue request. Add a lockless fast-path to skip lock acquisition when the request is already pending with the required flags set. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Implement scx_bpf_dsq_reenq() for user DSQsTejun Heo2-0/+129
scx_bpf_dsq_reenq() currently only supports local DSQs. Extend it to support user-defined DSQs by adding a deferred re-enqueue mechanism similar to the local DSQ handling. Add per-cpu deferred_reenq_user_node/flags to scx_dsq_pcpu and deferred_reenq_users list to scx_rq. When scx_bpf_dsq_reenq() is called on a user DSQ, the DSQ's per-cpu node is added to the current rq's deferred list. process_deferred_reenq_users() then iterates the DSQ using the cursor helpers and re-enqueues each task. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Factor out nldsq_cursor_next_task() and nldsq_cursor_lost_task()Tejun Heo1-55/+99
Factor out cursor-based DSQ iteration from bpf_iter_scx_dsq_next() into nldsq_cursor_next_task() and the task-lost check from scx_dsq_move() into nldsq_cursor_lost_task() to prepare for reuse. As ->priv is only used to record dsq->seq for cursors, update INIT_DSQ_LIST_CURSOR() to take the DSQ pointer and set ->priv from dsq->seq so that users don't have to read it manually. Move scx_dsq_iter_flags enum earlier so nldsq_cursor_next_task() can use SCX_DSQ_ITER_REV. bypass_lb_cpu() now sets cursor.priv to dsq->seq but doesn't use it. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Add per-CPU data to DSQsTejun Heo1-15/+72
Add per-CPU data structure to dispatch queues. Each DSQ now has a percpu scx_dsq_pcpu which contains a back-pointer to the DSQ. This will be used by future changes to implement per-CPU reenqueue tracking for user DSQs. init_dsq() now allocates the percpu data and can fail, so it returns an error code. All callers are updated to handle failures. exit_dsq() is added to free the percpu data and is called from all DSQ cleanup paths. In scx_bpf_create_dsq(), init_dsq() is called before rcu_read_lock() since alloc_percpu() requires GFP_KERNEL context, and dsq->sched is set afterwards. v2: Fix err_free_pcpu to only exit_dsq() initialized bypass DSQs (Andrea Righi). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Add reenq_flags plumbing to scx_bpf_dsq_reenq()Tejun Heo2-5/+38
Add infrastructure to pass flags through the deferred reenqueue path. reenq_local() now takes a reenq_flags parameter, and scx_sched_pcpu gains a deferred_reenq_local_flags field to accumulate flags from multiple scx_bpf_dsq_reenq() calls before processing. No flags are defined yet. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Introduce scx_bpf_dsq_reenq() for remote local DSQ reenqueueTejun Heo1-45/+73
scx_bpf_reenqueue_local() can only trigger re-enqueue of the current CPU's local DSQ. Introduce scx_bpf_dsq_reenq() which takes a DSQ ID and can target any local DSQ including remote CPUs via SCX_DSQ_LOCAL_ON | cpu. This will be expanded to support user DSQs by future changes. scx_bpf_reenqueue_local() is reimplemented as a simple wrapper around scx_bpf_dsq_reenq(SCX_DSQ_LOCAL, 0) and may be deprecated in the future. Update compat.bpf.h with a compatibility shim and scx_qmap to test the new functionality. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Wrap deferred_reenq_local_node into a structTejun Heo2-10/+18
Wrap the deferred_reenq_local_node list_head into struct scx_deferred_reenq_local. More fields will be added and this allows using a shorthand pointer to access them. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Convert deferred_reenq_locals from llist to regular listTejun Heo3-21/+41
The deferred reenqueue local mechanism uses an llist (lockless list) for collecting schedulers that need their local DSQs re-enqueued. Convert to a regular list protected by a raw_spinlock. The llist was used for its lockless properties, but the upcoming changes to support remote reenqueue require more complex list operations that are difficult to implement correctly with lockless data structures. A spinlock- protected regular list provides the necessary flexibility. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Relocate run_deferred() and its calleesTejun Heo1-94/+92
Previously, both process_ddsp_deferred_locals() and reenq_local() required forward declarations. Reorganize so that only run_deferred() needs to be declared. Both callees are grouped right before run_deferred() for better locality. This reduces forward declaration clutter and will ease adding more to the run_deferred() path. No functional changes. v2: Also relocate process_ddsp_deferred_locals() next to run_deferred() (Daniel Jordan). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Change find_global_dsq() to take CPU number instead of taskTejun Heo1-17/+15
Change find_global_dsq() to take a CPU number directly instead of a task pointer. This prepares for callers where the CPU is available but the task is not. No functional changes. v2: Rename tcpu to cpu in find_global_dsq() (Emil Tsalapatis). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Factor out pnode allocation and deallocation into helpersTejun Heo1-9/+23
Extract pnode allocation and deallocation logic into alloc_pnode() and free_pnode() helpers. This simplifies scx_alloc_and_add_sched() and prepares for adding more per-node initialization and cleanup in subsequent patches. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Wrap global DSQs in per-node structureTejun Heo2-17/+21
Global DSQs are currently stored as an array of scx_dispatch_q pointers, one per NUMA node. To allow adding more per-node data structures, wrap the global DSQ in scx_sched_pnode and replace global_dsqs with pnode array. NUMA-aware allocation is maintained. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Relocate scx_bpf_task_cgroup() and its BTF_ID to the end of kfunc ↵Tejun Heo1-39/+39
section Move scx_bpf_task_cgroup() kfunc definition and its BTF_ID entry to the end of the kfunc section before __bpf_kfunc_end_defs() for cleaner code organization. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-07sched_ext: Pass full dequeue flags to ops.quiescent()Andrea Righi1-10/+10
ops.quiescent() is invoked with the same deq_flags as ops.dequeue(), so the BPF scheduler is able to distinguish sleep vs property changes in both callbacks. However, dequeue_task_scx() receives deq_flags as an int from the sched_class interface, so SCX flags above bit 32 (%SCX_DEQ_SCHED_CHANGE) are truncated. ops_dequeue() reconstructs the full u64 for ops.dequeue(), but ops.quiescent() is still called with the original int and can never see %SCX_DEQ_SCHED_CHANGE. Fix this by constructing the full u64 deq_flags in dequeue_task_scx() (renaming the int parameter to core_deq_flags) and passing the complete flags to both ops_dequeue() and ops.quiescent(). Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-07Merge branch 'for-7.0-fixes' into for-7.1Tejun Heo1-3/+2
Pull in 57ccf5ccdc56 ("sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flags") which conflicts with ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics"). Signed-off-by: Tejun Heo <tj@kernel.org> # Conflicts: # kernel/sched/ext.c
2026-03-07sched_ext: Fix enqueue_task_scx() truncation of upper enqueue flagsTejun Heo1-3/+2
enqueue_task_scx() takes int enq_flags from the sched_class interface. SCX enqueue flags starting at bit 32 (SCX_ENQ_PREEMPT and above) are silently truncated when passed through activate_task(). extra_enq_flags was added as a workaround - storing high bits in rq->scx.extra_enq_flags and OR-ing them back in enqueue_task_scx(). However, the OR target is still the int parameter, so the high bits are lost anyway. The current impact is limited as the only affected flag is SCX_ENQ_PREEMPT which is informational to the BPF scheduler - its loss means the scheduler doesn't know about preemption but doesn't cause incorrect behavior. Fix by renaming the int parameter to core_enq_flags and introducing a u64 enq_flags local that merges both sources. All downstream functions already take u64 enq_flags. Fixes: f0e1a0643a59 ("sched_ext: Implement BPF extensible scheduler class") Cc: stable@vger.kernel.org # v6.12+ Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06bpf: collect only live registers in linked regsEduard Zingerman1-3/+10
Fix an inconsistency between func_states_equal() and collect_linked_regs(): - regsafe() uses check_ids() to verify that cached and current states have identical register id mapping. - func_states_equal() calls regsafe() only for registers computed as live by compute_live_registers(). - clean_live_states() is supposed to remove dead registers from cached states, but it can skip states belonging to an iterator-based loop. - collect_linked_regs() collects all registers sharing the same id, ignoring the marks computed by compute_live_registers(). Linked registers are stored in the state's jump history. - backtrack_insn() marks all linked registers for an instruction as precise whenever one of the linked registers is precise. The above might lead to a scenario: - There is an instruction I with register rY known to be dead at I. - Instruction I is reached via two paths: first A, then B. - On path A: - There is an id link between registers rX and rY. - Checkpoint C is created at I. - Linked register set {rX, rY} is saved to the jump history. - rX is marked as precise at I, causing both rX and rY to be marked precise at C. - On path B: - There is no id link between registers rX and rY, otherwise register states are sub-states of those in C. - Because rY is dead at I, check_ids() returns true. - Current state is considered equal to checkpoint C, propagate_precision() propagates spurious precision mark for register rY along the path B. - Depending on a program, this might hit verifier_bug() in the backtrack_insn(), e.g. if rY ∈ [r1..r5] and backtrack_insn() spots a function call. The reproducer program is in the next patch. This was hit by sched_ext scx_lavd scheduler code. Changes in tests: - verifier_scalar_ids.c selftests need modification to preserve some registers as live for __msg() checks. - exceptions_assert.c adjusted to match changes in the verifier log, R0 is dead after conditional instruction and thus does not get range. - precise.c adjusted to match changes in the verifier log, register r9 is dead after comparison and it's range is not important for test. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Fixes: 0fb3cf6110a5 ("bpf: use register liveness information for func_states_equal") Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-linked-regs-and-propagate-precision-v1-1-18e859be570d@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-07padata: Remove cpu online check from cpu add and removalChuyi Zhou1-18/+8
During the CPU offline process, the dying CPU is cleared from the cpu_online_mask in takedown_cpu(). After this step, various CPUHP_*_DEAD callbacks are executed to perform cleanup jobs for the dead CPU, so this cpu online check in padata_cpu_dead() is unnecessary. Similarly, when executing padata_cpu_online() during the CPUHP_AP_ONLINE_DYN phase, the CPU has already been set in the cpu_online_mask, the action even occurs earlier than the CPUHP_AP_ONLINE_IDLE stage. Remove this unnecessary cpu online check in __padata_add_cpu() and __padata_remove_cpu(). Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2026-03-06tracing: Fix trace_buf_size= cmdline parameter with sizes >= 2GCalvin Owens1-3/+3
Some of the sizing logic through tracer_alloc_buffers() uses int internally, causing unexpected behavior if the user passes a value that does not fit in an int (on my x86 machine, the result is uselessly tiny buffers). Fix by plumbing the parameter's real type (unsigned long) through to the ring buffer allocation functions, which already use unsigned long. It has always been possible to create larger ring buffers via the sysfs interface: this only affects the cmdline parameter. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/bff42a4288aada08bdf74da3f5b67a2c28b761f8.1772852067.git.calvin@wbinvd.org Fixes: 73c5162aa362 ("tracing: keep ring buffer to minimum size till used") Signed-off-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06bpf: Fix u32/s32 bounds when ranges cross min/max boundaryEduard Zingerman1-0/+24
Same as in __reg64_deduce_bounds(), refine s32/u32 ranges in __reg32_deduce_bounds() in the following situations: - s32 range crosses U32_MAX/0 boundary, positive part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxx s32 range xxxxxxxxx] [xxxxxxx| 0 S32_MAX S32_MIN -1 - s32 range crosses U32_MAX/0 boundary, negative part of the s32 range overlaps with u32 range: 0 U32_MAX | [xxxxxxxxxxxxxx u32 range xxxxxxxxxxxxxx] | |----------------------------|----------------------------| |xxxxxxxxx] [xxxxxxxxxxxx s32 range | 0 S32_MAX S32_MIN -1 - No refinement if ranges overlap in two intervals. This helps for e.g. consider the following program: call %[bpf_get_prandom_u32]; w0 &= 0xffffffff; if w0 < 0x3 goto 1f; // on fall-through u32 range [3..U32_MAX] if w0 s> 0x1 goto 1f; // on fall-through s32 range [S32_MIN..1] if w0 s< 0x0 goto 1f; // range can be narrowed to [S32_MIN..-1] r10 = 0; 1: ...; The reg_bounds.c selftest is updated to incorporate identical logic, refinement based on non-overflowing range halves: ((x ∩ [0, smax]) ∩ (y ∩ [0, smax])) ∪ ((x ∩ [smin,-1]) ∩ (y ∩ [smin,-1])) Reported-by: Andrea Righi <arighi@nvidia.com> Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Closes: https://lore.kernel.org/bpf/aakqucg4vcujVwif@gpd4/T/ Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260306-bpf-32-bit-range-overflow-v3-1-f7f67e060a6b@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-06cgroup: Don't expose dead tasks in cgroupSebastian Andrzej Siewior1-0/+6
Once a task exits it has its state set to TASK_DEAD and then it is removed from the cgroup it belonged to. The last step happens on the task gets out of its last schedule() invocation and is delayed on PREEMPT_RT due to locking constraints. As a result it is possible to receive a pid via waitpid() of a task which is still listed in cgroup.procs for the cgroup it belonged to. This is something that systemd does not expect and as a result it waits for its exit until a time out occurs. This can also be reproduced on !PREEMPT_RT kernel with a significant delay in do_exit() after exit_notify(). Hide the task from the output which have PF_EXITING set which is done before the parent is notified. Keeping zombies with live threads shouldn't break anything (suggested by Tejun). Reported-by: Bert Karwatzki <spasswolf@web.de> Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/ Tested-by: Bert Karwatzki <spasswolf@web.de> Fixes: 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Cc: stable@vger.kernel.org # v6.19+ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06tracing: Fix enabling multiple events on the kernel command line and bootconfigAndrei-Alexandru Tachici1-1/+5
Multiple events can be enabled on the kernel command line via a comma separator. But if the are specified one at a time, then only the last event is enabled. This is because the event names are saved in a temporary buffer, and each call by the init cmdline code will reset that buffer. This also affects names in the boot config file, as it may call the callback multiple times with an example of: kernel.trace_event = ":mod:rproc_qcom_common", ":mod:qrtr", ":mod:qcom_aoss" Change the cmdline callback function to append a comma and the next value if the temporary buffer already has content. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260302-trace-events-allow-multiple-modules-v1-1-ce4436e37fb8@oss.qualcomm.com Signed-off-by: Andrei-Alexandru Tachici <andrei-alexandru.tachici@oss.qualcomm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06tracing: Add NULL pointer check to trigger_data_free()Guenter Roeck1-0/+3
If trigger_data_alloc() fails and returns NULL, event_hist_trigger_parse() jumps to the out_free error path. While kfree() safely handles a NULL pointer, trigger_data_free() does not. This causes a NULL pointer dereference in trigger_data_free() when evaluating data->cmd_ops->set_filter. Fix the problem by adding a NULL pointer check to trigger_data_free(). The problem was found by an experimental code review agent based on gemini-3.1-pro while reviewing backports into v6.18.y. Cc: Miaoqian Lin <linmq006@gmail.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20260305193339.2810953-1-linux@roeck-us.net Fixes: 0550069cc25f ("tracing: Properly process error handling in event_hist_trigger_parse()") Assisted-by: Gemini:gemini-3.1-pro Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-06sched_ext: Add basic building blocks for nested sub-scheduler dispatchingTejun Heo2-15/+108
This is an early-stage partial implementation that demonstrates the core building blocks for nested sub-scheduler dispatching. While significant work remains in the enqueue path and other areas, this patch establishes the fundamental mechanisms needed for hierarchical scheduler operation. The key building blocks introduced include: - Private stack support for ops.dispatch() to prevent stack overflow when walking down nested schedulers during dispatch operations - scx_bpf_sub_dispatch() kfunc that allows parent schedulers to trigger dispatch operations on their direct child schedulers - Proper parent-child relationship validation to ensure dispatch requests are only made to legitimate child schedulers - Updated scx_dispatch_sched() to handle both nested and non-nested invocations with appropriate kf_mask handling The qmap scheduler is updated to demonstrate the functionality by calling scx_bpf_sub_dispatch() on registered child schedulers when it has no tasks in its own queues. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Add rhashtable lookup for sub-schedulersTejun Heo2-5/+47
Add rhashtable-based lookup for sub-schedulers indexed by cgroup_id to enable efficient scheduler discovery in preparation for multiple scheduler support. The hash table allows quick lookup of the appropriate scheduler instance when processing tasks from different cgroups. This extends scx_link_sched() to register sub-schedulers in the hash table and scx_unlink_sched() to remove them. A new scx_find_sub_sched() function provides the lookup interface. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Factor out scx_link_sched() and scx_unlink_sched()Tejun Heo1-22/+31
Factor out scx_link_sched() and scx_unlink_sched() functions to reduce code duplication in the scheduler enable/disable paths. No functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Make scx_bpf_reenqueue_local() sub-sched awareTejun Heo3-14/+62
scx_bpf_reenqueue_local() currently re-enqueues all tasks on the local DSQ regardless of which sub-scheduler owns them. With multiple sub-schedulers, each should only re-enqueue tasks it owns or are owned by its descendants. Replace the per-rq boolean flag with a lock-free linked list to track per-scheduler reenqueue requests. Filter tasks in reenq_local() using hierarchical ownership checks and block deferrals during bypass to prevent use on dead schedulers. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Add scx_sched back pointer to scx_sched_pcpuTejun Heo2-0/+6
Add a back pointer from scx_sched_pcpu to scx_sched. This will be used by the next patch to make scx_bpf_reenqueue_local() sub-sched aware. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Implement cgroup sub-sched enabling and disablingTejun Heo1-6/+279
The preceding changes implemented the framework to support cgroup sub-scheds and updated scheduling paths and kfuncs so that they have minimal but working support for sub-scheds. However, actual sub-sched enabling/disabling hasn't been implemented yet and all tasks stayed on scx_root. Implement cgroup sub-sched enabling and disabling to actually activate sub-scheds: - Both enable and disable operations bypass only the tasks in the subtree of the child being enabled or disabled to limit disruptions. - When enabling, all candidate tasks are first initialized for the child sched. Once that succeeds, the tasks are exited for the parent and then switched over to the child. This adds a bit of complication but guarantees that child scheduler failures are always contained. - Disabling works the same way in the other direction. However, when the parent may fail to initialize a task, disabling is propagated up to the parent. While this means that a parent sched fail due to a child sched event, the failure can only originate from the parent itself (its ops.init_task()). The only effect a malfunctioning child can have on the parent is attempting to move the tasks back to the parent. After this change, although not all the necessary mechanisms are in place yet, sub-scheds can take control of their tasks and schedule them. v2: Fix missing scx_cgroup_unlock()/percpu_up_write() in abort path (Cheng-Yang Chou). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Support dumping multiple schedulers and add scheduler identificationTejun Heo1-11/+43
Extend scx_dump_state() to support multiple schedulers and improve task identification in dumps. The function now takes a specific scheduler to dump and can optionally filter tasks by scheduler. scx_dump_task() now displays which scheduler each task belongs to, using "*" to mark tasks owned by the scheduler being dumped. Sub-schedulers are identified with their level and cgroup ID. The SysRq-D handler now iterates through all active schedulers under scx_sched_lock and dumps each one separately. For SysRq-D dumps, only tasks owned by each scheduler are dumped to avoid redundancy since all schedulers are being dumped. Error-triggered dumps continue to dump all tasks since only that specific scheduler is being dumped. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Convert scx_dump_state() spinlock to raw spinlockTejun Heo1-5/+2
The scx_dump_state() function uses a regular spinlock to serialize access. In a subsequent patch, this function will be called while holding scx_sched_lock, which is a raw spinlock, creating a lock nesting violation. Convert the dump_lock to a raw spinlock and use the guard macro for cleaner lock management. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Make watchdog sub-sched awareTejun Heo2-25/+56
Currently, the watchdog checks all tasks as if they are all on scx_root. Move scx_watchdog_timeout inside scx_sched and make check_rq_for_timeouts() use the timeout from the scx_sched associated with each task. refresh_watchdog() is added, which determines the timer interval as half of the shortest watchdog timeouts of all scheds and arms or disarms it as necessary. Every scx_sched instance has equivalent or better detection latency while sharing the same timer. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_schedTejun Heo2-40/+34
scx_dsp_ctx and scx_dsp_max_batch are global variables used in the dispatch path. In prepration for multiple scheduler support, move the former into scx_sched_pcpu and the latter into scx_sched. No user-visible behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Dispatch from all scx_sched instancesTejun Heo1-4/+8
The cgroup sub-sched support involves invasive changes to many areas of sched_ext. The overall scaffolding is now in place and the next step is implementing sub-sched enable/disable. To enable partial testing and verification, update balance_one() to dispatch from all scx_sched instances until it finds a task to run. This should keep scheduling working when sub-scheds are enabled with tasks on them. This will be replaced by BPF-driven hierarchical operation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Implement hierarchical bypass modeTejun Heo2-7/+101
When a sub-scheduler enters bypass mode, its tasks must be scheduled by an ancestor to guarantee forward progress. Tasks from bypassing descendants are queued in the bypass DSQs of the nearest non-bypassing ancestor, or the root scheduler if all ancestors are bypassing. This requires coordination between bypassing schedulers and their hosts. Add bypass_enq_target_dsq() to find the correct bypass DSQ by walking up the hierarchy until reaching a non-bypassing ancestor. When a sub-scheduler starts bypassing, all its runnable tasks are re-enqueued after scx_bypassing() is set, ensuring proper migration to ancestor bypass DSQs. Update scx_dispatch_sched() to handle hosting bypassed descendants. When a scheduler is not bypassing but has bypassing descendants, it must schedule both its own tasks and bypassed descendant tasks. A simple policy is implemented where every Nth dispatch attempt (SCX_BYPASS_HOST_NTH=2) consumes from the bypass DSQ. A fallback consumption is also added at the end of dispatch to ensure bypassed tasks make progress even when normal scheduling is idle. Update enable_bypass_dsp() and disable_bypass_dsp() to increment bypass_dsp_enable_depth on both the bypassing scheduler and its parent host, ensuring both can detect that bypass dispatch is active through bypass_dsp_enabled(). Add SCX_EV_SUB_BYPASS_DISPATCH event counter to track scheduling of bypassed descendant tasks. v2: Fix comment typos (Andrea). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Separate bypass dispatch enabling from bypass depth trackingTejun Heo2-10/+69
The bypass_depth field tracks nesting of bypass operations but is also used to determine whether the bypass dispatch path should be active. With hierarchical scheduling, child schedulers may need to activate their parent's bypass dispatch path without affecting the parent's bypass_depth, requiring separation of these concerns. Add bypass_dsp_enable_depth and bypass_dsp_claim to independently control bypass dispatch path activation. The new enable_bypass_dsp() and disable_bypass_dsp() functions manage this state with proper claim semantics to prevent races. The bypass dispatch path now only activates when bypass_dsp_enabled() returns true, which checks the new enable_depth counter. The disable operation is carefully ordered after all tasks are moved out of bypass DSQs to ensure they are drained before the dispatch path is disabled. During scheduler teardown, disable_bypass_dsp() is called explicitly to ensure cleanup even if bypass mode was never entered normally. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: When calling ops.dispatch() @prev must be on the same scx_schedTejun Heo1-2/+3
The @prev parameter passed into ops.dispatch() is expected to be on the same sched. Passing in @prev which isn't on the sched can spuriously trigger failures that can kill the scheduler. Pass in @prev iff it's on the same sched. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Factor out scx_dispatch_sched()Tejun Heo1-58/+65
In preparation of multiple scheduler support, factor out scx_dispatch_sched() from balance_one(). The function boundary makes remembering $prev_on_scx and $prev_on_rq less useful. Open code $prev_on_scx in balance_one() and $prev_on_rq in both balance_one() and scx_dispatch_sched(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Prepare bypass mode for hierarchical operationTejun Heo1-22/+63
Bypass mode is used to simplify enable and disable paths and guarantee forward progress when something goes wrong. When enabled, all tasks skip BPF scheduling and fall back to simple in-kernel FIFO scheduling. While this global behavior can be used as-is when dealing with sub-scheds, that would allow any sub-sched instance to affect the whole system in a significantly disruptive manner. Make bypass state hierarchical by propagating it to descendants and updating per-cpu flags accordingly. This allows an scx_sched to bypass if itself or any of its ancestors are in bypass mode. However, this doesn't make the actual bypass enqueue and dispatch paths hierarchical yet. That will be done in later patches. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2026-03-06sched_ext: Move bypass state into scx_schedTejun Heo4-81/+80
In preparation of multiple scheduler support, make bypass state per-scx_sched. Move scx_bypass_depth, bypass_timestamp and bypass_lb_timer from globals into scx_sched. Move SCX_RQ_BYPASSING from rq to scx_sched_pcpu as SCX_SCHED_PCPU_BYPASSING. scx_bypass() now takes @sch and scx_rq_bypassing(rq) is replaced with scx_bypassing(sch, cpu). All callers updated. scx_bypassed_for_enable existed to balance the global scx_bypass_depth when enable failed. Now that bypass_depth is per-scheduler, the counter is destroyed along with the scheduler on enable failure. Remove scx_bypassed_for_enable. As all tasks currently use the root scheduler, there's no observable behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>