aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-11-12sched_ext: Exit dispatch and move operations immediately when abortingTejun Heo1-44/+18
62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") introduced the breather mechanism to inject delays during bypass mode switching. It maintains operation semantics unchanged while reducing lock contention to avoid live-locks on large NUMA systems. However, the breather only activates when exiting the scheduler, so there's no need to maintain operation semantics. Simplify by exiting dispatch and move operations immediately when scx_aborting is set. In consume_dispatch_q(), break out of the task iteration loop. In scx_dsq_move(), return early before acquiring locks. This also fixes cases the breather mechanism cannot handle. When a large system has many runnable threads affinitized to different CPU subsets and the BPF scheduler places them all into a single DSQ, many CPUs can scan the DSQ concurrently for tasks they can run. This can cause DSQ and RQ locks to be held for extended periods, leading to various failure modes. The breather cannot solve this because once in the consume loop, there's no exit. The new mechanism fixes this by exiting the loop immediately. The bypass DSQ is exempted to ensure the bypass mechanism itself can make progress. v2: Use READ_ONCE() when reading scx_aborting (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Emil Tsalapatis <etsal@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Simplify breather mechanism with scx_aborting flagTejun Heo1-29/+25
The breather mechanism was introduced in 62dcbab8b0ef ("sched_ext: Avoid live-locking bypass mode switching") and e32c260195e6 ("sched_ext: Enable the ops breather and eject BPF scheduler on softlockup") to prevent live-locks by injecting delays when CPUs are trapped in dispatch paths. Currently, it uses scx_breather_depth (atomic_t) and scx_in_softlockup (unsigned long) with separate increment/decrement and cleanup operations. The breather is only activated when aborting, so tie it directly to the exit mechanism. Replace both variables with scx_aborting flag set when exit is claimed and cleared after bypass is enabled. Introduce scx_claim_exit() to consolidate exit_kind claiming and breather enablement. This eliminates scx_clear_softlockup() and simplifies scx_softlockup() and scx_bypass(). The breather mechanism will be replaced by a different abort mechanism in a future patch. This simplification prepares for that change. Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Use per-CPU DSQs instead of per-node global DSQs in bypass modeTejun Heo2-3/+14
Bypass mode routes tasks through fallback dispatch queues. Originally a single global DSQ, b7b3b2dbae73 ("sched_ext: Split the global DSQ per NUMA node") changed this to per-node DSQs to resolve NUMA-related livelocks. Dan Schatzberg found per-node DSQs can still livelock when many threads are pinned to different small CPU subsets: each CPU must scan many incompatible tasks to find runnable ones, causing severe contention with high CPU counts. Switch to per-CPU bypass DSQs. Each task queues on its current CPU. Default idle CPU selection and direct dispatch handle most cases well. This introduces a failure mode when tasks concentrate on one CPU in over-saturated systems. If the BPF scheduler severely skews placement before triggering bypass, that CPU's queue may be too long to drain, causing RCU stalls. A load balancer in a future patch will address this. The bypass DSQ is separate from local DSQ to enable load balancing: local DSQs use rq locks, preventing efficient scanning and transfer across CPUs, especially problematic when systems are already contended. v2: Clarified why bypass DSQ is separate from local DSQ (Andrea Righi). Reported-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Dan Schatzberg <schatzberg.dan@gmail.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Refactor do_enqueue_task() local and global DSQ pathsTejun Heo1-9/+12
The local and global DSQ enqueue paths in do_enqueue_task() share the same slice refill logic. Factor out the common code into a shared enqueue label. This makes adding new enqueue cases easier. No functional changes. Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Use shorter slice in bypass modeTejun Heo1-3/+31
There have been reported cases of bypass mode not making forward progress fast enough. The 20ms default slice is unnecessarily long for bypass mode where the primary goal is ensuring all tasks can make forward progress. Introduce SCX_SLICE_BYPASS set to 5ms and make the scheduler automatically switch to it when entering bypass mode. Also make the bypass slice value tunable through the slice_bypass_us module parameter (adjustable between 100us and 100ms) to make it easier to test whether slice durations are a factor in problem cases. v3: Use READ_ONCE/WRITE_ONCE for scx_slice_dfl access (Dan). v2: Removed slice_dfl_us module parameter. Fixed typos (Andrea). Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched_ext: Fix unsafe locking in the scx_dump_state()Zqiang1-2/+2
For built with CONFIG_PREEMPT_RT=y kernels, the dump_lock will be converted sleepable spinlock and not disable-irq, so the following scenarios occur: inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage. irq_work/0/27 [HC0[0]:SC0[0]:HE1:SE1] takes: (&rq->__lock){?...}-{2:2}, at: raw_spin_rq_lock_nested+0x2b/0x40 {IN-HARDIRQ-W} state was registered at: lock_acquire+0x1e1/0x510 _raw_spin_lock_nested+0x42/0x80 raw_spin_rq_lock_nested+0x2b/0x40 sched_tick+0xae/0x7b0 update_process_times+0x14c/0x1b0 tick_periodic+0x62/0x1f0 tick_handle_periodic+0x48/0xf0 timer_interrupt+0x55/0x80 __handle_irq_event_percpu+0x20a/0x5c0 handle_irq_event_percpu+0x18/0xc0 handle_irq_event+0xb5/0x150 handle_level_irq+0x220/0x460 __common_interrupt+0xa2/0x1e0 common_interrupt+0xb0/0xd0 asm_common_interrupt+0x2b/0x40 _raw_spin_unlock_irqrestore+0x45/0x80 __setup_irq+0xc34/0x1a30 request_threaded_irq+0x214/0x2f0 hpet_time_init+0x3e/0x60 x86_late_time_init+0x5b/0xb0 start_kernel+0x308/0x410 x86_64_start_reservations+0x1c/0x30 x86_64_start_kernel+0x96/0xa0 common_startup_64+0x13e/0x148 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&rq->__lock); <Interrupt> lock(&rq->__lock); *** DEADLOCK *** stack backtrace: CPU: 0 UID: 0 PID: 27 Comm: irq_work/0 Call Trace: <TASK> dump_stack_lvl+0x8c/0xd0 dump_stack+0x14/0x20 print_usage_bug+0x42e/0x690 mark_lock.part.44+0x867/0xa70 ? __pfx_mark_lock.part.44+0x10/0x10 ? string_nocheck+0x19c/0x310 ? number+0x739/0x9f0 ? __pfx_string_nocheck+0x10/0x10 ? __pfx_check_pointer+0x10/0x10 ? kvm_sched_clock_read+0x15/0x30 ? sched_clock_noinstr+0xd/0x20 ? local_clock_noinstr+0x1c/0xe0 __lock_acquire+0xc4b/0x62b0 ? __pfx_format_decode+0x10/0x10 ? __pfx_string+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_vsnprintf+0x10/0x10 lock_acquire+0x1e1/0x510 ? raw_spin_rq_lock_nested+0x2b/0x40 ? __pfx_lock_acquire+0x10/0x10 ? dump_line+0x12e/0x270 ? raw_spin_rq_lock_nested+0x20/0x40 _raw_spin_lock_nested+0x42/0x80 ? raw_spin_rq_lock_nested+0x2b/0x40 raw_spin_rq_lock_nested+0x2b/0x40 scx_dump_state+0x3b3/0x1270 ? finish_task_switch+0x27e/0x840 scx_ops_error_irq_workfn+0x67/0x80 irq_work_single+0x113/0x260 irq_work_run_list.part.3+0x44/0x70 run_irq_workd+0x6b/0x90 ? __pfx_run_irq_workd+0x10/0x10 smpboot_thread_fn+0x529/0x870 ? __pfx_smpboot_thread_fn+0x10/0x10 kthread+0x305/0x3f0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x40/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> This commit therefore use rq_lock_irqsave/irqrestore() to replace rq_lock/unlock() in the scx_dump_state(). Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit") Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12tracing: Have function tracer define options per instanceSteven Rostedt1-5/+5
Currently the function tracer's options are saved via a global mask when it should be per instance. Use the new infrastructure to define a "default_flags" field in the tracer structure that is used for the top level instance as well as new ones. Currently the global mask causes confusion: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that updating the "func-args" option at the top level instance also updates the "func-args" option in the instance but because the update is only done by the instance that gets changed (as it should), it's confusing to see that the option is already set in the other instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.470883736@kernel.org Fixes: f20a580627f43 ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-12tracing: Have tracer option be instance specificSteven Rostedt2-74/+186
Tracers can add specify options to modify them. This logic was added before instances were created and the tracer flags were global variables. After instances were created where a tracer may exist in more than one instance, the flags were not updated from being global into instance specific. This causes confusion with these options. For example, the function tracer has an option to enable function arguments: # cd /sys/kernel/tracing # mkdir instances/foo # echo function > instances/foo/current_tracer # echo 1 > options/func-args # echo function > current_tracer # cat trace [..] <idle>-0 [005] d..3. 1050.656187: rcu_needs_cpu() <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656188: get_next_timer_interrupt(basej=0x10002dbad, basem=0xf45fd7d300) <-tick_nohz_next_event <idle>-0 [005] d..3. 1050.656189: _raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656190: do_raw_spin_lock(lock=0xffff8944bdf5de80) <-__get_next_timer_interrupt <idle>-0 [005] d..4. 1050.656191: _raw_spin_lock_nested(lock=0xffff8944bdf5f140, subclass=1) <-__get_next_timer_interrupt # cat instances/foo/options/func-args 1 # cat instances/foo/trace [..] kworker/4:1-88 [004] ...1. 298.127735: next_zone <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127736: first_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127738: next_online_pgdat <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127739: fold_diff <-refresh_cpu_vm_stats kworker/4:1-88 [004] ...1. 298.127741: round_jiffies_relative <-vmstat_update [..] The above shows that setting "func-args" in the top level instance also set it in the instance "foo", but since the interface of the trace flags are per instance, the update didn't take affect in the "foo" instance. Update the infrastructure to allow tracers to add a "default_flags" field in the tracer structure that can be set instead of "flags" which will make the flags per instance. If a tracer needs to keep the flags global (like blktrace), keeping the "flags" field set will keep the old behavior. This does not update function or the function graph tracers. That will be handled later. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.305317942@kernel.org Fixes: f20a580627f43 ("ftrace: Allow instances to use function tracing") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-12power: always freeze efivarfsChristian Brauner2-8/+4
The efivarfs filesystems must always be frozen and thawed to resync variable state. Make it so. Link: https://patch.msgid.link/20251105-vorbild-zutreffen-fe00d1dd98db@brauner Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11cpuset: remove need_rebuild_sched_domainsChen Ridong1-5/+1
Previously, update_cpumasks_hier() used need_rebuild_sched_domains to decide whether to invoke rebuild_sched_domains_locked(). Now that rebuild_sched_domains_locked() only sets force_rebuild, the flag is redundant. Hence, remove it. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11cpuset: remove global remote_children listChen Ridong2-12/+11
The remote_children list is used to track all remote partitions attached to a cpuset. However, it serves no other purpose. Using a boolean flag to indicate whether a cpuset is a remote partition is a more direct approach, making remote_children unnecessary. This patch replaces the list with a remote_partition flag in the cpuset structure and removes remote_children entirely. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11cpuset: simplify node setting on errorChen Ridong1-12/+9
There is no need to jump to the 'done' label upon failure, as no cleanup is required. Return the error code directly instead. Signed-off-by: Chen Ridong <chenridong@huawei.com> Reviewed-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11cgroup: include missing header for struct irq_workBert Karwatzki1-0/+1
To compile cgroup.c with PREEMPT_RT=y include header which declares struct irq_work. Fixes: 9311e6c29b34 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT") Signed-off-by: Bert Karwatzki <spasswolf@web.de> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-11sched/deadline: Minor cleanup in select_task_rq_dl()Shrikanth Hegde1-2/+1
In select_task_rq_dl, there is only one goto statement, there is no need for it. No functional changes. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20251014100342.978936-2-sshegde@linux.ibm.com
2025-11-11sched/deadline: Use cpumask_weight_and() in dl_bw_cpusShrikanth Hegde1-10/+1
cpumask_subset(a,b) -> cpumask_weight(a) should be same as cpumask_weight_and(a,b) for_each_cpu_and(a,b) to count cpus could be replaced by cpumask_weight_and(a,b) No Functional Change. It could save a few cycles since cpumask_weight_and would be more efficient. Plus one less stack variable. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Link: https://patch.msgid.link/20251014100342.978936-3-sshegde@linux.ibm.com
2025-11-11sched/deadline: Document dl_serverPeter Zijlstra1-0/+194
Place the notes that resulted from going through the dl_server code in a comment. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-11-11tracing: fprobe: use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_ARGSMenglong Dong1-10/+22
For now, we will use ftrace for the fprobe if fp->exit_handler not exists and CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled. However, CONFIG_DYNAMIC_FTRACE_WITH_REGS is not supported by some arch, such as arm. What we need in the fprobe is the function arguments, so we can use ftrace for fprobe if CONFIG_DYNAMIC_FTRACE_WITH_ARGS is enabled. Therefore, use ftrace if CONFIG_DYNAMIC_FTRACE_WITH_REGS or CONFIG_DYNAMIC_FTRACE_WITH_ARGS enabled. Link: https://lore.kernel.org/all/20251103063434.47388-1-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-11tracing: fprobe: optimization for entry only caseMenglong Dong1-9/+119
For now, fgraph is used for the fprobe, even if we need trace the entry only. However, the performance of ftrace is better than fgraph, and we can use ftrace_ops for this case. Then performance of kprobe-multi increases from 54M to 69M. Before this commit: $ ./benchs/run_bench_trigger.sh kprobe-multi kprobe-multi : 54.663 ± 0.493M/s After this commit: $ ./benchs/run_bench_trigger.sh kprobe-multi kprobe-multi : 69.447 ± 0.143M/s Mitigation is disable during the bench testing above. Link: https://lore.kernel.org/all/20251015083238.2374294-2-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-11-11tracing: fprobe: Fix to init fprobe_ip_table earlierMasami Hiramatsu (Google)1-1/+1
Since the fprobe_ip_table is used from module unloading in the failure path of load_module(), it must be initialized in the earlier timing than late_initcall(). Unless that, the fprobe_module_callback() will use an uninitialized spinlock of fprobe_ip_table. Initialize fprobe_ip_table in core_initcall which is the same timing as ftrace. Link: https://lore.kernel.org/all/175939434403.3665022.13030530757238556332.stgit@mhiramat.tok.corp.google.com/ Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202509301440.be4b3631-lkp@intel.com Fixes: e5a4cc28a052 ("tracing: fprobe: use rhltable for fprobe_ip_table") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Menglong Dong <menglong8.dong@gmail.com>
2025-11-11rv: Add explicit lockdep context for reactorsThomas Weißschuh1-0/+4
Reactors can be called from any context through tracepoints. When developing reactors care needs to be taken to only call APIs which are safe. As the tracepoints used during testing may not actually be called from restrictive contexts lockdep may not be helpful. Add explicit overrides to help lockdep find invalid code patterns. The usage of LD_WAIT_FREE will trigger lockdep warnings in the panic reactor. These are indeed valid warnings but they are out of scope for RV and will instead be fixed by the printk subsystem. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Nam Cao <namcao@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-3-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-11-11rv: Make rv_reacting_on() staticThomas Weißschuh1-1/+1
There are no external users left. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-2-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-11-11rv: Pass va_list to reactorsThomas Weißschuh3-11/+17
The only thing the reactors can do with the passed in varargs is to convert it into a va_list. Do that in a central helper instead. It simplifies the reactors, removes some hairy macro-generated code and introduces a convenient hook point to modify reactor behavior. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://lore.kernel.org/r/20251014-rv-lockdep-v1-1-0b9e51919ea8@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-11-11sched/deadline: Fix dl_server stop conditionPeter Zijlstra1-2/+38
Gabriel reported that the dl_server doesn't stop as expected. The problem was found to be the fact that idle time and fair runtime are treated equally. Both will count towards dl_server runtime and push the activation forwards when it is in the zero-laxity wait state. Notably: dl_server_update_idle() update_curr_dl_se() if (dl_defer && dl_throttled && dl_runtime_exceeded()) hrtimer_try_to_cancel(); // stop timer replenish_dl_new_period() deadline = now + dl_deadline; // fwd period runtime = dl_runtime; start_dl_timer(); // restart timer And while we do want idle time accounted towards the *current* activation of the dl_server -- after all, a fair task could've ran if we had any -- we don't necessarily want idle time to cause or push forward an activation. Introduce dl_defer_idle to make this distinction. It will be set once idle time pushed the activation forward, once set idle time will only be allowed to consume any runtime but not push the activation. This will then cause dl_server_timer() to fire, which will stop the dl_server. Any non-idle time accounting during this phase will clear dl_defer_idle, so only a full period of idle will cause the dl_server to stop. Reported-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251101000057.GA2184199@noisy.programming.kicks-ass.net
2025-11-11sched/deadline: Fix dl_server time accountingPeter Zijlstra4-35/+33
The dl_server time accounting code is a little odd. The normal scheduler pattern is to update curr before doing something, such that the old state is fully accounted before changing state. Notably, the dl_server_timer() needs to propagate the current time accounting since the current task could be ran by dl_server and thus this can affect dl_se->runtime. Similarly for dl_server_start(). And since the (deferred) dl_server wants idle time accounted, rework sched_idle_class time accounting to be more like all the others. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251020141130.GJ3245006@noisy.programming.kicks-ass.net
2025-11-11sched/core: Remove double update_rq_clock() in __set_cpus_allowed_ptr_locked()Hao Jia1-2/+0
Since commit d4c64207b88a ("sched: Cleanup the sched_change NOCLOCK usage"), update_rq_clock() is called in do_set_cpus_allowed() -> sched_change_begin() to update the rq clock. This results in a duplicate call update_rq_clock() in __set_cpus_allowed_ptr_locked(). While holding the rq lock and before calling do_set_cpus_allowed(), there is nothing that depends on an updated rq_clock. Therefore, remove the redundant update_rq_clock() in __set_cpus_allowed_ptr_locked() to avoid the warning about double rq clock updates. Fixes: d4c64207b88a ("sched: Cleanup the sched_change NOCLOCK usage") Signed-off-by: Hao Jia <jiahao1@lixiang.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20251029093655.31252-1-jiahao.kernel@gmail.com
2025-11-11sched/eevdf: Fix min_vruntime vs avg_vruntimePeter Zijlstra3-95/+31
Basically, from the constraint that the sum of lag is zero, you can infer that the 0-lag point is the weighted average of the individual vruntime, which is what we're trying to compute: \Sum w_i * v_i avg = -------------- \Sum w_i Now, since vruntime takes the whole u64 (worse, it wraps), this multiplication term in the numerator is not something we can compute; instead we do the min_vruntime (v0 henceforth) thing like: v_i = (v_i - v0) + v0 This does two things: - it keeps the key: (v_i - v0) 'small'; - it creates a relative 0-point in the modular space. If you do that subtitution and work it all out, you end up with: \Sum w_i * (v_i - v0) avg = --------------------- + v0 \Sum w_i Since you cannot very well track a ratio like that (and not suffer terrible numerical problems) we simpy track the numerator and denominator individually and only perform the division when strictly needed. Notably, the numerator lives in cfs_rq->avg_vruntime and the denominator lives in cfs_rq->avg_load. The one extra 'funny' is that these numbers track the entities in the tree, and current is typically outside of the tree, so avg_vruntime() adds current when needed before doing the division. (vruntime_eligible() elides the division by cross-wise multiplication) Anyway, as mentioned above, we currently use the CFS era min_vruntime for this purpose. However, this thing can only move forward, while the above avg can in fact move backward (when a non-eligible task leaves, the average becomes smaller), this can cause trouble when through happenstance (or construction) these values drift far enough apart to wreck the game. Replace cfs_rq::min_vruntime with cfs_rq::zero_vruntime which is kept near/at avg_vruntime, following its motion. The down-side is that this requires computing the avg more often. Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy") Reported-by: Zicheng Qu <quzicheng@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106111741.GC4068168@noisy.programming.kicks-ass.net Cc: stable@vger.kernel.org
2025-11-11sched/core: Add comment explaining force-idle vruntime snapshotsPeter Zijlstra1-0/+181
I always end up having to re-read these emails every time I look at this code. And a future patch is going to change this story a little. This means it is past time to stick them in a comment so it can be modified and stay current. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200506143506.GH5298@hirez.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20200515103844.GG2978@hirez.programming.kicks-ass.net Link: https://patch.msgid.link/20251106111603.GB4068168@noisy.programming.kicks-ass.net
2025-11-11sched/core: Optimize core cookie matching checkFernand Sieber1-1/+4
Early return true if the core cookie matches. This avoids the SMT mask loop to check for an idle core, which might be more expensive on wide platforms. Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com> Link: https://patch.msgid.link/20251105152538.470586-1-sieberf@amazon.com
2025-11-11sched/proxy: Yield the donor taskFernand Sieber5-7/+8
When executing a task in proxy context, handle yields as if they were requested by the donor task. This matches the traditional PI semantics of yield() as well. This avoids scenario like proxy task yielding, pick next task selecting the same previous blocked donor, running the proxy task again, etc. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202510211205.1e0f5223-lkp@intel.com Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fernand Sieber <sieberf@amazon.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251106104022.195157-1-sieberf@amazon.com
2025-11-11ns: drop custom reference count initialization for initial namespacesChristian Brauner4-4/+4
Initial namespaces don't modify their reference count anymore. They remain fixed at one so drop the custom refcount initializations. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-16-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11pid: rely on common reference count behaviorChristian Brauner1-1/+1
Now that we changed the generic reference counting mechanism for all namespaces to never manipulate reference counts of initial namespaces we can drop the special handling for pid namespaces. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-15-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11ns: rename is_initial_namespace()Christian Brauner1-1/+1
Rename is_initial_namespace() to ns_init_inum() and make it symmetrical with the ns id variant. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-9-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: use guards for ns_tree_lockChristian Brauner1-7/+13
Make use of the guard infrastructure for ns_tree_lock. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-7-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: simplify owner list iterationChristian Brauner1-3/+7
Make use of list_for_each_entry_from_rcu(). Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-6-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: switch to new structuresChristian Brauner2-131/+81
Switch the nstree management to the new combined structures. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-5-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11nstree: add helper to operate on struct ns_tree_{node,root}Christian Brauner1-0/+85
Add helpers that work on the combined rbtree and rculist combined. This will make the code a lot more managable and legible. Link: https://patch.msgid.link/20251110-work-namespace-nstree-fixes-v1-4-e8a9264e0fb9@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-11Merge branch 'kbuild-6.19.fms.extension'Christian Brauner9-14/+28
Bring in the shared branch with the kbuild tree to enable '-fms-extensions' for 6.19. Further namespace cleanup work requires this extension. Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-10tracing: Report wrong dynamic event commandMasami Hiramatsu (Google)1-2/+9
Report wrong dynamic event type in the command via error_log. ----- # echo "z hoge" > /sys/kernel/tracing/dynamic_events sh: write error: Invalid argument # cat /sys/kernel/tracing/error_log [ 22.977022] dynevent: error: No matching dynamic event type Command: z hoge ^ ----- Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/176278970056.343441.10528135217342926645.stgit@devnote2 Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Use switch statement instead of ifs in set_tracer_flag()Steven Rostedt1-15/+23
The "mask" passed in to set_trace_flag() has a single bit set. The function then checks if the mask is equal to one of the option masks and performs the appropriate function associated to that option. Instead of having a bunch of "if ()" statement, use a "switch ()" statement instead to make it cleaner and a bit more optimal. No function changes. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.890298562@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Exit out immediately after update_marker_trace()Steven Rostedt1-1/+4
The call to update_marker_trace() in set_tracer_flag() performs the update to the tr->trace_flags. There's no reason to perform it again after it is called. Return immediately instead. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251106003501.726406870@kernel.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Have add_tracer_options() error pass up to callersSteven Rostedt1-21/+34
The function add_tracer_options() can fail, but currently it is ignored. Pass the status of add_tracer_options() up to adding a new tracer as well as when an instance is created. Have the instance creation fail if the add_tracer_options() fail. Only print a warning for the top level instance, like it does with other failures. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.375299297@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Remove dummy options and flagsSteven Rostedt1-32/+16
When a tracer does not define their own flags, dummy options and flags are used so that the values are always valid. There's not that many locations that reference these values so having dummy versions just complicates the code. Remove the dummy values and just check for NULL when appropriate. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251105161935.206093132@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10tracing: Hide __NR_utimensat and _NR_mq_timedsend when not definedSteven Rostedt1-0/+4
Some architectures (riscv-32) do not define __NR_utimensat and _NR_mq_timedsend, and fails to build when they are used. Hide them in "ifdef"s. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20251104205310.00a1db9a@batman.local.home Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511031239.ZigDcWzY-lkp@intel.com/ Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-10bpf: Export necessary symbols for modules with struct_opsD. Wythe2-0/+3
Exports three necessary symbols for implementing struct_ops with tristate subsystem. To hold or release refcnt of struct_ops refcnt by inline funcs bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put) conditionally. And to copy obj name from one to the other with effective checks by bpf_obj_name_cpy. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251107035632.115950-2-alibuda@linux.alibaba.com
2025-11-10bpf: Unclone skb head on bpf_dynptr_write to skb metadataJakub Sitnicki1-4/+2
Currently bpf_dynptr_from_skb_meta() marks the dynptr as read-only when the skb is cloned, preventing writes to metadata. Remove this restriction and unclone the skb head on bpf_dynptr_write() to metadata, now that the metadata is preserved during uncloning. This makes metadata dynptr consistent with skb dynptr, allowing writes regardless of whether the skb is cloned. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-3-5ceb08a9b37b@cloudflare.com
2025-11-10workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutexzhang jiao1-6/+0
assert_rcu_or_wq_mutex_or_pool_mutex is never referenced in the code. Just remove it. Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-10printk_ringbuffer: Create a helper function to decide whether more space is ↵Petr Mladek1-4/+28
needed The decision whether some more space is needed is tricky in the printk ring buffer code: 1. The given lpos values might overflow. A subtraction must be used instead of a simple "lower than" check. 2. Another CPU might reuse the space in the mean time. It can be detected when the subtraction is bigger than DATA_SIZE(data_ring). 3. There is exactly enough space when the result of the subtraction is zero. But more space is needed when the result is exactly DATA_SIZE(data_ring). Add a helper function to make sure that the check is done correctly in all situations. Also it helps to make the code consistent and better documented. Suggested-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/87tsz7iea2.fsf@jogness.linutronix.de Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20251107194720.1231457-3-pmladek@suse.com [pmladek@suse.com: Updated wording as suggested by John] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-11-10printk_ringbuffer: Fix check of valid data size when blk_lpos overflowsPetr Mladek1-3/+6
The commit 67e1b0052f6bb8 ("printk_ringbuffer: don't needlessly wrap data blocks around") allows to use the last 4 bytes of the ring buffer. But the check for the @data_size was not properly updated in get_data(). It fails when "blk_lpos->next" overflows to "0". In this case: + is_blk_wrapped(data_ring, blk_lpos->begin, blk_lpos->next) returns "false" because it checks "blk_lpos->next - 1". + "blk_lpos->begin < blk_lpos->next" fails because "blk_lpos->next" is already 0. + is_blk_wrapped(data_ring, blk_lpos->begin + DATA_SIZE(data_ring), blk_lpos->next) returns "false" because "begin_lpos" is from the next wrap but "next_lpos - 1" is from the previous one. As a result, get_data() triggers the WARN_ON_ONCE() for "Illegal block description", for example: [ 216.317316][ T7652] loop0: detected capacity change from 0 to 16 ** 1 printk messages dropped ** [ 216.327750][ T7652] ------------[ cut here ]------------ [ 216.327789][ T7652] WARNING: kernel/printk/printk_ringbuffer.c:1278 at get_data+0x48a/0x840, CPU#1: syz.0.585/7652 [ 216.327848][ T7652] Modules linked in: [ 216.327907][ T7652] CPU: 1 UID: 0 PID: 7652 Comm: syz.0.585 Not tainted syzkaller #0 PREEMPT(full) [ 216.327933][ T7652] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/02/2025 [ 216.327953][ T7652] RIP: 0010:get_data+0x48a/0x840 [ 216.327986][ T7652] Code: 83 c4 f8 48 b8 00 00 00 00 00 fc ff df 41 0f b6 04 07 84 c0 0f 85 ee 01 00 00 44 89 65 00 49 83 c5 08 eb 13 e8 a7 19 1f 00 90 <0f> 0b 90 eb 05 e8 9c 19 1f 00 45 31 ed 4c 89 e8 48 83 c4 28 5b 41 [ 216.328007][ T7652] RSP: 0018:ffffc900035170e0 EFLAGS: 00010293 [ 216.328029][ T7652] RAX: ffffffff81a1eee9 RBX: 00003fffffffffff RCX: ffff888033255b80 [ 216.328048][ T7652] RDX: 0000000000000000 RSI: 00003fffffffffff RDI: 0000000000000000 [ 216.328063][ T7652] RBP: 0000000000000012 R08: 0000000000000e55 R09: 000000325e213cc7 [ 216.328079][ T7652] R10: 000000325e213cc7 R11: 00001de4c2000037 R12: 0000000000000012 [ 216.328095][ T7652] R13: 0000000000000000 R14: ffffc90003517228 R15: 1ffffffff1bca646 [ 216.328111][ T7652] FS: 00007f44eb8da6c0(0000) GS:ffff888125fda000(0000) knlGS:0000000000000000 [ 216.328131][ T7652] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 216.328147][ T7652] CR2: 00007f44ea9722e0 CR3: 0000000066344000 CR4: 00000000003526f0 [ 216.328168][ T7652] Call Trace: [ 216.328178][ T7652] <TASK> [ 216.328199][ T7652] _prb_read_valid+0x672/0xa90 [ 216.328328][ T7652] ? desc_read+0x1b8/0x3f0 [ 216.328381][ T7652] ? __pfx__prb_read_valid+0x10/0x10 [ 216.328422][ T7652] ? panic_on_this_cpu+0x32/0x40 [ 216.328450][ T7652] prb_read_valid+0x3c/0x60 [ 216.328482][ T7652] printk_get_next_message+0x15c/0x7b0 [ 216.328526][ T7652] ? __pfx_printk_get_next_message+0x10/0x10 [ 216.328561][ T7652] ? __lock_acquire+0xab9/0xd20 [ 216.328595][ T7652] ? console_flush_all+0x131/0xb10 [ 216.328621][ T7652] ? console_flush_all+0x478/0xb10 [ 216.328648][ T7652] console_flush_all+0x4cc/0xb10 [ 216.328673][ T7652] ? console_flush_all+0x131/0xb10 [ 216.328704][ T7652] ? __pfx_console_flush_all+0x10/0x10 [ 216.328748][ T7652] ? is_printk_cpu_sync_owner+0x32/0x40 [ 216.328781][ T7652] console_unlock+0xbb/0x190 [ 216.328815][ T7652] ? __pfx___down_trylock_console_sem+0x10/0x10 [ 216.328853][ T7652] ? __pfx_console_unlock+0x10/0x10 [ 216.328899][ T7652] vprintk_emit+0x4c5/0x590 [ 216.328935][ T7652] ? __pfx_vprintk_emit+0x10/0x10 [ 216.328993][ T7652] _printk+0xcf/0x120 [ 216.329028][ T7652] ? __pfx__printk+0x10/0x10 [ 216.329051][ T7652] ? kernfs_get+0x5a/0x90 [ 216.329090][ T7652] _erofs_printk+0x349/0x410 [ 216.329130][ T7652] ? __pfx__erofs_printk+0x10/0x10 [ 216.329161][ T7652] ? __raw_spin_lock_init+0x45/0x100 [ 216.329186][ T7652] ? __init_swait_queue_head+0xa9/0x150 [ 216.329231][ T7652] erofs_fc_fill_super+0x1591/0x1b20 [ 216.329285][ T7652] ? __pfx_erofs_fc_fill_super+0x10/0x10 [ 216.329324][ T7652] ? sb_set_blocksize+0x104/0x180 [ 216.329356][ T7652] ? setup_bdev_super+0x4c1/0x5b0 [ 216.329385][ T7652] get_tree_bdev_flags+0x40e/0x4d0 [ 216.329410][ T7652] ? __pfx_erofs_fc_fill_super+0x10/0x10 [ 216.329444][ T7652] ? __pfx_get_tree_bdev_flags+0x10/0x10 [ 216.329483][ T7652] vfs_get_tree+0x92/0x2b0 [ 216.329512][ T7652] do_new_mount+0x302/0xa10 [ 216.329537][ T7652] ? apparmor_capable+0x137/0x1b0 [ 216.329576][ T7652] ? __pfx_do_new_mount+0x10/0x10 [ 216.329605][ T7652] ? ns_capable+0x8a/0xf0 [ 216.329637][ T7652] ? kmem_cache_free+0x19b/0x690 [ 216.329682][ T7652] __se_sys_mount+0x313/0x410 [ 216.329717][ T7652] ? __pfx___se_sys_mount+0x10/0x10 [ 216.329836][ T7652] ? do_syscall_64+0xbe/0xfa0 [ 216.329869][ T7652] ? __x64_sys_mount+0x20/0xc0 [ 216.329901][ T7652] do_syscall_64+0xfa/0xfa0 [ 216.329932][ T7652] ? lockdep_hardirqs_on+0x9c/0x150 [ 216.329964][ T7652] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 216.329988][ T7652] ? clear_bhb_loop+0x60/0xb0 [ 216.330017][ T7652] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 216.330040][ T7652] RIP: 0033: