aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-11-15crash: fix crashkernel resource shrinkSourabh Jain1-1/+1
When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem | grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI <snip...> Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 <snip...> This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-15Merge tag 'timers-urgent-2025-11-15' of ↵Linus Torvalds1-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "Fix a memory leak in the posix timer creation logic" * tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-timers: Plug potential memory leak in do_timer_create()
2025-11-14bpf: don't skip other information if xlated_prog_insns is skippedAltgelt, Max (Nextron)1-11/+11
If xlated_prog_insns should not be exposed, other information (such as func_info) still can and should be filled in. Therefore, instead of directly terminating in this case, continue with the normal flow. Signed-off-by: Max Altgelt <max.altgelt@nextron-systems.com> Link: https://lore.kernel.org/r/efd00fcec5e3e247af551632726e2a90c105fbd8.camel@nextron-systems.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14bpf: verifier: Move desc->imm setup to sort_kfunc_descs_by_imm_off()Puranjay Mohan1-19/+35
Metadata about a kfunc call is added to the kfunc_tab in add_kfunc_call() but the call instruction itself could get removed by opt_remove_dead_code() later if it is not reachable. If the call instruction is removed, specialize_kfunc() is never called for it and the desc->imm in the kfunc_tab is never initialized for this kfunc call. In this case, sort_kfunc_descs_by_imm_off(env->prog); in do_misc_fixups() doesn't sort the table correctly. This is a problem for s390 as its JIT uses this table to find the addresses for kfuncs, and if this table is not sorted properly, JIT may fail to find addresses for valid kfunc calls. This was exposed by: commit d869d56ca848 ("bpf: verifier: refactor kfunc specialization") as before this commit, desc->imm was initialised in add_kfunc_call() which happens before dead code elimination. Move desc->imm setup down to sort_kfunc_descs_by_imm_off(), this fixes the problem and also saves us from having the same logic in add_kfunc_call() and specialize_kfunc(). Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114154023.12801-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+Alexei Starovoitov19-111/+243
Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds5-40/+72
Pull bpf fixes from Alexei Starovoitov: - Fix interaction between livepatch and BPF fexit programs (Song Liu) With Steven and Masami acks. - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa) With Steven and Masami acks. - Fix out of bounds access in widen_imprecise_scalars() in the verifier (Eduard Zingerman) - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen) - Fix net_sched storage collision with BPF data_meta/data_end (Eric Dumazet) - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test widen_imprecise_scalars() with different stack depth bpf: account for current allocated stack depth in widen_imprecise_scalars() bpf: Add bpf_prog_run_data_pointers() selftests/bpf: Add mptcp test with sockmap mptcp: Fix proto fallback detection with BPF mptcp: Disallow MPTCP subflows from sockmap selftests/bpf: Add stacktrace ips test for raw_tp selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()" bpf: add _impl suffix for bpf_stream_vprintk() kfunc bpf:add _impl suffix for bpf_task_work_schedule* kfuncs selftests/bpf: Add tests for livepatch + bpf trampoline ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct() ftrace: Fix BPF fexit with livepatch
2025-11-14bpf: Handle return value of ftrace_set_filter_ip in register_fentryMenglong Dong1-1/+3
The error that returned by ftrace_set_filter_ip() in register_fentry() is not handled properly. Just fix it. Fixes: 00963a2e75a8 ("bpf: Support bpf_trampoline on functions with IPMODIFY (e.g. livepatch)") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20251110120705.1553694-1-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14bpf: Add missing checks to avoid verbose verifier logEduard Zingerman1-4/+6
There are a few places where log level is not checked before calling "verbose()". This forces programs working only at BPF_LOG_LEVEL_STATS (e.g. veristat) to allocate unnecessarily large log buffers. Add missing checks. Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114200542.912386-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14sched_ext: Pass locked CPU parameter to scx_hardlockup() and add docsTejun Heo2-4/+23
With the buddy lockup detector, smp_processor_id() returns the detecting CPU, not the locked CPU, making scx_hardlockup()'s printouts confusing. Pass the locked CPU number from watchdog_hardlockup_check() as a parameter instead. Also add kerneldoc comments to handle_lockup(), scx_hardlockup(), and scx_rcu_cpu_stall() documenting their return value semantics. Suggested-by: Doug Anderson <dianders@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Acked-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-14bpf: Prevent nesting overflow in bpf_try_get_buffersSahil Chandna1-0/+3
bpf_try_get_buffers() returns one of multiple per-CPU buffers based on a per-CPU nesting counter. This mechanism expects that buffers are not endlessly acquired before being returned. migrate_disable() ensures that a task remains on the same CPU, but it does not prevent the task from being preempted by another task on that CPU. Without disabled preemption, a task may be preempted while holding a buffer, allowing another task to run on same CPU and acquire an additional buffer. Several such preemptions can cause the per-CPU nest counter to exceed MAX_BPRINTF_NEST_LEVEL and trigger the warning in bpf_try_get_buffers(). Adding preempt_disable()/preempt_enable() around buffer acquisition and release prevents this task preemption and preserves the intended bounded nesting behavior. Reported-by: syzbot+b0cff308140f79a9c4cb@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68f6a4c8.050a0220.1be48.0011.GAE@google.com/ Fixes: 4223bf833c849 ("bpf: Remove preempt_disable in bpf_try_get_buffers") Suggested-by: Yonghong Song <yonghong.song@linux.dev> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Sahil Chandna <chandna.sahil@gmail.com> Link: https://lore.kernel.org/r/20251114064922.11650-1-chandna.sahil@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14time: Fix a few typos in time[r] related code commentsJianyun Gao2-2/+2
Signed-off-by: Jianyun Gao <jianyungao89@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20250927093411.1509275-1-jianyungao89@gmail.com
2025-11-14tracing: Convert function graph set_flags() to use a switch() statementSteven Rostedt1-5/+7
Currently the set_flags() of the function graph tracer has a bunch of: if (bit == FLAG1) { [..] } if (bit == FLAG2) { [..] } To clean it up a bit, convert it over to a switch statement. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192319.117123664@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Have function graph tracer option sleep-time be per instanceSteven Rostedt4-23/+60
Currently the option to have function graph tracer to ignore time spent when a task is sleeping is global when the interface is per-instance. Changing the value in one instance will affect the results of another instance that is also running the function graph tracer. This can lead to confusing results. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.950255167@kernel.org Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Move graph-time out of function graph optionsSteven Rostedt3-14/+23
The option "graph-time" affects the function profiler when it is using the function graph infrastructure. It has nothing to do with the function graph tracer itself. The option only affects the global function profiler and does nothing to the function graph tracer. Move it out of the function graph tracer options and make it a global option that is only available at the top level instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.781711154@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14tracing: Have function graph tracer option funcgraph-irqs be per instanceSteven Rostedt1-10/+31
Currently the option to trace interrupts in the function graph tracer is global when the interface is per-instance. Changing the value in one instance will affect the results of another instance that is also running the function graph tracer. This can lead to confusing results. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251114192318.613867934@kernel.org Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-14time: tick-oneshot: Add missing Return and parameter descriptions to kernel-docSunday Adelodun1-1/+19
Several functions in kernel/time/tick-oneshot.c are missing parameter and return value descriptions in their kernel-doc comments. This causes warnings during doc generation. Update the kernel-doc blocks to include detailed @param and Return: descriptions for better clarity and to fix kernel-doc warnings. No functional code changes are made. Signed-off-by: Sunday Adelodun <adelodunolaoluwa@yahoo.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251106113938.34693-3-adelodunolaoluwa@yahoo.com
2025-11-14bpf: account for current allocated stack depth in widen_imprecise_scalars()Eduard Zingerman1-2/+4
The usage pattern for widen_imprecise_scalars() looks as follows: prev_st = find_prev_entry(env, ...); queued_st = push_stack(...); widen_imprecise_scalars(env, prev_st, queued_st); Where prev_st is an ancestor of the queued_st in the explored states tree. This ancestor is not guaranteed to have same allocated stack depth as queued_st. E.g. in the following case: def main(): for i in 1..2: foo(i) // same callsite, differnt param def foo(i): if i == 1: use 128 bytes of stack iterator based loop Here, for a second 'foo' call prev_st->allocated_stack is 128, while queued_st->allocated_stack is much smaller. widen_imprecise_scalars() needs to take this into account and avoid accessing bpf_verifier_state->frame[*]->stack out of bounds. Fixes: 2793a8b015f7 ("bpf: exact states comparison for iterator convergence checks") Reported-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251114025730.772723-1-eddyz87@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-14PM: suspend: Make pm_test delay interruptible by wakeup eventsRiwen Lu1-1/+5
Modify the suspend_test() function to allow the test delay to be interrupted by wakeup events. This improves the responsiveness of the system during suspend testing when wakeup events occur, allowing the suspend process to proceed without waiting for the full test delay to complete when wakeup events are detected. Additionally, using msleep() instead of mdelay() avoids potential soft lockup "CPU stuck" issues when long test delays are configured. Co-developed-by: xiongxin <xiongxin@kylinos.cn> Signed-off-by: xiongxin <xiongxin@kylinos.cn> Signed-off-by: Riwen Lu <luriwen@kylinos.cn> [ rjw: Changelog edits ] Link: https://patch.msgid.link/20251113012638.1362013-1-luriwen@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-11-14posix-timers: Plug potential memory leak in do_timer_create()Eslam Khafagy1-6/+6
When posix timer creation is set to allocate a given timer ID and the access to the user space value faults, the function terminates without freeing the already allocated posix timer structure. Move the allocation after the user space access to cure that. [ tglx: Massaged change log ] Fixes: ec2d0c04624b3 ("posix-timers: Provide a mechanism to allocate a given timer ID") Reported-by: syzbot+9c47ad18f978d4394986@syzkaller.appspotmail.com Suggested-by: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Eslam Khafagy <eslam.medhat1993@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251114122739.994326-1-eslam.medhat1993@gmail.com Closes: https://lore.kernel.org/all/69155df4.a70a0220.3124cb.0017.GAE@google.com/T/
2025-11-14hrtimer: Store time as ktime_t in restart blockThomas Weißschuh2-4/+4
The hrtimer core uses ktime_t to represent times, use that also for the restart block. CPU timers internally use nanoseconds instead of ktime_t but use the same restart block, so use the correct accessors for those. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/20251110-restart-block-expiration-v1-3-5d39cc93df4f@linutronix.de
2025-11-14futex: Store time as ktime_t in restart blockThomas Weißschuh1-5/+4
The futex core uses ktime_t to represent times, use that also for the restart block. This allows the simplification of the accessors. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://patch.msgid.link/20251110-restart-block-expiration-v1-2-5d39cc93df4f@linutronix.de
2025-11-14nstree: fix kernel-doc comments for internal functionsKriish Sharma1-2/+3
Documentation build reported: Warning: kernel/nstree.c:325 function parameter 'ns_tree' not described in '__ns_tree_adjoined_rcu' Warning: kernel/nstree.c:325 expecting prototype for ns_tree_adjoined_rcu(). Prototype was for __ns_tree_adjoined_rcu() instead Warning: kernel/nstree.c:353 expecting prototype for ns_tree_gen_id(). Prototype was for __ns_tree_gen_id() instead The kernel-doc comments for `__ns_tree_adjoined_rcu()` and `__ns_tree_gen_id()` had mismatched function names and a missing parameter description. This patch updates the function names in the kernel-doc headers and adds the missing `@ns_tree` parameter description for `__ns_tree_adjoined_rcu()`. Fixes: 885fc8ac0a4d ("nstree: make iterator generic") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202511061542.0LO7xKs8-lkp@intel.com Signed-off-by: Kriish Sharma <kriish.sharma2006@gmail.com> Link: https://patch.msgid.link/20251111112533.2254432-1-kriish.sharma2006@gmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14nsproxy: fix free_nsproxy() and simplify create_new_namespaces()Christian Brauner1-16/+20
Make it possible to handle NULL being passed to the reference count helpers instead of forcing the caller to handle this. Afterwards we can nicely allow a cleanup guard to handle nsproxy freeing. Active reference count handling is not done in nsproxy_free() but rather in free_nsproxy() as nsproxy_free() is also called from setns() failure paths where a new nsproxy has been prepared but has not been marked as active via switch_task_namespaces(). Link: https://lore.kernel.org/690bfb9e.050a0220.2e3c35.0013.GAE@google.com Link: https://patch.msgid.link/20251111-sakralbau-guthaben-7dcc277d337f@brauner Fixes: 3c9820d5c64a ("ns: add active reference count") Reported-by: syzbot+0b2e79f91ff6579bfa5b@syzkaller.appspotmail.com Reported-by: syzbot+0a8655a80e189278487e@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-14padata: remove __padata_list_init()Tetsuo Handa1-8/+4
syzbot is reporting possibility of deadlock due to sharing lock_class_key between padata_init_squeues() and padata_init_reorder_list(). This is a false positive, for these callers initialize different object. Unshare lock_class_key by embedding __padata_list_init() into these callers. Reported-by: syzbot+bd936ccd4339cea66e6b@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=bd936ccd4339cea66e6b Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-11-14syscore: Pass context data to callbacksThierry Reding6-23/+69
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com>
2025-11-13Merge tag 'pm-6.18-rc6' of ↵Linus Torvalds1-9/+13
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix issues related to the handling of compressed hibernation images and a recent intel_pstate driver regression: - Fix issues related to using inadequate data types and incorrect use of atomic variables in the compressed hibernation images handling code that were introduced during the 6.9 development cycle (Mario Limonciello) - Move a X86_FEATURE_IDA check from turbo_is_disabled() to the places where a new value for MSR_IA32_PERF_CTL is computed in intel_pstate to address a regression preventing users from enabling turbo frequencies post-boot (Srinivas Pandruvada)" * tag 'pm-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: intel_pstate: Check IDA only before MSR_IA32_PERF_CTL writes PM: hibernate: Fix style issues in save_compressed_image() PM: hibernate: Use atomic64_t for compressed_size variable PM: hibernate: Emit an error when image writing fails
2025-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski13-62/+158
Cross-merge networking fixes after downstream PR (net-6.18-rc6). No conflicts, adjacent changes in: drivers/net/phy/micrel.c 96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface") 61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814") and a trivial one in tools/testing/selftests/drivers/net/Makefile. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-13trace/pid_list: optimize pid_list->lock contentionYongliang Gao2-9/+22
When the system has many cores and task switching is frequent, setting set_ftrace_pid can cause frequent pid_list->lock contention and high system sys usage. For example, in a 288-core VM environment, we observed 267 CPUs experiencing contention on pid_list->lock, with stack traces showing: #4 [ffffa6226fb4bc70] native_queued_spin_lock_slowpath at ffffffff99cd4b7e #5 [ffffa6226fb4bc90] _raw_spin_lock_irqsave at ffffffff99cd3e36 #6 [ffffa6226fb4bca0] trace_pid_list_is_set at ffffffff99267554 #7 [ffffa6226fb4bcc0] trace_ignore_this_task at ffffffff9925c288 #8 [ffffa6226fb4bcd8] ftrace_filter_pid_sched_switch_probe at ffffffff99246efe #9 [ffffa6226fb4bcf0] __schedule at ffffffff99ccd161 Replaces the existing spinlock with a seqlock to allow concurrent readers, while maintaining write exclusivity. Link: https://patch.msgid.link/20251113000252.1058144-1-leonylgao@gmail.com Reviewed-by: Huang Cun <cunhuang@tencent.com> Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-13tracing: Have function graph tracer define options per instanceSteven Rostedt1-7/+11
Currently the function graph tracer's options are saved via a global mask when it should be per instance. Use the new infrastructure to define a "default_flags" field in the tracer structure that is used for the top level instance as well as new ones. Currently the global mask causes confusion: # cd /sys/kernel/tracing # mkdir instances/foo # echo function_graph > instances/foo/current_tracer # echo 1 > options/funcgraph-args # echo function_graph > current_tracer # cat trace [..] 2) | _raw_spin_lock_irq(lock=0xffff96b97dea16c0) { 2) 0.422 us | do_raw_spin_lock(lock=0xffff96b97dea16c0); 7) | rcu_sched_clock_irq(user=0) { 2) 1.478 us | } 7) 0.758 us | rcu_is_cpu_rrupt_from_idle(); 2) 0.647 us | enqueue_hrtimer(timer=0xffff96b97dea2058, base=0xffff96b97dea1740, mode=0); # cat instances/foo/options/funcgraph-args 1 # cat instances/foo/trace [..] 4) | __x64_sys_read() { 4) | ksys_read() { 4) 0.755 us | fdget_pos(); 4) | vfs_read() { 4) | rw_verify_area() { 4) | security_file_permission() { 4) | apparmor_file_permission() { 4) | common_file_perm() { 4) | aa_file_perm() { 4) | rcu_read_lock_held() { [..] The above shows that updating the "funcgraph-args" option at the top level instance also updates the "funcgraph-args" option in the instance but because the update is only done by the instance that gets changed (as it should), it's confusing to see that the option is already set in the other instance. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://patch.msgid.link/20251111232429.641030027@kernel.org Fixes: c132be2c4fcc1 ("function_graph: Have the instances use their own ftrace_ops for filtering") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-11-13Merge branch 'pm-sleep'Rafael J. Wysocki1-9/+13
Merge fixes for issues related to the handling of compressed hibernation images that were introduced during the 6.9 development cycle. * pm-sleep: PM: hibernate: Fix style issues in save_compressed_image() PM: hibernate: Use atomic64_t for compressed_size variable PM: hibernate: Emit an error when image writing fails
2025-11-13sched_ext: Fix possible deadlock in the deferred_irq_workfn()Zqiang1-1/+1
For PREEMPT_RT=y kernels, the deferred_irq_workfn() is executed in the per-cpu irq_work/* task context and not disable-irq, if the rq returned by container_of() is current CPU's rq, the following scenarios may occur: lock(&rq->__lock); <Interrupt> lock(&rq->__lock); This commit use IRQ_WORK_INIT_HARD() to replace init_irq_work() to initialize rq->scx.deferred_irq_work, make the deferred_irq_workfn() is always invoked in hard-irq context. Signed-off-by: Zqiang <qiang.zhang@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-13bpf: Free special fields when update [lru_,]percpu_hash mapsLeon Hwang1-2/+8
As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed. Fix this by calling 'bpf_obj_free_fields()' after 'copy_map_value[,_long]()' in 'pcpu_copy_value()'. Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap") Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20251105151407.12723-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-13Merge tag 'v6.18-rc5' into objtool/core, to pick up fixesIngo Molnar26-95/+261
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-11-12bpf: Adjust return value for queue destruction in rqspinlockKumar Kartikeya Dwivedi1-1/+1
Return -ETIMEDOUT whenever non-head waiters are signalled by head, and fix oversight in commit 7bd6e5ce5be6 ("rqspinlock: Disable queue destruction for deadlocks"). We no longer signal on deadlocks. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Reviewed-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20251111013827.1853484-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-11-12sched_ext: Update comments replacing breather with aborting mechanismAndrea Righi1-4/+4
Commit 5ebec443fb96a ("sched_ext: Exit dispatch and move operations immediately when aborting") replaced the breather mechanism with the scx_aborting flag. Update comments removing references to the breather mechanism to avoid confusion. Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12sched/ext: convert scx_tasks_lock to raw spinlockEmil Tsalapatis1-8/+8
Update scx_task_locks so that it's safe to lock/unlock in a non-sleepable context in PREEMPT_RT kernels. scx_task_locks is (non-raw) spinlock used to protect the list of tasks under SCX. This list is updated during from finish_task_switch(), which cannot sleep. Regular spinlocks can be locked in such a context in non-RT kernels, but are sleepable under when CONFIG_PREEMPT_RT=y. Convert scx_task_locks into a raw spinlock, which is not sleepable even on RT kernels. Sample backtrace: <TASK> dump_stack_lvl+0x83/0xa0 __might_resched+0x14a/0x200 rt_spin_lock+0x61/0x1c0 ? sched_ext_dead+0x2d/0xf0 ? lock_release+0xc6/0x280 sched_ext_dead+0x2d/0xf0 ? srso_alias_return_thunk+0x5/0xfbef5 finish_task_switch.isra.0+0x254/0x360 __schedule+0x584/0x11d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? srso_alias_return_thunk+0x5/0xfbef5 ? tick_nohz_idle_exit+0x7e/0x120 schedule_idle+0x23/0x40 cpu_startup_entry+0x29/0x30 start_secondary+0xf8/0x100 common_startup_64+0x13e/0x148 </TASK> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-12release_task: kill unnecessary rcu_read_lock() around dec_rlimit_ucounts()Oleg Nesterov1-3/+1
rcu_read_lock() was added to shut RCU-lockdep up when this code used __task_cred()->rcu_dereference(), but after the commit 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") it is no longer needed: task_ucounts()->task_cred_xxx() takes rcu_read_lock() itself. NOTE: task_ucounts() returns the pointer to another rcu-protected data, struct ucounts. So it should either be used when task->real_cred and thus task->real_cred->ucounts is stable (release_task, copy_process, copy_creds), or it should be called under rcu_read_lock(). In both cases it is pointless to take rcu_read_lock() to read the cred->ucounts pointer. Link: https://lkml.kernel.org/r/20251026143140.GA22463@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Alexey Gladkov <legion@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Kees Cook <kees@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12taint/module: remove unnecessary taint_flag.module fieldPetr Pavlu2-26/+22
The TAINT_RANDSTRUCT and TAINT_FWCTL flags are mistakenly set in the taint_flags table as per-module flags. While this can be trivially corrected, the issue can be avoided altogether by removing the taint_flag.module field. This is possible because, since commit 7fd8329ba502 ("taint/module: Clean up global and module taint flags handling") in 2016, the handling of module taint flags has been fully generic. Specifically, module_flags_taint() can print all flags, and the required output buffer size is properly defined in terms of TAINT_FLAGS_COUNT. The actual per-module flags are always those added to module.taints by calls to add_taint_module(). Link: https://lkml.kernel.org/r/20251022082938.26670-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12taint: add reminder about updating docs and scriptsRandy Dunlap1-0/+6
Sometimes people update taint-related pieces of the kernel without updating the supporting documentation or scripts. Add a reminder to do this. Link: https://lkml.kernel.org/r/20251015221626.1126156-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12crash: let architecture decide crash memory export to iomem_resourceSourabh Jain1-0/+3
With the generic crashkernel reservation, the kernel emits the following warning on powerpc: WARNING: CPU: 0 PID: 1 at arch/powerpc/mm/mem.c:341 add_system_ram_resources+0xfc/0x180 Modules linked in: CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.17.0-auto-12607-g5472d60c129f #1 VOLUNTARY Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.01 (NH1110_069) hv:phyp pSeries NIP: c00000000201de3c LR: c00000000201de34 CTR: 0000000000000000 REGS: c000000127cef8a0 TRAP: 0700 Not tainted (6.17.0-auto-12607-g5472d60c129f) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 84000840 XER: 20040010 CFAR: c00000000017eed0 IRQMASK: 0 GPR00: c00000000201de34 c000000127cefb40 c0000000016a8100 0000000000000001 GPR04: c00000012005aa00 0000000020000000 c000000002b705c8 0000000000000000 GPR08: 000000007fffffff fffffffffffffff0 c000000002db8100 000000011fffffff GPR12: c00000000201dd40 c000000002ff0000 c0000000000112bc 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c0000000015a3808 GPR24: c00000000200468c c000000001699888 0000000000000106 c0000000020d1950 GPR28: c0000000014683f8 0000000081000200 c0000000015c1868 c000000002b9f710 NIP [c00000000201de3c] add_system_ram_resources+0xfc/0x180 LR [c00000000201de34] add_system_ram_resources+0xf4/0x180 Call Trace: add_system_ram_resources+0xf4/0x180 (unreliable) do_one_initcall+0x60/0x36c do_initcalls+0x120/0x220 kernel_init_freeable+0x23c/0x390 kernel_init+0x34/0x26c ret_from_kernel_user_thread+0x14/0x1c This warning occurs due to a conflict between crashkernel and System RAM iomem resources. The generic crashkernel reservation adds the crashkernel memory range to /proc/iomem during early initialization. Later, all memblock ranges are added to /proc/iomem as System RAM. If the crashkernel region overlaps with any memblock range, it causes a conflict while adding those memblock regions as iomem resources, triggering the above warning. The conflicting memblock regions are then omitted from /proc/iomem. For example, if the following crashkernel region is added to /proc/iomem: 20000000-11fffffff : Crash kernel then the following memblock regions System RAM regions fail to be inserted: 00000000-7fffffff : System RAM 80000000-257fffffff : System RAM Fix this by not adding the crashkernel memory to /proc/iomem on powerpc. Introduce an architecture hook to let each architecture decide whether to export the crashkernel region to /proc/iomem. For more info checkout commit c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem") and commit bce074bdbc36 ("powerpc: insert System RAM resource to prevent crashkernel conflict") Note: Before switching to the generic crashkernel reservation, powerpc never exported the crashkernel region to /proc/iomem. Link: https://lkml.kernel.org/r/20251016142831.144515-1-sourabhjain@linux.ibm.com Fixes: e3185ee438c2 ("powerpc/crash: use generic crashkernel reservation"). Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/90937fe0-2e76-4c82-b27e-7b8a7fe3ac69@linux.ibm.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12hung_task: panic when there are more than N hung tasks at the same timeLi RongQing2-6/+11
The hung_task_panic sysctl is currently a blunt instrument: it's all or nothing. Panicking on a single hung task can be an overreaction to a transient glitch. A more reliable indicator of a systemic problem is when multiple tasks hang simultaneously. Extend hung_task_panic to accept an integer threshold, allowing the kernel to panic only when N hung tasks are detected in a single scan. This provides finer control to distinguish between isolated incidents and system-wide failures. The accepted values are: - 0: Don't panic (unchanged) - 1: Panic on the first hung task (unchanged) - N > 1: Panic after N hung tasks are detected in a single scan The original behavior is preserved for values 0 and 1, maintaining full backward compatibility. [lance.yang@linux.dev: new changelog] Link: https://lkml.kernel.org/r/20251015063615.2632-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au> [aspeed_g5_defconfig] Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Florian Wesphal <fw@strlen.de> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Joel Stanley <joel@jms.id.au> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Phil Auld <pauld@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12scs: fix a wrong parameter in __scs_magicZhichi Lin1-1/+1
__scs_magic() needs a 'void *' variable, but a 'struct task_struct *' is given. 'task_scs(tsk)' is the starting address of the task's shadow call stack, and '__scs_magic(task_scs(tsk))' is the end address of the task's shadow call stack. Here should be '__scs_magic(task_scs(tsk))'. The user-visible effect of this bug is that when CONFIG_DEBUG_STACK_USAGE is enabled, the shadow call stack usage checking function (scs_check_usage) would scan an incorrect memory range. This could lead to: 1. **Inaccurate stack usage reporting**: The function would calculate wrong usage statistics for the shadow call stack, potentially showing incorrect value in kmsg. 2. **Potential kernel crash**: If the value of __scs_magic(tsk)is greater than that of __scs_magic(task_scs(tsk)), the for loop may access unmapped memory, potentially causing a kernel panic. However, this scenario is unlikely because task_struct is allocated via the slab allocator (which typically returns lower addresses), while the shadow call stack returned by task_scs(tsk) is allocated via vmalloc(which typically returns higher addresses). However, since this is purely a debugging feature (CONFIG_DEBUG_STACK_USAGE), normal production systems should be not unaffected. The bug only impacts developers and testers who are actively debugging stack usage with this configuration enabled. Link: https://lkml.kernel.org/r/20251011082222.12965-1-zhichi.lin@vivo.com Fixes: 5bbaf9d1fcb9 ("scs: Add support for stack usage debugging") Signed-off-by: Jiyuan Xie <xiejiyuan@vivo.com> Signed-off-by: Zhichi Lin <zhichi.lin@vivo.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yee Lee <yee.lee@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12kexec_core: remove superfluous page offset handling in segment loadingJustinien Bouron1-12/+3
During kexec_segment loading, when copying the content of the segment (i.e. kexec_segment::kbuf or kexec_segment::buf) to its associated pages, kimage_load_{cma,normal,crash}_segment handle the case where the physical address of the segment is not page aligned, e.g. in kimage_load_normal_segment: page = kimage_alloc_page(image, GFP_HIGHUSER, maddr); // ... ptr = kmap_local_page(page); // ... ptr += maddr & ~PAGE_MASK; mchunk = min_t(size_t, mbytes, PAGE_SIZE - (maddr & ~PAGE_MASK)); // ^^^^ Non page-aligned segments handled here ^^^ // ... if (image->file_mode) memcpy(ptr, kbuf, uchunk); else result = copy_from_user(ptr, buf, uchunk); (similar logic is present in kimage_load_{cma,crash}_segment). This is actually not needed because, prior to their loading, all kexec_segments first go through a vetting step in `sanity_check_segment_list`, which rejects any segment that is not page-aligned: for (i = 0; i < nr_segments; i++) { unsigned long mstart, mend; mstart = image->segment[i].mem; mend = mstart + image->segment[i].memsz; // ... if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK)) return -EADDRNOTAVAIL; // ... } In case `sanity_check_segment_list` finds a non-page aligned the whole kexec load is aborted and no segment is loaded. This means that `kimage_load_{cma,normal,crash}_segment` never actually have to handle non page-aligned segments and `(maddr & ~PAGE_MASK) == 0` is always true no matter if the segment is coming from a file (i.e. `kexec_file_load` syscall), from a user-space buffer (i.e. `kexec_load` syscall) or created by the kernel through `kexec_add_buffer`. In the latter case, `kexec_add_buffer` actually enforces the page alignment: /* Ensure minimum alignment needed for segments. */ kbuf->memsz = ALIGN(kbuf->memsz, PAGE_SIZE); kbuf->buf_align = max(kbuf->buf_align, PAGE_SIZE); [jbouron@amazon.com: v3] Link: https://lkml.kernel.org/r/20251024155009.39502-1-jbouron@amazon.com Link: https://lkml.kernel.org/r/20250929160220.47616-1-jbouron@amazon.com Signed-off-by: Justinien Bouron <jbouron@amazon.com> Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Marcos Paulo de Souza <mpdesouza@suse.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-11-12sched_ext: Implement load balancer for bypass modeTejun Heo2-3/+242
In bypass mode, tasks are queued on per-CPU bypass DSQs. While this works well in most cases, there is a failure mode where a BPF scheduler can skew task placement severely before triggering bypass in highly over-saturated systems. If most tasks end up concentrated on a few CPUs, those CPUs can accumulate queues that are too long to drain in a reasonable time, leading to RCU stalls and hung tasks. Implement a simple timer-based load balancer that redistributes tasks across CPUs within each NUMA node. The balancer runs periodically (default 500ms, tunable via bypass_lb_intv_us module parameter) and moves tasks from overloaded CPUs to underloaded ones. When moving tasks between bypass DSQs, the load balancer holds nested DSQ locks to avoid dropping and reacquiring the donor DSQ lock on each iteration, as donor DSQs can be very long and highly contended. Add the SCX_ENQ_NESTED flag and use raw_spin_lock_nested() in dispatch_enqueue() to support this. The load balancer timer function reads scx_bypass_depth locklessly to check whether