aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-11-01tracing: fprobe: use rhltable for fprobe_ip_tableMenglong Dong1-66/+91
For now, all the kernel functions who are hooked by the fprobe will be added to the hash table "fprobe_ip_table". The key of it is the function address, and the value of it is "struct fprobe_hlist_node". The budget of the hash table is FPROBE_IP_TABLE_SIZE, which is 256. And this means the overhead of the hash table lookup will grow linearly if the count of the functions in the fprobe more than 256. When we try to hook all the kernel functions, the overhead will be huge. Therefore, replace the hash table with rhltable to reduce the overhead. Link: https://lore.kernel.org/all/20250819031825.55653-1-dongml2@chinatelecom.cn/ Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-10-31Merge back system sleep material for 6.19Rafael J. Wysocki5-29/+117
2025-10-31nstree: simplify returnChristian Brauner1-5/+0
node_to_ns() checks for NULL and the assert isn't really helpful and will have to be dropped later anyway. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-7-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-31cgroup: add cgroup namespace to tree after owner is setChristian Brauner1-1/+1
Otherwise we trip VFS_WARN_ON_ONC() in __ns_tree_add_raw(). Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-6-2e6f823ebdc0@kernel.org Fixes: 7c6059398533 ("cgroup: support ns lookup") Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-10-30Merge tag 'pm-6.18-rc4' of ↵Linus Torvalds4-10/+18
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix three regressions, two recent ones and one introduced during the 6.17 development cycle: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki)" * tag 'pm-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Allow pm_restrict_gfp_mask() stacking cpuidle: governors: menu: Select polling state in some more cases Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
2025-10-30Merge branches 'pm-cpuidle' and 'pm-sleep'Rafael J. Wysocki4-10/+18
Merge a cpuidle fix and two fixes related to system sleep for 6.18-rc4: - Add an exit latency check to the menu cpuidle governor in the case when it considers using a real idle state instead of a polling one to address a performance regression (Rafael Wysocki) - Revert an attempted cleanup of a system suspend code path that introduced a regression elsewhere (Samuel Wu) - Allow pm_restrict_gfp_mask() to be called multiple times in a row and adjust pm_restore_gfp_mask() accordingly to avoid having to play nasty games with these calls during hibernation (Rafael Wysocki) * pm-cpuidle: cpuidle: governors: menu: Select polling state in some more cases * pm-sleep: PM: sleep: Allow pm_restrict_gfp_mask() stacking Revert "PM: sleep: Make pm_wakeup_clear() call more clear"
2025-10-30freezer: Clarify that only cgroup1 freezer uses PM freezerTejun Heo2-2/+2
cgroup1 freezer piggybacks on the PM freezer, which inadvertently allowed userspace to produce uninterruptible tasks at will. To avoid the issue, cgroup2 freezer switched to a separate job control based mechanism. While this happened a long time ago, the code and comment haven't been updated making it confusing to people who aren't familiar with the history. Rename cgroup_freezing() to cgroup1_freezing() and update comments on top of freezing() and frozen() to clarify that cgroup2 freezer isn't covered by the PM freezer mechanism. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Qu Wenruo <wqu@suse.com> Link: https://patch.msgid.link/aPZ3q6Hm865NicBC@slm.duckdns.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30PM: hibernate: add sysfs interface for hibernate_compression_threadsXueqin Luo1-0/+38
Add a sysfs attribute `/sys/power/hibernate_compression_threads` to allow runtime configuration of the number of threads used for compressing and decompressing hibernation images. The new sysfs interface enables dynamic adjustment at runtime: # cat /sys/power/hibernate_compression_threads 3 # echo 4 > /sys/power/hibernate_compression_threads This change provides greater flexibility for debugging and performance tuning of hibernation without requiring a reboot. Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn> Link: https://patch.msgid.link/c68c62f97fabf32507b8794ad8c16cd22ee656ac.1761046167.git.luoxueqin@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30PM: hibernate: make compression threads configurableXueqin Luo1-4/+21
The number of compression/decompression threads has a direct impact on hibernate image generation and resume latency. Using more threads can reduce overall resume time, but on systems with fewer CPU cores it may also introduce contention and reduce efficiency. Performance was evaluated on an 8-core ARM system, averaged over 10 runs: Threads Hibernate(s) Resume(s) -------------------------------- 3 12.14 18.86 4 12.28 17.48 5 11.09 16.77 6 11.08 16.44 With 5–6 threads, resume latency improves by approximately 12% compared to the default 3-thread configuration, with negligible impact on hibernate time. Introduce a new kernel parameter `hibernate_compression_threads=` that allows users and integrators to tune the number of compression/decompression threads at boot. This provides a way to balance performance and CPU utilization across a wide range of hardware without recompiling the kernel. Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn> Link: https://patch.msgid.link/f24b3ca6416e230a515a154ed4c121d72a7e05a6.1761046167.git.luoxueqin@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30PM: hibernate: dynamically allocate crc->unc_len/unc for configurable threadsXueqin Luo1-14/+44
Convert crc->unc_len and crc->unc from fixed-size arrays to dynamically allocated arrays, sized according to the actual number of threads selected at runtime. This removes the fixed limit imposed by CMP_THREADS. Signed-off-by: Xueqin Luo <luoxueqin@kylinos.cn> Link: https://patch.msgid.link/b5db63bb95729482d2649b12d3a11cb7547b7fcc.1761046167.git.luoxueqin@kylinos.cn Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-30printk/nbcon: Release nbcon consoles ownership in atomic flush after each ↵Petr Mladek1-4/+5
emitted record printk() tries to flush messages with NBCON_PRIO_EMERGENCY on nbcon consoles immediately. It might take seconds to flush all pending lines on slow serial consoles. Note that there might be hundreds of messages, for example: [ 3.771531][ T1] pci 0000:3e:08.1: [8086:324 ** replaying previous printk message ** [ 3.771531][ T1] pci 0000:3e:08.1: [8086:3246] type 00 class 0x088000 PCIe Root Complex Integrated Endpoint [ ... more than 2000 lines, about 200kB messages ... ] [ 3.837752][ T1] pci 0000:20:01.0: Adding to iommu group 18 [ 3.837851][ T ** replaying previous printk message ** [ 3.837851][ T1] pci 0000:20:03.0: Adding to iommu group 19 [ 3.837946][ T1] pci 0000:20:05.0: Adding to iommu group 20 [ ... more than 500 messages for iommu groups 21-590 ...] [ 3.912932][ T1] pci 0000:f6:00.1: Adding to iommu group 591 [ 3.913070][ T1] pci 0000:f6:00.2: Adding to iommu group 592 [ 3.913243][ T1] DMAR: Intel(R) Virtualization Technology for Directed I/O [ 3.913245][ T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) [ 3.913245][ T1] software IO TLB: mapped [mem 0x000000004f000000-0x0000000053000000] (64MB) [ 3.913324][ T1] RAPL PMU: API unit is 2^-32 Joules, 3 fixed counters, 655360 ms ovfl timer [ 3.913325][ T1] RAPL PMU: hw unit of domain package 2^-14 Joules [ 3.913326][ T1] RAPL PMU: hw unit of domain dram 2^-14 Joules [ 3.913327][ T1] RAPL PMU: hw unit of domain psys 2^-0 Joules [ 3.933486][ T1] ------------[ cut here ]------------ [ 3.933488][ T1] WARNING: CPU: 2 PID: 1 at arch/x86/events/intel/uncore.c:1156 uncore_pci_pmu_register+0x15e/0x180 [ 3.930291][ C0] watchdog: Watchdog detected hard LOCKUP on cpu 0 [ 3.930291][ C0] Kernel panic - not syncing: Hard LOCKUP [...] [ 3.930291][ C0] CPU: 0 UID: 0 PID: 18 Comm: pr/ttyS0 Not tainted... [...] [ 3.930291][ C0] RIP: 0010:nbcon_reacquire_nobuf+0x11/0x50 [ 3.930291][ C0] Call Trace: [...] [ 3.930291][ C0] <TASK> [ 3.930291][ C0] serial8250_console_write+0x16d/0x5c0 [ 3.930291][ C0] nbcon_emit_next_record+0x22c/0x250 [ 3.930291][ C0] nbcon_emit_one+0x93/0xe0 [ 3.930291][ C0] nbcon_kthread_func+0x13c/0x1c0 The are visible two takeovers of the console ownership: - The 1st one is triggered by the "WARNING: CPU: 2 PID: 1 at arch/x86/..." line printed with NBCON_PRIO_EMERGENCY. - The 2nd one is triggered by the "Kernel panic - not syncing: Hard LOCKUP" line printed with NBCON_PRIO_PANIC. There are more than 2500 lines, at about 240kB, emitted between the takeover and the 1st "WARNING" line in the emergency context. This amount of pending messages had to be flushed by nbcon_atomic_flush_pending() when WARN() printed its first line. The atomic flush was holding the nbcon console context for too long so that it triggered hard lockup on the CPU running the printk kthread "pr/ttyS0". The kthread needed to reacquire the console ownership for restoring the original serial port state in serial8250_console_write(). Prevent the hardlockup by releasing the nbcon console ownership after each emitted record. Note that __nbcon_atomic_flush_pending_con() used to hold the console ownership all the time because it blocked the printk kthread. Otherwise the kthread tried to flush the messages in parallel which caused repeated takeovers and more replayed messages. It is not longer a problem because the repeated takeovers are blocked by the counter of emergency contexts, see nbcon_cpu_emergency_cnt. Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250926124912.243464-4-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-30printk/nbcon/panic: Allow printk kthread to sleep when the system is in panicPetr Mladek1-2/+4
The printk kthread might be running when there is a panic in progress. But it is not able to acquire the console ownership any longer. Prevent the desperate attempts to acquire the ownership and allow sleeping in panic. It would make it behave the same as when there is any CPU in an emergency context. Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250926124912.243464-3-pmladek@suse.com [pmladek@suse.com: Rebased on top of 6.18-rc1 (panic_in_progress() moved to panic.c)] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-30printk/nbcon: Block printk kthreads when any CPU is in an emergency contextPetr Mladek1-1/+37
In emergency contexts, printk() tries to flush messages directly even on nbcon consoles. And it is allowed to takeover the console ownership and interrupt the printk kthread in the middle of a message. Only one takeover and one repeated message should be enough in most situations. The first emergency message flushes the backlog and printk kthreads get to sleep. Next emergency messages are flushed directly and printk() does not wake up the kthreads. However, the one takeover is not guaranteed. Any printk() in normal context on another CPU could wake up the kthreads. Or a new emergency message might be added before the kthreads get to sleep. Note that the interrupted .write_thread() callbacks usually have to call nbcon_reacquire_nobuf() and restore the original device setting before checking for pending messages. The risk of the repeated takeovers will be even bigger because __nbcon_atomic_flush_pending_con is going to release the console ownership after each emitted record. It will be needed to prevent hardlockup reports on other CPUs which are busy waiting for the context ownership, for example, by nbcon_reacquire_nobuf() or __uart_port_nbcon_acquire(). The repeated takeovers break the output, for example: [ 5042.650211][ T2220] Call Trace: [ 5042.6511 ** replaying previous printk message ** [ 5042.651192][ T2220] <TASK> [ 5042.652160][ T2220] kunit_run_ ** replaying previous printk message ** [ 5042.652160][ T2220] kunit_run_tests+0x72/0x90 [ 5042.653340][ T22 ** replaying previous printk message ** [ 5042.653340][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5 [ 5042.654628][ T2220] ? stack_trace_save+0x4d/0x70 [ 5042.6553 ** replaying previous printk message ** [ 5042.655394][ T2220] ? srso_alias_return_thunk+0x5/0xfbef5 [ 5042.656713][ T2220] ? save_trace+0x5b/0x180 A more robust solution is to block the printk kthread entirely whenever *any* CPU enters an emergency context. This ensures that critical messages can be flushed without contention from the normal, non-atomic printing path. Link: https://lore.kernel.org/all/aNQO-zl3k1l4ENfy@pathway.suse.cz Reviewed-by: Andrew Murray <amurray@thegoodpenguin.co.uk> Reviewed-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250926124912.243464-2-pmladek@suse.com [pmladek@suse.com: Added changes proposed by John Ogness] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-29bpf: Use kmalloc_nolock() in bpf streamsPuranjay Mohan1-151/+8
BPF stream kfuncs need to be non-sleeping as they can be called from programs running in any context, this requires a way to allocate memory from any context. Currently, this is done by a custom per-CPU NMI-safe bump allocation mechanism, backed by alloc_pages_nolock() and free_pages_nolock() primitives. As kmalloc_nolock() and kfree_nolock() primitives are available now, the custom allocator can be removed in favor of these. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251023161448.4263-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-29rqspinlock: Disable queue destruction for deadlocksKumar Kartikeya Dwivedi1-0/+8
Disable propagation and unwinding of the waiter queue in case the head waiter detects a deadlock condition, but keep it enabled in case of the timeout fallback. Currently, when the head waiter experiences an AA deadlock, it will signal all its successors in the queue to exit with an error. This is not ideal for cases where the same lock is held in contexts which can cause errors in an unrestricted fashion (e.g., BPF programs, or kernel paths invoked through BPF programs), and core kernel logic which is written in a correct fashion and does not expect deadlocks. The same reasoning can be extended to ABBA situations. Depending on the actual runtime schedule, one or both of the head waiters involved in an ABBA situation can detect and exit directly without terminating their waiter queue. If the ABBA situation manifests again, the waiters will keep exiting until progress can be made, or a timeout is triggered in case of more complicated locking dependencies. We still preserve the queue destruction in case of timeouts, as either the locking dependencies are too complex to be captured by AA and ABBA heuristics, or the owner is perpetually stuck. As such, it would be unwise to continue to apply the timeout for each new head waiter without terminating the queue, since we may end up waiting for more than 250 ms in aggregate with all participants in the locking transaction. The patch itself is fairly simple; we can simply signal our successor to become the next head waiter, and leave the queue without attempting to acquire the lock. With this change, the behavior for waiters in case of deadlocks experienced by a predecessor changes. It is guaranteed that call sites will no longer receive errors if the predecessors encounter deadlocks and the successors do not participate in one. This should lower the failure rate for waiters that are not doing improper locking opreations, just because they were unlucky to queue behind a misbehaving waiter. However, timeouts are still a possibility, hence they must be accounted for, so users cannot rely upon errors not occuring at all. Suggested-by: Amery Hung <ameryhung@gmail.com> Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20251029181828.231529-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-29PM: sleep: Allow pm_restrict_gfp_mask() stackingRafael J. Wysocki2-9/+17
Allow pm_restrict_gfp_mask() to be called many times in a row to avoid issues with calling dpm_suspend_start() when the GFP mask has been already restricted. Only the first invocation of pm_restrict_gfp_mask() will actually restrict the GFP mask and the subsequent calls will warn if there is a mismatch between the expected allowed GFP mask and the actual one. Moreover, if pm_restrict_gfp_mask() is called many times in a row, pm_restore_gfp_mask() needs to be called matching number of times in a row to actually restore the GFP mask. Calling it when the GFP mask has not been restricted will cause it to warn. This is necessary for the GFP mask restriction starting in hibernation_snapshot() to continue throughout the entire hibernation flow until it completes or it is aborted (either by a wakeup event or by an error). Fixes: 449c9c02537a1 ("PM: hibernate: Restrict GFP mask in hibernation_snapshot()") Fixes: 469d80a3712c ("PM: hibernate: Fix hybrid-sleep") Reported-by: Askar Safin <safinaskar@gmail.com> Closes: https://lore.kernel.org/linux-pm/20251025050812.421905-1-safinaskar@gmail.com/ Link: https://lore.kernel.org/linux-pm/20251028111730.2261404-1-safinaskar@gmail.com/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Tested-by: Mario Limonciello (AMD) <superm1@kernel.org> Cc: 6.16+ <stable@vger.kernel.org> # 6.16+ Link: https://patch.msgid.link/5935682.DvuYhMxLoT@rafael.j.wysocki
2025-10-29sched_ext: Allow scx_bpf_reenqueue_local() to be called from anywhereTejun Heo2-0/+32
The ops.cpu_acquire/release() callbacks miss events under multiple conditions. There are two distinct task dispatch gaps that can cause cpu_released flag desynchronization: 1. balance-to-pick_task gap: This is what was originally reported. balance_scx() can enqueue a task, but during consume_remote_task() when the rq lock is released, a higher priority task can be enqueued and ultimately picked while cpu_released remains false. This gap is closeable via RETRY_TASK handling. 2. ttwu-to-pick_task gap: ttwu() can directly dispatch a task to a CPU's local DSQ. By the time the sched path runs on the target CPU, higher class tasks may already be queued. In such cases, nothing on sched_ext side will be invoked, and the only solution would be a hook invoked regardless of sched class, which isn't desirable. Rather than adding invasive core hooks, BPF schedulers can use generic BPF mechanisms like tracepoints. From SCX scheduler's perspective, this is congruent with other mechanisms it already uses and doesn't add further friction. The main use case for cpu_release() was calling scx_bpf_reenqueue_local() when a CPU gets preempted by a higher priority scheduling class. However, the old scx_bpf_reenqueue_local() could only be called from cpu_release() context. Add a new version of scx_bpf_reenqueue_local() that can be called from any context by deferring the actual re-enqueue operation. This eliminates the need for cpu_acquire/release() ops entirely. Schedulers can now use standard BPF mechanisms like the sched_switch tracepoint to detect and handle CPU preemption. Update scx_qmap to demonstrate the new approach using sched_switch instead of cpu_release, with compat support for older kernels. Mark cpu_acquire/release() as deprecated. The old scx_bpf_reenqueue_local() variant will be removed in v6.23. Reported-by: Wen-Fang Liu <liuwenfang@honor.com> Link: https://lore.kernel.org/all/8d64c74118c6440f81bcf5a4ac6b9f00@honor.com/ Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Factor out reenq_local() from scx_bpf_reenqueue_local()Tejun Heo1-21/+29
Factor out the core re-enqueue logic from scx_bpf_reenqueue_local() into a new reenq_local() helper function. scx_bpf_reenqueue_local() now handles the BPF kfunc checks and calls reenq_local() to perform the actual work. This is a prep patch to allow reenq_local() to be called from other contexts. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29sched_ext: Split schedule_deferred() into locked and unlocked variantsTejun Heo1-9/+24
schedule_deferred() currently requires the rq lock to be held so that it can use scheduler hooks for efficiency when available. However, there are cases where deferred actions need to be scheduled from contexts that don't hold the rq lock. Split into schedule_deferred() which can be called from any context and just queues irq_work, and schedule_deferred_locked() which requires the rq lock and can optimize by using scheduler hooks when available. Update the existing call site to use the _locked variant. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29Merge branch 'for-6.18-fixes' into for-6.19Tejun Heo1-1/+1
Pull to receive f4fa7c25f632 ("sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()") which conflicts with changes for planned sub-sched changes.
2025-10-29sched_ext: Fix use of uninitialized variable in scx_bpf_cpuperf_set()Andrea Righi1-1/+1
scx_bpf_cpuperf_set() has a typo where it dereferences the local variable @sch, instead of the global @scx_root pointer. Fix by dereferencing the correct variable. Fixes: 956f2b11a8a4f ("sched_ext: Drop kf_cpu_valid()") Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-29printk_legacy_map: use LD_WAIT_CONFIG instead of LD_WAIT_SLEEPOleg Nesterov1-10/+7
printk_legacy_map is used to hide lock nesting violations caused by legacy drivers and is using the wrong override type. LD_WAIT_SLEEP is for always sleeping lock types such as mutex_t. LD_WAIT_CONFIG is for lock type which are sleeping while spinning on PREEMPT_RT such as spinlock_t. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251026150726.GA23223@redhat.com [pmladek@suse.com: Fixed indentation.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-29PM: EM: Add to em_pd_list only when no failurePeng Fan1-1/+4
When em_create_perf_table() returns failure, pd is freed, there dev->em_pd is not valid. Then accessing dev->em_pd->node will trigger kernel panic in em_dev_register_pd_no_update(). So return early if 'ret' is non-zero. Kernel dump: cpu cpu0: EM: invalid power: 0 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Mem abort info: pc : em_dev_register_pd_no_update+0xb4/0x79c lr : em_dev_register_pd_no_update+0x9c/0x79c Call trace: em_dev_register_pd_no_update+0xb4/0x79c (P) em_dev_register_perf_domain+0x18/0x58 scmi_cpufreq_register_em+0x84/0xb8 cpufreq_online+0x48c/0xb74 cpufreq_add_dev+0x80/0x98 subsys_interface_register+0x100/0x11c cpufreq_register_driver+0x158/0x278 scmi_cpufreq_probe+0x1f8/0x2e0 scmi_dev_probe+0x28/0x3c really_probe+0xbc/0x29c __driver_probe_device+0x78/0x12c driver_probe_device+0x3c/0x15c __device_attach_driver+0xb8/0x134 bus_for_each_drv+0x84/0xe4 Fixes: cbe5aeedecc7 ("PM: EM: Assign a unique ID when creating a performance domain") Signed-off-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Changwoo Min <changwoo@igalia.com> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/20251028-fix-energy-v1-1-ab854fd6a97c@nxp.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-29perf: Support deferred user unwindPeter Zijlstra3-5/+91
Add support for deferred userspace unwind to perf. Where perf currently relies on in-place stack unwinding; from NMI context and all that. This moves the userspace part of the unwind to right before the return-to-userspace. This has two distinct benefits, the biggest is that it moves the unwind to a faultable context. It becomes possible to fault in debug info (.eh_frame, SFrame etc.) that might not otherwise be readily available. And secondly, it de-duplicates the user callchain where multiple samples happen during the same kernel entry. To facilitate this the perf interface is extended with a new record type: PERF_RECORD_CALLCHAIN_DEFERRED and two new attribute flags: perf_event_attr::defer_callchain - to request the user unwind be deferred perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records The existing PERF_RECORD_SAMPLE callchain section gets a new context type: PERF_CONTEXT_USER_DEFERRED After which will come a single entry, denoting the 'cookie' of the deferred callchain that should be attached here, matching the 'cookie' field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED. The 'defer_callchain' flag is expected on all events with PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event responsible for collecting side-band events (like mmap, comm etc.). Setting 'defer_output' on multiple events will get you duplicated PERF_RECORD_CALLCHAIN_DEFERRED records. Based on earlier patches by Josh and Steven. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net
2025-10-29unwind_user/x86: Teach FP unwind about start of functionPeter Zijlstra1-9/+30
When userspace is interrupted at the start of a function, before we get a chance to complete the frame, unwind will miss one caller. X86 has a uprobe specific fixup for this, add bits to the generic unwinder to support this. Suggested-by: Jens Remus <jremus@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251024145156.GM4068168@noisy.programming.kicks-ass.net
2025-10-29unwind: Implement compat fp unwindPeter Zijlstra1-11/+29
It is important to be able to unwind compat tasks too. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20250924080119.613695709@infradead.org
2025-10-29unwind: Simplify unwind_user_next_fp() alignment checkPeter Zijlstra1-3/+1
2^log_2(n) == n Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080119.497867836@infradead.org
2025-10-29unwind: Make unwind_task_info::unwind_mask consistentPeter Zijlstra1-8/+9
The unwind_task_info::unwind_mask was manipulated using a mixture of: regular store WRITE_ONCE() try_cmpxchg() set_bit() atomic_long_*() Clean up and make it consistently atomic_long_t. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20250924080119.384384486@infradead.org
2025-10-29unwind: Simplify unwind_user_faultable()Peter Zijlstra1-9/+6
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20250924080119.271671514@infradead.org
2025-10-29unwind: Clarify calling contextPeter Zijlstra1-0/+2
The get_cookie() function hard relies on IRQs being disabled, but this isn't immediately obvious when reading the function. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080119.122507632@infradead.org
2025-10-29unwind: Fix unwind_deferred_request() vs NMIPeter Zijlstra1-3/+7
task_work_add(RWA_RESUME) isn't NMI-safe, use TWA_NMI_CURRENT when used from NMI context. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080119.005422353@infradead.org
2025-10-29unwind: Add comment to unwind_deferred_task_exit()Peter Zijlstra1-1/+6
Explain why unwind_deferred_task_exit() exist and its constraints. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080118.893367437@infradead.org
2025-10-29task_work: Fix NMI race conditionPeter Zijlstra1-1/+7
__schedule() // disable irqs <NMI> task_work_add(current, work, TWA_NMI_CURRENT); </NMI> // current = next; // enable irqs <IRQ> task_work_set_notify_irq() test_and_set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); // wrong task! </IRQ> // original task skips task work on its next return to user (or exit!) Fixes: 466e4d801cd4 ("task_work: Add TWA_NMI_CURRENT as an additional notify mode.") Reported-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://patch.msgid.link/20250924080118.425949403@infradead.org
2025-10-29dma-mapping: remove unused map_page callbackLeon Romanovsky2-19/+1
After conversion of arch code to use physical address mapping, there are no users of .map_page() and .unmap_page() callbacks, so let's remove them. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-14-3bbfe3a25cdf@kernel.org
2025-10-29dma-mapping: remove unused mapping resource callbacksLeon Romanovsky1-12/+4
After ARM and XEN conversions to use physical addresses for the mapping, there are no in-kernel users for map_resource/unmap_resource callbacks, so remove them. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-6-3bbfe3a25cdf@kernel.org
2025-10-29dma-mapping: convert dummy ops to physical address mappingLeon Romanovsky1-7/+6
Change dma_dummy_map_page and dma_dummy_unmap_page routines to accept physical address and rename them. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-2-3bbfe3a25cdf@kernel.org
2025-10-29dma-mapping: prepare dma_map_ops to conversion to physical addressLeon Romanovsky2-2/+14
Add new .map_phys() and .unmap_phys() callbacks to dma_map_ops as a preparation to replace .map_page() and .unmap_page() respectively. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251015-remove-map-page-v5-1-3bbfe3a25cdf@kernel.org
2025-10-29tools/dma: move dma_map_benchmark from selftests to tools/dmaQinxin Xia1-1/+1
dma_map_benchmark is a standalone developer tool rather than an automated selftest. It has no pass/fail criteria, expects manual invocation, and is built as a normal userspace binary. Move it to tools/dma/ and add a minimal Makefile. Suggested-by: Marek Szyprowski <m.szyprowski@samsung.com> Suggested-by: Barry Song <baohua@kernel.org> Signed-off-by: Qinxin Xia <xiaqinxin@huawei.com> Acked-by: Barry Song <baohua@kernel.org> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20251028120900.2265511-3-xiaqinxin@huawei.com
2025-10-29Merge branch 'linus/master' into sched/core, to resolve conflictPeter Zijlstra17-61/+196
Conflicts: kernel/sched/ext.c Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-10-28tracing: Have persistent ring buffer print syscalls normallySteven Rostedt1-4/+23
The persistent ring buffer from a previous boot has to be careful printing events as the print formats of random events can have pointers to strings and such that are not available. Ftrace static events (like the function tracer event) are stable and are printed normally. System call event formats are also stable. Allow them to be printed normally as well: Instead of: <...>-1 [005] ...1. 57.240405: sys_enter_waitid: __syscall_nr=0xf7 (247) which=0x1 (1) upid=0x499 (1177) infop=0x7ffd5294d690 (140725988939408) options=0x5 (5) ru=0x0 (0) <...>-1 [005] ...1. 57.240433: sys_exit_waitid: __syscall_nr=0xf7 (247) ret=0x0 (0) <...>-1 [005] ...1. 57.240437: sys_enter_rt_sigprocmask: __syscall_nr=0xe (14) how=0x2 (2) nset=0x7ffd5294d7c0 (140725988939712) oset=0x0 (0) sigsetsize=0x8 (8) <...>-1 [005] ...1. 57.240438: sys_exit_rt_sigprocmask: __syscall_nr=0xe (14) ret=0x0 (0) <...>-1 [005] ...1. 57.240442: sys_enter_close: __syscall_nr=0x3 (3) fd=0x4 (4) <...>-1 [005] ...1. 57.240463: sys_exit_close: __syscall_nr=0x3 (3) ret=0x0 (0) <...>-1 [005] ...1. 57.240485: sys_enter_openat: __syscall_nr=0x101 (257) dfd=0xffffffffffdfff9c (-2097252) filename=(0xffff8b81639ca01c) flags=0x80000 (524288) mode=0x0 (0) __filename_val=/run/systemd/reboot-param <...>-1 [005] ...1. 57.240555: sys_exit_openat: __syscall_nr=0x101 (257) ret=0xffffffffffdffffe (-2097154) <...>-1 [005] ...1. 57.240571: sys_enter_openat: __syscall_nr=0x101 (257) dfd=0xffffffffffdfff9c (-2097252) filename=(0xffff8b81639ca01c) flags=0x80000 (524288) mode=0x0 (0) __filename_val=/run/systemd/reboot-param <...>-1 [005] ...1. 57.240620: sys_exit_openat: __syscall_nr=0x101 (257) ret=0xffffffffffdffffe (-2097154) <...>-1 [005] ...1. 57.240629: sys_enter_writev: __syscall_nr=0x14 (20) fd=0x3 (3) vec=0x7ffd5294ce50 (140725988937296) vlen=0x7 (7) <...>-1 [005] ...1. 57.242281: sys_exit_writev: __syscall_nr=0x14 (20) ret=0x24 (36) <...>-1 [005] ...1. 57.242286: sys_enter_reboot: __syscall_nr=0xa9 (169) magic1=0xfee1dead (4276215469) magic2=0x28121969 (672274793) cmd=0x1234567 (19088743) arg=0x0 (0) Have: <...>-1 [000] ...1. 91.446011: sys_waitid(which: 1, upid: 0x4d2, infop: 0x7ffdccdadfd0, options: 5, ru: 0) <...>-1 [000] ...1. 91.446042: sys_waitid -> 0x0 <...>-1 [000] ...1. 91.446045: sys_rt_sigprocmask(how: 2, nset: 0x7ffdccdae100, oset: 0, sigsetsize: 8) <...>-1 [000] ...1. 91.446047: sys_rt_sigprocmask -> 0x0 <...>-1 [000] ...1. 91.446051: sys_close(fd: 4) <...>-1 [000] ...1. 91.446073: sys_close -> 0x0 <...>-1 [000] ...1. 91.446095: sys_openat(dfd: 18446744073709551516, filename: 139732544945794 "/run/systemd/reboot-param", flags: O_RDONLY|O_CLOEXEC) <...>-1 [000] ...1. 91.446165: sys_openat -> 0xfffffffffffffffe <...>-1 [000] ...1. 91.446182: sys_openat(dfd: 18446744073709551516, filename: 139732544945794 "/run/systemd/reboot-param", flags: O_RDONLY|O_CLOEXEC) <...>-1 [000] ...1. 91.446233: sys_openat -> 0xfffffffffffffffe <...>-1 [000] ...1. 91.446242: sys_writev(fd: 3, vec: 0x7ffdccdad790, vlen: 7) <...>-1 [000] ...1. 91.447877: sys_writev -> 0x24 <...>-1 [000] ...1. 91.447883: sys_reboot(magic1: 0xfee1dead, magic2: 0x28121969, cmd: 0x1234567, arg: 0) Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231149.097404581@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28tracing: Check for printable characters when printing field dyn stringsSteven Rostedt1-2/+25
When the "fields" option is enabled, it prints each trace event field based on its type. But a dynamic array and a dynamic string can both have a "char *" type. Printing it as a string can cause escape characters to be printed and mess up the output of the trace. For dynamic strings, test if there are any non-printable characters, and if so, print both the string with the non printable characters as '.', and the print the hex value of the array. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.929243047@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28tracing: Add parsing of flags to the sys_enter_openat trace eventSteven Rostedt1-10/+182
Add some logic to give the openat system call trace event a bit more human readable information: syscalls:sys_enter_openat: dfd: 0xffffff9c, filename: 0x7f0053dc121c "/etc/ld.so.cache", flags: O_RDONLY|O_CLOEXEC, mode: 0000 The above is output from "perf script" and now shows the flags used by the openat system call. Since the output from tracing is in the kernel, it can also remove the mode field when not used (when flags does not contain O_CREATE|O_TMPFILE) touch-1185 [002] ...1. 1291.690154: sys_openat(dfd: 4294967196, filename: 139785545139344 "/usr/lib/locale/locale-archive", flags: O_RDONLY|O_CLOEXEC) touch-1185 [002] ...1. 1291.690504: sys_openat(dfd: 18446744073709551516, filename: 140733603151330 "/tmp/x", flags: O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, mode: 0666) As system calls have a fixed ABI, their trace events can be extended. This currently only updates the openat system call, but others may be extended in the future. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.763161484@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28tracing: Show printable characters in syscall arraysSteven Rostedt1-0/+21
When displaying the contents of the user space data passed to the kernel, instead of just showing the array values, also print any printable content. Instead of just: bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deeddb0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20), count: 0x16) Display: bash-1113 [003] ..... 3433.290654: sys_write(fd: 2, buf: 0x555a8deeddb0 (72:6f:6f:74:40:64:65:62:69:61:6e:2d:78:38:36:2d:36:34:3a:7e:23:20) "root@debian-x86-64:~# ", count: 0x16) This only affects tracing and does not affect perf, as this only updates the output from the kernel. The output from perf is via user space. This may change by an update to libtraceevent that will then update perf to have this as well. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.429422865@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28tracing: Add a config and syscall_user_buf_size file to limit amount writtenSteven Rostedt4-22/+97
When a system call that can copy user space addresses into the ring buffer, it can copy up to 511 bytes of data. This can waste precious ring buffer space if the user isn't interested in the output. Add a new file "syscall_user_buf_size" that gets initialized to a new config CONFIG_SYSCALL_BUF_SIZE_DEFAULT that defaults to 63. The config also is used to limit how much perf can read from user space. Also lower the max down to 165, as this isn't to record everything that a system call may be passing through to the kernel. 165 is more than enough. The reason for 165 is because adding one for the nul terminating byte, as well as possibly needing to append the "..." string turns it into 170 bytes. As this needs to save up to 3 arguments and 3 * 170 is 510 which fits nicely in 512 bytes (a power of 2). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Takaya Saeki <takayas@google.com> Cc: Tom Zanussi <zanussi@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ian Rogers <irogers@google.com> Cc: Douglas Raillard <douglas.raillard@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Link: https://lore.kernel.org/20251028231148.260068913@kernel.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-10-28tracing: Allow syscall trace events to re