aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2026-03-27kernel/crash: remove inclusion of crypto/sha1.hEric Biggers3-6/+0
Several files related to kernel crash dumps include crypto/sha1.h but never use any of its functionality. Remove these includes so that these files don't unnecessarily come up in searches for which kernel code is still using the obsolete SHA-1 algorithm. Link: https://lkml.kernel.org/r/20260314204243.45001-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: explicitly report I/O wait state in log outputAaron Tomlin1-2/+3
Currently, the hung task reporting mechanism indiscriminately labels all TASK_UNINTERRUPTIBLE (D) tasks as "blocked", irrespective of whether they are awaiting I/O completion or kernel locking primitives. This ambiguity compels system administrators to manually inspect stack traces to discern whether the delay stems from an I/O wait (typically indicative of hardware or filesystem anomalies) or software contention. Such detailed analysis is not always immediately accessible to system administrators or support engineers. To address this, this patch utilises the existing in_iowait field within struct task_struct to augment the failure report. If the task is blocked due to I/O (e.g., via io_schedule_prepare()), the log message is updated to explicitly state "blocked in I/O wait". Examples: - Standard Block: "INFO: task bash:123 blocked for more than 120 seconds". - I/O Block: "INFO: task dd:456 blocked in I/O wait for more than 120 seconds". Theoretically, concurrent executions of io_schedule_finish() could result in a race condition where the read flag does not precisely correlate with the subsequently printed backtrace. However, this limitation is deemed acceptable in practice. The entire reporting mechanism is inherently racy by design; nevertheless, it remains highly reliable in the vast majority of cases, particularly because it primarily captures protracted stalls. Consequently, introducing additional synchronisation to mitigate this minor inaccuracy would be entirely disproportionate to the situation. Link: https://lkml.kernel.org/r/20260303221324.4106917-1-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: increment the global counter immediatelyPetr Mladek1-15/+8
A recent change allowed to reset the global counter of hung tasks using the sysctl interface. A potential race with the regular check has been solved by updating the global counter only once at the end of the check. However, the hung task check can take a significant amount of time, particularly when task information is being dumped to slow serial consoles. Some users monitor this global counter to trigger immediate migration of critical containers. Delaying the increment until the full check completes postpones these high-priority rescue operations. Update the global counter as soon as a hung task is detected. Since the value is read asynchronously, a relaxed atomic operation is sufficient. Link: https://lkml.kernel.org/r/20260303203031.4097316-4-atomlin@atomlin.com Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reported-by: Lance Yang <lance.yang@linux.dev> Closes: https://lore.kernel.org/r/f239e00f-4282-408d-b172-0f9885f4b01b@linux.dev Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joel Granados <joel.granados@kernel.org> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: enable runtime reset of hung_task_detect_countAaron Tomlin1-7/+51
Currently, the hung_task_detect_count sysctl provides a cumulative count of hung tasks since boot. In long-running, high-availability environments, this counter may lose its utility if it cannot be reset once an incident has been resolved. Furthermore, the previous implementation relied upon implicit ordering, which could not strictly guarantee that diagnostic metadata published by one CPU was visible to the panic logic on another. This patch introduces the capability to reset the detection count by writing "0" to the hung_task_detect_count sysctl. The proc_handler logic has been updated to validate this input and atomically reset the counter. The synchronisation of sysctl_hung_task_detect_count relies upon a transactional model to ensure the integrity of the detection counter against concurrent resets from userspace. The application of atomic_long_read_acquire() and atomic_long_cmpxchg_release() is correct and provides the following guarantees: 1. Prevention of Load-Store Reordering via Acquire Semantics By utilising atomic_long_read_acquire() to snapshot the counter before initiating the task traversal, we establish a strict memory barrier. This prevents the compiler or hardware from reordering the initial load to a point later in the scan. Without this "acquire" barrier, a delayed load could potentially read a "0" value resulting from a userspace reset that occurred mid-scan. This would lead to the subsequent cmpxchg succeeding erroneously, thereby overwriting the user's reset with stale increment data. 2. Atomicity of the "Commit" Phase via Release Semantics The atomic_long_cmpxchg_release() serves as the transaction's commit point. The "release" barrier ensures that all diagnostic recordings and task-state observations made during the scan are globally visible before the counter is incremented. 3. Race Condition Resolution This pairing effectively detects any "out-of-band" reset of the counter. If sysctl_hung_task_detect_count is modified via the procfs interface during the scan, the final cmpxchg will detect the discrepancy between the current value and the "acquire" snapshot. Consequently, the update will fail, ensuring that a reset command from the administrator is prioritised over a scan that may have been invalidated by that very reset. Link: https://lkml.kernel.org/r/20260303203031.4097316-3-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Joel Granados <joel.granados@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27hung_task: refactor detection logic and atomicise detection countAaron Tomlin1-25/+33
Patch series "hung_task: Provide runtime reset interface for hung task detector", v9. This series introduces the ability to reset /proc/sys/kernel/hung_task_detect_count. Writing a "0" value to this file atomically resets the counter of detected hung tasks. This functionality provides system administrators with the means to clear the cumulative diagnostic history following incident resolution, thereby simplifying subsequent monitoring without necessitating a system restart. This patch (of 3): The check_hung_task() function currently conflates two distinct responsibilities: validating whether a task is hung and handling the subsequent reporting (printing warnings, triggering panics, or tracepoints). This patch refactors the logic by introducing hung_task_info(), a function dedicated solely to reporting. The actual detection check, task_is_hung(), is hoisted into the primary loop within check_hung_uninterruptible_tasks(). This separation clearly decouples the mechanism of detection from the policy of reporting. Furthermore, to facilitate future support for concurrent hung task detection, the global sysctl_hung_task_detect_count variable is converted from unsigned long to atomic_long_t. Consequently, the counting logic is updated to accumulate the number of hung tasks locally (this_round_count) during the iteration. The global counter is then updated atomically via atomic_long_cmpxchg_relaxed() once the loop concludes, rather than incrementally during the scan. These changes are strictly preparatory and introduce no functional change to the system's runtime behaviour. Link: https://lkml.kernel.org/r/20260303203031.4097316-1-atomlin@atomlin.com Link: https://lkml.kernel.org/r/20260303203031.4097316-2-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Joel Granados <joel.granados@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27crash_dump: use sysfs_emit in sysfs show functionsThorsten Blum1-4/+5
Replace sprintf() with sysfs_emit() in sysfs show functions. sysfs_emit() is preferred for formatting sysfs output because it provides safer bounds checking. No functional changes. Link: https://lkml.kernel.org/r/20260301125106.911980-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27pid: document the PIDNS_ADDING checks in alloc_pid() and copy_process()Oleg Nesterov2-1/+10
Both copy_process() and alloc_pid() do the same PIDNS_ADDING check. The reasons for these checks, and the fact that both are necessary, are not immediately obvious. Add the comments. Link: https://lkml.kernel.org/r/aaGIRElc78U4Er42@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Adrian Reber <areber@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27pid: make sub-init creation retryableOleg Nesterov1-7/+6
Patch series "pid: make sub-init creation retryable". This patch (of 2): Currently we allow only one attempt to create init in a new namespace. If the first fork() fails after alloc_pid() succeeds, free_pid() clears PIDNS_ADDING and thus disables further PID allocations. Nowadays this looks like an unnecessary limitation. The original reason to handle "case PIDNS_ADDING" in free_pid() is gone, most probably after commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc"). Change free_pid() to keep ns->pid_allocated == PIDNS_ADDING, and change alloc_pid() to reset the cursor early, right after taking pidmap_lock. Test-case: #define _GNU_SOURCE #include <linux/sched.h> #include <sys/syscall.h> #include <sys/wait.h> #include <assert.h> #include <sched.h> #include <errno.h> int main(void) { struct clone_args args = { .exit_signal = SIGCHLD, .flags = CLONE_PIDFD, .pidfd = 0, }; unsigned long pidfd; int pid; assert(unshare(CLONE_NEWPID) == 0); pid = syscall(__NR_clone3, &args, sizeof(args)); assert(pid == -1 && errno == EFAULT); args.pidfd = (unsigned long)&pidfd; pid = syscall(__NR_clone3, &args, sizeof(args)); if (pid) assert(pid > 0 && wait(NULL) == pid); else assert(getpid() == 1); return 0; } Link: https://lkml.kernel.org/r/aaGHu3ixbw9Y7kFj@redhat.com Link: https://lkml.kernel.org/r/aaGIHa7vGdwhEc_D@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrei Vagin <avagin@gmail.com> Cc: Adrian Reber <areber@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27crash_dump: fix typo in function name read_key_from_user_keyingThorsten Blum1-2/+2
The function read_key_from_user_keying() is missing an 'r' in its name. Fix the typo by renaming it to read_key_from_user_keyring(). Link: https://lkml.kernel.org/r/20260227230422.859423-1-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27crash_dump: remove redundant less-than-zero checkThorsten Blum1-1/+1
'key_count' is an 'unsigned int' and cannot be less than zero. Remove the redundant condition. Link: https://lkml.kernel.org/r/20260228085136.861971-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27fork: replace simple_strtoul with kstrtoul in coredump_filter_setupThorsten Blum1-5/+6
Replace simple_strtoul() with the recommended kstrtoul() for parsing the 'coredump_filter=' boot parameter. Check the return value of kstrtoul() and reject invalid values. This adds error handling while preserving behavior for existing values, and removes use of the deprecated simple_strtoul() helper. The current code silently sets 'default_dump_filter = 0' if parsing fails, instead of leaving the default value (MMF_DUMP_FILTER_DEFAULT) unchanged. Rename the static variable 'default_dump_filter' to 'coredump_filter' since it does not necessarily contain the default value and the current name can be misleading. Link: https://lkml.kernel.org/r/20251215142152.4082-2-thorsten.blum@linux.dev Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27complete_signal: kill always-true "core_state || !SIGNAL_GROUP_EXIT" checkOleg Nesterov1-3/+1
The "(signal->core_state || !(signal->flags & SIGNAL_GROUP_EXIT))" check in complete_signal() is not obvious at all, and in fact it only adds unnecessary confusion: this condition is always true. prepare_signal() does: if (signal->flags & SIGNAL_GROUP_EXIT) { if (signal->core_state) return sig == SIGKILL; /* * The process is in the middle of dying, drop the signal. */ return false; } This means that "!signal->core_state && (signal->flags & SIGNAL_GROUP_EXIT)" in complete_signal() is never possible. If SIGNAL_GROUP_EXIT is set, prepare_signal() can only return true if signal->core_state is not NULL. Link: https://lkml.kernel.org/r/aZsfkDhnqJ4s1oTs@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc; Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27exit: kill unnecessary thread_group_leader() checks in exit_notify() and ↵Oleg Nesterov2-7/+4
do_notify_parent() thread_group_empty(tsk) is only possible if tsk is a group leader, and thread_group_empty() already does the thread_group_leader() check. So it makes no sense to check "thread_group_leader() && thread_group_empty()"; thread_group_empty() alone is enough. Link: https://lkml.kernel.org/r/aZsfeegKZPZZszJh@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Kees Cook <kees@kernel.org> Cc; Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/panic: mark init_taint_buf as __initdata and panic instead of warning ↵Rio1-7/+3
in alloc_taint_buf() However there's a convention of assuming that __init-time allocations cannot fail. Because if a kmalloc() were to fail at this time, the kernel is hopelessly messed up anyway. So simply panic() if that kmalloc failed, then make that 350-byte buffer __initdata. Link: https://lkml.kernel.org/r/20260223035914.4033-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/panic: allocate taint string buffer dynamicallyRio1-5/+46
The buffer used to hold the taint string is statically allocated, which requires updating whenever a new taint flag is added. Instead, allocate the exact required length at boot once the allocator is available in an init function. The allocation sums the string lengths in taint_flags[], along with space for separators and formatting. print_tainted() is switched to use this dynamically allocated buffer. If allocation fails, print_tainted() warns about the failure and continues to use the original static buffer as a fallback. Link: https://lkml.kernel.org/r/20260222140804.22225-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27kernel/panic: increase buffer size for verbose taint loggingRio1-2/+6
The verbose 'Tainted: ...' string in print_tainted_seq can total to 327 characters while the buffer defined in _print_tainted is 320 bytes. Increase its size to 350 characters to hold all flags, along with some headroom. [akpm@linux-foundation.org: fix spello, add comment] Link: https://lkml.kernel.org/r/20260220151500.13585-1-rioo.tsukatsukii@gmail.com Signed-off-by: Rio <rioo.tsukatsukii@gmail.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Wang Jinchao <wangjinchao600@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27unshare: fix nsproxy leak in ksys_unshare() on set_cred_ucounts() failureMichal Grzedzicki1-4/+7
When set_cred_ucounts() fails in ksys_unshare() new_nsproxy is leaked. Let's call put_nsproxy() if that happens. Link: https://lkml.kernel.org/r/20260213193959.2556730-1-mge@meta.com Fixes: 905ae01c4ae2 ("Add a reference to ucounts for each cred") Signed-off-by: Michal Grzedzicki <mge@meta.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Gladkov (Intel) <legion@kernel.org> Cc: Ben Segall <bsegall@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Kees Cook <kees@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-03-27Merge tag 'sysctl-7.00-fixes-rc6' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl fix from Joel Granados: "Fix uninitialized variable error when writing to a sysctl bitmap Removed the possibility of returning an unjustified -EINVAL when writing to a sysctl bitmap" * tag 'sysctl-7.00-fixes-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: fix uninitialized variable in proc_do_large_bitmap
2026-03-27tracing: Fix potential deadlock in cpu hotplug with osnoiseLuo Haiyang1-5/+5
The following sequence may leads deadlock in cpu hotplug: task1 task2 task3 ----- ----- ----- mutex_lock(&interface_lock) [CPU GOING OFFLINE] cpus_write_lock(); osnoise_cpu_die(); kthread_stop(task3); wait_for_completion(); osnoise_sleep(); mutex_lock(&interface_lock); cpus_read_lock(); [DEAD LOCK] Fix by swap the order of cpus_read_lock() and mutex_lock(&interface_lock). Cc: stable@vger.kernel.org Cc: <mathieu.desnoyers@efficios.com> Cc: <zhang.run@zte.com.cn> Cc: <yang.tao172@zte.com.cn> Cc: <ran.xiaokai@zte.com.cn> Fixes: bce29ac9ce0bb ("trace: Add osnoise tracer") Link: https://patch.msgid.link/20260326141953414bVSj33dAYktqp9Oiyizq8@zte.com.cn Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Luo Haiyang <luo.haiyang@zte.com.cn> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-27sched_ext: Document why built-in DSQs are unsupported sources in ↵Cheng-Yang Chou1-1/+9
scx_bpf_dsq_move_to_local() Add a comment explaining the design intent behind rejecting built-in DSQs (%SCX_DSQ_GLOBAL and %SCX_DSQ_LOCAL*) as sources. Local DSQs support reenqueueing but the BPF scheduler cannot directly iterate or move tasks from them. %SCX_DSQ_GLOBAL is similar but also doesn't support reenqueueing because it maps to multiple per-node DSQs, making the scope difficult to define. Also annotate @dsq_id to make clear it must be a user-created DSQ. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-27printk_ringbuffer: Add sanity check for 0-size dataJohn Ogness1-2/+5
get_data() has a sanity check for regular data blocks to ensure at least space for the ID exists. But a regular block should also have at least 1 byte of data (otherwise it would be data-less instead of regular). Expand the get_data() block size sanity check to additionally expect at least 1 byte of data. Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/20260326133809.8045-2-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2026-03-27printk_ringbuffer: Fix get_data() size sanity checkJohn Ogness1-4/+4
Commit cc3bad11de6e ("printk_ringbuffer: Fix check of valid data size when blk_lpos overflows") added sanity checking to get_data() to avoid returning data of illegal sizes (too large or too small). It uses the helper function data_check_size() for the check. However, data_check_size() expects the size of the data, not the size of the data block. get_data() is providing the size of the data block. This means that if the data size (text_buf_size) is at or near the maximum legal size: sizeof(prb_data_block) + text_buf_size == DATA_SIZE(data_ring) / 2 data_check_size() will report failure because it adds sizeof(prb_data_block) to the provided size. The sanity check in get_data() is counting the data block header twice. The result is that the reader fails to read the legal record. Since get_data() subtracts the data block header size before returning, move the sanity check to after the subtraction. Luckily printk() is not vulnerable to this problem because truncate_msg() limits printk-messages to 1/4 of the ringbuffer. Indeed, by adjusting the printk_ringbuffer KUnit test, which does not use printk() and its truncate_msg() check, it is easy to see that the reader fails and the WARN_ON is triggered. Fixes: cc3bad11de6e ("printk_ringbuffer: Fix check of valid data size when blk_lpos overflows") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Link: https://patch.msgid.link/20260326133809.8045-1-john.ogness@linutronix.de Signed-off-by: Petr Mladek <pmladek@suse.com>
2026-03-27bpf: classify block device hooks appropriatelyChristian Brauner1-0/+4
A bunch of new hooks for managing block devices were added a while ago but they weren't actually appropriately classified. * bpf_lsm_bdev_alloc() is called when the inode for the block device is allocated. This happens from a sleepable context so mark the function as sleepable. When this function is called the memory for the block device storage embedded into the inode is zeroed. That block device cannot be meaningfully reference or interacted with at this point. So mark it as untrusted for now. * bpf_lsm_bdev_free() is called when the inode for the block device is freed. A bunch of memory associated with the block device has already been freed and there's dangling pointers in there. So mark it as untrusted. It cannot be meaningfully referenced or interacted with anymore. It is also called from sb->s_op->free_inode:: which means it runs in rcu context (most of the times). So leave it as non-sleepable. * bpf_lsm_bdev_setintegrity() is called when a dm-verity device is instantiated (glossing over details for simplicity of the commit message). The block device is very much alive so it remains a trusted hook. It's also called with device mapper's suspend lock held and so the hook is able to sleep so mark it sleepable. Signed-off-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20260326-work-bpf-bdev-v2-1-5e3c58963987@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-03-27PCI: Align head space betterIlpo Järvinen1-1/+1
When a bridge window contains big and small resource(s), the small resource(s) may not amount to the half of the size of the big resource which would allow calculate_head_align() to shrink the head alignment. This results in always placing the small resource(s) after the big resource. In general, it would be good to be able to place the small resource(s) before the big resource to achieve better utilization of the address space. In the cases where the large resource can only fit at the end of the window, it is even required. However, carrying the information over from pbus_size_mem() and calculate_head_align() to __pci_assign_resource() and pcibios_align_resource() is not easy with the current data structures. A somewhat hacky way to move the non-aligning tail part to the head is possible within pcibios_align_resource(). The free space between the start of the free space span and the aligned start address can be compared with the non-aligning remainder of the size. If the free space is larger than the remainder, placing the remainder before the start address is possible. This relocation should generally work, because PCI resources consist only power-of-2 atoms. Various arch requirements may still need to override the relocation, so the relocation is only applied selectively in such cases. Closes: https://bugzilla.kernel.org/show_bug.cgi?id=221205 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-10-ilpo.jarvinen@linux.intel.com
2026-03-27resource: Rename 'tmp' variable to 'full_avail'Ilpo Järvinen1-14/+14
__find_resource_space() has variable called 'tmp'. Rename it to 'full_avail' to better indicate its purpose. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-4-ilpo.jarvinen@linux.intel.com
2026-03-27resource: Pass full extent of empty space to resource_alignf callbackIlpo Järvinen1-1/+2
__find_resource_space() calculates the full extent of empty space but only passes the aligned space to resource_alignf callback. In some situations, the callback may choose take advantage of the free space before the requested alignment. Pass the full extent of the calculated empty space to resource_alignf callback as an additional parameter. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-3-ilpo.jarvinen@linux.intel.com
2026-03-27Merge back earlier material related to system sleep for 7.1Rafael J. Wysocki1-2/+5
2026-03-27Merge branch 'dt-reserved-mem-cleanups' into dma-mapping-for-nextMarek Szyprowski3-47/+77
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
2026-03-26btf: Support kernel parsing of BTF with layout infoAlan Maguire1-4/+56
Validate layout if present, but because the kernel must be strict in what it accepts, reject BTF with unsupported kinds, even if they are in the layout information. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260326145444.2076244-8-alan.maguire@oracle.com
2026-03-26resource: Add __resource_contains_unbound() for internal contains checksIlpo Järvinen1-2/+2
__find_resource_space() currently uses resource_contains() but for tentative resources that are not yet crafted into the resource tree. As resource_contains() checks that IORESOURCE_UNSET is not set for either of the resources, the caller has to hack around this problem by clearing the IORESOURCE_UNSET flag (essentially lying to resource_contains()). Instead of the hack, introduce __resource_contains_unbound() for cases like this. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Xifer <xiferdev@gmail.com> Link: https://patch.msgid.link/20260324165633.4583-2-ilpo.jarvinen@linux.intel.com
2026-03-26Merge tag 'pm-7.0-rc6' of ↵Linus Torvalds2-1/+12
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two cpufreq issues, one in the core and one in the conservative governor, and two issues related to system sleep: - Restore the cpufreq core behavior changed inadvertently during the 6.19 development cycle to call cpufreq_frequency_table_cpuinfo() for cpufreq policies getting re-initialized which ensures that policy->max and policy->cpuinfo_max_freq will be valid going forward (Viresh Kumar) - Adjust the cached requested frequency in the conservative cpufreq governor on policy limits changes to prevent it from becoming stale in some cases (Viresh Kumar) - Prevent pm_restore_gfp_mask() from triggering a WARN_ON() in some code paths in which it is legitimately called without invoking pm_restrict_gfp_mask() previously (Youngjun Park) - Update snapshot_write_finalize() to take trailing zero pages into account properly which prevents user space restore from failing subsequently in some cases (Alberto Garcia)" * tag 'pm-7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM: sleep: Drop spurious WARN_ON() from pm_restore_gfp_mask() PM: hibernate: Drain trailing zero pages on userspace restore cpufreq: conservative: Reset requested_freq on limits change cpufreq: Don't skip cpufreq_frequency_table_cpuinfo()
2026-03-26of: reserved_mem: replace CMA quirks by generic methodsMarek Szyprowski1-19/+51
Add optional reserved memory callbacks to perform region verification and early fixup, then move all CMA related code in of_reserved_mem.c to them. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-5-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26of: reserved_mem: switch to ops based OF_DECLARE()Marek Szyprowski3-20/+22
Move init function from OF_DECLARE() argument to the given reserved memory region ops structure and then pass that structure to the OF_DECLARE() initializer. This node_init callback is mandatory for the reserved mem driver. Such change makes it possible in the future to add more functions called by the generic code before given memory region is initialized and rmem object is created. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-4-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26of: reserved_mem: use -ENODEV instead of -ENOENTMarek Szyprowski2-2/+2
When given reserved memory region doesn't really support given node, return -ENODEV instead of -ENOENT. Then fix __reserved_mem_init_node() function to properly propagate error code different from -ENODEV instead of silently ignoring it. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-3-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26of: reserved_mem: remove fdt node from the structureMarek Szyprowski3-8/+4
FDT node is not needed for anything besides the initialization, so it can be simply passed as an argument to the reserved memory region init function. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://patch.msgid.link/20260325090023.3175348-2-m.szyprowski@samsung.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
2026-03-26smp: Use system_percpu_wq instead of system_wqMarco Crivellari1-1/+1
When a caller enqueues a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() uses WORK_CPU_UNBOUND (used when no target CPU is specified). The same applies to schedule_work() that is using system_wq and queue_work(), which again makes use of WORK_CPU_UNBOUND. This lack of consistency cannot be addressed without refactoring the API. Continue the effort to refactor workqueue APIs, which began with the introduction of new workqueues and a new alloc_workqueue() flag in: commit 128ea9f6ccfb ("workqueue: Add system_percpu_wq and system_dfl_wq") commit 930c2ea566af ("workqueue: Add new WQ_PERCPU flag") and switch smp_call_on_cpu() to use system_percpu_wq because system_wq is going away once the ongoing workqueue restructuring is done. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Link: https://patch.msgid.link/20251110170332.319314-1-marco.crivellari@suse.com
2026-03-26Merge tag 'dma-mapping-7.0-2026-03-25' of ↵Linus Torvalds4-9/+34
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: "A set of fixes for DMA-mapping subsystem, which resolve false- positive warnings from KMSAN and DMA-API debug (Shigeru Yoshida and Leon Romanovsky) as well as a simple build fix (Miguel Ojeda)" * tag 'dma-mapping-7.0-2026-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-mapping: add missing `inline` for `dma_free_attrs` mm/hmm: Indicate that HMM requires DMA coherency RDMA/umem: Tell DMA mapping that UMEM requires coherency iommu/dma: add support for DMA_ATTR_REQUIRE_COHERENT attribute dma-direct: prevent SWIOTLB path when DMA_ATTR_REQUIRE_COHERENT is set dma-mapping: Introduce DMA require coherency attribute dma-mapping: Clarify valid conditions for CPU cache line overlap dma-mapping: handle DMA_ATTR_CPU_CACHE_CLEAN in trace output dma-debug: Allow multiple invocations of overlapping entries dma: swiotlb: add KMSAN annotations to swiotlb_bounce()
2026-03-26futex: Fix UaF between futex_key_to_node_opt() and vma_replace_policy()Hao-Yu Yang1-1/+1
During futex_key_to_node_opt() execution, vma->vm_policy is read under speculative mmap lock and RCU. Concurrently, mbind() may call vma_replace_policy() which frees the old mempolicy immediately via kmem_cache_free(). This creates a race where __futex_key_to_node() dereferences a freed mempolicy pointer, causing a use-after-free read of mpol->mode. [ 151.412631] BUG: KASAN: slab-use-after-free in __futex_key_to_node (kernel/futex/core.c:349) [ 151.414046] Read of size 2 at addr ffff888001c49634 by task e/87 [ 151.415969] Call Trace: [ 151.416732] __asan_load2 (mm/kasan/generic.c:271) [ 151.416777] __futex_key_to_node (kernel/futex/core.c:349) [ 151.416822] get_futex_key (kernel/futex/core.c:374 kernel/futex/core.c:386 kernel/futex/core.c:593) Fix by adding rcu to __mpol_put(). Fixes: c042c505210d ("futex: Implement FUTEX2_MPOL") Reported-by: Hao-Yu Yang <naup96721@gmail.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Hao-Yu Yang <naup96721@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Link: https://patch.msgid.link/20260324174418.GB1850007@noisy.programming.kicks-ass.net
2026-03-26futex: Require sys_futex_requeue() to have identical flagsPeter Zijlstra1-0/+8
Nicholas reported that his LLM found it was possible to create a UaF when sys_futex_requeue() is used with different flags. The initial motivation for allowing different flags was the variable sized futex, but since that hasn't been merged (yet), simply mandate the flags are identical, as is the case for the old style sys_futex() requeue operations. Fixes: 0f4b5f972216 ("futex: Add sys_futex_requeue()") Reported-by: Nicholas Carlini <npc@anthropic.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2026-03-26timens: Remove dependency on the vDSOThomas Weißschuh4-5/+35
Previously, missing time namespace support in the vDSO meant that time namespaces needed to be disabled globally. This was expressed in a hard dependency on the generic vDSO library. This also meant that architectures without any vDSO or only a stub vDSO could not enable time namespaces. Now that all architectures using a real vDSO are using the generic library, that dependency is not necessary anymore. Remove the dependency and let all architectures enable time namespaces. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-2-c82693a7775f@linutronix.de
2026-03-26vdso/timens: Move functions to new fileThomas Weißschuh4-121/+164
As a preparation of the untangling of time namespaces and the vDSO, move the glue functions between those subsystems into a new file. While at it, switch the mutex lock and mmap_read_lock() in the vDSO namespace code to guard(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260326-vdso-timens-decoupling-v2-1-c82693a7775f@linutronix.de
2026-03-26tracing: Move snapshot code out of trace.c and into trace_snapshot.cSteven Rostedt4-1177/+1188
The trace.c file was a dumping ground for most tracing code. Start organizing it better by moving various functions out into their own files. Move all the snapshot code, including the max trace code into its own trace_snapshot.c file. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://patch.msgid.link/20260324140145.36352d6a@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-03-26nsproxy: Add FOR_EACH_NS_TYPE() X-macro and CLONE_NS_ALLMickaël Salaün2-13/+7
Introduce the FOR_EACH_NS_TYPE(X) macro as the single source of truth for the set of (struct type, CLONE_NEW* flag) pairs that define Linux namespace types. Currently, the list of CLONE_NEW* flags is duplicated inline in multiple call sites and would need another copy in each new consumer. This makes it easy to miss one when a new namespace type is added. Derive two things from the X-macro: - CLONE_NS_ALL: Bitmask of all known CLONE_NEW* flags, usable as a validity mask or iteration bound. - ns_common_type(): Rewritten to use the X-macro via a leading-comma _Generic pattern, so the struct-to-flag mapping stays in sync with the flag set automatically. Replace the inline flag enumerations in copy_namespaces(), unshare_nsproxy_namespaces(), check_setns_flags(), and ksys_unshare() with CLONE_NS_ALL. When a new namespace type is added, only FOR_EACH_NS_TYPE needs to be updated; CLONE_NS_ALL, ns_common_type(), and all the call sites pick up the change automatically. Cc: Christian Brauner <brauner@kernel.org> Cc: Günther Noack <gnoack@google.com> Signed-off-by: Mickaël Salaün <mic@digikod.net> Link: https://patch.msgid.link/20260312100444.2609563-4-mic@digikod.net Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26kernel: acct: fix duplicate word in commenthaoyu.lu1-1/+1
Fix the duplicate word "kernel" in the comment on line 247. Signed-off-by: haoyu.lu <hechushiguitu666@gmail.com> Link: https://patch.msgid.link/20260326055628.10773-1-hechushiguitu666@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2026-03-26kernel: Use trace_call__##name() at guarded tracepoint call sitesVineeth Pillai (Google)3-3/+3
Replace trace_foo() with the new trace_call__foo() at sites already guarded by trace_foo_enabled(), avoiding a redundant static_branch_unlikely() re-evaluation inside the tracepoint. trace_call__foo() calls the tracepoint callbacks directly without utilizing the static branch again. Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <arighi@nvidia.com> Cc: Changwoo Min <changwoo@igalia.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ben Segall <bsegall@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Thomas Gleixner <tglx@kernel.org> Cc: