aboutsummaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)AuthorFilesLines
2025-06-02ftrace: Fix UAF when lookup kallsym after ftrace disabledYe Bin1-3/+7
The following issue happens with a buggy module: BUG: unable to handle page fault for address: ffffffffc05d0218 PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS RIP: 0010:sized_strscpy+0x81/0x2f0 RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000 RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68 R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038 R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ftrace_mod_get_kallsym+0x1ac/0x590 update_iter_mod+0x239/0x5b0 s_next+0x5b/0xa0 seq_read_iter+0x8c9/0x1070 seq_read+0x249/0x3b0 proc_reg_read+0x1b0/0x280 vfs_read+0x17f/0x920 ksys_read+0xf3/0x1c0 do_syscall_64+0x5f/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue may happen as follows: (1) Add kprobe tracepoint; (2) insmod test.ko; (3) Module triggers ftrace disabled; (4) rmmod test.ko; (5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed; ftrace_mod_get_kallsym() ... strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); ... The problem is when a module triggers an issue with ftrace and sets ftrace_disable. The ftrace_disable is set when an anomaly is discovered and to prevent any more damage, ftrace stops all text modification. The issue that happened was that the ftrace_disable stops more than just the text modification. When a module is loaded, its init functions can also be traced. Because kallsyms deletes the init functions after a module has loaded, ftrace saves them when the module is loaded and function tracing is enabled. This allows the output of the function trace to show the init function names instead of just their raw memory addresses. When a module is removed, ftrace_release_mod() is called, and if ftrace_disable is set, it just returns without doing anything more. The problem here is that it leaves the mod_list still around and if kallsyms is called, it will call into this code and access the module memory that has already been freed as it will return: strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); Where the "mod" no longer exists and triggers a UAF bug. Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/ Cc: stable@vger.kernel.org Fixes: aba4b5c22cba ("ftrace: Save module init functions kallsyms symbols for tracing") Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-30Merge tag 'trace-ringbuffer-v6.16' of ↵Linus Torvalds2-105/+134
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer updates from Steven Rostedt: - Allow the persistent ring buffer to be memory mapped In the last merge window there was issues with the implementation of mapping the persistent ring buffer because it was assumed that the persistent memory was just physical memory without being part of the kernel virtual address space. But this was incorrect and the persistent ring buffer can be mapped the same way as the allocated ring buffer is mapped. The metadata for the persistent ring buffer is different than the normal ring buffer and the organization of mapping it to user space is a little different. Make the updates needed to the meta data to allow the persistent ring buffer to be mapped to user space. - Fix cpus_read_lock() with buffer->mutex and cpu_buffer->mapping_lock Mapping the ring buffer to user space uses the cpu_buffer->mapping_lock. The buffer->mutex can be taken when the mapping_lock is held, giving the locking order of: cpu_buffer->mapping_lock -->> buffer->mutex. But there also exists the ordering: buffer->mutex -->> cpus_read_lock() mm->mmap_lock -->> cpu_buffer->mapping_lock cpus_read_lock() -->> mm->mmap_lock causing a circular chain of: cpu_buffer->mapping_lock -> buffer->mutex -->> cpus_read_lock() -->> mm->mmap_lock -->> cpu_buffer->mapping_lock By moving the cpus_read_lock() outside the buffer->mutex where: cpus_read_lock() -->> buffer->mutex, breaks the deadlock chain. - Do not trigger WARN_ON() for commit overrun When the ring buffer is user space mapped and there's a "commit overrun" (where an interrupt preempted an event, and then added so many events it filled the buffer having to drop events when it hit the preempted event) a WARN_ON() was triggered if this was read via a memory mapped buffer. This is due to "missed events" being non zero when the reader page ended up with the commit page. The idea was, if the writer is on the reader page, there's only one page that has been written to and there should be no missed events. But if a commit overrun is done where the writer is off the commit page and looped around to the commit page causing missed events, it is possible that the reader page is the commit page with missed events. Instead of triggering a WARN_ON() when the reader page is the commit page with missed events, trigger it when the reader page is the tail_page with missed events. That's because the writer is always on the tail_page if an event was interrupted (which holds the commit event) and continues off the commit page. - Reset the persistent buffer if it is fully consumed On boot up, if the user fully consumes the last boot buffer of the persistent buffer, if it reboots without enabling it, there will still be events in the buffer which can cause confusion. Instead, reset the buffer when it is fully consumed, so that the data is not read again. - Clean up some goto out jumps There's a few cases that the code jumps to the "out:" label that simply returns a value. There used to be more work done at those labels but now that they simply return a value use a return instead of jumping to a label. - Use guard() to simplify some of the code Add guard() around some locking instead of jumping to a label to do the unlocking. - Use free() to simplify some of the code Use free(kfree) on variables that will get freed on error and use return_ptr() to return the variable when its not freed. There's one instance where free(kfree) simplifies the code on a temp variable that was allocated just for the function use. * tag 'trace-ringbuffer-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Simplify functions with __free(kfree) to free allocations ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex) ring-buffer: Simplify ring_buffer_read_page() with guard() ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard() ring-buffer: Remove jump to out label in ring_buffer_swap_cpu() ring-buffer: Removed unnecessary if() goto out where out is the next line tracing: Reset last-boot buffers when reading out all cpu buffers ring-buffer: Allow reserve_mem persistent ring buffers to be mmapped ring-buffer: Do not trigger WARN_ON() due to a commit_overrun ring-buffer: Move cpus_read_lock() outside of buffer->mutex
2025-05-30Merge tag 'pull-automount' of ↵Linus Torvalds1-4/+15
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull automount updates from Al Viro: "Automount wart removal A bunch of odd boilerplate gone from instances - the reason for those was the need to protect the yet-to-be-attched mount from mark_mounts_for_expiry() deciding to take it out. But that's easy to detect and take care of in mark_mounts_for_expiry() itself; no need to have every instance simulate mount being busy by grabbing an extra reference to it, with finish_automount() undoing that once it attaches that mount. Should've done it that way from the very beginning... This is a flagday change, thankfully there are very few instances. vfs_submount() is gone - its sole remaining user (trace_automount) had been switched to saner primitives" * tag 'pull-automount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kill vfs_submount() saner calling conventions for ->d_automount()
2025-05-29Merge tag 'trace-v6.16' of ↵Linus Torvalds20-260/+574
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: - Have module addresses get updated in the persistent ring buffer The addresses of the modules from the previous boot are saved in the persistent ring buffer. If the same modules are loaded and an address is in the old buffer points to an address that was both saved in the persistent ring buffer and is loaded in memory, shift the address to point to the address that is loaded in memory in the trace event. - Print function names for irqs off and preempt off callsites When ignoring the print fmt of a trace event and just printing the fields directly, have the fields for preempt off and irqs off events still show the function name (via kallsyms) instead of just showing the raw address. - Clean ups of the histogram code The histogram functions saved over 800 bytes on the stack to process events as they come in. Instead, create per-cpu buffers that can hold this information and have a separate location for each context level (thread, softirq, IRQ and NMI). Also add some more comments to the code. - Add "common_comm" field for histograms Add "common_comm" that uses the current->comm as a field in an event histogram and acts like any of the other fields of the event. - Show "subops" in the enabled_functions file When the function graph infrastructure is used, a subsystem has a "subops" that it attaches its callback function to. Instead of the enabled_functions just showing a function calling the function that calls the subops functions, also show the subops functions that will get called for that function too. - Add "copy_trace_marker" option to instances There are cases where an instance is created for tooling to write into, but the old tooling has the top level instance hardcoded into the application. New tools want to consume the data from an instance and not the top level buffer. By adding a copy_trace_marker option, whenever the top instance trace_marker is written into, a copy of it is also written into the instance with this option set. This allows new tools to read what old tools are writing into the top buffer. If this option is cleared by the top instance, then what is written into the trace_marker is not written into the top instance. This is a way to redirect the trace_marker writes into another instance. - Have tracepoints created by DECLARE_TRACE() use trace_<name>_tp() If a tracepoint is created by DECLARE_TRACE() instead of TRACE_EVENT(), then it will not be exposed via tracefs. Currently there's no way to differentiate in the kernel the tracepoint functions between those that are exposed via tracefs or not. A calling convention has been made manually to append a "_tp" prefix for events created by DECLARE_TRACE(). Instead of doing this manually, force it so that all DECLARE_TRACE() events have this notation. - Use __string() for task->comm in some sched events Instead of hardcoding the comm to be TASK_COMM_LEN in some of the scheduler events use __string() which makes it dynamic. Note, if these events are parsed by user space it they may break, and the event may have to be converted back to the hardcoded size. - Have function graph "depth" be unsigned to the user Internally to the kernel, the "depth" field of the function graph event is signed due to -1 being used for end of boundary. What actually gets recorded in the event itself is zero or positive. Reflect this to user space by showing "depth" as unsigned int and be consistent across all events. - Allow an arbitrary long CPU string to osnoise_cpus_write() The filtering of which CPUs to write to can exceed 256 bytes. If a machine has 256 CPUs, and the filter is to filter every other CPU, the write would take a string larger than 256 bytes. Instead of using a fixed size buffer on the stack that is 256 bytes, allocate it to handle what is passed in. - Stop having ftrace check the per-cpu data "disabled" flag The "disabled" flag in the data structure passed to most ftrace functions is checked to know if tracing has been disabled or not. This flag was added back in 2008 before the ring buffer had its own way to disable tracing. The "disable" flag is now not always set when needed, and the ring buffer flag should be used in all locations where the disabled is needed. Since the "disable" flag is redundant and incorrect, stop using it. Fix up some locations that use the "disable" flag to use the ring buffer info. - Use a new tracer_tracing_disable/enable() instead of data->disable flag There's a few cases that set the data->disable flag to stop tracing, but this flag is not consistently used. It is also an on/off switch where if a function set it and calls another function that sets it, the called function may incorrectly enable it. Use a new trace_tracing_disable() and tracer_tracing_enable() that uses a counter and can be nested. These use the ring buffer flags which are always checked making the disabling more consistent. - Save the trace clock in the persistent ring buffer Save what clock was used for tracing in the persistent ring buffer and set it back to that clock after a reboot. - Remove unused reference to a per CPU data pointer in mmiotrace functions - Remove unused buffer_page field from trace_array_cpu structure - Remove more strncpy() instances - Other minor clean ups and fixes * tag 'trace-v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (36 commits) tracing: Fix compilation warning on arm32 tracing: Record trace_clock and recover when reboot tracing/sched: Use __string() instead of fixed lengths for task->comm tracepoint: Have tracepoints created with DECLARE_TRACE() have _tp suffix tracing: Cleanup upper_empty() in pid_list tracing: Allow the top level trace_marker to write into another instances tracing: Add a helper function to handle the dereference arg in verifier tracing: Remove unnecessary "goto out" that simply returns ret is trigger code tracing: Fix error handling in event_trigger_parse() tracing: Rename event_trigger_alloc() to trigger_data_alloc() tracing: Replace deprecated strncpy() with strscpy() for stack_trace_filter_buf tracing: Remove unused buffer_page field from trace_array_cpu structure tracing: Use atomic_inc_return() for updating "disabled" counter in irqsoff tracer tracing: Convert the per CPU "disabled" counter to local from atomic tracing: branch: Use trace_tracing_is_on_cpu() instead of "disabled" field ring-buffer: Add ring_buffer_record_is_on_cpu() tracing: Do not use per CPU array_buffer.data->disabled for cpumask ftrace: Do not disabled function graph based on "disabled" field tracing: kdb: Use tracer_tracing_on/off() instead of setting per CPU disabled tracing: Use tracer_tracing_disable() instead of "disabled" field for ftrace_dump_one() ...
2025-05-29ring-buffer: Simplify functions with __free(kfree) to free allocationsSteven Rostedt1-18/+9
The function rb_allocate_pages() allocates cpu_buffer and on error needs to free it. It has a single return. Use __free(kfree) and return directly on errors and have the return use return_ptr(cpu_buffer). The function alloc_buffer() allocates buffer and on error needs to free it. It has a single return. Use __free(kfree) and return directly on errors and have the return use return_ptr(buffer). The function __rb_map_vma() allocates a temporary array "pages". Have it use __free() and not worry about freeing it when returning. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250527143144.6edc4625@gandalf.local.home Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Make ring_buffer_{un}map() simpler with guard(mutex)Steven Rostedt1-24/+11
Convert the taking of the buffer->mutex and the cpu_buffer->mapping_lock over to guard(mutex) and simplify the ring_buffer_map() and ring_buffer_unmap() functions. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250527122009.267efb72@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Simplify ring_buffer_read_page() with guard()Steven Rostedt1-17/+11
The function ring_buffer_read_page() had two gotos. One was simply returning "ret" and the other was unlocking the reader_lock. There's no reason to use goto to simply return the "ret" variable. Instead just return the value. The jump to the unlocking of the reader_lock can be replaced by guard(raw_spinlock_irqsave)(&cpu_buffer->reader_lock). With these two changes the "ret" variable is no longer used and can be removed. The return value on non-error is what was read and is stored in the "read" variable. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250527145216.0187cf36@gandalf.local.home Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Simplify reset_disabled_cpu_buffer() with use of guard()Steven Rostedt1-7/+2
Use guard(raw_spinlock_irqsave)() in reset_disabled_cpu_buffer() to simplify the locking. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250527144623.77a9cc47@gandalf.local.home Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Remove jump to out label in ring_buffer_swap_cpu()Steven Rostedt1-14/+9
The function ring_buffer_swap_cpu() has a bunch of jumps to the label out that simply returns "ret". There's no reason to jump to a label that simply returns a value. Just return directly from there. This goes back to almost the beginning when commit 8aabee573dff ("ring-buffer: remove unneeded get_online_cpus") was introduced. That commit removed a put_online_cpus() from that label, but never updated all the jumps to it that now no longer needed to do anything but return a value. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250527145753.6b45d840@gandalf.local.home Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Removed unnecessary if() goto out where out is the next lineSteven Rostedt1-4/+1
In the function ring_buffer_discard_commit() there's an if statement that jumps to the next line: if (rb_try_to_discard(cpu_buffer, event)) goto out; out: This was caused by the change that modified the way timestamps were taken in interrupt context, and removed the code between the if statement and the goto, but failed to update the conditional logic. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250527155116.227f35be@gandalf.local.home Fixes: a389d86f7fd0 ("ring-buffer: Have nested events still record running time stamp") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29tracing: Reset last-boot buffers when reading out all cpu buffersMasami Hiramatsu (Google)1-0/+22
Reset the last-boot ring buffers when read() reads out all cpu buffers through trace_pipe/trace_pipe_raw. This prevents ftrace to unwind ring buffer read pointer next boot. Note that this resets only when all per-cpu buffers are empty, and read via read(2) syscall. For example, if you read only one of the per-cpu trace_pipe, it does not reset it. Also, reading buffer by splice(2) syscall does not reset because some data in the reader (the last) page. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/174792929202.496143.8184644221859580999.stgit@mhiramat.tok.corp.google.com Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Allow reserve_mem persistent ring buffers to be mmappedSteven Rostedt2-8/+45
When the persistent ring buffer is created from the memory returned by reserve_mem there is nothing prohibiting it to be memory mapped to user space. The memory is the same as the pages allocated by alloc_page(). The way the memory is managed by the ring buffer code is slightly different though and needs to be addressed. The persistent memory uses the page->id for its own purpose where as the user mmap buffer currently uses that for the subbuf array mapped to user space. If the buffer is a persistent buffer, use the page index into that buffer as the identifier instead of the page->id. That is, the page->id for a persistent buffer, represents the order of the buffer is in the link list. ->id == 0 means it is the reader page. When a reader page is swapped, the new reader page's ->id gets zero, and the old reader page gets the ->id of the page that it swapped with. The user space mapping has the ->id is the index of where it was mapped in user space and does not change while it is mapped. Since the persistent buffer is fixed in its location, the index of where a page is in the memory range can be used as the "id" to put in the meta page array, and it can be mapped in the same order to user space as it is in the persistent memory. A new rb_page_id() helper function is used to get and set the id depending on if the page is a normal memory allocated buffer or a physical memory mapped buffer. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/20250401203332.246646011@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-29ring-buffer: Do not trigger WARN_ON() due to a commit_overrunSteven Rostedt1-8/+18
When reading a memory mapped buffer the reader page is just swapped out with the last page written in the write buffer. If the reader page is the same as the commit buffer (the buffer that is currently being written to) it was assumed that it should never have missed events. If it does, it triggers a WARN_ON_ONCE(). But there just happens to be one scenario where this can legitimately happen. That is on a commit_overrun. A commit overrun is when an interrupt preempts an event being written to the buffer and then the interrupt adds so many new events that it fills and wraps the buffer back to the commit. Any new events would then be dropped and be reported as "missed_events". In this case, the next page to read is the commit buffer and after the swap of the reader page, the reader page will be the commit buffer, but this time there will be missed events and this triggers the following warning: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1127 at kernel/trace/ring_buffer.c:7357 ring_buffer_map_get_reader+0x49a/0x780 Modules linked in: kvm_intel kvm irqbypass CPU: 2 UID: 0 PID: 1127 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00004-g478bc2824b45-dirty #564 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ring_buffer_map_get_reader+0x49a/0x780 Code: 00 00 00 48 89 fe 48 c1 ee 03 80 3c 2e 00 0f 85 ec 01 00 00 4d 3b a6 a8 00 00 00 0f 85 8a fd ff ff 48 85 c0 0f 84 55 fe ff ff <0f> 0b e9 4e fe ff ff be 08 00 00 00 4c 89 54 24 58 48 89 54 24 50 RSP: 0018:ffff888121787dc0 EFLAGS: 00010002 RAX: 00000000000006a2 RBX: ffff888100062800 RCX: ffffffff8190cb49 RDX: ffff888126934c00 RSI: 1ffff11020200a15 RDI: ffff8881010050a8 RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffed1024d26982 R10: ffff888126934c17 R11: ffff8881010050a8 R12: ffff888126934c00 R13: ffff8881010050b8 R14: ffff888101005000 R15: ffff888126930008 FS: 00007f95c8cd7540(0000) GS:ffff8882b576e000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f95c8de4dc0 CR3: 0000000128452002 CR4: 0000000000172ef0 Call Trace: <TASK> ? __pfx_ring_buffer_map_get_reader+0x10/0x10 tracing_buffers_ioctl+0x283/0x370 __x64_sys_ioctl+0x134/0x190 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f95c8de48db Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 RSP: 002b:00007ffe037ba110 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffe037bb2b0 RCX: 00007f95c8de48db RDX: 0000000000000000 RSI: 0000000000005220 RDI: 0000000000000006 RBP: 00007ffe037ba180 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe037bb6f8 R14: 00007f95c9065000 R15: 00005575c7492c90 </TASK> irq event stamp: 5080 hardirqs last enabled at (5079): [<ffffffff83e0adb0>] _raw_spin_unlock_irqrestore+0x50/0x70 hardirqs last disabled at (5080): [<ffffffff83e0aa83>] _raw_spin_lock_irqsave+0x63/0x70 softirqs last enabled at (4182): [<ffffffff81516122>] handle_softirqs+0x552/0x710 softirqs last disabled at (4159): [<ffffffff815163f7>] __irq_exit_rcu+0x107/0x210 ---[ end trace 0000000000000000 ]--- The above was triggered by running on a kernel with both lockdep and KASAN as well as kmemleak enabled and executing the following command: # perf record -o perf-test.dat -a -- trace-cmd record --nosplice -e all -p function hackbench 50 With perf interjecting a lot of interrupts and trace-cmd enabling all events as well as function tracing, with lockdep, KASAN and kmemleak enabled, it could cause an interrupt preempting an event being written to add enough events to wrap the buffer. trace-cmd was modified to have --nosplice use mmap instead of reading the buffer. The way to differentiate this case from the normal case of there only being one page written to where the swap of the reader page received that one page (which is the commit page), check if the tail page is on the reader page. The difference between the commit page and the tail page is that the tail page is where new writes go to, and the commit page holds the first write that hasn't been committed yet. In the case of an interrupt preempting the write of an event and filling the buffer, it would move the tail page but not the commit page. Have the warning only trigger if the tail page is also on the reader page, and also print out the number of events dropped by a commit overrun as that can not yet be safely added to the page so that the reader can see there were events dropped. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250528121555.2066527e@gandalf.local.home Fixes: fe832be05a8ee ("ring-buffer: Have mmapped ring buffer keep track of missed events") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-28Merge tag 'bpf-next-6.16' of ↵Linus Torvalds2-111/+212
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Fix and improve BTF deduplication of identical BTF types (Alan Maguire and Andrii Nakryiko) - Support up to 12 arguments in BPF trampoline on arm64 (Xu Kuohai and Alexis Lothoré) - Support load-acquire and store-release instructions in BPF JIT on riscv64 (Andrea Parri) - Fix uninitialized values in BPF_{CORE,PROBE}_READ macros (Anton Protopopov) - Streamline allowed helpers across program types (Feng Yang) - Support atomic update for hashtab of BPF maps (Hou Tao) - Implement json output for BPF helpers (Ihor Solodrai) - Several s390 JIT fixes (Ilya Leoshkevich) - Various sockmap fixes (Jiayuan Chen) - Support mmap of vmlinux BTF data (Lorenz Bauer) - Support BPF rbtree traversal and list peeking (Martin KaFai Lau) - Tests for sockmap/sockhash redirection (Michal Luczaj) - Introduce kfuncs for memory reads into dynptrs (Mykyta Yatsenko) - Add support for dma-buf iterators in BPF (T.J. Mercier) - The verifier support for __bpf_trap() (Yonghong Song) * tag 'bpf-next-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (135 commits) bpf, arm64: Remove unused-but-set function and variable. selftests/bpf: Add tests with stack ptr register in conditional jmp bpf: Do not include stack ptr register in precision backtracking bookkeeping selftests/bpf: enable many-args tests for arm64 bpf, arm64: Support up to 12 function arguments bpf: Check rcu_read_lock_trace_held() in bpf_map_lookup_percpu_elem() bpf: Avoid __bpf_prog_ret0_warn when jit fails bpftool: Add support for custom BTF path in prog load/loadall selftests/bpf: Add unit tests with __bpf_trap() kfunc bpf: Warn with __bpf_trap() kfunc maybe due to uninitialized variable bpf: Remove special_kfunc_set from verifier selftests/bpf: Add test for open coded dmabuf_iter selftests/bpf: Add test for dmabuf_iter bpf: Add open coded dmabuf iterator bpf: Add dmabuf iterator dma-buf: Rename debugfs symbols bpf: Fix error return value in bpf_copy_from_user_dynptr libbpf: Use mmap to parse vmlinux BTF from sysfs selftests: bpf: Add a test for mmapable vmlinux BTF btf: Allow mmap of vmlinux btf ...
2025-05-28tracing: Fix compilation warning on arm32Pan Taixi1-1/+1
On arm32, size_t is defined to be unsigned int, while PAGE_SIZE is unsigned long. This hence triggers a compilation warning as min() asserts the type of two operands to be equal. Casting PAGE_SIZE to size_t solves this issue and works on other target architectures as well. Compilation warning details: kernel/trace/trace.c: In function 'tracing_splice_read_pipe': ./include/linux/minmax.h:20:28: warning: comparison of distinct pointer types lacks a cast (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^ ./include/linux/minmax.h:26:4: note: in expansion of macro '__typecheck' (__typecheck(x, y) && __no_side_effects(x, y)) ^~~~~~~~~~~ ... kernel/trace/trace.c:6771:8: note: in expansion of macro 'min' min((size_t)trace_seq_used(&iter->seq), ^~~ Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250526013731.1198030-1-pantaixi@huaweicloud.com Fixes: f5178c41bb43 ("tracing: Fix oob write in trace_seq_to_buffer()") Reviewed-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Pan Taixi <pantaixi@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-27Merge tag 'sysctl-6.16-rc1' of ↵Linus Torvalds2-2/+56
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl updates from Joel Granados: - Move kern_table members out of kernel/sysctl.c Moved a subset (tracing, panic, signal, stack_tracer and sparc) out of the kern_table array. The goal is for kern_table to only have sysctl elements. All this increases modularity by placing the ctl_tables closer to where they are used while reducing the chances of merge conflicts in kernel/sysctl.c. - Fixed sysctl unit test panic by relocating it to selftests * tag 'sysctl-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: Close test ctl_headers with a for loop sysctl: call sysctl tests with a for loop sysctl: Add 0012 to test the u8 range check sysctl: move u8 register test to lib/test_sysctl.c sparc: mv sparc sysctls into their own file under arch/sparc/kernel stack_tracer: move sysctl registration to kernel/trace/trace_stack.c tracing: Move trace sysctls into trace.c signal: Move signal ctl tables into signal.c panic: Move panic ctl tables into panic.c
2025-05-27ring-buffer: Move cpus_read_lock() outside of buffer->mutexSteven Rostedt1-5/+6
Running a modified trace-cmd record --nosplice where it does a mmap of the ring buffer when '--nosplice' is set, caused the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 Not tainted ------------------------------------------------------ trace-cmd/1113 is trying to acquire lock: ffff888100062888 (&buffer->mutex){+.+.}-{4:4}, at: ring_buffer_map+0x11c/0xe70 but task is already holding lock: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&cpu_buffer->mapping_lock){+.+.}-{4:4}: __mutex_lock+0x192/0x18c0 ring_buffer_map+0xcf/0xe70 tracing_buffers_mmap+0x1c4/0x3b0 __mmap_region+0xd8d/0x1f70 do_mmap+0x9d7/0x1010 vm_mmap_pgoff+0x20b/0x390 ksys_mmap_pgoff+0x2e9/0x440 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #4 (&mm->mmap_lock){++++}-{4:4}: __might_fault+0xa5/0x110 _copy_to_user+0x22/0x80 _perf_ioctl+0x61b/0x1b70 perf_ioctl+0x62/0x90 __x64_sys_ioctl+0x134/0x190 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #3 (&cpuctx_mutex){+.+.}-{4:4}: __mutex_lock+0x192/0x18c0 perf_event_init_cpu+0x325/0x7c0 perf_event_init+0x52a/0x5b0 start_kernel+0x263/0x3e0 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0x95/0xa0 common_startup_64+0x13e/0x141 -> #2 (pmus_lock){+.+.}-{4:4}: __mutex_lock+0x192/0x18c0 perf_event_init_cpu+0xb7/0x7c0 cpuhp_invoke_callback+0x2c0/0x1030 __cpuhp_invoke_callback_range+0xbf/0x1f0 _cpu_up+0x2e7/0x690 cpu_up+0x117/0x170 cpuhp_bringup_mask+0xd5/0x120 bringup_nonboot_cpus+0x13d/0x170 smp_init+0x2b/0xf0 kernel_init_freeable+0x441/0x6d0 kernel_init+0x1e/0x160 ret_from_fork+0x34/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2a/0xd0 ring_buffer_resize+0x610/0x14e0 __tracing_resize_ring_buffer.part.0+0x42/0x120 tracing_set_tracer+0x7bd/0xa80 tracing_set_trace_write+0x132/0x1e0 vfs_write+0x21c/0xe80 ksys_write+0xf9/0x1c0 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&buffer->mutex){+.+.}-{4:4}: __lock_acquire+0x1405/0x2210 lock_acquire+0x174/0x310 __mutex_lock+0x192/0x18c0 ring_buffer_map+0x11c/0xe70 tracing_buffers_mmap+0x1c4/0x3b0 __mmap_region+0xd8d/0x1f70 do_mmap+0x9d7/0x1010 vm_mmap_pgoff+0x20b/0x390 ksys_mmap_pgoff+0x2e9/0x440 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: &buffer->mutex --> &mm->mmap_lock --> &cpu_buffer->mapping_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&cpu_buffer->mapping_lock); lock(&mm->mmap_lock); lock(&cpu_buffer->mapping_lock); lock(&buffer->mutex); *** DEADLOCK *** 2 locks held by trace-cmd/1113: #0: ffff888106b847e0 (&mm->mmap_lock){++++}-{4:4}, at: vm_mmap_pgoff+0x192/0x390 #1: ffff888100a5f9f8 (&cpu_buffer->mapping_lock){+.+.}-{4:4}, at: ring_buffer_map+0xcf/0xe70 stack backtrace: CPU: 5 UID: 0 PID: 1113 Comm: trace-cmd Not tainted 6.15.0-rc7-test-00002-gfb7d03d8a82f #551 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6e/0xa0 print_circular_bug.cold+0x178/0x1be check_noncircular+0x146/0x160 __lock_acquire+0x1405/0x2210 lock_acquire+0x174/0x310 ? ring_buffer_map+0x11c/0xe70 ? ring_buffer_map+0x11c/0xe70 ? __mutex_lock+0x169/0x18c0 __mutex_lock+0x192/0x18c0 ? ring_buffer_map+0x11c/0xe70 ? ring_buffer_map+0x11c/0xe70 ? function_trace_call+0x296/0x370 ? __pfx___mutex_lock+0x10/0x10 ? __pfx_function_trace_call+0x10/0x10 ? __pfx___mutex_lock+0x10/0x10 ? _raw_spin_unlock+0x2d/0x50 ? ring_buffer_map+0x11c/0xe70 ? ring_buffer_map+0x11c/0xe70 ? __mutex_lock+0x5/0x18c0 ring_buffer_map+0x11c/0xe70 ? do_raw_spin_lock+0x12d/0x270 ? find_held_lock+0x2b/0x80 ? _raw_spin_unlock+0x2d/0x50 ? rcu_is_watching+0x15/0xb0 ? _raw_spin_unlock+0x2d/0x50 ? trace_preempt_on+0xd0/0x110 tracing_buffers_mmap+0x1c4/0x3b0 __mmap_region+0xd8d/0x1f70 ? ring_buffer_lock_reserve+0x99/0xff0 ? __pfx___mmap_region+0x10/0x10 ? ring_buffer_lock_reserve+0x99/0xff0 ? __pfx_ring_buffer_lock_reserve+0x10/0x10 ? __pfx_ring_buffer_lock_reserve+0x10/0x10 ? bpf_lsm_mmap_addr+0x4/0x10 ? security_mmap_addr+0x46/0xd0 ? lock_is_held_type+0xd9/0x130 do_mmap+0x9d7/0x1010 ? 0xffffffffc0370095 ? __pfx_do_mmap+0x10/0x10 vm_mmap_pgoff+0x20b/0x390 ? __pfx_vm_mmap_pgoff+0x10/0x10 ? 0xffffffffc0370095 ksys_mmap_pgoff+0x2e9/0x440 do_syscall_64+0x79/0x1c0 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fb0963a7de2 Code: 00 00 00 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 89 cd 53 48 89 fb 48 85 ff 74 3b 41 89 ea 48 89 df b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 5b 5d c3 0f 1f 00 48 8b 05 e1 9f 0d 00 64 RSP: 002b:00007ffdcc8fb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000009 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb0963a7de2 RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000 RBP: 0000000000000001 R08: 0000000000000006 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffdcc8fbe68 R14: 00007fb096628000 R15: 00005633e01a5c90 </TASK> The issue is that cpus_read_lock() is taken within buffer->mutex. The memory mapped pages are taken with the mmap_lock held. The buffer->mutex is taken within the cpu_buffer->mapping_lock. There's quite a chain with all these locks, where the deadlock can be fixed by moving the cpus_read_lock() outside the taking of the buffer->mutex. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250527105820.0f45d045@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-26Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linuxLinus Torvalds1-9/+2
Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...
2025-05-23bpf: Fix error return value in bpf_copy_from_user_dynptrMykyta Yatsenko1-2/+6
On error, copy_from_user returns number of bytes not copied to destination, but current implementation of copy_user_data_sleepable does not handle that correctly and returns it as error value, which may confuse user, expecting meaningful negative error value. Fixes: a498ee7576de ("bpf: Implement dynptr copy kfuncs") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250523181705.261585-1-mykyta.yatsenko5@gmail.com
2025-05-23traceevent/block: Add REQ_ATOMIC flag to block trace eventsRitesh Harjani (IBM)1-0/+2
Filesystems like XFS can implement atomic write I/O using either REQ_ATOMIC flag set in the bio or via CoW operation. It will be useful if we have a flag in trace events to distinguish between the two. This patch adds char 'U' (Untorn writes) to rwbs field of the trace events if REQ_ATOMIC flag is set in the bio. <W/ REQ_ATOMIC> ================= xfs_io-4238 [009] ..... 4148.126843: block_rq_issue: 259,0 WFSU 16384 () 768 + 32 none,0,0 [xfs_io] <idle>-0 [009] d.h1. 4148.129864: block_rq_complete: 259,0 WFSU () 768 + 32 none,0,0 [0] <W/O REQ_ATOMIC> =============== xfs_io-4237 [010] ..... 4143.325616: block_rq_issue: 259,0 WS 16384 () 768 + 32 none,0,0 [xfs_io] <idle>-0 [010] d.H1. 4143.329138: block_rq_complete: 259,0 WS () 768 + 32 none,0,0 [0] Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/44317cb2ec4588f6a2c1501a96684e6a1196e8ba.1747921498.git.ritesh.list@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-22bpf: Revert "bpf: remove unnecessary rcu_read_{lock,unlock}() in ↵Di Shen1-0/+2
multi-uprobe attach logic" This reverts commit 4a8f635a6054. Althought get_pid_task() internally already calls rcu_read_lock() and rcu_read_unlock(), the find_vpid() was not. The documentation for find_vpid() clearly states: "Must be called with the tasklist_lock or rcu_read_lock() held." Add proper rcu_read_lock/unlock() to protect the find_vpid(). Fixes: 4a8f635a6054 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic") Reported-by: Xuewen Yan <xuewen.yan@unisoc.com> Signed-off-by: Di Shen <di.shen@unisoc.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-14Merge tag 'trace-v6.15-rc6' of ↵Linus Torvalds3-9/+7
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix sample code that uses trace_array_printk() The sample code for in kernel use of trace_array (that creates an instance for use within the kernel) and shows how to use trace_array_printk() that writes into the created instance, used trace_printk_init_buffers(). But that function is used to initialize normal trace_printk() and produces the NOTICE banner which is not needed for use of trace_array_printk(). The function to initialize that is trace_array_init_printk() that takes the created trace array instance as a parameter. Update the sample code to reflect the proper usage. - Fix preemption count output for stacktrace event The tracing buffer shows the preempt count level when an event executes. Because writing the event itself disables preemption, this needs to be accounted for when recording. The stacktrace event did not account for this so the output of the stacktrace event showed preemption was disabled while the event that triggered the stacktrace shows preemption is enabled and this leads to confusion. Account for preemption being disabled for the stacktrace event. The same happened for stack traces triggered by function tracer. - Fix persistent ring buffer when trace_pipe is used The ring buffer swaps the reader page with the next page to read from the write buffer when trace_pipe is used. If there's only a page of data in the ring buffer, this swap will cause the "commit" pointer (last data written) to be on the reader page. If more data is written to the buffer, it is added to the reader page until it falls off back into the write buffer. If the system reboots and the commit pointer is still on the reader page, even if new data was written, the persistent buffer validator will miss finding the commit pointer because it only checks the write buffer and does not check the reader page. This causes the validator to fail the validation and clear the buffer, where the new data is lost. There was a check for this, but it checked the "head pointer", which was incorrect, because the "head pointer" always stays on the write buffer and is the next page to swap out for the reader page. Fix the logic to catch this case and allow the user to still read the data after reboot. * tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Fix persistent buffer when commit page is the reader page ftrace: Fix preemption accounting for stacktrace filter command ftrace: Fix preemption accounting for stacktrace trigger command tracing: samples: Initialize trace_array_printk() with the correct function
2025-05-14ring-buffer: Fix persistent buffer when commit page is the reader pageSteven Rostedt1-3/+5
The ring buffer is made up of sub buffers (sometimes called pages as they are by default PAGE_SIZE). It has the following "pages": "tail page" - this is the page that the next write will write to "head page" - this is the page that the reader will swap the reader page with. "reader page" - This belongs to the reader, where it will swap the head page from the ring buffer so that the reader does not race with the writer. The writer may end up on the "reader page" if the ring buffer hasn't written more than one page, where the "tail page" and the "head page" are the same. The persistent ring buffer has meta data that points to where these pages exist so on reboot it can re-create the pointers to the cpu_buffer descriptor. But when the commit page is on the reader page, the logic is incorrect. The check to see if the commit page is on the reader page checked if the head page was the reader page, which would never happen, as the head page is always in the ring buffer. The correct check would be to test if the commit page is on the reader page. If that's the case, then it can exit out early as the commit page is only on the reader page when there's only one page of data in the buffer. There's no reason to iterate the ring buffer pages to find the "commit page" as it is already found. To trigger this bug: # echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable # touch /tmp/x # chown sshd /tmp/x # reboot On boot up, the dmesg will have: Ring buffer meta [0] is from previous boot! Ring buffer meta [1] is from previous boot! Ring buffer meta [2] is from previous boot! Ring buffer meta [3] is from previous boot! Ring buffer meta [4] commit page not found Ring buffer meta [5] is from previous boot! Ring buffer meta [6] is from previous boot! Ring buffer meta [7] is from previous boot! Where the buffer on CPU 4 had a "commit page not found" error and that buffer is cleared and reset causing the output to be empty and the data lost. When it works correctly, it has: # cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe <...>-1137 [004] ..... 998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0 Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Reported-by: Tasos Sahanidis <tasos@tasossah.com> Tested-by: Tasos Sahanidis <tasos@tasossah.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14</