aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-10-22blktrace: handle BLKTRACESETUP2 ioctlJohannes Thumshirn1-0/+36
Handle the BLKTRACESETUP2 ioctl, requesting an extended version of the blktrace protocol from user-space. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: trace zone write plugging operationsJohannes Thumshirn1-0/+39
Trace zone write plugging operations on block devices. As tracing of zoned block commands needs the upper 32bit of the widened 64bit action, only add traces to blktrace if user-space has requested version 2 of the blktrace protocol. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: expose ZONE APPEND completions to blktraceJohannes Thumshirn1-0/+21
Expose ZONE APPEND completions as a block trace completion action to blktrace. As tracing of zoned block commands needs the upper 32bit of the widened 64bit action, only add traces to blktrace if user-space has requested version 2 of the blktrace protocol. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: add block trace commands for zone operationsJohannes Thumshirn1-4/+25
Add block trace commands for zone operations. These commands can only be handled with version 2 of the blktrace protocol. For version 1, warn if a command that does not fit into the 16 bits reserved for the command in this version is passed in. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: move ftrace blk_io_tracer to blk_io_trace2Johannes Thumshirn1-8/+8
Move ftrace's blk_io_tracer to the new blk_io_trace2 infrastructure. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: move trace_note to blk_io_trace2Johannes Thumshirn1-7/+7
Move trace_note() to the new blk_io_trace2 infrastructure. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: differentiate between blk_io_trace versionsJohannes Thumshirn1-1/+57
Differentiate between blk_io_trace and blk_io_trace2 when relaying to user-space depending on which version has been requested by the blktrace utility. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: add definitions for struct blk_io_trace2Johannes Thumshirn1-0/+1
Add definitions for the extended version of the blktrace protocol using a wider action type to be able to record new actions in the kernel. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: pass blk_user_trace2 to setup functionsJohannes Thumshirn1-9/+22
Pass struct blk_user_trace_setup2 to blktrace_setup_finalize(). This prepares for the incoming extension of the blktrace protocol with a 64bit act_mask. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: add definitions for blk_user_trace_setup2Johannes Thumshirn1-0/+3
Add definitions for a version 2 of the blk_user_trace_setup ioctl. This new ioctl will enable a different struct layout of the binary data passed to user-space when using a new version of the blktrace utility requesting the new struct layout. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: split do_blk_trace_setup into two functionsJohannes Thumshirn1-38/+56
Split do_blk_trace_setup into two functions, this is done to prepare for an incoming new BLKTRACESETUP2 ioctl(2) which can receive extended parameters from user-space. Also move the size verification logic to the callers in preparation for using a new internal structure later. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: change the internal action to 64bitJohannes Thumshirn1-19/+19
Change the internal use of the action in blktrace to 64bit. Although for now only the lower 32bits will be used. With the upcoming version 2 of the blktrace user-space protocol the upper 32bit will also be utilized. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: untangle if/else sequence in __blk_add_traceJohannes Thumshirn1-2/+11
Untangle the if/else sequence setting the trace action in __blk_add_trace() and turn it into a switch statement for better extensibility. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: split out relaying a blktrace eventJohannes Thumshirn1-28/+32
Split out the code relaying a blktrace event to user-space using relayfs. This enables adding a second version supporting a new version of the protocol. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: factor out recording a blktrace eventJohannes Thumshirn1-40/+49
Factor out the recording of a blktrace event into its own function, deduplicating the code. This also enables recording different versions of the blktrace protocol later on. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22blktrace: only calculate trace length onceJohannes Thumshirn1-6/+8
De-duplicate the calculation of the trace length instead of doing the calculation twice, once for calling trace_buffer_lock_reserve() and once for calling relay_reserve(). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-10-22sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttledK Prateek Nayak1-0/+12
Matteo reported hitting the assert_list_leaf_cfs_rq() warning from enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate load for throttled cfs_rq") which transitioned to using cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in propagate_entity_cfs_rq(). The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all its descendants should have their PELT frozen too or weird things can happen as a result of children accumulating PELT signals when the parents have their PELT clock stopped. Another side effect of this is the loss of integrity of the leaf cfs_rq list. As debugged by Aaron, consider the following hierarchy: root(#) / \ A(#) B(*) | C <--- new cgroup | D <--- new cgroup # - Already on leaf cfs_rq list * - Throttled with PELT frozen The newly created cgroups don't have their "pelt_clock_throttled" signal synced with cgroup B. Next, the following series of events occur: 1. online_fair_sched_group() for cgroup D will call propagate_entity_cfs_rq(). (Same can happen if a throttled task is moved to cgroup C and enqueue_task_fair() returns early.) propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to "rq->tmp_alone_branch" since its PELT clock is not marked throttled and cfs_rq of cgroup B is not on the list. cfs_rq of cgroup B is skipped since its PELT is throttled. root cfs_rq already exists on cfs_rq leading to list_add_leaf_cfs_rq() returning early. The cfs_rq of cgroup C is left dangling on the "rq->tmp_alone_branch". 2. A new task wakes up on cgroup A. Since the whole hierarchy is already on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early without any modifications to "rq->tmp_alone_branch". The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch". !!! Splat !!! Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not enough since the new cfs_rq is not yet enqueued on the hierarchy. A dequeue on other subtree on the throttled hierarchy can freeze the PELT clock for the parent hierarchy without setting the indicators for this newly added cfs_rq which was never enqueued. Since there are no tasks on the new hierarchy, start a cfs_rq on a throttled hierarchy with its PELT clock throttled. The first enqueue, or the distribution (whichever happens first) will unfreeze the PELT clock and queue the cfs_rq on the leaf cfs_rq list. While at it, add an assert_list_leaf_cfs_rq() in propagate_entity_cfs_rq() to catch such cases in the future. Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/ Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model") Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Suggested-by: Aaron Lu <ziqianlu@bytedance.com> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Aaron Lu <ziqianlu@bytedance.com> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk> Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com
2025-10-22printk_ringbuffer: don't needlessly wrap data blocks aroundDaniil Tatianin1-8/+20
Previously, data blocks that perfectly fit the data ring buffer would get wrapped around to the beginning for no reason since the calculated offset of the next data block would belong to the next wrap. Since this offset is not actually part of the data block, but rather the offset of where the next data block is going to start, there is no reason to include it when deciding whether the current block fits the buffer. Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: John Ogness <john.ogness@linutronix.de> Tested-by: John Ogness <john.ogness@linutronix.de> Link: https://patch.msgid.link/20250905144152.9137-2-d-tatianin@yandex-team.ru [pmladek@suse.com: Updated indentation.] Signed-off-by: Petr Mladek <pmladek@suse.com>
2025-10-21sched_ext: Fix scx_bpf_dsq_insert() backward binary compatibilityTejun Heo1-6/+6
cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") introduced a new bool-returning scx_bpf_dsq_insert() and renamed the old void-returning version to scx_bpf_dsq_insert___compat, with the expectation that libbpf would match old binaries to the ___compat variant, maintaining backward binary compatibility. However, while libbpf ignores ___suffix on the BPF side when matching symbols, it doesn't do so for kernel-side symbols. Old binaries compiled with the original scx_bpf_dsq_insert() could no longer resolve the symbol. Fix by reversing the naming: Keep scx_bpf_dsq_insert() as the old void-returning interface and add ___v2 to the new bool-returning version. This allows old binaries to continue working while new code can use the ___v2 variant. Once libbpf is updated to ignore kernel-side ___SUFFIX, the ___v2 suffix can be dropped when the compat interface is removed. v2: Use ___v2 instead of ___new. Fixes: cded46d97159 ("sched_ext: Make scx_bpf_dsq_insert*() return bool") Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-21bpf: make bpf_insn_successors to return a pointerAnton Protopopov2-32/+64
The bpf_insn_successors() function is used to return successors to a BPF instruction. So far, an instruction could have 0, 1 or 2 successors. Prepare the verifier code to introduction of instructions with more than 2 successors (namely, indirect jumps). To do this, introduce a new struct, struct bpf_iarray, containing an array of bpf instruction indexes and make bpf_insn_successors to return a pointer of that type. The storage for all instructions is allocated in the env->succ, which holds an array of size 2, to be used for all instructions. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-10-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: generalize and export map_get_next_key for arraysAnton Protopopov1-10/+9
The kernel/bpf/array.c file defines the array_map_get_next_key() function which finds the next key for array maps. It actually doesn't use any map fields besides the generic max_entries field. Generalize it, and export as bpf_array_get_next_key() such that it can be re-used by other array-like maps. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-4-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: save the start of functions in bpf_prog_auxAnton Protopopov1-0/+1
Introduce a new subprog_start field in bpf_prog_aux. This field may be used by JIT compilers wanting to know the real absolute xlated offset of the function being jitted. The func_info[func_id] may have served this purpose, but func_info may be NULL, so JIT compilers can't rely on it. Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-3-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: fix the return value of push_stackAnton Protopopov1-40/+40
In [1] Eduard mentioned that on push_stack failure verifier code should return -ENOMEM instead of -EFAULT. After checking with the other call sites I've found that code randomly returns either -ENOMEM or -EFAULT. This patch unifies the return values for the push_stack (and similar push_async_cb) functions such that error codes are always assigned properly. [1] https://lore.kernel.org/bpf/20250615085943.3871208-1-a.s.protopopov@gmail.com Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251019202145.3944697-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: Sync pending IRQ work before freeing ring bufferNoorain Eqbal1-0/+2
Fix a race where irq_work can be queued in bpf_ringbuf_commit() but the ring buffer is freed before the work executes. In the syzbot reproducer, a BPF program attached to sched_switch triggers bpf_ringbuf_commit(), queuing an irq_work. If the ring buffer is freed before this work executes, the irq_work thread may accesses freed memory. Calling `irq_work_sync(&rb->work)` ensures that all pending irq_work complete before freeing the buffer. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2617fc732430968b45d2 Tested-by: syzbot+2617fc732430968b45d2@syzkaller.appspotmail.com Signed-off-by: Noorain Eqbal <nooraineqbal@gmail.com> Link: https://lore.kernel.org/r/20251020180301.103366-1-nooraineqbal@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21bpf: Clarify get_outer_instance() handling in propagate_to_outer_instance()Shardul Bankar1-0/+2
propagate_to_outer_instance() calls get_outer_instance() and uses the returned pointer to reset and commit stack write marks. Under normal conditions, update_instance() guarantees that an outer instance exists, so get_outer_instance() cannot return an ERR_PTR. However, explicitly checking for IS_ERR(outer_instance) makes this code more robust and self-documenting. It reduces cognitive load when reading the control flow and silences potential false-positive reports from static analysis or automated tooling. No functional change intended. Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251021080849.860072-1-shardulsb08@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-21seqlock: Change thread_group_cputime() to use scoped_seqlock_read()Oleg Nesterov1-15/+5
To simplify the code and make it more readable. While at it, change thread_group_cputime() to use __for_each_thread(sig). [peterz: update to new interface] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2025-10-21locking/spinlock/debug: Fix data-race in do_raw_write_lockAlexander Sverdlin1-2/+2
KCSAN reports: BUG: KCSAN: data-race in do_raw_write_lock / do_raw_write_lock write (marked) to 0xffff800009cf504c of 4 bytes by task 1102 on cpu 1: do_raw_write_lock+0x120/0x204 _raw_write_lock_irq do_exit call_usermodehelper_exec_async ret_from_fork read to 0xffff800009cf504c of 4 bytes by task 1103 on cpu 0: do_raw_write_lock+0x88/0x204 _raw_write_lock_irq do_exit call_usermodehelper_exec_async ret_from_fork value changed: 0xffffffff -> 0x00000001 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 1103 Comm: kworker/u4:1 6.1.111 Commit 1a365e822372 ("locking/spinlock/debug: Fix various data races") has adressed most of these races, but seems to be not consistent/not complete. >From do_raw_write_lock() only debug_write_lock_after() part has been converted to WRITE_ONCE(), but not debug_write_lock_before() part. Do it now. Fixes: 1a365e822372 ("locking/spinlock/debug: Fix various data races") Reported-by: Adrian Freihofer <adrian.freihofer@siemens.com> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Waiman Long <longman@redhat.com> Cc: stable@vger.kernel.org
2025-10-20Merge tag 'cgroup-for-6.18-rc2-fixes' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix seqcount lockdep assertion failure in cgroup freezer on PREEMPT_RT. Plain seqcount_t expects preemption disabled, but PREEMPT_RT spinlocks don't disable preemption. Switch to seqcount_spinlock_t to properly associate css_set_lock with the freeze timing seqcount. - Misc changes including kernel-doc warning fix for misc_res_type enum and improved selftest diagnostics. * tag 'cgroup-for-6.18-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/misc: fix misc_res_type kernel-doc warning selftests: cgroup: Use values_close_report in test_cpu selftests: cgroup: add values_close_report helper cgroup: Fix seqcount lockdep assertion in cgroup freezer
2025-10-20PM: hibernate: Rework message printing in swsusp_save()Rafael J. Wysocki1-7/+6
The messages printed by swsusp_save() are basically only useful for debug, so printing them every time a hibernation image is created at the "info" log level is not particularly useful. Also printing a message on a failing memory allocation is redundant. Use pm_deferred_pr_dbg() for printing those messages so they will only be printed when requested and the "deferred" variant is used because this code runs in a deeply atomic context (one CPU with interrupts off, no functional devices). Also drop the useless message printed when memory allocations fails. While at it, extend one of the messages in question so it is less cryptic. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [ rjw: Dropped a useless colon at the end of one of the messages ] Link: https://patch.msgid.link/10750389.nUPlyArG6x@rafael.j.wysocki
2025-10-20genirq/msi: Slightly simplify msi_domain_alloc()Christophe JAILLET1-1/+1
The return value of irq_find_mapping() is only tested, not used for anything else. Replaced it by irq_resolve_mapping() which is internally used by irq_find_mapping() and allows a simple boolean decision. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/1ce680114cdb8d40b072c54d7f015696a540e5a6.1760863194.git.christophe.jaillet@wanadoo.fr
2025-10-20timekeeping: Fix aux clocks sysfs initialization loop boundHaofeng Li1-1/+1
The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the termination condition, which results in 9 iterations (i=0 to 8) when MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support only up to 8 auxiliary clocks. This off-by-one error causes the creation of a 9th sysfs entry that exceeds the intended auxiliary clock range. Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8 auxiliary clock entries are created, matching the design specification. Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks") Signed-off-by: Haofeng Li <lihaofeng@kylinos.cn> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com
2025-10-20cgroup/cpuset: Don't track # of local child partitionsWaiman Long2-31/+13
The cpuset structure has a nr_subparts field which tracks the number of child local partitions underneath a particular cpuset. Right now, nr_subparts is only used in partition_is_populated() to avoid iteration of child cpusets if the condition is right. So by always performing the child iteration, we can avoid tracking the number of child partitions and simplify the code a bit. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-20rv: Make rtapp/pagefault monitor depends on CONFIG_MMUNam Cao1-0/+1
There is no page fault without MMU. Compiling the rtapp/pagefault monitor without CONFIG_MMU fails as page fault tracepoints' definitions are not available. Make rtapp/pagefault monitor depends on CONFIG_MMU. Fixes: 9162620eb604 ("rv: Add rtapp_pagefault monitor") Signed-off-by: Nam Cao <namcao@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202509260455.6Z9Vkty4-lkp@intel.com/ Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082317.973839-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-10-20rv: Fully convert enabled_monitors to use list_head as iteratorNam Cao1-6/+6
The callbacks in enabled_monitors_seq_ops are inconsistent. Some treat the iterator as struct rv_monitor *, while others treat the iterator as struct list_head *. This causes a wrong type cast and crashes the system as reported by Nathan. Convert everything to use struct list_head * as iterator. This also makes enabled_monitors consistent with available_monitors. Fixes: de090d1ccae1 ("rv: Fix wrong type cast in enabled_monitors_next()") Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/linux-trace-kernel/20250923002004.GA2836051@ax162/ Signed-off-by: Nam Cao <namcao@linutronix.de> Cc: stable@vger.kernel.org Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/r/20251002082235.973099-1-namcao@linutronix.de Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
2025-10-19Merge tag 'sched_urgent_for_v6.18_rc2' of ↵Linus Torvalds3-13/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Borislav Petkov: - Make sure the check for lost pelt idle time is done unconditionally to have correct lost idle time accounting - Stop the deadline server task before a CPU goes offline * tag 'sched_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix pelt lost idle time detection sched/deadline: Stop dl_server before CPU goes offline
2025-10-19Merge tag 'perf_urgent_for_v6.18_rc2' of ↵Linus Torvalds2-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Make sure perf reporting works correctly in setups using overlayfs or FUSE - Move the uprobe optimization to a better location logically * tag 'perf_urgent_for_v6.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix MMAP2 event device with backing files perf/core: Fix MMAP event path names with backing files perf/core: Fix address filter match with backing files uprobe: Move arch_uprobe_optimize right after handlers execution
2025-10-18bpf: mark vma->{vm_mm,vm_file} as __safe_trusted_or_nullYafang Shao1-0/+6
The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus, we can mark it as trusted_or_null. With this change, BPF helpers can safely access vma->vm_mm to retrieve the associated mm_struct from the VMA. Then we can make policy decision from the VMA. The "trusted" annotation enables direct access to vma->vm_mm within kfuncs marked with KF_TRUSTED_ARGS or KF_RCU, such as bpf_task_get_cgroup1() and bpf_task_under_cgroup(). Conversely, "null" enforcement requires all callsites using vma->vm_mm to perform NULL checks. The lsm selftest must be modified because it directly accesses vma->vm_mm without a NULL pointer check; otherwise it will break due to this change. For the VMA based THP policy, the use case is as follows, @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null if (!@mm) return; bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID); /* make the decision based on the @cgroup1 attribute */ bpf_cgroup_release(@cgroup1); // release the associated cgroup out: bpf_rcu_read_unlock(); PSI memory information can be obtained from the associated cgroup to inform policy decisions. Since upstream PSI support is currently limited to cgroup v2, the following example demonstrates cgroup v2 implementation: @owner = @mm->owner; if (@owner) { // @ancestor_cgid is user-configured @ancestor = bpf_cgroup_from_id(@ancestor_cgid); if (bpf_task_under_cgroup(@owner, @ancestor)) { @psi_group = @ancestor->psi; /* Extract PSI metrics from @psi_group and * implement policy logic based on the values */ } } The vma::vm_file can also be marked with __safe_trusted_or_null. No additional selftests are required since vma->vm_file and vma->vm_mm are already validated in the existing selftest suite. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-3-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-18bpf: mark mm->owner as __safe_rcu_or_nullYafang Shao1-0/+3
When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The owner can be NULL. With this change, BPF helpers can safely access mm->owner to retrieve the associated task from the mm. We can then make policy decision based on the task attribute. The typical use case is as follows, bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field @owner = @mm->owner; // mm_struct::owner is rcu trusted or null if (!@owner) goto out; /* Do something based on the task attribute */ out: bpf_rcu_read_unlock(); Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Link: https://lore.kernel.org/r/20251016063929.13830-2-laoar.shao@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf at 6.18-rc2Alexei Starovoitov3-12/+32
Cross-merge BPF and other fixes after downstream PR. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-10-18sched_ext: Allow forcibly picking an scx taskAndrea Righi1-2/+16
Refactor pick_task_scx() adding a new argument to forcibly pick a SCHED_EXT task, ignoring any higher-priority sched class activity. This refactoring prepares the code for future scenarios, e.g., allowing the ext dl_server to force a SCHED_EXT task selection. No functional changes. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-18dma: contiguous: Reserve default CMA heapMaxime Ripard1-0/+6
The CMA code, in addition to the reserved-memory regions in the device tree, will also register a default CMA region if the device tree doesn't provide any, with its size and position coming from either the kernel command-line or configuration. Let's register that one for use to create a heap for it. Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Maxime Ripard <mripard@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Link: https://lore.kernel.org/r/20251013-dma-buf-ecc-heap-v8-4-04ce150ea3d9@kernel.org
2025-10-18dma: contiguous: Register reusable CMA regions at bootMaxime Ripard1-0/+5
In order to create a CMA dma-buf heap instance for each CMA heap region in the system, we need to collect all of them during boot. They are created from two main sources: the reserved-memory regions in the device tree, and the default CMA region created from the configuration or command line parameters, if no default region is provided in the device tree. Let's collect all the device-tree defined CMA regions flagged as reusable. Reviewed-by: T.J. Mercier <tjmercier@google.com> Signed-off-by: Maxime Ripard <mripard@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Sumit Semwal <sumit.semwal@linaro.org> Link: https://lore.kernel.org/r/20251013-dma-buf-ecc-heap-v8-3-04ce150ea3d9@kernel.org
2025-10-18PM: console: Fix memory allocation error handling in pm_vt_switch_required()Malaya Kumar Rout1-2/+6
The pm_vt_switch_required() function fails silently when memory allocation fails, offering no indication to callers that the operation was unsuccessful. This behavior prevents drivers from handling allocation errors correctly or implementing retry mechanisms. By ensuring that failures are reported back to the caller, drivers can make informed decisions, improve robustness, and avoid unexpected behavior during critical power management operations. Change the function signature to return an integer error code and modify the implementation to return -ENOMEM when kmalloc() fails. Update both the function declaration and the inline stub in include/linux/pm.h to maintain consistency across CONFIG_VT_CONSOLE_SLEEP configurations. The function now returns: - 0 on success (including when updating existing entries) - -ENOMEM when memory allocation fails This change improves error reporting without breaking existing callers, as the current callers in drivers/video/fbdev/core/fbmem.c already ignore the return value, making this a backward-compatible improvement. Reviewed-by: Lyude Paul <lyude@redhat.com> Signed-off-by: Malaya Kumar Rout <mrout@redhat.com> Reviewed-by: Dhruva Gole <d-gole@ti.com> Reviewed-by: Lyude Paul <lyude@redhat.com> Link: https://patch.msgid.link/20251013193028.89570-1-mrout@redhat.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-10-16sched_ext: Merge branch 'sched/core' of ↵Tejun Heo15-553/+602
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.19 Pull in tip/sched/core to receive: 50653216e4ff ("sched: Add support to pick functions to take rf") 4c95380701f5 ("sched/ext: Fold balance_scx() into pick_task_scx()") which will enable clean integration of DL server support among other things. This conflicts with the following from sched_ext/for-6.18-fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") which adds maybe_queue_balance_callback() to balance_scx() which is removed by 50653216e4ff. Resolve by moving the invocation to pick_task_scx() in the equivalent location. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16sched_ext: Merge branch 'for-6.18-fixes' into for-6.19Tejun Heo1-6/+6
Pull sched_ext/for-6.18-fixes to sync trees to receive: 05e63305c85c ("sched_ext: Fix scx_kick_pseqs corruption on concurrent scheduler loads") to avoid conflicts with planned cgroup sub-sched support. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16sched_ext: fix flag check for deferred callbacksEmil Tsalapatis1-1/+1
When scheduling the deferred balance callbacks, check SCX_RQ_BAL_CB_PENDING instead of SCX_RQ_BAL_PENDING. This way schedule_deferred() properly tests whether there is already a pending request for queue_balance_callback() to be invoked at the end of .balance(). Fixes: a8ad873113d3 ("sched_ext: defer queue_balance_callback() until after ops.dispatch") Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-10-16bpf: Fix memory leak in __lookup_instance error pathShardul Bankar1-1/+3
When __lookup_instance() allocates a func_instance structure but fails to allocate the must_write_set array, it returns an error without freeing the previously allocated func_instance. This causes a memory leak of 192 bytes (sizeof(struct func_instance)) each time this error path is triggered. Fix by freeing 'result' on must_write_set allocation failure. Fixes: b3698c356ad9 ("bpf: callchain sensitive stack liveness tracking using CFG") Reported-by: BPF Runtime Fuzzer (BRF) Signed-off-by: Shardul Bankar <shardulsb08@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://patch.msgid.link/20251016063330.4107547-1-shardulsb08@gmail.com
2025-10-16sched/ext: Fold balance_scx() into pick_task_scx()Peter Zijlstra3-80/+12
With pick_task() having an rf argument, it is possible to do the lock-break there, get rid of the weird balance/pick_task hack. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Add support to pick functions to take rfJoel Fernandes8-41/+46
Some pick functions like the internal pick_next_task_fair() already take rf but some others dont. We need this for scx's server pick function. Prepare for this by having pick functions accept it. [peterz: - added RETRY_TASK handling - removed pick_next_task_fair indirection] Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>
2025-10-16sched: Detect per-class runqueue changesPeter Zijlstra8-2/+42
Have enqueue/dequeue set a per-class bit in rq->queue_mask. This then enables easy tracking of which runqueues are modified over a lock-break. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org>