| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
"Three more fixes for the DMA-mapping code, related to PCI P2PDMA, DMA
debug and DMA link ranges API (Li RongQing and Jason Gunthorpe)"
* tag 'dma-mapping-7.1-2026-06-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
iommu/dma: Do not try to iommu_map a 0 length region in swiotlb
dma-debug: fix physical address retrieval in debug_dma_sync_sg_for_device
dma-mapping: direct: fix missing mapping for THRU_HOST_BRIDGE segments
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull runtime verifier fixes from Steven Rostedt:
- Fix reset ordering on per-task destruction
Reset the task before dropping the slot instead of after, which was
causing out-of-bound memory accesses.
- Fix HA monitor synchronization and cleanup
Ensure synchronous cleanup for HA monitors by running timer callbacks
in RCU read-side critical sections and using synchronize_rcu() during
destruction.
- Avoid armed timers after tasks exit
Add automatic cleanup for per-task HA monitors to prevent timers from
firing after task exit.
- Fix memory ordering for DA/HA monitors
Fix race conditions during monitor start by using release-acquire
semantics for the monitoring flag.
- Fix initialization for DA/HA monitors
Ensure monitors are not initialized relying on potentially corrupted
state like the monitoring flag, that is not reset by all monitors
type and may have an unknown state in monitors reusing the storage
(per-task).
- Fix memory safety in per-task and per-object monitors
Prevent use-after-free and out-of-bounds access by synchronizing with
in-flight tracepoint probes using tracepoint_synchronize_unregister()
before freeing monitor storage or releasing task slots.
- Adjust monitors for preemptible tracepoints
Fix monitors that relied on tracepoints disabling preemption.
Explicitly disable task migration when per-CPU monitors handle events
to avoid accessing the wrong state and update the opid monitor logic.
- Fix incorrect __user specifier usage
Remove __user from a non-pointer variable in the extract_params()
helper.
- Fix bugs in the rv tool
Ensure strings are NUL-terminated, fix substring matching in monitor
searches, and improve cleanup and exit status handling.
- Fix several bugs in rvgen
Fix LTL literal stringification, subparsers' options handling, and
suffix stripping in dot2k.
* tag 'trace-rv-v7.1-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
verification/rvgen: Fix ltl2k writing True as a literal
verification/rvgen: Fix options shared among commands
verification/rvgen: Fix suffix strip in dot2k
tools/rv: Fix cleanup after failed trace setup
tools/rv: Fix substring match when listing container monitors
tools/rv: Fix substring match bug in monitor name search
tools/rv: Ensure monitor name and desc are NUL-terminated
rv: Use 0 to check preemption enabled in opid
rv: Prevent task migration while handling per-CPU events
rv: Ensure synchronous cleanup for HA monitors
rv: Add automatic cleanup handlers for per-task HA monitors
rv: Do not rely on clean monitor when initialising HA
rv: Fix monitor start ordering and memory ordering for monitoring flag
rv: Ensure all pending probes terminate on per-obj monitor destroy
rv: Prevent in-flight per-task handlers from using invalid slots
rv: Reset per-task DA monitors before releasing the slot
rv: Fix __user specifier usage in extract_params()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
- Fix the arch_inlined_clockevent_set_next_coupled() prototype in the
!CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST case (Naveen Kumar Chaudhary)
- Fix an off-by-1 bug in the sys_settimeofday() usecs validation code
(Naveen Kumar Chaudhary)
- Mark vdso_k_*_data pointers as __ro_after_init (Thomas Weißschuh)
- Fix livelock race in tmigr_handle_remote_up() (Amit Matityahu)
* tag 'timers-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix livelock in tmigr_handle_remote_up()
vdso/datastore: Mark vdso_k_*_data pointers as __ro_after_init
time: Fix off-by-one in settimeofday() usec validation
clockevents: Fix duplicate type specifier in stub function parameter
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
- Fix a NULL pointer dereference bug in the FUTEX_CMP_REQUEUE_PI
code (Ji'an Zhou)
- Fix a NULL pointer dereference bug in the rtmutex code (Davidlohr
Bueso)
* tag 'locking-urgent-2026-06-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Skip remove_waiter() when waiter is not enqueued
futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix error handling in ovl_cache_get()
- Tighten access checks for exited tasks in pidfd_getfd()
- Fix selftests leak in __wait_for_test()
- Limit FUSE_NOTIFY_RETRIEVE to uptodate folios
- Reject fuse_notify() pagecache ops on directories
- Clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
- Fix failure to unlock in nfsd4_create_file()
- Fix pointer arithmetic in qnx6 directory iteration
- Fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
- Avoid potential null folio->mapping deref during iomap error
reporting
* tag 'vfs-7.1-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: avoid potential null folio->mapping deref during error reporting
fhandle: fix UAF due to unlocked ->mnt_ns read in may_decode_fh()
fs/qnx6: fix pointer arithmetic in directory iteration
VFS: fix possible failure to unlock in nfsd4_create_file()
signal: clear JOBCTL_PENDING_MASK for caller in zap_other_threads()
fuse: reject fuse_notify() pagecache ops on directories
fuse: limit FUSE_NOTIFY_RETRIEVE to uptodate folios
selftests: harness: fix pidfd leak in __wait_for_test
pidfd: refuse access to tasks that have started exiting harder
ovl: keep err zero after successful ovl_cache_get()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing/probes fix from Masami Hiramatsu:
"Fix the eprobe event parser to point error position correctly"
* tag 'probes-fixes-v7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/probes: Point the error offset correctly for eprobe argument error
|
|
tmigr_handle_remote_cpu() skips timer_expire_remote() when cpu ==
smp_processor_id(), assuming the local softirq path already handled this
CPU's timers.
This assumption is wrong because jiffies can advance after the handling of
the CPU's global timers in run_timer_base(BASE_GLOBAL) and before
tmigr_handle_remote() evaluates the expiry times.
As a consequence a timer which expires after the CPU local timer wheel
advanced and becomes expired in the remote handling is ignored and the
callback is never invoked and removed from the timer wheel.
What's worse is that fetch_next_timer_interrupt_remote() keeps reporting it
as expired, and the event is re-queued with expires == now on each
iteration. The goto-again loop spins indefinitely.
Fix this by calling timer_expire_remote() unconditionally. That's minimal
overhead for the common case as __run_timer_base() returns immediately if
there is nothing to expire in the local wheel.
[ tglx: Amend change log and add a comment ]
Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Alon Kariv <alonka@amazon.com>
Signed-off-by: Amit Matityahu <amitmat@amazon.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260603170139.33628-1-amitmat@amazon.com
|
|
syzbot triggered the following splat in remove_waiter() via
FUTEX_CMP_REQUEUE_PI:
KASAN: null-ptr-deref in range [0x0000000000000a88-0x0000000000000a8f]
class_raw_spinlock_constructor
remove_waiter+0x159/0x1200 kernel/locking/rtmutex.c:1561
rt_mutex_start_proxy_lock+0x103/0x120
futex_requeue+0x10e4/0x20d0
__x64_sys_futex+0x34f/0x4d0
task_blocks_on_rt_mutex() does not arm the waiter upon deadlock detection,
leaving waiter->task nil, where 3bfdc63936dd ("rtmutex: Use waiter::task instead
of current in remove_waiter()") made this fatal.
Furthermore, rt_mutex_start_proxy_lock() should not be calling into remove_waiter()
upon a successfully grabbing the rtmutex. 1a1fb985f2e2 ("futex: Handle early deadlock
return correctly"), moved the remove_waiter() out of __rt_mutex_start_proxy_lock()
(where 'ret' was only ever 0 or < 0) into the wrapper. Tighten this check to
account for try_to_take_rt_mutex().
Fixes: 3bfdc63936dd ("rtmutex: Use waiter::task instead of current in remove_waiter()")
Reported-by: syzbot+78147abe6c524f183ee9@syzkaller.appspotmail.com
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
Closes: https://lore.kernel.org/all/69f114ac.050a0220.ac8b.0003.GAE@google.com/
Link: https://patch.msgid.link/20260507112913.1019537-1-dave@stgolabs.net
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"One cpuset fix and a maintenance update, both low-risk:
- Fix cpuset partition CPU accounting under sibling CPU exclusion
that could produce wrong CPU assignments and trigger
scheduling-domain warnings. Includes selftests.
- Update an email address in MAINTAINERS"
* tag 'cgroup-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Change Ridong's email
cgroup/cpuset: Add test cases for sibling CPU exclusion on partition update
cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"Two low-risk fixes:
- Drop a spurious warning that can fire during cgroup migration while
a sched_ext scheduler is loaded
- Fix a drgn-based debug script that broke after scheduler state
moved into a per-scheduler struct"
* tag 'sched_ext-for-7.1-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Don't warn on NULL cgrp_moving_from in scx_cgroup_move_task()
tools/sched_ext: Fix scx_show_state per-scheduler state reads
|
|
In debug_dma_sync_sg_for_device(), when iterating over a scatterlist,
the debug entry population mistakenly uses the head of the scatterlist
'sg' to fetch the physical address via sg_phys(), instead of using the
current iterator variable 's'.
This causes dma-debug to track the physical address of the very first
scatterlist entry for all subsequent entries in the list.
Fix this by passing the correct loop iterator 's' to sg_phys()
Fixes: 9d4f645a1fd49ee ("dma-debug: store a phys_addr_t in struct dma_debug_entry")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260603123708.1665-1-lirongqing@baidu.com
|
|
Tracepoint handlers no longer run with preemption disabled by default
since a46023d5616 ("tracing: Guard __DECLARE_TRACE() use of
__DO_TRACE_CALL() with SRCU-fast"), the opid monitor should now count 1
in the preemption count as preemption disabled.
Change the rule for preempt_off to preempt > 0.
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260601153840.124372-11-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
|
|
Hybrid automata monitors may start timers, depending on the model, these
may remain active on an exiting task and cause false positives or even
access freed memory.
Add an enable/disable hook in the HA code, currently only populated by
the per-task handler for registration and deregistration.
This hooks to the sched_process_exit event and ensures the timer is
stopped for every exiting task. The handler is enabled automatically but
may be disabled, for instance if the monitor uses the event for another
purpose (but should still manually ensure timers are stopped).
Fixes: f5587d1b6ec9 ("rv: Add Hybrid Automata monitor type")
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260601153840.124372-8-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
|
|
The attributes variables extracted from syscalls in the helper are both
defined with the __user specifier although only the actual pointer to
user data should be marked.
Remove the __user specifier from attr.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202604150820.Ny143u6X-lkp@intel.com
Fixes: b133207deb72 ("rv: Add nomiss deadline monitor")
Reviewed-by: Wen Yang <wen.yang@linux.dev>
Reviewed-by: Nam Cao <namcao@linutronix.de>
Link: https://lore.kernel.org/r/20260601153840.124372-2-gmonaco@redhat.com
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>
|
|
In dma_direct_map_sg(), the case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE
incorrectly used 'break' instead of falling through to MAP_NONE.
As a result, segments traversing the host bridge skipped the required
dma_direct_map_phys() call entirely, leaving sg->dma_address
uninitialized and leading to DMA failures. Fix this by using
'fallthrough;'.
Fixes: a25e7962db0d79 ("PCI/P2PDMA: Refactor the p2pdma mapping helpers")
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Link: https://lore.kernel.org/r/20260603013723.2439-1-lirongqing@baidu.com
|
|
A WARN fires when systemd's user manager writes "+cpu +memory +pids" to
its own subtree_control while a sched_ext scheduler is loaded:
WARNING: at kernel/sched/ext.c:3227 scx_cgroup_move_task+0xa8/0xb0
scx_cgroup_move_task+0xa8/0xb0
sched_move_task+0x134/0x290
cpu_cgroup_attach+0x39/0x70
cgroup_migrate_execute+0x37d/0x450
cgroup_update_dfl_csses+0x1e3/0x270
cgroup_subtree_control_write+0x3e7/0x440
scx_cgroup_can_attach() arms cgrp_moving_from only when a task's cpu
cgroup changes. It can still be NULL when scx_cgroup_move_task() runs,
through this sequence:
Step Result
--------------------------------- ----------------------------------
1. cpu enabled on cgroup G cpu css = A
2. cpu toggled off then on for G A killed, B created (same cgroup)
3. an exiting task keeps A alive migration skips it, A now stale
4. +memory migrates G stale A vs current B pulls cpu in
5. cpu attach runs for all tasks hits a live, cpu-unchanged task
6. scx_cgroup_move_task() on it cgrp_moving_from NULL -> WARN
The mismatch is that scx_cgroup_can_attach() keys on cgroup identity
while migration drives the move on css identity, so a NULL cgrp_moving_from
here is a legitimate css-only migration, not a missing prep.
The call is already gated on cgrp_moving_from, so just drop the warning.
ops.cgroup_prep_move() and ops.cgroup_move() stay paired.
Fixes: 819513666966 ("sched_ext: Add cgroup support")
Cc: stable@vger.kernel.org # v6.12+
Reported-by: Matt Fleming <mfleming@cloudflare.com>
Closes: https://lore.kernel.org/all/20260601124156.2205704-1-mfleming@cloudflare.com/
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
|
|
self-deadlock
When FUTEX_CMP_REQUEUE_PI requeues a non-top waiter that already owns the
target PI futex, task_blocks_on_rt_mutex() returns -EDEADLK before setting
waiter->task.
The subsequent remove_waiter() in rt_mutex_start_proxy_lock() dereferences
the NULL waiter->task, causing a kernel crash.
Add a self-deadlock check for non-top waiters before calling
rt_mutex_start_proxy_lock(), analogous to the top-waiter check in
futex_lock_pi_atomic().
Fixes: 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 ("rtmutex: Use waiter::task instead of current in remove_waiter()")
Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Cc: stable@vger.kernel.org
|
|
The validation check uses '>' instead of '>=' when comparing tv_usec
against USEC_PER_SEC, allowing the value 1000000 through. After
conversion to nanoseconds (*= 1000), this produces tv_nsec ==
NSEC_PER_SEC, violating the timespec invariant that tv_nsec must be
less than NSEC_PER_SEC.
Use '>=' to reject tv_usec values that are not in the valid range of
0 to 999999.
Fixes: 5e0fb1b57bea ("y2038: time: avoid timespec usage in settimeofday()")
Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/4rikk44zew3s6577dugmx4jyblz7o5c57niuap6ct3td5yfm6w@gh7pcumg7qor
|
|
The stub for arch_inlined_clockevent_set_next_coupled() has 'u64 u64
cycles' in its parameter list. Since u64 is a typedef, the compiler
parses the second 'u64' as the parameter name, making 'cycles' an
unused token. Remove the duplicate so the parameter is correctly named.
Fixes: 89f951a1e8ad ("clockevents: Provide support for clocksource coupled comparators")
Signed-off-by: Naveen Kumar Chaudhary <naveen.osdev@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/7tostpvxzdn6tobmyow63a5rweatls5kux3scqp2vzhe7mv6uq@ecr746b4hyhf
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux
Pull liveupdate fixes from Mike Rapoport:
"Two kexec handover regression fixes:
- fix order calculation for kho_unpreserve_pages() to make sure sure
that the order calculation in kho_unpreserve_pages() mathes the
order calculation in kho_preserve_pages().
- fix math in calculation of KHO_TREE_MAX_DEPTH to make it work with
16KB pages"
* tag 'liveupdate-fixes-2026-05-30' of git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux:
kho: fix order calculation for kho_unpreserve_pages()
kho: fix KHO_TREE_MAX_DEPTH for non-4KB page sizes
|
|
Fix to point the error offset correctly for eprobe argument error.
In the cleanup commit 1b8b0cd754cd ("tracing/probes: Move event parameter
fetching code to common parser"), due to incorrect backward compatibility
aimed at conforming to the test specifications, the error location was set
to 0 when a non-existent formal parameter was specified for Eprobe.
However, this should be corrected in both the test and the implementation
to point correct error position.
Link: https://lore.kernel.org/all/177967567399.209006.1451571244515632097.stgit@devnote2/
Fixes: 1b8b0cd754cd ("tracing/probes: Move event parameter fetching code to common parser")
Cc: stable@vger.kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
|
|
When sibling CPU exclusion occurs, a partition's user_xcpus may contain
CPUs that were never actually granted to it. These CPUs are present in
user_xcpus(cs) but not in cs->effective_xcpus.
The partcmd_update path in update_parent_effective_cpumask() uses
user_xcpus(cs) (via the local variable xcpus) to compute the addmask
(CPUs to return to parent) and delmask (CPUs to request from parent).
This is incorrect:
1) When newmask removes a CPU that was previously excluded by a
sibling, addmask incorrectly includes that CPU and tries to return
it to the parent even though the partition never actually owned it,
causing CPU overlap with sibling partitions and triggering warnings
in generate_sched_domains().
2) When newmask adds a previously excluded CPU that is now available,
delmask fails to request it from the parent because user_xcpus(cs)
already includes it.
Fix this by using cs->effective_xcpus instead of user_xcpus(cs) in all
partcmd_update paths that calculate addmask or delmask, including the
PERR_NOCPUS error handling paths.
Reproducers:
Example 1 - Removing a sibling-excluded CPU incorrectly returns it:
# cd /sys/fs/cgroup
# echo "0-1" > a1/cpuset.cpus
# echo "root" > a1/cpuset.cpus.partition
# echo "0-2" > b1/cpuset.cpus
# echo "root" > b1/cpuset.cpus.partition
# echo "2" > b1/cpuset.cpus
# cat cpuset.cpus.effective
# Actual: 0-1,3 Expected: 3
Example 2 - Expanding to a previously excluded CPU fails to request it:
# cd /sys/fs/cgroup
# echo "0-1" > a1/cpuset.cpus
# echo "root" > a1/cpuset.cpus.partition
# echo "0-2" > b1/cpuset.cpus
# echo "root" > b1/cpuset.cpus.partition
# echo "member" > a1/cpuset.cpus.partition
# echo "1-2" > b1/cpuset.cpus
# cat cpuset.cpus.effective
# Actual: 0-1,3 Expected: 0,3
Fixes: 2a3602030d80 ("cgroup/cpuset: Don't invalidate sibling partitions on cpuset.cpus conflict")
Cc: stable@vger.kernel.org # v7.0+
Suggested-by: Zhang Guopeng <zhangguopeng@kylinos.cn>
Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn>
Reviewed-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
|
|
Commit 91e74fa8b1bc ("kho: make sure preservations do not span multiple
NUMA nodes") made sure preservations from kho_preserve_pages() do not
span multiple NUMA nodes. If they do, the order is reduced and tried
again.
The same logic was not implemented for kho_unpreserve_pages(). This can
result in unpreserve calculating a different order than preserve, and
thus not actually unpreserving the pages.
Fix this by moving the order calculation logic to
__kho_preserve_pages_order() and use it from both preserve and
unpreserve paths.
Move __kho_unpreserve() down to avoid having a forward declaration. Its
users are further down in the file anyway. Also, it results in grouping
for all the page-level preservation and unpreservation functions. This
unfortunately makes the diff hard to read, but the main change in
__kho_unpreserve() is to call __kho_preserve_pages_order() instead of
open-coding the order calculation.
Fixes: 91e74fa8b1bc ("kho: make sure preservations do not span multiple NUMA nodes")
Cc: stable@vger.kernel.org
Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org>
Reviewed-by: Samiullah Khawaja <skhawaja@google.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Link: https://patch.msgid.link/20260519133332.2498092-1-pratyush@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix bpf_throw() and global subprog combination (Kumar Kartikeya
Dwivedi)
- Fix out of bounds access in BPF interpreter (Yazhou Tang)
- Fix potential out of bounds access in inner per-cpu array map
(Guannan Wang)
- Reject NULL data/sig in bpf_verify_pkcs7_signature (KP Singh)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
libbpf: fix off-by-one in emit_signature_match jump offset
bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature
selftests/bpf: Cover global subprog exception leaks
bpf: Check global subprog exception paths
bpf: make bpf_session_is_return() reference optional
bpf: Use array_map_meta_equal for percpu array inner map replacement
selftests/bpf: Add test for large offset bpf-to-bpf call
bpf: Fix s16 truncation for large bpf-to-bpf call offsets
bpf: Fix out-of-bounds read in bpf_patch_call_args()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Spurious WARN in ops_dequeue() racing with concurrent dispatch
- Self-deadlock between scheduler disable and a concurrent sub-sched
enable
* tag 'sched_ext-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Fix spurious WARN on stale ops_state in ops_dequeue()
sched_ext: Fix deadlock between scx_root_disable() and concurrent forks
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Two rstat fixes:
- Out-of-bounds access in the css_rstat_updated() BPF kfunc when
called with an unchecked user-supplied cpu
- Over-strict NMI guard after the recent switch to try_cmpxchg left
sparc and ppc64 unable to queue rstat updates from NMI"
* tag 'cgroup-for-7.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: relax NMI guard after switch to try_cmpxchg
cgroup/rstat: validate cpu before css_rstat_cpu() access
|
|
When a multi-threaded process receives a stop signal (e.g., SIGSTOP),
do_signal_stop() sets JOBCTL_STOP_PENDING and JOBCTL_STOP_CONSUME on all
threads and sets signal->group_stop_count to the number of threads. If
one of the threads concurrently calls execve(), de_thread() invokes
zap_other_threads() to kill all other threads. zap_other_threads()
aborts the pending group stop by resetting signal->group_stop_count to 0
and clears the JOBCTL_PENDING_MASK for all other threads. However, it
fails to clear the job control flags for the calling thread.
When execve() completes, the calling thread returns to user mode and
checks for pending signals. Seeing the stale JOBCTL_STOP_PENDING flag,
it calls do_signal_stop(), which invokes task_participate_group_stop().
Since JOBCTL_STOP_CONSUME is still set, it attempts to decrement the
already-zero signal->group_stop_count, triggering a warning:
sig->group_stop_count == 0
WARNING: CPU: 1 PID: 6475 at kernel/signal.c:373
task_participate_group_stop+0x215/0x2d0
Call Trace:
<TASK>
do_signal_stop+0x3be/0x5c0 kernel/signal.c:2619
get_signal+0xa8c/0x1330 kernel/signal.c:2884
arch_do_signal_or_restart+0xbc/0x840 arch/x86/kernel/signal.c:337
exit_to_user_mode_loop+0x8c/0x4d0 kernel/entry/common.c:98
do_syscall_64+0x33e/0xf80 arch/x86/entry/syscall_64.c:100
entry_SYSCALL_64_after_hwframe+0x77/0x7f
</TASK>
Fix this race condition by clearing the JOBCTL_PENDING_MASK for the
calling thread in zap_other_threads(), ensuring it does not retain any
stale job control state after the thread group is destroyed. This aligns
with other functions that tear down a thread group and abort group
stops, such as zap_process() and complete_signal(), which correctly
clear these flags for all threads including the current one.
Fixes: 39efa3ef3a37 ("signal: Use GROUP_STOP_PENDING to stop once for a single group stop")
Assisted-by: Gemini:gemini-3.1-pro-preview Gemini:gemini-3-flash-preview syzbot
Reported-by: syzbot+b109633ea805cac54a61@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b109633ea805cac54a61
Link: https://syzkaller.appspot.com/ai_job?id=d70208cc-862b-4fe3-bf02-3031e10cd0b3
Signed-off-by: Aleksandr Nogikh <nogikh@google.com>
Link: https://patch.msgid.link/20260521142240.2973022-1-nogikh@google.com
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fixes from Marek Szyprowski:
"Two minor updates for the DMA-mapping code, mainly fixing some rare
corner cases (Petr Tesarik, Jianpeng Chang)"
* tag 'dma-mapping-7.1-2026-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-mapping: move dma_map_resource() sanity check into debug code
dma-direct: fix use of max_pfn
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Avoid NULL return from hist_field_name()
The function hist_field_name() is directly passed to a strcat() which
does not handle "NULL" characters. Return a zero length string when
size is greater than the limit.
This is used only to output already created histograms and no field
currently is greater than the limit. But it should still not return
NULL.
- Do not call map->ops->elt_free() on allocation failure
When elt_alloc() fails, it should not call the map->ops->elt_free()
function if it exists, as that function may not be able to handle the
free on allocation failures. The ->elt_free() should only be called
when elt_alloc() succeeds.
* tag 'trace-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Do not call map->ops->elt_free() if elt_alloc() fails
tracing: Avoid NULL return from hist_field_name() on truncation
|
|
When a child process exits, it sends exit_signal to its parent via
do_notify_parent(). The clone() syscall constructs exit_signal as:
(lower_32_bits(clone_flags) & CSIGNAL)
CSIGNAL is 0xff, so values in the range 65-255 are possible. However,
valid_signal() only accepts signals up to _NSIG (64 on x86_64). A
non-zero non-valid exit_signal acts the same as exit_signal == 0: the
parent process is not signaled when the child terminates.
The syzkaller reproducer triggers this by calling clone() with flags=0x80,
resulting in exit_signal = (0x80 & CSIGNAL) = 128, which exceeds _NSIG and
is not a valid signal.
The v1 of this patch added the check only in the clone() syscall handler,
which is incomplete. kernel_clone() has other callers such as
sys_ia32_clone() which would remain unprotected. Move the check to
kernel_clone() to cover all callers.
Since the valid_signal() check is now in kernel_clone() and covers all
callers including clone3(), the same check in copy_clone_args_from_user()
becomes redundant and is removed. The higher 32bits check for clone3() is
kept as it is clone3() specific.
Note that this is a user-visible change: previously, passing an invalid
exit_signal to clone() was silently accepted. The man page for clone()
does not document any defined behavior for invalid exit_signal values, so
rejecting them with -EINVAL is the correct behavior. It is unlikely that
any sane application relies on passing an invalid exit_signal.
[oleg@redhat.com: the comment above kernel_clone() should be updated]
Link: https://lore.kernel.org/abwvgU17W8wuW2-J@redhat.com
Link: https://lore.kernel.org/20260316151956.563558-1-kartikey406@gmail.com
Fixes: 3f2c788a1314 ("fork: prevent accidental access to clone3 features")
Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: syzbot+bbe6b99feefc3a0842de@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bbe6b99feefc3a0842de
Tested-by: syzbot+bbe6b99feefc3a0842de@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/20260307064202.353405-1-kartikey406@gmail.com/T/ [v1]
Link: https://lore.kernel.org/all/20260316104536.558108-1-kartikey406@gmail.com/T/ [v2]
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Ben Segall <bsegall@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer fixes from Steven Rostedt:
- Fix reporting MISSED EVENTS in trace iterator
When the "trace" file is read with tracing enabled, if the writer
were to pass the iterator reader, it resets, sets a "missed_events"
flag and continues. The tracing output checks for missed events and
if there are some, it prints out "[LOST EVENTS]" to let the user know
events were dropped.
But the clearing of the missed_events happened when the tracing
system queried the ring buffer iterator about missed events. This was
premature as the ring buffer is per CPU, and the tracing code reads
all the CPU buffers and checks for missed events when it is read. If
the CPU iterator that had missed events isn't printed next, the
output for the LOST EVENTS is lost.
Clear the missed_events flag when the iterator moves to the next
event and not when the missed_events flag is queried. Also clear it
on reset.
- Flush and stop the persistent ring buffer on panic
On panic the persistent ring buffer is used to debug what caused the
panic. But on some architectures, it requires flushing the memory
from cache, otherwise, the ring buffer persistent memory may not have
the last events and this could also cause the ring buffer to be
corrupted on the next boot.
- Fix nr_subbufs initialization in simple_ring_buffer_init_mm
The remote simple ring buffer meta data nr_subbufs is initialized too
early and gets cleared later on, making it zero and not reflect the
actual number of sub-buffers.
- Fix unload_page for simple_ring_buffer init rollback
On error, the pages loaded need to be unloaded. To unload a page it
is expected that: page = load_page(va); -> unload_page(page). But the
code was doing: unload_page(va) and not unload_page(page).
- Create output file from cmd_check_undefined
The check for undefined symbols checks if the file *.o.checked exists
and if so it skips doing the work. But the *.o.checked file never was
created making every build do the work even when it was already done
previously.
* tag 'trace-ringbuffer-v7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: Create output file from cmd_check_undefined
tracing: Fix unload_page for simple_ring_buffer init rollback
tracing: Fix nr_subbufs initialization in simple_ring_buffer_init_mm()
ring-buffer: Flush and stop persistent ring buffer on panic
ring-buffer: Fix reporting of missed events in iterator
|
|
ops_dequeue() can race with finish_dispatch() and spuriously trigger the
"queued task must be in BPF scheduler's custody" warning.
ops_dequeue() snapshots p->scx.ops_state via atomic_long_read_acquire()
and then, in the SCX_OPSS_QUEUED arm, asserts that SCX_TASK_IN_CUSTODY
is set. The two reads are not atomic w.r.t. a concurrent
finish_dispatch() running on another CPU:
CPU 1 CPU 2
===== =====
dequeue_task_scx()
ops_dequeue()
opss = read_acquire(ops_state)
= SCX_OPSS_QUEUED
finish_dispatch()
cmpxchg ops_state:
SCX_OPSS_QUEUED -> SCX_OPSS_DISPATCHING [succeeds]
dispatch_enqueue(SCX_DSQ_GLOBAL,
SCX_ENQ_CLEAR_OPSS)
call_task_dequeue()
p->scx.flags &= ~SCX_TASK_IN_CUSTODY
WARN_ON_ONCE(!(p->scx.flags &
SCX_TASK_IN_CUSTODY))
/* opss is stale: QUEUED,
* but task already claimed */
set_release(ops_state, SCX_OPSS_NONE)
The race has been observed via two distinct call chains: the most common
goes through sched_setaffinity(), a rarer variant through
sched_change_begin().
For SCX_DSQ_GLOBAL / SCX_DSQ_BYPASS, dispatch_enqueue() clears
SCX_TASK_IN_CUSTODY before clearing ops_state to SCX_OPSS_NONE
(intentional, to avoid concurrent non-atomic RMW of p->scx.flags against
ops_dequeue()). The window between those two writes is exactly what
ops_dequeue() observes as "QUEUED without custody".
The observed state is not actually inconsistent, it just means CPU 1 has
already claimed the task and the QUEUED value held by CPU 2 is stale.
Re-read ops_state in that case; the next read is guaranteed to return
SCX_OPSS_DISPATCHING or SCX_OPSS_NONE, both of which exit the switch
cleanly. The retry is bounded: once IN_CUSTODY is cleared, ops_state has
already advanced past QUEUED for this dispatch cycle, and a fresh QUEUED
would require re-enqueue under p's rq lock, which CPU 2 holds.
Changes in v2:
- Use READ_ONCE() for p->scx.flags to ensure fresh reads and prevent
compiler reordering in the lockless path
- Add cpu_relax() to reduce power consumption and improve performance
during the spin-wait
- Use unlikely() to optimize branch prediction for the common case
- Expand the in-code comment to document the race condition and
bounded retry guarantee
Fixes: ebf1ccff79c4 ("sched_ext: Fix ops.dequeue() semantics")
Suggested-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Samuele Mariotti <smariotti@disroot.org>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
In paths where tracing_map_elt_alloc() failed to allocate objects,
the map->ops->elt_alloc() call was never successful. In this case,
map->ops->elt_free() should not be called.
Link: https://sashiko.dev/#/patchset/20260520223101.34710-1-rosenp%40gmail.com
Cc: stable@vger.kernel.org
Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Rosen Penev <rosenp@gmail.com>
Reported-by: Sashiko <sashiko-bot@kernel.org>
Fixes: 2734b629525a ("tracing: Add per-element variable support to tracing_map")
Link: https://patch.msgid.link/177933895460.108746.5396070821443932634.stgit@devnote2
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
As the output file is currently never created, the check will run every
time, even if the inputs have not changed.
Create an empty output file which allows make to skip the execution when
it is not necessary.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Vincent Donnefort <vdonnefort@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: https://patch.msgid.link/20260520-tracing-ringbuffer-check-v1-1-d979cfab1338@weissschuh.net
Fixes: 1211907ac0b5 ("tracing: Generate undef symbols allowlist for simple_ring_buffer")
Fixes: 58b4bd18390e ("tracing: Adjust cmd_check_undefined to show unexpected undefined symbols")
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
The unload_page callback expects the return value of load_page() as its
argument: ret = load_page(va); unload(ret). Fix the rollback code in
simple_ring_buffer_init_mm() where the descriptor's VA is used instead
of the loaded page address.
Link: https://patch.msgid.link/20260512141614.1759430-1-vdonnefort@google.com
Fixes: 635923081c79 ("tracing: load/unload page callbacks for simple_ring_buffer")
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
nr_subbufs in the ring buffer metadata is always initialized to zero
because it is assigned from cpu_buffer->nr_pages before the page
initialization loop has run. While nr_subbufs is not currently read
by the kernel, it should reflect the actual buffer geometry in the
meta page for correctness.
Move the assignment after the page loop so that cpu_buffer->nr_pages
holds the final count.
Link: https://patch.msgid.link/20260512135420.99194-1-devnexen@gmail.com
Fixes: 34e5b958bdad ("tracing: Introduce simple_ring_buffer")
Reviewed-by: Vincent Donnefort <vdonnefort@google.com>
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: David Carlier <devnexen@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
On real hardware, panic and machine reboot may not flush hardware cache
to memory. This means the persistent ring buffer, which relies on a
coherent state of memory, may not have its events written to the buffer
and they may be lost. Moreover, there may be inconsistency with the
counters which are used for validation of the integrity of the
persistent ring buffer which may cause all data to be discarded.
To avoid this issue, stop recording of the ring buffer on panic and
flush the cache of the ring buffer's memory.
Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance")
Cc: stable@vger.kernel.org
Cc: Will Deacon <will@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Ian Rogers <irogers@google.com>
Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
|
|
When tracing is active while reading the trace file, if the iterator
reading the buffer detects that the writer has passed the iterator head,
it will reset and set a "missed events" flag. This flag is passed to the
output processing to show the user that events were missed:
CPU:4 [LOST EVENTS]
The problem is that the |