| Age | Commit message (Collapse) | Author | Files | Lines |
|
Boris Brezillion needs the gem lru fixes 379e8f1ca5e9 ("drm/gem: Make
the GEM LRU lock part of drm_device") backmerged for drm-misc-next.
That also means we need to sort out the rename conflict in panthor with
the fixup patch from Boris from drm-tip.
Signed-off-by: Simona Vetter <simona.vetter@ffwll.ch>
|
|
Controlled by IOMMU drivers, ATS can be enabled "on demand", when a given
PASID on a device is attached to an I/O page table. This is working, even
when a device has no translation on its RID (i.e., RID is IOMMU bypassed).
However, certain PCIe devices require non-PASID ATS on their RID even when
the RID is IOMMU bypassed. Call this "ATS always on" in IOMMU term.
For example, CXL spec r4.0 notes in sec 3.2.5.13 Memory Type on CXL.cache:
"To source requests on CXL.cache, devices need to get the Host Physical
Address (HPA) from the Host by means of an ATS request on CXL.io."
In other words, the CXL.cache capability requires ATS; otherwise, it can't
access host physical memory.
Introduce a new pci_ats_required() helper for the IOMMU driver to scan a
PCI device and shift ATS policies between "on demand" and "always on".
Add the support for CXL.cache devices first. Pre-CXL devices will be added
in quirks.c file.
Note that pci_ats_required() validates against pci_ats_supported(), so we
ensure that untrusted devices (e.g. external ports) will not be always on.
This maintains the existing ATS security policy regarding potential side-
channel attacks via ATS.
Cc: linux-cxl@vger.kernel.org
Suggested-by: Vikram Sethi <vsethi@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Tested-by: Nirmoy Das <nirmoyd@nvidia.com>
Acked-by: Nirmoy Das <nirmoyd@nvidia.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Reviewed-by: Yi Liu <yi.l.liu@intel.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
gcc-16 has gained some more advanced inter-procedual optimization
techniques that enable it to inline the dummy_tlb_add_page() and
dummy_tlb_flush() function pointers into a specialized version of
__arm_v7s_unmap:
WARNING: modpost: vmlinux: section mismatch in reference: __arm_v7s_unmap+0x2cc (section: .text) -> dummy_tlb_add_page (section: .init.text)
ERROR: modpost: Section mismatches detected.
>From what I can tell, the transformation is correct, as this is only
called when __arm_v7s_unmap() is called from arm_v7s_do_selftests(),
which is also __init. Since __arm_v7s_unmap() however is not __init,
gcc cannot inline the inner function calls directly.
In debug_objects_selftest(), the same thing happens. Both the
caller and the leaf function are __init, but the IPA pulls
it into a non-init one:
WARNING: modpost: vmlinux: section mismatch in reference: lookup_object_or_alloc+0x7c (section: .text.lookup_object_or_alloc) -> is_static_object (section: .init.text)
Marking the affected functions as not "__init" would reliably avoid this
issue but is not a good solution because it removes an otherwise correct
annotation. I tried marking the functions as 'noinline', but that ended
up not covering all the affected configurations.
With some more experimenting, I found that marking these functions as
__attribute__((noipa)) is both logical and reliable.
In order to keep the syntax readable, add a custom macro for this in
include/linux/compiler_attributes.h next to other related macros and
use it to annotate both files.
Link: https://lore.kernel.org/all/abRB6g-48ZX6Yl2r@willie-the-truck/
Cc: Will Deacon <will@kernel.org>
Cc: Thomas Gleixner <tglx@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: linux-kbuild@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Thomas Gleixner <tglx@kernel.org>
Acked-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
This is more consistent with what commit 7efa84b5cdd6 ("compiler-gcc.h:
Introduce __diag_GCC_all") did for GCC.
Link: https://patch.msgid.link/20260517-bump-minimum-supported-llvm-version-to-17-v2-16-b3b8cda46bdd@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
|
|
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the redefinition of __cleanup with
__maybe_unused added to it is unnecessary because the referenced LLVM
change is present in all supported LLVM versions. Drop it.
Link: https://patch.msgid.link/20260517-bump-minimum-supported-llvm-version-to-17-v2-15-b3b8cda46bdd@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
|
|
When the VMBus driver is not part of the kernel (CONFIG_HYPERV_VMBUS=n),
the MSHV root driver fails to link:
ERROR: modpost: "hv_vmbus_exists" [drivers/hv/mshv_root.ko] undefined!
Fix this while meeting these requirements:
* It must be possible to include the MSHV root driver without the
VMBus driver. In such case, the MSHV root driver can be built-in
to the kernel image, or it can be built as a separate module.
* If both the MSHV root driver and the VMBus driver are present, the
MSHV root driver and VMBus driver can both be built-in, or they can
both be separate modules. Or the MSHV root driver can be a module
while the VMBus driver can be built-in, but the reverse is
disallowed. Regardless of the build choices, the VMBus driver must
be loaded before the MSHV driver in order for the SynIC to be
managed properly (see comments in the MSHV SynIC code).
The fix has two parts:
* Add a Kconfig entry for MSHV_ROOT to depend on HYPERV_VMBUS if
HYPERV_VMBUS is present. The entry disallows MSHV_ROOT being
built-in when HYPERV_VMBUS is a module, but without requiring that
HYPERV_VMBUS be built.
* Add a stub implementation of hv_vmbus_exists() for when the
VMBus driver is not present so that the MSHV root driver has
no module dependency on VMBus. When the VMBus driver *is*
present, the module dependency ensures that the VMBus driver
loads first when both are built as modules.
Existing code ensures that the VMBus driver loads first if it is
built-in. The VMBus driver uses subsys_initcall(), which is
initcall level 4. The MSHV root driver uses module_init(), which
becomes device_init() when built-in, and device_init() is
initcall level 6.
Reported-by: Arnd Bergmann <arnd@arndb.de>
Closes: https://lore.kernel.org/all/20260520074044.923728-1-arnd@kernel.org/
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jork Loeser <jloeser@linux.microsoft.com>
Reviewed-by: Hardik Garg <hargar@linux.microsoft.com>
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
Move the post_unbind_rust callback before devres_release_all() in
device_unbind_cleanup().
With drvdata() removed, the driver's bus device private data is only
accessible by the owning driver itself. It is hence safe to drop the
driver's bus device private data before devres actions are released.
This reordering is the key enabler for Higher-Ranked Lifetime Types
(HRT) in Rust device drivers -- it allows driver structs to hold direct
references to devres-managed resources, because the bus device private
data (and with it all such references) is guaranteed to be dropped while
the underlying devres resources are still alive.
Without this change, devres resources would be freed first, leaving the
driver's bus device private data with dangling references during its
destructor.
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Gary Guo <gary@garyguo.net>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://patch.msgid.link/20260525202921.124698-6-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next
Florian Westphal says:
====================
netfilter: updates for net-next
The following patchset contains Netfilter fixes and small enhancements:
1) Disable 32-bit x_tables compatibility (32bit binaries on 64bit
kernel) interface in user namespaces. This is 'last warning'
before this is removed for good.
2) Add a configuration toggle for netfilter GCOV profiling. Provide
dedicated toggles for ipset and ipvs.
3) Remove modular support for nfnetlink and restrict it to built-in only.
From Pablo Neira Ayuso.
4) Use per-rule hash initval in nf_conncount. This avoids unecessary
lock contention with short keys (e.g. conntrack zones) in different
namespaces.
5) Use nf_ct_exp_net() in ctnetlink expectation dumps.
From Pratham Gupta.
6) Remove a dead conditional in nft_set_rbtree.
7) Fix conntrack helper policy updates to apply per-class values correctly.
From David Carlier.
8) Fix an off-by-one OOB read in nf_conntrack_irc:parse_dcc(). Use strict
less-than comparison in the newline search loop to respect the
exclusive-end pointer convention. From Muhammad Bilal.
9) Fix typos in nf_conntrack_proto_tcp comments. From Avinash Duduskar.
10) Restore performance optimization in nft_set_pipapo_avx2 by passing
the next map index. Refactor lookup logic for clarity and add a
DEBUG_NET check to document this.
11) Avoid (harmless) u16 overflow in nf_conntrack_ftp when parsing FTP PORT
and EPRT commands. Ignore commands where single octet exceeds 255.
From Giuseppe Caruso.
Patch 12, which removes incorrect (and obviously unused) code from
nft_byteorder was kept back to avoid a net -> net-next merge conflict.
* tag 'nf-next-26-05-25' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next:
netfilter: nf_conntrack_ftp: avoid u16 overflows
netfilter: nft_set_pipapo_avx2: restore performance optimization
netfilter: nf_conntrack_proto_tcp: fix typos in comments
netfilter: nf_conntrack_irc: fix parse_dcc() off-by-one OOB read
netfilter: nfnl_cthelper: apply per-class values when updating policies
netfilter: nft_set_rbtree: remove dead conditional
netfilter: ctnetlink: use nf_ct_exp_net() in expectation dump
netfilter: nf_conncount: use per-rule hash initval
netfilter: allow nfnetlink built-in only
netfilter: add option for GCOV profiling
netfilter: x_tables: disable 32bit compat interface in user namespaces
====================
Link: https://patch.msgid.link/20260525182924.28456-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Most of the lls source files are missing SPDX-License-Identifier
lines. Add appropriate IDs to these files, and remove other license
info from the header. In once case, leave the existing id line
and just remove the license reference text.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Link: https://patch.msgid.link/20260522225508.24006-1-tim.bird@sony.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Address checkpatch.pl warning below, across the audit subsystem:
WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
Minor cleanup, no functional changes.
Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
|
|
Change "succesfully" to "successfully" in the kerneldoc
comment of call_once().
Signed-off-by: Jiun Jeong <jiun.jeong.cs@gmail.com>
Link: https://patch.msgid.link/20260501144413.49419-1-jiun.jeong.cs@gmail.com
[sean: don't scope to KVM, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Commit abb30460bda2 ("block: mark bio_wouldblock_error() bio with
BIO_QUIET") added this to suppress buffer_head warnings, but neither
when this commit was added nor now any buffer_head using code actually
ever sets REQ_NOWAIT which can lead to BLK_STS_AGAIN.
Remove the special handling for now. If we ever plan to use REQ_NOWAIT
for buffer_head based I/O we're better off handling BLK_STS_AGAIN in
the completion handler as it actually needs to retry the I/O as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20260518063336.507369-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
numa_node in blk_mq_hw_ctx and the matching argument of
blk_mq_ops::init_request can be NUMA_NO_NODE (-1). Declared as
unsigned int, NUMA_NO_NODE becomes UINT_MAX and walks off
nvme_dev::descriptor_pools[] on CONFIG_NUMA=n [1].
Switch the field and the callback prototype to int and update all
in-tree init_request implementations. No functional change:
cpu_to_node(), kmalloc_node() and blk_alloc_flush_queue() already
take int.
Link: https://lore.kernel.org/linux-nvme/20260522150628.399288-1-mateusz.nowicki@posteo.net/ [1]
Link: https://lore.kernel.org/linux-nvme/20260309062840.2937858-2-iam@sung-woo.kim/
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Sung-woo Kim <iam@sung-woo.kim>
Signed-off-by: Mateusz Nowicki <mateusz.nowicki@posteo.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://patch.msgid.link/20260523125210.272274-1-mateusz.nowicki@posteo.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When populating a guest_memfd instance with the initial CPUID data for an
SNP guest, acquire a writable pin on the source page as KVM will write back
the "correct" CPUID information if the userspace provided data is rejected
by trusted firmware. Because KVM writes to the source page using a kernel
mapping, pinning for read could result in KVM clobbering read-only memory.
Note, well-behaved VMMs are unlikely to be affected, as CPUID information
is almost always dynamically generated by userspace, i.e. it's unlikely for
the CPUID information to be backed by a read-only mapping.
Fixes: 2a62345b30529 ("KVM: guest_memfd: GUP source pages prior to populating guest memory")
Cc: stable@vger.kernel.org
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Link: https://patch.msgid.link/20260522-fix-sev-gmem-post-populate-v2-1-3f196bfad5a1@google.com
[sean: rewrite shortlog and changelog, tag for stable@]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address
post-7.1 issues or aren't considered suitable for backporting.
All patches are singletons - please see the individual changelogs for
details"
* tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "mm: introduce a new page type for page pool in page type"
mm/vmalloc: do not trigger BUG() on BH disabled context
MAINTAINERS, mailmap: change email for Eugen Hristev
mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page
kernel/fork: validate exit_signal in kernel_clone()
mm: memcontrol: propagate NMI slab stats to memcg vmstats
mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()
mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one
zram: fix use-after-free in zram_writeback_endio
memfd: deny writeable mappings when implying SEAL_WRITE
ipc: limit next_id allocation to the valid ID range
Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare"
MAINTAINERS: .mailmap: update after GEHC spin-off
|
|
The last user was removed in commit aea12071d6fc
("power/supply: Drop obsolete JZ4740 driver") and replaced by
a self-contained IIO-based driver. No file includes this header.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
The last user was the JZ4740 MFD ADC driver, removed in commit
ff71266aa490 ("mfd: Drop obsolete JZ4740 driver") and replaced
by a self-contained IIO driver. No file includes or references
this header.
Assisted-by: Claude:claude-opus-4-6
Signed-off-by: Costa Shulyupin <costa.shul@redhat.com>
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
|
|
Prepare for a smarter iterator for /proc/interrupts so that the next
interrupt descriptor can be cached after lookup.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Link: https://patch.msgid.link/20260517194931.917415190@kernel.org
|
|
show_interrupts() evaluates a boatload of conditions to establish whether
it should expose an interrupt in /proc/interrupts or not.
That can be simplified by caching the condition in an internal status flag,
which is updated when one of the relevant inputs changes.
The irq_desc::kstat_irq check is dropped because visible interrupt
descriptors always have a valid pointer.
As a result the number of instructions and branches for reading
/proc/interrupts is reduced significantly.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Radu Rendec <radu@rendec.net>
Link: https://patch.msgid.link/20260517194931.680943749@kernel.org
|
|
Interrupts which are not marked per CPU increment not only the per CPU
statistics, but also the accumulation counter irq_desc::tot_count.
Change the counter to type unsigned long so it does not produce sporadic
zeros due to wrap arounds on 64-bit machines and do a quick check for non
per CPU interrupts. If the counter is zero, then simply emit a full set of
zero strings. That spares the evaluation of the per CPU counters completely
for interrupts with zero events.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Radu Rendec <radu@rendec.net>
Link: https://patch.msgid.link/20260517194931.115522199@kernel.org
|
|
A large portion of interrupt count entries are zero. There is no point in
formatting the zero value as it is way cheeper to just emit a constant
string.
Collect the number of consecutive zero counts and emit them in one go
before a non-zero count and at the end of the line.
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Dmitry Ilvokhin <d@ilvokhin.com>
Reviewed-by: Radu Rendec <radu@rendec.net>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/20260517194931.034728540@kernel.org
|
|
Add SM_THERMAL_CALIB_READ to the secure monitor command enum and
introduce meson_sm_get_thermal_calib() to allow drivers to retrieve
thermal sensor calibration data through the firmware interface.
Signed-off-by: Ronald Claveau <linux-kernel-dev@aliel.fr>
Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org>
Reviewed-by: Neil Armstrong <neil.armstrong@linaro.org>
Link: https://patch.msgid.link/20260424-add-thermal-t7-vim4-v5-2-9040ca36afe2@aliel.fr
|
|
exec_mmap() installs the new mm and then tears the old one down while
still holding exec_update_lock for writing -- and with cred_guard_mutex
held all the way to setup_new_exec():
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
mm_update_next_owner(old_mm);
mmput(old_mm);
Neither lock is needed for this. exec_update_lock only exists to make the
mm swap atomic with the later commit_creds(), so that permission-checking
readers (proc, ptrace, the futex robust list, perf, kcmp, mm_access())
never observe the new mm together with the old credentials. Those readers
all operate on task->mm, i.e. the new mm after the swap; none looks at the
detached old mm, its ->owner or signal->maxrss. cred_guard_mutex guards
credential calculation and is equally irrelevant here.
The cost is real: __mmput() runs exit_mmap() over the entire old address
space and can block in exit_aio() waiting for in-flight AIO, all while
holding exec_update_lock for writing and cred_guard_mutex. For execve() of
a large process this blocks ptrace_attach() and every exec_update_lock
reader for the duration of the teardown.
Stash the old mm in bprm->old_mm and release it from setup_new_exec()
after both locks are dropped. setup_new_exec() still runs before
setup_arg_pages() and the segment mappings, so the old address space is
freed before the new one is populated and peak memory is unchanged. The
ordering constraints are kept: old_mm's mmap_lock is still dropped in
exec_mmap() before mm_update_next_owner() (required since commit
31a78f23bac0 ("mm owner: fix race between swapoff and exit")), and
mm_update_next_owner() still precedes mmput(); both run in the execing
task's context, as mm_update_next_owner() requires.
If exec swaps the mm but fails before setup_new_exec() runs the old mm
would leak, so add a backstop in free_bprm(). The lazy-tlb case
(old_mm == NULL, e.g. kernel_execve()) has no address space to
free and is left in exec_mmap().
Link: https://patch.msgid.link/20260522-work-exit_mm-v1-1-bd32d5a560bb@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
The dumpable flag captured at execve() is consulted by
__ptrace_may_access() and several /proc owner / visibility checks.
It lives on mm_struct today, which exit_mm() clears from the task
long before the task itself is reaped.
exec_state is anchored to the execve() that established the current
privilege domain. CLONE_VM siblings refcount-share the parent's
exec_state via copy_exec_state(); non-CLONE_VM clones allocate a
fresh exec_state inheriting the parent's dumpable mode and user_ns
reference via task_exec_state_copy(). execve() allocates a fresh
instance (via alloc_task_exec_state() in begin_new_exec()) and
installs it under task_lock + exec_update_lock with
task_exec_state_replace(). init_task uses a static instance.
The dumpable mode now lives on task->exec_state->dumpable.
task->mm->flags no longer carries dumpability; MMF_DUMPABLE_MASK is
removed, but MMF_DUMPABLE_BITS is reserved so MMF_DUMP_FILTER_* bit
positions remain stable for the /proc/<pid>/coredump_filter ABI. The
task->user_dumpable cache bit and its assignment in exit_mm() are
removed; readers go through get_dumpable(task) directly.
coredump_params gains a snapshot field cprm.dumpable, populated from
get_dumpable(current) at vfs_coredump() entry, replacing the previous
__get_dumpable(cprm->mm_flags) consumers in fs/coredump.c and
fs/pidfs.c.
The user namespace recorded at execve() is consulted by
__ptrace_may_access() and by /proc/PID/* owner derivation. Move the
captured user_ns onto task_exec_state, which stays attached to the task
past exit_mm() and across exit_files().
bprm grows a user_ns field staged in bprm_mm_init() with the caller's
user_ns, narrowed by would_dump() to the closest privileged ancestor,
and consumed by exec_mmap() via alloc_task_exec_state(bprm->user_ns).
free_bprm() releases the staging reference.
mm_struct loses ->user_ns entirely. Initializers in init-mm, efi_mm,
and the implicit one in mm_init()/dup_mm()/mm_alloc() are removed;
__mmdrop() drops the matching put_user_ns(). The kthread_use_mm()
WARN_ON_ONCE(!mm->user_ns) is no longer meaningful and goes too.
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-4-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Add a helper that encapsulates all of the logic for checking ptrace
access and remove open-coded versions in follow-up patches.
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: David Hildenbrand (arm) <david@kernel.org>
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-3-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Introduce struct task_exec_state, a per-task RCU-protected structure
that holds the dumpable mode and the user namespace and stays attached
to the task for its full lifetime.
task_exec_state_rcu() is the canonical reader: asserts RCU or
task_lock is held, WARNs on a NULL state, returns the
rcu_dereference()'d pointer.
Reviewed-by: Jann Horn <jannh@google.com>
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-2-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
Replace the SUID_DUMP_DISABLE/USER/ROOT preprocessor constants with
enum task_dumpable. Numeric values are preserved (kernel.suid_dumpable
sysctl and prctl(PR_SET_DUMPABLE) ABI), so this is a pure rename with
no behavioral change.
Subsequent commits relocate dumpability onto a per-task structure
where the enum type will allow stronger type-checking on the new API.
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: David Hildenbrand (arm) <david@kernel.org>
Link: https://patch.msgid.link/20260520-work-task_exec_state-v3-1-69f895bc1385@kernel.org
Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
|
|
KHO_TREE_MAX_DEPTH is calculated as:
DIV_ROUND_UP(KHO_ORDER_0_LOG2 - KHO_BITMAP_SIZE_LOG2,
KHO_TABLE_SIZE_LOG2) + 1
For systems with 16KB pages (e.g. arm64 with CONFIG_ARM64_16K_PAGES=y or
LoongArch), this gives a depth of 4. Since levels are 0 based, with
depth = 4 the effective top level is 3 and the top-level shift at bit 39.
PAGE_SHIFT = 14
KHO_BITMAP_SIZE_LOG2 = PAGE_SHIFT + 3 = 17
KHO_TABLE_SIZE_LOG2 = log(2; (1 << PAGE_SHIFT) / 8) = 11
shift = ((3 - 1) * KHO_TABLE_SIZE_LOG2) + KHO_BITMAP_SIZE_LOG2 = 39
The order-0 bit sits at bit 50 (KHO_ORDER_0_LOG2 = 64 - PAGE_SHIFT =
50). When inserting or reading a key, the index extracted at the top
level is:
(1 << 50) >> 39 = 2048
2048 is exactly the table size (PAGE_SIZE / sizeof(phys_addr_t) = 2048
for 16KB pages), so it wraps to 0, aliasing the order bit to index 0
and losing it silently.
On the second kernel, kho_radix_decode_key() sees a key without the
order bit, calls fls64() on the wrong bit, computes a wrong order and
thus a garbage physical address. phys_to_page() of that address faults
in kho_preserved_memory_reserve(), causing a kernel panic early in boot.
Fix by adding +1 to the DIV_ROUND_UP numerator so the formula accounts
for the order bit itself, giving depth 5 for 16KB pages. The top-level
shift becomes 50, and (1 << 50) >> 50 = 1, which is nonzero and
unambiguous. For 4KB and 64KB page sizes the depth is unchanged.
Link: https://patch.msgid.link/20260509024415.33190-1-dongtai.guo@linux.dev
Fixes: 3f2ad90060f6 ("kho: adopt radix tree for preserved memory tracking")
Tested-by: Kexin Liu <liukexin@kylinos.cn>
Signed-off-by: George Guo <guodongtai@kylinos.cn>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[rppt: added actual math to the changelog]
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
|
|
Add nr_guest_files in per-HART local config to represent the number of
guest files available on a particular HART whereas the nr_guest_files
in the global config represents the number of guest files available
across all HARTs.
This allows KVM RISC-V to use nr_guest_files from per-HART local
config for asymmetric big.Little systems.
Signed-off-by: Guo Ren (Alibaba DAMO Academy) <guoren@kernel.org>
Acked-by: Thomas Gleixner <tglx@kernel.org>
Signed-off-by: Anup Patel <anup.patel@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260525094945.3721783-2-anup.patel@oss.qualcomm.com
Signed-off-by: Anup Patel <anup@brainfault.org>
|
|
Add MLX5_SPF to enum mlx5_func_type so SPFs get their own page counter,
and add the corresponding WARN check at page cleanup. Wait for SPF pages
to be reclaimed during ECPF teardown, alongside the existing host PF and
VF page waits.
SPF page requests are always identified by vhca_id, so the legacy
func_id_to_type() path is not reached for satellite PFs.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260521110843.367329-13-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next into for-7.2
|
|
Replace the open-coded resource table iteration loop in
rproc_handle_resources() with the rsc_table_for_each_entry() helper.
The remoteproc-specific dispatch logic (vendor resource handling via
rproc_handle_rsc(), RSC_LAST bounds check, handler table lookup) is
moved into a local callback rproc_handle_rsc_entry(), keeping the
iteration mechanics in one canonical place.
The callback receives the payload offset within the table so that
handlers which write back into the resource table (e.g.
rproc_handle_carveout() recording a dynamically allocated address via
rsc_offset) continue to work correctly.
No functional change.
Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260506050107.1985033-3-mukesh.ojha@oss.qualcomm.com
|
|
The resource table data structure has traditionally been associated with
the remoteproc framework, where the resource table is included as a
section within the remote processor firmware binary. However, it is also
possible to obtain the resource table through other means—such as from a
reserved memory region populated by the boot firmware, statically
maintained driver data, or via a secure SMC call—when it is not embedded
in the firmware.
There are multiple Qualcomm remote processors (e.g., Venus, Iris, GPU,
etc.) in the upstream kernel that do not use the remoteproc framework to
manage their lifecycle for various reasons.
When Linux is running at EL2, similar to the Qualcomm PAS driver
(qcom_q6v5_pas.c), client drivers for subsystems like video and GPU may
also want to use the resource table SMC call to retrieve and map
resources before they are used by the remote processor.
In such cases, the resource table data structure is no longer tightly
coupled with the remoteproc headers. Client drivers that do not use the
remoteproc framework should still be able to parse the resource table
obtained through alternative means. Therefore, there is a need to
decouple the resource table definitions from the remoteproc headers.
Signed-off-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20260506050107.1985033-2-mukesh.ojha@oss.qualcomm.com
|
|
Tejun Heo says:
====================
This makes BPF arena memory directly dereferenceable from kernel code
(struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page
that an arch fault hook installs into empty PTEs on kernel-side faults,
after KFENCE. The faulting instruction retries and the violation is reported
through the program's BPF stream.
v4:
- Patch 1: note that the strict-zero cmpxchg is narrower than pte_none() in
inline comments on both x86 and arm64. (Andrea)
- Patch 2: stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL via a
new include/linux/bpf_defs.h. (lkp)
- Patch 7: scx_arena_alloc() retries via a loop instead of a single retry on
pool growth. (Andrea)
- Picked up Reviewed-by tags from Emil and Andrea.
v3: https://lore.kernel.org/r/20260520235052.4180316-1-tj@kernel.org
v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org
v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org
Motivation
----------
sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask
*. The kernel translates a kernel cpumask to a cmask, but it had no way to
write into the arena, so the cmask lived in kernel memory and was passed as
a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so
the BPF side had to word-by-word probe-read the kernel cmask into an arena
cmask via cmask_copy_from_kernel() before any helper could touch it. It
works, but is clumsy.
The shape isn't unique to set_cmask. Sub-scheduler support is on the way and
more sched_ext callbacks will want to pass structured data to BPF. Anywhere
a kfunc or struct_ops callback wants to hand a struct to a BPF program,
arena residence is the natural answer.
Approach
--------
Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as
today - PTEs are populated only for allocated pages. A new arch fault hook
(bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64
__do_kernel_fault(), after KFENCE. When a kernel-side access faults inside
an arena's kern_vm range, the helper walks the stack to find the BPF program
responsible, range-checks the fault address against prog->aux->arena, and
atomically installs the scratch page into the empty PTE via the new
ptep_try_set() wrapper. The kernel instruction retries and reads/writes the
scratch page. Free paths and map destruction treat scratch as non-owned.
Real allocation refuses to overwrite scratch (apply_range_set_cb returns
-EBUSY). A scratched address stays dead until map destroy, since its
presence means the BPF program has already malfunctioned.
The mechanism is default behavior - no UAPI flag.
What this preserves
-------------------
All the debugging properties of today's sparse-PTE design are preserved:
* BPF programs still fault on unmapped arena accesses. The fault semantics
(instruction retry with rdst = 0) and the violation report through
bpf_streams are unchanged for prog-side accesses.
* The first kernel-side touch of an unmapped address is reported via
bpf_streams the same way as a prog-side fault, with the stack walk
attributing it to the originating prog.
* User-side fault on a never-scratched address still lazy-allocates a real
page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a
scratched address SIGSEGVs.
What changes for the kernel-side caller is just that an unmapped deref no
longer oopses - it retries through the scratch page and emits a violation
report. The same shape today's BPF instruction faults have.
Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------
mm: Add ptep_try_set() for lockless empty-slot installs
bpf: Recover arena kernel faults with scratch page
Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------
bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
bpf: Add bpf_struct_ops_for_each_prog()
bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()
====================
Link: https://lore.kernel.org/bpf/20260522172219.1423324-1-tj@kernel.org/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Add support for dynamically discovering and exposing Configuration
Security Unit (CSU) registers through sysfs. Leverage the existing
PM_QUERY_DATA API to discover available registers at runtime, making
the interface flexible and maintainable.
Key features:
- Dynamic register discovery using PM_QUERY_DATA API
* PM_QID_GET_NODE_COUNT: Query number of available registers
* PM_QID_GET_NODE_NAME: Query register names by index
- Automatic sysfs attribute creation under csu_registers/ group
- Read operations via existing IOCTL_READ_REG API
- Write operations via existing IOCTL_MASK_WRITE_REG API
The sysfs interface is created at:
/sys/devices/platform/firmware:zynqmp-firmware/csu_registers/
Currently supported registers include:
- multiboot (CSU_MULTI_BOOT)
- idcode (CSU_IDCODE, read-only)
- pcap-status (CSU_PCAP_STATUS, read-only)
The dynamic discovery approach allows firmware to control which
registers are exposed without requiring kernel changes, improving
maintainability and security.
The firmware does not currently expose per-register access mode
information, so the kernel cannot distinguish read-only registers
from read-write ones at discovery time. All discovered registers are
therefore created with sysfs mode 0644, and the firmware is
responsible for rejecting writes to registers it treats as read-only
(for example idcode and pcap-status); that error is propagated back
to userspace from the store callback. If a per-register access-mode
query is added to the firmware in the future, sysfs permissions can
be tightened to match.
CSU register discovery is an optional feature: on firmware that lacks
support for PM_QID_GET_NODE_COUNT or PM_QID_GET_NODE_NAME, the probe
returns gracefully without exposing any sysfs entries. To keep the
memory footprint minimal on that path, partial devm allocations made
during discovery are explicitly released on failure so that no memory
lingers until device unbind when the feature is unavailable.
Signed-off-by: Ronak Jain <ronak.jain@amd.com>
Signed-off-by: Michal Simek <michal.simek@amd.com>
Link: https://lore.kernel.org/r/20260520093654.3303917-3-ronak.jain@amd.com
|
|
Cross-merge BPF and other fixes after downstream PR.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Troy Mitchell <troy.mitchell@linux.spacemit.com> says:
On some SoCs (e.g. SpacemiT K3), multiple I2S controllers share the
same physical BCLK. When one controller is already streaming, the
others must use hw_params that result in the same BCLK rate, otherwise
the shared clock would be reconfigured and corrupt the active stream.
This series adds framework-level support for this constraint:
Patch 1 adds the dt-bindings for the spacemit,k3-i2s compatible.
The K3 SoC uses the same I2S IP as K1 but requires additional clocks:
a dedicated sysclk_div, along with c_sysclk and c_bclk which are
shared across multiple I2S controllers.
Patch 2 adds a DEFINE_GUARD wrapping snd_soc_card_mutex_lock() and
snd_soc_card_mutex_unlock() so that scope-based locking picks up the
SND_SOC_CARD_CLASS_RUNTIME lockdep subclass.
Patch 3 adds the constraint logic in soc-pcm.c. During PCM open,
every DAI that has a bclk clock pointer gets a hw_rule registered
unconditionally. The rule callback runs at hw_refine time: it scans
the card for an active peer sharing the same physical BCLK (via
clk_is_match()) that has already completed hw_params, then constrains
the current stream's rate to match the established BCLK rate. The
first DAI to complete hw_params is unconstrained; subsequent DAIs
must match. Two modes are supported:
- Default (I2S): BCLK = rate * channels * sample_bits. The rule
derives the valid rate range from the current channel and
sample_bits intervals.
- Explicit ratio (TDM): if the driver sets dai->bclk_ratio
(e.g. slots * slot_width), the rule computes the single valid
rate as active_bclk_rate / bclk_ratio.
This series was prompted by review feedback on the SpacemiT K3 I2S
series, where a vendor-specific fixed-sample-rate property was rejected
in favor of a generic framework solution:
https://lore.kernel.org/all/afFqgF6ZRwYdfUmL@sirena.co.uk/
Link: https://patch.msgid.link/20260522-i2s-same-blk-v4-0-a71a86faaa20@linux.spacemit.com
|
|
Signed-off-by: Takashi Iwai <tiwai@suse.de>
|
|
We need the driver-core fixes in here as well to build on top of.
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
|
|
Neil Armstrong <neil.armstrong@linaro.org> says:
Add support for the SG Micro SGM3804 Single Inductor Dual Output
Buck/Boost Converter used to power LCD panels a provide positive
and negative power rails with configurable voltage and active
discharge function for each output.
The SGM3804 is powered by the enable GPIO pins inputs and only
supports I2C write messages.
In order to add flexibility and simplify the driver, the
regmap cache is enabled and populated with default values
since we can't write registers when the 2 GPIOs are down.
This regulator is used to provide vsn and vsn power to the
Ayaneo Pocket S2 dual-DSI LCD panel.
Link: https://patch.msgid.link/20260522-topic-sm8650-ayaneo-pocket-s2-sgm3804-v5-0-bd6b1c300ecc@linaro.org
|
|
This feature is required to use 32bit arp/ip/ip6/ebtables binaries on
64bit kernels. I don't think there are many users left.
Support has been a compile-time option since 2021 and defaults to off
since 2023.
The XTABLES_COMPAT config option is already off in many distributions
including Debian and Fedora.
Give a few more months before complete removal but disable support in
user namespaces already.
Assisted-by: Claude:claude-sonnet-4-6
Signed-off-by: Florian Westphal <fw@strlen.de>
|
|
Pull bpf fixes from Alexei Starovoitov:
- Fix bpf_throw() and global subprog combination (Kumar Kartikeya
Dwivedi)
- Fix out of bounds access in BPF interpreter (Yazhou Tang)
- Fix potential out of bounds access in inner per-cpu array map
(Guannan Wang)
- Reject NULL data/sig in bpf_verify_pkcs7_signature (KP Singh)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
libbpf: fix off-by-one in emit_signature_match jump offset
bpf: Reject NULL data/sig in bpf_verify_pkcs7_signature
selftests/bpf: Cover global subprog exception leaks
bpf: Check global subprog exception paths
bpf: make bpf_session_is_return() reference optional
bpf: Use array_map_meta_equal for percpu array inner map replacement
selftests/bpf: Add test for large offset bpf-to-bpf call
bpf: Fix s16 truncation for large bpf-to-bpf call offsets
bpf: Fix out-of-bounds read in bpf_patch_call_args()
|
|
This commit documents the rcu_access_pointer() use case for |