aboutsummaryrefslogtreecommitdiff
path: root/kernel/bpf
AgeCommit message (Collapse)AuthorFilesLines
8 daysMerge tag 'mm-stable-2026-04-13-21-45' of ↵Linus Torvalds1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "maple_tree: Replace big node with maple copy" (Liam Howlett) Mainly prepararatory work for ongoing development but it does reduce stack usage and is an improvement. - "mm, swap: swap table phase III: remove swap_map" (Kairui Song) Offers memory savings by removing the static swap_map. It also yields some CPU savings and implements several cleanups. - "mm: memfd_luo: preserve file seals" (Pratyush Yadav) File seal preservation to LUO's memfd code - "mm: zswap: add per-memcg stat for incompressible pages" (Jiayuan Chen) Additional userspace stats reportng to zswap - "arch, mm: consolidate empty_zero_page" (Mike Rapoport) Some cleanups for our handling of ZERO_PAGE() and zero_pfn - "mm/kmemleak: Improve scan_should_stop() implementation" (Zhongqiu Han) A robustness improvement and some cleanups in the kmemleak code - "Improve khugepaged scan logic" (Vernon Yang) Improve khugepaged scan logic and reduce CPU consumption by prioritizing scanning tasks that access memory frequently - "Make KHO Stateless" (Jason Miu) Simplify Kexec Handover by transitioning KHO from an xarray-based metadata tracking system with serialization to a radix tree data structure that can be passed directly to the next kernel - "mm: vmscan: add PID and cgroup ID to vmscan tracepoints" (Thomas Ballasi and Steven Rostedt) Enhance vmscan's tracepointing - "mm: arch/shstk: Common shadow stack mapping helper and VM_NOHUGEPAGE" (Catalin Marinas) Cleanup for the shadow stack code: remove per-arch code in favour of a generic implementation - "Fix KASAN support for KHO restored vmalloc regions" (Pasha Tatashin) Fix a WARN() which can be emitted the KHO restores a vmalloc area - "mm: Remove stray references to pagevec" (Tal Zussman) Several cleanups, mainly udpating references to "struct pagevec", which became folio_batch three years ago - "mm: Eliminate fake head pages from vmemmap optimization" (Kiryl Shutsemau) Simplify the HugeTLB vmemmap optimization (HVO) by changing how tail pages encode their relationship to the head page - "mm/damon/core: improve DAMOS quota efficiency for core layer filters" (SeongJae Park) Improve two problematic behaviors of DAMOS that makes it less efficient when core layer filters are used - "mm/damon: strictly respect min_nr_regions" (SeongJae Park) Improve DAMON usability by extending the treatment of the min_nr_regions user-settable parameter - "mm/page_alloc: pcp locking cleanup" (Vlastimil Babka) The proper fix for a previously hotfixed SMP=n issue. Code simplifications and cleanups ensued - "mm: cleanups around unmapping / zapping" (David Hildenbrand) A bunch of cleanups around unmapping and zapping. Mostly simplifications, code movements, documentation and renaming of zapping functions - "support batched checking of the young flag for MGLRU" (Baolin Wang) Batched checking of the young flag for MGLRU. It's part cleanups; one benchmark shows large performance benefits for arm64 - "memcg: obj stock and slab stat caching cleanups" (Johannes Weiner) memcg cleanup and robustness improvements - "Allow order zero pages in page reporting" (Yuvraj Sakshith) Enhance free page reporting - it is presently and undesirably order-0 pages when reporting free memory. - "mm: vma flag tweaks" (Lorenzo Stoakes) Cleanup work following from the recent conversion of the VMA flags to a bitmap - "mm/damon: add optional debugging-purpose sanity checks" (SeongJae Park) Add some more developer-facing debug checks into DAMON core - "mm/damon: test and document power-of-2 min_region_sz requirement" (SeongJae Park) An additional DAMON kunit test and makes some adjustments to the addr_unit parameter handling - "mm/damon/core: make passed_sample_intervals comparisons overflow-safe" (SeongJae Park) Fix a hard-to-hit time overflow issue in DAMON core - "mm/damon: improve/fixup/update ratio calculation, test and documentation" (SeongJae Park) A batch of misc/minor improvements and fixups for DAMON - "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c" (David Hildenbrand) Fix a possible issue with dax-device when CONFIG_HUGETLB=n. Some code movement was required. - "zram: recompression cleanups and tweaks" (Sergey Senozhatsky) A somewhat random mix of fixups, recompression cleanups and improvements in the zram code - "mm/damon: support multiple goal-based quota tuning algorithms" (SeongJae Park) Extend DAMOS quotas goal auto-tuning to support multiple tuning algorithms that users can select - "mm: thp: reduce unnecessary start_stop_khugepaged()" (Breno Leitao) Fix the khugpaged sysfs handling so we no longer spam the logs with reams of junk when starting/stopping khugepaged - "mm: improve map count checks" (Lorenzo Stoakes) Provide some cleanups and slight fixes in the mremap, mmap and vma code - "mm/damon: support addr_unit on default monitoring targets for modules" (SeongJae Park) Extend the use of DAMON core's addr_unit tunable - "mm: khugepaged cleanups and mTHP prerequisites" (Nico Pache) Cleanups to khugepaged and is a base for Nico's planned khugepaged mTHP support - "mm: memory hot(un)plug and SPARSEMEM cleanups" (David Hildenbrand) Code movement and cleanups in the memhotplug and sparsemem code - "mm: remove CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE and cleanup CONFIG_MIGRATION" (David Hildenbrand) Rationalize some memhotplug Kconfig support - "change young flag check functions to return bool" (Baolin Wang) Cleanups to change all young flag check functions to return bool - "mm/damon/sysfs: fix memory leak and NULL dereference issues" (Josh Law and SeongJae Park) Fix a few potential DAMON bugs - "mm/vma: convert vm_flags_t to vma_flags_t in vma code" (Lorenzo Stoakes) Convert a lot of the existing use of the legacy vm_flags_t data type to the new vma_flags_t type which replaces it. Mainly in the vma code. - "mm: expand mmap_prepare functionality and usage" (Lorenzo Stoakes) Expand the mmap_prepare functionality, which is intended to replace the deprecated f_op->mmap hook which has been the source of bugs and security issues for some time. Cleanups, documentation, extension of mmap_prepare into filesystem drivers - "mm/huge_memory: refactor zap_huge_pmd()" (Lorenzo Stoakes) Simplify and clean up zap_huge_pmd(). Additional cleanups around vm_normal_folio_pmd() and the softleaf functionality are performed. * tag 'mm-stable-2026-04-13-21-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (369 commits) mm: fix deferred split queue races during migration mm/khugepaged: fix issue with tracking lock mm/huge_memory: add and use has_deposited_pgtable() mm/huge_memory: add and use normal_or_softleaf_folio_pmd() mm: add softleaf_is_valid_pmd_entry(), pmd_to_softleaf_folio() mm/huge_memory: separate out the folio part of zap_huge_pmd() mm/huge_memory: use mm instead of tlb->mm mm/huge_memory: remove unnecessary sanity checks mm/huge_memory: deduplicate zap deposited table call mm/huge_memory: remove unnecessary VM_BUG_ON_PAGE() mm/huge_memory: add a common exit path to zap_huge_pmd() mm/huge_memory: handle buggy PMD entry in zap_huge_pmd() mm/huge_memory: have zap_huge_pmd return a boolean, add kdoc mm/huge: avoid big else branch in zap_huge_pmd() mm/huge_memory: simplify vma_is_specal_huge() mm: on remap assert that input range within the proposed VMA mm: add mmap_action_map_kernel_pages[_full]() uio: replace deprecated mmap hook with mmap_prepare in uio_info drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare mm: allow handling of stacked mmap_prepare hooks in more drivers ...
11 daysMerge patch series "bpf: Fix OOB in pcpu_init_value and add a test"Alexei Starovoitov1-1/+1
xulang <xulang@uniontech.com> says: ==================== Fix OOB read when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes, and add a test case to reproduce the issue. The root cause is that pcpu_init_value() uses copy_map_value_long() which rounds up the copy size to 8 bytes, but CGROUP_STORAGE map values are not 8-byte aligned (e.g., 4-byte). This causes a 4-byte OOB read when the copy is performed. ==================== Link: https://lore.kernel.org/r/7653EEEC2BAB17DF+20260402073948.2185396-1-xulang@uniontech.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Fix OOB in pcpu_init_valueLang Xu1-1/+1
An out-of-bounds read occurs when copying element from a BPF_MAP_TYPE_CGROUP_STORAGE map to another pcpu map with the same value_size that is not rounded up to 8 bytes. The issue happens when: 1. A CGROUP_STORAGE map is created with value_size not aligned to 8 bytes (e.g., 4 bytes) 2. A pcpu map is created with the same value_size (e.g., 4 bytes) 3. Update element in 2 with data in 1 pcpu_init_value assumes that all sources are rounded up to 8 bytes, and invokes copy_map_value_long to make a data copy, However, the assumption doesn't stand since there are some cases where the source may not be rounded up to 8 bytes, e.g., CGROUP_STORAGE, skb->data. the verifier verifies exactly the size that the source claims, not the size rounded up to 8 bytes by kernel, an OOB happens when the source has only 4 bytes while the copy size(4) is rounded up to 8. Fixes: d3bec0138bfb ("bpf: Zero-fill re-used per-cpu map element") Reported-by: Kaiyan Mei <kaiyanm@hust.edu.cn> Closes: https://lore.kernel.org/all/14e6c70c.6c121.19c0399d948.Coremail.kaiyanm@hust.edu.cn/ Link: https://lore.kernel.org/r/420FEEDDC768A4BE+20260402074236.2187154-1-xulang@uniontech.com Signed-off-by: Lang Xu <xulang@uniontech.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Allow instructions with arena source and non-arena dest registersEmil Tsalapatis1-3/+11
The compiler sometimes stores the result of a PTR_TO_ARENA and SCALAR operation into the scalar register rather than the pointer register. Relax the verifier to allow operations between a source arena register and a destination non-arena register, marking the destination's value as a PTR_TO_ARENA. Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Song Liu <song@kernel.org> Fixes: 6082b6c328b5 ("bpf: Recognize addr_space_cast instruction in the verifier.") Link: https://lore.kernel.org/r/20260412174546.18684-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: add missing fsession to the verifier logMenglong Dong1-5/+5
The fsession attach type is missed in the verifier log in check_get_func_ip(), bpf_check_attach_target() and check_attach_btf_id(). Update them to make the verifier log proper. Meanwhile, update the corresponding selftests. Acked-by: Leon Hwang <leon.hwang@linux.dev> Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Link: https://lore.kernel.org/r/20260412060346.142007-2-dongml2@chinatelecom.cn Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move BTF checking logic into check_btf.cAlexei Starovoitov3-458/+466
BTF validation logic is independent from the main verifier. Move it into check_btf.c Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-7-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move backtracking logic to backtrack.cAlexei Starovoitov3-945/+939
Move precision propagation and backtracking logic to backtrack.c to reduce verifier.c size. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-6-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move state equivalence logic to states.cAlexei Starovoitov3-1746/+1698
verifier.c is huge. Move is_state_visited() to states.c, so that all state equivalence logic is in one file. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-5-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move check_cfg() into cfg.cAlexei Starovoitov3-996/+904
verifier.c is huge. Move check_cfg(), compute_postorder(), compute_scc() into cfg.c Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-4-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move compute_insn_live_regs() into liveness.cAlexei Starovoitov2-249/+248
verifier.c is huge. Move compute_insn_live_regs() into liveness.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-3-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move fixup/post-processing logic from verifier.c into fixups.cAlexei Starovoitov3-2725/+2688
verifier.c is huge. Split fixup/post-processing logic that runs after the verifier accepted the program into fixups.c. Mechanical move. No functional changes. Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260412152936.54262-2-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Simplify do_check_insn()Alexei Starovoitov1-57/+46
Move env->insn_idx++ to the caller, so that most of check_*() calls in do_check_insn() tail call into the next helper. Link: https://lore.kernel.org/r/20260411230001.71664-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
11 daysbpf: Move checks for reserved fields out of the main passAlexei Starovoitov1-147/+200
Check reserved fields of each insn once in a prepass instead of repeatedly rechecking them during the main verifier pass. Link: https://lore.kernel.org/r/20260411200932.41797-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Delete unused variableAlexei Starovoitov1-3/+1
'cnt' is set, but not used. Delete it. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604111401.eqzyF2kx-lkp@intel.com/ Fixes: 2c167d91775b ("bpf: change logging scheme for live stack analysis") Link: https://lore.kernel.org/r/20260411141447.45932-1-alexei.starovoitov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Remove gfp_flags plumbing from bpf_local_storage_update()Amery Hung5-51/+18
Remove the check that rejects sleepable BPF programs from doing BPF_ANY/BPF_EXIST updates on local storage. This restriction was added in commit b00fa38a9c1c ("bpf: Enable non-atomic allocations in local storage") because kzalloc(GFP_KERNEL) could sleep inside local_storage->lock. This is no longer a concern: all local storage allocations now use kmalloc_nolock() which never sleeps. In addition, since kmalloc_nolock() only accepts __GFP_ACCOUNT, __GFP_ZERO and __GFP_NO_OBJ_EXT, the gfp_flags parameter plumbing from bpf_*_storage_get() to bpf_local_storage_update() becomes dead code. Remove gfp_flags from bpf_selem_alloc(), bpf_local_storage_alloc() and bpf_local_storage_update(). Drop the hidden 5th argument from bpf_*_storage_get helpers, and remove the verifier patching that injected GFP_KERNEL/GFP_ATOMIC into the fifth argument. Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260411015419.114016-4-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Use kmalloc_nolock() universally in local storageAmery Hung4-118/+18
Switch to kmalloc_nolock() universally in local storage. Socket local storage didn't move to kmalloc_nolock() when BPF memory allocator was replaced by it for performance reasons. Now that kfree_rcu() supports freeing memory allocated by kmalloc_nolock(), we can move the remaining local storages to use kmalloc_nolock() and cleanup the cluttered free paths. Use kfree() instead of kfree_nolock() in bpf_selem_free_trace_rcu() and bpf_local_storage_free_trace_rcu(). Both callbacks run in process context where spinning is allowed, so kfree_nolock() is unnecessary. Benchmark: ./bench -p 1 local-storage-create --storage-type socket \ --batch-size {16,32,64} The benchmark is a microbenchmark stress-testing how fast local storage can be created. There is no measurable throughput change for socket local storage after switching from kzalloc() to kmalloc_nolock(). Socket local storage batch creation speed diff --------------- ---- ------------------ ---- Baseline 16 433.9 ± 0.6 k/s 32 434.3 ± 1.4 k/s 64 434.2 ± 0.7 k/s After 16 439.0 ± 1.9 k/s +1.2% 32 437.3 ± 2.0 k/s +0.7% 64 435.8 ± 2.5k/s +0.4% Also worth noting that the baseline got a 5% throughput boost when sheaf replaces percpu partial slab recently [0]. [0] https://lore.kernel.org/bpf/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/ Signed-off-by: Amery Hung <ameryhung@gmail.com> Link: https://lore.kernel.org/r/20260411015419.114016-3-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: Enforce regsafe base id consistency for BPF_ADD_CONST scalarsDaniel Borkmann1-1/+16
When regsafe() compares two scalar registers that both carry BPF_ADD_CONST, check_scalar_ids() maps their full compound id (aka base | BPF_ADD_CONST flag) as one idmap entry. However, it never verifies that the underlying base ids, that is, with the flag stripped are consistent with existing idmap mappings. This allows construction of two verifier states where the old state has R3 = R2 + 10 (both sharing base id A) while the current state has R3 = R4 + 10 (base id C, unrelated to R2). The idmap creates two independent entries: A->B (for R2) and A|flag->C|flag (for R3), without catching that A->C conflicts with A->B. State pruning then incorrectly succeeds. Fix this by additionally verifying base ID mapping consistency whenever BPF_ADD_CONST is set: after mapping the compound ids, also invoke check_ids() on the base IDs (flag bits stripped). This ensures that if A was already mapped to B from comparing the source register, any ADD_CONST derivative must also derive from B, not an unrelated C. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260410232651.559778-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: poison dead stack slotsAlexei Starovoitov2-27/+67
As a sanity check poison stack slots that stack liveness determined to be dead, so that any read from such slots will cause program rejection. If stack liveness logic is incorrect the poison can cause valid program to be rejected, but it also will prevent unsafe program to be accepted. Allow global subprogs "read" poisoned stack slots. The static stack liveness determined that subprog doesn't read certain stack slots, but sizeof(arg_type) based global subprog validation isn't accurate enough to know which slots will actually be read by the callee, so it needs to check full sizeof(arg_type) at the caller. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-14-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: change logging scheme for live stack analysisEduard Zingerman1-76/+165
Instead of breadcrumbs like: (d2,cs15) frame 0 insn 18 +live -16 (d2,cs15) frame 0 insn 17 +live -16 Print final accumulated stack use/def data per-func_instance per-instruction. printed func_instance's are ordered by callsite and depth. For example: stack use/def subprog#0 shared_instance_must_write_overwrite (d0,cs0): 0: (b7) r1 = 1 1: (7b) *(u64 *)(r10 -8) = r1 ; def: fp0-8 2: (7b) *(u64 *)(r10 -16) = r1 ; def: fp0-16 3: (bf) r1 = r10 4: (07) r1 += -8 5: (bf) r2 = r10 6: (07) r2 += -16 7: (85) call pc+7 ; use: fp0-8 fp0-16 8: (bf) r1 = r10 9: (07) r1 += -16 10: (bf) r2 = r10 11: (07) r2 += -8 12: (85) call pc+2 ; use: fp0-8 fp0-16 13: (b7) r0 = 0 14: (95) exit stack use/def subprog#1 forwarding_rw (d1,cs7): 15: (85) call pc+1 ; use: fp0-8 fp0-16 16: (95) exit stack use/def subprog#1 forwarding_rw (d1,cs12): 15: (85) call pc+1 ; use: fp0-8 fp0-16 16: (95) exit stack use/def subprog#2 write_first_read_second (d2,cs15): 17: (7a) *(u64 *)(r1 +0) = 42 18: (79) r0 = *(u64 *)(r2 +0) ; use: fp0-8 fp0-16 19: (95) exit For groups of three or more consecutive stack slots, abbreviate as follows: 25: (85) call bpf_loop#181 ; use: fp2-8..-512 fp1-8..-512 fp0-8..-512 Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-10-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: simplify liveness to use (callsite, depth) keyed func_instancesEduard Zingerman2-547/+254
Rework func_instance identification and remove the dynamic liveness API, completing the transition to fully static stack liveness analysis. Replace callchain-based func_instance keys with (callsite, depth) pairs. The full callchain (all ancestor callsites) is no longer part of the hash key; only the immediate callsite and the call depth matter. This does not lose precision in practice and simplifies the data structure significantly: struct callchain is removed entirely, func_instance stores just callsite, depth. Drop must_write_acc propagation. Previously, must_write marks were accumulated across successors and propagated to the caller via propagate_to_outer_instance(). Instead, callee entry liveness (live_before at subprog start) is pulled directly back to the caller's callsite in analyze_subprog() after each callee returns. Since (callsite, depth) instances are shared across different call chains that invoke the same subprog at the same depth, must_write marks from one call may be stale for another. To handle this, analyze_subprog() records into a fresh_instance() when the instance was already visited (must_write_initialized), then merge_instances() combines the results: may_read is unioned, must_write is intersected. This ensures only slots written on ALL paths through all call sites are marked as guaranteed writes. This replaces commit_stack_write_marks() logic. Skip recursive descent into callees that receive no FP-derived arguments (has_fp_args() check). This is needed because global subprogram calls can push depth beyond MAX_CALL_FRAMES (max depth is 64 for global calls but only 8 frames are accommodated for FP passing). It also handles the case where a callback subprog cannot be determined by argument tracking: such callbacks will be processed by analyze_subprog() at depth 0 independently. Update lookup_instance() (used by is_live_before queries) to search for the func_instance with maximal depth at the corresponding callsite, walking depth downward from frameno to 0. This accounts for the fact that instance depth no longer corresponds 1:1 to bpf_verifier_state->curframe, since skipped non-FP calls create gaps. Remove the dynamic public liveness API from verifier.c: - bpf_mark_stack_{read,write}(), bpf_reset/commit_stack_write_marks() - bpf_update_live_stack(), bpf_reset_live_stack_callchain() - All call sites in check_stack_{read,write}_fixed_off(), check_stack_range_initialized(), mark_stack_slot_obj_read(), mark/unmark_stack_slots_{dynptr,iter,irq_flag}() - The per-instruction write mark accumulation in do_check() - The bpf_update_live_stack() call in prepare_func_exit() mark_stack_read() and mark_stack_write() become static functions in liveness.c, called only from the static analysis pass. The func_instance->updated and must_write_dropped flags are removed. Remove spis_single_slot(), spis_one_bit() helpers from bpf_verifier.h as they are no longer used. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-9-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: record arg tracking results in bpf_liveness masksEduard Zingerman2-62/+245
After arg tracking reaches a fixed point, perform a single linear scan over the converged at_in[] state and translate each memory access into liveness read/write masks on the func_instance: - Load/store instructions: FP-derived pointer's frame and offset(s) are converted to half-slot masks targeting per_frame_masks->{may_read,must_write} - Helper/kfunc calls: record_call_access() queries bpf_helper_stack_access_bytes() / bpf_kfunc_stack_access_bytes() for each FP-derived argument to determine access size and direction. Unknown access size (S64_MIN) conservatively marks all slots from fp_off to fp+0 as read. - Imprecise pointers (frame == ARG_IMPRECISE): conservatively mark all slots in every frame covered by the pointer's frame bitmask as fully read. - Static subprog calls with unresolved arguments: conservatively mark all frames as fully read. Instead of a call to clean_live_states(), start cleaning the current state continuously as registers and stack become dead since the static analysis provides complete liveness information. This makes clean_live_states() and bpf_verifier_state->cleaned unnecessary. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-8-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: introduce forward arg-tracking dataflow analysisEduard Zingerman2-0/+1064
The analysis is a basis for static liveness tracking mechanism introduced by the next two commits. A forward fixed-point analysis that tracks which frame's FP each register value is derived from, and at what byte offset. This is needed because a callee can receive a pointer to its caller's stack frame (e.g. r1 = fp-16 at the call site), then do *(u64 *)(r1 + 0) inside the callee — a cross-frame stack access that the callee's local liveness must attribute to the caller's stack. Each register holds an arg_track value from a three-level lattice: - Precise {frame=N, off=[o1,o2,...]} — known frame index and up to 4 concrete byte offsets - Offset-imprecise {frame=N, off_cnt=0} — known frame, unknown offset - Fully-imprecise {frame=ARG_IMPRECISE, mask=bitmask} — unknown frame, mask says which frames might be involved At CFG merge points the lattice moves toward imprecision (same frame+offset stays precise, same frame different offsets merges offset sets or becomes offset-imprecise, different frames become fully-imprecise with OR'd bitmask). The analysis also tracks spills/fills to the callee's own stack (at_stack_in/out), so FP derived values spilled and reloaded. This pass is run recursively per call site: when subprog A calls B with specific FP-derived arguments, B is re-analyzed with those entry args. The recursion follows analyze_subprog -> compute_subprog_args -> (for each call insn) -> analyze_subprog. Subprogs that receive no FP-derived args are skipped during recursion and analyzed independently at depth 0. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-7-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: prepare liveness internal API for static analysis passEduard Zingerman1-25/+23
Move the `updated` check and reset from bpf_update_live_stack() into update_instance() itself, so callers outside the main loop can reuse it. Similarly, move write_insn_idx assignment out of reset_stack_write_marks() into its public caller, and thread insn_idx as a parameter to commit_stack_write_marks() instead of reading it from liveness->write_insn_idx. Drop the unused `env` parameter from alloc_frame_masks() and mark_stack_read(). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-6-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: 4-byte precise clean_verifier_stateEduard Zingerman2-23/+85
Migrate clean_verifier_state() and its liveness queries from 8-byte SPI granularity to 4-byte half-slot granularity. In __clean_func_state(), each SPI is cleaned in two independent halves: - half_spi 2*i (lo): slot_type[0..3] - half_spi 2*i+1 (hi): slot_type[4..7] Slot types STACK_DYNPTR, STACK_ITER and STACK_IRQ_FLAG are never cleaned, as their slot type markers are required by destroy_if_dynptr_stack_slot(), is_iter_reg_valid_uninit() and is_irq_flag_reg_valid_uninit() for correctness. When only the hi half is dead, spilled_ptr metadata is destroyed and the lo half's STACK_SPILL bytes are downgraded to STACK_MISC or STACK_ZERO. When only the lo half is dead, spilled_ptr is preserved because the hi half may still need it for state comparison. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-5-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: make liveness.c track stack with 4-byte granularityEduard Zingerman2-64/+113
Convert liveness bitmask type from u64 to spis_t, doubling the number of trackable stack slots from 64 to 128 to support 4-byte granularity. Each 8-byte SPI now maps to two consecutive 4-byte sub-slots in the bitmask: spi*2 half and spi*2+1 half. In verifier.c, check_stack_write_fixed_off() now reports 4-byte aligned writes of 4-byte writes as half-slot marks and 8-byte aligned 8-byte writes as two slots. Similar logic applied in check_stack_read_fixed_off(). Queries (is_live_before) are not yet migrated to half-slot granularity. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-4-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: save subprogram name in bpf_subprog_infoEduard Zingerman1-0/+1
Subprogram name can be computed from function info and BTF, but it is convenient to have the name readily available for logging purposes. Update comment saying that bpf_subprog_info->start has to be the first field, this is no longer true, relevant sites access .start field by it's name. Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-2-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
12 daysbpf: share several utility functions as internal APIEduard Zingerman2-7/+7
Namely: - bpf_subprog_is_global - bpf_vlog_alignment Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20260410-patch-set-v4-1-5d4eecb343db@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: Fix RCU stall in bpf_fd_array_map_clear()Sechang Lim1-1/+3
Add a missing cond_resched() in bpf_fd_array_map_clear() loop. For PROG_ARRAY maps with many entries this loop calls prog_array_map_poke_run() per entry which can be expensive, and without yielding this can cause RCU stalls under load: rcu: Stack dump where RCU GP kthread last ran: CPU: 0 UID: 0 PID: 30932 Comm: kworker/0:2 Not tainted 6.14.0-13195-g967e8def1100 #2 PREEMPT(undef) Workqueue: events prog_array_map_clear_deferred RIP: 0010:write_comp_data+0x38/0x90 kernel/kcov.c:246 Call Trace: <TASK> prog_array_map_poke_run+0x77/0x380 kernel/bpf/arraymap.c:1096 __fd_array_map_delete_elem+0x197/0x310 kernel/bpf/arraymap.c:925 bpf_fd_array_map_clear kernel/bpf/arraymap.c:1000 [inline] prog_array_map_clear_deferred+0x119/0x1b0 kernel/bpf/arraymap.c:1141 process_one_work+0x898/0x19d0 kernel/workqueue.c:3238 process_scheduled_works kernel/workqueue.c:3319 [inline] worker_thread+0x770/0x10b0 kernel/workqueue.c:3400 kthread+0x465/0x880 kernel/kthread.c:464 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:153 ret_from_fork_asm+0x19/0x30 arch/x86/entry/entry_64.S:245 </TASK> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Fixes: da765a2f5993 ("bpf: Add poke dependency tracking for prog array maps") Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Link: https://lore.kernel.org/r/20260407103823.3942156-1-rhkrqnwk98@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: return VMA snapshot from task_vma iteratorPuranjay Mohan1-12/+30
Holding the per-VMA lock across the BPF program body creates a lock ordering problem when helpers acquire locks that depend on mmap_lock: vm_lock -> i_rwsem -> mmap_lock -> vm_lock Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then drop the lock before returning. The BPF program accesses only the snapshot. The verifier only trusts vm_mm and vm_file pointers (see BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference- counted with get_file() under the lock and released via fput() on the next iteration or in _destroy(). vm_mm is already correct because lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers are left as-is by memcpy() since the verifier treats them as untrusted. Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408154539.3832150-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: switch task_vma iterator from mmap_lock to per-VMA locksPuranjay Mohan1-18/+73
The open-coded task_vma iterator holds mmap_lock for the entire duration of iteration, increasing contention on this highly contended lock. Switch to per-VMA locking. Find the next VMA via an RCU-protected maple tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not used because its fallback takes mmap_read_lock(), and the iterator must work in non-sleepable contexts. lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA containing a given address but cannot iterate across gaps. An RCU-protected vma_next() walk (mas_find) first locates the next VMA's vm_start to pass to lock_vma_under_rcu(). Between the RCU walk and the lock, the VMA may be removed, shrunk, or write-locked. On failure, advance past it using vm_end from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be stale; fall back to PAGE_SIZE advancement when it does not make forward progress. Concurrent VMA insertions at addresses already passed by the iterator are not detected. CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260408154539.3832150-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysbpf: fix mm lifecycle in open-coded task_vma iteratorPuranjay Mohan1-3/+51
The open-coded task_vma iterator reads task->mm locklessly and acquires mmap_read_trylock() but never calls mmget(). If the task exits concurrently, the mm_struct can be freed as it is not SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free. Safely read task->mm with a trylock on alloc_lock and acquire an mm reference. Drop the reference via bpf_iter_mmput_async() in _destroy() and error paths. bpf_iter_mmput_async() is a local wrapper around mmput_async() with a fallback to mmput() on !CONFIG_MMU. Reject irqs-disabled contexts (including NMI) up front. Operations used by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async) take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from NMI or from a tracepoint that fires with those locks held could deadlock. A trylock on alloc_lock is used instead of the blocking task_lock() (get_task_mm) to avoid a deadlock when a softirq BPF program iterates a task that already holds its alloc_lock on the same CPU. Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs") Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260408154539.3832150-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
14 daysbpf: Fix use-after-free in offloaded map/prog info fillJiayuan Chen1-6/+4
When querying info for an offloaded BPF map or program, bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns() obtain the network namespace with get_net(dev_net(offmap->netdev)). However, the associated netdev's netns may be racing with teardown during netns destruction. If the netns refcount has already reached 0, get_net() performs a refcount_t increment on 0, triggering: refcount_t: addition on 0; use-after-free. Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains valid, they cannot prevent the netns refcount from reaching zero. Fix this by using maybe_get_net() instead of get_net(). maybe_get_net() uses refcount_inc_not_zero() and returns NULL if the refcount is already zero, which causes ns_get_path_cb() to fail and the caller to return -ENOENT -- the correct behavior when the netns is being destroyed. Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs") Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps") Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/ Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
14 daysbpf: Drop pkt_end markers on arithmetic to prevent is_pkt_ptr_branch_takenDaniel Borkmann1-8/+22
When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from a comparison, and then, known-constant arithmetic is performed, adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw = ptr_reg->raw without clearing the negative reg->range sentinel values. This lets is_pkt_ptr_branch_taken() choose one branch direction and skip going through the other. Fix this by clearing negative pkt range values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown and both branches are properly verified. Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08bpf: Remove static qualifier from local subprog pointerDaniel Borkmann1-2/+2
The local subprog pointer in create_jt() and visit_abnormal_return_insn() was declared static. It is unconditionally assigned via bpf_find_containing_subprog() before every use. Thus, the static qualifier serves no purpose and rather creates confusion. Just remove it. Fixes: e40f5a6bf88a ("bpf: correct stack liveness for tail calls") Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08bpf: Fix ld_{abs,ind} failure path analysis in subprogsDaniel Borkmann1-2/+31
Usage of ld_{abs,ind} instructions got extended into subprogs some time ago via commit 09b28d76eac4 ("bpf: Add abnormal return checks."). These are only allowed in subprograms when the latter are BTF annotated and have scalar return types. The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 + exit) from legacy cBPF times. While the enforcement is on scalar return types, the verifier must also simulate the path of abnormal exit if the packet data load via ld_{abs,ind} failed. This is currently not the case. Fix it by having the verifier simulate both success and failure paths, and extend it in similar ways as we do for tail calls. The success path (r0=unknown, continue to next insn) is pushed onto stack for later validation and the r0=0 and return to the caller is done on the fall-through side. Fixes: 09b28d76eac4 ("bpf: Add abnormal return checks.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08bpf: Propagate error from visit_tailcall_insnDaniel Borkmann1-2/+5
Commit e40f5a6bf88a ("bpf: correct stack liveness for tail calls") added visit_tailcall_insn() but did not check its return value. Fixes: e40f5a6bf88a ("bpf: correct stack liveness for tail calls") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260408191242.526279-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08bpf: Make find_linfo widely availableKumar Kartikeya Dwivedi2-42/+38
Move find_linfo() as bpf_find_linfo() into core.c to allow for its use in the verifier in subsequent patches. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-08bpf: Extract bpf_get_linfo_file_lineKumar Kartikeya Dwivedi1-8/+21
Extract bpf_get_linfo_file_line as its own function so that the logic to obtain the file, line, and line number for a given program can be shared in subsequent patches. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07bpf: Allow overwriting referenced dynptr when refcnt > 1Amery Hung1-2/+21
The verifier currently does not allow overwriting a referenced dynptr's stack slot to prevent resource leak. This is because referenced dynptr holds additional resources that requires calling specific helpers to release. This limitation can be relaxed when there are multiple copies of the same dynptr. Whether it is the orignial dynptr or one of its clones, as long as there exists at least one other dynptr with the same ref_obj_id (to be used to release the reference), its stack slot should be allowed to be overwritten. Suggested-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Amery Hung <ameryhung@gmail.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07bpf: Clear delta when clearing reg id for non-{add,sub} opsDaniel Borkmann1-28/+28
When a non-{add,sub} alu op such as xor is performed on a scalar register that previously had a BPF_ADD_CONST delta, the else path in adjust_reg_min_max_vals() only clears dst_reg->id but leaves dst_reg->delta unchanged. This stale delta can propagate via assign_scalar_id_before_mov() when the register is later used in a mov. It gets a fresh id but keeps the stale delta from the old (now-cleared) BPF_ADD_CONST. This stale delta can later propagate leading to a verifier-vs- runtime value mismatch. The clear_id label already correctly clears both delta and id. Make the else path consistent by also zeroing the delta when id is cleared. More generally, this introduces a helper clear_scalar_id() which internally takes care of zeroing. There are various other locations in the verifier where only the id is cleared. By using the helper we catch all current and future locations. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07bpf: Fix linked reg delta tracking when src_reg == dst_regDaniel Borkmann1-1/+2
Consider the case of rX += rX where src_reg and dst_reg are pointers to the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first modifies the dst_reg in-place, and later in the delta tracking, the subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the post-{add,sub} value instead of the original source. This is problematic since it sets an incorrect delta, which sync_linked_regs() then propagates to linked registers, thus creating a verifier-vs-runtime mismatch. Fix it by just skipping this corner case. Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.") Reported-by: STAR Labs SG <info@starlabs.sg> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07bpf: Retire rcu_trace_implies_rcu_gp()Kumar Kartikeya Dwivedi4-48/+19
RCU Tasks Trace grace period implies RCU grace period, and this guarantee is expected to remain in the future. Only BPF is the user of this predicate, hence retire the API and clean up all in-tree users. RCU Tasks Trace is now implemented on SRCU-fast and its grace period mechanism always has at least one call to synchronize_rcu() as it is required for SRCU-fast's correctness (it replaces the smp_mb() that SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP. Reviewed-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-07bpf: Drop task_to_inode and inet_conn_established from lsm sleepable hooksJiayuan Chen1-3/+0
bpf_lsm_task_to_inode() is called under rcu_read_lock() and bpf_lsm_inet_conn_established() is called from softirq context, so neither hook can be used by sleepable LSM programs. Fixes: 423f16108c9d8 ("bpf: Augment the set of sleepable LSM hooks") Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn> Reported-by: Yinhao Hu <dddddd@hust.edu.cn> Reported-by: Kaiyan Mei <M202472210@hust.edu.cn> Reported-by: Dongliang Mu <dzm91@hust.edu.cn> Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06bpf: Do not ignore offsets for loads from insn_arraysAnton Protopopov1-0/+20
When a pointer to PTR_TO_INSN is dereferenced, the offset field of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier to not ignore this field. Reported-by: Jiyong Yang <ksur673@gmail.com> Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps") Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com> Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06bpf: Avoid -Wflex-array-members-not-at-end warningsGustavo A. R. Silva1-5/+7
Apparently, struct bpf_empty_prog_array exists entirely to populate a single element of "items" in a global variable. "null_prog" is only used during the initializer. None of this is needed; globals will be correctly sized with an array initializer of a flexible-array member. So, remove struct bpf_empty_prog_array and adjust the rest of the code, accordingly. With these changes, fix the following warnings: ./include/linux/bpf.h:2369:31: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06bpf: Enable unaligned accesses for syscall ctxKumar Kartikeya Dwivedi1-2/+1
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests will be added in later commits. Unaligned offsets already work for variable offsets. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-06bpf: Support variable offsets for syscall PTR_TO_CTXKumar Kartikeya Dwivedi1-38/+65
Allow accessing PTR_TO_CTX with variable offsets in syscall programs. Fixed offsets are already enabled for all program types that do not convert their ctx accesses, since the changes we made in the commit de6c7d99f898 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note that we also lift the restriction on passing syscall context into helpers, which was not permitted before, and passing modified syscall context into kfuncs. The structure of check_mem_access can be mostly shared and preserved, but we must use check_mem_region_access to correctly verify access with variable offsets. The check made in check_helper_mem_access is hardened to only allow PTR_TO_CTX for syscall programs to be passed in as helper memory. This was the original intention of the existing code anyway, and it makes little sense for other program types' context to be utilized as a memory buffer. In case a convincing example presents itself in the future, this check can be relaxed further. We also no longer use the last-byte access to simulate helper memory access, but instead go through check_mem_region_access. Since this no longer updates our max_ctx_offset, we must do so manually, to keep track of the maximum offset at which the program ctx may be accessed. Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not relax any fixed or variable offset constraints around PTR_TO_CTX even in syscall programs, and require them to be passed unmodified. There are several reasons why this is necessary. First, if we pass a modified ctx, then the global subprog's accesses will not update the max_ctx_offset to its true maximum offset, and can lead to out of bounds accesses. Second, tail called program (or extension program replacing global subprog) where their max_ctx_offset exceeds the program they are being called from can also cause issues. For the latter, unmodified PTR_TO_CTX is the first requirement for the fix, the second is ensuring max_ctx_offset >= the program they are being called from, which has to be a separate change not made in this commit. All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and make our relaxation around offsets conditional on it. Drop coverage of syscall tests from verifier_ctx.c temporarily for negative cases until they are updated in subsequent commits. Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Puranjay Mohan <puranjay@kernel.org> Acked-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-05bpf: Fix stale offload->prog pointer after constant blindingMingTao Huang1-0/+2
When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes JIT compilation with constant blinding enabled (bpf_jit_harden >= 2), bpf_jit_blind_constants() clones the program. The original prog is then freed in bpf_jit_prog_release_other(), which updates aux->prog to point to the surviving clone, but fails to update offload->prog. This leaves offload->prog pointing to the freed original program. When the network namespace is subsequently destroyed, cleanup_net() triggers bpf_dev_bound_netdev_unregister(), which iterates ondev->progs and calls __bpf_prog_offload_destroy(offload->prog). Accessing the freed prog causes a page fault: BUG: unable to handle page fault for address: ffffc900085f1038 Workqueue: netns cleanup_net RIP: 0010:__bpf_prog_offload_destroy+0xc/0x80 Call Trace: __bpf_offload_dev_netdev_unregister+0x257/0x350 bpf_dev_bound_netdev_unregister+0x4a/0x90 unregister_netdevice_many_notify+0x2a2/0x660 ... cleanup_net+0x21a/0x320 The test sequence that triggers this reliably is: 1. Set net.core.bpf_jit_harden=2 (echo 2 > /proc/sys/net/core/bpf_jit_harden) 2. Run xdp_metadata selftest, which creates a dev-bound-only XDP program on a veth inside a netns (./test_progs -t xdp_metadata) 3. cleanup_net -> page fault in __bpf_prog_offload_destroy Dev-bound-only programs are unique in that they have an offload structure but go through the normal JIT path instead of bpf_prog_offload_compile(). This means they are subject to constant blinding's prog clone-and-replace, while also having offload->prog that must stay in sync. Fix this by updating offload->prog in bpf_jit_prog_release_other(), alongside the existing aux->prog update. Both are back-pointers to the prog that must be kept in sync when the prog is replaced. Fixes: 2b3486bc2d23 ("bpf: Introduce device-bound XDP programs") Signed-off-by: MingTao Huang <mintaohuang@tencent.com> Link: https://lore.kernel.org/r/tencent_BCF692F45859CCE6C22B7B0B64827947D406@qq.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-05bpf: fix end-of-list detection in cgroup_storage_get_next_key()Weiming Shi1-1/+1
list_next_entry() never returns NULL -- when the current element is the last entry it wraps to the list head via container_of(). The subsequent NULL check is therefore dead code and get_next_key() never returns -ENOENT for the last element, instead reading storage->key from a bogus pointer that aliases internal map fields and copying the result to userspace. Replace it with list_entry_is_head() so the function correctly returns -ENOENT when there are no more entries. Fixes: de9cbbaadba5 ("bpf: introduce cgroup storage maps") Reported-by: Xiang Mei <xmei5@asu.edu> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Reviewed-by: Sun Jian <sun.jian.kdev@gmail.com> Acked-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/20260403132951.43533-2-bestswngs@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-04-05bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCKMykyta Yatsenko1-0/+4
When a BPF_F_LOCK update races with a concurrent delete, the freed element can be immediately recycled by alloc_htab_elem(). The fast path in htab_map_update_elem() performs a lockless lookup and then calls copy_map_value_locked() under the element's spin_lock. If alloc_htab_elem() recycles the same memory, it overwrites the value with plain copy_map_value(), without taking the spin_lock, causing torn writes. Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's value is written under the embedded spin_lock, serializing against any stale lock holders. Fixes: 96049f3afd50 ("bpf: introduce BPF_F_LOCK flag") Reported-by: Aaron Esau <aaron1esau@gmail.com> Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/ Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Link: https://lore.kernel.org/r/20260401-bpf_map_torn_writes-v1-1-782d071c55e7@meta.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>