| Age | Commit message (Collapse) | Author | Files | Lines |
|
next buddy should not prevent shorter slice preemption. Don't take buddy
into account when checking if shorter slice entity can preempt and clear it
if the entity with a shorter slice can preempt current.
Test on snapdragon rb5:
hackbench -T -p -l 16000000 -g 2 1> /dev/null &
hackbench runs in cgroup /test-A
cyclictest -t 1 -i 2777 -D 63 --policy=fair --mlock -h 20000 -q
cyclictest runs in cgroup /test-B
tip/sched/core tip/sched/core +this patch
cyclictest slice (ms) (default)2.8 8 8
hackbench slice (ms) (default)2.8 20 20
Total Samples | 22679 22595 22686
Average (us) | 84 94(-12%) 59( 37%)
Median (P50) (us) | 56 56( 0%) 56( 0%)
90th Percentile (us) | 64 65(- 2%) 63( 3%)
99th Percentile (us) | 1047 1273(-22%) 74( 94%)
99.9th Percentile (us) | 2431 4751(-95%) 663( 86%)
Maximum (us) | 4694 8655(-84%) 3934( 55%)
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260410132321.2897789-1-vincent.guittot@linaro.org
|
|
'for-next/ttbr-macros-cleanup', 'for-next/kselftest', 'for-next/feat_lsui', 'for-next/mpam', 'for-next/hotplug-batched-tlbi', 'for-next/bbml2-fixes', 'for-next/sysreg', 'for-next/generic-entry' and 'for-next/acpi', remote-tracking branches 'arm64/for-next/perf' and 'arm64/for-next/read-once' into for-next/core
* arm64/for-next/perf:
: Perf updates
perf/arm-cmn: Fix resource_size_t printk specifier in arm_cmn_init_dtc()
perf/arm-cmn: Fix incorrect error check for devm_ioremap()
perf: add NVIDIA Tegra410 C2C PMU
perf: add NVIDIA Tegra410 CPU Memory Latency PMU
perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
perf/arm_cspmu: nvidia: Rename doc to Tegra241
perf/arm-cmn: Stop claiming entire iomem region
arm64: cpufeature: Use pmuv3_implemented() function
arm64: cpufeature: Make PMUVer and PerfMon unsigned
KVM: arm64: Read PMUVer as unsigned
* arm64/for-next/read-once:
: Fixes for __READ_ONCE() with CONFIG_LTO=y
arm64, compiler-context-analysis: Permit alias analysis through __READ_ONCE() with CONFIG_LTO=y
arm64: Optimize __READ_ONCE() with CONFIG_LTO=y
* for-next/misc:
: Miscellaneous cleanups/fixes
arm64: rsi: use linear-map alias for realm config buffer
arm64: Kconfig: fix duplicate word in CMDLINE help text
arm64: mte: Skip TFSR_EL1 checks and barriers in synchronous tag check mode
arm64/hwcap: Generate the KERNEL_HWCAP_ definitions for the hwcaps
arm64: kexec: Remove duplicate allocation for trans_pgd
arm64: mm: Use generic enum pgtable_level
arm64: scs: Remove redundant save/restore of SCS SP on entry to/from EL0
arm64: remove ARCH_INLINE_*
* for-next/tlbflush:
: Refactor the arm64 TLB invalidation API and implementation
arm64: mm: __ptep_set_access_flags must hint correct TTL
arm64: mm: Provide level hint for flush_tlb_page()
arm64: mm: Wrap flush_tlb_page() around __do_flush_tlb_range()
arm64: mm: More flags for __flush_tlb_range()
arm64: mm: Refactor __flush_tlb_range() to take flags
arm64: mm: Refactor flush_tlb_page() to use __tlbi_level_asid()
arm64: mm: Simplify __flush_tlb_range_limit_excess()
arm64: mm: Simplify __TLBI_RANGE_NUM() macro
arm64: mm: Re-implement the __flush_tlb_range_op macro in C
arm64: mm: Inline __TLBI_VADDR_RANGE() into __tlbi_range()
arm64: mm: Push __TLBI_VADDR() into __tlbi_level()
arm64: mm: Implicitly invalidate user ASID based on TLBI operation
arm64: mm: Introduce a C wrapper for by-range TLB invalidation
arm64: mm: Re-implement the __tlbi_level macro as a C function
* for-next/ttbr-macros-cleanup:
: Cleanups of the TTBR1_* macros
arm64/mm: Directly use TTBRx_EL1_CnP
arm64/mm: Directly use TTBRx_EL1_ASID_MASK
arm64/mm: Describe TTBR1_BADDR_4852_OFFSET
* for-next/kselftest:
: arm64 kselftest updates
selftests/arm64: Implement cmpbr_sigill() to hwcap test
* for-next/feat_lsui:
: Futex support using FEAT_LSUI instructions to avoid toggling PAN
arm64: armv8_deprecated: Disable swp emulation when FEAT_LSUI present
arm64: Kconfig: Add support for LSUI
KVM: arm64: Use CAST instruction for swapping guest descriptor
arm64: futex: Support futex with FEAT_LSUI
arm64: futex: Refactor futex atomic operation
KVM: arm64: kselftest: set_id_regs: Add test for FEAT_LSUI
KVM: arm64: Expose FEAT_LSUI to guests
arm64: cpufeature: Add FEAT_LSUI
* for-next/mpam: (40 commits)
: Expose MPAM to user-space via resctrl:
: - Add architecture context-switch and hiding of the feature from KVM.
: - Add interface to allow MPAM to be exposed to user-space using resctrl.
: - Add errata workaoround for some existing platforms.
: - Add documentation for using MPAM and what shape of platforms can use resctrl
arm64: mpam: Add initial MPAM documentation
arm_mpam: Quirk CMN-650's CSU NRDY behaviour
arm_mpam: Add workaround for T241-MPAM-6
arm_mpam: Add workaround for T241-MPAM-4
arm_mpam: Add workaround for T241-MPAM-1
arm_mpam: Add quirk framework
arm_mpam: resctrl: Call resctrl_init() on platforms that can support resctrl
arm64: mpam: Select ARCH_HAS_CPU_RESCTRL
arm_mpam: resctrl: Add empty definitions for assorted resctrl functions
arm_mpam: resctrl: Update the rmid reallocation limit
arm_mpam: resctrl: Add resctrl_arch_rmid_read()
arm_mpam: resctrl: Allow resctrl to allocate monitors
arm_mpam: resctrl: Add support for csu counters
arm_mpam: resctrl: Add monitor initialisation and domain boilerplate
arm_mpam: resctrl: Add kunit test for control format conversions
arm_mpam: resctrl: Add support for 'MB' resource
arm_mpam: resctrl: Wait for cacheinfo to be ready
arm_mpam: resctrl: Add rmid index helpers
arm_mpam: resctrl: Convert to/from MPAMs fixed-point formats
arm_mpam: resctrl: Hide CDP emulation behind CONFIG_EXPERT
...
* for-next/hotplug-batched-tlbi:
: arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
arm64/mm: Reject memory removal that splits a kernel leaf mapping
arm64/mm: Enable batched TLB flush in unmap_hotplug_range()
* for-next/bbml2-fixes:
: Fixes for realm guest and BBML2_NOABORT
arm64: mm: Remove pmd_sect() and pud_sect()
arm64: mm: Handle invalid large leaf mappings correctly
arm64: mm: Fix rodata=full block mapping support for realm guests
* for-next/sysreg:
: arm64 sysreg updates
arm64/sysreg: Update ID_AA64SMFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ZFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64FPFR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR2_EL1 description to DDI0601 2025-12
arm64/sysreg: Update ID_AA64ISAR0_EL1 description to DDI0601 2025-12
arm64/sysreg: Update SMIDR_EL1 to DDI0601 2025-06
* for-next/generic-entry:
: More arm64 refactoring towards using the generic entry code
arm64: Check DAIF (and PMR) at task-switch time
arm64: entry: Use split preemption logic
arm64: entry: Use irqentry_{enter_from,exit_to}_kernel_mode()
arm64: entry: Consistently prefix arm64-specific wrappers
arm64: entry: Don't preempt with SError or Debug masked
entry: Split preemption from irqentry_exit_to_kernel_mode()
entry: Split kernel mode logic from irqentry_{enter,exit}()
entry: Move irqentry_enter() prototype later
entry: Remove local_irq_{enable,disable}_exit_to_user()
entry: Fix stale comment for irqentry_enter()
* for-next/acpi:
: arm64 ACPI updates
ACPI: AGDI: fix missing newline in error message
|
|
Merge cpuidle updates, OPP (operating performance points) library
updates, and updates related to system suspend and hibernation for
7.1-rc1:
- Refine stopped tick handling in the menu cpuidle governor and
rearrange stopped tick handling in the teo cpuidle governor (Rafael
Wysocki)
- Add Panther Lake C-states table to the intel_idle driver (Artem
Bityutskiy)
- Clean up dead dependencies on CPU_IDLE in Kconfig (Julian Braha)
- Simplify cpuidle_register_device() with guard() (Huisong Li)
- Use performance level if available to distinguish between rates in
OPP debugfs (Manivannan Sadhasivam)
- Fix scoped_guard in dev_pm_opp_xlate_required_opp() (Viresh Kumar)
- Return -ENODATA if the snapshot image is not loaded (Alberto Garcia)
- Remove inclusion of crypto/hash.h from hibernate_64.c on x86 (Eric
Biggers)
* pm-cpuidle:
cpuidle: Simplify cpuidle_register_device() with guard()
cpuidle: clean up dead dependencies on CPU_IDLE in Kconfig
intel_idle: Add Panther Lake C-states table
cpuidle: governors: teo: Rearrange stopped tick handling
cpuidle: governors: menu: Refine stopped tick handling
* pm-opp:
OPP: Move break out of scoped_guard in dev_pm_opp_xlate_required_opp()
OPP: debugfs: Use performance level if available to distinguish between rates
* pm-sleep:
PM: hibernate: return -ENODATA if the snapshot image is not loaded
PM: hibernate: x86: Remove inclusion of crypto/hash.h
|
|
Merge cpufreq updates for 7.1-rc1:
- Update qcom-hw DT bindings to include Eliza hardware (Abel Vesa)
- Update cpufreq-dt-platdev blocklist (Faruque Ansari)
- Minor updates to driver and dt-bindings for Tegra (Thierry Reding,
Rosen Penev)
- Add MAINTAINERS entry for CPPC driver (Viresh Kumar)
- Add support for new features: CPPC performance priority, Dynamic EPP,
Raw EPP, and new unit tests for them to amd-pstate (Gautham Shenoy,
Mario Limonciello)
- Fix sysfs files being present when HW missing and broken/outdated
documentation in the amd-pstate driver (Ninad Naik, Gautham Shenoy)
- Pass the policy to cpufreq_driver->adjust_perf() to avoid using
cpufreq_cpu_get() in the .adjust_perf() callback in amd-pstate which
leads to a scheduling-while-atomic bug (K Prateek Nayak)
- Clean up dead code in Kconfig for cpufreq (Julian Braha)
- Remove max_freq_req update for pre-existing cpufreq policy and add a
boost_freq_req QoS request to save the boost constraint instead of
overwriting the last scaling_max_freq constraint (Pierre Gondois)
- Embed cpufreq QoS freq_req objects in cpufreq policy so they all
are allocated in one go along with the policy to simplify lifetime
rules and avoid error handling issues (Viresh Kumar)
- Use DMI max speed when CPPC is unavailable in the acpi-cpufreq
scaling driver (Henry Tseng)
- Switch policy_is_shared() in cpufreq to using cpumask_nth() instead
of cpumask_weight() because the former is more efficient (Yury Norov)
- Use sysfs_emit() in sysfs show functions for cpufreq governor
attributes (Thorsten Blum)
- Update intel_pstate to stop returning an error when "off" is written
to its status sysfs attribute while the driver is already off (Fabio
De Francesco)
- Include current frequency in the debug message printed by
__cpufreq_driver_target() (Pengjie Zhang)
* pm-cpufreq: (38 commits)
cpufreq/amd-pstate: Add POWER_SUPPLY select for dynamic EPP
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
...
|
|
The format string says "device %p ... rdma cgroup %p" but the arguments
were passed as (cg, device), printing them in the wrong order.
Signed-off-by: cuitao <cuitao@kylinos.cn>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
When querying info for an offloaded BPF map or program,
bpf_map_offload_info_fill_ns() and bpf_prog_offload_info_fill_ns()
obtain the network namespace with get_net(dev_net(offmap->netdev)).
However, the associated netdev's netns may be racing with teardown
during netns destruction. If the netns refcount has already reached 0,
get_net() performs a refcount_t increment on 0, triggering:
refcount_t: addition on 0; use-after-free.
Although rtnl_lock and bpf_devs_lock ensure the netdev pointer remains
valid, they cannot prevent the netns refcount from reaching zero.
Fix this by using maybe_get_net() instead of get_net(). maybe_get_net()
uses refcount_inc_not_zero() and returns NULL if the refcount is already
zero, which causes ns_get_path_cb() to fail and the caller to return
-ENOENT -- the correct behavior when the netns is being destroyed.
Fixes: 675fc275a3a2d ("bpf: offload: report device information for offloaded programs")
Fixes: 52775b33bb507 ("bpf: offload: report device information about offloaded maps")
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reviewed-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/f0aa3678-79c9-47ae-9e8c-02a3d1df160a@hust.edu.cn/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260409023733.168050-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a pkt pointer acquires AT_PKT_END or BEYOND_PKT_END range from
a comparison, and then, known-constant arithmetic is performed,
adjust_ptr_min_max_vals() copies the stale range via dst_reg->raw =
ptr_reg->raw without clearing the negative reg->range sentinel values.
This lets is_pkt_ptr_branch_taken() choose one branch direction and
skip going through the other. Fix this by clearing negative pkt range
values (that is, AT_PKT_END and BEYOND_PKT_END) after arithmetic on
pkt pointers. This ensures is_pkt_ptr_branch_taken() returns unknown
and both branches are properly verified.
Fixes: 6d94e741a8ff ("bpf: Support for pointers beyond pkt_end.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260409155016.536608-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
"A fix for DMA-mapping subsystem, which hides annoying, false-positive
warnings from DMA-API debug on coherent platforms like x86_64 (Mikhail
Gavrilov)"
* tag 'dma-mapping-7.0-2026-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-debug: suppress cacheline overlap warning when arch has no DMA alignment requirement
|
|
The local subprog pointer in create_jt() and visit_abnormal_return_insn()
was declared static.
It is unconditionally assigned via bpf_find_containing_subprog() before
every use. Thus, the static qualifier serves no purpose and rather creates
confusion. Just remove it.
Fixes: e40f5a6bf88a ("bpf: correct stack liveness for tail calls")
Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260408191242.526279-3-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Usage of ld_{abs,ind} instructions got extended into subprogs some time
ago via commit 09b28d76eac4 ("bpf: Add abnormal return checks."). These
are only allowed in subprograms when the latter are BTF annotated and
have scalar return types.
The code generator in bpf_gen_ld_abs() has an abnormal exit path (r0=0 +
exit) from legacy cBPF times. While the enforcement is on scalar return
types, the verifier must also simulate the path of abnormal exit if the
packet data load via ld_{abs,ind} failed.
This is currently not the case. Fix it by having the verifier simulate
both success and failure paths, and extend it in similar ways as we do
for tail calls. The success path (r0=unknown, continue to next insn) is
pushed onto stack for later validation and the r0=0 and return to the
caller is done on the fall-through side.
Fixes: 09b28d76eac4 ("bpf: Add abnormal return checks.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260408191242.526279-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Commit e40f5a6bf88a ("bpf: correct stack liveness for tail calls") added
visit_tailcall_insn() but did not check its return value.
Fixes: e40f5a6bf88a ("bpf: correct stack liveness for tail calls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260408191242.526279-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Move find_linfo() as bpf_find_linfo() into core.c to allow for its use
in the verifier in subsequent patches.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20260408021359.3786905-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Extract bpf_get_linfo_file_line as its own function so that the logic to
obtain the file, line, and line number for a given program can be shared
in subsequent patches.
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260408021359.3786905-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
* kvm-arm64/hyp-tracing: (40 commits)
: .
: EL2 tracing support, adding both 'remote' ring-buffer
: infrastructure and the tracing itself, courtesy of
: Vincent Donnefort. From the cover letter:
:
: "The growing set of features supported by the hypervisor in protected
: mode necessitates debugging and profiling tools. Tracefs is the
: ideal candidate for this task:
:
: * It is simple to use and to script.
:
: * It is supported by various tools, from the trace-cmd CLI to the
: Android web-based perfetto.
:
: * The ring-buffer, where are stored trace events consists of linked
: pages, making it an ideal structure for sharing between kernel and
: hypervisor.
:
: This series first introduces a new generic way of creating remote events and
: remote buffers. Then it adds support to the pKVM hypervisor."
: .
tracing: selftests: Extend hotplug testing for trace remotes
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: Adjust cmd_check_undefined to show unexpected undefined symbols
tracing: Restore accidentally removed SPDX tag
KVM: arm64: avoid unused-variable warning
tracing: Generate undef symbols allowlist for simple_ring_buffer
KVM: arm64: tracing: add ftrace dependency
tracing: add more symbols to whitelist
tracing: Update undefined symbols allow list for simple_ring_buffer
KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing
tracing: selftests: Add hypervisor trace remote tests
KVM: arm64: Add selftest event support to nVHE/pKVM hyp
KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp
KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote
KVM: arm64: Add trace reset to the nVHE/pKVM hyp
KVM: arm64: Sync boot clock with the nVHE/pKVM hyp
KVM: arm64: Add trace remote for the nVHE/pKVM hyp
KVM: arm64: Add tracing capability for the nVHE/pKVM hyp
KVM: arm64: Support unaligned fixmap in the pKVM hyp
KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Replace raw READ_ONCE() dereferences of pgtable entries with corresponding
standard page table accessors pxdp_get() in perf_get_pgtable_size(). These
accessors default to READ_ONCE() on platforms that don't override them. So
there is no functional change on such platforms.
However arm64 platform is being extended to support 128 bit page tables via
a new architecture feature i.e FEAT_D128 in which case READ_ONCE() will not
provide required single copy atomic access for 128 bit page table entries.
Although pxdp_get() accessors can later be overridden on arm64 platform to
extend required single copy atomicity support on 128 bit entries.
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260227062744.2215491-1-anshuman.khandual@arm.com
|
|
The commit 5f6bd380c7bdb ("sched/rt: Remove default bandwidth control")
and followup changes made a few of the functions unnecessary, drop them
for simplicity.
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-3-1e7d5ed6b249@suse.com
|
|
The sched_rt_global_constraints() function is a remnant that used to set
up global RT throttling but that is no more since commit 5f6bd380c7bdb
("sched/rt: Remove default bandwidth control") and the function ended up
only doing schedulability check.
Move the check into the validation function where it fits better.
(The order of validations sched_dl_global_validate() and
sched_rt_global_validate() shouldn't matter.)
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-2-1e7d5ed6b249@suse.com
|
|
The warning from the commit 87f1fb77d87a6 ("sched: Add RT_GROUP WARN
checks for non-root task_groups") is wrong -- it assumes that only
task_groups with rt_rq are traversed, however, the schedulability check
would iterate all task_groups even when rt_group_sched=0 is disabled at
boot time but some non-root task_groups exist.
The schedulability check is supposed to validate:
a) that children don't overcommit its parent,
b) no RT task group overcommits global RT limit.
but with rt_group_sched=0 there is no (non-trivial) hierarchy of RT groups,
therefore skip the validation altogether. Otherwise, writes to the
global sched_rt_runtime_us knob will be rejected with incorrect
validation error.
This fix is immaterial with CONFIG_RT_GROUP_SCHED=n.
Fixes: 87f1fb77d87a6 ("sched: Add RT_GROUP WARN checks for non-root task_groups")
Signed-off-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260323-sched-rert_groups-v3-1-1e7d5ed6b249@suse.com
|
|
John noted that commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server")
unfixed the issue from commit a3a70caf7906 ("sched/deadline: Fix dl_server
behaviour").
The issue in commit 115135422562 was for wakeups of the server after the
deadline; in which case you *have* to start a new period. The case for
a3a70caf7906 is wakeups before the deadline.
Now, because the server is effectively running a least-laxity policy, it means
that any wakeup during the runnable phase means dl_entity_overflow() will be
true. This means we need to adjust the runtime to allow it to still run until
the existing deadline expires.
Use the revised wakeup rule for dl_defer entities.
Fixes: 115135422562 ("sched/deadline: Fix 'stuck' dl_server")
Reported-by: John Stultz <jstultz@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: John Stultz <jstultz@google.com>
Link: https://patch.msgid.link/20260404102244.GB22575@noisy.programming.kicks-ass.net
|
|
The generic irqentry code has entry/exit functions specifically for
exceptions taken from user mode, but doesn't have entry/exit functions
specifically for exceptions taken from kernel mode.
It would be helpful to have separate entry/exit functions specifically
for exceptions taken from kernel mode. This would make the structure of
the entry code more consistent, and would make it easier for
architectures to manage logic specific to exceptions taken from kernel
mode.
Move the logic specific to kernel mode out of irqentry_enter() and
irqentry_exit() into new irqentry_enter_from_kernel_mode() and
irqentry_exit_to_kernel_mode() functions. These are marked
__always_inline and placed in irq-entry-common.h, as with
irqentry_enter_from_user_mode() and irqentry_exit_to_user_mode(), so
that they can be inlined into architecture-specific wrappers. The
existing out-of-line irqentry_enter() and irqentry_exit() functions
retained as callers of the new functions.
The lockdep assertion from irqentry_exit() is moved into
irqentry_exit_to_user_mode() and irqentry_exit_to_kernel_mode(). This
was previously missing from irqentry_exit_to_user_mode() when called
directly, and any new lockdep assertion failure relating from this
change is a latent bug.
Aside from the lockdep change noted above, there should be no functional
change as a result of this change.
[ tglx: Updated kernel doc ]
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407131650.3813777-5-mark.rutland@arm.com
|
|
local_irq_enable_exit_to_user() and local_irq_disable_exit_to_user() are
never overridden by architecture code, and are always equivalent to
local_irq_enable() and local_irq_disable().
These functions were added on the assumption that arm64 would override
them to manage 'DAIF' exception masking, as described by Thomas Gleixner
in these threads:
https://lore.kernel.org/all/20190919150809.340471236@linutronix.de/
https://lore.kernel.org/all/alpine.DEB.2.21.1910240119090.1852@nanos.tec.linutronix.de/
In practice arm64 did not need to override either. Prior to moving to
the generic irqentry code, arm64's management of DAIF was reworked in
commit:
97d935faacde ("arm64: Unmask Debug + SError in do_notify_resume()")
Since that commit, arm64 only masks interrupts during the 'prepare' step
when returning to user mode, and masks other DAIF exceptions later.
Within arm64_exit_to_user_mode(), the arm64 entry code is as follows:
local_irq_disable();
exit_to_user_mode_prepare_legacy(regs);
local_daif_mask();
mte_check_tfsr_exit();
exit_to_user_mode();
Remove the unnecessary local_irq_enable_exit_to_user() and
local_irq_disable_exit_to_user() functions.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Reviewed-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407131650.3813777-3-mark.rutland@arm.com
|
|
The verifier currently does not allow overwriting a referenced dynptr's
stack slot to prevent resource leak. This is because referenced dynptr
holds additional resources that requires calling specific helpers to
release. This limitation can be relaxed when there are multiple copies
of the same dynptr. Whether it is the orignial dynptr or one of its
clones, as long as there exists at least one other dynptr with the same
ref_obj_id (to be used to release the reference), its stack slot should
be allowed to be overwritten.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Amery Hung <ameryhung@gmail.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406150548.1354271-2-ameryhung@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When a non-{add,sub} alu op such as xor is performed on a scalar
register that previously had a BPF_ADD_CONST delta, the else path
in adjust_reg_min_max_vals() only clears dst_reg->id but leaves
dst_reg->delta unchanged.
This stale delta can propagate via assign_scalar_id_before_mov()
when the register is later used in a mov. It gets a fresh id but
keeps the stale delta from the old (now-cleared) BPF_ADD_CONST.
This stale delta can later propagate leading to a verifier-vs-
runtime value mismatch.
The clear_id label already correctly clears both delta and id.
Make the else path consistent by also zeroing the delta when id
is cleared. More generally, this introduces a helper clear_scalar_id()
which internally takes care of zeroing. There are various other
locations in the verifier where only the id is cleared. By using
the helper we catch all current and future locations.
Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260407192421.508817-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Consider the case of rX += rX where src_reg and dst_reg are pointers to
the same bpf_reg_state in adjust_reg_min_max_vals(). The latter first
modifies the dst_reg in-place, and later in the delta tracking, the
subsequent is_reg_const(src_reg)/reg_const_value(src_reg) reads the
post-{add,sub} value instead of the original source.
This is problematic since it sets an incorrect delta, which sync_linked_regs()
then propagates to linked registers, thus creating a verifier-vs-runtime
mismatch. Fix it by just skipping this corner case.
Fixes: 98d7ca374ba4 ("bpf: Track delta between "linked" registers.")
Reported-by: STAR Labs SG <info@starlabs.sg>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/r/20260407192421.508817-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When an unqualified kprobe target exists in both vmlinux and a loaded
module, number_of_same_symbols() returns a count greater than 1,
causing kprobe attachment to fail with -EADDRNOTAVAIL even though the
vmlinux symbol is unambiguous.
When no module qualifier is given and the symbol is found in vmlinux,
return the vmlinux-only count without scanning loaded modules. This
preserves the existing behavior for all other cases:
- Symbol only in a module: vmlinux count is 0, falls through to module
scan as before.
- Symbol qualified with MOD:SYM: mod != NULL, unchanged path.
- Symbol ambiguous within vmlinux itself: count > 1 is returned as-is.
Fixes: 926fe783c8a6 ("tracing/kprobes: Fix symbol counting logic by looking at modules as well")
Fixes: 9d8616034f16 ("tracing/kprobes: Add symbol counting check when module loads")
Suggested-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Ihor Solodrai <ihor.solodrai@linux.dev>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@crowdstrike.com>
Link: https://lore.kernel.org/r/20260407203912.1787502-2-andrey.grodzovsky@crowdstrike.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
RCU Tasks Trace grace period implies RCU grace period, and this
guarantee is expected to remain in the future. Only BPF is the user of
this predicate, hence retire the API and clean up all in-tree users.
RCU Tasks Trace is now implemented on SRCU-fast and its grace period
mechanism always has at least one call to synchronize_rcu() as it is
required for SRCU-fast's correctness (it replaces the smp_mb() that
SRCU-fast readers skip). So, RCU-tt GP will always imply RCU GP.
Reviewed-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260407162234.785270-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
use NR_STD_WORKER_POOLS for irq_work_fns[] array definition.
NR_STD_WORKER_POOLS is also 2, but better to use MACRO.
Initialization loop for_each_bh_worker_pool() also uses same MACRO.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Eight hotfixes. All are cc:stable and seven are for MM.
All are singletons - please see the changelogs for details"
* tag 'mm-hotfixes-stable-2026-04-06-15-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
ocfs2: fix out-of-bounds write in ocfs2_write_end_inline
mm/damon/stat: deallocate damon_call() failure leaking damon_ctx
mm/vma: fix memory leak in __mmap_region()
mm/memory_hotplug: maintain N_NORMAL_MEMORY during hotplug
mm/damon/sysfs: dealloc repeat_call_control if damon_call() fails
mm: reinstate unconditional writeback start in balance_dirty_pages()
liveupdate: propagate file deserialization failures
mm: filemap: fix nr_pages calculation overflow in filemap_map_pages()
|
|
In alarmtimer_suspend(), timerqueue_getnext() is called under
base->lock, but next->expires is read after the lock is released.
This is safe because suspend freezes all relevant task contexts,
but reading the node while holding the lock makes the code easier
to reason about and not worry about a theoretical UAF.
Signed-off-by: Zhan Xusheng <zhanxusheng@xiaomi.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Link: https://patch.msgid.link/20260407143627.19405-1-zhanxusheng@xiaomi.com
|
|
bpf_lsm_task_to_inode() is called under rcu_read_lock() and
bpf_lsm_inet_conn_established() is called from softirq context, so
neither hook can be used by sleepable LSM programs.
Fixes: 423f16108c9d8 ("bpf: Augment the set of sleepable LSM hooks")
Reported-by: Quan Sun <2022090917019@std.uestc.edu.cn>
Reported-by: Yinhao Hu <dddddd@hust.edu.cn>
Reported-by: Kaiyan Mei <M202472210@hust.edu.cn>
Reported-by: Dongliang Mu <dzm91@hust.edu.cn>
Closes: https://lore.kernel.org/bpf/3ab69731-24d1-431a-a351-452aafaaf2a5@std.uestc.edu.cn/T/#u
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Link: https://lore.kernel.org/r/20260407122334.344072-1-jiayuan.chen@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Commit 56534673cea7f ("tick/nohz: Optimize check_tick_dependency() with
early return") added a fast path that returns !val when the tick_stop
tracepoint is disabled.
This is inverted: the slow path returns true when a dependency IS found
(val != 0), but !val returns true when val is zero (no dependency). The
result is that can_stop_full_tick() sees "dependency found" when there are
none, and the tick never stops on nohz_full CPUs.
Fix this by returning !!val instead of !val, matching the slow-path semantics.
Fixes: 56534673cea7f ("tick/nohz: Optimize check_tick_dependency() with early return")
Signed-off-by: Josh Snyder <josh@code406.com>
Signed-off-by: Thomas Gleixner <tglx@kernel.org>
Assisted-by: Claude:claude-opus-4-6
Link: https://patch.msgid.link/20260402-fix-idle-tick2-v1-1-eecb589649d3@code406.com
|
|
Here is one scenario which was triggered when running:
stress-ng --yield=32 -t 10000000s&
while true; do perf bench sched messaging -p -t -l 100000 -g 16; done
on a 256CPUs machine after about an hour into the run:
__enqeue_entity: entity_key(-141245081754) weight(90891264) overflow_mul(5608800059305154560) vlag(57498) delayed?(0)
cfs_rq: zero_vruntime(3809707759657809) sum_w_vruntime(0) sum_weight(0) nr_queued(1)
cfs_rq->curr: entity_key(0) vruntime(3809707759657809) deadline(3809723966988476) weight(37)
The above comes from __enqueue_entity() after a place_entity(). Breaking
this down:
vlag_initial = 57498
vlag = (57498 * (37 + 90891264)) / 37 = 141,245,081,754
vruntime = 3809707759657809 - 141245081754 = 3,809,566,514,576,055
entity_key(se, cfs_rq) = -141,245,081,754
Now, multiplying the entity_key with its own weight results to
5,608,800,059,305,154,560 (same as what overflow_mul() suggests) but
in Python, without overflow, this would be: -1,2837,944,014,404,397,056
Avoid the overflow (without doing the division for avg_vruntime()), by moving
zero_vruntime to the new entity when it is heavier.
Fixes: 4823725d9d1d ("sched/fair: Increase weight bits for avg_vruntime")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
[peterz: suggested 'weight > load' condition]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260407120052.GG3738010@noisy.programming.kicks-ass.net
|
|
to_ratio() computes BW_SHIFT-scaled bandwidth ratios from u64 period and
runtime values, but it returns unsigned long. tg_rt_schedulable() also
stores the current group limit and the accumulated child sum in unsigned
long.
On 32-bit builds, large bandwidth ratios can be truncated and the RT
group sum can wrap when enough siblings are present. That can let an
overcommitted RT hierarchy pass the schedulability check, and it also
narrows the helper result for other callers.
Return u64 from to_ratio() and use u64 for the RT group totals so
bandwidth ratios are preserved and compared at full width on both 32-bit
and 64-bit builds.
Fixes: b40b2e8eb521 ("sched: rt: multi level group constraints")
Assisted-by: Codex:GPT-5
Signed-off-by: Joseph Salisbury <joseph.salisbury@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260403210014.2713404-1-joseph.salisbury@oracle.com
|
|
When a pointer to PTR_TO_INSN is dereferenced, the offset field
of the BPF_LDX_MEM instruction can be nonzero. Patch the verifier
to not ignore this field.
Reported-by: Jiyong Yang <ksur673@gmail.com>
Fixes: 493d9e0d6083 ("bpf, x86: add support for indirect jumps")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Link: https://lore.kernel.org/r/20260406160141.36943-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Apparently, struct bpf_empty_prog_array exists entirely to populate a
single element of "items" in a global variable. "null_prog" is only
used during the initializer.
None of this is needed; globals will be correctly sized with an array
initializer of a flexible-array member.
So, remove struct bpf_empty_prog_array and adjust the rest of the code,
accordingly.
With these changes, fix the following warnings:
./include/linux/bpf.h:2369:31: warning: structure containing a flexible
array member is not at the end of another structure [-Wflex-array-member-not-at-end]
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/acr7Whmn0br3xeBP@kspp
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Don't reject usage of fixed unaligned offsets for syscall ctx. Tests
will be added in later commits. Unaligned offsets already work for
variable offsets.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Allow accessing PTR_TO_CTX with variable offsets in syscall programs.
Fixed offsets are already enabled for all program types that do not
convert their ctx accesses, since the changes we made in the commit
de6c7d99f898 ("bpf: Relax fixed offset check for PTR_TO_CTX"). Note
that we also lift the restriction on passing syscall context into
helpers, which was not permitted before, and passing modified syscall
context into kfuncs.
The structure of check_mem_access can be mostly shared and preserved,
but we must use check_mem_region_access to correctly verify access with
variable offsets.
The check made in check_helper_mem_access is hardened to only allow
PTR_TO_CTX for syscall programs to be passed in as helper memory. This
was the original intention of the existing code anyway, and it makes
little sense for other program types' context to be utilized as a memory
buffer. In case a convincing example presents itself in the future, this
check can be relaxed further.
We also no longer use the last-byte access to simulate helper memory
access, but instead go through check_mem_region_access. Since this no
longer updates our max_ctx_offset, we must do so manually, to keep track
of the maximum offset at which the program ctx may be accessed.
Take care to ensure that when arg_type is ARG_PTR_TO_CTX, we do not
relax any fixed or variable offset constraints around PTR_TO_CTX even in
syscall programs, and require them to be passed unmodified. There are
several reasons why this is necessary. First, if we pass a modified ctx,
then the global subprog's accesses will not update the max_ctx_offset to
its true maximum offset, and can lead to out of bounds accesses. Second,
tail called program (or extension program replacing global subprog) where
their max_ctx_offset exceeds the program they are being called from can
also cause issues. For the latter, unmodified PTR_TO_CTX is the first
requirement for the fix, the second is ensuring max_ctx_offset >= the
program they are being called from, which has to be a separate change
not made in this commit.
All in all, we can hint using arg_type when we expect ARG_PTR_TO_CTX and
make our relaxation around offsets conditional on it.
Drop coverage of syscall tests from verifier_ctx.c temporarily for
negative cases until they are updated in subsequent commits.
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20260406194403.1649608-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
luo_session_deserialize() ignored the return value from
luo_file_deserialize(). As a result, a session could be left partially
restored even though the /dev/liveupdate open path treats deserialization
failures as fatal.
Propagate the error so a failed file deserialization aborts session
deserialization instead of silently continuing.
Link: https://lkml.kernel.org/r/20260325044608.8407-1-leotimmins1974@gmail.com
Link: https://lkml.kernel.org/r/20260325044608.8407-2-leotimmins1974@gmail.com
Fixes: 16cec0d26521 ("liveupdate: luo_session: add ioctls for file preservation")
Signed-off-by: Leo Timmins <leotimmins1974@gmail.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pratyush Yadav <pratyush@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When a dev-bound-only BPF program (BPF_F_XDP_DEV_BOUND_ONLY) undergoes
JIT compilation with constant blinding enabled (bp |