aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-01-30KVM: x86: Add x2APIC "features" to control EOI broadcast suppressionKhushit Shah7-15/+127
Add two flags for KVM_CAP_X2APIC_API to allow userspace to control support for Suppress EOI Broadcasts when using a split IRQCHIP (I/O APIC emulated by userspace), which KVM completely mishandles. When x2APIC support was first added, KVM incorrectly advertised and "enabled" Suppress EOI Broadcast, without fully supporting the I/O APIC side of the equation, i.e. without adding directed EOI to KVM's in-kernel I/O APIC. That flaw was carried over to split IRQCHIP support, i.e. KVM advertised support for Suppress EOI Broadcasts irrespective of whether or not the userspace I/O APIC implementation supported directed EOIs. Even worse, KVM didn't actually suppress EOI broadcasts, i.e. userspace VMMs without support for directed EOI came to rely on the "spurious" broadcasts. KVM "fixed" the in-kernel I/O APIC implementation by completely disabling support for Suppress EOI Broadcasts in commit 0bcc3fb95b97 ("KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in use"), but didn't do anything to remedy userspace I/O APIC implementations. KVM's bogus handling of Suppress EOI Broadcast is problematic when the guest relies on interrupts being masked in the I/O APIC until well after the initial local APIC EOI. E.g. Windows with Credential Guard enabled handles interrupts in the following order: 1. Interrupt for L2 arrives. 2. L1 APIC EOIs the interrupt. 3. L1 resumes L2 and injects the interrupt. 4. L2 EOIs after servicing. 5. L1 performs the I/O APIC EOI. Because KVM EOIs the I/O APIC at step #2, the guest can get an interrupt storm, e.g. if the IRQ line is still asserted and userspace reacts to the EOI by re-injecting the IRQ, because the guest doesn't de-assert the line until step #4, and doesn't expect the interrupt to be re-enabled until step #5. Unfortunately, simply "fixing" the bug isn't an option, as KVM has no way of knowing if the userspace I/O APIC supports directed EOIs, i.e. suppressing EOI broadcasts would result in interrupts being stuck masked in the userspace I/O APIC due to step #5 being ignored by userspace. And fully disabling support for Suppress EOI Broadcast is also undesirable, as picking up the fix would require a guest reboot, *and* more importantly would change the virtual CPU model exposed to the guest without any buy-in from userspace. Add KVM_X2APIC_ENABLE_SUPPRESS_EOI_BROADCAST and KVM_X2APIC_DISABLE_SUPPRESS_EOI_BROADCAST flags to allow userspace to explicitly enable or disable support for Suppress EOI Broadcasts. This gives userspace control over the virtual CPU model exposed to the guest, as KVM should never have enabled support for Suppress EOI Broadcast without userspace opt-in. Not setting either flag will result in legacy quirky behavior for backward compatibility. Disallow fully enabling SUPPRESS_EOI_BROADCAST when using an in-kernel I/O APIC, as KVM's history/support is just as tragic. E.g. it's not clear that commit c806a6ad35bf ("KVM: x86: call irq notifiers with directed EOI") was entirely correct, i.e. it may have simply papered over the lack of Directed EOI emulation in the I/O APIC. Note, Suppress EOI Broadcasts is defined only in Intel's SDM, not in AMD's APM. But the bit is writable on some AMD CPUs, e.g. Turin, and KVM's ABI is to support Directed EOI (KVM's name) irrespective of guest CPU vendor. Fixes: 7543a635aa09 ("KVM: x86: Add KVM exit for IOAPIC EOIs") Closes: https://lore.kernel.org/kvm/7D497EF1-607D-4D37-98E7-DAF95F099342@nutanix.com Cc: stable@vger.kernel.org Suggested-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Khushit Shah <khushit.shah@nutanix.com> Link: https://patch.msgid.link/20260123125657.3384063-1-khushit.shah@nutanix.com [sean: clean up minor formatting goofs and fix a comment typo] Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30KVM: x86: Harden against unexpected adjustments to kvm_cpu_capsSean Christopherson4-5/+25
Add a flag to track when KVM is actively configuring its CPU caps, and WARN if a cap is set or cleared if KVM isn't in its configuration stage. Modifying CPU caps after {svm,vmx}_set_cpu_caps() can be fatal to KVM, as vendor setup code expects the CPU caps to be frozen at that point, e.g. will do additional configuration based on the caps. Rename kvm_set_cpu_caps() to kvm_initialize_cpu_caps() to pair with the new "finalize", and to make it more obvious that KVM's CPU caps aren't fully configured within the function. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30KVM: VMX: Print out "bad" offsets+value on VMCS config mismatchSean Christopherson1-1/+16
When kvm-intel.ko refuses to load due to a mismatched VMCS config, print all mismatching offsets+values to make it easier to debug goofs during development, and to make it at least feasible to triage failures that occur during production. E.g. if a physical core is flaky or is running with the "wrong" microcode patch loaded, then a CPU can get a legitimate mismatch even without KVM bugs. Print the mismatches as 32-bit values as a compromise between hand coding every field (to provide precise information) and printing individual bytes (requires more effort to deduce the mismatch bit(s)). All fields in the VMCS config are either 32-bit or 64-bit values, i.e. in many cases, printing 32-bit values will be 100% precise, and in the others it's close enough, especially when considering that MSR values are split into EDX:EAX anyways. E.g. on mismatch CET entry/exit controls, KVM will print: kvm_intel: VMCS config on CPU 0 doesn't match reference config: Offset 76 REF = 0x107fffff, CPU0 = 0x007fffff, mismatch = 0x10000000 Offset 84 REF = 0x0010f3ff, CPU0 = 0x0000f3ff, mismatch = 0x00100000 Opportunistically tweak the wording on the initial error message to say "mismatch" instead of "inconsistent", as the VMCS config itself isn't inconsistent, and the wording conflates the cross-CPU compatibility check with the error_on_inconsistent_vmcs_config knob that treats inconsistent VMCS configurations as errors (e.g. if a CPU supports CET entry controls but no CET exit controls). Cc: Jim Mattson <jmattson@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30KVM: x86: Explicitly configure supported XSS from {svm,vmx}_set_cpu_caps()Sean Christopherson4-13/+23
Explicitly configure KVM's supported XSS as part of each vendor's setup flow to fix a bug where clearing SHSTK and IBT in kvm_cpu_caps, e.g. due to lack of CET XFEATURE support, makes kvm-intel.ko unloadable when nested VMX is enabled, i.e. when nested=1. The late clearing results in nested_vmx_setup_{entry,exit}_ctls() clearing VM_{ENTRY,EXIT}_LOAD_CET_STATE when nested_vmx_setup_ctls_msrs() runs during the CPU compatibility checks, ultimately leading to a mismatched VMCS config due to the reference config having the CET bits set, but every CPU's "local" config having the bits cleared. Note, kvm_caps.supported_{xcr0,xss} are unconditionally initialized by kvm_x86_vendor_init(), before calling into vendor code, and not referenced between ops->hardware_setup() and their current/old location. Fixes: 69cc3e886582 ("KVM: x86: Add XSS support for CET_KERNEL and CET_USER") Cc: stable@vger.kernel.org Cc: Mathias Krause <minipli@grsecurity.net> Cc: John Allen <john.allen@amd.com> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Cc: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260128014310.3255561-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-01-30Merge tag 'block-6.19-20260130' of ↵Linus Torvalds4-8/+6
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull block fixes from Jens Axboe: - Fix for an accounting leak in bcache that's been there forever, and a related dead code removal - Revert of a fix for rnbd that went into this series, but depends on other changes that are staged for 7.0 - NVMe pull request via Keith: - TCP target completion race condition fix (Ming) - DMA descriptor cleanup fix (Roger) * tag 'block-6.19-20260130' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: bcache: fix I/O accounting leak in detached_dev_do_request bcache: remove dead code in detached_dev_do_request nvme-pci: DMA unmap the correct regions in nvme_free_sgls Revert "rnbd-clt: fix refcount underflow in device unmap path" nvmet: fix race in nvmet_bio_done() leading to NULL pointer dereference
2026-01-30Merge tag 'dma-mapping-6.19-2026-01-30' of ↵Linus Torvalds4-9/+42
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fixes from Marek Szyprowski: - important fix for ARM 32-bit based systems using cma= kernel parameter (Oreoluwa Babatunde) - a fix for the corner case of the DMA atomic pool based allocations (Sai Sree Kartheek Adivi) * tag 'dma-mapping-6.19-2026-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma/pool: distinguish between missing and exhausted atomic pools of: reserved_mem: Allow reserved_mem framework detect "cma=" kernel param
2026-01-30tick/nohz: Optimize check_tick_dependency() with early returnIonut Nechita (Sunlight Linux)1-0/+3
There is no point in iterating through individual tick dependency bits when the tick_stop tracepoint is disabled, which is the common case. When the trace point is disabled, return immediately based on the atomic value being zero or non-zero, skipping the per-bit evaluation. This optimization improves the hot path performance of tick dependency checks across all contexts (idle and non-idle), not just nohz_full CPUs. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ionut Nechita (Sunlight Linux) <sunlightlinux@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Link: https://patch.msgid.link/20260128074558.15433-3-sunlightlinux@gmail.com
2026-01-30selftests/bpf: Make bpf get_preempt_count() work for v6.14+ kernelsChangwoo Min1-2/+20
Recent x86 kernels export __preempt_count as a ksym, while some old kernels between v6.1 and v6.14 expose the preemption counter via pcpu_hot.preempt_count. The existing selftest helper unconditionally dereferenced __preempt_count, which breaks BPF program loading on such old kernels. Make the x86 preemption count lookup version-agnostic by: - Marking __preempt_count and pcpu_hot as weak ksyms. - Introducing a BTF-described pcpu_hot___local layout with preserve_access_index. - Selecting the appropriate access path at runtime using ksym availability and bpf_ksym_exists() and bpf_core_field_exists(). This allows a single BPF binary to run correctly across kernel versions (e.g., v6.18 vs. v6.13) without relying on compile-time version checks. Signed-off-by: Changwoo Min <changwoo@igalia.com> Link: https://lore.kernel.org/r/20260130021843.154885-1-changwoo@igalia.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30Merge branch 'bpf-tail-calls-in-sleepable-programs'Alexei Starovoitov5-1/+122
Jiri Olsa says: ==================== this patchset allows sleepable programs to use tail calls. At the moment we need to have separate sleepable uprobe program to retrieve user space data and pass it to complex program with tail calls. It'd be great if the program with tail calls could be sleepable and do the data retrieval directly. ==================== Link: https://patch.msgid.link/20260130081208.1130204-1-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30selftests/bpf: Add test for sleepable program tailcallsJiri Olsa2-0/+117
Adding test that makes sure we can't mix sleepable and non-sleepable bpf programs in the BPF_MAP_TYPE_PROG_ARRAY map and that we can do tail call in the sleepable program. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20260130081208.1130204-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30bpf: Allow sleepable programs to use tail callsJiri Olsa3-1/+5
Allowing sleepable programs to use tail calls. Making sure we can't mix sleepable and non-sleepable bpf programs in tail call map (BPF_MAP_TYPE_PROG_ARRAY) and allowing it to be used in sleepable programs. Sleepable programs can be preempted and sleep which might bring new source of race conditions, but both direct and indirect tail calls should not be affected. Direct tail calls work by patching direct jump to callee into bpf caller program, so no problem there. We atomically switch from nop to jump instruction. Indirect tail call reads the callee from the map and then jumps to it. The callee bpf program can't disappear (be released) from the caller, because it is executed under rcu lock (rcu_read_lock_trace). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260130081208.1130204-2-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-30dt-bindings: power: supply: google,goldfish-battery: Convert to DT schemaKuan-Wei Chiu2-17/+41
Convert the Android Goldfish Battery binding to DT schema format. Move the file to the power/supply directory to match the subsystem. Update the example node name to 'battery' to comply with generic node naming standards. Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Link: https://patch.msgid.link/20260113092602.3197681-5-visitorckw@gmail.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2026-01-30power: supply: qcom_battmgr: Recognize "LiP" as lithium-polymerVal Packett1-1/+2
On the Dell Latitude 7455, the firmware uses "LiP" with a lowercase 'i' for the battery chemistry type, but only all-uppercase "LIP" was being recognized. Add the CamelCase variant to the check to fix the "Unknown battery technology" warning. Fixes: 202ac22b8e2e ("power: supply: qcom_battmgr: Add lithium-polymer entry") Signed-off-by: Val Packett <val@packett.cool> Reviewed-by: Konrad Dybcio <konrad.dybcio@oss.qualcomm.com> Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Link: https://patch.msgid.link/20260120235831.479038-1-val@packett.cool Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2026-01-30Merge tag 'gpio-fixes-for-v6.19-rc8' of ↵Linus Torvalds8-34/+52
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux Pull gpio fixes from Bartosz Golaszewski: "Over the last week I received quite an unexpected (for rc7) number of fixes but they are all pretty small and mostly limited to drivers: - don't call into pinctrl when setting direction in gpio-rockchip as it's not needed and may trigger locking context errors - change spinlock to raw_spinlock in gpio-sprd - fix a use-after-free bug in gpio-virtuser - don't register a driver from another driver's probe() in gpio-omap - fix int width problems in GPIO ACPI code - fix interrupt-to-pin mapping in gpio-brcmstb - mask interrupts in irq shutdown in gpio-pca953x" * tag 'gpio-fixes-for-v6.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux: gpiolib: acpi: Fix potential out-of-boundary left shift gpio: brcmstb: correct hwirq to bank map gpio: omap: do not register driver in probe() gpio: pca953x: mask interrupts in irq shutdown gpio: virtuser: fix UAF in configfs release path gpiolib: acpi: use BIT_ULL() for u64 mask in address space handler gpio: sprd: Change sprd_gpio lock to raw_spin_lock gpio: rockchip: Stop calling pinctrl for set_direction
2026-01-30power: supply: wm97xx: Use devm_power_supply_register()Waqar Hameed1-12/+4
Instead of handling the registration manually, use the automatic `devres` variant `devm_power_supply_register()`. This is less error prone and cleaner. Signed-off-by: Waqar Hameed <waqar.hameed@axis.com> Link: https://patch.msgid.link/b0b366d302f0605c8555dd68ed32973959f133bb.1769158280.git.waqar.hameed@axis.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2026-01-30power: supply: wm97xx: Use devm_kcalloc()Waqar Hameed1-11/+5
Instead of handling the memory allocation manually, use the automatic `devres` variant `devm_kcalloc()`. This is less error prone and eliminates the `goto`-path. Signed-off-by: Waqar Hameed <waqar.hameed@axis.com> Link: https://patch.msgid.link/5ac93f5b0fa417bb5d9e93b9302a18f2c04d4077.1769158280.git.waqar.hameed@axis.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2026-01-30power: supply: pm8916_lbc: Fix use-after-free for extcon in IRQ handlerWaqar Hameed1-5/+5
Using the `devm_` variant for requesting IRQ _before_ the `devm_` variant for allocating/registering the `extcon` handle, means that the `extcon` handle will be deallocated/unregistered _before_ the interrupt handler (since `devm_` naturally deallocates in reverse allocation order). This means that during removal, there is a race condition where an interrupt can fire just _after_ the `extcon` handle has been freed, *but* just _before_ the corresponding unregistration of the IRQ handler has run. This will lead to the IRQ handler calling `extcon_set_state_sync()` with a freed `extcon` handle. Which usually crashes the system or otherwise silently corrupts the memory... Fix this racy use-after-free by making sure the IRQ is requested _after_ the registration of the `extcon` handle. Fixes: f8d7a3d21160 ("power: supply: Add driver for pm8916 lbc") Signed-off-by: Waqar Hameed <waqar.hameed@axis.com> Reviewed-by: Nikita Travkin <nikita@trvn.ru> Link: https://patch.msgid.link/e2a4cd2fcd42b6cd97d856c17c097289a2aed393.1769163273.git.waqar.hameed@axis.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2026-01-30accel/amdxdna: Fix memory leak in amdxdna_ubuf_mapZishun Yi1-2/+8
The amdxdna_ubuf_map() function allocates memory for sg and internal sg table structures, but it fails to free them if subsequent operations (sg_alloc_table_from_pages or dma_map_sgtable) fail. Fixes: bd72d4acda10 ("accel/amdxdna: Support user space allocated buffer") Signed-off-by: Zishun Yi <zishun.yi.dev@gmail.com> Reviewed-by: Lizhi Hou <lizhi.hou@amd.com> Reviewed-by: Min Ma <mamin506@gmail.com> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260129171022.68578-1-zishun.yi.dev@gmail.com
2026-01-30accel/amdxdna: Stop job scheduling across aie2_release_resource()Lizhi Hou1-0/+6
Running jobs on a hardware context while it is in the process of releasing resources can lead to use-after-free and crashes. Fix this by stopping job scheduling before calling aie2_release_resource() and restarting it after the release completes. Additionally, aie2_sched_job_run() now checks whether the hardware context is still active. Fixes: 4fd6ca90fc7f ("accel/amdxdna: Refactor hardware context destroy routine") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260130003255.2083255-1-lizhi.hou@amd.com
2026-01-30accel/amdxdna: Hold mm structure across iommu_sva_unbind_device()Lizhi Hou2-0/+4
Some tests trigger a crash in iommu_sva_unbind_device() due to accessing iommu_mm after the associated mm structure has been freed. Fix this by taking an explicit reference to the mm structure after successfully binding the device, and releasing it only after the device is unbound. This ensures the mm remains valid for the entire SVA bind/unbind lifetime. Fixes: be462c97b7df ("accel/amdxdna: Add hardware context") Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> Link: https://patch.msgid.link/20260128002356.1858122-1-lizhi.hou@amd.com
2026-01-30power: reset: tdx-ec-poweroff: fix restartEmanuele Ghidoli1-0/+19
During testing, restart occasionally failed on Toradex modules. The issue was traced to an interaction between the EC-based reset/poweroff handler and the PSCI restart handler. While the embedded controller is resetting or powering off the module, the PSCI code may still be invoked, triggering an I2C transaction to the PMIC. This can leave the PMIC I2C in a frozen state. Add a delay after issuing the EC reset or power-off command to give the controller time to complete the operation and avoid falling back to another restart/poweroff provider. Also print an error message if sending the command to the embedded controller fails. Fixes: 18672fe12367 ("power: reset: add Toradex Embedded Controller") Cc: stable@vger.kernel.org Signed-off-by: Emanuele Ghidoli <emanuele.ghidoli@toradex.com> Reviewed-by: Francesco Dolcini <francesco.dolcini@toradex.com> Link: https://patch.msgid.link/20260130071208.1184239-1-ghidoliemanuele@gmail.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
2026-01-30PM: EM: Documentation: Fix bug in example code snippetPatrick Little1-2/+2
A semicolon was mistakenly placed at the end of 'if' statements. If example is copied as-is, it would lead to the subsequent return being executed unconditionally, which is incorrect, and the rest of the function would never be reached. Signed-off-by: Patrick Little <plittle@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> [ rjw: Subject adjustment ] Link: https://patch.msgid.link/20260128-documentation-fix-grammar-v1-2-39238dc471f9@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-30Documentation: Fix typos in energy model documentationPatrick Little2-11/+11
Fix typos in documentation related to energy model management. Signed-off-by: Patrick Little <plittle@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> [ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/20260128-documentation-fix-grammar-v1-1-39238dc471f9@gmail.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-30cpuidle: governors: teo: Refine intercepts-based idle state lookupRafael J. Wysocki1-7/+43
There are cases in which decisions made by the teo governor are arguably overly conservative. For instance, suppose that there are 4 idle states and the values of the intercepts metric for the first 3 of them are 400, 250, and 251, respectively. If the total sum computed in teo_update() is 1000, the governor will select idle state 1 (provided that all idle states are enabled and the scheduler tick has not been stopped) although arguably idle state 0 would be a better choice because the likelihood of getting an idle duration below the target residency of idle state 1 is greater than the likelihood of getting an idle duration between the target residency of idle state 1 and the target residency of idle state 2. To address this, refine the candidate idle state lookup based on intercepts to start at the state with the maximum intercepts metric, below the deepest enabled one, to avoid the cases in which the search may stop before reaching that state. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> [ rjw: Fixed typo "intercetps" in new comments (3 places) ] Link: https://patch.msgid.link/2417298.ElGaqSPkdT@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2026-01-30cpuidle: governors: teo: Adjust the classification of wakeup eventsRafael J. Wysocki1-11/+14
If differences between target residency values of adjacent idle states of a given CPU are relatively large, the corresponding idle state bins used by the teo governors are large either and the rule by which hits are distinguished from intercepts is inaccurate. Namely, by that rule, a wakeup event is classified as a hit if the sleep length (the time till the closest timer other than the tick) and the measured idle duration, adjusted for the entered idle state exit latency, fall into the same idle state bin. However, if that bin is large enough, the actual difference between the sleep length and the measured idle duration may be significant. It may in fact be significantly greater than the analogous difference for an event where the sleep length and the measured idle duration fall into different bins. For this reason, amend the rule in question with a check that will only allow a wakeup event to be counted as a hit if the sleep length is less than the "raw" measured idle duration (which means that the wakeup appears to have occurred after the anticipated timer event). Otherwise, the event will be counted as an intercept. Also update the documentation part explaining the difference between "hits" and "intercepts" to take the above change into account. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/5093379.31r3eYUQgx@rafael.j.wysocki
2026-01-30PCI/ACPI: Restrict program_hpx_type2() to AER bitsHåkon Bugge3-38/+27
Previously program_hpx_type2() applied PCIe settings unconditionally, which could incorrectly change bits like Extended Tag Field Enable and Enable Relaxed Ordering. When _HPX was added to ACPI r3.0, the intent of the PCIe Setting Record (Type 2) in sec 6.2.7.3 was to configure AER registers when the OS does not own the AER Capability: The PCI Express setting record contains ... [the AER] Uncorrectable Error Mask, Uncorrectable Error Severity, Correctable Error Mask ... to be used when configuring registers in the Advanced Error Reporting Extended Capability Structure ... OSPM [1] will only evaluate _HPX with Setting Record – Type 2 if OSPM is not controlling the PCI Express Advanced Error Reporting capability. ACPI r3.0b, sec 6.2.7.3, added more AER registers, including registers in the PCIe Capability with AER-related bits, and the restriction that the OS use this only when it owns PCIe native hotplug: ... when configuring PCI Express registers in the Advanced Error Reporting Extended Capability Structure *or PCI Express Capability Structure* ... An OS that has assumed ownership of native hot plug but does not ... have ownership of the AER register set must use ... the Type 2 record to program the AER registers ... However, since the Type 2 record also includes register bits that have functions other than AER, the OS must ignore values ... that are not applicable. Restrict program_hpx_type2() to only the intended purpose: - Apply settings only when OS owns PCIe native hotplug but not AER, - Only touch the AER-related bits (Error Reporting Enables) in Device Control - Don't touch Link Control at all, since nothing there seems AER-related, but log _HPX settings for debugging purposes Note that Read Completion Boundary is now configured elsewhere, since it is unrelated to _HPX. [1] Operating System-directed configuration and Power Management Fixes: 40abb96c51bb ("[PATCH] pciehp: Fix programming hotplug parameters") Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260129175237.727059-3-haakon.bugge@oracle.com
2026-01-30PCI: Initialize RCB from pci_configure_device()Håkon Bugge1-0/+32
Commit e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)") worked around a bogus _HPX type 2 record, which caused program_hpx_type2() to set the RCB in an endpoint even though the Root Port did not have the RCB bit set. e42010d8207f fixed that by setting the RCB in the endpoint only when it was set in the Root Port. In retrospect, program_hpx_type2() is intended for AER-related settings, and the RCB should be configured elsewhere so it doesn't depend on the presence or contents of an _HPX record. Explicitly program the RCB from pci_configure_device() so it matches the Root Port's RCB. The Root Port may not be visible to virtualized guests; in that case, leave RCB alone. Fixes: e42010d8207f ("PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX)") Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Link: https://patch.msgid.link/20260129175237.727059-2-haakon.bugge@oracle.com
2026-01-30PCI: dwc: ep: Fix resizable BAR support for multi-PF configurationsAksh Garg1-16/+32
The resizable BAR support added by the commit 3a3d4cabe681 ("PCI: dwc: ep: Allow EPF drivers to configure the size of Resizable BARs") incorrectly configures the resizable BARs only for the first Physical Function (PF0) in EP mode. The resizable BAR configuration functions use generic dw_pcie_*_dbi() operations instead of physical function specific dw_pcie_ep_*_dbi() operations. This causes resizable BAR configuration to always target PF0 regardless of the requested function number. Additionally, dw_pcie_ep_init_non_sticky_registers() only initializes resizable BAR registers for PF0, leaving other PFs unconfigured during the execution of this function. Fix this by using physical function specific configuration space access operations throughout the resizable BAR code path and initializing registers for all the physical functions that support resizable BARs. Fixes: 3a3d4cabe681 ("PCI: dwc: ep: Allow EPF drivers to configure the size of Resizable BARs") Signed-off-by: Aksh Garg <a-garg7@ti.com> [mani: added stable tag] Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260130115516.515082-2-a-garg7@ti.com
2026-01-30PCI: endpoint: pci-epf-test: Allow overriding default BAR sizesNiklas Cassel2-2/+116
Add bar{0,1,2,3,4,5}_size attributes in configfs, so that the user is not restricted to run pci-epf-test with the hardcoded BAR size values defined in pci-epf-test.c. This code is shamelessly more or less copy pasted from pci-epf-vntb.c Signed-off-by: Niklas Cassel <cassel@kernel.org> Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Koichiro Den <den@valinux.co.jp> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Frank Li <Frank.Li@nxp.com> Link: https://patch.msgid.link/20260130113038.2143947-2-cassel@kernel.org
2026-01-30i40e: drop udp_tunnel_get_rx_info() call from i40e_open()Mohammad Heib1-1/+0
The i40e driver calls udp_tunnel_get_rx_info() during i40e_open(). This is redundant because UDP tunnel RX offload state is preserved across device down/up cycles. The udp_tunnel core handles synchronization automatically when required. Furthermore, recent changes in the udp_tunnel infrastructure require querying RX info while holding the udp_tunnel lock. Calling it directly from the ndo_open path violates this requirement, triggering the following lockdep warning: Call Trace: <TASK> ? __udp_tunnel_nic_assert_locked+0x39/0x40 [udp_tunnel] i40e_open+0x135/0x14f [i40e] __dev_open+0x121/0x2e0 __dev_change_flags+0x227/0x270 dev_change_flags+0x3d/0xb0 devinet_ioctl+0x56f/0x860 sock_do_ioctl+0x7b/0x130 __x64_sys_ioctl+0x91/0xd0 do_syscall_64+0x90/0x170 ... </TASK> Remove the redundant and unsafe call to udp_tunnel_get_rx_info() from i40e_open() resolve the locking violation. Fixes: 1ead7501094c ("udp_tunnel: remove rtnl_lock dependency") Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-01-30ice: drop udp_tunnel_get_rx_info() call from ndo_open()Mohammad Heib1-3/+0
The ice driver calls udp_tunnel_get_rx_info() during ice_open_internal(). This is redundant because UDP tunnel RX offload state is preserved across device down/up cycles. The udp_tunnel core handles synchronization automatically when required. Furthermore, recent changes in the udp_tunnel infrastructure require querying RX info while holding the udp_tunnel lock. Calling it directly from the ndo_open path violates this requirement, triggering the following lockdep warning: Call Trace: <TASK> ice_open_internal+0x253/0x350 [ice] __udp_tunnel_nic_assert_locked+0x86/0xb0 [udp_tunnel] __dev_open+0x2f5/0x880 __dev_change_flags+0x44c/0x660 netif_change_flags+0x80/0x160 devinet_ioctl+0xd21/0x15f0 inet_ioctl+0x311/0x350 sock_ioctl+0x114/0x220 __x64_sys_ioctl+0x131/0x1a0 ... </TASK> Remove the redundant and unsafe call to udp_tunnel_get_rx_info() from ice_open_internal() to resolve the locking violation Fixes: 1ead7501094c ("udp_tunnel: remove rtnl_lock dependency") Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-01-30ice: Fix PTP NULL pointer dereference during VSI rebuildAaron Ma3-5/+29
Fix race condition where PTP periodic work runs while VSI is being rebuilt, accessing NULL vsi->rx_rings. The sequence was: 1. ice_ptp_prepare_for_reset() cancels PTP work 2. ice_ptp_rebuild() immediately queues PTP work 3. VSI rebuild happens AFTER ice_ptp_rebuild() 4. PTP work runs and accesses NULL vsi->rx_rings Fix: Keep PTP work cancelled during rebuild, only queue it after VSI rebuild completes in ice_rebuild(). Added ice_ptp_queue_work() helper function to encapsulate the logic for queuing PTP work, ensuring it's only queued when PTP is supported and the state is ICE_PTP_READY. Error log: [ 121.392544] ice 0000:60:00.1: PTP reset successful [ 121.392692] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 121.392712] #PF: supervisor read access in kernel mode [ 121.392720] #PF: error_code(0x0000) - not-present page [ 121.392727] PGD 0 [ 121.392734] Oops: Oops: 0000 [#1] SMP NOPTI [ 121.392746] CPU: 8 UID: 0 PID: 1005 Comm: ice-ptp-0000:60 Tainted: G S 6.19.0-rc6+ #4 PREEMPT(voluntary) [ 121.392761] Tainted: [S]=CPU_OUT_OF_SPEC [ 121.392773] RIP: 0010:ice_ptp_update_cached_phctime+0xbf/0x150 [ice] [ 121.393042] Call Trace: [ 121.393047] <TASK> [ 121.393055] ice_ptp_periodic_work+0x69/0x180 [ice] [ 121.393202] kthread_worker_fn+0xa2/0x260 [ 121.393216] ? __pfx_ice_ptp_periodic_work+0x10/0x10 [ice] [ 121.393359] ? __pfx_kthread_worker_fn+0x10/0x10 [ 121.393371] kthread+0x10d/0x230 [ 121.393382] ? __pfx_kthread+0x10/0x10 [ 121.393393] ret_from_fork+0x273/0x2b0 [ 121.393407] ? __pfx_kthread+0x10/0x10 [ 121.393417] ret_from_fork_asm+0x1a/0x30 [ 121.393432] </TASK> Fixes: 803bef817807d ("ice: factor out ice_ptp_rebuild_owner()") Signed-off-by: Aaron Ma <aaron.ma@canonical.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-01-30ice: PTP: fix missing timestamps on E825 hardwareJacob Keller3-78/+103
The E825 hardware currently has each PF handle the PFINT_TSYN_TX cause of the miscellaneous OICR interrupt vector. The actual interrupt cause underlying this is shared by all ports on the same quad: ┌─────────────────────────────────┐ │ │ │ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ │ │PF 0│ │PF 1│ │PF 2│ │PF 3│ │ │ └────┘ └────┘ └────┘ └────┘ │ │ │ └────────────────▲────────────────┘ │ │ ┌────────────────┼────────────────┐ │ PHY QUAD │ └───▲────────▲────────▲────────▲──┘ │ │ │ │ ┌───┼──┐ ┌───┴──┐ ┌───┼──┐ ┌───┼──┐ │Port 0│ │Port 1│ │Port 2│ │Port 3│ └──────┘ └──────┘ └──────┘ └──────┘ If multiple PFs issue Tx timestamp requests near simultaneously, it is possible that the correct PF will not be interrupted and will miss its timestamp. Understanding why is somewhat complex. Consider the following sequence of events: CPU 0: Send Tx packet on PF 0 ... PF 0 enqueues packet with Tx request CPU 1, PF1: ... Send Tx packet on PF1 ... PF 1 enqueues packet with Tx request HW: PHY Port 0 sends packet PHY raises Tx timestamp event interrupt MAC raises each PF interrupt CPU 0, PF0: CPU 1, PF1: ice_misc_intr() checks for Tx timestamps ice_misc_intr() checks for Tx timestamp Sees packet ready bit set Sees nothing available ... Exits ... ... HW: PHY port 1 sends packet PHY interrupt ignored because not all packet timestamps read yet. ... Read timestamp, report to stack Because the interrupt event is shared for all ports on the same quad, the PHY will not raise a new interrupt for any PF until all timestamps are read. In the example above, the second timestamp comes in for port 1 before the timestamp from port 0 is read. At this point, there is no longer an interrupt thread running that will read the timestamps, because each PF has checked and found that there was no work to do. Applications such as ptp4l will timeout after waiting a few milliseconds. Eventually, the watchdog service task will re-check for all quads and notice that there are outstanding timestamps, and issue a software interrupt to recover. However, by this point it is far too late, and applications have already failed. All of this occurs because of the underlying hardware behavior. The PHY cannot raise a new interrupt signal until all outstanding timestamps have been read. As a first step to fix this, switch the E825C hardware to the ICE_PTP_TX_INTERRUPT_ALL mode. In this mode, only the clock owner PF will respond to the PFINT_TSYN_TX cause. Other PFs disable this cause and will not wake. In this mode, the clock owner will iterate over all ports and handle timestamps for each connected port. This matches the E822 behavior, and is a necessary but insufficient step to resolve the missing timestamps. Even with use of the ICE_PTP_TX_INTERRUPT_ALL mode, we still sometimes miss a timestamp event. The ice_ptp_tx_tstamp_owner() does re-check the ready bitmap, but does so before re-enabling the OICR interrupt vector. It also only checks the ready bitmap, but not the software Tx timestamp tracker. To avoid risk of losing a timestamp, refactor the logic to check both the software Tx timestamp tracker bitmap *and* the hardware ready bitmap. Additionally, do this outside of ice_ptp_process_ts() after we have already re-enabled the OICR interrupt. Remove the checks from the ice_ptp_tx_tstamp(), ice_ptp_tx_tstamp_owner(), and the ice_ptp_process_ts() functions. This results in ice_ptp_tx_tstamp() being nothing more than a wrapper around ice_ptp_process_tx_tstamp() so we can remove it. Add the ice_ptp_tx_tstamps_pending() function which returns a boolean indicating if there are any pending Tx timestamps. First, check the software timestamp tracker bitmap. In ICE_PTP_TX_INTERRUPT_ALL mode, check *all* ports software trackers. If a tracker has outstanding timestamp requests, return true. Additionally, check the PHY ready bitmap to confirm if the PHY indicates any outstanding timestamps. In the ice_misc_thread_fn(), call ice_ptp_tx_tstamps_pending() just before returning from the IRQ thread handler. If it returns true, write to PFINT_OICR to trigger a PFINT_OICR_TSYN_TX_M software interrupt. This will force the handler to interrupt again and complete the work even if the PHY hardware did not interrupt for any reason. This results in the following new flow for handling Tx timestamps: 1) send Tx packet 2) PHY captures timestamp 3) PHY triggers MAC interrupt 4) clock owner executes ice_misc_intr() with PFINT_OICR_TSYN_TX flag set 5) ice_ptp_ts_irq() returns IRQ_WAKE_THREAD 7) The interrupt thread wakes up and kernel calls ice_misc_intr_thread_fn() 8) ice_ptp_process_ts() is called to handle any outstanding timestamps 9) ice_irq_dynamic_ena() is called to re-enable the OICR hardware interrupt cause 10) ice_ptp_tx_tstamps_pending() is called to check if we missed any more outstanding timestamps, checking both software and hardware indicators. With this change, it should no longer be possible for new timestamps to come in such a way that we lose an interrupt. If a timestamp comes in before the ice_ptp_tx_tstamps_pending() call, it will be noticed by at least one of the software bitmap check or the hardware bitmap check. If the timestamp comes in *after* this check, it should cause a timestamp interrupt as we have already read all timestamps from the PHY and the OICR vector has been re-enabled. Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Przemyslaw Korba <przemyslaw.korba@intel.com> Tested-by: Vitaly Grinberg <vgrinber@redhat.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-01-30ice: fix missing TX timestamps interrupts on E825 devicesGrzegorz Nitka1-1/+4
Modify PTP (Precision Time Protocol) configuration on link down flow. Previously, PHY_REG_TX_OFFSET_READY register was cleared in such case. This register is used to determine if the timestamp is valid or not on the hardware side. However, there is a possibility that there is still the packet in the HW queue which originally was supposed to be timestamped but the link is already down and given register is cleared. This potentially might lead to the situation in which that 'delayed' packet's timestamp is treated as invalid one when the link is up again. This in turn leads to the situation in which the driver is not able to effectively clean timestamp memory and interrupt configuration. From the hardware perspective, that 'old' interrupt was not handled properly and even if new timestamp packets are processed, no new interrupts is generated. As a result, providing timestamps to the user applications (like ptp4l) is not possible. The solution for this problem is implemented at the driver level rather than the firmware, and maintains the tx_ready bit high, even during link down events. This avoids entering a potential inconsistent state between the driver and the timestamp hardware. Testing hints: - run PTP traffic at higher rate (like 16 PTP messages per second) - observe ptp4l behaviour at the client side in the following conditions: a) trigger link toggle events. It needs to be physiscal link down/up events b) link speed change In all above cases, PTP processing at ptp4l application should resume always. In failure case, the following permanent error message in ptp4l log was observed: controller-0 ptp4l: err [6175.116] ptp4l-legacy timed out while polling for tx timestamp Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products") Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2026-01-30f2fs: fix incomplete block usage in compact SSA summariesDaeho Jeong1-4/+4
In a previous commit, a bug was introduced where compact SSA summaries failed to utilize the entire block space in non-4KB block size configurations, leading to inefficient space management. This patch fixes the calculation logic to ensure that compact SSA summaries can fully occupy the block regardless of the block size. Reported-by: Chris Mason <clm@meta.com> Fixes: e48e16f3e37f ("f2fs: support non-4KB block size without packed_ssa feature") Signed-off-by: Daeho Jeong <daehojeong@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2026-01-30NFS: Merge CONFIG_NFS_V4_1 with CONFIG_NFS_V4Anna Schumaker25-368/+35
Compiling the NFSv4 module without any minorversion support doesn't make much sense, so this patch sets NFS v4.1 as the default, always enabled NFS version allowing us to replace all the CONFIG_NFS_V4_1s scattered throughout the code with CONFIG_NFS_V4. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-30NFS: Add a way to disable NFS v4.0 via KConfigAnna Schumaker8-5/+47
I introduce NFS4_MIN_MINOR_VERSION as a parallel to NFS4_MAX_MINOR_VERSION to check if NFS v4.0 has been compiled in and return an appropriate error if not. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-30NFS: Move sequence slot operations into minorversion operationsAnna Schumaker4-72/+70
At the same time, I move the NFS v4.0 functions into nfs40proc.c to keep v4.0 features together in their own files. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-30NFS: Pass a struct nfs_client to nfs4_init_sequence()Anna Schumaker4-69/+87
No functional change in this patch. This just makes the next patch where I introduce "sequence slot operations" simpler. Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-30NFS: Move NFS v4.0 pathdown recovery into nfs40client.cAnna Schumaker3-22/+25
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2026-01-30