aboutsummaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2 daysMerge tag 'x86-urgent-2026-05-31' of ↵Linus Torvalds10-29/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Make the clearcpuid= boot parameter less prominent and warn about its dangers & caveats (Borislav Petkov) - Do not access the (new) PLATFORM_ID MSR when running as a guest (Borislav Petkov) - x86 ftrace: Relocate %rip-relative percpu refs in dynamic trampolines, to fix crash when using such trampolines (Alexis Lothoré) - Fix x86-64 CFI build error (Peter Zijlstra) - Revert FPU signal return magic number check optimization, because it broke CRIU and gVisor in certain FPU configurations (Andrei Vagin) * tag 'x86-urgent-2026-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "x86/fpu: Refine and simplify the magic number check during signal return" x86/kvm/vmx: Fix x86_64 CFI build x86/ftrace: Relocate %rip-relative percpu refs in dynamic trampolines x86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guest Documentation/arch/x86: Hide clearcpuid=
4 daysRevert "x86/fpu: Refine and simplify the magic number check during signal ↵Andrei Vagin1-3/+8
return" This reverts dc8aa31a7ac2 ("x86/fpu: Refine and simplify the magic number check during signal return"). The aforementioned commit broke applications that construct signal frames in userspace (such as CRIU and gVisor) if the frame's xstate size is smaller than the kernel's fpstate->user_size. Furthermore, this introduces a critical issue for checkpoint/restore tools like CRIU. If a process is checkpointed while inside a signal handler, its stack contains a signal frame formatted according to the source host's xstate capabilities. If that process is later restored on a destination host with larger xstate capabilities (e.g., a newer CPU with more features enabled, resulting in a larger fpstate->user_size), the kernel will look for FP_XSTATE_MAGIC2 at the destination host's larger user_size offset instead of the offset encoded in the frame's fx_sw->xstate_size. This causes the magic2 check to fail, forcing sigreturn to silently fall back to "FX-only" mode. Upon return from the signal handler, the process's extended state is reset to initial values instead of being restored, leading to silent data corruption. The aforementioned commit cited d877550eaf2d ("x86/fpu: Stop relying on userspace for info to fault in xsave buffer") as justification to stop relying on userspace for the magic number check. However, these two changes are fundamentally different. The last one only changed how much memory the kernel ensures is paged-in before running XRSTOR to prevent an infinite loop. It did not change the signal frame format or how the layout is validated. Reverting this change restores the use of fx_sw->xstate_size for locating magic2 and restores the necessary sanity checks, ensuring that the signal frame remains self-describing and portable. [ bp: Massage commit message. ] Fixes: dc8aa31a7ac2 ("x86/fpu: Refine and simplify the magic number check during signal return") Signed-off-by: Andrei Vagin <avagin@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Chang S. Bae <chang.seok.bae@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20260429000623.3356606-1-avagin@google.com
4 daysMerge commit 'kvm-psc-for-7.1' into HEADPaolo Bonzini1-26/+65
4 daysKVM: SEV: Use READ_ONCE() when reading entries/indices from PSC bufferSean Christopherson1-6/+6
Use READ_ONCE() when reading entries/indices from the guest-accessible Page State Change buffer to defend against TOCTOU bugs. Don't bother with READ_ONCE()/WRITE_ONCE() for cases where KVM is writing (and not consuming the result!), as the guest isn't supposed to touch the buffer while it's being processed. I.e. using READ_ONCE() is all about protecting against misbehaving guests. Fixes: 9b54e248d264 ("KVM: SEV: Add support to handle Page State Change VMGEXIT") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Check PSC request indices against the actual size of the bufferSean Christopherson1-2/+15
When processing Page State Change (PSC) requests, validate the PSC buffer against the effective size of the scratch area, which could be less than the maximum size if the guest provided a pointer that isn't exactly at the start of the GHCB shared buffer. Fixes: 9b54e248d264 ("KVM: SEV: Add support to handle Page State Change VMGEXIT") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Don't explicitly pass PSC buffer to snp_begin_psc()Sean Christopherson1-5/+6
Stop explicitly passing the PSC buffer to snp_begin_psc(): it *must* be the scratch area. This will allow fixing a variety of bugs without further complicating the code. No functional change intended. Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: WARN if KVM attempts to setup scratch area with min_len==0Sean Christopherson1-0/+3
Now that all paths in KVM properly validate the length needed for the scratch area, and are guaranteed to pass in a non-zero length, WARN if KVM attempts to configured the scratch area with min_len==0 to guard against future bugs. Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Compute the correct max length of the in-GHCB scratch areaSean Christopherson1-9/+10
When setting the length of the GHCB scratch area, and the area is in the GHCB shared buffer, set the effective length of the scratch area to the max possible size given the start of the guest-provided pointer, and the end of the shared buffer. The code was "fine" when first introduced, as KVM doesn't consult the length of the buffer when emulating MMIO, because the passed in @len always specifies the *max* size required. But for PSC requests, the incoming @len is just the minimum length (to process the header), and KVM needs to know the full size of the scratch area to avoid buffer overflows (spoiler alert). Opportunistically rename @len => @min_len to better reflect its role. Fixes: 9b54e248d264 ("KVM: SEV: Add support to handle Page State Change VMGEXIT") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Use the size of the PSC header as the minimum size for PSC requestsSean Christopherson1-1/+1
When handling a Page State Change (PSC) #VMGEXIT use the size of the PSC header as the minimum size for the scratch area. Per the GHCB spec, PSC requests do NOT provide the length, i.e. using control->exit_info_2 for the length is completely made up behavior. The existing code "works", e.g. even though Linux-as-a-guest always passes '0', because KVM doesn't do anything with the length when the request is in the GHCB's shared buffer. Use the header as the min length. Once the header is retrieved, KVM can use the specified indices to compute the full size of the request. Fixes: 9b54e248d264 ("KVM: SEV: Add support to handle Page State Change VMGEXIT") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Ignore Port I/O requests of length '0'Sean Christopherson1-0/+8
Explicitly ignore Port I/O requests of length '0' (or count '0'), so that setting up the software scratch area (and other code) doesn't have to worry about underflowing the length, and to allow for WARNing on trying to configure the scratch area with len==0. Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Reject MMIO requests larger than 8 bytes with GHCB v2+Sean Christopherson1-0/+5
When using GHCB v2+, reject MMIO requests that are larger than 8 bytes. Per the GHCB spec: SW_EXITINFO2 must be less than or equal to 0x7fffffff for version 1 and less than or equal to 0x8 for all other versions. Fixes: 4af663c2f64a ("KVM: SEV: Allow per-guest configuration of GHCB protocol version") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Ignore MMIO requests of length '0'Sean Christopherson1-3/+7
Explicitly ignore MMIO requests of length '0', so that setting up the software scratch area (and other code) doesn't have to worry about underflowing the length, and to allow for special casing '0' in the future. Fixes: 8f423a80d299 ("KVM: SVM: Support MMIO for an SEV-ES guest") Cc: stable@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysKVM: SEV: Require in-GHCB scratch area if GHCB v2+ is in useMichael Roth1-0/+4
As per the GHCB spec, when using GHCB v2+ require the software scratch area to reside in the GHCB's shared buffer. Note, things like Page State Change (PSC) requests _rely_ on this behavior, as the guest can't provide a length when making the request, i.e. the size of the guest payload is bounded by the size of the shared buffer. Failure to force usage of the GHCB, and a slew of other flaws, lets a malicious SNP guest corrupt host kernel heap memory, and leak host heap layout information. setup_vmgexit_scratch() allocates a buffer via kvzalloc(exit_info_2), where exit_info_2 is guest-controlled. With exit_info_2=24, this yields a 24-byte allocation in kmalloc-cg-32 (32-byte slab objects). The buffer holds an 8-byte psc_hdr followed by 8-byte psc_entry structs, so only entries[0] and entries[1] are in-bounds. snp_begin_psc() validates end_entry against VMGEXIT_PSC_MAX_COUNT (253) but NOT against the actual buffer size: idx_end = hdr->end_entry; if (idx_end >= VMGEXIT_PSC_MAX_COUNT) { // checks 253, not buffer snp_complete_psc(svm, ...); return 1; } for (idx = idx_start; idx <= idx_end; idx++) { entry_start = entries[idx]; // OOB when idx >= 2 The guest sets end_entry=10+, causing the host to iterate entries[2+] which are OOB into adjacent slab objects. For each OOB entry: - The host reads 8 bytes (OOB READ / info leak oracle) - If the data passes PSC validation, __snp_complete_one_psc() writes cur_page = 1 or 512 into the entry (OOB WRITE, sev.c:3806) - If validation fails, the error response reveals whether adjacent memory is zero vs non-zero (information disclosure to guest) The guest controls allocation size (exit_info_2), entry range (cur_entry/end_entry), and can fire unlimited VMGEXITs to repeatedly hit different slab positions. By exploiting the variety of bugs, a malicious SEV-SNP guest can: - OOB read adjacent kmalloc-cg-32 objects (heap layout disclosure) - OOB write cur_page bits into adjacent objects (heap corruption) - Trigger use-after-free conditions across VMGEXITs E.g. with KASAN enabled, a single insmod of the PoC guest module produces 73 KASAN reports: BUG: KASAN: slab-out-of-bounds in snp_begin_psc+0x126/0x890 Read of size 8 at addr ffff888219ffb5e0 by task qemu-system-x86/2199 BUG: KASAN: slab-out-of-bounds in snp_begin_psc+0x468/0x890 Write of size 8 at addr ffff888351566648 by task qemu-system-x86/2199 The buggy address belongs to the object at ffff888XXXXXXXXX which belongs to the cache kmalloc-cg-32 of size 32 The buggy address is located N bytes to the right of allocated 32-byte region [ffff888XXXXXXXXX, ffff888XXXXXXXXX) Breakdown: 62 slab-out-of-bounds (reads + writes past allocation) 7 slab-use-after-free 4 use-after-free All credit to Stan for the wonderful description and reproducer! Reported-by: Stan Shaw <shawstan96@gmail.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Peter Gonda <pgonda@google.com> Cc: Jacky Li <jackyli@google.com> Fixes: 4af663c2f64a ("KVM: SEV: Allow per-guest configuration of GHCB protocol version") Cc: stable@vger.kernel.org Signed-off-by: Michael Roth <michael.roth@amd.com> [sean: write changelog] Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260501202250.2115252-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 daysMerge tag 'kvm-x86-fixes-7.1-rc6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini4-10/+54
KVM x86 fixes for 7.1-rcN - Include the kernel's linux/mman.h in KVM selftests to ensure MADV_COLLAPSE is defined, as older libc versions may not provide it. - Include execinfo.h if and only if KVM selftests are building against glibc, and provide a test_dump_stack() for non-glibc builds. - Fudge around an RCU splat in the emegerncy reboot code that is technically a legitimate flaw, but in practice is a non-issue and fixing the flaw, e.g. by adding locking, would incur meaningful risk, i.e. do more harm than good. - Rate-limit global clock updates once again (but without delayed work), as KVM was subtly relying on the old rate-limiting for NPT correction to guard against "update storms" when running without a master clock on systems with overcommitted CPUs. - Fix a brown paper bag goof where KVM checked if ERAPS is "dirty" instead of marking it dirty when emulating INVPCID. - Flush the TLB when transitioning from xAVIC => x2AVIC to ensure the CPU TLB doesn't contain AVIC-tagged entries for the APIC base GPA.
6 daysx86/kvm/vmx: Fix x86_64 CFI buildPeter Zijlstra3-11/+5
It was missed that idt_do_interrupt_irqoff() gets compiled on x84_64; this is a problem for CFI builds because it includes an unadorned indirect call. It is however completely dead code. Rework things to not emit this function at all. Fixes: 0701c9e17bd9 ("x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core") Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Calvin Owens <calvin@wbinvd.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Link: https://patch.msgid.link/20260526090631.GA4149641@noisy.programming.kicks-ass.net
6 daysx86/ftrace: Relocate %rip-relative percpu refs in dynamic trampolinesAlexis Lothoré (eBPF Foundation)1-0/+7
With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform (eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline crashes on the first call into the traced function: BUG: unable to handle page fault for address: ffff88817ae18880 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 4b53067 P4D 4b53067 PUD 0 Oops: Oops: 0002 [#1] SMP PTI CPU: 3 UID: 0 PID: 187 Comm: usleep Not tainted 7.0.10 #243 PREEMPT(full) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 Code: 24 78 00 00 00 00 48 89 ea 48 89 54 24 20 48 8b b4 24 b8 00 00 00 48 8b bc 24 b0 00 00 00 48 89 bc 24 80 00 00 00 48 83 ef 05 <65> 48 c1 3d 1f a8 b6 02 05 48 8b 15 f6 00 00 00 4c 89 3c 24 4c 89 Call Trace: <TASK> ? find_held_lock ? exc_page_fault ? lock_release ? __x64_sys_clock_nanosleep ? lockdep_hardirqs_on_prepare ? trace_hardirqs_on __x64_sys_clock_nanosleep do_syscall_64 ? exc_page_fault ? call_depth_return_thunk entry_SYSCALL_64_after_hwframe ... Kernel panic - not syncing: Fatal exception This small reproducer allows to easily trigger the crash: # echo 'p __x64_sys_clock_nanosleep' > /sys/kernel/tracing/kprobe_events # echo 1 > /sys/kernel/tracing/events/kprobes/p___x64_sys_clock_nanosleep_0/enable # usleep 1 Monitoring the crash under GDB points to the exact instruction in charge of incrementing the call depth: sarq $5, %gs:__x86_call_depth(%rip) This instruction matches the one inserted by the ftrace_regs_caller from ftrace_64.S. This emitted code was likely working fine until the introduction of 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"): it has made the call depth accounting addressing relative to $rip, instead of being based on an absolute address. As this code exact location depends on where the trampoline lives in memory, the corresponding displacement needs to be adjusted at runtime to actually correctly find the per-cpu __x86_call_depth value, otherwise the targeted address is wrong, leading to the page fault seen above. Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(), as it is done for example by the x86 BPF JIT compiler through x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots, in ftrace_caller and ftrace_regs_caller. [ bp: Massage. ] Fixes: 59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()") Signed-off-by: Alexis Lothoré (eBPF Foundation) <alexis.lothore@bootlin.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Link: https://patch.msgid.link/20260527-fix_call_depth_in_trampoline-v1-1-1c1abc8ae310@bootlin.com
7 daysx86/microcode: Do not access MSR_IA32_PLATFORM_ID when running as a guestBorislav Petkov5-15/+16
Patch in Fixes: causes the usual: unchecked MSR access error: RDMSR from 0x17 at ... (intel_get_platform_id) Call Trace: early_init_intel early_cpu_init setup_arch _printk start_kernel x86_64_start_reservations x86_64_start_kernel common_startup_64 because the kernel is booted in a guest. In order to avoid it, this MSR access needs to be prevented when running virtualized. That is usually done by checking X86_FEATURE_HYPERVISOR but for this particular case it is too early yet. The platform ID needs to be read as early as when microcode is loaded on the BSP: load_ucode_bsp ... -> get_microcode_blob ... -> intel_find_matching_signature and by that time, CPUID leafs haven't been parsed yet. The microcode loader already has logic to check early whether the kernel is running virtualized so make that globally available to arch/x86/. The query whether running virtualized is getting more and more prominent in recent times so might as well make it an arch-global var which the rest of the code can use. Fixes: d8630b67ca1ed ("x86/cpu: Add platform ID to CPU info structure") Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/all/20260430020953.1405535-1-binbin.wu@linux.intel.com
9 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-6/+8
Pull kvm fixes from Paolo Bonzini: "arm64: - Fix ITS EventID sanitisation when restoring an interrupt translation table. - Fix PPI memory leak when failing to initialise a vcpu. - Correctly return an error when the validation of a hypervisor trace descriptor fails, and limit this validation to protected mode only. RISC-V: - Fix invalid HVA warning in steal-time recording - Return SBI_ERR_FAILURE to guest upon OOM in pmu_event_info() and pmu_snapshot_set_shmem() - Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler - Fix sign extension of value for MMIO loads s390: - Fix bugs in vSIE (nested virtualization) and UCONTROL, caused by the page table rewrite. x86: - Apply erratum #1235 workaround (disable AVIC IPI virtualization) on Hygon Family 18h, just like on AMD Family 17h. - When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency instead of the default. This is less confusing (read: not wrong) and makes it easier to fill in CPUID information that communicates the APIC bus frequency to the guest. Selftests: - Do not include glibc-internal <bits/endian.h>; it worked by chance and broke building KVM selftests with musl" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235) KVM: selftests: Verify that KVM returns the configured APIC cycle length KVM: x86: Return the VM's configured APIC bus frequency when queried KVM: selftests: elf: Include <endian.h> instead of <bits/endian.h> KVM: s390: Properly reset zero bit in PGSTE KVM: s390: vsie: Fix redundant rmap entries KVM: s390: vsie: Fix unshadowing logic KVM: s390: Fix leaking kvm_s390_mmu_cache in case of errors KVM: s390: vsie: Fix memory leak when unshadowing KVM: arm64: Fix nVHE/pKVM hyp tracing error on invalid desc KVM: arm64: vgic: Free private_irqs when init fails after allocation KVM: arm64: vgic-its: Reject restored DTE with out-of-range num_eventid_bits RISC-V: KVM: Fix sign extension for MMIO loads RISC-V: KVM: Fix NULL pointer dereference in SBI v0.1 SEND_IPI handler riscv: kvm: return SBI_ERR_FAILURE for pmu_event_info() when OOM riscv: kvm: return SBI_ERR_FAILURE for pmu_snapshot_set_shmem() when OOM RISC-V: KVM: Fix invalid HVA warning in steal-time recording
9 daysMerge tag 'x86-urgent-2026-05-24' of ↵Linus Torvalds14-69/+134
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - On SEV guests, handle set_memory_{encrypted,decrypted}() failures more conservatively by assuming that all affected pages are unencrypted (Carlos López) - Disable broadcast TLB flush when PCID is disabled (Tom Lendacky) - Fix VMX vs. hrtimer_rearm_deferred() regression (Peter Zijlstra) - Move IRQ/NMI dispatch code from KVM into x86 core, to prepare for a KVM x2apic fix (Peter Zijlstra) - Fix incorrect munmap() size on map_vdso() failure (Guilherme Giacomo Simoes) * tag 'x86-urgent-2026-05-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: virt: sev-guest: Explicitly leak pages in unknown state x86/mm: Disable broadcast TLB flush when PCID is disabled x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred() x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 core x86/vdso: Fix incorrect size in munmap() on map_vdso() failure
11 daysKVM: SVM: Disable AVIC IPI virtualization on Hygon Family 18h (erratum #1235)Tina Zhang1-5/+7
Hygon Family 18h CPUs are derived from AMD Family 17h (Zen1) silicon and share the same erratum #1235: hardware may read a stale IsRunning=1 bit during ICR write emulation and silently fail to generate an AVIC_IPI_FAILURE_TARGET_NOT_RUNNING VM-Exit on the sending vCPU. The absence of the VM-Exit causes KVM to miss the required wakeup of blocking target vCPUs, leading to hung vCPUs and unbounded delays in guest execution. Extend the existing AMD Family 17h erratum #1235 workaround to also cover Hygon Family 18h. With IPI virtualization disabled, KVM never sets IsRunning=1 in the Physical ID table, so every non-self IPI generates a VM-Exit and is correctly emulated. Fixes: 8de4a1c8164e ("KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235") Cc: <stable@vger.kernel.org> Signed-off-by: Tina Zhang <zhang_wei@open-hieco.net> Message-ID: <20260522040014.3380201-1-zhang_wei@open-hieco.net>
11 daysKVM: x86: Return the VM's configured APIC bus frequency when queriedSean Christopherson1-1/+1
When KVM_CAP_X86_APIC_BUS_CYCLES_NS is queried on a specific VM, return the VM's configured APIC bus frequency, not KVM's default. Aside from the fact that returning the default frequency is blatantly wrong if userspace has changed the frequency, returning the configured frequency means userspace can blindly trust the result, e.g. when filling PV CPUID information that communicates the APIC bus frequency to the guest. Fixes: 6fef518594bc ("KVM: x86: Add a capability to configure bus frequency for APIC timer") Reported-by: David Woodhouse <dwmw2@infradead.org> Closes: https://lore.kernel.org/all/ab84153e33fbe7c25667f595c56b310d4d5a93ef.camel@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260522173526.3539407-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
12 daysKVM: SVM: Flush the current TLB when transitioning from xAVIC => x2AVICSean Christopherson1-6/+29
Flush the current TLB when xAVIC *or* x2AVIC is activated, as KVM is (apparently) responsible for purging TLB entries when transitioning from xAVIC to x2AVIC. The APM says a whole lot of nothing about TLB flushing with respect to (x2)AVIC, but empirical data strongly suggests hardware also does a whole lot of nothing. Failure to flush the TLB when enabling x2AVIC can lead to guest accesses to the APIC base address getting incorrectly redirected to the virtual APIC page. The flaw most visibly manifests as failures in KVM-Unit-Test's verify_disabled_apic_mmio() testcase when x2APIC is enabled (though for reasons unknown, the test only reliably fails with EFI builds). Fixes: 0ccf3e7cb95a ("KVM: SVM: Flush the "current" TLB when activating AVIC") Fixes: 4d1d7942e36a ("KVM: SVM: Introduce logic to (de)activate x2AVIC mode") Cc: stable@vger.kernel.org Cc: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://patch.msgid.link/20260515171536.1841645-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
12 daysKVM: x86: Fix ERAPS RAP clear on INVPCID single-context invalidationEmily Ehlert1-1/+1
Use kvm_register_mark_dirty() instead of kvm_register_is_dirty() to actually mark VCPU_EXREG_ERAPS as dirty when emulating INVPCID_TYPE_SINGLE_CTXT. kvm_register_is_dirty() is a read-only predicate whose return value is discarded, making the call a no-op. Without this fix, a single-context INVPCID will not trigger a RAP clear on the next VMRUN, breaking the ERAPS security guarantee. Fixes: db5e82496492 ("KVM: SVM: Virtualize and advertise support for ERAPS") Signed-off-by: Emily Ehlert <ehemily@amazon.de> Link: https://patch.msgid.link/20260518135956.82569-1-ehemily@amazon.de Signed-off-by: Sean Christopherson <seanjc@google.com>
13 daysring-buffer: Flush and stop persistent ring buffer on panicMasami Hiramatsu (Google)1-0/+1
On real hardware, panic and machine reboot may not flush hardware cache to memory. This means the persistent ring buffer, which relies on a coherent state of memory, may not have its events written to the buffer and they may be lost. Moreover, there may be inconsistency with the counters which are used for validation of the integrity of the persistent ring buffer which may cause all data to be discarded. To avoid this issue, stop recording of the ring buffer on panic and flush the cache of the ring buffer's memory. Fixes: e645535a954a ("tracing: Add option to use memmapped memory for trace boot instance") Cc: stable@vger.kernel.org Cc: Will Deacon <will@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ian Rogers <irogers@google.com> Link: https://patch.msgid.link/177751969602.2136606.12031934362587643488.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
13 daysx86/mm: Disable broadcast TLB flush when PCID is disabledTom Lendacky1-0/+1
Booting with "nopcid" clears X86_FEATURE_PCID and keeps CR4.PCIDE from being set to one. On AMD CPUs that support INVLPGB, broadcast TLB flushing remains enabled. There are two checks that decide whether the global ASID code runs, mm_global_asid() and consider_global_asid(), that key off of the X86_FEATURE_INVLPGB feature. Once an mm becomes active on more than three CPUs, consider_global_asid() assigns it a global ASID, after which flush_tlb_mm_range() takes the broadcast_tlb_flush() path using a non-zero PCID. Issuing an INVLPGB with a non-zero PCID while CR4.PCIDE is not set results in a #GP: Oops: general protection fault, kernel NULL pointer dereference 0x1: 0000 [#1] SMP NOPTI CPU: 158 UID: 0 PID: 3119 Comm: snap Not tainted 7.1.0-rc3 #1 PREEMPT(full) Hardware name: ... RIP: 0010:broadcast_tlb_flush Code: ... 89 da 48 83 c8 07 <0f> 01 fe eb 08 cc cc cc ... Call Trace: <TASK> flush_tlb_mm_range ptep_clear_flush wp_page_copy ? _raw_spin_unlock __handle_mm_fault handle_mm_fault do_user_addr_fault exc_page_fault asm_exc_page_fault All processors that support broadcast TLB invalidation also have PCID support, so it is only the "nopcid" scenario that is of concern. In this situation just disable the broadcast TLB support using the CPUID dependency support by making X86_FEATURE_INVLPGB dependent on X86_FEATURE_PCID. [ bp: Massage commit message. ] Fixes: 4afeb0ed1753 ("x86/mm: Enable broadcast TLB invalidation for multi-threaded processes") Suggested-by: Dave Hansen <dave.hansen@intel.com> Assisted-by: Claude:claude-opus-4.7 Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Rik van Riel <riel@surriel.com> Cc: <stable@kernel.org> Link: https://patch.msgid.link/b915acfd63e8b2a094fdeb8dc608738072518764.1779296450.git.thomas.lendacky@amd.com
2026-05-19x86/kvm/vmx: Fix VMX vs hrtimer_rearm_deferred()Peter Zijlstra1-0/+13
Vishal reported that KVM unit test 'x2apic' started failing after commit 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming"). The reason is that KVM/VMX is injecting interrupts while it has interrupts disabled, for a context that will enable interrupts, this means that regs->flags.X86_EFLAGS_IF == 0 and irqentry_exit() will not do the right thing. Notably, irqentry_exit() must not call hrtimer_rearm_deferred() when the return context does not have IF set, because this will cause problems vs NMIs. Therefore, fix up the state after the injection. Fixes: 0e98eb14814e ("entry: Prepare for deferred hrtimer rearming") Reported-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Suggested-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Tested-by: David Woodhouse <dwmw@amazon.co.uk> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260423155936.957351833@infradead.org Closes: https://lore.kernel.org/r/70cd3e97fbb796e2eb2ff8cd4b7614ada05a5f24.camel%40intel.com
2026-05-19x86/kvm/vmx: Move IRQ/NMI dispatch from KVM into x86 corePeter Zijlstra12-68/+119
Move the VMX interrupt dispatch magic into the x86 core code. This isolates KVM from the FRED/IDT decisions and reduces the amount of EXPORT_SYMBOL_FOR_KVM(). Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Tested-by: "Verma, Vishal L" <vishal.l.verma@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Tested-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Binbin Wu <binbin.wu@linxu.intel.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508091829.GO3126523@noisy.programming.kicks-ass.net
2026-05-19x86/vdso: Fix incorrect size in munmap() on map_vdso() failureGuilherme Giacomo Simoes1-1/+1
In map_vdso(), if a failure occurs during the installation of the VVAR mappings, the error path attempts to clean up previously allocated mappings using do_munmap(). However, the cleanup for the VVAR mapping is incorrectly using image->size (the size of the vDSO text) instead of the actual size allocated for the VVAR area. Replace the incorrect do_munmap() image->size parameter with the constant VDSO_NR_PAGES * PAGE_SIZE. Ensure the unmap size exactly matches the size used during the vdso_install_vvar_mapping() phase to provide a symmetrical and complete teardown of the memory region. Fixes: e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping") Signed-off-by: Guilherme Giacomo Simoes <trintaeoitogc@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Reviewed-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20260503191609.551817-1-trintaeoitogc@gmail.com
2026-05-17Merge tag 'x86-urgent-2026-05-17' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: - Fix x86 boot crash for non-kjump kexecs (David Woodhouse) * tag 'x86-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kexec: Push kjump return address even for non-kjump kexec
2026-05-17Merge tag 'ras-urgent-2026-05-17' of ↵Linus Torvalds1-28/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull MCE fix from Ingo Molnar: - Fix an MCE polling interval adjustment regression (Borislav Petkov) * tag 'ras-urgent-2026-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Restore MCA polling interval halving
2026-05-15Merge tag 'for-linus-7.1b-rc4-tag' of ↵Linus Torvalds2-3/+7
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - one simple cleanup - a fix for a corner case when running as Xen PV dom0 - a fix of a regression for Xen PV guests, introduced in 7.0 * tag 'for-linus-7.1b-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: x86/xen: Tolerate nested XEN_LAZY_MMU entering/leaving x86/xen: Fix xen_e820_swap_entry_with_ram() xen/arm: Replace __ASSEMBLY__ with __ASSEMBLER__ in interface.h
2026-05-14Merge tag 'acpi-7.1-rc4' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI support fixes from Rafael Wysocki: "These fix several platform drivers that use the ACPI companion of the given platform device without checking its presence, which may lead to a NULL pointer dereference or other kind of malfunction if the driver is forced to match a device without an ACPI companion via driver override, and restore debug log level for some messages in the ACPI CPPC library: - Check ACPI_COMPANION() against NULL during probe in several core ACPI device drivers (Rafael Wysocki) - Restore log level of messages in amd_set_max_freq_ratio() (Mario Limonciello)" * tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI: PAD: xen: Check ACPI_COMPANION() against NULL ACPI: driver: Check ACPI_COMPANION() against NULL during probe Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
2026-05-14Merge branch 'acpi-cppc'Rafael J. Wysocki1-3/+3
Merge a revert of an ACPI CPPC commit that increased the log level of some debug messages which turned out to be a bad idea: - Restore log level of messages in amd_set_max_freq_ratio() (Mario Limonciello) * acpi-cppc: Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
2026-05-14x86/xen: Tolerate nested XEN_LAZY_MMU entering/leavingJuergen Gross1-2/+6
With the support of nested lazy mmu sections it can happen that arch_enter_lazy_mmu_mode() is being called twice without a call of arch_leave_lazy_mmu_mode() in between, as the lazy_mmu_*() helpers are not disabling preemption when checking for nested lazy mmu sections. This is a problem when running as a Xen PV guest, as xen_enter_lazy_mmu() and xen_leave_lazy_mmu() don't tolerate this case. Fix that in xen_enter_lazy_mmu() and xen_leave_lazy_mmu() in order not to hurt all other lazy mmu mode users. Fixes: 291b3abed657 ("x86/xen: use lazy_mmu_state when context-switching") Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260508143933.493013-1-jgross@suse.com>
2026-05-14x86/xen: Fix xen_e820_swap_entry_with_ram()Juergen Gross1-1/+1
When swapping a not page-aligned E820 map entry with RAM, the start address of the modified entry is calculated wrong (the offset into the page is subtracted instead of being added to the page address). Fixes: be35d91c8880 ("xen: tolerate ACPI NVS memory overlapping with Xen allocated memory") Reported-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Message-ID: <20260505102417.208138-1-jgross@suse.com>
2026-05-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds7-37/+66
Pull kvm fixes from Paolo Bonzini: "arm64: - Add the pKVM side of the workaround for ARM's erratum 4193714, provided that the EL3 firmware does its part of the job. KVM will refuse to initialise otherwise - Correctly handle 52bit VAs for guest EL2 stage-1 translations when running under NV with E2H==0 - Correctly deal with permission faults in guest_memfd memslots - Fix the steal-time selftest after the infrastructure was reworked - Make sure the host cannot pass a non-sensical clock update to the EL2 tracing infrastructure - Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390 ability to run arm64 guests, which will inevitably lead to arm64 code being directly used on s390 - Make sure that EL2 is configured with both exception entry and exit being Context Synchronization Events - Handle the current vcpu being NULL on EL2 panic - Fix the selftest_vcpu memcache being empty at the point of donation or sharing - Check that the memcache has enough capacity before engaging on the share/donate path - Fix __deactivate_fgt() to use its parameter rather than a variable in the macro context s390: - Fix array overrun with large amounts of PCI devices x86: - Never use L0's PAUSE loop exiting while L2 is running, since it's unlikely that a nested guest will help solving the hypervisor's spinlock contention - Fix emulation of MOVNTDQA - Fix typo in Xen hypercall tracepoint - Add back an optimization that was left behind when recently fixing a bug - Add module parameter to disable CET, whose implementation seems to have issues. For now it remains enabled by default Generic: - Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn() Documentation: - Update stale links Selftests: - Fix guest_memfd_test with host page size > guest page size" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits) KVM: VMX: introduce module parameter to disable CET KVM: x86: Swap the dst and src operand for MOVNTDQA KVM: x86: use again the flush argument of __link_shadow_page() KVM: selftests: Ensure gmem file sizes are multiple of host page size Documentation: kvm: update links in the references section of AMD Memory Encryption KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running KVM: x86: Fix Xen hypercall tracepoint argument assignment KVM: Reject wrapped offset in kvm_reset_dirty_gfn() KVM: arm64: Pre-check vcpu memcache for host->guest donate KVM: arm64: Pre-check vcpu memcache for host->guest share KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache KVM: arm64: Fix __deactivate_fgt macro parameter typo KVM: arm64: Guard against NULL vcpu on VHE hyp panic path KVM: arm64: Make EL2 exception entry and exit context-synchronization events MAINTAINERS: Add Steffen as reviewer for KVM/arm64 KVM: arm64: Remove potential UB on nvhe tracing clock update KVM: selftests: arm64: Fix steal_time test after UAPI refactoring KVM: arm64: Handle permission faults with guest_memfd KVM: arm64: nv: Consider the DS bit when translating TCR_EL2 KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests ...
2026-05-13KVM: x86: Rate-limit global clock updates on vCPU loadLei Chen2-2/+10
commit 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"") dropped the rate limiting for KVM_REQ_GLOBAL_CLOCK_UPDATE. As a result, kvm_arch_vcpu_load() can queue global clock update requests every time a vCPU is scheduled when the master clock is disabled or when the vCPU is loaded for the first time. Restore the throttling with a per-VM ratelimit state and gate KVM_REQ_GLOBAL_CLOCK_UPDATE through __ratelimit(), so frequent vCPU scheduling does not generate a steady stream of redundant clock update requests. Fixes: 446fcce2a52b ("Revert "x86: kvm: rate-limit global clock updates"") Signed-off-by: Lei Chen <lei.chen@smartx.com> Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com> Closes: https://lore.kernel.org/all/CAK8fFZ5gY8_Mw2A=iZVFNVKQNrXQzVsn-HTd+Me9K6ZfmdgA+Q@mail.gmail.com/ Link: https://patch.msgid.link/20260409142226.2581-1-lei.chen@smartx.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-05-13x86/virt: Silence RCU lockdep splat in emergency virt callback pathMikhail Gavrilov1-1/+14
x86_virt_invoke_kvm_emergency_callback() reaches rcu_dereference() through machine_crash_shutdown() with IRQs disabled but with RCU not necessarily watching the crashing CPU, which triggers a suspicious RCU usage splat on debug kernels (CONFIG_PROVE_RCU=y) during panic/kdump: WARNING: suspicious RCU usage arch/x86/virt/hw.c:52 suspicious rcu_dereference_check() usage! rcu_scheduler_active = 2, debug_locks = 1 1 lock held by tee/11119: #0: ffff8881fa32c440 (sb_writers#3){.+.+}-{0:0}, at: ksys_write Call Trace: <TASK> dump_stack_lvl+0x84/0xd0 lockdep_rcu_suspicious.cold+0x37/0x8f x86_virt_invoke_kvm_emergency_callback+0x5f/0x70 x86_svm_emergency_disable_virtualization_cpu+0x2a/0x30 x86_virt_emergency_disable_virtualization_cpu+0x6b/0x90 native_machine_crash_shutdown+0x72/0x170 __crash_kexec+0x137/0x280 panic+0xce/0xd0 sysrq_handle_crash+0x1f/0x20 __handle_sysrq.cold+0x192/0x335 write_sysrq_trigger+0x8c/0xc0 proc_reg_write+0x1c3/0x3c0 vfs_write+0x1d0/0xf80 ksys_write+0x116/0x250 do_syscall_64+0x11c/0x1480 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> A truly correct fix is non-trivial: the RCU usage genuinely is wrong in panic context (RCU may ignore the crashing CPU during synchronization), and a concurrent KVM module unload could in principle race with the callback read; see commit 2baa33a8ddd6 ("KVM: x86: Leave user-return notifier registered on reboot/shutdown") which notes that nothing prevents module unload during panic/reboot. However, the alternatives are worse: - smp_store_release()/smp_load_acquire() handles ordering but not liveness; the kernel still needs to keep the module text alive while the callback is in flight. - Taking a lock in the panic path is risky — any lock could be held by a CPU that has already been NMI'd to a halt. Use rcu_dereference_raw() to silence the splat and accept the vanishingly small remaining race. Panic context inherently cannot guarantee complete correctness; the goal here is to keep debug builds quiet on the kdump path so the splat doesn't obscure the actual kernel state being captured. Reproducible on a debug kernel (CONFIG_PROVE_LOCKING=y, CONFIG_PROVE_RCU=y) with kvm_amd or kvm_intel loaded by triggering kdump: echo c > /proc/sysrq-trigger Suggested-by: Sean Christopherson <seanjc@google.com> Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260504235435.90957-1-mikhail.v.gavrilov@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-05-13x86/mce: Restore MCA polling interval halvingBorislav Petkov (AMD)1-28/+5
RongQing reported that the MCA polling interval doesn't halve when an error gets logged. It was traced down to the commit in Fixes:, because: mce_timer_fn() |-> mce_poll_banks() |-> machine_check_poll() |-> mce_log() which will queue the work and return. Now, back in mce_timer_fn(): /* * Alert userspace if needed. If we