aboutsummaryrefslogtreecommitdiff
path: root/arch/x86/virt
AgeCommit message (Collapse)AuthorFilesLines
2026-05-13x86/virt: Silence RCU lockdep splat in emergency virt callback pathMikhail Gavrilov1-1/+14
x86_virt_invoke_kvm_emergency_callback() reaches rcu_dereference() through machine_crash_shutdown() with IRQs disabled but with RCU not necessarily watching the crashing CPU, which triggers a suspicious RCU usage splat on debug kernels (CONFIG_PROVE_RCU=y) during panic/kdump: WARNING: suspicious RCU usage arch/x86/virt/hw.c:52 suspicious rcu_dereference_check() usage! rcu_scheduler_active = 2, debug_locks = 1 1 lock held by tee/11119: #0: ffff8881fa32c440 (sb_writers#3){.+.+}-{0:0}, at: ksys_write Call Trace: <TASK> dump_stack_lvl+0x84/0xd0 lockdep_rcu_suspicious.cold+0x37/0x8f x86_virt_invoke_kvm_emergency_callback+0x5f/0x70 x86_svm_emergency_disable_virtualization_cpu+0x2a/0x30 x86_virt_emergency_disable_virtualization_cpu+0x6b/0x90 native_machine_crash_shutdown+0x72/0x170 __crash_kexec+0x137/0x280 panic+0xce/0xd0 sysrq_handle_crash+0x1f/0x20 __handle_sysrq.cold+0x192/0x335 write_sysrq_trigger+0x8c/0xc0 proc_reg_write+0x1c3/0x3c0 vfs_write+0x1d0/0xf80 ksys_write+0x116/0x250 do_syscall_64+0x11c/0x1480 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> A truly correct fix is non-trivial: the RCU usage genuinely is wrong in panic context (RCU may ignore the crashing CPU during synchronization), and a concurrent KVM module unload could in principle race with the callback read; see commit 2baa33a8ddd6 ("KVM: x86: Leave user-return notifier registered on reboot/shutdown") which notes that nothing prevents module unload during panic/reboot. However, the alternatives are worse: - smp_store_release()/smp_load_acquire() handles ordering but not liveness; the kernel still needs to keep the module text alive while the callback is in flight. - Taking a lock in the panic path is risky — any lock could be held by a CPU that has already been NMI'd to a halt. Use rcu_dereference_raw() to silence the splat and accept the vanishingly small remaining race. Panic context inherently cannot guarantee complete correctness; the goal here is to keep debug builds quiet on the kdump path so the splat doesn't obscure the actual kernel state being captured. Reproducible on a debug kernel (CONFIG_PROVE_LOCKING=y, CONFIG_PROVE_RCU=y) with kvm_amd or kvm_intel loaded by triggering kdump: echo c > /proc/sysrq-trigger Suggested-by: Sean Christopherson <seanjc@google.com> Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260504235435.90957-1-mikhail.v.gavrilov@gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-04-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds5-148/+554
Pull kvm updates from Paolo Bonzini: "Arm: - Add support for tracing in the standalone EL2 hypervisor code, which should help both debugging and performance analysis. This uses the new infrastructure for 'remote' trace buffers that can be exposed by non-kernel entities such as firmware, and which came through the tracing tree - Add support for GICv5 Per Processor Interrupts (PPIs), as the starting point for supporting the new GIC architecture in KVM - Finally add support for pKVM protected guests, where pages are unmapped from the host as they are faulted into the guest and can be shared back from the guest using pKVM hypercalls. Protected guests are created using a new machine type identifier. As the elusive guestmem has not yet delivered on its promises, anonymous memory is also supported This is only a first step towards full isolation from the host; for example, the CPU register state and DMA accesses are not yet isolated. Because this does not really yet bring fully what it promises, it is hidden behind CONFIG_ARM_PKVM_GUEST + 'kvm-arm.mode=protected', and also triggers TAINT_USER when a VM is created. Caveat emptor - Rework the dreaded user_mem_abort() function to make it more maintainable, reducing the amount of state being exposed to the various helpers and rendering a substantial amount of state immutable - Expand the Stage-2 page table dumper to support NV shadow page tables on a per-VM basis - Tidy up the pKVM PSCI proxy code to be slightly less hard to follow - Fix both SPE and TRBE in non-VHE configurations so that they do not generate spurious, out of context table walks that ultimately lead to very bad HW lockups - A small set of patches fixing the Stage-2 MMU freeing in error cases - Tighten-up accepted SMC immediate value to be only #0 for host SMCCC calls - The usual cleanups and other selftest churn LoongArch: - Use CSR_CRMD_PLV for kvm_arch_vcpu_in_kernel() - Add DMSINTC irqchip in kernel support RISC-V: - Fix steal time shared memory alignment checks - Fix vector context allocation leak - Fix array out-of-bounds in pmu_ctr_read() and pmu_fw_ctr_read_hi() - Fix double-free of sdata in kvm_pmu_clear_snapshot_area() - Fix integer overflow in kvm_pmu_validate_counter_mask() - Fix shift-out-of-bounds in make_xfence_request() - Fix lost write protection on huge pages during dirty logging - Split huge pages during fault handling for dirty logging - Skip CSR restore if VCPU is reloaded on the same core - Implement kvm_arch_has_default_irqchip() for KVM selftests - Factored-out ISA checks into separate sources - Added hideleg to struct kvm_vcpu_config - Factored-out VCPU config into separate sources - Support configuration of per-VM HGATP mode from KVM user space s390: - Support for ESA (31-bit) guests inside nested hypervisors - Remove restriction on memslot alignment, which is not needed anymore with the new gmap code - Fix LPSW/E to update the bear (which of course is the breaking event address register) x86: - Shut up various UBSAN warnings on reading module parameter before they were initialized - Don't zero-allocate page tables that are used for splitting hugepages in the TDP MMU, as KVM is guaranteed to set all SPTEs in the page table and thus write all bytes - As an optimization, bail early when trying to unsync 4KiB mappings if the target gfn can just be mapped with a 2MiB hugepage x86 generic: - Copy single-chunk MMIO write values into struct kvm_vcpu (more precisely struct kvm_mmio_fragment) to fix use-after-free stack bugs where KVM would dereference stack pointer after an exit to userspace - Clean up and comment the emulated MMIO code to try to make it easier to maintain (not necessarily "easy", but "easier") - Move VMXON+VMXOFF and EFER.SVME toggling out of KVM (not *all* of VMX and SVM enabling) as it is needed for trusted I/O - Advertise support for AVX512 Bit Matrix Multiply (BMM) instructions - Immediately fail the build if a required #define is missing in one of KVM's headers that is included multiple times - Reject SET_GUEST_DEBUG with -EBUSY if there's an already injected exception, mostly to prevent syzkaller from abusing the uAPI to trigger WARNs, but also because it can help prevent userspace from unintentionally crashing the VM - Exempt SMM from CPUID faulting on Intel, as per the spec - Misc hardening and cleanup changes x86 (AMD): - Fix and optimize IRQ window inhibit handling for AVIC; make it per-vCPU so that KVM doesn't prematurely re-enable AVIC if multiple vCPUs have to-be-injected IRQs - Clean up and optimize the OSVW handling, avoiding a bug in which KVM would overwrite state when enabling virtualization on multiple CPUs in parallel. This should not be a problem because OSVW should usually be the same for all CPUs - Drop a WARN in KVM_MEMORY_ENCRYPT_REG_REGION where KVM complains about a "too large" size based purely on user input - Clean up and harden the pinning code for KVM_MEMORY_ENCRYPT_REG_REGION - Disallow synchronizing a VMSA of an already-launched/encrypted vCPU, as doing so for an SNP guest will crash the host due to an RMP violation page fault - Overhaul KVM's APIs for detecting SEV+ guests so that VM-scoped queries are required to hold kvm->lock, and enforce it by lockdep. Fix various bugs where sev_guest() was not ensured to be stable for the whole duration of a function or ioctl - Convert a pile of kvm->lock SEV code to guard() - Play nicer with userspace that does not enable KVM_CAP_EXCEPTION_PAYLOAD, for which KVM needs to set CR2 and DR6 as a response to ioctls such as KVM_GET_VCPU_EVENTS (even if the payload would end up in EXITINFO2 rather than CR2, for example). Only set CR2 and DR6 when consumption of the payload is imminent, but on the other hand force delivery of the payload in all paths where userspace retrieves CR2 or DR6 - Use vcpu->arch.cr2 when updating vmcb12's CR2 on nested #VMEXIT instead of vmcb02->save.cr2. The value is out of sync after a save/restore or after a #PF is injected into L2 - Fix a class of nSVM bugs where some fields written by the CPU are not synchronized from vmcb02 to cached vmcb12 after VMRUN, and so are not up-to-date when saved by KVM_GET_NESTED_STATE - Fix a class of bugs where the ordering between KVM_SET_NESTED_STATE and KVM_SET_{S}REGS could cause vmcb02 to be incorrectly initialized after save+restore - Add a variety of missing nSVM consistency checks - Fix several bugs where KVM failed to correctly update VMCB fields on nested #VMEXIT - Fix several bugs where KVM failed to correctly synthesize #UD or #GP for SVM-related instructions - Add support for save+restore of virtualized LBRs (on SVM) - Refactor various helpers and macros to improve clarity and (hopefully) make the code easier to maintain - Aggressively sanitize fields when copying from vmcb12, to guard against unintentionally allowing L1 to utilize yet-to-be-defined features - Fix several bugs where KVM botched rAX legality checks when emulating SVM instructions. There are remaining issues in that KVM doesn't handle size prefix overrides for 64-bit guests - Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails instead of somewhat arbitrarily synthesizing #GP (i.e. don't double down on AMD's architectural but sketchy behavior of generating #GP for "unsupported" addresses) - Cache all used vmcb12 fields to further harden against TOCTOU bugs x86 (Intel): - Drop obsolete branch hint prefixes from the VMX instruction macros - Use ASM_INPUT_RM() in __vmcs_writel() to coerce clang into using a register input when appropriate - Code cleanups guest_memfd: - Don't mark guest_memfd folios as accessed, as guest_memfd doesn't support reclaim, the memory is unevictable, and there is no storage to write back to LoongArch selftests: - Add KVM PMU test cases s390 selftests: - Enable more memory selftests x86 selftests: - Add support for Hygon CPUs in KVM selftests - Fix a bug in the MSR test where it would get false failures on AMD/Hygon CPUs with exactly one of RDPID or RDTSCP - Add an MADV_COLLAPSE testcase for guest_memfd as a regression test for a bug where the kernel would attempt to collapse guest_memfd folios against KVM's will" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (373 commits) KVM: x86: use inlines instead of macros for is_sev_*guest x86/virt: Treat SVM as unsupported when running as an SEV+ guest KVM: SEV: Goto an existing error label if charging misc_cg for an ASID fails KVM: SVM: Move lock-protected allocation of SEV ASID into a separate helper KVM: SEV: use mutex guard in snp_handle_guest_req() KVM: SEV: use mutex guard in sev_mem_enc_unregister_region() KVM: SEV: use mutex guard in sev_mem_enc_ioctl() KVM: SEV: use mutex guard in snp_launch_update() KVM: SEV: Assert that kvm->lock is held when querying SEV+ support KVM: SEV: Document that checking for SEV+ guests when reclaiming memory is "safe" KVM: SEV: Hide "struct kvm_sev_info" behind CONFIG_KVM_AMD_SEV=y KVM: SEV: WARN on unhandled VM type when initializing VM KVM: LoongArch: selftests: Add PMU overflow interrupt test KVM: LoongArch: selftests: Add basic PMU event counting test KVM: LoongArch: selftests: Add cpucfg read/write helpers LoongArch: KVM: Add DMSINTC inject msi to vCPU LoongArch: KVM: Add DMSINTC device support LoongArch: KVM: Make vcpu_is_preempted() as a macro rather than function LoongArch: KVM: Move host CSR_GSTAT save and restore in context switch LoongArch: KVM: Move host CSR_EENTRY save and restore in context switch ...
2026-04-14Merge tag 'x86_sev_for_v7.1_rc1' of ↵Linus Torvalds1-74/+89
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV updates from Borislav Petkov: - Change the SEV host code handling of when SNP gets enabled in order to allow the machine to claim SNP-related resources only when SNP guests are really going to be launched. The user requests this by loading the ccp module and thus it controls when SNP initialization is done So export an API which module code can call and do the necessary SNP setup only when really needed - Drop an unnecessary write-back and invalidate operation that was being performed too early, since the ccp driver already issues its own at the correct point in the initialization sequence - Drop the hotplug callbacks for enabling SNP on newly onlined CPUs, which were both architecturally unsound (the firmware rejects initialization if any CPU lacks the required configuration) and buggy (the MFDM SYSCFG MSR bit was not being set) - Code refactoring and cleanups to accomplish the above * tag 'x86_sev_for_v7.1_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: crypto/ccp: Update HV_FIXED page states to allow freeing of memory crypto/ccp: Implement SNP x86 shutdown x86/sev, crypto/ccp: Move HSAVE_PA setup to arch/x86/ x86/sev, crypto/ccp: Move SNP init to ccp driver x86/sev: Create snp_shutdown() x86/sev: Create snp_prepare() x86/sev: Create a function to clear/zero the RMP x86/sev: Rename SNP_FEATURES_PRESENT to SNP_FEATURES_IMPL x86/virt/sev: Keep the RMP table bookkeeping area mapped x86/virt/sev: Drop WBINVD before setting MSR_AMD64_SYSCFG_SNP_EN x86/virt/sev: Drop support for SNP hotplug
2026-04-09x86/virt: Treat SVM as unsupported when running as an SEV+ guestSean Christopherson1-1/+2
When running as an SEV+ guest, treat SVM as unsupported even if CPUID (and other reporting, e.g. MSRs) enumerate support for SVM, as KVM doesn't support nested virtualization within an SEV VM (KVM would need to explicitly share all VMCBs and other assets with the untrusted host), let alone running nested VMs within SEV-ES+ guests (e.g. emulating VMLOAD, VMSAVE, and VMRUN all require access to guest register state). And outside of KVM, there is no in-tree user of SVM enabling. Arguably, the hypervisor/VMM (e.g. QEMU) should clear SVM from guest CPUID for SEV VMs, especially for SEV-ES+, but super duper technically, it's feasible to run nested VMs in SEV+ guests (with many caveats). More importantly, Linux-as-a-guest has played nice with SVM being advertised to SEV+ guests for a long time. Treating SVM as unsupported fixes a regression where a clean shutdown of an SEV-ES+ guest degrades into an abrupt termination. Due to a gnarly virtualization hole in SEV-ES (the architecture), where EFER must NOT be intercepted by the hypervisor (because the untrusted hypervisor can't set e.g. EFER.LME on behalf o the guest), the _host's_ EFER.SVME is visible to the guest. Because EFER.SVME must be always '1' while in guest mode, Linux-the-guest sees EFER.SVME=1 even when _its_ EFER.SVME is '0', thinks it has enabled virtualization, and ultimately can cause x86_svm_emergency_disable_virtualization_cpu() to execute STGI to ensure GIF is enabled. Executing STGI _should_ be fine, except Linux is a also wee bit paranoid when running as an SEV-ES guest. Because L0 sees EFER.SVME=0 for the guest, a well-behaved L0 hypervisor will intercept STGI (to inject #UD), and thus generate a #VC on the STGI. Which, again, should be fine. Unfortunately, vc_check_opcode_bytes() fails to account for STGI and other SVM instructions, throws a fatal error, and triggers a termination request. In a perfect world, the #VC handler would be more forgiving of unknown intercepts, especially when the #VC happened on an instruction with exception fixup. For now, just fix the immediate regression. Fixes: 428afac5a8ea ("KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystem") Reported-by: Srikanth Aithal <sraithal@amd.com> Closes: https://lore.kernel.org/all/c820e242-9f3a-4210-b414-19d11b022404@amd.com Link: https://patch.msgid.link/20260409191341.1932853-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-29x86/sev, crypto/ccp: Move HSAVE_PA setup to arch/x86/Tycho Andersen (AMD)1-0/+8
Now that there is snp_prepare() that indicates when the CCP driver wants to prepare the architecture for SNP_INIT(_EX), move this architecture-specific bit of code to a more sensible place. Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260324161301.1353976-6-tycho@kernel.org
2026-03-29x86/sev, crypto/ccp: Move SNP init to ccp driverTycho Andersen (AMD)1-2/+0
Use the new snp_prepare() to initialize SNP from the ccp driver instead of at boot time. This means that SNP is not enabled unless it is really going to be used (i.e. kvm_amd loads the ccp driver automatically). Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260324161301.1353976-5-tycho@kernel.org
2026-03-29x86/sev: Create snp_shutdown()Tycho Andersen (AMD)1-3/+19
After SNP_SHUTDOWN, two things should be done: 1. clear the RMP table 2. disable MFDM to prevent the FW_WARN in k8_check_syscfg_dram_mod_en() in the event of a kexec Create and export to the CCP driver a function that does them. Also change the MFDM helper to allow for disabling the bit, since the SNP x86 shutdown path needs to disable MFDM. The comment for k8_check_syscfg_dram_mod_en() notes, the "BIOS" is supposed clear it, or the kernel in the case of module unload and shutdown followed by kexec. Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260324161301.1353976-4-tycho@kernel.org
2026-03-28x86/sev: Create snp_prepare()Tycho Andersen (AMD)1-19/+30
In preparation for delayed SNP initialization, create a function snp_prepare() that does the necessary architecture setup. Export this function for the ccp module to allow it to do the setup as necessary. Introduce a cpu_read_lock/unlock() wrapper around the MFDM and SNP enable. While CPU hotplug is not supported, this makes sure that the bit setting happens on the same set of CPUs in both cases. This improvement was suggested by Sashiko: https://sashiko.dev/#/patchset/20260324161301.1353976-1-tycho%40kernel.org Also move {mfd,snp}_enable() out of the __init section, since these will be called later. Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260326161110.1764303-3-tycho@kernel.org
2026-03-28x86/sev: Create a function to clear/zero the RMPTom Lendacky1-14/+27
In preparation for delayed SNP initialization and disablement on shutdown, create a function, clear_rmp(), that clears the RMP bookkeeping area and the RMP entries. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20260324161301.1353976-2-tycho@kernel.org
2026-03-09x86/virt/sev: Keep the RMP table bookkeeping area mappedTom Lendacky1-23/+17
In preparation for delayed SNP initialization and disablement on shutdown, the RMP will need to be cleared each time SNP is disabled. Maintain the mapping to the RMP bookkeeping area to avoid mapping and unmapping it each time and any possible errors that may arise from that. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://patch.msgid.link/20260309180053.2389118-4-tycho@kernel.org
2026-03-09x86/virt/sev: Drop WBINVD before setting MSR_AMD64_SYSCFG_SNP_ENTycho Andersen (AMD)1-3/+0
WBINVD is required before SNP_INIT(_EX), but not before setting MSR_AMD64_SYSCFG_SNP_EN, since the ccp driver already does its own WBINVD before SNP_INIT (and this one would be too early for that anyway...). Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260309180053.2389118-3-tycho@kernel.org
2026-03-09x86/virt/sev: Drop support for SNP hotplugTycho Andersen (AMD)1-27/+5
During an SNP_INIT(_EX), the SEV firmware checks that all CPUs have the SNP syscfg bit set, and fails if they do not. As such, it does not make sense to have offline CPUs: the firmware will fail initialization because of the offlined ones that the kernel did not initialize. Further, there is a bug: during SNP_INIT(_EX) the firmware requires the MFDM syscfg bit to be set in addition to having SNP enabled, which the previous hotplug code did not do. Since k8_check_syscfg_dram_mod_en() enforces this be cleared, hotplug wouldn't work. Signed-off-by: Tycho Andersen (AMD) <tycho@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://patch.msgid.link/20260309180053.2389118-2-tycho@kernel.org
2026-03-04x86/virt/tdx: Use ida_is_empty() to detect if any TDs may be runningSean Christopherson1-13/+4
Drop nr_configured_hkid and instead use ida_is_empty() to detect if any HKIDs have been allocated/configured. Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: KVM: Consolidate TDX CPU hotplug handlingChao Gao1-3/+46
The core kernel registers a CPU hotplug callback to do VMX and TDX init and deinit while KVM registers a separate CPU offline callback to block offlining the last online CPU in a socket. Splitting TDX-related CPU hotplug handling across two components is odd and adds unnecessary complexity. Consolidate TDX-related CPU hotplug handling by integrating KVM's tdx_offline_cpu() to the one in the core kernel. Also move nr_configured_hkid to the core kernel because tdx_offline_cpu() references it. Since HKID allocation and free are handled in the core kernel, it's more natural to track used HKIDs there. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: Tag a pile of functions as __init, and globals as __ro_after_initSean Christopherson2-63/+66
Now that TDX-Module initialization is done during subsys init, tag all related functions as __init, and relevant data as __ro_after_init. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86/tdx: Do VMXON and TDX-Module initialization during subsys initSean Christopherson2-78/+92
Now that VMXON can be done without bouncing through KVM, do TDX-Module initialization during subsys init (specifically before module_init() so that it runs before KVM when both are built-in). Aside from the obvious benefits of separating core TDX code from KVM, this will allow tagging a pile of TDX functions and globals as being __init and __ro_after_init. Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt/tdx: Drop the outdated requirement that TDX be enabled in IRQ contextSean Christopherson1-8/+1
Remove TDX's outdated requirement that per-CPU enabling be done via IPI function call, which was a stale artifact leftover from early versions of the TDX enablement series. The requirement that IRQs be disabled should have been dropped as part of the revamped series that relied on a the KVM rework to enable VMX at module load. In other words, the kernel's "requirement" was never a requirement at all, but instead a reflection of how KVM enabled VMX (via IPI callback) when the TDX subsystem code was merged. Note, accessing per-CPU information is safe even without disabling IRQs, as tdx_online_cpu() is invoked via a cpuhp callback, i.e. from a per-CPU thread. Link: https://lore.kernel.org/all/ZyJOiPQnBz31qLZ7@google.com Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt: Add refcounting of VMX/SVM usage to support multiple in-kernel usersSean Christopherson1-17/+47
Implement a per-CPU refcounting scheme so that "users" of hardware virtualization, e.g. KVM and the future TDX code, can co-exist without pulling the rug out from under each other. E.g. if KVM were to disable VMX on module unload or when the last KVM VM was destroyed, SEAMCALLs from the TDX subsystem would #UD and panic the kernel. Disable preemption in the get/put APIs to ensure virtualization is fully enabled/disabled before returning to the caller. E.g. if the task were preempted after a 0=>1 transition, the new task would see a 1=>2 and thus return without enabling virtualization. Explicitly disable preemption instead of requiring the caller to do so, because the need to disable preemption is an artifact of the implementation. E.g. from KVM's perspective there is no _need_ to disable preemption as KVM guarantees the pCPU on which it is running is stable (but preemption is enabled). Opportunistically abstract away SVM vs. VMX in the public APIs by using X86_FEATURE_{SVM,VMX} to communicate what technology the caller wants to enable and use. Cc: Xu Yilun <yilun.xu@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86: Move bulk of emergency virtualizaton logic to virt subsystemSean Christopherson1-8/+115
Move the majority of the code related to disabling hardware virtualization in emergency from KVM into the virt subsystem so that virt can take full ownership of the state of SVM/VMX. This will allow refcounting usage of SVM/VMX so that KVM and the TDX subsystem can enable VMX without stomping on each other. To route the emergency callback to the "right" vendor code, add to avoid mixing vendor and generic code, implement a x86_virt_ops structure to track the emergency callback, along with the SVM vs. VMX (vs. "none") feature that is active. To avoid having to choose between SVM and VMX, simply refuse to enable either if both are somehow supported. No known CPU supports both SVM and VMX, and it's comically unlikely such a CPU will ever exist. Leave KVM's clearing of loaded VMCSes and MSR_VM_HSAVE_PA in KVM, via a callback explicitly scoped to KVM. Loading VMCSes and saving/restoring host state are firmly tied to running VMs, and thus are (a) KVM's responsibility and (b) operations that are still exclusively reserved for KVM (as far as in-tree code is concerned). I.e. the contract being established is that non-KVM subsystems can utilize virtualization, but for all intents and purposes cannot act as full-blown hypervisors. Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: SVM: Move core EFER.SVME enablement to kernelSean Christopherson1-0/+53
Move the innermost EFER.SVME logic out of KVM and into to core x86 to land the SVM support alongside VMX support. This will allow providing a more unified API from the kernel to KVM, and will allow moving the bulk of the emergency disabling insanity out of KVM without having a weird split between kernel and KVM for SVM vs. VMX. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: VMX: Move core VMXON enablement to kernelSean Christopherson1-2/+83
Move the innermost VMXON+VMXOFF logic out of KVM and into to core x86 so that TDX can (eventually) force VMXON without having to rely on KVM being loaded, e.g. to do SEAMCALLs during initialization. Opportunistically update the comment regarding emergency disabling via NMI to clarify that virt_rebooting will be set by _another_ emergency callback, i.e. that virt_rebooting doesn't need to be set before VMCLEAR, only before _this_ invocation does VMXOFF. Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04x86/virt: Force-clear X86_FEATURE_VMX if configuring root VMCS failsSean Christopherson1-2/+12
If allocating and configuring a root VMCS fails, clear X86_FEATURE_VMX in all CPUs so that KVM doesn't need to manually check root_vmcs. As added bonuses, clearing VMX will reflect that VMX is unusable in /proc/cpuinfo, and will avoid a futile auto-probe of kvm-intel.ko. WARN if allocating a root VMCS page fails, e.g. to help users figure out why VMX is broken in the unlikely scenario something goes sideways during boot (and because the allocation should succeed unless there's a kernel bug). Tweak KVM's error message to suggest checking kernel logs if VMX is unsupported (in addition to checking BIOS). Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: VMX: Unconditionally allocate root VMCSes during boot CPU bringupSean Christopherson1-0/+71
Allocate the root VMCS (misleading called "vmxarea" and "kvm_area" in KVM) for each possible CPU during early boot CPU bringup, before early TDX initialization, so that TDX can eventually do VMXON on-demand (to make SEAMCALLs) without needing to load kvm-intel.ko. Allocate the pages early on, e.g. instead of trying to do so on-demand, to avoid having to juggle allocation failures at runtime. Opportunistically rename the per-CPU pointers to better reflect the role of the VMCS. Use Intel's "root VMCS" terminology, e.g. from various VMCS patents[1][2] and older SDMs, not the more opaque "VMXON region" used in recent versions of the SDM. While it's possible the VMCS passed to VMXON no longer serves as _the_ root VMCS on modern CPUs, it is still in effect a "root mode VMCS", as described in the patents. Link: https://patentimages.storage.googleapis.com/c7/e4/32/d7a7def5580667/WO2013101191A1.pdf [1] Link: https://patentimages.storage.googleapis.com/13/f6/8d/1361fab8c33373/US20080163205A1.pdf [2] Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-03-04KVM: x86: Move "kvm_rebooting" to kernel as "virt_rebooting"Sean Christopherson2-0/+9
Move "kvm_rebooting" to the kernel, exported for KVM, as one of many steps towards extracting the innermost VMXON and EFER.SVME management logic out of KVM and into to core x86. For lack of a better name, call the new file "hw.c", to yield "virt hardware" when combined with its parent directory. No functional change intended. Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260214012702.2368778-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2026-02-25x86/virt/tdx: Print TDX module version during initVishal Verma1-0/+6
It is useful to print the TDX module version in dmesg logs. This is currently the only way to determine the module version from the host. It also creates a record for any future problems being investigated. This was also requested in [1]. Include the version in the log messages during init, e.g.: virt/tdx: TDX module version: 1.5.24 virt/tdx: 1034220 KB allocated for PAMT virt/tdx: module initialized Print the version in get_tdx_sys_info(), right after the version metadata is read, which makes it available even if there are subsequent initialization failures. Based on a patch by Kai Huang <kai.huang@intel.com> [2] Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/all/CAGtprH8eXwi-TcH2+-Fo5YdbEwGmgLBh9ggcDvd6N=bsKEJ_WQ@mail.gmail.com/ # [1] Link: https://lore.kernel.org/all/6b5553756f56a8e3222bfc36d0bdb3e5192137b7.1731318868.git.kai.huang@intel.com # [2] Link: https://patch.msgid.link/20260109-tdx_print_module_version-v2-2-e10e4ca5b450@intel.com
2026-02-25x86/virt/tdx: Retrieve TDX module versionChao Gao1-0/+16
Each TDX module has several bits of metadata about which specific TDX module it is. The primary bit of info is the version, which has an x.y.z format. These represent the major version, minor version, and update version respectively. Knowing the running TDX Module version is valuable for bug reporting and debugging. Note that the module does expose other pieces of version-related metadata, such as build number and date. Those aren't retrieved for now, that can be added if needed in the future. Retrieve the TDX Module version using the existing metadata reading interface. Later changes will expose this information. The metadata reading interfaces have existed for quite some time, so this will work with older versions of the TDX module as well - i.e. this isn't a new interface. As a side note, the global metadata reading code was originally set up to be auto-generated from a JSON definition [1]. However, later [2] this was found to be unsustainable, and the autogeneration approach was dropped in favor of just manually adding fields as needed (e.g. as in this patch). Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Reviewed-by: Tony Lindgren <tony.lindgren@linux.intel.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/kvm/CABgObfYXUxqQV_FoxKjC8U3t5DnyM45nz5DpTxYZv2x_uFK_Kw@mail.gmail.com/ # [1] Link: https://lore.kernel.org/all/1e7bcbad-eb26-44b7-97ca-88ab53467212@intel.com/ # [2] Link: https://patch.msgid.link/20260109-tdx_print_module_version-v2-1-e10e4ca5b450@intel.com
2026-02-21Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds2-2/+2
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook2-2/+2
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2025-11-12x86: Restrict KVM-induced symbol exports to KVM modules where obvious/possibleSean Christopherson1-34/+35
Extend KVM's export macro framework to provide EXPORT_SYMBOL_FOR_KVM(), and use the helper macro to export symbols for KVM throughout x86 if and only if KVM will build one or more modules, and only for those modules. To avoid unnecessary exports when CONFIG_KVM=m but kvm.ko will not be built (because no vendor modules are selected), let arch code #define EXPORT_SYMBOL_FOR_KVM to suppress/override the exports. Note, the set of symbols to restrict to KVM was generated by manual search and audit; any "misses" are due to human error, not some grand plan. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Kai Huang <kai.huang@intel.com> Link: https://patch.msgid.link/20251112173944.1380633-5-seanjc%40google.com
2025-10-04Merge tag 'x86_tdx_for_6.18-rc1' of ↵Linus Torvalds1-32/+48
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "The biggest change here is making TDX and kexec play nicely together. Before this, the memory encryption hardware (which doesn't respect cache coherency) could write back old cachelines on top of data in the new kernel, so kexec and TDX were made mutually exclusive. This removes the limitation. There is also some work to tighten up a hardware bug workaround and some MAINTAINERS updates. - Make TDX and kexec work together - Skip TDX bug workaround when the bug is not present - Update maintainers entries" * tag 'x86_tdx_for_6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/virt/tdx: Use precalculated TDVPR page physical address KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLs x86/virt/tdx: Update the kexec section in the TDX documentation x86/virt/tdx: Remove the !KEXEC_CORE dependency x86/kexec: Disable kexec/kdump on platforms with TDX partial write erratum x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALL x86/sme: Use percpu boolean to control WBINVD during kexec x86/kexec: Consolidate relocate_kernel() function parameters x86/tdx: Skip clearing reclaimed pages unless X86_BUG_TDX_PW_MCE is present x86/tdx: Tidy reset_pamt functions x86/tdx: Eliminate duplicate code in tdx_clear_page() MAINTAINERS: Add KVM mail list to the TDX entry MAINTAINERS: Add Rick Edgecombe as a TDX reviewer MAINTAINERS: Update the file list in the TDX entry.
2025-09-17x86/sev: Add new dump_rmp parameter to snp_leak_pages() APIAshish Kalra1-3/+4
When leaking certain page types, such as Hypervisor Fixed (HV_FIXED) pages, it does not make sense to dump RMP contents for the 2MB range of the page(s) being leaked. In the case of HV_FIXED pages, this is not an error situation where the surrounding 2MB page RMP entries can provide debug information. Add new __snp_leak_pages() API with dump_rmp bool parameter to support continue adding pages to the snp_leaked_pages_list but not issue dump_rmpentry(). Make snp_leak_pages() a wrapper for the common case which also allows existing users to continue to dump RMP entries. Suggested-by: Thomas Lendacky <Thomas.Lendacky@amd.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/cover.1758057691.git.ashish.kalra@amd.com
2025-09-11x86/virt/tdx: Use precalculated TDVPR page physical addressKai Huang1-13/+8
All of the x86 KVM guest types (VMX, SEV and TDX) do some special context tracking when entering guests. This means that the actual guest entry sequence must be noinstr. Part of entering a TDX guest is passing a physical address to the TDX module. Right now, that physical address is stored as a 'struct page' and converted to a physical address at guest entry. That page=>phys conversion can be complicated, can vary greatly based on kernel config, and it is definitely _not_ a noinstr path today. There have been a number of tinkering approaches to try and fix this up, but they all fall down due to some part of the page=>phys conversion infrastructure not being noinstr friendly. Precalculate the page=>phys conversion and store it in the existing 'tdx_vp' structure. Use the new field at every site that needs a tdvpr physical address. Remove the now redundant tdx_tdvpr_pa(). Remove the __flatten remnant from the tinkering. Note that only one user of the new field is actually noinstr. All others can use page_to_phys(). But, they might as well save the effort since there is a pre-calculated value sitting there for them. [ dhansen: rewrite all the text ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kiryl Shutsemau <kas@kernel.org> Tested-by: Farrah Chen <farrah.chen@intel.com>
2025-09-05KVM/TDX: Explicitly do WBINVD when no more TDX SEAMCALLsKai Huang1-0/+19
On TDX platforms, during kexec, the kernel needs to make sure there are no dirty cachelines of TDX private memory before booting to the new kernel to avoid silent memory corruption to the new kernel. To do this, the kernel has a percpu boolean to indicate whether the cache of a CPU may be in incoherent state. During kexec, namely in stop_this_cpu(), the kernel does WBINVD if that percpu boolean is true. TDX turns on that percpu boolean on a CPU when the kernel does SEAMCALL, Thus making sure the cache will be flushed during kexec. However, kexec has a race condition that, while remaining extremely rare, would be more likely in the presence of a relatively long operation such as WBINVD. In particular, the kexec-ing CPU invokes native_stop_other_cpus() to stop all remote CPUs before booting to the new kernel. native_stop_other_cpus() then sends a REBOOT vector IPI to remote CPUs and waits for them to stop; if that times out, it also sends NMIs to the still-alive CPUs and waits again for them to stop. If the race happens, kexec proceeds before all CPUs have processed the NMI and stopped[1], and the system hangs. But after tdx_disable_virtualization_cpu(), no more TDX activity can happen on this cpu. When kexec is enabled, flush the cache explicitly at that point; this moves the WBINVD to an earlier stage than stop_this_cpus(), avoiding a possibly lengthy operation at a time where it could cause this race. [1] https://lore.kernel.org/kvm/b963fcd60abe26c7ec5dc20b42f1a2ebbcc72397.1750934177.git.kai.huang@intel.com/ [Make the new function a stub for !CONFIG_KEXEC_CORE. - Paolo] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/all/20250901160930.1785244-8-pbonzini%40redhat.com
2025-09-05x86/virt/tdx: Mark memory cache state incoherent when making SEAMCALLKai Huang1-2/+2
On TDX platforms, dirty cacheline aliases with and without encryption bits can coexist, and the cpu can flush them back to memory in random order. During kexec, the caches must be flushed before jumping to the new kernel otherwise the dirty cachelines could silently corrupt the memory used by the new kernel due to different encryption property. A percpu boolean is used to mark whether the cache of a given CPU may be in an incoherent state, and the kexec performs WBINVD on the CPUs with that boolean turned on. For TDX, only the TDX module or the TDX guests can generate dirty cachelines of TDX private memory, i.e., they are only generated when the kernel does a SEAMCALL. Set that boolean when the kernel does SEAMCALL so that kexec can flush the cache correctly. The kernel provides both the __seamcall*() assembly functions and the seamcall*() wrapper ones which additionally handle running out of entropy error in a loop. Most of the SEAMCALLs are called using the seamcall*(), except TDH.VP.ENTER and TDH.PHYMEM.PAGE.RDMD which are called using __seamcall*() variant directly. To cover the two special cases, add a new __seamcall_dirty_cache() helper w