aboutsummaryrefslogtreecommitdiff
path: root/arch/arm64/include/asm/kvm_host.h
AgeCommit message (Collapse)AuthorFilesLines
2026-05-29KVM: arm64: Correctly cap ZCR_EL2 provided by a guest hypervisorMark Brown1-1/+1
ZCR_EL2 can be updated by a VHE guest hypervisor either using ZCR_EL2 (which traps) or ZCR_EL1 (which does not trap). KVM handles both in different way: - on ZCR_EL2 trap, ZCR_EL2.LEN is immediately capped at the VM's own VL limit. This has the potential to break existing SW that relies on the full LEN field to be stateful. - on ZCR_EL1 access, we do absolutely nothing. On restoring the SVE context for an L2 guest, we directly restore the guest hypervisor's view of ZCR_EL2 into the physical ZCR_EL2. If the guest's view of the register was updated using the ZCR_EL2 accessor, the value has already been sanitised (with the caveat mentioned above). But if the guest used ZCR_EL1, the raw value is written into the HW, and the L2 guest can now access VLs that it shouldn't. Fix all the above by moving the VL capping to the restore points, ensuring that: - the HW is always programmed with a capped value, irrespective of the accessor being used, - the ZCR_EL2.LEN field is always completely stateful, irrespective of the accessor being used. Additionally, move ZCR_EL2 to be a sanitised register, ensuring that only the LEN field is actually stateful. This requires some creative construction of the RES0 mask, as the sysreg generation script does not yet generate RAZ/WI fields. Fixes: b3d29a823099 ("KVM: arm64: nv: Handle ZCR_EL2 traps") Signed-off-by: Mark Brown <broonie@kernel.org> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20260529-kvm-arm64-fix-zcr-len-nv-v2-1-86cad51992bd@kernel.org [maz: rewrote commit message, tidy up access_zcr_el2()] Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-24KVM: arm64: Fix kvm_vcpu_initialized() macro parameterFuad Tabba1-1/+1
The macro is defined with parameter 'v' but the body references the literal token 'vcpu' instead, causing it to silently operate on whatever 'vcpu' resolves to in the caller's scope rather than the value passed by the caller. All current call sites happen to use a variable named 'vcpu', so the bug is latent. Fixes: e016333745c7 ("KVM: arm64: Only reset vCPU-scoped feature ID regs once") Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260424084908.370776-5-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org
2026-04-18KVM: arm64: pkvm: Adopt MARKER() to define host hypercall rangesMarc Zyngier1-3/+0
The EL2 code defines ranges of host hypercalls that are either enabled at boot-time only, used by [nh]VHE KVM, or reserved to pKVM. The way these ranges are delineated is error prone, as the enum symbols defining the limits are expressed in terms of actual function symbols. This means that should a new function be added, special care must be taken to also update the limit symbol. Improve this by reusing the mechanism introduced for the vcpu_sysreg enum, which uses a MARKER() macro and some extra trickery to make the limit symbol standalone. Crucially, the limit symbol has the same value as the *following* symbol. The handle_host_hcall() function is then updated to make use of the new limit definitions and get rid of the brittle default upper limit. This allows for some more strict checks at build time, and the removal of an comparison at run time. Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260414160528.2218858-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08Merge branch kvm-arm64/vgic-fixes-7.1 into kvmarm-master/nextMarc Zyngier1-8/+1
* kvm-arm64/vgic-fixes-7.1: : . : FIrst pass at fixing a number of vgic-v5 bugs that were found : after the merge of the initial series. : . KVM: arm64: Advertise ID_AA64PFR2_EL1.GCIE KVM: arm64: vgic-v5: Fold PPI state for all exposed PPIs KVM: arm64: set_id_regs: Allow GICv3 support to be set at runtime KVM: arm64: Don't advertises GICv3 in ID_PFR1_EL1 if AArch32 isn't supported KVM: arm64: Correctly plumb ID_AA64PFR2_EL1 into pkvm idreg handling KVM: arm64: Move GICv5 timer PPI validation into timer_irqs_are_valid() KVM: arm64: Remove evaluation of timer state in kvm_cpu_has_pending_timer() KVM: arm64: Kill arch_timer_context::direct field KVM: arm64: vgic-v5: Correctly set dist->ready once initialised KVM: arm64: vgic-v5: Make the effective priority mask a strict limit KVM: arm64: vgic-v5: Cast vgic_apr to u32 to avoid undefined behaviours KVM: arm64: vgic-v5: Transfer edge pending state to ICH_PPI_PENDRx_EL2 KVM: arm64: vgic-v5: Hold config_lock while finalizing GICv5 PPIs KVM: arm64: Account for RESx bits in __compute_fgt() KVM: arm64: Fix writeable mask for ID_AA64PFR2_EL1 arm64: Fix field references for ICH_PPI_DVIR[01]_EL2 KVM: arm64: Don't skip per-vcpu NV initialisation KVM: arm64: vgic: Don't reset cpuif/redist addresses at finalize time Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08Merge branch kvm-arm64/pkvm-protected-guest into kvmarm-master/nextMarc Zyngier1-1/+8
* kvm-arm64/pkvm-protected-guest: (41 commits) : . : pKVM support for protected guests, implementing the very long : awaited support for anonymous memory, as the elusive guestmem : has failed to deliver on its promises despite a multi-year : effort. Patches courtesy of Will Deacon. From the initial cover : letter: : : "[...] this patch series implements support for protected guest : memory with pKVM, where pages are unmapped from the host as they are : faulted into the guest and can be shared back from the guest using pKVM : hypercalls. Protected guests are created using a new machine type : identifier and can be booted to a shell using the kvmtool patches : available at [2], which finally means that we are able to test the pVM : logic in pKVM. Since this is an incremental step towards full isolation : from the host (for example, the CPU register state and DMA accesses are : not yet isolated), creating a pVM requires a developer Kconfig option to : be enabled in addition to booting with 'kvm-arm.mode=protected' and : results in a kernel taint." : . KVM: arm64: Don't hold 'vm_table_lock' across guest page reclaim KVM: arm64: Allow get_pkvm_hyp_vm() to take a reference to a dying VM KVM: arm64: Prevent teardown finalisation of referenced 'hyp_vm' drivers/virt: pkvm: Add Kconfig dependency on DMA_RESTRICTED_POOL KVM: arm64: Rename PKVM_PAGE_STATE_MASK KVM: arm64: Extend pKVM page ownership selftests to cover guest hvcs KVM: arm64: Extend pKVM page ownership selftests to cover forced reclaim KVM: arm64: Register 'selftest_vm' in the VM table KVM: arm64: Extend pKVM page ownership selftests to cover guest donation KVM: arm64: Add some initial documentation for pKVM KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled KVM: arm64: Implement the MEM_UNSHARE hypercall for protected VMs KVM: arm64: Implement the MEM_SHARE hypercall for protected VMs KVM: arm64: Add hvc handler at EL2 for hypercalls from protected VMs KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler KVM: arm64: Introduce hypercall to force reclaim of a protected page KVM: arm64: Annotate guest donations with handle and gfn in host stage-2 KVM: arm64: Change 'pkvm_handle_t' to u16 KVM: arm64: Introduce host_stage2_set_owner_metadata_locked() ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08Merge branch kvm-arm64/spe-trbe-nvhe into kvmarm-master/nextMarc Zyngier1-0/+2
* kvm-arm64/spe-trbe-nvhe: : . : Fix SPE and TRBE nVHE world switch which can otherwise result in : pretty bad behaviours, as they have the nasty habit of performing : out of context speculative page table walks. : : Patches courtesy of Will Deacon. : . KVM: arm64: Don't pass host_debug_state to BRBE world-switch routines KVM: arm64: Disable SPE Profiling Buffer when running in guest context KVM: arm64: Disable TRBE Trace Buffer Unit when running in guest context Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08Merge branch kvm-arm64/nv-s2-debugfs into kvmarm-master/nextMarc Zyngier1-0/+9
* kvm-arm64/nv-s2-debugfs: : . : Expand the stage-2 ptdump infrastructure to be able to display : the content of the shadow s2 tables generated by nested virt. : : Patches courtesy of Wei-Lin Chang. : . KVM: arm64: ptdump: Initialize parser_state before pgtable walk KVM: arm64: nv: Expose shadow page tables in debugfs KVM: arm64: ptdump: Make KVM ptdump code s2 mmu aware Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08Merge branch kvm-arm64/vgic-v5-ppi into kvmarm-master/nextMarc Zyngier1-0/+34
* kvm-arm64/vgic-v5-ppi: (40 commits) : . : Add initial GICv5 support for KVM guests, only adding PPI support : for the time being. Patches courtesy of Sascha Bischoff. : : From the cover letter: : : "This is v7 of the patch series to add the virtual GICv5 [1] device : (vgic_v5). Only PPIs are supported by this initial series, and the : vgic_v5 implementation is restricted to the CPU interface, : only. Further patch series are to follow in due course, and will add : support for SPIs, LPIs, the GICv5 IRS, and the GICv5 ITS." : . KVM: arm64: selftests: Add no-vgic-v5 selftest KVM: arm64: selftests: Introduce a minimal GICv5 PPI selftest KVM: arm64: gic-v5: Communicate userspace-driveable PPIs via a UAPI Documentation: KVM: Introduce documentation for VGICv5 KVM: arm64: gic-v5: Probe for GICv5 device KVM: arm64: gic-v5: Set ICH_VCTLR_EL2.En on boot KVM: arm64: gic-v5: Introduce kvm_arm_vgic_v5_ops and register them KVM: arm64: gic-v5: Hide FEAT_GCIE from NV GICv5 guests KVM: arm64: gic: Hide GICv5 for protected guests KVM: arm64: gic-v5: Mandate architected PPI for PMU emulation on GICv5 KVM: arm64: gic-v5: Enlighten arch timer for GICv5 irqchip/gic-v5: Introduce minimal irq_set_type() for PPIs KVM: arm64: gic-v5: Initialise ID and priority bits when resetting vcpu KVM: arm64: gic-v5: Create and initialise vgic_v5 KVM: arm64: gic-v5: Support GICv5 interrupts with KVM_IRQ_LINE KVM: arm64: gic-v5: Implement direct injection of PPIs KVM: arm64: Introduce set_direct_injection irq_op KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writes KVM: arm64: gic-v5: Check for pending PPIs KVM: arm64: gic-v5: Clear TWI if single task running ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-08Merge branch kvm-arm64/hyp-tracing into kvmarm-master/nextMarc Zyngier1-0/+3
* kvm-arm64/hyp-tracing: (40 commits) : . : EL2 tracing support, adding both 'remote' ring-buffer : infrastructure and the tracing itself, courtesy of : Vincent Donnefort. From the cover letter: : : "The growing set of features supported by the hypervisor in protected : mode necessitates debugging and profiling tools. Tracefs is the : ideal candidate for this task: : : * It is simple to use and to script. : : * It is supported by various tools, from the trace-cmd CLI to the : Android web-based perfetto. : : * The ring-buffer, where are stored trace events consists of linked : pages, making it an ideal structure for sharing between kernel and : hypervisor. : : This series first introduces a new generic way of creating remote events and : remote buffers. Then it adds support to the pKVM hypervisor." : . tracing: selftests: Extend hotplug testing for trace remotes tracing: Non-consuming read for trace remotes with an offline CPU tracing: Adjust cmd_check_undefined to show unexpected undefined symbols tracing: Restore accidentally removed SPDX tag KVM: arm64: avoid unused-variable warning tracing: Generate undef symbols allowlist for simple_ring_buffer KVM: arm64: tracing: add ftrace dependency tracing: add more symbols to whitelist tracing: Update undefined symbols allow list for simple_ring_buffer KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing tracing: selftests: Add hypervisor trace remote tests KVM: arm64: Add selftest event support to nVHE/pKVM hyp KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote KVM: arm64: Add trace reset to the nVHE/pKVM hyp KVM: arm64: Sync boot clock with the nVHE/pKVM hyp KVM: arm64: Add trace remote for the nVHE/pKVM hyp KVM: arm64: Add tracing capability for the nVHE/pKVM hyp KVM: arm64: Support unaligned fixmap in the pKVM hyp KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp ... Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-04-01KVM: arm64: vgic-v5: Fold PPI state for all exposed PPIsSascha Bischoff1-8/+1
GICv5 supports up to 128 PPIs, which would introduce a large amount of overhead if all of them were actively tracked. Rather than keeping track of all 128 potential PPIs, we instead only consider the set of architected PPIs (the first 64). Moreover, we further reduce that set by only exposing a subset of the PPIs to a guest. In practice, this means that only 4 PPIs are typically exposed to a guest - the SW_PPI, PMUIRQ, and the timers. When folding the PPI state, changed bits in the active or pending were used to choose which state to sync back. However, this breaks badly for Edge interrupts when exiting the guest before it has consumed the edge. There is no change in pending state detected, and the edge is lost forever. Given the reduced set of PPIs exposed to the guest, and the issues around tracking the edges, drop the tracking of changed state, and instead iterate over the limited subset of PPIs exposed to the guest directly. This change drops the second copy of the PPI pending state used for detecting edges in the pending state, and reworks vgic_v5_fold_ppi_state() to iterate over the VM's PPI mask instead. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://patch.msgid.link/20260401162152.932243-1-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-30KVM: arm64: Change 'pkvm_handle_t' to u16Will Deacon1-1/+1
'pkvm_handle_t' doesn't need to be a 32-bit type and subsequent patches will rely on it being no more than 16 bits so that it can be encoded into a pte annotation. Change 'pkvm_handle_t' to a u16 and add a compile-type check that the maximum handle fits into the reduced type. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-24-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-30KVM: arm64: Split teardown hypercall into two phasesWill Deacon1-0/+7
In preparation for reclaiming protected guest VM pages from the host during teardown, split the current 'pkvm_teardown_vm' hypercall into separate 'start' and 'finalise' calls. The 'pkvm_start_teardown_vm' hypercall puts the VM into a new 'is_dying' state, which is a point of no return past which no vCPU of the pVM is allowed to run any more. Once in this new state, 'pkvm_finalize_teardown_vm' can be used to reclaim meta-data and page-table pages from the VM. A subsequent patch will add support for reclaiming the individual guest memory pages. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Tested-by: Mostafa Saleh <smostafa@google.com> Co-developed-by: Quentin Perret <qperret@google.com> Signed-off-by: Quentin Perret <qperret@google.com> Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260330144841.26181-12-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-28KVM: arm64: Disable SPE Profiling Buffer when running in guest contextWill Deacon1-0/+1
The nVHE world-switch code relies on zeroing PMSCR_EL1 to disable profiling data generation in guest context when SPE is in use by the host. Unfortunately, this may leave PMBLIMITR_EL1.E set and consequently we can end up running in guest/hypervisor context with the Profiling Buffer enabled. The current "known issues" document for Rev M.a of the Arm ARM states that this can lead to speculative, out-of-context translations: | 2.18 D23136: | | When the Profiling Buffer is enabled, profiling is not stopped, and | Discard mode is not enabled, the Statistical Profiling Unit might | cause speculative translations for the owning translation regime, | including when the owning translation regime is out-of-context. In a similar fashion to TRBE, ensure that the Profiling Buffer is disabled during the nVHE world switch before we start messing with the stage-2 MMU and trap configuration. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Alexandru Elisei <alexandru.elisei@arm.com> Tested-by: Fuad Tabba <tabba@google.com> Fixes: f85279b4bd48 ("arm64: KVM: Save/restore the host SPE state when entering/leaving a VM") Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260327130047.21065-3-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-28KVM: arm64: Disable TRBE Trace Buffer Unit when running in guest contextWill Deacon1-0/+1
The nVHE world-switch code relies on zeroing TRFCR_EL1 to disable trace generation in guest context when self-hosted TRBE is in use by the host. Per D3.2.1 ("Controls to prohibit trace at Exception levels"), clearing TRFCR_EL1 means that trace generation is prohibited at EL1 and EL0 but per R_YCHKJ the Trace Buffer Unit will still be enabled if TRBLIMITR_EL1.E is set. R_SJFRQ goes on to state that, when enabled, the Trace Buffer Unit can perform address translation for the "owning exception level" even when it is out of context. Consequently, we can end up in a state where TRBE performs speculative page-table walks for a host VA/IPA in guest/hypervisor context depending on the value of MDCR_EL2.E2TB, which changes over world-switch. The potential result appears to be a heady mixture of SErrors, data corruption and hardware lockups. Extend the TRBE world-switch code to clear TRBLIMITR_EL1.E after draining the buffer, restoring the register on return to the host. This unfortunately means we need to tackle CPU errata #2064142 and #2038923 which add additional synchronisation requirements around manipulations of the limit register. Hopefully this doesn't need to be fast. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oupton@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Leo Yan <leo.yan@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Alexandru Elisei <alexandru.elisei@arm.com> Tested-by: Leo Yan <leo.yan@arm.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> Reviewed-by: Fuad Tabba <tabba@google.com> Fixes: a1319260bf62 ("arm64: KVM: Enable access to TRBE support for host") Signed-off-by: Will Deacon <will@kernel.org> Link: https://patch.msgid.link/20260327130047.21065-2-will@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-23KVM: arm64: nv: Expose shadow page tables in debugfsWei-Lin Chang1-0/+9
Exposing shadow page tables in debugfs improves the debugability and testability of NV. With this patch a new directory "nested" is created for each VM created if the host is NV capable. Within the directory each valid s2 mmu will have its shadow page table exposed as a readable file with the file name formatted as 0x<vttbr>-0x<vtcr>-s2-{en,dis}abled. The creation and removal of the files happen at the points when an s2 mmu becomes valid, or the context it represents change. In the future the "nested" directory can also hold other NV related information. This is gated behind CONFIG_PTDUMP_STAGE2_DEBUGFS. Suggested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Sebastian Ene <sebastianene@google.com> Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://patch.msgid.link/20260317182638.1592507-3-weilin.chang@arm.com [maz: minor refactor, full 16 chars addresses] Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-19KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writesSascha Bischoff1-1/+0
A guest should not be able to detect if a PPI that is not exposed to the guest is implemented or not. Avoid the guest enabling any PPIs that are not implemented as far as the guest is concerned by trapping and masking writes to the two ICC_PPI_ENABLERx_EL1 registers. When a guest writes these registers, the write is masked with the set of PPIs actually exposed to the guest, and the state is written back to KVM's shadow state. As there is now no way for the guest to change the PPI enable state without it being trapped, saving of the PPI Enable state is dropped from guest exit. Reads for the above registers are not masked. When the guest is running and reads from the above registers, it is presented with what KVM provides in the ICH_PPI_ENABLERx_EL2 registers, which is the masked version of what the guest last wrote. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260319154937.3619520-25-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-19KVM: arm64: gic-v5: Add vgic-v5 save/restore hyp interfaceSascha Bischoff1-0/+16
Introduce the following hyp functions to save/restore GICv5 state: * __vgic_v5_save_apr() * __vgic_v5_restore_vmcr_apr() * __vgic_v5_save_ppi_state() - no hypercall required * __vgic_v5_restore_ppi_state() - no hypercall required * __vgic_v5_save_state() - no hypercall required * __vgic_v5_restore_state() - no hypercall required Note that the functions tagged as not requiring hypercalls are always called directly from the same context. They are either called via the vgic_save_state()/vgic_restore_state() path when running with VHE, or via __hyp_vgic_save_state()/__hyp_vgic_restore_state() otherwise. This mimics how vgic_v3_save_state()/vgic_v3_restore_state() are implemented. Overall, the state of the following registers is saved/restored: * ICC_ICSR_EL1 * ICH_APR_EL2 * ICH_PPI_ACTIVERx_EL2 * ICH_PPI_DVIRx_EL2 * ICH_PPI_ENABLERx_EL2 * ICH_PPI_PENDRx_EL2 * ICH_PPI_PRIORITYRx_EL2 * ICH_VMCR_EL2 All of these are saved/restored to/from the KVM vgic_v5 CPUIF shadow state, with the exception of the PPI active, pending, and enable state. The pending state is saved and restored from kvm_host_data as any changes here need to be tracked and propagated back to the vgic_irq shadow structures (coming in a future commit). Therefore, an entry and an exit copy is required. The active and enable state is restored from the vgic_v5 CPUIF, but is saved to kvm_host_data. Again, this needs to by synced back into the shadow data structures. The ICSR must be save/restored as this register is shared between host and guest. Therefore, to avoid leaking host state to the guest, this must be saved and restored. Moreover, as this can by used by the host at any time, it must be save/restored eagerly. Note: the host state is not preserved as the host should only use this register when preemption is disabled. As with GICv3, the VMCR is eagerly saved as this is required when checking if interrupts can be injected or not, and therefore impacts things such as WFI. As part of restoring the ICH_VMCR_EL2 and ICH_APR_EL2, GICv3-compat mode is also disabled by setting the ICH_VCTLR_EL2.V3 bit to 0. The correspoinding GICv3-compat mode enable is part of the VMCR & APR restore for a GICv3 guest as it only takes effect when actually running a guest. Co-authored-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Timothy Hayes <timothy.hayes@arm.com> Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Link: https://patch.msgid.link/20260319154937.3619520-17-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-19KVM: arm64: gic-v5: Support GICv5 FGTs & FGUsSascha Bischoff1-0/+19
Extend the existing FGT/FGU infrastructure to include the GICv5 trap registers (ICH_HFGRTR_EL2, ICH_HFGWTR_EL2, ICH_HFGITR_EL2). This involves mapping the trap registers and their bits to the corresponding feature that introduces them (FEAT_GCIE for all, in this case), and mapping each trap bit to the system register/instruction controlled by it. As of this change, none of the GICv5 instructions or register accesses are being trapped. Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Link: https://patch.msgid.link/20260319154937.3619520-14-sascha.bischoff@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-11KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hypVincent Donnefort1-0/+3
The hyp_enter and hyp_exit events are logged by the hypervisor any time it is entered and exited. Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://patch.msgid.link/20260309162516.2623589-29-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-03-07KVM: arm64: vgic: Pick EOIcount deactivations from AP-list tailMarc Zyngier1-0/+3
Valentine reports that their guests fail to boot correctly, losing interrupts, and indicates that the wrong interrupt gets deactivated. What happens here is that if the maintenance interrupt is slow enough to kick us out of the guest, extra interrupts can be activated from the LRs. We then exit and proceed to handle EOIcount deactivations, picking active interrupts from the AP list. But we start from the top of the list, potentially deactivating interrupts that were in the LRs, while EOIcount only denotes deactivation of interrupts that are not present in an LR. Solve this by tracking the last interrupt that made it in the LRs, and start the EOIcount deactivation walk *after* that interrupt. Since this only makes sense while the vcpu is loaded, stash this in the per-CPU host state. Huge thanks to Valentine for doing all the detective work and providing an initial patch. Fixes: 3cfd59f81e0f3 ("KVM: arm64: GICv3: Handle LR overflow when EOImode==0") Fixes: 281c6c06e2a7b ("KVM: arm64: GICv2: Handle LR overflow when EOImode==0") Reported-by: Valentine Burley <valentine.burley@collabora.com> Tested-by: Valentine Burley <valentine.burley@collabora.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20260307115955.369455-1-valentine.burley@collabora.com Link: https://patch.msgid.link/20260307191151.3781182-1-maz@kernel.org Cc: stable@vger.kernel.org
2026-02-13KVM: arm64: Optimise away S1POE handling when not supported by hostFuad Tabba1-1/+2
Although ID register sanitisation prevents guests from seeing the feature, adding this check to the helper allows the compiler to entirely eliminate S1POE-specific code paths (such as context switching POR_EL1) when the host kernel is compiled without support (CONFIG_ARM64_POE is disabled). This aligns with the pattern used for other optional features like SVE (kvm_has_sve()) and FPMR (kvm_has_fpmr()), ensuring no POE logic if the host lacks support, regardless of the guest configuration state. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260213143815.1732675-3-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05Merge branch kvm-arm64/misc-6.20 into kvmarm-master/nextMarc Zyngier1-6/+6
* kvm-arm64/misc-6.20: : . : Misc KVM/arm64 changes for 6.20 : : - Trivial FPSIMD cleanups : : - Calculate hyp VA size only once, avoiding potential mapping issues when : VA bits is smaller than expected : : - Silence sparse warning for the HYP stack base : : - Fix error checking when handling FFA_VERSION : : - Add missing trap configuration for DBGWCR15_EL1 : : - Don't try to deal with nested S2 when NV isn't enabled for a guest : : - Various spelling fixes : . KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported KVM: arm64: Fix various comments KVM: arm64: nv: Add trap config for DBGWCR<15>_EL1 KVM: arm64: Fix error checking for FFA_VERSION KVM: arm64: Fix missing <asm/stackpage/nvhe.h> include KVM: arm64: Calculate hyp VA size only once KVM: arm64: Remove ISB after writing FPEXC32_EL2 KVM: arm64: Shuffle KVM_HOST_DATA_FLAG_* indices KVM: arm64: Fix comment in fpsimd_lazy_switch_to_host() Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05Merge branch kvm-arm64/resx into kvmarm-master/nextMarc Zyngier1-6/+32
* kvm-arm64/resx: : . : Add infrastructure to deal with the full gamut of RESx bits : for NV. As a result, it is now possible to have the expected : semantics for some bits such as SCTLR_EL2.SPAN. : . KVM: arm64: Add debugfs file dumping computed RESx values KVM: arm64: Add sanitisation to SCTLR_EL2 KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE KVM: arm64: Remove all traces of FEAT_TME KVM: arm64: Simplify handling of full register invalid constraint KVM: arm64: Get rid of FIXED_VALUE altogether KVM: arm64: Simplify handling of HCR_EL2.E2H RESx KVM: arm64: Move RESx into individual register descriptors KVM: arm64: Add RES1_WHEN_E2Hx constraints as configuration flags KVM: arm64: Add REQUIRES_E2H1 constraint as configuration flags KVM: arm64: Simplify FIXED_VALUE handling KVM: arm64: Convert HCR_EL2.RW to AS_RES1 KVM: arm64: Correctly handle SCTLR_EL1 RES1 bits for unsupported features KVM: arm64: Allow RES1 bits to be inferred from configuration KVM: arm64: Inherit RESx bits from FGT register descriptors KVM: arm64: Extend unified RESx handling to runtime sanitisation KVM: arm64: Introduce data structure tracking both RES0 and RES1 bits KVM: arm64: Introduce standalone FGU computing primitive KVM: arm64: Remove duplicate configuration for SCTLR_EL1.{EE,E0E} arm64: Convert SCTLR_EL2 to sysreg infrastructure Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05Merge branch kvm-arm64/debugfs-fixes into kvmarm-master/nextMarc Zyngier1-3/+0
* kvm-arm64/debugfs-fixes: : . : Cleanup of the debugfs iterator, which are way more complicated : than they ought to be, courtesy of Fuad Tabba. From the cover letter: : : "This series refactors the debugfs implementations for `idregs` and : `vgic-state` to use standard `seq_file` iterator patterns. : : The existing implementations relied on storing iterator state within : global VM structures (`kvm_arch` and `vgic_dist`). This approach : prevented concurrent reads of the debugfs files (returning -EBUSY) and : created improper dependencies between transient file operations and : long-lived VM state." : . KVM: arm64: Use standard seq_file iterator for vgic-debug debugfs KVM: arm64: Reimplement vgic-debug XArray iteration KVM: arm64: Use standard seq_file iterator for idregs debugfs Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05KVM: arm64: Add sanitisation to SCTLR_EL2Marc Zyngier1-1/+1
Sanitise SCTLR_EL2 the usual way. The most important aspect of this is that we benefit from SCTLR_EL2.SPAN being RES1 when HCR_EL2.E2H==0. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260202184329.2724080-20-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05KVM: arm64: Extend unified RESx handling to runtime sanitisationMarc Zyngier1-0/+15
Add a new helper to retrieve the RESx values for a given system register, and use it for the runtime sanitisation. This results in slightly better code generation for a fairly hot path in the hypervisor, and additionally covers all sanitised registers in all conditions, not just the VNCR-based ones. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260202184329.2724080-6-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-05KVM: arm64: Introduce data structure tracking both RES0 and RES1 bitsMarc Zyngier1-5/+16
We have so far mostly tracked RES0 bits, but only made a few attempts at being just as strict for RES1 bits (probably because they are both rarer and harder to handle). Start scratching the surface by introducing a data structure tracking RES0 and RES1 bits at the same time. Note that contrary to the usual idiom, this structure is mostly passed around by value -- the ABI handles it nicely, and the resulting code is much nicer. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260202184329.2724080-5-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-02-02KVM: arm64: Use standard seq_file iterator for idregs debugfsFuad Tabba1-3/+0
The current implementation uses `idreg_debugfs_iter` in `struct kvm_arch` to track the sequence position. This effectively makes the iterator shared across all open file descriptors for the VM. This approach has significant drawbacks: - It enforces mutual exclusion, preventing concurrent reads of the debugfs file (returning -EBUSY). - It relies on storing transient iterator state in the long-lived VM structure (`kvm_arch`). - The use of `u8` for the iterator index imposes an implicit limit of 255 registers. While not currently exceeded, this is fragile against future architectural growth. Switching to `loff_t` eliminates this overflow risk. Refactor the implementation to use the standard `seq_file` iterator. Instead of storing state in `kvm_arch`, rely on the `pos` argument passed to the `start` and `next` callbacks, which tracks the logical index specific to the file descriptor. This change enables concurrent access and eliminates the `idreg_debugfs_iter` field from `struct kvm_arch`. Signed-off-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20260202085721.3954942-2-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-30KVM: arm64: Fix various commentsZenghui Yu (Huawei)1-1/+1
Use tab instead of whitespaces, as well as 2 minor typo fixes. Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev> Link: https://patch.msgid.link/20260128075208.23024-1-zenghui.yu@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-23Merge branch kvm-arm64/pkvm-features-6.20 into kvmarm-master/nextMarc Zyngier1-0/+2
* kvm-arm64/pkvm-features-6.20: : . : pKVM guest feature trapping fixes, courtesy of Fuad Tabba. : . KVM: arm64: Prevent host from managing timer offsets for protected VMs KVM: arm64: Check whether a VM IOCTL is allowed in pKVM KVM: arm64: Track KVM IOCTLs and their associated KVM caps KVM: arm64: Do not allow KVM_CAP_ARM_MTE for any guest in pKVM KVM: arm64: Include VM type when checking VM capabilities in pKVM KVM: arm64: Introduce helper to calculate fault IPA offset KVM: arm64: Fix MTE flag initialization for protected VMs KVM: arm64: Fix Trace Buffer trap polarity for protected VMs KVM: arm64: Fix Trace Buffer trapping for protected VMs Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-15KVM: arm64: Track KVM IOCTLs and their associated KVM capsFuad Tabba1-0/+2
Track KVM IOCTLs (VM IOCTLs for now), and the associated KVM capability that enables that IOCTL. Add a function that performs the lookup. This will be used by CoCo VM Hypervisors (e.g., pKVM) to determine whether a particular KVM IOCTL is allowed for its VMs. Suggested-by: Oliver Upton <oupton@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> [maz: don't expose KVM_CAP_BASIC to userspace, and rely on NR_VCPUS as a proxy for this] Link: https://patch.msgid.link/20251211104710.151771-8-tabba@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-15KVM: arm64: Account for RES1 bits in DECLARE_FEAT_MAP() and coMarc Zyngier1-0/+1
None of the registers we manage in the feature dependency infrastructure so far has any RES1 bit. This is about to change, as VTCR_EL2 has its bit 31 being RES1. In order to not fail the consistency checks by not describing a bit, add RES1 bits to the set of immutable bits. This requires some extra surgery for the FGT handling, as we now need to track RES1 bits there as well. There are no RES1 FGT bits *yet*. Watch this space. Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Sascha Bischoff <sascha.bischoff@arm.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Tested-by: Fuad Tabba <tabba@google.com> Link: https://patch.msgid.link/20251210173024.561160-5-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2026-01-07KVM: arm64: Shuffle KVM_HOST_DATA_FLAG_* indicesMark Rutland1-5/+5
There's a gap in the KVM_HOST_DATA_FLAG_* indices since the removal of KVM_HOST_DATA_FLAG_HOST_SVE_ENABLED and KVM_HOST_DATA_FLAG_HOST_SME_ENABLED in commits: * 459f059be702 ("KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN") * 407a99c4654e ("KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN") Shuffle the indices to remove the gap, as Oliver requested at the time of the removals: https://lore.kernel.org/linux-arm-kernel/Z6qC4qn47ONfDCSH@linux.dev/ There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://patch.msgid.link/20260106173707.3292074-3-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-12-01Merge branch 'kvm-arm64/vgic-lr-overflow' into kvmarm/nextOliver Upton1-0/+1
* kvm-arm64/vgic-lr-overflow: (50 commits) : Support for VGIC LR overflows, courtesy of Marc Zyngier : : Address deficiencies in KVM's GIC emulation when a vCPU has more active : IRQs than can be represented in the VGIC list registers. Sort the AP : list to prioritize inactive and pending IRQs, potentially spilling : active IRQs outside of the LRs. : : Handle deactivation of IRQs outside of the LRs for both EOImode=0/1, : which involves special consideration for SPIs being deactivated from a : different vCPU than the one that acked it. KVM: arm64: Convert ICH_HCR_EL2_TDIR cap to EARLY_LOCAL_CPU_FEATURE KVM: arm64: selftests: vgic_irq: Add timer deactivation test KVM: arm64: selftests: vgic_irq: Add Group-0 enable test KVM: arm64: selftests: vgic_irq: Add asymmetric SPI deaectivation test KVM: arm64: selftests: vgic_irq: Perform EOImode==1 deactivation in ack order KVM: arm64: selftests: vgic_irq: Remove LR-bound limitation KVM: arm64: selftests: vgic_irq: Exclude timer-controlled interrupts KVM: arm64: selftests: vgic_irq: Change configuration before enabling interrupt KVM: arm64: selftests: vgic_irq: Fix GUEST_ASSERT_IAR_EMPTY() helper KVM: arm64: selftests: gic_v3: Disable Group-0 interrupts by default KVM: arm64: selftests: gic_v3: Add irq group setting helper KVM: arm64: GICv2: Always trap GICV_DIR register KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps KVM: arm64: GICv2: Handle LR overflow when EOImode==0 KVM: arm64: GICv3: Force exit to sync ICH_HCR_EL2.En KVM: arm64: GICv3: nv: Plug L1 LR sync into deactivation primitive KVM: arm64: GICv3: nv: Resync LRs/VMCR/HCR early for better MI emulation KVM: arm64: GICv3: Avoid broadcast kick on CPUs lacking TDIR KVM: arm64: GICv3: Handle in-LR deactivation when possible KVM: arm64: GICv3: Add SPI tracking to handle asymmetric deactivation ... Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-24KVM: arm64: GICv3: Handle deactivation via ICV_DIR_EL1 trapsMarc Zyngier1-0/+1
Deactivation via ICV_DIR_EL1 is both relatively straightforward (we have the interrupt that needs deactivation) and really awkward. The main issue is that the interrupt may either be in an LR on another CPU, or ourside of any LR. In the former case, we process the deactivation is if ot was a write to GICD_CACTIVERn, which is already implemented as a big hammer IPI'ing all vcpus. In the latter case, we just perform a normal deactivation, similar to what we do for EOImode==0. Another annoying aspect is that we need to tell the CPU owning the interrupt that its ap_list needs laudering. We use a brand new vcpu request to that effect. Note that this doesn't address deactivation via the GICV MMIO view, which will be taken care of in a later change. Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Link: https://msgid.link/20251120172540.2267180-29-maz@kernel.org Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-11-12KVM: arm64: VM exit to userspace to handle SEAJiaqi Yan1-0/+2
When APEI fails to handle a stage-2 synchronous external abort (SEA), today KVM injects an asynchronous SError to the VCPU then resumes it, which usually results in unpleasant guest kernel panic. One major situation of guest SEA is when vCPU consumes recoverable uncorrected memory error (UER). Although SError and guest kernel panic effectively stops the propagation of corrupted memory, guest may re-use the corrupted memory if auto-rebooted; in worse case, guest boot may run into poisoned memory. So there is room to recover from an UER in a more graceful manner. Alternatively KVM can redirect the synchronous SEA event to VMM to - Reduce blast radius if possible. VMM can inject a SEA to VCPU via KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison consumption or fault is not from guest kernel, blast radius can be limited to the triggering thread in guest userspace, so VM can keep running. - Allow VMM to protect from future memory poison consumption by unmapping the page from stage-2, or to interrupt guest of the poisoned page so guest kernel can unmap it from stage-1 page table. - Allow VMM to track SEA events that VM customers care about, to restart VM when certain number of distinct poison events have happened, to provide observability to customers in log management UI. Introduce an userspace-visible feature to enable VMM handle SEA: - KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior when host APEI fails to claim a SEA, userspace can opt in this new capability to let KVM exit to userspace during SEA if it is not owned by host. - KVM_EXIT_ARM_SEA. A new exit reason is introduced for this. KVM fills kvm_run.arm_sea with as much as possible information about the SEA, enabling VMM to emulate SEA to guest by itself. - Sanitized ESR_EL2. The general rule is to keep only the bits useful for userspace and relevant to guest memory. - Flags indicating if faulting guest physical address is valid. - Faulting guest physical and virtual addresses if valid. Signed-off-by: Jiaqi Yan <jiaqiyan@google.com> Co-developed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://msgid.link/20251013185903.1372553-2-jiaqiyan@google.com Signed-off-by: Oliver Upton <oupton@kernel.org>
2025-10-13KVM: arm64: Compute per-vCPU FGTs at vcpu_load()Oliver Upton1-0/+50
To date KVM has used the fine-grained traps for the sake of UNDEF enforcement (so-called FGUs), meaning the constituent parts could be computed on a per-VM basis and folded into the effective value when programmed. Prepare for traps changing based on the vCPU context by computing the whole mess of them at vcpu_load(). Aggressively inline all the helpers to preserve the build-time checks that were there before. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-09-30Merge tag 'kvmarm-6.18' of ↵Paolo Bonzini1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.18 - Add support for FF-A 1.2 as the secure memory conduit for pKVM, allowing more registers to be used as part of the message payload. - Change the way pKVM allocates its VM handles, making sure that the privileged hypervisor is never tricked into using uninitialised data. - Speed up MMIO range registration by avoiding unnecessary RCU synchronisation, which results in VMs starting much quicker. - Add the dump of the instruction stream when panic-ing in the EL2 payload, just like the rest of the kernel has always done. This will hopefully help debugging non-VHE setups. - Add 52bit PA support to the stage-1 page-table walker, and make use of it to populate the fault level reported to the guest on failing to translate a stage-1 walk. - Add NV support to the GICv3-on-GICv5 emulation code, ensuring feature parity for guests, irrespective of the host platform. - Fix some really ugly architecture problems when dealing with debug in a nested VM. This has some bad performance impacts, but is at least correct. - Add enough infrastructur