| Age | Commit message (Collapse) | Author | Files | Lines |
|
ZCR_EL2 can be updated by a VHE guest hypervisor either using ZCR_EL2
(which traps) or ZCR_EL1 (which does not trap). KVM handles both in
different way:
- on ZCR_EL2 trap, ZCR_EL2.LEN is immediately capped at the VM's own
VL limit. This has the potential to break existing SW that relies
on the full LEN field to be stateful.
- on ZCR_EL1 access, we do absolutely nothing.
On restoring the SVE context for an L2 guest, we directly restore the
guest hypervisor's view of ZCR_EL2 into the physical ZCR_EL2. If the
guest's view of the register was updated using the ZCR_EL2 accessor,
the value has already been sanitised (with the caveat mentioned above).
But if the guest used ZCR_EL1, the raw value is written into the HW,
and the L2 guest can now access VLs that it shouldn't.
Fix all the above by moving the VL capping to the restore points,
ensuring that:
- the HW is always programmed with a capped value, irrespective of
the accessor being used,
- the ZCR_EL2.LEN field is always completely stateful, irrespective
of the accessor being used.
Additionally, move ZCR_EL2 to be a sanitised register, ensuring that
only the LEN field is actually stateful. This requires some creative
construction of the RES0 mask, as the sysreg generation script does
not yet generate RAZ/WI fields.
Fixes: b3d29a823099 ("KVM: arm64: nv: Handle ZCR_EL2 traps")
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260529-kvm-arm64-fix-zcr-len-nv-v2-1-86cad51992bd@kernel.org
[maz: rewrote commit message, tidy up access_zcr_el2()]
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The macro is defined with parameter 'v' but the body references the
literal token 'vcpu' instead, causing it to silently operate on whatever
'vcpu' resolves to in the caller's scope rather than the value passed by
the caller. All current call sites happen to use a variable named 'vcpu',
so the bug is latent.
Fixes: e016333745c7 ("KVM: arm64: Only reset vCPU-scoped feature ID regs once")
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260424084908.370776-5-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
|
|
The EL2 code defines ranges of host hypercalls that are either
enabled at boot-time only, used by [nh]VHE KVM, or reserved to pKVM.
The way these ranges are delineated is error prone, as the enum symbols
defining the limits are expressed in terms of actual function symbols.
This means that should a new function be added, special care must be
taken to also update the limit symbol.
Improve this by reusing the mechanism introduced for the vcpu_sysreg
enum, which uses a MARKER() macro and some extra trickery to make
the limit symbol standalone. Crucially, the limit symbol has the
same value as the *following* symbol.
The handle_host_hcall() function is then updated to make use of
the new limit definitions and get rid of the brittle default
upper limit. This allows for some more strict checks at build
time, and the removal of an comparison at run time.
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260414160528.2218858-1-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/vgic-fixes-7.1:
: .
: FIrst pass at fixing a number of vgic-v5 bugs that were found
: after the merge of the initial series.
: .
KVM: arm64: Advertise ID_AA64PFR2_EL1.GCIE
KVM: arm64: vgic-v5: Fold PPI state for all exposed PPIs
KVM: arm64: set_id_regs: Allow GICv3 support to be set at runtime
KVM: arm64: Don't advertises GICv3 in ID_PFR1_EL1 if AArch32 isn't supported
KVM: arm64: Correctly plumb ID_AA64PFR2_EL1 into pkvm idreg handling
KVM: arm64: Move GICv5 timer PPI validation into timer_irqs_are_valid()
KVM: arm64: Remove evaluation of timer state in kvm_cpu_has_pending_timer()
KVM: arm64: Kill arch_timer_context::direct field
KVM: arm64: vgic-v5: Correctly set dist->ready once initialised
KVM: arm64: vgic-v5: Make the effective priority mask a strict limit
KVM: arm64: vgic-v5: Cast vgic_apr to u32 to avoid undefined behaviours
KVM: arm64: vgic-v5: Transfer edge pending state to ICH_PPI_PENDRx_EL2
KVM: arm64: vgic-v5: Hold config_lock while finalizing GICv5 PPIs
KVM: arm64: Account for RESx bits in __compute_fgt()
KVM: arm64: Fix writeable mask for ID_AA64PFR2_EL1
arm64: Fix field references for ICH_PPI_DVIR[01]_EL2
KVM: arm64: Don't skip per-vcpu NV initialisation
KVM: arm64: vgic: Don't reset cpuif/redist addresses at finalize time
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-protected-guest: (41 commits)
: .
: pKVM support for protected guests, implementing the very long
: awaited support for anonymous memory, as the elusive guestmem
: has failed to deliver on its promises despite a multi-year
: effort. Patches courtesy of Will Deacon. From the initial cover
: letter:
:
: "[...] this patch series implements support for protected guest
: memory with pKVM, where pages are unmapped from the host as they are
: faulted into the guest and can be shared back from the guest using pKVM
: hypercalls. Protected guests are created using a new machine type
: identifier and can be booted to a shell using the kvmtool patches
: available at [2], which finally means that we are able to test the pVM
: logic in pKVM. Since this is an incremental step towards full isolation
: from the host (for example, the CPU register state and DMA accesses are
: not yet isolated), creating a pVM requires a developer Kconfig option to
: be enabled in addition to booting with 'kvm-arm.mode=protected' and
: results in a kernel taint."
: .
KVM: arm64: Don't hold 'vm_table_lock' across guest page reclaim
KVM: arm64: Allow get_pkvm_hyp_vm() to take a reference to a dying VM
KVM: arm64: Prevent teardown finalisation of referenced 'hyp_vm'
drivers/virt: pkvm: Add Kconfig dependency on DMA_RESTRICTED_POOL
KVM: arm64: Rename PKVM_PAGE_STATE_MASK
KVM: arm64: Extend pKVM page ownership selftests to cover guest hvcs
KVM: arm64: Extend pKVM page ownership selftests to cover forced reclaim
KVM: arm64: Register 'selftest_vm' in the VM table
KVM: arm64: Extend pKVM page ownership selftests to cover guest donation
KVM: arm64: Add some initial documentation for pKVM
KVM: arm64: Allow userspace to create protected VMs when pKVM is enabled
KVM: arm64: Implement the MEM_UNSHARE hypercall for protected VMs
KVM: arm64: Implement the MEM_SHARE hypercall for protected VMs
KVM: arm64: Add hvc handler at EL2 for hypercalls from protected VMs
KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte
KVM: arm64: Reclaim faulting page from pKVM in spurious fault handler
KVM: arm64: Introduce hypercall to force reclaim of a protected page
KVM: arm64: Annotate guest donations with handle and gfn in host stage-2
KVM: arm64: Change 'pkvm_handle_t' to u16
KVM: arm64: Introduce host_stage2_set_owner_metadata_locked()
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/spe-trbe-nvhe:
: .
: Fix SPE and TRBE nVHE world switch which can otherwise result in
: pretty bad behaviours, as they have the nasty habit of performing
: out of context speculative page table walks.
:
: Patches courtesy of Will Deacon.
: .
KVM: arm64: Don't pass host_debug_state to BRBE world-switch routines
KVM: arm64: Disable SPE Profiling Buffer when running in guest context
KVM: arm64: Disable TRBE Trace Buffer Unit when running in guest context
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/nv-s2-debugfs:
: .
: Expand the stage-2 ptdump infrastructure to be able to display
: the content of the shadow s2 tables generated by nested virt.
:
: Patches courtesy of Wei-Lin Chang.
: .
KVM: arm64: ptdump: Initialize parser_state before pgtable walk
KVM: arm64: nv: Expose shadow page tables in debugfs
KVM: arm64: ptdump: Make KVM ptdump code s2 mmu aware
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/vgic-v5-ppi: (40 commits)
: .
: Add initial GICv5 support for KVM guests, only adding PPI support
: for the time being. Patches courtesy of Sascha Bischoff.
:
: From the cover letter:
:
: "This is v7 of the patch series to add the virtual GICv5 [1] device
: (vgic_v5). Only PPIs are supported by this initial series, and the
: vgic_v5 implementation is restricted to the CPU interface,
: only. Further patch series are to follow in due course, and will add
: support for SPIs, LPIs, the GICv5 IRS, and the GICv5 ITS."
: .
KVM: arm64: selftests: Add no-vgic-v5 selftest
KVM: arm64: selftests: Introduce a minimal GICv5 PPI selftest
KVM: arm64: gic-v5: Communicate userspace-driveable PPIs via a UAPI
Documentation: KVM: Introduce documentation for VGICv5
KVM: arm64: gic-v5: Probe for GICv5 device
KVM: arm64: gic-v5: Set ICH_VCTLR_EL2.En on boot
KVM: arm64: gic-v5: Introduce kvm_arm_vgic_v5_ops and register them
KVM: arm64: gic-v5: Hide FEAT_GCIE from NV GICv5 guests
KVM: arm64: gic: Hide GICv5 for protected guests
KVM: arm64: gic-v5: Mandate architected PPI for PMU emulation on GICv5
KVM: arm64: gic-v5: Enlighten arch timer for GICv5
irqchip/gic-v5: Introduce minimal irq_set_type() for PPIs
KVM: arm64: gic-v5: Initialise ID and priority bits when resetting vcpu
KVM: arm64: gic-v5: Create and initialise vgic_v5
KVM: arm64: gic-v5: Support GICv5 interrupts with KVM_IRQ_LINE
KVM: arm64: gic-v5: Implement direct injection of PPIs
KVM: arm64: Introduce set_direct_injection irq_op
KVM: arm64: gic-v5: Trap and mask guest ICC_PPI_ENABLERx_EL1 writes
KVM: arm64: gic-v5: Check for pending PPIs
KVM: arm64: gic-v5: Clear TWI if single task running
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/hyp-tracing: (40 commits)
: .
: EL2 tracing support, adding both 'remote' ring-buffer
: infrastructure and the tracing itself, courtesy of
: Vincent Donnefort. From the cover letter:
:
: "The growing set of features supported by the hypervisor in protected
: mode necessitates debugging and profiling tools. Tracefs is the
: ideal candidate for this task:
:
: * It is simple to use and to script.
:
: * It is supported by various tools, from the trace-cmd CLI to the
: Android web-based perfetto.
:
: * The ring-buffer, where are stored trace events consists of linked
: pages, making it an ideal structure for sharing between kernel and
: hypervisor.
:
: This series first introduces a new generic way of creating remote events and
: remote buffers. Then it adds support to the pKVM hypervisor."
: .
tracing: selftests: Extend hotplug testing for trace remotes
tracing: Non-consuming read for trace remotes with an offline CPU
tracing: Adjust cmd_check_undefined to show unexpected undefined symbols
tracing: Restore accidentally removed SPDX tag
KVM: arm64: avoid unused-variable warning
tracing: Generate undef symbols allowlist for simple_ring_buffer
KVM: arm64: tracing: add ftrace dependency
tracing: add more symbols to whitelist
tracing: Update undefined symbols allow list for simple_ring_buffer
KVM: arm64: Fix out-of-tree build for nVHE/pKVM tracing
tracing: selftests: Add hypervisor trace remote tests
KVM: arm64: Add selftest event support to nVHE/pKVM hyp
KVM: arm64: Add hyp_enter/hyp_exit events to nVHE/pKVM hyp
KVM: arm64: Add event support to the nVHE/pKVM hyp and trace remote
KVM: arm64: Add trace reset to the nVHE/pKVM hyp
KVM: arm64: Sync boot clock with the nVHE/pKVM hyp
KVM: arm64: Add trace remote for the nVHE/pKVM hyp
KVM: arm64: Add tracing capability for the nVHE/pKVM hyp
KVM: arm64: Support unaligned fixmap in the pKVM hyp
KVM: arm64: Initialise hyp_nr_cpus for nVHE hyp
...
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
GICv5 supports up to 128 PPIs, which would introduce a large amount of
overhead if all of them were actively tracked. Rather than keeping
track of all 128 potential PPIs, we instead only consider the set of
architected PPIs (the first 64). Moreover, we further reduce that set
by only exposing a subset of the PPIs to a guest. In practice, this
means that only 4 PPIs are typically exposed to a guest - the SW_PPI,
PMUIRQ, and the timers.
When folding the PPI state, changed bits in the active or pending were
used to choose which state to sync back. However, this breaks badly
for Edge interrupts when exiting the guest before it has consumed the
edge. There is no change in pending state detected, and the edge is
lost forever.
Given the reduced set of PPIs exposed to the guest, and the issues
around tracking the edges, drop the tracking of changed state, and
instead iterate over the limited subset of PPIs exposed to the guest
directly.
This change drops the second copy of the PPI pending state used for
detecting edges in the pending state, and reworks
vgic_v5_fold_ppi_state() to iterate over the VM's PPI mask instead.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260401162152.932243-1-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
'pkvm_handle_t' doesn't need to be a 32-bit type and subsequent patches
will rely on it being no more than 16 bits so that it can be encoded
into a pte annotation.
Change 'pkvm_handle_t' to a u16 and add a compile-type check that the
maximum handle fits into the reduced type.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Tested-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20260330144841.26181-24-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
In preparation for reclaiming protected guest VM pages from the host
during teardown, split the current 'pkvm_teardown_vm' hypercall into
separate 'start' and 'finalise' calls.
The 'pkvm_start_teardown_vm' hypercall puts the VM into a new 'is_dying'
state, which is a point of no return past which no vCPU of the pVM is
allowed to run any more. Once in this new state,
'pkvm_finalize_teardown_vm' can be used to reclaim meta-data and
page-table pages from the VM. A subsequent patch will add support for
reclaiming the individual guest memory pages.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Tested-by: Mostafa Saleh <smostafa@google.com>
Co-developed-by: Quentin Perret <qperret@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20260330144841.26181-12-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The nVHE world-switch code relies on zeroing PMSCR_EL1 to disable
profiling data generation in guest context when SPE is in use by the
host.
Unfortunately, this may leave PMBLIMITR_EL1.E set and consequently we
can end up running in guest/hypervisor context with the Profiling Buffer
enabled. The current "known issues" document for Rev M.a of the Arm ARM
states that this can lead to speculative, out-of-context translations:
| 2.18 D23136:
|
| When the Profiling Buffer is enabled, profiling is not stopped, and
| Discard mode is not enabled, the Statistical Profiling Unit might
| cause speculative translations for the owning translation regime,
| including when the owning translation regime is out-of-context.
In a similar fashion to TRBE, ensure that the Profiling Buffer is
disabled during the nVHE world switch before we start messing with the
stage-2 MMU and trap configuration.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Fuad Tabba <tabba@google.com>
Fixes: f85279b4bd48 ("arm64: KVM: Save/restore the host SPE state when entering/leaving a VM")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20260327130047.21065-3-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The nVHE world-switch code relies on zeroing TRFCR_EL1 to disable trace
generation in guest context when self-hosted TRBE is in use by the host.
Per D3.2.1 ("Controls to prohibit trace at Exception levels"), clearing
TRFCR_EL1 means that trace generation is prohibited at EL1 and EL0 but
per R_YCHKJ the Trace Buffer Unit will still be enabled if
TRBLIMITR_EL1.E is set. R_SJFRQ goes on to state that, when enabled, the
Trace Buffer Unit can perform address translation for the "owning
exception level" even when it is out of context.
Consequently, we can end up in a state where TRBE performs speculative
page-table walks for a host VA/IPA in guest/hypervisor context depending
on the value of MDCR_EL2.E2TB, which changes over world-switch. The
potential result appears to be a heady mixture of SErrors, data
corruption and hardware lockups.
Extend the TRBE world-switch code to clear TRBLIMITR_EL1.E after
draining the buffer, restoring the register on return to the host. This
unfortunately means we need to tackle CPU errata #2064142 and #2038923
which add additional synchronisation requirements around manipulations
of the limit register. Hopefully this doesn't need to be fast.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oupton@kernel.org>
Cc: James Clark <james.clark@linaro.org>
Cc: Leo Yan <leo.yan@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Tested-by: Leo Yan <leo.yan@arm.com>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Fixes: a1319260bf62 ("arm64: KVM: Enable access to TRBE support for host")
Signed-off-by: Will Deacon <will@kernel.org>
Link: https://patch.msgid.link/20260327130047.21065-2-will@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Exposing shadow page tables in debugfs improves the debugability and
testability of NV. With this patch a new directory "nested" is created
for each VM created if the host is NV capable. Within the directory each
valid s2 mmu will have its shadow page table exposed as a readable file
with the file name formatted as 0x<vttbr>-0x<vtcr>-s2-{en,dis}abled. The
creation and removal of the files happen at the points when an s2 mmu
becomes valid, or the context it represents change. In the future the
"nested" directory can also hold other NV related information.
This is gated behind CONFIG_PTDUMP_STAGE2_DEBUGFS.
Suggested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Sebastian Ene <sebastianene@google.com>
Signed-off-by: Wei-Lin Chang <weilin.chang@arm.com>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Link: https://patch.msgid.link/20260317182638.1592507-3-weilin.chang@arm.com
[maz: minor refactor, full 16 chars addresses]
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
A guest should not be able to detect if a PPI that is not exposed to
the guest is implemented or not. Avoid the guest enabling any PPIs
that are not implemented as far as the guest is concerned by trapping
and masking writes to the two ICC_PPI_ENABLERx_EL1 registers.
When a guest writes these registers, the write is masked with the set
of PPIs actually exposed to the guest, and the state is written back
to KVM's shadow state. As there is now no way for the guest to change
the PPI enable state without it being trapped, saving of the PPI
Enable state is dropped from guest exit.
Reads for the above registers are not masked. When the guest is
running and reads from the above registers, it is presented with what
KVM provides in the ICH_PPI_ENABLERx_EL2 registers, which is the
masked version of what the guest last wrote.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-25-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Introduce the following hyp functions to save/restore GICv5 state:
* __vgic_v5_save_apr()
* __vgic_v5_restore_vmcr_apr()
* __vgic_v5_save_ppi_state() - no hypercall required
* __vgic_v5_restore_ppi_state() - no hypercall required
* __vgic_v5_save_state() - no hypercall required
* __vgic_v5_restore_state() - no hypercall required
Note that the functions tagged as not requiring hypercalls are always
called directly from the same context. They are either called via the
vgic_save_state()/vgic_restore_state() path when running with VHE, or
via __hyp_vgic_save_state()/__hyp_vgic_restore_state() otherwise. This
mimics how vgic_v3_save_state()/vgic_v3_restore_state() are
implemented.
Overall, the state of the following registers is saved/restored:
* ICC_ICSR_EL1
* ICH_APR_EL2
* ICH_PPI_ACTIVERx_EL2
* ICH_PPI_DVIRx_EL2
* ICH_PPI_ENABLERx_EL2
* ICH_PPI_PENDRx_EL2
* ICH_PPI_PRIORITYRx_EL2
* ICH_VMCR_EL2
All of these are saved/restored to/from the KVM vgic_v5 CPUIF shadow
state, with the exception of the PPI active, pending, and enable
state. The pending state is saved and restored from kvm_host_data as
any changes here need to be tracked and propagated back to the
vgic_irq shadow structures (coming in a future commit). Therefore, an
entry and an exit copy is required. The active and enable state is
restored from the vgic_v5 CPUIF, but is saved to kvm_host_data. Again,
this needs to by synced back into the shadow data structures.
The ICSR must be save/restored as this register is shared between host
and guest. Therefore, to avoid leaking host state to the guest, this
must be saved and restored. Moreover, as this can by used by the host
at any time, it must be save/restored eagerly. Note: the host state is
not preserved as the host should only use this register when
preemption is disabled.
As with GICv3, the VMCR is eagerly saved as this is required when
checking if interrupts can be injected or not, and therefore impacts
things such as WFI.
As part of restoring the ICH_VMCR_EL2 and ICH_APR_EL2, GICv3-compat
mode is also disabled by setting the ICH_VCTLR_EL2.V3 bit to 0. The
correspoinding GICv3-compat mode enable is part of the VMCR & APR
restore for a GICv3 guest as it only takes effect when actually
running a guest.
Co-authored-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Timothy Hayes <timothy.hayes@arm.com>
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Link: https://patch.msgid.link/20260319154937.3619520-17-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Extend the existing FGT/FGU infrastructure to include the GICv5 trap
registers (ICH_HFGRTR_EL2, ICH_HFGWTR_EL2, ICH_HFGITR_EL2). This
involves mapping the trap registers and their bits to the
corresponding feature that introduces them (FEAT_GCIE for all, in this
case), and mapping each trap bit to the system register/instruction
controlled by it.
As of this change, none of the GICv5 instructions or register accesses
are being trapped.
Signed-off-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Link: https://patch.msgid.link/20260319154937.3619520-14-sascha.bischoff@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The hyp_enter and hyp_exit events are logged by the hypervisor any time
it is entered and exited.
Signed-off-by: Vincent Donnefort <vdonnefort@google.com>
Link: https://patch.msgid.link/20260309162516.2623589-29-vdonnefort@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Valentine reports that their guests fail to boot correctly, losing
interrupts, and indicates that the wrong interrupt gets deactivated.
What happens here is that if the maintenance interrupt is slow enough
to kick us out of the guest, extra interrupts can be activated from
the LRs. We then exit and proceed to handle EOIcount deactivations,
picking active interrupts from the AP list. But we start from the
top of the list, potentially deactivating interrupts that were in
the LRs, while EOIcount only denotes deactivation of interrupts that
are not present in an LR.
Solve this by tracking the last interrupt that made it in the LRs,
and start the EOIcount deactivation walk *after* that interrupt.
Since this only makes sense while the vcpu is loaded, stash this
in the per-CPU host state.
Huge thanks to Valentine for doing all the detective work and
providing an initial patch.
Fixes: 3cfd59f81e0f3 ("KVM: arm64: GICv3: Handle LR overflow when EOImode==0")
Fixes: 281c6c06e2a7b ("KVM: arm64: GICv2: Handle LR overflow when EOImode==0")
Reported-by: Valentine Burley <valentine.burley@collabora.com>
Tested-by: Valentine Burley <valentine.burley@collabora.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20260307115955.369455-1-valentine.burley@collabora.com
Link: https://patch.msgid.link/20260307191151.3781182-1-maz@kernel.org
Cc: stable@vger.kernel.org
|
|
Although ID register sanitisation prevents guests from seeing the
feature, adding this check to the helper allows the compiler to entirely
eliminate S1POE-specific code paths (such as context switching POR_EL1)
when the host kernel is compiled without support (CONFIG_ARM64_POE is
disabled).
This aligns with the pattern used for other optional features like SVE
(kvm_has_sve()) and FPMR (kvm_has_fpmr()), ensuring no POE logic if the
host lacks support, regardless of the guest configuration state.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260213143815.1732675-3-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/misc-6.20:
: .
: Misc KVM/arm64 changes for 6.20
:
: - Trivial FPSIMD cleanups
:
: - Calculate hyp VA size only once, avoiding potential mapping issues when
: VA bits is smaller than expected
:
: - Silence sparse warning for the HYP stack base
:
: - Fix error checking when handling FFA_VERSION
:
: - Add missing trap configuration for DBGWCR15_EL1
:
: - Don't try to deal with nested S2 when NV isn't enabled for a guest
:
: - Various spelling fixes
: .
KVM: arm64: nv: Avoid NV stage-2 code when NV is not supported
KVM: arm64: Fix various comments
KVM: arm64: nv: Add trap config for DBGWCR<15>_EL1
KVM: arm64: Fix error checking for FFA_VERSION
KVM: arm64: Fix missing <asm/stackpage/nvhe.h> include
KVM: arm64: Calculate hyp VA size only once
KVM: arm64: Remove ISB after writing FPEXC32_EL2
KVM: arm64: Shuffle KVM_HOST_DATA_FLAG_* indices
KVM: arm64: Fix comment in fpsimd_lazy_switch_to_host()
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/resx:
: .
: Add infrastructure to deal with the full gamut of RESx bits
: for NV. As a result, it is now possible to have the expected
: semantics for some bits such as SCTLR_EL2.SPAN.
: .
KVM: arm64: Add debugfs file dumping computed RESx values
KVM: arm64: Add sanitisation to SCTLR_EL2
KVM: arm64: Remove all traces of HCR_EL2.MIOCNCE
KVM: arm64: Remove all traces of FEAT_TME
KVM: arm64: Simplify handling of full register invalid constraint
KVM: arm64: Get rid of FIXED_VALUE altogether
KVM: arm64: Simplify handling of HCR_EL2.E2H RESx
KVM: arm64: Move RESx into individual register descriptors
KVM: arm64: Add RES1_WHEN_E2Hx constraints as configuration flags
KVM: arm64: Add REQUIRES_E2H1 constraint as configuration flags
KVM: arm64: Simplify FIXED_VALUE handling
KVM: arm64: Convert HCR_EL2.RW to AS_RES1
KVM: arm64: Correctly handle SCTLR_EL1 RES1 bits for unsupported features
KVM: arm64: Allow RES1 bits to be inferred from configuration
KVM: arm64: Inherit RESx bits from FGT register descriptors
KVM: arm64: Extend unified RESx handling to runtime sanitisation
KVM: arm64: Introduce data structure tracking both RES0 and RES1 bits
KVM: arm64: Introduce standalone FGU computing primitive
KVM: arm64: Remove duplicate configuration for SCTLR_EL1.{EE,E0E}
arm64: Convert SCTLR_EL2 to sysreg infrastructure
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/debugfs-fixes:
: .
: Cleanup of the debugfs iterator, which are way more complicated
: than they ought to be, courtesy of Fuad Tabba. From the cover letter:
:
: "This series refactors the debugfs implementations for `idregs` and
: `vgic-state` to use standard `seq_file` iterator patterns.
:
: The existing implementations relied on storing iterator state within
: global VM structures (`kvm_arch` and `vgic_dist`). This approach
: prevented concurrent reads of the debugfs files (returning -EBUSY) and
: created improper dependencies between transient file operations and
: long-lived VM state."
: .
KVM: arm64: Use standard seq_file iterator for vgic-debug debugfs
KVM: arm64: Reimplement vgic-debug XArray iteration
KVM: arm64: Use standard seq_file iterator for idregs debugfs
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Sanitise SCTLR_EL2 the usual way. The most important aspect of
this is that we benefit from SCTLR_EL2.SPAN being RES1 when
HCR_EL2.E2H==0.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-20-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Add a new helper to retrieve the RESx values for a given system
register, and use it for the runtime sanitisation.
This results in slightly better code generation for a fairly hot
path in the hypervisor, and additionally covers all sanitised
registers in all conditions, not just the VNCR-based ones.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-6-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
We have so far mostly tracked RES0 bits, but only made a few attempts
at being just as strict for RES1 bits (probably because they are both
rarer and harder to handle).
Start scratching the surface by introducing a data structure tracking
RES0 and RES1 bits at the same time.
Note that contrary to the usual idiom, this structure is mostly passed
around by value -- the ABI handles it nicely, and the resulting code is
much nicer.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202184329.2724080-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
The current implementation uses `idreg_debugfs_iter` in `struct
kvm_arch` to track the sequence position. This effectively makes the
iterator shared across all open file descriptors for the VM.
This approach has significant drawbacks:
- It enforces mutual exclusion, preventing concurrent reads of the
debugfs file (returning -EBUSY).
- It relies on storing transient iterator state in the long-lived VM
structure (`kvm_arch`).
- The use of `u8` for the iterator index imposes an implicit limit of
255 registers. While not currently exceeded, this is fragile against
future architectural growth. Switching to `loff_t` eliminates this
overflow risk.
Refactor the implementation to use the standard `seq_file` iterator.
Instead of storing state in `kvm_arch`, rely on the `pos` argument
passed to the `start` and `next` callbacks, which tracks the logical
index specific to the file descriptor.
This change enables concurrent access and eliminates the
`idreg_debugfs_iter` field from `struct kvm_arch`.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20260202085721.3954942-2-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Use tab instead of whitespaces, as well as 2 minor typo fixes.
Signed-off-by: Zenghui Yu (Huawei) <zenghui.yu@linux.dev>
Link: https://patch.msgid.link/20260128075208.23024-1-zenghui.yu@linux.dev
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/pkvm-features-6.20:
: .
: pKVM guest feature trapping fixes, courtesy of Fuad Tabba.
: .
KVM: arm64: Prevent host from managing timer offsets for protected VMs
KVM: arm64: Check whether a VM IOCTL is allowed in pKVM
KVM: arm64: Track KVM IOCTLs and their associated KVM caps
KVM: arm64: Do not allow KVM_CAP_ARM_MTE for any guest in pKVM
KVM: arm64: Include VM type when checking VM capabilities in pKVM
KVM: arm64: Introduce helper to calculate fault IPA offset
KVM: arm64: Fix MTE flag initialization for protected VMs
KVM: arm64: Fix Trace Buffer trap polarity for protected VMs
KVM: arm64: Fix Trace Buffer trapping for protected VMs
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
Track KVM IOCTLs (VM IOCTLs for now), and the associated KVM capability
that enables that IOCTL. Add a function that performs the lookup.
This will be used by CoCo VM Hypervisors (e.g., pKVM) to determine
whether a particular KVM IOCTL is allowed for its VMs.
Suggested-by: Oliver Upton <oupton@kernel.org>
Signed-off-by: Fuad Tabba <tabba@google.com>
[maz: don't expose KVM_CAP_BASIC to userspace, and rely on NR_VCPUS
as a proxy for this]
Link: https://patch.msgid.link/20251211104710.151771-8-tabba@google.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
None of the registers we manage in the feature dependency infrastructure
so far has any RES1 bit. This is about to change, as VTCR_EL2 has
its bit 31 being RES1.
In order to not fail the consistency checks by not describing a bit,
add RES1 bits to the set of immutable bits. This requires some extra
surgery for the FGT handling, as we now need to track RES1 bits there
as well.
There are no RES1 FGT bits *yet*. Watch this space.
Reviewed-by: Fuad Tabba <tabba@google.com>
Tested-by: Sascha Bischoff <sascha.bischoff@arm.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Fuad Tabba <tabba@google.com>
Link: https://patch.msgid.link/20251210173024.561160-5-maz@kernel.org
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
There's a gap in the KVM_HOST_DATA_FLAG_* indices since the removal of
KVM_HOST_DATA_FLAG_HOST_SVE_ENABLED and
KVM_HOST_DATA_FLAG_HOST_SME_ENABLED in commits:
* 459f059be702 ("KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN")
* 407a99c4654e ("KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN")
Shuffle the indices to remove the gap, as Oliver requested at the time of
the removals:
https://lore.kernel.org/linux-arm-kernel/Z6qC4qn47ONfDCSH@linux.dev/
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Will Deacon <will@kernel.org>
Tested-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://patch.msgid.link/20260106173707.3292074-3-mark.rutland@arm.com
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
* kvm-arm64/vgic-lr-overflow: (50 commits)
: Support for VGIC LR overflows, courtesy of Marc Zyngier
:
: Address deficiencies in KVM's GIC emulation when a vCPU has more active
: IRQs than can be represented in the VGIC list registers. Sort the AP
: list to prioritize inactive and pending IRQs, potentially spilling
: active IRQs outside of the LRs.
:
: Handle deactivation of IRQs outside of the LRs for both EOImode=0/1,
: which involves special consideration for SPIs being deactivated from a
: different vCPU than the one that acked it.
KVM: arm64: Convert ICH_HCR_EL2_TDIR cap to EARLY_LOCAL_CPU_FEATURE
KVM: arm64: selftests: vgic_irq: Add timer deactivation test
KVM: arm64: selftests: vgic_irq: Add Group-0 enable test
KVM: arm64: selftests: vgic_irq: Add asymmetric SPI deaectivation test
KVM: arm64: selftests: vgic_irq: Perform EOImode==1 deactivation in ack order
KVM: arm64: selftests: vgic_irq: Remove LR-bound limitation
KVM: arm64: selftests: vgic_irq: Exclude timer-controlled interrupts
KVM: arm64: selftests: vgic_irq: Change configuration before enabling interrupt
KVM: arm64: selftests: vgic_irq: Fix GUEST_ASSERT_IAR_EMPTY() helper
KVM: arm64: selftests: gic_v3: Disable Group-0 interrupts by default
KVM: arm64: selftests: gic_v3: Add irq group setting helper
KVM: arm64: GICv2: Always trap GICV_DIR register
KVM: arm64: GICv2: Handle deactivation via GICV_DIR traps
KVM: arm64: GICv2: Handle LR overflow when EOImode==0
KVM: arm64: GICv3: Force exit to sync ICH_HCR_EL2.En
KVM: arm64: GICv3: nv: Plug L1 LR sync into deactivation primitive
KVM: arm64: GICv3: nv: Resync LRs/VMCR/HCR early for better MI emulation
KVM: arm64: GICv3: Avoid broadcast kick on CPUs lacking TDIR
KVM: arm64: GICv3: Handle in-LR deactivation when possible
KVM: arm64: GICv3: Add SPI tracking to handle asymmetric deactivation
...
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
Deactivation via ICV_DIR_EL1 is both relatively straightforward
(we have the interrupt that needs deactivation) and really awkward.
The main issue is that the interrupt may either be in an LR on
another CPU, or ourside of any LR.
In the former case, we process the deactivation is if ot was
a write to GICD_CACTIVERn, which is already implemented as a big
hammer IPI'ing all vcpus. In the latter case, we just perform
a normal deactivation, similar to what we do for EOImode==0.
Another annoying aspect is that we need to tell the CPU owning
the interrupt that its ap_list needs laudering. We use a brand new
vcpu request to that effect.
Note that this doesn't address deactivation via the GICV MMIO view,
which will be taken care of in a later change.
Tested-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://msgid.link/20251120172540.2267180-29-maz@kernel.org
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
When APEI fails to handle a stage-2 synchronous external abort (SEA),
today KVM injects an asynchronous SError to the VCPU then resumes it,
which usually results in unpleasant guest kernel panic.
One major situation of guest SEA is when vCPU consumes recoverable
uncorrected memory error (UER). Although SError and guest kernel panic
effectively stops the propagation of corrupted memory, guest may
re-use the corrupted memory if auto-rebooted; in worse case, guest
boot may run into poisoned memory. So there is room to recover from
an UER in a more graceful manner.
Alternatively KVM can redirect the synchronous SEA event to VMM to
- Reduce blast radius if possible. VMM can inject a SEA to VCPU via
KVM's existing KVM_SET_VCPU_EVENTS API. If the memory poison
consumption or fault is not from guest kernel, blast radius can be
limited to the triggering thread in guest userspace, so VM can
keep running.
- Allow VMM to protect from future memory poison consumption by
unmapping the page from stage-2, or to interrupt guest of the
poisoned page so guest kernel can unmap it from stage-1 page table.
- Allow VMM to track SEA events that VM customers care about, to restart
VM when certain number of distinct poison events have happened,
to provide observability to customers in log management UI.
Introduce an userspace-visible feature to enable VMM handle SEA:
- KVM_CAP_ARM_SEA_TO_USER. As the alternative fallback behavior
when host APEI fails to claim a SEA, userspace can opt in this new
capability to let KVM exit to userspace during SEA if it is not
owned by host.
- KVM_EXIT_ARM_SEA. A new exit reason is introduced for this.
KVM fills kvm_run.arm_sea with as much as possible information about
the SEA, enabling VMM to emulate SEA to guest by itself.
- Sanitized ESR_EL2. The general rule is to keep only the bits
useful for userspace and relevant to guest memory.
- Flags indicating if faulting guest physical address is valid.
- Faulting guest physical and virtual addresses if valid.
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Link: https://msgid.link/20251013185903.1372553-2-jiaqiyan@google.com
Signed-off-by: Oliver Upton <oupton@kernel.org>
|
|
To date KVM has used the fine-grained traps for the sake of UNDEF
enforcement (so-called FGUs), meaning the constituent parts could be
computed on a per-VM basis and folded into the effective value when
programmed.
Prepare for traps changing based on the vCPU context by computing the
whole mess of them at vcpu_load(). Aggressively inline all the helpers
to preserve the build-time checks that were there before.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Joey Gouly <joey.gouly@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for 6.18
- Add support for FF-A 1.2 as the secure memory conduit for pKVM,
allowing more registers to be used as part of the message payload.
- Change the way pKVM allocates its VM handles, making sure that the
privileged hypervisor is never tricked into using uninitialised
data.
- Speed up MMIO range registration by avoiding unnecessary RCU
synchronisation, which results in VMs starting much quicker.
- Add the dump of the instruction stream when panic-ing in the EL2
payload, just like the rest of the kernel has always done. This will
hopefully help debugging non-VHE setups.
- Add 52bit PA support to the stage-1 page-table walker, and make use
of it to populate the fault level reported to the guest on failing
to translate a stage-1 walk.
- Add NV support to the GICv3-on-GICv5 emulation code, ensuring
feature parity for guests, irrespective of the host platform.
- Fix some really ugly architecture problems when dealing with debug
in a nested VM. This has some bad performance impacts, but is at
least correct.
- Add enough infrastructur |