aboutsummaryrefslogtreecommitdiff
path: root/drivers/hv
AgeCommit message (Collapse)AuthorFilesLines
2025-11-15Drivers: hv: Functions for setting up and tearing down the paravisor SynICRoman Kisel1-12/+126
The confidential VMBus runs with the paravisor SynIC and requires configuring it with the paravisor. Add the functions for configuring the paravisor SynIC. Update overall SynIC initialization logic to initialize the SynIC if it is present. Finally, break out SynIC interrupt enable/disable code into separate functions so that SynIC interrupts can be enabled or disabled via the paravisor instead of the hypervisor if the paravisor SynIC is present. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Rename the SynIC enable and disable routinesRoman Kisel4-11/+12
The confidential VMBus requires support for the both hypervisor facing SynIC and the paravisor one. Rename the functions that enable and disable SynIC with the hypervisor. No functional changes. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Check message and event pages for non-NULL before iounmap()Roman Kisel1-4/+8
It might happen that some hyp SynIC pages aren't allocated. Check for that and only then call iounmap(). Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: remove stale commentRoman Kisel1-5/+1
The comment about the x2v shim is ancient and long since incorrect. Remove the incorrect comment. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Post messages through the confidential VMBus if availableRoman Kisel1-1/+10
When the confidential VMBus is available, the guest should post messages to the paravisor. Update hv_post_message() to post messages to the paravisor rather than through GHCB or TD calls. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Allocate the paravisor SynIC pages when requiredRoman Kisel2-90/+112
Confidential VMBus requires interacting with two SynICs -- one provided by the host hypervisor, and one provided by the paravisor. Each SynIC requires its own message and event pages. Refactor and extend the existing code to add allocating and freeing the message and event pages for the paravisor SynIC when it is present. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: Rename fields for SynIC message and event pagesRoman Kisel6-45/+45
Confidential VMBus requires interacting with two SynICs -- one provided by the host hypervisor, and one provided by the paravisor. Each SynIC requires its own message and event pages. Rename the existing host-accessible SynIC message and event pages with the "hyp_" prefix to clearly distinguish them from the paravisor ones. The field name is also changed in mshv_root.* for consistency. No functional changes. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15arch/x86: mshyperv: Trap on access for some synthetic MSRsRoman Kisel1-0/+5
hv_set_non_nested_msr() has special handling for SINT MSRs when a paravisor is present. In addition to updating the MSR on the host, the mirror MSR in the paravisor is updated, including with the proxy bit. But with Confidential VMBus, the proxy bit must not be used, so add a special case to skip it. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15arch: hyperv: Get/set SynIC synth.registers via paravisorRoman Kisel2-0/+55
The existing Hyper-V wrappers for getting and setting MSRs are hv_get/set_msr(). Via hv_get/set_non_nested_msr(), they detect when running in a CoCo VM with a paravisor, and use the TDX or SNP guest-host communication protocol to bypass the paravisor and go directly to the host hypervisor for SynIC MSRs. The "set" function also implements the required special handling for the SINT MSRs. Provide functions that allow manipulating the SynIC registers through the paravisor. Move vmbus_signal_eom() to a more appropriate location (which also avoids breaking KVM). Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15Drivers: hv: VMBus protocol version 6.0Roman Kisel2-0/+14
The confidential VMBus is supported starting from the protocol version 6.0 onwards. Provide the required definitions. No functional changes. Signed-off-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15drivers: hv: Allow vmbus message synic interrupt injected from Hyper-VTianyu Lan2-0/+7
When Secure AVIC is enabled, VMBus driver should call x2apic Secure AVIC interface to allow Hyper-V to inject VMBus message interrupt. Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Fix deposit memory in MSHV_ROOT_HVCALLNuno Das Neves1-28/+30
When the MSHV_ROOT_HVCALL ioctl is executing a hypercall, and gets HV_STATUS_INSUFFICIENT_MEMORY, it deposits memory and then returns -EAGAIN to userspace. The expectation is that the VMM will retry. However, some VMM code in the wild doesn't do this and simply fails. Rather than force the VMM to retry, change the ioctl to deposit memory on demand and immediately retry the hypercall as is done with all the other hypercall helper functions. In addition to making the ioctl easier to use, removing the need for multiple syscalls improves performance. There is a complication: unlike the other hypercall helper functions, in MSHV_ROOT_HVCALL the input is opaque to the kernel. This is problematic for rep hypercalls, because the next part of the input list can't be copied on each loop after depositing pages (this was the original reason for returning -EAGAIN in this case). Introduce hv_do_rep_hypercall_ex(), which adds a 'rep_start' parameter. This solves the issue, allowing the deposit loop in MSHV_ROOT_HVCALL to restart a rep hypercall after depositing pages partway through. Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs") Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-15mshv: Fix VpRootDispatchThreadBlocked valueNuno Das Neves1-1/+1
This value in the VP stats page is used to track if the VP can be dispatched for execution when there are no fast interrupts injected. The original value of 201 was used in a version of the hypervisor which did not ship. It was subsequently changed to 202 so that is the correct value. Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-11-14syscore: Pass context data to callbacksThierry Reding1-5/+9
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com>
2025-11-04rseq, virt: Retrigger RSEQ after vcpu_run()Thomas Gleixner1-0/+3
Hypervisors invoke resume_user_mode_work() before entering the guest, which clears TIF_NOTIFY_RESUME. The @regs argument is NULL as there is no user space context available to them, so the rseq notify handler skips inspecting the critical section, but updates the CPU/MM CID values unconditionally so that the eventual pending rseq event is not lost on the way to user space. This is a pointless exercise as the task might be rescheduled before actually returning to user space and it creates unnecessary work in the vcpu_run() loops. It's way more efficient to ignore that invocation based on @regs == NULL and let the hypervisors re-raise TIF_NOTIFY_RESUME after returning from the vcpu_run() loop before returning from the ioctl(). This ensures that a pending RSEQ update is not lost and the IDs are updated before returning to user space. Once the RSEQ handling is decoupled from TIF_NOTIFY_RESUME, this turns into a NOOP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20251027084306.399495855@linutronix.de
2025-10-01Drivers: hv: Make CONFIG_HYPERV boolMukesh Rathor2-2/+2
With CONFIG_HYPERV and CONFIG_HYPERV_VMBUS separated, change CONFIG_HYPERV to bool from tristate. CONFIG_HYPERV now becomes the core Hyper-V hypervisor support, such as hypercalls, clocks/timers, Confidential Computing setup, PCI passthru, etc. that doesn't involve VMBus or VMBus devices. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-10-01Drivers: hv: Add CONFIG_HYPERV_VMBUS optionMukesh Rathor2-3/+10
At present VMBus driver is hinged off of CONFIG_HYPERV which entails lot of builtin code and encompasses too much. It's not always clear what depends on builtin hv code and what depends on VMBus. Setting CONFIG_HYPERV as a module and fudging the Makefile to switch to builtin adds even more confusion. VMBus is an independent module and should have its own config option. Also, there are scenarios like baremetal dom0/root where support is built in with CONFIG_HYPERV but without VMBus. Lastly, there are more features coming down that use CONFIG_HYPERV and add more dependencies on it. So, create a fine grained HYPERV_VMBUS option and update Kconfigs for dependency on VMBus. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Drivers: hv: vmbus: Fix typos in vmbus_drv.cAlok Tiwari1-2/+2
Fix two minor typos in vmbus_drv.c: - Correct "reponsible" -> "responsible" in a comment. - Add missing newline in pr_err() message ("channeln" -> "channel\n"). These are cosmetic changes only and do not affect functionality. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Drivers: hv: vmbus: Fix sysfs output format for ring buffer indexAlok Tiwari1-2/+2
The sysfs attributes out_read_index and out_write_index in vmbus_drv.c currently use %d to print outbound.current_read_index and outbound.current_write_index. These fields are u32 values, so printing them with %d (signed) is not logically correct. Update the format specifier to %u to correctly match their type. No functional change, only fixes the sysfs output format. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30Drivers: hv: vmbus: Clean up sscanf format specifier in target_cpu_store()Alok Tiwari1-1/+1
The target_cpu_store() function parses the target CPU from the sysfs buffer using sscanf(). The format string currently uses "%uu", which is redundant. The compiler ignores the extra "u", so there is no incorrect parsing at runtime. Update the format string to use "%u" for clarity and consistency. Signed-off-by: Alok Tiwari <alok.a.tiwari@oracle.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30x86/hyperv: Switch to msi_create_parent_irq_domain()Nam Cao1-0/+1
Move away from the legacy MSI domain setup, switch to use msi_create_parent_irq_domain(). While doing the conversion, I noticed that hv_irq_compose_msi_msg() is doing more than it is supposed to (composing message content). The interrupt allocation bits should be moved into hv_msi_domain_alloc(). However, I have no hardware to test this change, therefore I leave a TODO note. Signed-off-by: Nam Cao <namcao@linutronix.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30mshv: Use common "entry virt" APIs to do work in root before running guestSean Christopherson4-50/+7
Use the kernel's common "entry virt" APIs to handle pending work prior to (re)entering guest mode, now that the virt APIs don't have a superfluous dependency on KVM. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-30mshv: Handle NEED_RESCHED_LAZY before transferring to guestSean Christopherson2-2/+3
Check for NEED_RESCHED_LAZY, not just NEED_RESCHED, prior to transferring control to a guest. Failure to check for lazy resched can unnecessarily delay rescheduling until the next tick when using a lazy preemption model. Note, ideally both the checking and processing of TIF bits would be handled in common code, to avoid having to keep three separate paths synchronized, but defer such cleanups to the future to keep the fix as standalone as possible. Cc: Nuno Das Neves <nunodasneves@linux.microsoft.com> Cc: Mukesh R <mrathor@linux.microsoft.com> Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs") Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-08Drivers: hv: Simplify data structures for VMBus channel close messageMichael Kelley1-1/+1
struct vmbus_close_msg is used for sending the VMBus channel close message. It contains a struct vmbus_channel_msginfo, which has a flex array member at the end. The latter's presence in the middle of struct vmbus_close_msg causes warnings when built with -Wflex-array-member-not-at-end. But the struct vmbus_channel_msginfo is unused because the Hyper-V host does not send a response to the channel close message. So remove the struct vmbus_channel_msginfo. Then, since the only remaining field is struct vmbus_channel_close_channel, also remove the containing struct vmbus_close_msg and directly use struct vmbus_channel_close_channel. Besides eliminating unnecessary complexity, these changes resolve the -Wflex-array-member-not-at-end warnings. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-08Drivers: hv: util: Cosmetic changes for hv_utils_transport.cAbhishek Tiwari1-5/+5
Fix issues reported by checkpatch.pl script for hv_utils_transport.c file - Update pr_warn() calls to use __func__ for consistent logging context. - else should follow close brace '}' No functional changes intended. Signed-off-by: Abhishek Tiwari <abhitiwari@linux.microsoft.com> Reviewed-by: Naman Jain <namjain@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-09-08mshv: Add support for a new parent partition configurationNuno Das Neves2-22/+26
Detect booting as an "L1VH" partition. This is a new scenario very similar to root partition where the mshv_root driver can be used to create and manage guest partitions. It mostly works the same as root partition, but there are some differences in how various features are handled. hv_l1vh_partition() is introduced to handle these cases. Add hv_parent_partition() which returns true for either case, replacing some hv_root_partition() checks. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Acked-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
2025-07-30Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-0/+8
Pull kvm updates from Paolo Bonzini: "ARM: - Host driver for GICv5, the next generation interrupt controller for arm64, including support for interrupt routing, MSIs, interrupt translation and wired interrupts - Use FEAT_GCIE_LEGACY on GICv5 systems to virtualize GICv3 VMs on GICv5 hardware, leveraging the legacy VGIC interface - Userspace control of the 'nASSGIcap' GICv3 feature, allowing userspace to disable support for SGIs w/o an active state on hardware that previously advertised it unconditionally - Map supporting endpoints with cacheable memory attributes on systems with FEAT_S2FWB and DIC where KVM no longer needs to perform cache maintenance on the address range - Nested support for FEAT_RAS and FEAT_DoubleFault2, allowing the guest hypervisor to inject external aborts into an L2 VM and take traps of masked external aborts to the hypervisor - Convert more system register sanitization to the config-driven implementation - Fixes to the visibility of EL2 registers, namely making VGICv3 system registers accessible through the VGIC device instead of the ONE_REG vCPU ioctls - Various cleanups and minor fixes LoongArch: - Add stat information for in-kernel irqchip - Add tracepoints for CPUCFG and CSR emulation exits - Enhance in-kernel irqchip emulation - Various cleanups RISC-V: - Enable ring-based dirty memory tracking - Improve perf kvm stat to report interrupt events - Delegate illegal instruction trap to VS-mode - MMU improvements related to upcoming nested virtualization s390x - Fixes x86: - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time - Share device posted IRQ code between SVM and VMX and harden it against bugs and runtime errors - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n) - For MMIO stale data mitigation, track whether or not a vCPU has access to (host) MMIO based on whether the page tables have MMIO pfns mapped; using VFIO is prone to false negatives - Rework the MSR interception code so that the SVM and VMX APIs are more or less identical - Recalculate all MSR intercepts from scratch on MSR filter changes, instead of maintaining shadow bitmaps - Advertise support for LKGS (Load Kernel GS base), a new instruction that's loosely related to FRED, but is supported and enumerated independently - Fix a user-triggerable WARN that syzkaller found by setting the vCPU in INIT_RECEIVED state (aka wait-for-SIPI), and then putting the vCPU into VMX Root Mode (post-VMXON). Trying to detect every possible path leading to architecturally forbidden states is hard and even risks breaking userspace (if it goes from valid to valid state but passes through invalid states), so just wait until KVM_RUN to detect that the vCPU state isn't allowed - Add KVM_X86_DISABLE_EXITS_APERFMPERF to allow disabling interception of APERF/MPERF reads, so that a "properly" configured VM can access APERF/MPERF. This has many caveats (APERF/MPERF cannot be zeroed on vCPU creation or saved/restored on suspend and resume, or preserved over thread migration let alone VM migration) but can be useful whenever you're interested in letting Linux guests see the effective physical CPU frequency in /proc/cpuinfo - Reject KVM_SET_TSC_KHZ for vm file descriptors if vCPUs have been created, as there's no known use case for changing the default frequency for other VM types and it goes counter to the very reason why the ioctl was added to the vm file descriptor. And also, there would be no way to make it work for confidential VMs with a "secure" TSC, so kill two birds with one stone - Dynamically allocation the shadow MMU's hashed page list, and defer allocating the hashed list until it's actually needed (the TDP MMU doesn't use the list) - Extract many of KVM's helpers for accessing architectural local APIC state to common x86 so that they can be shared by guest-side code for Secure AVIC - Various cleanups and fixes x86 (Intel): - Preserve the host's DEBUGCTL.FREEZE_IN_SMM when running the guest. Failure to honor FREEZE_IN_SMM can leak host state into guests - Explicitly check vmcs12.GUEST_DEBUGCTL on nested VM-Enter to prevent L1 from running L2 with features that KVM doesn't support, e.g. BTF x86 (AMD): - WARN and reject loading kvm-amd.ko instead of panicking the kernel if the nested SVM MSRPM offsets tracker can't handle an MSR (which is pretty much a static condition and therefore should never happen, but still) - Fix a variety of flaws and bugs in the AVIC device posted IRQ code - Inhibit AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs - Request GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU - Intercept SPEC_CTRL on AMD if the MSR shouldn't exist according to the vCPU's CPUID model - Accept any SNP policy that is accepted by the firmware with respect to SMT and single-socket restrictions. An incompatible policy doesn't put the kernel at risk in any way, so there's no reason for KVM to care - Drop a superfluous WBINVD (on all CPUs!) when destroying a VM and use WBNOINVD instead of WBINVD when possible for SEV cache maintenance - When reclaiming memory from an SEV guest, only do cache flushes on CPUs that have ever run a vCPU for the guest, i.e. don't flush the caches for CPUs that can't possibly have cache lines with dirty, encrypted data Generic: - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *", to avoid making a simple concept unnecessarily difficult to understand - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs - Clean up and document/comment the irqfd assignment code - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique - Add a tracepoint for KVM_SET_MEMORY_ATTRIBUTES to help debug issues related to private <=> shared memory conversions - Drop guest_memfd's .getattr() implementation as the VFS layer will call generic_fillattr() if inode_operations.getattr is NULL - Fix issues with dirty ring harvesting where KVM doesn't bound the processing of entries in any way, which allows userspace to keep KVM in a tight loop indefinitely - Kill off kvm_arch_{start,end}_assignment() and x86's associated tracking, now that KVM no longer uses assigned_device_count as a heuristic for either irqbypass usage or MDS mitigation Selftests: - Fix a comment typo - Verify KVM is loaded when getting any KVM module param so that attempting to run a selftest without kvm.ko loaded results in a SKIP message about KVM not being loaded/enabled (versus some random parameter not existing) - Skip tests that hit EACCES when attempting to access a file, and print a "Root required?" help message. In most cases, the test just needs to be run with elevated permissions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (340 commits) Documentation: KVM: Use unordered list for pre-init VGIC registers RISC-V: KVM: Avoid re-acquiring memslot in kvm_riscv_gstage_map() RISC-V: KVM: Use find_vma_intersection() to search for intersecting VMAs RISC-V: perf/kvm: Add reporting of interrupt events RISC-V: KVM: Enable ring-based dirty memory tracking RISC-V: KVM: Fix inclusion of Smnpm in the guest ISA bitmap RISC-V: KVM: Delegate illegal instruction fault to VS mode RISC-V: KVM: Pass VMID as parameter to kvm_riscv_hfence_xyz() APIs RISC-V: KVM: Factor-out g-stage page table management RISC-V: KVM: Add vmid field to struct kvm_riscv_hfence RISC-V: KVM: Introduce struct kvm_gstage_mapping RISC-V: KVM: Factor-out MMU related declarations into separate headers RISC-V: KVM: Use ncsr_xyz() in kvm_riscv_vcpu_trap_redirect() RISC-V: KVM: Implement kvm_arch_flush_remote_tlbs_range() RISC-V: KVM: Don't flush TLB when PTE is unchanged RISC-V: KVM: Replace KVM_REQ_HFENCE_GVMA_VMID_ALL with KVM_REQ_TLB_FLUSH RISC-V: KVM: Rename and move kvm_riscv_local_tlb_sanitize() RISC-V: KVM: Drop the return value of kvm_riscv_vcpu_aia_init() RISC-V: KVM: Check kvm_riscv_vcpu_alloc_vector_context() return value KVM: arm64: selftests: Add FEAT_RAS EL2 registers to get-reg-list ...
2025-07-29Merge tag 'kvm-x86-irqs-6.17' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-0/+8
KVM IRQ changes for 6.17 - Rework irqbypass to track/match producers and consumers via an xarray instead of a linked list. Using a linked list leads to O(n^2) insertion times, which is hugely problematic for use cases that create large numbers of VMs. Such use cases typically don't actually use irqbypass, but eliminating the pointless registration is a future problem to solve as it likely requires new uAPI. - Track irqbypass's "token" as "struct eventfd_ctx *" instead of a "void *", to avoid making a simple concept unnecessarily difficult to understand. - Add CONFIG_KVM_IOAPIC for x86 to allow disabling support for I/O APIC, PIC, and PIT emulation at compile time. - Drop x86's irq_comm.c, and move a pile of IRQ related code into irq.c. - Fix a variety of flaws and bugs in the AVIC device posted IRQ code. - Inhibited AVIC if a vCPU's ID is too big (relative to what hardware supports) instead of rejecting vCPU creation. - Extend enable_ipiv module param support to SVM, by simply leaving IsRunning clear in the vCPU's physical ID table entry. - Disable IPI virtualization, via enable_ipiv, if the CPU is affected by erratum #1235, to allow (safely) enabling AVIC on such CPUs. - Dedup x86's device posted IRQ code, as the vast majority of functionality can be shared verbatime between SVM and VMX. - Harden the device posted IRQ code against bugs and runtime errors. - Use vcpu_idx, not vcpu_id, for GA log tag/metadata, to make lookups O(1) instead of O(n). - Generate GA Log interrupts if and only if the target vCPU is blocking, i.e. only if KVM needs a notification in order to wake the vCPU. - Decouple device posted IRQs from VFIO device assignment, as binding a VM to a VFIO group is not a requirement for enabling device posted IRQs. - Clean up and document/comment the irqfd assignment code. - Disallow binding multiple irqfds to an eventfd with a priority waiter, i.e. ensure an eventfd is bound to at most one irqfd through the entire host, and add a selftest to verify eventfd:irqfd bindings are globally unique.
2025-07-28Merge tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-11/+3
Pull CLASS(fd) update from Al Viro: "A missing bit of commit 66635b077624 ("assorted variants of irqfd setup: convert to CLASS(fd)") from a year ago. mshv_eventfd would've been covered by that, but it had forked slightly before that series and got merged into mainline later" * tag 'pull-fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mshv_eventfd: convert to CLASS(fd)
2025-07-16mshv_eventfd: convert to CLASS(fd)Al Viro1-11/+3
similar to 66635b077624 ("assorted variants of irqfd setup: convert to CLASS(fd)") a year ago... Acked-by: Wei Liu <wei.liu@kernel.org> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2025-07-15Drivers: hv: Use nested hypercall for post message and signal eventNuno Das Neves2-3/+8
When running nested, these hypercalls must be sent to the L0 hypervisor or VMBus will fail. Remove hv_do_nested_hypercall() and hv_do_fast_nested_hypercall8() altogether and open-code these cases, since there are only 2 and all they do is add the nested bit. Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1752261532-7225-2-git-send-email-nunodasneves@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1752261532-7225-2-git-send-email-nunodasneves@linux.microsoft.com>
2025-07-09Drivers: hv: Fix warnings for missing export.h header inclusionNaman Jain6-0/+6
Fix below warning in Hyper-V drivers that comes when kernel is compiled with W=1 option. Include export.h in driver files to fix it. * warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Reviewed-by: Nuno Das Neves <nunodasneves@linux.microsoft.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Link: https://lore.kernel.org/r/20250611100459.92900-2-namjain@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250611100459.92900-2-namjain@linux.microsoft.com>
2025-07-09Drivers: hv: Fix the check for HYPERVISOR_CALLBACK_VECTORNaman Jain1-4/+5
__is_defined(HYPERVISOR_CALLBACK_VECTOR) would return 1, only if HYPERVISOR_CALLBACK_VECTOR macro is defined as 1. However its value is 0xf3 and this leads to __is_defined() returning 0. The expectation was to just check whether this MACRO is defined or not and get 1 if it's defined. Replace __is_defined with #ifdef blocks instead to fix it. Fixes: 1dc5df133b98 ("Drivers: hv: vmbus: Get the IRQ number from DeviceTree") Cc: stable@kernel.org Signed-off-by: Naman Jain <namjain@linux.microsoft.com> Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20250707084322.1763-1-namjain@linux.microsoft.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250707084322.1763-1-namjain@linux.microsoft.com>
2025-07-09Drivers: hv: Select CONFIG_SYSFB only if EFI is enabledMichael Kelley1-1/+1
Commit 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests") selects CONFIG_SYSFB for Hyper-V guests so that screen_info is available to the VMBus driver to get the location of the framebuffer in Generation 2 VMs. However, if CONFIG_HYPERV is enabled but CONFIG_EFI is not, a kernel link error results in ARM64 builds because screen_info is provided by the EFI firmware interface. While configuring an ARM64 Hyper-V guest without EFI isn't useful since EFI is required to boot, the configuration is still possible and the link error should be prevented. Fix this by making the selection of CONFIG_SYSFB conditional on CONFIG_EFI being defined. For Generation 1 VMs on x86/x64, which don't use EFI, the additional condition is OK because such VMs get the framebuffer information via a mechanism that doesn't use screen_info. Fixes: 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests") Reported-by: Arnd Bergmann <arnd@arndb.de> Closes: https://lore.kernel.org/linux-hyperv/20250610091810.2638058-1-arnd@kernel.org/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202506080820.1wmkQufc-lkp@intel.com/ Signed-off-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/stable/20250613230059.380483-1-mhklinux%40outlook.com Reviewed-by: Roman Kisel <romank@linux.microsoft.com> Link: https://lore.kernel.org/r/20250613230059.380483-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250613230059.380483-1-mhklinux@outlook.com>
2025-06-23sched/wait: Drop WQ_FLAG_EXCLUSIVE from add_wait_queue_priority()Sean Christopherson1-0/+8
Drop the setting of WQ_FLAG_EXCLUSIVE from add_wait_queue_priority() and instead have callers manually add the flag prior to adding their structure to the queue. Blindly setting WQ_FLAG_EXCLUSIVE is flawed, as the nature of exclusive, priority waiters means that only the first waiter added will ever receive notifications. Pushing the flawed behavior to callers will allow fixing the problem one hypervisor at a time (KVM added the flawed API, and then KVM's code was copy+pasted nearly verbatim by Xen and Hyper-V), and will also allow for adding an API that provides true exclusivity, i.e. that guarantees at most one priority waiter is in the queue. Opportunistically add a comment in Hyper-V to call out the mess. Xen privcmd's irqfd_wakefup() doesn't actually operate in exclusive mode, i.e. can be "fixed" simply by dropping WQ_FLAG_EXCLUSIVE. And KVM is primed to switch to the aforementioned fully exclusive API, i.e. won't be carrying the flawed code for long. No functional change intended. Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250522235223.3178519-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-03Merge tag 'hyperv-next-signed-20250602' of ↵Linus Torvalds4-61/+140
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - Support for Virtual Trust Level (VTL) on arm64 (Roman Kisel) - Fixes for Hyper-V UIO driver (Long Li) - Fixes for Hyper-V PCI driver (Michael Kelley) - Select CONFIG_SYSFB for Hyper-V guests (Michael Kelley) - Documentation updates for Hyper-V VMBus (Michael Kelley) - Enhance logging for hv_kvp_daemon (Shradha Gupta) * tag 'hyperv-next-signed-20250602' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (23 commits) Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests Drivers: hv: vmbus: Add comments about races with "channels" sysfs dir Documentation: hyperv: Update VMBus doc with new features and info PCI: hv: Remove unnecessary flex array in struct pci_packet Drivers: hv: Remove hv_alloc/free_* helpers Drivers: hv: Use kzalloc for panic page allocation uio_hv_generic: Align ring size to system page uio_hv_generic: Use correct size for interrupt and monitor pages Drivers: hv: Allocate interrupt and monitor pages aligned to system page boundary arch/x86: Provide the CPU number in the wakeup AP callback x86/hyperv: Fix APIC ID and VP index confusion in hv_snp_boot_ap() PCI: hv: Get vPCI MSI IRQ domain from DeviceTree ACPI: irq: Introduce acpi_get_gsi_dispatcher() Drivers: hv: vmbus: Introduce hv_get_vmbus_root_device() Drivers: hv: vmbus: Get the IRQ number from DeviceTree dt-bindings: microsoft,vmbus: Add interrupt and DMA coherence properties arm64, x86: hyperv: Report the VTL the system boots in arm64: hyperv: Initialize the Virtual Trust Level field Drivers: hv: Provide arch-neutral implementation of get_vtl() Drivers: hv: Enable VTL mode for arm64 ...
2025-05-29Merge tag 'driver-core-6.16-rc1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core updates from Greg KH: "Here are the driver core / kernfs changes for 6.16-rc1. Not a huge number of changes this development cycle, here's the summary of what is included in here: - kernfs locking tweaks, pushing some global locks down into a per-fs image lock - rust driver core and pci device bindings added for new features. - sysfs const work for bin_attributes. The final churn of switching away from and removing the transitional struct members, "read_new", "write_new" and "bin_attrs_new" will come after the merge window to avoid unnecesary merge conflicts. - auxbus device creation helpers added - fauxbus fix for creating sysfs files after the probe completed properly - other tiny updates for driver core things. All of these have been in linux-next for over a week with no reported issues" * tag 'driver-core-6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: kernfs: Relax constraint in draining guard Documentation: embargoed-hardware-issues.rst: Remove myself drivers: hv: fix up const issue with vmbus_chan_bin_attrs firmware_loader: use SHA-256 library API instead of crypto_shash API docs: debugfs: do not recommend debugfs_remove_recursive PM: wakeup: Do not expose 4 device wakeup source APIs kernfs: switch global kernfs_rename_lock to per-fs lock kernfs: switch global kernfs_idr_lock to per-fs lock driver core: auxiliary bus: Fix IS_ERR() vs NULL mixup in __devm_auxiliary_device_create() sysfs: constify attribute_group::bin_attrs sysfs: constify bin_attribute argument of bin_attribute::read/write() software node: Correct a OOB check in software_node_get_reference_args() devres: simplify devm_kstrdup() using devm_kmemdup() platform: replace magic number with macro PLATFORM_DEVID_NONE component: do not try to unbind unbound components driver core: auxiliary bus: add device creation helpers driver core: faux: Add sysfs groups after probing
2025-05-23Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guestsMichael Kelley1-0/+1
The Hyper-V host provides guest VMs with a range of MMIO addresses that guest VMBus drivers can use. The VMBus driver in Linux manages that MMIO space, and allocates portions to drivers upon request. As part of managing that MMIO space in a Generation 2 VM, the VMBus driver must reserve the portion of the MMIO space that Hyper-V has designated for the synthetic frame buffer, and not allocate this space to VMBus drivers other than graphics framebuffer drivers. The synthetic frame buffer MMIO area is described by the screen_info data structure that is passed to the Linux kernel at boot time, so the VMBus driver must access screen_info for Generation 2 VMs. (In Generation 1 VMs, the framebuffer MMIO space is communicated to the guest via a PCI pseudo-device, and access to screen_info is not needed.) In commit a07b50d80ab6 ("hyperv: avoid dependency on screen_info") the VMBus driver's access to screen_info is restricted to when CONFIG_SYSFB is enabled. CONFIG_SYSFB is typically enabled in kernels built for Hyper-V by virtue of having at least one of CONFIG_FB_EFI, CONFIG_FB_VESA, or CONFIG_SYSFB_SIMPLEFB enabled, so the restriction doesn't usually affect anything. But it's valid to have none of these enabled, in which case CONFIG_SYSFB is not enabled, and the VMBus driver is unable to properly reserve the framebuffer MMIO space for graphics framebuffer drivers. The framebuffer MMIO space may be assigned to some other VMBus driver, with undefined results. As an example, if a VM is using a PCI pass-thru NVMe controller to host the OS disk, the PCI NVMe controller is probed before any graphics devices, and the NVMe controller is assigned a portion of the framebuffer MMIO space. Hyper-V reports an error to Linux during the probe, and the OS disk fails to get setup. Then Linux fails to boot in the VM. Fix this by having CONFIG_HYPERV always select SYSFB. Then the VMBus driver in a Gen 2 VM can always reserve the MMIO space for the graphics framebuffer driver, and prevent the undefined behavior. But don't select SYSFB when building for HYPERV_VTL_MODE as VTLs other than VTL 0 don't have a framebuffer and aren't subject to the issue. Adding SYSFB in such cases is harmless, but would increase the image size for no purpose. Fixes: a07b50d80ab6 ("hyperv: avoid dependency on screen_info") Signed-off-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Link: https://lore.kernel.org/stable/20250520040143.6964-1-mhklinux%40outlook.com Link: https://lore.kernel.org/r/20250520040143.6964-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250520040143.6964-1-mhklinux@outlook.com>
2025-05-23Drivers: hv: vmbus: Add comments about races with "channels" sysfs dirMichael Kelley1-2/+40
The VMBus driver code has some inherent races in the creation of the "channels" sysfs subdirectory and its per-channel numbered subdirectories. These races have not generally been recognized or understood. Add some comments to call them out. No code changes. Signed-off-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/20250514225508.52629-1-mhklinux@outlook.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <20250514225508.52629-1-mhklinux@outlook.com>
2025-05-23Drivers: hv: Remove hv_alloc/free_* helpersLong Li1-39/+0
There are no users for those functions, remove them. Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1746492997-4599-6-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1746492997-4599-6-git-send-email-longli@linuxonhyperv.com>
2025-05-23Drivers: hv: Use kzalloc for panic page allocationLong Li1-3/+3
To prepare for removal of hv_alloc_* and hv_free* functions, use kzalloc/kfree directly for panic reporting page. Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1746492997-4599-5-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1746492997-4599-5-git-send-email-longli@linuxonhyperv.com>
2025-05-23Drivers: hv: Allocate interrupt and monitor pages aligned to system page ↵Long Li1-6/+17
boundary There are use cases that interrupt and monitor pages are mapped to user-mode through UIO, so they need to be system page aligned. Some Hyper-V allocation APIs introduced earlier broke those requirements. Fix this by using page allocation functions directly for interrupt and monitor pages. Cc: stable@vger.kernel.org Fixes: ca48739e59df ("Drivers: hv: vmbus: Move Hyper-V page allocator to arch neutral code") Signed-off-by: Long Li <longli@microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Link: https://lore.kernel.org/r/1746492997-4599-2-git-send-email-longli@linuxonhyperv.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Message-ID: <1746492997-4599-2-git-send-email-longli@linuxonhyperv.com>
2025-05-23Drivers: hv: vmbus: Introduce hv_get_vmbus_root_device()Roman Kisel1-8/+15
The ARM64 PCI code for hyperv needs to know the VMBus root device, and it is private. Provide a func