12 Sep, 2024

1 commit

  • commit 6fd28941447bf2c8ca0f26fda612a1cabc41663f upstream.

    Rescind offer handling relies on rescind callbacks for some of the
    resources cleanup, if they are registered. It does not unregister
    vmbus device for the primary channel closure, when callback is
    registered. Without it, next onoffer does not come, rescind flag
    remains set and device goes to unusable state.

    Add logic to unregister vmbus for the primary channel in rescind callback
    to ensure channel removal and relid release, and to ensure that next
    onoffer can be received and handled properly.

    Cc: stable@vger.kernel.org
    Fixes: ca3cda6fcf1e ("uio_hv_generic: add rescind support")
    Signed-off-by: Naman Jain
    Reviewed-by: Saurabh Sengar
    Link: https://lore.kernel.org/r/20240829071312.1595-3-namjain@linux.microsoft.com
    Signed-off-by: Greg Kroah-Hartman

    Naman Jain
     

17 May, 2024

3 commits

  • [ Upstream commit 30d18df6567be09c1433e81993e35e3da573ac48 ]

    In CoCo VMs it is possible for the untrusted host to cause
    set_memory_encrypted() or set_memory_decrypted() to fail such that an
    error is returned and the resulting memory is shared. Callers need to
    take care to handle these errors to avoid returning decrypted (shared)
    memory to the page allocator, which could lead to functional or security
    issues.

    The VMBus ring buffer code could free decrypted/shared pages if
    set_memory_decrypted() fails. Check the decrypted field in the struct
    vmbus_gpadl for the ring buffers to decide whether to free the memory.

    Signed-off-by: Michael Kelley
    Reviewed-by: Kuppuswamy Sathyanarayanan
    Acked-by: Kirill A. Shutemov
    Link: https://lore.kernel.org/r/20240311161558.1310-6-mhklinux@outlook.com
    Signed-off-by: Wei Liu
    Message-ID:
    Signed-off-by: Sasha Levin

    Michael Kelley
     
  • [ Upstream commit 211f514ebf1ef5de37b1cf6df9d28a56cfd242ca ]

    In CoCo VMs it is possible for the untrusted host to cause
    set_memory_encrypted() or set_memory_decrypted() to fail such that an
    error is returned and the resulting memory is shared. Callers need to
    take care to handle these errors to avoid returning decrypted (shared)
    memory to the page allocator, which could lead to functional or security
    issues.

    In order to make sure callers of vmbus_establish_gpadl() and
    vmbus_teardown_gpadl() don't return decrypted/shared pages to
    allocators, add a field in struct vmbus_gpadl to keep track of the
    decryption status of the buffers. This will allow the callers to
    know if they should free or leak the pages.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Michael Kelley
    Reviewed-by: Kuppuswamy Sathyanarayanan
    Acked-by: Kirill A. Shutemov
    Link: https://lore.kernel.org/r/20240311161558.1310-3-mhklinux@outlook.com
    Signed-off-by: Wei Liu
    Message-ID:
    Signed-off-by: Sasha Levin

    Rick Edgecombe
     
  • [ Upstream commit 03f5a999adba062456c8c818a683beb1b498983a ]

    In CoCo VMs it is possible for the untrusted host to cause
    set_memory_encrypted() or set_memory_decrypted() to fail such that an
    error is returned and the resulting memory is shared. Callers need to
    take care to handle these errors to avoid returning decrypted (shared)
    memory to the page allocator, which could lead to functional or security
    issues.

    VMBus code could free decrypted pages if set_memory_encrypted()/decrypted()
    fails. Leak the pages if this happens.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Michael Kelley
    Reviewed-by: Kuppuswamy Sathyanarayanan
    Acked-by: Kirill A. Shutemov
    Link: https://lore.kernel.org/r/20240311161558.1310-2-mhklinux@outlook.com
    Signed-off-by: Wei Liu
    Message-ID:
    Signed-off-by: Sasha Levin

    Rick Edgecombe
     

27 Mar, 2024

1 commit

  • [ Upstream commit 2b4b90e053a29057fb05ba81acce26bddce8d404 ]

    Currently, the secondary CPUs in Hyper-V VTL context lack support for
    parallel startup. Therefore, relying on the single initial_stack fetched
    from the current task structure suffices for all vCPUs.

    However, common initial_stack risks stack corruption when parallel startup
    is enabled. In order to facilitate parallel startup, use the initial_stack
    from the per CPU idle thread instead of the current task.

    Fixes: 3be1bc2fe9d2 ("x86/hyperv: VTL support for Hyper-V")
    Signed-off-by: Saurabh Sengar
    Reviewed-by: Michael Kelley
    Link: https://lore.kernel.org/r/1709452896-13342-1-git-send-email-ssengar@linux.microsoft.com
    Signed-off-by: Wei Liu
    Message-ID:
    Signed-off-by: Sasha Levin

    Saurabh Sengar
     

05 Sep, 2023

1 commit

  • …rnel/git/hyperv/linux

    Pull hyperv updates from Wei Liu:

    - Support for SEV-SNP guests on Hyper-V (Tianyu Lan)

    - Support for TDX guests on Hyper-V (Dexuan Cui)

    - Use SBRM API in Hyper-V balloon driver (Mitchell Levy)

    - Avoid dereferencing ACPI root object handle in VMBus driver (Maciej
    Szmigiero)

    - A few misecllaneous fixes (Jiapeng Chong, Nathan Chancellor, Saurabh
    Sengar)

    * tag 'hyperv-next-signed-20230902' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (24 commits)
    x86/hyperv: Remove duplicate include
    x86/hyperv: Move the code in ivm.c around to avoid unnecessary ifdef's
    x86/hyperv: Remove hv_isolation_type_en_snp
    x86/hyperv: Use TDX GHCI to access some MSRs in a TDX VM with the paravisor
    Drivers: hv: vmbus: Bring the post_msg_page back for TDX VMs with the paravisor
    x86/hyperv: Introduce a global variable hyperv_paravisor_present
    Drivers: hv: vmbus: Support >64 VPs for a fully enlightened TDX/SNP VM
    x86/hyperv: Fix serial console interrupts for fully enlightened TDX guests
    Drivers: hv: vmbus: Support fully enlightened TDX guests
    x86/hyperv: Support hypercalls for fully enlightened TDX guests
    x86/hyperv: Add hv_isolation_type_tdx() to detect TDX guests
    x86/hyperv: Fix undefined reference to isolation_type_en_snp without CONFIG_HYPERV
    x86/hyperv: Add missing 'inline' to hv_snp_boot_ap() stub
    hv: hyperv.h: Replace one-element array with flexible-array member
    Drivers: hv: vmbus: Don't dereference ACPI root object handle
    x86/hyperv: Add hyperv-specific handling for VMMCALL under SEV-ES
    x86/hyperv: Add smp support for SEV-SNP guest
    clocksource: hyper-v: Mark hyperv tsc page unencrypted in sev-snp enlightened guest
    x86/hyperv: Use vmmcall to implement Hyper-V hypercall in sev-snp enlightened guest
    drivers: hv: Mark percpu hvcall input arg page unencrypted in SEV-SNP enlightened guest
    ...

    Linus Torvalds
     

25 Aug, 2023

7 commits

  • In ms_hyperv_init_platform(), do not distinguish between a SNP VM with
    the paravisor and a SNP VM without the paravisor.

    Replace hv_isolation_type_en_snp() with
    !ms_hyperv.paravisor_present && hv_isolation_type_snp().

    The hv_isolation_type_en_snp() in drivers/hv/hv.c and
    drivers/hv/hv_common.c can be changed to hv_isolation_type_snp() since
    we know !ms_hyperv.paravisor_present is true there.

    Signed-off-by: Dexuan Cui
    Reviewed-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-10-decui@microsoft.com

    Dexuan Cui
     
  • The post_msg_page was removed in
    commit 9a6b1a170ca8 ("Drivers: hv: vmbus: Remove the per-CPU post_msg_page")

    However, it turns out that we need to bring it back, but only for a TDX VM
    with the paravisor: in such a VM, the hyperv_pcpu_input_arg is not decrypted,
    but the HVCALL_POST_MESSAGE in such a VM needs a decrypted page as the
    hypercall input page: see the comments in hyperv_init() for a detailed
    explanation.

    Except for HVCALL_POST_MESSAGE and HVCALL_SIGNAL_EVENT, the other hypercalls
    in a TDX VM with the paravisor still use hv_hypercall_pg and must use the
    hyperv_pcpu_input_arg (which is encrypted in such a VM), when a hypercall
    input page is used.

    Signed-off-by: Dexuan Cui
    Reviewed-by: Tianyu Lan
    Reviewed-by: Michael Kelley
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-8-decui@microsoft.com

    Dexuan Cui
     
  • The new variable hyperv_paravisor_present is set only when the VM
    is a SNP/TDX VM with the paravisor running: see ms_hyperv_init_platform().

    We introduce hyperv_paravisor_present because we can not use
    ms_hyperv.paravisor_present in arch/x86/include/asm/mshyperv.h:

    struct ms_hyperv_info is defined in include/asm-generic/mshyperv.h, which
    is included at the end of arch/x86/include/asm/mshyperv.h, but at the
    beginning of arch/x86/include/asm/mshyperv.h, we would already need to use
    struct ms_hyperv_info in hv_do_hypercall().

    We use hyperv_paravisor_present only in include/asm-generic/mshyperv.h,
    and use ms_hyperv.paravisor_present elsewhere. In the future, we'll
    introduce a hypercall function structure for different VM types, and
    at boot time, the right function pointers would be written into the
    structure so that runtime testing of TDX vs. SNP vs. normal will be
    avoided and hyperv_paravisor_present will no longer be needed.

    Call hv_vtom_init() when it's a VBS VM or when ms_hyperv.paravisor_present
    is true, i.e. the VM is a SNP VM or TDX VM with the paravisor.

    Enhance hv_vtom_init() for a TDX VM with the paravisor.

    In hv_common_cpu_init(), don't decrypt the hyperv_pcpu_input_arg
    for a TDX VM with the paravisor, just like we don't decrypt the page
    for a SNP VM with the paravisor.

    Signed-off-by: Dexuan Cui
    Reviewed-by: Tianyu Lan
    Reviewed-by: Michael Kelley
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-7-decui@microsoft.com

    Dexuan Cui
     
  • Don't set *this_cpu_ptr(hyperv_pcpu_input_arg) before the function
    set_memory_decrypted() returns, otherwise we run into this ticky issue:

    For a fully enlightened TDX/SNP VM, in hv_common_cpu_init(),
    *this_cpu_ptr(hyperv_pcpu_input_arg) is an encrypted page before
    the set_memory_decrypted() returns.

    When such a VM has more than 64 VPs, if the hyperv_pcpu_input_arg is not
    NULL, hv_common_cpu_init() -> set_memory_decrypted() -> ... ->
    cpa_flush() -> on_each_cpu() -> ... -> hv_send_ipi_mask() -> ... ->
    __send_ipi_mask_ex() tries to call hv_do_rep_hypercall() with the
    hyperv_pcpu_input_arg as the hypercall input page, which must be a
    decrypted page in such a VM, but the page is still encrypted at this
    point, and a fatal fault is triggered.

    Fix the issue by setting *this_cpu_ptr(hyperv_pcpu_input_arg) after
    set_memory_decrypted(): if the hyperv_pcpu_input_arg is NULL,
    __send_ipi_mask_ex() returns HV_STATUS_INVALID_PARAMETER immediately,
    and hv_send_ipi_mask() falls back to orig_apic.send_IPI_mask(),
    which can use x2apic_send_IPI_all(), which may be slightly slower than
    the hypercall but still works correctly in such a VM.

    Reviewed-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Signed-off-by: Dexuan Cui
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-6-decui@microsoft.com

    Dexuan Cui
     
  • Add Hyper-V specific code so that a fully enlightened TDX guest (i.e.
    without the paravisor) can run on Hyper-V:
    Don't use hv_vp_assist_page. Use GHCI instead.
    Don't try to use the unsupported HV_REGISTER_CRASH_CTL.
    Don't trust (use) Hyper-V's TLB-flushing hypercalls.
    Don't use lazy EOI.
    Share the SynIC Event/Message pages with the hypervisor.
    Don't use the Hyper-V TSC page for now, because non-trivial work is
    required to share the page with the hypervisor.

    Reviewed-by: Michael Kelley
    Signed-off-by: Dexuan Cui
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-4-decui@microsoft.com

    Dexuan Cui
     
  • A fully enlightened TDX guest on Hyper-V (i.e. without the paravisor) only
    uses the GHCI call rather than hv_hypercall_pg. Do not initialize
    hypercall_pg for such a guest.

    In hv_common_cpu_init(), the hyperv_pcpu_input_arg page needs to be
    decrypted in such a guest.

    Reviewed-by: Kuppuswamy Sathyanarayanan
    Reviewed-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Signed-off-by: Dexuan Cui
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-3-decui@microsoft.com

    Dexuan Cui
     
  • No logic change to SNP/VBS guests.

    hv_isolation_type_tdx() will be used to instruct a TDX guest on Hyper-V to
    do some TDX-specific operations, e.g. for a fully enlightened TDX guest
    (i.e. without the paravisor), hv_do_hypercall() should use
    __tdx_hypercall() and such a guest on Hyper-V should handle the Hyper-V
    Event/Message/Monitor pages specially.

    Reviewed-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Signed-off-by: Dexuan Cui
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230824080712.30327-2-decui@microsoft.com

    Dexuan Cui
     

22 Aug, 2023

4 commits

  • Since the commit referenced in the Fixes: tag below the VMBus client driver
    is walking the ACPI namespace up from the VMBus ACPI device to the ACPI
    namespace root object trying to find Hyper-V MMIO ranges.

    However, if it is not able to find them it ends trying to walk resources of
    the ACPI namespace root object itself.
    This object has all-ones handle, which causes a NULL pointer dereference
    in the ACPI code (from dereferencing this pointer with an offset).

    This in turn causes an oops on boot with VMBus host implementations that do
    not provide Hyper-V MMIO ranges in their VMBus ACPI device or its
    ancestors.
    The QEMU VMBus implementation is an example of such implementation.

    I guess providing these ranges is optional, since all tested Windows
    versions seem to be able to use VMBus devices without them.

    Fix this by explicitly terminating the lookup at the ACPI namespace root
    object.

    Note that Linux guests under KVM/QEMU do not use the Hyper-V PV interface
    by default - they only do so if the KVM PV interface is missing or
    disabled.

    Example stack trace of such oops:
    [ 3.710827] ? __die+0x1f/0x60
    [ 3.715030] ? page_fault_oops+0x159/0x460
    [ 3.716008] ? exc_page_fault+0x73/0x170
    [ 3.716959] ? asm_exc_page_fault+0x22/0x30
    [ 3.717957] ? acpi_ns_lookup+0x7a/0x4b0
    [ 3.718898] ? acpi_ns_internalize_name+0x79/0xc0
    [ 3.720018] acpi_ns_get_node_unlocked+0xb5/0xe0
    [ 3.721120] ? acpi_ns_check_object_type+0xfe/0x200
    [ 3.722285] ? acpi_rs_convert_aml_to_resource+0x37/0x6e0
    [ 3.723559] ? down_timeout+0x3a/0x60
    [ 3.724455] ? acpi_ns_get_node+0x3a/0x60
    [ 3.725412] acpi_ns_get_node+0x3a/0x60
    [ 3.726335] acpi_ns_evaluate+0x1c3/0x2c0
    [ 3.727295] acpi_ut_evaluate_object+0x64/0x1b0
    [ 3.728400] acpi_rs_get_method_data+0x2b/0x70
    [ 3.729476] ? vmbus_platform_driver_probe+0x1d0/0x1d0 [hv_vmbus]
    [ 3.730940] ? vmbus_platform_driver_probe+0x1d0/0x1d0 [hv_vmbus]
    [ 3.732411] acpi_walk_resources+0x78/0xd0
    [ 3.733398] vmbus_platform_driver_probe+0x9f/0x1d0 [hv_vmbus]
    [ 3.734802] platform_probe+0x3d/0x90
    [ 3.735684] really_probe+0x19b/0x400
    [ 3.736570] ? __device_attach_driver+0x100/0x100
    [ 3.737697] __driver_probe_device+0x78/0x160
    [ 3.738746] driver_probe_device+0x1f/0x90
    [ 3.739743] __driver_attach+0xc2/0x1b0
    [ 3.740671] bus_for_each_dev+0x70/0xc0
    [ 3.741601] bus_add_driver+0x10e/0x210
    [ 3.742527] driver_register+0x55/0xf0
    [ 3.744412] ? 0xffffffffc039a000
    [ 3.745207] hv_acpi_init+0x3c/0x1000 [hv_vmbus]

    Fixes: 7f163a6fd957 ("drivers:hv: Modify hv_vmbus to search for all MMIO ranges available.")
    Signed-off-by: Maciej S. Szmigiero
    Reviewed-by: Michael Kelley
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/fd8e64ceeecfd1d95ff49021080cf699e88dbbde.1691606267.git.maciej.szmigiero@oracle.com

    Maciej S. Szmigiero
     
  • Hypervisor needs to access input arg, VMBus synic event and
    message pages. Mark these pages unencrypted in the SEV-SNP
    guest and free them only if they have been marked encrypted
    successfully.

    Reviewed-by: Dexuan Cui
    Reviewed-by: Michael Kelley
    Signed-off-by: Tianyu Lan
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230818102919.1318039-5-ltykernel@gmail.com

    Tianyu Lan
     
  • SEV-SNP guests on Hyper-V can run at multiple Virtual Trust
    Levels (VTL). During boot, get the VTL at which we're running
    using the GET_VP_REGISTERs hypercall, and save the value
    for future use. Then during VMBus initialization, set the VTL
    with the saved value as required in the VMBus init message.

    Reviewed-by: Dexuan Cui
    Reviewed-by: Michael Kelley
    Signed-off-by: Tianyu Lan
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230818102919.1318039-3-ltykernel@gmail.com

    Tianyu Lan
     
  • Introduce static key isolation_type_en_snp for enlightened
    sev-snp guest check.

    Reviewed-by: Dexuan Cui
    Reviewed-by: Michael Kelley
    Signed-off-by: Tianyu Lan
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230818102919.1318039-2-ltykernel@gmail.com

    Tianyu Lan
     

12 Aug, 2023

1 commit

  • This patch is intended as a proof-of-concept for the new SBRM
    machinery[1]. For some brief background, the idea behind SBRM is using
    the __cleanup__ attribute to automatically unlock locks (or otherwise
    release resources) when they go out of scope, similar to C++ style RAII.
    This promises some benefits such as making code simpler (particularly
    where you have lots of goto fail; type constructs) as well as reducing
    the surface area for certain kinds of bugs.

    The changes in this patch should not result in any difference in how the
    code actually runs (i.e., it's purely an exercise in this new syntax
    sugar). In one instance SBRM was not appropriate, so I left that part
    alone, but all other locking/unlocking is handled automatically in this
    patch.

    [1] https://lore.kernel.org/all/20230626125726.GU4253@hirez.programming.kicks-ass.net/

    Suggested-by: Boqun Feng
    Signed-off-by: "Mitchell Levy (Microsoft)"
    Reviewed-by: Boqun Feng
    Signed-off-by: Wei Liu
    Link: https://lore.kernel.org/r/20230807-sbrm-hyperv-v2-1-9d2ac15305bd@gmail.com

    Mitchell Levy
     

29 Jun, 2023

2 commits

  • Several places in code for Hyper-V reference the
    per-CPU variable hyperv_pcpu_input_arg. Older code uses a multi-line
    sequence to reference the variable, and usually includes a cast.
    Newer code does a much simpler direct assignment. The latter is
    preferable as the complexity of the older code is unnecessary.

    Update older code to use the simpler direct assignment.

    Signed-off-by: Nischala Yelchuri
    Link: https://lore.kernel.org/r/1687286438-9421-1-git-send-email-niyelchu@linux.microsoft.com
    Signed-off-by: Wei Liu

    Nischala Yelchuri
     
  • Currently hv_free_hyperv_page() takes an unsigned long argument, which
    is inconsistent with the void * return value from the corresponding
    hv_alloc_hyperv_page() function and variants. This creates unnecessary
    extra casting.

    Change the hv_free_hyperv_page() argument type to void *.
    Also remove redundant casts from invocations of
    hv_alloc_hyperv_page() and variants.

    Signed-off-by: Kameron Carr
    Reviewed-by: Nuno Das Neves
    Reviewed-by: Dexuan Cui
    Link: https://lore.kernel.org/r/1687558189-19734-1-git-send-email-kameroncarr@linux.microsoft.com
    Signed-off-by: Wei Liu

    Kameron Carr
     

18 Jun, 2023

1 commit

  • These commits

    a494aef23dfc ("PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg")
    2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")

    update the Hyper-V virtual PCI driver to use the hyperv_pcpu_input_arg
    because that memory will be correctly marked as decrypted or encrypted
    for all VM types (CoCo or normal). But problems ensue when CPUs in the
    VM go online or offline after virtual PCI devices have been configured.

    When a CPU is brought online, the hyperv_pcpu_input_arg for that CPU is
    initialized by hv_cpu_init() running under state CPUHP_AP_ONLINE_DYN.
    But this state occurs after state CPUHP_AP_IRQ_AFFINITY_ONLINE, which
    may call the virtual PCI driver and fault trying to use the as yet
    uninitialized hyperv_pcpu_input_arg. A similar problem occurs in a CoCo
    VM if the MMIO read and write hypercalls are used from state
    CPUHP_AP_IRQ_AFFINITY_ONLINE.

    When a CPU is taken offline, IRQs may be reassigned in state
    CPUHP_TEARDOWN_CPU. Again, the virtual PCI driver may fault trying to
    use the hyperv_pcpu_input_arg that has already been freed by a
    higher state.

    Fix the onlining problem by adding state CPUHP_AP_HYPERV_ONLINE
    immediately after CPUHP_AP_ONLINE_IDLE (similar to CPUHP_AP_KVM_ONLINE)
    and before CPUHP_AP_IRQ_AFFINITY_ONLINE. Use this new state for
    Hyper-V initialization so that hyperv_pcpu_input_arg is allocated
    early enough.

    Fix the offlining problem by not freeing hyperv_pcpu_input_arg when
    a CPU goes offline. Retain the allocated memory, and reuse it if
    the CPU comes back online later.

    Signed-off-by: Michael Kelley
    Reviewed-by: Vitaly Kuznetsov
    Acked-by: Borislav Petkov (AMD)
    Reviewed-by: Dexuan Cui
    Link: https://lore.kernel.org/r/1684862062-51576-1-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     

24 May, 2023

1 commit

  • vmbus_wait_for_unload() may be called in the panic path after other
    CPUs are stopped. vmbus_wait_for_unload() currently loops through
    online CPUs looking for the UNLOAD response message. But the values of
    CONFIG_KEXEC_CORE and crash_kexec_post_notifiers affect the path used
    to stop the other CPUs, and in one of the paths the stopped CPUs
    are removed from cpu_online_mask. This removal happens in both
    x86/x64 and arm64 architectures. In such a case, vmbus_wait_for_unload()
    only checks the panic'ing CPU, and misses the UNLOAD response message
    except when the panic'ing CPU is CPU 0. vmbus_wait_for_unload()
    eventually times out, but only after waiting 100 seconds.

    Fix this by looping through *present* CPUs in vmbus_wait_for_unload().
    The cpu_present_mask is not modified by stopping the other CPUs in the
    panic path, nor should it be.

    Also, in a CoCo VM the synic_message_page is not allocated in
    hv_synic_alloc(), but is set and cleared in hv_synic_enable_regs()
    and hv_synic_disable_regs() such that it is set only when the CPU is
    online. If not all present CPUs are online when vmbus_wait_for_unload()
    is called, the synic_message_page might be NULL. Add a check for this.

    Fixes: cd95aad55793 ("Drivers: hv: vmbus: handle various crash scenarios")
    Cc: stable@vger.kernel.org
    Reported-by: John Starks
    Signed-off-by: Michael Kelley
    Reviewed-by: Vitaly Kuznetsov
    Link: https://lore.kernel.org/r/1684422832-38476-1-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     

09 May, 2023

1 commit

  • Commit 572086325ce9 ("Drivers: hv: vmbus: Cleanup synic memory free path")
    says "Any memory allocations that succeeded will be freed when the caller
    cleans up by calling hv_synic_free()", but if the get_zeroed_page() in
    hv_synic_alloc() fails, currently hv_synic_free() is not really called
    in vmbus_bus_init(), consequently there will be a memory leak, e.g.
    hv_context.hv_numa_map is not freed in the error path. Fix this by
    updating the goto labels.

    Cc: stable@kernel.org
    Signed-off-by: Dexuan Cui
    Fixes: 4df4cb9e99f8 ("x86/hyperv: Initialize clockevents earlier in CPU onlining")
    Reviewed-by: Michael Kelley
    Link: https://lore.kernel.org/r/20230504224155.10484-1-decui@microsoft.com
    Signed-off-by: Wei Liu

    Dexuan Cui
     

28 Apr, 2023

3 commits

  • …rnel/git/hyperv/linux

    Pull hyperv updates from Wei Liu:

    - PCI passthrough for Hyper-V confidential VMs (Michael Kelley)

    - Hyper-V VTL mode support (Saurabh Sengar)

    - Move panic report initialization code earlier (Long Li)

    - Various improvements and bug fixes (Dexuan Cui and Michael Kelley)

    * tag 'hyperv-next-signed-20230424' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: (22 commits)
    PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg
    Drivers: hv: move panic report code from vmbus to hv early init code
    x86/hyperv: VTL support for Hyper-V
    Drivers: hv: Kconfig: Add HYPERV_VTL_MODE
    x86/hyperv: Make hv_get_nmi_reason public
    x86/hyperv: Add VTL specific structs and hypercalls
    x86/init: Make get/set_rtc_noop() public
    x86/hyperv: Exclude lazy TLB mode CPUs from enlightened TLB flushes
    x86/hyperv: Add callback filter to cpumask_to_vpset()
    Drivers: hv: vmbus: Remove the per-CPU post_msg_page
    clocksource: hyper-v: make sure Invariant-TSC is used if it is available
    PCI: hv: Enable PCI pass-thru devices in Confidential VMs
    Drivers: hv: Don't remap addresses that are above shared_gpa_boundary
    hv_netvsc: Remove second mapping of send and recv buffers
    Drivers: hv: vmbus: Remove second way of mapping ring buffers
    Drivers: hv: vmbus: Remove second mapping of VMBus monitor pages
    swiotlb: Remove bounce buffer remapping for Hyper-V
    Driver: VMBus: Add Devicetree support
    dt-bindings: bus: Add Hyper-V VMBus
    Drivers: hv: vmbus: Convert acpi_device to more generic platform_device
    ...

    Linus Torvalds
     
  • Pull sysctl updates from Luis Chamberlain:
    "This only does a few sysctl moves from the kernel/sysctl.c file, the
    rest of the work has been put towards deprecating two API calls which
    incur recursion and prevent us from simplifying the registration
    process / saving memory per move. Most of the changes have been
    soaking on linux-next since v6.3-rc3.

    I've slowed down the kernel/sysctl.c moves due to Matthew Wilcox's
    feedback that we should see if we could *save* memory with these moves
    instead of incurring more memory. We currently incur more memory since
    when we move a syctl from kernel/sysclt.c out to its own file we end
    up having to add a new empty sysctl used to register it. To achieve
    saving memory we want to allow syctls to be passed without requiring
    the end element being empty, and just have our registration process
    rely on ARRAY_SIZE(). Without this, supporting both styles of sysctls
    would make the sysctl registration pretty brittle, hard to read and
    maintain as can be seen from Meng Tang's efforts to do just this [0].
    Fortunately, in order to use ARRAY_SIZE() for all sysctl registrations
    also implies doing the work to deprecate two API calls which use
    recursion in order to support sysctl declarations with subdirectories.

    And so during this development cycle quite a bit of effort went into
    this deprecation effort. I've annotated the following two APIs are
    deprecated and in few kernel releases we should be good to remove
    them:

    - register_sysctl_table()
    - register_sysctl_paths()

    During this merge window we should be able to deprecate and unexport
    register_sysctl_paths(), we can probably do that towards the end of
    this merge window.

    Deprecating register_sysctl_table() will take a bit more time but this
    pull request goes with a few example of how to do this.

    As it turns out each of the conversions to move away from either of
    these two API calls *also* saves memory. And so long term, all these
    changes *will* prove to have saved a bit of memory on boot.

    The way I see it then is if remove a user of one deprecated call, it
    gives us enough savings to move one kernel/sysctl.c out from the
    generic arrays as we end up with about the same amount of bytes.

    Since deprecating register_sysctl_table() and register_sysctl_paths()
    does not require maintainer coordination except the final unexport
    you'll see quite a bit of these changes from other pull requests, I've
    just kept the stragglers after rc3"

    Link: https://lkml.kernel.org/r/ZAD+cpbrqlc5vmry@bombadil.infradead.org [0]

    * tag 'sysctl-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: (29 commits)
    fs: fix sysctls.c built
    mm: compaction: remove incorrect #ifdef checks
    mm: compaction: move compaction sysctl to its own file
    mm: memory-failure: Move memory failure sysctls to its own file
    arm: simplify two-level sysctl registration for ctl_isa_vars
    ia64: simplify one-level sysctl registration for kdump_ctl_table
    utsname: simplify one-level sysctl registration for uts_kern_table
    ntfs: simplfy one-level sysctl registration for ntfs_sysctls
    coda: simplify one-level sysctl registration for coda_table
    fs/cachefiles: simplify one-level sysctl registration for cachefiles_sysctls
    xfs: simplify two-level sysctl registration for xfs_table
    nfs: simplify two-level sysctl registration for nfs_cb_sysctls
    nfs: simplify two-level sysctl registration for nfs4_cb_sysctls
    lockd: simplify two-level sysctl registration for nlm_sysctls
    proc_sysctl: enhance documentation
    xen: simplify sysctl registration for balloon
    md: simplify sysctl registration
    hv: simplify sysctl registration
    scsi: simplify sysctl registration with register_sysctl()
    csky: simplify alignment sysctl registration
    ...

    Linus Torvalds
     
  • Pull driver core updates from Greg KH:
    "Here is the large set of driver core changes for 6.4-rc1.

    Once again, a busy development cycle, with lots of changes happening
    in the driver core in the quest to be able to move "struct bus" and
    "struct class" into read-only memory, a task now complete with these
    changes.

    This will make the future rust interactions with the driver core more
    "provably correct" as well as providing more obvious lifetime rules
    for all busses and classes in the kernel.

    The changes required for this did touch many individual classes and
    busses as many callbacks were changed to take const * parameters
    instead. All of these changes have been submitted to the various
    subsystem maintainers, giving them plenty of time to review, and most
    of them actually did so.

    Other than those changes, included in here are a small set of other
    things:

    - kobject logging improvements

    - cacheinfo improvements and updates

    - obligatory fw_devlink updates and fixes

    - documentation updates

    - device property cleanups and const * changes

    - firwmare loader dependency fixes.

    All of these have been in linux-next for a while with no reported
    problems"

    * tag 'driver-core-6.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (120 commits)
    device property: make device_property functions take const device *
    driver core: update comments in device_rename()
    driver core: Don't require dynamic_debug for initcall_debug probe timing
    firmware_loader: rework crypto dependencies
    firmware_loader: Strip off \n from customized path
    zram: fix up permission for the hot_add sysfs file
    cacheinfo: Add use_arch[|_cache]_info field/function
    arch_topology: Remove early cacheinfo error message if -ENOENT
    cacheinfo: Check cache properties are present in DT
    cacheinfo: Check sib_leaf in cache_leaves_are_shared()
    cacheinfo: Allow early level detection when DT/ACPI info is missing/broken
    cacheinfo: Add arm64 early level initializer implementation
    cacheinfo: Add arch specific early level initializer
    tty: make tty_class a static const structure
    driver core: class: remove struct class_interface * from callbacks
    driver core: class: mark the struct class in struct class_interface constant
    driver core: class: make class_register() take a const *
    driver core: class: mark class_release() as taking a const *
    driver core: remove incorrect comment for device_create*
    MIPS: vpe-cmp: remove module owner pointer from struct class usage.
    ...

    Linus Torvalds
     

26 Apr, 2023

1 commit

  • Pull x86 SEV updates from Borislav Petkov:

    - Add the necessary glue so that the kernel can run as a confidential
    SEV-SNP vTOM guest on Hyper-V. A vTOM guest basically splits the
    address space in two parts: encrypted and unencrypted. The use case
    being running unmodified guests on the Hyper-V confidential computing
    hypervisor

    - Double-buffer messages between the guest and the hardware PSP device
    so that no partial buffers are copied back'n'forth and thus potential
    message integrity and leak attacks are possible

    - Name the return value the sev-guest driver returns when the hw PSP
    device hasn't been called, explicitly

    - Cleanups

    * tag 'x86_sev_for_v6.4_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/hyperv: Change vTOM handling to use standard coco mechanisms
    init: Call mem_encrypt_init() after Hyper-V hypercall init is done
    x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
    Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
    x86/hyperv: Reorder code to facilitate future work
    x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM
    x86/sev: Change snp_guest_issue_request()'s fw_err argument
    virt/coco/sev-guest: Double-buffer messages
    crypto: ccp: Get rid of __sev_platform_init_locked()'s local function pointer
    crypto: ccp - Name -1 return value as SEV_RET_NO_FW_CALL

    Linus Torvalds
     

21 Apr, 2023

1 commit

  • The panic reporting code was added in commit 81b18bce48af
    ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic")

    It was added to the vmbus driver. The panic reporting has no dependence
    on vmbus, and can be enabled at an earlier boot time when Hyper-V is
    initialized.

    This patch moves the panic reporting code out of vmbus. There is no
    functionality changes. During moving, also refactored some cleanup
    functions into hv_kmsg_dump_unregister().

    Signed-off-by: Long Li
    Reviewed-by: Michael Kelley
    Link: https://lore.kernel.org/r/1682030946-6372-1-git-send-email-longli@linuxonhyperv.com
    Signed-off-by: Wei Liu

    Long Li
     

19 Apr, 2023

1 commit


18 Apr, 2023

9 commits

  • The post_msg_page was introduced in 2014 in
    commit b29ef3546aec ("Drivers: hv: vmbus: Cleanup hv_post_message()")

    Commit 68bb7bfb7985 ("X86/Hyper-V: Enable IPI enlightenments") introduced
    the hyperv_pcpu_input_arg in 2018, which can be used in hv_post_message().

    Remove post_msg_page to simplify the code a little bit.

    Signed-off-by: Dexuan Cui
    Reviewed-by: Jinank Jain
    Link: https://lore.kernel.org/r/20230408213441.15472-1-decui@microsoft.com
    Signed-off-by: Wei Liu

    Dexuan Cui
     
  • For PCI pass-thru devices in a Confidential VM, Hyper-V requires
    that PCI config space be accessed via hypercalls. In normal VMs,
    config space accesses are trapped to the Hyper-V host and emulated.
    But in a confidential VM, the host can't access guest memory to
    decode the instruction for emulation, so an explicit hypercall must
    be used.

    Add functions to make the new MMIO read and MMIO write hypercalls.
    Update the PCI config space access functions to use the hypercalls
    when such use is indicated by Hyper-V flags. Also, set the flag to
    allow the Hyper-V PCI driver to be loaded and used in a Confidential
    VM (a.k.a., "Isolation VM"). The driver has previously been hardened
    against a malicious Hyper-V host[1].

    [1] https://lore.kernel.org/all/20220511223207.3386-2-parri.andrea@gmail.com/

    Co-developed-by: Dexuan Cui
    Signed-off-by: Dexuan Cui
    Signed-off-by: Michael Kelley
    Reviewed-by: Boqun Feng
    Reviewed-by: Haiyang Zhang
    Link: https://lore.kernel.org/r/1679838727-87310-13-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     
  • With the vTOM bit now treated as a protection flag and not part of
    the physical address, avoid remapping physical addresses with vTOM set
    since technically such addresses aren't valid. Use ioremap_cache()
    instead of memremap() to ensure that the mapping provides decrypted
    access, which will correctly set the vTOM bit as a protection flag.

    While this change is not required for correctness with the current
    implementation of memremap(), for general code hygiene it's better to
    not depend on the mapping functions doing something reasonable with
    a physical address that is out-of-range.

    While here, fix typos in two error messages.

    Signed-off-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Link: https://lore.kernel.org/r/1679838727-87310-12-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     
  • With changes to how Hyper-V guest VMs flip memory between private
    (encrypted) and shared (decrypted), creating a second kernel virtual
    mapping for shared memory is no longer necessary. Everything needed
    for the transition to shared is handled by set_memory_decrypted().

    As such, remove the code to create and manage the second
    mapping for the pre-allocated send and recv buffers. This mapping
    is the last user of hv_map_memory()/hv_unmap_memory(), so delete
    these functions as well. Finally, hv_map_memory() is the last
    user of vmap_pfn() in Hyper-V guest code, so remove the Kconfig
    selection of VMAP_PFN.

    Signed-off-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Link: https://lore.kernel.org/r/1679838727-87310-11-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     
  • With changes to how Hyper-V guest VMs flip memory between private
    (encrypted) and shared (decrypted), it's no longer necessary to
    have separate code paths for mapping VMBus ring buffers for
    for normal VMs and for Confidential VMs.

    As such, remove the code path that uses vmap_pfn(), and set
    the protection flags argument to vmap() to account for the
    difference between normal and Confidential VMs.

    Signed-off-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Link: https://lore.kernel.org/r/1679838727-87310-10-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     
  • With changes to how Hyper-V guest VMs flip memory between private
    (encrypted) and shared (decrypted), creating a second kernel virtual
    mapping for shared memory is no longer necessary. Everything needed
    for the transition to shared is handled by set_memory_decrypted().

    As such, remove the code to create and manage the second
    mapping for VMBus monitor pages. Because set_memory_decrypted()
    and set_memory_encrypted() are no-ops in normal VMs, it's
    not even necessary to test for being in a Confidential VM
    (a.k.a., "Isolation VM").

    Signed-off-by: Michael Kelley
    Reviewed-by: Tianyu Lan
    Link: https://lore.kernel.org/r/1679838727-87310-9-git-send-email-mikelley@microsoft.com
    Signed-off-by: Wei Liu

    Michael Kelley
     
  • Merge the following 6 patches from tip/x86/sev, which are taken from
    Michael Kelley's series [0]. The rest of Michael's series depend on
    them.

    x86/hyperv: Change vTOM handling to use standard coco mechanisms
    init: Call mem_encrypt_init() after Hyper-V hypercall init is done
    x86/mm: Handle decryption/re-encryption of bss_decrypted consistently
    Drivers: hv: Explicitly request decrypted in vmap_pfn() calls
    x86/hyperv: Reorder code to facilitate future work
    x86/ioremap: Add hypervisor callback for private MMIO mapping in coco VM

    0: https://lore.kernel.org/linux-hyperv/1679838727-87310-1-git-send-email-mikelley@microsoft.com/

    Wei Liu
     
  • Update the driver to support Devicetree boot as well along with ACPI.
    At present the Devicetree parsing only provides the mmio region info
    and is not the exact copy of ACPI parsing. This is sufficient to cater
    all the current Devicetree usecases for VMBus.

    Currently Devicetree is supported only for x86 systems.

    Signed-off-by: Saurabh Sengar
    Reviewed-by: Michael Kelley
    Link: https://lore.kernel.org/r/1679298460-11855-6-git-send-email-ssengar@linux.microsoft.com
    Signed-off-by: Wei Liu

    Saurabh Sengar
     
  • VMBus driver code currently has direct dependency on ACPI and struct
    acpi_device. As a staging step toward optionally configuring based on
    Devicetree instead of ACPI, use a more generic platform device to reduce
    the dependency on ACPI where possible, though the dependency on ACPI
    is not completely removed. Also rename the function vmbus_acpi_remove()
    to the more generic vmbus_mmio_remove().

    Signed-off-by: Saurabh Sengar
    Reviewed-by: Michael Kelley
    Link: https://lore.kernel.org/r/1679298460-11855-4-git-send-email-ssengar@linux.microsoft.com
    Signed-off-by: Wei Liu

    Saurabh Sengar
     

14 Apr, 2023

1 commit


27 Mar, 2023

1 commit

  • Hyper-V guests on AMD SEV-SNP hardware have the option of using the
    "virtual Top Of Memory" (vTOM) feature specified by the SEV-SNP
    architecture. With vTOM, shared vs. private memory accesses are
    controlled by splitting the guest physical address space into two
    halves.

    vTOM is the dividing line where the uppermost bit of the physical
    address space is set; e.g., with 47 bits of guest physical address
    space, vTOM is 0x400000000000 (bit 46 is set). Guest physical memory is
    accessible at two parallel physical addresses -- one below vTOM and one
    above vTOM. Accesses below vTOM are private (encrypted) while accesses
    above vTOM are shared (decrypted). In this sense, vTOM is like the
    GPA.SHARED bit in Intel TDX.

    Support for Hyper-V guests using vTOM was added to the Linux kernel in
    two patch sets[1][2]. This support treats the vTOM bit as part of
    the physical address. For accessing shared (decrypted) memory, these
    patch sets create a second kernel virtual mapping that maps to physical
    addresses above vTOM.

    A better approach is to treat the vTOM bit as a protection flag, not
    as part of the physical address. This new approach is like the approach
    for the GPA.SHARED bit in Intel TDX. Rather than creating a second kernel
    virtual mapping, the existing mapping is updated using recently added
    coco mechanisms.

    When memory is changed between private and shared using
    set_memory_decrypted() and set_memory_encrypted(), the PTEs for the
    existing kernel mapping are changed to add or remove the vTOM bit in the
    guest physical address, just as with TDX. The hypercalls to change the
    memory status on the host side are made using the existing callback
    mechanism. Everything just works, with a minor tweak to map the IO-APIC
    to use private accesses.

    To accomplish the switch in approach, the following must be done:

    * Update Hyper-V initialization to set the cc_mask based on vTOM
    and do other coco initialization.

    * Update physical_mask so the vTOM bit is no longer treated as part
    of the physical address

    * Remove CC_VENDOR_HYPERV and merge the associated vTOM functionality
    under CC_VENDOR_AMD. Update cc_mkenc() and cc_mkdec() to set/clear
    the vTOM bit as a protection flag.

    * Code already exists to make hypercalls to inform Hyper-V about pages
    changing between shared and private. Update this code to run as a
    callback from __set_memory_enc_pgtable().

    * Remove the Hyper-V special case from __set_memory_enc_dec()

    * Remove the Hyper-V specific call to swiotlb_update_mem_attributes()
    since mem_encrypt_init() will now do it.

    * Add a Hyper-V specific implementation of the is_private_mmio()
    callback that returns true for the IO-APIC and vTPM MMIO addresses

    [1] https://lore.kernel.org/all/20211025122116.264793-1-ltykernel@gmail.com/
    [2] https://lore.kernel.org/all/20211213071407.314309-1-ltykernel@gmail.com/

    [ bp: Touchups. ]

    Signed-off-by: Michael Kelley
    Signed-off-by: Borislav Petkov (AMD)
    Link: https://lore.kernel.org/r/1679838727-87310-7-git-send-email-mikelley@microsoft.com

    Michael Kelley