04 Aug, 2021

1 commit

  • KVM creates a debugfs directory for each VM in order to store statistics
    about the virtual machine. The directory name is built from the process
    pid and a VM fd. While generally unique, it is possible to keep a
    file descriptor alive in a way that causes duplicate directories, which
    manifests as these messages:

    [ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!

    Even though this should not happen in practice, it is more or less
    expected in the case of KVM for testcases that call KVM_CREATE_VM and
    close the resulting file descriptor repeatedly and in parallel.

    When this happens, debugfs_create_dir() returns an error but
    kvm_create_vm_debugfs() goes on to allocate stat data structs which are
    later leaked. The slow memory leak was spotted by syzkaller, where it
    caused OOM reports.

    Since the issue only affects debugfs, do a lookup before calling
    debugfs_create_dir, so that the message is downgraded and rate-limited.
    While at it, ensure kvm->debugfs_dentry is NULL rather than an error
    if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
    checking IS_ERR_OR_NULL correctly.

    Cc: stable@vger.kernel.org
    Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
    Reported-by: Alexey Kardashevskiy
    Suggested-by: Greg Kroah-Hartman
    Acked-by: Greg Kroah-Hartman
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

28 Jul, 2021

2 commits

  • The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer,
    therefore it needs a compat ioctl implementation. Otherwise,
    32-bit userspace fails to invoke it on 64-bit kernels; for x86
    it might work fine by chance if the padding is zero, but not
    on big-endian architectures.

    Reported-by: Thomas Sattler
    Cc: stable@vger.kernel.org
    Fixes: 2a31b9db1535 ("kvm: introduce manual dirty log reprotect")
    Reviewed-by: Peter Xu
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • SMT siblings share caches and other hardware, and busy halt polling
    will degrade its sibling performance if its sibling is working

    Sean Christopherson suggested as below:

    "Rather than disallowing halt-polling entirely, on x86 it should be
    sufficient to simply have the hardware thread yield to its sibling(s)
    via PAUSE. It probably won't get back all performance, but I would
    expect it to be close.
    This compiles on all KVM architectures, and AFAICT the intended usage
    of cpu_relax() is identical for all architectures."

    Suggested-by: Sean Christopherson
    Signed-off-by: Li RongQing
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Li RongQing
     

16 Jul, 2021

1 commit

  • Pull kvm fixes from Paolo Bonzini:

    - Allow again loading KVM on 32-bit non-PAE builds

    - Fixes for host SMIs on AMD

    - Fixes for guest SMIs on AMD

    - Fixes for selftests on s390 and ARM

    - Fix memory leak

    - Enforce no-instrumentation area on vmentry when hardware breakpoints
    are in use.

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
    KVM: selftests: smm_test: Test SMM enter from L2
    KVM: nSVM: Restore nested control upon leaving SMM
    KVM: nSVM: Fix L1 state corruption upon return from SMM
    KVM: nSVM: Introduce svm_copy_vmrun_state()
    KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN
    KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
    KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities
    KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails
    KVM: SVM: add module param to control the #SMI interception
    KVM: SVM: remove INIT intercept handler
    KVM: SVM: #SMI interception must not skip the instruction
    KVM: VMX: Remove vmx_msr_index from vmx.h
    KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
    KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc
    kvm: debugfs: fix memory leak in kvm_create_vm_debugfs
    KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
    KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
    KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler
    KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
    KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
    ...

    Linus Torvalds
     

15 Jul, 2021

3 commits

  • In commit bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
    loop for filling debugfs_stat_data was copy-pasted 2 times, but
    in the second loop pointers are saved over pointers allocated
    in the first loop. All this causes is a memory leak, fix it.

    Fixes: bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
    Signed-off-by: Pavel Skripkin
    Reviewed-by: Jing Zhang
    Message-Id:
    Reviewed-by: Jing Zhang
    Signed-off-by: Paolo Bonzini

    Pavel Skripkin
     
  • BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
    Read of size 8 at addr ffff0000c03a2500 by task syz-executor083/4269

    CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7
    Hardware name: linux,dummy-virt (DT)
    Call trace:
    dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
    show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x110/0x164 lib/dump_stack.c:118
    print_address_description+0x78/0x5c8 mm/kasan/report.c:385
    __kasan_report mm/kasan/report.c:545 [inline]
    kasan_report+0x148/0x1e4 mm/kasan/report.c:562
    check_memory_region_inline mm/kasan/generic.c:183 [inline]
    __asan_load8+0xb4/0xbc mm/kasan/generic.c:252
    kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
    kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
    vfs_ioctl fs/ioctl.c:48 [inline]
    __do_sys_ioctl fs/ioctl.c:753 [inline]
    __se_sys_ioctl fs/ioctl.c:739 [inline]
    __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
    __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
    el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
    do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
    el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
    el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
    el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

    Allocated by task 4269:
    stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track mm/kasan/common.c:56 [inline]
    __kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
    kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
    kmem_cache_alloc_trace include/linux/slab.h:450 [inline]
    kmalloc include/linux/slab.h:552 [inline]
    kzalloc include/linux/slab.h:664 [inline]
    kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146
    kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746
    vfs_ioctl fs/ioctl.c:48 [inline]
    __do_sys_ioctl fs/ioctl.c:753 [inline]
    __se_sys_ioctl fs/ioctl.c:739 [inline]
    __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
    __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
    el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
    do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
    el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
    el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
    el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

    Freed by task 4269:
    stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
    kasan_save_stack mm/kasan/common.c:48 [inline]
    kasan_set_track+0x38/0x6c mm/kasan/common.c:56
    kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
    __kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
    kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
    slab_free_hook mm/slub.c:1544 [inline]
    slab_free_freelist_hook mm/slub.c:1577 [inline]
    slab_free mm/slub.c:3142 [inline]
    kfree+0x104/0x38c mm/slub.c:4124
    coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102
    kvm_iodevice_destructor include/kvm/iodev.h:61 [inline]
    kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374
    kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186
    kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
    vfs_ioctl fs/ioctl.c:48 [inline]
    __do_sys_ioctl fs/ioctl.c:753 [inline]
    __se_sys_ioctl fs/ioctl.c:739 [inline]
    __arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
    __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
    el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
    do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
    el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
    el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
    el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

    If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor()
    inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list
    and free the dev, but kvm_iodevice_destructor() is called again, it will lead
    the above issue.

    Let's check the the return value of kvm_io_bus_unregister_dev(), only call
    kvm_iodevice_destructor() if the return value is 0.

    Cc: Paolo Bonzini
    Cc: kvm@vger.kernel.org
    Reported-by: Hulk Robot
    Signed-off-by: Kefeng Wang
    Message-Id:
    Cc: stable@vger.kernel.org
    Fixes: 5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20)
    Signed-off-by: Paolo Bonzini

    Kefeng Wang
     
  • …git/kvms390/linux into HEAD

    KVM: selftests: Fixes

    - provide memory model for IBM z196 and zEC12
    - do not require 64GB of memory

    Paolo Bonzini
     

30 Jun, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "191 patches.

    Subsystems affected by this patch series: kthread, ia64, scripts,
    ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
    slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
    mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
    pagealloc, and memory-failure)"

    * emailed patches from Andrew Morton : (191 commits)
    mm,hwpoison: make get_hwpoison_page() call get_any_page()
    mm,hwpoison: send SIGBUS with error virutal address
    mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
    mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
    mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
    mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
    docs: remove description of DISCONTIGMEM
    arch, mm: remove stale mentions of DISCONIGMEM
    mm: remove CONFIG_DISCONTIGMEM
    m68k: remove support for DISCONTIGMEM
    arc: remove support for DISCONTIGMEM
    arc: update comment about HIGHMEM implementation
    alpha: remove DISCONTIGMEM and NUMA
    mm/page_alloc: move free_the_page
    mm/page_alloc: fix counting of managed_pages
    mm/page_alloc: improve memmap_pages dbg msg
    mm: drop SECTION_SHIFT in code comments
    mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
    mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
    mm/page_alloc: scale the number of pages that are batch freed
    ...

    Linus Torvalds
     
  • vma_lookup() finds the vma of a specific address with a cleaner interface
    and is more readable.

    Link: https://lkml.kernel.org/r/20210521174745.2219620-11-Liam.Howlett@Oracle.com
    Signed-off-by: Liam R. Howlett
    Reviewed-by: Laurent Dufour
    Acked-by: David Hildenbrand
    Acked-by: Davidlohr Bueso
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liam Howlett
     

29 Jun, 2021

2 commits

  • Pull kvm updates from Paolo Bonzini:
    "This covers all architectures (except MIPS) so I don't expect any
    other feature pull requests this merge window.

    ARM:

    - Add MTE support in guests, complete with tag save/restore interface

    - Reduce the impact of CMOs by moving them in the page-table code

    - Allow device block mappings at stage-2

    - Reduce the footprint of the vmemmap in protected mode

    - Support the vGIC on dumb systems such as the Apple M1

    - Add selftest infrastructure to support multiple configuration and
    apply that to PMU/non-PMU setups

    - Add selftests for the debug architecture

    - The usual crop of PMU fixes

    PPC:

    - Support for the H_RPT_INVALIDATE hypercall

    - Conversion of Book3S entry/exit to C

    - Bug fixes

    S390:

    - new HW facilities for guests

    - make inline assembly more robust with KASAN and co

    x86:

    - Allow userspace to handle emulation errors (unknown instructions)

    - Lazy allocation of the rmap (host physical -> guest physical
    address)

    - Support for virtualizing TSC scaling on VMX machines

    - Optimizations to avoid shattering huge pages at the beginning of
    live migration

    - Support for initializing the PDPTRs without loading them from
    memory

    - Many TLB flushing cleanups

    - Refuse to load if two-stage paging is available but NX is not (this
    has been a requirement in practice for over a year)

    - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
    CR0/CR4/EFER, using the MMU mode everywhere once it is computed
    from the CPU registers

    - Use PM notifier to notify the guest about host suspend or hibernate

    - Support for passing arguments to Hyper-V hypercalls using XMM
    registers

    - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
    on AMD processors

    - Hide Hyper-V hypercalls that are not included in the guest CPUID

    - Fixes for live migration of virtual machines that use the Hyper-V
    "enlightened VMCS" optimization of nested virtualization

    - Bugfixes (not many)

    Generic:

    - Support for retrieving statistics without debugfs

    - Cleanups for the KVM selftests API"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
    KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
    kvm: x86: disable the narrow guest module parameter on unload
    selftests: kvm: Allows userspace to handle emulation errors.
    kvm: x86: Allow userspace to handle emulation errors
    KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
    KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
    KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
    KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
    KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
    KVM: x86: Enhance comments for MMU roles and nested transition trickiness
    KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
    KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
    KVM: x86/mmu: Use MMU's role to determine PTTYPE
    KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
    KVM: x86/mmu: Add a helper to calculate root from role_regs
    KVM: x86/mmu: Add helper to update paging metadata
    KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
    KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
    KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
    KVM: x86/mmu: Get nested MMU's root level from the MMU's role
    ...

    Linus Torvalds
     
  • Pull scheduler udpates from Ingo Molnar:

    - Changes to core scheduling facilities:

    - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
    coordinated scheduling across SMT siblings. This is a much
    requested feature for cloud computing platforms, to allow the
    flexible utilization of SMT siblings, without exposing untrusted
    domains to information leaks & side channels, plus to ensure more
    deterministic computing performance on SMT systems used by
    heterogenous workloads.

    There are new prctls to set core scheduling groups, which allows
    more flexible management of workloads that can share siblings.

    - Fix task->state access anti-patterns that may result in missed
    wakeups and rename it to ->__state in the process to catch new
    abuses.

    - Load-balancing changes:

    - Tweak newidle_balance for fair-sched, to improve 'memcache'-like
    workloads.

    - "Age" (decay) average idle time, to better track & improve
    workloads such as 'tbench'.

    - Fix & improve energy-aware (EAS) balancing logic & metrics.

    - Fix & improve the uclamp metrics.

    - Fix task migration (taskset) corner case on !CONFIG_CPUSET.

    - Fix RT and deadline utilization tracking across policy changes

    - Introduce a "burstable" CFS controller via cgroups, which allows
    bursty CPU-bound workloads to borrow a bit against their future
    quota to improve overall latencies & batching. Can be tweaked via
    /sys/fs/cgroup/cpu//cpu.cfs_burst_us.

    - Rework assymetric topology/capacity detection & handling.

    - Scheduler statistics & tooling:

    - Disable delayacct by default, but add a sysctl to enable it at
    runtime if tooling needs it. Use static keys and other
    optimizations to make it more palatable.

    - Use sched_clock() in delayacct, instead of ktime_get_ns().

    - Misc cleanups and fixes.

    * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
    sched/doc: Update the CPU capacity asymmetry bits
    sched/topology: Rework CPU capacity asymmetry detection
    sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
    psi: Fix race between psi_trigger_create/destroy
    sched/fair: Introduce the burstable CFS controller
    sched/uclamp: Fix uclamp_tg_restrict()
    sched/rt: Fix Deadline utilization tracking during policy change
    sched/rt: Fix RT utilization tracking during policy change
    sched: Change task_struct::state
    sched,arch: Remove unused TASK_STATE offsets
    sched,timer: Use __set_current_state()
    sched: Add get_current_state()
    sched,perf,kvm: Fix preemption condition
    sched: Introduce task_is_running()
    sched: Unbreak wakeups
    sched/fair: Age the average idle time
    sched/cpufreq: Consider reduced CPU capacity in energy calculation
    sched/fair: Take thermal pressure into account while estimating energy
    thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
    sched/fair: Return early from update_tg_cfs_load() if delta == 0
    ...

    Linus Torvalds
     

25 Jun, 2021

3 commits

  • To remove code duplication, use the binary stats descriptors in the
    implementation of the debugfs interface for statistics. This unifies
    the definition of statistics for the binary and debugfs interfaces.

    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     
  • Add a VCPU ioctl to get a statistics file descriptor by which a read
    functionality is provided for userspace to read out VCPU stats header,
    descriptors and data.
    Define VCPU statistics descriptors and header for all architectures.

    Reviewed-by: David Matlack
    Reviewed-by: Ricardo Koller
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Fuad Tabba
    Tested-by: Fuad Tabba #arm64
    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     
  • Add a VM ioctl to get a statistics file descriptor by which a read
    functionality is provided for userspace to read out VM stats header,
    descriptors and data.
    Define VM statistics descriptors and header for all architectures.

    Reviewed-by: David Matlack
    Reviewed-by: Ricardo Koller
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Fuad Tabba
    Tested-by: Fuad Tabba #arm64
    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     

24 Jun, 2021

3 commits

  • It's possible to create a region which maps valid but non-refcounted
    pages (e.g., tail pages of non-compound higher order allocations). These
    host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
    of APIs, which take a reference to the page, which takes it from 0 to 1.
    When the reference is dropped, this will free the page incorrectly.

    Fix this by only taking a reference on valid pages if it was non-zero,
    which indicates it is participating in normal refcounting (and can be
    released with put_page).

    This addresses CVE-2021-22543.

    Signed-off-by: Nicholas Piggin
    Tested-by: Paolo Bonzini
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Nicholas Piggin
     
  • This commit defines the API for userspace and prepare the common
    functionalities to support per VM/VCPU binary stats data readings.

    The KVM stats now is only accessible by debugfs, which has some
    shortcomings this change series are supposed to fix:
    1. The current debugfs stats solution in KVM could be disabled
    when kernel Lockdown mode is enabled, which is a potential
    rick for production.
    2. The current debugfs stats solution in KVM is organized as "one
    stats per file", it is good for debugging, but not efficient
    for production.
    3. The stats read/clear in current debugfs solution in KVM are
    protected by the global kvm_lock.

    Besides that, there are some other benefits with this change:
    1. All KVM VM/VCPU stats can be read out in a bulk by one copy
    to userspace.
    2. A schema is used to describe KVM statistics. From userspace's
    perspective, the KVM statistics are self-describing.
    3. With the fd-based solution, a separate telemetry would be able
    to read KVM stats in a less privileged environment.
    4. After the initial setup by reading in stats descriptors, a
    telemetry only needs to read the stats data itself, no more
    parsing or setup is needed.

    Reviewed-by: David Matlack
    Reviewed-by: Ricardo Koller
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Fuad Tabba
    Tested-by: Fuad Tabba #arm64
    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     
  • Generic KVM stats are those collected in architecture independent code
    or those supported by all architectures; put all generic statistics in
    a separate structure. This ensures that they are defined the same way
    in the statistics API which is being added, removing duplication among
    different architectures in the declaration of the descriptors.

    No functional change intended.

    Reviewed-by: David Matlack
    Reviewed-by: Ricardo Koller
    Reviewed-by: Krish Sadhukhan
    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     

18 Jun, 2021

5 commits

  • When ran from the sched-out path (preempt_notifier or perf_event),
    p->state is irrelevant to determine preemption. You can get preempted
    with !task_is_running() just fine.

    The right indicator for preemption is if the task is still on the
    runqueue in the sched-out path.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mark Rutland
    Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org

    Peter Zijlstra
     
  • Make them the same type as vCPU stats. There is no reason
    to limit the counters to unsigned long.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Add KVM PM-notifier so that architectures can have arch-specific
    VM suspend/resume routines. Such architectures need to select
    CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier().

    Signed-off-by: Sergey Senozhatsky
    Acked-by: Marc Zyngier
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sergey Senozhatsky
     
  • Add a new lock to protect the arch-specific fields of memslots if they
    need to be modified in a kvm->srcu read critical section. A future
    commit will use this lock to lazily allocate memslot rmaps for x86.

    Signed-off-by: Ben Gardon
    Message-Id:
    [Add Documentation/ hunk. - Paolo]
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     
  • Factor out copying kvm_memslots from allocating the memory for new ones
    in preparation for adding a new lock to protect the arch-specific fields
    of the memslots.

    No functional change intended.

    Reviewed-by: David Hildenbrand
    Signed-off-by: Ben Gardon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     

27 May, 2021

3 commits

  • For VMX, when a vcpu enters HLT emulation, pi_post_block will:

    1) Add vcpu to per-cpu list of blocked vcpus.

    2) Program the posted-interrupt descriptor "notification vector"
    to POSTED_INTR_WAKEUP_VECTOR

    With interrupt remapping, an interrupt will set the PIR bit for the
    vector programmed for the device on the CPU, test-and-set the
    ON bit on the posted interrupt descriptor, and if the ON bit is clear
    generate an interrupt for the notification vector.

    This way, the target CPU wakes upon a device interrupt and wakes up
    the target vcpu.

    Problem is that pi_post_block only programs the notification vector
    if kvm_arch_has_assigned_device() is true. Its possible for the
    following to happen:

    1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
    notification vector is not programmed
    2) device is assigned to VM
    3) device interrupts vcpu V, sets ON bit
    (notification vector not programmed, so pcpu P remains in idle)
    4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
    kvm_vcpu_kick is skipped
    5) vcpu 0 busy spins on vcpu V's response for several seconds, until
    RCU watchdog NMIs all vCPUs.

    To fix this, use the start_assignment kvm_x86_ops callback to kick
    vcpus out of the halt loop, so the notification vector is
    properly reprogrammed to the wakeup vector.

    Reported-by: Pei Zhang
    Signed-off-by: Marcelo Tosatti
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Marcelo Tosatti
     
  • KVM_REQ_UNBLOCK will be used to exit a vcpu from
    its inner vcpu halt emulation loop.

    Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
    PowerPC to arch specific request bit.

    Signed-off-by: Marcelo Tosatti

    Message-Id:
    Signed-off-by: Paolo Bonzini

    Marcelo Tosatti
     
  • This is inspired by commit 262de4102c7bb8 (kvm: exit halt polling on
    need_resched() as well). Due to PPC implements an arch specific halt
    polling logic, we have to the need_resched() check there as well. This
    patch adds a helper function that can be shared between book3s and generic
    halt-polling loops.

    Reviewed-by: David Matlack
    Reviewed-by: Venkatesh Srinivas
    Cc: Ben Segall
    Cc: Venkatesh Srinivas
    Cc: Jim Mattson
    Cc: David Matlack
    Cc: Paul Mackerras
    Cc: Suraj Jitindar Singh
    Signed-off-by: Wanpeng Li
    Message-Id:
    [Make the function inline. - Paolo]
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

17 May, 2021

1 commit


15 May, 2021

1 commit

  • This reverts commit a979a6aa009f3c99689432e0cdb5402a4463fb88.

    The reverted commit may cause VM freeze on arm64 with GICv4,
    where stopping a consumer is implemented by suspending the VM.
    Should the connect fail, the VM will not be resumed, which
    is a bit of a problem.

    It also erroneously calls the producer destructor unconditionally,
    which is unexpected.

    Reported-by: Shaokun Zhang
    Suggested-by: Marc Zyngier
    Acked-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Reviewed-by: Eric Auger
    Tested-by: Shaokun Zhang
    Signed-off-by: Zhu Lingshan
    [maz: tags and cc-stable, commit message update]
    Signed-off-by: Marc Zyngier
    Fixes: a979a6aa009f ("irqbypass: do not start cons/prod when failed connect")
    Link: https://lore.kernel.org/r/3a2c66d6-6ca0-8478-d24b-61e8e3241b20@hisilicon.com
    Link: https://lore.kernel.org/r/20210508071152.722425-1-lingshan.zhu@intel.com
    Cc: stable@vger.kernel.org

    Zhu Lingshan
     

07 May, 2021

1 commit

  • When growing halt-polling, there is no check that the poll time exceeds
    the per-VM limit. It's possible for vcpu->halt_poll_ns to grow past
    kvm->max_halt_poll_ns and stay there until a halt which takes longer
    than kvm->halt_poll_ns.

    Signed-off-by: David Matlack
    Signed-off-by: Venkatesh Srinivas
    Message-Id:
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    David Matlack
     

03 May, 2021

1 commit

  • single_task_running() is usually more general than need_resched()
    but CFS_BANDWIDTH throttling will use resched_task() when there
    is just one task to get the task to block. This was causing
    long-need_resched warnings and was likely allowing VMs to
    overrun their quota when halt polling.

    Signed-off-by: Ben Segall
    Signed-off-by: Venkatesh Srinivas
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Cc: stable@vger.kernel.org
    Reviewed-by: Jim Mattson

    Benjamin Segall
     

22 Apr, 2021

2 commits

  • Both lock holder vCPU and IPI receiver that has halted are condidate for
    boost. However, the PLE handler was originally designed to deal with the
    lock holder preemption problem. The Intel PLE occurs when the spinlock
    waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
    they can be in either kernel or user mode. the vCPU candidate in user mode
    will not be boosted even if they should respond to IPIs. Some benchmarks
    like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
    of the time they are running in user mode. It can lead to a large number
    of continuous PLE events because the IPI sender causes PLE events
    repeatedly until the receiver is scheduled while the receiver is not
    candidate for a boost.

    This patch boosts the vCPU candidiate in user mode which is delivery
    interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
    VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
    96 HTs Intel CLX box). There is no performance regression for other
    benchmarks like Unixbench spawn (most of the time contend read/write
    lock in kernel mode), ebizzy (most of the time contend read/write sem
    and TLB shoodtdown in kernel mode).

    Signed-off-by: Wanpeng Li
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • Add a capability for userspace to mirror SEV encryption context from
    one vm to another. On our side, this is intended to support a
    Migration Helper vCPU, but it can also be used generically to support
    other in-guest workloads scheduled by the host. The intention is for
    the primary guest and the mirror to have nearly identical memslots.

    The primary benefits of this are that:
    1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
    can't accidentally clobber each other.
    2) The VMs can have different memory-views, which is necessary for post-copy
    migration (the migration vCPUs on the target need to read and write to
    pages, when the primary guest would VMEXIT).

    This does not change the threat model for AMD SEV. Any memory involved
    is still owned by the primary guest and its initial state is still
    attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
    to circumvent SEV, they could achieve the same effect by simply attaching
    a vCPU to the primary VM.
    This patch deliberately leaves userspace in charge of the memslots for the
    mirror, as it already has the power to mess with them in the primary guest.

    This patch does not support SEV-ES (much less SNP), as it does not
    handle handing off attested VMSAs to the mirror.

    For additional context, we need a Migration Helper because SEV PSP
    migration is far too slow for our live migration on its own. Using
    an in-guest migrator lets us speed this up significantly.

    Signed-off-by: Nathan Tempelman
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Nathan Tempelman
     

20 Apr, 2021

3 commits

  • Convert a comment above kvm_io_bus_unregister_dev() into an actual
    lockdep assertion, and opportunistically add curly braces to a multi-line
    for-loop.

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
    fails to allocate memory for the new instance of the bus. If it can't
    instantiate a new bus, unregister_dev() destroys all devices _except_ the
    target device. But, it doesn't tell the caller that it obliterated the
    bus and invoked the destructor for all devices that were on the bus. In
    the coalesced MMIO case, this can result in a deleted list entry
    dereference due to attempting to continue iterating on coalesced_zones
    after future entries (in the walk) have been deleted.

    Opportunistically add curly braces to the for-loop, which encompasses
    many lines but sneaks by without braces due to the guts being a single
    if statement.

    Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
    Cc: stable@vger.kernel.org
    Reported-by: Hao Sun
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • If allocating a new instance of an I/O bus fails when unregistering a
    device, wait to destroy the device until after all readers are guaranteed
    to see the new null bus. Destroying devices before the bus is nullified
    could lead to use-after-free since readers expect the devices on their
    reference of the bus to remain valid.

    Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

17 Apr, 2021

6 commits

  • Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
    detected in the memslots, i.e. don't take the lock for notifications that
    don't affect the guest.

    For small VMs, spurious locking is a minor annoyance. And for "volatile"
    setups where the majority of notifications _are_ relevant, this barely
    qualifies as an optimization.

    But, for large VMs (hundreds of threads) with static setups, e.g. no
    page migration, no swapping, etc..., the vast majority of MMU notifier
    callbacks will be unrelated to the guest, e.g. will often be in response
    to the userspace VMM adjusting its own virtual address space. In such
    large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
    handling page faults. In some scenarios it can even be "fatal" in the
    sense that it causes unacceptable brownouts, e.g. when rebuilding huge
    pages after live migration, a significant percentage of vCPUs will be
    attempting to handle page faults.

    x86's TDP MMU implementation is especially susceptible to spurious
    locking due it taking mmu_lock for read when handling page faults.
    Because rwlock is fair, a single writer will stall future readers, while
    the writer is itself stalled waiting for in-progress readers to complete.
    This is exacerbated by the MMU notifiers often firing multiple times in
    quick succession, e.g. moving a page will (always?) invoke three separate
    notifiers: .invalidate_range_start(), invalidate_range_end(), and
    .change_pte(). Unnecessarily taking mmu_lock each time means even a
    single spurious sequence can be problematic.

    Note, this optimizes only the unpaired callbacks. Optimizing the
    .invalidate_range_{start,end}() pairs is more complex and will be done in
    a future patch.

    Suggested-by: Ben Gardon
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Acquire and release mmu_lock in the __kvm_handle_hva_range() helper
    instead of requiring the caller to do the same. This paves the way for
    future patches to take mmu_lock if and only if an overlapping memslot is
    found, without also having to introduce the on_lock() shenanigans used
    to manipulate the notifier count and sequence.

    No functional change intended.

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Yank out the hva-based MMU notifier APIs now that all architectures that
    use the notifiers have moved to the gfn-based APIs.

    No functional change intended.

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Move the hva->gfn lookup for MMU notifiers into common code. Every arch
    does a similar lookup, and some arch code is all but identical across
    multiple architectures.

    In addition to consolidating code, this will allow introducing
    optimizations that will benefit all architectures without incurring
    multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
    relevant range exists in the memslots.

    The use of __always_inline to avoid indirect call retpolines, as done by
    x86, may also benefit other architectures.

    Consolidating the lookups also fixes a wart in x86, where the legacy MMU
    and TDP MMU each do their own memslot walks.

    Lastly, future enhancements to the memslot implementation, e.g. to add an
    interval tree to track host address, will need to touch far less arch
    specific code.

    MIPS, PPC, and arm64 will be converted one at a time in future patches.

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • In KVM's .change_pte() notification callback, replace the notifier
    sequence bump with a WARN_ON assertion that the notifier count is
    elevated. An elevated count provides stricter protections than bumping
    the sequence, and the sequence is guarnateed to be bumped before the
    count hits zero.

    When .change_pte() was added by commit 828502d30073 ("ksm: add
    mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
    as .change_pte() would be invoked without any surrounding notifications.

    However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify
    with invalidate_range_start and invalidate_range_end"), all calls to
    .change_pte() are guaranteed to be surrounded by start() and end(), and
    so are guaranteed to run with an elevated notifier count.

    Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
    bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
    the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes
    .change_pte() is called when the relevant SPTE is present in KVM's MMU,
    as the original goal was to accelerate Kernel Samepage Merging (KSM) by
    updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
    the SPTE). I.e. it means that .change_pte() is effectively dead code
    on _all_ architectures.

    x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
    is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE,
    which again should be a nop due to the invalidation. arm64 is a bit
    murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
    called without a cache pointer, which means it will map an entry if and
    only if an existing PTE was found.

    For now, take advantage of the bug to simplify future consolidation of
    KVMs's MMU notifier code. Doing so will not greatly complicate fixing
    .change_pte(), assuming it's even worth fixing. .change_pte() has been
    broken for 8+ years and no one has complained. Even if there are
    KSM+KVM users that care deeply about its performance, the benefits of
    avoiding VM-Exits via .change_pte() need to be reevaluated to justify
    the added complexity and testing burden. Ripping out .change_pte()
    entirely would be a lot easier.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that
    that the allocations are accounted, to make it easier to audit KVM's
    allocations in the future, and to be consistent with other cache usage in
    KVM.

    When using SLAB/SLUB, this is a nop as the cache itself is created with
    SLAB_ACCOUNT.

    When using SLOB, there are caveats within caveats. SLOB doesn't honor
    SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU
    allocations now being accounted. But, even that depends on internal
    SLOB details as SLOB will only go to the page allocator when its cache is
    depleted. That just happens to be extremely likely for vCPUs because the
    size of kvm_vcpu is larger than the a page for almost all combinations of
    architecture and page size. Whether or not the SLOB behavior is by
    design is unknown; it's just as likely that no SLOB users care about
    accounding and so no one has bothered to implemented support in SLOB.
    Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup
    users, if any exist.

    Reviewed-by: Wanpeng Li
    Acked-by: David Rientjes
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson