08 Sep, 2021

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - Page ownership tracking between host EL1 and EL2
    - Rely on userspace page tables to create large stage-2 mappings
    - Fix incompatibility between pKVM and kmemleak
    - Fix the PMU reset state, and improve the performance of the virtual
    PMU
    - Move over to the generic KVM entry code
    - Address PSCI reset issues w.r.t. save/restore
    - Preliminary rework for the upcoming pKVM fixed feature
    - A bunch of MM cleanups
    - a vGIC fix for timer spurious interrupts
    - Various cleanups

    s390:
    - enable interpretation of specification exceptions
    - fix a vcpu_idx vs vcpu_id mixup

    x86:
    - fast (lockless) page fault support for the new MMU
    - new MMU now the default
    - increased maximum allowed VCPU count
    - allow inhibit IRQs on KVM_RUN while debugging guests
    - let Hyper-V-enabled guests run with virtualized LAPIC as long as
    they do not enable the Hyper-V "AutoEOI" feature
    - fixes and optimizations for the toggling of AMD AVIC (virtualized
    LAPIC)
    - tuning for the case when two-dimensional paging (EPT/NPT) is
    disabled
    - bugfixes and cleanups, especially with respect to vCPU reset and
    choosing a paging mode based on CR0/CR4/EFER
    - support for 5-level page table on AMD processors

    Generic:
    - MMU notifier invalidation callbacks do not take mmu_lock unless
    necessary
    - improved caching of LRU kvm_memory_slot
    - support for histogram statistics
    - add statistics for halt polling and remote TLB flush requests"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (210 commits)
    KVM: Drop unused kvm_dirty_gfn_invalid()
    KVM: x86: Update vCPU's hv_clock before back to guest when tsc_offset is adjusted
    KVM: MMU: mark role_regs and role accessors as maybe unused
    KVM: MIPS: Remove a "set but not used" variable
    x86/kvm: Don't enable IRQ when IRQ enabled in kvm_wait
    KVM: stats: Add VM stat for remote tlb flush requests
    KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count()
    KVM: x86/mmu: Move lpage_disallowed_link further "down" in kvm_mmu_page
    KVM: x86/mmu: Relocate kvm_mmu_page.tdp_mmu_page for better cache locality
    Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()"
    KVM: x86/mmu: Remove unused field mmio_cached in struct kvm_mmu_page
    kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710
    kvm: x86: Increase MAX_VCPUS to 1024
    kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS
    KVM: VMX: avoid running vmx_handle_exit_irqoff in case of emulation
    KVM: x86/mmu: Don't freak out if pml5_root is NULL on 4-level host
    KVM: s390: index kvm->arch.idle_mask by vcpu_idx
    KVM: s390: Enable specification exception interpretation
    KVM: arm64: Trim guest debug exception handling
    KVM: SVM: Add 5-level page table support for SVM
    ...

    Linus Torvalds
     

02 Sep, 2021

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "Yet another set of documentation changes:

    - A reworking of PDF generation to yield better results for documents
    using CJK fonts in particular.

    - A new set of translations into traditional Chinese, a dialect for
    which I am assured there is a community of interested readers.

    - A lot more regular Chinese translation work as well.

    ... plus the usual assortment of updates, fixes, typo tweaks, etc"

    * tag 'docs-5.15' of git://git.lwn.net/linux: (55 commits)
    docs: sphinx-requirements: Move sphinx_rtd_theme to top
    docs: pdfdocs: Enable language-specific font choice of zh_TW translations
    docs: pdfdocs: Teach xeCJK about character classes of quotation marks
    docs: pdfdocs: Permit AutoFakeSlant for CJK fonts
    docs: pdfdocs: One-half spacing for CJK translations
    docs: pdfdocs: Add conf.py local to translations for ascii-art alignment
    docs: pdfdocs: Preserve inter-phrase space in Korean translations
    docs: pdfdocs: Choose Serif font as CJK mainfont if possible
    docs: pdfdocs: Add CJK-language-specific font settings
    docs: pdfdocs: Refactor config for CJK document
    scripts/kernel-doc: Override -Werror from KCFLAGS with KDOC_WERROR
    docs/zh_CN: Add zh_CN/accounting/psi.rst
    doc: align Italian translation
    Documentation/features/vm: riscv supports THP now
    docs/zh_CN: add infiniband user_verbs translation
    docs/zh_CN: add infiniband user_mad translation
    docs/zh_CN: add infiniband tag_matching translation
    docs/zh_CN: add infiniband sysfs translation
    docs/zh_CN: add infiniband opa_vnic translation
    docs/zh_CN: add infiniband ipoib translation
    ...

    Linus Torvalds
     

21 Aug, 2021

2 commits

  • KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
    while running.

    This change is mostly intended for more robust single stepping
    of the guest and it has the following benefits when enabled:

    * Resuming from a breakpoint is much more reliable.
    When resuming execution from a breakpoint, with interrupts enabled,
    more often than not, KVM would inject an interrupt and make the CPU
    jump immediately to the interrupt handler and eventually return to
    the breakpoint, to trigger it again.

    From the user point of view it looks like the CPU never executed a
    single instruction and in some cases that can even prevent forward
    progress, for example, when the breakpoint is placed by an automated
    script (e.g lx-symbols), which does something in response to the
    breakpoint and then continues the guest automatically.
    If the script execution takes enough time for another interrupt to
    arrive, the guest will be stuck on the same breakpoint RIP forever.

    * Normal single stepping is much more predictable, since it won't
    land the debugger into an interrupt handler.

    * RFLAGS.TF has less chance to be leaked to the guest:

    We set that flag behind the guest's back to do single stepping
    but if single step lands us into an interrupt/exception handler
    it will be leaked to the guest in the form of being pushed
    to the stack.
    This doesn't completely eliminate this problem as exceptions
    can still happen, but at least this reduces the chances
    of this happening.

    Signed-off-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Maxim Levitsky
     
  • Add documentations for linear and logarithmic histogram statistics.

    Signed-off-by: Jing Zhang
    Message-Id:
    [Small changes to the phrasing. - Paolo]
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     

13 Aug, 2021

2 commits

  • Merge topic branch with fixes for 5.14-rc6 and 5.15 merge window.

    Paolo Bonzini
     
  • Add yet another spinlock for the TDP MMU and take it when marking indirect
    shadow pages unsync. When using the TDP MMU and L1 is running L2(s) with
    nested TDP, KVM may encounter shadow pages for the TDP entries managed by
    L1 (controlling L2) when handling a TDP MMU page fault. The unsync logic
    is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
    misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
    which runs with mmu_lock held for read, not write.

    Lack of a critical section manifests most visibly as an underflow of
    unsync_children in clear_unsync_child_bit() due to unsync_children being
    corrupted when multiple CPUs write it without a critical section and
    without atomic operations. But underflow is the best case scenario. The
    worst case scenario is that unsync_children prematurely hits '0' and
    leads to guest memory corruption due to KVM neglecting to properly sync
    shadow pages.

    Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
    would functionally be ok. Usurping the lock could degrade performance when
    building upper level page tables on different vCPUs, especially since the
    unsync flow could hold the lock for a comparatively long time depending on
    the number of indirect shadow pages and the depth of the paging tree.

    For simplicity, take the lock for all MMUs, even though KVM could fairly
    easily know that mmu_lock is held for write. If mmu_lock is held for
    write, there cannot be contention for the inner spinlock, and marking
    shadow pages unsync across multiple vCPUs will be slow enough that
    bouncing the kvm_arch cacheline should be in the noise.

    Note, even though L2 could theoretically be given access to its own EPT
    entries, a nested MMU must hold mmu_lock for write and thus cannot race
    against a TDP MMU page fault. I.e. the additional spinlock only _needs_ to
    be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
    that is running with the TDP MMU enabled. Holding mmu_lock for read also
    prevents the indirect shadow page from being freed. But as above, keep
    it simple and always take the lock.

    Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
    effectively disable unsync behavior for nested TDP. Write protecting leaf
    shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
    VMMs typically don't modify TDP entries, but the same may not hold true for
    non-standard use cases and/or VMMs that are migrating physical pages (from
    L1's perspective).

    Alternative #2, the unsync logic could be made thread safe. In theory,
    simply converting all relevant kvm_mmu_page fields to atomics and using
    atomic bitops for the bitmap would suffice. However, (a) an in-depth audit
    would be required, (b) the code churn would be substantial, and (c) legacy
    shadow paging would incur additional atomic operations in performance
    sensitive paths for no benefit (to legacy shadow paging).

    Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
    Cc: stable@vger.kernel.org
    Cc: Ben Gardon
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

03 Aug, 2021

1 commit

  • We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
    notifications that are unrelated to KVM. Because mmu_notifier_count
    must be modified while holding mmu_lock for write, and must always
    be paired across start->end to stay balanced, lock elision must
    happen in both or none. Therefore, in preparation for this change,
    this patch prevents memslot updates across range_start() and range_end().

    Note, technically flag-only memslot updates could be allowed in parallel,
    but stalling a memslot update for a relatively short amount of time is
    not a scalability issue, and this is all more than complex enough.

    A long note on the locking: a previous version of the patch used an rwsem
    to block the memslot update while the MMU notifier run, but this resulted
    in the following deadlock involving the pseudo-lock tagged as
    "mmu_notifier_invalidate_range_start".

    ======================================================
    WARNING: possible circular locking dependency detected
    5.12.0-rc3+ #6 Tainted: G OE
    ------------------------------------------------------
    qemu-system-x86/3069 is trying to acquire lock:
    ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190

    but task is already holding lock:
    ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

    which lock already depends on the new lock.

    This corresponds to the following MMU notifier logic:

    invalidate_range_start
    take pseudo lock
    down_read() (*)
    release pseudo lock
    invalidate_range_end
    take pseudo lock (**)
    up_read()
    release pseudo lock

    At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
    at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

    This could cause a deadlock (ignoring for a second that the pseudo lock
    is not a lock):

    - invalidate_range_start waits on down_read(), because the rwsem is
    held by install_new_memslots

    - install_new_memslots waits on down_write(), because the rwsem is
    held till (another) invalidate_range_end finishes

    - invalidate_range_end sits waits on the pseudo lock, held by
    invalidate_range_start.

    Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
    it would change the *shared* rwsem readers into *shared recursive*
    readers), so open-code the wait using a readers count and a
    spinlock. This also allows handling blockable and non-blockable
    critical section in the same way.

    Losing the rwsem fairness does theoretically allow MMU notifiers to
    block install_new_memslots forever. Note that mm/mmu_notifier.c's own
    retry scheme in mmu_interval_read_begin also uses wait/wake_up
    and is likewise not fair.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

26 Jul, 2021

5 commits

  • The conversion tools used during DocBook/LaTeX/html/Markdown->ReST
    conversion and some cut-and-pasted text contain some characters that
    aren't easily reachable on standard keyboards and/or could cause
    troubles when parsed by the documentation build system.

    Replace the occurences of the following characters:

    - U+00a0 (' '): NO-BREAK SPACE
    as it can cause lines being truncated on PDF output

    Signed-off-by: Mauro Carvalho Chehab
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Mauro Carvalho Chehab
     
  • 'KVM_CAP_ENFORCE_PV_CPUID' doesn't match the define in
    include/uapi/linux/kvm.h.

    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • The conversion tools used during DocBook/LaTeX/html/Markdown->ReST
    conversion and some cut-and-pasted text contain some characters that
    aren't easily reachable on standard keyboards and/or could cause
    troubles when parsed by the documentation build system.

    Replace the occurences of the following characters:

    - U+00a0 (' '): NO-BREAK SPACE
    as it can cause lines being truncated on PDF output

    Signed-off-by: Mauro Carvalho Chehab
    Link: https://lore.kernel.org/r/ff70cb42d63f3a1da66af1b21b8d038418ed5189.1626947264.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     
  • Add a '::' so that a code block is interpreted properly and also add a
    blank line before the start of a list.

    Fixes: fdc09ddd4064 ("KVM: stats: Add documentation for binary statistics interface")
    Signed-off-by: Ioana Ciornei
    Reviewed-by: Jing Zhang
    Link: https://lore.kernel.org/r/20210722100356.635078-4-ciorneiioana@gmail.com
    Signed-off-by: Jonathan Corbet

    Ioana Ciornei
     
  • Fix some small build warnings. The title underline was too short in some
    cases and a code block was not indented.

    Documentation/virt/kvm/api.rst:7216: WARNING: Title underline too short.

    Fixes: 6dba94035203 ("KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2")
    Signed-off-by: Ioana Ciornei
    Link: https://lore.kernel.org/r/20210722100356.635078-3-ciorneiioana@gmail.com
    Signed-off-by: Jonathan Corbet

    Ioana Ciornei
     

29 Jun, 2021

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "This was a reasonably active cycle for documentation; this includes:

    - Some kernel-doc cleanups. That script is still regex onslaught from
    hell, but it has gotten a little better.

    - Improvements to the checkpatch docs, which are also used by the
    tool itself.

    - A major update to the pathname lookup documentation.

    - Elimination of :doc: markup, since our automarkup magic can create
    references from filenames without all the extra noise.

    - The flurry of Chinese translation activity continues.

    Plus, of course, the usual collection of updates, typo fixes, and
    warning fixes"

    * tag 'docs-5.14' of git://git.lwn.net/linux: (115 commits)
    docs: path-lookup: use bare function() rather than literals
    docs: path-lookup: update symlink description
    docs: path-lookup: update get_link() ->follow_link description
    docs: path-lookup: update WALK_GET, WALK_PUT desc
    docs: path-lookup: no get_link()
    docs: path-lookup: update i_op->put_link and cookie description
    docs: path-lookup: i_op->follow_link replaced with i_op->get_link
    docs: path-lookup: Add macro name to symlink limit description
    docs: path-lookup: remove filename_mountpoint
    docs: path-lookup: update do_last() part
    docs: path-lookup: update path_mountpoint() part
    docs: path-lookup: update path_to_nameidata() part
    docs: path-lookup: update follow_managed() part
    docs: Makefile: Use CONFIG_SHELL not SHELL
    docs: Take a little noise out of the build process
    docs: x86: avoid using ReST :doc:`foo` markup
    docs: virt: kvm: s390-pv-boot.rst: avoid using ReST :doc:`foo` markup
    docs: userspace-api: landlock.rst: avoid using ReST :doc:`foo` markup
    docs: trace: ftrace.rst: avoid using ReST :doc:`foo` markup
    docs: trace: coresight: coresight.rst: avoid using ReST :doc:`foo` markup
    ...

    Linus Torvalds
     

25 Jun, 2021

6 commits

  • KVM/arm64 updates for v5.14.

    - Add MTE support in guests, complete with tag save/restore interface
    - Reduce the impact of CMOs by moving them in the page-table code
    - Allow device block mappings at stage-2
    - Reduce the footprint of the vmemmap in protected mode
    - Support the vGIC on dumb systems such as the Apple M1
    - Add selftest infrastructure to support multiple configuration
    and apply that to PMU/non-PMU setups
    - Add selftests for the debug architecture
    - The usual crop of PMU fixes

    Paolo Bonzini
     
  • Add a fallback mechanism to the in-kernel instruction emulator that
    allows userspace the opportunity to process an instruction the emulator
    was unable to. When the in-kernel instruction emulator fails to process
    an instruction it will either inject a #UD into the guest or exit to
    userspace with exit reason KVM_INTERNAL_ERROR. This is because it does
    not know how to proceed in an appropriate manner. This feature lets
    userspace get involved to see if it can figure out a better path
    forward.

    Signed-off-by: Aaron Lewis
    Reviewed-by: David Edmondson
    Message-Id:
    Reviewed-by: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Aaron Lewis
     
  • Rename "nxe" to "efer_nx" so that future macro magic can use the pattern
    _ for all CR0, CR4, and EFER bits that included in the role.
    Using "efer_nx" also makes it clear that the role bit reflects EFER.NX,
    not the NX bit in the corresponding PTE.

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Originally, __kvm_sync_page used to check the cr4_pae bit in the role
    to avoid zapping 4-byte kvm_mmu_pages when guest page size are 8-byte
    or the other way round. However, in commit 47c42e6b4192 ("KVM: x86: fix
    handling of role.cr4_pae and rename it to 'gpte_size'", 2019-03-28) it
    was observed that this did not work for nested EPT, where the page table
    size would be 8 bytes even if CR4.PAE=0. (Note that the check still
    has to be done for nested *NPT*, so it is not possible to use tdp_enabled
    or similar).

    Therefore, a hack was introduced to identify nested EPT shadow pages
    and unconditionally call __kvm_sync_page() on them. However, it is
    possible to do without the hack to identify nested EPT shadow pages:
    if EPT is active, there will be no shadow pages in non-EPT format,
    and all of them will have gpte_is_8_bytes set to true; we can just
    check the MMU role directly, and the test will always be true.

    Even for non-EPT shadow MMUs, this test should really always be true
    now that __kvm_sync_page() is called if and only if the role is an
    exact match (kvm_mmu_get_page()) or is part of the current MMU context
    (kvm_mmu_sync_roots()). A future commit will convert the likely-pointless
    check into a meaningful WARN to enforce that the mmu_roles of the current
    context and the shadow page are compatible.

    Cc: Vitaly Kuznetsov
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
    instability. Initialize last_vmentry_cpu to -1 and use it to detect if
    the vCPU has been run at least once when its CPUID model is changed.

    KVM does not correctly handle changes to paging related settings in the
    guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM
    could theoretically zap all shadow pages, but actually making that happen
    is a mess due to lock inversion (vcpu->mutex is held). And even then,
    updating paging settings on the fly would only work if all vCPUs are
    stopped, updated in concert with identical settings, then restarted.

    To support running vCPUs with different vCPU models (that affect paging),
    KVM would need to track all relevant information in kvm_mmu_page_role.
    Note, that's the _page_ role, not the full mmu_role. Updating mmu_role
    isn't sufficient as a vCPU can reuse a shadow page translation that was
    created by a vCPU with different settings and thus completely skip the
    reserved bit checks (that are tied to CPUID).

    Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
    it would require doubling gfn_track from a u16 to a u32, i.e. would
    increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
    E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
    would all need to be tracked.

    In practice, there is no remotely sane use case for changing any paging
    related CPUID entries on the fly, so just sweep it under the rug (after
    yelling at userspace).

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • This new API provides a file descriptor for every VM and VCPU to read
    KVM statistics data in binary format.
    It is meant to provide a lightweight, flexible, scalable and efficient
    lock-free solution for user space telemetry applications to pull the
    statistics data periodically for large scale systems. The pulling
    frequency could be as high as a few times per second.
    The statistics descriptors are defined by KVM in kernel and can be
    by userspace to discover VM/VCPU statistics during the one-time setup
    stage.
    The statistics data itself could be read out by userspace telemetry
    periodically without any extra parsing or setup effort.
    There are a few existed interface protocols and definitions, but no
    one can fulfil all the requirements this interface implemented as
    below:
    1. During high frequency periodic stats reading, there should be no
    extra efforts except the stats data read itself.
    2. Support stats annotation, like type (cumulative, instantaneous,
    peak, histogram, etc) and unit (counter, time, size, cycles, etc).
    3. The stats data reading should be free of lock/synchronization. We
    don't care about the consistency between all the stats data. All
    stats data can not be read out at exactly the same time. We really
    care about the change or trend of the stats data. The lock-free
    solution is not just for efficiency and scalability, also for the
    stats data accuracy and usability. For example, in the situation
    that all the stats data readings are protected by a global lock,
    if one VCPU died somehow with that lock held, then all stats data
    reading would be blocked, then we have no way from stats data that
    which VCPU has died.
    4. The stats data reading workload can be handed over to other
    unprivileged process.

    Reviewed-by: David Matlack
    Reviewed-by: Ricardo Koller
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Fuad Tabba
    Signed-off-by: Jing Zhang
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jing Zhang
     

23 Jun, 2021

1 commit


22 Jun, 2021

2 commits

  • Now that we have H_RPT_INVALIDATE fully implemented, enable
    support for the same via KVM_CAP_PPC_RPT_INVALIDATE KVM capability

    Signed-off-by: Bharata B Rao
    Reviewed-by: David Gibson
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/20210621085003.904767-6-bharata@linux.ibm.com

    Bharata B Rao
     
  • A new capability (KVM_CAP_ARM_MTE) identifies that the kernel supports
    granting a guest access to the tags, and provides a mechanism for the
    VMM to enable it.

    A new ioctl (KVM_ARM_MTE_COPY_TAGS) provides a simple way for a VMM to
    access the tags of a guest without having to maintain a PROT_MTE mapping
    in userspace. The above capability gates access to the ioctl.

    Reviewed-by: Catalin Marinas
    Signed-off-by: Steven Price
    Signed-off-by: Marc Zyngier
    Link: https://lore.kernel.org/r/20210621111716.37157-7-steven.price@arm.com

    Steven Price
     

18 Jun, 2021

5 commits

  • The :doc:`foo` tag is auto-generated via automarkup.py.
    So, use the filename at the sources, instead of :doc:`foo`.

    Signed-off-by: Mauro Carvalho Chehab
    Link: https://lore.kernel.org/r/8c0fc6578ff6384580fd0d622f363bbbd4fe91da.1623824363.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     
  • This hypercall is used by the SEV guest to notify a change in the page
    encryption status to the hypervisor. The hypercall should be invoked
    only when the encryption attribute is changed from encrypted -> decrypted
    and vice versa. By default all guest pages are considered encrypted.

    The hypercall exits to userspace to manage the guest shared regions and
    integrate with the userspace VMM's migration code.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Paolo Bonzini
    Cc: Joerg Roedel
    Cc: Borislav Petkov
    Cc: Tom Lendacky
    Cc: x86@kernel.org
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Steve Rutherford
    Signed-off-by: Brijesh Singh
    Signed-off-by: Ashish Kalra
    Co-developed-by: Sean Christopherson
    Signed-off-by: Sean Christopherson
    Co-developed-by: Paolo Bonzini
    Signed-off-by: Paolo Bonzini
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Ashish Kalra
     
  • This is a new version of KVM_GET_SREGS / KVM_SET_SREGS.

    It has the following changes:
    * Has flags for future extensions
    * Has vcpu's PDPTRs, allowing to save/restore them on migration.
    * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)

    New capability, KVM_CAP_SREGS2 is added to signal
    the userspace of this ioctl.

    Signed-off-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Maxim Levitsky
     
  • Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows
    for limiting Hyper-V features to those exposed to the guest in Hyper-V
    CPUIDs (0x40000003, 0x40000004, ...).

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • Add a new lock to protect the arch-specific fields of memslots if they
    need to be modified in a kvm->srcu read critical section. A future
    commit will use this lock to lazily allocate memslot rmaps for x86.

    Signed-off-by: Ben Gardon
    Message-Id:
    [Add Documentation/ hunk. - Paolo]
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     

09 Jun, 2021

1 commit

  • When computing the access permissions of a shadow page, use the effective
    permissions of the walk up to that point, i.e. the logic AND of its parents'
    permissions. Two guest PxE entries that point at the same table gfn need to
    be shadowed with different shadow pages if their parents' permissions are
    different. KVM currently uses the effective permissions of the last
    non-leaf entry for all non-leaf entries. Because all non-leaf SPTEs have
    full ("uwx") permissions, and the effective permissions are recorded only
    in role.access and merged into the leaves, this can lead to incorrect
    reuse of a shadow page and eventually to a missing guest protection page
    fault.

    For example, here is a shared pagetable:

    pgd[] pud[] pmd[] virtual address pointers
    /->pmd1(u--)->pte1(uw-)->page1 pud1(uw-)--->pmd2(uw-)->pte2(uw-)->page2 pud2(u--)--->pmd1(u--)->pte1(uw-)->page1 pmd2(uw-)->pte2(uw-)->page2 access. "u--" is used also to get
    the pagetable for pud1, instead of "uw-".

    - Then the guest writes to ptr2 and KVM reuses pud1 which is present.
    The hypervisor set up a shadow page for ptr2 with pt->access is "uw-"
    even though the pud1 pmd (because of the incorrect argument to
    kvm_mmu_get_page in the previous step) has role.access="u--".

    - Then the guest reads from ptr3. The hypervisor reuses pud1's
    shadow pmd for pud2, because both use "u--" for their permissions.
    Thus, the shadow pmd already includes entries for both pmd1 and pmd2.

    - At last, the guest writes to ptr4. This causes no vmexit or pagefault,
    because pud1's shadow page structures included an "uw-" page even though
    its role.access was "u--".

    Any kind of shared pagetable might have the similar problem when in
    virtual machine without TDP enabled if the permissions are different
    from different ancestors.

    In order to fix the problem, we change pt->access to be an array, and
    any access in it will not include permissions ANDed from child ptes.

    The test code is: https://lore.kernel.org/kvm/20210603050537.19605-1-jiangshanlai@gmail.com/
    Remember to test it with TDP disabled.

    The problem had existed long before the commit 41074d07c78b ("KVM: MMU:
    Fix inherited permissions for emulated guest pte updates"), and it
    is hard to find which is the culprit. So there is no fixes tag here.

    Signed-off-by: Lai Jiangshan
    Message-Id:
    Cc: stable@vger.kernel.org
    Fixes: cea0f0e7ea54 ("[PATCH] KVM: MMU: Shadow page table caching")
    Signed-off-by: Paolo Bonzini

    Lai Jiangshan
     

30 May, 2021

1 commit

  • Pull KVM fixes from Paolo Bonzini:
    "ARM fixes:

    - Another state update on exit to userspace fix

    - Prevent the creation of mixed 32/64 VMs

    - Fix regression with irqbypass not restarting the guest on failed
    connect

    - Fix regression with debug register decoding resulting in
    overlapping access

    - Commit exception state on exit to usrspace

    - Fix the MMU notifier return values

    - Add missing 'static' qualifiers in the new host stage-2 code

    x86 fixes:

    - fix guest missed wakeup with assigned devices

    - fix WARN reported by syzkaller

    - do not use BIT() in UAPI headers

    - make the kvm_amd.avic parameter bool

    PPC fixes:

    - make halt polling heuristics consistent with other architectures

    selftests:

    - various fixes

    - new performance selftest memslot_perf_test

    - test UFFD minor faults in demand_paging_test"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
    selftests: kvm: fix overlapping addresses in memslot_perf_test
    KVM: X86: Kill off ctxt->ud
    KVM: X86: Fix warning caused by stale emulation context
    KVM: X86: Use kvm_get_linear_rip() in single-step and #DB/#BP interception
    KVM: x86/mmu: Fix comment mentioning skip_4k
    KVM: VMX: update vcpu posted-interrupt descriptor when assigning device
    KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK
    KVM: x86: add start_assignment hook to kvm_x86_ops
    KVM: LAPIC: Narrow the timer latency between wait_lapic_expire and world switch
    selftests: kvm: do only 1 memslot_perf_test run by default
    KVM: X86: Use _BITUL() macro in UAPI headers
    KVM: selftests: add shared hugetlbfs backing source type
    KVM: selftests: allow using UFFD minor faults for demand paging
    KVM: selftests: create alias mappings when using shared memory
    KVM: selftests: add shmem backing source type
    KVM: selftests: refactor vm_mem_backing_src_type flags
    KVM: selftests: allow different backing source types
    KVM: selftests: compute correct demand paging size
    KVM: selftests: simplify setup_demand_paging error handling
    KVM: selftests: Print a message if /dev/kvm is missing
    ...

    Linus Torvalds
     

27 May, 2021

1 commit


21 May, 2021

2 commits

  • The document which describes the SGX kernel architecture was added at
    commit 3fa97bf00126 ("Documentation/x86: Document SGX kernel architecture")

    but the reference at virt/kvm/api.rst is pointing to some
    non-existing document.

    Signed-off-by: Mauro Carvalho Chehab
    Link: https://lore.kernel.org/r/138c24633c6e4edf862a2b4d77033c603fc10406.1621413933.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     
  • Changeset f0400a77ebdc ("atomic: Delete obsolete documentation")
    got rid of atomic_ops.rst, pointing that this was superseded by
    Documentation/atomic_*.txt.

    Update its reference accordingly.

    Fixes: f0400a77ebdc ("atomic: Delete obsolete documentation")
    Signed-off-by: Mauro Carvalho Chehab
    Link: https://lore.kernel.org/r/703af756ac26a06c2185c05dfe6d902253f11161.1621413933.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

17 May, 2021

1 commit

  • Pull x86 fixes from Borislav Petkov:
    "The three SEV commits are not really urgent material. But we figured
    since getting them in now will avoid a huge amount of conflicts
    between future SEV changes touching tip, the kvm and probably other
    trees, sending them to you now would be best.

    The idea is that the tip, kvm etc branches for 5.14 will all base
    ontop of -rc2 and thus everything will be peachy. What is more, those
    changes are purely mechanical and defines movement so they should be
    fine to go now (famous last words).

    Summary:

    - Enable -Wundef for the compressed kernel build stage

    - Reorganize SEV code to streamline and simplify future development"

    * tag 'x86_urgent_for_v5.13_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/boot/compressed: Enable -Wundef
    x86/msr: Rename MSR_K8_SYSCFG to MSR_AMD64_SYSCFG
    x86/sev: Move GHCB MSR protocol and NAE definitions in a common header
    x86/sev-es: Rename sev-es.{ch} to sev.{ch}

    Linus Torvalds
     

10 May, 2021

1 commit


07 May, 2021

1 commit

  • The capability that exposes new ioctl KVM_X86_SET_MSR_FILTER to
    userspace is specified incorrectly as the ioctl itself (instead of
    KVM_CAP_X86_MSR_FILTER). This patch fixes it.

    Fixes: 1a155254ff93 ("KVM: x86: Introduce MSR filtering")
    Reviewed-by: Alexander Graf
    Signed-off-by: Siddharth Chandrasekaran
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Siddharth Chandrasekaran
     

02 May, 2021

1 commit

  • Pull kvm updates from Paolo Bonzini:
    "This is a large update by KVM standards, including AMD PSP (Platform
    Security Processor, aka "AMD Secure Technology") and ARM CoreSight
    (debug and trace) changes.

    ARM:

    - CoreSight: Add support for ETE and TRBE

    - Stage-2 isolation for the host kernel when running in protected
    mode

    - Guest SVE support when running in nVHE mode

    - Force W^X hypervisor mappings in nVHE mode

    - ITS save/restore for guests using direct injection with GICv4.1

    - nVHE panics now produce readable backtraces

    - Guest support for PTP using the ptp_kvm driver

    - Performance improvements in the S2 fault handler

    x86:

    - AMD PSP driver changes

    - Optimizations and cleanup of nested SVM code

    - AMD: Support for virtual SPEC_CTRL

    - Optimizations of the new MMU code: fast invalidation, zap under
    read lock, enable/disably dirty page logging under read lock

    - /dev/kvm API for AMD SEV live migration (guest API coming soon)

    - support SEV virtual machines sharing the same encryption context

    - support SGX in virtual machines

    - add a few more statistics

    - improved directed yield heuristics

    - Lots and lots of cleanups

    Generic:

    - Rework of MMU notifier interface, simplifying and optimizing the
    architecture-specific code

    - a handful of "Get rid of oprofile leftovers" patches

    - Some selftests improvements"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
    KVM: selftests: Speed up set_memory_region_test
    selftests: kvm: Fix the check of return value
    KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
    KVM: SVM: Skip SEV cache flush if no ASIDs have been used
    KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
    KVM: SVM: Drop redundant svm_sev_enabled() helper
    KVM: SVM: Move SEV VMCB tracking allocation to sev.c
    KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
    KVM: SVM: Unconditionally invoke sev_hardware_teardown()
    KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
    KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
    KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
    KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
    KVM: SVM: Move SEV module params/variables to sev.c
    KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
    KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
    KVM: SVM: Zero out the VMCB array used to track SEV ASID association
    x86/sev: Drop redundant and potentially misleading 'sev_enabled'
    KVM: x86: Move reverse CPUID helpers to separate header file
    KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
    ...

    Linus Torvalds
     

27 Apr, 2021

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "It's been a relatively busy cycle in docsland, though more than
    usually well contained to Documentation/ itself. Highlights include:

    - The Chinese translators have been busy and show no signs of
    stopping anytime soon. Italian has also caught up.

    - Aditya Srivastava has been working on improvements to the
    kernel-doc script.

    - Thorsten continues his work on reporting-issues.rst and related
    documentation around regression reporting.

    - Lots of documentation updates, typo fixes, etc. as usual"

    * tag 'docs-5.13' of git://git.lwn.net/linux: (139 commits)
    docs/zh_CN: add openrisc translation to zh_CN index
    docs/zh_CN: add openrisc index.rst translation
    docs/zh_CN: add openrisc todo.rst translation
    docs/zh_CN: add openrisc openrisc_port.rst translation
    docs/zh_CN: add core api translation to zh_CN index
    docs/zh_CN: add core-api index.rst translation
    docs/zh_CN: add core-api irq index.rst translation
    docs/zh_CN: add core-api irq irqflags-tracing.rst translation
    docs/zh_CN: add core-api irq irq-domain.rst translation
    docs/zh_CN: add core-api irq irq-affinity.rst translation
    docs/zh_CN: add core-api irq concepts.rst translation
    docs: sphinx-pre-install: don't barf on beta Sphinx releases
    scripts: kernel-doc: improve parsing for kernel-doc comments syntax
    docs/zh_CN: two minor fixes in zh_CN/doc-guide/
    Documentation: dev-tools: Add Testing Overview
    docs/zh_CN: add translations in zh_CN/dev-tools/gcov
    docs: reporting-issues: make people CC the regressions list
    MAINTAINERS: add regressions mailing list
    doc:it_IT: align Italian documentation
    docs/zh_CN: sync reporting-issues.rst
    ...

    Linus Torvalds
     

26 Apr, 2021

1 commit


23 Apr, 2021

2 commits

  • KVM/arm64 updates for Linux 5.13

    New features:

    - Stage-2 isolation for the host kernel when running in protected mode
    - Guest SVE support when running in nVHE mode
    - Force W^X hypervisor mappings in nVHE mode
    - ITS save/restore for guests using direct injection with GICv4.1
    - nVHE panics now produce readable backtraces
    - Guest support for PTP using the ptp_kvm driver
    - Performance improvements in the S2 fault handler
    - Alexandru is now a reviewer (not really a new feature...)

    Fixes:
    - Proper emulation of the GICR_TYPER register
    - Handle the complete set of relocation in the nVHE EL2 object
    - Get rid of the oprofile dependency in the PMU code (and of the
    oprofile body parts at the same time)
    - Debug and SPE fixes
    - Fix vcpu reset

    Paolo Bonzini
     
  • Paolo Bonzini