16 Mar, 2019

2 commits

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - some cleanups
    - direct physical timer assignment
    - cache sanitization for 32-bit guests

    s390:
    - interrupt cleanup
    - introduction of the Guest Information Block
    - preparation for processor subfunctions in cpu models

    PPC:
    - bug fixes and improvements, especially related to machine checks
    and protection keys

    x86:
    - many, many cleanups, including removing a bunch of MMU code for
    unnecessary optimizations
    - AVIC fixes

    Generic:
    - memcg accounting"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
    kvm: vmx: fix formatting of a comment
    KVM: doc: Document the life cycle of a VM and its resources
    MAINTAINERS: Add KVM selftests to existing KVM entry
    Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
    KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
    KVM: PPC: Fix compilation when KVM is not enabled
    KVM: Minor cleanups for kvm_main.c
    KVM: s390: add debug logging for cpu model subfunctions
    KVM: s390: implement subfunction processor calls
    arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
    KVM: arm/arm64: Remove unused timer variable
    KVM: PPC: Book3S: Improve KVM reference counting
    KVM: PPC: Book3S HV: Fix build failure without IOMMU support
    Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
    x86: kvmguest: use TSC clocksource if invariant TSC is exposed
    KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
    KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
    KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
    KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
    KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
    ...

    Linus Torvalds
     
  • The series to add memcg accounting to KVM allocations[1] states:

    There are many KVM kernel memory allocations which are tied to the
    life of the VM process and should be charged to the VM process's
    cgroup.

    While it is correct to account KVM kernel allocations to the cgroup of
    the process that created the VM, it's technically incorrect to state
    that the KVM kernel memory allocations are tied to the life of the VM
    process. This is because the VM itself, i.e. struct kvm, is not tied to
    the life of the process which created it, rather it is tied to the life
    of its associated file descriptor. In other words, kvm_destroy_vm() is
    not invoked until fput() decrements its associated file's refcount to
    zero. A simple example is to fork() in Qemu and have the child sleep
    indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file
    descriptor *and* the rogue child is killed.

    The allocations are guaranteed to be *accounted* to the process which
    created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the
    ioctl() with -EIO if kvm->mm != current->mm. I.e. the child can keep
    the VM "alive" but can't do anything useful with its reference.

    Note that because 'struct kvm' also holds a reference to the mm_struct
    of its owner, the above behavior also applies to userspace allocations.

    Given that mucking with a VM's file descriptor can lead to subtle and
    undesirable behavior, e.g. memcg charges persisting after a VM is shut
    down, explicitly document a VM's lifecycle and its impact on the VM's
    resources.

    Alternatively, KVM could aggressively free resources when the creating
    process exits, e.g. via mmu_notifier->release(). However, mmu_notifier
    isn't guaranteed to be available, and freeing resources when the creator
    exits is likely to be error prone and fragile as KVM would need to
    ensure that it only freed resources that are truly out of reach. In
    practice, the existing behavior shouldn't be problematic as a properly
    configured system will prevent a child process from being moved out of
    the appropriate cgroup hierarchy, i.e. prevent hiding the process from
    the OOM killer, and will prevent an unprivileged user from being able to
    to hold a reference to struct kvm via another method, e.g. debugfs.

    [1]https://patchwork.kernel.org/patch/10806707/

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

07 Mar, 2019

1 commit


21 Feb, 2019

3 commits

  • The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
    start value when raising up vcpu->halt_poll_ns.
    It actually sets the first timeout to the first polling session.
    This value has significant effect on how tolerant we are to outliers.
    On the standard case, higher value is better - we will spend more time
    in the polling busyloop, handle events/interrupts faster and result
    in better performance.
    But on outliers it puts us in a busy loop that does nothing.
    Even if the shrink factor is zero, we will still waste time on the first
    iteration.
    The optimal value changes between different workloads. It depends on
    outliers rate and polling sessions length.
    As this value has significant effect on the dynamic halt-polling
    algorithm, it should be configurable and exposed.

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
    from the original series[1].

    Though not explicitly stated, for all intents and purposes the fast
    invalidate mechanism was added to speed up the scenario where removing
    a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
    flush all shadow entries[1]. Now that the memslot case flushes only
    shadow entries belonging to the memslot, i.e. doesn't use the fast
    invalidate mechanism, the only remaining usage of the mechanism are
    when the VM is being destroyed and when the MMIO generation rolls
    over.

    When a VM is being destroyed, either there are no active vcpus, i.e.
    there's no lock contention, or the VM has ungracefully terminated, in
    which case we want to reclaim its pages as quickly as possible, i.e.
    not release the MMU lock if there are still CPUs executing in the VM.

    The MMIO generation scenario is almost literally a one-in-a-million
    occurrence, i.e. is not a performance sensitive scenario.

    Given that lock-breaking is not desirable (VM teardown) or irrelevant
    (MMIO generation overflow), remove the fast invalidate mechanism to
    simplify the code (a small amount) and to discourage future code from
    zapping all pages as using such a big hammer should be a last restort.

    This reverts commit f6f8adeef542a18b1cb26a0b772c9781a10bb477.

    [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com

    Cc: Xiao Guangrong
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • ...now that KVM won't explode by moving it out of bit 0. Using bit 63
    eliminates the need to jump over bit 0, e.g. when calculating a new
    memslots generation or when propagating the memslots generation to an
    MMIO spte.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

12 Jan, 2019

1 commit


15 Dec, 2018

1 commit

  • With every new Hyper-V Enlightenment we implement we're forced to add a
    KVM_CAP_HYPERV_* capability. While this approach works it is fairly
    inconvenient: the majority of the enlightenments we do have corresponding
    CPUID feature bit(s) and userspace has to know this anyways to be able to
    expose the feature to the guest.

    Add KVM_GET_SUPPORTED_HV_CPUID ioctl (backed by KVM_CAP_HYPERV_CPUID, "one
    cap to rule them all!") returning all Hyper-V CPUID feature leaves.

    Using the existing KVM_GET_SUPPORTED_CPUID doesn't seem to be possible:
    Hyper-V CPUID feature leaves intersect with KVM's (e.g. 0x40000000,
    0x40000001) and we would probably confuse userspace in case we decide to
    return these twice.

    KVM_CAP_HYPERV_CPUID's number is interim: we're intended to drop
    KVM_CAP_HYPERV_STIMER_DIRECT and use its number instead.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     

14 Dec, 2018

2 commits

  • There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
    it can take kvm->mmu_lock for an extended period of time. Second, its user
    can actually see many false positives in some cases. The latter is due
    to a benign race like this:

    1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
    them.
    2. The guest modifies the pages, causing them to be marked ditry.
    3. Userspace actually copies the pages.
    4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
    they were not written to since (3).

    This is especially a problem for large guests, where the time between
    (1) and (3) can be substantial. This patch introduces a new
    capability which, when enabled, makes KVM_GET_DIRTY_LOG not
    write-protect the pages it returns. Instead, userspace has to
    explicitly clear the dirty log bits just before using the content
    of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
    64-page granularity rather than requiring to sync a full memslot;
    this way, the mmu_lock is taken for small amounts of time, and
    only a small amount of time will pass between write protection
    of pages and the sending of their content.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • The first such capability to be handled in virt/kvm/ will be manual
    dirty page reprotection.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

26 Oct, 2018

1 commit

  • Pull KVM updates from Radim Krčmář:
    "ARM:
    - Improved guest IPA space support (32 to 52 bits)

    - RAS event delivery for 32bit

    - PMU fixes

    - Guest entry hardening

    - Various cleanups

    - Port of dirty_log_test selftest

    PPC:
    - Nested HV KVM support for radix guests on POWER9. The performance
    is much better than with PR KVM. Migration and arbitrary level of
    nesting is supported.

    - Disable nested HV-KVM on early POWER9 chips that need a particular
    hardware bug workaround

    - One VM per core mode to prevent potential data leaks

    - PCI pass-through optimization

    - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base

    s390:
    - Initial version of AP crypto virtualization via vfio-mdev

    - Improvement for vfio-ap

    - Set the host program identifier

    - Optimize page table locking

    x86:
    - Enable nested virtualization by default

    - Implement Hyper-V IPI hypercalls

    - Improve #PF and #DB handling

    - Allow guests to use Enlightened VMCS

    - Add migration selftests for VMCS and Enlightened VMCS

    - Allow coalesced PIO accesses

    - Add an option to perform nested VMCS host state consistency check
    through hardware

    - Automatic tuning of lapic_timer_advance_ns

    - Many fixes, minor improvements, and cleanups"

    * tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
    KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
    Revert "kvm: x86: optimize dr6 restore"
    KVM: PPC: Optimize clearing TCEs for sparse tables
    x86/kvm/nVMX: tweak shadow fields
    selftests/kvm: add missing executables to .gitignore
    KVM: arm64: Safety check PSTATE when entering guest and handle IL
    KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
    arm/arm64: KVM: Enable 32 bits kvm vcpu events support
    arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
    KVM: arm64: Fix caching of host MDCR_EL2 value
    KVM: VMX: enable nested virtualization by default
    KVM/x86: Use 32bit xor to clear registers in svm.c
    kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
    kvm: vmx: Defer setting of DR6 until #DB delivery
    kvm: x86: Defer setting of CR2 until #PF delivery
    kvm: x86: Add payload operands to kvm_multiple_exception
    kvm: x86: Add exception payload fields to kvm_vcpu_events
    kvm: x86: Add has_payload and payload to kvm_queued_exception
    KVM: Documentation: Fix omission in struct kvm_vcpu_events
    KVM: selftests: add Enlightened VMCS test
    ...

    Linus Torvalds
     

25 Oct, 2018

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "This is a fairly typical cycle for documentation. There's some welcome
    readability improvements for the formatted output, some LICENSES
    updates including the addition of the ISC license, the removal of the
    unloved and unmaintained 00-INDEX files, the deprecated APIs document
    from Kees, more MM docs from Mike Rapoport, and the usual pile of typo
    fixes and corrections"

    * tag 'docs-4.20' of git://git.lwn.net/linux: (41 commits)
    docs: Fix typos in histogram.rst
    docs: Introduce deprecated APIs list
    kernel-doc: fix declaration type determination
    doc: fix a typo in adding-syscalls.rst
    docs/admin-guide: memory-hotplug: remove table of contents
    doc: printk-formats: Remove bogus kobject references for device nodes
    Documentation: preempt-locking: Use better example
    dm flakey: Document "error_writes" feature
    docs/completion.txt: Fix a couple of punctuation nits
    LICENSES: Add ISC license text
    LICENSES: Add note to CDDL-1.0 license that it should not be used
    docs/core-api: memory-hotplug: add some details about locking internals
    docs/core-api: rename memory-hotplug-notifier to memory-hotplug
    docs: improve readability for people with poorer eyesight
    yama: clarify ptrace_scope=2 in Yama documentation
    docs/vm: split memory hotplug notifier description to Documentation/core-api
    docs: move memory hotplug description into admin-guide/mm
    doc: Fix acronym "FEKEK" in ecryptfs
    docs: fix some broken documentation references
    iommu: Fix passthrough option documentation
    ...

    Linus Torvalds
     

19 Oct, 2018

1 commit


18 Oct, 2018

2 commits

  • This is a per-VM capability which can be enabled by userspace so that
    the faulting linear address will be included with the information
    about a pending #PF in L2, and the "new DR6 bits" will be included
    with the information about a pending #DB in L2. With this capability
    enabled, the L1 hypervisor can now intercept #PF before CR2 is
    modified. Under VMX, the L1 hypervisor can now intercept #DB before
    DR6 and DR7 are modified.

    When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should
    generally provide an appropriate payload when injecting a #PF or #DB
    exception via KVM_SET_VCPU_EVENTS. However, to support restoring old
    checkpoints, this payload is not required.

    Note that bit 16 of the "new DR6 bits" is set to indicate that a debug
    exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM
    region while advanced debugging of RTM transactional regions was
    enabled. This is the reverse of DR6.RTM, which is cleared in this
    scenario.

    This capability also enables exception.pending in struct
    kvm_vcpu_events, which allows userspace to distinguish between pending
    and injected exceptions.

    Reported-by: Jim Mattson
    Suggested-by: Paolo Bonzini
    Signed-off-by: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     
  • The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a
    later commit) adds the following fields to struct kvm_vcpu_events:
    exception_has_payload, exception_payload, and exception.pending.

    With this capability set, all of the details of vcpu->arch.exception,
    including the payload for a pending exception, are reported to
    userspace in response to KVM_GET_VCPU_EVENTS.

    With this capability clear, the original ABI is preserved, and the
    exception.injected field is set for either pending or injected
    exceptions.

    When userspace calls KVM_SET_VCPU_EVENTS with
    KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer
    translated to exception.pending. KVM_SET_VCPU_EVENTS can now only
    establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set.

    Reported-by: Jim Mattson
    Suggested-by: Paolo Bonzini
    Signed-off-by: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     

17 Oct, 2018

4 commits

  • The header file indicates that there are 36 reserved bytes at the end
    of this structure. Adjust the documentation to agree with the header
    file.

    Signed-off-by: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     
  • Coalesced pio is based on coalesced mmio and can be used for some port
    like rtc port, pci-host config port and so on.

    Specially in case of rtc as coalesced pio, some versions of windows guest
    access rtc frequently because of rtc as system tick. guest access rtc like
    this: write register index to 0x70, then write or read data from 0x71.
    writing 0x70 port is just as index and do nothing else. So we can use
    coalesced pio to handle this scene to reduce VM-EXIT time.

    When starting and closing a virtual machine, it will access pci-host config
    port frequently. So setting these port as coalesced pio can reduce startup
    and shutdown time.

    without my patch, get the vm-exit time of accessing rtc 0x70 and piix 0xcf8
    using perf tools: (guest OS : windows 7 64bit)
    IO Port Access Samples Samples% Time% Min Time Max Time Avg time
    0x70:POUT 86 30.99% 74.59% 9us 29us 10.75us (+- 3.41%)
    0xcf8:POUT 1119 2.60% 2.12% 2.79us 56.83us 3.41us (+- 2.23%)

    with my patch
    IO Port Access Samples Samples% Time% Min Time Max Time Avg time
    0x70:POUT 106 32.02% 29.47% 0us 10us 1.57us (+- 7.38%)
    0xcf8:POUT 1065 1.67% 0.28% 0.41us 65.44us 0.66us (+- 10.55%)

    Signed-off-by: Peng Hao
    Signed-off-by: Paolo Bonzini

    Peng Hao
     
  • Signed-off-by: Peng Hao
    Signed-off-by: Paolo Bonzini

    Peng Hao
     
  • Using hypercall for sending IPIs is faster because this allows to specify
    any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
    will take only one VMEXIT.

    Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
    hypercall can't be 'fast' (passing parameters through registers) but
    apparently this is not true, Windows always uses it as 'fast' so we need
    to support that.

    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     

09 Oct, 2018

4 commits

  • This adds a KVM_PPC_NO_HASH flag to the flags field of the
    kvm_ppc_smmu_info struct, and arranges for it to be set when
    running as a nested hypervisor, as an unambiguous indication
    to userspace that HPT guests are not supported. Reporting the
    KVM_CAP_PPC_MMU_HASH_V3 capability as false could be taken as
    indicating only that the new HPT features in ISA V3.0 are not
    supported, leaving it ambiguous whether pre-V3.0 HPT features
    are supported.

    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Paul Mackerras
     
  • With this, userspace can enable a KVM-HV guest to run nested guests
    under it.

    The administrator can control whether any nested guests can be run;
    setting the "nested" module parameter to false prevents any guests
    becoming nested hypervisors (that is, any attempt to enable the nested
    capability on a guest will fail). Guests which are already nested
    hypervisors will continue to be so.

    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Paul Mackerras
     
  • This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
    series of commits that touch both general arch/powerpc code and KVM
    code. These commits will be merged both via the KVM tree and the
    powerpc tree.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     
  • This adds a one-reg register identifier which can be used to read and
    set the virtual PTCR for the guest. This register identifies the
    address and size of the virtual partition table for the guest, which
    contains information about the nested guests under this guest.

    Migrating this value is the only extra requirement for migrating a
    guest which has nested guests (assuming of course that the destination
    host supports nested virtualization in the kvm-hv module).

    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman

    Paul Mackerras
     

03 Oct, 2018

1 commit

  • Allow specifying the physical address size limit for a new
    VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This
    allows us to finalise the stage2 page table as early as possible
    and hence perform the right checks on the memory slots
    without complication. The size is encoded as Log2(PA_Size) in
    bits[7:0] of the type field. For backward compatibility the
    value 0 is reserved and implies 40bits. Also, lift the limit
    of the IPA to host limit and allow lower IPA sizes (e.g, 32).

    The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE
    for the availability of this feature. The cap check returns the
    maximum limit for the physical address shift supported by the host.

    Cc: Marc Zyngier
    Cc: Christoffer Dall
    Cc: Peter Maydell
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Reviewed-by: Eric Auger
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Suzuki K Poulose
     

20 Sep, 2018

1 commit

  • Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
    to reads of MSR_PLATFORM_INFO.

    Disabling access to reads of this MSR gives userspace the control to "expose"
    this platform-dependent information to guests in a clear way. As it exists
    today, guests that read this MSR would get unpopulated information if userspace
    hadn't already set it (and prior to this patch series, only the CPUID faulting
    information could have been populated). This existing interface could be
    confusing if guests don't handle the potential for incorrect/incomplete
    information gracefully (e.g. zero reported for base frequency).

    Signed-off-by: Drew Schmitt
    Signed-off-by: Paolo Bonzini

    Drew Schmitt
     

12 Sep, 2018

1 commit

  • We currently do not notify all gmaps when using gmap_pmdp_xchg(), due
    to locking constraints. This makes ucontrol VMs, which is the only VM
    type that creates multiple gmaps, incompatible with huge pages. Also
    we would need to hold the guest_table_lock of all gmaps that have this
    vmaddr maped to synchronize access to the pmd.

    ucontrol VMs are rather exotic and creating a new locking concept is
    no easy task. Hence we return EINVAL when trying to active
    KVM_CAP_S390_HPAGE_1M and report it as being not available when
    checking for it.

    Fixes: a4499382 ("KVM: s390: Add huge page enablement control")
    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand
    Reviewed-by: Claudio Imbrenda
    Message-Id:
    Signed-off-by: Janosch Frank

    Janosch Frank
     

10 Sep, 2018

1 commit

  • This is a respin with a wider audience (all that get_maintainer returned)
    and I know this spams a *lot* of people. Not sure what would be the correct
    way, so my apologies for ruining your inbox.

    The 00-INDEX files are supposed to give a summary of all files present
    in a directory, but these files are horribly out of date and their
    usefulness is brought into question. Often a simple "ls" would reveal
    the same information as the filenames are generally quite descriptive as
    a short introduction to what the file covers (it should not surprise
    anyone what Documentation/sched/sched-design-CFS.txt covers)

    A few years back it was mentioned that these files were no longer really
    needed, and they have since then grown further out of date, so perhaps
    it is time to just throw them out.

    A short status yields the following _outdated_ 00-INDEX files, first
    counter is files listed in 00-INDEX but missing in the directory, last
    is files present but not listed in 00-INDEX.

    List of outdated 00-INDEX:
    Documentation: (4/10)
    Documentation/sysctl: (0/1)
    Documentation/timers: (1/0)
    Documentation/blockdev: (3/1)
    Documentation/w1/slaves: (0/1)
    Documentation/locking: (0/1)
    Documentation/devicetree: (0/5)
    Documentation/power: (1/1)
    Documentation/powerpc: (0/5)
    Documentation/arm: (1/0)
    Documentation/x86: (0/9)
    Documentation/x86/x86_64: (1/1)
    Documentation/scsi: (4/4)
    Documentation/filesystems: (2/9)
    Documentation/filesystems/nfs: (0/2)
    Documentation/cgroup-v1: (0/2)
    Documentation/kbuild: (0/4)
    Documentation/spi: (1/0)
    Documentation/virtual/kvm: (1/0)
    Documentation/scheduler: (0/2)
    Documentation/fb: (0/1)
    Documentation/block: (0/1)
    Documentation/networking: (6/37)
    Documentation/vm: (1/3)

    Then there are 364 subdirectories in Documentation/ with several files that
    are missing 00-INDEX alltogether (and another 120 with a single file and no
    00-INDEX).

    I don't really have an opinion to whether or not we /should/ have 00-INDEX,
    but the above 00-INDEX should either be removed or be kept up to date. If
    we should keep the files, I can try to keep them updated, but I rather not
    if we just want to delete them anyway.

    As a starting point, remove all index-files and references to 00-INDEX and
    see where the discussion is going.

    Signed-off-by: Henrik Austad
    Acked-by: "Paul E. McKenney"
    Just-do-it-by: Steven Rostedt
    Reviewed-by: Jens Axboe
    Acked-by: Paul Moore
    Acked-by: Greg Kroah-Hartman
    Acked-by: Mark Brown
    Acked-by: Mike Rapoport
    Cc: [Almost everybody else]
    Signed-off-by: Jonathan Corbet

    Henrik Austad
     

22 Aug, 2018

2 commits


06 Aug, 2018

2 commits

  • Using hypercall to send IPIs by one vmexit instead of one by one for
    xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
    mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
    is enabled in qemu, however, latest AMD EPYC still just supports xapic
    mode which can get great improvement by Exit-less IPIs. This patchset
    lets a guest send multicast IPIs, with at most 128 destinations per
    hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.

    Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
    is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):

    x2apic cluster mode, vanilla

    Dry-run: 0, 2392199 ns
    Self-IPI: 6907514, 15027589 ns
    Normal IPI: 223910476, 251301666 ns
    Broadcast IPI: 0, 9282161150 ns
    Broadcast lock: 0, 8812934104 ns

    x2apic cluster mode, pv-ipi

    Dry-run: 0, 2449341 ns
    Self-IPI: 6720360, 15028732 ns
    Normal IPI: 228643307, 255708477 ns
    Broadcast IPI: 0, 7572293590 ns => 22% performance boost
    Broadcast lock: 0, 8316124651 ns

    x2apic physical mode, vanilla

    Dry-run: 0, 3135933 ns
    Self-IPI: 8572670, 17901757 ns
    Normal IPI: 226444334, 255421709 ns
    Broadcast IPI: 0, 19845070887 ns
    Broadcast lock: 0, 19827383656 ns

    x2apic physical mode, pv-ipi

    Dry-run: 0, 2446381 ns
    Self-IPI: 6788217, 15021056 ns
    Normal IPI: 219454441, 249583458 ns
    Broadcast IPI: 0, 7806540019 ns => 154% performance boost
    Broadcast lock: 0, 9143618799 ns

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Vitaly Kuznetsov
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • For nested virtualization L0 KVM is managing a bit of state for L2 guests,
    this state can not be captured through the currently available IOCTLs. In
    fact the state captured through all of these IOCTLs is usually a mix of L1
    and L2 state. It is also dependent on whether the L2 guest was running at
    the moment when the process was interrupted to save its state.

    With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
    and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
    that is in VMX operation.

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: x86@kernel.org
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Jim Mattson
    [karahmed@ - rename structs and functions and make them ready for AMD and
    address previous comments.
    - handle nested.smm state.
    - rebase & a bit of refactoring.
    - Merge 7/8 and 8/8 into one patch. ]
    Signed-off-by: KarimAllah Ahmed
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     

31 Jul, 2018

1 commit

  • General KVM huge page support on s390 has to be enabled via the
    kvm.hpage module parameter. Either nested or hpage can be enabled, as
    we currently do not support vSIE for huge backed guests. Once the vSIE
    support is added we will either drop the parameter or enable it as
    default.

    For a guest the feature has to be enabled through the new
    KVM_CAP_S390_HPAGE_1M capability and the hpage module
    parameter. Enabling it means that cmm can't be enabled for the vm and
    disables pfmf and storage key interpretation.

    This is due to the fact that in some cases, in upcoming patches, we
    have to split huge pages in the guest mapping to be able to set more
    granular memory protection on 4k pages. These split pages have fake
    page tables that are not visible to the Linux memory management which
    subsequently will not manage its PGSTEs, while the SIE will. Disabling
    these features lets us manage PGSTE data in a consistent matter and
    solve that problem.

    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand

    Janosch Frank
     

21 Jul, 2018

4 commits

  • arm64's new use of KVMs get_events/set_events API calls isn't just
    or RAS, it allows an SError that has been made pending by KVM as
    part of its device emulation to be migrated.

    Wire this up for 32bit too.

    We only need to read/write the HCR_VA bit, and check that no esr has
    been provided, as we don't yet support VDFSR.

    Signed-off-by: James Morse
    Reviewed-by: Dongjiu Geng
    Signed-off-by: Marc Zyngier

    James Morse
     
  • For the arm64 RAS Extension, user space can inject a virtual-SError
    with specified ESR. So user space needs to know whether KVM support
    to inject such SError, this interface adds this query for this capability.

    KVM will check whether system support RAS Extension, if supported, KVM
    returns true to user space, otherwise returns false.

    Signed-off-by: Dongjiu Geng
    Reviewed-by: James Morse
    [expanded documentation wording]
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier

    Dongjiu Geng
     
  • For the migrating VMs, user space may need to know the exception
    state. For example, in the machine A, KVM make an SError pending,
    when migrate to B, KVM also needs to pend an SError.

    This new IOCTL exports user-invisible states related to SError.
    Together with appropriate user space changes, user space can get/set
    the SError exception state to do migrate/snapshot/suspend.

    Signed-off-by: Dongjiu Geng
    Reviewed-by: James Morse
    [expanded documentation wording]
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier

    Dongjiu Geng
     
  • Update the documentation to reflect the ordering requirements of
    restoring the GICD_IIDR register before any other registers and the
    effects this has on restoring the interrupt groups for an emulated GICv2
    instance.

    Also remove some outdated limitations in the documentation while we're
    at it.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

22 Jun, 2018

1 commit


02 Jun, 2018

3 commits