30 Jul, 2019

1 commit

  • The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle
    driver, allows guest vcpus to poll for a specified amount of time before
    halting.
    This provides the following benefits to host side polling:

    1) The POLL flag is set while polling is performed, which allows
    a remote vCPU to avoid sending an IPI (and the associated
    cost of handling the IPI) when performing a wakeup.

    2) The VM-exit cost can be avoided.

    The downside of guest side polling is that polling is performed
    even with other runnable tasks in the host.

    Results comparing halt_poll_ns and server/client application
    where a small packet is ping-ponged:

    host --> 31.33
    halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%)
    halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%)

    For the SAP HANA benchmarks (where idle_spin is a parameter
    of the previous version of the patch, results should be the
    same):

    hpns == halt_poll_ns

    idle_spin=0/ idle_spin=800/ idle_spin=0/
    hpns=200000 hpns=0 hpns=800000
    DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%)
    InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%)
    DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%)
    UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%)

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Rafael J. Wysocki

    Marcelo Tosatti
     

24 Jul, 2019

1 commit

  • Renaming docs seems to be en vogue at the moment, so fix on of the
    grossly misnamed directories. We usually never use "virtual" as
    a shortcut for virtualization in the kernel, but always virt,
    as seen in the virt/ top-level directory. Fix up the documentation
    to match that.

    Fixes: ed16648eb5b8 ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:")
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Paolo Bonzini

    Christoph Hellwig
     

20 Jul, 2019

1 commit


13 Jul, 2019

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - support for chained PMU counters in guests
    - improved SError handling
    - handle Neoverse N1 erratum #1349291
    - allow side-channel mitigation status to be migrated
    - standardise most AArch64 system register accesses to msr_s/mrs_s
    - fix host MPIDR corruption on 32bit
    - selftests ckleanups

    x86:
    - PMU event {white,black}listing
    - ability for the guest to disable host-side interrupt polling
    - fixes for enlightened VMCS (Hyper-V pv nested virtualization),
    - new hypercall to yield to IPI target
    - support for passing cstate MSRs through to the guest
    - lots of cleanups and optimizations

    Generic:
    - Some txt->rST conversions for the documentation"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (128 commits)
    Documentation: virtual: Add toctree hooks
    Documentation: kvm: Convert cpuid.txt to .rst
    Documentation: virtual: Convert paravirt_ops.txt to .rst
    KVM: x86: Unconditionally enable irqs in guest context
    KVM: x86: PMU Event Filter
    kvm: x86: Fix -Wmissing-prototypes warnings
    KVM: Properly check if "page" is valid in kvm_vcpu_unmap
    KVM: arm/arm64: Initialise host's MPIDRs by reading the actual register
    KVM: LAPIC: Retry tune per-vCPU timer_advance_ns if adaptive tuning goes insane
    kvm: LAPIC: write down valid APIC registers
    KVM: arm64: Migrate _elx sysreg accessors to msr_s/mrs_s
    KVM: doc: Add API documentation on the KVM_REG_ARM_WORKAROUNDS register
    KVM: arm/arm64: Add save/restore support for firmware workaround state
    arm64: KVM: Propagate full Spectre v2 workaround state to KVM guests
    KVM: arm/arm64: Support chained PMU counters
    KVM: arm/arm64: Remove pmc->bitmask
    KVM: arm/arm64: Re-create event when setting counter value
    KVM: arm/arm64: Extract duplicated code to own function
    KVM: arm/arm64: Rename kvm_pmu_{enable/disable}_counter functions
    KVM: LAPIC: ARBPRI is a reserved register for x2APIC
    ...

    Linus Torvalds
     

11 Jul, 2019

5 commits

  • KVM/arm updates for 5.3

    - Add support for chained PMU counters in guests
    - Improve SError handling
    - Handle Neoverse N1 erratum #1349291
    - Allow side-channel mitigation status to be migrated
    - Standardise most AArch64 system register accesses to msr_s/mrs_s
    - Fix host MPIDR corruption on 32bit

    Paolo Bonzini
     
  • Added toctree hooks for indexing. Hooks added only for newly added
    files.

    The hook for the top of the tree will be added in a later patch series
    when a few more substantial changes have been added.

    Signed-off-by: Luke Nowakowski-Krijger
    Signed-off-by: Paolo Bonzini

    Luke Nowakowski-Krijger
     
  • Convert cpuid.txt to .rst format to be parsable by sphinx.

    Change format and spacing to make function definitions and return values
    much more clear. Also added a table that is parsable by sphinx and makes
    the information much more clean. Updated Author email to their new
    active email address. Added license identifier with the consent of the
    author.

    Signed-off-by: Luke Nowakowski-Krijger
    Signed-off-by: Paolo Bonzini

    Luke Nowakowski-Krijger
     
  • Convert paravirt_opts.txt to .rst format to be able to be parsed by
    sphinx.

    Made some minor spacing and formatting corrections to make defintions
    much more clear and easy to read. Added default kernel license to the
    document.

    Signed-off-by: Luke Nowakowski-Krijger
    Signed-off-by: Paolo Bonzini

    Luke Nowakowski-Krijger
     
  • Some events can provide a guest with information about other guests or the
    host (e.g. L3 cache stats); providing the capability to restrict access
    to a "safe" set of events would limit the potential for the PMU to be used
    in any side channel attacks. This change introduces a new VM ioctl that
    sets an event filter. If the guest attempts to program a counter for
    any blacklisted or non-whitelisted event, the kernel counter won't be
    created, so any RDPMC/RDMSR will show 0 instances of that event.

    Signed-off-by: Eric Hankland
    [Lots of changes. All remaining bugs are probably mine. - Paolo]
    Signed-off-by: Paolo Bonzini

    Eric Hankland
     

10 Jul, 2019

1 commit

  • Pull Documentation updates from Jonathan Corbet:
    "It's been a relatively busy cycle for docs:

    - A fair pile of RST conversions, many from Mauro. These create more
    than the usual number of simple but annoying merge conflicts with
    other trees, unfortunately. He has a lot more of these waiting on
    the wings that, I think, will go to you directly later on.

    - A new document on how to use merges and rebases in kernel repos,
    and one on Spectre vulnerabilities.

    - Various improvements to the build system, including automatic
    markup of function() references because some people, for reasons I
    will never understand, were of the opinion that
    :c:func:``function()`` is unattractive and not fun to type.

    - We now recommend using sphinx 1.7, but still support back to 1.4.

    - Lots of smaller improvements, warning fixes, typo fixes, etc"

    * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
    docs: automarkup.py: ignore exceptions when seeking for xrefs
    docs: Move binderfs to admin-guide
    Disable Sphinx SmartyPants in HTML output
    doc: RCU callback locks need only _bh, not necessarily _irq
    docs: format kernel-parameters -- as code
    Doc : doc-guide : Fix a typo
    platform: x86: get rid of a non-existent document
    Add the RCU docs to the core-api manual
    Documentation: RCU: Add TOC tree hooks
    Documentation: RCU: Rename txt files to rst
    Documentation: RCU: Convert RCU UP systems to reST
    Documentation: RCU: Convert RCU linked list to reST
    Documentation: RCU: Convert RCU basic concepts to reST
    docs: filesystems: Remove uneeded .rst extension on toctables
    scripts/sphinx-pre-install: fix out-of-tree build
    docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
    Documentation: PGP: update for newer HW devices
    Documentation: Add section about CPU vulnerabilities for Spectre
    Documentation: platform: Delete x86-laptop-drivers.txt
    docs: Note that :c:func: should no longer be used
    ...

    Linus Torvalds
     

05 Jul, 2019

1 commit


03 Jul, 2019

3 commits


19 Jun, 2019

1 commit

  • Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format
    of VMX nested state data in a struct.

    In order to avoid changing the ioctl values of
    KVM_{GET,SET}_NESTED_STATE, there is a need to preserve
    sizeof(struct kvm_nested_state). This is done by defining the data
    struct as "data.vmx[0]". It was the most elegant way I found to
    preserve struct size while still keeping struct readable and easy to
    maintain. It does have a misfortunate side-effect that now it has to be
    accessed as "data.vmx[0]" rather than just "data.vmx".

    Because we are already modifying these structs, I also modified the
    following:
    * Define the "format" field values as macros.
    * Rename vmcs_pa to vmcs12_pa for better readability.

    Signed-off-by: Liran Alon
    [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo]
    Reviewed-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Liran Alon
     

18 Jun, 2019

2 commits


15 Jun, 2019

1 commit

  • The documentation is in a format that is very close to ReST format.

    The conversion is actually:
    - add blank lines in order to identify paragraphs;
    - fixing tables markups;
    - adding some lists markups;
    - marking literal blocks;
    - adjust some title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

09 Jun, 2019

1 commit

  • Get rid of those warnings:

    Documentation/virtual/kvm/amd-memory-encryption.rst:244: WARNING: Citation [white-paper] is not referenced.
    Documentation/virtual/kvm/amd-memory-encryption.rst:246: WARNING: Citation [amd-apm] is not referenced.
    Documentation/virtual/kvm/amd-memory-encryption.rst:247: WARNING: Citation [kvm-forum] is not referenced.

    For references that aren't mentioned at the text by adding an
    explicit reference to them.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Paolo Bonzini
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

08 Jun, 2019

1 commit


05 Jun, 2019

3 commits


18 May, 2019

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - support for SVE and Pointer Authentication in guests
    - PMU improvements

    POWER:
    - support for direct access to the POWER9 XIVE interrupt controller
    - memory and performance optimizations

    x86:
    - support for accessing memory not backed by struct page
    - fixes and refactoring

    Generic:
    - dirty page tracking improvements"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
    kvm: fix compilation on aarch64
    Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
    kvm: x86: Fix L1TF mitigation for shadow MMU
    KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
    KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
    KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
    KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
    kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
    tests: kvm: Add tests for KVM_SET_NESTED_STATE
    KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
    tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
    tests: kvm: Add tests to .gitignore
    KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
    KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
    KVM: Fix the bitmap range to copy during clear dirty
    KVM: arm64: Fix ptrauth ID register masking logic
    KVM: x86: use direct accessors for RIP and RSP
    KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
    KVM: x86: Omit caching logic for always-available GPRs
    kvm, x86: Properly check whether a pfn is an MMIO or not
    ...

    Linus Torvalds
     

16 May, 2019

2 commits


08 May, 2019

1 commit


01 May, 2019

4 commits

  • If a memory slot's size is not a multiple of 64 pages (256K), then
    the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
    either requires the requested page range to go beyond memslot->npages,
    or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
    requires log->num_pages to be both in range and aligned.

    To allow this case, allow log->num_pages not to be a multiple of 64 if
    it ends exactly on the last page of the slot.

    Reported-by: Peter Xu
    Fixes: 98938aa8edd6 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • This reverts commit 919f6cd8bb2fe7151f8aecebc3b3d1ca2567396e.

    The patch was applied twice.
    The first commit is eca6be566d47029f945a5f8e1c94d374e31df2ca.

    Reported-by: Cornelia Huck
    Signed-off-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     
  • …/kvms390/linux into HEAD

    KVM: s390: Features and fixes for 5.2

    - VSIE crypto fixes
    - new guest features for gen15
    - disable halt polling for nested virtualization with overcommit

    Paolo Bonzini
     
  • If a memory slot's size is not a multiple of 64 pages (256K), then
    the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
    either requires the requested page range to go beyond memslot->npages,
    or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
    requires log->num_pages to be both in range and aligned.

    To allow this case, allow log->num_pages not to be a multiple of 64 if
    it ends exactly on the last page of the slot.

    Reported-by: Peter Xu
    Fixes: 98938aa8edd6 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

30 Apr, 2019

9 commits

  • The KVM XICS-over-XIVE device and the proposed KVM XIVE native device
    implement an IRQ space for the guest using the generic IPI interrupts
    of the XIVE IC controller. These interrupts are allocated at the OPAL
    level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
    Interrupt management is performed in the XIVE way: using loads and
    stores on the addresses of the XIVE IPI interrupt ESB pages.

    Both KVM devices share the same internal structure caching information
    on the interrupts, among which the xive_irq_data struct containing the
    addresses of the IPI ESB pages and an extra one in case of pass-through.
    The later contains the addresses of the ESB pages of the underlying HW
    controller interrupts, PHB4 in all cases for now.

    A guest, when running in the XICS legacy interrupt mode, lets the KVM
    XICS-over-XIVE device "handle" interrupt management, that is to
    perform the loads and stores on the addresses of the ESB pages of the
    guest interrupts. However, when running in XIVE native exploitation
    mode, the KVM XIVE native device exposes the interrupt ESB pages to
    the guest and lets the guest perform directly the loads and stores.

    The VMA exposing the ESB pages make use of a custom VM fault handler
    which role is to populate the VMA with appropriate pages. When a fault
    occurs, the guest IRQ number is deduced from the offset, and the ESB
    pages of associated XIVE IPI interrupt are inserted in the VMA (using
    the internal structure caching information on the interrupts).

    Supporting device passthrough in the guest running in XIVE native
    exploitation mode adds some extra refinements because the ESB pages
    of a different HW controller (PHB4) need to be exposed to the guest
    along with the initial IPI ESB pages of the XIVE IC controller. But
    the overall mechanic is the same.

    When the device HW irqs are mapped into or unmapped from the guest
    IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
    and kvmppc_xive_clr_mapped(), are called to record or clear the
    passthrough interrupt information and to perform the switch.

    The approach taken by this patch is to clear the ESB pages of the
    guest IRQ number being mapped and let the VM fault handler repopulate.
    The handler will insert the ESB page corresponding to the HW interrupt
    of the device being passed-through or the initial IPI ESB page if the
    device is being removed.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • Each source is associated with an Event State Buffer (ESB) with a
    even/odd pair of pages which provides commands to manage the source:
    to trigger, to EOI, to turn off the source for instance.

    The custom VM fault handler will deduce the guest IRQ number from the
    offset of the fault, and the ESB page of the associated XIVE interrupt
    will be inserted into the VMA using the internal structure caching
    information on the interrupts.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • Each thread has an associated Thread Interrupt Management context
    composed of a set of registers. These registers let the thread handle
    priority management and interrupt acknowledgment. The most important
    are :

    - Interrupt Pending Buffer (IPB)
    - Current Processor Priority (CPPR)
    - Notification Source Register (NSR)

    They are exposed to software in four different pages each proposing a
    view with a different privilege. The first page is for the physical
    thread context and the second for the hypervisor. Only the third
    (operating system) and the fourth (user level) are exposed the guest.

    A custom VM fault handler will populate the VMA with the appropriate
    pages, which should only be the OS page for now.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • The state of the thread interrupt management registers needs to be
    collected for migration. These registers are cached under the
    'xive_saved_state.w01' field of the VCPU when the VPCU context is
    pulled from the HW thread. An OPAL call retrieves the backup of the
    IPB register in the underlying XIVE NVT structure and merges it in the
    KVM state.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • When migration of a VM is initiated, a first copy of the RAM is
    transferred to the destination before the VM is stopped, but there is
    no guarantee that the EQ pages in which the event notifications are
    queued have not been modified.

    To make sure migration will capture a consistent memory state, the
    XIVE device should perform a XIVE quiesce sequence to stop the flow of
    event notifications and stabilize the EQs. This is the purpose of the
    KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty
    to force their transfer.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • This control will be used by the H_INT_SYNC hcall from QEMU to flush
    event notifications on the XIVE IC owning the source.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • This control is to be used by the H_INT_RESET hcall from QEMU. Its
    purpose is to clear all configuration of the sources and EQs. This is
    necessary in case of a kexec (for a kdump kernel for instance) to make
    sure that no remaining configuration is left from the previous boot
    setup so that the new kernel can start safely from a clean state.

    The queue 7 is ignored when the XIVE device is configured to run in
    single escalation mode. Prio 7 is used by escalations.

    The XIVE VP is kept enabled as the vCPU is still active and connected
    to the XIVE device.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • These controls will be used by the H_INT_SET_QUEUE_CONFIG and
    H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying
    Event Queue in the XIVE IC. They will also be used to restore the
    configuration of the XIVE EQs and to capture the internal run-time
    state of the EQs. Both 'get' and 'set' rely on an OPAL call to access
    the EQ toggle bit and EQ index which are updated by the XIVE IC when
    event notifications are enqueued in the EQ.

    The value of the guest physical address of the event queue is saved in
    the XIVE internal xive_q structure for later use. That is when
    migration needs to mark the EQ pages dirty to capture a consistent
    memory state of the VM.

    To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra
    OPAL call setting the EQ toggle bit and EQ index to configure the EQ,
    but restoring the EQ state will.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater
     
  • This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from
    QEMU to configure the target of a source and also to restore the
    configuration of a source when migrating the VM.

    The XIVE source interrupt structure is extended with the value of the
    Effective Interrupt Source Number. The EISN is the interrupt number
    pushed in the event queue that the guest OS will use to dispatch
    events internally. Caching the EISN value in KVM eases the test when
    checking if a reconfiguration is indeed needed.

    Signed-off-by: Cédric Le Goater
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras

    Cédric Le Goater