27 Dec, 2018

2 commits

  • Pull RCU updates from Ingo Molnar:
    "The biggest RCU changes in this cycle were:

    - Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.

    - Replace calls of RCU-bh and RCU-sched update-side functions to
    their vanilla RCU counterparts. This series is a step towards
    complete removal of the RCU-bh and RCU-sched update-side functions.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - Documentation updates, including a number of flavor-consolidation
    updates from Joel Fernandes.

    - Miscellaneous fixes.

    - Automate generation of the initrd filesystem used for rcutorture
    testing.

    - Convert spin_is_locked() assertions to instead use lockdep.

    ( Note that some of these conversions are going upstream via their
    respective maintainers. )

    - SRCU updates, especially including a fix from Dennis Krein for a
    bag-on-head-class bug.

    - RCU torture-test updates"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    rcutorture: Don't do busted forward-progress testing
    rcutorture: Use 100ms buckets for forward-progress callback histograms
    rcutorture: Recover from OOM during forward-progress tests
    rcutorture: Print forward-progress test age upon failure
    rcutorture: Print time since GP end upon forward-progress failure
    rcutorture: Print histogram of CB invocation at OOM time
    rcutorture: Print GP age upon forward-progress failure
    rcu: Print per-CPU callback counts for forward-progress failures
    rcu: Account for nocb-CPU callback counts in RCU CPU stall warnings
    rcutorture: Dump grace-period diagnostics upon forward-progress OOM
    rcutorture: Prepare for asynchronous access to rcu_fwd_startat
    torture: Remove unnecessary "ret" variables
    rcutorture: Affinity forward-progress test to avoid housekeeping CPUs
    rcutorture: Break up too-long rcu_torture_fwd_prog() function
    rcutorture: Remove cbflood facility
    torture: Bring any extra CPUs online during kernel startup
    rcutorture: Add call_rcu() flooding forward-progress tests
    rcutorture/formal: Replace synchronize_sched() with synchronize_rcu()
    tools/kernel.h: Replace synchronize_sched() with synchronize_rcu()
    net/decnet: Replace rcu_barrier_bh() with rcu_barrier()
    ...

    Linus Torvalds
     
  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - selftests improvements
    - large PUD support for HugeTLB
    - single-stepping fixes
    - improved tracing
    - various timer and vGIC fixes

    x86:
    - Processor Tracing virtualization
    - STIBP support
    - some correctness fixes
    - refactorings and splitting of vmx.c
    - use the Hyper-V range TLB flush hypercall
    - reduce order of vcpu struct
    - WBNOINVD support
    - do not use -ftrace for __noclone functions
    - nested guest support for PAUSE filtering on AMD
    - more Hyper-V enlightenments (direct mode for synthetic timers)

    PPC:
    - nested VFIO

    s390:
    - bugfixes only this time"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
    KVM: x86: Add CPUID support for new instruction WBNOINVD
    kvm: selftests: ucall: fix exit mmio address guessing
    Revert "compiler-gcc: disable -ftracer for __noclone functions"
    KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines
    KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
    KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
    MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry
    KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
    KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
    KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp()
    KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte()
    KVM: Make kvm_set_spte_hva() return int
    KVM: Replace old tlb flush function with new one to flush a specified range.
    KVM/MMU: Add tlb flush with range helper function
    KVM/VMX: Add hv tlb range flush support
    x86/hyper-v: Add HvFlushGuestAddressList hypercall support
    KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
    KVM: x86: Disable Intel PT when VMXON in L1 guest
    KVM: x86: Set intercept for Intel PT MSRs read/write
    KVM: x86: Implement Intel PT MSRs read/write emulation
    ...

    Linus Torvalds
     

26 Dec, 2018

1 commit

  • Pull arm64 festive updates from Will Deacon:
    "In the end, we ended up with quite a lot more than I expected:

    - Support for ARMv8.3 Pointer Authentication in userspace (CRIU and
    kernel-side support to come later)

    - Support for per-thread stack canaries, pending an update to GCC
    that is currently undergoing review

    - Support for kexec_file_load(), which permits secure boot of a kexec
    payload but also happens to improve the performance of kexec
    dramatically because we can avoid the sucky purgatory code from
    userspace. Kdump will come later (requires updates to libfdt).

    - Optimisation of our dynamic CPU feature framework, so that all
    detected features are enabled via a single stop_machine()
    invocation

    - KPTI whitelisting of Cortex-A CPUs unaffected by Meltdown, so that
    they can benefit from global TLB entries when KASLR is not in use

    - 52-bit virtual addressing for userspace (kernel remains 48-bit)

    - Patch in LSE atomics for per-cpu atomic operations

    - Custom preempt.h implementation to avoid unconditional calls to
    preempt_schedule() from preempt_enable()

    - Support for the new 'SB' Speculation Barrier instruction

    - Vectorised implementation of XOR checksumming and CRC32
    optimisations

    - Workaround for Cortex-A76 erratum #1165522

    - Improved compatibility with Clang/LLD

    - Support for TX2 system PMUS for profiling the L3 cache and DMC

    - Reflect read-only permissions in the linear map by default

    - Ensure MMIO reads are ordered with subsequent calls to Xdelay()

    - Initial support for memory hotplug

    - Tweak the threshold when we invalidate the TLB by-ASID, so that
    mremap() performance is improved for ranges spanning multiple PMDs.

    - Minor refactoring and cleanups"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (125 commits)
    arm64: kaslr: print PHYS_OFFSET in dump_kernel_offset()
    arm64: sysreg: Use _BITUL() when defining register bits
    arm64: cpufeature: Rework ptr auth hwcaps using multi_entry_cap_matches
    arm64: cpufeature: Reduce number of pointer auth CPU caps from 6 to 4
    arm64: docs: document pointer authentication
    arm64: ptr auth: Move per-thread keys from thread_info to thread_struct
    arm64: enable pointer authentication
    arm64: add prctl control for resetting ptrauth keys
    arm64: perf: strip PAC when unwinding userspace
    arm64: expose user PAC bit positions via ptrace
    arm64: add basic pointer authentication support
    arm64/cpufeature: detect pointer authentication
    arm64: Don't trap host pointer auth use to EL2
    arm64/kvm: hide ptrauth from guests
    arm64/kvm: consistently handle host HCR_EL2 flags
    arm64: add pointer authentication register bits
    arm64: add comments about EC exception levels
    arm64: perf: Treat EXCLUDE_EL* bit definitions as unsigned
    arm64: kpti: Whitelist Cortex-A CPUs that don't implement the CSV3 field
    arm64: enable per-task stack canaries
    ...

    Linus Torvalds
     

21 Dec, 2018

5 commits

  • This patch is to move tlb flush in kvm_set_pte_rmapp() to
    kvm_mmu_notifier_change_pte() in order to avoid redundant tlb flush.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • The patch is to make kvm_set_spte_hva() return int and caller can
    check return value to determine flush tlb or not.

    Signed-off-by: Lan Tianyu
    Acked-by: Paul Mackerras
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • Signed-off-by: Wei Yang
    [Preserved the iff and a probably intentional weird bracket notation.
    Also dropped the style change to make a single-purpose patch. - Radim]
    Signed-off-by: Radim Krčmář

    Wei Yang
     
  • Since the offset is added directly to the hva from the
    gfn_to_hva_cache, a negative offset could result in an out of bounds
    write. The existing BUG_ON only checks for addresses beyond the end of
    the gfn_to_hva_cache, not for addresses before the start of the
    gfn_to_hva_cache.

    Note that all current call sites have non-negative offsets.

    Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
    Reported-by: Cfir Cohen
    Signed-off-by: Jim Mattson
    Reviewed-by: Cfir Cohen
    Reviewed-by: Peter Shier
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Sean Christopherson
    Signed-off-by: Radim Krčmář

    Jim Mattson
     
  • Previously, in the case where (gpa + len) wrapped around, the entire
    region was not validated, as the comment claimed. It doesn't actually
    seem that wraparound should be allowed here at all.

    Furthermore, since some callers don't check the return code from this
    function, it seems prudent to clear ghc->memslot in the event of an
    error.

    Fixes: 8f964525a121f ("KVM: Allow cross page reads and writes from cached translations.")
    Reported-by: Cfir Cohen
    Signed-off-by: Jim Mattson
    Reviewed-by: Cfir Cohen
    Reviewed-by: Marc Orr
    Cc: Andrew Honig
    Signed-off-by: Radim Krčmář

    Jim Mattson
     

20 Dec, 2018

9 commits

  • …marm/kvmarm into HEAD

    KVM/arm updates for 4.21

    - Large PUD support for HugeTLB
    - Single-stepping fixes
    - Improved tracing
    - Various timer and vgic fixups

    Paolo Bonzini
     
  • 32 and 64bit use different symbols to identify the traps.
    32bit has a fine grained approach (prefetch abort, data abort and HVC),
    while 64bit is pretty happy with just "trap".

    This has been fine so far, except that we now need to decode some
    of that in tracepoints that are common to both architectures.

    Introduce ARM_EXCEPTION_IS_TRAP which abstracts the trap symbols
    and make the tracepoint use it.

    Acked-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • There are two things we need to take care of when we create block
    mappings in the stage 2 page tables:

    (1) The alignment within a PMD between the host address range and the
    guest IPA range must be the same, since otherwise we end up mapping
    pages with the wrong offset.

    (2) The head and tail of a memory slot may not cover a full block
    size, and we have to take care to not map those with block
    descriptors, since we could expose memory to the guest that the host
    did not intend to expose.

    So far, we have been taking care of (1), but not (2), and our commentary
    describing (1) was somewhat confusing.

    This commit attempts to factor out the checks of both into a common
    function, and if we don't pass the check, we won't attempt any PMD
    mappings for neither hugetlbfs nor THP.

    Note that we used to only check the alignment for THP, not for
    hugetlbfs, but as far as I can tell the check needs to be applied to
    both scenarios.

    Cc: Ralph Palutke
    Cc: Lukas Braun
    Reported-by: Lukas Braun
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We currently only halt the guest when a vCPU messes with the active
    state of an SPI. This is perfectly fine for GICv2, but isn't enough
    for GICv3, where all vCPUs can access the state of any other vCPU.

    Let's broaden the condition to include any GICv3 interrupt that
    has an active state (i.e. all but LPIs).

    Cc: stable@vger.kernel.org
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • kvm_timer_vcpu_terminate can only be called in two scenarios:

    1. As part of cleanup during a failed VCPU create
    2. As part of freeing the whole VM (struct kvm refcount == 0)

    In the first case, we cannot have programmed any timers or mapped any
    IRQs, and therefore we do not have to cancel anything or unmap anything.

    In the second case, the VCPU will have gone through kvm_timer_vcpu_put,
    which will have canceled the emulated physical timer's hrtimer, and we
    do not need to that here as well. We also do not care if the irq is
    recorded as mapped or not in the VGIC data structure, because the whole
    VM is going away. That leaves us only with having to ensure that we
    cancel the bg_timer if we were blocking the last time we called
    kvm_timer_vcpu_put().

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The use of a work queue in the hrtimer expire function for the bg_timer
    is a leftover from the time when we would inject interrupts when the
    bg_timer expired.

    Since we are no longer doing that, we can instead call
    kvm_vcpu_wake_up() directly from the hrtimer function and remove all
    workqueue functionality from the arch timer code.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The kvm_exit tracepoint strangely always reported exits as being IRQs.
    This seems to be because either the __print_symbolic or the tracepoint
    macros use a variable named idx.

    Take this chance to update the fields in the tracepoint to reflect the
    concepts in the arm64 architecture that we pass to the tracepoint and
    move the exception type table to the same location and header files as
    the exits code.

    We also clear out the exception code to 0 for IRQ exits (which
    translates to UNKNOWN in text) to make it slighyly less confusing to
    parse the trace output.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • When checking if there are any pending IRQs for the VM, consider the
    active state and priority of the IRQs as well.

    Otherwise we could be continuously scheduling a guest hypervisor without
    it seeing an IRQ.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • When using the nospec API, it should be taken into account that:

    "...if the CPU speculates past the bounds check then
    * array_index_nospec() will clamp the index within the range of [0,
    * size)."

    The above is part of the header for macro array_index_nospec() in
    linux/nospec.h

    Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
    or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
    returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
    of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:

    /* SGIs and PPIs */
    if (intid arch.vgic_cpu.private_irqs[intid];

    /* SPIs */
    if (intid arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];

    are valid values for intid.

    Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
    and VGIC_MAX_SPI + 1 as arguments for its parameter size.

    Fixes: 41b87599c743 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Gustavo A. R. Silva
    [dropped the SPI part which was fixed separately]
    Signed-off-by: Marc Zyngier

    Gustavo A. R. Silva
     

19 Dec, 2018

1 commit

  • If you register a kvm_coalesced_mmio_zone with '.pio = 0' but then
    unregister it with '.pio = 1', KVM_UNREGISTER_COALESCED_MMIO will try to
    unregister it from KVM_PIO_BUS rather than KVM_MMIO_BUS, which is a
    no-op. But it frees the kvm_coalesced_mmio_dev anyway, causing a
    use-after-free.

    Fix it by only unregistering and freeing the zone if the correct value
    of 'pio' is provided.

    Reported-by: syzbot+f87f60bb6f13f39b54e3@syzkaller.appspotmail.com
    Fixes: 0804c849f1df ("kvm/x86 : add coalesced pio support")
    Signed-off-by: Eric Biggers
    Signed-off-by: Paolo Bonzini

    Eric Biggers
     

18 Dec, 2018

15 commits

  • SPIs should be checked against the VMs specific configuration, and
    not the architectural maximum.

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • In attempting to re-construct the logic for our stage 2 page table
    layout I found the reasoning in the comment explaining how we calculate
    the number of levels used for stage 2 page tables a bit backwards.

    This commit attempts to clarify the comment, to make it slightly easier
    to read without having the Arm ARM open on the right page.

    While we're at it, fixup a typo in a comment that was recently changed.

    Reviewed-by: Suzuki K Poulose
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • To change the active state of an MMIO, halt is requested for all vcpus of
    the affected guest before modifying the IRQ state. This is done by calling
    cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
    disabled at this point and we cannot reschedule a vcpu.

    We actually don't need any of this, as kvm_arm_halt_guest ensures that
    all the other vcpus are out of the guest. Let's just drop that useless
    code.

    Signed-off-by: Julien Thierry
    Suggested-by: Christoffer Dall
    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier

    Julien Thierry
     
  • KVM only supports PMD hugepages at stage 2. Now that the various page
    handling routines are updated, extend the stage 2 fault handling to
    map in PUD hugepages.

    Addition of PUD hugepage support enables additional page sizes (e.g.,
    1G with 4K granule) which can be useful on cores that support mapping
    larger block sizes in the TLB entries.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replace BUG() => WARN_ON(1) for arm32 PUD helpers ]
    Signed-off-by: Suzuki Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating larger hugepages at Stage 2, add support
    to the age handling notifiers for PUD hugepages when encountered.

    Provide trivial helpers for arm32 to allow sharing code.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON(1) for arm32 PUD helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating larger hugepages at Stage 2, extend the
    access fault handling at Stage 2 to support PUD hugepages when
    encountered.

    Provide trivial helpers for arm32 to allow sharing of code.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON(1) in PUD helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating PUD hugepages at stage 2, add support for
    detecting execute permissions on PUD page table entries. Faults due to
    lack of execute permissions on page table entries is used to perform
    i-cache invalidation on first execute.

    Provide trivial implementations of arm32 helpers to allow sharing of
    code.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON(1) in arm32 PUD helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • In preparation for creating PUD hugepages at stage 2, add support for
    write protecting PUD hugepages when they are encountered. Write
    protecting guest tables is used to track dirty pages when migrating
    VMs.

    Also, provide trivial implementations of required kvm_s2pud_* helpers
    to allow sharing of code with arm32.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    [ Replaced BUG() => WARN_ON() in arm32 pud helpers ]
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • Introduce helpers to abstract architectural handling of the conversion
    of pfn to page table entries and marking a PMD page table entry as a
    block entry.

    The helpers are introduced in preparation for supporting PUD hugepages
    at stage 2 - which are supported on arm64 but do not exist on arm.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki K Poulose
    Acked-by: Christoffer Dall
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Reviewed-by: Marc Zyngier
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • Stage 2 fault handler marks a page as executable if it is handling an
    execution fault or if it was a permission fault in which case the
    executable bit needs to be preserved.

    The logic to decide if the page should be marked executable is
    duplicated for PMD and PTE entries. To avoid creating another copy
    when support for PUD hugepages is introduced refactor the code to
    share the checks needed to mark a page table entry as executable.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki K Poulose
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • The code for operations such as marking the pfn as dirty, and
    dcache/icache maintenance during stage 2 fault handling is duplicated
    between normal pages and PMD hugepages.

    Instead of creating another copy of the operations when we introduce
    PUD hugepages, let's share them across the different pagesizes.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki K Poulose
    Reviewed-by: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • When restoring the active state from userspace, we don't know which CPU
    was the source for the active state, and this is not architecturally
    exposed in any of the register state.

    Set the active_source to 0 in this case. In the future, we can expand
    on this and exposse the information as additional information to
    userspace for GICv2 if anyone cares.

    Cc: stable@vger.kernel.org
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We recently addressed a VMID generation race by introducing a read/write
    lock around accesses and updates to the vmid generation values.

    However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
    does so without taking the read lock.

    As far as I can tell, this can lead to the same kind of race:

    VM 0, VCPU 0 VM 0, VCPU 1
    ------------ ------------
    update_vttbr (vmid 254)
    update_vttbr (vmid 1) // roll over
    read_lock(kvm_vmid_lock);
    force_vm_exit()
    local_irq_disable
    need_new_vmid_gen == false //because vmid gen matches

    enter_guest (vmid 254)
    kvm_arch.vttbr = :
    read_unlock(kvm_vmid_lock);

    enter_guest (vmid 1)

    Which results in running two VCPUs in the same VM with different VMIDs
    and (even worse) other VCPUs from other VMs could now allocate clashing
    VMID 254 from the new generation as long as VCPU 0 is not exiting.

    Attempt to solve this by making sure vttbr is updated before another CPU
    can observe the updated VMID generation.

    Cc: stable@vger.kernel.org
    Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race"
    Reviewed-by: Julien Thierry
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • When we emulate a guest instruction, we don't advance the hardware
    singlestep state machine, and thus the guest will receive a software
    step exception after a next instruction which is not emulated by the
    host.

    We bodge around this in an ad-hoc fashion. Sometimes we explicitly check
    whether userspace requested a single step, and fake a debug exception
    from within the kernel. Other times, we advance the HW singlestep state
    rely on the HW to generate the exception for us. Thus, the observed step
    behaviour differs for host and guest.

    Let's make this simpler and consistent by always advancing the HW
    singlestep state machine when we skip an instruction. Thus we can rely
    on the hardware to generate the singlestep exception for us, and never
    need to explicitly check for an active-pending step, nor do we need to
    fake a debug exception from the guest.

    Cc: Peter Maydell
    Reviewed-by: Alex Bennée
    Reviewed-by: Christoffer Dall
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier

    Mark Rutland
     
  • When we emulate an MMIO instruction, we advance the CPU state within
    decode_hsr(), before emulating the instruction effects.

    Having this logic in decode_hsr() is opaque, and advancing the state
    before emulation is problematic. It gets in the way of applying
    consistent single-step logic, and it prevents us from being able to fail
    an MMIO instruction with a synchronous exception.

    Clean this up by only advancing the CPU state *after* the effects of the
    instruction are emulated.

    Cc: Peter Maydell
    Reviewed-by: Alex Bennée
    Reviewed-by: Christoffer Dall
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier

    Mark Rutland
     

14 Dec, 2018

3 commits

  • There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
    it can take kvm->mmu_lock for an extended period of time. Second, its user
    can actually see many false positives in some cases. The latter is due
    to a benign race like this:

    1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
    them.
    2. The guest modifies the pages, causing them to be marked ditry.
    3. Userspace actually copies the pages.
    4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
    they were not written to since (3).

    This is especially a problem for large guests, where the time between
    (1) and (3) can be substantial. This patch introduces a new
    capability which, when enabled, makes KVM_GET_DIRTY_LOG not
    write-protect the pages it returns. Instead, userspace has to
    explicitly clear the dirty log bits just before using the content
    of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
    64-page granularity rather than requiring to sync a full memslot;
    this way, the mmu_lock is taken for small amounts of time, and
    only a small amount of time will pass between write protection
    of pages and the sending of their content.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
    pointer argument will always be false on exit, because no TLB flush is needed
    until the manual re-protection operation. Rename it from "is_dirty" to "flush",
    which more accurately tells the caller what they have to do with it.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • The first such capability to be handled in virt/kvm/ will be manual
    dirty page reprotection.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

10 Dec, 2018

1 commit

  • An SVE system is so far the only case where we mandate VHE. As we're
    starting to grow this requirements, let's slightly rework the way we
    deal with that situation, allowing for easy extension of this check.

    Acked-by: Christoffer Dall
    Reviewed-by: James Morse
    Signed-off-by: Marc Zyngier
    Signed-off-by: Will Deacon

    Marc Zyngier
     

13 Nov, 2018

1 commit

  • lockdep_assert_held() is better suited to checking locking requirements,
    since it only checks if the current thread holds the lock regardless of
    whether someone else does. This is also a step towards possibly removing
    spin_is_locked().

    Signed-off-by: Lance Roy
    Cc: Marc Zyngier
    Cc: Eric Auger
    Cc: linux-arm-kernel@lists.infradead.org
    Cc:
    Signed-off-by: Paul E. McKenney
    Acked-by: Christoffer Dall

    Lance Roy
     

27 Oct, 2018

1 commit

  • Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with
    blockable invalidate callbacks").

    MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
    longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for
    mmu notifiers"). We now have a full support for per range !blocking
    behavior so we can drop the stop gap workaround which the per notifier
    flag was used for.

    Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Boris Ostrovsky
    Cc: Jerome Glisse
    Cc: Juergen Gross
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

26 Oct, 2018

1 commit

  • Pull KVM updates from Radim Krčmář:
    "ARM:
    - Improved guest IPA space support (32 to 52 bits)

    - RAS event delivery for 32bit

    - PMU fixes

    - Guest entry hardening

    - Various cleanups

    - Port of dirty_log_test selftest

    PPC:
    - Nested HV KVM support for radix guests on POWER9. The performance
    is much better than with PR KVM. Migration and arbitrary level of
    nesting is supported.

    - Disable nested HV-KVM on early POWER9 chips that need a particular
    hardware bug workaround

    - One VM per core mode to prevent potential data leaks

    - PCI pass-through optimization

    - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base

    s390:
    - Initial version of AP crypto virtualization via vfio-mdev

    - Improvement for vfio-ap

    - Set the host program identifier

    - Optimize page table locking

    x86:
    - Enable nested virtualization by default

    - Implement Hyper-V IPI hypercalls

    - Improve #PF and #DB handling

    - Allow guests to use Enlightened VMCS

    - Add migration selftests for VMCS and Enlightened VMCS

    - Allow coalesced PIO accesses

    - Add an option to perform nested VMCS host state consistency check
    through hardware

    - Automatic tuning of lapic_timer_advance_ns

    - Many fixes, minor improvements, and cleanups"

    * tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
    KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned
    Revert "kvm: x86: optimize dr6 restore"
    KVM: PPC: Optimize clearing TCEs for sparse tables
    x86/kvm/nVMX: tweak shadow fields
    selftests/kvm: add missing executables to .gitignore
    KVM: arm64: Safety check PSTATE when entering guest and handle IL
    KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips
    arm/arm64: KVM: Enable 32 bits kvm vcpu events support
    arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension()
    KVM: arm64: Fix caching of host MDCR_EL2 value
    KVM: VMX: enable nested virtualization by default
    KVM/x86: Use 32bit xor to clear registers in svm.c
    kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD
    kvm: vmx: Defer setting of DR6 until #DB delivery
    kvm: x86: Defer setting of CR2 until #PF delivery
    kvm: x86: Add payload operands to kvm_multiple_exception
    kvm: x86: Add exception payload fields to kvm_vcpu_events
    kvm: x86: Add has_payload and payload to kvm_queued_exception
    KVM: Documentation: Fix omission in struct kvm_vcpu_events
    KVM: selftests: add Enlightened VMCS test
    ...

    Linus Torvalds