03 Apr, 2019

1 commit

  • commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream.

    KVM's API requires thats ioctls must be issued from the same process
    that created the VM. In other words, userspace can play games with a
    VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
    creator can do anything useful. Explicitly reject device ioctls that
    are issued by a process other than the VM's creator, and update KVM's
    API documentation to extend its requirements to device ioctls.

    Fixes: 852b6d57dc7f ("kvm: add device control API")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

24 Mar, 2019

4 commits

  • commit 152482580a1b0accb60676063a1ac57b2d12daf6 upstream.

    kvm_arch_memslots_updated() is at this point in time an x86-specific
    hook for handling MMIO generation wraparound. x86 stashes 19 bits of
    the memslots generation number in its MMIO sptes in order to avoid
    full page fault walks for repeat faults on emulated MMIO addresses.
    Because only 19 bits are used, wrapping the MMIO generation number is
    possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
    the generation has changed so that it can invalidate all MMIO sptes in
    case the effective MMIO generation has wrapped so as to avoid using a
    stale spte, e.g. a (very) old spte that was created with generation==0.

    Given that the purpose of kvm_arch_memslots_updated() is to prevent
    consuming stale entries, it needs to be called before the new generation
    is propagated to memslots. Invalidating the MMIO sptes after updating
    memslots means that there is a window where a vCPU could dereference
    the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
    spte that was created with (pre-wrap) generation==0.

    Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • [ Upstream commit ab2d5eb03dbb7b37a1c6356686fb48626ab0c93e ]

    We currently initialize the group of private IRQs during
    kvm_vgic_vcpu_init, and the value of the group depends on the GIC model
    we are emulating. However, CPUs created before creating (and
    initializing) the VGIC might end up with the wrong group if the VGIC
    is created as GICv3 later.

    Since we have no enforced ordering of creating the VGIC and creating
    VCPUs, we can end up with part the VCPUs being properly intialized and
    the remaining incorrectly initialized. That also means that we have no
    single place to do the per-cpu data structure initialization which
    depends on knowing the emulated GIC model (which is only the group
    field).

    This patch removes the incorrect comment from kvm_vgic_vcpu_init and
    initializes the group of all previously created VCPUs's private
    interrupts in vgic_init in addition to the existing initialization in
    kvm_vgic_vcpu_init.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin

    Christoffer Dall
     
  • [ Upstream commit 358b28f09f0ab074d781df72b8a671edb1547789 ]

    The current kvm_psci_vcpu_on implementation will directly try to
    manipulate the state of the VCPU to reset it. However, since this is
    not done on the thread that runs the VCPU, we can end up in a strangely
    corrupted state when the source and target VCPUs are running at the same
    time.

    Fix this by factoring out all reset logic from the PSCI implementation
    and forwarding the required information along with a request to the
    target VCPU.

    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Sasha Levin

    Marc Zyngier
     
  • [ Upstream commit fc3bc475231e12e9c0142f60100cf84d077c79e1 ]

    vgic_dist->lpi_list_lock must always be taken with interrupts disabled as
    it is used in interrupt context.

    For configurations such as PREEMPT_RT_FULL, this means that it should
    be a raw_spinlock since RT spinlocks are interruptible.

    Signed-off-by: Julien Thierry
    Acked-by: Christoffer Dall
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Sasha Levin

    Julien Thierry
     

13 Feb, 2019

3 commits

  • commit cfa39381173d5f969daf43582c95ad679189cbc9 upstream.

    kvm_ioctl_create_device() does the following:

    1. creates a device that holds a reference to the VM object (with a borrowed
    reference, the VM's refcount has not been bumped yet)
    2. initializes the device
    3. transfers the reference to the device to the caller's file descriptor table
    4. calls kvm_get_kvm() to turn the borrowed reference to the VM into a real
    reference

    The ownership transfer in step 3 must not happen before the reference to the VM
    becomes a proper, non-borrowed reference, which only happens in step 4.
    After step 3, an attacker can close the file descriptor and drop the borrowed
    reference, which can cause the refcount of the kvm object to drop to zero.

    This means that we need to grab a reference for the device before
    anon_inode_getfd(), otherwise the VM can disappear from under us.

    Fixes: 852b6d57dc7f ("kvm: add device control API")
    Cc: stable@kernel.org
    Signed-off-by: Jann Horn
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • [ Upstream commit 7a86dab8cf2f0fdf508f3555dddfc236623bff60 ]

    Since the offset is added directly to the hva from the
    gfn_to_hva_cache, a negative offset could result in an out of bounds
    write. The existing BUG_ON only checks for addresses beyond the end of
    the gfn_to_hva_cache, not for addresses before the start of the
    gfn_to_hva_cache.

    Note that all current call sites have non-negative offsets.

    Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()")
    Reported-by: Cfir Cohen
    Signed-off-by: Jim Mattson
    Reviewed-by: Cfir Cohen
    Reviewed-by: Peter Shier
    Reviewed-by: Krish Sadhukhan
    Reviewed-by: Sean Christopherson
    Signed-off-by: Radim Krčmář
    Signed-off-by: Sasha Levin

    Jim Mattson
     
  • [ Upstream commit 0d640732dbebed0f10f18526de21652931f0b2f2 ]

    When we emulate an MMIO instruction, we advance the CPU state within
    decode_hsr(), before emulating the instruction effects.

    Having this logic in decode_hsr() is opaque, and advancing the state
    before emulation is problematic. It gets in the way of applying
    consistent single-step logic, and it prevents us from being able to fail
    an MMIO instruction with a synchronous exception.

    Clean this up by only advancing the CPU state *after* the effects of the
    instruction are emulated.

    Cc: Peter Maydell
    Reviewed-by: Alex Bennée
    Reviewed-by: Christoffer Dall
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Sasha Levin

    Mark Rutland
     

17 Jan, 2019

1 commit

  • commit fb544d1ca65a89f7a3895f7531221ceeed74ada7 upstream.

    We recently addressed a VMID generation race by introducing a read/write
    lock around accesses and updates to the vmid generation values.

    However, kvm_arch_vcpu_ioctl_run() also calls need_new_vmid_gen() but
    does so without taking the read lock.

    As far as I can tell, this can lead to the same kind of race:

    VM 0, VCPU 0 VM 0, VCPU 1
    ------------ ------------
    update_vttbr (vmid 254)
    update_vttbr (vmid 1) // roll over
    read_lock(kvm_vmid_lock);
    force_vm_exit()
    local_irq_disable
    need_new_vmid_gen == false //because vmid gen matches

    enter_guest (vmid 254)
    kvm_arch.vttbr = :
    read_unlock(kvm_vmid_lock);

    enter_guest (vmid 1)

    Which results in running two VCPUs in the same VM with different VMIDs
    and (even worse) other VCPUs from other VMs could now allocate clashing
    VMID 254 from the new generation as long as VCPU 0 is not exiting.

    Attempt to solve this by making sure vttbr is updated before another CPU
    can observe the updated VMID generation.

    Cc: stable@vger.kernel.org
    Fixes: f0cf47d939d0 "KVM: arm/arm64: Close VMID generation race"
    Reviewed-by: Julien Thierry
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     

10 Jan, 2019

5 commits

  • commit c23b2e6fc4ca346018618266bcabd335c0a8a49e upstream.

    When using the nospec API, it should be taken into account that:

    "...if the CPU speculates past the bounds check then
    * array_index_nospec() will clamp the index within the range of [0,
    * size)."

    The above is part of the header for macro array_index_nospec() in
    linux/nospec.h

    Now, in this particular case, if intid evaluates to exactly VGIC_MAX_SPI
    or to exaclty VGIC_MAX_PRIVATE, the array_index_nospec() macro ends up
    returning VGIC_MAX_SPI - 1 or VGIC_MAX_PRIVATE - 1 respectively, instead
    of VGIC_MAX_SPI or VGIC_MAX_PRIVATE, which, based on the original logic:

    /* SGIs and PPIs */
    if (intid arch.vgic_cpu.private_irqs[intid];

    /* SPIs */
    if (intid arch.vgic.spis[intid - VGIC_NR_PRIVATE_IRQS];

    are valid values for intid.

    Fix this by calling array_index_nospec() macro with VGIC_MAX_PRIVATE + 1
    and VGIC_MAX_SPI + 1 as arguments for its parameter size.

    Fixes: 41b87599c743 ("KVM: arm/arm64: vgic: fix possible spectre-v1 in vgic_get_irq()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Gustavo A. R. Silva
    [dropped the SPI part which was fixed separately]
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Gustavo A. R. Silva
     
  • commit 60c3ab30d8c2ff3a52606df03f05af2aae07dc6b upstream.

    When restoring the active state from userspace, we don't know which CPU
    was the source for the active state, and this is not architecturally
    exposed in any of the register state.

    Set the active_source to 0 in this case. In the future, we can expand
    on this and exposse the information as additional information to
    userspace for GICv2 if anyone cares.

    Cc: stable@vger.kernel.org
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Christoffer Dall
     
  • commit bea2ef803ade3359026d5d357348842bca9edcf1 upstream.

    SPIs should be checked against the VMs specific configuration, and
    not the architectural maximum.

    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 2e2f6c3c0b08eed3fcf7de3c7684c940451bdeb1 upstream.

    To change the active state of an MMIO, halt is requested for all vcpus of
    the affected guest before modifying the IRQ state. This is done by calling
    cond_resched_lock() in vgic_mmio_change_active(). However interrupts are
    disabled at this point and we cannot reschedule a vcpu.

    We actually don't need any of this, as kvm_arm_halt_guest ensures that
    all the other vcpus are out of the guest. Let's just drop that useless
    code.

    Signed-off-by: Julien Thierry
    Suggested-by: Christoffer Dall
    Cc: stable@vger.kernel.org
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Julien Thierry
     
  • commit 107352a24900fb458152b92a4e72fbdc83fd5510 upstream.

    We currently only halt the guest when a vCPU messes with the active
    state of an SPI. This is perfectly fine for GICv2, but isn't enough
    for GICv3, where all vCPUs can access the state of any other vCPU.

    Let's broaden the condition to include any GICv3 interrupt that
    has an active state (i.e. all but LPIs).

    Cc: stable@vger.kernel.org
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

14 Nov, 2018

2 commits

  • commit da5a3ce66b8bb51b0ea8a89f42aac153903f90fb upstream.

    At boot time, KVM stashes the host MDCR_EL2 value, but only does this
    when the kernel is not running in hyp mode (i.e. is non-VHE). In these
    cases, the stashed value of MDCR_EL2.HPMN happens to be zero, which can
    lead to CONSTRAINED UNPREDICTABLE behaviour.

    Since we use this value to derive the MDCR_EL2 value when switching
    to/from a guest, after a guest have been run, the performance counters
    do not behave as expected. This has been observed to result in accesses
    via PMXEVTYPER_EL0 and PMXEVCNTR_EL0 not affecting the relevant
    counters, resulting in events not being counted. In these cases, only
    the fixed-purpose cycle counter appears to work as expected.

    Fix this by always stashing the host MDCR_EL2 value, regardless of VHE.

    Cc: Christopher Dall
    Cc: James Morse
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 1e947bad0b63b351 ("arm64: KVM: Skip HYP setup when already running in HYP")
    Tested-by: Robin Murphy
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit fd2ef358282c849c193aa36dadbf4f07f7dcd29b upstream.

    PageTransCompoundMap() returns true for hugetlbfs and THP
    hugepages. This behaviour incorrectly leads to stage 2 faults for
    unsupported hugepage sizes (e.g., 64K hugepage with 4K pages) to be
    treated as THP faults.

    Tighten the check to filter out hugetlbfs pages. This also leads to
    consistently mapping all unsupported hugepage sizes as PTE level
    entries at stage 2.

    Signed-off-by: Punit Agrawal
    Reviewed-by: Suzuki Poulose
    Cc: Christoffer Dall
    Cc: Marc Zyngier
    Cc: stable@vger.kernel.org # v4.13+
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Punit Agrawal
     

07 Sep, 2018

2 commits

  • kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
    deal with. Drop the now obsolete code.

    Fixes: fb1522e099f0 ("KVM: update to new mmu_notifier semantic v2")
    Cc: James Hogan
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • When triggering a CoW, we unmap the RO page via an MMU notifier
    (invalidate_range_start), and then populate the new PTE using another
    one (change_pte). In the meantime, we'll have copied the old page
    into the new one.

    The problem is that the data for the new page is sitting in the
    cache, and should the guest have an uncached mapping to that page
    (or its MMU off), following accesses will bypass the cache.

    In a way, this is similar to what happens on a translation fault:
    We need to clean the page to the PoC before mapping it. So let's just
    do that.

    This fixes a KVM unit test regression observed on a HiSilicon platform,
    and subsequently reproduced on Seattle.

    Fixes: a9c0e12ebee5 ("KVM: arm/arm64: Only clean the dcache on translation fault")
    Cc: stable@vger.kernel.org # v4.16+
    Reported-by: Mike Galbraith
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

23 Aug, 2018

3 commits

  • Pull second set of KVM updates from Paolo Bonzini:
    "ARM:
    - Support for Group0 interrupts in guests
    - Cache management optimizations for ARMv8.4 systems
    - Userspace interface for RAS
    - Fault path optimization
    - Emulated physical timer fixes
    - Random cleanups

    x86:
    - fixes for L1TF
    - a new test case
    - non-support for SGX (inject the right exception in the guest)
    - fix lockdep false positive"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
    KVM: VMX: fixes for vmentry_l1d_flush module parameter
    kvm: selftest: add dirty logging test
    kvm: selftest: pass in extra memory when create vm
    kvm: selftest: include the tools headers
    kvm: selftest: unify the guest port macros
    tools: introduce test_and_clear_bit
    KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
    KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
    KVM: vmx: Add defines for SGX ENCLS exiting
    x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
    x86: kvm: avoid unused variable warning
    KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
    KVM: arm/arm64: Skip updating PTE entry if no change
    KVM: arm/arm64: Skip updating PMD entry if no change
    KVM: arm: Use true and false for boolean values
    KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
    KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
    KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
    KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
    KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - the rest of MM

    - procfs updates

    - various misc things

    - more y2038 fixes

    - get_maintainer updates

    - lib/ updates

    - checkpatch updates

    - various epoll updates

    - autofs updates

    - hfsplus

    - some reiserfs work

    - fatfs updates

    - signal.c cleanups

    - ipc/ updates

    * emailed patches from Andrew Morton : (166 commits)
    ipc/util.c: update return value of ipc_getref from int to bool
    ipc/util.c: further variable name cleanups
    ipc: simplify ipc initialization
    ipc: get rid of ids->tables_initialized hack
    lib/rhashtable: guarantee initial hashtable allocation
    lib/rhashtable: simplify bucket_table_alloc()
    ipc: drop ipc_lock()
    ipc/util.c: correct comment in ipc_obtain_object_check
    ipc: rename ipcctl_pre_down_nolock()
    ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
    ipc: reorganize initialization of kern_ipc_perm.seq
    ipc: compute kern_ipc_perm.id under the ipc lock
    init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
    fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
    adfs: use timespec64 for time conversion
    kernel/sysctl.c: fix typos in comments
    drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
    fork: don't copy inconsistent signal handler state to child
    signal: make get_signal() return bool
    signal: make sigkill_pending() return bool
    ...

    Linus Torvalds
     
  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Aug, 2018

2 commits

  • …marm/kvmarm into HEAD

    KVM/arm updates for 4.19

    - Support for Group0 interrupts in guests
    - Cache management optimizations for ARMv8.4 systems
    - Userspace interface for RAS, allowing error retrival and injection
    - Fault path optimization
    - Emulated physical timer fixes
    - Random cleanups

    Paolo Bonzini
     
  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

20 Aug, 2018

1 commit

  • Pull first set of KVM updates from Paolo Bonzini:
    "PPC:
    - minor code cleanups

    x86:
    - PCID emulation and CR3 caching for shadow page tables
    - nested VMX live migration
    - nested VMCS shadowing
    - optimized IPI hypercall
    - some optimizations

    ARM will come next week"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
    kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
    KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
    KVM: X86: Implement PV IPIs in linux guest
    KVM: X86: Add kvm hypervisor init time platform setup callback
    KVM: X86: Implement "send IPI" hypercall
    KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
    KVM: x86: Skip pae_root shadow allocation if tdp enabled
    KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
    KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
    KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
    KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
    KVM: vmx: move struct host_state usage to struct loaded_vmcs
    KVM: vmx: compute need to reload FS/GS/LDT on demand
    KVM: nVMX: remove a misleading comment regarding vmcs02 fields
    KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
    KVM: vmx: add dedicated utility to access guest's kernel_gs_base
    KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
    KVM: vmx: refactor segmentation code in vmx_save_host_state()
    kvm: nVMX: Fix fault priority for VMX operations
    kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
    ...

    Linus Torvalds
     

15 Aug, 2018

1 commit

  • Pull arm64 updates from Will Deacon:
    "A bunch of good stuff in here. Worth noting is that we've pulled in
    the x86/mm branch from -tip so that we can make use of the core
    ioremap changes which allow us to put down huge mappings in the
    vmalloc area without screwing up the TLB. Much of the positive
    diffstat is because of the rseq selftest for arm64.

    Summary:

    - Wire up support for qspinlock, replacing our trusty ticket lock
    code

    - Add an IPI to flush_icache_range() to ensure that stale
    instructions fetched into the pipeline are discarded along with the
    I-cache lines

    - Support for the GCC "stackleak" plugin

    - Support for restartable sequences, plus an arm64 port for the
    selftest

    - Kexec/kdump support on systems booting with ACPI

    - Rewrite of our syscall entry code in C, which allows us to zero the
    GPRs on entry from userspace

    - Support for chained PMU counters, allowing 64-bit event counters to
    be constructed on current CPUs

    - Ensure scheduler topology information is kept up-to-date with CPU
    hotplug events

    - Re-enable support for huge vmalloc/IO mappings now that the core
    code has the correct hooks to use break-before-make sequences

    - Miscellaneous, non-critical fixes and cleanups"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
    arm64: alternative: Use true and false for boolean values
    arm64: kexec: Add comment to explain use of __flush_icache_range()
    arm64: sdei: Mark sdei stack helper functions as static
    arm64, kaslr: export offset in VMCOREINFO ELF notes
    arm64: perf: Add cap_user_time aarch64
    efi/libstub: Only disable stackleak plugin for arm64
    arm64: drop unused kernel_neon_begin_partial() macro
    arm64: kexec: machine_kexec should call __flush_icache_range
    arm64: svc: Ensure hardirq tracing is updated before return
    arm64: mm: Export __sync_icache_dcache() for xen-privcmd
    drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
    arm64: Add support for STACKLEAK gcc plugin
    arm64: Add stack information to on_accessible_stack
    drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
    arm64: fix ACPI dependencies
    rseq/selftests: Add support for arm64
    arm64: acpi: fix alignment fault in accessing ACPI
    efi/arm: map UEFI memory map even w/o runtime services enabled
    efi/arm: preserve early mapping of UEFI memory map longer for BGRT
    drivers: acpi: add dependency of EFI for arm64
    ...

    Linus Torvalds
     

13 Aug, 2018

2 commits

  • When there is contention on faulting in a particular page table entry
    at stage 2, the break-before-make requirement of the architecture can
    lead to additional refaulting due to TLB invalidation.

    Avoid this by skipping a page table update if the new value of the PTE
    matches the previous value.

    Cc: stable@vger.kernel.org
    Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • Contention on updating a PMD entry by a large number of vcpus can lead
    to duplicate work when handling stage 2 page faults. As the page table
    update follows the break-before-make requirement of the architecture,
    it can lead to repeated refaults due to clearing the entry and
    flushing the tlbs.

    This problem is more likely when -

    * there are large number of vcpus
    * the mapping is large block mapping

    such as when using PMD hugepages (512MB) with 64k pages.

    Fix this by skipping the page table update if there is no change in
    the entry being updated.

    Cc: stable@vger.kernel.org
    Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     

12 Aug, 2018

3 commits

  • kvm_vgic_sync_hwstate is only called with IRQ being disabled.
    There is thus no need to call spin_lock_irqsave/restore in
    vgic_fold_lr_state and vgic_prune_ap_list.

    This patch replace them with the non irq-safe version.

    Signed-off-by: Jia He
    Acked-by: Christoffer Dall
    [maz: commit message tidy-up]
    Signed-off-by: Marc Zyngier

    Jia He
     
  • DEBUG_SPINLOCK_BUG_ON can be used with both vgic-v2 and vgic-v3,
    so let's move it to vgic.h

    Signed-off-by: Jia He
    [maz: commit message tidy-up]
    Signed-off-by: Marc Zyngier

    Jia He
     
  • Although vgic-v3 now supports Group0 interrupts, it still doesn't
    deal with Group0 SGIs. As usually with the GIC, nothing is simple:

    - ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1
    with KVM (as per 8.1.10, Non-secure EL1 access)

    - ICC_SGI0R can only generate Group0 SGIs

    - ICC_ASGI1R sees its scope refocussed to generate only Group0
    SGIs (as per the note at the bottom of Table 8-14)

    We only support Group1 SGIs so far, so no material change.

    Reviewed-by: Eric Auger
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

06 Aug, 2018

4 commits

  • We are currently cutting hva_to_pfn_fast short if we do not want an
    immediate exit, which is represented by !async && !atomic. However,
    this is unnecessary, and __get_user_pages_fast is *much* faster
    because the regular get_user_pages takes pmd_lock/pte_lock.
    In fact, when many CPUs take a nested vmexit at the same time
    the contention on those locks is visible, and this patch removes
    about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
    nested guest.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • This patch is to provide a way for platforms to register hv tlb remote
    flush callback and this helps to optimize operation of tlb flush
    among vcpus for nested virtualization case.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Tianyu Lan
     
  • Use the fast CR3 switch mechanism to locklessly change the MMU root
    page when switching between L1 and L2. The switch from L2 to L1 should
    always go through the fast path, while the switch from L1 to L2 should
    go through the fast path if L1's CR3/EPTP for L2 hasn't changed
    since the last time.

    Signed-off-by: Junaid Shahid
    Signed-off-by: Paolo Bonzini

    Junaid Shahid
     
  • Pull bug fixes into the KVM development tree to avoid nasty conflicts.

    Paolo Bonzini
     

31 Jul, 2018

2 commits

  • When the VCPU is blocked (for example from WFI) we don't inject the
    physical timer interrupt if it should fire while the CPU is blocked, but
    instead we just wake up the VCPU and expect kvm_timer_vcpu_load to take
    care of injecting the interrupt.

    Unfortunately, kvm_timer_vcpu_load() doesn't actually do that, it only
    has support to schedule a soft timer if the emulated phys timer is
    expected to fire in the future.

    Follow the same pattern as kvm_timer_update_state() and update the irq
    state after potentially scheduling a soft timer.

    Reported-by: Andre Przywara
    Cc: Stable # 4.15+
    Fixes: bbdd52cfcba29 ("KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit")
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • kvm_timer_update_state() is called when changing the phys timer
    configuration registers, either via vcpu reset, as a result of a trap
    from the guest, or when userspace programs the registers.

    phys_timer_emulate() is in turn called by kvm_timer_update_state() to
    either cancel an existing software timer, or program a new software
    timer, to emulate the behavior of a real phys timer, based on the change
    in configuration registers.

    Unfortunately, the interaction between these two functions left a small
    race; if the conceptual emulated phys timer should actually fire, but
    the soft timer hasn't executed its callback yet, we cancel the timer in
    phys_timer_emulate without injecting an irq. This only happens if the
    check in kvm_timer_update_state is called before the timer should fire,
    which is relatively unlikely, but possible.

    The solution is to update the state of the phys timer after calling
    phys_timer_emulate, which will pick up the pending timer state and
    update the interrupt value.

    Note that this leaves the opportunity of raising the interrupt twice,
    once in the just-programmed soft timer, and once in
    kvm_timer_update_state. Since this always happens synchronously with
    the VCPU execution, there is no harm in this, and the guest ever only
    sees a single timer interrupt.

    Cc: Stable # 4.15+
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

25 Jul, 2018

1 commit


24 Jul, 2018

1 commit

  • It's possible for userspace to control n. Sanitize n when using it as an
    array index, to inhibit the potential spectre-v1 write gadget.

    Note that while it appears that n must be bound to the interval [0,3]
    due to the way it is extracted from addr, we cannot guarantee that
    compiler transformations (and/or future refactoring) will ensure this is
    the case, and given this is a slow path it's better to always perform
    the masking.

    Found by smatch.

    Signed-off-by: Mark Rutland
    Cc: Christoffer Dall
    Cc: Marc Zyngier
    Cc: kvmarm@lists.cs.columbia.edu
    Signed-off-by: Marc Zyngier

    Mark Rutland
     

21 Jul, 2018

2 commits