23 Aug, 2018

3 commits

  • Pull second set of KVM updates from Paolo Bonzini:
    "ARM:
    - Support for Group0 interrupts in guests
    - Cache management optimizations for ARMv8.4 systems
    - Userspace interface for RAS
    - Fault path optimization
    - Emulated physical timer fixes
    - Random cleanups

    x86:
    - fixes for L1TF
    - a new test case
    - non-support for SGX (inject the right exception in the guest)
    - fix lockdep false positive"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
    KVM: VMX: fixes for vmentry_l1d_flush module parameter
    kvm: selftest: add dirty logging test
    kvm: selftest: pass in extra memory when create vm
    kvm: selftest: include the tools headers
    kvm: selftest: unify the guest port macros
    tools: introduce test_and_clear_bit
    KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
    KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
    KVM: vmx: Add defines for SGX ENCLS exiting
    x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
    x86: kvm: avoid unused variable warning
    KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
    KVM: arm/arm64: Skip updating PTE entry if no change
    KVM: arm/arm64: Skip updating PMD entry if no change
    KVM: arm: Use true and false for boolean values
    KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
    KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
    KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
    KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
    KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:

    - the rest of MM

    - procfs updates

    - various misc things

    - more y2038 fixes

    - get_maintainer updates

    - lib/ updates

    - checkpatch updates

    - various epoll updates

    - autofs updates

    - hfsplus

    - some reiserfs work

    - fatfs updates

    - signal.c cleanups

    - ipc/ updates

    * emailed patches from Andrew Morton : (166 commits)
    ipc/util.c: update return value of ipc_getref from int to bool
    ipc/util.c: further variable name cleanups
    ipc: simplify ipc initialization
    ipc: get rid of ids->tables_initialized hack
    lib/rhashtable: guarantee initial hashtable allocation
    lib/rhashtable: simplify bucket_table_alloc()
    ipc: drop ipc_lock()
    ipc/util.c: correct comment in ipc_obtain_object_check
    ipc: rename ipcctl_pre_down_nolock()
    ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
    ipc: reorganize initialization of kern_ipc_perm.seq
    ipc: compute kern_ipc_perm.id under the ipc lock
    init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
    fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
    adfs: use timespec64 for time conversion
    kernel/sysctl.c: fix typos in comments
    drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
    fork: don't copy inconsistent signal handler state to child
    signal: make get_signal() return bool
    signal: make sigkill_pending() return bool
    ...

    Linus Torvalds
     
  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

22 Aug, 2018

2 commits

  • …marm/kvmarm into HEAD

    KVM/arm updates for 4.19

    - Support for Group0 interrupts in guests
    - Cache management optimizations for ARMv8.4 systems
    - Userspace interface for RAS, allowing error retrival and injection
    - Fault path optimization
    - Emulated physical timer fixes
    - Random cleanups

    Paolo Bonzini
     
  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

20 Aug, 2018

1 commit

  • Pull first set of KVM updates from Paolo Bonzini:
    "PPC:
    - minor code cleanups

    x86:
    - PCID emulation and CR3 caching for shadow page tables
    - nested VMX live migration
    - nested VMCS shadowing
    - optimized IPI hypercall
    - some optimizations

    ARM will come next week"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
    kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
    KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
    KVM: X86: Implement PV IPIs in linux guest
    KVM: X86: Add kvm hypervisor init time platform setup callback
    KVM: X86: Implement "send IPI" hypercall
    KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
    KVM: x86: Skip pae_root shadow allocation if tdp enabled
    KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
    KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
    KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
    KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
    KVM: vmx: move struct host_state usage to struct loaded_vmcs
    KVM: vmx: compute need to reload FS/GS/LDT on demand
    KVM: nVMX: remove a misleading comment regarding vmcs02 fields
    KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
    KVM: vmx: add dedicated utility to access guest's kernel_gs_base
    KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
    KVM: vmx: refactor segmentation code in vmx_save_host_state()
    kvm: nVMX: Fix fault priority for VMX operations
    kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
    ...

    Linus Torvalds
     

15 Aug, 2018

1 commit

  • Pull arm64 updates from Will Deacon:
    "A bunch of good stuff in here. Worth noting is that we've pulled in
    the x86/mm branch from -tip so that we can make use of the core
    ioremap changes which allow us to put down huge mappings in the
    vmalloc area without screwing up the TLB. Much of the positive
    diffstat is because of the rseq selftest for arm64.

    Summary:

    - Wire up support for qspinlock, replacing our trusty ticket lock
    code

    - Add an IPI to flush_icache_range() to ensure that stale
    instructions fetched into the pipeline are discarded along with the
    I-cache lines

    - Support for the GCC "stackleak" plugin

    - Support for restartable sequences, plus an arm64 port for the
    selftest

    - Kexec/kdump support on systems booting with ACPI

    - Rewrite of our syscall entry code in C, which allows us to zero the
    GPRs on entry from userspace

    - Support for chained PMU counters, allowing 64-bit event counters to
    be constructed on current CPUs

    - Ensure scheduler topology information is kept up-to-date with CPU
    hotplug events

    - Re-enable support for huge vmalloc/IO mappings now that the core
    code has the correct hooks to use break-before-make sequences

    - Miscellaneous, non-critical fixes and cleanups"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (90 commits)
    arm64: alternative: Use true and false for boolean values
    arm64: kexec: Add comment to explain use of __flush_icache_range()
    arm64: sdei: Mark sdei stack helper functions as static
    arm64, kaslr: export offset in VMCOREINFO ELF notes
    arm64: perf: Add cap_user_time aarch64
    efi/libstub: Only disable stackleak plugin for arm64
    arm64: drop unused kernel_neon_begin_partial() macro
    arm64: kexec: machine_kexec should call __flush_icache_range
    arm64: svc: Ensure hardirq tracing is updated before return
    arm64: mm: Export __sync_icache_dcache() for xen-privcmd
    drivers/perf: arm-ccn: Use devm_ioremap_resource() to map memory
    arm64: Add support for STACKLEAK gcc plugin
    arm64: Add stack information to on_accessible_stack
    drivers/perf: hisi: update the sccl_id/ccl_id when MT is supported
    arm64: fix ACPI dependencies
    rseq/selftests: Add support for arm64
    arm64: acpi: fix alignment fault in accessing ACPI
    efi/arm: map UEFI memory map even w/o runtime services enabled
    efi/arm: preserve early mapping of UEFI memory map longer for BGRT
    drivers: acpi: add dependency of EFI for arm64
    ...

    Linus Torvalds
     

13 Aug, 2018

2 commits

  • When there is contention on faulting in a particular page table entry
    at stage 2, the break-before-make requirement of the architecture can
    lead to additional refaulting due to TLB invalidation.

    Avoid this by skipping a page table update if the new value of the PTE
    matches the previous value.

    Cc: stable@vger.kernel.org
    Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     
  • Contention on updating a PMD entry by a large number of vcpus can lead
    to duplicate work when handling stage 2 page faults. As the page table
    update follows the break-before-make requirement of the architecture,
    it can lead to repeated refaults due to clearing the entry and
    flushing the tlbs.

    This problem is more likely when -

    * there are large number of vcpus
    * the mapping is large block mapping

    such as when using PMD hugepages (512MB) with 64k pages.

    Fix this by skipping the page table update if there is no change in
    the entry being updated.

    Cc: stable@vger.kernel.org
    Fixes: ad361f093c1e ("KVM: ARM: Support hugetlbfs backed huge pages")
    Reviewed-by: Suzuki Poulose
    Acked-by: Christoffer Dall
    Signed-off-by: Punit Agrawal
    Signed-off-by: Marc Zyngier

    Punit Agrawal
     

12 Aug, 2018

3 commits

  • kvm_vgic_sync_hwstate is only called with IRQ being disabled.
    There is thus no need to call spin_lock_irqsave/restore in
    vgic_fold_lr_state and vgic_prune_ap_list.

    This patch replace them with the non irq-safe version.

    Signed-off-by: Jia He
    Acked-by: Christoffer Dall
    [maz: commit message tidy-up]
    Signed-off-by: Marc Zyngier

    Jia He
     
  • DEBUG_SPINLOCK_BUG_ON can be used with both vgic-v2 and vgic-v3,
    so let's move it to vgic.h

    Signed-off-by: Jia He
    [maz: commit message tidy-up]
    Signed-off-by: Marc Zyngier

    Jia He
     
  • Although vgic-v3 now supports Group0 interrupts, it still doesn't
    deal with Group0 SGIs. As usually with the GIC, nothing is simple:

    - ICC_SGI1R can signal SGIs of both groups, since GICD_CTLR.DS==1
    with KVM (as per 8.1.10, Non-secure EL1 access)

    - ICC_SGI0R can only generate Group0 SGIs

    - ICC_ASGI1R sees its scope refocussed to generate only Group0
    SGIs (as per the note at the bottom of Table 8-14)

    We only support Group1 SGIs so far, so no material change.

    Reviewed-by: Eric Auger
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

06 Aug, 2018

4 commits

  • We are currently cutting hva_to_pfn_fast short if we do not want an
    immediate exit, which is represented by !async && !atomic. However,
    this is unnecessary, and __get_user_pages_fast is *much* faster
    because the regular get_user_pages takes pmd_lock/pte_lock.
    In fact, when many CPUs take a nested vmexit at the same time
    the contention on those locks is visible, and this patch removes
    about 25% (compared to 4.18) from vmexit.flat on a 16 vCPU
    nested guest.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • This patch is to provide a way for platforms to register hv tlb remote
    flush callback and this helps to optimize operation of tlb flush
    among vcpus for nested virtualization case.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Tianyu Lan
     
  • Use the fast CR3 switch mechanism to locklessly change the MMU root
    page when switching between L1 and L2. The switch from L2 to L1 should
    always go through the fast path, while the switch from L1 to L2 should
    go through the fast path if L1's CR3/EPTP for L2 hasn't changed
    since the last time.

    Signed-off-by: Junaid Shahid
    Signed-off-by: Paolo Bonzini

    Junaid Shahid
     
  • Pull bug fixes into the KVM development tree to avoid nasty conflicts.

    Paolo Bonzini
     

31 Jul, 2018

2 commits

  • When the VCPU is blocked (for example from WFI) we don't inject the
    physical timer interrupt if it should fire while the CPU is blocked, but
    instead we just wake up the VCPU and expect kvm_timer_vcpu_load to take
    care of injecting the interrupt.

    Unfortunately, kvm_timer_vcpu_load() doesn't actually do that, it only
    has support to schedule a soft timer if the emulated phys timer is
    expected to fire in the future.

    Follow the same pattern as kvm_timer_update_state() and update the irq
    state after potentially scheduling a soft timer.

    Reported-by: Andre Przywara
    Cc: Stable # 4.15+
    Fixes: bbdd52cfcba29 ("KVM: arm/arm64: Avoid phys timer emulation in vcpu entry/exit")
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • kvm_timer_update_state() is called when changing the phys timer
    configuration registers, either via vcpu reset, as a result of a trap
    from the guest, or when userspace programs the registers.

    phys_timer_emulate() is in turn called by kvm_timer_update_state() to
    either cancel an existing software timer, or program a new software
    timer, to emulate the behavior of a real phys timer, based on the change
    in configuration registers.

    Unfortunately, the interaction between these two functions left a small
    race; if the conceptual emulated phys timer should actually fire, but
    the soft timer hasn't executed its callback yet, we cancel the timer in
    phys_timer_emulate without injecting an irq. This only happens if the
    check in kvm_timer_update_state is called before the timer should fire,
    which is relatively unlikely, but possible.

    The solution is to update the state of the phys timer after calling
    phys_timer_emulate, which will pick up the pending timer state and
    update the interrupt value.

    Note that this leaves the opportunity of raising the interrupt twice,
    once in the just-programmed soft timer, and once in
    kvm_timer_update_state. Since this always happens synchronously with
    the VCPU execution, there is no harm in this, and the guest ever only
    sees a single timer interrupt.

    Cc: Stable # 4.15+
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

25 Jul, 2018

1 commit


24 Jul, 2018

1 commit

  • It's possible for userspace to control n. Sanitize n when using it as an
    array index, to inhibit the potential spectre-v1 write gadget.

    Note that while it appears that n must be bound to the interval [0,3]
    due to the way it is extracted from addr, we cannot guarantee that
    compiler transformations (and/or future refactoring) will ensure this is
    the case, and given this is a slow path it's better to always perform
    the masking.

    Found by smatch.

    Signed-off-by: Mark Rutland
    Cc: Christoffer Dall
    Cc: Marc Zyngier
    Cc: kvmarm@lists.cs.columbia.edu
    Signed-off-by: Marc Zyngier

    Mark Rutland
     

21 Jul, 2018

16 commits

  • Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • arm64's new use of KVMs get_events/set_events API calls isn't just
    or RAS, it allows an SError that has been made pending by KVM as
    part of its device emulation to be migrated.

    Wire this up for 32bit too.

    We only need to read/write the HCR_VA bit, and check that no esr has
    been provided, as we don't yet support VDFSR.

    Signed-off-by: James Morse
    Reviewed-by: Dongjiu Geng
    Signed-off-by: Marc Zyngier

    James Morse
     
  • The get/set events helpers to do some work to check reserved
    and padding fields are zero. This is useful on 32bit too.

    Move this code into virt/kvm/arm/arm.c, and give the arch
    code some underscores.

    This is temporarily hidden behind __KVM_HAVE_VCPU_EVENTS until
    32bit is wired up.

    Signed-off-by: James Morse
    Reviewed-by: Dongjiu Geng
    Signed-off-by: Marc Zyngier

    James Morse
     
  • For the migrating VMs, user space may need to know the exception
    state. For example, in the machine A, KVM make an SError pending,
    when migrate to B, KVM also needs to pend an SError.

    This new IOCTL exports user-invisible states related to SError.
    Together with appropriate user space changes, user space can get/set
    the SError exception state to do migrate/snapshot/suspend.

    Signed-off-by: Dongjiu Geng
    Reviewed-by: James Morse
    [expanded documentation wording]
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier

    Dongjiu Geng
     
  • Simply letting IGROUPR be writable from userspace would break
    migration from old kernels to newer kernels, because old kernels
    incorrectly report interrupt groups as group 1. This would not be a big
    problem if userspace wrote GICD_IIDR as read from the kernel, because we
    could detect the incompatibility and return an error to userspace.
    Unfortunately, this is not the case with current userspace
    implementations and simply letting IGROUPR be writable from userspace for
    an emulated GICv2 silently breaks migration and causes the destination
    VM to no longer run after migration.

    We now encourage userspace to write the read and expected value of
    GICD_IIDR as the first part of a GIC register restore, and if we observe
    a write to GICD_IIDR we know that userspace has been updated and has had
    a chance to cope with older kernels (VGICv2 IIDR.Revision == 0)
    incorrectly reporting interrupts as group 1, and therefore we now allow
    groups to be user writable.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • Implement the required MMIO accessors for GICv2 and GICv3 for the
    IGROUPR distributor and redistributor registers.

    This can allow guests to change behavior compared to running on previous
    versions of KVM, but only to align with the architecture and hardware
    implementations.

    This also allows userspace to configure the interrupts groups for GICv3.
    We don't allow userspace to write the groups on GICv2 just yet, because
    that would result in GICv2 guests not receiving interrupts after
    migrating from an older kernel that exposes GICv2 interrupts as group 1.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • If userspace attempts to write a GICD_IIDR that does not match the
    kernel version, return an error to userspace. The intention is to allow
    implementation changes inside KVM while avoiding silently breaking
    migration resulting in guests not running without any clear indication
    of what went wrong.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • Currently we do not allow any vgic mmio write operations to fail, which
    makes sense from mmio traps from the guest. However, we should be able
    to report failures to userspace, if userspace writes incompatible values
    to read-only registers. Rework the internal interface to allow errors
    to be returned on the write side for userspace writes.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • Now when we have a group configuration on the struct IRQ, use this state
    when populating the LR and signaling interrupts as either group 0 or
    group 1 to the VM. Depending on the model of the emulated GIC, and the
    guest's configuration of the VMCR, interrupts may be signaled as IRQs or
    FIQs to the VM.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • In preparation for proper group 0 and group 1 support in the vgic, we
    add a field in the struct irq to store the group of all interrupts.

    We initialize the group to group 0 when emulating GICv2 and to group 1
    when emulating GICv3, just like we treat them today. LPIs are always
    group 1. We also continue to ignore writes from the guest, preserving
    existing functionality, for now.

    Finally, we also add this field to the vgic debug logic to show the
    group for all interrupts.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We currently don't support grouping in the emulated VGIC, which is a
    known defect on KVM (not hurting any currently used guests as far as
    we're aware). This is currently handled by treating all interrupts as
    group 0 interrupts for an emulated GICv2 and always signaling interrupts
    as group 0 to the virtual CPU interface.

    However, when reading which group interrupts belongs to in the guest
    from the emulated VGIC, the VGIC currently reports group 1 instead of
    group 0, which is misleading. Fix this temporarily before introducing
    full group support by changing the hander to _raz instead of _rao.

    Fixes: fb848db39661a "KVM: arm/arm64: vgic-new: Add GICv2 MMIO handling framework"
    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • As we are about to tweak implementation aspects of the VGIC emulation,
    while still preserving some level of backwards compatibility support,
    add a field to keep track of the implementation revision field which is
    reported to the VM and to userspace.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • Instead of hardcoding the shifts and masks in the GICD_IIDR register
    emulation, let's add the definition of these fields to the GIC header
    files and use them.

    This will make things more obvious when we're going to bump the revision
    in the IIDR when we'll make guest-visible changes to the implementation.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • The vgic debugfs file only knows about SGI/PPI/SPI interrupts, and
    completely ignores LPIs. Let's fix that.

    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • In the quest to remove all stack VLA usage from the kernel[1], this
    switches to using a maximum size and adds sanity checks. Additionally
    cleans up some of the int-vs-u32 usage and adds additional bounds checking.
    As it currently stands, this will always be 8 bytes until the ABI changes.

    [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

    Cc: Christoffer Dall
    Cc: Eric Auger
    Cc: Andre Przywara
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: kvmarm@lists.cs.columbia.edu
    Signed-off-by: Kees Cook
    [maz: dropped WARN_ONs]
    Signed-off-by: Marc Zyngier

    Kees Cook
     
  • The vgic_init function can race with kvm_arch_vcpu_create() which does
    not hold kvm_lock() and we therefore have no synchronization primitives
    to ensure we're doing the right thing.

    As the user is trying to initialize or run the VM while at the same time
    creating more VCPUs, we just have to refuse to initialize the VGIC in
    this case rather than silently failing with a broken VCPU.

    Reviewed-by: Eric Auger
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

19 Jul, 2018

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "Miscellaneous bugfixes, plus a small patchlet related to Spectre v2"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvmclock: fix TSC calibration for nested guests
    KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled
    KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer
    KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel.
    x86/kvmclock: set pvti_cpu0_va after enabling kvmclock
    x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD
    kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading
    x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks
    KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR

    Linus Torvalds
     

18 Jul, 2018

2 commits

  • A comment warning against this bug is there, but the code is not doing what
    the comment says. Therefore it is possible that an EPOLLHUP races against
    irq_bypass_register_consumer. The EPOLLHUP handler schedules irqfd_shutdown,
    and if that runs soon enough, you get a use-after-free.

    Reported-by: syzbot
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Reviewed-by: David Hildenbrand

    Paolo Bonzini
     
  • Syzbot reports crashes in kvm_irqfd_assign(), caused by use-after-free
    when kvm_irqfd_assign() and kvm_irqfd_deassign() run in parallel
    for one specific eventfd. When the assign path hasn't finished but irqfd
    has been added to kvm->irqfds.items list, another thead may deassign the
    eventfd and free struct kvm_kernel_irqfd(). The assign path then uses
    the struct kvm_kernel_irqfd that has been freed by deassign path. To avoid
    such issue, keep irqfd under kvm->irq_srcu protection after the irqfd
    has been added to kvm->irqfds.items list, and call synchronize_srcu()
    in irq_shutdown() to make sure that irqfd has been fully initialized in
    the assign path.

    Reported-by: Dmitry Vyukov
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Dmitry Vyukov
    Signed-off-by: Tianyu Lan
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     

13 Jul, 2018

1 commit