25 Sep, 2015

1 commit

  • We observed some performance degradation on s390x with dynamic
    halt polling. Until we can provide a proper fix, let's enable
    halt_poll_ns as default only for supported architectures.

    Architectures are now free to set their own halt_poll_ns
    default value.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Paolo Bonzini

    David Hildenbrand
     

17 Sep, 2015

2 commits

  • …it/kvmarm/kvmarm into kvm-master

    Second set of KVM/ARM changes for 4.3-rc2

    - Workaround for a Cortex-A57 erratum
    - Bug fix for the debugging infrastructure
    - Fix for 32bit guests with more than 4GB of address space
    on a 32bit host
    - A number of fixes for the (unusual) case when we don't use
    the in-kernel GIC emulation
    - Removal of ThumbEE handling on arm64, since these have been
    dropped from the architecture before anyone actually ever
    built a CPU
    - Remove the KVM_ARM_MAX_VCPUS limitation which has become
    fairly pointless

    Paolo Bonzini
     
  • This patch removes config option of KVM_ARM_MAX_VCPUS,
    and like other ARCHs, just choose the maximum allowed
    value from hardware, and follows the reasons:

    1) from distribution view, the option has to be
    defined as the max allowed value because it need to
    meet all kinds of virtulization applications and
    need to support most of SoCs;

    2) using a bigger value doesn't introduce extra memory
    consumption, and the help text in Kconfig isn't accurate
    because kvm_vpu structure isn't allocated until request
    of creating VCPU is sent from QEMU;

    3) the main effect is that the field of vcpus[] in 'struct kvm'
    becomes a bit bigger(sizeof(void *) per vcpu) and need more cache
    lines to hold the structure, but 'struct kvm' is one generic struct,
    and it has worked well on other ARCHs already in this way. Also,
    the world switch frequecy is often low, for example, it is ~2000
    when running kernel building load in VM from APM xgene KVM host,
    so the effect is very small, and the difference can't be observed
    in my test at all.

    Cc: Dann Frazier
    Signed-off-by: Ming Lei
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Ming Lei
     

16 Sep, 2015

1 commit

  • This new statistic can help diagnosing VCPUs that, for any reason,
    trigger bad behavior of halt_poll_ns autotuning.

    For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
    like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
    10+20+40+80+160+320+480 = 1110 microseconds out of every
    479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
    is consuming about 30% more CPU than it would use without
    polling. This would show as an abnormally high number of
    attempted polling compared to the successful polls.

    Acked-by: Christian Borntraeger <
    Reviewed-by: David Matlack
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

15 Sep, 2015

5 commits

  • Currently, if we had a zero length mmio eventfd assigned on
    KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
    always compares the kvm_io_range() with the length that guest
    wrote. This will cause e.g for vhost, kick will be trapped by qemu
    userspace instead of vhost. Fixing this by using zero length if an
    iodevice is zero length.

    Cc: stable@vger.kernel.org
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Signed-off-by: Jason Wang
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Jason Wang
     
  • We register wildcard mmio eventfd on two buses, once for KVM_MMIO_BUS
    and once on KVM_FAST_MMIO_BUS but with a single iodev
    instance. This will lead to an issue: kvm_io_bus_destroy() knows
    nothing about the devices on two buses pointing to a single dev. Which
    will lead to double free[1] during exit. Fix this by allocating two
    instances of iodevs then registering one on KVM_MMIO_BUS and another
    on KVM_FAST_MMIO_BUS.

    CPU: 1 PID: 2894 Comm: qemu-system-x86 Not tainted 3.19.0-26-generic #28-Ubuntu
    Hardware name: LENOVO 2356BG6/2356BG6, BIOS G7ET96WW (2.56 ) 09/12/2013
    task: ffff88009ae0c4b0 ti: ffff88020e7f0000 task.ti: ffff88020e7f0000
    RIP: 0010:[] [] ioeventfd_release+0x28/0x60 [kvm]
    RSP: 0018:ffff88020e7f3bc8 EFLAGS: 00010292
    RAX: dead000000200200 RBX: ffff8801ec19c900 RCX: 000000018200016d
    RDX: ffff8801ec19cf80 RSI: ffffea0008bf1d40 RDI: ffff8801ec19c900
    RBP: ffff88020e7f3bd8 R08: 000000002fc75a01 R09: 000000018200016d
    R10: ffffffffc07df6ae R11: ffff88022fc75a98 R12: ffff88021e7cc000
    R13: ffff88021e7cca48 R14: ffff88021e7cca50 R15: ffff8801ec19c880
    FS: 00007fc1ee3e6700(0000) GS:ffff88023e240000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8f389d8000 CR3: 000000023dc13000 CR4: 00000000001427e0
    Stack:
    ffff88021e7cc000 0000000000000000 ffff88020e7f3be8 ffffffffc07e2622
    ffff88020e7f3c38 ffffffffc07df69a ffff880232524160 ffff88020e792d80
    0000000000000000 ffff880219b78c00 0000000000000008 ffff8802321686a8
    Call Trace:
    [] ioeventfd_destructor+0x12/0x20 [kvm]
    [] kvm_put_kvm+0xca/0x210 [kvm]
    [] kvm_vcpu_release+0x18/0x20 [kvm]
    [] __fput+0xe7/0x250
    [] ____fput+0xe/0x10
    [] task_work_run+0xd4/0xf0
    [] do_exit+0x368/0xa50
    [] ? recalc_sigpending+0x1f/0x60
    [] do_group_exit+0x45/0xb0
    [] get_signal+0x291/0x750
    [] do_signal+0x28/0xab0
    [] ? do_futex+0xdb/0x5d0
    [] ? __wake_up_locked_key+0x18/0x20
    [] ? SyS_futex+0x76/0x170
    [] do_notify_resume+0x69/0xb0
    [] int_signal+0x12/0x17
    Code: 5d c3 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 48 8b 7f 20 e8 06 d6 a5 c0 48 8b 43 08 48 8b 13 48 89 df 48 89 42 08 89 10 48 b8 00 01 10 00 00
    RIP [] ioeventfd_release+0x28/0x60 [kvm]
    RSP

    Cc: stable@vger.kernel.org
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Signed-off-by: Jason Wang
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Jason Wang
     
  • This patch factors out core eventfd assign/deassign logic and leaves
    the argument checking and bus index selection to callers.

    Cc: stable@vger.kernel.org
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Signed-off-by: Jason Wang
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Jason Wang
     
  • We only want zero length mmio eventfd to be registered on
    KVM_FAST_MMIO_BUS. So check this explicitly when arg->len is zero to
    make sure this.

    Cc: stable@vger.kernel.org
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Signed-off-by: Jason Wang
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Jason Wang
     
  • After 'commit 0b8ba4a2b658 ("KVM: fix checkpatch.pl errors in
    kvm/coalesced_mmio.h")', the declaration of the two function will exceed 80
    characters.

    This patch reduces the TAPs to make each line in 80 characters.

    Signed-off-by: Wei Yang
    Signed-off-by: Paolo Bonzini

    Wei Yang
     

14 Sep, 2015

2 commits


11 Sep, 2015

2 commits

  • Merge third patch-bomb from Andrew Morton:

    - even more of the rest of MM

    - lib/ updates

    - checkpatch updates

    - small changes to a few scruffy filesystems

    - kmod fixes/cleanups

    - kexec updates

    - a dma-mapping cleanup series from hch

    * emailed patches from Andrew Morton : (81 commits)
    dma-mapping: consolidate dma_set_mask
    dma-mapping: consolidate dma_supported
    dma-mapping: cosolidate dma_mapping_error
    dma-mapping: consolidate dma_{alloc,free}_noncoherent
    dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}
    mm: use vma_is_anonymous() in create_huge_pmd() and wp_huge_pmd()
    mm: make sure all file VMAs have ->vm_ops set
    mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff()
    mm: mark most vm_operations_struct const
    namei: fix warning while make xmldocs caused by namei.c
    ipc: convert invalid scenarios to use WARN_ON
    zlib_deflate/deftree: remove bi_reverse()
    lib/decompress_unlzma: Do a NULL check for pointer
    lib/decompressors: use real out buf size for gunzip with kernel
    fs/affs: make root lookup from blkdev logical size
    sysctl: fix int -> unsigned long assignments in INT_MIN case
    kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
    kexec: align crash_notes allocation to make it be inside one physical page
    kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
    kexec: split kexec_load syscall from kexec core code
    ...

    Linus Torvalds
     
  • In the scope of the idle memory tracking feature, which is introduced by
    the following patch, we need to clear the referenced/accessed bit not only
    in primary, but also in secondary ptes. The latter is required in order
    to estimate wss of KVM VMs. At the same time we want to avoid flushing
    tlb, because it is quite expensive and it won't really affect the final
    result.

    Currently, there is no function for clearing pte young bit that would meet
    our requirements, so this patch introduces one. To achieve that we have
    to add a new mmu-notifier callback, clear_young, since there is no method
    for testing-and-clearing a secondary pte w/o flushing tlb. The new method
    is not mandatory and currently only implemented by KVM.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Acked-by: Paolo Bonzini
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

08 Sep, 2015

1 commit


06 Sep, 2015

3 commits

  • Tracepoint for dynamic halt_pool_ns, fired on every potential change.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • There is a downside of always-poll since poll is still happened for idle
    vCPUs which can waste cpu usage. This patchset add the ability to adjust
    halt_poll_ns dynamically, to grow halt_poll_ns when shot halt is detected,
    and to shrink halt_poll_ns when long halt is detected.

    There are two new kernel parameters for changing the halt_poll_ns:
    halt_poll_ns_grow and halt_poll_ns_shrink.

    no-poll always-poll dynamic-poll
    -----------------------------------------------------------------------
    Idle (nohz) vCPU %c0 0.15% 0.3% 0.2%
    Idle (250HZ) vCPU %c0 1.1% 4.6%~14% 1.2%
    TCP_RR latency 34us 27us 26.7us

    "Idle (X) vCPU %c0" is the percent of time the physical cpu spent in
    c0 over 60 seconds (each vCPU is pinned to a pCPU). (nohz) means the
    guest was tickless. (250HZ) means the guest was ticking at 250HZ.

    The big win is with ticking operating systems. Running the linux guest
    with nohz=off (and HZ=250), we save 3.4%~12.8% CPUs/second and get close
    to no-polling overhead levels by using the dynamic-poll. The savings
    should be even higher for higher frequency ticks.

    Suggested-by: David Matlack
    Signed-off-by: Wanpeng Li
    [Simplify the patch. - Paolo]
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • Change halt_poll_ns into per-VCPU variable, seeded from module parameter,
    to allow greater flexibility.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

04 Sep, 2015

2 commits

  • Provide a better quality of implementation and be architecture compliant
    on ARMv7 for the architected timer by resetting the CNTV_CTL to 0 on
    reset of the timer.

    This change alone fixes the UEFI reset issue reported by Laszlo back in
    February.

    Cc: Laszlo Ersek
    Cc: Ard Biesheuvel
    Cc: Drew Jones
    Cc: Wei Huang
    Cc: Peter Maydell
    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • We currently set the physical active state only when we *inject* a new
    pending virtual interrupt, but this is actually not correct, because we
    could have been preempted and run something else on the system that
    resets the active state to clear. This causes us to run the VM with the
    timer set to fire, but without setting the physical active state.

    The solution is to always check the LR configurations, and we if have a
    mapped interrupt in the LR in either the pending or active state
    (virtual), then set the physical active state.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

23 Aug, 2015

1 commit


12 Aug, 2015

7 commits


30 Jul, 2015

1 commit


29 Jul, 2015

1 commit


10 Jul, 2015

1 commit

  • If there are no assigned devices, the guest PAT are not providing
    any useful information and can be overridden to writeback; VMX
    always does this because it has the "IPAT" bit in its extended
    page table entries, but SVM does not have anything similar.
    Hook into VFIO and legacy device assignment so that they
    provide this information to KVM.

    Reviewed-by: Alex Williamson
    Tested-by: Joerg Roedel
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

04 Jul, 2015

1 commit

  • Commit 1cde2930e154 ("sched/preempt: Add static_key() to preempt_notifiers")
    had two problems. First, the preempt-notifier API needs to sleep with the
    addition of the static_key, we do however need to hold off preemption
    while modifying the preempt notifier list, otherwise a preemption could
    observe an inconsistent list state. KVM correctly registers and
    unregisters preempt notifiers with preemption disabled, so the sleep
    caused dmesg splats.

    Second, KVM registers and unregisters preemption notifiers very often
    (in vcpu_load/vcpu_put). With a single uniprocessor guest the static key
    would move between 0 and 1 continuously, hitting the slow path on every
    userspace exit.

    To fix this, wrap the static_key inc/dec in a new API, and call it from
    KVM.

    Fixes: 1cde2930e154 ("sched/preempt: Add static_key() to preempt_notifiers")
    Reported-by: Pontus Fuchs
    Reported-by: Takashi Iwai
    Tested-by: Takashi Iwai
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Paolo Bonzini

    Peter Zijlstra
     

25 Jun, 2015

1 commit

  • Pull arm64 updates from Catalin Marinas:
    "Mostly refactoring/clean-up:

    - CPU ops and PSCI (Power State Coordination Interface) refactoring
    following the merging of the arm64 ACPI support, together with
    handling of Trusted (secure) OS instances

    - Using fixmap for permanent FDT mapping, removing the initial dtb
    placement requirements (within 512MB from the start of the kernel
    image). This required moving the FDT self reservation out of the
    memreserve processing

    - Idmap (1:1 mapping used for MMU on/off) handling clean-up

    - Removing flush_cache_all() - not safe on ARM unless the MMU is off.
    Last stages of CPU power down/up are handled by firmware already

    - "Alternatives" (run-time code patching) refactoring and support for
    immediate branch patching, GICv3 CPU interface access

    - User faults handling clean-up

    And some fixes:

    - Fix for VDSO building with broken ELF toolchains

    - Fix another case of init_mm.pgd usage for user mappings (during
    ASID roll-over broadcasting)

    - Fix for FPSIMD reloading after CPU hotplug

    - Fix for missing syscall trace exit

    - Workaround for .inst asm bug

    - Compat fix for switching the user tls tpidr_el0 register"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (42 commits)
    arm64: use private ratelimit state along with show_unhandled_signals
    arm64: show unhandled SP/PC alignment faults
    arm64: vdso: work-around broken ELF toolchains in Makefile
    arm64: kernel: rename __cpu_suspend to keep it aligned with arm
    arm64: compat: print compat_sp instead of sp
    arm64: mm: Fix freeing of the wrong memmap entries with !SPARSEMEM_VMEMMAP
    arm64: entry: fix context tracking for el0_sp_pc
    arm64: defconfig: enable memtest
    arm64: mm: remove reference to tlb.S from comment block
    arm64: Do not attempt to use init_mm in reset_context()
    arm64: KVM: Switch vgic save/restore to alternative_insn
    arm64: alternative: Introduce feature for GICv3 CPU interface
    arm64: psci: fix !CONFIG_HOTPLUG_CPU build warning
    arm64: fix bug for reloading FPSIMD state after CPU hotplug.
    arm64: kernel thread don't need to save fpsimd context.
    arm64: fix missing syscall trace exit
    arm64: alternative: Work around .inst assembler bugs
    arm64: alternative: Merge alternative-asm.h into alternative.h
    arm64: alternative: Allow immediate branch as alternative instruction
    arm64: Rework alternate sequence for ARM erratum 845719
    ...

    Linus Torvalds
     

19 Jun, 2015

4 commits


18 Jun, 2015

1 commit


17 Jun, 2015

2 commits

  • Commit fd1d0ddf2ae9 (KVM: arm/arm64: check IRQ number on userland
    injection) rightly limited the range of interrupts userspace can
    inject in a guest, but failed to consider the (unlikely) case where
    a guest is configured with 1024 interrupts.

    In this case, interrupts ranging from 1020 to 1023 are unuseable,
    as they have a special meaning for the GIC CPU interface.

    Make sure that these number cannot be used as an IRQ. Also delete
    a redundant (and similarily buggy) check in kvm_set_irq.

    Reported-by: Peter Maydell
    Cc: Andre Przywara
    Cc: # 4.1, 4.0, 3.19, 3.18
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • If a GICv3-enabled guest tries to configure Group0, we print a
    warning on the console (because we don't support Group0 interrupts).

    This is fairly pointless, and would allow a guest to spam the
    console. Let's just drop the warning.

    Acked-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

12 Jun, 2015

1 commit

  • So far, we configured the world-switch by having a small array
    of pointers to the save and restore functions, depending on the
    GIC used on the platform.

    Loading these values each time is a bit silly (they never change),
    and it makes sense to rely on the instruction patching instead.

    This leads to a nice cleanup of the code.

    Acked-by: Will Deacon
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier
    Signed-off-by: Catalin Marinas

    Marc Zyngier