05 May, 2016

1 commit

  • commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45 upstream.

    On a host that runs NTP, corrections can have a direct impact on
    the background timer that we program on the behalf of a vcpu.

    In particular, NTP performing a forward correction will result in
    a timer expiring sooner than expected from a guest point of view.
    Not a big deal, we kick the vcpu anyway.

    But on wake-up, the vcpu thread is going to perform a check to
    find out whether or not it should block. And at that point, the
    timer check is going to say "timer has not expired yet, go back
    to sleep". This results in the timer event being lost forever.

    There are multiple ways to handle this. One would be record that
    the timer has expired and let kvm_cpu_has_pending_timer return
    true in that case, but that would be fairly invasive. Another is
    to check for the "short sleep" condition in the hrtimer callback,
    and restart the timer for the remaining time when the condition
    is detected.

    This patch implements the latter, with a bit of refactoring in
    order to avoid too much code duplication.

    Reported-by: Alexander Graf
    Reviewed-by: Alexander Graf
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     

13 Apr, 2016

1 commit

  • commit e9ad4ec8379ad1ba6f68b8ca1c26b50b5ae0a327 upstream.

    Moving the initialization earlier is needed in 4.6 because
    kvm_arch_init_vm is now using mmu_lock, causing lockdep to
    complain:

    [ 284.440294] INFO: trying to register non-static key.
    [ 284.445259] the code is fine but needs lockdep annotation.
    [ 284.450736] turning off the locking correctness validator.
    ...
    [ 284.528318] [] lock_acquire+0xd3/0x240
    [ 284.533733] [] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
    [ 284.541467] [] _raw_spin_lock+0x41/0x80
    [ 284.546960] [] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
    [ 284.554707] [] kvm_page_track_register_notifier+0x20/0x60 [kvm]
    [ 284.562281] [] kvm_mmu_init_vm+0x20/0x30 [kvm]
    [ 284.568381] [] kvm_arch_init_vm+0x1ea/0x200 [kvm]
    [ 284.574740] [] kvm_dev_ioctl+0xbf/0x4d0 [kvm]

    However, it also helps fixing a preexisting problem, which is why this
    patch is also good for stable kernels: kvm_create_vm was incrementing
    current->mm->mm_count but not decrementing it at the out_err label (in
    case kvm_init_mmu_notifier failed). The new initialization order makes
    it possible to add the required mmdrop without adding a new error label.

    Reported-by: Borislav Petkov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     

16 Mar, 2016

1 commit

  • commit 313f636d5c490c9741d3f750dc8da33029edbc6b upstream.

    When growing halt-polling, there is no check that the poll time exceeds
    the limit. It's possible for vcpu->halt_poll_ns grow once past
    halt_poll_ns, and stay there until a halt which takes longer than
    vcpu->halt_poll_ns. For example, booting a Linux guest with
    halt_poll_ns=11000:

    ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000)
    ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0)
    ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000)

    Signed-off-by: David Matlack
    Fixes: aca6ff29c4063a8d467cdee241e6b3bf7dc4a171
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    David Matlack
     

04 Mar, 2016

2 commits

  • commit 236cf17c2502007a9d2dda3c39fb0d9a6bd03cc2 upstream.

    When we allocate bitmaps in vgic_vcpu_init_maps, we divide the number of
    bits we need by 8 to figure out how many bytes to allocate. However,
    bitmap elements are always accessed as unsigned longs, and if we didn't
    happen to allocate a size such that size % sizeof(unsigned long) == 0,
    bitmap accesses may go past the end of the allocation.

    When using KASAN (which does byte-granular access checks), this results
    in a continuous stream of BUGs whenever these bitmaps are accessed:

    =============================================================================
    BUG kmalloc-128 (Tainted: G B ): kasan: bad access detected
    -----------------------------------------------------------------------------

    INFO: Allocated in vgic_init.part.25+0x55c/0x990 age=7493 cpu=3 pid=1730
    INFO: Slab 0xffffffbde6d5da40 objects=16 used=15 fp=0xffffffc935769700 flags=0x4000000000000080
    INFO: Object 0xffffffc935769500 @offset=1280 fp=0x (null)

    Bytes b4 ffffffc9357694f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769510: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769520: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769530: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769540: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769550: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769560: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Object ffffffc935769570: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Padding ffffffc9357695b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Padding ffffffc9357695c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Padding ffffffc9357695d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Padding ffffffc9357695e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    Padding ffffffc9357695f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    CPU: 3 PID: 1740 Comm: kvm-vcpu-0 Tainted: G B 4.4.0+ #17
    Hardware name: ARM Juno development board (r1) (DT)
    Call trace:
    [] dump_backtrace+0x0/0x280
    [] show_stack+0x14/0x20
    [] dump_stack+0x100/0x188
    [] print_trailer+0xfc/0x168
    [] object_err+0x3c/0x50
    [] kasan_report_error+0x244/0x558
    [] __asan_report_load8_noabort+0x48/0x50
    [] __bitmap_or+0xc0/0xc8
    [] kvm_vgic_flush_hwstate+0x1bc/0x650
    [] kvm_arch_vcpu_ioctl_run+0x2ec/0xa60
    [] kvm_vcpu_ioctl+0x474/0xa68
    [] do_vfs_ioctl+0x5b8/0xcb0
    [] SyS_ioctl+0x8c/0xa0
    [] el0_svc_naked+0x24/0x28
    Memory state around the buggy address:
    ffffffc935769400: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffffffc935769480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffffffc935769500: 04 fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ^
    ffffffc935769580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffffffc935769600: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
    ==================================================================

    Fix the issue by always allocating a multiple of sizeof(unsigned long),
    as we do elsewhere in the vgic code.

    Fixes: c1bfb577a ("arm/arm64: KVM: vgic: switch to dynamic allocation")
    Acked-by: Marc Zyngier
    Acked-by: Christoffer Dall
    Signed-off-by: Mark Rutland
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Mark Rutland
     
  • commit d7444794a02ff655eda87e3cc54e86b940e7736f upstream.

    In async_pf we try to allocate with NOWAIT to get an element quickly
    or fail. This code also handle failures gracefully. Lets silence
    potential page allocation failures under load.

    qemu-system-s39: page allocation failure: order:0,mode:0x2200000
    [...]
    Call Trace:
    ([] show_trace+0xf8/0x148)
    [] show_stack+0x62/0xe8
    [] dump_stack+0x70/0x98
    [] warn_alloc_failed+0xd2/0x148
    [] __alloc_pages_nodemask+0x94e/0xb38
    [] new_slab+0x382/0x400
    [] ___slab_alloc.constprop.30+0x2dc/0x378
    [] kmem_cache_alloc+0x160/0x1d0
    [] kvm_setup_async_pf+0x6c/0x198
    [] kvm_arch_vcpu_ioctl_run+0xd48/0xd58
    [] kvm_vcpu_ioctl+0x372/0x690
    [] do_vfs_ioctl+0x3be/0x510
    [] SyS_ioctl+0xa4/0xb8
    [] system_call+0xd6/0x264
    [] 0x3ffa24fa06a

    Signed-off-by: Christian Borntraeger
    Reviewed-by: Dominik Dingel
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Christian Borntraeger
     

26 Feb, 2016

1 commit

  • commit b3aff6ccbb1d25e506b60ccd9c559013903f3464 upstream.

    Commit 4b4b4512da2a ("arm/arm64: KVM: Rework the arch timer to use
    level-triggered semantics") brought the virtual architected timer
    closer to the VGIC. There is one occasion were we don't properly
    check for the VGIC actually having been initialized before, but
    instead go on to check the active state of some IRQ number.
    If userland hasn't instantiated a virtual GIC, we end up with a
    kernel NULL pointer dereference:
    =========
    Unable to handle kernel NULL pointer dereference at virtual address 00000000
    pgd = ffffffc9745c5000
    [00000000] *pgd=00000009f631e003, *pud=00000009f631e003, *pmd=0000000000000000
    Internal error: Oops: 96000006 [#2] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 2144 Comm: kvm_simplest-ar Tainted: G D 4.5.0-rc2+ #1300
    Hardware name: ARM Juno development board (r1) (DT)
    task: ffffffc976da8000 ti: ffffffc976e28000 task.ti: ffffffc976e28000
    PC is at vgic_bitmap_get_irq_val+0x78/0x90
    LR is at kvm_vgic_map_is_active+0xac/0xc8
    pc : [] lr : [] pstate: 20000145
    ....
    =========

    Fix this by bailing out early of kvm_timer_flush_hwstate() if we don't
    have a VGIC at all.

    Reported-by: Cosmin Gorgovan
    Acked-by: Marc Zyngier
    Signed-off-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Signed-off-by: Greg Kroah-Hartman

    Andre Przywara
     

12 Dec, 2015

1 commit

  • External inputs to the vgic from time to time need to poke into the
    state of a virtual interrupt, the prime example is the architected timer
    code.

    Since the IRQ's active state can be represented in two places; the LR or
    the distributor, we first loop over the LRs but if not active in the LRs
    we just return if *any* IRQ is active on the VCPU in question.

    This is of course bogus, as we should check if the specific IRQ in
    quesiton is active on the distributor instead.

    Reported-by: Eric Auger
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

25 Nov, 2015

2 commits

  • We were probing the physial distributor state for the active state of a
    HW virtual IRQ, because we had seen evidence that the LR state was not
    cleared when the guest deactivated a virtual interrupted.

    However, this issue turned out to be a software bug in the GIC, which
    was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29
    (KVM: arm/arm64: arch_timer: Preserve physical dist. active
    state on LR.active, 2015-11-24)

    Therefore, get rid of the complexities and just look at the LR.

    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • We were incorrectly removing the active state from the physical
    distributor on the timer interrupt when the timer output level was
    deasserted. We shouldn't be doing this without considering the virtual
    interrupt's active state, because the architecture requires that when an
    LR has the HW bit set and the pending or active bits set, then the
    physical interrupt must also have the corresponding bits set.

    This addresses an issue where we have been observing an inconsistency
    between the LR state and the physical distributor state where the LR
    state was active and the physical distributor was not active, which
    shouldn't happen.

    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     

06 Nov, 2015

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "First batch of KVM changes for 4.4.

    s390:
    A bunch of fixes and optimizations for interrupt and time handling.

    PPC:
    Mostly bug fixes.

    ARM:
    No big features, but many small fixes and prerequisites including:

    - a number of fixes for the arch-timer

    - introducing proper level-triggered semantics for the arch-timers

    - a series of patches to synchronously halt a guest (prerequisite
    for IRQ forwarding)

    - some tracepoint improvements

    - a tweak for the EL2 panic handlers

    - some more VGIC cleanups getting rid of redundant state

    x86:
    Quite a few changes:

    - support for VT-d posted interrupts (i.e. PCI devices can inject
    interrupts directly into vCPUs). This introduces a new
    component (in virt/lib/) that connects VFIO and KVM together.
    The same infrastructure will be used for ARM interrupt
    forwarding as well.

    - more Hyper-V features, though the main one Hyper-V synthetic
    interrupt controller will have to wait for 4.5. These will let
    KVM expose Hyper-V devices.

    - nested virtualization now supports VPID (same as PCID but for
    vCPUs) which makes it quite a bit faster

    - for future hardware that supports NVDIMM, there is support for
    clflushopt, clwb, pcommit

    - support for "split irqchip", i.e. LAPIC in kernel +
    IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
    the hypervisor

    - obligatory smattering of SMM fixes

    - on the guest side, stable scheduler clock support was rewritten
    to not require help from the hypervisor"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
    KVM: VMX: Fix commit which broke PML
    KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
    KVM: x86: allow RSM from 64-bit mode
    KVM: VMX: fix SMEP and SMAP without EPT
    KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
    KVM: device assignment: remove pointless #ifdefs
    KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
    KVM: x86: zero apic_arb_prio on reset
    drivers/hv: share Hyper-V SynIC constants with userspace
    KVM: x86: handle SMBASE as physical address in RSM
    KVM: x86: add read_phys to x86_emulate_ops
    KVM: x86: removing unused variable
    KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
    KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
    KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
    KVM: arm/arm64: Optimize away redundant LR tracking
    KVM: s390: use simple switch statement as multiplexer
    KVM: s390: drop useless newline in debugging data
    KVM: s390: SCA must not cross page boundaries
    KVM: arm: Do not indent the arguments of DECLARE_BITMAP
    ...

    Linus Torvalds
     

04 Nov, 2015

7 commits

  • We do not want to do too much work in atomic context, in particular
    not walking all the VCPUs of the virtual machine. So we want
    to distinguish the architecture-specific injection function for irqfd
    from kvm_set_msi. Since it's still empty, reuse the newly added
    kvm_arch_set_irq and rename it to kvm_arch_set_irq_inatomic.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • The symbol was missing a KVM dependency.

    Signed-off-by: Jan Beulich
    Signed-off-by: Paolo Bonzini

    Jan Beulich
     
  • KVM/ARM Changes for v4.4-rc1

    Includes a number of fixes for the arch-timer, introducing proper
    level-triggered semantics for the arch-timers, a series of patches to
    synchronously halt a guest (prerequisite for IRQ forwarding), some tracepoint
    improvements, a tweak for the EL2 panic handlers, some more VGIC cleanups
    getting rid of redundant state, and finally a stylistic change that gets rid of
    some ctags warnings.

    Conflicts:
    arch/x86/include/asm/kvm_host.h

    Paolo Bonzini
     
  • Now we see that vgic_set_lr() and vgic_sync_lr_elrsr() are always used
    together. Merge them into one function, saving from second vgic_ops
    dereferencing every time.

    Signed-off-by: Pavel Fedin
    Signed-off-by: Christoffer Dall

    Pavel Fedin
     
  • 1. Remove unnecessary 'irq' argument, because irq number can be retrieved
    from the LR.
    2. Since cff9211eb1a1f58ce7f5a2d596b617928fd4be0e
    ("arm/arm64: KVM: Fix arch timer behavior for disabled interrupts ")
    LR_STATE_PENDING is queued back by vgic_retire_lr() itself. Also, it
    clears vlr.state itself. Therefore, we remove the same, now duplicated,
    check with all accompanying bit manipulations from vgic_unqueue_irqs().
    3. vgic_retire_lr() is always accompanied by vgic_irq_clear_queued(). Since
    it already does more than just clearing the LR, move
    vgic_irq_clear_queued() inside of it.

    Signed-off-by: Pavel Fedin
    Signed-off-by: Christoffer Dall

    Pavel Fedin
     
  • Currently we use vgic_irq_lr_map in order to track which LRs hold which
    IRQs, and lr_used bitmap in order to track which LRs are used or free.

    vgic_irq_lr_map is actually used only for piggy-back optimization, and
    can be easily replaced by iteration over lr_used. This is good because in
    future, when LPI support is introduced, number of IRQs will grow up to at
    least 16384, while numbers from 1024 to 8192 are never going to be used.
    This would be a huge memory waste.

    In its turn, lr_used is also completely redundant since
    ae705930fca6322600690df9dc1c7d0516145a93 ("arm/arm64: KVM: Keep elrsr/aisr
    in sync with software model"), because together with lr_used we also update
    elrsr. This allows to easily replace lr_used with elrsr, inverting all
    conditions (because in elrsr '1' means 'free').

    Signed-off-by: Pavel Fedin
    Signed-off-by: Christoffer Dall

    Pavel Fedin
     
  • Pull irq updates from Thomas Gleixner:
    "The irq departement delivers:

    - Rework the irqdomain core infrastructure to accomodate ACPI based
    systems. This is required to support ARM64 without creating
    artificial device tree nodes.

    - Sanitize the ACPI based ARM GIC initialization by making use of the
    new firmware independent irqdomain core

    - Further improvements to the generic MSI management

    - Generalize the irq migration on CPU hotplug

    - Improvements to the threaded interrupt infrastructure

    - Allow the migration of "chained" low level interrupt handlers

    - Allow optional force masking of interrupts in disable_irq[_nosysnc]

    - Support for two new interrupt chips - Sigh!

    - A larger set of errata fixes for ARM gicv3

    - The usual pile of fixes, updates, improvements and cleanups all
    over the place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    Document that IRQ_NONE should be returned when IRQ not actually handled
    PCI/MSI: Allow the MSI domain to be device-specific
    PCI: Add per-device MSI domain hook
    of/irq: Use the msi-map property to provide device-specific MSI domain
    of/irq: Split of_msi_map_rid to reuse msi-map lookup
    irqchip/gic-v3-its: Parse new version of msi-parent property
    PCI/MSI: Use of_msi_get_domain instead of open-coded "msi-parent" parsing
    of/irq: Use of_msi_get_domain instead of open-coded "msi-parent" parsing
    of/irq: Add support code for multi-parent version of "msi-parent"
    irqchip/gic-v3-its: Add handling of PCI requester id.
    PCI/MSI: Add helper function pci_msi_domain_get_msi_rid().
    of/irq: Add new function of_msi_map_rid()
    Docs: dt: Add PCI MSI map bindings
    irqchip/gic-v2m: Add support for multiple MSI frames
    irqchip/gic-v3: Fix translation of LPIs after conversion to irq_fwspec
    irqchip/mxs: Add Alphascale ASM9260 support
    irqchip/mxs: Prepare driver for hardware with different offsets
    irqchip/mxs: Panic if ioremap or domain creation fails
    irqdomain: Documentation updates
    irqdomain/msi: Use fwnode instead of of_node
    ...

    Linus Torvalds
     

23 Oct, 2015

8 commits

  • The VGIC and timer code for KVM arm/arm64 doesn't have any tracepoints
    or tracepoint infrastructure defined. Rewriting some of the timer code
    handling showed me how much we need this, so let's add these simple
    trace points once and for all and we can easily expand with additional
    trace points in these files as we go along.

    Cc: Wei Huang
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • We mark edge-triggered interrupts with the HW bit set as queued to
    prevent the VGIC code from injecting LRs with both the Active and
    Pending bits set at the same time while also setting the HW bit,
    because the hardware does not support this.

    However, this means that we must also clear the queued flag when we sync
    back a LR where the state on the physical distributor went from active
    to inactive because the guest deactivated the interrupt. At this point
    we must also check if the interrupt is pending on the distributor, and
    tell the VGIC to queue it again if it is.

    Since these actions on the sync path are extremely close to those for
    level-triggered interrupts, rename process_level_irq to
    process_queued_irq, allowing it to cater for both cases.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • The arch timer currently uses edge-triggered semantics in the sense that
    the line is never sampled by the vgic and lowering the line from the
    timer to the vgic doesn't have any effect on the pending state of
    virtual interrupts in the vgic. This means that we do not support a
    guest with the otherwise valid behavior of (1) disable interrupts (2)
    enable the timer (3) disable the timer (4) enable interrupts. Such a
    guest would validly not expect to see any interrupts on real hardware,
    but will see interrupts on KVM.

    This patch fixes this shortcoming through the following series of
    changes.

    First, we change the flow of the timer/vgic sync/flush operations. Now
    the timer is always flushed/synced before the vgic, because the vgic
    samples the state of the timer output. This has the implication that we
    move the timer operations in to non-preempible sections, but that is
    fine after the previous commit getting rid of hrtimer schedules on every
    entry/exit.

    Second, we change the internal behavior of the timer, letting the timer
    keep track of its previous output state, and only lower/raise the line
    to the vgic when the state changes. Note that in theory this could have
    been accomplished more simply by signalling the vgic every time the
    state *potentially* changed, but we don't want to be hitting the vgic
    more often than necessary.

    Third, we get rid of the use of the map->active field in the vgic and
    instead simply set the interrupt as active on the physical distributor
    whenever the input to the GIC is asserted and conversely clear the
    physical active state when the input to the GIC is deasserted.

    Fourth, and finally, we now initialize the timer PPIs (and all the other
    unused PPIs for now), to be level-triggered, and modify the sync code to
    sample the line state on HW sync and re-inject a new interrupt if it is
    still pending at that time.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • We currently initialize the SGIs to be enabled in the VGIC code, but we
    use the VGIC_NR_PPIS define for this purpose, instead of the the more
    natural VGIC_NR_SGIS. Change this slightly confusing use of the
    defines.

    Note: This should have no functional change, as both names are defined
    to the number 16.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
    We currently simulate this behavior by writing a hardcoded value to the
    register for the SGIs and PPIs on every write of these bits to the
    register (ignoring what the guest actually wrote), and by writing the
    same value as the reset value to the register.

    This is a bit counter-intuitive, as the register is RO for these bits,
    and we can just implement it that way, allowing us to control the value
    of the bits purely in the reset code.

    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Currently vgic_process_maintenance() processes dealing with a completed
    level-triggered interrupt directly, but we are soon going to reuse this
    logic for level-triggered mapped interrupts with the HW bit set, so
    move this logic into a separate static function.

    Probably the most scary part of this commit is convincing yourself that
    the current flow is safe compared to the old one. In the following I
    try to list the changes and why they are harmless:

    Move vgic_irq_clear_queued after kvm_notify_acked_irq:
    Harmless because the only potential effect of clearing the queued
    flag wrt. kvm_set_irq is that vgic_update_irq_pending does not set
    the pending bit on the emulated CPU interface or in the
    pending_on_cpu bitmask if the function is called with level=1.
    However, the point of kvm_notify_acked_irq is to call kvm_set_irq
    with level=0, and we set the queued flag again in
    __kvm_vgic_sync_hwstate later on if the level is stil high.

    Move vgic_set_lr before kvm_notify_acked_irq:
    Also, harmless because the LR are cpu-local operations and
    kvm_notify_acked only affects the dist

    Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
    Also harmless, because now we check the level state in the
    clear_soft_pend function and lower the pending bits if the level is
    low.

    Reviewed-by: Eric Auger
    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • We currently schedule a soft timer every time we exit the guest if the
    timer did not expire while running the guest. This is really not
    necessary, because the only work we do in the timer work function is to
    kick the vcpu.

    Kicking the vcpu does two things:
    (1) If the vpcu thread is on a waitqueue, make it runnable and remove it
    from the waitqueue.
    (2) If the vcpu is running on a different physical CPU from the one
    doing the kick, it sends a reschedule IPI.

    The second case cannot happen, because the soft timer is only ever
    scheduled when the vcpu is not running. The first case is only relevant
    when the vcpu thread is on a waitqueue, which is only the case when the
    vcpu thread has called kvm_vcpu_block().

    Therefore, we only need to make sure a timer is scheduled for
    kvm_vcpu_block(), which we do by encapsulating all calls to
    kvm_vcpu_block() with kvm_timer_{un}schedule calls.

    Additionally, we only schedule a soft timer if the timer is enabled and
    unmasked, since it is useless otherwise.

    Note that theoretically userspace can use the SET_ONE_REG interface to
    change registers that should cause the timer to fire, even if the vcpu
    is blocked without a scheduled timer, but this case was not supported
    before this patch and we leave it for future work for now.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Some times it is useful for architecture implementations of KVM to know
    when the VCPU thread is about to block or when it comes back from
    blocking (arm/arm64 needs to know this to properly implement timers, for
    example).

    Therefore provide a generic architecture callback function in line with
    what we do elsewhere for KVM generic-arch interactions.

    Reviewed-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     

21 Oct, 2015

4 commits

  • We currently do a single update of the vgic state when the distributor
    enable/disable control register is accessed and then bypass updating the
    state for as long as the distributor remains disabled.

    This is incorrect, because updating the state does not consider the
    distributor enable bit, and this you can end up in a situation where an
    interrupt is marked as pending on the CPU interface, but not pending on
    the distributor, which is an impossible state to be in, and triggers a
    warning. Consider for example the following sequence of events:

    1. An interrupt is marked as pending on the distributor
    - the interrupt is also forwarded to the CPU interface
    2. The guest turns off the distributor (it's about to do a reboot)
    - we stop updating the CPU interface state from now on
    3. The guest disables the pending interrupt
    - we remove the pending state from the distributor, but don't touch
    the CPU interface, see point 2.

    Since the distributor disable bit really means that no interrupts should
    be forwarded to the CPU interface, we modify the code to keep updating
    the internal VGIC state, but always set the CPU interface pending bits
    to zero when the distributor is disabled.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
    to clear the pending and active states of an interrupt through the
    emulated VGIC distributor. However, since the architected timers are
    defined by the architecture to be level triggered and the guest
    rightfully expects them to be that, but we emulate them as
    edge-triggered, we have to mimic level-triggered behavior for an
    edge-triggered virtual implementation.

    We currently do not signal the VGIC when the map->active field is true,
    because it indicates that the guest has already been signalled of the
    interrupt as required. Normally this field is set to false when the
    guest deactivates the virtual interrupt through the sync path.

    We also need to catch the case where the guest deactivates the interrupt
    through the emulated distributor, again allowing guests to boot even if
    the original virtual timer signal hit before the guest's GIC
    initialization sequence is run.

    Reviewed-by: Eric Auger
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • We have an interesting issue when the guest disables the timer interrupt
    on the VGIC, which happens when turning VCPUs off using PSCI, for
    example.

    The problem is that because the guest disables the virtual interrupt at
    the VGIC level, we never inject interrupts to the guest and therefore
    never mark the interrupt as active on the physical distributor. The
    host also never takes the timer interrupt (we only use the timer device
    to trigger a guest exit and everything else is done in software), so the
    interrupt does not become active through normal means.

    The result is that we keep entering the guest with a programmed timer
    that will always fire as soon as we context switch the hardware timer
    state and run the guest, preventing forward progress for the VCPU.

    Since the active state on the physical distributor is really part of the
    timer logic, it is the job of our virtual arch timer driver to manage
    this state.

    The timer->map->active boolean field indicates whether we have signalled
    this interrupt to the vgic and if that interrupt is still pending or
    active. As long as that is the case, the hardware doesn't have to
    generate physical interrupts and therefore we mark the interrupt as
    active on the physical distributor.

    We also have to restore the pending state of an interrupt that was
    queued to an LR but was retired from the LR for some reason, while
    remaining pending in the LR.

    Cc: Marc Zyngier
    Reported-by: Lorenzo Pieralisi
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • When lowering a level-triggered line from userspace, we forgot to lower
    the pending bit on the emulated CPU interface and we also did not
    re-compute the pending_on_cpu bitmap for the CPU affected by the change.

    Update vgic_update_irq_pending() to fix the two issues above and also
    raise a warning in vgic_quue_irq_to_lr if we encounter an interrupt
    pending on a CPU which is neither marked active nor pending.

    [ Commit text reworked completely - Christoffer ]

    Signed-off-by: Pavel Fedin
    Signed-off-by: Christoffer Dall

    Pavel Fedin
     

16 Oct, 2015

4 commits

  • Any other irq routing types (MSI, S390_ADAPTER, upcoming Hyper-V
    SynIC) map one-to-one to GSI.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Vitaly Kuznetsov
    CC: "K. Y. Srinivasan"
    CC: Gleb Natapov
    CC: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • Allow for arch-specific interrupt types to be set. For that, add
    kvm_arch_set_irq() which takes interrupt type-specific action if it
    recognizes the interrupt type given, and -EWOULDBLOCK otherwise.

    The default implementation always returns -EWOULDBLOCK.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Vitaly Kuznetsov
    CC: "K. Y. Srinivasan"
    CC: Gleb Natapov
    CC: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • Factor out kvm_notify_acked_gsi() helper to iterate over EOI listeners
    and notify those matching the given gsi.

    It will be reused in the upcoming Hyper-V SynIC implementation.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Vitaly Kuznetsov
    CC: "K. Y. Srinivasan"
    CC: Gleb Natapov
    CC: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     
  • The loop(for) inside irqfd_update() is unnecessary
    because any other value for irq_entry.type will just trigger
    schedule_work(&irqfd->inject) in irqfd_wakeup.

    Signed-off-by: Andrey Smetanin
    Reviewed-by: Roman Kagan
    Signed-off-by: Denis V. Lunev
    CC: Vitaly Kuznetsov
    CC: "K. Y. Srinivasan"
    CC: Gleb Natapov
    CC: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Andrey Smetanin
     

14 Oct, 2015

1 commit

  • async_pf_execute() seems to be missing a memory barrier which might
    cause the waker to not notice the waiter and miss sending a wake_up as
    in the following figure.

    async_pf_execute kvm_vcpu_block
    ------------------------------------------------------------------------
    spin_lock(&vcpu->async_pf.lock);
    if (waitqueue_active(&vcpu->wq))
    /* The CPU might reorder the test for
    the waitqueue up here, before
    prior writes complete */
    prepare_to_wait(&vcpu->wq, &wait,
    TASK_INTERRUPTIBLE);
    /*if (kvm_vcpu_check_block(vcpu) < 0) */
    /*if (kvm_arch_vcpu_runnable(vcpu)) { */
    ...
    return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
    !vcpu->arch.apf.halted)
    || !list_empty_careful(&vcpu->async_pf.done)
    ...
    return 0;
    list_add_tail(&apf->link,
    &vcpu->async_pf.done);
    spin_unlock(&vcpu->async_pf.lock);
    waited = true;
    schedule();
    ------------------------------------------------------------------------

    The attached patch adds the missing memory barrier.

    I found this issue when I was looking through the linux source code
    for places calling waitqueue_active() before wake_up*(), but without
    preceding memory barriers, after sending a patch to fix a similar
    issue in drivers/tty/n_tty.c (Details about the original issue can be
    found here: https://lkml.org/lkml/2015/9/28/849).

    Signed-off-by: Kosuke Tatsukawa
    Signed-off-by: Paolo Bonzini

    Kosuke Tatsukawa
     

10 Oct, 2015

1 commit

  • Hardware virtualisation of GICv3 is only supported by 64bit hosts for
    the moment. Some VGICv3 bits are missing from the 32bit side, and this
    patch allows to still be able to build 32bit hosts when CONFIG_ARM_GIC_V3
    is selected.

    To this end, we introduce a new option, CONFIG_KVM_ARM_VGIC_V3, that is
    only enabled on the 64bit side. The selection is done unconditionally
    because CONFIG_ARM_GIC_V3 is always enabled on arm64.

    Reviewed-by: Marc Zyngier
    Signed-off-by: Jean-Philippe Brucker
    Signed-off-by: Marc Zyngier

    Jean-Philippe Brucker
     

01 Oct, 2015

5 commits

  • This patch updates the Posted-Interrupts Descriptor when vCPU
    is blocked.

    pre-block:
    - Add the vCPU to the blocked per-CPU list
    - Set 'NV' to POSTED_INTR_WAKEUP_VECTOR

    post-block:
    - Remove the vCPU from the per-CPU list

    Signed-off-by: Feng Wu
    [Concentrate invocation of pre/post-block hooks to vcpu_block. - Paolo]
    Signed-off-by: Paolo Bonzini

    Feng Wu
     
  • This patch adds an arch specific hooks 'arch_update' in
    'struct kvm_kernel_irqfd'. On Intel side, it is used to
    update the IRTE when VT-d posted-interrupts is used.

    Signed-off-by: Feng Wu
    Reviewed-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Feng Wu
     
  • This patch adds the registration/unregistration of an
    irq_bypass_consumer on irqfd assignment/deassignment.

    Signed-off-by: Eric Auger
    Signed-off-by: Feng Wu
    Reviewed-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Eric Auger
     
  • This patch introduces
    - kvm_arch_irq_bypass_add_producer
    - kvm_arch_irq_bypass_del_producer
    - kvm_arch_irq_bypass_stop
    - kvm_arch_irq_bypass_start

    They make possible to specialize the KVM IRQ bypass consumer in
    case CONFIG_KVM_HAVE_IRQ_BYPASS is set.

    Signed-off-by: Eric Auger
    [Add weak implementations of the callbacks. - Feng]
    Signed-off-by: Feng Wu
    Reviewed-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Eric Auger
     
  • Move _irqfd_resampler and _irqfd struct declarations in a new
    public header: kvm_irqfd.h. They are respectively renamed into
    kvm_kernel_irqfd_resampler and kvm_kernel_irqfd. Those datatypes
    will be used by architecture specific code, in the context of
    IRQ bypass manager integration.

    Signed-off-by: Eric Auger
    Signed-off-by: Feng Wu
    Reviewed-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Eric Auger