19 May, 2016

2 commits

  • AVIC has a use for kvm_vcpu_wake_up.

    Signed-off-by: Radim Krčmář
    Tested-by: Suravee Suthikulpanit
    Reviewed-by: Paolo Bonzini
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     
  • commit 3491caf2755e ("KVM: halt_polling: provide a way to qualify
    wakeups during poll") added more aggressive shrinking of the
    polling interval if the wakeup did not match some criteria. This
    still allows to keep polling enabled if the polling time was
    smaller that the current max poll time (block_ns halt_poll_ns).
    Performance measurement shows that even more aggressive shrinking
    (shrink polling on any invalid wakeup) reduces absolute and relative
    (to the workload) CPU usage even further.

    Cc: David Matlack
    Cc: Wanpeng Li
    Cc: Radim Krčmář
    CC: Paolo Bonzini
    CC: Cornelia Huck
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Paolo Bonzini

    Christian Borntraeger
     

13 May, 2016

1 commit

  • Some wakeups should not be considered a sucessful poll. For example on
    s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
    would be considered runnable - letting all vCPUs poll all the time for
    transactional like workload, even if one vCPU would be enough.
    This can result in huge CPU usage for large guests.
    This patch lets architectures provide a way to qualify wakeups if they
    should be considered a good/bad wakeups in regard to polls.

    For s390 the implementation will fence of halt polling for anything but
    known good, single vCPU events. The s390 implementation for floating
    interrupts does a wakeup for one vCPU, but the interrupt will be delivered
    by whatever CPU checks first for a pending interrupt. We prefer the
    woken up CPU by marking the poll of this CPU as "good" poll.
    This code will also mark several other wakeup reasons like IPI or
    expired timers as "good". This will of course also mark some events as
    not sucessful. As KVM on z runs always as a 2nd level hypervisor,
    we prefer to not poll, unless we are really sure, though.

    This patch successfully limits the CPU usage for cases like uperf 1byte
    transactional ping pong workload or wakeup heavy workload like OLTP
    while still providing a proper speedup.

    This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
    wakeups that are considered not good for polling.

    Signed-off-by: Christian Borntraeger
    Acked-by: Radim Krčmář (for an earlier version)
    Cc: David Matlack
    Cc: Wanpeng Li
    [Rename config symbol. - Paolo]
    Signed-off-by: Paolo Bonzini

    Christian Borntraeger
     

12 May, 2016

3 commits

  • If we don't support a mechanism for bypassing IRQs, don't register as
    a consumer. This eliminates meaningless dev_info()s when the connect
    fails between producer and consumer, such as on AMD systems where
    kvm_x86_ops->update_pi_irte is not implemented

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     
  • A NULL token is meaningless and can only lead to unintended problems.
    Error on registration with a NULL token, ignore de-registrations with
    a NULL token.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     
  • The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
    also the upper limit for vCPU ids. This is okay for all archs except PowerPC
    which can have higher ids, depending on the cpu/core/thread topology. In the
    worst case (single threaded guest, host with 8 threads per core), it limits
    the maximum number of vCPUS to KVM_MAX_VCPUS / 8.

    This patch separates the vCPU numbering from the total number of vCPUs, with
    the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
    plus one.

    The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
    before passing them to KVM_CREATE_VCPU.

    This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
    Other archs continue to return KVM_MAX_VCPUS instead.

    Suggested-by: Radim Krcmar
    Signed-off-by: Greg Kurz
    Reviewed-by: Cornelia Huck
    Signed-off-by: Paolo Bonzini

    Greg Kurz
     

03 May, 2016

2 commits

  • Currently, the firmware tables are parsed 2 times: once in the GIC
    drivers, the other time when initializing the vGIC. It means code
    duplication and make more tedious to add the support for another
    firmware table (like ACPI).

    Use the recently introduced helper gic_get_kvm_info() to get
    information about the virtual GIC.

    With this change, the virtual GIC becomes agnostic to the firmware
    table and KVM will be able to initialize the vGIC on ACPI.

    Signed-off-by: Julien Grall
    Reviewed-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     
  • The firmware table is currently parsed by the virtual timer code in
    order to retrieve the virtual timer interrupt. However, this is already
    done by the arch timer driver.

    To avoid code duplication, use the newly function arch_timer_get_kvm_info()
    which return all the information required by the virtual timer code.

    Signed-off-by: Julien Grall
    Acked-by: Christoffer Dall
    Signed-off-by: Christoffer Dall

    Julien Grall
     

06 Apr, 2016

1 commit

  • On a host that runs NTP, corrections can have a direct impact on
    the background timer that we program on the behalf of a vcpu.

    In particular, NTP performing a forward correction will result in
    a timer expiring sooner than expected from a guest point of view.
    Not a big deal, we kick the vcpu anyway.

    But on wake-up, the vcpu thread is going to perform a check to
    find out whether or not it should block. And at that point, the
    timer check is going to say "timer has not expired yet, go back
    to sleep". This results in the timer event being lost forever.

    There are multiple ways to handle this. One would be record that
    the timer has expired and let kvm_cpu_has_pending_timer return
    true in that case, but that would be fairly invasive. Another is
    to check for the "short sleep" condition in the hrtimer callback,
    and restart the timer for the remaining time when the condition
    is detected.

    This patch implements the latter, with a bit of refactoring in
    order to avoid too much code duplication.

    Cc:
    Reported-by: Alexander Graf
    Reviewed-by: Alexander Graf
    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

01 Apr, 2016

1 commit

  • The kernel is written in C, not python, so we need braces around
    multi-line if statements. GCC 6 actually warns about this, thanks to the
    fantastic new "-Wmisleading-indentation" flag:

    | virt/kvm/arm/pmu.c: In function ‘kvm_pmu_overflow_status’:
    | virt/kvm/arm/pmu.c:198:3: warning: statement is indented as if it were guarded by... [-Wmisleading-indentation]
    | reg &= vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
    | ^~~
    | arch/arm64/kvm/../../../virt/kvm/arm/pmu.c:196:2: note: ...this ‘if’ clause, but it is not
    | if ((vcpu_sys_reg(vcpu, PMCR_EL0) & ARMV8_PMU_PMCR_E))
    | ^~

    As it turns out, this particular case is harmless (we just do some &=
    operations with 0), but worth fixing nonetheless.

    Signed-off-by: Will Deacon
    Signed-off-by: Christoffer Dall

    Will Deacon
     

22 Mar, 2016

3 commits

  • smp_load_acquire() is enough here and it's cheaper than smp_mb().
    Adding a comment about reusing memory barrier of kvm_make_all_cpus_request()
    here to keep order between modifications to the page tables and reading mode.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • Moving the initialization earlier is needed in 4.6 because
    kvm_arch_init_vm is now using mmu_lock, causing lockdep to
    complain:

    [ 284.440294] INFO: trying to register non-static key.
    [ 284.445259] the code is fine but needs lockdep annotation.
    [ 284.450736] turning off the locking correctness validator.
    ...
    [ 284.528318] [] lock_acquire+0xd3/0x240
    [ 284.533733] [] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
    [ 284.541467] [] _raw_spin_lock+0x41/0x80
    [ 284.546960] [] ? kvm_page_track_register_notifier+0x20/0x60 [kvm]
    [ 284.554707] [] kvm_page_track_register_notifier+0x20/0x60 [kvm]
    [ 284.562281] [] kvm_mmu_init_vm+0x20/0x30 [kvm]
    [ 284.568381] [] kvm_arch_init_vm+0x1ea/0x200 [kvm]
    [ 284.574740] [] kvm_dev_ioctl+0xbf/0x4d0 [kvm]

    However, it also helps fixing a preexisting problem, which is why this
    patch is also good for stable kernels: kvm_create_vm was incrementing
    current->mm->mm_count but not decrementing it at the out_err label (in
    case kvm_init_mmu_notifier failed). The new initialization order makes
    it possible to add the required mmdrop without adding a new error label.

    Cc: stable@vger.kernel.org
    Reported-by: Borislav Petkov
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

17 Mar, 2016

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "One of the largest releases for KVM... Hardly any generic
    changes, but lots of architecture-specific updates.

    ARM:
    - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
    - PMU support for guests
    - 32bit world switch rewritten in C
    - various optimizations to the vgic save/restore code.

    PPC:
    - enabled KVM-VFIO integration ("VFIO device")
    - optimizations to speed up IPIs between vcpus
    - in-kernel handling of IOMMU hypercalls
    - support for dynamic DMA windows (DDW).

    s390:
    - provide the floating point registers via sync regs;
    - separated instruction vs. data accesses
    - dirty log improvements for huge guests
    - bugfixes and documentation improvements.

    x86:
    - Hyper-V VMBus hypercall userspace exit
    - alternative implementation of lowest-priority interrupts using
    vector hashing (for better VT-d posted interrupt support)
    - fixed guest debugging with nested virtualizations
    - improved interrupt tracking in the in-kernel IOAPIC
    - generic infrastructure for tracking writes to guest
    memory - currently its only use is to speedup the legacy shadow
    paging (pre-EPT) case, but in the future it will be used for
    virtual GPUs as well
    - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
    KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
    KVM: x86: disable MPX if host did not enable MPX XSAVE features
    arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
    arm64: KVM: vgic-v3: Reset LRs at boot time
    arm64: KVM: vgic-v3: Do not save an LR known to be empty
    arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
    arm64: KVM: vgic-v3: Avoid accessing ICH registers
    KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
    KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
    KVM: arm/arm64: vgic-v2: Reset LRs at boot time
    KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
    KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
    KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
    KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
    KVM: s390: allocate only one DMA page per VM
    KVM: s390: enable STFLE interpretation only if enabled for the guest
    KVM: s390: wake up when the VCPU cpu timer expires
    KVM: s390: step the VCPU timer while in enabled wait
    KVM: s390: protect VCPU cpu timer with a seqcount
    KVM: s390: step VCPU cpu timer during kvm_run ioctl
    ...

    Linus Torvalds
     

15 Mar, 2016

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle are:

    - Make schedstats a runtime tunable (disabled by default) and
    optimize it via static keys.

    As most distributions enable CONFIG_SCHEDSTATS=y due to its
    instrumentation value, this is a nice performance enhancement.
    (Mel Gorman)

    - Implement 'simple waitqueues' (swait): these are just pure
    waitqueues without any of the more complex features of full-blown
    waitqueues (callbacks, wake flags, wake keys, etc.). Simple
    waitqueues have less memory overhead and are faster.

    Use simple waitqueues in the RCU code (in 4 different places) and
    for handling KVM vCPU wakeups.

    (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
    Marcelo Tosatti)

    - sched/numa enhancements (Rik van Riel)

    - NOHZ performance enhancements (Rik van Riel)

    - Various sched/deadline enhancements (Steven Rostedt)

    - Various fixes (Peter Zijlstra)

    - ... and a number of other fixes, cleanups and smaller enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
    sched/cputime: Fix steal_account_process_tick() to always return jiffies
    sched/deadline: Remove dl_new from struct sched_dl_entity
    Revert "kbuild: Add option to turn incompatible pointer check into error"
    sched/deadline: Remove superfluous call to switched_to_dl()
    sched/debug: Fix preempt_disable_ip recording for preempt_disable()
    sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
    time, acct: Drop irq save & restore from __acct_update_integrals()
    acct, time: Change indentation in __acct_update_integrals()
    sched, time: Remove non-power-of-two divides from __acct_update_integrals()
    sched/rt: Kick RT bandwidth timer immediately on start up
    sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
    sched/debug: Move sched_domain_sysctl to debug.c
    sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
    sched/rt: Fix PI handling vs. sched_setscheduler()
    sched/core: Remove duplicated sched_group_set_shares() prototype
    sched/fair: Consolidate nohz CPU load update code
    sched/fair: Avoid using decay_load_missed() with a negative value
    sched/deadline: Always calculate end of period on sched_yield()
    sched/cgroup: Fix cgroup entity load tracking tear-down
    rcu: Use simple wait queues where possible in rcutree
    ...

    Linus Torvalds
     

09 Mar, 2016

11 commits

  • When growing halt-polling, there is no check that the poll time exceeds
    the limit. It's possible for vcpu->halt_poll_ns grow once past
    halt_poll_ns, and stay there until a halt which takes longer than
    vcpu->halt_poll_ns. For example, booting a Linux guest with
    halt_poll_ns=11000:

    ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000)
    ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0)
    ... kvm:kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000)

    Signed-off-by: David Matlack
    Fixes: aca6ff29c4063a8d467cdee241e6b3bf7dc4a171
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    David Matlack
     
  • KVM/ARM updates for 4.6

    - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
    - PMU support for guests
    - 32bit world switch rewritten in C
    - Various optimizations to the vgic save/restore code

    Conflicts:
    include/uapi/linux/kvm.h

    Paolo Bonzini
     
  • In order to let the GICv3 code be more lazy in the way it
    accesses the LRs, it is necessary to start with a clean slate.

    Let's reset the LRs on each CPU when the vgic is probed (which
    includes a round trip to EL2...).

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • Just like on GICv2, we're a bit hammer-happy with GICv3, and access
    them more often than we should.

    Adopt a policy similar to what we do for GICv2, only save/restoring
    the minimal set of registers. As we don't access the registers
    linearly anymore (we may skip some), the convoluted accessors become
    slightly simpler, and we can drop the ugly indexing macro that
    tended to confuse the reviewers.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • The GICD_SGIR register lives a long way from the beginning of
    the handler array, which is searched linearly. As this is hit
    pretty often, let's move it up. This saves us some precious
    cycles when the guest is generating IPIs.

    Acked-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • So far, we're always writing all possible LRs, setting the empty
    ones with a zero value. This is obvious doing a lot of work for
    nothing, and we're better off clearing those we've actually
    dirtied on the exit path (it is very rare to inject more than one
    interrupt at a time anyway).

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • In order to let make the GICv2 code more lazy in the way it
    accesses the LRs, it is necessary to start with a clean slate.

    Let's reset the LRs on each CPU when the vgic is probed.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • On exit, any empty LR will be signaled in GICH_ELRSR*. Which
    means that we do not have to save it, and we can just clear
    its state in the in-memory copy.

    Take this opportunity to move the LR saving code into its
    own function.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • In order to make the saving path slightly more readable and
    prepare for some more optimizations, let's move the GICH_ELRSR
    saving to its own function.

    No functional change.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • Next on our list of useless accesses is the maintenance interrupt
    status registers (GICH_MISR, GICH_EISR{0,1}).

    It is pointless to save them if we haven't asked for a maintenance
    interrupt the first place, which can only happen for two reasons:
    - Underflow: GICH_HCR_UIE will be set,
    - EOI: GICH_LR_EOI will be set.

    These conditions can be checked on the in-memory copies of the regs.
    Should any of these two condition be valid, we must read GICH_MISR.
    We can then check for GICH_MISR_EOI, and only when set read
    GICH_EISR*.

    This means that in most case, we don't have to save them at all.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • GICv2 registers are *slow*. As in "terrifyingly slow". Which is bad.
    But we're equaly bad, as we make a point in accessing them even if
    we don't have any interrupt in flight.

    A good solution is to first find out if we have anything useful to
    write into the GIC, and if we don't, to simply not do it. This
    involves tracking which LRs actually have something valid there.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

04 Mar, 2016

1 commit

  • For the kvm_is_error_hva, ubsan complains if the uninitialized writable
    is passed to __direct_map, even though the value itself is not used
    (__direct_map goes to mmu_set_spte->set_spte->set_mmio_spte but never
    looks at that argument).

    Ensuring that __gfn_to_pfn_memslot initializes *writable is cheap and
    avoids this kind of issue.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

01 Mar, 2016

12 commits

  • Programming the active state in the (re)distributor can be an
    expensive operation so it makes some sense to try and reduce
    the number of accesses as much as possible. So far, we
    program the active state on each VM entry, but there is some
    opportunity to do less.

    An obvious solution is to cache the active state in memory,
    and only program it in the HW when conditions change. But
    because the HW can also change things under our feet (the active
    state can transition from 1 to 0 when the guest does an EOI),
    some precautions have to be taken, which amount to only caching
    an "inactive" state, and always programing it otherwise.

    With this in place, we observe a reduction of around 700 cycles
    on a 2GHz GICv2 platform for a NULL hypercall.

    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • To configure the virtual PMUv3 overflow interrupt number, we use the
    vcpu kvm_device ioctl, encapsulating the KVM_ARM_VCPU_PMU_V3_IRQ
    attribute within the KVM_ARM_VCPU_PMU_V3_CTRL group.

    After configuring the PMUv3, call the vcpu ioctl with attribute
    KVM_ARM_VCPU_PMU_V3_INIT to initialize the PMUv3.

    Signed-off-by: Shannon Zhao
    Acked-by: Peter Maydell
    Reviewed-by: Andrew Jones
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • To support guest PMUv3, use one bit of the VCPU INIT feature array.
    Initialize the PMU when initialzing the vcpu with that bit and PMU
    overflow interrupt set.

    Signed-off-by: Shannon Zhao
    Acked-by: Peter Maydell
    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • When KVM frees VCPU, it needs to free the perf_event of PMU.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Marc Zyngier
    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • When resetting vcpu, it needs to reset the PMU state to initial status.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Marc Zyngier
    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • When calling perf_event_create_kernel_counter to create perf_event,
    assign a overflow handler. Then when the perf event overflows, set the
    corresponding bit of guest PMOVSSET register. If this counter is enabled
    and its interrupt is enabled as well, kick the vcpu to sync the
    interrupt.

    On VM entry, if there is counter overflowed and interrupt level is
    changed, inject the interrupt with corresponding level. On VM exit, sync
    the interrupt level as well if it has been changed.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Marc Zyngier
    Reviewed-by: Andrew Jones
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • According to ARMv8 spec, when writing 1 to PMCR.E, all counters are
    enabled by PMCNTENSET, while writing 0 to PMCR.E, all counters are
    disabled. When writing 1 to PMCR.P, reset all event counters, not
    including PMCCNTR, to zero. When writing 1 to PMCR.C, reset PMCCNTR to
    zero.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Marc Zyngier
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • Add access handler which emulates writing and reading PMSWINC
    register and add support for creating software increment event.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • Since the reset value of PMOVSSET and PMOVSCLR is UNKNOWN, use
    reset_unknown for its reset handler. Add a handler to emulate writing
    PMOVSSET or PMOVSCLR register.

    When writing non-zero value to PMOVSSET, the counter and its interrupt
    is enabled, kick this vcpu to sync PMU interrupt.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • When we use tools like perf on host, perf passes the event type and the
    id of this event type category to kernel, then kernel will map them to
    hardware event number and write this number to PMU PMEVTYPER_EL0
    register. When getting the event number in KVM, directly use raw event
    type to create a perf_event for it.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Marc Zyngier
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • Since the reset value of PMCNTENSET and PMCNTENCLR is UNKNOWN, use
    reset_unknown for its reset handler. Add a handler to emulate writing
    PMCNTENSET or PMCNTENCLR register.

    When writing to PMCNTENSET, call perf_event_enable to enable the perf
    event. When writing to PMCNTENCLR, call perf_event_disable to disable
    the perf event.

    Signed-off-by: Shannon Zhao
    Signed-off-by: Marc Zyngier

    Shannon Zhao
     
  • These kind of registers include PMEVCNTRn, PMCCNTR and PMXEVCNTR which
    is mapped to PMEVCNTRn.

    The access handler translates all aarch32 register offsets to aarch64
    ones and uses vcpu_sys_reg() to access their values to avoid taking care
    of big endian.

    When reading these registers, return the sum of register value and the
    value perf event counts.

    Signed-off-by: Shannon Zhao
    Reviewed-by: Andrew Jones
    Signed-off-by: Marc Zyngier

    Shannon Zhao