15 May, 2019

1 commit

  • Use the mmu_notifier_range_blockable() helper function instead of directly
    dereferencing the range->blockable field. This is done to make it easier
    to change the mmu_notifier range field.

    This patch is the outcome of the following coccinelle patch:

    %blockable
    +mmu_notifier_range_blockable(I1)
    ...>
    }
    ------------------------------------------------------------------->%

    spatch --in-place --sp-file blockable.spatch --dir .

    Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

07 May, 2019

1 commit

  • Pull arm64 updates from Will Deacon:
    "Mostly just incremental improvements here:

    - Introduce AT_HWCAP2 for advertising CPU features to userspace

    - Expose SVE2 availability to userspace

    - Support for "data cache clean to point of deep persistence" (DC PODP)

    - Honour "mitigations=off" on the cmdline and advertise status via
    sysfs

    - CPU timer erratum workaround (Neoverse-N1 #1188873)

    - Introduce perf PMU driver for the SMMUv3 performance counters

    - Add config option to disable the kuser helpers page for AArch32 tasks

    - Futex modifications to ensure liveness under contention

    - Rework debug exception handling to seperate kernel and user
    handlers

    - Non-critical fixes and cleanup"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (92 commits)
    Documentation: Add ARM64 to kernel-parameters.rst
    arm64/speculation: Support 'mitigations=' cmdline option
    arm64: ssbs: Don't treat CPUs with SSBS as unaffected by SSB
    arm64: enable generic CPU vulnerabilites support
    arm64: add sysfs vulnerability show for speculative store bypass
    arm64: Fix size of __early_cpu_boot_status
    clocksource/arm_arch_timer: Use arch_timer_read_counter to access stable counters
    clocksource/arm_arch_timer: Remove use of workaround static key
    clocksource/arm_arch_timer: Drop use of static key in arch_timer_reg_read_stable
    clocksource/arm_arch_timer: Direcly assign set_next_event workaround
    arm64: Use arch_timer_read_counter instead of arch_counter_get_cntvct
    watchdog/sbsa: Use arch_timer_read_counter instead of arch_counter_get_cntvct
    ARM: vdso: Remove dependency with the arch_timer driver internals
    arm64: Apply ARM64_ERRATUM_1188873 to Neoverse-N1
    arm64: Add part number for Neoverse N1
    arm64: Make ARM64_ERRATUM_1188873 depend on COMPAT
    arm64: Restrict ARM64_ERRATUM_1188873 mitigation to AArch32
    arm64: mm: Remove pte_unmap_nested()
    arm64: Fix compiler warning from pte_unmap() with -Wunused-but-set-variable
    arm64: compat: Reduce address limit for 64K pages
    ...

    Linus Torvalds
     

01 May, 2019

2 commits

  • …git/kvmarm/kvmarm into kvm-master

    KVM/ARM fixes for 5.1, take #2:

    - Don't try to emulate timers on userspace access
    - Fix unaligned huge mappings, again
    - Properly reset a vcpu that fails to reset(!)
    - Properly retire pending LPIs on reset
    - Fix computation of emulated CNTP_TVAL

    Paolo Bonzini
     
  • If a memory slot's size is not a multiple of 64 pages (256K), then
    the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
    either requires the requested page range to go beyond memslot->npages,
    or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
    requires log->num_pages to be both in range and aligned.

    To allow this case, allow log->num_pages not to be a multiple of 64 if
    it ends exactly on the last page of the slot.

    Reported-by: Peter Xu
    Fixes: 98938aa8edd6 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

25 Apr, 2019

3 commits

  • When a VCPU never runs before a guest exists, but we set timer registers
    up via ioctls, the associated hrtimer might never get cancelled.

    Since we moved vcpu_load/put into the arch-specific implementations and
    only have load/put for KVM_RUN, we won't ever have a scheduled hrtimer
    for emulating a timer when modifying the timer state via an ioctl from
    user space. All we need to do is make sure that we pick up the right
    state when we load the timer state next time userspace calls KVM_RUN
    again.

    We also do not need to worry about this interacting with the bg_timer,
    because if we were in WFI from the guest, and somehow ended up in a
    kvm_arm_timer_set_reg, it means that:

    1. the VCPU thread has received a signal,
    2. we have called vcpu_load when being scheduled in again,
    3. we have called vcpu_put when we returned to userspace for it to issue
    another ioctl

    And therefore will not have a bg_timer programmed and the event is
    treated as a spurious wakeup from WFI if userspace decides to run the
    vcpu again even if there are not virtual interrupts.

    This fixes stray virtual timer interrupts triggered by an expiring
    hrtimer, which happens after a failed live migration, for instance.

    Fixes: bee038a674875 ("KVM: arm/arm64: Rework the timer code to use a timer_map")
    Signed-off-by: Christoffer Dall
    Reported-by: Andre Przywara
    Tested-by: Andre Przywara
    Signed-off-by: Andre Przywara
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     
  • With commit a80868f398554842b14, we no longer ensure that the
    THP page is properly aligned in the guest IPA. Skip the stage2
    huge mapping for unaligned IPA backed by transparent hugepages.

    Fixes: a80868f398554842b14 ("KVM: arm/arm64: Enforce PTE mappings at stage2 when needed")
    Reported-by: Eric Auger
    Cc: Marc Zyngier
    Cc: Chirstoffer Dall
    Cc: Zenghui Yu
    Cc: Zheng Xiang
    Cc: Andrew Murray
    Cc: Eric Auger
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Suzuki K Poulose
     
  • A failed KVM_ARM_VCPU_INIT should not set the vcpu target,
    as the vcpu target is used by kvm_vcpu_initialized() to
    determine if other vcpu ioctls may proceed. We need to set
    the target before calling kvm_reset_vcpu(), but if that call
    fails, we should then unset it and clear the feature bitmap
    while we're at it.

    Signed-off-by: Andrew Jones
    [maz: Simplified patch, completed commit message]
    Signed-off-by: Marc Zyngier

    Andrew Jones
     

16 Apr, 2019

1 commit


09 Apr, 2019

1 commit

  • ARM64 standard pgtable functions are going to use pgtable_page_[ctor|dtor]
    or pgtable_pmd_page_[ctor|dtor] constructs. At present KVM guest stage-2
    PUD|PMD|PTE level page tabe pages are allocated with __get_free_page()
    via mmu_memory_cache_alloc() but released with standard pud|pmd_free() or
    pte_free_kernel(). These will fail once they start calling into pgtable_
    [pmd]_page_dtor() for pages which never originally went through respective
    constructor functions. Hence convert all stage-2 page table page release
    functions to call buddy directly while freeing pages.

    Reviewed-by: Suzuki K Poulose
    Acked-by: Yu Zhao
    Acked-by: Marc Zyngier
    Signed-off-by: Anshuman Khandual
    Signed-off-by: Will Deacon

    Anshuman Khandual
     

03 Apr, 2019

1 commit

  • When disabling LPIs (for example on reset) at the redistributor
    level, it is expected that LPIs that was pending in the CPU
    interface are eventually retired.

    Currently, this is not what is happening, and these LPIs will
    stay in the ap_list, eventually being acknowledged by the vcpu
    (which didn't quite expect this behaviour).

    The fix is thus to retire these LPIs from the list of pending
    interrupts as we disable LPIs.

    Reported-by: Heyi Guo
    Tested-by: Heyi Guo
    Fixes: 0e4e82f154e3 ("KVM: arm64: vgic-its: Enable ITS emulation as a virtual MSI controller")
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

30 Mar, 2019

1 commit

  • Recently the generic timer test of kvm-unit-tests failed to complete
    (stalled) when a physical timer is being used. This issue is caused
    by incorrect update of CNTP_CVAL when CNTP_TVAL is being accessed,
    introduced by 'Commit 84135d3d18da ("KVM: arm/arm64: consolidate arch
    timer trap handlers")'. According to Arm ARM, the read/write behavior
    of accesses to the TVAL registers is expected to be:

    * READ: TimerValue = (CompareValue – (Counter - Offset)
    * WRITE: CompareValue = ((Counter - Offset) + Sign(TimerValue)

    This patch fixes the TVAL read/write code path according to the
    specification.

    Fixes: 84135d3d18da ("KVM: arm/arm64: consolidate arch timer trap handlers")
    Signed-off-by: Wei Huang
    [maz: commit message tidy-up]
    Signed-off-by: Marc Zyngier

    Wei Huang
     

29 Mar, 2019

3 commits

  • …t/kvmarm/kvmarm into kvm-master

    KVM/ARM fixes for 5.1

    - Fix THP handling in the presence of pre-existing PTEs
    - Honor request for PTE mappings even when THPs are available
    - GICv4 performance improvement
    - Take the srcu lock when writing to guest-controlled ITS data structures
    - Reset the virtual PMU in preemptible context
    - Various cleanups

    Paolo Bonzini
     
  • The function irqfd_wakeup() has flags defined as __poll_t and then it
    has additional flags which is used for irqflags.

    Redefine the inner flags variable as iflags so it does not shadow the
    outer flags.

    Cc: Paolo Bonzini
    Cc: "Radim Krčmář"
    Cc: kvm@vger.kernel.org
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Paolo Bonzini

    Sebastian Andrzej Siewior
     
  • KVM's API requires thats ioctls must be issued from the same process
    that created the VM. In other words, userspace can play games with a
    VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
    creator can do anything useful. Explicitly reject device ioctls that
    are issued by a process other than the VM's creator, and update KVM's
    API documentation to extend its requirements to device ioctls.

    Fixes: 852b6d57dc7f ("kvm: add device control API")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

28 Mar, 2019

1 commit


21 Mar, 2019

2 commits

  • Fix sparse warnings:

    arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1732:5: warning:
    symbol 'vgic_its_has_attr_regs' was not declared. Should it be static?
    arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1753:5: warning:
    symbol 'vgic_its_attr_regs_access' was not declared. Should it be static?

    Signed-off-by: YueHaibing
    [maz: fixed subject]
    Signed-off-by: Marc Zyngier

    YueHaibing
     
  • We rely on the mmu_notifier call backs to handle the split/merge
    of huge pages and thus we are guaranteed that, while creating a
    block mapping, either the entire block is unmapped at stage2 or it
    is missing permission.

    However, we miss a case where the block mapping is split for dirty
    logging case and then could later be made block mapping, if we cancel the
    dirty logging. This not only creates inconsistent TLB entries for
    the pages in the the block, but also leakes the table pages for
    PMD level.

    Handle this corner case for the huge mappings at stage2 by
    unmapping the non-huge mapping for the block. This could potentially
    release the upper level table. So we need to restart the table walk
    once we unmap the range.

    Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages")
    Reported-by: Zheng Xiang
    Cc: Zheng Xiang
    Cc: Zenghui Yu
    Cc: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Suzuki K Poulose
     

20 Mar, 2019

4 commits

  • commit 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
    made the checks to skip huge mappings, stricter. However it introduced
    a bug where we still use huge mappings, ignoring the flag to
    use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE.

    Also, the checks do not cover the PUD huge pages, that was
    under review during the same period. This patch fixes both
    the issues.

    Fixes : 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings")
    Reported-by: Zenghui Yu
    Cc: Zenghui Yu
    Cc: Christoffer Dall
    Signed-off-by: Suzuki K Poulose
    Signed-off-by: Marc Zyngier

    Suzuki K Poulose
     
  • Calling kvm_is_visible_gfn() implies that we're parsing the memslots,
    and doing this without the srcu lock is frown upon:

    [12704.164532] =============================
    [12704.164544] WARNING: suspicious RCU usage
    [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G W
    [12704.164573] -----------------------------
    [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
    [12704.164602] other info that might help us debug this:
    [12704.164616] rcu_scheduler_active = 2, debug_locks = 1
    [12704.164631] 6 locks held by qemu-system-aar/13968:
    [12704.164644] #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
    [12704.164691] #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
    [12704.164726] #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [12704.164761] #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [12704.164794] #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [12704.164827] #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [12704.164861] stack backtrace:
    [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G W 5.1.0-rc1-00008-g600025238f51-dirty #16
    [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
    [12704.164896] Call trace:
    [12704.164910] dump_backtrace+0x0/0x138
    [12704.164920] show_stack+0x24/0x30
    [12704.164934] dump_stack+0xbc/0x104
    [12704.164946] lockdep_rcu_suspicious+0xcc/0x110
    [12704.164958] gfn_to_memslot+0x174/0x190
    [12704.164969] kvm_is_visible_gfn+0x28/0x70
    [12704.164980] vgic_its_check_id.isra.0+0xec/0x1e8
    [12704.164991] vgic_its_save_tables_v0+0x1ac/0x330
    [12704.165001] vgic_its_set_attr+0x298/0x3a0
    [12704.165012] kvm_device_ioctl_attr+0x9c/0xd8
    [12704.165022] kvm_device_ioctl+0x8c/0xf8
    [12704.165035] do_vfs_ioctl+0xc8/0x960
    [12704.165045] ksys_ioctl+0x8c/0xa0
    [12704.165055] __arm64_sys_ioctl+0x28/0x38
    [12704.165067] el0_svc_common+0xd8/0x138
    [12704.165078] el0_svc_handler+0x38/0x78
    [12704.165089] el0_svc+0x8/0xc

    Make sure the lock is taken when doing this.

    Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
    Reviewed-by: Eric Auger
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • When halting a guest, QEMU flushes the virtual ITS caches, which
    amounts to writing to the various tables that the guest has allocated.

    When doing this, we fail to take the srcu lock, and the kernel
    shouts loudly if running a lockdep kernel:

    [ 69.680416] =============================
    [ 69.680819] WARNING: suspicious RCU usage
    [ 69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted
    [ 69.682096] -----------------------------
    [ 69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage!
    [ 69.683225]
    [ 69.683225] other info that might help us debug this:
    [ 69.683225]
    [ 69.683975]
    [ 69.683975] rcu_scheduler_active = 2, debug_locks = 1
    [ 69.684598] 6 locks held by qemu-system-aar/4097:
    [ 69.685059] #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0
    [ 69.686087] #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0
    [ 69.686919] #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [ 69.687698] #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [ 69.688475] #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [ 69.689978] #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0
    [ 69.690729]
    [ 69.690729] stack backtrace:
    [ 69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18
    [ 69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019
    [ 69.692831] Call trace:
    [ 69.694072] lockdep_rcu_suspicious+0xcc/0x110
    [ 69.694490] gfn_to_memslot+0x174/0x190
    [ 69.694853] kvm_write_guest+0x50/0xb0
    [ 69.695209] vgic_its_save_tables_v0+0x248/0x330
    [ 69.695639] vgic_its_set_attr+0x298/0x3a0
    [ 69.696024] kvm_device_ioctl_attr+0x9c/0xd8
    [ 69.696424] kvm_device_ioctl+0x8c/0xf8
    [ 69.696788] do_vfs_ioctl+0xc8/0x960
    [ 69.697128] ksys_ioctl+0x8c/0xa0
    [ 69.697445] __arm64_sys_ioctl+0x28/0x38
    [ 69.697817] el0_svc_common+0xd8/0x138
    [ 69.698173] el0_svc_handler+0x38/0x78
    [ 69.698528] el0_svc+0x8/0xc

    The fix is to obviously take the srcu lock, just like we do on the
    read side of things since bf308242ab98. One wonders why this wasn't
    fixed at the same time, but hey...

    Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock")
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     
  • The normal interrupt flow is not to enable the vgic when no virtual
    interrupt is to be injected (i.e. the LRs are empty). But when a guest
    is likely to use GICv4 for LPIs, we absolutely need to switch it on
    at all times. Otherwise, VLPIs only get delivered when there is something
    in the LRs, which doesn't happen very often.

    Reported-by: Nianyao Tang
    Tested-by: Shameerali Kolothum Thodi
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

16 Mar, 2019

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - some cleanups
    - direct physical timer assignment
    - cache sanitization for 32-bit guests

    s390:
    - interrupt cleanup
    - introduction of the Guest Information Block
    - preparation for processor subfunctions in cpu models

    PPC:
    - bug fixes and improvements, especially related to machine checks
    and protection keys

    x86:
    - many, many cleanups, including removing a bunch of MMU code for
    unnecessary optimizations
    - AVIC fixes

    Generic:
    - memcg accounting"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (147 commits)
    kvm: vmx: fix formatting of a comment
    KVM: doc: Document the life cycle of a VM and its resources
    MAINTAINERS: Add KVM selftests to existing KVM entry
    Revert "KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()"
    KVM: PPC: Book3S: Add count cache flush parameters to kvmppc_get_cpu_char()
    KVM: PPC: Fix compilation when KVM is not enabled
    KVM: Minor cleanups for kvm_main.c
    KVM: s390: add debug logging for cpu model subfunctions
    KVM: s390: implement subfunction processor calls
    arm64: KVM: Fix architecturally invalid reset value for FPEXC32_EL2
    KVM: arm/arm64: Remove unused timer variable
    KVM: PPC: Book3S: Improve KVM reference counting
    KVM: PPC: Book3S HV: Fix build failure without IOMMU support
    Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()"
    x86: kvmguest: use TSC clocksource if invariant TSC is exposed
    KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start
    KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter
    KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns
    KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes()
    KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children
    ...

    Linus Torvalds
     

07 Mar, 2019

1 commit

  • Pull ACPI updates from Rafael Wysocki:
    "These are ACPICA updates including ACPI 6.3 support among other
    things, APEI updates including the ARM Software Delegated Exception
    Interface (SDEI) support, ACPI EC driver fixes and cleanups and other
    assorted improvements.

    Specifics:

    - Update the ACPICA code in the kernel to upstream revision 20190215
    including ACPI 6.3 support and more:
    * New predefined methods: _NBS, _NCH, _NIC, _NIH, and _NIG (Erik
    Schmauss).
    * Update of the PCC Identifier structure in PDTT (Erik Schmauss).
    * Support for new Generic Affinity Structure subtable in SRAT
    (Erik Schmauss).
    * New PCC operation region support (Erik Schmauss).
    * Support for GICC statistical profiling for MADT (Erik Schmauss).
    * New Error Disconnect Recover notification support (Erik
    Schmauss).
    * New PPTT Processor Structure Flags fields support (Erik
    Schmauss).
    * ACPI 6.3 HMAT updates (Erik Schmauss).
    * GTDT Revision 3 support (Erik Schmauss).
    * Legacy module-level code (MLC) support removal (Erik Schmauss).
    * Update/clarification of messages for control method failures
    (Bob Moore).
    * Warning on creation of a zero-length opregion (Bob Moore).
    * acpiexec option to dump extra info for memory leaks (Bob Moore).
    * More ACPI error to firmware error conversions (Bob Moore).
    * Debugger fix (Bob Moore).
    * Copyrights update (Bob Moore)

    - Clean up sleep states support code in ACPICA (Christoph Hellwig)

    - Rework in_nmi() handling in the APEI code and add suppor for the
    ARM Software Delegated Exception Interface (SDEI) to it (James
    Morse)

    - Fix possible out-of-bounds accesses in BERT-related core (Ross
    Lagerwall)

    - Fix the APEI code parsing HEST that includes a Deferred Machine
    Check subtable (Yazen Ghannam)

    - Use DEFINE_DEBUGFS_ATTRIBUTE for APEI-related debugfs files
    (YueHaibing)

    - Switch the APEI ERST code to the new generic UUID API (Andy
    Shevchenko)

    - Update the MAINTAINERS entry for APEI (Borislav Petkov)

    - Fix and clean up the ACPI EC driver (Rafael Wysocki, Zhang Rui)

    - Fix DMI checks handling in the ACPI backlight driver and add the
    "Lunch Box" chassis-type check to it (Hans de Goede)

    - Add support for using ACPI table overrides included in built-in
    initrd images (Shunyong Yang)

    - Update ACPI device enumeration to treat the PWM2 device as "always
    present" on Lenovo Yoga Book (Yauhen Kharuzhy)

    - Fix up the enumeration of device objects with the PRP0001 device ID
    (Andy Shevchenko)

    - Clean up PPTT parsing error messages (John Garry)

    - Clean up debugfs files creation handling (Greg Kroah-Hartman,
    Rafael Wysocki)

    - Clean up the ACPI DPTF Makefile (Masahiro Yamada)"

    * tag 'acpi-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
    ACPI / bus: Respect PRP0001 when retrieving device match data
    ACPICA: Update version to 20190215
    ACPI/ACPICA: Trivial: fix spelling mistakes and fix whitespace formatting
    ACPICA: ACPI 6.3: add GTDT Revision 3 support
    ACPICA: ACPI 6.3: HMAT updates
    ACPICA: ACPI 6.3: PPTT add additional fields in Processor Structure Flags
    ACPICA: ACPI 6.3: add Error Disconnect Recover Notification value
    ACPICA: ACPI 6.3: MADT: add support for statistical profiling in GICC
    ACPICA: ACPI 6.3: add PCC operation region support for AML interpreter
    efi: cper: Fix possible out-of-bounds access
    ACPI: APEI: Fix possible out-of-bounds access to BERT region
    ACPICA: ACPI 6.3: SRAT: add Generic Affinity Structure subtable
    ACPICA: ACPI 6.3: Add Trigger order to PCC Identifier structure in PDTT
    ACPICA: ACPI 6.3: Adding predefined methods _NBS, _NCH, _NIC, _NIH, and _NIG
    ACPICA: Update/clarify messages for control method failures
    ACPICA: Debugger: Fix possible fault with the "test objects" command
    ACPICA: Interpreter: Emit warning for creation of a zero-length op region
    ACPICA: Remove legacy module-level code support
    ACPI / x86: Make PWM2 device always present at Lenovo Yoga Book
    ACPI / video: Extend chassis-type detection with a "Lunch Box" check
    ..

    Linus Torvalds
     

06 Mar, 2019

1 commit

  • Pull RCU updates from Ingo Molnar:
    "The main RCU related changes in this cycle were:

    - Additional cleanups after RCU flavor consolidation

    - Grace-period forward-progress cleanups and improvements

    - Documentation updates

    - Miscellaneous fixes

    - spin_is_locked() conversions to lockdep

    - SPDX changes to RCU source and header files

    - SRCU updates

    - Torture-test updates, including nolibc updates and moving nolibc to
    tools/include"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    locking/locktorture: Convert to SPDX license identifier
    linux/torture: Convert to SPDX license identifier
    torture: Convert to SPDX license identifier
    linux/srcu: Convert to SPDX license identifier
    linux/rcutree: Convert to SPDX license identifier
    linux/rcutiny: Convert to SPDX license identifier
    linux/rcu_sync: Convert to SPDX license identifier
    linux/rcu_segcblist: Convert to SPDX license identifier
    linux/rcupdate: Convert to SPDX license identifier
    linux/rcu_node_tree: Convert to SPDX license identifier
    rcu/update: Convert to SPDX license identifier
    rcu/tree: Convert to SPDX license identifier
    rcu/tiny: Convert to SPDX license identifier
    rcu/sync: Convert to SPDX license identifier
    rcu/srcu: Convert to SPDX license identifier
    rcu/rcutorture: Convert to SPDX license identifier
    rcu/rcu_segcblist: Convert to SPDX license identifier
    rcu/rcuperf: Convert to SPDX license identifier
    rcu/rcu.h: Convert to SPDX license identifier
    RCU/torture.txt: Remove section MODULE PARAMETERS
    ...

    Linus Torvalds
     

04 Mar, 2019

1 commit

  • * acpi-apei: (29 commits)
    efi: cper: Fix possible out-of-bounds access
    ACPI: APEI: Fix possible out-of-bounds access to BERT region
    MAINTAINERS: Add James Morse to the list of APEI reviewers
    ACPI / APEI: Add support for the SDEI GHES Notification type
    firmware: arm_sdei: Add ACPI GHES registration helper
    ACPI / APEI: Use separate fixmap pages for arm64 NMI-like notifications
    ACPI / APEI: Only use queued estatus entry during in_nmi_queue_one_entry()
    ACPI / APEI: Split ghes_read_estatus() to allow a peek at the CPER length
    ACPI / APEI: Make GHES estatus header validation more user friendly
    ACPI / APEI: Pass ghes and estatus separately to avoid a later copy
    ACPI / APEI: Let the notification helper specify the fixmap slot
    ACPI / APEI: Move locking to the notification helper
    arm64: KVM/mm: Move SEA handling behind a single 'claim' interface
    KVM: arm/arm64: Add kvm_ras.h to collect kvm specific RAS plumbing
    ACPI / APEI: Switch NOTIFY_SEA to use the estatus queue
    ACPI / APEI: Move NOTIFY_SEA between the estatus-queue and NOTIFY_NMI
    ACPI / APEI: Don't allow ghes_ack_error() to mask earlier errors
    ACPI / APEI: Generalise the estatus queue's notify code
    ACPI / APEI: Don't update struct ghes' flags in read/clear estatus
    ACPI / APEI: Remove spurious GHES_TO_CLEAR check
    ...

    Rafael J. Wysocki
     

01 Mar, 2019

1 commit

  • debugfs can now report an error code if something went wrong instead of
    just NULL. So if the return value is to be used as a "real" dentry, it
    needs to be checked if it is an error before dereferencing it.

    This is now happening because of ff9fb72bc077 ("debugfs: return error
    values, not NULL"). syzbot has found a way to trigger multiple debugfs
    files attempting to be created, which fails, and then the error code
    gets passed to dentry_path_raw() which obviously does not like it.

    Reported-by: Eric Biggers
    Reported-and-tested-by: syzbot+7857962b4d45e602b8ad@syzkaller.appspotmail.com
    Cc: "Radim Krčmář"
    Cc: kvm@vger.kernel.org
    Acked-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     

23 Feb, 2019

2 commits


22 Feb, 2019

1 commit

  • The 'timer' local variable became unused after commit bee038a67487
    ("KVM: arm/arm64: Rework the timer code to use a timer_map").
    Remove it to avoid [-Wunused-but-set-variable] warning.

    Cc: Christoffer Dall
    Cc: James Morse
    Cc: Suzuki K Pouloze
    Reviewed-by: Julien Thierry
    Signed-off-by: Shaokun Zhang
    Signed-off-by: Marc Zyngier

    Shaokun Zhang
     

21 Feb, 2019

10 commits

  • The value of "dirty_bitmap[i]" is already check before setting its value
    to mask. The following check of "mask" is redundant. The check of "mask" was
    introduced by commit 58d2930f4ee3 ("KVM: Eliminate extra function calls in
    kvm_get_dirty_log_protect()"), revert it.

    Signed-off-by: Lan Tianyu
    Signed-off-by: Paolo Bonzini

    Lan Tianyu
     
  • grow_halt_poll_ns() have a strange behaviour in case
    (vcpu->halt_poll_ns != 0) &&
    (vcpu->halt_poll_ns < halt_poll_ns_grow_start).

    In this case, vcpu->halt_poll_ns will be multiplied by grow factor
    (halt_poll_ns_grow) which will require several grow iteration in order
    to reach a value bigger than halt_poll_ns_grow_start.
    This means that growing vcpu->halt_poll_ns from value of 0 is slower
    than growing it from a positive value less than halt_poll_ns_grow_start.
    Which is misleading and inaccurate.

    Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
    to halt_poll_ns_grow_start in any case that
    (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
    Regardless if vcpu->halt_poll_ns is 0.

    use READ_ONCE to get a consistent number for all cases.

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
    start value when raising up vcpu->halt_poll_ns.
    It actually sets the first timeout to the first polling session.
    This value has significant effect on how tolerant we are to outliers.
    On the standard case, higher value is better - we will spend more time
    in the polling busyloop, handle events/interrupts faster and result
    in better performance.
    But on outliers it puts us in a busy loop that does nothing.
    Even if the shrink factor is zero, we will still waste time on the first
    iteration.
    The optimal value changes between different workloads. It depends on
    outliers rate and polling sessions length.
    As this value has significant effect on the dynamic halt-polling
    algorithm, it should be configurable and exposed.

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • grow_halt_poll_ns() have a strange behavior in case
    (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).

    In this case, vcpu->halt_pol_ns will be set to zero.
    That results in shrinking instead of growing.

    Fix issue by changing grow_halt_poll_ns() to not modify
    vcpu->halt_poll_ns in case halt_poll_ns_grow is zero

    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Liran Alon
    Signed-off-by: Nir Weiner
    Suggested-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Nir Weiner
     
  • ...now that KVM won't explode by moving it out of bit 0. Using bit 63
    eliminates the need to jump over bit 0, e.g. when calculating a new
    memslots generation or when propagating the memslots generation to an
    MMIO spte.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • x86 captures a subset of the memslot generation (19 bits) in its MMIO
    sptes so that it can expedite emulated MMIO handling by checking only
    the releveant spte, i.e. doesn't need to do a full page fault walk.

    Because the MMIO sptes capture only 19 bits (due to limited space in
    the sptes), there is a non-zero probability that the MMIO generation
    could wrap, e.g. after 500k memslot updates. Since normal usage is
    extremely unlikely to result in 500k memslot updates, a hack was added
    by commit 69c9ea93eaea ("KVM: MMU: init kvm generation close to mmio
    wrap-around value") to offset the MMIO generation in order to trigger
    a wraparound, e.g. after 150 memslot updates.

    When separate memslot generation sequences were assigned to each
    address space, commit 00f034a12fdd ("KVM: do not bias the generation
    number in kvm_current_mmio_generation") moved the offset logic into the
    initialization of the memslot generation itself so that the per-address
    space bit(s) were not dropped/corrupted by the MMIO shenanigans.

    Remove the offset hack for three reasons:

    - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
    wrapping the generation doesn't actually test the interesting case
    of having stale MMIO sptes with the new generation number, e.g. old
    sptes with a generation number of 0.

    - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
    performance rather important since the probability of invalidating
    MMIO sptes jumps from "effectively never" to "fairly likely". This
    limits what can be done in future patches, e.g. to simplify the
    invalidation code, as doing so without proper caution could lead to
    a noticeable performance regression.

    - Forcing the memslots generation, which is a 64-bit number, to wrap
    prevents KVM from assuming the memslots generation will never wrap.
    This in turn prevents KVM from using an arbitrary bit for the
    "update in-progress" flag, e.g. using bit 63 would immediately
    collide with using a large value as the starting generation number.
    The "update in-progress" flag is effectively forced into bit 0 so
    that it's (subtly) taken into account when incrementing the
    generation.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • KVM uses bit 0 of the memslots generation as an "update in-progress"
    flag, which is used by x86 to prevent caching MMIO access while the
    memslots are changing. Although the intended behavior is flag-like,
    e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
    caching data from in-flux memslots, the implementation oftentimes treats
    the bit as part of the generation number itself, e.g. incrementing the
    generation increments twice, once to set the flag and once to clear it.

    Prior to commit 4bd518f1598d ("KVM: use separate generations for
    each address space"), incorporating the "update in-progress" bit into
    the generation number largely made sense, e.g. "real" generations are
    even, "bogus" generations are odd, most code doesn't need to be aware of
    the bit, etc...

    Now that unique memslots generation numbers are assigned to each address
    space, stealthing the in-progress status into the generation number
    results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
    over bit 0 when initializing the memslots generation without any hint as
    to why.

    Explicitly define the flag and convert as much code as possible (which
    isn't much) to actually treat it like a flag. This paves the way for
    eventually using a different bit for "update in-progress" so that it can
    be a flag in truth instead of a awkward extension to the generation
    number.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • kvm_arch_memslots_updated() is at this point in time an x86-specific
    hook for handling MMIO generation wraparound. x86 stashes 19 bits of
    the memslots generation number in its MMIO sptes in order to avoid
    full page fault walks for repeat faults on emulated MMIO addresses.
    Because only 19 bits are used, wrapping the MMIO generation number is
    possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that
    the generation has changed so that it can invalidate all MMIO sptes in
    case the effective MMIO generation has wrapped so as to avoid using a
    stale spte, e.g. a (very) old spte that was created with generation==0.

    Given that the purpose of kvm_arch_memslots_updated() is to prevent
    consuming stale entries, it needs to be called before the new generation
    is propagated to memslots. Invalidating the MMIO sptes after updating
    memslots means that there is a window where a vCPU could dereference
    the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
    spte that was created with (pre-wrap) generation==0.

    Fixes: e59dbe09f8e6 ("KVM: Introduce kvm_arch_memslots_updated()")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • There are many KVM kernel memory allocations which are tied to the life of
    the VM process and should be charged to the VM process's cgroup. If the
    allocations aren't tied to the process, the OOM killer will not know
    that killing the process will free the associated kernel memory.
    Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
    charged to the VM process's cgroup.

    Tested:
    Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
    introduced no new failures.
    Ran a kernel memory accounting test which creates a VM to touch
    memory and then checks that the kernel memory allocated for the
    process is within certain bounds.
    With this patch we account for much more of the vmalloc and slab memory
    allocated for the VM.

    There remain a few allocations which should be charged to the VM's
    cgroup but are not. In they include:
    vcpu->run
    kvm->coalesced_mmio_ring
    There allocations are unaccounted in this patch because they are mapped
    to userspace, and accounting them to a cgroup causes problems. This
    should be addressed in a future patch.

    Signed-off-by: Ben Gardon
    Reviewed-by: Shakeel Butt
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     
  • One of the more common cases of allocation size calculations is finding
    the size of a structure that has a zero-sized array at the end, along
    with memory for some number of elements for that array. For example:

    struct foo {
    int stuff;
    void *entry[];
    };

    instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);

    Instead of leaving these open-coded and prone to type mistakes, we can
    now use the new struct_size() helper:

    instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

    This code was detected with the help of Coccinelle.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Paolo Bonzini

    Gustavo A. R. Silva
     

20 Feb, 2019

1 commit