10 Nov, 2015

1 commit

  • commit 8832317f662c06f5c06e638f57bfe89a71c9b266 upstream.

    Currently we do not validate rtas.entry before calling enter_rtas(). This
    leads to a kernel oops when user space calls rtas system call on a powernv
    platform (see below). This patch adds code to validate rtas.entry before
    making enter_rtas() call.

    Oops: Exception in kernel mode, sig: 4 [#1]
    SMP NR_CPUS=1024 NUMA PowerNV
    task: c000000004294b80 ti: c0000007e1a78000 task.ti: c0000007e1a78000
    NIP: 0000000000000000 LR: 0000000000009c14 CTR: c000000000423140
    REGS: c0000007e1a7b920 TRAP: 0e40 Not tainted (3.18.17-340.el7_1.pkvm3_1_0.2400.1.ppc64le)
    MSR: 1000000000081000 CR: 00000000 XER: 00000000
    CFAR: c000000000009c0c SOFTE: 0
    NIP [0000000000000000] (null)
    LR [0000000000009c14] 0x9c14
    Call Trace:
    [c0000007e1a7bba0] [c00000000041a7f4] avc_has_perm_noaudit+0x54/0x110 (unreliable)
    [c0000007e1a7bd80] [c00000000002ddc0] ppc_rtas+0x150/0x2d0
    [c0000007e1a7be30] [c000000000009358] syscall_exit+0x0/0x98

    Fixes: 55190f88789a ("powerpc: Add skeleton PowerNV platform")
    Reported-by: NAGESWARA R. SASTRY
    Signed-off-by: Vasant Hegde
    [mpe: Reword change log, trim oops, and add stable + fixes]
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Vasant Hegde
     

27 Oct, 2015

1 commit

  • commit c56dadf39761a6157239cac39e3988998c994f98 upstream.

    Function should_resched() is equal to (!preempt_count() && need_resched()).
    In preemptive kernel preempt_count here is non-zero because of vc->lock.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Graf
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150715095203.12246.72922.stgit@buzz
    Signed-off-by: Ingo Molnar
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

23 Oct, 2015

3 commits

  • commit e297c939b745e420ef0b9dc989cb87bda617b399 upstream.

    This fixes a race which can result in the same virtual IRQ number
    being assigned to two different MSI interrupts. The most visible
    consequence of that is usually a warning and stack trace from the
    sysfs code about an attempt to create a duplicate entry in sysfs.

    The race happens when one CPU (say CPU 0) is disposing of an MSI
    while another CPU (say CPU 1) is setting up an MSI. CPU 0 calls
    (for example) pnv_teardown_msi_irqs(), which calls
    msi_bitmap_free_hwirqs() to indicate that the MSI (i.e. its
    hardware IRQ number) is no longer in use. Then, before CPU 0 gets
    to calling irq_dispose_mapping() to free up the virtal IRQ number,
    CPU 1 comes in and calls msi_bitmap_alloc_hwirqs() to allocate an
    MSI, and gets the same hardware IRQ number that CPU 0 just freed.
    CPU 1 then calls irq_create_mapping() to get a virtual IRQ number,
    which sees that there is currently a mapping for that hardware IRQ
    number and returns the corresponding virtual IRQ number (which is
    the same virtual IRQ number that CPU 0 was using). CPU 0 then
    calls irq_dispose_mapping() and frees that virtual IRQ number.
    Now, if another CPU comes along and calls irq_create_mapping(), it
    is likely to get the virtual IRQ number that was just freed,
    resulting in the same virtual IRQ number apparently being used for
    two different hardware interrupts.

    To fix this race, we just move the call to msi_bitmap_free_hwirqs()
    to after the call to irq_dispose_mapping(). Since virq_to_hw()
    doesn't work for the virtual IRQ number after irq_dispose_mapping()
    has been called, we need to call it before irq_dispose_mapping() and
    remember the result for the msi_bitmap_free_hwirqs() call.

    The pattern of calling msi_bitmap_free_hwirqs() before
    irq_dispose_mapping() appears in 5 places under arch/powerpc, and
    appears to have originated in commit 05af7bd2d75e ("[POWERPC] MPIC
    U3/U4 MSI backend") from 2007.

    Fixes: 05af7bd2d75e ("[POWERPC] MPIC U3/U4 MSI backend")
    Reported-by: Alexey Kardashevskiy
    Signed-off-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     
  • commit 7e022e717f54897e396504306d0c9b61452adf4e upstream.

    In guest_exit_cont we call kvmhv_commence_exit which expects the trap
    number as the argument. However r3 doesn't contain the trap number at
    this point and as a result we would be calling the function with a
    spurious trap number.

    Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
    r12 contains the trap number.

    Fixes: eddb60fb1443
    Signed-off-by: Gautham R. Shenoy
    Signed-off-by: Paul Mackerras
    Signed-off-by: Greg Kroah-Hartman

    Gautham R. Shenoy
     
  • commit 3eb4ee68254235e4f47bc0410538fcdaede39589 upstream.

    Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
    functions) has to be protected via the kvm->srcu lock.
    The kvmppc_h_logical_ci_load() and -store() functions are missing
    this lock so far, so let's add it there, too.
    This fixes the problem that the kernel reports "suspicious RCU usage"
    when lock debugging is enabled.

    Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1
    Signed-off-by: Thomas Huth
    Signed-off-by: Paul Mackerras
    Signed-off-by: Greg Kroah-Hartman

    Thomas Huth
     

30 Sep, 2015

8 commits

  • commit 36b35d5d807b7e57aff7d08e63de8b17731ee211 upstream.

    If we had secondary hash flag set, we ended up modifying hash value in
    the updatepp code path. Hence with a failed updatepp we will be using
    a wrong hash value for the following hash insert. Fix this by
    recomputing hash before insert.

    Without this patch we can end up with using wrong slot number in linux
    pte. That can result in us missing an hash pte update or invalidate
    which can cause memory corruption or even machine check.

    Fixes: 6d492ecc6489 ("powerpc/THP: Add code to handle HPTE faults for hugepages")
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Paul Mackerras
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit 655471f54c2e395ba29ae4156ba0f49928177cc1 upstream.

    The kernel does it, not the boot wrapper, which breaks with some
    cross compilers that still default to ABI v1.

    Fixes: 147c05168fc8 ("powerpc/boot: Add support for 64bit little endian wrapper")
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     
  • commit 72cd7b44bc99376b3f3c93cedcd052663fcdf705 upstream.

    enable_kernel_vsx() function was commented since anything was using
    it. However, vmx-crypto driver uses VSX instructions which are
    only available if VSX is enable. Otherwise it rises an exception oops.

    This patch uncomment enable_kernel_vsx() routine and makes it available.

    Signed-off-by: Leonidas S. Barbosa
    Signed-off-by: Herbert Xu
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Leonidas Da Silva Barbosa
     
  • commit 1c2cb594441d02815d304cccec9742ff5c707495 upstream.

    The EPOW interrupt handler uses rtas_get_sensor(), which in turn
    uses rtas_busy_delay() to wait for RTAS becoming ready in case it
    is necessary. But rtas_busy_delay() is annotated with might_sleep()
    and thus may not be used by interrupts handlers like the EPOW handler!
    This leads to the following BUG when CONFIG_DEBUG_ATOMIC_SLEEP is
    enabled:

    BUG: sleeping function called from invalid context at arch/powerpc/kernel/rtas.c:496
    in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.2.0-rc2-thuth #6
    Call Trace:
    [c00000007ffe7b90] [c000000000807670] dump_stack+0xa0/0xdc (unreliable)
    [c00000007ffe7bc0] [c0000000000e1f14] ___might_sleep+0x134/0x180
    [c00000007ffe7c20] [c00000000002aec0] rtas_busy_delay+0x30/0xd0
    [c00000007ffe7c50] [c00000000002bde4] rtas_get_sensor+0x74/0xe0
    [c00000007ffe7ce0] [c000000000083264] ras_epow_interrupt+0x44/0x450
    [c00000007ffe7d90] [c000000000120260] handle_irq_event_percpu+0xa0/0x300
    [c00000007ffe7e70] [c000000000120524] handle_irq_event+0x64/0xc0
    [c00000007ffe7eb0] [c000000000124dbc] handle_fasteoi_irq+0xec/0x260
    [c00000007ffe7ef0] [c00000000011f4f0] generic_handle_irq+0x50/0x80
    [c00000007ffe7f20] [c000000000010f3c] __do_irq+0x8c/0x200
    [c00000007ffe7f90] [c0000000000236cc] call_do_irq+0x14/0x24
    [c00000007e6f39e0] [c000000000011144] do_IRQ+0x94/0x110
    [c00000007e6f3a30] [c000000000002594] hardware_interrupt_common+0x114/0x180

    Fix this issue by introducing a new rtas_get_sensor_fast() function
    that does not use rtas_busy_delay() - and thus can only be used for
    sensors that do not cause a BUSY condition - known as "fast" sensors.

    The EPOW sensor is defined to be "fast" in sPAPR - mpe.

    Fixes: 587f83e8dd50 ("powerpc/pseries: Use rtas_get_sensor in RAS code")
    Signed-off-by: Thomas Huth
    Reviewed-by: Nathan Fontenot
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Thomas Huth
     
  • commit 74b5037baa2011a2799e2c43adde7d171b072f9e upstream.

    The powerpc kernel can be built to have either a 4K PAGE_SIZE or a 64K
    PAGE_SIZE.

    However when built with a 4K PAGE_SIZE there is an additional config
    option which can be enabled, PPC_HAS_HASH_64K, which means the kernel
    also knows how to hash a 64K page even though the base PAGE_SIZE is 4K.

    This is used in one obscure configuration, to support 64K pages for SPU
    local store on the Cell processor when the rest of the kernel is using
    4K pages.

    In this configuration, pte_pagesize_index() is defined to just pass
    through its arguments to get_slice_psize(). However pte_pagesize_index()
    is called for both user and kernel addresses, whereas get_slice_psize()
    only knows how to handle user addresses.

    This has been broken forever, however until recently it happened to
    work. That was because in get_slice_psize() the large kernel address
    would cause the right shift of the slice mask to return zero.

    However in commit 7aa0727f3302 ("powerpc/mm: Increase the slice range to
    64TB"), the get_slice_psize() code was changed so that instead of a
    right shift we do an array lookup based on the address. When passed a
    kernel address this means we index way off the end of the slice array
    and return random junk.

    That is only fatal if we happen to hit something non-zero, but when we
    do return a non-zero value we confuse the MMU code and eventually cause
    a check stop.

    This fix is ugly, but simple. When we're called for a kernel address we
    return 4K, which is always correct in this configuration, otherwise we
    use the slice mask.

    Fixes: 7aa0727f3302 ("powerpc/mm: Increase the slice range to 64TB")
    Reported-by: Cyril Bur
    Signed-off-by: Michael Ellerman
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 259800135c654a098d9f0adfdd3d1f20eef1f231 upstream.

    The config space of some PCI devices can't be accessed when their
    PEs are in frozen state. Otherwise, fenced PHB might be seen.
    Those PEs are identified with flag EEH_PE_CFG_RESTRICTED, meaing
    EEH_PE_CFG_BLOCKED is set automatically when the PE is put to
    frozen state (EEH_PE_ISOLATED). eeh_slot_error_detail() restores
    PCI device BARs with eeh_pe_restore_bars(), which then calls
    eeh_ops->restore_config() to reinitialize the PCI device in
    (OPAL) firmware. eeh_ops->restore_config() produces PCI config
    access that causes fenced PHB. The problem was reported on below
    adapter:

    0001:01:00.0 0200: 14e4:168e (rev 10)
    0001:01:00.0 Ethernet controller: Broadcom Corporation \
    NetXtreme II BCM57810 10 Gigabit Ethernet (rev 10)

    This fixes the issue by skipping eeh_pe_restore_bars() in
    eeh_slot_error_detail() when EEH_PE_CFG_BLOCKED is set for the PE.

    Fixes: b6541db1 ("powerpc/eeh: Block PCI config access upon frozen PE")
    Reported-by: Manvanthara B. Puttashankar
    Signed-off-by: Gavin Shan
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit e642d11bdbfe8eb10116ab3959a2b5d75efda832 upstream.

    In the complete hotplug case, EEH PEs are supposed to be released
    and set to NULL. Normally, this is done by eeh_remove_device(),
    which is called from pcibios_release_device().

    However, if something is holding a kref to the device, it will not
    be released, and the PE will remain. eeh_add_device_late() has
    a check for this which will explictly destroy the PE in this case.

    This check in eeh_add_device_late() occurs after a call to
    eeh_ops->probe(). On PowerNV, probe is a pointer to pnv_eeh_probe(),
    which will exit without probing if there is an existing PE.

    This means that on PowerNV, devices with outstanding krefs will not
    be rediscovered by EEH correctly after a complete hotplug. This is
    affecting CXL (CAPI) devices in the field.

    Put the probe after the kref check so that the PE is destroyed
    and affected devices are correctly rediscovered by EEH.

    Fixes: d91dafc02f42 ("powerpc/eeh: Delay probing EEH device during hotplug")
    Cc: Gavin Shan
    Signed-off-by: Daniel Axtens
    Acked-by: Gavin Shan
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Daniel Axtens
     
  • commit 590c7567a2895f939525ead57b0334c6d47986f0 upstream.

    Commit cca87d30 ("powerpc/pci: Refactor pci_dn") introduced pdn
    list for SRIOV VFs. It means the pdn is be put into the child list
    of its parent pdn when the pdn is created. When doing PCI hot
    unplugging on pSeries, the PCI device node as well as its pdn are
    released through procfs entry "powerpc/ofdt". Some one else grabs
    the memory chunk of the pdn and update it accordingly. At the same
    time, the pdn is still tracked in the child list of parent pdn. It
    leads to corrupted child list in the parent pdn.

    This fixes above issue by removing the pdn from the child list of
    its parent pdn when the device node is detached from the system.
    Note the pdn is free'd when the device node is released if the
    device node is dynamic one. Otherwise, the device node as well
    as the pdn won't be released.

    Fixes: cca87d30 ("powerpc/pci: Refactor pci_dn")
    Reported-by: Santwana Samantray
    Signed-off-by: Gavin Shan
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     

22 Sep, 2015

2 commits

  • commit 1e5bf454f58731e360e504253e85bae7aaa2d298 upstream.

    The reference (R) and change (C) bits in a HPT entry can be set by
    hardware at any time up until the HPTE is invalidated and the TLB
    invalidation sequence has completed. This means that when removing
    a HPTE, we need to read the HPTE after the invalidation sequence has
    completed in order to obtain reliable values of R and C. The code
    in kvmppc_do_h_remove() used to do this. However, commit 6f22bd3265fb
    ("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
    read after invalidation as a side effect of other changes. This
    restores the read of the HPTE after invalidation.

    The user-visible effect of this bug would be that when migrating a
    guest, there is a small probability that a page modified by the guest
    and then unmapped by the guest might not get re-transmitted and thus
    the destination might end up with a stale copy of the page.

    Fixes: 6f22bd3265fb
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf
    Signed-off-by: Greg Kroah-Hartman

    Paul Mackerras
     
  • commit 06554d9f6cc8f0b5ec903db19726a15dfc7b09d6 upstream.

    The code that handles the case when we receive a H_DOORBELL interrupt
    has a comment which says "Hypervisor doorbell - exit only if host IPI
    flag set". However, the current code does not actually check if the
    host IPI flag is set. This is due to a comparison instruction that
    got missed.

    As a result, the current code performs the exit to host only
    if some sibling thread or a sibling sub-core is exiting to the
    host. This implies that, an IPI sent to a sibling core in
    (subcores-per-core != 1) mode will be missed by the host unless the
    sibling core is on the exit path to the host.

    This patch adds the missing comparison operation which will ensure
    that when HOST_IPI flag is set, we unconditionally exit to the host.

    Fixes: 66feed61cdf6
    Signed-off-by: Gautham R. Shenoy
    Reviewed-by: David Gibson
    Signed-off-by: Paul Mackerras
    Signed-off-by: Greg Kroah-Hartman

    Gautham R. Shenoy
     

17 Aug, 2015

1 commit

  • commit 3c00cb5e68dc719f2fc73a33b1b230aadfcb1309 upstream.

    This function can leak kernel stack data when the user siginfo_t has a
    positive si_code value. The top 16 bits of si_code descibe which fields
    in the siginfo_t union are active, but they are treated inconsistently
    between copy_siginfo_from_user32, copy_siginfo_to_user32 and
    copy_siginfo_to_user.

    copy_siginfo_from_user32 is called from rt_sigqueueinfo and
    rt_tgsigqueueinfo in which the user has full control overthe top 16 bits
    of si_code.

    This fixes the following information leaks:
    x86: 8 bytes leaked when sending a signal from a 32-bit process to
    itself. This leak grows to 16 bytes if the process uses x32.
    (si_code = __SI_CHLD)
    x86: 100 bytes leaked when sending a signal from a 32-bit process to
    a 64-bit process. (si_code = -1)
    sparc: 4 bytes leaked when sending a signal from a 32-bit process to a
    64-bit process. (si_code = any)

    parsic and s390 have similar bugs, but they are not vulnerable because
    rt_[tg]sigqueueinfo have checks that prevent sending a positive si_code
    to a different process. These bugs are also fixed for consistency.

    Signed-off-by: Amanieu d'Antras
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Amanieu d'Antras
     

11 Aug, 2015

1 commit

  • commit b32aadc1a8ed84afbe924cd2ced31cd6a2e67074 upstream.

    core_idle_state is maintained for each core. It uses 0-7 bits to track
    whether a thread in the core has entered fastsleep or winkle. 8th bit is
    used as a lock bit.
    The lock bit is set in these 2 scenarios-
    - The thread is first in subcore to wakeup from sleep/winkle.
    - If its the last thread in the core about to enter sleep/winkle

    While the lock bit is set, if any other thread in the core wakes up, it
    loops until the lock bit is cleared before proceeding in the wakeup
    path. This helps prevent race conditions w.r.t fastsleep workaround and
    prevents threads from switching to process context before core/subcore
    resources are restored.

    But, in the path to sleep/winkle entry, we currently don't check for
    lock-bit. This exposes us to following race when running with subcore
    on-

    First thread in the subcorea Another thread in the same
    waking up core entering sleep/winkle

    lwarx r15,0,r14
    ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
    stwcx. r15,0,r14
    [Code to restore subcore state]

    lwarx r15,0,r14
    [clear thread bit]
    stwcx. r15,0,r14

    andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
    stw r15,0(r14)

    Here, after the thread entering sleep clears its thread bit in
    core_idle_state, the value is overwritten by the thread waking up.
    In such cases when the core enters fastsleep, code mistakes an idle
    thread as running. Because of this, the first thread waking up from
    fastsleep which is supposed to resync timebase skips it. So we can
    end up having a core with stale timebase value.

    This patch fixes the above race by looping on the lock bit even while
    entering the idle states.

    Signed-off-by: Shreyas B. Prabhu
    Fixes: 7b54e9f213f76 'powernv/powerpc: Add winkle support for offline cpus'
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Shreyas B. Prabhu
     

11 Jul, 2015

1 commit

  • commit 72e349f1124a114435e599479c9b8d14bfd1ebcd upstream.

    When we take a PMU exception or a software event we call
    perf_read_regs(). This overloads regs->result with a boolean that
    describes if we should use the sampled instruction address register
    (SIAR) or the regs.

    If the exception is in kernel, we start with the kernel regs and
    backtrace through the kernel stack. At this point we switch to the
    userspace regs and backtrace the user stack with perf_callchain_user().

    Unfortunately these regs have not got the perf_read_regs() treatment,
    so regs->result could be anything. If it is non zero,
    perf_instruction_pointer() decides to use the SIAR, and we get issues
    like this:

    0.11% qemu-system-ppc [kernel.kallsyms] [k] _raw_spin_lock_irqsave
    |
    ---_raw_spin_lock_irqsave
    |
    |--52.35%-- 0
    | |
    | |--46.39%-- __hrtimer_start_range_ns
    | | kvmppc_run_core
    | | kvmppc_vcpu_run_hv
    | | kvmppc_vcpu_run
    | | kvm_arch_vcpu_ioctl_run
    | | kvm_vcpu_ioctl
    | | do_vfs_ioctl
    | | sys_ioctl
    | | system_call
    | | |
    | | |--67.08%-- _raw_spin_lock_irqsave
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Anton Blanchard
     

22 May, 2015

1 commit

  • Pull KVM fixes from Paolo Bonzini:
    "This includes a fix for two oopses, one on PPC and on x86.

    The rest is fixes for bugs with newer Intel processors"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvm/fpu: Enable eager restore kvm FPU for MPX
    Revert "KVM: x86: drop fpu_activate hook"
    kvm: fix crash in kvm_vcpu_reload_apic_access_page
    KVM: MMU: fix SMAP virtualization
    KVM: MMU: fix CR4.SMEP=1, CR0.WP=0 with shadow pages
    KVM: MMU: fix smap permission check
    KVM: PPC: Book3S HV: Fix list traversal in error case

    Linus Torvalds
     

14 May, 2015

1 commit

  • Recent toolchains force the TOC to be 256 byte aligned. We need
    to enforce this alignment in our linker script, otherwise pointers
    to our TOC variables (__toc_start, __prom_init_toc_start) could
    be incorrect.

    If they are bad, we die a few hundred instructions into boot.

    Cc: stable@vger.kernel.org
    Signed-off-by: Anton Blanchard
    Signed-off-by: Michael Ellerman

    Anton Blanchard
     

12 May, 2015

3 commits

  • Before 69111bac42f5 ("powerpc: Replace __get_cpu_var uses"), in
    save_mce_event, index got the value of mce_nest_count, and
    mce_nest_count was incremented *after* index was set.

    However, that patch changed the behaviour so that mce_nest count was
    incremented *before* setting index.

    This causes an off-by-one error, as get_mce_event sets index as
    mce_nest_count - 1 before reading mce_event. Thus get_mce_event reads
    bogus data, causing warnings like
    "Machine Check Exception, Unknown event version 0 !"
    and breaking MCEs handling.

    Restore the old behaviour and unbreak MCE handling by subtracting one
    from the newly incremented value.

    The same broken change occured in machine_check_queue_event (which set
    a queue read by machine_check_process_queued_event). Fix that too,
    unbreaking printing of MCE information.

    Fixes: 69111bac42f5 ("powerpc: Replace __get_cpu_var uses")
    CC: stable@vger.kernel.org
    CC: Mahesh Salgaonkar
    CC: Christoph Lameter
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • We need to check whether pte is present in follow_huge_addr() and
    properly return NULL if mapping is not present. Also use READ_ONCE
    when dereferencing pte_t address.

    Without this patch, we may wrongly return a zero pfn page in
    follow_huge_addr().

    Reviewed-by: David Gibson
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     
  • Serialize against find_linux_pte_or_hugepte() which does lock-less
    lookup in page tables with local interrupts disabled. For huge pages it
    casts pmd_t to pte_t. Since the format of pte_t is different from pmd_t
    we want to prevent transit from pmd pointing to page table to pmd
    pointing to huge page (and back) while interrupts are disabled. We
    clear pmd to possibly replace it with page table pointer in different
    code paths. So make sure we wait for the parallel
    find_linux_pte_or_hugepage() to finish.

    Without this patch, a find_linux_pte_or_hugepte() running in parallel to
    __split_huge_zero_page_pmd() or do_huge_pmd_wp_page_fallback() or
    zap_huge_pmd() can run into the above issue. With
    __split_huge_zero_page_pmd() and do_huge_pmd_wp_page_fallback() we clear
    the hugepage pte before inserting the pmd entry with a regular pgtable
    address. Such a clear need to wait for the parallel
    find_linux_pte_or_hugepte() to finish.

    With zap_huge_pmd(), we can run into issues, with a hugepage pte getting
    zapped due to a MADV_DONTNEED while other cpu fault it in as small
    pages.

    Reported-by: Kirill A. Shutemov
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Michael Ellerman

    Aneesh Kumar K.V
     

10 May, 2015

1 commit

  • This fixes a regression introduced in commit 25fedfca94cf, "KVM: PPC:
    Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu", which
    leads to a user-triggerable oops.

    In the case where we try to run a vcore on a physical core that is
    not in single-threaded mode, or the vcore has too many threads for
    the physical core, we iterate the list of runnable vcpus to make
    each one return an EBUSY error to userspace. Since this involves
    taking each vcpu off the runnable_threads list for the vcore, we
    need to use list_for_each_entry_safe rather than list_for_each_entry
    to traverse the list. Otherwise the kernel will crash with an oops
    message like this:

    Unable to handle kernel paging request for data at address 0x000fff88
    Faulting instruction address: 0xd00000001e635dc8
    Oops: Kernel access of bad area, sig: 11 [#2]
    SMP NR_CPUS=1024 NUMA PowerNV
    ...
    CPU: 48 PID: 91256 Comm: qemu-system-ppc Tainted: G D 3.18.0 #1
    task: c00000274e507500 ti: c0000027d1924000 task.ti: c0000027d1924000
    NIP: d00000001e635dc8 LR: d00000001e635df8 CTR: c00000000011ba50
    REGS: c0000027d19275b0 TRAP: 0300 Tainted: G D (3.18.0)
    MSR: 9000000000009033 CR: 22002824 XER: 00000000
    CFAR: c000000000008468 DAR: 00000000000fff88 DSISR: 40000000 SOFTE: 1
    GPR00: d00000001e635df8 c0000027d1927830 d00000001e64c850 0000000000000001
    GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
    GPR08: 0000000000200200 0000000000000000 0000000000000000 d00000001e63e588
    GPR12: 0000000000002200 c000000007dbc800 c000000fc7800000 000000000000000a
    GPR16: fffffffffffffffc c000000fd5439690 c000000fc7801c98 0000000000000001
    GPR20: 0000000000000003 c0000027d1927aa8 c000000fd543b348 c000000fd543b350
    GPR24: 0000000000000000 c000000fa57f0000 0000000000000030 0000000000000000
    GPR28: fffffffffffffff0 c000000fd543b328 00000000000fe468 c000000fd543b300
    NIP [d00000001e635dc8] kvmppc_run_core+0x198/0x17c0 [kvm_hv]
    LR [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv]
    Call Trace:
    [c0000027d1927830] [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] (unreliable)
    [c0000027d1927a30] [d00000001e638350] kvmppc_vcpu_run_hv+0x5b0/0xdd0 [kvm_hv]
    [c0000027d1927b70] [d00000001e510504] kvmppc_vcpu_run+0x44/0x60 [kvm]
    [c0000027d1927ba0] [d00000001e50d4a4] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
    [c0000027d1927be0] [d00000001e504be8] kvm_vcpu_ioctl+0x5e8/0x7a0 [kvm]
    [c0000027d1927d40] [c0000000002d6720] do_vfs_ioctl+0x490/0x780
    [c0000027d1927de0] [c0000000002d6ae4] SyS_ioctl+0xd4/0xf0
    [c0000027d1927e30] [c000000000009358] syscall_exit+0x0/0x98
    Instruction dump:
    60000000 60420000 387e1b30 38800003 38a00001 38c00000 480087d9 e8410018
    ebde1c98 7fbdf040 3bdee368 419e0048 939e1b18 2f890001 409effcc
    ---[ end trace 8cdf50251cca6680 ]---

    Fixes: 25fedfca94cfbf2461314c6c34ef58e74a31b025
    Signed-off-by: Paul Mackerras
    Reviewed-by: Alexander Graf
    Signed-off-by: Paolo Bonzini

    Paul Mackerras
     

01 May, 2015

4 commits

  • Patches 7cba160ad "powernv/cpuidle: Redesign idle states management"
    and 77b54e9f2 "powernv/powerpc: Add winkle support for offline cpus"
    use non-volatile condition registers (cr2, cr3 and cr4) early in the system
    reset interrupt handler (system_reset_pSeries()) before it has been determined
    if state loss has occurred. If state loss has not occurred, control returns via
    the power7_wakeup_noloss() path which does not restore those condition
    registers, leaving them corrupted.

    Fix this by restoring the condition registers in the power7_wakeup_noloss()
    case.

    This is apparent when running a KVM guest on hardware that does not
    support winkle or sleep and the guest makes use of secondary threads. In
    practice this means Power7 machines, though some early unreleased Power8
    machines may also be susceptible.

    The secondary CPUs are taken off line before the guest is started and
    they call pnv_smp_cpu_kill_self(). This checks support for sleep
    states (in this case there is no support) and power7_nap() is called.

    When the CPU is woken, power7_nap() returns and because the CPU is
    still off line, the main while loop executes again. The sleep states
    support test is executed again, but because the tested values cannot
    have changed, the compiler has optimized the test away and instead we
    rely on the result of the first test, which has been left in cr3
    and/or cr4. With the result overwritten, the wrong branch is taken and
    power7_winkle() is called on a CPU that does not support it, leading
    to it stalling.

    Fixes: 7cba160ad789 ("powernv/cpuidle: Redesign idle states management")
    Fixes: 77b54e9f213f ("powernv/powerpc: Add winkle support for offline cpus")
    [mpe: Massage change log a bit more]
    Signed-off-by: Sam Bobroff
    Signed-off-by: Michael Ellerman

    Sam Bobroff
     
  • Commit 1c509148b ("powerpc/eeh: Do probe on pci_dn") probes EEH
    devices in early stage, which is reasonable to pSeries platform.
    However, it's wrong for PowerNV platform because the PE# isn't
    determined until the resources (IO and MMIO) are assigned to
    PE in hotplug case. So we have to delay probing EEH devices
    for PowerNV platform until the PE# is assigned.

    Fixes: ff57b454ddb9 ("powerpc/eeh: Do probe on pci_dn")
    Signed-off-by: Gavin Shan
    Signed-off-by: Michael Ellerman

    Gavin Shan
     
  • When asserting reset in pcibios_set_pcie_reset_state(), the PE
    is enforced to (hardware) frozen state in order to drop unexpected
    PCI transactions (except PCI config read/write) automatically by
    hardware during reset, which would cause recursive EEH error.
    However, the (software) frozen state EEH_PE_ISOLATED is missed.
    When users get 0xFF from PCI config or MMIO read, EEH_PE_ISOLATED
    is set in PE state retrival backend. Unfortunately, nobody (the
    reset handler or the EEH recovery functinality in host) will clear
    EEH_PE_ISOLATED when the PE has been passed through to guest.

    The patch sets and clears EEH_PE_ISOLATED properly during reset
    in function pcibios_set_pcie_reset_state() to fix the issue.

    Fixes: 28158cd ("Enhance pcibios_set_pcie_reset_state()")
    Reported-by: Carol L. Soto
    Signed-off-by: Gavin Shan
    Tested-by: Carol L. Soto
    Signed-off-by: Michael Ellerman

    Gavin Shan
     
  • The incorrect ordering of operations during cpu dlpar add results in invalid
    affinity for the cpu being added. The ibm,associativity property in the
    device tree is populated with all zeroes for the added cpu which results in
    invalid affinity mappings and all cpus appear to belong to node 0.

    This occurs because rtas configure-connector is called prior to making the
    rtas set-indicator calls. Phyp does not assign affinity information
    for a cpu until the rtas set-indicator calls are made to set the isolation
    and allocation state.

    Correct the order of operations to make the rtas set-indicator
    calls (done in dlpar_acquire_drc) before calling rtas configure-connector.

    Fixes: 1a8061c46c46 ("powerpc/pseries: Add kernel based CPU DLPAR handling")

    Signed-off-by: Nathan Fontenot
    Signed-off-by: Michael Ellerman

    Nathan Fontenot
     

30 Apr, 2015

1 commit

  • This reverts commit feba40362b11341bee6d8ed58d54b896abbd9f84.

    Although the principle of this change is good, the implementation has a
    few issues.

    Firstly we can sometimes fail to abort a syscall because r12 may have
    been clobbered by C code if we went down the virtual CPU accounting
    path, or if syscall tracing was enabled.

    Secondly we have decided that it is safer to abort the syscall even
    earlier in the syscall entry path, so that we avoid the syscall tracing
    path when we are transactional.

    So that we have time to thoroughly test those changes we have decided to
    revert this for this merge window and will merge the fixed version in
    the next window.

    NB. Rather than reverting the selftest we just drop tm-syscall from
    TEST_PROGS so that it's not run by default.

    Fixes: feba40362b11 ("powerpc/tm: Abort syscalls in active transactions")
    Signed-off-by: Michael Ellerman

    Michael Ellerman
     

29 Apr, 2015

2 commits

  • Load the PowerNV platform pci controller ops into pci controllers
    after all the operations are loaded into the platform ops struct, not
    before.

    Otherwise we aren't actually setting the ops properly which can break
    IO for some devices.

    Fixes: 65ebf4b63 ("powerpc/powernv: Move controller ops from ppc_md to controller_ops")
    Reported-by: Gavin Shan
    Reviewed-by: Gavin Shan
    Signed-off-by: Daniel Axtens
    Signed-off-by: Michael Ellerman

    Daniel Axtens
     
  • Commit 34cb7954c0aa "Convert ICS mutex lock to spin lock" added an
    include of asm/spinlock.h, which does not work in the SMP=n case.

    It should instead include linux/spinlock.h

    Fixes: 34cb7954c0aa ("KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock")
    Acked-by: Paul Mackerras
    Reviewed-by: Alexander Graf
    Signed-off-by: Michael Ellerman

    Michael Ellerman
     

27 Apr, 2015

3 commits

  • Pull fourth vfs update from Al Viro:
    "d_inode() annotations from David Howells (sat in for-next since before
    the beginning of merge window) + four assorted fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    RCU pathwalk breakage when running into a symlink overmounting something
    fix I_DIO_WAKEUP definition
    direct-io: only inc/dec inode->i_dio_count for file systems
    fs/9p: fix readdir()
    VFS: assorted d_backing_inode() annotations
    VFS: fs/inode.c helpers: d_inode() annotations
    VFS: fs/cachefiles: d_backing_inode() annotations
    VFS: fs library helpers: d_inode() annotations
    VFS: assorted weird filesystems: d_inode() annotations
    VFS: normal filesystems (and lustre): d_inode() annotations
    VFS: security/: d_inode() annotations
    VFS: security/: d_backing_inode() annotations
    VFS: net/: d_inode() annotations
    VFS: net/unix: d_backing_inode() annotations
    VFS: kernel/: d_inode() annotations
    VFS: audit: d_backing_inode() annotations
    VFS: Fix up some ->d_inode accesses in the chelsio driver
    VFS: Cachefiles should perform fs modifications on the top layer only
    VFS: AF_UNIX sockets should call mknod on the top layer only

    Linus Torvalds
     
  • Pull powerpc fixes from Michael Ellerman:

    - fix for mm_dec_nr_pmds() from Scott.

    - fixes for oopses seen with KVM + THP from Aneesh.

    - build fixes from Aneesh & Shreyas.

    * tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
    powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabled
    powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error
    powerpc/mm/thp: Return pte address if we find trans_splitting.
    powerpc/mm/thp: Make page table walk safe against thp split/collapse
    KVM: PPC: Remove page table walk helpers
    KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer
    powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()

    Linus Torvalds
     
  • Pull second batch of KVM changes from Paolo Bonzini:
    "This mostly includes the PPC changes for 4.1, which this time cover
    Book3S HV only (debugging aids, minor performance improvements and
    some cleanups). But there are also bug fixes and small cleanups for
    ARM, x86 and s390.

    The task_migration_notifier revert and real fix is still pending
    review, but I'll send it as soon as possible after -rc1"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
    KVM: arm/arm64: check IRQ number on userland injection
    KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi
    KVM: VMX: Preserve host CR4.MCE value while in guest mode.
    KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
    KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C
    KVM: PPC: Book3S HV: Streamline guest entry and exit
    KVM: PPC: Book3S HV: Use bitmap of active threads rather than count
    KVM: PPC: Book3S HV: Use decrementer to wake napping threads
    KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI
    KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken
    KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu
    KVM: PPC: Book3S HV: Minor cleanups
    KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update
    KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
    KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT
    KVM: PPC: Book3S HV: Add ICP real mode counters
    KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode
    KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock
    KVM: PPC: Book3S HV: Add guest->host real mode completion counters
    KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte
    ...

    Linus Torvalds
     

23 Apr, 2015

1 commit


21 Apr, 2015

5 commits

  • This uses msgsnd where possible for signalling other threads within
    the same core on POWER8 systems, rather than IPIs through the XICS
    interrupt controller. This includes waking secondary threads to run
    the guest, the interrupts generated by the virtual XICS, and the
    interrupts to bring the other threads out of the guest when exiting.

    Aggregated statistics from debugfs across vcpus for a guest with 32
    vcpus, 8 threads/vcore, running on a POWER8, show this before the
    change:

    rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
    rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
    rm_intr: 1660.0ns (12 - 553050, 3600051 samples)

    and this after the change:

    rm_entry: 3060.1ns (212 - 65138, 953873 samples)
    rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
    rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)

    for a test of booting Fedora 20 big-endian to the login prompt.

    The time taken for a H_PROD hcall (which is handled in the host
    kernel) went down from about 35 microseconds to about 16 microseconds
    with this change.

    The noinline added to kvmppc_run_core turned out to be necessary for
    good performance, at least with gcc 4.9.2 as packaged with Fedora 21
    and a little-endian POWER8 host.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This replaces the assembler code for kvmhv_commence_exit() with C code
    in book3s_hv_builtin.c. It also moves the IPI sending code that was
    in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it
    can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq().

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • On entry to the guest, secondary threads now wait for the primary to
    switch the MMU after loading up most of their state, rather than before.
    This means that the secondary threads get into the guest sooner, in the
    common case where the secondary threads get to kvmppc_hv_entry before
    the primary thread.

    On exit, the first thread out increments the exit count and interrupts
    the other threads (to get them out of the guest) before saving most
    of its state, rather than after. That means that the other threads
    exit sooner and means that the first thread doesn't spend so much
    time waiting for the other threads at the point where the MMU gets
    switched back to the host.

    This pulls out the code that increments the exit count and interrupts
    other threads into a separate function, kvmhv_commence_exit().
    This also makes sure that r12 and vcpu->arch.trap are set correctly
    in some corner cases.

    Statistics from /sys/kernel/debug/kvm/vm*/vcpu*/timings show the
    improvement. Aggregating across vcpus for a guest with 32 vcpus,
    8 threads/vcore, running on a POWER8, gives this before the change:

    rm_entry: avg 4537.3ns (222 - 48444, 1068878 samples)
    rm_exit: avg 4787.6ns (152 - 165490, 1010717 samples)
    rm_intr: avg 1673.6ns (12 - 341304, 3818691 samples)

    and this after the change:

    rm_entry: avg 3427.7ns (232 - 68150, 1118921 samples)
    rm_exit: avg 4716.0ns (12 - 150720, 1119477 samples)
    rm_intr: avg 1614.8ns (12 - 522436, 3850432 samples)

    showing a substantial reduction in the time spent per guest entry in
    the real-mode guest entry code, and smaller reductions in the real
    mode guest exit and interrupt handling times. (The test was to start
    the guest and boot Fedora 20 big-endian to the login prompt.)

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Currently, the entry_exit_count field in the kvmppc_vcore struct
    contains two 8-bit counts, one of the threads that have started entering
    the guest, and one of the threads that have started exiting the guest.
    This changes it to an entry_exit_map field which contains two bitmaps
    of 8 bits each. The advantage of doing this is that it gives us a
    bitmap of which threads need to be signalled when exiting the guest.
    That means that we no longer need to use the trick of setting the
    HDEC to 0 to pull the other threads out of the guest, which led in
    some cases to a spurious HDEC interrupt on the next guest entry.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This arranges for threads that are napping due to their vcpu having
    ceded or due to not having a vcpu to wake up at the end of the guest's
    timeslice without having to be poked with an IPI. We do that by
    arranging for the decrementer to contain a value no greater than the
    number of timebase ticks remaining until the end of the timeslice.
    In the case of a thread with no vcpu, this number is in the hypervisor
    decrementer already. In the case of a ceded vcpu, we use the smaller
    of the HDEC value and the DEC value.

    Using the DEC like this when ceded means we need to save and restore
    the guest decrementer value around the nap.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras