17 Dec, 2014

1 commit


07 Dec, 2014

3 commits

  • commit 360743814c4082515581aa23ab1d8e699e1fbe88 upstream.

    Instead of the arch specific quirk which we are deprecating
    and that drivers don't understand.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     
  • commit 3b8a3c01096925a824ed3272601082289d9c23a5 upstream.

    On pseries system (LPAR) xmon failed to enter when running in LE mode,
    system is hunging. Inititating xmon will lead to such an output on the
    console:

    SysRq : Entering xmon
    cpu 0x15: Vector: 0 at [c0000003f39ffb10]
    pc: c00000000007ed7c: sysrq_handle_xmon+0x5c/0x70
    lr: c00000000007ed7c: sysrq_handle_xmon+0x5c/0x70
    sp: c0000003f39ffc70
    msr: 8000000000009033
    current = 0xc0000003fafa7180
    paca = 0xc000000007d75e80 softe: 0 irq_happened: 0x01
    pid = 14617, comm = bash
    Bad kernel stack pointer fafb4b0 at eca7cc4
    cpu 0x15: Vector: 300 (Data Access) at [c000000007f07d40]
    pc: 000000000eca7cc4
    lr: 000000000eca7c44
    sp: fafb4b0
    msr: 8000000000001000
    dar: 10000000
    dsisr: 42000000
    current = 0xc0000003fafa7180
    paca = 0xc000000007d75e80 softe: 0 irq_happened: 0x01
    pid = 14617, comm = bash
    cpu 0x15: Exception 300 (Data Access) in xmon, returning to main loop
    xmon: WARNING: bad recursive fault on cpu 0x15

    The root cause is that xmon is calling RTAS to turn off the surveillance
    when entering xmon, and RTAS is requiring big endian parameters.

    This patch is byte swapping the RTAS arguments when running in LE mode.

    Signed-off-by: Laurent Dufour
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     
  • commit 415072a041bf50dbd6d56934ffc0cbbe14c97be8 upstream.

    Instead of the arch specific quirk which we are deprecating

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     

15 Nov, 2014

1 commit

  • commit 10ccaf178b2b961d8bca252d647ed7ed8aae2a20 upstream.

    In powerpc pseries platform dlpar operations, use device_online() and
    device_offline() instead of cpu_up() and cpu_down().

    Calling cpu_up/down() directly does not update the cpu device offline
    field, which is used to online/offline a cpu from sysfs. Calling
    device_online/offline() instead keeps the sysfs cpu online value
    correct. The hotplug lock, which is required to be held when calling
    device_online/offline(), is already held when dlpar_online/offline_cpu()
    are called, since they are called only from cpu_probe|release_store().

    This patch fixes errors on phyp (PowerVM) systems that have cpu(s)
    added/removed using dlpar operations; without this patch, the
    /sys/devices/system/cpu/cpuN/online nodes do not correctly show the
    online state of added/removed cpus.

    Signed-off-by: Dan Streetman
    Cc: Nathan Fontenot
    Fixes: 0902a9044fa5 ("Driver core: Use generic offline/online for CPU offline/online")
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Dan Streetman
     

31 Oct, 2014

1 commit

  • commit 9410e0185e65394c0c6d046033904b53b97a9423 upstream.

    rtas_call() accepts and returns values in CPU endianness.
    The ddw_query_response and ddw_create_response structs members are
    defined and treated as BE but as they are passed to rtas_call() as
    (u32 *) and they get byteswapped automatically, the data is CPU-endian.
    This fixes ddw_query_response and ddw_create_response definitions and use.

    of_read_number() is designed to work with device tree cells - it assumes
    the input is big-endian and returns data in CPU-endian. However due
    to the ddw_create_response struct fix, create.addr_hi/lo are already
    CPU-endian so do not byteswap them.

    ddw_avail is a pointer to the "ibm,ddw-applicable" property which contains
    3 cells which are big-endian as it is a device tree. rtas_call() accepts
    a RTAS token in CPU-endian. This makes use of of_property_read_u32_array
    to byte swap and avoid the need for a number of be32_to_cpu calls.

    Cc: Benjamin Herrenschmidt
    [aik: folded Anton's patch with of_property_read_u32_array]
    Signed-off-by: Alexey Kardashevskiy
    Acked-by: Anton Blanchard
    Signed-off-by: Michael Ellerman
    Signed-off-by: Greg Kroah-Hartman

    Alexey Kardashevskiy
     

06 Oct, 2014

3 commits

  • commit 78e05b1421fa41ae8457701140933baa5e7d9479 upstream.

    Similar to the previous commit which described why we need to add a
    barrier to arch_spin_is_locked(), we have a similar problem with
    spin_unlock_wait().

    We need a barrier on entry to ensure any spinlock we have previously
    taken is visibly locked prior to the load of lock->slock.

    It's also not clear if spin_unlock_wait() is intended to have ACQUIRE
    semantics. For now be conservative and add a barrier on exit to give it
    ACQUIRE semantics.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 51d7d5205d3389a32859f9939f1093f267409929 upstream.

    The kernel defines the function spin_is_locked(), which can be used to
    check if a spinlock is currently locked.

    Using spin_is_locked() on a lock you don't hold is obviously racy. That
    is, even though you may observe that the lock is unlocked, it may become
    locked at any time.

    There is (at least) one exception to that, which is if two locks are
    used as a pair, and the holder of each checks the status of the other
    before doing any update.

    Assuming *A and *B are two locks, and *COUNTER is a shared non-atomic
    value:

    The first CPU does:

    spin_lock(*A)

    if spin_is_locked(*B)
    # nothing
    else
    smp_mb()
    LOAD r = *COUNTER
    r++
    STORE *COUNTER = r

    spin_unlock(*A)

    And the second CPU does:

    spin_lock(*B)

    if spin_is_locked(*A)
    # nothing
    else
    smp_mb()
    LOAD r = *COUNTER
    r++
    STORE *COUNTER = r

    spin_unlock(*B)

    Although this is a strange locking construct, it should work.

    It seems to be understood, but not documented, that spin_is_locked() is
    not a memory barrier, so in the examples above and below the caller
    inserts its own memory barrier before acting on the result of
    spin_is_locked().

    For now we assume spin_is_locked() is implemented as below, and we break
    it out in our examples:

    bool spin_is_locked(*LOCK) {
    LOAD l = *LOCK
    return l.locked
    }

    Our intuition is that there should be no problem even if the two code
    sequences run simultaneously such as:

    CPU 0 CPU 1
    ==================================================
    spin_lock(*A) spin_lock(*B)
    LOAD b = *B LOAD a = *A
    if b.locked # true if a.locked # true
    # nothing # nothing
    spin_unlock(*A) spin_unlock(*B)

    If one CPU gets the lock before the other then it will do the update and
    the other CPU will back off:

    CPU 0 CPU 1
    ==================================================
    spin_lock(*A)
    LOAD b = *B
    spin_lock(*B)
    if b.locked # false LOAD a = *A
    else if a.locked # true
    smp_mb() # nothing
    LOAD r1 = *COUNTER spin_unlock(*B)
    r1++
    STORE *COUNTER = r1
    spin_unlock(*A)

    However in reality spin_lock() itself is not indivisible. On powerpc we
    implement it as a load-and-reserve and store-conditional.

    Ignoring the retry logic for the lost reservation case, it boils down to:
    spin_lock(*LOCK) {
    LOAD l = *LOCK
    l.locked = true
    STORE *LOCK = l
    ACQUIRE_BARRIER
    }

    The ACQUIRE_BARRIER is required to give spin_lock() ACQUIRE semantics as
    defined in memory-barriers.txt:

    This acts as a one-way permeable barrier. It guarantees that all
    memory operations after the ACQUIRE operation will appear to happen
    after the ACQUIRE operation with respect to the other components of
    the system.

    On modern powerpc systems we use lwsync for ACQUIRE_BARRIER. lwsync is
    also know as "lightweight sync", or "sync 1".

    As described in Power ISA v2.07 section B.2.1.1, in this scenario the
    lwsync is not the barrier itself. It instead causes the LOAD of *LOCK to
    act as the barrier, preventing any loads or stores in the locked region
    from occurring prior to the load of *LOCK.

    Whether this behaviour is in accordance with the definition of ACQUIRE
    semantics in memory-barriers.txt is open to discussion, we may switch to
    a different barrier in future.

    What this means in practice is that the following can occur:

    CPU 0 CPU 1
    ==================================================
    LOAD a = *A LOAD b = *B
    a.locked = true b.locked = true
    LOAD b = *B LOAD a = *A
    STORE *A = a STORE *B = b
    if b.locked # false if a.locked # false
    else else
    smp_mb() smp_mb()
    LOAD r1 = *COUNTER LOAD r2 = *COUNTER
    r1++ r2++
    STORE *COUNTER = r1
    STORE *COUNTER = r2 # Lost update
    spin_unlock(*A) spin_unlock(*B)

    That is, the load of *B can occur prior to the store that makes *A
    visibly locked. And similarly for CPU 1. The result is both CPUs hold
    their lock and believe the other lock is unlocked.

    The easiest fix for this is to add a full memory barrier to the start of
    spin_is_locked(), so adding to our previous definition would give us:

    bool spin_is_locked(*LOCK) {
    smp_mb()
    LOAD l = *LOCK
    return l.locked
    }

    The new barrier orders the store to the lock we are locking vs the load
    of the other lock:

    CPU 0 CPU 1
    ==================================================
    LOAD a = *A LOAD b = *B
    a.locked = true b.locked = true
    STORE *A = a STORE *B = b
    smp_mb() smp_mb()
    LOAD b = *B LOAD a = *A
    if b.locked # true if a.locked # true
    # nothing # nothing
    spin_unlock(*A) spin_unlock(*B)

    Although the above example is theoretical, there is code similar to this
    example in sem_lock() in ipc/sem.c. This commit in addition to the next
    commit appears to be a fix for crashes we are seeing in that code where
    we believe this race happens in practice.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 85101af13bb854a6572fa540df7c7201958624b9 upstream.

    ABIv2 kernels are failing to backtrace through the kernel. An example:

    39.30% readseek2_proce [kernel.kallsyms] [k] find_get_entry
    |
    --- find_get_entry
    __GI___libc_read

    The problem is in valid_next_sp() where we check that the new stack
    pointer is at least STACK_FRAME_OVERHEAD below the previous one.

    ABIv1 has a minimum stack frame size of 112 bytes consisting of 48 bytes
    and 64 bytes of parameter save area. ABIv2 changes that to 32 bytes
    with no paramter save area.

    STACK_FRAME_OVERHEAD is in theory the minimum stack frame size,
    but we over 240 uses of it, some of which assume that it includes
    space for the parameter area.

    We need to work through all our stack defines and rationalise them
    but let's fix perf now by creating STACK_FRAME_MIN_SIZE and using
    in valid_next_sp(). This fixes the issue:

    30.64% readseek2_proce [kernel.kallsyms] [k] find_get_entry
    |
    --- find_get_entry
    pagecache_get_page
    generic_file_read_iter
    new_sync_read
    vfs_read
    sys_read
    syscall_exit
    __GI___libc_read

    Reported-by: Aneesh Kumar K.V
    Signed-off-by: Anton Blanchard
    Signed-off-by: Greg Kroah-Hartman

    Anton Blanchard
     

18 Sep, 2014

10 commits

  • commit 7e467245bf5226db34c4b12d3cbacfa2f7a15a8b upstream.

    We would get wrong results in compiler recomputed old_pmd. Avoid
    that by using ACCESS_ONCE

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit 969b7b208f7408712a3526856e4ae60ad13f6928 upstream.

    As per ISA, for 4k base page size we compare 14..65 bits of VA specified
    with the entry_VA in tlb. That implies we need to make sure we do a
    tlbie with all the possible 4k va we used to access the 16MB hugepage.
    With 64k base page size we compare 14..57 bits of VA. Hence we cannot
    ignore the lower 24 bits of va while tlbie .We also cannot tlb
    invalidate a 16MB entry with just one tlbie instruction because
    we don't track which va was used to instantiate the tlb entry.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit fc0479557572375100ef16c71170b29a98e0d69a upstream.

    If we changed base page size of the segment, either via sub_page_protect
    or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
    table entries. We do a lazy hash page table flush for all mapped pages
    in the demoted segment. This happens when we handle hash page fault for
    these pages.

    We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
    pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
    that implies that we could possibly have older 64K hash pte entries in
    the hash page table and we need to invalidate those entries.

    Use _PAGE_COMBO to determine the page size with which we should
    invalidate the hash table entries on unmap.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit 629149fae478f0ac6bf705a535708b192e9c6b59 upstream.

    If we changed base page size of the segment, either via sub_page_protect
    or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
    table entries. We do a lazy hash page table flush for all mapped pages
    in the demoted segment. This happens when we handle hash page fault
    for these pages.

    We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
    pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
    that implies that we could possibly have older 64K hash pte entries in
    the hash page table and we need to invalidate those entries.

    Handle this correctly for 16M pages

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit fa1f8ae80f8bb996594167ff4750a0b0a5a5bb5d upstream.

    The segment identifier and segment size will remain the same in
    the loop, So we can compute it outside. We also change the
    hugepage_invalidate interface so that we can use it the later patch

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit b0aa44a3dfae3d8f45bd1264349aa87f87b7774f upstream.

    With hugepages, we store the hpte valid information in the pte page
    whose address is stored in the second half of the PMD. Use a
    write barrier to make sure clearing pmd busy bit and updating
    hpte valid info are ordered properly.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit 5efbabe09d986f25c02d19954660238fcd7f008a upstream.

    Function remove_ddw() could be called in of_reconfig_notifier and
    we potentially remove the dynamic DMA window property, which invokes
    of_reconfig_notifier again. Eventually, it leads to the deadlock as
    following backtrace shows.

    The patch fixes the above issue by deferring releasing the dynamic
    DMA window property while releasing the device node.

    =============================================
    [ INFO: possible recursive locking detected ]
    3.16.0+ #428 Tainted: G W
    ---------------------------------------------
    drmgr/2273 is trying to acquire lock:
    ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
    .__blocking_notifier_call_chain+0x40/0x78

    but task is already holding lock:
    ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
    .__blocking_notifier_call_chain+0x40/0x78

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock((of_reconfig_chain).rwsem);
    lock((of_reconfig_chain).rwsem);
    *** DEADLOCK ***

    May be due to missing lock nesting notation

    2 locks held by drmgr/2273:
    #0: (sb_writers#4){.+.+.+}, at: [] \
    .vfs_write+0xb0/0x1f8
    #1: ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
    .__blocking_notifier_call_chain+0x40/0x78

    stack backtrace:
    CPU: 17 PID: 2273 Comm: drmgr Tainted: G W 3.16.0+ #428
    Call Trace:
    [c0000000137e7000] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
    [c0000000137e70b0] [c00000000083cd34] .dump_stack+0x7c/0x9c
    [c0000000137e7130] [c0000000000b8afc] .__lock_acquire+0x128c/0x1c68
    [c0000000137e7280] [c0000000000b9a4c] .lock_acquire+0xe8/0x104
    [c0000000137e7350] [c00000000083588c] .down_read+0x4c/0x90
    [c0000000137e73e0] [c000000000091890] .__blocking_notifier_call_chain+0x40/0x78
    [c0000000137e7490] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
    [c0000000137e7520] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
    [c0000000137e75b0] [c000000000682a9c] .of_property_notify+0x4c/0x54
    [c0000000137e7650] [c000000000682bf0] .of_remove_property+0x30/0xd4
    [c0000000137e76f0] [c000000000052a44] .remove_ddw+0x144/0x168
    [c0000000137e7790] [c000000000053204] .iommu_reconfig_notifier+0x30/0xe0
    [c0000000137e7820] [c00000000009137c] .notifier_call_chain+0x6c/0xb4
    [c0000000137e78c0] [c0000000000918ac] .__blocking_notifier_call_chain+0x5c/0x78
    [c0000000137e7970] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
    [c0000000137e7a00] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
    [c0000000137e7a90] [c000000000682e14] .of_detach_node+0x44/0x1fc
    [c0000000137e7b40] [c0000000000518e4] .ofdt_write+0x3ac/0x688
    [c0000000137e7c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
    [c0000000137e7cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
    [c0000000137e7d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
    [c0000000137e7e30] [c00000000000a064] syscall_exit+0x0/0x98

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit f1b3929c232784580e5d8ee324b6bc634e709575 upstream.

    While running command "drmgr -c phb -r -s 'PHB 528'", following
    backtrace jumped out because the target device node isn't marked
    with OF_DETACHED by of_detach_node(), which caused by error
    returned from memory hotplug related reconfig notifier when
    disabling CONFIG_MEMORY_HOTREMOVE. The patch fixes it.

    ERROR: Bad of_node_put() on /pci@800000020000210/ethernet@0
    CPU: 14 PID: 2252 Comm: drmgr Tainted: G W 3.16.0+ #427
    Call Trace:
    [c000000012a776a0] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
    [c000000012a77750] [c00000000083cd34] .dump_stack+0x7c/0x9c
    [c000000012a777d0] [c0000000006807c4] .of_node_release+0x58/0xe0
    [c000000012a77860] [c00000000038a7d0] .kobject_release+0x174/0x1b8
    [c000000012a77900] [c00000000038a884] .kobject_put+0x70/0x78
    [c000000012a77980] [c000000000681680] .of_node_put+0x28/0x34
    [c000000012a77a00] [c000000000681ea8] .__of_get_next_child+0x64/0x70
    [c000000012a77a90] [c000000000682138] .of_find_node_by_path+0x1b8/0x20c
    [c000000012a77b40] [c000000000051840] .ofdt_write+0x308/0x688
    [c000000012a77c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
    [c000000012a77cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
    [c000000012a77d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
    [c000000012a77e30] [c00000000000a064] syscall_exit+0x0/0x98

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     
  • commit 85c1fafd7262e68ad821ee1808686b1392b1167d upstream.

    On ppc64 we support 4K hash pte with 64K page size. That requires
    us to track the hash pte slot information on a per 4k basis. We do that
    by storing the slot details in the second half of pte page. The pte bit
    _PAGE_COMBO is used to indicate whether the second half need to be
    looked while building real_pte. We need to use read memory barrier while
    doing that so that load of hidx is not reordered w.r.t _PAGE_COMBO
    check. On the store side we already do a lwsync in __hash_page_4K

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit b00fc6ec1f24f9d7af9b8988b6a198186eb3408c upstream.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631
    Reported-by: David Binderman
    Signed-off-by: Andrey Utkin
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Andrey Utkin
     

06 Sep, 2014

2 commits

  • commit a32305bf90a2ae0e6a9a93370c7616565f75e15a upstream.

    powerpc defines various machine-specific routines for handling
    pci_set_dma_mask(). The routines for machine "PowerNV" may neglect
    to set dev->dma_mask. This could confuse anyone (e.g. drivers) that
    consult dev->dma_mask to find the current mask. Set the dma_mask in
    the PowerNV leaf routine.

    Signed-off-by: Brian W. Hart
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Brian W Hart
     
  • commit 7340056567e32b2c9d3554eb146e1977c93da116 upstream.

    Commit bcdde7e made __sysfs_remove_dir() recursive and introduced a BUG_ON
    during PHB removal while attempting to delete the power managment attribute
    group of the bus. This is a result of tearing the bridge and bus devices down
    out of order in remove_phb_dynamic. Since, the the bus resides below the bridge
    in the sysfs device tree it should be torn down first.

    This patch simply moves the device_unregister call for the PHB bridge device
    after the device_unregister call for the PHB bus.

    Fixes: bcdde7e221a8 ("sysfs: make __sysfs_remove_dir() recursive")
    Signed-off-by: Tyrel Datwyler
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Tyrel Datwyler
     

28 Jul, 2014

1 commit

  • commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc upstream.

    The optimistic spin code assumes regular stores and cmpxchg() play nice;
    this is found to not be true for at least: parisc, sparc32, tile32,
    metag-lock1, arc-!llsc and hexagon.

    There is further wreckage, but this in particular seemed easy to
    trigger, so blacklist this.

    Opt in for known good archs.

    Signed-off-by: Peter Zijlstra
    Reported-by: Mikulas Patocka
    Cc: David Miller
    Cc: Chris Metcalf
    Cc: James Bottomley
    Cc: Vineet Gupta
    Cc: Jason Low
    Cc: Waiman Long
    Cc: "James E.J. Bottomley"
    Cc: Paul McKenney
    Cc: John David Anglin
    Cc: James Hogan
    Cc: Linus Torvalds
    Cc: Davidlohr Bueso
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Russell King
    Cc: Will Deacon
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140606175316.GV13930@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

18 Jul, 2014

4 commits

  • commit fb43e8477ed9006c4f397f904c691a120503038c upstream.

    powerpc:allmodconfig has been failing for some time with the following
    error.

    arch/powerpc/kernel/exceptions-64s.S: Assembler messages:
    arch/powerpc/kernel/exceptions-64s.S:1312: Error: attempt to move .org backwards
    make[1]: *** [arch/powerpc/kernel/head_64.o] Error 1

    A number of attempts to fix the problem by moving around code have been
    unsuccessful and resulted in failed builds for some configurations and
    the discovery of toolchain bugs.

    Fix the problem by disabling RELOCATABLE for COMPILE_TEST builds instead.
    While this is less than perfect, it avoids substantial code changes
    which would otherwise be necessary just to make COMPILE_TEST builds
    happy and might have undesired side effects.

    Signed-off-by: Guenter Roeck
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Guenter Roeck
     
  • commit b50a6c584bb47b370f84bfd746770c0bbe7129b7 upstream.

    On POWER8 when switching to a KVM guest we set bits in MMCR2 to freeze
    the PMU counters. Aside from on boot they are then never reset,
    resulting in stuck perf counters for any user in the guest or host.

    We now set MMCR2 to 0 whenever enabling the PMU, which provides a sane
    state for perf to use the PMU counters under either the guest or the
    host.

    This was manifesting as a bug with ppc64_cpu --frequency:

    $ sudo ppc64_cpu --frequency
    WARNING: couldn't run on cpu 0
    WARNING: couldn't run on cpu 8
    ...
    WARNING: couldn't run on cpu 144
    WARNING: couldn't run on cpu 152
    min: 18446744073.710 GHz (cpu -1)
    max: 0.000 GHz (cpu -1)
    avg: 0.000 GHz

    The command uses a perf counter to measure CPU cycles over a fixed
    amount of time, in order to approximate the frequency of the machine.
    The counters were returning zero once a guest was started, regardless of
    weather it was still running or had been shut down.

    By dumping the value of MMCR2, it was observed that once a guest is
    running MMCR2 is set to 1s - which stops counters from running:

    $ sudo sh -c 'echo p > /proc/sysrq-trigger'
    CPU: 0 PMU registers, ppmu = POWER8 n_counters = 6
    PMC1: 5b635e38 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000
    PMC5: 1bf5a646 PMC6: 5793d378 PMC7: deadbeef PMC8: deadbeef
    MMCR0: 0000000080000000 MMCR1: 000000001e000000 MMCRA: 0000040000000000
    MMCR2: fffffffffffffc00 EBBHR: 0000000000000000
    EBBRR: 0000000000000000 BESCR: 0000000000000000
    SIAR: 00000000000a51cc SDAR: c00000000fc40000 SIER: 0000000001000000

    This is done unconditionally in book3s_hv_interrupts.S upon entering the
    guest, and the original value is only save/restored if the host has
    indicated it was using the PMU. This is okay, however the user of the
    PMU needs to ensure that it is in a defined state when it starts using
    it.

    Fixes: e05b9b9e5c10 ("powerpc/perf: Power8 PMU support")
    Signed-off-by: Joel Stanley
    Acked-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Joel Stanley
     
  • commit 4d9690dd56b0d18f2af8a9d4a279cb205aae3345 upstream.

    Instead of separate bits for every POWER8 PMU feature, have a single one
    for v2.07 of the architecture.

    This saves us adding a MMCR2 define for a future patch.

    Signed-off-by: Joel Stanley
    Acked-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Joel Stanley
     
  • commit f56029410a13cae3652d1f34788045c40a13ffc7 upstream.

    We are seeing a lot of PMU warnings on POWER8:

    Can't find PMC that caused IRQ

    Looking closer, the active PMC is 0 at this point and we took a PMU
    exception on the transition from negative to 0. Some versions of POWER8
    have an issue where they edge detect and not level detect PMC overflows.

    A number of places program the PMC with (0x80000000 - period_left),
    where period_left can be negative. We can either fix all of these or
    just ensure that period_left is always >= 1.

    This patch takes the second option.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Anton Blanchard
     

07 Jul, 2014

10 commits

  • commit 6663a4fa6711050036562ddfd2086edf735fae21 upstream.

    Commit 59a53afe70fd530040bdc69581f03d880157f15a "powerpc: Don't setup
    CPUs with bad status" broke ePAPR SMP booting. ePAPR says that CPUs
    that aren't presently running shall have status of disabled, with
    enable-method being used to determine whether the CPU can be enabled.

    Fix by checking for spin-table, which is currently the only supported
    enable-method.

    Signed-off-by: Scott Wood
    Cc: Michael Neuling
    Cc: Emil Medve
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Scott Wood
     
  • commit dd58a092c4202f2bd490adab7285b3ff77f8e467 upstream.

    The Vector Crypto category instructions are supported by current POWER8
    chips, advertise them to userspace using a specific bit to properly
    differentiate with chips of the same architecture level that might not
    have them.

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     
  • commit 59a53afe70fd530040bdc69581f03d880157f15a upstream.

    OPAL will mark a CPU that is guarded as "bad" in the status property of the CPU
    node.

    Unfortunatley Linux doesn't check this property and will put the bad CPU in the
    present map. This has caused hangs on booting when we try to unsplit the core.

    This patch checks the CPU is avaliable via this status property before putting
    it in the present map.

    Signed-off-by: Michael Neuling
    Tested-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Michael Neuling
     
  • commit b69a1da94f3d1589d1942b5d1b384d8cfaac4500 upstream.

    Commit cd64d1697cf0 ("powerpc: mtmsrd not defined") added a check for
    CONFIG_PPC_CPU were a check for CONFIG_PPC_FPU was clearly intended.

    Fixes: cd64d1697cf0 ("powerpc: mtmsrd not defined")
    Signed-off-by: Paul Bolle
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Paul Bolle
     
  • commit 3df48c981d5a9610e02e9270b1bc4274fb536710 upstream.

    In commit 330a1eb "Core EBB support for 64-bit book3s" I messed up
    clear_task_ebb(). It clears some but not all of the task's Event Based
    Branch (EBB) registers when we duplicate a task struct.

    That allows a child task to observe the EBBHR & EBBRR of its parent,
    which it should not be able to do.

    Fix it by clearing EBBHR & EBBRR.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 6e0fdf9af216887e0032c19d276889aad41cad00 upstream.

    Commit b0d278b7d3ae ("powerpc/perf_event: Reduce latency of calling
    perf_event_do_pending") added a check for CONFIG_PMAC were a check for
    CONFIG_PPC_PMAC was clearly intended.

    Fixes: b0d278b7d3ae ("powerpc/perf_event: Reduce latency of calling perf_event_do_pending")
    Signed-off-by: Paul Bolle
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Paul Bolle
     
  • commit 5d73320a96fcce80286f1447864c481b5f0b96fa upstream.

    commit 8f9c0119d7ba (compat: fs: Generic compat_sys_sendfile
    implementation) changed the PowerPC 64bit sendfile call from
    sys_sendile64 to sys_sendfile.

    Unfortunately this broke sendfile of lengths greater than 2G because
    sys_sendfile caps at MAX_NON_LFS. Restore what we had previously which
    fixes the bug.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Anton Blanchard
     
  • commit c4cad90f9e9dcb85afc5e75a02ae3522ed077296 upstream.

    We had a mix & match of flags used when creating legacy ports
    depending on where we found them in the device-tree. Among others
    we were missing UPF_SKIP_TEST for some kind of ISA ports which is
    a problem as quite a few UARTs out there don't support the loopback
    test (such as a lot of BMCs).

    Let's pick the set of flags used by the SoC code and generalize it
    which means autoconf, no loopback test, irq maybe shared and fixed
    port.

    Sending to stable as the lack of UPF_SKIP_TEST is breaking
    serial on some machines so I want this back into distros

    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Herrenschmidt
     
  • commit 09567e7fd44291bfc08accfdd67ad8f467842332 upstream.

    We have a bug in our hugepage handling which exhibits as an infinite
    loop of hash faults. If the fault is being taken in the kernel it will
    typically trigger the softlockup detector, or the RCU stall detector.

    The bug is as follows:

    1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
    2. Slice code converts the slice psize to 16M.
    3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
    synchronises the mm->context with the paca->context. So the paca slice
    mask is updated to include the 16M slice.
    3. Either:
    * mmap() fails because there are no huge pages available.
    * mmap() succeeds and the mapping is then munmapped.
    In both cases the slice psize remains at 16M in both the paca & mm.
    4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
    5. The slice psize is converted back to 64K. Because of the check on line 539
    of slice.c we DO NOT update the paca->context. The paca slice mask is now
    out of sync with the mm slice mask.
    6. User/kernel accesses 0xa0000000.
    7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
    to create an SLB entry and inserts it in the SLB.
    18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
    is found so a data access exception is generated.
    19. The data access handler calls do_page_fault() -> handle_mm_fault().
    10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
    11. The hardware retries the access, there is still nothing in the hash table
    so once again a data access exception is generated.
    12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
    hash. Although the THP mapping maps 16M the hashing is done using 64K
    as the segment page size.
    13. hash_page() returns immediately after calling __hash_page_thp(), skipping
    over the code at line 1125. Resulting in the mismatch between the
    paca->context and mm->context not being detected.
    14. The hardware retries the access, the hash it generates using the 16M
    SLB entry does NOT match the hash we inserted.
    15. We take another data access and go into __hash_page_thp().
    16. We see a valid entry in the hpte_slot_array and so we call updatepp()
    which succeeds.
    17. Goto 14.

    We could fix this in two ways. The first would be to remove or modify
    the check on line 539 of slice.c.

    The second option is to cause the check of paca psize in hash_page() on
    line 1125 to also be done for THP pages.

    We prefer the latter, because the check & update of the paca psize is
    not done until we know it's necessary. It's also done only on the
    current cpu, so we don't need to IPI all other cpus.

    Without further rearranging the code, the simplest fix is to pull out
    the code that checks paca psize and call it in two places. Firstly for
    THP/hugetlb, and secondly for other mappings as before.

    Thanks to Dave Jones for trinity, which originally found this bug.

    Signed-off-by: Michael Ellerman
    Reviewed-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Michael Ellerman
     
  • commit 54f112a3837d4e7532bbedbbbf27c0de277be510 upstream.

    In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always
    overwritten by EEH_STATE_NOT_SUPPORT because of the missed
    "break" there. The patch fixes the issue.

    Reported-by: Joe Perches
    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Gavin Shan
     

01 Jul, 2014

1 commit

  • commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream.

    Currently hugepage migration is available for all archs which support
    pmd-level hugepage, but testing is done only for x86_64 and there're
    bugs for other archs. So to avoid breaking such archs, this patch
    limits the availability strictly to x86_64 until developers of other
    archs get interested in enabling this feature.

    Simply disabling hugepage migration on non-x86_64 archs is not enough to
    fix the reported problem where sys_move_pages() hits the BUG_ON() in
    follow_page(FOLL_GET), so let's fix this by checking if hugepage
    migration is supported in vma_migratable().

    Signed-off-by: Naoya Horiguchi
    Reported-by: Michael Ellerman
    Tested-by: Michael Ellerman
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: Tony Luck
    Cc: Russell King
    Cc: Martin Schwidefsky
    Cc: James Hogan
    Cc: Ralf Baechle
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     

08 Jun, 2014

3 commits

  • commit 011e4b02f1da156ac7fea28a9da878f3c23af739 upstream.

    If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
    (ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
    get the following messages during boot:

    [ 0.089866] POWER8 performance monitor hardware support registered
    [ 0.089985] power8-pmu: PMAO restore workaround active.
    [ 5.095419] Processor 1 is stuck.
    [ 10.097933] Processor 2 is stuck.
    [ 15.100480] Processor 3 is stuck.
    [ 20.102982] Processor 4 is stuck.
    [ 25.105489] Processor 5 is stuck.
    [ 30.108005] Processor 6 is stuck.
    [ 35.110518] Processor 7 is stuck.
    [ 40.113369] Processor 9 is stuck.
    [ 45.115879] Processor 10 is stuck.
    [ 50.118389] Processor 11 is stuck.
    [ 55.120904] Processor 12 is stuck.
    [ 60.123425] Processor 13 is stuck.
    [ 65.125970] Processor 14 is stuck.
    [ 70.128495] Processor 15 is stuck.
    [ 75.131316] Processor 17 is stuck.

    Note that only the sibling threads are stuck, while the primary threads (0, 8,
    16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
    that kexec tries to wakeup (bring online) the sibling threads of all the cores,
    before performing kexec:

    [ 9464.131231] Starting new kernel
    [ 9464.148507] kexec: Waking offline cpu 1.
    [ 9464.148552] kexec: Waking offline cpu 2.
    [ 9464.148600] kexec: Waking offline cpu 3.
    [ 9464.148636] kexec: Waking offline cpu 4.
    [ 9464.148671] kexec: Waking offline cpu 5.
    [ 9464.148708] kexec: Waking offline cpu 6.
    [ 9464.148743] kexec: Waking offline cpu 7.
    [ 9464.148779] kexec: Waking offline cpu 9.
    [ 9464.148815] kexec: Waking offline cpu 10.
    [ 9464.148851] kexec: Waking offline cpu 11.
    [ 9464.148887] kexec: Waking offline cpu 12.
    [ 9464.148922] kexec: Waking offline cpu 13.
    [ 9464.148958] kexec: Waking offline cpu 14.
    [ 9464.148994] kexec: Waking offline cpu 15.
    [ 9464.149030] kexec: Waking offline cpu 17.

    Instrumenting this piece of code revealed that the cpu_up() operation actually
    fails with -EBUSY. Thus, only the primary threads of all the cores are online
    during kexec, and hence this is a sure-shot receipe for disaster, as explained
    in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec),
    as well as in the comment above wake_offline_cpus().

    It turns out that cpu_up() was returning -EBUSY because the variable
    'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
    by migrate_to_reboot_cpu() inside kernel_kexec().

    Now, migrate_to_reboot_cpu() was originally written with the assumption that
    any further code will not need to perform CPU hotplug, since we are anyway in
    the reboot path. However, kexec is clearly not such a case, since we depend on
    onlining CPUs, atleast on powerpc.

    So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
    kexec path, to fix this regression in kexec on powerpc.

    Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
    can catch such issues more easily in the future.

    Fixes: c97102ba963 (kexec: migrate to reboot cpu)
    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Srivatsa S. Bhat
     
  • commit 7998eb3dc700aaf499f93f50b3d77da834ef9e1d upstream.

    With binutils 2.24, various 64 bit builds fail with relocation errors
    such as

    arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
    (.text+0x165ee): relocation truncated to fit: R_PPC64_ADDR16_HI
    against symbol `interrupt_base_book3e' defined in .text section
    in arch/powerpc/kernel/built-in.o
    arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
    (.text+0x16602): relocation truncated to fit: R_PPC64_ADDR16_HI
    against symbol `interrupt_end_book3e' defined in .text section
    in arch/powerpc/kernel/built-in.o

    The assembler maintainer says:

    I changed the ABI, something that had to be done but unfortunately
    happens to break the booke kernel code. When building up a 64-bit
    value with lis, ori, shl, oris, ori or similar sequences, you now
    should use @high and @higha in place of @h and @ha. @h and @ha
    (and their associated relocs R_PPC64_ADDR16_HI and R_PPC64_ADDR16_HA)
    now report overflow if the value is out of 32-bit signed range.
    ie. @h and @ha assume you're building a 32-bit value. This is needed
    to report out-of-range -mcmodel=medium toc pointer offsets in @toc@h
    and @toc@ha expressions, and for consistency I did the same for all
    other @h and @ha relocs.

    Replacing @h with @high in one strategic location fixes the relocation
    errors. This has to be done conditionally since the assembler either
    supports @h or @high but not both.

    Signed-off-by: Guenter Roeck
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Guenter Roeck
     
  • commit 8050936caf125fbe54111ba5e696b68a360556ba upstream.

    I am seeing an issue where a CPU running perf eventually hangs.
    Traces show timer interrupts happening every 4 seconds even
    when a userspace task is running on the CPU. /proc/timer_list
    also shows pending hrtimers have not run in over an hour,
    including the scheduler.

    Looking closer, decrementers_next_tb is getting set to
    0xffffffffffffffff, and at that point we will never take
    a timer interrupt again.

    In __timer_interrupt() we set decrementers_next_tb to
    0xffffffffffffffff and rely on ->event_handler to update it:

    *next_tb = ~(u64)0;
    if (evt->event_handler)
    evt->event_handler(evt);

    In this case ->event_handler is hrtimer_interrupt. This will eventually
    call back through the clockevents code with the next event to be
    programmed:

    static int decrementer_set_next_event(unsigned long evt,
    struct clock_event_device *dev)
    {
    /* Don't adjust the decrementer if some irq work is pending */
    if (test_irq_work_pending())
    return 0;
    __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

    If irq work came in between these two points, we will return
    before updating decrementers_next_tb and we never process a timer
    interrupt again.

    This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
    with irq_work). Fix it by removing the early exit and relying on
    code later on in the function to force an early decrementer:

    /* We may have raced with new irq work */
    if (test_irq_work_pending())
    set_dec(1);

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Greg Kroah-Hartman

    Anton Blanchard