26 Sep, 2014

7 commits

  • commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream.

    cold is a bool, make it one. Make the likely case the "if" part of the
    block instead of the else as according to the optimisation manual this is
    preferred.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Jan Kara
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Theodore Ts'o
    Cc: "Paul E. McKenney"
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 upstream.

    We use the accessed bit to age a page at page reclaim time,
    and currently we also flush the TLB when doing so.

    But in some workloads TLB flush overhead is very heavy. In my
    simple multithreaded app with a lot of swap to several pcie
    SSDs, removing the tlb flush gives about 20% ~ 30% swapout
    speedup.

    Fortunately just removing the TLB flush is a valid optimization:
    on x86 CPUs, clearing the accessed bit without a TLB flush
    doesn't cause data corruption.

    It could cause incorrect page aging and the (mistaken) reclaim of
    hot pages, but the chance of that should be relatively low.

    So as a performance optimization don't flush the TLB when
    clearing the accessed bit, it will eventually be flushed by
    a context switch or a VM operation anyway. [ In the rare
    event of it not getting flushed for a long time the delay
    shouldn't really matter because there's no real memory
    pressure for swapout to react to. ]

    Suggested-by: Linus Torvalds
    Signed-off-by: Shaohua Li
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: Hugh Dickins
    Acked-by: Johannes Weiner
    Cc: linux-mm@kvack.org
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
    [ Rewrote the changelog and the code comments. ]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Shaohua Li
     
  • commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

    This patch is a continuation of efforts trying to optimize find_vma(),
    avoiding potentially expensive rbtree walks to locate a vma upon faults.
    The original approach (https://lkml.org/lkml/2013/11/1/410), where the
    largest vma was also cached, ended up being too specific and random,
    thus further comparison with other approaches were needed. There are
    two things to consider when dealing with this, the cache hit rate and
    the latency of find_vma(). Improving the hit-rate does not necessarily
    translate in finding the vma any faster, as the overhead of any fancy
    caching schemes can be too high to consider.

    We currently cache the last used vma for the whole address space, which
    provides a nice optimization, reducing the total cycles in find_vma() by
    up to 250%, for workloads with good locality. On the other hand, this
    simple scheme is pretty much useless for workloads with poor locality.
    Analyzing ebizzy runs shows that, no matter how many threads are
    running, the mmap_cache hit rate is less than 2%, and in many situations
    below 1%.

    The proposed approach is to replace this scheme with a small per-thread
    cache, maximizing hit rates at a very low maintenance cost.
    Invalidations are performed by simply bumping up a 32-bit sequence
    number. The only expensive operation is in the rare case of a seq
    number overflow, where all caches that share the same address space are
    flushed. Upon a miss, the proposed replacement policy is based on the
    page number that contains the virtual address in question. Concretely,
    the following results are seen on an 80 core, 8 socket x86-64 box:

    1) System bootup: Most programs are single threaded, so the per-thread
    scheme does improve ~50% hit rate by just adding a few more slots to
    the cache.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 50.61% | 19.90 |
    | patched | 73.45% | 13.58 |
    +----------------+----------+------------------+

    2) Kernel build: This one is already pretty good with the current
    approach as we're dealing with good locality.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 75.28% | 11.03 |
    | patched | 88.09% | 9.31 |
    +----------------+----------+------------------+

    3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 70.66% | 17.14 |
    | patched | 91.15% | 12.57 |
    +----------------+----------+------------------+

    4) Ebizzy: There's a fair amount of variation from run to run, but this
    approach always shows nearly perfect hit rates, while baseline is just
    about non-existent. The amounts of cycles can fluctuate between
    anywhere from ~60 to ~116 for the baseline scheme, but this approach
    reduces it considerably. For instance, with 80 threads:

    +----------------+----------+------------------+
    | caching scheme | hit-rate | cycles (billion) |
    +----------------+----------+------------------+
    | baseline | 1.06% | 91.54 |
    | patched | 99.97% | 14.18 |
    +----------------+----------+------------------+

    [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
    [akpm@linux-foundation.org: document vmacache_valid() logic]
    [akpm@linux-foundation.org: attempt to untangle header files]
    [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
    [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: adjust and enhance comments]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Acked-by: Linus Torvalds
    Reviewed-by: Michel Lespinasse
    Cc: Oleg Nesterov
    Tested-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Davidlohr Bueso
     
  • commit 71b54f8263860a37dd9f50f81880a9d681fd9c10 upstream.

    When choosing between doing an address space or ranged flush,
    the x86 implementation of flush_tlb_mm_range takes into account
    whether there are any large pages in the range. A per-page
    flush typically requires fewer entries than would covered by a
    single large page and the check is redundant.

    There is one potential exception. THP migration flushes single
    THP entries and it conceivably would benefit from flushing a
    single entry instead of the mm. However, this flush is after a
    THP allocation, copy and page table update potentially with any
    other threads serialised behind it. In comparison to that, the
    flush is noise. It makes more sense to optimise balancing to
    require fewer flushes than to optimise the flush itself.

    This patch deletes the redundant huge page check.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit 15aa368255f249df0b2af630c9487bb5471bd7da upstream.

    NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and
    the comparison with total_vm is done before taking
    tlb_flushall_shift into account. Clean it up.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Alex Shi
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.

    Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
    vmstats: tlb flush counters") to cause overhead problems.

    The counters are undeniably useful but how often do we really
    need to debug TLB flush related issues? It does not justify
    taking the penalty everywhere so make it a debugging option.

    Signed-off-by: Mel Gorman
    Tested-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alex Shi
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mel Gorman
    Signed-off-by: Jiri Slaby

    Mel Gorman
     
  • commit eb35bdd7bca29a13c8ecd44e6fd747a84ce675db upstream.

    Nathan reports that we leak TLS information from the parent context
    during an exec, as we don't clear the TLS registers when flushing the
    thread state.

    This patch updates the flushing code so that we:

    (1) Unconditionally zero the tpidr_el0 register (since this is fully
    context switched for native tasks and zeroed for compat tasks)

    (2) Zero the tp_value state in thread_info before clearing the
    tpidrr0_el0 register for compat tasks (since this is only writable
    by the set_tls compat syscall and therefore not fully switched).

    A missing compiler barrier is also added to the compat set_tls syscall.

    Acked-by: Nathan Lynch
    Reported-by: Nathan Lynch
    Signed-off-by: Will Deacon
    Signed-off-by: Jiri Slaby

    Will Deacon
     

17 Sep, 2014

16 commits

  • commit 608308682addfdc7b8e2aee88f0e028331d88e4d upstream.

    get_system_type() is not thread-safe on OCTEON. It uses static data,
    also more dangerous issue is that it's calling cvmx_fuse_read_byte()
    every time without any synchronization. Currently it's possible to get
    processes stuck looping forever in kernel simply by launching multiple
    readers of /proc/cpuinfo:

    (while true; do cat /proc/cpuinfo > /dev/null; done) &
    (while true; do cat /proc/cpuinfo > /dev/null; done) &
    ...

    Fix by initializing the system type string only once during the early
    boot.

    Signed-off-by: Aaro Koskinen
    Reviewed-by: Markos Chandras
    Patchwork: http://patchwork.linux-mips.org/patch/7437/
    Signed-off-by: James Hogan
    Signed-off-by: Jiri Slaby

    Aaro Koskinen
     
  • commit 2e5767a27337812f6850b3fa362419e2f085e5c3 upstream.

    In do_ade(), is_fpu_owner() isn't preempt-safe. For example, when an
    unaligned ldc1 is executed, do_cpu() is called and then FPU will be
    enabled (and TIF_USEDFPU will be set for the current process). Then,
    do_ade() is called because the access is unaligned. If the current
    process is preempted at this time, TIF_USEDFPU will be cleard. So when
    the process is scheduled again, BUG_ON(!is_fpu_owner()) is triggered.

    This small program can trigger this BUG in a preemptible kernel:

    int main (int argc, char *argv[])
    {
    double u64[2];

    while (1) {
    asm volatile (
    ".set push \n\t"
    ".set noreorder \n\t"
    "ldc1 $f3, 4(%0) \n\t"
    ".set pop \n\t"
    ::"r"(u64):
    );
    }

    return 0;
    }

    V2: Remove the BUG_ON() unconditionally due to Paul's suggestion.

    Signed-off-by: Huacai Chen
    Signed-off-by: Jie Chen
    Signed-off-by: Rui Wang
    Cc: John Crispin
    Cc: Steven J. Hill
    Cc: linux-mips@linux-mips.org
    Cc: Fuxin Zhang
    Cc: Zhangjin Wu
    Signed-off-by: Ralf Baechle
    Signed-off-by: Jiri Slaby

    Huacai Chen
     
  • commit 8393c524a25609a30129e4a8975cf3b91f6c16a5 upstream.

    In commit 2c8c53e28f1 (MIPS: Optimize TLB handlers for Octeon CPUs)
    build_r4000_tlb_refill_handler() is modified. But it doesn't compatible
    with the original code in HUGETLB case. Because there is a copy & paste
    error and one line of code is missing. It is very easy to produce a bug
    with LTP's hugemmap05 test.

    Signed-off-by: Huacai Chen
    Signed-off-by: Binbin Zhou
    Cc: John Crispin
    Cc: Steven J. Hill
    Cc: linux-mips@linux-mips.org
    Cc: Fuxin Zhang
    Cc: Zhangjin Wu
    Patchwork: https://patchwork.linux-mips.org/patch/7496/
    Signed-off-by: Ralf Baechle
    Signed-off-by: Jiri Slaby

    Huacai Chen
     
  • commit b1442d39fac2fcfbe6a4814979020e993ca59c9e upstream.

    If one or more matching FCSR cause & enable bits are set in saved thread
    context then when that context is restored the kernel will take an FP
    exception. This is of course undesirable and considered an oops, leading
    to the kernel writing a backtrace to the console and potentially
    rebooting depending upon the configuration. Thus the kernel avoids this
    situation by clearing the cause bits of the FCSR register when handling
    FP exceptions and after emulating FP instructions.

    However the kernel does not prevent userland from setting arbitrary FCSR
    cause & enable bits via ptrace, using either the PTRACE_POKEUSR or
    PTRACE_SETFPREGS requests. This means userland can trivially cause the
    kernel to oops on any system with an FPU. Prevent this from happening
    by clearing the cause bits when writing to the saved FCSR context via
    ptrace.

    This problem appears to exist at least back to the beginning of the git
    era in the PTRACE_POKEUSR case.

    Signed-off-by: Paul Burton
    Cc: linux-mips@linux-mips.org
    Cc: Paul Burton
    Cc: stable@vger.kernel.org
    Patchwork: https://patchwork.linux-mips.org/patch/7438/
    Signed-off-by: Ralf Baechle
    Signed-off-by: Jiri Slaby

    Paul Burton
     
  • commit ffc8415afab20bd97754efae6aad1f67b531132b upstream.

    A GIC interrupt which is declared as having a GIC_MAP_TO_NMI_MSK
    mapping causes the cpu parameter to gic_setup_intr() to be increased
    to 32, causing memory corruption when pcpu_masks[] is written to again
    later in the function.

    Signed-off-by: Jeffrey Deans
    Signed-off-by: Markos Chandras
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/7375/
    Signed-off-by: Ralf Baechle
    Signed-off-by: Jiri Slaby

    Jeffrey Deans
     
  • commit 7e467245bf5226db34c4b12d3cbacfa2f7a15a8b upstream.

    We would get wrong results in compiler recomputed old_pmd. Avoid
    that by using ACCESS_ONCE

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit 969b7b208f7408712a3526856e4ae60ad13f6928 upstream.

    As per ISA, for 4k base page size we compare 14..65 bits of VA specified
    with the entry_VA in tlb. That implies we need to make sure we do a
    tlbie with all the possible 4k va we used to access the 16MB hugepage.
    With 64k base page size we compare 14..57 bits of VA. Hence we cannot
    ignore the lower 24 bits of va while tlbie .We also cannot tlb
    invalidate a 16MB entry with just one tlbie instruction because
    we don't track which va was used to instantiate the tlb entry.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit fc0479557572375100ef16c71170b29a98e0d69a upstream.

    If we changed base page size of the segment, either via sub_page_protect
    or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
    table entries. We do a lazy hash page table flush for all mapped pages
    in the demoted segment. This happens when we handle hash page fault for
    these pages.

    We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
    pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
    that implies that we could possibly have older 64K hash pte entries in
    the hash page table and we need to invalidate those entries.

    Use _PAGE_COMBO to determine the page size with which we should
    invalidate the hash table entries on unmap.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit 629149fae478f0ac6bf705a535708b192e9c6b59 upstream.

    If we changed base page size of the segment, either via sub_page_protect
    or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
    table entries. We do a lazy hash page table flush for all mapped pages
    in the demoted segment. This happens when we handle hash page fault
    for these pages.

    We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
    pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
    that implies that we could possibly have older 64K hash pte entries in
    the hash page table and we need to invalidate those entries.

    Handle this correctly for 16M pages

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit fa1f8ae80f8bb996594167ff4750a0b0a5a5bb5d upstream.

    The segment identifier and segment size will remain the same in
    the loop, So we can compute it outside. We also change the
    hugepage_invalidate interface so that we can use it the later patch

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit b0aa44a3dfae3d8f45bd1264349aa87f87b7774f upstream.

    With hugepages, we store the hpte valid information in the pte page
    whose address is stored in the second half of the PMD. Use a
    write barrier to make sure clearing pmd busy bit and updating
    hpte valid info are ordered properly.

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit 5efbabe09d986f25c02d19954660238fcd7f008a upstream.

    Function remove_ddw() could be called in of_reconfig_notifier and
    we potentially remove the dynamic DMA window property, which invokes
    of_reconfig_notifier again. Eventually, it leads to the deadlock as
    following backtrace shows.

    The patch fixes the above issue by deferring releasing the dynamic
    DMA window property while releasing the device node.

    =============================================
    [ INFO: possible recursive locking detected ]
    3.16.0+ #428 Tainted: G W
    ---------------------------------------------
    drmgr/2273 is trying to acquire lock:
    ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
    .__blocking_notifier_call_chain+0x40/0x78

    but task is already holding lock:
    ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
    .__blocking_notifier_call_chain+0x40/0x78

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock((of_reconfig_chain).rwsem);
    lock((of_reconfig_chain).rwsem);
    *** DEADLOCK ***

    May be due to missing lock nesting notation

    2 locks held by drmgr/2273:
    #0: (sb_writers#4){.+.+.+}, at: [] \
    .vfs_write+0xb0/0x1f8
    #1: ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
    .__blocking_notifier_call_chain+0x40/0x78

    stack backtrace:
    CPU: 17 PID: 2273 Comm: drmgr Tainted: G W 3.16.0+ #428
    Call Trace:
    [c0000000137e7000] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
    [c0000000137e70b0] [c00000000083cd34] .dump_stack+0x7c/0x9c
    [c0000000137e7130] [c0000000000b8afc] .__lock_acquire+0x128c/0x1c68
    [c0000000137e7280] [c0000000000b9a4c] .lock_acquire+0xe8/0x104
    [c0000000137e7350] [c00000000083588c] .down_read+0x4c/0x90
    [c0000000137e73e0] [c000000000091890] .__blocking_notifier_call_chain+0x40/0x78
    [c0000000137e7490] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
    [c0000000137e7520] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
    [c0000000137e75b0] [c000000000682a9c] .of_property_notify+0x4c/0x54
    [c0000000137e7650] [c000000000682bf0] .of_remove_property+0x30/0xd4
    [c0000000137e76f0] [c000000000052a44] .remove_ddw+0x144/0x168
    [c0000000137e7790] [c000000000053204] .iommu_reconfig_notifier+0x30/0xe0
    [c0000000137e7820] [c00000000009137c] .notifier_call_chain+0x6c/0xb4
    [c0000000137e78c0] [c0000000000918ac] .__blocking_notifier_call_chain+0x5c/0x78
    [c0000000137e7970] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
    [c0000000137e7a00] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
    [c0000000137e7a90] [c000000000682e14] .of_detach_node+0x44/0x1fc
    [c0000000137e7b40] [c0000000000518e4] .ofdt_write+0x3ac/0x688
    [c0000000137e7c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
    [c0000000137e7cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
    [c0000000137e7d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
    [c0000000137e7e30] [c00000000000a064] syscall_exit+0x0/0x98

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Gavin Shan
     
  • commit f1b3929c232784580e5d8ee324b6bc634e709575 upstream.

    While running command "drmgr -c phb -r -s 'PHB 528'", following
    backtrace jumped out because the target device node isn't marked
    with OF_DETACHED by of_detach_node(), which caused by error
    returned from memory hotplug related reconfig notifier when
    disabling CONFIG_MEMORY_HOTREMOVE. The patch fixes it.

    ERROR: Bad of_node_put() on /pci@800000020000210/ethernet@0
    CPU: 14 PID: 2252 Comm: drmgr Tainted: G W 3.16.0+ #427
    Call Trace:
    [c000000012a776a0] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
    [c000000012a77750] [c00000000083cd34] .dump_stack+0x7c/0x9c
    [c000000012a777d0] [c0000000006807c4] .of_node_release+0x58/0xe0
    [c000000012a77860] [c00000000038a7d0] .kobject_release+0x174/0x1b8
    [c000000012a77900] [c00000000038a884] .kobject_put+0x70/0x78
    [c000000012a77980] [c000000000681680] .of_node_put+0x28/0x34
    [c000000012a77a00] [c000000000681ea8] .__of_get_next_child+0x64/0x70
    [c000000012a77a90] [c000000000682138] .of_find_node_by_path+0x1b8/0x20c
    [c000000012a77b40] [c000000000051840] .ofdt_write+0x308/0x688
    [c000000012a77c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
    [c000000012a77cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
    [c000000012a77d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
    [c000000012a77e30] [c00000000000a064] syscall_exit+0x0/0x98

    Signed-off-by: Gavin Shan
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Gavin Shan
     
  • commit 85c1fafd7262e68ad821ee1808686b1392b1167d upstream.

    On ppc64 we support 4K hash pte with 64K page size. That requires
    us to track the hash pte slot information on a per 4k basis. We do that
    by storing the slot details in the second half of pte page. The pte bit
    _PAGE_COMBO is used to indicate whether the second half need to be
    looked while building real_pte. We need to use read memory barrier while
    doing that so that load of hidx is not reordered w.r.t _PAGE_COMBO
    check. On the store side we already do a lwsync in __hash_page_4K

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Aneesh Kumar K.V
     
  • commit b00fc6ec1f24f9d7af9b8988b6a198186eb3408c upstream.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631
    Reported-by: David Binderman
    Signed-off-by: Andrey Utkin
    Signed-off-by: Benjamin Herrenschmidt
    Signed-off-by: Jiri Slaby

    Andrey Utkin
     
  • commit 36e7fdaa1a04fcf65b864232e1af56a51c7814d6 upstream.

    commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc (locking/mutex: Disable
    optimistic spinning on some architectures) fenced spinning for
    architectures without proper cmpxchg.
    There is no need to disable mutex spinning on s390, though:
    The instructions CS,CSG and friends provide the proper guarantees.
    (We dont implement cmpxchg with locks).

    Signed-off-by: Christian Borntraeger
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Heiko Carstens
    Signed-off-by: Martin Schwidefsky
    Signed-off-by: Jiri Slaby

    Christian Borntraeger
     

12 Sep, 2014

2 commits

  • commit 39424e89d64661faa0a2e00c5ad1e6dbeebfa972 upstream.

    Further discussion here: http://marc.info/?l=linux-kernel&m=139073901101034&w=2

    kbuild, 0day kernel build service, outputs the warning:

    arch/x86/kernel/irq.c:333:1: warning: the frame size of 2056 bytes
    is larger than 2048 bytes [-Wframe-larger-than=]

    because check_irq_vectors_for_cpu_disable() allocates two cpumasks on the
    stack. Fix this by moving the two cpumasks to a global file context.

    Reported-by: Fengguang Wu
    Tested-by: David Rientjes
    Signed-off-by: Prarit Bhargava
    Link: http://lkml.kernel.org/r/1390915331-27375-1-git-send-email-prarit@redhat.com
    Cc: Andi Kleen
    Cc: Michel Lespinasse
    Cc: Seiji Aguchi
    Cc: Yang Zhang
    Cc: Paul Gortmaker
    Cc: Janet Morgan
    Cc: Tony Luck
    Cc: Ruiv Wang
    Cc: Gong Chen
    Cc: Yinghai Lu
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Jiri Slaby

    Prarit Bhargava
     
  • commit da6139e49c7cb0f4251265cb5243b8d220adb48d upstream.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64791

    When a cpu is downed on a system, the irqs on the cpu are assigned to
    other cpus. It is possible, however, that when a cpu is downed there
    aren't enough free vectors on the remaining cpus to account for the
    vectors from the cpu that is being downed.

    This results in an interesting "overflow" condition where irqs are
    "assigned" to a CPU but are not handled.

    For example, when downing cpus on a 1-64 logical processor system:

    [ 232.021745] smpboot: CPU 61 is now offline
    [ 238.480275] smpboot: CPU 62 is now offline
    [ 245.991080] ------------[ cut here ]------------
    [ 245.996270] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x246/0x250()
    [ 246.005688] NETDEV WATCHDOG: p786p1 (ixgbe): transmit queue 0 timed out
    [ 246.013070] Modules linked in: lockd sunrpc iTCO_wdt iTCO_vendor_support sb_edac ixgbe microcode e1000e pcspkr joydev edac_core lpc_ich ioatdma ptp mdio mfd_core i2c_i801 dca pps_core i2c_core wmi acpi_cpufreq isci libsas scsi_transport_sas
    [ 246.037633] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0+ #14
    [ 246.044451] Hardware name: Intel Corporation S4600LH ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
    [ 246.057371] 0000000000000009 ffff88081fa03d40 ffffffff8164fbf6 ffff88081fa0ee48
    [ 246.065728] ffff88081fa03d90 ffff88081fa03d80 ffffffff81054ecc ffff88081fa13040
    [ 246.074073] 0000000000000000 ffff88200cce0000 0000000000000040 0000000000000000
    [ 246.082430] Call Trace:
    [ 246.085174] [] dump_stack+0x46/0x58
    [ 246.091633] [] warn_slowpath_common+0x8c/0xc0
    [ 246.098352] [] warn_slowpath_fmt+0x46/0x50
    [ 246.104786] [] dev_watchdog+0x246/0x250
    [ 246.110923] [] ? dev_deactivate_queue.constprop.31+0x80/0x80
    [ 246.119097] [] call_timer_fn+0x3a/0x110
    [ 246.125224] [] ? update_process_times+0x6f/0x80
    [ 246.132137] [] ? dev_deactivate_queue.constprop.31+0x80/0x80
    [ 246.140308] [] run_timer_softirq+0x1f0/0x2a0
    [ 246.146933] [] __do_softirq+0xe0/0x220
    [ 246.152976] [] call_softirq+0x1c/0x30
    [ 246.158920] [] do_softirq+0x55/0x90
    [ 246.164670] [] irq_exit+0xa5/0xb0
    [ 246.170227] [] smp_apic_timer_interrupt+0x4a/0x60
    [ 246.177324] [] apic_timer_interrupt+0x6a/0x70
    [ 246.184041] [] ? cpuidle_enter_state+0x5b/0xe0
    [ 246.191559] [] ? cpuidle_enter_state+0x57/0xe0
    [ 246.198374] [] cpuidle_idle_call+0xbd/0x200
    [ 246.204900] [] arch_cpu_idle+0xe/0x30
    [ 246.210846] [] cpu_startup_entry+0xd0/0x250
    [ 246.217371] [] rest_init+0x77/0x80
    [ 246.223028] [] start_kernel+0x3ee/0x3fb
    [ 246.229165] [] ? repair_env_string+0x5e/0x5e
    [ 246.235787] [] x86_64_start_reservations+0x2a/0x2c
    [ 246.242990] [] x86_64_start_kernel+0xf8/0xfc
    [ 246.249610] ---[ end trace fb74fdef54d79039 ]---
    [ 246.254807] ixgbe 0000:c2:00.0 p786p1: initiating reset due to tx timeout
    [ 246.262489] ixgbe 0000:c2:00.0 p786p1: Reset adapter
    Last login: Mon Nov 11 08:35:14 from 10.18.17.119
    [root@(none) ~]# [ 246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
    [ 249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
    [ 246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
    [ 249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

    (last lines keep repeating. ixgbe driver is dead until module reload.)

    If the downed cpu has more vectors than are free on the remaining cpus on the
    system, it is possible that some vectors are "orphaned" even though they are
    assigned to a cpu. In this case, since the ixgbe driver had a watchdog, the
    watchdog fired and notified that something was wrong.

    This patch adds a function, check_vectors(), to compare the number of vectors
    on the CPU going down and compares it to the number of vectors available on
    the system. If there aren't enough vectors for the CPU to go down, an
    error is returned and propogated back to userspace.

    v2: Do not need to look at percpu irqs
    v3: Need to check affinity to prevent counting of MSIs in IOAPIC Lowest
    Priority Mode
    v4: Additional changes suggested by Gong Chen.
    v5/v6/v7/v8: Updated comment text

    Signed-off-by: Prarit Bhargava
    Link: http://lkml.kernel.org/r/1389613861-3853-1-git-send-email-prarit@redhat.com
    Reviewed-by: Gong Chen
    Cc: Andi Kleen
    Cc: Michel Lespinasse
    Cc: Seiji Aguchi
    Cc: Yang Zhang
    Cc: Paul Gortmaker
    Cc: Janet Morgan
    Cc: Tony Luck
    Cc: Ruiv Wang
    Cc: Gong Chen
    Signed-off-by: H. Peter Anvin
    Cc:
    Signed-off-by: Jiri Slaby

    Prarit Bhargava
     

09 Sep, 2014

1 commit

  • commit 10f67dbf6add97751050f294d4c8e0cc1e5c2c23 upstream.

    The mainline signal handling code for OpenRISC has been buggy since day
    one with respect to syscall restart. This patch significantly reworks
    the signal handling code:

    i) Move the "work pending" loop to C code (borrowed from ARM arch)

    ii) Allow a tracer to muck about with the IP and skip syscall restart
    in that case (again, borrowed from ARM)

    iii) Make signal handling WRT syscall restart actually work

    v) Make the signal handling code look more like that of other
    architectures so that it's easier for others to follow

    Reported-by: Anders Nystrom
    Signed-off-by: Jonas Bonn
    Signed-off-by: Jiri Slaby

    Jonas Bonn
     

04 Sep, 2014

7 commits

  • commit 8d5999df35314607c38fbd6bdd709e25c3a4eeab upstream.

    If the timer irqs are resumed during device resume it is possible in
    certain circumstances for the resume to hang early on, before device
    interrupts are resumed. For an Ubuntu 14.04 PVHVM guest this would
    occur in ~0.5% of resume attempts.

    It is not entirely clear what is occuring the point of the hang but I
    think a task necessary for the resume calls schedule_timeout(),
    waiting for a timer interrupt (which never arrives). This failure may
    require specific tasks to be running on the other VCPUs to trigger
    (processes are not frozen during a suspend/resume if PREEMPT is
    disabled).

    Add IRQF_EARLY_RESUME to the timer interrupts so they are resumed in
    syscore_resume().

    Signed-off-by: David Vrabel
    Reviewed-by: Boris Ostrovsky
    Signed-off-by: Jiri Slaby

    David Vrabel
     
  • commit 7b2a583afb4ab894f78bc0f8bd136e96b6499a7e upstream.

    Without CONFIG_RELOCATABLE the early boot code will decompress the
    kernel to LOAD_PHYSICAL_ADDR. While this may have been fine in the BIOS
    days, that isn't going to fly with UEFI since parts of the firmware
    code/data may be located at LOAD_PHYSICAL_ADDR.

    Straying outside of the bounds of the regions we've explicitly requested
    from the firmware will cause all sorts of trouble. Bruno reports that
    his machine resets while trying to decompress the kernel image.

    We already go to great pains to ensure the kernel is loaded into a
    suitably aligned buffer, it's just that the address isn't necessarily
    LOAD_PHYSICAL_ADDR, because we can't guarantee that address isn't in-use
    by the firmware.

    Explicitly enforce CONFIG_RELOCATABLE for the EFI boot stub, so that we
    can load the kernel at any address with the correct alignment.

    Reported-by: Bruno Prémont
    Tested-by: Bruno Prémont
    Cc: H. Peter Anvin
    Signed-off-by: Matt Fleming
    Signed-off-by: Jiri Slaby

    Matt Fleming
     
  • commit 53b884ac3745353de220d92ef792515c3ae692f0 upstream.

    This commit in Linux 3.6:

    commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655
    Author: Joe Perches
    Date: Mon May 21 19:50:07 2012 -0700

    x86/debug: Add KERN_ to bare printks, convert printks to pr_

    caused warn_bad_vsyscall to output garbage in the middle of the
    line. Revert the bad part of it.

    The printk in question isn't actually bare; the level is "%s".

    The bug this fixes is purely cosmetic; backports are optional.

    Signed-off-by: Andy Lutomirski
    Link: http://lkml.kernel.org/r/03eac1f24110bbe496ecc12a4df467e0d88466d4.1406330947.git.luto@amacapital.net
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Jiri Slaby

    Andy Lutomirski
     
  • commit cbace46a9710a480cae51e4611697df5de41713e upstream.

    Commit 30919b0bf356 ("x86: avoid low BIOS area when allocating address
    space") moved the test for resource allocations that fall within the first
    1MB of address space from the PCI-specific path to a generic path, such
    that all resource allocations will avoid this area. However, this breaks
    ISA cards which need to allocate a memory region within the first 1MB. An
    example is the i82365 PCMCIA controller and derivatives like the Ricoh
    RF5C296/396 which map part of the PCMCIA socket memory address space into
    the first 1MB of system memory address space. They do not work anymore as
    no usable memory region exists due to this change:

    Intel ISA PCIC probe: Ricoh RF5C296/396 ISA-to-PCMCIA at port 0x3e0 ofs 0x00, 2 sockets
    host opts [0]: none
    host opts [1]: none
    ISA irqs (scanned) = 3,4,5,9,10 status change on irq 10
    pcmcia_socket pcmcia_socket1: pccard: PCMCIA card inserted into slot 1
    pcmcia_socket pcmcia_socket0: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff
    pcmcia_socket pcmcia_socket0: cs: IO port probe 0xa00-0xaff: clean.
    pcmcia_socket pcmcia_socket0: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff
    pcmcia_socket pcmcia_socket0: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff
    pcmcia_socket pcmcia_socket0: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff
    pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff
    pcmcia_socket pcmcia_socket0: cs: memory probe 0x0d0000-0x0dffff: clean.
    pcmcia_socket pcmcia_socket0: cs: memory probe 0x0e0000-0x0effff: clean.
    pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: clean.
    pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: clean.
    pcmcia_socket pcmcia_socket1: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff
    pcmcia_socket pcmcia_socket1: cs: IO port probe 0xa00-0xaff: clean.
    pcmcia_socket pcmcia_socket1: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x0d0000-0x0dffff: clean.
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x0e0000-0x0effff: clean.
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x60000000-0x60ffffff: clean.
    pcmcia_socket pcmcia_socket1: cs: memory probe 0xa0000000-0xa0ffffff: clean.
    pcmcia_socket pcmcia_socket1: cs: memory probe 0x0cc000-0x0effff: excluding 0xe0000-0xeffff
    pcmcia_socket pcmcia_socket1: cs: unable to map card memory!

    If filtering out the first 1MB is reverted, everything works as expected.

    Tested-by: Robert Resch
    Signed-off-by: Christoph Schulz
    Signed-off-by: Bjorn Helgaas
    Signed-off-by: Jiri Slaby

    Christoph Schulz
     
  • commit 0d234daf7e0a3290a3a20c8087eefbd6335a5bd4 upstream.

    This reverts commit 682367c494869008eb89ef733f196e99415ae862,
    which causes 32-bit SMP Windows 7 guests to panic.

    SeaBIOS has a limit on the number of MTRRs that it can handle,
    and this patch exceeded the limit. Better revert it.
    Thanks to Nadav Amit for debugging the cause.

    Reported-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Jiri Slaby

    Paolo Bonzini
     
  • commit 9e8919ae793f4edfaa29694a70f71a515ae9942a upstream.

    Return unhandlable error on inter-privilege level ret instruction. This is
    since the current emulation does not check the privilege level correctly when
    loading the CS, and does not pop RSP/SS as needed.

    Signed-off-by: Nadav Amit
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Jiri Slaby

    Nadav Amit
     
  • commit 9b5f7428f8b16bd8980213f2b70baf1dd0b9e36c upstream.

    According to the comment “restore_es3: applies to 34xx >= ES3.0" in
    "arch/arm/mach-omap2/sleep34xx.S”, omap3_restore_es3 should be used
    if the revision of an OMAP34xx is ES3.1.2.

    Signed-off-by: Jeremy Vial
    Signed-off-by: Tony Lindgren
    Signed-off-by: Jiri Slaby

    Jeremy Vial
     

26 Aug, 2014

7 commits

  • [ Upstream commit 093758e3daede29cb4ce6aedb111becf9d4bfc57 ]

    This commit is a guesswork, but it seems to make sense to drop this
    break, as otherwise the following line is never executed and becomes
    dead code. And that following line actually saves the result of
    local calculation by the pointer given in function argument. So the
    proposed change makes sense if this code in the whole makes sense (but I
    am unable to analyze it in the whole).

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641
    Reported-by: David Binderman
    Signed-off-by: Andrey Utkin
    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    Andrey Utkin
     
  • [ Upstream commit 4ec1b01029b4facb651b8ef70bc20a4be4cebc63 ]

    The LDC handshake could have been asynchronously triggered
    after ldc_bind() enables the ldc_rx() receive interrupt-handler
    (and thus intercepts incoming control packets)
    and before vio_port_up() calls ldc_connect(). If that is the case,
    ldc_connect() should return 0 and let the state-machine
    progress.

    Signed-off-by: Sowmini Varadhan
    Acked-by: Karl Volz
    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    Sowmini Varadhan
     
  • [ Upstream commit 4ca9a23765da3260058db3431faf5b4efd8cf926 ]

    Based almost entirely upon a patch by Christopher Alexander Tobias
    Schulze.

    In commit db64fe02258f1507e13fe5212a989922323685ce ("mm: rewrite vmap
    layer") lazy VMAP tlb flushing was added to the vmalloc layer. This
    causes problems on sparc64.

    Sparc64 has two VMAP mapped regions and they are not contiguous with
    eachother. First we have the malloc mapping area, then another
    unrelated region, then the vmalloc region.

    This "another unrelated region" is where the firmware is mapped.

    If the lazy TLB flushing logic in the vmalloc code triggers after
    we've had both a module unload and a vfree or similar, it will pass an
    address range that goes from somewhere inside the malloc region to
    somewhere inside the vmalloc region, and thus covering the
    openfirmware area entirely.

    The sparc64 kernel learns about openfirmware's dynamic mappings in
    this region early in the boot, and then services TLB misses in this
    area. But openfirmware has some locked TLB entries which are not
    mentioned in those dynamic mappings and we should thus not disturb
    them.

    These huge lazy TLB flush ranges causes those openfirmware locked TLB
    entries to be removed, resulting in all kinds of problems including
    hard hangs and crashes during reboot/reset.

    Besides causing problems like this, such huge TLB flush ranges are
    also incredibly inefficient. A plea has been made with the author of
    the VMAP lazy TLB flushing code, but for now we'll put a safety guard
    into our flush_tlb_kernel_range() implementation.

    Since the implementation has become non-trivial, stop defining it as a
    macro and instead make it a function in a C source file.

    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    David S. Miller
     
  • [ Upstream commit 18f38132528c3e603c66ea464727b29e9bbcb91b ]

    The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
    only be called when the PTE being installed will be accessible by the user.

    This is not true for code paths originating from remove_migration_pte().

    There are dire consequences for placing a non-valid PTE into the TSB. The TLB
    miss frramework assumes thatwhen a TSB entry matches we can just load it into
    the TLB and return from the TLB miss trap.

    So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
    and over, never satisfying the miss.

    Just exit early from update_mmu_cache() and friends in this situation.

    Based upon a report and patch from Christopher Alexander Tobias Schulze.

    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    David S. Miller
     
  • [ Upstream commit 5aa4ecfd0ddb1e6dcd1c886e6c49677550f581aa ]

    This is the prevent previous stores from overlapping the block stores
    done by the memcpy loop.

    Based upon a glibc patch by Jose E. Marchesi

    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    David S. Miller
     
  • [ Upstream commit b18eb2d779240631a098626cb6841ee2dd34fda0 ]

    Access to the TSB hash tables during TLB misses requires that there be
    an atomic 128-bit quad load available so that we fetch a matching TAG
    and DATA field at the same time.

    On cpus prior to UltraSPARC-III only virtual address based quad loads
    are available. UltraSPARC-III and later provide physical address
    based variants which are easier to use.

    When we only have virtual address based quad loads available this
    means that we have to lock the TSB into the TLB at a fixed virtual
    address on each cpu when it runs that process. We can't just access
    the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
    take a recursive TLB miss inside of the TLB miss handler without
    risking running out of hardware trap levels (some trap combinations
    can be deep, such as those generated by register window spill and fill
    traps).

    Without huge pages it's working perfectly fine, but when the huge TSB
    got added another chunk of fixed virtual address space was not
    allocated for this second TSB mapping.

    So we were mapping both the 8K and 4MB TSBs to the same exact virtual
    address, causing multiple TLB matches which gives undefined behavior.

    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    David S. Miller
     
  • [ Upstream commit e5c460f46ae7ee94831cb55cb980f942aa9e5a85 ]

    This was found using Dave Jone's trinity tool.

    When a user process which is 32-bit performs a load or a store, the
    cpu chops off the top 32-bits of the effective address before
    translating it.

    This is because we run 32-bit tasks with the PSTATE_AM (address
    masking) bit set.

    We can't run the kernel with that bit set, so when the kernel accesses
    userspace no address masking occurs.

    Since a 32-bit process will have no mappings in that region we will
    properly fault, so we don't try to handle this using access_ok(),
    which can safely just be a NOP on sparc64.

    Real faults from 32-bit processes should never generate such addresses
    so a bug check was added long ago, and it barks in the logs if this
    happens.

    But it also barks when a kernel user access causes this condition, and
    that _can_ happen. For example, if a pointer passed into a system call
    is "0xfffffffc" and the kernel access 4 bytes offset from that pointer.

    Just handle such faults normally via the exception entries.

    Signed-off-by: David S. Miller
    Signed-off-by: Jiri Slaby

    David S. Miller