Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

26 Sep, 2014

7 commits

3e7379c0f mm: page_alloc: convert hot/cold parameter and immediate callers to bool ... Browse Code »

commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream.

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Mel Gorman
2014-09-26 17:52:03 +0800
c32064785 x86/mm: In the PTE swapout page reclaim case clear the accessed bit instead of flushing the TLB ... Browse Code »

commit b13b1d2d8692b437203de7a404c6b809d2cc4d99 upstream.

We use the accessed bit to age a page at page reclaim time,
and currently we also flush the TLB when doing so.

But in some workloads TLB flush overhead is very heavy. In my
simple multithreaded app with a lot of swap to several pcie
SSDs, removing the tlb flush gives about 20% ~ 30% swapout
speedup.

Fortunately just removing the TLB flush is a valid optimization:
on x86 CPUs, clearing the accessed bit without a TLB flush
doesn't cause data corruption.

It could cause incorrect page aging and the (mistaken) reclaim of
hot pages, but the chance of that should be relatively low.

So as a performance optimization don't flush the TLB when
clearing the accessed bit, it will eventually be flushed by
a context switch or a VM operation anyway. [ In the rare
event of it not getting flushed for a long time the delay
shouldn't really matter because there's no real memory
pressure for swapout to react to. ]

Suggested-by: Linus Torvalds
Signed-off-by: Shaohua Li
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Acked-by: Hugh Dickins
Acked-by: Johannes Weiner
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140408075809.GA1764@kernel.org
[ Rewrote the changelog and the code comments. ]
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Shaohua Li
2014-09-26 17:52:01 +0800
9c0073071 mm: per-thread vma caching ... Browse Code »

commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream.

This patch is a continuation of efforts trying to optimize find_vma(),
avoiding potentially expensive rbtree walks to locate a vma upon faults.
The original approach (https://lkml.org/lkml/2013/11/1/410), where the
largest vma was also cached, ended up being too specific and random,
thus further comparison with other approaches were needed. There are
two things to consider when dealing with this, the cache hit rate and
the latency of find_vma(). Improving the hit-rate does not necessarily
translate in finding the vma any faster, as the overhead of any fancy
caching schemes can be too high to consider.

We currently cache the last used vma for the whole address space, which
provides a nice optimization, reducing the total cycles in find_vma() by
up to 250%, for workloads with good locality. On the other hand, this
simple scheme is pretty much useless for workloads with poor locality.
Analyzing ebizzy runs shows that, no matter how many threads are
running, the mmap_cache hit rate is less than 2%, and in many situations
below 1%.

The proposed approach is to replace this scheme with a small per-thread
cache, maximizing hit rates at a very low maintenance cost.
Invalidations are performed by simply bumping up a 32-bit sequence
number. The only expensive operation is in the rare case of a seq
number overflow, where all caches that share the same address space are
flushed. Upon a miss, the proposed replacement policy is based on the
page number that contains the virtual address in question. Concretely,
the following results are seen on an 80 core, 8 socket x86-64 box:

1) System bootup: Most programs are single threaded, so the per-thread
scheme does improve ~50% hit rate by just adding a few more slots to
the cache.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 50.61% | 19.90 |
| patched | 73.45% | 13.58 |
+----------------+----------+------------------+

2) Kernel build: This one is already pretty good with the current
approach as we're dealing with good locality.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 75.28% | 11.03 |
| patched | 88.09% | 9.31 |
+----------------+----------+------------------+

3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 70.66% | 17.14 |
| patched | 91.15% | 12.57 |
+----------------+----------+------------------+

4) Ebizzy: There's a fair amount of variation from run to run, but this
approach always shows nearly perfect hit rates, while baseline is just
about non-existent. The amounts of cycles can fluctuate between
anywhere from ~60 to ~116 for the baseline scheme, but this approach
reduces it considerably. For instance, with 80 threads:

+----------------+----------+------------------+
| caching scheme | hit-rate | cycles (billion) |
+----------------+----------+------------------+
| baseline | 1.06% | 91.54 |
| patched | 99.97% | 14.18 |
+----------------+----------+------------------+

[akpm@linux-foundation.org: fix nommu build, per Davidlohr]
[akpm@linux-foundation.org: document vmacache_valid() logic]
[akpm@linux-foundation.org: attempt to untangle header files]
[akpm@linux-foundation.org: add vmacache_find() BUG_ON]
[hughd@google.com: add vmacache_valid_mm() (from Oleg)]
[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: adjust and enhance comments]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Acked-by: Linus Torvalds
Reviewed-by: Michel Lespinasse
Cc: Oleg Nesterov
Tested-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Davidlohr Bueso
2014-09-26 17:51:57 +0800
b33660ee9 x86/mm: Eliminate redundant page table walk during TLB range flushing ... Browse Code »

commit 71b54f8263860a37dd9f50f81880a9d681fd9c10 upstream.

When choosing between doing an address space or ranged flush,
the x86 implementation of flush_tlb_mm_range takes into account
whether there are any large pages in the range. A per-page
flush typically requires fewer entries than would covered by a
single large page and the check is redundant.

There is one potential exception. THP migration flushes single
THP entries and it conceivably would benefit from flushing a
single entry instead of the mm. However, this flush is after a
THP allocation, copy and page table update potentially with any
other threads serialised behind it. In comparison to that, the
flush is noise. It makes more sense to optimise balancing to
require fewer flushes than to optimise the flush itself.

This patch deletes the redundant huge page check.

Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Alex Shi
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-sgei1drpOcburujPsfh6ovmo@git.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Mel Gorman
2014-09-26 17:51:52 +0800
05f7ec4c8 x86/mm: Clean up inconsistencies when flushing TLB ranges ... Browse Code »

commit 15aa368255f249df0b2af630c9487bb5471bd7da upstream.

NR_TLB_LOCAL_FLUSH_ALL is not always accounted for correctly and
the comparison with total_vm is done before taking
tlb_flushall_shift into account. Clean it up.

Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Alex Shi
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Hugh Dickins
Link: http://lkml.kernel.org/n/tip-Iz5gcahrgskIldvukulzi0hh@git.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Mel Gorman
2014-09-26 17:51:52 +0800
1af7e30f9 mm, x86: Account for TLB flushes only when debugging ... Browse Code »

commit ec65993443736a5091b68e80ff1734548944a4b8 upstream.

Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm:
vmstats: tlb flush counters") to cause overhead problems.

The counters are undeniably useful but how often do we really
need to debug TLB flush related issues? It does not justify
taking the penalty everywhere so make it a debugging option.

Signed-off-by: Mel Gorman
Tested-by: Davidlohr Bueso
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Cc: Hugh Dickins
Cc: Alex Shi
Cc: Linus Torvalds
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org
Signed-off-by: Ingo Molnar
Signed-off-by: Mel Gorman
Signed-off-by: Jiri Slaby

Mel Gorman
2014-09-26 17:51:52 +0800
7bcae251a arm64: flush TLS registers during exec ... Browse Code »

commit eb35bdd7bca29a13c8ecd44e6fd747a84ce675db upstream.

Nathan reports that we leak TLS information from the parent context
during an exec, as we don't clear the TLS registers when flushing the
thread state.

This patch updates the flushing code so that we:

(1) Unconditionally zero the tpidr_el0 register (since this is fully
context switched for native tasks and zeroed for compat tasks)

(2) Zero the tp_value state in thread_info before clearing the
tpidrr0_el0 register for compat tasks (since this is only writable
by the set_tls compat syscall and therefore not fully switched).

A missing compiler barrier is also added to the compat set_tls syscall.

Acked-by: Nathan Lynch
Reported-by: Nathan Lynch
Signed-off-by: Will Deacon
Signed-off-by: Jiri Slaby

Will Deacon
2014-09-26 17:23:43 +0800

17 Sep, 2014

16 commits

1e1c3a215 MIPS: OCTEON: make get_system_type() thread-safe ... Browse Code »

commit 608308682addfdc7b8e2aee88f0e028331d88e4d upstream.

get_system_type() is not thread-safe on OCTEON. It uses static data,
also more dangerous issue is that it's calling cvmx_fuse_read_byte()
every time without any synchronization. Currently it's possible to get
processes stuck looping forever in kernel simply by launching multiple
readers of /proc/cpuinfo:

(while true; do cat /proc/cpuinfo > /dev/null; done) &
(while true; do cat /proc/cpuinfo > /dev/null; done) &
...

Fix by initializing the system type string only once during the early
boot.

Signed-off-by: Aaro Koskinen
Reviewed-by: Markos Chandras
Patchwork: http://patchwork.linux-mips.org/patch/7437/
Signed-off-by: James Hogan
Signed-off-by: Jiri Slaby

Aaro Koskinen
2014-09-17 22:55:07 +0800
6540a0387 MIPS: Remove BUG_ON(!is_fpu_owner()) in do_ade() ... Browse Code »

commit 2e5767a27337812f6850b3fa362419e2f085e5c3 upstream.

In do_ade(), is_fpu_owner() isn't preempt-safe. For example, when an
unaligned ldc1 is executed, do_cpu() is called and then FPU will be
enabled (and TIF_USEDFPU will be set for the current process). Then,
do_ade() is called because the access is unaligned. If the current
process is preempted at this time, TIF_USEDFPU will be cleard. So when
the process is scheduled again, BUG_ON(!is_fpu_owner()) is triggered.

This small program can trigger this BUG in a preemptible kernel:

int main (int argc, char *argv[])
{
double u64[2];

while (1) {
asm volatile (
".set push \n\t"
".set noreorder \n\t"
"ldc1 $f3, 4(%0) \n\t"
".set pop \n\t"
::"r"(u64):
);
}

return 0;
}

V2: Remove the BUG_ON() unconditionally due to Paul's suggestion.

Signed-off-by: Huacai Chen
Signed-off-by: Jie Chen
Signed-off-by: Rui Wang
Cc: John Crispin
Cc: Steven J. Hill
Cc: linux-mips@linux-mips.org
Cc: Fuxin Zhang
Cc: Zhangjin Wu
Signed-off-by: Ralf Baechle
Signed-off-by: Jiri Slaby

Huacai Chen
2014-09-17 22:55:07 +0800
ac9c2b5ec MIPS: tlbex: Fix a missing statement for HUGETLB ... Browse Code »

commit 8393c524a25609a30129e4a8975cf3b91f6c16a5 upstream.

In commit 2c8c53e28f1 (MIPS: Optimize TLB handlers for Octeon CPUs)
build_r4000_tlb_refill_handler() is modified. But it doesn't compatible
with the original code in HUGETLB case. Because there is a copy & paste
error and one line of code is missing. It is very easy to produce a bug
with LTP's hugemmap05 test.

Signed-off-by: Huacai Chen
Signed-off-by: Binbin Zhou
Cc: John Crispin
Cc: Steven J. Hill
Cc: linux-mips@linux-mips.org
Cc: Fuxin Zhang
Cc: Zhangjin Wu
Patchwork: https://patchwork.linux-mips.org/patch/7496/
Signed-off-by: Ralf Baechle
Signed-off-by: Jiri Slaby

Huacai Chen
2014-09-17 22:55:07 +0800
44058412d MIPS: Prevent user from setting FCSR cause bits ... Browse Code »

commit b1442d39fac2fcfbe6a4814979020e993ca59c9e upstream.

If one or more matching FCSR cause & enable bits are set in saved thread
context then when that context is restored the kernel will take an FP
exception. This is of course undesirable and considered an oops, leading
to the kernel writing a backtrace to the console and potentially
rebooting depending upon the configuration. Thus the kernel avoids this
situation by clearing the cause bits of the FCSR register when handling
FP exceptions and after emulating FP instructions.

However the kernel does not prevent userland from setting arbitrary FCSR
cause & enable bits via ptrace, using either the PTRACE_POKEUSR or
PTRACE_SETFPREGS requests. This means userland can trivially cause the
kernel to oops on any system with an FPU. Prevent this from happening
by clearing the cause bits when writing to the saved FCSR context via
ptrace.

This problem appears to exist at least back to the beginning of the git
era in the PTRACE_POKEUSR case.

Signed-off-by: Paul Burton
Cc: linux-mips@linux-mips.org
Cc: Paul Burton
Cc: stable@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/7438/
Signed-off-by: Ralf Baechle
Signed-off-by: Jiri Slaby

Paul Burton
2014-09-17 22:55:06 +0800
41c6ebd1c MIPS: GIC: Prevent array overrun ... Browse Code »

commit ffc8415afab20bd97754efae6aad1f67b531132b upstream.

A GIC interrupt which is declared as having a GIC_MAP_TO_NMI_MSK
mapping causes the cpu parameter to gic_setup_intr() to be increased
to 32, causing memory corruption when pcpu_masks[] is written to again
later in the function.

Signed-off-by: Jeffrey Deans
Signed-off-by: Markos Chandras
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/7375/
Signed-off-by: Ralf Baechle
Signed-off-by: Jiri Slaby

Jeffrey Deans
2014-09-17 22:55:06 +0800
9de1fa608 powerpc/thp: Use ACCESS_ONCE when loading pmdp ... Browse Code »

commit 7e467245bf5226db34c4b12d3cbacfa2f7a15a8b upstream.

We would get wrong results in compiler recomputed old_pmd. Avoid
that by using ACCESS_ONCE

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:05 +0800
5611f07c0 powerpc/thp: Invalidate with vpn in loop ... Browse Code »

commit 969b7b208f7408712a3526856e4ae60ad13f6928 upstream.

As per ISA, for 4k base page size we compare 14..65 bits of VA specified
with the entry_VA in tlb. That implies we need to make sure we do a
tlbie with all the possible 4k va we used to access the 16MB hugepage.
With 64k base page size we compare 14..57 bits of VA. Hence we cannot
ignore the lower 24 bits of va while tlbie .We also cannot tlb
invalidate a 16MB entry with just one tlbie instruction because
we don't track which va was used to instantiate the tlb entry.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:04 +0800
4807eb30a powerpc/thp: Handle combo pages in invalidate ... Browse Code »

commit fc0479557572375100ef16c71170b29a98e0d69a upstream.

If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault for
these pages.

We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.

Use _PAGE_COMBO to determine the page size with which we should
invalidate the hash table entries on unmap.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:04 +0800
30da621f9 powerpc/thp: Invalidate old 64K based hash page mapping before insert of 4k pte ... Browse Code »

commit 629149fae478f0ac6bf705a535708b192e9c6b59 upstream.

If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault
for these pages.

We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.

Handle this correctly for 16M pages

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:04 +0800
e828f5bc7 powerpc/thp: Don't recompute vsid and ssize in loop on invalidate ... Browse Code »

commit fa1f8ae80f8bb996594167ff4750a0b0a5a5bb5d upstream.

The segment identifier and segment size will remain the same in
the loop, So we can compute it outside. We also change the
hugepage_invalidate interface so that we can use it the later patch

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:04 +0800
4cd682a17 powerpc/thp: Add write barrier after updating the valid bit ... Browse Code »

commit b0aa44a3dfae3d8f45bd1264349aa87f87b7774f upstream.

With hugepages, we store the hpte valid information in the pte page
whose address is stored in the second half of the PMD. Use a
write barrier to make sure clearing pmd busy bit and updating
hpte valid info are ordered properly.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:04 +0800
6c4e48e6c powerpc/pseries: Avoid deadlock on removing ddw ... Browse Code »

commit 5efbabe09d986f25c02d19954660238fcd7f008a upstream.

Function remove_ddw() could be called in of_reconfig_notifier and
we potentially remove the dynamic DMA window property, which invokes
of_reconfig_notifier again. Eventually, it leads to the deadlock as
following backtrace shows.

The patch fixes the above issue by deferring releasing the dynamic
DMA window property while releasing the device node.

=============================================
[ INFO: possible recursive locking detected ]
3.16.0+ #428 Tainted: G W
---------------------------------------------
drmgr/2273 is trying to acquire lock:
((of_reconfig_chain).rwsem){.+.+..}, at: [] \
.__blocking_notifier_call_chain+0x40/0x78

but task is already holding lock:
((of_reconfig_chain).rwsem){.+.+..}, at: [] \
.__blocking_notifier_call_chain+0x40/0x78

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock((of_reconfig_chain).rwsem);
lock((of_reconfig_chain).rwsem);
*** DEADLOCK ***

May be due to missing lock nesting notation

2 locks held by drmgr/2273:
#0: (sb_writers#4){.+.+.+}, at: [] \
.vfs_write+0xb0/0x1f8
#1: ((of_reconfig_chain).rwsem){.+.+..}, at: [] \
.__blocking_notifier_call_chain+0x40/0x78

stack backtrace:
CPU: 17 PID: 2273 Comm: drmgr Tainted: G W 3.16.0+ #428
Call Trace:
[c0000000137e7000] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
[c0000000137e70b0] [c00000000083cd34] .dump_stack+0x7c/0x9c
[c0000000137e7130] [c0000000000b8afc] .__lock_acquire+0x128c/0x1c68
[c0000000137e7280] [c0000000000b9a4c] .lock_acquire+0xe8/0x104
[c0000000137e7350] [c00000000083588c] .down_read+0x4c/0x90
[c0000000137e73e0] [c000000000091890] .__blocking_notifier_call_chain+0x40/0x78
[c0000000137e7490] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
[c0000000137e7520] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
[c0000000137e75b0] [c000000000682a9c] .of_property_notify+0x4c/0x54
[c0000000137e7650] [c000000000682bf0] .of_remove_property+0x30/0xd4
[c0000000137e76f0] [c000000000052a44] .remove_ddw+0x144/0x168
[c0000000137e7790] [c000000000053204] .iommu_reconfig_notifier+0x30/0xe0
[c0000000137e7820] [c00000000009137c] .notifier_call_chain+0x6c/0xb4
[c0000000137e78c0] [c0000000000918ac] .__blocking_notifier_call_chain+0x5c/0x78
[c0000000137e7970] [c000000000091900] .blocking_notifier_call_chain+0x38/0x48
[c0000000137e7a00] [c000000000682a28] .of_reconfig_notify+0x34/0x5c
[c0000000137e7a90] [c000000000682e14] .of_detach_node+0x44/0x1fc
[c0000000137e7b40] [c0000000000518e4] .ofdt_write+0x3ac/0x688
[c0000000137e7c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
[c0000000137e7cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
[c0000000137e7d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
[c0000000137e7e30] [c00000000000a064] syscall_exit+0x0/0x98

Signed-off-by: Gavin Shan
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Gavin Shan
2014-09-17 22:55:04 +0800
1eae3e47d powerpc/pseries: Failure on removing device node ... Browse Code »

commit f1b3929c232784580e5d8ee324b6bc634e709575 upstream.

While running command "drmgr -c phb -r -s 'PHB 528'", following
backtrace jumped out because the target device node isn't marked
with OF_DETACHED by of_detach_node(), which caused by error
returned from memory hotplug related reconfig notifier when
disabling CONFIG_MEMORY_HOTREMOVE. The patch fixes it.

ERROR: Bad of_node_put() on /pci@800000020000210/ethernet@0
CPU: 14 PID: 2252 Comm: drmgr Tainted: G W 3.16.0+ #427
Call Trace:
[c000000012a776a0] [c000000000013d9c] .show_stack+0x88/0x148 (unreliable)
[c000000012a77750] [c00000000083cd34] .dump_stack+0x7c/0x9c
[c000000012a777d0] [c0000000006807c4] .of_node_release+0x58/0xe0
[c000000012a77860] [c00000000038a7d0] .kobject_release+0x174/0x1b8
[c000000012a77900] [c00000000038a884] .kobject_put+0x70/0x78
[c000000012a77980] [c000000000681680] .of_node_put+0x28/0x34
[c000000012a77a00] [c000000000681ea8] .__of_get_next_child+0x64/0x70
[c000000012a77a90] [c000000000682138] .of_find_node_by_path+0x1b8/0x20c
[c000000012a77b40] [c000000000051840] .ofdt_write+0x308/0x688
[c000000012a77c20] [c000000000238430] .proc_reg_write+0xb8/0xd4
[c000000012a77cd0] [c0000000001cbeac] .vfs_write+0xec/0x1f8
[c000000012a77d70] [c0000000001cc3b0] .SyS_write+0x58/0xa0
[c000000012a77e30] [c00000000000a064] syscall_exit+0x0/0x98

Signed-off-by: Gavin Shan
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Gavin Shan
2014-09-17 22:55:04 +0800
21c4de8af powerpc/mm: Use read barrier when creating real_pte ... Browse Code »

commit 85c1fafd7262e68ad821ee1808686b1392b1167d upstream.

On ppc64 we support 4K hash pte with 64K page size. That requires
us to track the hash pte slot information on a per 4k basis. We do that
by storing the slot details in the second half of pte page. The pte bit
_PAGE_COMBO is used to indicate whether the second half need to be
looked while building real_pte. We need to use read memory barrier while
doing that so that load of hidx is not reordered w.r.t _PAGE_COMBO
check. On the store side we already do a lwsync in __hash_page_4K

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Aneesh Kumar K.V
2014-09-17 22:55:03 +0800
beaeda95e powerpc/mm/numa: Fix break placement ... Browse Code »

commit b00fc6ec1f24f9d7af9b8988b6a198186eb3408c upstream.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631
Reported-by: David Binderman
Signed-off-by: Andrey Utkin
Signed-off-by: Benjamin Herrenschmidt
Signed-off-by: Jiri Slaby

Andrey Utkin
2014-09-17 22:55:03 +0800
f00833106 s390/locking: Reenable optimistic spinning ... Browse Code »

commit 36e7fdaa1a04fcf65b864232e1af56a51c7814d6 upstream.

commit 4badad352a6bb202ec68afa7a574c0bb961e5ebc (locking/mutex: Disable
optimistic spinning on some architectures) fenced spinning for
architectures without proper cmpxchg.
There is no need to disable mutex spinning on s390, though:
The instructions CS,CSG and friends provide the proper guarantees.
(We dont implement cmpxchg with locks).

Signed-off-by: Christian Borntraeger
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Heiko Carstens
Signed-off-by: Martin Schwidefsky
Signed-off-by: Jiri Slaby

Christian Borntraeger
2014-09-17 22:55:02 +0800

12 Sep, 2014

2 commits

66acfbf79 x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable() ... Browse Code »

commit 39424e89d64661faa0a2e00c5ad1e6dbeebfa972 upstream.

Further discussion here: http://marc.info/?l=linux-kernel&m=139073901101034&w=2

kbuild, 0day kernel build service, outputs the warning:

arch/x86/kernel/irq.c:333:1: warning: the frame size of 2056 bytes
is larger than 2048 bytes [-Wframe-larger-than=]

because check_irq_vectors_for_cpu_disable() allocates two cpumasks on the
stack. Fix this by moving the two cpumasks to a global file context.

Reported-by: Fengguang Wu
Tested-by: David Rientjes
Signed-off-by: Prarit Bhargava
Link: http://lkml.kernel.org/r/1390915331-27375-1-git-send-email-prarit@redhat.com
Cc: Andi Kleen
Cc: Michel Lespinasse
Cc: Seiji Aguchi
Cc: Yang Zhang
Cc: Paul Gortmaker
Cc: Janet Morgan
Cc: Tony Luck
Cc: Ruiv Wang
Cc: Gong Chen
Cc: Yinghai Lu
Signed-off-by: H. Peter Anvin
Signed-off-by: Jiri Slaby

Prarit Bhargava
2014-09-12 22:16:48 +0800
26a01cb8a x86: Add check for number of available vectors before CPU down ... Browse Code »

commit da6139e49c7cb0f4251265cb5243b8d220adb48d upstream.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=64791

When a cpu is downed on a system, the irqs on the cpu are assigned to
other cpus. It is possible, however, that when a cpu is downed there
aren't enough free vectors on the remaining cpus to account for the
vectors from the cpu that is being downed.

This results in an interesting "overflow" condition where irqs are
"assigned" to a CPU but are not handled.

For example, when downing cpus on a 1-64 logical processor system:

[ 232.021745] smpboot: CPU 61 is now offline
[ 238.480275] smpboot: CPU 62 is now offline
[ 245.991080] ------------[ cut here ]------------
[ 245.996270] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x246/0x250()
[ 246.005688] NETDEV WATCHDOG: p786p1 (ixgbe): transmit queue 0 timed out
[ 246.013070] Modules linked in: lockd sunrpc iTCO_wdt iTCO_vendor_support sb_edac ixgbe microcode e1000e pcspkr joydev edac_core lpc_ich ioatdma ptp mdio mfd_core i2c_i801 dca pps_core i2c_core wmi acpi_cpufreq isci libsas scsi_transport_sas
[ 246.037633] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.12.0+ #14
[ 246.044451] Hardware name: Intel Corporation S4600LH ........../SVRBD-ROW_T, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[ 246.057371] 0000000000000009 ffff88081fa03d40 ffffffff8164fbf6 ffff88081fa0ee48
[ 246.065728] ffff88081fa03d90 ffff88081fa03d80 ffffffff81054ecc ffff88081fa13040
[ 246.074073] 0000000000000000 ffff88200cce0000 0000000000000040 0000000000000000
[ 246.082430] Call Trace:
[ 246.085174] [] dump_stack+0x46/0x58
[ 246.091633] [] warn_slowpath_common+0x8c/0xc0
[ 246.098352] [] warn_slowpath_fmt+0x46/0x50
[ 246.104786] [] dev_watchdog+0x246/0x250
[ 246.110923] [] ? dev_deactivate_queue.constprop.31+0x80/0x80
[ 246.119097] [] call_timer_fn+0x3a/0x110
[ 246.125224] [] ? update_process_times+0x6f/0x80
[ 246.132137] [] ? dev_deactivate_queue.constprop.31+0x80/0x80
[ 246.140308] [] run_timer_softirq+0x1f0/0x2a0
[ 246.146933] [] __do_softirq+0xe0/0x220
[ 246.152976] [] call_softirq+0x1c/0x30
[ 246.158920] [] do_softirq+0x55/0x90
[ 246.164670] [] irq_exit+0xa5/0xb0
[ 246.170227] [] smp_apic_timer_interrupt+0x4a/0x60
[ 246.177324] [] apic_timer_interrupt+0x6a/0x70
[ 246.184041] [] ? cpuidle_enter_state+0x5b/0xe0
[ 246.191559] [] ? cpuidle_enter_state+0x57/0xe0
[ 246.198374] [] cpuidle_idle_call+0xbd/0x200
[ 246.204900] [] arch_cpu_idle+0xe/0x30
[ 246.210846] [] cpu_startup_entry+0xd0/0x250
[ 246.217371] [] rest_init+0x77/0x80
[ 246.223028] [] start_kernel+0x3ee/0x3fb
[ 246.229165] [] ? repair_env_string+0x5e/0x5e
[ 246.235787] [] x86_64_start_reservations+0x2a/0x2c
[ 246.242990] [] x86_64_start_kernel+0xf8/0xfc
[ 246.249610] ---[ end trace fb74fdef54d79039 ]---
[ 246.254807] ixgbe 0000:c2:00.0 p786p1: initiating reset due to tx timeout
[ 246.262489] ixgbe 0000:c2:00.0 p786p1: Reset adapter
Last login: Mon Nov 11 08:35:14 from 10.18.17.119
[root@(none) ~]# [ 246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
[ 249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 246.792676] ixgbe 0000:c2:00.0 p786p1: detected SFP+: 5
[ 249.231598] ixgbe 0000:c2:00.0 p786p1: NIC Link is Up 10 Gbps, Flow Control: RX/TX

(last lines keep repeating. ixgbe driver is dead until module reload.)

If the downed cpu has more vectors than are free on the remaining cpus on the
system, it is possible that some vectors are "orphaned" even though they are
assigned to a cpu. In this case, since the ixgbe driver had a watchdog, the
watchdog fired and notified that something was wrong.

This patch adds a function, check_vectors(), to compare the number of vectors
on the CPU going down and compares it to the number of vectors available on
the system. If there aren't enough vectors for the CPU to go down, an
error is returned and propogated back to userspace.

v2: Do not need to look at percpu irqs
v3: Need to check affinity to prevent counting of MSIs in IOAPIC Lowest
Priority Mode
v4: Additional changes suggested by Gong Chen.
v5/v6/v7/v8: Updated comment text

Signed-off-by: Prarit Bhargava
Link: http://lkml.kernel.org/r/1389613861-3853-1-git-send-email-prarit@redhat.com
Reviewed-by: Gong Chen
Cc: Andi Kleen
Cc: Michel Lespinasse
Cc: Seiji Aguchi
Cc: Yang Zhang
Cc: Paul Gortmaker
Cc: Janet Morgan
Cc: Tony Luck
Cc: Ruiv Wang
Cc: Gong Chen
Signed-off-by: H. Peter Anvin
Cc:
Signed-off-by: Jiri Slaby

Prarit Bhargava
2014-09-12 21:03:19 +0800

09 Sep, 2014

1 commit

406eb8e63 openrisc: Rework signal handling ... Browse Code »

commit 10f67dbf6add97751050f294d4c8e0cc1e5c2c23 upstream.

The mainline signal handling code for OpenRISC has been buggy since day
one with respect to syscall restart. This patch significantly reworks
the signal handling code:

i) Move the "work pending" loop to C code (borrowed from ARM arch)

ii) Allow a tracer to muck about with the IP and skip syscall restart
in that case (again, borrowed from ARM)

iii) Make signal handling WRT syscall restart actually work

v) Make the signal handling code look more like that of other
architectures so that it's easier for others to follow

Reported-by: Anders Nystrom
Signed-off-by: Jonas Bonn
Signed-off-by: Jiri Slaby

Jonas Bonn
2014-09-09 20:16:12 +0800

04 Sep, 2014

7 commits

82908a4ac x86/xen: resume timer irqs early ... Browse Code »

commit 8d5999df35314607c38fbd6bdd709e25c3a4eeab upstream.

If the timer irqs are resumed during device resume it is possible in
certain circumstances for the resume to hang early on, before device
interrupts are resumed. For an Ubuntu 14.04 PVHVM guest this would
occur in ~0.5% of resume attempts.

It is not entirely clear what is occuring the point of the hang but I
think a task necessary for the resume calls schedule_timeout(),
waiting for a timer interrupt (which never arrives). This failure may
require specific tasks to be running on the other VCPUs to trigger
(processes are not frozen during a suspend/resume if PREEMPT is
disabled).

Add IRQF_EARLY_RESUME to the timer interrupts so they are resumed in
syscore_resume().

Signed-off-by: David Vrabel
Reviewed-by: Boris Ostrovsky
Signed-off-by: Jiri Slaby

David Vrabel
2014-09-04 03:31:32 +0800
465502e73 x86/efi: Enforce CONFIG_RELOCATABLE for EFI boot stub ... Browse Code »

commit 7b2a583afb4ab894f78bc0f8bd136e96b6499a7e upstream.

Without CONFIG_RELOCATABLE the early boot code will decompress the
kernel to LOAD_PHYSICAL_ADDR. While this may have been fine in the BIOS
days, that isn't going to fly with UEFI since parts of the firmware
code/data may be located at LOAD_PHYSICAL_ADDR.

Straying outside of the bounds of the regions we've explicitly requested
from the firmware will cause all sorts of trouble. Bruno reports that
his machine resets while trying to decompress the kernel image.

We already go to great pains to ensure the kernel is loaded into a
suitably aligned buffer, it's just that the address isn't necessarily
LOAD_PHYSICAL_ADDR, because we can't guarantee that address isn't in-use
by the firmware.

Explicitly enforce CONFIG_RELOCATABLE for the EFI boot stub, so that we
can load the kernel at any address with the correct alignment.

Reported-by: Bruno Prémont
Tested-by: Bruno Prémont
Cc: H. Peter Anvin
Signed-off-by: Matt Fleming
Signed-off-by: Jiri Slaby

Matt Fleming
2014-09-04 03:31:31 +0800
de17f3c90 x86_64/vsyscall: Fix warn_bad_vsyscall log output ... Browse Code »

commit 53b884ac3745353de220d92ef792515c3ae692f0 upstream.

This commit in Linux 3.6:

commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655
Author: Joe Perches
Date: Mon May 21 19:50:07 2012 -0700

x86/debug: Add KERN_ to bare printks, convert printks to pr_

caused warn_bad_vsyscall to output garbage in the middle of the
line. Revert the bad part of it.

The printk in question isn't actually bare; the level is "%s".

The bug this fixes is purely cosmetic; backports are optional.

Signed-off-by: Andy Lutomirski
Link: http://lkml.kernel.org/r/03eac1f24110bbe496ecc12a4df467e0d88466d4.1406330947.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin
Signed-off-by: Jiri Slaby

Andy Lutomirski
2014-09-04 03:31:31 +0800
4804ca792 x86: don't exclude low BIOS area when allocating address space for non-PCI cards ... Browse Code »

commit cbace46a9710a480cae51e4611697df5de41713e upstream.

Commit 30919b0bf356 ("x86: avoid low BIOS area when allocating address
space") moved the test for resource allocations that fall within the first
1MB of address space from the PCI-specific path to a generic path, such
that all resource allocations will avoid this area. However, this breaks
ISA cards which need to allocate a memory region within the first 1MB. An
example is the i82365 PCMCIA controller and derivatives like the Ricoh
RF5C296/396 which map part of the PCMCIA socket memory address space into
the first 1MB of system memory address space. They do not work anymore as
no usable memory region exists due to this change:

Intel ISA PCIC probe: Ricoh RF5C296/396 ISA-to-PCMCIA at port 0x3e0 ofs 0x00, 2 sockets
host opts [0]: none
host opts [1]: none
ISA irqs (scanned) = 3,4,5,9,10 status change on irq 10
pcmcia_socket pcmcia_socket1: pccard: PCMCIA card inserted into slot 1
pcmcia_socket pcmcia_socket0: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff
pcmcia_socket pcmcia_socket0: cs: IO port probe 0xa00-0xaff: clean.
pcmcia_socket pcmcia_socket0: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0d0000-0x0dffff: clean.
pcmcia_socket pcmcia_socket0: cs: memory probe 0x0e0000-0x0effff: clean.
pcmcia_socket pcmcia_socket0: cs: memory probe 0x60000000-0x60ffffff: clean.
pcmcia_socket pcmcia_socket0: cs: memory probe 0xa0000000-0xa0ffffff: clean.
pcmcia_socket pcmcia_socket1: cs: IO port probe 0xc00-0xcff: excluding 0xcf8-0xcff
pcmcia_socket pcmcia_socket1: cs: IO port probe 0xa00-0xaff: clean.
pcmcia_socket pcmcia_socket1: cs: IO port probe 0x100-0x3ff: excluding 0x170-0x177 0x1f0-0x1f7 0x2f8-0x2ff 0x370-0x37f 0x3c0-0x3e7 0x3f0-0x3ff
pcmcia_socket pcmcia_socket1: cs: memory probe 0x0a0000-0x0affff: excluding 0xa0000-0xaffff
pcmcia_socket pcmcia_socket1: cs: memory probe 0x0b0000-0x0bffff: excluding 0xb0000-0xbffff
pcmcia_socket pcmcia_socket1: cs: memory probe 0x0c0000-0x0cffff: excluding 0xc0000-0xcbfff
pcmcia_socket pcmcia_socket1: cs: memory probe 0x0d0000-0x0dffff: clean.
pcmcia_socket pcmcia_socket1: cs: memory probe 0x0e0000-0x0effff: clean.
pcmcia_socket pcmcia_socket1: cs: memory probe 0x60000000-0x60ffffff: clean.
pcmcia_socket pcmcia_socket1: cs: memory probe 0xa0000000-0xa0ffffff: clean.
pcmcia_socket pcmcia_socket1: cs: memory probe 0x0cc000-0x0effff: excluding 0xe0000-0xeffff
pcmcia_socket pcmcia_socket1: cs: unable to map card memory!

If filtering out the first 1MB is reverted, everything works as expected.

Tested-by: Robert Resch
Signed-off-by: Christoph Schulz
Signed-off-by: Bjorn Helgaas
Signed-off-by: Jiri Slaby

Christoph Schulz
2014-09-04 03:31:30 +0800
1cd31e1b5 Revert "KVM: x86: Increase the number of fixed MTRR regs to 10" ... Browse Code »

commit 0d234daf7e0a3290a3a20c8087eefbd6335a5bd4 upstream.

This reverts commit 682367c494869008eb89ef733f196e99415ae862,
which causes 32-bit SMP Windows 7 guests to panic.

SeaBIOS has a limit on the number of MTRRs that it can handle,
and this patch exceeded the limit. Better revert it.
Thanks to Nadav Amit for debugging the cause.

Reported-by: Wanpeng Li
Signed-off-by: Paolo Bonzini
Signed-off-by: Jiri Slaby

Paolo Bonzini
2014-09-04 03:31:27 +0800
06508c9b0 KVM: x86: Inter-privilege level ret emulation is not implemeneted ... Browse Code »

commit 9e8919ae793f4edfaa29694a70f71a515ae9942a upstream.

Return unhandlable error on inter-privilege level ret instruction. This is
since the current emulation does not check the privilege level correctly when
loading the CS, and does not pop RSP/SS as needed.

Signed-off-by: Nadav Amit
Signed-off-by: Paolo Bonzini
Signed-off-by: Jiri Slaby

Nadav Amit
2014-09-04 03:31:26 +0800
c24a47e83 ARM: OMAP3: Fix choice of omap3_restore_es function in OMAP34XX rev3.1.2 case. ... Browse Code »

commit 9b5f7428f8b16bd8980213f2b70baf1dd0b9e36c upstream.

According to the comment “restore_es3: applies to 34xx >= ES3.0" in
"arch/arm/mach-omap2/sleep34xx.S”, omap3_restore_es3 should be used
if the revision of an OMAP34xx is ES3.1.2.

Signed-off-by: Jeremy Vial
Signed-off-by: Tony Lindgren
Signed-off-by: Jiri Slaby

Jeremy Vial
2014-09-04 03:31:19 +0800

26 Aug, 2014

7 commits

15f4a2d04 arch/sparc/math-emu/math_32.c: drop stray break operator ... Browse Code »

[ Upstream commit 093758e3daede29cb4ce6aedb111becf9d4bfc57 ]

This commit is a guesswork, but it seems to make sense to drop this
break, as otherwise the following line is never executed and becomes
dead code. And that following line actually saves the result of
local calculation by the pointer given in function argument. So the
proposed change makes sense if this code in the whole makes sense (but I
am unable to analyze it in the whole).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81641
Reported-by: David Binderman
Signed-off-by: Andrey Utkin
Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

Andrey Utkin
2014-08-26 20:12:01 +0800
7e8d4ace2 sparc64: ldc_connect() should not return EINVAL when handshake is in progress. ... Browse Code »

[ Upstream commit 4ec1b01029b4facb651b8ef70bc20a4be4cebc63 ]

The LDC handshake could have been asynchronously triggered
after ldc_bind() enables the ldc_rx() receive interrupt-handler
(and thus intercepts incoming control packets)
and before vio_port_up() calls ldc_connect(). If that is the case,
ldc_connect() should return 0 and let the state-machine
progress.

Signed-off-by: Sowmini Varadhan
Acked-by: Karl Volz
Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

Sowmini Varadhan
2014-08-26 20:12:01 +0800
7eca30d5a sparc64: Guard against flushing openfirmware mappings. ... Browse Code »

[ Upstream commit 4ca9a23765da3260058db3431faf5b4efd8cf926 ]

Based almost entirely upon a patch by Christopher Alexander Tobias
Schulze.

In commit db64fe02258f1507e13fe5212a989922323685ce ("mm: rewrite vmap
layer") lazy VMAP tlb flushing was added to the vmalloc layer. This
causes problems on sparc64.

Sparc64 has two VMAP mapped regions and they are not contiguous with
eachother. First we have the malloc mapping area, then another
unrelated region, then the vmalloc region.

This "another unrelated region" is where the firmware is mapped.

If the lazy TLB flushing logic in the vmalloc code triggers after
we've had both a module unload and a vfree or similar, it will pass an
address range that goes from somewhere inside the malloc region to
somewhere inside the vmalloc region, and thus covering the
openfirmware area entirely.

The sparc64 kernel learns about openfirmware's dynamic mappings in
this region early in the boot, and then services TLB misses in this
area. But openfirmware has some locked TLB entries which are not
mentioned in those dynamic mappings and we should thus not disturb
them.

These huge lazy TLB flush ranges causes those openfirmware locked TLB
entries to be removed, resulting in all kinds of problems including
hard hangs and crashes during reboot/reset.

Besides causing problems like this, such huge TLB flush ranges are
also incredibly inefficient. A plea has been made with the author of
the VMAP lazy TLB flushing code, but for now we'll put a safety guard
into our flush_tlb_kernel_range() implementation.

Since the implementation has become non-trivial, stop defining it as a
macro and instead make it a function in a C source file.

Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

David S. Miller
2014-08-26 20:11:59 +0800
8dbeb4c14 sparc64: Do not insert non-valid PTEs into the TSB hash table. ... Browse Code »

[ Upstream commit 18f38132528c3e603c66ea464727b29e9bbcb91b ]

The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
only be called when the PTE being installed will be accessible by the user.

This is not true for code paths originating from remove_migration_pte().

There are dire consequences for placing a non-valid PTE into the TSB. The TLB
miss frramework assumes thatwhen a TSB entry matches we can just load it into
the TLB and return from the TLB miss trap.

So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
and over, never satisfying the miss.

Just exit early from update_mmu_cache() and friends in this situation.

Based upon a report and patch from Christopher Alexander Tobias Schulze.

Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

David S. Miller
2014-08-26 20:11:58 +0800
71adc6cba sparc64: Add membar to Niagara2 memcpy code. ... Browse Code »

[ Upstream commit 5aa4ecfd0ddb1e6dcd1c886e6c49677550f581aa ]

This is the prevent previous stores from overlapping the block stores
done by the memcpy loop.

Based upon a glibc patch by Jose E. Marchesi

Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

David S. Miller
2014-08-26 20:11:58 +0800
ffd194290 sparc64: Fix huge TSB mapping on pre-UltraSPARC-III cpus. ... Browse Code »

[ Upstream commit b18eb2d779240631a098626cb6841ee2dd34fda0 ]

Access to the TSB hash tables during TLB misses requires that there be
an atomic 128-bit quad load available so that we fetch a matching TAG
and DATA field at the same time.

On cpus prior to UltraSPARC-III only virtual address based quad loads
are available. UltraSPARC-III and later provide physical address
based variants which are easier to use.

When we only have virtual address based quad loads available this
means that we have to lock the TSB into the TLB at a fixed virtual
address on each cpu when it runs that process. We can't just access
the PAGE_OFFSET based aliased mapping of these TSBs because we cannot
take a recursive TLB miss inside of the TLB miss handler without
risking running out of hardware trap levels (some trap combinations
can be deep, such as those generated by register window spill and fill
traps).

Without huge pages it's working perfectly fine, but when the huge TSB
got added another chunk of fixed virtual address space was not
allocated for this second TSB mapping.

So we were mapping both the 8K and 4MB TSBs to the same exact virtual
address, causing multiple TLB matches which gives undefined behavior.

Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

David S. Miller
2014-08-26 20:11:58 +0800
49d8dd908 sparc64: Don't bark so loudly about 32-bit tasks generating 64-bit fault addresses. ... Browse Code »

[ Upstream commit e5c460f46ae7ee94831cb55cb980f942aa9e5a85 ]

This was found using Dave Jone's trinity tool.

When a user process which is 32-bit performs a load or a store, the
cpu chops off the top 32-bits of the effective address before
translating it.

This is because we run 32-bit tasks with the PSTATE_AM (address
masking) bit set.

We can't run the kernel with that bit set, so when the kernel accesses
userspace no address masking occurs.

Since a 32-bit process will have no mappings in that region we will
properly fault, so we don't try to handle this using access_ok(),
which can safely just be a NOP on sparc64.

Real faults from 32-bit processes should never generate such addresses
so a bug check was added long ago, and it barks in the logs if this
happens.

But it also barks when a kernel user access causes this condition, and
that _can_ happen. For example, if a pointer passed into a system call
is "0xfffffffc" and the kernel access 4 bytes offset from that pointer.

Just handle such faults normally via the exception entries.

Signed-off-by: David S. Miller
Signed-off-by: Jiri Slaby

David S. Miller
2014-08-26 20:11:57 +0800