Eric Lee / smarc-fsl-linux-kernel

19 Dec, 2011

1 commit

a66086b81 powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX ... Browse Code »
43

Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
For large aligned copies this new loop is over 10% faster, and for
large unaligned copies it is over 200% faster.

If we take a fault we fall back to the old version, this keeps
things relatively simple and easy to verify.

On POWER7 unaligned stores rarely slow down - they only flush when
a store crosses a 4KB page boundary. Furthermore this flush is
handled completely in hardware and should be 20-30 cycles.

Unaligned loads on the other hand flush much more often - whenever
crossing a 128 byte cache line, or a 32 byte sector if either sector
is an L1 miss.

Considering this information we really want to get the loads aligned
and not worry about the alignment of the stores. Microbenchmarks
confirm that this approach is much faster than the current unaligned
copy loop that uses shifts and rotates to ensure both loads and
stores are aligned.

We also want to try and do the stores in cacheline aligned, cacheline
sized chunks. If the store queue is unable to merge an entire
cacheline of stores then the L2 cache will have to do a
read/modify/write. Even worse, we will serialise this with the stores
in the next iteration of the copy loop since both iterations hit
the same cacheline.

Based on this, the new loop does the following things:

1 - 127 bytes
Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
boring and similar to how the current loop works.

128 - 4095 bytes
Get the source 8 byte aligned and use 8 byte loads and stores,
1 cacheline at a time. We aren't doing the stores in cacheline
aligned chunks so we will potentially serialise once per cacheline.
Even so it is much better than the loop we have today.

4096 - bytes
If both source and destination have the same alignment get them both
16 byte aligned, then get the destination cacheline aligned. Do
cacheline sized loads and stores using VMX.

If source and destination do not have the same alignment, we get the
destination cacheline aligned, and use permute to do aligned loads.

In both cases the VMX loop should be optimal - we always do aligned
loads and stores and are always doing stores in cacheline aligned,
cacheline sized chunks.

To be able to use VMX we must be careful about interrupts and
sleeping. We don't use the VMX loop when in an interrupt (which should
be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
and fall back to the existing copy_tofrom_user loop if we do need to
sleep.

The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:

http://ozlabs.org/~anton/junkcode/copy_to_user.c

Since we are using VMX and there is a cost to saving and restoring
the user VMX state there are two broad cases we need to benchmark:

- Best case - userspace never uses VMX

- Worst case - userspace always uses VMX

In reality a userspace process will sit somewhere between these two
extremes. Since we need to test both aligned and unaligned copies we
end up with 4 combinations. The point at which the VMX loop begins to
win is:

0% VMX
aligned 2048 bytes
unaligned 2048 bytes

100% VMX
aligned 16384 bytes
unaligned 8192 bytes

Considering this is a microbenchmark, the data is hot in cache and
the VMX loop has better store queue merging properties we set the
breakpoint to 4096 bytes, a little below the unaligned breakpoints.

Some future optimisations we can look at:

- Looking at the perf data, a significant part of the cost when a
task is always using VMX is the extra exception we take to restore
the VMX state. As such we should do something similar to the x86
optimisation that restores FPU state for heavy users. ie:

/*
* If the task has used fpu the last 5 timeslices, just do a full
* restore of the math state immediately to avoid the trap; the
* chances of needing FPU soon are obviously high now
*/
preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;

and

/*
* fpu_counter contains the number of consecutive context switches
* that the FPU is used. If this is over a threshold, the lazy fpu
* saving becomes unlazy to save the trap. This is an unsigned char
* so that after 256 times the counter wraps and the behavior turns
* lazy again; this to deal with bursty apps that only use FPU for
* a short time
*/

- We could create a paca bit to mirror the VMX enabled MSR bit and check
that first, avoiding multiple calls to calling enable_kernel_altivec.
That should help with iovec based system calls like readv.

- We could have two VMX breakpoints, one for when we know the user VMX
state is loaded into the registers and one when it isn't. This could
be a second bit in the paca so we can calculate the break points quickly.

- One suggestion from Ben was to save and restore the VSX registers
we use inline instead of using enable_kernel_altivec.

[BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2011-12-19 11:40:40 +0800

16 Nov, 2011

1 commit

d715e433b powerpc: Copy down exception vectors after feature fixups ... Browse Code »
1

kdump fails because we try to execute an HV only instruction. Feature
fixups are being applied after we copy the exception vectors down to 0
so they miss out on any updates.

We have always had this issue but it only became critical in v3.0
when we added CFAR support (breaks POWER5) and v3.1 when we added
POWERNV (breaks everyone).

Signed-off-by: Anton Blanchard
Cc: [v3.0+]
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2011-11-16 11:47:54 +0800

01 Nov, 2011

1 commit

4b16f8e2d powerpc: various straight conversions from module.h --> export.h ... Browse Code »

All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure. We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-11-01 07:30:44 +0800

21 May, 2011

2 commits

82aff107f Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (152 commits)
powerpc: Fix hard CPU IDs detection
powerpc/pmac: Update via-pmu to new syscore_ops
powerpc/kvm: Fix the build for 32-bit Book 3S (classic) processors
powerpc/kvm: Fix kvmppc_core_pending_dec
powerpc: Remove last piece of GEMINI
powerpc: Fix for Pegasos keyboard and mouse
powerpc: Make early memory scan more resilient to out of order nodes
powerpc/pseries/iommu: Cleanup ddw naming
powerpc/pseries/iommu: Find windows after kexec during boot
powerpc/pseries/iommu: Remove ddw property when destroying window
powerpc/pseries/iommu: Add additional checks when changing iommu mask
powerpc/pseries/iommu: Use correct return type in dupe_ddw_if_already_created
powerpc: Remove unused/obsolete CONFIG_XICS
misc: Add CARMA DATA-FPGA Programmer support
misc: Add CARMA DATA-FPGA Access Driver
powerpc: Make IRQ_NOREQUEST last to clear, first to set
powerpc: Integrated Flash controller device tree bindings
powerpc/85xx: Create dts of each core in CAMP mode for P1020RDB
powerpc/85xx: Fix PCIe IDSEL for Px020RDB
powerpc/85xx: P2020 DTS: re-organize dts files
...

Linus Torvalds
2011-05-21 04:28:01 +0800
268bb0ce3 sanitize <linux/prefetch.h> usage ... Browse Code »

Commit e66eed651fd1 ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.

So this fixes things up a bit, using

grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

to guide us in finding files that either need
inclusion, or have it despite not needing it.

There are more of them around (mostly network drivers), but this gets
many core ones.

Reported-by: Stephen Rothwell
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-21 03:50:29 +0800

19 May, 2011

3 commits

a56555e57 powerpc: Remove alloc_maybe_bootmem for zalloc version ... Browse Code »

Replace all remaining callers of alloc_maybe_bootmem with
zalloc_maybe_bootmem. The callsite in pci_dn is followed with a
memset to clear the memory, and not zeroing at the other callsites
in the celleb fake pci code could lead to following uninitialized
memory as pointers or even freeing said pointers on error paths.

Signed-off-by: Milton Miller
Signed-off-by: Benjamin Herrenschmidt

Milton Miller
2011-05-19 13:30:57 +0800
40f1ce7fb powerpc: Remove ioremap_flags ... Browse Code »

We have a confusing number of ioremap functions. Make things just a
bit simpler by merging ioremap_flags and ioremap_prot.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2011-05-19 12:30:43 +0800
d988f0e3f powerpc: Simplify 4k/64k copy_page logic ... Browse Code »

To make it easier to add optimised versions of copy_page, remove
the 4kB loop for 64kB pages and just do all the work in copy_page.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2011-05-19 12:30:42 +0800

27 Apr, 2011

1 commit

b91e136cd powerpc: Use MSR_64BIT in sstep.c, fix kprobes on BOOK3E ... Browse Code »

We check MSR_SF a lot in sstep.c, to decide if we need to emulate the
truncation of values when running in 32-bit mode. Factor out that code
into a helper, and convert it and the other uses to use MSR_64BIT.

This fixes a bug on BOOK3E where kprobes would end up returning to a
32-bit address, because regs->nip was truncated, because (msr & MSR_SF)
was false.

Signed-off-by: Michael Ellerman
Signed-off-by: Benjamin Herrenschmidt

Michael Ellerman
2011-04-27 12:18:46 +0800

21 Jan, 2011

1 commit

c0337288a powerpc: Ensure the else case of feature sections will fit ... Browse Code »

When we create an alternative feature section, the else case must be the
same size or smaller than the body. This is because when we patch the
else case in we just overwrite the body, so there must be room.

Up to now we just did this by inspection, but it's quite easy to enforce
it in the assembler, so we should.

The only change is to add the ifgt block, but that effects the alignment
of the tabs and so the whole macro is modified.

Also add a test, but #if 0 it because we don't want to break the build.
Anyone who's modifying the feature macros should enable the test.

Signed-off-by: Michael Ellerman
Signed-off-by: Benjamin Herrenschmidt

Michael Ellerman
2011-01-21 11:08:33 +0800

09 Dec, 2010

1 commit

b5f9b6665 powerpc: Hardcode popcnt instructions for old assemblers ... Browse Code »

The popcnt instructions went into binutils relatively recently. As with a
number of other instructions, create macros and hardcode them.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-12-09 12:35:30 +0800

29 Nov, 2010

1 commit

64ff31287 powerpc: Add support for popcnt instructions ... Browse Code »

POWER5 added popcntb, and POWER7 added popcntw and popcntd. As a first step
this patch does all the work out of line, but it would be nice to implement
them as inlines with an out of line fallback.

The performance issue with hweight was noticed when disabling SMT on a large
(192 thread) POWER7 box. The patch improves that testcase by about 8%.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-11-29 12:48:17 +0800

13 Oct, 2010

1 commit

4108d9ba9 powerpc/Makefiles: Change to new flag variables ... Browse Code »

Replace EXTRA_CFLAGS with ccflags-y and EXTRA_AFLAGS with asflags-y.

Signed-off-by: matt mooney
Signed-off-by: Benjamin Herrenschmidt

matt mooney
2010-10-13 13:19:22 +0800

02 Sep, 2010

6 commits

cd64d1697 powerpc: mtmsrd not defined ... Browse Code »

Replace the BOOK3S_64 specific mtmsrd with the generic MTMSRD macro.
Only enable ldstfp when CONFIG_PPC_FPU is set.

Signed-off-by: Sean MacLennan
Signed-off-by: Benjamin Herrenschmidt

Sean MacLennan
2010-09-02 12:07:34 +0800
025c0186a powerpc: Fix incorrect .stabs entry for copy_32.S ... Browse Code »

Signed-off-by: Sean MacLennan
Signed-off-by: Benjamin Herrenschmidt

Sean MacLennan
2010-09-02 12:07:34 +0800
8154c5d22 powerpc: Abstract indexing of lppaca structs ... Browse Code »

Currently we have the lppaca structs as a simple array of NR_CPUS
entries, taking up space in the data section of the kernel image.
In future we would like to allocate them dynamically, so this
abstracts out the accesses to the array, making it easier to
change how we locate the lppaca for a given cpu in future.
Specifically, lppaca[cpu] changes to lppaca_of(cpu).

Signed-off-by: Paul Mackerras
Signed-off-by: Benjamin Herrenschmidt

Paul Mackerras
2010-09-02 12:07:31 +0800
8c7739147 powerpc: Add 64bit csum_and_copy_to_user ... Browse Code »

This adds the equivalent of csum_and_copy_from_user for the receive side so we
can copy and checksum in one pass. It is modelled on the generic checksum
routine.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-09-02 12:07:30 +0800
fdd374b62 powerpc: Optimise 64bit csum_partial_copy_generic and add csum_and_copy_from_user ... Browse Code »

We use the same core loop as the new csum_partial, adding in the
stores and exception handling code. To keep things simple we do all the
exception fixup in csum_and_copy_from_user. This wrapper function is
modelled on the generic checksum code and is careful to always calculate
a complete checksum even if we only copied part of the data to userspace.

To test this I forced checksumming on over loopback and ran socklib (a
simple TCP benchmark). On a POWER6 575 throughput improved by 19% with
this patch. If I forced both the sender and receiver onto the same cpu
(with the hope of shifting the benchmark from being cache bandwidth limited
to cpu limited), adding this patch improved performance by 55%

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-09-02 12:07:30 +0800
9b83ecb0a powerpc: Optimise 64bit csum_partial ... Browse Code »

The main loop of csum_partial runs very slowly on recent POWER CPUs. After some
analysis on both POWER6 and POWER7 I came up with routine below. First we get
the source aligned to a double word, ignoring any odd alignment to keep things
simple. Then we do 64 bytes at a time, with an entry and exit limb of a further
64 bytes. On both POWER6 and POWER7 this should be as fast as we can go since
we are limited by the latency of the adde instructions.

To test this I forced checksumming on over loopback and ran socklib (a
simple TCP benchmark). On a POWER6 575 throughput improved by 11% with
this patch.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-09-02 12:07:29 +0800

09 Jul, 2010

1 commit

5f07aa752 Merge commit 'paulus-perf/master' into next Browse Code »

Benjamin Herrenschmidt
2010-07-09 09:25:48 +0800

08 Jul, 2010

2 commits

3880ecb05 powerpc: Fix feature-fixup tests for gcc 4.5 ... Browse Code »

The feature-fixup test declare some extern void variables and then take
their addresses. Fix this by declaring them as extern u8 instead.

Fixes these warnings (treated as errors):

CC arch/powerpc/lib/feature-fixups.o
cc1: warnings being treated as errors
arch/powerpc/lib/feature-fixups.c: In function 'test_cpu_macros':
arch/powerpc/lib/feature-fixups.c:293:23: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:294:9: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c: In function 'test_fw_macros':
arch/powerpc/lib/feature-fixups.c:306:23: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:307:9: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c: In function 'test_lwsync_macros':
arch/powerpc/lib/feature-fixups.c:321:23: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:322:9: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void'
arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void'

Signed-off-by: Stephen Rothwell
Signed-off-by: Benjamin Herrenschmidt

Stephen Rothwell
2010-07-08 16:11:41 +0800
7fca5dc8a powerpc: Fix module building for gcc 4.5 and 64 bit ... Browse Code »

Gcc 4.5 is now generating out of line register save and restore
in the function prefix and postfix when we use -Os.

Signed-off-by: Stephen Rothwell
Signed-off-by: Benjamin Herrenschmidt

Stephen Rothwell
2010-07-08 16:11:38 +0800

22 Jun, 2010

2 commits

5aae8a537 powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors ... Browse Code »

Implement perf-events based hw-breakpoint interfaces for PowerPC
64-bit server (Book III S) processors. This allows access to a
given location to be used as an event that can be counted or
profiled by the perf_events subsystem.

This is done using the DABR (data breakpoint register), which can
also be used for process debugging via ptrace. When perf_event
hw_breakpoint support is configured in, the perf_event subsystem
manages the DABR and arbitrates access to it, and ptrace then
creates a perf_event when it is requested to set a data breakpoint.

[Adopted suggestions from Paul Mackerras to
- emulate_step() all system-wide breakpoints and single-step only the
per-task breakpoints
- perform arch-specific cleanup before unregistration through
arch_unregister_hw_breakpoint()
]

Signed-off-by: K.Prasad
Signed-off-by: Paul Mackerras

K.Prasad
2010-06-22 17:40:50 +0800
0016a4cf5 powerpc: Emulate most Book I instructions in emulate_step() ... Browse Code »

This extends the emulate_step() function to handle a large proportion
of the Book I instructions implemented on current 64-bit server
processors. The aim is to handle all the load and store instructions
used in the kernel, plus all of the instructions that appear between
l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).

The new code can emulate user mode instructions, and checks the
effective address for a load or store if the saved state is for
user mode. It doesn't handle little-endian mode at present.

For floating-point, Altivec/VMX and VSX instructions, it checks
that the saved MSR has the enable bit for the relevant facility
set, and if so, assumes that the FP/VMX/VSX registers contain
valid state, and does loads or stores directly to/from the
FP/VMX/VSX registers, using assembly helpers in ldstfp.S.

Instructions supported now include:
* Loads and stores, including some but not all VMX and VSX instructions,
and lmw/stmw
* Atomic loads and stores (l[dw]arx, st[dw]cx.)
* Arithmetic instructions (add, subtract, multiply, divide, etc.)
* Compare instructions
* Rotate and mask instructions
* Shift instructions
* Logical instructions (and, or, xor, etc.)
* Condition register logical instructions
* mtcrf, cntlz[wd], exts[bhw]
* isync, sync, lwsync, ptesync, eieio
* Cache operations (dcbf, dcbst, dcbt, dcbtst)

The overflow-checking arithmetic instructions are not included, but
they appear not to be ever used in C code.

This uses decimal values for the minor opcodes in the switch statements
because that is what appears in the Power ISA specification, thus it is
easier to check that they are correct if they are in decimal.

If this is used to single-step an instruction where a data breakpoint
interrupt occurred, then there is the possibility that the instruction
is a lwarx or ldarx. In that case we have to be careful not to lose the
reservation until we get to the matching st[wd]cx., or we'll never make
forward progress. One alternative is to try to arrange that we can
return from interrupts and handle data breakpoint interrupts without
losing the reservation, which means not using any spinlocks, mutexes,
or atomic ops (including bitops). That seems rather fragile. The
other alternative is to emulate the larx/stcx and all the instructions
in between. This is why this commit adds support for a wide range
of integer instructions.

Signed-off-by: Paul Mackerras

Paul Mackerras
2010-06-22 17:40:29 +0800

21 May, 2010

1 commit

ca5d0674c powerpc: Fix string library functions ... Browse Code »

The powerpc strncmp implementation does not correctly handle a zero
length, despite the claim in 0119536cd314ef95553604208c25bc35581f7f0a
(Add hand-coded assembly strcmp).

Additionally, all the length arguments are size_t, not int, so use
PPC_LCMPI and eq instead of cmpwi and le throughout.

Signed-off-by: Andreas Schwab
Acked-by: Paul Mackerras
Signed-off-by: Benjamin Herrenschmidt

Andreas Schwab
2010-05-21 15:31:08 +0800

07 Apr, 2010

1 commit

637a99022 powerpc: Fix handling of strncmp with zero len ... Browse Code »

Commit 0119536c, which added the assembly version of strncmp to
powerpc, mentions that it adds two instructions to the version from
boot/string.S to allow it to handle len=0. Unfortunately, it doesn't
always return 0 when that is the case. The length is passed in r5, but
the return value is passed back in r3. In certain cases, this will
happen to work. Otherwise it will pass back the address of the first
string as the return value.

This patch lifts the len
CC:
Signed-off-by: Benjamin Herrenschmidt

Jeff Mahoney
2010-04-07 16:00:39 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

26 Feb, 2010

1 commit

3d98ffbff powerpc: Fix lwsync feature fixup vs. modules on 64-bit ... Browse Code »

Anton's commit enabling the use of the lwsync fixup mechanism on 64-bit
breaks modules. The lwsync fixup section uses .long instead of the
FTR_ENTRY_OFFSET macro used by other fixups sections, and thus will
generate 32-bit relocations that our module loader cannot resolve.

This changes it to use the same type as other feature sections.

Note however that we might want to consider using 32-bit for all the
feature fixup offsets and add support for R_PPC_REL32 to module_64.c
instead as that would reduce the size of the kernel image. I'll leave
that as an exercise for the reader for now...

Signed-off-by: Benjamin Herrenschmidt

Benjamin Herrenschmidt
2010-02-26 15:29:17 +0800

17 Feb, 2010

3 commits

789c299ca powerpc: Improve 64bit copy_tofrom_user ... Browse Code »

Here is a patch from Paul Mackerras that improves the ppc64 copy_tofrom_user.
The loop now does 32 bytes at a time and as well as pairing loads and stores.

A quick test case that reads 8kB over and over shows the improvement:

POWER6: 53% faster
POWER7: 51% faster

#define _XOPEN_SOURCE 500
#include
#include
#include
#include
#include
#include

#define BUFSIZE (8 * 1024)
#define ITERATIONS 10000000

int main()
{
char tmpfile[] = "/tmp/copy_to_user_testXXXXXX";
int fd;
char *buf[BUFSIZE];
unsigned long i;

fd = mkstemp(tmpfile);
if (fd < 0) {
perror("open");
exit(1);
}

if (write(fd, buf, BUFSIZE) != BUFSIZE) {
perror("open");
exit(1);
}

for (i = 0; i < 10000000; i++) {
if (pread(fd, buf, BUFSIZE, 0) != BUFSIZE) {
perror("pread");
exit(1);
}
}

unlink(tmpfile);

return 0;
}

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-02-17 11:03:16 +0800
63e6c5b81 powerpc: Pair loads and stores in copy_4k_page ... Browse Code »

A number of our chips like loads and stores to be paired. A small kernel
module testcase shows the improvement of pairing loads and stores in
copy_4k_page:

POWER6: +9%
POWER7: +1.5%

#include
#include

#define ITERATIONS 10000000

static int __init copypage_init(void)
{
struct timespec before, after;
unsigned long i;
struct page *destpage, *srcpage;
char *dest, *src;

destpage = alloc_page(GFP_KERNEL);
srcpage = alloc_page(GFP_KERNEL);

dest = page_address(destpage);
src = page_address(srcpage);

getnstimeofday(&before);

for (i = 0; i < ITERATIONS; i++)
copy_4K_page(dest, src);

getnstimeofday(&after);

free_page((unsigned long)dest);
free_page((unsigned long)src);

printk(KERN_DEBUG "copy_4K_page loop took %lu ns\n",
(after.tv_sec - before.tv_sec) * NSEC_PER_SEC +
(after.tv_nsec - before.tv_nsec));

return 0;
}

static void __exit copypage_exit(void)
{
}

module_init(copypage_init)
module_exit(copypage_exit)
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Anton Blanchard");

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-02-17 11:03:16 +0800
53eae2281 powerpc: Fix lwsync patching code on 64bit ... Browse Code »

do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the
wrong addresses:

0:mon> di c0000001000bfacc
c0000001000bfacc 7c2004ac lwsync

Since the lwsync section has negative offsets we need to use a signed int
pointer so we sign extend the value.

Signed-off-by: Anton Blanchard
Signed-off-by: Benjamin Herrenschmidt

Anton Blanchard
2010-02-17 11:03:15 +0800

15 Dec, 2009

3 commits

fb3a6bbc9 locking: Convert raw_rwlock to arch_rwlock ... Browse Code »

Not strictly necessary for -rt as -rt does not have non sleeping
rwlocks, but it's odd to not have a consistent naming convention.

No functional change.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Acked-by: David S. Miller
Acked-by: Ingo Molnar
Cc: linux-arch@vger.kernel.org

Thomas Gleixner
2009-12-15 06:55:32 +0800
0199c4e68 locking: Convert __raw_spin* functions to arch_spin* ... Browse Code »

Name space cleanup. No functional change.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Acked-by: David S. Miller
Acked-by: Ingo Molnar
Cc: linux-arch@vger.kernel.org

Thomas Gleixner
2009-12-15 06:55:32 +0800
445c89514 locking: Convert raw_spinlock to arch_spinlock ... Browse Code »

The raw_spin* namespace was taken by lockdep for the architecture
specific implementations. raw_spin_* would be the ideal name space for
the spinlocks which are not converted to sleeping locks in preempt-rt.

Linus suggested to convert the raw_ to arch_ locks and cleanup the
name space instead of using an artifical name like core_spin,
atomic_spin or whatever

No functional change.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Acked-by: David S. Miller
Acked-by: Ingo Molnar
Cc: linux-arch@vger.kernel.org

Thomas Gleixner
2009-12-15 06:55:32 +0800

09 Dec, 2009

2 commits

bcd6acd51 Merge commit 'origin/master' into next ... Browse Code »

Conflicts:
include/linux/kvm.h

Benjamin Herrenschmidt
2009-12-09 14:14:38 +0800
15d914d72 powerpc/8xx: Start using dcbX instructions in various copy routines ... Browse Code »

Now that 8xx can fixup dcbX instructions, start using them
where possible like every other PowerPc arch do.

Signed-off-by: Joakim Tjernlund
Signed-off-by: Benjamin Herrenschmidt

Joakim Tjernlund
2009-12-09 14:10:37 +0800

28 Oct, 2009

1 commit

3cd980dbc powerpc: perf_event: Cleanup copy_page output by hiding setup symbol ... Browse Code »

A lot of hits in "setup" doesn't make much sense, so hide this symbol and
allow all the hits to end up in copy_4k_page.

Signed-off-by: Anton Blanchard
Signed-off-by: Paul Mackerras

Anton Blanchard
2009-10-28 13:13:05 +0800

16 Jun, 2009

1 commit

ba55bd743 powerpc: Add configurable -Werror for arch/powerpc ... Browse Code »

Add the option to build the code under arch/powerpc with -Werror.

The intention is to make it harder for people to inadvertantly introduce
warnings in the arch/powerpc code. It needs to be configurable so that
if a warning is introduced, people can easily work around it while it's
being fixed.

The option is a negative, ie. don't enable -Werror, so that it will be
turned on for allyes and allmodconfig builds.

The default is n, in the hope that developers will build with -Werror,
that will probably lead to some build breaks, I am prepared to be flamed.

It's not enabled for math-emu, which is a steaming pile of warnings.

Signed-off-by: Michael Ellerman
Signed-off-by: Benjamin Herrenschmidt

Michael Ellerman
2009-06-16 12:15:45 +0800

27 May, 2009

2 commits

b16e7766d powerpc: Move dma-noncoherent.c from arch/powerpc/lib to arch/powerpc/mm ... Browse Code »

(pre-requisite to make the next patches more palatable)

Signed-off-by: Benjamin Herrenschmidt

Benjamin Herrenschmidt
2009-05-27 14:32:05 +0800
84532a0fc Revert "powerpc: Rework dma-noncoherent to use generic vmalloc layer" ... Browse Code »

This reverts commit 33f00dcedb0e22cdb156a23632814fc580fcfcf8.

While it was a good idea to try to use the mm/vmalloc.c allocator instead
of our own (in fact, ours is itself a dup on an old variant of the vmalloc
one), unfortunately, the approach is terminally busted since
dma_alloc_coherent() can be called at interrupt time or in atomic contexts
and there's little chances we'll make the code in mm/vmalloc.c cope with\ that :-(

Until we can get the generic code to forbid that idiocy and fix all
drivers abusing it, we pretty much have no choice but revert to
our custom virtual space allocator.

There's also a problem with SMP safety since freeing such mapping
would require an IPI which cannot be done at interrupt time.

However, right now, I don't think we support any platform that is
both SMP and has non-coherent DMA (don't laugh, I know such things
do exist !) so we can sort that out later.

Signed-off-by: Benjamin Herrenschmidt

Benjamin Herrenschmidt
2009-05-27 11:33:14 +0800