19 Dec, 2011

1 commit

  • Implement a POWER7 optimised copy_to_user/copy_from_user using VMX.
    For large aligned copies this new loop is over 10% faster, and for
    large unaligned copies it is over 200% faster.

    If we take a fault we fall back to the old version, this keeps
    things relatively simple and easy to verify.

    On POWER7 unaligned stores rarely slow down - they only flush when
    a store crosses a 4KB page boundary. Furthermore this flush is
    handled completely in hardware and should be 20-30 cycles.

    Unaligned loads on the other hand flush much more often - whenever
    crossing a 128 byte cache line, or a 32 byte sector if either sector
    is an L1 miss.

    Considering this information we really want to get the loads aligned
    and not worry about the alignment of the stores. Microbenchmarks
    confirm that this approach is much faster than the current unaligned
    copy loop that uses shifts and rotates to ensure both loads and
    stores are aligned.

    We also want to try and do the stores in cacheline aligned, cacheline
    sized chunks. If the store queue is unable to merge an entire
    cacheline of stores then the L2 cache will have to do a
    read/modify/write. Even worse, we will serialise this with the stores
    in the next iteration of the copy loop since both iterations hit
    the same cacheline.

    Based on this, the new loop does the following things:

    1 - 127 bytes
    Get the source 8 byte aligned and use 8 byte loads and stores. Pretty
    boring and similar to how the current loop works.

    128 - 4095 bytes
    Get the source 8 byte aligned and use 8 byte loads and stores,
    1 cacheline at a time. We aren't doing the stores in cacheline
    aligned chunks so we will potentially serialise once per cacheline.
    Even so it is much better than the loop we have today.

    4096 - bytes
    If both source and destination have the same alignment get them both
    16 byte aligned, then get the destination cacheline aligned. Do
    cacheline sized loads and stores using VMX.

    If source and destination do not have the same alignment, we get the
    destination cacheline aligned, and use permute to do aligned loads.

    In both cases the VMX loop should be optimal - we always do aligned
    loads and stores and are always doing stores in cacheline aligned,
    cacheline sized chunks.

    To be able to use VMX we must be careful about interrupts and
    sleeping. We don't use the VMX loop when in an interrupt (which should
    be rare anyway) and we wrap the VMX loop in disable/enable_pagefault
    and fall back to the existing copy_tofrom_user loop if we do need to
    sleep.

    The VMX breakpoint of 4096 bytes was chosen using this microbenchmark:

    http://ozlabs.org/~anton/junkcode/copy_to_user.c

    Since we are using VMX and there is a cost to saving and restoring
    the user VMX state there are two broad cases we need to benchmark:

    - Best case - userspace never uses VMX

    - Worst case - userspace always uses VMX

    In reality a userspace process will sit somewhere between these two
    extremes. Since we need to test both aligned and unaligned copies we
    end up with 4 combinations. The point at which the VMX loop begins to
    win is:

    0% VMX
    aligned 2048 bytes
    unaligned 2048 bytes

    100% VMX
    aligned 16384 bytes
    unaligned 8192 bytes

    Considering this is a microbenchmark, the data is hot in cache and
    the VMX loop has better store queue merging properties we set the
    breakpoint to 4096 bytes, a little below the unaligned breakpoints.

    Some future optimisations we can look at:

    - Looking at the perf data, a significant part of the cost when a
    task is always using VMX is the extra exception we take to restore
    the VMX state. As such we should do something similar to the x86
    optimisation that restores FPU state for heavy users. ie:

    /*
    * If the task has used fpu the last 5 timeslices, just do a full
    * restore of the math state immediately to avoid the trap; the
    * chances of needing FPU soon are obviously high now
    */
    preload_fpu = tsk_used_math(next_p) && next_p->fpu_counter > 5;

    and

    /*
    * fpu_counter contains the number of consecutive context switches
    * that the FPU is used. If this is over a threshold, the lazy fpu
    * saving becomes unlazy to save the trap. This is an unsigned char
    * so that after 256 times the counter wraps and the behavior turns
    * lazy again; this to deal with bursty apps that only use FPU for
    * a short time
    */

    - We could create a paca bit to mirror the VMX enabled MSR bit and check
    that first, avoiding multiple calls to calling enable_kernel_altivec.
    That should help with iovec based system calls like readv.

    - We could have two VMX breakpoints, one for when we know the user VMX
    state is loaded into the registers and one when it isn't. This could
    be a second bit in the paca so we can calculate the break points quickly.

    - One suggestion from Ben was to save and restore the VSX registers
    we use inline instead of using enable_kernel_altivec.

    [BenH: Fixed a problem with preempt and fixed build without CONFIG_ALTIVEC]

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

16 Nov, 2011

1 commit

  • kdump fails because we try to execute an HV only instruction. Feature
    fixups are being applied after we copy the exception vectors down to 0
    so they miss out on any updates.

    We have always had this issue but it only became critical in v3.0
    when we added CFAR support (breaks POWER5) and v3.1 when we added
    POWERNV (breaks everyone).

    Signed-off-by: Anton Blanchard
    Cc: [v3.0+]
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

01 Nov, 2011

1 commit


21 May, 2011

2 commits

  • * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (152 commits)
    powerpc: Fix hard CPU IDs detection
    powerpc/pmac: Update via-pmu to new syscore_ops
    powerpc/kvm: Fix the build for 32-bit Book 3S (classic) processors
    powerpc/kvm: Fix kvmppc_core_pending_dec
    powerpc: Remove last piece of GEMINI
    powerpc: Fix for Pegasos keyboard and mouse
    powerpc: Make early memory scan more resilient to out of order nodes
    powerpc/pseries/iommu: Cleanup ddw naming
    powerpc/pseries/iommu: Find windows after kexec during boot
    powerpc/pseries/iommu: Remove ddw property when destroying window
    powerpc/pseries/iommu: Add additional checks when changing iommu mask
    powerpc/pseries/iommu: Use correct return type in dupe_ddw_if_already_created
    powerpc: Remove unused/obsolete CONFIG_XICS
    misc: Add CARMA DATA-FPGA Programmer support
    misc: Add CARMA DATA-FPGA Access Driver
    powerpc: Make IRQ_NOREQUEST last to clear, first to set
    powerpc: Integrated Flash controller device tree bindings
    powerpc/85xx: Create dts of each core in CAMP mode for P1020RDB
    powerpc/85xx: Fix PCIe IDSEL for Px020RDB
    powerpc/85xx: P2020 DTS: re-organize dts files
    ...

    Linus Torvalds
     
  • Commit e66eed651fd1 ("list: remove prefetching from regular list
    iterators") removed the include of prefetch.h from list.h, which
    uncovered several cases that had apparently relied on that rather
    obscure header file dependency.

    So this fixes things up a bit, using

    grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
    grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')

    to guide us in finding files that either need
    inclusion, or have it despite not needing it.

    There are more of them around (mostly network drivers), but this gets
    many core ones.

    Reported-by: Stephen Rothwell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 May, 2011

3 commits

  • Replace all remaining callers of alloc_maybe_bootmem with
    zalloc_maybe_bootmem. The callsite in pci_dn is followed with a
    memset to clear the memory, and not zeroing at the other callsites
    in the celleb fake pci code could lead to following uninitialized
    memory as pointers or even freeing said pointers on error paths.

    Signed-off-by: Milton Miller
    Signed-off-by: Benjamin Herrenschmidt

    Milton Miller
     
  • We have a confusing number of ioremap functions. Make things just a
    bit simpler by merging ioremap_flags and ioremap_prot.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • To make it easier to add optimised versions of copy_page, remove
    the 4kB loop for 64kB pages and just do all the work in copy_page.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

27 Apr, 2011

1 commit

  • We check MSR_SF a lot in sstep.c, to decide if we need to emulate the
    truncation of values when running in 32-bit mode. Factor out that code
    into a helper, and convert it and the other uses to use MSR_64BIT.

    This fixes a bug on BOOK3E where kprobes would end up returning to a
    32-bit address, because regs->nip was truncated, because (msr & MSR_SF)
    was false.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     

21 Jan, 2011

1 commit

  • When we create an alternative feature section, the else case must be the
    same size or smaller than the body. This is because when we patch the
    else case in we just overwrite the body, so there must be room.

    Up to now we just did this by inspection, but it's quite easy to enforce
    it in the assembler, so we should.

    The only change is to add the ifgt block, but that effects the alignment
    of the tabs and so the whole macro is modified.

    Also add a test, but #if 0 it because we don't want to break the build.
    Anyone who's modifying the feature macros should enable the test.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     

09 Dec, 2010

1 commit


29 Nov, 2010

1 commit

  • POWER5 added popcntb, and POWER7 added popcntw and popcntd. As a first step
    this patch does all the work out of line, but it would be nice to implement
    them as inlines with an out of line fallback.

    The performance issue with hweight was noticed when disabling SMT on a large
    (192 thread) POWER7 box. The patch improves that testcase by about 8%.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

13 Oct, 2010

1 commit


02 Sep, 2010

6 commits

  • Replace the BOOK3S_64 specific mtmsrd with the generic MTMSRD macro.
    Only enable ldstfp when CONFIG_PPC_FPU is set.

    Signed-off-by: Sean MacLennan
    Signed-off-by: Benjamin Herrenschmidt

    Sean MacLennan
     
  • Signed-off-by: Sean MacLennan
    Signed-off-by: Benjamin Herrenschmidt

    Sean MacLennan
     
  • Currently we have the lppaca structs as a simple array of NR_CPUS
    entries, taking up space in the data section of the kernel image.
    In future we would like to allocate them dynamically, so this
    abstracts out the accesses to the array, making it easier to
    change how we locate the lppaca for a given cpu in future.
    Specifically, lppaca[cpu] changes to lppaca_of(cpu).

    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     
  • This adds the equivalent of csum_and_copy_from_user for the receive side so we
    can copy and checksum in one pass. It is modelled on the generic checksum
    routine.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • We use the same core loop as the new csum_partial, adding in the
    stores and exception handling code. To keep things simple we do all the
    exception fixup in csum_and_copy_from_user. This wrapper function is
    modelled on the generic checksum code and is careful to always calculate
    a complete checksum even if we only copied part of the data to userspace.

    To test this I forced checksumming on over loopback and ran socklib (a
    simple TCP benchmark). On a POWER6 575 throughput improved by 19% with
    this patch. If I forced both the sender and receiver onto the same cpu
    (with the hope of shifting the benchmark from being cache bandwidth limited
    to cpu limited), adding this patch improved performance by 55%

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • The main loop of csum_partial runs very slowly on recent POWER CPUs. After some
    analysis on both POWER6 and POWER7 I came up with routine below. First we get
    the source aligned to a double word, ignoring any odd alignment to keep things
    simple. Then we do 64 bytes at a time, with an entry and exit limb of a further
    64 bytes. On both POWER6 and POWER7 this should be as fast as we can go since
    we are limited by the latency of the adde instructions.

    To test this I forced checksumming on over loopback and ran socklib (a
    simple TCP benchmark). On a POWER6 575 throughput improved by 11% with
    this patch.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

09 Jul, 2010

1 commit


08 Jul, 2010

2 commits

  • The feature-fixup test declare some extern void variables and then take
    their addresses. Fix this by declaring them as extern u8 instead.

    Fixes these warnings (treated as errors):

    CC arch/powerpc/lib/feature-fixups.o
    cc1: warnings being treated as errors
    arch/powerpc/lib/feature-fixups.c: In function 'test_cpu_macros':
    arch/powerpc/lib/feature-fixups.c:293:23: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:294:9: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:297:2: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c: In function 'test_fw_macros':
    arch/powerpc/lib/feature-fixups.c:306:23: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:307:9: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:310:2: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c: In function 'test_lwsync_macros':
    arch/powerpc/lib/feature-fixups.c:321:23: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:322:9: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:326:3: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void'
    arch/powerpc/lib/feature-fixups.c:329:3: error: taking address of expression of type 'void'

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Benjamin Herrenschmidt

    Stephen Rothwell
     
  • Gcc 4.5 is now generating out of line register save and restore
    in the function prefix and postfix when we use -Os.

    Signed-off-by: Stephen Rothwell
    Signed-off-by: Benjamin Herrenschmidt

    Stephen Rothwell
     

22 Jun, 2010

2 commits

  • Implement perf-events based hw-breakpoint interfaces for PowerPC
    64-bit server (Book III S) processors. This allows access to a
    given location to be used as an event that can be counted or
    profiled by the perf_events subsystem.

    This is done using the DABR (data breakpoint register), which can
    also be used for process debugging via ptrace. When perf_event
    hw_breakpoint support is configured in, the perf_event subsystem
    manages the DABR and arbitrates access to it, and ptrace then
    creates a perf_event when it is requested to set a data breakpoint.

    [Adopted suggestions from Paul Mackerras to
    - emulate_step() all system-wide breakpoints and single-step only the
    per-task breakpoints
    - perform arch-specific cleanup before unregistration through
    arch_unregister_hw_breakpoint()
    ]

    Signed-off-by: K.Prasad
    Signed-off-by: Paul Mackerras

    K.Prasad
     
  • This extends the emulate_step() function to handle a large proportion
    of the Book I instructions implemented on current 64-bit server
    processors. The aim is to handle all the load and store instructions
    used in the kernel, plus all of the instructions that appear between
    l[wd]arx and st[wd]cx., so this handles the Altivec/VMX lvx and stvx
    and the VSX lxv2dx and stxv2dx instructions (implemented in POWER7).

    The new code can emulate user mode instructions, and checks the
    effective address for a load or store if the saved state is for
    user mode. It doesn't handle little-endian mode at present.

    For floating-point, Altivec/VMX and VSX instructions, it checks
    that the saved MSR has the enable bit for the relevant facility
    set, and if so, assumes that the FP/VMX/VSX registers contain
    valid state, and does loads or stores directly to/from the
    FP/VMX/VSX registers, using assembly helpers in ldstfp.S.

    Instructions supported now include:
    * Loads and stores, including some but not all VMX and VSX instructions,
    and lmw/stmw
    * Atomic loads and stores (l[dw]arx, st[dw]cx.)
    * Arithmetic instructions (add, subtract, multiply, divide, etc.)
    * Compare instructions
    * Rotate and mask instructions
    * Shift instructions
    * Logical instructions (and, or, xor, etc.)
    * Condition register logical instructions
    * mtcrf, cntlz[wd], exts[bhw]
    * isync, sync, lwsync, ptesync, eieio
    * Cache operations (dcbf, dcbst, dcbt, dcbtst)

    The overflow-checking arithmetic instructions are not included, but
    they appear not to be ever used in C code.

    This uses decimal values for the minor opcodes in the switch statements
    because that is what appears in the Power ISA specification, thus it is
    easier to check that they are correct if they are in decimal.

    If this is used to single-step an instruction where a data breakpoint
    interrupt occurred, then there is the possibility that the instruction
    is a lwarx or ldarx. In that case we have to be careful not to lose the
    reservation until we get to the matching st[wd]cx., or we'll never make
    forward progress. One alternative is to try to arrange that we can
    return from interrupts and handle data breakpoint interrupts without
    losing the reservation, which means not using any spinlocks, mutexes,
    or atomic ops (including bitops). That seems rather fragile. The
    other alternative is to emulate the larx/stcx and all the instructions
    in between. This is why this commit adds support for a wide range
    of integer instructions.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

21 May, 2010

1 commit

  • The powerpc strncmp implementation does not correctly handle a zero
    length, despite the claim in 0119536cd314ef95553604208c25bc35581f7f0a
    (Add hand-coded assembly strcmp).

    Additionally, all the length arguments are size_t, not int, so use
    PPC_LCMPI and eq instead of cmpwi and le throughout.

    Signed-off-by: Andreas Schwab
    Acked-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Andreas Schwab
     

07 Apr, 2010

1 commit

  • Commit 0119536c, which added the assembly version of strncmp to
    powerpc, mentions that it adds two instructions to the version from
    boot/string.S to allow it to handle len=0. Unfortunately, it doesn't
    always return 0 when that is the case. The length is passed in r5, but
    the return value is passed back in r3. In certain cases, this will
    happen to work. Otherwise it will pass back the address of the first
    string as the return value.

    This patch lifts the len
    CC:
    Signed-off-by: Benjamin Herrenschmidt

    Jeff Mahoney
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

26 Feb, 2010

1 commit

  • Anton's commit enabling the use of the lwsync fixup mechanism on 64-bit
    breaks modules. The lwsync fixup section uses .long instead of the
    FTR_ENTRY_OFFSET macro used by other fixups sections, and thus will
    generate 32-bit relocations that our module loader cannot resolve.

    This changes it to use the same type as other feature sections.

    Note however that we might want to consider using 32-bit for all the
    feature fixup offsets and add support for R_PPC_REL32 to module_64.c
    instead as that would reduce the size of the kernel image. I'll leave
    that as an exercise for the reader for now...

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     

17 Feb, 2010

3 commits

  • Here is a patch from Paul Mackerras that improves the ppc64 copy_tofrom_user.
    The loop now does 32 bytes at a time and as well as pairing loads and stores.

    A quick test case that reads 8kB over and over shows the improvement:

    POWER6: 53% faster
    POWER7: 51% faster

    #define _XOPEN_SOURCE 500
    #include
    #include
    #include
    #include
    #include
    #include

    #define BUFSIZE (8 * 1024)
    #define ITERATIONS 10000000

    int main()
    {
    char tmpfile[] = "/tmp/copy_to_user_testXXXXXX";
    int fd;
    char *buf[BUFSIZE];
    unsigned long i;

    fd = mkstemp(tmpfile);
    if (fd < 0) {
    perror("open");
    exit(1);
    }

    if (write(fd, buf, BUFSIZE) != BUFSIZE) {
    perror("open");
    exit(1);
    }

    for (i = 0; i < 10000000; i++) {
    if (pread(fd, buf, BUFSIZE, 0) != BUFSIZE) {
    perror("pread");
    exit(1);
    }
    }

    unlink(tmpfile);

    return 0;
    }

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • A number of our chips like loads and stores to be paired. A small kernel
    module testcase shows the improvement of pairing loads and stores in
    copy_4k_page:

    POWER6: +9%
    POWER7: +1.5%

    #include
    #include

    #define ITERATIONS 10000000

    static int __init copypage_init(void)
    {
    struct timespec before, after;
    unsigned long i;
    struct page *destpage, *srcpage;
    char *dest, *src;

    destpage = alloc_page(GFP_KERNEL);
    srcpage = alloc_page(GFP_KERNEL);

    dest = page_address(destpage);
    src = page_address(srcpage);

    getnstimeofday(&before);

    for (i = 0; i < ITERATIONS; i++)
    copy_4K_page(dest, src);

    getnstimeofday(&after);

    free_page((unsigned long)dest);
    free_page((unsigned long)src);

    printk(KERN_DEBUG "copy_4K_page loop took %lu ns\n",
    (after.tv_sec - before.tv_sec) * NSEC_PER_SEC +
    (after.tv_nsec - before.tv_nsec));

    return 0;
    }

    static void __exit copypage_exit(void)
    {
    }

    module_init(copypage_init)
    module_exit(copypage_exit)
    MODULE_LICENSE("GPL");
    MODULE_AUTHOR("Anton Blanchard");

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • do_lwsync_fixups doesn't work on 64bit, we end up writing lwsyncs to the
    wrong addresses:

    0:mon> di c0000001000bfacc
    c0000001000bfacc 7c2004ac lwsync

    Since the lwsync section has negative offsets we need to use a signed int
    pointer so we sign extend the value.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

15 Dec, 2009

3 commits

  • Not strictly necessary for -rt as -rt does not have non sleeping
    rwlocks, but it's odd to not have a consistent naming convention.

    No functional change.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Acked-by: David S. Miller
    Acked-by: Ingo Molnar
    Cc: linux-arch@vger.kernel.org

    Thomas Gleixner
     
  • Name space cleanup. No functional change.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Acked-by: David S. Miller
    Acked-by: Ingo Molnar
    Cc: linux-arch@vger.kernel.org

    Thomas Gleixner
     
  • The raw_spin* namespace was taken by lockdep for the architecture
    specific implementations. raw_spin_* would be the ideal name space for
    the spinlocks which are not converted to sleeping locks in preempt-rt.

    Linus suggested to convert the raw_ to arch_ locks and cleanup the
    name space instead of using an artifical name like core_spin,
    atomic_spin or whatever

    No functional change.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Acked-by: David S. Miller
    Acked-by: Ingo Molnar
    Cc: linux-arch@vger.kernel.org

    Thomas Gleixner
     

09 Dec, 2009

2 commits


28 Oct, 2009

1 commit


16 Jun, 2009

1 commit

  • Add the option to build the code under arch/powerpc with -Werror.

    The intention is to make it harder for people to inadvertantly introduce
    warnings in the arch/powerpc code. It needs to be configurable so that
    if a warning is introduced, people can easily work around it while it's
    being fixed.

    The option is a negative, ie. don't enable -Werror, so that it will be
    turned on for allyes and allmodconfig builds.

    The default is n, in the hope that developers will build with -Werror,
    that will probably lead to some build breaks, I am prepared to be flamed.

    It's not enabled for math-emu, which is a steaming pile of warnings.

    Signed-off-by: Michael Ellerman
    Signed-off-by: Benjamin Herrenschmidt

    Michael Ellerman
     

27 May, 2009

2 commits

  • (pre-requisite to make the next patches more palatable)

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     
  • This reverts commit 33f00dcedb0e22cdb156a23632814fc580fcfcf8.

    While it was a good idea to try to use the mm/vmalloc.c allocator instead
    of our own (in fact, ours is itself a dup on an old variant of the vmalloc
    one), unfortunately, the approach is terminally busted since
    dma_alloc_coherent() can be called at interrupt time or in atomic contexts
    and there's little chances we'll make the code in mm/vmalloc.c cope with\ that :-(

    Until we can get the generic code to forbid that idiocy and fix all
    drivers abusing it, we pretty much have no choice but revert to
    our custom virtual space allocator.

    There's also a problem with SMP safety since freeing such mapping
    would require an IPI which cannot be done at interrupt time.

    However, right now, I don't think we support any platform that is
    both SMP and has non-coherent DMA (don't laugh, I know such things
    do exist !) so we can sort that out later.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt