10 Jul, 2012

2 commits

  • mtocrf define is just a wrapper around the real instructions so we can
    just use real register names here (ie. lower case).

    Also remove braces in macro so this is possible.

    Signed-off-by: Michael Neuling
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     
  • Anything that uses a constructed instruction (ie. from ppc-opcode.h),
    need to use the new R0 macro, as %r0 is not going to work.

    Also convert usages of macros where we are just determining an offset
    (usually for a load/store), like:
    std r14,STK_REG(r14)(r1)
    Can't use STK_REG(r14) as %r14 doesn't work in the STK_REG macro since
    it's just calculating an offset.

    Signed-off-by: Michael Neuling
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     

03 Jul, 2012

1 commit

  • Implement a POWER7 optimised memcpy using VMX and enhanced prefetch
    instructions.

    This is a copy of the POWER7 optimised copy_to_user/copy_from_user
    loop. Detailed implementation and performance details can be found in
    commit a66086b8197d (powerpc: POWER7 optimised
    copy_to_user/copy_from_user using VMX).

    I noticed memcpy issues when profiling a RAID6 workload:

    .memcpy
    .async_memcpy
    .async_copy_data
    .__raid_run_ops
    .handle_stripe
    .raid5d
    .md_thread

    I created a simplified testcase by building a RAID6 array with 4 1GB
    ramdisks (booting with brd.rd_size=1048576):

    # mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3]

    I then timed how long it took to write to the entire array:

    # dd if=/dev/zero of=/dev/md0 bs=1M

    Before: 892 MB/s
    After: 999 MB/s

    A 12% improvement.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

30 Apr, 2012

1 commit

  • Remove CONFIG_POWER4_ONLY, the option is badly named and only does two
    things:

    - It wraps the MMU segment table code. With feature fixups there is
    little downside to compiling this in.

    - It uses the newer mtocrf instruction in various assembly functions.
    Instead of making this a compile option just do it at runtime via
    a feature fixup.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     

26 Feb, 2009

1 commit

  • This fixes a regression introduced by commit
    25d6e2d7c58ddc4a3b614fc5381591c0cfe66556 ("powerpc: Update 64bit memcpy()
    using CPU_FTR_UNALIGNED_LD_STD").

    This commit allowed CPUs that have the CPU_FTR_UNALIGNED_LD_STD CPU
    feature bit present to do the memcpy() with unaligned load doubles. But,
    along with this came a bug where our final load double would read bytes
    beyond a page boundary and into the next (unmapped) page. This was caught
    by enabling CONFIG_DEBUG_PAGEALLOC,

    The fix was to read only the number of bytes that we need to store rather
    than reading a full 8-byte doubleword and storing only a portion of that.

    In order to minimise the amount of existing code touched we use the
    original do_tail for the src_unaligned case.

    Below is an example of the regression, as reported by Sachin Sant:

    Unable to handle kernel paging request for data at address 0xc00000003f380000
    Faulting instruction address: 0xc000000000039574
    cpu 0x1: Vector: 300 (Data Access) at [c00000003baf3020]
    pc: c000000000039574: .memcpy+0x74/0x244
    lr: d00000000244916c: .ext3_xattr_get+0x288/0x2f4 [ext3]
    sp: c00000003baf32a0
    msr: 8000000000009032
    dar: c00000003f380000
    dsisr: 40000000
    current = 0xc00000003e54b010
    paca = 0xc000000000a53680
    pid = 1840, comm = readahead
    enter ? for help
    [link register ] d00000000244916c .ext3_xattr_get+0x288/0x2f4 [ext3]
    [c00000003baf32a0] d000000002449104 .ext3_xattr_get+0x220/0x2f4 [ext3]
    (unreliab
    le)
    [c00000003baf3390] d00000000244a6e8 .ext3_xattr_security_get+0x40/0x5c [ext3]
    [c00000003baf3400] c000000000148154 .generic_getxattr+0x74/0x9c
    [c00000003baf34a0] c000000000333400 .inode_doinit_with_dentry+0x1c4/0x678
    [c00000003baf3560] c00000000032c6b0 .security_d_instantiate+0x50/0x68
    [c00000003baf35e0] c00000000013c818 .d_instantiate+0x78/0x9c
    [c00000003baf3680] c00000000013ced0 .d_splice_alias+0xf0/0x120
    [c00000003baf3720] d00000000243e05c .ext3_lookup+0xec/0x134 [ext3]
    [c00000003baf37c0] c000000000131e74 .do_lookup+0x110/0x260
    [c00000003baf3880] c000000000134ed0 .__link_path_walk+0xa98/0x1010
    [c00000003baf3970] c0000000001354a0 .path_walk+0x58/0xc4
    [c00000003baf3a20] c000000000135720 .do_path_lookup+0x138/0x1e4
    [c00000003baf3ad0] c00000000013645c .path_lookup_open+0x6c/0xc8
    [c00000003baf3b70] c000000000136780 .do_filp_open+0xcc/0x874
    [c00000003baf3d10] c0000000001251e0 .do_sys_open+0x80/0x140
    [c00000003baf3dc0] c00000000016aaec .compat_sys_open+0x24/0x38
    [c00000003baf3e30] c00000000000855c syscall_exit+0x0/0x40

    Signed-off-by: Benjamin Herrenschmidt

    Mark Nelson
     

05 Nov, 2008

1 commit

  • Update memcpy() to add two new feature sections: one for aligning the
    destination before copying and one for copying using aligned load
    and store doubles.

    These new feature sections will only affect Power6 and Cell because
    the CPU feature bit was only added to these two processors.

    Power6 gets its best performance in memcpy() when aligning neither the
    source nor the destination, while Cell gets its best performance when
    just the destination is aligned. But in order to save on CPU feature
    bits we can use the previously added CPU_FTR_CP_USE_DCBTZ feature bit
    to differentiate between Power6 and Cell (because CPU_FTR_CP_USE_DCBTZ
    was added to Cell but not Power6).

    The first feature section acts to nop out the branch that takes us to
    the code that aligns us to an eight byte boundary for the destination.
    We only want to nop out this branch on Power6.

    So the ALT_FTR_SECTION_END() for this feature section creates a test
    mask of the two feature bits ORed together and provides an expected
    result of just CPU_FTR_UNALIGNED_LD_STD, thus we nop out the branch
    if we're on a CPU that has CPU_FTR_UNALIGNED_LD_STD set and
    CPU_FTR_CP_USE_DCBTZ unset.

    For the second feature section added, if we're on a CPU that has the
    CPU_FTR_UNALIGNED_LD_STD bit set then we don't want to do the copy
    with aligned loads and stores (and the appropriate shifting left and
    right instructions), so we want to nop out the branch to
    .Lsrc_unaligned.

    The andi. used for this branch is moved to just above the branch
    because this allows us to nop out both instructions with just one
    feature section which gives us better performance and doesn't hurt
    readability which two separate feature sections did.

    Moving the andi. to just above the branch doesn't have any noticeable
    negative effect on the remaining 64bit processors (the ones that
    didn't have this feature bit added).

    On Cell this simple modification results in an improvement to measured
    memcpy() bandwidth of up to 50% in the hot cache case and up to 15% in
    the cold cache case.

    On Power6 we get memory bandwidth results that are up to three times
    faster in the hot cache case and up to 50% faster in the cold cache
    case.

    Commit 2a9294369bd020db89bfdf78b84c3615b39a5c84 ("powerpc: Add new CPU
    feature: CPU_FTR_CP_USE_DCBTZ") was where CPU_FTR_CP_USE_DCBTZ was
    added.

    To say that Cell gets its best performance in memcpy() with just the
    destination aligned is true but only for the reason that the indirect
    shift and rotate instructions, sld and srd, are microcoded on Cell.
    This means that either the destination or the source can be aligned,
    but not both, and seeing as we get better performance with the
    destination aligned we choose this option.

    While we're at it make a one line change from cmpldi r1,... to
    cmpldi cr1,... for consistency.

    Signed-off-by: Mark Nelson
    Signed-off-by: Paul Mackerras

    Mark Nelson
     

13 Apr, 2007

1 commit


31 Aug, 2006

1 commit


10 Feb, 2006

1 commit


10 Oct, 2005

1 commit