07 Jan, 2012

1 commit

  • * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    x86: Fix atomic64_xxx_cx8() functions
    x86: Fix and improve cmpxchg_double{,_local}()
    x86_64, asm: Optimise fls(), ffs() and fls64()
    x86, bitops: Move fls64.h inside __KERNEL__
    x86: Fix and improve percpu_cmpxchg{8,16}b_double()
    x86: Report cpb and eff_freq_ro flags correctly
    x86/i386: Use less assembly in strlen(), speed things up a bit
    x86: Use the same node_distance for 32 and 64-bit
    x86: Fix rflags in FAKE_STACK_FRAME
    x86: Clean up and extend do_int3()
    x86: Call do_notify_resume() with interrupts enabled
    x86/div64: Add a micro-optimization shortcut if base is power of two
    x86-64: Cleanup some assembly entry points
    x86-64: Slightly shorten line system call entry and exit paths
    x86-64: Reduce amount of redundant code generated for invalidate_interruptNN
    x86-64: Slightly shorten int_ret_from_sys_call
    x86, efi: Convert efi_phys_get_time() args to physical addresses
    x86: Default to vsyscall=emulate
    x86-64: Set siginfo and context on vsyscall emulation faults
    x86: consolidate xchg and xadd macros
    ...

    Linus Torvalds
     

13 Dec, 2011

1 commit

  • Current i386 strlen() hardcodes NOT/DEC sequence. DEC is
    mentioned to be suboptimal on Core2. So, put only REPNE SCASB
    sequence in assembly, compiler can do the rest.

    The difference in generated code is like below (MCORE2=y):

    :
    push %edi
    mov $0xffffffff,%ecx
    mov %eax,%edi
    xor %eax,%eax
    repnz scas %es:(%edi),%al
    not %ecx

    - dec %ecx
    - mov %ecx,%eax
    + lea -0x1(%ecx),%eax

    pop %edi
    ret

    Signed-off-by: Alexey Dobriyan
    Cc: Linus Torvalds
    Cc: Jan Beulich
    Link: http://lkml.kernel.org/r/20111211181319.GA17097@p183.telecom.by
    Signed-off-by: Ingo Molnar

    Alexey Dobriyan
     

05 Dec, 2011

2 commits

  • Since new Intel software developers manual introduces
    new format for AVX instruction set (including AVX2),
    it is important to update x86-opcode-map.txt to fit
    those changes.

    Signed-off-by: Masami Hiramatsu
    Cc: "H. Peter Anvin"
    Cc: yrl.pp-manager.tt@hitachi.com
    Link: http://lkml.kernel.org/r/20111205120557.15475.13236.stgit@cloud
    Signed-off-by: Ingo Molnar

    Masami Hiramatsu
     
  • For reducing memory usage of attribute table, x86 instruction
    decoder puts "Group" attribute only on "no-last-prefix"
    attribute table (same as vex_p == 0 case).

    Thus, the decoder should look no-last-prefix table first, and
    then only if it is not a group, move on to "with-last-prefix"
    table (vex_p != 0).

    However, current implementation, inat_get_avx_attribute()
    looks with-last-prefix directly. So, when decoding
    a grouped AVX instruction, the decoder fails to find correct
    group because there is no "Group" attribute on the table.
    This ends up with the mis-decoding of instructions, as Ingo
    reported in http://thread.gmane.org/gmane.linux.kernel/1214103

    This patch fixes it to check no-last-prefix table first
    even if that is an AVX instruction, and get an attribute from
    "with last-prefix" table only if that is not a group.

    Reported-by: Ingo Molnar
    Signed-off-by: Masami Hiramatsu
    Cc: "H. Peter Anvin"
    Cc: yrl.pp-manager.tt@hitachi.com
    Link: http://lkml.kernel.org/r/20111205120539.15475.91428.stgit@cloud
    Signed-off-by: Ingo Molnar

    Masami Hiramatsu
     

10 Oct, 2011

1 commit

  • Fix x86 insn decoder for hardening against invalid length
    instructions. This adds length checkings for each byte-read
    site and if it exceeds MAX_INSN_SIZE, returns immediately.
    This can happen when decoding user-space binary.

    Caller can check whether it happened by checking insn.*.got
    member is set or not.

    Signed-off-by: Masami Hiramatsu
    Cc: Stephane Eranian
    Cc: Andi Kleen
    Cc: acme@redhat.com
    Cc: ming.m.lin@intel.com
    Cc: robert.richter@amd.com
    Cc: ravitillo@lbl.gov
    Cc: yrl.pp-manager.tt@hitachi.com
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111007133155.10933.58577.stgit@localhost.localdomain
    Signed-off-by: Ingo Molnar

    Masami Hiramatsu
     

27 Jul, 2011

1 commit

  • This allows us to move duplicated code in
    (atomic_inc_not_zero() for now) to

    Signed-off-by: Arun Sharma
    Reviewed-by: Eric Dumazet
    Cc: Ingo Molnar
    Cc: David Miller
    Cc: Eric Dumazet
    Acked-by: Mike Frysinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun Sharma
     

23 Jul, 2011

2 commits

  • * 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86-64, vdso: Do not allocate memory for the vDSO
    clocksource: Change __ARCH_HAS_CLOCKSOURCE_DATA to a CONFIG option
    x86, vdso: Drop now wrong comment
    Document the vDSO and add a reference parser
    ia64: Replace clocksource.fsys_mmio with generic arch data
    x86-64: Move vread_tsc and vread_hpet into the vDSO
    clocksource: Replace vread with generic arch data
    x86-64: Add --no-undefined to vDSO build
    x86-64: Allow alternative patching in the vDSO
    x86: Make alternative instruction pointers relative
    x86-64: Improve vsyscall emulation CS and RIP handling
    x86-64: Emulate legacy vsyscalls
    x86-64: Fill unused parts of the vsyscall page with 0xcc
    x86-64: Remove vsyscall number 3 (venosys)
    x86-64: Map the HPET NX
    x86-64: Remove kernel.vsyscall64 sysctl
    x86-64: Give vvars their own page
    x86-64: Document some of entry_64.S
    x86-64: Fix alignment of jiffies variable

    Linus Torvalds
     
  • * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Fix write lock scalability 64-bit issue
    x86: Unify rwsem assembly implementation
    x86: Unify rwlock assembly implementation
    x86, asm: Fix binutils 2.16 issue with __USER32_CS
    x86, asm: Cleanup thunk_64.S
    x86, asm: Flip RESTORE_ARGS arguments logic
    x86, asm: Flip SAVE_ARGS arguments logic
    x86, asm: Thin down SAVE/RESTORE_* asm macros

    Linus Torvalds
     

22 Jul, 2011

1 commit

  • copy_from_user_nmi() is used in oprofile and perf. Moving it to other
    library functions like copy_from_user(). As this is x86 code for 32
    and 64 bits, create a new file usercopy.c for unified code.

    Signed-off-by: Robert Richter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110607172413.GJ20052@erda.amd.com
    Signed-off-by: Ingo Molnar

    Robert Richter
     

21 Jul, 2011

3 commits

  • With the write lock path simply subtracting RW_LOCK_BIAS there
    is, on large systems, the theoretical possibility of overflowing
    the 32-bit value that was used so far (namely if 128 or more
    CPUs manage to do the subtraction, but don't get to do the
    inverse addition in the failure path quickly enough).

    A first measure is to modify RW_LOCK_BIAS itself - with the new
    value chosen, it is good for up to 2048 CPUs each allowed to
    nest over 2048 times on the read path without causing an issue.
    Quite possibly it would even be sufficient to adjust the bias a
    little further, assuming that allowing for significantly less
    nesting would suffice.

    However, as the original value chosen allowed for even more
    nesting levels, to support more than 2048 CPUs (possible
    currently only for 64-bit kernels) the lock itself gets widened
    to 64 bits.

    Signed-off-by: Jan Beulich
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4E258E0D020000780004E3F0@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     
  • Rather than having two functionally identical implementations
    for 32- and 64-bit configurations, use the previously extended
    assembly abstractions to fold the rwsem two implementations into
    a shared one.

    Signed-off-by: Jan Beulich
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4E258DF3020000780004E3ED@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     
  • Rather than having two functionally identical implementations
    for 32- and 64-bit configurations, extend the existing assembly
    abstractions enough to fold the two rwlock implementations into
    a shared one.

    Signed-off-by: Jan Beulich
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4E258DD7020000780004E3EA@nat28.tlf.novell.com
    Signed-off-by: Ingo Molnar

    Jan Beulich
     

14 Jul, 2011

1 commit


04 Jun, 2011

1 commit

  • Drop thunk_ra macro in favor of an additional argument to the thunk
    macro since their bodies are almost identical. Do a whitespace scrubbing
    and use CFI-aware macros for full annotation.

    Signed-off-by: Borislav Petkov
    Link: http://lkml.kernel.org/r/1306873314-32523-5-git-send-email-bp@alien8.de
    Signed-off-by: H. Peter Anvin

    Borislav Petkov
     

20 May, 2011

1 commit

  • …inus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86, apic: Print verbose error interrupt reason on apic=debug

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Demacro CONFIG_PARAVIRT cpu accessors

    * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86: Fix mrst sparse complaints
    x86: Fix spelling error in the memcpy() source code comment
    x86, mpparse: Remove unnecessary variable

    Linus Torvalds
     

18 May, 2011

6 commits

  • As reported in BZ #30352:

    https://bugzilla.kernel.org/show_bug.cgi?id=30352

    there's a kernel bug related to reading the last allowed page on x86_64.

    The _copy_to_user() and _copy_from_user() functions use the following
    check for address limit:

    if (buf + size >= limit)
    fail();

    while it should be more permissive:

    if (buf + size > limit)
    fail();

    That's because the size represents the number of bytes being
    read/write from/to buf address AND including the buf address.
    So the copy function will actually never touch the limit
    address even if "buf + size == limit".

    Following program fails to use the last page as buffer
    due to the wrong limit check:

    #include
    #include
    #include

    #define PAGE_SIZE (4096)
    #define LAST_PAGE ((void*)(0x7fffffffe000))

    int main()
    {
    int fds[2], err;
    void * ptr = mmap(LAST_PAGE, PAGE_SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED, -1, 0);
    assert(ptr == LAST_PAGE);
    err = socketpair(AF_LOCAL, SOCK_STREAM, 0, fds);
    assert(err == 0);
    err = send(fds[0], ptr, PAGE_SIZE, 0);
    perror("send");
    assert(err == PAGE_SIZE);
    err = recv(fds[1], ptr, PAGE_SIZE, MSG_WAITALL);
    perror("recv");
    assert(err == PAGE_SIZE);
    return 0;
    }

    The other place checking the addr limit is the access_ok() function,
    which is working properly. There's just a misleading comment
    for the __range_not_ok() macro - which this patch fixes as well.

    The last page of the user-space address range is a guard page and
    Brian Gerst observed that the guard page itself due to an erratum on K8 cpus
    (#121 Sequential Execution Across Non-Canonical Boundary Causes Processor
    Hang).

    However, the test code is using the last valid page before the guard page.
    The bug is that the last byte before the guard page can't be read
    because of the off-by-one error. The guard page is left in place.

    This bug would normally not show up because the last page is
    part of the process stack and never accessed via syscalls.

    Signed-off-by: Jiri Olsa
    Acked-by: Brian Gerst
    Acked-by: Linus Torvalds
    Cc:
    Link: http://lkml.kernel.org/r/1305210630-7136-1-git-send-email-jolsa@redhat.com
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • Support memset() with enhanced rep stosb. On processors supporting enhanced
    REP MOVSB/STOSB, the alternative memset_c_e function using enhanced rep stosb
    overrides the fast string alternative memset_c and the original function.

    Signed-off-by: Fenghua Yu
    Link: http://lkml.kernel.org/r/1305671358-14478-10-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     
  • Support memmove() by enhanced rep movsb. On processors supporting enhanced
    REP MOVSB/STOSB, the alternative memmove() function using enhanced rep movsb
    overrides the original function.

    The patch doesn't change the backward memmove case to use enhanced rep
    movsb.

    Signed-off-by: Fenghua Yu
    Link: http://lkml.kernel.org/r/1305671358-14478-9-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     
  • Support memcpy() with enhanced rep movsb. On processors supporting enhanced
    rep movsb, the alternative memcpy() function using enhanced rep movsb overrides the original function and the fast string
    function.

    Signed-off-by: Fenghua Yu
    Link: http://lkml.kernel.org/r/1305671358-14478-8-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     
  • Support copy_to_user/copy_from_user() by enhanced REP MOVSB/STOSB.
    On processors supporting enhanced REP MOVSB/STOSB, the alternative
    copy_user_enhanced_fast_string function using enhanced rep movsb overrides the
    original function and the fast string function.

    Signed-off-by: Fenghua Yu
    Link: http://lkml.kernel.org/r/1305671358-14478-7-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     
  • Intel processors are adding enhancements to REP MOVSB/STOSB and the use of
    REP MOVSB/STOSB for optimal memcpy/memset or similar functions is recommended.
    Enhancement availability is indicated by CPUID.7.0.EBX[9] (Enhanced REP MOVSB/
    STOSB).

    Support clear_page() with rep stosb for processor supporting enhanced REP MOVSB
    /STOSB. On processors supporting enhanced REP MOVSB/STOSB, the alternative
    clear_page_c_e function using enhanced REP STOSB overrides the original function
    and the fast string function.

    Signed-off-by: Fenghua Yu
    Link: http://lkml.kernel.org/r/1305671358-14478-6-git-send-email-fenghua.yu@intel.com
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     

07 May, 2011

1 commit


02 May, 2011

1 commit


28 Mar, 2011

1 commit


19 Mar, 2011

1 commit


18 Mar, 2011

2 commits


16 Mar, 2011

2 commits


02 Mar, 2011

1 commit


01 Mar, 2011

3 commits


28 Feb, 2011

1 commit


26 Jan, 2011

1 commit

  • memmove_64.c only implements memmove() function which is completely written in
    inline assembly code. Therefore it doesn't make sense to keep the assembly code
    in .c file.

    Currently memmove() doesn't store return value to rax. This may cause issue if
    caller uses the return value. The patch fixes this issue.

    Signed-off-by: Fenghua Yu
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Fenghua Yu
     

04 Jan, 2011

1 commit


25 Sep, 2010

1 commit

  • movs instruction will combine data to accelerate moving data,
    however we need to concern two cases about it.

    1. movs instruction need long lantency to startup,
    so here we use general mov instruction to copy data.
    2. movs instruction is not good for unaligned case,
    even if src offset is 0x10, dest offset is 0x0,
    we avoid and handle the case by general mov instruction.

    Signed-off-by: Ma Ling
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Ma Ling
     

24 Aug, 2010

2 commits

  • All read operations after allocation stage can run speculatively,
    all write operation will run in program order, and if addresses are
    different read may run before older write operation, otherwise wait
    until write commit. However CPU don't check each address bit,
    so read could fail to recognize different address even they
    are in different page.For example if rsi is 0xf004, rdi is 0xe008,
    in following operation there will generate big performance latency.
    1. movq (%rsi), %rax
    2. movq %rax, (%rdi)
    3. movq 8(%rsi), %rax
    4. movq %rax, 8(%rdi)

    If %rsi and rdi were in really the same meory page, there are TRUE
    read-after-write dependence because instruction 2 write 0x008 and
    instruction 3 read 0x00c, the two address are overlap partially.
    Actually there are in different page and no any issues,
    but without checking each address bit CPU could think they are
    in the same page, and instruction 3 have to wait for instruction 2
    to write data into cache from write buffer, then load data from cache,
    the cost time read spent is equal to mfence instruction. We may avoid it by
    tuning operation sequence as follow.

    1. movq 8(%rsi), %rax
    2. movq %rax, 8(%rdi)
    3. movq (%rsi), %rax
    4. movq %rax, (%rdi)

    Instruction 3 read 0x004, instruction 2 write address 0x010, no any
    dependence. At last on Core2 we gain 1.83x speedup compared with
    original instruction sequence. In this patch we first handle small
    size(less 20bytes), then jump to different copy mode. Based on our
    micro-benchmark small bytes from 1 to 127 bytes, we got up to 2X
    improvement, and up to 1.5X improvement for 1024 bytes on Corei7. (We
    use our micro-benchmark, and will do further test according to your
    requirment)

    Signed-off-by: Ma Ling
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Ma Ling
     
  • memmove() allow source and destination address to be overlap, but
    there is no such limitation for memcpy(). Therefore, explicitly
    implement memmove() in both the forwards and backward directions, to
    give us the ability to optimize memcpy().

    Signed-off-by: Ma Ling
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Ma, Ling
     

14 Aug, 2010

1 commit