13 May, 2019

1 commit

  • Remove an unnecessary arch complication:

    arch/x86/include/asm/arch_hweight.h uses __sw_hweight{32,64} as
    alternatives, and they are implemented in arch/x86/lib/hweight.S

    x86 does not rely on the generic C implementation lib/hweight.c
    at all, so CONFIG_GENERIC_HWEIGHT should be disabled.

    __HAVE_ARCH_SW_HWEIGHT is not necessary either.

    No change in functionality intended.

    Signed-off-by: Masahiro Yamada
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Uros Bizjak
    Link: http://lkml.kernel.org/r/1557665521-17570-1-git-send-email-yamada.masahiro@socionext.com
    Signed-off-by: Ingo Molnar

    Masahiro Yamada
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

08 Jun, 2016

1 commit

  • People complained about ARCH_HWEIGHT_CFLAGS and how it throws a wrench
    into kcov, lto, etc, experimentations.

    Add asm versions for __sw_hweight{32,64}() and do explicit saving and
    restoring of clobbered registers. This gets rid of the special calling
    convention. We get to call those functions on !X86_FEATURE_POPCNT CPUs.

    We still need to hardcode POPCNT and register operands as some old gas
    versions which we support, do not know about POPCNT.

    Btw, remove redundant REX prefix from 32-bit POPCNT because alternatives
    can do padding now.

    Suggested-by: H. Peter Anvin
    Signed-off-by: Borislav Petkov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464605787-20603-1-git-send-email-bp@alien8.de
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

14 Sep, 2014

1 commit

  • It used to be an ad-hoc hack defined by the x86 version of
    that enabled a couple of library routines to know whether
    an integer multiply is faster than repeated shifts and additions.

    This just makes it use the real Kconfig system instead, and makes x86
    (which was the only architecture that did this) select the option.

    NOTE! Even for x86, this really is kind of wrong. If we cared, we would
    probably not enable this for builds optimized for netburst (P4), where
    shifts-and-adds are generally faster than multiplies. This patch does
    *not* change that kind of logic, though, it is purely a syntactic change
    with no code changes.

    This was triggered by the fact that we have other places that really
    want to know "do I want to expand multiples by constants by hand or
    not", particularly the hash generation code.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

08 Mar, 2012

1 commit


07 Apr, 2010

2 commits

  • Add support for the hardware version of the Hamming weight function,
    popcnt, present in CPUs which advertize it under CPUID, Function
    0x0000_0001_ECX[23]. On CPUs which don't support it, we fallback to the
    default lib/hweight.c sw versions.

    A synthetic benchmark comparing popcnt with __sw_hweight64 showed almost
    a 3x speedup on a F10h machine.

    Signed-off-by: Borislav Petkov
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Borislav Petkov
     
  • Rename the extisting runtime hweight() implementations to
    __arch_hweight(), rename the compile-time versions to __const_hweight()
    and then have hweight() pick between them.

    Suggested-by: H. Peter Anvin
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Acked-by: H. Peter Anvin
    LKML-Reference:
    Signed-off-by: H. Peter Anvin

    Peter Zijlstra
     

28 Dec, 2009

1 commit

  • Optimize hweight32 by using the same technique in hweight64.

    The proof of this technique can be found in the commit log for
    f9b4192923fa6e38331e88214b1fe5fc21583fcc ("bitops: hweight()
    speedup").

    The userspace benchmark on x86_32 showed 20% speedup with
    bitmap_weight() which uses hweight32 to count bits for each
    unsigned long on 32bit architectures.

    int main(void)
    {
    #define SZ (1024 * 1024 * 512)

    static DECLARE_BITMAP(bitmap, SZ) = {
    [0 ... 100] = 1,
    };

    return bitmap_weight(bitmap, SZ);
    }

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Cc: Linus Torvalds
    LKML-Reference:
    [ only x86 sets ARCH_HAS_FAST_MULTIPLIER so we do this via the x86 tree]
    Signed-off-by: Ingo Molnar

    Akinobu Mita
     

20 Oct, 2007

1 commit

  • remove asm/bitops.h includes

    including asm/bitops directly may cause compile errors. don't include it
    and include linux/bitops instead. next patch will deny including asm header
    directly.

    Cc: Adrian Bunk
    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

26 Sep, 2006

1 commit

  • Based on patch from David Rientjes , but
    changed by AK.

    Optimizes the 64-bit hamming weight for x86_64 processors assuming they
    have fast multiplication. Uses five fewer bitops than the generic
    hweight64. Benchmark on one EMT64 showed ~25% speedup with 2^24
    consecutive calls.

    Define a new ARCH_HAS_FAST_MULTIPLIER that can be set by other
    architectures that can also multiply fast.

    Signed-off-by: Andi Kleen

    Andi Kleen
     

27 Mar, 2006

2 commits

  • wrote:

    This is an extremely well-known technique. You can see a similar version that
    uses a multiply for the last few steps at
    http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel whch
    refers to "Software Optimization Guide for AMD Athlon 64 and Opteron
    Processors"
    http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

    It's section 8.6, "Efficient Implementation of Population-Count Function in
    32-bit Mode", pages 179-180.

    It uses the name that I am more familiar with, "popcount" (population count),
    although "Hamming weight" also makes sense.

    Anyway, the proof of correctness proceeds as follows:

    b = a - ((a >> 1) & 0x55555555);
    c = (b & 0x33333333) + ((b >> 2) & 0x33333333);
    d = (c + (c >> 4)) & 0x0f0f0f0f;
    #if SLOW_MULTIPLY
    e = d + (d >> 8)
    f = e + (e >> 16);
    return f & 63;
    #else
    /* Useful if multiply takes at most 4 cycles */
    return (d * 0x01010101) >> 24;
    #endif

    The input value a can be thought of as 32 1-bit fields each holding their own
    hamming weight. Now look at it as 16 2-bit fields. Each 2-bit field a1..a0
    has the value 2*a1 + a0. This can be converted into the hamming weight of the
    2-bit field a1+a0 by subtracting a1.

    That's what the (a >> 1) & mask subtraction does. Since there can be no
    borrows, you can just do it all at once.

    Enumerating the 4 possible cases:

    0b00 = 0 -> 0 - 0 = 0
    0b01 = 1 -> 1 - 0 = 1
    0b10 = 2 -> 2 - 1 = 1
    0b11 = 3 -> 3 - 1 = 2

    The next step consists of breaking up b (made of 16 2-bir fields) into
    even and odd halves and adding them into 4-bit fields. Since the largest
    possible sum is 2+2 = 4, which will not fit into a 4-bit field, the 2-bit
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    "which will not fit into a 2-bit field"

    fields have to be masked before they are added.

    After this point, the masking can be delayed. Each 4-bit field holds a
    population count from 0..4, taking at most 3 bits. These numbers can be added
    without overflowing a 4-bit field, so we can compute c + (c >> 4), and only
    then mask off the unwanted bits.

    This produces d, a number of 4 8-bit fields, each in the range 0..8. From
    this point, we can shift and add d multiple times without overflowing an 8-bit
    field, and only do a final mask at the end.

    The number to mask with has to be at least 63 (so that 32 on't be truncated),
    but can also be 128 or 255. The x86 has a special encoding for signed
    immediate byte values -128..127, so the value of 255 is slower. On other
    processors, a special "sign extend byte" instruction might be faster.

    On a processor with fast integer multiplies (Athlon but not P4), you can
    reduce the final few serially dependent instructions to a single integer
    multiply. Consider d to be 3 8-bit values d3, d2, d1 and d0, each in the
    range 0..8. The multiply forms the partial products:

    d3 d2 d1 d0
    d3 d2 d1 d0
    d3 d2 d1 d0
    + d3 d2 d1 d0
    ----------------------
    e3 e2 e1 e0

    Where e3 = d3 + d2 + d1 + d0. e2, e1 and e0 obviously cannot generate
    any carries.

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This patch introduces the C-language equivalents of the functions below:

    unsigned int hweight32(unsigned int w);
    unsigned int hweight16(unsigned int w);
    unsigned int hweight8(unsigned int w);
    unsigned long hweight64(__u64 w);

    In include/asm-generic/bitops/hweight.h

    This code largely copied from: include/linux/bitops.h

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita