13 Jul, 2017

23 commits

  • Add a few initial respective tests for an array:

    o Echoing values separated by spaces works
    o Echoing only first elements will set first elements
    o Confirm PAGE_SIZE limit still applies even if an array is used

    Link: http://lkml.kernel.org/r/20170630224431.17374-7-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Test against a simple proc_douintvec() case. While at it, add a test
    against UINT_MAX. Make sure UINT_MAX works, and UINT_MAX+1 will fail
    and that negative values are not accepted.

    Link: http://lkml.kernel.org/r/20170630224431.17374-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Test against a simple proc_dointvec() case. While at it, add a test
    against INT_MAX. Make sure INT_MAX works, and INT_MAX+1 will fail.
    Also test negative values work.

    Link: http://lkml.kernel.org/r/20170630224431.17374-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Add the following tests to ensure we do not regress:

    o Test using a buffer full of space (PAGE_SIZE-1) followed by a
    single digit works

    o Test using a buffer full of spaces (PAGE_SIZE or over) will fail

    As tests increase instead of unloading the module and reloading it we
    can just do a shell reset_vals() with a reset to values we know are set
    at init on the driver.

    Link: http://lkml.kernel.org/r/20170630224431.17374-4-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • This adds a generic script to let us more easily add more tests cases.
    Since we really have only two types of tests cases just fold them into
    the one file. Each test unit is now identified into its separate
    function:

    # ./sysctl.sh -l
    Test ID list:

    TEST_ID x NUM_TEST
    TEST_ID: Test ID
    NUM_TESTS: Number of recommended times to run the test

    0001 x 1 - tests proc_dointvec_minmax()
    0002 x 1 - tests proc_dostring()

    For now we start off with what we had before, and run only each test
    once. We can now watch a test case until it fails:

    ./sysctl.sh -w 0002

    We can also run a test case x number of times, say we want to run a test
    case 100 times:

    ./sysctl.sh -c 0001 100

    To run a test case only once, for example:

    ./sysctl.sh -s 0002

    The default settings are specified at the top of sysctl.sh.

    Link: http://lkml.kernel.org/r/20170630224431.17374-3-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • The existing tools/testing/selftests/sysctl/ tests include two test
    cases, but these use existing production kernel sysctl interfaces. We
    want to expand test coverage but we can't just be looking for random
    safe production values to poke at, that's just insane!

    Instead just dedicate a test driver for debugging purposes and port the
    existing scripts to use it. This will make it easier for further tests
    to be added.

    Subsequent patches will extend our test coverage for sysctl.

    The stress test driver uses a new license (GPL on Linux, copyleft-next
    outside of Linux). Linus was fine with this [0] and later due to Ted's
    and Alans's request ironed out an "or" language clause to use [1] which
    is already present upstream.

    [0] https://lkml.kernel.org/r/CA+55aFyhxcvD+q7tp+-yrSFDKfR0mOHgyEAe=f_94aKLsOu0Og@mail.gmail.com
    [1] https://lkml.kernel.org/r/1495234558.7848.122.camel@linux.intel.com

    Link: http://lkml.kernel.org/r/20170630224431.17374-2-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: "Eric W. Biederman"
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • To keep parity with regular int interfaces provide the an unsigned int
    proc_douintvec_minmax() which allows you to specify a range of allowed
    valid numbers.

    Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
    actual user for that.

    Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Acked-by: Kees Cook
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Commit e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32
    fields") added proc_douintvec() to start help adding support for
    unsigned int, this however was only half the work needed. Two fixes
    have come in since then for the following issues:

    o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

    This was fixed by commit 5380e5644afbba9 ("sysctl: don't print negative
    flag for proc_douintvec").

    o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
    o We echo negative values in and they are accepted

    This was fixed by commit 425fffd886ba ("sysctl: report EINVAL if value
    is larger than UINT_MAX for proc_douintvec").

    It still also failed to be added to sysctl_check_table()... instead of
    adding it with the current implementation just provide a proper and
    simplified unsigned int support without any array unsigned int support
    with no negative support at all.

    Historically sysctl proc helpers have supported arrays, due to the
    complexity this adds though we've taken a step back to evaluate array
    users to determine if its worth upkeeping for unsigned int. An
    evaluation using Coccinelle has been done to perform a grammatical
    search to ask ourselves:

    o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
    Answer: about 8
    - Of these how many are array users ?
    Answer: Probably only 1
    o How many sysctl array users exist ?
    Answer: about 12

    This last question gives us an idea just how popular arrays: they are not.
    Array support should probably just be kept for strings.

    The identified uint ports are:

    drivers/infiniband/core/ucma.c - max_backlog
    drivers/infiniband/core/iwcm.c - default_backlog
    net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
    net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
    net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
    net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
    net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
    net/phonet/sysctl.c proc_local_port_range()

    The only possible array users is proc_local_port_range() but it does not
    seem worth it to add array support just for this given the range support
    works just as well. Unsigned int support should be desirable more for
    when you *need* more than INT_MAX or using int min/max support then does
    not suffice for your ranges.

    If you forget and by mistake happen to register an unsigned int proc
    entry with an array, the driver will fail and you will get something as
    follows:

    sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
    CPU: 2 PID: 1342 Comm: modprobe Tainted: G W E
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    Call Trace:
    dump_stack+0x63/0x81
    __register_sysctl_table+0x350/0x650
    ? kmem_cache_alloc_trace+0x107/0x240
    __register_sysctl_paths+0x1b3/0x1e0
    ? 0xffffffffc005f000
    register_sysctl_table+0x1f/0x30
    test_sysctl_init+0x10/0x1000 [test_sysctl]
    do_one_initcall+0x52/0x1a0
    ? kmem_cache_alloc_trace+0x107/0x240
    do_init_module+0x5f/0x200
    load_module+0x1867/0x1bd0
    ? __symbol_put+0x60/0x60
    SYSC_finit_module+0xdf/0x110
    SyS_finit_module+0xe/0x10
    entry_SYSCALL_64_fastpath+0x1e/0xad
    RIP: 0033:0x7f042b22d119

    Fixes: e7d316a02f68 ("sysctl: handle error writing UINT_MAX to u32 fields")
    Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Alexey Dobriyan
    Cc: Subash Abhinov Kasiviswanathan
    Cc: Liping Zhang
    Cc: Alexey Dobriyan
    Cc: Heinrich Schuchardt
    Cc: Kees Cook
    Cc: "David S. Miller"
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • The mode sysctl_writes_strict positional checks keep being copy and pasted
    as we add new proc handlers. Just add a helper to avoid code duplication.

    Link: http://lkml.kernel.org/r/20170519033554.18592-4-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Suggested-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Document the different sysctl_writes_strict modes in code.

    Link: http://lkml.kernel.org/r/20170519033554.18592-3-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Patch series "sysctl: few fixes", v5.

    I've been working on making kmod more deterministic, and as I did that I
    couldn't help but notice a few issues with sysctl. My end goal was just
    to fix unsigned int support, which back then was completely broken.
    Liping Zhang has sent up small atomic fixes, however it still missed yet
    one more fix and Alexey Dobriyan had also suggested to just drop array
    support given its complexity.

    I have inspected array support using Coccinelle and indeed its not that
    popular, so if in fact we can avoid it for new interfaces, I agree its
    best.

    I did develop a sysctl stress driver but will hold that off for another
    series.

    This patch (of 5):

    Commit 7c60c48f58a7 ("sysctl: Improve the sysctl sanity checks")
    improved sanity checks considerbly, however the enhancements on
    sysctl_check_table() meant adding a functional change so that only the
    last table entry's sanity error is propagated. It also changed the way
    errors were propagated so that each new check reset the err value, this
    means only last sanity check computed is used for an error. This has
    been in the kernel since v3.4 days.

    Fix this by carrying on errors from previous checks and iterations as we
    traverse the table and ensuring we keep any error from previous checks.
    We keep iterating on the table even if an error is found so we can
    complain for all errors found in one shot. This works as -EINVAL is
    always returned on error anyway, and the check for error is any non-zero
    value.

    Fixes: 7c60c48f58a7 ("sysctl: Improve the sysctl sanity checks")
    Link: http://lkml.kernel.org/r/20170519033554.18592-2-mcgrof@kernel.org
    Signed-off-by: Luis R. Rodriguez
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Luis R. Rodriguez
     
  • Minor updates in Documentation for arm64 as relocatable kernel. Also
    this patch updates documentation for using uncompressed image "Image"
    which is used for ARM64.

    Link: http://lkml.kernel.org/r/1495104793-6563-1-git-send-email-Bharat.Bhushan@nxp.com
    Signed-off-by: Bharat Bhushan
    Cc: Dave Young
    Cc: Baoquan He
    Cc: Vivek Goyal
    Cc: Jonathan Corbet
    Cc: AKASHI Takahiro
    Cc: Pratyush Anand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bharat Bhushan
     
  • Currently vmcoreinfo data is updated at boot time subsys_initcall(), it
    has the risk of being modified by some wrong code during system is
    running.

    As a result, vmcore dumped may contain the wrong vmcoreinfo. Later on,
    when using "crash", "makedumpfile", etc utility to parse this vmcore, we
    probably will get "Segmentation fault" or other unexpected errors.

    E.g. 1) wrong code overwrites vmcoreinfo_data; 2) further crashes the
    system; 3) trigger kdump, then we obviously will fail to recognize the
    crash context correctly due to the corrupted vmcoreinfo.

    Now except for vmcoreinfo, all the crash data is well
    protected(including the cpu note which is fully updated in the crash
    path, thus its correctness is guaranteed). Given that vmcoreinfo data
    is a large chunk prepared for kdump, we better protect it as well.

    To solve this, we relocate and copy vmcoreinfo_data to the crash memory
    when kdump is loading via kexec syscalls. Because the whole crash
    memory will be protected by existing arch_kexec_protect_crashkres()
    mechanism, we naturally protect vmcoreinfo_data from write(even read)
    access under kernel direct mapping after kdump is loaded.

    Since kdump is usually loaded at the very early stage after boot, we can
    trust the correctness of the vmcoreinfo data copied.

    On the other hand, we still need to operate the vmcoreinfo safe copy
    when crash happens to generate vmcoreinfo_note again, we rely on vmap()
    to map out a new kernel virtual address and update to use this new one
    instead in the following crash_save_vmcoreinfo().

    BTW, we do not touch vmcoreinfo_note, because it will be fully updated
    using the protected vmcoreinfo_data after crash which is surely correct
    just like the cpu crash note.

    Link: http://lkml.kernel.org/r/1493281021-20737-3-git-send-email-xlpang@redhat.com
    Signed-off-by: Xunlei Pang
    Tested-by: Michael Holzheu
    Cc: Benjamin Herrenschmidt
    Cc: Dave Young
    Cc: Eric Biederman
    Cc: Hari Bathini
    Cc: Juergen Gross
    Cc: Mahesh Salgaonkar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • vmcoreinfo_max_size stands for the vmcoreinfo_data, the correct one we
    should use is vmcoreinfo_note whose total size is VMCOREINFO_NOTE_SIZE.

    Like explained in commit 77019967f06b ("kdump: fix exported size of
    vmcoreinfo note"), it should not affect the actual function, but we
    better fix it, also this change should be safe and backward compatible.

    After this, we can get rid of variable vmcoreinfo_max_size, let's use
    the corresponding macros directly, fewer variables means more safety for
    vmcoreinfo operation.

    [xlpang@redhat.com: fix build warning]
    Link: http://lkml.kernel.org/r/1494830606-27736-1-git-send-email-xlpang@redhat.com
    Link: http://lkml.kernel.org/r/1493281021-20737-2-git-send-email-xlpang@redhat.com
    Signed-off-by: Xunlei Pang
    Reviewed-by: Mahesh Salgaonkar
    Reviewed-by: Dave Young
    Cc: Hari Bathini
    Cc: Benjamin Herrenschmidt
    Cc: Eric Biederman
    Cc: Juergen Gross
    Cc: Michael Holzheu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • As Eric said,
    "what we need to do is move the variable vmcoreinfo_note out of the
    kernel's .bss section. And modify the code to regenerate and keep this
    information in something like the control page.

    Definitely something like this needs a page all to itself, and ideally
    far away from any other kernel data structures. I clearly was not
    watching closely the data someone decided to keep this silly thing in
    the kernel's .bss section."

    This patch allocates extra pages for these vmcoreinfo_XXX variables, one
    advantage is that it enhances some safety of vmcoreinfo, because
    vmcoreinfo now is kept far away from other kernel data structures.

    Link: http://lkml.kernel.org/r/1493281021-20737-1-git-send-email-xlpang@redhat.com
    Signed-off-by: Xunlei Pang
    Tested-by: Michael Holzheu
    Reviewed-by: Juergen Gross
    Suggested-by: Eric Biederman
    Cc: Benjamin Herrenschmidt
    Cc: Dave Young
    Cc: Hari Bathini
    Cc: Mahesh Salgaonkar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xunlei Pang
     
  • The reason to disable interrupts seems to be to avoid switching to a
    different processor while handling per cpu data using individual loads and
    stores. If we use per cpu RMV primitives we will not have to disable
    interrupts.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1705171055130.5898@east.gentwo.org
    Signed-off-by: Christoph Lameter
    Cc: Andy Lutomirski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     
  • With gcc 4.1.2:

    mm/memory.o: In function `create_huge_pmd':
    memory.c:(.text+0x93e): undefined reference to `do_huge_pmd_anonymous_page'

    Interestingly, create_huge_pmd() is emitted in the assembler output, but
    never called.

    Converting transparent_hugepage_enabled() from a macro to a static
    inline function reduced the ability of the compiler to remove unused
    code.

    Fix this by marking create_huge_pmd() inline.

    Fixes: 16981d763501c0e0 ("mm: improve readability of transparent_hugepage_enabled()")
    Link: http://lkml.kernel.org/r/1499842660-10665-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • If the first parameter of container_of() is a pointer to a
    non-const-qualified array type (and the third parameter names a
    non-const-qualified array member), the local variable __mptr will be
    defined with a const-qualified array type. In ISO C, these types are
    incompatible. They work as expected in GNU C, but some versions will
    issue warnings. For example, GCC 4.9 produces the warning
    "initialization from incompatible pointer type".

    Here is an example of where the problem occurs:

    -------------------------------------------------------
    #include
    #include

    MODULE_LICENSE("GPL");

    struct st {
    int a;
    char b[16];
    };

    static int __init example_init(void) {
    struct st t = { .a = 101, .b = "hello" };
    char (*p)[16] = &t.b;
    struct st *x = container_of(p, struct st, b);
    printk(KERN_DEBUG "%p %p\n", (void *)&t, (void *)x);
    return 0;
    }

    static void __exit example_exit(void) {
    }

    module_init(example_init);
    module_exit(example_exit);
    -------------------------------------------------------

    Building the module with gcc-4.9 results in these warnings (where '{m}'
    is the module source and '{k}' is the kernel source):

    -------------------------------------------------------
    In file included from {m}/example.c:1:0:
    {m}/example.c: In function `example_init':
    {k}/include/linux/kernel.h:854:48: warning: initialization from incompatible pointer type
    const typeof( ((type *)0)->member ) *__mptr = (ptr); \
    ^
    {m}/example.c:14:17: note: in expansion of macro `container_of'
    struct st *x = container_of(p, struct st, b);
    ^
    {k}/include/linux/kernel.h:854:48: warning: (near initialization for `x')
    const typeof( ((type *)0)->member ) *__mptr = (ptr); \
    ^
    {m}/example.c:14:17: note: in expansion of macro `container_of'
    struct st *x = container_of(p, struct st, b);
    ^
    -------------------------------------------------------

    Replace the type checking performed by the macro to avoid these
    warnings. Make sure `*(ptr)` either has type compatible with the
    member, or has type compatible with `void`, ignoring qualifiers. Raise
    compiler errors if this is not true. This is stronger than the previous
    behaviour, which only resulted in compiler warnings for a type mismatch.

    [arnd@arndb.de: fix new warnings for container_of()]
    Link: http://lkml.kernel.org/r/20170620200940.90557-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170525120316.24473-7-abbotti@mev.co.uk
    Signed-off-by: Ian Abbott
    Signed-off-by: Arnd Bergmann
    Acked-by: Michal Nazarewicz
    Acked-by: Kees Cook
    Cc: Hidehiro Kawai
    Cc: Borislav Petkov
    Cc: Rasmus Villemoes
    Cc: Johannes Berg
    Cc: Peter Zijlstra
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ian Abbott
     
  • "kernel.h: handle pointers to arrays better in container_of()" triggers:

    In file included from include/uapi/linux/stddef.h:1:0,
    from include/linux/stddef.h:4,
    from include/uapi/linux/posix_types.h:4,
    from include/uapi/linux/types.h:13,
    from include/linux/types.h:5,
    from include/linux/syscalls.h:71,
    from fs/dcache.c:17:
    fs/dcache.c: In function 'release_dentry_name_snapshot':
    include/linux/compiler.h:542:38: error: call to '__compiletime_assert_305' declared with attribute error: pointer type mismatch in container_of()
    _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
    ^
    include/linux/compiler.h:525:4: note: in definition of macro '__compiletime_assert'
    prefix ## suffix(); \
    ^
    include/linux/compiler.h:542:2: note: in expansion of macro '_compiletime_assert'
    _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
    ^
    include/linux/build_bug.h:46:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
    ^
    include/linux/kernel.h:860:2: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \
    ^
    fs/dcache.c:305:7: note: in expansion of macro 'container_of'
    p = container_of(name->name, struct external_name, name[0]);

    Switch name_snapshot to use unsigned chars, matching struct qstr and
    struct external_name.

    Link: http://lkml.kernel.org/r/20170710152134.0f78c1e6@canb.auug.org.au
    Signed-off-by: Stephen Rothwell
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Rothwell
     
  • Pull i2c updates from Wolfram Sang:
    "This pull request contains:

    - i2c core reorganization. One source file became too monolithic. It
    is now split up, yet we still have the same named object as the
    final output. This should ease maintenance.

    - new drivers: ZTE ZX2967 family, ASPEED 24XX/25XX

    - designware driver gained slave mode support

    - xgene-slimpro driver gained ACPI support

    - bigger overhaul for pca-platform driver

    - the algo-bit module now supports messages with enforced STOP

    - slightly bigger than usual set of driver updates and improvements

    and with much appreciated quality assurance from Andy Shevchenko"

    * 'i2c/for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (51 commits)
    i2c: Provide a stub for i2c_detect_slave_mode()
    i2c: designware: Let slave adapter support be optional
    i2c: designware: Make HW init functions static
    i2c: designware: fix spelling mistakes
    i2c: pca-platform: propagate error from i2c_pca_add_numbered_bus
    i2c: pca-platform: correctly set algo_data.reset_chip
    i2c: acpi: Do not create i2c-clients for LNXVIDEO ACPI devices
    i2c: designware: enable SLAVE in platform module
    i2c: designware: add SLAVE mode functions
    i2c: zx2967: drop COMPILE_TEST dependency
    i2c: zx2967: always use the same device when printing errors
    i2c: pca-platform: use dev_warn/dev_info instead of printk
    i2c: pca-platform: use device managed allocations
    i2c: pca-platform: add devicetree awareness
    i2c: pca-platform: switch to struct gpio_desc
    dt-bindings: add bindings for i2c-pca-platform
    i2c: cadance: fix ctrl/addr reg write order
    i2c: zx2967: add i2c controller driver for ZTE's zx2967 family
    dt: bindings: add documentation for zx2967 family i2c controller
    i2c: algo-bit: add support for I2C_M_STOP
    ...

    Linus Torvalds
     
  • Pull IOMMU updates from Joerg Roedel:
    "This update comes with:

    - Support for lockless operation in the ARM io-pgtable code.

    This is an important step to solve the scalability problems in the
    common dma-iommu code for ARM

    - Some Errata workarounds for ARM SMMU implemenations

    - Rewrite of the deferred IO/TLB flush code in the AMD IOMMU driver.

    The code suffered from very high flush rates, with the new
    implementation the flush rate is down to ~1% of what it was before

    - Support for amd_iommu=off when booting with kexec.

    The problem here was that the IOMMU driver bailed out early without
    disabling the iommu hardware, if it was enabled in the old kernel

    - The Rockchip IOMMU driver is now available on ARM64

    - Align the return value of the iommu_ops->device_group call-backs to
    not miss error values

    - Preempt-disable optimizations in the Intel VT-d and common IOVA
    code to help Linux-RT

    - Various other small cleanups and fixes"

    * tag 'iommu-updates-v4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (60 commits)
    iommu/vt-d: Constify intel_dma_ops
    iommu: Warn once when device_group callback returns NULL
    iommu/omap: Return ERR_PTR in device_group call-back
    iommu: Return ERR_PTR() values from device_group call-backs
    iommu/s390: Use iommu_group_get_for_dev() in s390_iommu_add_device()
    iommu/vt-d: Don't disable preemption while accessing deferred_flush()
    iommu/iova: Don't disable preempt around this_cpu_ptr()
    iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #126
    iommu/arm-smmu-v3: Enable ACPI based HiSilicon CMD_PREFETCH quirk(erratum 161010701)
    iommu/arm-smmu-v3: Add workaround for Cavium ThunderX2 erratum #74
    ACPI/IORT: Fixup SMMUv3 resource size for Cavium ThunderX2 SMMUv3 model
    iommu/arm-smmu-v3, acpi: Add temporary Cavium SMMU-V3 IORT model number definitions
    iommu/io-pgtable-arm: Use dma_wmb() instead of wmb() when publishing table
    iommu/io-pgtable: depend on !GENERIC_ATOMIC64 when using COMPILE_TEST with LPAE
    iommu/arm-smmu-v3: Remove io-pgtable spinlock
    iommu/arm-smmu: Remove io-pgtable spinlock
    iommu/io-pgtable-arm-v7s: Support lockless operation
    iommu/io-pgtable-arm: Support lockless operation
    iommu/io-pgtable: Introduce explicit coherency
    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
    ...

    Linus Torvalds
     
  • Pull overlayfs updates from Miklos Szeredi:
    "This work from Amir introduces the inodes index feature, which
    provides:

    - hardlinks are not broken on copy up

    - infrastructure for overlayfs NFS export

    This also fixes constant st_ino for samefs case for lower hardlinks"

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits)
    ovl: mark parent impure and restore timestamp on ovl_link_up()
    ovl: document copying layers restrictions with inodes index
    ovl: cleanup orphan index entries
    ovl: persistent overlay inode nlink for indexed inodes
    ovl: implement index dir copy up
    ovl: move copy up lock out
    ovl: rearrange copy up
    ovl: add flag for upper in ovl_entry
    ovl: use struct copy_up_ctx as function argument
    ovl: base tmpfile in workdir too
    ovl: factor out ovl_copy_up_inode() helper
    ovl: extract helper to get temp file in copy up
    ovl: defer upper dir lock to tempfile link
    ovl: hash overlay non-dir inodes by copy up origin
    ovl: cleanup bad and stale index entries on mount
    ovl: lookup index entry for copy up origin
    ovl: verify index dir matches upper dir
    ovl: verify upper root dir matches lower root dir
    ovl: introduce the inodes index dir feature
    ovl: generalize ovl_create_workdir()
    ...

    Linus Torvalds
     
  • Reported-and-tested-by: Meelis Roos
    Fixes: commit d9e968cb9f84 "getrlimit()/setrlimit(): move compat to native"
    Signed-off-by: Al Viro
    Acked-by: David S. Miller
    Signed-off-by: Linus Torvalds

    Al Viro
     

12 Jul, 2017

7 commits

  • Pull sparc fixes from David Miller:

    - Fix symbol version generation for assembler on sparc, from
    Nagarathnam Muthusamy.

    - Fix compound page handling in gup_huge_pmd(), from Nitin Gupta.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
    sparc64: Fix gup_huge_pmd
    Adding the type of exported symbols
    sed regex in Makefile.build requires line break between exported symbols
    Adding asm-prototypes.h for genksyms to generate crc

    Linus Torvalds
     
  • Pull more block updates from Jens Axboe:
    "This is a followup for block changes, that didn't make the initial
    pull request. It's a bit of a mixed bag, this contains:

    - A followup pull request from Sagi for NVMe. Outside of fixups for
    NVMe, it also includes a series for ensuring that we properly
    quiesce hardware queues when browsing live tags.

    - Set of integrity fixes from Dmitry (mostly), fixing various issues
    for folks using DIF/DIX.

    - Fix for a bug introduced in cciss, with the req init changes. From
    Christoph.

    - Fix for a bug in BFQ, from Paolo.

    - Two followup fixes for lightnvm/pblk from Javier.

    - Depth fix from Ming for blk-mq-sched.

    - Also from Ming, performance fix for mtip32xx that was introduced
    with the dynamic initialization of commands"

    * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
    block: call bio_uninit in bio_endio
    nvmet: avoid unneeded assignment of submit_bio return value
    nvme-pci: add module parameter for io queue depth
    nvme-pci: compile warnings in nvme_alloc_host_mem()
    nvmet_fc: Accept variable pad lengths on Create Association LS
    nvme_fc/nvmet_fc: revise Create Association descriptor length
    lightnvm: pblk: remove unnecessary checks
    lightnvm: pblk: control I/O flow also on tear down
    cciss: initialize struct scsi_req
    null_blk: fix error flow for shared tags during module_init
    block: Fix __blkdev_issue_zeroout loop
    nvme-rdma: unconditionally recycle the request mr
    nvme: split nvme_uninit_ctrl into stop and uninit
    virtio_blk: quiesce/unquiesce live IO when entering PM states
    mtip32xx: quiesce request queues to make sure no submissions are inflight
    nbd: quiesce request queues to make sure no submissions are inflight
    nvme: kick requeue list when requeueing a request instead of when starting the queues
    nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
    ...

    Linus Torvalds
     
  • Pull cifs fixes and sane default from Steve French:
    "Upgrade default dialect to more secure SMB3 from older cifs dialect"

    * tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: Clean up unused variables in smb2pdu.c
    [SMB3] Improve security, move default dialect to SMB3 from old CIFS
    [SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred
    CIFS: Reconnect expired SMB sessions
    CIFS: Display SMB2 error codes in the hex format
    cifs: Use smb 2 - 3 and cifsacl mount options setacl function
    cifs: prototype declaration and definition to set acl for smb 2 - 3 and cifsacl mount options

    Linus Torvalds
     
  • Pull ceph updates from Ilya Dryomov:
    "The main item here is support for v12.y.z ("Luminous") clusters:
    RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
    feature bits, and various other changes in the RADOS client protocol.

    On top of that we have a new fsc mount option to allow supplying
    fscache uniquifier (similar to NFS) and the usual pile of filesystem
    fixes from Zheng"

    * tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits)
    libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
    libceph: osd_state is 32 bits wide in luminous
    crush: remove an obsolete comment
    crush: crush_init_workspace starts with struct crush_work
    libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()
    crush: implement weight and id overrides for straw2
    libceph: apply_upmap()
    libceph: compute actual pgid in ceph_pg_to_up_acting_osds()
    libceph: pg_upmap[_items] infrastructure
    libceph: ceph_decode_skip_* helpers
    libceph: kill __{insert,lookup,remove}_pg_mapping()
    libceph: introduce and switch to decode_pg_mapping()
    libceph: don't pass pgid by value
    libceph: respect RADOS_BACKOFF backoffs
    libceph: make DEFINE_RB_* helpers more general
    libceph: avoid unnecessary pi lookups in calc_target()
    libceph: use target pi for calc_target() calculations
    libceph: always populate t->target_{oid,oloc} in calc_target()
    libceph: make sure need_resend targets reflect latest map
    libceph: delete from need_resend_linger before check_linger_pool_dne()
    ...

    Linus Torvalds
     
  • Pull watchdog updates from Wim Van Sebroeck:

    - Add Renesas RZ/A WDT Watchdog driver

    - STM32 Independent WatchDoG (IWDG) support

    - UniPhier watchdog support

    - Add F71868 support

    - Add support for NCT6793D and NCT6795D

    - dw_wdt: add reset lines support

    - core: add option to avoid early handling of watchdog

    - core: introduce watchdog_worker_should_ping helper

    - Cleanups and improvements for sama5d4, intel-mid_wdt, s3c2410_wdt,
    orion_wdt, gpio_wdt, it87_wdt, meson_wdt, davinci_wdt, bcm47xx_wdt,
    zx2967_wdt, cadence_wdt

    * git://www.linux-watchdog.org/linux-watchdog: (32 commits)
    watchdog: introduce watchdog_worker_should_ping helper
    watchdog: uniphier: add UniPhier watchdog driver
    dt-bindings: watchdog: add description for UniPhier WDT controller
    watchdog: cadence_wdt: make of_device_ids const.
    watchdog: zx2967: constify zx2967_wdt_ops.
    watchdog: bcm47xx_wdt: constify bcm47xx_wdt_hard_ops and bcm47xx_wdt_soft_ops
    watchdog: davinci: Add missing clk_disable_unprepare().
    watchdog: davinci: Handle return value of clk_prepare_enable
    watchdog: meson: Handle return value of clk_prepare_enable
    watchdog: it87: Add support for various Super-IO chips
    watchdog: it87: Use infrastructure to stop watchdog on reboot
    watchdog: it87: Drop support for resetting watchdog though CIR and Game port
    watchdog: it87: Convert to use watchdog core infrastructure
    watchdog: it87: Drop FSF mailing address
    watchdog: dw_wdt: get reset lines from dt
    watchdog: bindings: dw_wdt: add reset lines
    watchdog: w83627hf: Add support for NCT6793D and NCT6795D
    watchdog: core: add option to avoid early handling of watchdog
    watchdog: f71808e_wdt: Add F71868 support
    watchdog: Add STM32 IWDG driver
    ...

    Linus Torvalds
     
  • …/kernel/git/bleung/chrome-platform

    Pull chrome platform updates from Benson Leung:
    "Changes in this pull request are around catching up cros_ec with the
    internal chromeos-kernel versions of cros_ec, cros_ec_lpc, and
    cros_ec_lightbar.

    Also, switching maintainership from olof to bleung"

    * tag 'chrome-platform-for-linus-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/bleung/chrome-platform:
    platform/chrome : Add myself as Maintainer
    platform/chrome: cros_ec_lightbar - hide unused PM functions
    cros_ec: Don't signal wake event for non-wake host events
    cros_ec: Fix deadlock when EC is not responsive at probe
    cros_ec: Don't return error when checking command version
    platform/chrome: cros_ec_lightbar - Avoid I2C xfer to EC during suspend
    platform/chrome: cros_ec_lightbar - Add userspace lightbar control bit to EC
    platform/chrome: cros_ec_lightbar - Control of suspend/resume lightbar sequence
    platform/chrome: cros_ec_lightbar - Add lightbar program feature to sysfs
    platform/chrome: cros_ec_lpc: Add MKBP events support over ACPI
    platform/chrome: cros_ec_lpc: Add power management ops
    platform/chrome: cros_ec_lpc: Add support for GOOG004 ACPI device
    platform/chrome: cros_ec_lpc: Add support for mec1322 EC
    platform/chrome: cros_ec_lpc: Add R/W helpers to LPC protocol variants
    mfd: cros_ec: Add support for dumping panic information
    cros_ec_debugfs: Pass proper struct sizes to cros_ec_cmd_xfer()
    mfd: cros_ec: add debugfs, console log file
    mfd: cros_ec: Add EC console read structures definitions
    mfd: cros_ec: Add helper for event notifier.

    Linus Torvalds
     
  • Pull x86nommu update from Greg Ungerer:
    "Only a single change, to remove old Kconfig options from defconfigs"

    * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
    m68k: defconfig: Cleanup from old Kconfig options

    Linus Torvalds
     

11 Jul, 2017

10 commits

  • Merge more updates from Andrew Morton:

    - most of the rest of MM

    - KASAN updates

    - lib/ updates

    - checkpatch updates

    - some binfmt_elf changes

    - various misc bits

    * emailed patches from Andrew Morton : (115 commits)
    kernel/exit.c: avoid undefined behaviour when calling wait4()
    kernel/signal.c: avoid undefined behaviour in kill_something_info
    binfmt_elf: safely increment argv pointers
    s390: reduce ELF_ET_DYN_BASE
    powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
    arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
    arm: move ELF_ET_DYN_BASE to 4MB
    binfmt_elf: use ELF_ET_DYN_BASE only for PIE
    fs, epoll: short circuit fetching events if thread has been killed
    checkpatch: improve multi-line alignment test
    checkpatch: improve macro reuse test
    checkpatch: change format of --color argument to --color[=WHEN]
    checkpatch: silence perl 5.26.0 unescaped left brace warnings
    checkpatch: improve tests for multiple line function definitions
    checkpatch: remove false warning for commit reference
    checkpatch: fix stepping through statements with $stat and ctx_statement_block
    checkpatch: [HLP]LIST_HEAD is also declaration
    checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
    checkpatch: improve the unnecessary OOM message test
    lib/bsearch.c: micro-optimize pivot position calculation
    ...

    Linus Torvalds
     
  • wait4(-2147483648, 0x20, 0, 0xdd0000) triggers:
    UBSAN: Undefined behaviour in kernel/exit.c:1651:9

    The related calltrace is as follows:

    negation of -2147483648 cannot be represented in type 'int':
    CPU: 9 PID: 16482 Comm: zj Tainted: G B ---- ------- 3.10.0-327.53.58.71.x86_64+ #66
    Hardware name: Huawei Technologies Co., Ltd. Tecal RH2285 /BC11BTSA , BIOS CTSAV036 04/27/2011
    Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SyS_wait4+0x1cb/0x1e0
    system_call_fastpath+0x16/0x1b

    Exclude the overflow to avoid the UBSAN warning.

    Link: http://lkml.kernel.org/r/1497264618-20212-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhongjiang
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhongjiang
     
  • When running kill(72057458746458112, 0) in userspace I hit the following
    issue.

    UBSAN: Undefined behaviour in kernel/signal.c:1462:11
    negation of -2147483648 cannot be represented in type 'int':
    CPU: 226 PID: 9849 Comm: test Tainted: G B ---- ------- 3.10.0-327.53.58.70.x86_64_ubsan+ #116
    Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014
    Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SYSC_kill+0x43e/0x4d0
    SyS_kill+0xe/0x10
    system_call_fastpath+0x16/0x1b

    Add code to avoid the UBSAN detection.

    [akpm@linux-foundation.org: tweak comment]
    Link: http://lkml.kernel.org/r/1496670008-59084-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhongjiang
    Cc: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Xishi Qiu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhongjiang
     
  • When building the argv/envp pointers, the envp is needlessly
    pre-incremented instead of just continuing after the argv pointers are
    finished. In some (likely impossible) race where the strings could be
    changed from userspace between copy_strings() and here, it might be
    possible to confuse the envp position. Instead, just use sp like
    everything else.

    Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast
    Signed-off-by: Kees Cook
    Cc: Rik van Riel
    Cc: Daniel Micay
    Cc: Qualys Security Advisory
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Dmitry Safonov
    Cc: Andy Lutomirski
    Cc: Grzegorz Andrejczuk
    Cc: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Now that explicitly executed loaders are loaded in the mmap region, we
    have more freedom to decide where we position PIE binaries in the
    address space to avoid possible collisions with mmap or stack regions.

    For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
    address space for 32-bit pointers. On 32-bit use 4MB, which is the
    traditional x86 minimum load location, likely to avoid historically
    requiring a 4MB page table entry when only a portion of the first 4MB
    would be used (since the NULL address is avoided). For s390 the
    position could be 0x10000, but that is needlessly close to the NULL
    address.

    Link: http://lkml.kernel.org/r/1498154792-49952-5-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: James Hogan
    Cc: Pratyush Anand
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Now that explicitly executed loaders are loaded in the mmap region, we
    have more freedom to decide where we position PIE binaries in the
    address space to avoid possible collisions with mmap or stack regions.

    For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
    address space for 32-bit pointers. On 32-bit use 4MB, which is the
    traditional x86 minimum load location, likely to avoid historically
    requiring a 4MB page table entry when only a portion of the first 4MB
    would be used (since the NULL address is avoided).

    Link: http://lkml.kernel.org/r/1498154792-49952-4-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Tested-by: Michael Ellerman
    Acked-by: Michael Ellerman
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: James Hogan
    Cc: Pratyush Anand
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Now that explicitly executed loaders are loaded in the mmap region, we
    have more freedom to decide where we position PIE binaries in the
    address space to avoid possible collisions with mmap or stack regions.

    For 64-bit, align to 4GB to allow runtimes to use the entire 32-bit
    address space for 32-bit pointers. On 32-bit use 4MB, to match ARM.
    This could be 0x8000, the standard ET_EXEC load address, but that is
    needlessly close to the NULL address, and anyone running arm compat PIE
    will have an MMU, so the tight mapping is not needed.

    Link: http://lkml.kernel.org/r/1498251600-132458-4-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Ard Biesheuvel
    Cc: Catalin Marinas
    Cc: Mark Rutland
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Now that explicitly executed loaders are loaded in the mmap region, we
    have more freedom to decide where we position PIE binaries in the
    address space to avoid possible collisions with mmap or stack regions.

    4MB is chosen here mainly to have parity with x86, where this is the
    traditional minimum load location, likely to avoid historically
    requiring a 4MB page table entry when only a portion of the first 4MB
    would be used (since the NULL address is avoided).

    For ARM the position could be 0x8000, the standard ET_EXEC load address,
    but that is needlessly close to the NULL address, and anyone running PIE
    on 32-bit ARM will have an MMU, so the tight mapping is not needed.

    Link: http://lkml.kernel.org/r/1498154792-49952-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: James Hogan
    Cc: Pratyush Anand
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Andy Lutomirski
    Cc: Daniel Micay
    Cc: Dmitry Safonov
    Cc: Grzegorz Andrejczuk
    Cc: Kees Cook
    Cc: Masahiro Yamada
    Cc: Qualys Security Advisory
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The ELF_ET_DYN_BASE position was originally intended to keep loaders
    away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
    /bin/cat" might cause the subsequent load of /bin/cat into where the
    loader had been loaded.)

    With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
    ELF_ET_DYN_BASE continued to be used since the kernel was only looking
    at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the
    top 1/3rd of the TASK_SIZE, a substantial portion of the address space
    is unused.

    For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
    loaded above the mmap region. This means they can be made to collide
    (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
    pathological stack regions.

    Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
    region in all cases, and will now additionally avoid programs falling
    back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
    if it would have collided with the stack, now it will fail to load
    instead of falling back to the mmap region).

    To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
    are loaded into the mmap region, leaving space available for either an
    ET_EXEC binary with a fixed location or PIE being loaded into mmap by
    the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
    which means architectures can now safely lower their values without risk
    of loaders colliding with their subsequently loaded programs.

    For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
    the entire 32-bit address space for 32-bit pointers.

    Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
    suggestions on how to implement this solution.

    Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
    Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
    Signed-off-by: Kees Cook
    Acked-by: Rik van Riel
    Cc: Daniel Micay
    Cc: Qualys Security Advisory
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Alexander Viro
    Cc: Dmitry Safonov
    Cc: Andy Lutomirski
    Cc: Grzegorz Andrejczuk
    Cc: Masahiro Yamada
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: James Hogan
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Pratyush Anand
    Cc: Russell King
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • We've encountered zombies that are waiting for a thread to exit that are
    looping in ep_poll() almost endlessly although there is a pending
    SIGKILL as a result of a group exit.

    This happens because we always find ep_events_available() and fetch more
    events and never are able to check for signal_pending() that would break
    from the loop and return -EINTR.

    Special case fatal signals and break immediately to guarantee that we
    loop to fetch more events and delay making a timely exit.

    It would also be possible to simply move the check for signal_pending()
    higher than checking for ep_events_available(), but there have been no
    reports of delayed signal handling other than SIGKILL preventing zombies
    from exiting that would be fixed by this.

    It fixes an issue for us where we have witnessed zombies sticking around
    for at least O(minutes), but considering the code has been like this
    forever and nobody else has complained that I have found, I would simply
    queue it up for 4.12.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Cc: Alexander Viro
    Cc: Jan Kara
    Cc: Davide Libenzi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes