13 Apr, 2016

1 commit


05 Apr, 2016

2 commits

  • Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov:
    "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The first patch with most changes has been done with coccinelle. The
    second is manual fixups on top.

    The third patch removes macros definition"

    [ I was planning to apply this just before rc2, but then I spaced out,
    so here it is right _after_ rc2 instead.

    As Kirill suggested as a possibility, I could have decided to only
    merge the first two patches, and leave the old interfaces for
    compatibility, but I'd rather get it all done and any out-of-tree
    modules and patches can trivially do the converstion while still also
    working with older kernels, so there is little reason to try to
    maintain the redundant legacy model. - Linus ]

    * PAGE_CACHE_SIZE-removal:
    mm: drop PAGE_CACHE_* and page_cache_{get,release} definition
    mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
    mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros

    Linus Torvalds
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

03 Apr, 2016

2 commits

  • Pull perf fixes from Ingo Molnar:
    "Misc kernel side fixes:

    - fix event leak
    - fix AMD PMU driver bug
    - fix core event handling bug
    - fix build bug on certain randconfigs

    Plus misc tooling fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/amd/ibs: Fix pmu::stop() nesting
    perf/core: Don't leak event in the syscall error path
    perf/core: Fix time tracking bug with multiplexing
    perf jit: genelf makes assumptions about endian
    perf hists: Fix determination of a callchain node's childlessness
    perf tools: Add missing initialization of perf_sample.cpumode in synthesized samples
    perf tools: Fix build break on powerpc
    perf/x86: Move events_sysfs_show() outside CPU_SUP_INTEL
    perf bench: Fix detached tarball building due to missing 'perf bench memcpy' headers
    perf tests: Fix tarpkg build test error output redirection

    Linus Torvalds
     
  • Pull core kernel fixes from Ingo Molnar:
    "This contains the nohz/atomic cleanup/fix for the fetch_or() ugliness
    you noted during the original nohz pull request, plus there's also
    misc fixes:

    - fix liblockdep build bug
    - fix uapi header build bug
    - print more lockdep hash collision info to help debug recent reports
    of hash collisions
    - update MAINTAINERS email address"

    * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    MAINTAINERS: Update my email address
    locking/lockdep: Print chain_key collision information
    uapi/linux/stddef.h: Provide __always_inline to userspace headers
    tools/lib/lockdep: Fix unsupported 'basename -s' in run_tests.sh
    locking/atomic, sched: Unexport fetch_or()
    timers/nohz: Convert tick dependency mask to atomic_t
    locking/atomic: Introduce atomic_fetch_or()

    Linus Torvalds
     

02 Apr, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Missing device reference in IPSEC input path results in crashes
    during device unregistration. From Subash Abhinov Kasiviswanathan.

    2) Per-queue ISR register writes not being done properly in macb
    driver, from Cyrille Pitchen.

    3) Stats accounting bugs in bcmgenet, from Patri Gynther.

    4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
    Quentin Armitage.

    5) SXGBE driver has off-by-one in probe error paths, from Rasmus
    Villemoes.

    6) Fix race in save/swap/delete options in netfilter ipset, from
    Vishwanath Pai.

    7) Ageing time of bridge not set properly when not operating over a
    switchdev device. Fix from Haishuang Yan.

    8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
    Duyck.

    9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

    10) FEC driver should only access registers that actually exist on the
    given chipset, fix from Fabio Estevam.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
    net: mvneta: fix changing MTU when using per-cpu processing
    stmmac: fix MDIO settings
    Revert "stmmac: Fix 'eth0: No PHY found' regression"
    stmmac: fix TX normal DESC
    net: mvneta: use cache_line_size() to get cacheline size
    net: mvpp2: use cache_line_size() to get cacheline size
    net: mvpp2: fix maybe-uninitialized warning
    tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
    net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
    rtnl: fix msg size calculation in if_nlmsg_size()
    fec: Do not access unexisting register in Coldfire
    net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
    net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
    net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
    net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
    bpf: make padding in bpf_tunnel_key explicit
    ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
    bnxt_en: Fix ethtool -a reporting.
    bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
    bnxt_en: Implement proper firmware message padding.
    ...

    Linus Torvalds
     

31 Mar, 2016

11 commits

  • A sequence of pairs [class_idx -> corresponding chain_key iteration]
    is printed for both the current held_lock chain and the cached chain.

    That exposes the two different class_idx sequences that led to that
    particular hash value.

    This helps with debugging hash chain collision reports.

    Signed-off-by: Alfredo Alvarez Fernandez
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-fsdevel@vger.kernel.org
    Cc: sedat.dilek@gmail.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Alfredo Alvarez Fernandez
     
  • Convert perf_output_begin() to __perf_output_begin() and make the later
    function able to write records from the end of the ring-buffer.

    Following commits will utilize the 'backward' flag.

    This is the core patch to support writing to the ring-buffer backwards,
    which will be introduced by upcoming patches to support reading from
    overwritable ring-buffers.

    In theory, this patch should not introduce any extra performance
    overhead since we use always_inline, but it does not hurt to double
    check that assumption:

    When CONFIG_OPTIMIZE_INLINING is disabled, the output object is nearly
    identical to original one. See:

    http://lkml.kernel.org/g/56F52E83.70409@huawei.com

    When CONFIG_OPTIMIZE_INLINING is enabled, the resuling object file becomes
    smaller:

    $ size kernel/events/ring_buffer.o*
    text data bss dec hex filename
    4641 4 8 4653 122d kernel/events/ring_buffer.o.old
    4545 4 8 4557 11cd kernel/events/ring_buffer.o.new

    Performance testing results:

    Calling 3000000 times of 'close(-1)', use gettimeofday() to check
    duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
    system calls. In ns.

    Testing environment:

    CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
    Kernel : v4.5.0

    MEAN STDVAR
    BASE 800214.950 2853.083
    PRE 2253846.700 9997.014
    POST 2257495.540 8516.293

    Where 'BASE' is pure performance without capturing. 'PRE' is test
    result of pure 'v4.5.0' kernel. 'POST' is test result after this
    patch.

    Considering the stdvar, this patch doesn't hurt performance, within
    noise margin.

    For testing details, see:

    http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

    Signed-off-by: Wang Nan
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Brendan Gregg
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1459147292-239310-4-git-send-email-wangnan0@huawei.com
    Signed-off-by: Ingo Molnar

    Wang Nan
     
  • Set a default event->overflow_handler in perf_event_alloc() so don't
    need to check event->overflow_handler in __perf_event_overflow().
    Following commits can give a different default overflow_handler.

    Initial idea comes from Peter:

    http://lkml.kernel.org/r/20130708121557.GA17211@twins.programming.kicks-ass.net

    Since the default value of event->overflow_handler is not NULL, existing
    'if (!overflow_handler)' checks need to be changed.

    is_default_overflow_handler() is introduced for this.

    No extra performance overhead is introduced into the hot path because in the
    original code we still need to read this handler from memory. A conditional
    branch is avoided so actually we remove some instructions.

    Signed-off-by: Wang Nan
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Brendan Gregg
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1459147292-239310-3-git-send-email-wangnan0@huawei.com
    Signed-off-by: Ingo Molnar

    Wang Nan
     
  • Add new ioctl() to pause/resume ring-buffer output.

    In some situations we want to read from the ring-buffer only when we
    ensure nothing can write to the ring-buffer during reading. Without
    this patch we have to turn off all events attached to this ring-buffer
    to achieve this.

    This patch is a prerequisite to enable overwrite support for the
    perf ring-buffer support. Following commits will introduce new methods
    support reading from overwrite ring buffer. Before reading, caller
    must ensure the ring buffer is frozen, or the reading is unreliable.

    Signed-off-by: Wang Nan
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Brendan Gregg
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
    Signed-off-by: Ingo Molnar

    Wang Nan
     
  • Currently we check sample type for ftrace:function events
    even if it's not created as a sampling event. That prevents
    creating ftrace_function event in counting mode.

    Make sure we check sample types only for sampling events.

    Before:
    $ sudo perf stat -e ftrace:function ls
    ...

    Performance counter stats for 'ls':

    ftrace:function

    0.001983662 seconds time elapsed

    After:
    $ sudo perf stat -e ftrace:function ls
    ...

    Performance counter stats for 'ls':

    44,498 ftrace:function

    0.037534722 seconds time elapsed

    Suggested-by: Namhyung Kim
    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1458138873-1553-2-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • In order to ensure safe AUX buffer management, we rely on the assumption
    that pmu::stop() stops its ongoing AUX transaction and not just the hw.

    This patch documents this requirement for the perf_aux_output_{begin,end}()
    APIs.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Poirier
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/1457098969-21595-4-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Now that we can ensure that when ring buffer's AUX area is on the way
    to getting unmapped new transactions won't start, we only need to stop
    all events that can potentially be writing aux data to our ring buffer.

    Having done that, we can safely free the AUX pages and corresponding
    PMU data, as this time it is guaranteed to be the last aux reference
    holder.

    This partially reverts:

    57ffc5ca679 ("perf: Fix AUX buffer refcounting")

    ... which was made to defer deallocation that was otherwise possible
    from an NMI context. Now it is no longer the case; the last call to
    rb_free_aux() that drops the last AUX reference has to happen in
    perf_mmap_close() on that AUX area.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When ring buffer's AUX area is unmapped and rb->aux_mmap_count drops to
    zero, new AUX transactions into this buffer can still be started,
    even though the buffer in en route to deallocation.

    This patch adds a check to perf_aux_output_begin() for rb->aux_mmap_count
    being zero, in which case there is no point starting new transactions,
    in other words, the ring buffers that pass a certain point in
    perf_mmap_close will not have their events sending new data, which
    clears path for freeing those buffers' pages right there and then,
    provided that no active transactions are holding the AUX reference.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/1457098969-21595-2-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • There should (and can) only be a single PMU for perf_hw_context
    events.

    This is because of how we schedule events: once a hardware event fails to
    schedule (the PMU is 'full') we stop trying to add more. The trivial
    'fix' would break the Round-Robin scheduling we do.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In the error path, event_file not being NULL is used to determine
    whether the event itself still needs to be free'd, so fix it up to
    avoid leaking.

    Reported-by: Leon Yu
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 130056275ade ("perf: Do not double free")
    Link: http://lkml.kernel.org/r/87twk06yxp.fsf@ashishki-desk.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Stephane reported that commit:

    3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")

    introduced a regression wrt. time tracking, as easily observed by:

    > This patch introduce a bug in the time tracking of events when
    > multiplexing is used.
    >
    > The issue is easily reproducible with the following perf run:
    >
    > $ perf stat -a -C 0 -e branches,branches,branches,branches,branches,branches -I 1000
    > 1.000730239 652,394 branches (66.41%)
    > 1.000730239 597,809 branches (66.41%)
    > 1.000730239 593,870 branches (66.63%)
    > 1.000730239 651,440 branches (67.03%)
    > 1.000730239 656,725 branches (66.96%)
    > 1.000730239 branches
    >
    > One branches event is shown as not having run. Yet, with
    > multiplexing, all events should run especially with a 1s (-I 1000)
    > interval. The delta for time_running comes out to 0. Yet, the event
    > has run because the kernel is actually multiplexing the events. The
    > problem is that the time tracking is the kernel and especially in
    > ctx_sched_out() is wrong now.
    >
    > The problem is that in case that the kernel enters ctx_sched_out() with the
    > following state:
    > ctx->is_active=0x7 event_type=0x1
    > Call Trace:
    > [] dump_stack+0x63/0x82
    > [] ctx_sched_out+0x2bc/0x2d0
    > [] perf_mux_hrtimer_handler+0xf6/0x2c0
    > [] ? __perf_install_in_context+0x130/0x130
    > [] __hrtimer_run_queues+0xf8/0x2f0
    > [] hrtimer_interrupt+0xb7/0x1d0
    > [] local_apic_timer_interrupt+0x38/0x60
    > [] smp_apic_timer_interrupt+0x3d/0x50
    > [] apic_timer_interrupt+0x8c/0xa0
    >
    > In that case, the test:
    > if (is_active & EVENT_TIME)
    >
    > will be false and the time will not be updated. Time must always be updated on
    > sched out.

    Fix this by always updating time if EVENT_TIME was set, as opposed to
    only updating time when EVENT_TIME changed.

    Reported-by: Stephane Eranian
    Tested-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: kan.liang@intel.com
    Cc: namhyung@kernel.org
    Fixes: 3cbaa5906967 ("perf: Fix ctx time tracking by introducing EVENT_TIME")
    Link: http://lkml.kernel.org/r/20160329072644.GB3408@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Mar, 2016

2 commits

  • This patch functionally reverts:

    5fd7a09cfb8c ("atomic: Export fetch_or()")

    During the merge Linus observed that the generic version of fetch_or()
    was messy:

    " This makes the ugly "fetch_or()" macro that the scheduler used
    internally a new generic helper, and does a bad job at it. "

    e23604edac2a Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Now that we have introduced atomic_fetch_or(), fetch_or() is only used
    by the scheduler in order to deal with thread_info flags which type
    can vary across architectures.

    Lets confine fetch_or() back to the scheduler so that we encourage
    future users to use the more robust and well typed atomic_t version
    instead.

    While at it, fetch_or() gets robustified, pasting improvements from a
    previous patch by Ingo Molnar that avoids needless expression
    re-evaluations in the loop.

    Reported-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1458830281-4255-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The tick dependency mask was intially unsigned long because this is the
    type on which clear_bit() operates on and fetch_or() accepts it.

    But now that we have atomic_fetch_or(), we can instead use
    atomic_andnot() to clear the bit. This consolidates the type of our
    tick dependency mask, reduce its size on structures and benefit from
    possible architecture optimizations on atomic_t operations.

    Suggested-by: Linus Torvalds
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1458830281-4255-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

26 Mar, 2016

3 commits

  • KASAN needs to know whether the allocation happens in an IRQ handler.
    This lets us strip everything below the IRQ entry point to reduce the
    number of unique stack traces needed to be stored.

    Move the definition of __irq_entry to so that the
    users don't need to pull in . Also introduce the
    __softirq_entry macro which is similar to __irq_entry, but puts the
    corresponding functions to the .softirqentry.text section.

    Signed-off-by: Alexander Potapenko
    Acked-by: Steven Rostedt
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Konstantin Serebryany
    Cc: Dmitry Chernenkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • When oom_reaper manages to unmap all the eligible vmas there shouldn't
    be much of the freable memory held by the oom victim left anymore so it
    makes sense to clear the TIF_MEMDIE flag for the victim and allow the
    OOM killer to select another task.

    The lack of TIF_MEMDIE also means that the victim cannot access memory
    reserves anymore but that shouldn't be a problem because it would get
    the access again if it needs to allocate and hits the OOM killer again
    due to the fatal_signal_pending resp. PF_EXITING check. We can safely
    hide the task from the OOM killer because it is clearly not a good
    candidate anymore as everyhing reclaimable has been torn down already.

    This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
    and thus hold off further global OOM killer actions granted the oom
    reaper is able to take mmap_sem for the associated mm struct. This is
    not guaranteed now but further steps should make sure that mmap_sem for
    write should be blocked killable which will help to reduce such a lock
    contention. This is not done by this patch.

    Note that exit_oom_victim might be called on a remote task from
    __oom_reap_task now so we have to check and clear the flag atomically
    otherwise we might race and underflow oom_victims or wake up waiters too
    early.

    Signed-off-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This will be needed in the patch "mm, oom: introduce oom reaper".

    Acked-by: Michal Hocko
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

25 Mar, 2016

6 commits

  • Add map_flags attribute to bpf_map_show_fdinfo(), so that tools like
    tc can check for them when loading objects from a pinned entry, e.g.
    if user intent wrt allocation (BPF_F_NO_PREALLOC) is different to the
    pinned object, it can bail out. Follow-up to 6c9059817432 ("bpf:
    pre-allocate hash map elements"), so that tc can still support this
    with v4.6.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Pull more power management and ACPI updates from Rafael Wysocki:
    "The second batch of power management and ACPI updates for v4.6.

    Included are fixups on top of the previous PM/ACPI pull request and
    other material that didn't make into it but still should go into 4.6.

    Among other things, there's a fix for an intel_pstate driver issue
    uncovered by recent cpufreq changes, a workaround for a boot hang on
    Skylake-H related to the handling of deep C-states by the platform and
    a PCI/ACPI fix for the handling of IO port resources on non-x86
    architectures plus some new device IDs and similar.

    Specifics:

    - Fix for an intel_pstate driver issue related to the handling of MSR
    updates uncovered by the recent cpufreq rework (Rafael Wysocki).

    - cpufreq core cleanups related to starting governors and frequency
    synchronization during resume from system suspend and a locking fix
    for cpufreq_quick_get() (Rafael Wysocki, Richard Cochran).

    - acpi-cpufreq and powernv cpufreq driver updates (Jisheng Zhang,
    Michael Neuling, Richard Cochran, Shilpasri Bhat).

    - intel_idle driver update preventing some Skylake-H systems from
    hanging during initialization by disabling deep C-states mishandled
    by the platform in the problematic configurations (Len Brown).

    - Intel Xeon Phi Processor x200 support for intel_idle
    (Dasaratharaman Chandramouli).

    - cpuidle menu governor updates to make it always honor PM QoS
    latency constraints (and prevent C1 from being used as the fallback
    C-state on x86 when they are set below its exit latency) and to
    restore the previous behavior to fall back to C1 if the next timer
    event is set far enough in the future that was changed in 4.4 which
    led to an energy consumption regression (Rik van Riel, Rafael
    Wysocki).

    - New device ID for a future AMD UART controller in the ACPI driver
    for AMD SoCs (Wang Hongcheng).

    - Rockchip rk3399 support for the rockchip-io-domain adaptive voltage
    scaling (AVS) driver (David Wu).

    - ACPI PCI resources management fix for the handling of IO space
    resources on architectures where the IO space is memory mapped
    (IA64 and ARM64) broken by the introduction of common ACPI
    resources parsing for PCI host bridges in 4.4 (Lorenzo Pieralisi).

    - Fix for the ACPI backend of the generic device properties API to
    make it parse non-device (data node only) children of an ACPI
    device correctly (Irina Tirdea).

    - Fixes for the handling of global suspend flags (introduced in 4.4)
    during hibernation and resume from it (Lukas Wunner).

    - Support for obtaining configuration information from Device Trees
    in the PM clocks framework (Jon Hunter).

    - ACPI _DSM helper code and devfreq framework cleanups (Colin Ian
    King, Geert Uytterhoeven)"

    * tag 'pm+acpi-4.6-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399
    intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
    intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate
    cpufreq: governor: Always schedule work on the CPU running update
    cpufreq: Always update current frequency before startig governor
    cpufreq: Introduce cpufreq_update_current_freq()
    cpufreq: Introduce cpufreq_start_governor()
    cpufreq: powernv: Add sysfs attributes to show throttle stats
    cpufreq: acpi-cpufreq: make Intel/AMD MSR access, io port access static
    PCI: ACPI: IA64: fix IO port generic range check
    ACPI / util: cast data to u64 before shifting to fix sign extension
    cpufreq: powernv: Define per_cpu chip pointer to optimize hot-path
    cpuidle: menu: Fall back to polling if next timer event is near
    cpufreq: acpi-cpufreq: Clean up hot plug notifier callback
    intel_pstate: Do not call wrmsrl_on_cpu() with disabled interrupts
    cpufreq: Make cpufreq_quick_get() safe to call
    ACPI / property: fix data node parsing in acpi_get_next_subnode()
    ACPI / APD: Add device HID for future AMD UART controller
    ...

    Linus Torvalds
     
  • * pm-avs:
    PM / AVS: rockchip-io: add io selectors and supplies for rk3399

    * pm-clk:
    PM / clk: Add support for obtaining clocks from device-tree

    * pm-devfreq:
    PM / devfreq: Spelling s/frequnecy/frequency/

    * pm-sleep:
    ACPI / PM: Runtime resume devices when waking from hibernate
    PM / sleep: Clear pm_suspend_global_flags upon hibernate

    Rafael J. Wysocki
     
  • Pull tracing updates from Steven Rostedt:
    "Nothing major this round. Mostly small clean ups and fixes.

    Some visible changes:

    - A new flag was added to distinguish traces done in NMI context.

    - Preempt tracer now shows functions where preemption is disabled but
    interrupts are still enabled.

    Other notes:

    - Updates were done to function tracing to allow better performance
    with perf.

    - Infrastructure code has been added to allow for a new histogram
    feature for recording live trace event histograms that can be
    configured by simple user commands. The feature itself was just
    finished, but needs a round in linux-next before being pulled.

    This only includes some infrastructure changes that will be needed"

    * tag 'trace-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (22 commits)
    tracing: Record and show NMI state
    tracing: Fix trace_printk() to print when not using bprintk()
    tracing: Remove redundant reset per-CPU buff in irqsoff tracer
    x86: ftrace: Fix the misleading comment for arch/x86/kernel/ftrace.c
    tracing: Fix crash from reading trace_pipe with sendfile
    tracing: Have preempt(irqs)off trace preempt disabled functions
    tracing: Fix return while holding a lock in register_tracer()
    ftrace: Use kasprintf() in ftrace_profile_tracefs()
    ftrace: Update dynamic ftrace calls only if necessary
    ftrace: Make ftrace_hash_rec_enable return update bool
    tracing: Fix typoes in code comment and printk in trace_nop.c
    tracing, writeback: Replace cgroup path to cgroup ino
    tracing: Use flags instead of bool in trigger structure
    tracing: Add an unreg_all() callback to trigger commands
    tracing: Add needs_rec flag to event triggers
    tracing: Add a per-event-trigger 'paused' field
    tracing: Add get_syscall_name()
    tracing: Add event record param to trigger_ops.func()
    tracing: Make event trigger functions available
    tracing: Make ftrace_event_field checking functions available
    ...

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "This tree contains various perf fixes on the kernel side, plus three
    hw/event-enablement late additions:

    - Intel Memory Bandwidth Monitoring events and handling
    - the AMD Accumulated Power Mechanism reporting facility
    - more IOMMU events

    ... and a final round of perf tooling updates/fixes"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    perf llvm: Use strerror_r instead of the thread unsafe strerror one
    perf llvm: Use realpath to canonicalize paths
    perf tools: Unexport some methods unused outside strbuf.c
    perf probe: No need to use formatting strbuf method
    perf help: Use asprintf instead of adhoc equivalents
    perf tools: Remove unused perf_pathdup, xstrdup functions
    perf tools: Do not include stringify.h from the kernel sources
    tools include: Copy linux/stringify.h from the kernel
    tools lib traceevent: Remove redundant CPU output
    perf tools: Remove needless 'extern' from function prototypes
    perf tools: Simplify die() mechanism
    perf tools: Remove unused DIE_IF macro
    perf script: Remove lots of unused arguments
    perf thread: Rename perf_event__preprocess_sample_addr to thread__resolve
    perf machine: Rename perf_event__preprocess_sample to machine__resolve
    perf tools: Add cpumode to struct perf_sample
    perf tests: Forward the perf_sample in the dwarf unwind test
    perf tools: Remove misplaced __maybe_unused
    perf list: Fix documentation of :ppp
    perf bench numa: Fix assertion for nodes bitfield
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "Misc fixes: a cgroup fix, a fair-scheduler migration accounting fix, a
    cputime fix and two cpuacct cleanups"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cpuacct: Simplify the cpuacct code
    sched/cpuacct: Rename parameter in cpuusage_write() for readability
    sched/fair: Add comments to explain select_idle_sibling()
    sched/fair: Fix fairness issue on migration
    sched/cgroup: Fix/cleanup cgroup teardown/init
    sched/cputime: Fix steal time accounting vs. CPU hotplug

    Linus Torvalds
     

23 Mar, 2016

12 commits

  • When suspending to RAM, waking up and later suspending to disk,
    we gratuitously runtime resume devices after the thaw phase.
    This does not occur if we always suspend to RAM or always to disk.

    pm_complete_with_resume_check(), which gets called from
    pci_pm_complete() among others, schedules a runtime resume
    if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during
    a suspend-to-RAM cycle. It is cleared at the beginning of
    the suspend-to-RAM cycle but not afterwards and it is not
    cleared during a suspend-to-disk cycle at all. Fix it.

    Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement)
    Signed-off-by: Lukas Wunner
    Cc: 4.4+ # 4.4+
    Signed-off-by: Rafael J. Wysocki

    Lukas Wunner
     
  • Use the more common logging method with the eventual goal of removing
    pr_warning altogether.

    Miscellanea:

    - Realign arguments
    - Coalesce formats
    - Add missing space between a few coalesced formats

    Signed-off-by: Joe Perches
    Acked-by: Rafael J. Wysocki [kernel/power/suspend.c]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Add a flag to memremap() for writecombine mappings. Mappings satisfied
    by this flag will not be cached, however writes may be delayed or
    combined into more efficient bursts. This is most suitable for buffers
    written sequentially by the CPU for use by other DMA devices.

    Signed-off-by: Brian Starkey
    Reviewed-by: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • These patches implement a MEMREMAP_WC flag for memremap(), which can be
    used to obtain writecombine mappings. This is then used for setting up
    dma_coherent_mem regions which use the DMA_MEMORY_MAP flag.

    The motivation is to fix an alignment fault on arm64, and the suggestion
    to implement MEMREMAP_WC for this case was made at [1]. That particular
    issue is handled in patch 4, which makes sure that the appropriate
    memset function is used when zeroing allocations mapped as IO memory.

    This patch (of 4):

    Don't modify the flags input argument to memremap(). MEMREMAP_WB is
    already a special case so we can check for it directly instead of
    clearing flag bits in each mapper.

    Signed-off-by: Brian Starkey
    Cc: Catalin Marinas
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Brian Starkey
     
  • The value of __ARCH_SI_PREAMBLE_SIZE defines the size (including
    padding) of the part of the struct siginfo that is before the union, and
    it is then used to calculate the needed padding (SI_PAD_SIZE) to make
    the size of struct siginfo equal to 128 (SI_MAX_SIZE) bytes.

    Depending on the target architecture and word width it equals to either
    3 or 4 times sizeof int.

    Since the very beginning we had __ARCH_SI_PREAMBLE_SIZE wrong on the
    parisc architecture for the 64bit kernel build. It's even more
    frustrating, because it can easily be checked at compile time if the
    value was defined correctly.

    This patch adds such a check for the correctness of
    __ARCH_SI_PREAMBLE_SIZE in the hope that it will prevent existing and
    future architectures from running into the same problem.

    I refrained from replacing __ARCH_SI_PREAMBLE_SIZE by offsetof() in
    copy_siginfo() in include/asm-generic/siginfo.h, because a) it doesn't
    make any difference and b) it's used in the Documentation/kmemcheck.txt
    example.

    I ran this patch through the 0-DAY kernel test infrastructure and only
    the parisc architecture triggered as expected. That means that this
    patch should be OK for all major architectures.

    Signed-off-by: Helge Deller
    Cc: Stephen Rothwell
    Cc: Michael Ellerman
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Helge Deller
     
  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     
  • A couple of functions and variables in the profile implementation are
    used only on SMP systems by the procfs code, but are unused if either
    procfs is disabled or in uniprocessor kernels. gcc prints a harmless
    warning about the unused symbols:

    kernel/profile.c:243:13: error: 'profile_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_flip_buffers(void)
    ^
    kernel/profile.c:266:13: error: 'profile_discard_flip_buffers' defined but not used [-Werror=unused-function]
    static void profile_discard_flip_buffers(void)
    ^
    kernel/profile.c:330:12: error: 'profile_cpu_callback' defined but not used [-Werror=unused-function]
    static int profile_cpu_callback(struct notifier_block *info,
    ^

    This adds further #ifdef to the file, to annotate exactly in which cases
    they are used. I have done several thousand ARM randconfig kernels with
    this patch applied and no longer get any warnings in this file.

    Signed-off-by: Arnd Bergmann
    Cc: Vlastimil Babka
    Cc: Robin Holt
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Commit 1717f2096b54 ("panic, x86: Fix re-entrance problem due to panic
    on NMI") and commit 58c5661f2144 ("panic, x86: Allow CPUs to save
    registers even if looping in NMI context") introduced nmi_panic() which
    prevents concurrent/recursive execution of panic(). It also saves
    registers for the crash dump on x86.

    However, there are some cases where NMI handlers still use panic().
    This patch set partially replaces them with nmi_panic() in those cases.

    Even this patchset is applied, some NMI or similar handlers (e.g. MCE
    handler) continue to use panic(). This is because I can't test them
    well and actual problems won't happen. For example, the possibility
    that normal panic and panic on MCE happen simultaneously is very low.

    This patch (of 3):

    Convert nmi_panic() to a proper function and export it instead of
    exporting internal implementation details to modules, for obvious
    reasons.

    Signed-off-by: Hidehiro Kawai
    Acked-by: Borislav Petkov
    Acked-by: Michal Nazarewicz
    Cc: Michal Hocko
    Cc: Rasmus Villemoes
    Cc: Nicolas Iooss
    Cc: Javi Merino
    Cc: Gobinda Charan Maji
    Cc: "Steven Rostedt (Red Hat)"
    Cc: Thomas Gleixner
    Cc: Vitaly Kuznetsov
    Cc: HATAYAMA Daisuke
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hidehiro Kawai
     
  • This commit fixes the following security hole affecting systems where
    all of the following conditions are fulfilled:

    - The fs.suid_dumpable sysctl is set to 2.
    - The kernel.core_pattern sysctl's value starts with "/". (Systems
    where kernel.core_pattern starts with "|/" are not affected.)
    - Unprivileged user namespace creation is permitted. (This is
    true on Linux >=3.8, but some distributions disallow it by
    default using a distro patch.)

    Under these conditions, if a program executes under secure exec rules,
    causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
    namespace, changes its root directory and crashes, the coredump will be
    written using fsuid=0 and a path derived from kernel.core_pattern - but
    this path is interpreted relative to the root directory of the process,
    allowing the attacker to control where a coredump will be written with
    root privileges.

    To fix the security issue, always interpret core_pattern for dumps that
    are written under SUID_DUMP_ROOT relative to the root directory of init.

    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Andy Lutomirski
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • This test-case (simplified version of generated by syzkaller)

    #include
    #include
    #include

    void test(void)
    {
    for (;;) {
    if (fork()) {
    wait(NULL);
    continue;
    }

    ptrace(PTRACE_SEIZE, getppid(), 0, 0);
    ptrace(PTRACE_INTERRUPT, getppid(), 0, 0);
    _exit(0);
    }
    }

    int main(void)
    {
    int np;

    for (np = 0; np < 8; ++np)
    if (!fork())
    test();

    while (wait(NULL) > 0)
    ;
    return 0;
    }

    triggers the 2nd WARN_ON_ONCE(!signr) warning in do_jobctl_trap(). The
    problem is that __ptrace_unlink() clears task->jobctl under siglock but
    task->ptrace is cleared without this lock held; this fools the "else"
    branch which assumes that !PT_SEIZED means PT_PTRACED.

    Note also that most of other PTRACE_SEIZE checks can race with detach
    from the exiting tracer too. Say, the callers of ptrace_trap_notify()
    assume that SEIZED can't go away after it was checked.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Tejun Heo
    Cc: syzkaller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Except on SPARC, this is what the code always did. SPARC compat seccomp
    was buggy, although the impact of the bug was limited because SPARC
    32-bit and 64-bit syscall numbers are the same.

    Signed-off-by: Andy Lutomirski
    Cc: Paul Moore
    Cc: Eric Paris
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     
  • Users of the 32-bit ptrace() ABI expect the full 32-bit ABI. siginfo
    translation should check ptrace() ABI, not caller task ABI.

    This is an ABI change on SPARC. Let's hope that no one relied on the
    old buggy ABI.

    Signed-off-by: Andy Lutomirski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski