06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

05 Nov, 2015

1 commit

  • Pull networking updates from David Miller:

    Changes of note:

    1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

    2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
    David Ahern.

    3) Allow the user to ask for the statistics to be filtered out of
    ipv4/ipv6 address netlink dumps. From Sowmini Varadhan.

    4) More work to pass the network namespace context around deep into
    various packet path APIs, starting with the netfilter hooks. From
    Eric W Biederman.

    5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
    Richter.

    6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

    7) Support Very High Throughput in wireless MESH code, from Bob
    Copeland.

    8) Allow setting the ageing_time in switchdev/rocker. From Scott
    Feldman.

    9) Properly autoload L2TP type modules, from Stephen Hemminger.

    10) Fix and enable offload features by default in 8139cp driver, from
    David Woodhouse.

    11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
    Jiri Benc.

    12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
    Opstad.

    13) Fix IPSEC flowcache overflows on large systems, from Steffen
    Klassert.

    14) Convert bridging to track VLANs using rhashtable entries rather than
    a bitmap. From Nikolay Aleksandrov.

    15) Make TCP listener handling completely lockless, this is a major
    accomplishment. Incoming request sockets now live in the
    established hash table just like any other socket too.

    From Eric Dumazet.

    15) Provide more bridging attributes to netlink, from Nikolay
    Aleksandrov.

    16) Use hash based algorithm for ipv4 multipath routing, this was very
    long overdue. From Peter Nørlund.

    17) Several y2038 cures, mostly avoiding timespec. From Arnd Bergmann.

    18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

    19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet. This
    influences the port binding selection logic used by SO_REUSEPORT.

    20) Add ipv6 support to VRF, from David Ahern.

    21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

    22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

    23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

    24) Support multipath routes in MPLS, from Roopa Prabhu.

    25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
    Dumazet.

    26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
    Sudarsana Kalluru.

    27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
    Sowa.

    28) Support ipv6 geneve tunnels, from John W Linville.

    29) Add flood control support to switchdev layer, from Ido Schimmel.

    30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
    Hannes Frederic Sowa.

    31) Support persistent maps and progs in bpf, from Daniel Borkmann.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
    sh_eth: use DMA barriers
    switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
    net: sched: kill dead code in sch_choke.c
    irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
    net: dsa: mv88e6xxx: include DSA ports in VLANs
    net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
    net/core: fix for_each_netdev_feature
    vlan: Invoke driver vlan hooks only if device is present
    arcnet/com20020: add LEDS_CLASS dependency
    bpf, verifier: annotate verbose printer with __printf
    dp83640: Only wait for timestamps for packets with timestamping enabled.
    ptp: Change ptp_class to a proper bitmask
    dp83640: Prune rx timestamp list before reading from it
    dp83640: Delay scheduled work.
    dp83640: Include hash in timestamp/packet matching
    ipv6: fix tunnel error handling
    net/mlx5e: Fix LSO vlan insertion
    net/mlx5e: Re-eanble client vlan TX acceleration
    net/mlx5e: Return error in case mlx5e_set_features() fails
    net/mlx5e: Don't allow more than max supported channels
    ...

    Linus Torvalds
     

04 Nov, 2015

2 commits

  • Pull perf updates from Ingo Molnar:
    "Kernel side changes:

    - Improve accuracy of perf/sched clock on x86. (Adrian Hunter)

    - Intel DS and BTS updates. (Alexander Shishkin)

    - Intel cstate PMU support. (Kan Liang)

    - Add group read support to perf_event_read(). (Peter Zijlstra)

    - Branch call hardware sampling support, implemented on x86 and
    PowerPC. (Stephane Eranian)

    - Event groups transactional interface enhancements. (Sukadev
    Bhattiprolu)

    - Enable proper x86/intel/uncore PMU support on multi-segment PCI
    systems. (Taku Izumi)

    - ... misc fixes and cleanups.

    The perf tooling team was very busy again with 200+ commits, the full
    diff doesn't fit into lkml size limits. Here's an (incomplete) list
    of the tooling highlights:

    New features:

    - Change the default event used in all tools (record/top): use the
    most precise "cycles" hw counter available, i.e. when the user
    doesn't specify any event, it will try using cycles:ppp, cycles:pp,
    etc and fall back transparently until it finds a working counter.
    (Arnaldo Carvalho de Melo)

    - Integration of perf with eBPF that, given an eBPF .c source file
    (or .o file built for the 'bpf' target with clang), will get it
    automatically built, validated and loaded into the kernel via the
    sys_bpf syscall, which can then be used and seen using 'perf trace'
    and other tools.

    (Wang Nan)

    Various user interface improvements:

    - Automatic pager invocation on long help output. (Namhyung Kim)

    - Search for more options when passing args to -h, e.g.: (Arnaldo
    Carvalho de Melo)

    $ perf report -h interface

    Usage: perf report []

    --gtk Use the GTK2 interface
    --stdio Use the stdio interface
    --tui Use the TUI interface

    - Show ordered command line options when -h is used or when an
    unknown option is specified. (Arnaldo Carvalho de Melo)

    - If options are passed after -h, show just its descriptions, not all
    options. (Arnaldo Carvalho de Melo)

    - Implement column based horizontal scrolling in the hists browser
    (top, report), making it possible to use the TUI for things like
    'perf mem report' where there are many more columns than can fit in
    a terminal. (Arnaldo Carvalho de Melo)

    - Enhance the error reporting of tracepoint event parsing, e.g.:

    $ oldperf record -e sched:sched_switc usleep 1
    event syntax error: 'sched:sched_switc'
    \___ unknown tracepoint
    Run 'perf list' for a list of valid events

    Now we get the much nicer:

    $ perf record -e sched:sched_switc ls
    event syntax error: 'sched:sched_switc'
    \___ can't access trace events

    Error: No permissions to read /sys/kernel/debug/tracing/events/sched/sched_switc
    Hint: Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'

    And after we have those mount point permissions fixed:

    $ perf record -e sched:sched_switc ls
    event syntax error: 'sched:sched_switc'
    \___ unknown tracepoint

    Error: File /sys/kernel/debug/tracing/events/sched/sched_switc not found.
    Hint: Perhaps this kernel misses some CONFIG_ setting to enable this feature?.

    I.e. basically now the event parsing routing uses the strerror_open()
    routines introduced by and used in 'perf trace' work. (Jiri Olsa)

    - Fail properly when pattern matching fails to find a tracepoint,
    i.e. '-e non:existent' was being correctly handled, with a proper
    error message about that not being a valid event, but '-e
    non:existent*' wasn't, fix it. (Jiri Olsa)

    - Do event name substring search as last resort in 'perf list'.
    (Arnaldo Carvalho de Melo)

    E.g.:

    # perf list clock

    List of pre-defined events (to be used in -e):

    cpu-clock [Software event]
    task-clock [Software event]

    uncore_cbox_0/clockticks/ [Kernel PMU event]
    uncore_cbox_1/clockticks/ [Kernel PMU event]

    kvm:kvm_pvclock_update [Tracepoint event]
    kvm:kvm_update_master_clock [Tracepoint event]
    power:clock_disable [Tracepoint event]
    power:clock_enable [Tracepoint event]
    power:clock_set_rate [Tracepoint event]
    syscalls:sys_enter_clock_adjtime [Tracepoint event]
    syscalls:sys_enter_clock_getres [Tracepoint event]
    syscalls:sys_enter_clock_gettime [Tracepoint event]
    syscalls:sys_enter_clock_nanosleep [Tracepoint event]
    syscalls:sys_enter_clock_settime [Tracepoint event]
    syscalls:sys_exit_clock_adjtime [Tracepoint event]
    syscalls:sys_exit_clock_getres [Tracepoint event]
    syscalls:sys_exit_clock_gettime [Tracepoint event]
    syscalls:sys_exit_clock_nanosleep [Tracepoint event]
    syscalls:sys_exit_clock_settime [Tracepoint event]

    Intel PT hardware tracing enhancements:

    - Accept a zero --itrace period, meaning "as often as possible". In
    the case of Intel PT that is the same as a period of 1 and a unit
    of 'instructions' (i.e. --itrace=i1i). (Adrian Hunter)

    - Harmonize itrace's synthesized callchains with the existing
    --max-stack tool option. (Adrian Hunter)

    - Allow time to be displayed in nanoseconds in 'perf script'.
    (Adrian Hunter)

    - Fix potential infinite loop when handling Intel PT timestamps.
    (Adrian Hunter)

    - Slighly improve Intel PT debug logging. (Adrian Hunter)

    - Warn when AUX data has been lost, just like when processing
    PERF_RECORD_LOST. (Adrian Hunter)

    - Further document export-to-postgresql.py script. (Adrian Hunter)

    - Add option to synthesize branch stack from auxtrace data. (Adrian
    Hunter)

    Misc notable changes:

    - Switch the default callchain output mode to 'graph,0.5,caller', to
    make it look like the default for other tools, reducing the
    learning curve for people used to 'caller' based viewing. (Arnaldo
    Carvalho de Melo)

    - various call chain usability enhancements. (Namhyung Kim)

    - Introduce the 'P' event modifier, meaning 'max precision level,
    please', i.e.:

    $ perf record -e cycles:P usleep 1

    Is now similar to:

    $ perf record usleep 1

    Useful, for instance, when specifying multiple events. (Jiri Olsa)

    - Add 'socket' sort entry, to sort by the processor socket in 'perf
    top' and 'perf report'. (Kan Liang)

    - Introduce --socket-filter to 'perf report', for filtering by
    processor socket. (Kan Liang)

    - Add new "Zoom into Processor Socket" operation in the perf hists
    browser, used in 'perf top' and 'perf report'. (Kan Liang)

    - Allow probing on kmodules without DWARF. (Masami Hiramatsu)

    - Fix 'perf probe -l' for probes added to kernel module functions.
    (Masami Hiramatsu)

    - Preparatory work for the 'perf stat record' feature that will allow
    generating perf.data files with counting data in addition to the
    sampling mode we have now (Jiri Olsa)

    - Update libtraceevent KVM plugin. (Paolo Bonzini)

    - ... plus lots of other enhancements that I failed to list properly,
    by: Adrian Hunter, Alexander Shishkin, Andi Kleen, Andrzej Hajda,
    Arnaldo Carvalho de Melo, Dima Kogan, Don Zickus, Geliang Tang, He
    Kuang, Huaitong Han, Ingo Molnar, Jan Stancek, Jiri Olsa, Kan
    Liang, Kirill Tkhai, Masami Hiramatsu, Matt Fleming, Namhyung Kim,
    Paolo Bonzini, Peter Zijlstra, Rabin Vincent, Scott Wood, Stephane
    Eranian, Sukadev Bhattiprolu, Taku Izumi, Vaishali Thakkar, Wang
    Nan, Yang Shi and Yunlong Song"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (260 commits)
    perf unwind: Pass symbol source to libunwind
    tools build: Fix libiberty feature detection
    perf tools: Compile scriptlets to BPF objects when passing '.c' to --event
    perf record: Add clang options for compiling BPF scripts
    perf bpf: Attach eBPF filter to perf event
    perf tools: Make sure fixdep is built before libbpf
    perf script: Enable printing of branch stack
    perf trace: Add cmd string table to decode sys_bpf first arg
    perf bpf: Collect perf_evsel in BPF object files
    perf tools: Load eBPF object into kernel
    perf tools: Create probe points for BPF programs
    perf tools: Enable passing bpf object file to --event
    perf ebpf: Add the libbpf glue
    perf tools: Make perf depend on libbpf
    perf symbols: Fix endless loop in dso__split_kallsyms_for_kcore
    perf tools: Enable pre-event inherit setting by config terms
    perf symbols: we can now read separate debug-info files based on a build ID
    perf symbols: Fix type error when reading a build-id
    perf tools: Search for more options when passing args to -h
    perf stat: Cache aggregated map entries in extra cpumap
    ...

    Linus Torvalds
     
  • This seems to be a mis-reading of how alpha memory ordering works, and
    is not backed up by the alpha architecture manual. The helper functions
    don't do anything special on any other architectures, and the arguments
    that support them being safe on other architectures also argue that they
    are safe on alpha.

    Basically, the "control dependency" is between a previous read and a
    subsequent write that is dependent on the value read. Even if the
    subsequent write is actually done speculatively, there is no way that
    such a speculative write could be made visible to other cpu's until it
    has been committed, which requires validating the speculation.

    Note that most weakely ordered architectures (very much including alpha)
    do not guarantee any ordering relationship between two loads that depend
    on each other on a control dependency:

    read A
    if (val == 1)
    read B

    because the conditional may be predicted, and the "read B" may be
    speculatively moved up to before reading the value A. So we require the
    user to insert a smp_rmb() between the two accesses to be correct:

    read A;
    if (A == 1)
    smp_rmb()
    read B

    Alpha is further special in that it can break that ordering even if the
    *address* of B depends on the read of A, because the cacheline that is
    read later may be stale unless you have a memory barrier in between the
    pointer read and the read of the value behind a pointer:

    read ptr
    read offset(ptr)

    whereas all other weakly ordered architectures guarantee that the data
    dependency (as opposed to just a control dependency) will order the two
    accesses. As a result, alpha needs a "smp_read_barrier_depends()" in
    between those two reads for them to be ordered.

    The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()"
    had was a control dependency to a subsequent *write*, however, and
    nobody can finalize such a subsequent write without having actually done
    the read. And were you to write such a value to a "stale" cacheline
    (the way the unordered reads came to be), that would seem to lose the
    write entirely.

    So the things that make alpha able to re-order reads even more
    aggressively than other weak architectures do not seem to be relevant
    for a subsequent write. Alpha memory ordering may be strange, but
    there's no real indication that it is *that* strange.

    Also, the alpha architecture reference manual very explicitly talks
    about the definition of "Dependence Constraints" in section 5.6.1.7,
    where a preceding read dominates a subsequent write.

    Such a dependence constraint admittedly does not impose a BEFORE (alpha
    architecture term for globally visible ordering), but it does guarantee
    that there can be no "causal loop". I don't see how you could avoid
    such a loop if another cpu could see the stored value and then impact
    the value of the first read. Put another way: the read and the write
    could not be seen as being out of order wrt other cpus.

    So I do not see how these "x_ctrl()" functions can currently be necessary.

    I may have to eat my words at some point, but in the absense of clear
    proof that alpha actually needs this, or indeed even an explanation of
    how alpha could _possibly_ need it, I do not believe these functions are
    called for.

    And if it turns out that alpha really _does_ need a barrier for this
    case, that barrier still should not be "smp_read_barrier_depends()".
    We'd have to make up some new speciality barrier just for alpha, along
    with the documentation for why it really is necessary.

    Cc: Peter Zijlstra
    Cc: Paul E McKenney
    Cc: Dmitry Vyukov
    Cc: Will Deacon
    Cc: Ingo Molnar
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Oct, 2015

1 commit


16 Oct, 2015

1 commit

  • cgroup_exit() is called when a task exits and disassociates the
    exiting task from its cgroups and half-attach it to the root cgroup.
    This is unnecessary and undesirable.

    No controller actually needs an exiting task to be disassociated with
    non-root cgroups. Both cpu and perf_event controllers update the
    association to the root cgroup from their exit callbacks just to keep
    consistent with the cgroup core behavior.

    Also, this disassociation makes it difficult to track resources held
    by zombies or determine where the zombies came from. Currently, pids
    controller is completely broken as it uncharges on exit and zombies
    always escape the resource restriction. With cgroup association being
    reset on exit, fixing it is pretty painful.

    There's no reason to reset cgroup membership on exit. The zombie can
    be removed from its css_set so that it doesn't show up on
    "cgroup.procs" and thus can't be migrated or interfere with cgroup
    removal. It can still pin and point to the css_set so that its cgroup
    membership is maintained. This patch makes cgroup core keep zombies
    associated with their cgroups at the time of exit.

    * Previous patches decoupled populated_cnt tracking from css_set
    lifetime, so a dying task can be simply unlinked from its css_set
    while pinning and pointing to the css_set. This keeps css_set
    association from task side alive while hiding it from "cgroup.procs"
    and populated_cnt tracking. The css_set reference is dropped when
    the task_struct is freed.

    * ->exit() callback no longer needs the css arguments as the
    associated css never changes once PF_EXITING is set. Removed.

    * cpu and perf_events controllers no longer need ->exit() callbacks.
    There's no reason to explicitly switch away on exit. The final
    schedule out is enough. The callbacks are removed.

    * On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
    still reports "/" for all zombies. On the default hierarchy,
    "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
    to at the time of exit. If the cgroup gets removed before the task
    is reaped, " (deleted)" is appended.

    v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.

    Signed-off-by: Tejun Heo
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo

    Tejun Heo
     

28 Sep, 2015

1 commit


18 Sep, 2015

4 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • There are two races with the current code:

    - Another event can join the group and compute a larger header_size
    concurrently, if the smaller store wins we'll have an incorrect
    header_size set.

    - We compute the header_size after the event becomes active,
    therefore its possible to use the size before its computed.

    Remedy the first by moving the computation inside the ctx::mutex lock,
    and the second by placing it _before_ perf_install_in_context().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Vince reported that its possible to overflow the various size fields
    and get weird stuff if you stick too many events in a group.

    Put a lid on this by requiring the fixed record size not exceed 16k.
    This is still a fair amount of events (silly amount really) and leaves
    plenty room for callchains and stack dwarves while also avoiding
    overflowing the u16 variables.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The exclusive_event_installable() stuff only works because its
    exclusive with the grouping bits.

    Rework the code such that there is a sane place to error out before we
    go do things we cannot undo.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Sep, 2015

8 commits

  • Define a new PERF_PMU_TXN_READ interface to read a group of counters
    at once.

    pmu->start_txn() // Initialize before first event

    for each event in group
    pmu->read(event); // Queue each event to be read

    rc = pmu->commit_txn() // Read/update all queued counters

    Note that we use this interface with all PMUs. PMUs that implement this
    interface use the ->read() operation to _queue_ the counters to be read
    and use ->commit_txn() to actually read all the queued counters at once.

    PMUs that don't implement PERF_PMU_TXN_READ ignore ->start_txn() and
    ->commit_txn() and continue to read counters one at a time.

    Thanks to input from Peter Zijlstra.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1441336073-22750-9-git-send-email-sukadev@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Sukadev Bhattiprolu
     
  • When we implement the ability to read several counters at once (using
    the PERF_PMU_TXN_READ transaction interface), perf_event_read() can
    fail when the 'group' parameter is true (eg: trying to read too many
    events at once).

    For now, have perf_event_read() return an integer. Ignore the return
    value when the 'group' parameter is false.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1441336073-22750-8-git-send-email-sukadev@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Sukadev Bhattiprolu
     
  • In order to enable the use of perf_event_read(.group = true), we need
    to invert the sibling-child loop nesting of perf_read_group().

    Currently we iterate the child list for each sibling, this precludes
    using group reads. Flip things around so we iterate each group for
    each child.

    Signed-off-by: Peter Zijlstra (Intel)
    [ Made the patch compile and things. ]
    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1441336073-22750-7-git-send-email-sukadev@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Enable perf_event_read() to update entire groups at once, this will be
    useful for read transactions.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Sukadev Bhattiprolu
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20150723080435.GE25159@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to free up the perf_event_read_group() name:

    s/perf_event_read_\(one\|group\)/perf_read_\1/g
    s/perf_read_hw/__perf_read/g

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1441336073-22750-5-git-send-email-sukadev@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • perf_event_read() does two things:

    - call the PMU to read/update the counter value, and
    - compute the total count of the event and its children

    Not all callers need both. perf_event_reset() for instance needs the
    first piece but doesn't need the second. Similarly, when we implement
    the ability to read a group of events using the transaction interface,
    we would need the two pieces done independently.

    Break up perf_event_read() and have it just read/update the counter
    and have the callers compute the total count if necessary.

    Signed-off-by: Sukadev Bhattiprolu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1441336073-22750-4-git-send-email-sukadev@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Sukadev Bhattiprolu
     
  • Currently, the PMU interface allows reading only one counter at a time.
    But some PMUs like the 24x7 counters in Power, support reading several
    counters at once. To leveage this functionality, extend the transaction
    interface to support a "transaction type".

    The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
    i.e. used to _schedule_ all the events on the PMU as a group. A second
    transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
    by the 24x7 counters to read several counters at once.

    Extend the transaction interfaces to the PMU to accept a 'txn_flags'
    parameter and use this parameter to ignore any transactions that are
    not of type PERF_PMU_TXN_ADD.

    Thanks to Peter Zijlstra for his input.

    Signed-off-by: Sukadev Bhattiprolu
    [peterz: s390 compile fix]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Michael Ellerman
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Sukadev Bhattiprolu
     
  • cgroup_exit() is not called from copy_process() after commit:

    e8604cb43690 ("cgroup: fix spurious lockdep warning in cgroup_exit()")

    from do_exit(). So this check is useless and the comment is obsolete.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/55E444C8.3020402@odin.com
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

11 Sep, 2015

1 commit

  • There are two kexec load syscalls, kexec_load another and kexec_file_load.
    kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
    split kexec_load syscall code to kernel/kexec.c.

    And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
    use kexec_file_load only, or vice verse.

    The original requirement is from Ted Ts'o, he want kexec kernel signature
    being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
    kexec_load syscall can bypass the checking.

    Vivek Goyal proposed to create a common kconfig option so user can compile
    in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
    KEXEC_CORE so that old config files still work.

    Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
    architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
    KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
    kexec_load syscall.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Dave Young
    Cc: Eric W. Biederman
    Cc: Vivek Goyal
    Cc: Petr Tesarik
    Cc: Theodore Ts'o
    Cc: Josh Boyer
    Cc: David Howells
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Young
     

03 Sep, 2015

1 commit

  • Pull networking updates from David Miller:
    "Another merge window, another set of networking changes. I've heard
    rumblings that the lightweight tunnels infrastructure has been voted
    networking change of the year. But what do I know?

    1) Add conntrack support to openvswitch, from Joe Stringer.

    2) Initial support for VRF (Virtual Routing and Forwarding), which
    allows the segmentation of routing paths without using multiple
    devices. There are some semantic kinks to work out still, but
    this is a reasonably strong foundation. From David Ahern.

    3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.

    4) Ignore route nexthops with a link down state in ipv6, just like
    ipv4. From Andy Gospodarek.

    5) Remove spinlock from fast path of act_gact and act_mirred, from
    Eric Dumazet.

    6) Document the DSA layer, from Florian Fainelli.

    7) Add netconsole support to bcmgenet, systemport, and DSA. Also
    from Florian Fainelli.

    8) Add Mellanox Switch Driver and core infrastructure, from Jiri
    Pirko.

    9) Add support for "light weight tunnels", which allow for
    encapsulation and decapsulation without bearing the overhead of a
    full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
    others.

    10) Add Identifier Locator Addressing support for ipv6, from Tom
    Herbert.

    11) Support fragmented SKBs in iwlwifi, from Johannes Berg.

    12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.

    13) Add BQL support to 3c59x driver, from Loganaden Velvindron.

    14) Stop using a zero TX queue length to mean that a device shouldn't
    have a qdisc attached, use an explicit flag instead. From Phil
    Sutter.

    15) Use generic geneve netdevice infrastructure in openvswitch, from
    Pravin B Shelar.

    16) Add infrastructure to avoid re-forwarding a packet in software
    that was already forwarded by a hardware switch. From Scott
    Feldman.

    17) Allow AF_PACKET fanout function to be implemented in a bpf
    program, from Willem de Bruijn"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
    netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
    netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
    net: fec: clear receive interrupts before processing a packet
    ipv6: fix exthdrs offload registration in out_rt path
    xen-netback: add support for multicast control
    bgmac: Update fixed_phy_register()
    sock, diag: fix panic in sock_diag_put_filterinfo
    flow_dissector: Use 'const' where possible.
    flow_dissector: Fix function argument ordering dependency
    ixgbe: Resolve "initialized field overwritten" warnings
    ixgbe: Remove bimodal SR-IOV disabling
    ixgbe: Add support for reporting 2.5G link speed
    ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
    ixgbe: support for ethtool set_rxfh
    ixgbe: Avoid needless PHY access on copper phys
    ixgbe: cleanup to use cached mask value
    ixgbe: Remove second instance of lan_id variable
    ixgbe: use kzalloc for allocating one thing
    flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
    ixgbe: Remove unused PCI bus types
    ...

    Linus Torvalds
     

22 Aug, 2015

1 commit


12 Aug, 2015

4 commits

  • A question [1] was raised about the use of page::private in AUX buffer
    allocations, so let's add a clarification about its intended use.

    The private field and flag are used by perf's rb_alloc_aux() path to
    tell the pmu driver the size of each high-order allocation, so that the
    driver can program those appropriately into its hardware. This only
    matters for PMUs that don't support hardware scatter tables. Otherwise,
    every page in the buffer is just a page.

    This patch adds a comment about the private field to the AUX buffer
    allocation path.

    [1] http://marc.info/?l=linux-kernel&m=143803696607968

    Reported-by: Mathieu Poirier
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1438063204-665-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • I ran the perf fuzzer, which triggered some WARN()s which are due to
    trying to stop/restart an event on the wrong CPU.

    Use the normal IPI pattern to ensure we run the code on the correct CPU.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Vince Weaver
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • If rb->aux_refcount is decremented to zero before rb->refcount,
    __rb_free_aux() may be called twice resulting in a double free of
    rb->aux_pages. Fix this by adding a check to __rb_free_aux().

    Signed-off-by: Ben Hutchings
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
    Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
    Signed-off-by: Ingo Molnar

    Ben Hutchings
     

10 Aug, 2015

1 commit

  • This patch add three core perf APIs:
    - perf_event_attrs(): export the struct perf_event_attr from struct
    perf_event;
    - perf_event_get(): get the struct perf_event from the given fd;
    - perf_event_read_local(): read the events counters active on the
    current CPU;
    These APIs are needed when accessing events counters in eBPF programs.

    The API perf_event_read_local() comes from Peter and I add the
    corresponding SOB.

    Signed-off-by: Kaixu Xia
    Signed-off-by: Peter Zijlstra
    Signed-off-by: David S. Miller

    Kaixu Xia
     

07 Aug, 2015

1 commit

  • By copying BPF related operation to uprobe processing path, this patch
    allow users attach BPF programs to uprobes like what they are already
    doing on kprobes.

    After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
    uprobe perf event. Which make it possible to profile user space programs
    and kernel events together using BPF.

    Because of this patch, CONFIG_BPF_EVENTS should be selected by
    CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
    KPROBE_EVENT is not set.

    Signed-off-by: Wang Nan
    Acked-by: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: Daniel Borkmann
    Cc: David Ahern
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Kaixu Xia
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Zefan Li
    Cc: pi3orama@163.com
    Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Wang Nan
     

04 Aug, 2015

2 commits

  • Currently, the PT driver zeroes out the status register every time before
    starting the event. However, all the writable bits are already taken care
    of in pt_handle_status() function, except the new PacketByteCnt field,
    which in new versions of PT contains the number of packet bytes written
    since the last sync (PSB) packet. Zeroing it out before enabling PT forces
    a sync packet to be written. This means that, with the existing code, a
    sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
    time a PT event is scheduled in.

    To avoid these unnecessary syncs and save a WRMSR in the fast path, this
    patch changes the default behavior to not clear PacketByteCnt field, so
    that the sync packets will be generated with the period specified as
    "psb_period" attribute config field. This has little impact on the trace
    data as the other packets that are normally sent within PSB+ (between PSB
    and PSBEND) have their own generation scenarios which do not depend on the
    sync packets.

    One exception where we do need to force PSB like this when tracing starts,
    so that the decoder has a clear sync point in the trace. For this purpose
    we aready have hw::itrace_started flag, which we are currently using to
    output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
    from perf core to the pmu::start, where it should still be 0 on the very
    first run.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: hpa@zytor.com
    Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Vince reported that the fasync signal stuff doesn't work proper for
    inherited events. So fix that.

    Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
    which upon the demise of the file descriptor ensures the allocation is
    freed and state is updated.

    Now for perf, we can have the events stick around for a while after the
    original FD is dead because of references from child events. So we
    cannot copy the fasync pointer around. We can however consistently use
    the parent's fasync, as that will be updated.

    Reported-and-Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Arnaldo Carvalho deMelo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

31 Jul, 2015

10 commits

  • The xol_free_insn_slot()->waitqueue_active() check is buggy. We
    need mb() after we set the conditon for wait_event(), or
    xol_take_insn_slot() can miss the wakeup.

    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134036.GA4799@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Change xol_add_vma() to use _install_special_mapping(), this way
    we can name the vma installed by uprobes. Currently it looks
    like private anonymous mapping, this is confusing and
    complicates the debugging. With this change /proc/$pid/maps
    reports "[uprobes]".

    As a side effect this will cause core dumps to include the XOL vma
    and I think this is good; this can help to debug the problem if
    the app crashed because it was probed.

    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134033.GA4796@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • install_special_mapping(pages) expects that "pages" is the zero-
    terminated array while xol_add_vma() passes &area->page, this
    means that special_mapping_fault() can wrongly use the next
    member in xol_area (vaddr) as "struct page *".

    Fortunately, this area is not expandable so pgoff != 0 isn't
    possible (modulo bugs in special_mapping_vmops), but still this
    does not look good.

    Signed-off-by: Oleg Nesterov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Pratyush Anand
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134031.GA4789@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • The previous change documents that cleanup_return_instances()
    can't always detect the dead frames, the stack can grow. But
    there is one special case which imho worth fixing:
    arch_uretprobe_is_alive() can return true when the stack didn't
    actually grow, but the next "call" insn uses the already
    invalidated frame.

    Test-case:

    #include
    #include

    jmp_buf jmp;
    int nr = 1024;

    void func_2(void)
    {
    if (--nr == 0)
    return;
    longjmp(jmp, 1);
    }

    void func_1(void)
    {
    setjmp(jmp);
    func_2();
    }

    int main(void)
    {
    func_1();
    return 0;
    }

    If you ret-probe func_1() and func_2() prepare_uretprobe() hits
    the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not
    reported.

    When we know that the new call is not chained, we can do the
    more strict check. In this case "sp" points to the new ret-addr,
    so every frame which uses the same "sp" must be dead. The only
    complication is that arch_uretprobe_is_alive() needs to know was
    it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum
    and change prepare_uretprobe() to pass RP_CHECK_CALL only if
    !chained.

    Note: arch_uretprobe_is_alive() could also re-read *sp and check
    if this word is still trampoline_vaddr. This could obviously
    improve the logic, but I would like to avoid another
    copy_from_user() especially in the case when we can't avoid the
    false "alive == T" positives.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134028.GA4786@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • arch/x86 doesn't care (so far), but as Pratyush Anand pointed
    out other architectures might want why arch_uretprobe_is_alive()
    was called and use different checks depending on the context.
    Add the new argument to distinguish 2 callers.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134026.GA4779@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive()
    return_instance's. This is not needed correctness-wise, but can help
    to avoid the failure caused by MAX_URETPROBE_DEPTH.

    Note: in this case arch_uretprobe_is_alive() can be false
    positive, the stack can grow after longjmp(). Unfortunately, the
    kernel can't 100% solve this problem, but see the next patch.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134023.GA4776@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Test-case:

    #include
    #include

    jmp_buf jmp;

    void func_2(void)
    {
    longjmp(jmp, 1);
    }

    void func_1(void)
    {
    if (setjmp(jmp))
    return;
    func_2();
    printf("ERR!! I am running on the caller's stack\n");
    }

    int main(void)
    {
    func_1();
    return 0;
    }

    fails if you probe func_1() and func_2() because
    handle_trampoline() assumes that the probed function should must
    return and hit the bp installed be prepare_uretprobe(). But in
    this case func_2() does not return, so when func_1() returns the
    kernel uses the no longer valid return_instance of func_2().

    Change handle_trampoline() to unwind ->return_instances until we
    know that the next chain is alive or NULL, this ensures that the
    current chain is the last we need to report and free.

    Alternatively, every return_instance could use unique
    trampoline_vaddr, in this case we could use it as a key. And
    this could solve the problem with sigaltstack() automatically.

    But this approach needs more changes, and it puts the "hard"
    limit on MAX_URETPROBE_DEPTH. Plus it can not solve another
    problem partially fixed by the next patch.

    Note: this change has no effect on !x86, the arch-agnostic
    version of arch_uretprobe_is_alive() just returns "true".

    TODO: as documented by the previous change, arch_uretprobe_is_alive()
    can be fooled by sigaltstack/etc.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134021.GA4773@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add the x86 specific version of arch_uretprobe_is_alive()
    helper. It returns true if the stack frame mangled by
    prepare_uretprobe() is still on stack. So if it returns false,
    we know that the probed function has already returned.

    We add the new return_instance->stack member and change the
    generic code to initialize it in prepare_uretprobe, but it
    should be equally useful for other architectures.

    TODO: this assumes that the probed application can't use
    multiple stacks (say sigaltstack). We will try to improve
    this logic later.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134018.GA4766@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Add the new "weak" helper, arch_uretprobe_is_alive(), used by
    the next patches. It should return true if this return_instance
    is still valid. The arch agnostic version just always returns
    true.

    The patch exports "struct return_instance" for the architectures
    which want to override this hook. We can also cleanup
    prepare_uretprobe() if we pass the new return_instance to
    arch_uretprobe_hijack_return_addr().

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134016.GA4762@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • No functional changes, preparation.

    Add the new helper, find_next_ret_chain(), which finds the first
    !chained entry and returns its ->next. Yes, it is suboptimal. We
    probably want to turn ->chained into ->start_of_this_chain
    pointer and avoid another loop. But this needs the boring
    changes in dup_utask(), so lets do this later.

    Change the main loop in handle_trampoline() to unwind the stack
    until ri is equal to the pointer returned by this new helper.

    Tested-by: Pratyush Anand
    Signed-off-by: Oleg Nesterov
    Acked-by: Srikar Dronamraju
    Acked-by: Anton Arapov
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150721134013.GA4755@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov