Eric Lee / smarc-fsl-linux-kernel

06 Nov, 2015

1 commit

69234acee Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"The cgroup core saw several significant updates this cycle:

- percpu_rwsem for threadgroup locking is reinstated. This was
temporarily dropped due to down_write latency issues. Oleg's
rework of percpu_rwsem which is scheduled to be merged in this
merge window resolves the issue.

- On the v2 hierarchy, when controllers are enabled and disabled, all
operations are atomic and can fail and revert cleanly. This allows
->can_attach() failure which is necessary for cpu RT slices.

- Tasks now stay associated with the original cgroups after exit
until released. This allows tracking resources held by zombies
(e.g. pids) and makes it easy to find out where zombies came from
on the v2 hierarchy. The pids controller was broken before these
changes as zombies escaped the limits; unfortunately, updating this
behavior required too many invasive changes and I don't think it's
a good idea to backport them, so the pids controller on 4.3, the
first version which included the pids controller, will stay broken
at least until I'm sure about the cgroup core changes.

- Optimization of a couple common tests using static_key"

* 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
cgroup: fix race condition around termination check in css_task_iter_next()
blkcg: don't create "io.stat" on the root cgroup
cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
cgroup: replace error handling in cgroup_init() with WARN_ON()s
cgroup: add cgroup_subsys->free() method and use it to fix pids controller
cgroup: keep zombies associated with their original cgroups
cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
cgroup: don't hold css_set_rwsem across css task iteration
cgroup: reorganize css_task_iter functions
cgroup: factor out css_set_move_task()
cgroup: keep css_set and task lists in chronological order
cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
cgroup: make css_sets pin the associated cgroups
cgroup: relocate cgroup_[try]get/put()
cgroup: move check_for_release() invocation
cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
cgroup: make cgroup->nr_populated count the number of populated css_sets
cgroup: remove an unused parameter from cgroup_task_migrate()
cgroup: fix too early usage of static_branch_disable()
cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
...

Linus Torvalds
2015-11-06 06:51:32 +0800

05 Nov, 2015

1 commit

b0f85fa11 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

Changes of note:

1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
David Ahern.

3) Allow the user to ask for the statistics to be filtered out of
ipv4/ipv6 address netlink dumps. From Sowmini Varadhan.

4) More work to pass the network namespace context around deep into
various packet path APIs, starting with the netfilter hooks. From
Eric W Biederman.

5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
Richter.

6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

7) Support Very High Throughput in wireless MESH code, from Bob
Copeland.

8) Allow setting the ageing_time in switchdev/rocker. From Scott
Feldman.

9) Properly autoload L2TP type modules, from Stephen Hemminger.

10) Fix and enable offload features by default in 8139cp driver, from
David Woodhouse.

11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
Jiri Benc.

12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
Opstad.

13) Fix IPSEC flowcache overflows on large systems, from Steffen
Klassert.

14) Convert bridging to track VLANs using rhashtable entries rather than
a bitmap. From Nikolay Aleksandrov.

15) Make TCP listener handling completely lockless, this is a major
accomplishment. Incoming request sockets now live in the
established hash table just like any other socket too.

From Eric Dumazet.

15) Provide more bridging attributes to netlink, from Nikolay
Aleksandrov.

16) Use hash based algorithm for ipv4 multipath routing, this was very
long overdue. From Peter Nørlund.

17) Several y2038 cures, mostly avoiding timespec. From Arnd Bergmann.

18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet. This
influences the port binding selection logic used by SO_REUSEPORT.

20) Add ipv6 support to VRF, from David Ahern.

21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

24) Support multipath routes in MPLS, from Roopa Prabhu.

25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
Dumazet.

26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
Sudarsana Kalluru.

27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
Sowa.

28) Support ipv6 geneve tunnels, from John W Linville.

29) Add flood control support to switchdev layer, from Ido Schimmel.

30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
Hannes Frederic Sowa.

31) Support persistent maps and progs in bpf, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
sh_eth: use DMA barriers
switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
net: sched: kill dead code in sch_choke.c
irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
net: dsa: mv88e6xxx: include DSA ports in VLANs
net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
net/core: fix for_each_netdev_feature
vlan: Invoke driver vlan hooks only if device is present
arcnet/com20020: add LEDS_CLASS dependency
bpf, verifier: annotate verbose printer with __printf
dp83640: Only wait for timestamps for packets with timestamping enabled.
ptp: Change ptp_class to a proper bitmask
dp83640: Prune rx timestamp list before reading from it
dp83640: Delay scheduled work.
dp83640: Include hash in timestamp/packet matching
ipv6: fix tunnel error handling
net/mlx5e: Fix LSO vlan insertion
net/mlx5e: Re-eanble client vlan TX acceleration
net/mlx5e: Return error in case mlx5e_set_features() fails
net/mlx5e: Don't allow more than max supported channels
...

Linus Torvalds
2015-11-05 01:41:05 +0800

04 Nov, 2015

2 commits

b02ac6b18 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Kernel side changes:

- Improve accuracy of perf/sched clock on x86. (Adrian Hunter)

- Intel DS and BTS updates. (Alexander Shishkin)

- Intel cstate PMU support. (Kan Liang)

- Add group read support to perf_event_read(). (Peter Zijlstra)

- Branch call hardware sampling support, implemented on x86 and
PowerPC. (Stephane Eranian)

- Event groups transactional interface enhancements. (Sukadev
Bhattiprolu)

- Enable proper x86/intel/uncore PMU support on multi-segment PCI
systems. (Taku Izumi)

- ... misc fixes and cleanups.

The perf tooling team was very busy again with 200+ commits, the full
diff doesn't fit into lkml size limits. Here's an (incomplete) list
of the tooling highlights:

New features:

- Change the default event used in all tools (record/top): use the
most precise "cycles" hw counter available, i.e. when the user
doesn't specify any event, it will try using cycles:ppp, cycles:pp,
etc and fall back transparently until it finds a working counter.
(Arnaldo Carvalho de Melo)

- Integration of perf with eBPF that, given an eBPF .c source file
(or .o file built for the 'bpf' target with clang), will get it
automatically built, validated and loaded into the kernel via the
sys_bpf syscall, which can then be used and seen using 'perf trace'
and other tools.

(Wang Nan)

Various user interface improvements:

- Automatic pager invocation on long help output. (Namhyung Kim)

- Search for more options when passing args to -h, e.g.: (Arnaldo
Carvalho de Melo)

$ perf report -h interface

Usage: perf report []

--gtk Use the GTK2 interface
--stdio Use the stdio interface
--tui Use the TUI interface

- Show ordered command line options when -h is used or when an
unknown option is specified. (Arnaldo Carvalho de Melo)

- If options are passed after -h, show just its descriptions, not all
options. (Arnaldo Carvalho de Melo)

- Implement column based horizontal scrolling in the hists browser
(top, report), making it possible to use the TUI for things like
'perf mem report' where there are many more columns than can fit in
a terminal. (Arnaldo Carvalho de Melo)

- Enhance the error reporting of tracepoint event parsing, e.g.:

$ oldperf record -e sched:sched_switc usleep 1
event syntax error: 'sched:sched_switc'
\___ unknown tracepoint
Run 'perf list' for a list of valid events

Now we get the much nicer:

$ perf record -e sched:sched_switc ls
event syntax error: 'sched:sched_switc'
\___ can't access trace events

Error: No permissions to read /sys/kernel/debug/tracing/events/sched/sched_switc
Hint: Try 'sudo mount -o remount,mode=755 /sys/kernel/debug'

And after we have those mount point permissions fixed:

$ perf record -e sched:sched_switc ls
event syntax error: 'sched:sched_switc'
\___ unknown tracepoint

Error: File /sys/kernel/debug/tracing/events/sched/sched_switc not found.
Hint: Perhaps this kernel misses some CONFIG_ setting to enable this feature?.

I.e. basically now the event parsing routing uses the strerror_open()
routines introduced by and used in 'perf trace' work. (Jiri Olsa)

- Fail properly when pattern matching fails to find a tracepoint,
i.e. '-e non:existent' was being correctly handled, with a proper
error message about that not being a valid event, but '-e
non:existent*' wasn't, fix it. (Jiri Olsa)

- Do event name substring search as last resort in 'perf list'.
(Arnaldo Carvalho de Melo)

E.g.:

# perf list clock

List of pre-defined events (to be used in -e):

cpu-clock [Software event]
task-clock [Software event]

uncore_cbox_0/clockticks/ [Kernel PMU event]
uncore_cbox_1/clockticks/ [Kernel PMU event]

kvm:kvm_pvclock_update [Tracepoint event]
kvm:kvm_update_master_clock [Tracepoint event]
power:clock_disable [Tracepoint event]
power:clock_enable [Tracepoint event]
power:clock_set_rate [Tracepoint event]
syscalls:sys_enter_clock_adjtime [Tracepoint event]
syscalls:sys_enter_clock_getres [Tracepoint event]
syscalls:sys_enter_clock_gettime [Tracepoint event]
syscalls:sys_enter_clock_nanosleep [Tracepoint event]
syscalls:sys_enter_clock_settime [Tracepoint event]
syscalls:sys_exit_clock_adjtime [Tracepoint event]
syscalls:sys_exit_clock_getres [Tracepoint event]
syscalls:sys_exit_clock_gettime [Tracepoint event]
syscalls:sys_exit_clock_nanosleep [Tracepoint event]
syscalls:sys_exit_clock_settime [Tracepoint event]

Intel PT hardware tracing enhancements:

- Accept a zero --itrace period, meaning "as often as possible". In
the case of Intel PT that is the same as a period of 1 and a unit
of 'instructions' (i.e. --itrace=i1i). (Adrian Hunter)

- Harmonize itrace's synthesized callchains with the existing
--max-stack tool option. (Adrian Hunter)

- Allow time to be displayed in nanoseconds in 'perf script'.
(Adrian Hunter)

- Fix potential infinite loop when handling Intel PT timestamps.
(Adrian Hunter)

- Slighly improve Intel PT debug logging. (Adrian Hunter)

- Warn when AUX data has been lost, just like when processing
PERF_RECORD_LOST. (Adrian Hunter)

- Further document export-to-postgresql.py script. (Adrian Hunter)

- Add option to synthesize branch stack from auxtrace data. (Adrian
Hunter)

Misc notable changes:

- Switch the default callchain output mode to 'graph,0.5,caller', to
make it look like the default for other tools, reducing the
learning curve for people used to 'caller' based viewing. (Arnaldo
Carvalho de Melo)

- various call chain usability enhancements. (Namhyung Kim)

- Introduce the 'P' event modifier, meaning 'max precision level,
please', i.e.:

$ perf record -e cycles:P usleep 1

Is now similar to:

$ perf record usleep 1

Useful, for instance, when specifying multiple events. (Jiri Olsa)

- Add 'socket' sort entry, to sort by the processor socket in 'perf
top' and 'perf report'. (Kan Liang)

- Introduce --socket-filter to 'perf report', for filtering by
processor socket. (Kan Liang)

- Add new "Zoom into Processor Socket" operation in the perf hists
browser, used in 'perf top' and 'perf report'. (Kan Liang)

- Allow probing on kmodules without DWARF. (Masami Hiramatsu)

- Fix 'perf probe -l' for probes added to kernel module functions.
(Masami Hiramatsu)

- Preparatory work for the 'perf stat record' feature that will allow
generating perf.data files with counting data in addition to the
sampling mode we have now (Jiri Olsa)

- Update libtraceevent KVM plugin. (Paolo Bonzini)

- ... plus lots of other enhancements that I failed to list properly,
by: Adrian Hunter, Alexander Shishkin, Andi Kleen, Andrzej Hajda,
Arnaldo Carvalho de Melo, Dima Kogan, Don Zickus, Geliang Tang, He
Kuang, Huaitong Han, Ingo Molnar, Jan Stancek, Jiri Olsa, Kan
Liang, Kirill Tkhai, Masami Hiramatsu, Matt Fleming, Namhyung Kim,
Paolo Bonzini, Peter Zijlstra, Rabin Vincent, Scott Wood, Stephane
Eranian, Sukadev Bhattiprolu, Taku Izumi, Vaishali Thakkar, Wang
Nan, Yang Shi and Yunlong Song"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (260 commits)
perf unwind: Pass symbol source to libunwind
tools build: Fix libiberty feature detection
perf tools: Compile scriptlets to BPF objects when passing '.c' to --event
perf record: Add clang options for compiling BPF scripts
perf bpf: Attach eBPF filter to perf event
perf tools: Make sure fixdep is built before libbpf
perf script: Enable printing of branch stack
perf trace: Add cmd string table to decode sys_bpf first arg
perf bpf: Collect perf_evsel in BPF object files
perf tools: Load eBPF object into kernel
perf tools: Create probe points for BPF programs
perf tools: Enable passing bpf object file to --event
perf ebpf: Add the libbpf glue
perf tools: Make perf depend on libbpf
perf symbols: Fix endless loop in dso__split_kallsyms_for_kcore
perf tools: Enable pre-event inherit setting by config terms
perf symbols: we can now read separate debug-info files based on a build ID
perf symbols: Fix type error when reading a build-id
perf tools: Search for more options when passing args to -h
perf stat: Cache aggregated map entries in extra cpumap
...

Linus Torvalds
2015-11-04 09:38:09 +0800
105ff3cbf atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl() ... Browse Code »

This seems to be a mis-reading of how alpha memory ordering works, and
is not backed up by the alpha architecture manual. The helper functions
don't do anything special on any other architectures, and the arguments
that support them being safe on other architectures also argue that they
are safe on alpha.

Basically, the "control dependency" is between a previous read and a
subsequent write that is dependent on the value read. Even if the
subsequent write is actually done speculatively, there is no way that
such a speculative write could be made visible to other cpu's until it
has been committed, which requires validating the speculation.

Note that most weakely ordered architectures (very much including alpha)
do not guarantee any ordering relationship between two loads that depend
on each other on a control dependency:

read A
if (val == 1)
read B

because the conditional may be predicted, and the "read B" may be
speculatively moved up to before reading the value A. So we require the
user to insert a smp_rmb() between the two accesses to be correct:

read A;
if (A == 1)
smp_rmb()
read B

Alpha is further special in that it can break that ordering even if the
*address* of B depends on the read of A, because the cacheline that is
read later may be stale unless you have a memory barrier in between the
pointer read and the read of the value behind a pointer:

read ptr
read offset(ptr)

whereas all other weakly ordered architectures guarantee that the data
dependency (as opposed to just a control dependency) will order the two
accesses. As a result, alpha needs a "smp_read_barrier_depends()" in
between those two reads for them to be ordered.

The coontrol dependency that "READ_ONCE_CTRL()" and "atomic_read_ctrl()"
had was a control dependency to a subsequent *write*, however, and
nobody can finalize such a subsequent write without having actually done
the read. And were you to write such a value to a "stale" cacheline
(the way the unordered reads came to be), that would seem to lose the
write entirely.

So the things that make alpha able to re-order reads even more
aggressively than other weak architectures do not seem to be relevant
for a subsequent write. Alpha memory ordering may be strange, but
there's no real indication that it is *that* strange.

Also, the alpha architecture reference manual very explicitly talks
about the definition of "Dependence Constraints" in section 5.6.1.7,
where a preceding read dominates a subsequent write.

Such a dependence constraint admittedly does not impose a BEFORE (alpha
architecture term for globally visible ordering), but it does guarantee
that there can be no "causal loop". I don't see how you could avoid
such a loop if another cpu could see the stored value and then impact
the value of the first read. Put another way: the read and the write
could not be seen as being out of order wrt other cpus.

So I do not see how these "x_ctrl()" functions can currently be necessary.

I may have to eat my words at some point, but in the absense of clear
proof that alpha actually needs this, or indeed even an explanation of
how alpha could _possibly_ need it, I do not believe these functions are
called for.

And if it turns out that alpha really _does_ need a barrier for this
case, that barrier still should not be "smp_read_barrier_depends()".
We'd have to make up some new speciality barrier just for alpha, along
with the documentation for why it really is necessary.

Cc: Peter Zijlstra
Cc: Paul E McKenney
Cc: Dmitry Vyukov
Cc: Will Deacon
Cc: Ingo Molnar
Signed-off-by: Linus Torvalds

Linus Torvalds
2015-11-04 09:22:17 +0800

22 Oct, 2015

1 commit

fa128e6a1 perf: pad raw data samples automatically ... Browse Code »

Instead of WARN_ON in perf_event_output() on unpaded raw samples,
pad them automatically.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-10-22 21:42:13 +0800

16 Oct, 2015

1 commit

2e91fa7f6 cgroup: keep zombies associated with their original cgroups ... Browse Code »

cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.

No controller actually needs an exiting task to be disassociated with
non-root cgroups. Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.

Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from. Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction. With cgroup association being
reset on exit, fixing it is pretty painful.

There's no reason to reset cgroup membership on exit. The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal. It can still pin and point to the css_set so that its cgroup
membership is maintained. This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.

* Previous patches decoupled populated_cnt tracking from css_set
lifetime, so a dying task can be simply unlinked from its css_set
while pinning and pointing to the css_set. This keeps css_set
association from task side alive while hiding it from "cgroup.procs"
and populated_cnt tracking. The css_set reference is dropped when
the task_struct is freed.

* ->exit() callback no longer needs the css arguments as the
associated css never changes once PF_EXITING is set. Removed.

* cpu and perf_events controllers no longer need ->exit() callbacks.
There's no reason to explicitly switch away on exit. The final
schedule out is enough. The callbacks are removed.

* On traditional hierarchies, nothing changes. "/proc/PID/cgroup"
still reports "/" for all zombies. On the default hierarchy,
"/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
to at the time of exit. If the cgroup gets removed before the task
is reaped, " (deleted)" is appended.

v2: Build brekage due to missing dummy cgroup_free() when
!CONFIG_CGROUP fixed.

Signed-off-by: Tejun Heo
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo

Tejun Heo
2015-10-16 04:41:53 +0800

28 Sep, 2015

1 commit

18ab2cd3e perf/core, perf/x86: Change needlessly global functions and a variable to static ... Browse Code »

Fixes various sparse warnings.

Signed-off-by: Geliang Tang
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/70c14234da1bed6e3e67b9c419e2d5e376ab4f32.1443367286.git.geliangtang@163.com
Signed-off-by: Ingo Molnar

Geliang Tang
2015-09-28 14:09:52 +0800

18 Sep, 2015

4 commits

02386c356 Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying new changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2015-09-18 15:24:01 +0800
f73e22ab4 perf: Fix races in computing the header sizes ... Browse Code »

There are two races with the current code:

- Another event can join the group and compute a larger header_size
concurrently, if the smaller store wins we'll have an incorrect
header_size set.

- We compute the header_size after the event becomes active,
therefore its possible to use the size before its computed.

Remedy the first by moving the computation inside the ctx::mutex lock,
and the second by placing it _before_ perf_install_in_context().

Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-09-18 15:20:26 +0800
a723968c0 perf: Fix u16 overflows ... Browse Code »

Vince reported that its possible to overflow the various size fields
and get weird stuff if you stick too many events in a group.

Put a lid on this by requiring the fixed record size not exceed 16k.
This is still a fair amount of events (silly amount really) and leaves
plenty room for callchains and stack dwarves while also avoiding
overflowing the u16 variables.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-09-18 15:20:25 +0800
f55fc2a57 perf: Restructure perf syscall point of no return ... Browse Code »

The exclusive_event_installable() stuff only works because its
exclusive with the grouping bits.

Rework the code such that there is a sane place to error out before we
go do things we cannot undo.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-09-18 15:20:24 +0800

13 Sep, 2015

8 commits

4a00c16e5 perf/core: Define PERF_PMU_TXN_READ interface ... Browse Code »

Define a new PERF_PMU_TXN_READ interface to read a group of counters
at once.

pmu->start_txn() // Initialize before first event

for each event in group
pmu->read(event); // Queue each event to be read

rc = pmu->commit_txn() // Read/update all queued counters

Note that we use this interface with all PMUs. PMUs that implement this
interface use the ->read() operation to _queue_ the counters to be read
and use ->commit_txn() to actually read all the queued counters at once.

PMUs that don't implement PERF_PMU_TXN_READ ignore ->start_txn() and
->commit_txn() and continue to read counters one at a time.

Thanks to input from Peter Zijlstra.

Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1441336073-22750-9-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Sukadev Bhattiprolu
2015-09-13 17:27:28 +0800
7d88962e2 perf/core: Add return value for perf_event_read() ... Browse Code »

When we implement the ability to read several counters at once (using
the PERF_PMU_TXN_READ transaction interface), perf_event_read() can
fail when the 'group' parameter is true (eg: trying to read too many
events at once).

For now, have perf_event_read() return an integer. Ignore the return
value when the 'group' parameter is false.

Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1441336073-22750-8-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Sukadev Bhattiprolu
2015-09-13 17:27:28 +0800
fa8c26935 perf/core: Invert perf_read_group() loops ... Browse Code »

In order to enable the use of perf_event_read(.group = true), we need
to invert the sibling-child loop nesting of perf_read_group().

Currently we iterate the child list for each sibling, this precludes
using group reads. Flip things around so we iterate each group for
each child.

Signed-off-by: Peter Zijlstra (Intel)
[ Made the patch compile and things. ]
Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1441336073-22750-7-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-09-13 17:27:27 +0800
0492d4c5b perf/core: Add group reads to perf_event_read() ... Browse Code »

Enable perf_event_read() to update entire groups at once, this will be
useful for read transactions.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Sukadev Bhattiprolu
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/20150723080435.GE25159@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-09-13 17:27:27 +0800
b15f495b4 perf/core: Rename perf_event_read_{one,group}, perf_read_hw ... Browse Code »

In order to free up the perf_event_read_group() name:

s/perf_event_read_$one\|group$/perf_read_\1/g
s/perf_read_hw/__perf_read/g

Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1441336073-22750-5-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Peter Zijlstra (Intel)
2015-09-13 17:27:26 +0800
01add3eaf perf/core: Split perf_event_read() and perf_event_count() ... Browse Code »

perf_event_read() does two things:

- call the PMU to read/update the counter value, and
- compute the total count of the event and its children

Not all callers need both. perf_event_reset() for instance needs the
first piece but doesn't need the second. Similarly, when we implement
the ability to read a group of events using the transaction interface,
we would need the two pieces done independently.

Break up perf_event_read() and have it just read/update the counter
and have the callers compute the total count if necessary.

Signed-off-by: Sukadev Bhattiprolu
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1441336073-22750-4-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Sukadev Bhattiprolu
2015-09-13 17:27:25 +0800
fbbe07011 perf/core: Add a 'flags' parameter to the PMU transactional interfaces ... Browse Code »

Currently, the PMU interface allows reading only one counter at a time.
But some PMUs like the 24x7 counters in Power, support reading several
counters at once. To leveage this functionality, extend the transaction
interface to support a "transaction type".

The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
i.e. used to _schedule_ all the events on the PMU as a group. A second
transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
by the 24x7 counters to read several counters at once.

Extend the transaction interfaces to the PMU to accept a 'txn_flags'
parameter and use this parameter to ignore any transactions that are
not of type PERF_PMU_TXN_ADD.

Thanks to Peter Zijlstra for his input.

Signed-off-by: Sukadev Bhattiprolu
[peterz: s390 compile fix]
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Michael Ellerman
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Sukadev Bhattiprolu
2015-09-13 17:27:25 +0800
516792e67 perf/core: Delete PF_EXITING checks from perf_cgroup_exit() callback ... Browse Code »

cgroup_exit() is not called from copy_process() after commit:

e8604cb43690 ("cgroup: fix spurious lockdep warning in cgroup_exit()")

from do_exit(). So this check is useless and the comment is obsolete.

Signed-off-by: Kirill Tkhai
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/55E444C8.3020402@odin.com
Signed-off-by: Ingo Molnar

Kirill Tkhai
2015-09-13 17:27:23 +0800

11 Sep, 2015

1 commit

2965faa5e kexec: split kexec_load syscall from kexec core code ... Browse Code »

There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.

And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.

The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.

Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.

Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Dave Young
Cc: Eric W. Biederman
Cc: Vivek Goyal
Cc: Petr Tesarik
Cc: Theodore Ts'o
Cc: Josh Boyer
Cc: David Howells
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Young
2015-09-11 04:29:01 +0800

03 Sep, 2015

1 commit

dd5cdb48e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Another merge window, another set of networking changes. I've heard
rumblings that the lightweight tunnels infrastructure has been voted
networking change of the year. But what do I know?

1) Add conntrack support to openvswitch, from Joe Stringer.

2) Initial support for VRF (Virtual Routing and Forwarding), which
allows the segmentation of routing paths without using multiple
devices. There are some semantic kinks to work out still, but
this is a reasonably strong foundation. From David Ahern.

3) Remove spinlock fro act_bpf fast path, from Alexei Starovoitov.

4) Ignore route nexthops with a link down state in ipv6, just like
ipv4. From Andy Gospodarek.

5) Remove spinlock from fast path of act_gact and act_mirred, from
Eric Dumazet.

6) Document the DSA layer, from Florian Fainelli.

7) Add netconsole support to bcmgenet, systemport, and DSA. Also
from Florian Fainelli.

8) Add Mellanox Switch Driver and core infrastructure, from Jiri
Pirko.

9) Add support for "light weight tunnels", which allow for
encapsulation and decapsulation without bearing the overhead of a
full blown netdevice. From Thomas Graf, Jiri Benc, and a cast of
others.

10) Add Identifier Locator Addressing support for ipv6, from Tom
Herbert.

11) Support fragmented SKBs in iwlwifi, from Johannes Berg.

12) Allow perf PMUs to be accessed from eBPF programs, from Kaixu Xia.

13) Add BQL support to 3c59x driver, from Loganaden Velvindron.

14) Stop using a zero TX queue length to mean that a device shouldn't
have a qdisc attached, use an explicit flag instead. From Phil
Sutter.

15) Use generic geneve netdevice infrastructure in openvswitch, from
Pravin B Shelar.

16) Add infrastructure to avoid re-forwarding a packet in software
that was already forwarded by a hardware switch. From Scott
Feldman.

17) Allow AF_PACKET fanout function to be implemented in a bpf
program, from Willem de Bruijn"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1458 commits)
netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
netfilter: nf_dup{4, 6}: fix build error when nf_conntrack disabled
net: fec: clear receive interrupts before processing a packet
ipv6: fix exthdrs offload registration in out_rt path
xen-netback: add support for multicast control
bgmac: Update fixed_phy_register()
sock, diag: fix panic in sock_diag_put_filterinfo
flow_dissector: Use 'const' where possible.
flow_dissector: Fix function argument ordering dependency
ixgbe: Resolve "initialized field overwritten" warnings
ixgbe: Remove bimodal SR-IOV disabling
ixgbe: Add support for reporting 2.5G link speed
ixgbe: fix bounds checking in ixgbe_setup_tc for 82598
ixgbe: support for ethtool set_rxfh
ixgbe: Avoid needless PHY access on copper phys
ixgbe: cleanup to use cached mask value
ixgbe: Remove second instance of lan_id variable
ixgbe: use kzalloc for allocating one thing
flow: Move __get_hash_from_flowi{4,6} into flow_dissector.c
ixgbe: Remove unused PCI bus types
...

Linus Torvalds
2015-09-03 23:08:17 +0800

22 Aug, 2015

1 commit

dc25b2589 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/qmi_wwan.c

Overlapping additions of new device IDs to qmi_wwan.c

Signed-off-by: David S. Miller

David S. Miller
2015-08-22 02:44:04 +0800

12 Aug, 2015

4 commits

c2ad6b51e perf/ring-buffer: Clarify the use of page::private for high-order AUX allocations ... Browse Code »

A question [1] was raised about the use of page::private in AUX buffer
allocations, so let's add a clarification about its intended use.

The private field and flag are used by perf's rb_alloc_aux() path to
tell the pmu driver the size of each high-order allocation, so that the
driver can program those appropriately into its hardware. This only
matters for PMUs that don't support hardware scatter tables. Otherwise,
every page in the buffer is just a page.

This patch adds a comment about the private field to the AUX buffer
allocation path.

[1] http://marc.info/?l=linux-kernel&m=143803696607968

Reported-by: Mathieu Poirier
Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1438063204-665-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-08-12 17:43:20 +0800
3d325bf0d Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying new changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2015-08-12 17:39:19 +0800
c7999c6f3 perf: Fix PERF_EVENT_IOC_PERIOD migration race ... Browse Code »

I ran the perf fuzzer, which triggered some WARN()s which are due to
trying to stop/restart an event on the wrong CPU.

Use the normal IPI pattern to ensure we run the code on the correct CPU.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Vince Weaver
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-08-12 17:37:22 +0800
ee9397a6f perf: Fix double-free of the AUX buffer ... Browse Code »

If rb->aux_refcount is decremented to zero before rb->refcount,
__rb_free_aux() may be called twice resulting in a double free of
rb->aux_pages. Fix this by adding a check to __rb_free_aux().

Signed-off-by: Ben Hutchings
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
Signed-off-by: Ingo Molnar

Ben Hutchings
2015-08-12 17:37:21 +0800

10 Aug, 2015

1 commit

ffe8690c8 perf: add the necessary core perf APIs when accessing events counters in eBPF programs ... Browse Code »

This patch add three core perf APIs:
- perf_event_attrs(): export the struct perf_event_attr from struct
perf_event;
- perf_event_get(): get the struct perf_event from the given fd;
- perf_event_read_local(): read the events counters active on the
current CPU;
These APIs are needed when accessing events counters in eBPF programs.

The API perf_event_read_local() comes from Peter and I add the
corresponding SOB.

Signed-off-by: Kaixu Xia
Signed-off-by: Peter Zijlstra
Signed-off-by: David S. Miller

Kaixu Xia
2015-08-10 13:50:05 +0800

07 Aug, 2015

1 commit

04a22fae4 tracing, perf: Implement BPF programs attached to uprobes ... Browse Code »

By copying BPF related operation to uprobe processing path, this patch
allow users attach BPF programs to uprobes like what they are already
doing on kprobes.

After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
uprobe perf event. Which make it possible to profile user space programs
and kernel events together using BPF.

Because of this patch, CONFIG_BPF_EVENTS should be selected by
CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
KPROBE_EVENT is not set.

Signed-off-by: Wang Nan
Acked-by: Alexei Starovoitov
Cc: Brendan Gregg
Cc: Daniel Borkmann
Cc: David Ahern
Cc: He Kuang
Cc: Jiri Olsa
Cc: Kaixu Xia
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Zefan Li
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo

Wang Nan
2015-08-07 02:29:14 +0800

04 Aug, 2015

2 commits

9a6694cfa perf/x86/intel/pt: Do not force sync packets on every schedule-in ... Browse Code »

Currently, the PT driver zeroes out the status register every time before
starting the event. However, all the writable bits are already taken care
of in pt_handle_status() function, except the new PacketByteCnt field,
which in new versions of PT contains the number of packet bytes written
since the last sync (PSB) packet. Zeroing it out before enabling PT forces
a sync packet to be written. This means that, with the existing code, a
sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
time a PT event is scheduled in.

To avoid these unnecessary syncs and save a WRMSR in the fast path, this
patch changes the default behavior to not clear PacketByteCnt field, so
that the sync packets will be generated with the period specified as
"psb_period" attribute config field. This has little impact on the trace
data as the other packets that are normally sent within PSB+ (between PSB
and PSBEND) have their own generation scenarios which do not depend on the
sync packets.

One exception where we do need to force PSB like this when tracing starts,
so that the decoder has a clear sync point in the trace. For this purpose
we aready have hw::itrace_started flag, which we are currently using to
output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
from perf core to the pmu::start, where it should still be 0 on the very
first run.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-08-04 16:16:55 +0800
fed66e2cd perf: Fix fasync handling on inherited events ... Browse Code »

Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.

Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.

Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.

Reported-and-Tested-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Arnaldo Carvalho deMelo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-08-04 15:57:52 +0800

31 Jul, 2015

10 commits

2a742cedc uprobes: Fix the waitqueue_active() check in xol_free_insn_slot() ... Browse Code »

The xol_free_insn_slot()->waitqueue_active() check is buggy. We
need mb() after we set the conditon for wait_event(), or
xol_take_insn_slot() can miss the wakeup.

Signed-off-by: Oleg Nesterov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Pratyush Anand
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134036.GA4799@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:07 +0800
704bde3cc uprobes: Use vm_special_mapping to name the XOL vma ... Browse Code »

Change xol_add_vma() to use _install_special_mapping(), this way
we can name the vma installed by uprobes. Currently it looks
like private anonymous mapping, this is confusing and
complicates the debugging. With this change /proc/$pid/maps
reports "[uprobes]".

As a side effect this will cause core dumps to include the XOL vma
and I think this is good; this can help to debug the problem if
the app crashed because it was probed.

Signed-off-by: Oleg Nesterov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Pratyush Anand
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134033.GA4796@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
f58bea2fe uprobes: Fix the usage of install_special_mapping() ... Browse Code »

install_special_mapping(pages) expects that "pages" is the zero-
terminated array while xol_add_vma() passes &area->page, this
means that special_mapping_fault() can wrongly use the next
member in xol_area (vaddr) as "struct page *".

Fortunately, this area is not expandable so pgoff != 0 isn't
possible (modulo bugs in special_mapping_vmops), but still this
does not look good.

Signed-off-by: Oleg Nesterov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Pratyush Anand
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134031.GA4789@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
db087ef69 uprobes/x86: Make arch_uretprobe_is_alive(RP_CHECK_CALL) more clever ... Browse Code »

The previous change documents that cleanup_return_instances()
can't always detect the dead frames, the stack can grow. But
there is one special case which imho worth fixing:
arch_uretprobe_is_alive() can return true when the stack didn't
actually grow, but the next "call" insn uses the already
invalidated frame.

Test-case:

#include
#include

jmp_buf jmp;
int nr = 1024;

void func_2(void)
{
if (--nr == 0)
return;
longjmp(jmp, 1);
}

void func_1(void)
{
setjmp(jmp);
func_2();
}

int main(void)
{
func_1();
return 0;
}

If you ret-probe func_1() and func_2() prepare_uretprobe() hits
the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not
reported.

When we know that the new call is not chained, we can do the
more strict check. In this case "sp" points to the new ret-addr,
so every frame which uses the same "sp" must be dead. The only
complication is that arch_uretprobe_is_alive() needs to know was
it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum
and change prepare_uretprobe() to pass RP_CHECK_CALL only if
!chained.

Note: arch_uretprobe_is_alive() could also re-read *sp and check
if this word is still trampoline_vaddr. This could obviously
improve the logic, but I would like to avoid another
copy_from_user() especially in the case when we can't avoid the
false "alive == T" positives.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134028.GA4786@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
86dcb702e uprobes: Add the "enum rp_check ctx" arg to arch_uretprobe_is_alive() ... Browse Code »

arch/x86 doesn't care (so far), but as Pratyush Anand pointed
out other architectures might want why arch_uretprobe_is_alive()
was called and use different checks depending on the context.
Add the new argument to distinguish 2 callers.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134026.GA4779@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:06 +0800
a5b7e1a89 uprobes: Change prepare_uretprobe() to (try to) flush the dead frames ... Browse Code »

Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive()
return_instance's. This is not needed correctness-wise, but can help
to avoid the failure caused by MAX_URETPROBE_DEPTH.

Note: in this case arch_uretprobe_is_alive() can be false
positive, the stack can grow after longjmp(). Unfortunately, the
kernel can't 100% solve this problem, but see the next patch.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134023.GA4776@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:05 +0800
5eeb50de4 uprobes: Change handle_trampoline() to flush the frames invalidated by longjmp() ... Browse Code »

Test-case:

#include
#include

jmp_buf jmp;

void func_2(void)
{
longjmp(jmp, 1);
}

void func_1(void)
{
if (setjmp(jmp))
return;
func_2();
printf("ERR!! I am running on the caller's stack\n");
}

int main(void)
{
func_1();
return 0;
}

fails if you probe func_1() and func_2() because
handle_trampoline() assumes that the probed function should must
return and hit the bp installed be prepare_uretprobe(). But in
this case func_2() does not return, so when func_1() returns the
kernel uses the no longer valid return_instance of func_2().

Change handle_trampoline() to unwind ->return_instances until we
know that the next chain is alive or NULL, this ensures that the
current chain is the last we need to report and free.

Alternatively, every return_instance could use unique
trampoline_vaddr, in this case we could use it as a key. And
this could solve the problem with sigaltstack() automatically.

But this approach needs more changes, and it puts the "hard"
limit on MAX_URETPROBE_DEPTH. Plus it can not solve another
problem partially fixed by the next patch.

Note: this change has no effect on !x86, the arch-agnostic
version of arch_uretprobe_is_alive() just returns "true".

TODO: as documented by the previous change, arch_uretprobe_is_alive()
can be fooled by sigaltstack/etc.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134021.GA4773@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:05 +0800
7b868e480 uprobes/x86: Reimplement arch_uretprobe_is_alive() ... Browse Code »

Add the x86 specific version of arch_uretprobe_is_alive()
helper. It returns true if the stack frame mangled by
prepare_uretprobe() is still on stack. So if it returns false,
we know that the probed function has already returned.

We add the new return_instance->stack member and change the
generic code to initialize it in prepare_uretprobe, but it
should be equally useful for other architectures.

TODO: this assumes that the probed application can't use
multiple stacks (say sigaltstack). We will try to improve
this logic later.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134018.GA4766@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:05 +0800
97da89767 uprobes: Export 'struct return_instance', introduce arch_uretprobe_is_alive() ... Browse Code »

Add the new "weak" helper, arch_uretprobe_is_alive(), used by
the next patches. It should return true if this return_instance
is still valid. The arch agnostic version just always returns
true.

The patch exports "struct return_instance" for the architectures
which want to override this hook. We can also cleanup
prepare_uretprobe() if we pass the new return_instance to
arch_uretprobe_hijack_return_addr().

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134016.GA4762@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:04 +0800
a83cfeb92 uprobes: Change handle_trampoline() to find the next chain beforehand ... Browse Code »

No functional changes, preparation.

Add the new helper, find_next_ret_chain(), which finds the first
!chained entry and returns its ->next. Yes, it is suboptimal. We
probably want to turn ->chained into ->start_of_this_chain
pointer and avoid another loop. But this needs the boring
changes in dup_utask(), so lets do this later.

Change the main loop in handle_trampoline() to unwind the stack
until ri is equal to the pointer returned by this new helper.

Tested-by: Pratyush Anand
Signed-off-by: Oleg Nesterov
Acked-by: Srikar Dronamraju
Acked-by: Anton Arapov
Cc: Andy Lutomirski
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150721134013.GA4755@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-07-31 16:38:04 +0800