15 Feb, 2017
1 commit
-
commit 451d24d1e5f40bad000fa9abe36ddb16fc9928cb upstream.
Alexei had his box explode because doing read() on a package
(rapl/uncore) event that isn't currently scheduled in ends up doing an
out-of-bounds load.Rework the code to more explicitly deal with event->oncpu being -1.
Reported-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Tested-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: eranian@google.com
Fixes: d6a2f9035bfc ("perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG")
Link: http://lkml.kernel.org/r/20170131102710.GL6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
09 Feb, 2017
2 commits
-
commit 0b3589be9b98994ce3d5aeca52445d1f5627c4ba upstream.
Andres reported that MMAP2 records for anonymous memory always have
their protection field 0.Turns out, someone daft put the prot/flags generation code in the file
branch, leaving them unset for anonymous memory.Reported-by: Andres Freund
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Don Zickus
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@kernel.org
Cc: anton@ozlabs.org
Cc: namhyung@kernel.org
Fixes: f972eb63b100 ("perf: Pass protection and flags bits through mmap2 interface")
Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit a76a82a3e38c8d3fb6499e3dfaeb0949241ab588 upstream.
Dmitry reported a KASAN use-after-free on event->group_leader.
It turns out there's a hole in perf_remove_from_context() due to
event_function_call() not calling its function when the task
associated with the event is already dead.In this case the event will have been detached from the task, but the
grouping will have been retained, such that group operations might
still work properly while there are live child events etc.This does however mean that we can miss a perf_group_detach() call
when the group decomposes, this in turn can then lead to
use-after-free.Fix it by explicitly doing the group detach if its still required.
Reported-by: Dmitry Vyukov
Tested-by: Dmitry Vyukov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: syzkaller
Fixes: 63b6da39bb38 ("perf: Fix perf_event_exit_task() race")
Link: http://lkml.kernel.org/r/20170126153955.GD6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
01 Feb, 2017
1 commit
-
commit 321027c1fe77f892f4ea07846aeae08cefbbb290 upstream.
Di Shen reported a race between two concurrent sys_perf_event_open()
calls where both try and move the same pre-existing software group
into a hardware context.The problem is exactly that described in commit:
f63a8daa5812 ("perf: Fix event->ctx locking")
... where, while we wait for a ctx->mutex acquisition, the event->ctx
relation can have changed under us.That very same commit failed to recognise sys_perf_event_context() as an
external access vector to the events and thereby didn't apply the
established locking rules correctly.So while one sys_perf_event_open() call is stuck waiting on
mutex_lock_double(), the other (which owns said locks) moves the group
about. So by the time the former sys_perf_event_open() acquires the
locks, the context we've acquired is stale (and possibly dead).Apply the established locking rules as per perf_event_ctx_lock_nested()
to the mutex_lock_double() for the 'move_group' case. This obviously means
we need to validate state after we acquire the locks.Reported-by: Di Shen (Keen Lab)
Tested-by: John Dias
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Min Chong
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: f63a8daa5812 ("perf: Fix event->ctx locking")
Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
06 Dec, 2016
1 commit
-
The warning introduced in commit:
864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")
assumed that a cgroup switch always precedes list_del_event. This is
not the case. Remove warning.Make sure that cpuctx->cgrp is NULL until a cgroup event is sched in
or ctx->nr_cgroups == 0.Signed-off-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Fenghua Yu
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Marcelo Tosatti
Cc: Nilay Vaish
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Ravi V Shankar
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vegard Nossum
Cc: Vikas Shivappa
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1480841177-27299-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar
21 Nov, 2016
1 commit
-
The token table passed into match_token() must be null-terminated, which
it currently is not in the perf's address filter string parser, as caught
by Vince's perf_fuzzer and KASAN.It doesn't blow up otherwise because of the alignment padding of the table
to the next element in the .rodata, which is luck.Fixing by adding a null-terminator to the token table.
Reported-by: Vince Weaver
Tested-by: Vince Weaver
Signed-off-by: Alexander Shishkin
Acked-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: dvyukov@google.com
Cc: stable@vger.kernel.org # v4.7+
Fixes: 375637bc524 ("perf/core: Introduce address range filtering")
Link: http://lkml.kernel.org/r/877f81f264.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar
15 Nov, 2016
1 commit
-
Commit:
db4a835601b7 ("perf/core: Set cgroup in CPU contexts for new cgroup events")
failed to verify that event->cgrp is actually the scheduled cgroup
in a CPU before setting cpuctx->cgrp. This patch fixes that.Now that there is a different path for scheduled and unscheduled
cgroup, add a warning to catch when cpuctx->cgrp is still set after
the last cgroup event has been unsheduled.To verify the bug:
# Create 2 cgroups.
mkdir /dev/cgroups/devices/g1
mkdir /dev/cgroups/devices/g2# launch a task, bind it to a cpu and move it to g1
CPU=2
while :; do : ; done &
P=$!taskset -pc $CPU $P
echo $P > /dev/cgroups/devices/g1/tasks# monitor g2 (it runs no tasks) and observe output
perf stat -e cycles -I 1000 -C $CPU -G g2# time counts unit events
1.000091408 7,579,527 cycles g2
2.000350111 cycles g2
3.000589181 cycles g2
4.000771428 cycles g2# note first line that displays that a task run in g2, despite
# g2 having no tasks. This is because cpuctx->cgrp was wrongly
# set when context of new event was installed.
# After applying the fix we obtain the right output:perf stat -e cycles -I 1000 -C $CPU -G g2
# time counts unit events
1.000119615 cycles g2
2.000389430 cycles g2
3.000590962 cycles g2Signed-off-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Nilay Vaish
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vegard Nossum
Link: http://lkml.kernel.org/r/1478026378-86083-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar
29 Oct, 2016
1 commit
-
Pull perf fixes from Ingo Molnar:
"Misc kernel fixes: a virtualization environment related fix, an uncore
PMU driver removal handling fix, a PowerPC fix and new events for
Knights Landing"* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
perf/powerpc: Don't call perf_event_disable() from atomic context
perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
perf/x86/intel/cstate: Add C-state residency events for Knights Landing
28 Oct, 2016
2 commits
-
The trinity syscall fuzzer triggered following WARN() on powerpc:
WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
...
NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
Call Trace:
[c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48Followed by a lockdep warning:
===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc5+ #7 Tainted: G W
-------------------------------
./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by ls/2998:
#0: (rcu_read_lock){......}, at: [] .__atomic_notifier_call_chain+0x0/0x1c0
#1: (rcu_read_lock){......}, at: [] .hw_breakpoint_handler+0x0/0x2b0stack backtrace:
CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
Call Trace:
[c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
[c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
[c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
[c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
[c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
[c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
[c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
[c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
[c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
[c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
[c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
[c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48While it looks like the first WARN() is probably valid, the other one is
triggered by disabling event via perf_event_disable() from atomic context.The event is disabled here in case we were not able to emulate
the instruction that hit the breakpoint. By disabling the event
we unschedule the event and make sure it's not scheduled back.But we can't call perf_event_disable() from atomic context, instead
we need to use the event's pending_disable irq_work method to disable it.Reported-by: Jan Stancek
Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Huang Ying
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Neuling
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
Signed-off-by: Ingo Molnar -
…FIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
CAI Qian reported a crash in the PMU uncore device removal code,
enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:https://marc.info/?l=linux-kernel&m=147688837328451
The reason for the crash is that perf_pmu_unregister() tries to remove
a PMU device which is not added at this point. We add PMU devices
only after pmu_bus is registered, which happens in the
perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.The fix is to get the 'pmu_bus_running' flag state at the point
the PMU is taken out of the PMU list and remove the device
later only if it's set.Reported-by: CAI Qian <caiqian@redhat.com>
Tested-by: CAI Qian <caiqian@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Herring <robh@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
Signed-off-by: Ingo Molnar <mingo@kernel.org>
19 Oct, 2016
1 commit
-
This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.Signed-off-by: Lorenzo Stoakes
Acked-by: Michal Hocko
Reviewed-by: Jan Kara
Signed-off-by: Linus Torvalds
06 Oct, 2016
1 commit
-
Pull networking updates from David Miller:
1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
co. at Google. https://lwn.net/Articles/701165/2) Do TCP Small Queues for retransmits, from Eric Dumazet.
3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
Starovoitov.4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.
5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.
6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.
7) Support ndo_poll_controller in mlx5, from Calvin Owens.
8) Move VRF processing to an output hook and allow l3mdev to be
loopback, from David Ahern.9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.
10) Congestion control in RXRPC, from David Howells.
11) Support geneve RX offload in ixgbe, from Emil Tantilov.
12) When hitting pressure for new incoming TCP data SKBs, perform a
partial rathern than a full purge of the OFO queue (which could be
huge). From Eric Dumazet.13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.
14) Support RX network flow classification to igb, from Gangfeng Huang.
15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.
16) New skbmod packet action, from Jamal Hadi Salim.
17) Remove some inefficiencies in snmp proc output, from Jia He.
18) Add FIB notifications to properly propagate route changes to
hardware which is doing forwarding offloading. From Jiri Pirko.19) New dsa driver for qca8xxx chips, from John Crispin.
20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
Żenczykowski.21) Add L3 mode to ipvlan, from Mahesh Bandewar.
22) Support 802.1ad in mlx4, from Moshe Shemesh.
23) Support hardware LRO in mediatek driver, from Nelson Chang.
24) Add TC offloading to mlx5, from Or Gerlitz.
25) Convert various drivers to ethtool ksettings interfaces, from
Philippe Reynes.26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.
27) NAPI support for ath10k, from Rajkumar Manoharan.
28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.
29) UDP replicast support in TIPC, from Richard Alpe.
30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.
31) Support BQL in thunderx driver, from Sunil Goutham.
32) TSO support in alx driver, from Tobias Regnery.
33) Add stream parser engine and use it in kcm.
34) Support async DHCP replies in ipconfig module, from Uwe
Kleine-König.35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
mlxsw: switchx2: Fix misuse of hard_header_len
mlxsw: spectrum: Fix misuse of hard_header_len
net/faraday: Stop NCSI device on shutdown
net/ncsi: Introduce ncsi_stop_dev()
net/ncsi: Rework the channel monitoring
net/ncsi: Allow to extend NCSI request properties
net/ncsi: Rework request index allocation
net/ncsi: Don't probe on the reserved channel ID (0x1f)
net/ncsi: Introduce NCSI_RESERVED_CHANNEL
net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
net: Add netdev all_adj_list refcnt propagation to fix panic
net: phy: Add Edge-rate driver for Microsemi PHYs.
vmxnet3: Wake queue from reset work
i40e: avoid NULL pointer dereference and recursive errors on early PCI error
qed: Add RoCE ll2 & GSI support
qed: Add support for memory registeration verbs
qed: Add support for QP verbs
qed: PD,PKEY and CQ verb support
qed: Add support for RoCE hw init
qede: Add qedr framework
...
03 Oct, 2016
1 commit
-
Three sets of overlapping changes. Nothing serious.
Signed-off-by: David S. Miller
23 Sep, 2016
2 commits
-
Signed-off-by: Ingo Molnar
22 Sep, 2016
1 commit
-
An "exclusive" PMU is the one that can only have one event scheduled in
at any given time. There may be more than one of such PMUs in a system,
though, like Intel PT and BTS. It should be allowed to have one event
for either of those inside the same context (there may be other constraints
that may prevent this, but those would be hardware-specific). However,
the exclusivity code is written so that only one event from any of the
"exclusive" PMUs is allowed in a context.Fix this by making the exclusive event filter explicitly match two events'
PMUs.Signed-off-by: Alexander Shishkin
Acked-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160920154811.3255-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar
10 Sep, 2016
3 commits
-
Signed-off-by: Ingo Molnar
-
The order of accesses to ring buffer's aux_mmap_count and aux_refcount
has to be preserved across the users, namely perf_mmap_close() and
perf_aux_output_begin(), otherwise the inversion can result in the latter
holding the last reference to the aux buffer and subsequently free'ing
it in atomic context, triggering a warning.> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130
> CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
> Call Trace:
> [] __warn+0xcb/0xf0
> [] warn_slowpath_null+0x1d/0x20
> [] __rb_free_aux+0x11a/0x130
> [] rb_free_aux+0x18/0x20
> [] perf_aux_output_begin+0x163/0x1e0
> [] bts_event_start+0x3a/0xd0
> [] bts_event_add+0x5d/0x80
> [] event_sched_in.isra.104+0xf6/0x2f0
> [] group_sched_in+0x6e/0x190
> [] ctx_sched_in+0x2fe/0x5f0
> [] perf_event_sched_in+0x60/0x80
> [] ctx_resched+0x5b/0x90
> [] __perf_event_enable+0x1e1/0x240
> [] event_function+0xa9/0x180
> [] ? perf_cgroup_attach+0x70/0x70
> [] remote_function+0x3f/0x50
> [] flush_smp_call_function_queue+0x83/0x150
> [] generic_smp_call_function_single_interrupt+0x13/0x60
> [] smp_call_function_single_interrupt+0x27/0x40
> [] call_function_single_interrupt+0x89/0x90
> [] finish_task_switch+0xa6/0x210
> [] ? finish_task_switch+0x67/0x210
> [] __schedule+0x3dd/0xb50
> [] schedule+0x35/0x80
> [] sys_sched_yield+0x61/0x70
> [] entry_SYSCALL_64_fastpath+0x18/0xa8
> ---[ end trace 6235f556f5ea83a9 ]---This patch puts the checks in perf_aux_output_begin() in the same order
as that of perf_mmap_close().Reported-by: Vince Weaver
Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
In the mmap_close() path we need to stop all the AUX events that are
writing data to the AUX area that we are unmapping, before we can
safely free the pages. To determine if an event needs to be stopped,
we're comparing its ->rb against the one that's getting unmapped.
However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
and swizzle event::rb to some other ring buffer, but the transaction
will keep writing data to the old ring buffer until the event gets
scheduled out. At this point, mmap_close() will skip over such an
event and will proceed to free the AUX area, while it's still being
used by this event, which will set off a warning in the mmap_close()
path and cause a memory corruption.To avoid this, always stop an AUX event before its ->rb is updated;
this will release the (potentially) last reference on the AUX area
of the buffer. If the event gets restarted, its new ring buffer will
be used. If another SET_OUTPUT comes and switches it back to the
old ring buffer that's getting unmapped, it's also fine: this
ring buffer's aux_mmap_count will be zero and AUX transactions won't
start any more.Reported-by: Vince Weaver
Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar
07 Sep, 2016
1 commit
-
The newly added bpf_overflow_handler function is only built of both
CONFIG_EVENT_TRACING and CONFIG_BPF_SYSCALL are enabled, but the caller
only checks the latter:kernel/events/core.c: In function 'perf_event_alloc':
kernel/events/core.c:9106:27: error: 'bpf_overflow_handler' undeclared (first use in this function)This changes the caller so we also skip this call if CONFIG_EVENT_TRACING
is disabled entirely.Signed-off-by: Arnd Bergmann
Fixes: aa6a5f3cb2b2 ("perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs")
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
05 Sep, 2016
3 commits
-
PERF_EF_START is a flag to indicate to the PMU ->add() callback that, as
well as claiming the PMU resources required by the event being added,
it should also start the PMU.Passing this flag to the ->start() callback doesn't make sense, because
->start() always tries to start the PMU. Remove it.Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: mark.rutland@arm.com
Link: http://lkml.kernel.org/r/1471257765-29662-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
Conflicts:
kernel/events/core.cSigned-off-by: Ingo Molnar
-
This effectively reverts commit:
71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")
... and puts in a comment explaining why we ignore the return value.
Reported-by: Vegard Nossum
Signed-off-by: Peter Zijlstra (Intel)
Cc: David Carrillo-Cisneros
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")
Signed-off-by: Ingo Molnar
03 Sep, 2016
1 commit
-
Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
via overflow_handler mechanism.
When program is attached the overflow_handlers become stacked.
The program acts as a filter.
Returning zero from the program means that the normal perf_event_output handler
will not be called and sampling event won't be stored in the ring buffer.The overflow_handler_context==NULL is an additional safety check
to make sure programs are not attached to hw breakpoints and watchdog
in case other checks (that prevent that now anyway) get accidentally
relaxed in the future.The program refcnt is incremented in case perf_events are inhereted
when target task is forked.
Similar to kprobe and tracepoint programs there is no ioctl to
detach the program or swap already attached program. The user space
expected to close(perf_event_fd) like it does right now for kprobe+bpf.
That restriction simplifies the code quite a bit.The invocation of overflow_handler in __perf_event_overflow() is now
done via READ_ONCE, since that pointer can be replaced when the program
is attached while perf_event itself could have been active already.
There is no need to do similar treatment for event->prog, since it's
assigned only once before it's accessed.Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller
24 Aug, 2016
1 commit
-
When tearing down an AUX buf for an event via perf_mmap_close(),
__perf_event_output_stop() is called on the event's CPU to ensure that
trace generation is halted before the process of unmapping and
freeing the buffer pages begins.The callback is performed via cpu_function_call(), which ensures that it
runs with interrupts disabled and is therefore not preemptible.
Unfortunately, the current code grabs the per-cpu context pointer using
get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
a BUG when freeing the AUX buffer later on:WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
Modules linked in:
[...]
Call Trace:
[] dump_stack+0x4f/0x72
[] __warn+0xc6/0xe0
[] warn_slowpath_null+0x18/0x20
[] __rb_free_aux+0x10c/0x120
[] rb_free_aux+0x13/0x20
[] perf_mmap_close+0x29e/0x2f0
[] ? perf_iterate_ctx+0xe0/0xe0
[] remove_vma+0x25/0x60
[] exit_mmap+0x106/0x140
[] mmput+0x1c/0xd0
[] do_exit+0x253/0xbf0
[] do_group_exit+0x3e/0xb0
[] get_signal+0x249/0x640
[] do_signal+0x23/0x640
[] ? _raw_write_unlock_irq+0x12/0x30
[] ? _raw_spin_unlock_irq+0x9/0x10
[] ? __schedule+0x2c6/0x710
[] exit_to_usermode_loop+0x74/0x90
[] prepare_exit_to_usermode+0x26/0x30
[] retint_user+0x8/0x10This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
already disabled by the caller.Signed-off-by: Will Deacon
Reviewed-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: 95ff4ca26c49 ("perf/core: Free AUX pages in unmap path")
Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.com
Signed-off-by: Ingo Molnar
18 Aug, 2016
12 commits
-
Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events,
that allows a PMU to signal the generic perf code that an event is readable
in the current CPU if the event is active in a CPU in the same package as
the current CPU.This is an optimization that avoids a unnecessary IPI for the common case
where uncore events are run and read in the same package but in
different CPUs.As an example, the IPI removal speeds up perf_read() in my Haswell system
as follows:- For event UNC_C_LLC_LOOKUP: From 260 us to 31 us.
- For event RAPL's power/energy-cores/: From to 255 us to 27 us.For the optimization to work, all events in the group must have it
(similarly to PERF_EV_CAP_SOFTWARE).Signed-off-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: David Carrillo-Cisneros
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vegard Nossum
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1471467307-61171-4-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar -
Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a
group's leader to indicate that is_software_event(event) is true for all
events in a group. This is the only usage of event->group_flags.This pattern of setting a group level flags when all events in the group
share a property is useful for the flag introduced in the next patch and
for future CQM/CMT flags. So this patches generalizes group_flags to work
as an aggregate of event level flags.PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags
that I intend to add are also determinable at event initialization.
To better convey the above, this patch renames event's group_flags to
group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE.Individual event flags are stored in the new event->event_caps. Since the
cap flags do not change after event initialization, there is no need to
serialize event_caps. This new field is used when events are added to a
context, similarly to how PERF_GROUP_SOFTWARE and is_software_event()
worked.Lastly, for consistency, updates is_software_event() to rely in event_cap
instead of the context index.Signed-off-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vegard Nossum
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar -
When decoding the perf_regs mask in perf_output_sample_regs(),
we loop through the mask using find_first_bit and find_next_bit functions.While the exisiting code works fine in most of the case, the logic
is broken for big-endian 32-bit kernels.When reading a u64 mask using (u32 *)(&val)[0], find_*_bit() assumes
that it gets the lower 32 bits of u64, but instead it gets the upper
32 bits - which is wrong.The fix is to swap the words of the u64 to handle this case.
This is _not_ a regular endianness swap.Suggested-by: Yury Norov
Signed-off-by: Madhavan Srinivasan
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Yury Norov
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/1471426568-31051-2-git-send-email-maddy@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
The call to smp_call_function_single in perf_event_read() may fail if
an invalid or not online CPU index is passed. Warn user if such bug is
present and return error.Signed-off-by: David Carrillo-Cisneros
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Kan Liang
Cc: Linus Torvalds
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vegard Nossum
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar -
At this time the perf_addr_filter_needs_mmap() function will _not_
return true on a user space 'stop' filter. But stop filters need
exactly the same kind of mapping that range and start filters get.Signed-off-by: Mathieu Poirier
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar -
Function perf_event_mmap() is called by the MM subsystem each time
part of a binary is loaded in memory. There can be several mapping
for a binary, many times unrelated to the code section.Each time a section of a binary is mapped address filters are
updated, event when the map doesn't pertain to the code section.
The end result is that filters are configured based on the last map
event that was received rather than the last mapping of the code
segment.For example if we have an executable 'main' that calls library
'libcstest.so.1.0', and that we want to collect traces on code
that is in that library. The perf cmd line for this scenario
would be:perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main
Resulting in binaries being mapped this way:
root@linaro-nano:~# cat /proc/1950/maps
00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main
00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main
00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main
7fa2464000-7fa2474000 rw-p 00000000 00:00 0
7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar]
7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso]
7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack]
root@linaro-nano:~#Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
memory. Once that has been done perf_event_mmap() has been called
4 times, with the last map starting at address 0x7fa25ce000 and
the address filter configured to start filtering when the
IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).But that is wrong since the code segment for library 'libcstest.so.1.0'
as been mapped at 0x7fa25bd000, resulting in traces not being
collected.This patch corrects the situation by requesting that address
filters be updated only if the mapped event is for a code
segment.Signed-off-by: Mathieu Poirier
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar -
Binary file names have to be supplied for both range and start/stop
filters but the current code only processes the filename if an
address range filter is specified. This code adds processing of
the filename for start/stop filters.Signed-off-by: Mathieu Poirier
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar -
Vincent reported triggering the WARN_ON_ONCE() in event_function_local().
While thinking through cases I noticed that by using event_function()
directly, we miss the inactive case usually handled by
event_function_call().Therefore construct a blend of event_function_call() and
event_function() that handles the cases relevant to
event_function_local().Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org # 4.5+
Fixes: fae3fde65138 ("perf: Collapse and fix event_function_call() users")
Signed-off-by: Ingo Molnar -
Purely cosmetic, no changes in the compiled code.
Perhaps it is just me but I can hardly read __replace_page() because I can't
distinguish "page" from "kpage" and because I need to look at the caller to
to ensure that, say, kpage is really the new page and the code is correct.
Rename them to old_page and new_page, this matches the caller.Signed-off-by: Oleg Nesterov
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Brenden Blanco
Cc: Jiri Olsa
Cc: Johannes Weiner
Cc: Linus Torvalds
Cc: Michal Hocko
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Link: http://lkml.kernel.org/r/20160817153704.GC29724@redhat.com
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
__replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
it should only do this if page_check_address() fails.This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
from put_page(old_page), it is trivial to underflow the page_counter->count
and trigger OOM.Reported-and-tested-by: Brenden Blanco
Signed-off-by: Oleg Nesterov
Reviewed-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Cc: stable@vger.kernel.org # 3.17+
Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API")
Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
Signed-off-by: Ingo Molnar
10 Aug, 2016
3 commits
-
For perf record -b, which requires the pmu::sched_task callback the
current code is rather expensive:7.68% sched-pipe [kernel.vmlinux] [k] perf_pmu_sched_task
5.95% sched-pipe [kernel.vmlinux] [k] __switch_to
5.20% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
3.95% sched-pipe perf [.] worker_threadThe problem is that it will iterate all registered PMUs, most of which
will not have anything to do. Avoid this by keeping an explicit list
of PMUs that have requested the callback.The perf_sched_cb_{inc,dec}() functions already takes the required pmu
argument, and now that these functions are no longer called from NMI
context we can use them to manage a list.With this patch applied the function doesn't show up in the top 4
anymore (it dropped to 18th place).6.67% sched-pipe [kernel.vmlinux] [k] __switch_to
6.18% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
3.92% sched-pipe [kernel.vmlinux] [k] switch_mm_irqs_off
3.71% sched-pipe perf [.] worker_threadSigned-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
In order to allow optimizing perf_pmu_sched_task() we must ensure
perf_sched_cb_{inc,dec}() are no longer called from NMI context; this
means that pmu::{start,stop}() can no longer use them.Prepare for this by reworking the whole large PEBS setup code.
The current code relied on the cpuc->pebs_enabled state, however since
that reflects the current active state as per pmu::{start,stop}() we
can no longer rely on this.Introduce two counters: cpuc->n_pebs and cpuc->n_large_pebs which
count the total number of PEBS events and the number of PEBS events
that have FREERUNNING set, resp.. With this we can tell if the current
setup requires a single record interrupt threshold or can use a larger
buffer.This also improves the code in that it re-enables the large threshold
once the PEBS event that required single record gets removed.Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
Groups of events are supposed to be scheduled atomically, such that it
is possible to derive meaningful ratios between their values.We take great pains to achieve this when scheduling event groups to a
PMU in group_sched_in(), calling {start,commit}_txn() (which fall back
to perf_pmu_{disable,enable}() if necessary) to provide this guarantee.
However we don't mirror this in group_sched_out(), and in some cases
events will not be scheduled out atomically.For example, if we disable an event group with PERF_EVENT_IOC_DISABLE,
we'll cross-call __perf_event_disable() for the group leader, and will
call group_sched_out() without having first disabled the relevant PMU.
We will disable/enable the PMU around each pmu->del() call, but between
each call the PMU will be enabled and events may count.Avoid this by explicitly disabling and enabling the PMU around event
removal in group_sched_out(), mirroring what we do in group_sched_in().Signed-off-by: Mark Rutland
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/1469553141-28314-1-git-send-email-mark.rutland@arm.com
Signed-off-by: Ingo Molnar