31 Aug, 2011
1 commit
-
We detected a serious issue with PERF_SAMPLE_READ and
timing information when events were being multiplexing.Samples would have time_running > time_enabled. That
was easy to reproduce with a libpfm4 example (ran 3
times to cause multiplexing on Core 2):$ syst_smpl -e uops_retired:freq=1 &
$ syst_smpl -e uops_retired:freq=1 &
$ syst_smpl -e uops_retired:freq=1 &
IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
syst_smpl: WARNING: time_running > time_enabled
63277537998 uops_retired:freq=1 , scaledThe bug was not present in kernel up to (and including) 3.0. It turns
out the bug was introduced by the following commit:commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa
events: Move lockless timer calculation into helper function
The parameters of the function got reversed yet the call sites
were not updated to reflect the change. That lead to time_running
and time_enabled being swapped. That had no effect when there was
no multiplexing because in that case time_running = time_enabled
but it would show up in any other scenario.Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad
Signed-off-by: Ingo Molnar
29 Aug, 2011
1 commit
-
The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/sNow start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/sThat's a 37% penalty.
Note that pong is not even in the monitored cgroup.
The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100Performance counter stats for 'sleep 100':
CPU1 cycles test
CPU1 16,984,189,138 cycles # 0.000 GHzThe second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/sStart perf stat with cgroup:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/sThe penalty is now less than 2%.
And the results for perf stat are correct:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHzNow perf stat reports the correct counts for
for the non cgroup event.If we run pong inside the cgroup, then we also get the
correct counts:$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz10.001457237 seconds time elapsed
Signed-off-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar
22 Jul, 2011
1 commit
-
PMU type id can be allocated dynamically, so perf_event_attr::type check
when copying attribute from userspace to kernel is not valid.Signed-off-by: Lin Ming
Cc: Robert Richter
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1309421396-17438-4-git-send-email-ming.m.lin@intel.com
Signed-off-by: Ingo Molnar
01 Jul, 2011
8 commits
-
KVM needs one-shot samples, since a PMC programmed to -X will fire after X
events and then again after 2^40 events (i.e. variable period).Signed-off-by: Avi Kivity
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1309362157-6596-4-git-send-email-avi@redhat.com
Signed-off-by: Ingo Molnar -
The perf_event overflow handler does not receive any caller-derived
argument, so many callers need to resort to looking up the perf_event
in their local data structure. This is ugly and doesn't scale if a
single callback services many perf_events.Fix by adding a context parameter to perf_event_create_kernel_counter()
(and derived hardware breakpoints APIs) and storing it in the perf_event.
The field can be accessed from the callback as event->overflow_handler_context.
All callers are updated.Signed-off-by: Avi Kivity
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
Signed-off-by: Ingo Molnar -
Since only samples call perf_output_sample() its much saner (and more
correct) to put the sample logic in there than in the
perf_output_begin()/perf_output_end() pair.Saves a useless argument, reduces conditionals and shrinks
struct perf_output_handle, win!Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org
Signed-off-by: Ingo Molnar -
The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.For the various event classes:
- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.Signed-off-by: Peter Zijlstra
Cc: Michael Cree
Cc: Will Deacon
Cc: Deng-Cheng Zhu
Cc: Anton Blanchard
Cc: Eric B Munson
Cc: Heiko Carstens
Cc: Paul Mundt
Cc: David S. Miller
Cc: Frederic Weisbecker
Cc: Jason Wessel
Cc: Don Zickus
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar -
The event tracing infrastructure exposes two timers which should be updated
each time the value of the counter is updated. Currently, these counters are
only updated when userspace calls read() on the fd associated with an event.
This means that counters which are read via the mmap'd page exclusively never
have their timers updated. This patch adds ensures that the timers are updated
each time the values in the mmap'd page are updated.Signed-off-by: Eric B Munson
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1308932786-5111-1-git-send-email-emunson@mgebm.net
Signed-off-by: Ingo Molnar -
Take the timer calculation from perf_output_read and move it to a helper
function for any place that needs timer values but cannot take the ctx->lock.Signed-off-by: Eric B Munson
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1308861279-15216-2-git-send-email-emunson@mgebm.net
Signed-off-by: Ingo Molnar -
Signed-off-by: Eric B Munson
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1308861279-15216-1-git-send-email-emunson@mgebm.net
Signed-off-by: Ingo Molnar -
Since 2.6.36 (specifically commit d57e34fdd60b ("perf: Simplify the
ring-buffer logic: make perf_buffer_alloc() do everything needed"),
the perf_buffer_init_code() has been mis-setting the buffer watermark
if perf_event_attr.wakeup_events has a non-zero value.This is because perf_event_attr.wakeup_events is a union with
perf_event_attr.wakeup_watermark.This commit re-enables the check for perf_event_attr.watermark being
set before continuing with setting a non-default watermark.This bug is most noticable when you are trying to use PERF_IOC_REFRESH
with a value larger than one and perf_event_attr.wakeup_events is set to
one. In this case the buffer watermark will be set to 1 and you will
get extraneous POLL_IN overflows rather than POLL_HUP as expected.[ avoid using attr.wakeup_events when attr.watermark is set ]
Signed-off-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc:
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106011506390.5384@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar
09 Jun, 2011
1 commit
-
And create the internal perf events header.
v2: Keep an internal inlined perf_output_copy()
Signed-off-by: Frederic Weisbecker
Acked-by: Peter Zijlstra
Cc: Borislav Petkov
Cc: Stephane Eranian
Cc: Arnaldo Carvalho de Melo
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com
[ v3: use clearer 'ring_buffer' and 'rb' naming ]
Signed-off-by: Ingo Molnar
07 Jun, 2011
1 commit
-
A lost Quilt refresh of 2c29ef0fef8 (perf: Simplify and fix
__perf_install_in_context()) is causing grief and lockups,
reported by Jiri Olsa.When installing an event in a task context, there's a number of
issues:- there might not be an existing task context, in which case
we should install the now current context;- there might already be a context, not the current one, in
which case we should de-schedule the old and install the new;these cases were dealt with in the lost refresh, however there is one
further case that was found in testing:- there might already be a context, the current one, in which
case we should still de-schedule, and should take care
to re-install it (note that task_ctx_sched_out() clears
cpuctx->task_ctx).Reported-by: Jiri Olsa
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1307399008.2497.971.camel@laptop
Signed-off-by: Ingo Molnar
04 Jun, 2011
2 commits
-
Conflicts:
tools/perf/util/python.cMerge reason: resolve the conflict with perf/urgent.
Signed-off-by: Ingo Molnar
-
…/linux into perf/urgent
31 May, 2011
1 commit
-
Ben changed the cgroup API in commit f780bdb7c1c (cgroups: add
per-thread subsystem callbacks) in an incompatible way, but
forgot to convert the perf cgroup bits.Avoid compile warnings and runtime splats and convert perf too ;-)
Acked-by: Ben Blum
Cc: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1306767651.1200.2990.camel@twins
Signed-off-by: Ingo Molnar
29 May, 2011
9 commits
-
Since perf_install_in_context() will now install a context when we
add the first event, we can de-schedule the context when the last
event is removed.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192142.090431763@chello.nl
Signed-off-by: Ingo Molnar -
In order to always call list_del_event() on the correct cpu if the
event is part of an active context and avoid having to do two IPIs,
change the close() semantics slightly.The current perf_event_disable() call would disable a whole group if
the event that's being closed is the group leader, whereas the new
code keeps the group siblings enabled.People should not rely on this behaviour and I don't think they do,
but in case we find they do, the fix is easy and we have to take the
double IPI cost.Signed-off-by: Peter Zijlstra
Cc: Vince Weaver
Link: http://lkml.kernel.org/r/20110409192142.038377551@chello.nl
Signed-off-by: Ingo Molnar -
This was scattered out - refactor it into a single function.
No change in functionality.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.979862055@chello.nl
Signed-off-by: Ingo Molnar -
Instead of tracking if a context is active or not, track which events
of the context are active. By making it a bitmask of
EVENT_PINNED|EVENT_FLEXIBLE we can simplify some of the scheduling
routines since it can avoid adding events that are already active.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.930282378@chello.nl
Signed-off-by: Ingo Molnar -
Currently __perf_install_in_context() will try and schedule in the
event irrespective of our event scheduling rules, that is, we try to
schedule CPU-pinned, TASK-pinned, CPU-flexible, TASK-flexible, but
when creating a new event we simply try and schedule it on top of
whatever is already on the PMU, this can lead to errors for pinned
events.Therefore, simplify things and simply schedule everything out, add the
event to the corresponding context and schedule everything back in.This also nicely handles the case where with
__ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI can come right in the middle
of schedule, before we managed to call perf_event_task_sched_in().Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.870894224@chello.nl
Signed-off-by: Ingo Molnar -
Make task_ctx_sched_*() imply EVENT_ALL, since anything less will not
actually have scheduled the task in/out at all.Since there's no site that schedules all of a task in (due to the
interleave with flexible cpuctx) we can remove this function.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.817893268@chello.nl
Signed-off-by: Ingo Molnar -
Currently we only hold one ctx->lock at a time, which results in us
flipping back and forth between cpuctx->ctx.lock and task_ctx->lock.Avoid this and gain large atomic regions by holding both locks. We
nest the task lock inside the cpu lock, since with task scheduling we
might have to change task ctx while holding the cpu ctx lock.Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.769881865@chello.nl
Signed-off-by: Ingo Molnar -
Small cleanup to how we refcount in find_get_context(), this also
allows us to use put_ctx() to free things instead of using kfree().Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.719340481@chello.nl
Signed-off-by: Ingo Molnar -
Oleg noted that ctx_sched_out() disables the PMU even though it might
not actually do something, avoid needless PMU-disabling.Reported-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110409192141.665385503@chello.nl
Signed-off-by: Ingo Molnar
28 May, 2011
1 commit
-
Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
explicitly push the wakeup (including signals) when requested.Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc:
Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
Signed-off-by: Ingo Molnar
20 May, 2011
1 commit
-
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()
...
04 May, 2011
1 commit
-
Fix a few inconsistent style bits that were added over the past few
months.Cc: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-yv4hwf9yhnzoada8pcpb3a97@git.kernel.org
Signed-off-by: Ingo Molnar
03 May, 2011
2 commits
-
As part of the events sybsystem unification, relocate hw_breakpoint.c
into its new destination.Cc: Frederic Weisbecker
Signed-off-by: Borislav Petkov -
mv kernel/perf_event.c -> kernel/events/core.c. From there, all further
sensible splitting can happen. The idea is that due to perf_event.c
becoming pretty sizable and with the advent of the marriage with ftrace,
splitting functionality into its logical parts should help speeding up
the unification and to manage the complexity of the subsystem.Signed-off-by: Borislav Petkov