31 Aug, 2011

1 commit

  • We detected a serious issue with PERF_SAMPLE_READ and
    timing information when events were being multiplexing.

    Samples would have time_running > time_enabled. That
    was easy to reproduce with a libpfm4 example (ran 3
    times to cause multiplexing on Core 2):

    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    $ syst_smpl -e uops_retired:freq=1 &
    IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
    syst_smpl: WARNING: time_running > time_enabled
    63277537998 uops_retired:freq=1 , scaled

    The bug was not present in kernel up to (and including) 3.0. It turns
    out the bug was introduced by the following commit:

    commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa

    events: Move lockless timer calculation into helper function

    The parameters of the function got reversed yet the call sites
    were not updated to reflect the change. That lead to time_running
    and time_enabled being swapped. That had no effect when there was
    no multiplexing because in that case time_running = time_enabled
    but it would show up in any other scenario.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

29 Aug, 2011

1 commit

  • The current cgroup context switch code was incorrect leading
    to bogus counts. Furthermore, as soon as there was an active
    cgroup event on a CPU, the context switch cost on that CPU
    would increase by a significant amount as demonstrated by a
    simple ping/pong example:

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10684.51 ctxsw/s

    Now start a cgroup perf stat:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    $ ./pong
    Both processes pinned to CPU1, running for 10s
    6674.61 ctxsw/s

    That's a 37% penalty.

    Note that pong is not even in the monitored cgroup.

    The results shown by perf stat are bogus:
    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100

    Performance counter stats for 'sleep 100':

    CPU1 cycles test
    CPU1 16,984,189,138 cycles # 0.000 GHz

    The second 'cycles' event should report a count @ CPU clock
    (here 2.4GHz) as it is counting across all cgroups.

    The patch below fixes the bogus accounting and bypasses any
    cgroup switches in case the outgoing and incoming tasks are
    in the same cgroup.

    With this patch the same test now yields:
    $ ./pong
    Both processes pinned to CPU1, running for 10s
    10775.30 ctxsw/s

    Start perf stat with cgroup:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Run pong outside the cgroup:
    $ /pong
    Both processes pinned to CPU1, running for 10s
    10687.80 ctxsw/s

    The penalty is now less than 2%.

    And the results for perf stat are correct:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    Now perf stat reports the correct counts for
    for the non cgroup event.

    If we run pong inside the cgroup, then we also get the
    correct counts:

    $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10

    Performance counter stats for 'sleep 10':

    CPU1 22,297,726,205 cycles test # 0.000 GHz
    CPU1 23,933,981,448 cycles # 0.000 GHz

    10.001457237 seconds time elapsed

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

22 Jul, 2011

1 commit

  • PMU type id can be allocated dynamically, so perf_event_attr::type check
    when copying attribute from userspace to kernel is not valid.

    Signed-off-by: Lin Ming
    Cc: Robert Richter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309421396-17438-4-git-send-email-ming.m.lin@intel.com
    Signed-off-by: Ingo Molnar

    Lin Ming
     

01 Jul, 2011

8 commits

  • KVM needs one-shot samples, since a PMC programmed to -X will fire after X
    events and then again after 2^40 events (i.e. variable period).

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-4-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
     
  • The perf_event overflow handler does not receive any caller-derived
    argument, so many callers need to resort to looking up the perf_event
    in their local data structure. This is ugly and doesn't scale if a
    single callback services many perf_events.

    Fix by adding a context parameter to perf_event_create_kernel_counter()
    (and derived hardware breakpoints APIs) and storing it in the perf_event.
    The field can be accessed from the callback as event->overflow_handler_context.
    All callers are updated.

    Signed-off-by: Avi Kivity
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1309362157-6596-2-git-send-email-avi@redhat.com
    Signed-off-by: Ingo Molnar

    Avi Kivity
     
  • Since only samples call perf_output_sample() its much saner (and more
    correct) to put the sample logic in there than in the
    perf_output_begin()/perf_output_end() pair.

    Saves a useless argument, reduces conditionals and shrinks
    struct perf_output_handle, win!

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-2crpvsx3cqu67q3zqjbnlpsc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The nmi parameter indicated if we could do wakeups from the current
    context, if not, we would set some state and self-IPI and let the
    resulting interrupt do the wakeup.

    For the various event classes:

    - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
    the PMI-tail (ARM etc.)
    - tracepoint: nmi=0; since tracepoint could be from NMI context.
    - software: nmi=[0,1]; some, like the schedule thing cannot
    perform wakeups, and hence need 0.

    As one can see, there is very little nmi=1 usage, and the down-side of
    not using it is that on some platforms some software events can have a
    jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

    The up-side however is that we can remove the nmi parameter and save a
    bunch of conditionals in fast paths.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: Will Deacon
    Cc: Deng-Cheng Zhu
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Frederic Weisbecker
    Cc: Jason Wessel
    Cc: Don Zickus
    Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The event tracing infrastructure exposes two timers which should be updated
    each time the value of the counter is updated. Currently, these counters are
    only updated when userspace calls read() on the fd associated with an event.
    This means that counters which are read via the mmap'd page exclusively never
    have their timers updated. This patch adds ensures that the timers are updated
    each time the values in the mmap'd page are updated.

    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308932786-5111-1-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
     
  • Take the timer calculation from perf_output_read and move it to a helper
    function for any place that needs timer values but cannot take the ctx->lock.

    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308861279-15216-2-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
     
  • Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1308861279-15216-1-git-send-email-emunson@mgebm.net
    Signed-off-by: Ingo Molnar

    Eric B Munson
     
  • Since 2.6.36 (specifically commit d57e34fdd60b ("perf: Simplify the
    ring-buffer logic: make perf_buffer_alloc() do everything needed"),
    the perf_buffer_init_code() has been mis-setting the buffer watermark
    if perf_event_attr.wakeup_events has a non-zero value.

    This is because perf_event_attr.wakeup_events is a union with
    perf_event_attr.wakeup_watermark.

    This commit re-enables the check for perf_event_attr.watermark being
    set before continuing with setting a non-default watermark.

    This bug is most noticable when you are trying to use PERF_IOC_REFRESH
    with a value larger than one and perf_event_attr.wakeup_events is set to
    one. In this case the buffer watermark will be set to 1 and you will
    get extraneous POLL_IN overflows rather than POLL_HUP as expected.

    [ avoid using attr.wakeup_events when attr.watermark is set ]

    Signed-off-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1106011506390.5384@cl320.eecs.utk.edu
    Signed-off-by: Ingo Molnar

    Vince Weaver
     

09 Jun, 2011

1 commit

  • And create the internal perf events header.

    v2: Keep an internal inlined perf_output_copy()

    Signed-off-by: Frederic Weisbecker
    Acked-by: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Stephane Eranian
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com
    [ v3: use clearer 'ring_buffer' and 'rb' naming ]
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

07 Jun, 2011

1 commit

  • A lost Quilt refresh of 2c29ef0fef8 (perf: Simplify and fix
    __perf_install_in_context()) is causing grief and lockups,
    reported by Jiri Olsa.

    When installing an event in a task context, there's a number of
    issues:

    - there might not be an existing task context, in which case
    we should install the now current context;

    - there might already be a context, not the current one, in
    which case we should de-schedule the old and install the new;

    these cases were dealt with in the lost refresh, however there is one
    further case that was found in testing:

    - there might already be a context, the current one, in which
    case we should still de-schedule, and should take care
    to re-install it (note that task_ctx_sched_out() clears
    cpuctx->task_ctx).

    Reported-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1307399008.2497.971.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Jun, 2011

2 commits


31 May, 2011

1 commit

  • Ben changed the cgroup API in commit f780bdb7c1c (cgroups: add
    per-thread subsystem callbacks) in an incompatible way, but
    forgot to convert the perf cgroup bits.

    Avoid compile warnings and runtime splats and convert perf too ;-)

    Acked-by: Ben Blum
    Cc: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1306767651.1200.2990.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 May, 2011

9 commits

  • Since perf_install_in_context() will now install a context when we
    add the first event, we can de-schedule the context when the last
    event is removed.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192142.090431763@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to always call list_del_event() on the correct cpu if the
    event is part of an active context and avoid having to do two IPIs,
    change the close() semantics slightly.

    The current perf_event_disable() call would disable a whole group if
    the event that's being closed is the group leader, whereas the new
    code keeps the group siblings enabled.

    People should not rely on this behaviour and I don't think they do,
    but in case we find they do, the fix is easy and we have to take the
    double IPI cost.

    Signed-off-by: Peter Zijlstra
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20110409192142.038377551@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This was scattered out - refactor it into a single function.
    No change in functionality.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.979862055@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Instead of tracking if a context is active or not, track which events
    of the context are active. By making it a bitmask of
    EVENT_PINNED|EVENT_FLEXIBLE we can simplify some of the scheduling
    routines since it can avoid adding events that are already active.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.930282378@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently __perf_install_in_context() will try and schedule in the
    event irrespective of our event scheduling rules, that is, we try to
    schedule CPU-pinned, TASK-pinned, CPU-flexible, TASK-flexible, but
    when creating a new event we simply try and schedule it on top of
    whatever is already on the PMU, this can lead to errors for pinned
    events.

    Therefore, simplify things and simply schedule everything out, add the
    event to the corresponding context and schedule everything back in.

    This also nicely handles the case where with
    __ARCH_WANT_INTERRUPTS_ON_CTXSW the IPI can come right in the middle
    of schedule, before we managed to call perf_event_task_sched_in().

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.870894224@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make task_ctx_sched_*() imply EVENT_ALL, since anything less will not
    actually have scheduled the task in/out at all.

    Since there's no site that schedules all of a task in (due to the
    interleave with flexible cpuctx) we can remove this function.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.817893268@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently we only hold one ctx->lock at a time, which results in us
    flipping back and forth between cpuctx->ctx.lock and task_ctx->lock.

    Avoid this and gain large atomic regions by holding both locks. We
    nest the task lock inside the cpu lock, since with task scheduling we
    might have to change task ctx while holding the cpu ctx lock.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.769881865@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Small cleanup to how we refcount in find_get_context(), this also
    allows us to use put_ctx() to free things instead of using kfree().

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.719340481@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noted that ctx_sched_out() disables the PMU even though it might
    not actually do something, avoid needless PMU-disabling.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110409192141.665385503@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 May, 2011

1 commit

  • Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
    explicitly push the wakeup (including signals) when requested.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 May, 2011

1 commit

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (78 commits)
    Revert "rcu: Decrease memory-barrier usage based on semi-formal proof"
    net,rcu: convert call_rcu(prl_entry_destroy_rcu) to kfree
    batman,rcu: convert call_rcu(softif_neigh_free_rcu) to kfree_rcu
    batman,rcu: convert call_rcu(neigh_node_free_rcu) to kfree()
    batman,rcu: convert call_rcu(gw_node_free_rcu) to kfree_rcu
    net,rcu: convert call_rcu(kfree_tid_tx) to kfree_rcu()
    net,rcu: convert call_rcu(xt_osf_finger_free_rcu) to kfree_rcu()
    net/mac80211,rcu: convert call_rcu(work_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(wq_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(phonet_device_rcu_free) to kfree_rcu()
    perf,rcu: convert call_rcu(swevent_hlist_release_rcu) to kfree_rcu()
    perf,rcu: convert call_rcu(free_ctx) to kfree_rcu()
    net,rcu: convert call_rcu(__nf_ct_ext_free_rcu) to kfree_rcu()
    net,rcu: convert call_rcu(net_generic_release) to kfree_rcu()
    net,rcu: convert call_rcu(netlbl_unlhsh_free_addr6) to kfree_rcu()
    net,rcu: convert call_rcu(netlbl_unlhsh_free_addr4) to kfree_rcu()
    security,rcu: convert call_rcu(sel_netif_free) to kfree_rcu()
    net,rcu: convert call_rcu(xps_dev_maps_release) to kfree_rcu()
    net,rcu: convert call_rcu(xps_map_release) to kfree_rcu()
    net,rcu: convert call_rcu(rps_map_release) to kfree_rcu()
    ...

    Linus Torvalds
     

04 May, 2011

1 commit


03 May, 2011

2 commits

  • As part of the events sybsystem unification, relocate hw_breakpoint.c
    into its new destination.

    Cc: Frederic Weisbecker
    Signed-off-by: Borislav Petkov

    Borislav Petkov
     
  • mv kernel/perf_event.c -> kernel/events/core.c. From there, all further
    sensible splitting can happen. The idea is that due to perf_event.c
    becoming pretty sizable and with the advent of the marriage with ftrace,
    splitting functionality into its logical parts should help speeding up
    the unification and to manage the complexity of the subsystem.

    Signed-off-by: Borislav Petkov

    Borislav Petkov