26 Nov, 2010

2 commits

  • Stephane noticed that because the perf_sw_event() call is inside the
    perf_event_task_sched_out() call it won't get called unless we
    have a per-task counter.

    Reported-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • It was found that sometimes children of tasks with inherited events had
    one extra event. Eventually it turned out to be due to the list rotation
    no being exclusive with the list iteration in the inheritance code.

    Cure this by temporarily disabling the rotation while we inherit the events.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Cc:
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

18 Nov, 2010

2 commits

  • Oleg noticed that a perf-fd keeping a reference on the creating task
    leads to a few funny side effects.

    There's two different aspects to this:

    - kernel based perf-events, these should not take out
    a reference on the creating task and appear on the task's
    event list since they're not bound to fds nor visible
    to userspace.

    - fork() and pthread_create(), these can lead to the creating
    task dying (and thus the task's event-list becomming useless)
    but keeping the list and ref alive until the event is closed.

    Combined they lead to malfunction of the ptrace hw_tracepoints.

    Cure this by not considering kernel based perf_events for the
    owner-list and destroying the owner-list when the owner dies.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    Acked-by: Oleg Nesterov
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • …eric/random-tracing into perf/urgent

    Ingo Molnar
     

12 Nov, 2010

1 commit

  • When using early debugging, the kernel does not initialize the
    hw_breakpoint API early enough and causes the late initialization of
    the kernel debugger to fail. The boot arguments are:

    earlyprintk=vga ekgdboc=kbd kgdbwait

    Then simply type "go" at the kdb prompt and boot. The kernel will
    later emit the message:

    kgdb: Could not allocate hwbreakpoints

    And at that point the kernel debugger will cease to work correctly.

    The solution is to initialize the hw_breakpoint at the same time that
    all the other perf call backs are initialized instead of using a
    core_initcall() initialization which happens well after the kernel
    debugger can make use of hardware breakpoints.

    Signed-off-by: Jason Wessel
    CC: Frederic Weisbecker
    CC: Ingo Molnar
    CC: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Frederic Weisbecker

    Jason Wessel
     

11 Nov, 2010

1 commit

  • This patch corrects time tracking in samples. Without this patch
    both time_enabled and time_running are bogus when user asks for
    PERF_SAMPLE_READ.

    One uses PERF_SAMPLE_READ to sample the values of other counters
    in each sample. Because of multiplexing, it is necessary to know
    both time_enabled, time_running to be able to scale counts correctly.

    In this second version of the patch, we maintain a shadow
    copy of ctx->time which allows us to compute ctx->time without
    calling update_context_time() from NMI context. We avoid the
    issue that update_context_time() must always be called with
    ctx->lock held.

    We do not keep shadow copies of the other event timings
    because if the lead event is overflowing then it is active
    and thus it's been scheduled in via event_sched_in() in
    which case neither tstamp_stopped, tstamp_running can be modified.

    This timing logic only applies to samples when PERF_SAMPLE_READ
    is used.

    Note that this patch does not address timing issues related
    to sampling inheritance between tasks. This will be addressed
    in a future patch.

    With this patch, the libpfm4 example task_smpl now reports
    correct counts (shown on 2.4GHz Core 2):

    $ task_smpl -p 2400000000 -e unhalted_core_cycles:u,instructions_retired:u,baclears noploop 5
    noploop for 5 seconds
    IIP:0x000000004006d6 PID:5596 TID:5596 TIME:466,210,211,430 STREAM_ID:33 PERIOD:2,400,000,000 ENA=1,010,157,814 RUN=1,010,157,814 NR=3
    2,400,000,254 unhalted_core_cycles:u (33)
    2,399,273,744 instructions_retired:u (34)
    53,340 baclears (35)

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

22 Oct, 2010

3 commits

  • This new version (see commit 8e5fc1a) is much simpler and ensures that
    in case of error in group_sched_in() during event_sched_in(), the
    events up to the failed event go through regular event_sched_out().
    But the failed event and the remaining events in the group have their
    timings adjusted as if they had also gone through event_sched_in() and
    event_sched_out(). This ensures timing uniformity across all events in
    a group. This also takes care of the tstamp_stopped problem in case
    the group could never be scheduled. The tstamp_stopped is updated as
    if the event had actually run.

    With this patch, the following now reports correct time_enabled,
    in case the NMI watchdog is active:

    $ task -e unhalted_core_cycles,instructions_retired,baclears,baclears
    noploop 1
    noploop for 1 seconds

    0 unhalted_core_cycles (100.00% scaling, ena=997,552,872, run=0)
    0 instructions_retired (100.00% scaling, ena=997,552,872, run=0)
    0 baclears (100.00% scaling, ena=997,552,872, run=0)
    0 baclears (100.00% scaling, ena=997,552,872, run=0)

    And the older test case also works:

    $ task -einstructions_retired,baclears,baclears -e
    unhalted_core_cycles,baclears,baclears sleep 5

    1680885 instructions_retired (69.39% scaling, ena=950756, run=291006)
    10735 baclears (69.39% scaling, ena=950756, run=291006)
    10735 baclears (69.39% scaling, ena=950756, run=291006)

    0 unhalted_core_cycles (100.00% scaling, ena=817932, run=0)
    0 baclears (100.00% scaling, ena=817932, run=0)
    0 baclears (100.00% scaling, ena=817932, run=0)

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • This patch reverts commit 8e5fc1a (perf_events: Fix transaction
    recovery in group_sched_in()) because it had one flaw in case the
    group could never be scheduled. It would cause time_enabled to get
    negative.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • …git/tip/linux-2.6-tip

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (163 commits)
    tracing: Fix compile issue for trace_sched_wakeup.c
    [S390] hardirq: remove pointless header file includes
    [IA64] Move local_softirq_pending() definition
    perf, powerpc: Fix power_pmu_event_init to not use event->ctx
    ftrace: Remove recursion between recordmcount and scripts/mod/empty
    jump_label: Add COND_STMT(), reducer wrappery
    perf: Optimize sw events
    perf: Use jump_labels to optimize the scheduler hooks
    jump_label: Add atomic_t interface
    jump_label: Use more consistent naming
    perf, hw_breakpoint: Fix crash in hw_breakpoint creation
    perf: Find task before event alloc
    perf: Fix task refcount bugs
    perf: Fix group moving
    irq_work: Add generic hardirq context callbacks
    perf_events: Fix transaction recovery in group_sched_in()
    perf_events: Fix bogus AMD64 generic TLB events
    perf_events: Fix bogus context time tracking
    tracing: Remove parent recording in latency tracer graph options
    tracing: Use one prologue for the preempt irqs off tracer function tracers
    ...

    Linus Torvalds
     

19 Oct, 2010

9 commits

  • Acked-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Trades a call + conditional + ret for an unconditional jmp.

    Acked-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • hw_breakpoint creation needs to account stuff per-task to ensure there
    is always sufficient hardware resources to back these things due to
    ptrace.

    With the perf per pmu context changes the event initialization no
    longer has access to the event context, for the simple reason that we
    need to first find the pmu (result of initialization) before we can
    find the context.

    This makes hw_breakpoints unhappy, because it can no longer do per
    task accounting, cure this by frobbing a task pointer in the event::hw
    bits for now...

    Signed-off-by: Peter Zijlstra
    Cc: Frederic Weisbecker
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • So that we can pass the task pointer to the event allocation, so that
    we can use task associated data during event initialization.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently it looks like find_lively_task_by_vpid() takes a task ref
    and relies on find_get_context() to drop it.

    The problem is that perf_event_create_kernel_counter() shouldn't be
    dropping task refs.

    Signed-off-by: Peter Zijlstra
    Acked-by: Frederic Weisbecker
    Acked-by: Matt Helsley
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Matt found we trigger the WARN_ON_ONCE() in perf_group_attach() when we take
    the move_group path in perf_event_open().

    Since we cannot de-construct the group (we rely on it to move the events), we
    have to simply ignore the double attach. The group state is context invariant
    and doesn't need changing.

    Reported-by: Matt Fleming
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide a mechanism that allows running code in IRQ context. It is
    most useful for NMI code that needs to interact with the rest of the
    system -- like wakeup a task to drain buffers.

    Perf currently has such a mechanism, so extract that and provide it as
    a generic feature, independent of perf so that others may also
    benefit.

    The IRQ context callback is generated through self-IPIs where
    possible, or on architectures like powerpc the decrementer (the
    built-in timer facility) is set to generate an interrupt immediately.

    Architectures that don't have anything like this get to do with a
    callback from the timer tick. These architectures can call
    irq_work_run() at the tail of any IRQ handlers that might enqueue such
    work (like the perf IRQ handler) to avoid undue latencies in
    processing the work.

    Signed-off-by: Peter Zijlstra
    Acked-by: Kyle McMartin
    Acked-by: Martin Schwidefsky
    [ various fixes ]
    Signed-off-by: Huang Ying
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The group_sched_in() function uses a transactional approach to schedule
    a group of events. In a group, either all events can be scheduled or
    none are. To schedule each event in, the function calls event_sched_in().
    In case of error, event_sched_out() is called on each event in the group.

    The problem is that event_sched_out() does not completely cancel the
    effects of event_sched_in(). Furthermore event_sched_out() changes the
    state of the event as if it had run which is not true is this particular
    case.

    Those inconsistencies impact time tracking fields and may lead to events
    in a group not all reporting the same time_enabled and time_running values.
    This is demonstrated with the example below:

    $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
    1946101 unhalted_core_cycles (32.85% scaling, ena=829181, run=556827)
    11423 baclears (32.85% scaling, ena=829181, run=556827)
    7671 baclears (0.00% scaling, ena=556827, run=556827)

    2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995)
    11705 baclears (57.83% scaling, ena=962822, run=405995)
    11705 baclears (57.83% scaling, ena=962822, run=405995)

    Notice that in the first group, the last baclears event does not
    report the same timings as its siblings.

    This issue comes from the fact that tstamp_stopped is updated
    by event_sched_out() as if the event had actually run.

    To solve the issue, we must ensure that, in case of error, there is
    no change in the event state whatsoever. That means timings must
    remain as they were when entering group_sched_in().

    To do this we defer updating tstamp_running until we know the
    transaction succeeded. Therefore, we have split event_sched_in()
    in two parts separating the update to tstamp_running.

    Similarly, in case of error, we do not want to update tstamp_stopped.
    Therefore, we have split event_sched_out() in two parts separating
    the update to tstamp_stopped.

    With this patch, we now get the following output:

    $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
    2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841)
    11243 baclears (71.75% scaling, ena=1093330, run=308841)
    11243 baclears (71.75% scaling, ena=1093330, run=308841)

    1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489)
    9253 baclears (0.00% scaling, ena=784489, run=784489)
    9253 baclears (0.00% scaling, ena=784489, run=784489)

    Note that the uneven timing between groups is a side effect of
    the process spending most of its time sleeping, i.e., not enough
    event rotations (but that's a separate issue).

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • You can only call update_context_time() when the context
    is active, i.e., the thread it is attached to is still running.

    However, perf_event_read() can be called even when the context
    is inactive, e.g., user read() the counters. The call to
    update_context_time() must be conditioned on the status of
    the context, otherwise, bogus time_enabled, time_running may
    be returned. Here is an example on AMD64. The task program
    is an example from libpfm4. The -p prints deltas every 1s.

    $ task -p -e cpu_clk_unhalted sleep 5
    2,266,610 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
    5,242,358,071 cpu_clk_unhalted (99.95% scaling, ena=5,000,359,984, run=2,319,270)

    Whereas if you don't read deltas, e.g., no call to perf_event_read() until
    the process terminates:

    $ task -e cpu_clk_unhalted sleep 5
    2,497,783 cpu_clk_unhalted (0.00% scaling, ena=2,376,899, run=2,376,899)

    Notice that time_enable, time_running are bogus in the first example
    causing bogus scaling.

    This patch fixes the problem, by conditionally calling update_context_time()
    in perf_event_read().

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

15 Oct, 2010

1 commit


12 Oct, 2010

2 commits


11 Oct, 2010

1 commit

  • Introduce perf_pmu_name() helper function that returns the name of the
    pmu. This gives us a generic way to get the name of a pmu regardless of
    how an architecture identifies it internally.

    Signed-off-by: Matt Fleming
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mundt
    Signed-off-by: Robert Richter

    Matt Fleming
     

04 Oct, 2010

1 commit

  • This patch fixes an error in perf_event_open() when the pid
    provided by the user is invalid. find_lively_task_by_vpid()
    does not return NULL on error but an error code. Without the
    fix the error code was silently passed to find_get_context()
    which would eventually cause a invalid pointer dereference.

    Signed-off-by: Stephane Eranian
    Cc: peterz@infradead.org
    Cc: paulus@samba.org
    Cc: davem@davemloft.net
    Cc: fweisbec@gmail.com
    Cc: perfmon2-devel@lists.sf.net
    Cc: eranian@gmail.com
    Cc: robert.richter@amd.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

21 Sep, 2010

1 commit

  • The per-pmu per-cpu context patch converted things from
    get_cpu_var() to this_cpu_ptr(), but that only works if
    rcu_read_lock() actually disables preemption, and since
    there is no such guarantee, we need to fix that.

    Use the newly introduced {get,put}_cpu_ptr().

    Signed-off-by: Peter Zijlstra
    Cc: Tejun Heo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Sep, 2010

4 commits

  • Revert the timer per cpu-context timers because of unfortunate
    nohz interaction. Fixing that would have been somewhat ugly, so
    go back to driving things from the regular tick. Provide a
    jiffies interval feature for people who want slower rotations.

    Signed-off-by: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Robert Richter
    Cc: Yinghai Lu
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Use the right cpu-context.. spotted by preempt warning on
    hot-unplug

    Signed-off-by: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Robert Richter
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Aside from allowing software events into a !software group,
    allow adding !software events to pure software groups.

    Once we've moved the software group and attached the first
    !software event, the group will no longer be a pure software
    group and hence no longer be eligible for movement, at which
    point the straight ctx comparison is correct again.

    Signed-off-by: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Robert Richter
    Cc: Paul Mackerras
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Events were not grouped anymore. The reason was that in
    perf_event_open(), the field event->group_leader was
    initialized before the function looked up the group_fd
    to find the event leader. This patch fixes this by
    reordering the code correctly.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Cc: Robert Richter
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

15 Sep, 2010

2 commits

  • The kernel perf event creation path shouldn't use find_task_by_vpid()
    because a vpid exists in a specific namespace. find_task_by_vpid() uses
    current's pid namespace which isn't always the correct namespace to use
    for the vpid in all the places perf_event_create_kernel_counter() (and
    thus find_get_context()) is called.

    The goal is to clean up pid namespace handling and prevent bugs like:

    https://bugzilla.kernel.org/show_bug.cgi?id=17281

    Instead of using pids switch find_get_context() to use task struct
    pointers directly. The syscall is responsible for resolving the pid to
    a task struct. This moves the pid namespace resolution into the syscall
    much like every other syscall that takes pid parameters.

    Signed-off-by: Matt Helsley
    Signed-off-by: Peter Zijlstra
    Cc: Robin Green
    Cc: Prasad
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Will Deacon
    Cc: Mahesh Salgaonkar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Matt Helsley
     
  • Split out the code which searches for non-exiting tasks into its own
    helper. Creating this helper not only makes the code slightly more
    readable it prepares to move the search out of find_get_context() in
    a subsequent commit.

    Signed-off-by: Matt Helsley
    Signed-off-by: Peter Zijlstra
    Cc: Robin Green
    Cc: Prasad
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Will Deacon
    Cc: Mahesh Salgaonkar
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Matt Helsley
     

13 Sep, 2010

2 commits

  • With the context rework stuff we can actually end up freeing an event
    before it gets attached to a context.

    Reported-by: Cyrill Gorcunov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Simplify things and simply synchronize against two RCU variants for
    PMU unregister -- we don't care about performance, its module unload
    if anything.

    Reported-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Cc: Paul E. McKenney
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Sep, 2010

8 commits

  • We ought to return -ENOENT when non of the registered PMUs
    recognise the requested event.

    This fixes a boot crash that occurs if no PMU is available
    but the NMI watchdog tries to register an event.

    Reported-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Even though we call it from the inherit path, where the child is
    not yet accessible, we need to hold ctx->lock, add_event_to_ctx()
    assumes IRQs are disabled.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • I missed a perf_event_ctxp user when converting it to an array. Pull this
    last user into perf_event.c as well and fix it up.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Assuming we don't mix events of different pmus onto a single context
    (with the exeption of software events inside a hardware group) we can
    now assume that all events on a particular context belong to the same
    pmu, hence we can disable the pmu for the entire context operations.

    This reduces the amount of hardware writes.

    The exception for swevents comes from the fact that the sw pmu disable
    is a nop.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since software events are always schedulable, mixing them up with
    hardware events (who are not) can lead to funny scheduling oddities.

    Giving them their own context solves this.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide the infrastructure for multiple task contexts.

    A more flexible approach would have resulted in more pointer chases
    in the scheduling hot-paths. This approach has the limitation of a
    static number of task contexts.

    Since I expect most external PMUs to be system wide, or at least node
    wide (as per the intel uncore unit) they won't actually need a task
    context.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Unify the two perf_event_context allocation sites.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Move all inherit code near each other.

    Signed-off-by: Peter Zijlstra
    Cc: paulus
    Cc: stephane eranian
    Cc: Robert Richter
    Cc: Frederic Weisbecker
    Cc: Lin Ming
    Cc: Yanmin
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra