14 Sep, 2015

4 commits

  • commit c7999c6f3fed9e383d3131474588f282ae6d56b9 upstream.

    I ran the perf fuzzer, which triggered some WARN()s which are due to
    trying to stop/restart an event on the wrong CPU.

    Use the normal IPI pattern to ensure we run the code on the correct CPU.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Vince Weaver
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit ee9397a6fb9bc4e52677f5e33eed4abee0f515e6 upstream.

    If rb->aux_refcount is decremented to zero before rb->refcount,
    __rb_free_aux() may be called twice resulting in a double free of
    rb->aux_pages. Fix this by adding a check to __rb_free_aux().

    Signed-off-by: Ben Hutchings
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
    Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Ben Hutchings
     
  • commit 00a2916f7f82c348a2a94dbb572874173bc308a3 upstream.

    A recent fix to the shadow timestamp inadvertly broke the running time
    accounting.

    We must not update the running timestamp if we fail to schedule the
    event, the event will not have ran. This can (and did) result in
    negative total runtime because the stopped timestamp was before the
    running timestamp (we 'started' but never stopped the event -- because
    it never really started we didn't have to stop it either).

    Reported-and-Tested-by: Vince Weaver
    Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Shaohua Li
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit fed66e2cdd4f127a43fd11b8d92a99bdd429528c upstream.

    Vince reported that the fasync signal stuff doesn't work proper for
    inherited events. So fix that.

    Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
    which upon the demise of the file descriptor ensures the allocation is
    freed and state is updated.

    Now for perf, we can have the events stick around for a while after the
    original FD is dead because of references from child events. So we
    cannot copy the fasync pointer around. We can however consistently use
    the parent's fasync, as that will be updated.

    Reported-and-Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho deMelo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

30 Jun, 2015

1 commit

  • commit 2f993cf093643b98477c421fa2b9a98dcc940323 upstream.

    While looking for other users of get_state/cond_sync. I Found
    ring_buffer_attach() and it looks obviously buggy?

    Don't we need to ensure that we have "synchronize" _between_
    list_del() and list_add() ?

    IOW. Suppose that ring_buffer_attach() preempts right_after
    get_state_synchronize_rcu() and gp completes before spin_lock().

    In this case cond_synchronize_rcu() does nothing and we reuse
    ->rb_entry without waiting for gp in between?

    It also moves the ->rcu_pending check under "if (rb)", to make it
    more readable imo.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: josh@joshtriplett.org
    Cc: tj@kernel.org
    Fixes: b69cf53640da ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
    Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     

27 May, 2015

2 commits

  • PMUs that don't support hardware scatter tables require big contiguous
    chunks of memory and a PMI to switch between them. However, in overwrite
    using a PMI for this purpose adds extra overhead that the users would
    like to avoid. Thus, in overwrite mode for such PMUs we can only allow
    one contiguous chunk for the entire requested buffer.

    This patch changes the behavior accordingly, so that if the buddy allocator
    fails to come up with a single high-order chunk for the entire requested
    buffer, the allocation will fail.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: hpa@zytor.com
    Link: http://lkml.kernel.org/r/1432308626-18845-2-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • there is a race between perf_event_free_bpf_prog() and free_trace_kprobe():

    __free_event()
    event->destroy(event)
    tp_perf_event_destroy()
    perf_trace_destroy()
    perf_trace_event_unreg()

    which is dropping event->tp_event->perf_refcount and allows to proceed in:

    unregister_trace_kprobe()
    unregister_kprobe_event()
    trace_remove_event_call()
    probe_remove_event_call()
    free_trace_kprobe()

    while __free_event does:

    call_rcu(&event->rcu_head, free_event_rcu);
    free_event_rcu()
    perf_event_free_bpf_prog()

    To fix the race simply move perf_event_free_bpf_prog() before
    event->destroy(), since event->tp_event is still valid at that point.

    Note, perf_trace_destroy() is not racing with trace_remove_event_call()
    since they both grab event_mutex.

    Reported-by: Wang Nan
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: lizefan@huawei.com
    Cc: pi3orama@163.com
    Fixes: 2541517c32be ("tracing, perf: Implement BPF programs attached to kprobes")
    Link: http://lkml.kernel.org/r/1431717321-28772-1-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

08 May, 2015

1 commit


16 Apr, 2015

1 commit

  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

02 Apr, 2015

11 commits

  • For counters that generate AUX data that is bound to the context of a
    running task, such as instruction tracing, the decoder needs to know
    exactly which task is running when the event is first scheduled in,
    before the first sched_switch. The decoder's need to know this stems
    from the fact that instruction flow trace decoding will almost always
    require program's object code in order to reconstruct said flow and
    for that we need at least its pid/tid in the perf stream.

    To single out such instruction tracing pmus, this patch introduces
    ITRACE PMU capability. The reason this is not part of RECORD_AUX
    record is that not all pmus capable of generating AUX data need this,
    and the opposite is *probably* also true.

    While sched_switch covers for most cases, there are two problems with it:
    the consumer will need to process events out of order (that is, having
    found RECORD_AUX, it will have to skip forward to the nearest sched_switch
    to figure out which task it was, then go back to the actual trace to
    decode it) and it completely misses the case when the tracing is enabled
    and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When AUX area gets a certain amount of new data, we want to wake up
    userspace to collect it. This adds a new control to specify how much
    data will cause a wakeup. This is then passed down to pmu drivers via
    output handle's "wakeup" field, so that the driver can find the nearest
    point where it can generate an interrupt.

    We repurpose __reserved_2 in the event attribute for this, even though
    it was never checked to be zero before, aux_watermark will only matter
    for new AUX-aware code, so the old code should still be fine.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • This adds support for overwrite mode in the AUX area, which means "keep
    collecting data till you're stopped", turning AUX area into a circular
    buffer, where new data overwrites old data. It does not depend on data
    buffer's overwrite mode, so that it doesn't lose sideband data that is
    instrumental for processing AUX data.

    Overwrite mode is enabled at mapping AUX area read only. Even though
    aux_tail in the buffer's user page might be user writable, it will be
    ignored in this mode.

    A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
    data stream every time an event writes new data to the AUX area. The pmu
    driver might not be able to infer the exact beginning of the new data in
    each snapshot, some drivers will only provide the tail, which is
    aux_offset + aux_size in the AUX record. Consumer has to be able to tell
    the new data from the old one, for example, by means of time stamps if
    such are provided in the trace.

    Consumer is also responsible for disabling any events that might write
    to the AUX area (thus potentially racing with the consumer) before
    collecting the data.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • For pmus that wish to write data to ring buffer's AUX area, provide
    perf_aux_output_{begin,end}() calls to initiate/commit data writes,
    similarly to perf_output_{begin,end}. These also use the same output
    handle structure. Also, similarly to software counterparts, these
    will direct inherited events' output to parents' ring buffers.

    After the perf_aux_output_begin() returns successfully, handle->size
    is set to the maximum amount of data that can be written wrt aux_tail
    pointer, so that no data that the user hasn't seen will be overwritten,
    therefore this should always be called before hardware writing is
    enabled. On success, this will return the pointer to pmu driver's
    private structure allocated for this aux area by pmu::setup_aux. Same
    pointer can also be retrieved using perf_get_aux() while hardware
    writing is enabled.

    PMU driver should pass the actual amount of data written as a parameter
    to perf_aux_output_end(). All hardware writes should be completed and
    visible before this one is called.

    Additionally, perf_aux_output_skip() will adjust output handle and
    aux_head in case some part of the buffer has to be skipped over to
    maintain hardware's alignment constraints.

    Nested writers are forbidden and guards are in place to catch such
    attempts.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When there's new data in the AUX space, output a record indicating its
    offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
    mean the described data was truncated to fit in the ring buffer.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Usually, pmus that do, for example, instruction tracing, would only ever
    be able to have one event per task per cpu (or per perf_event_context). For
    such pmus it makes sense to disallow creating conflicting events early on,
    so as to provide consistent behavior for the user.

    This patch adds a pmu capability that indicates such constraint on event
    creation.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • For pmus that don't support scatter-gather for AUX data in hardware, it
    might still make sense to implement software double buffering to avoid
    losing data while the user is reading data out. For this purpose, add
    a pmu capability that guarantees multiple high-order chunks for AUX buffer,
    so that the pmu driver can do switchover tricks.

    To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
    pmu's capability mask. This will make the ring buffer AUX allocation code
    ensure that the biggest high order allocation for the aux buffer pages is
    no bigger than half of the total requested buffer size, thus making sure
    that the buffer has at least two high order allocations.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
    don't support scatter-gather and will prefer larger contiguous areas for
    their output regions.

    This patch adds a new pmu capability to request higher order allocations.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • This patch introduces "AUX space" in the perf mmap buffer, intended for
    exporting high bandwidth data streams to userspace, such as instruction
    flow traces.

    AUX space is a ring buffer, defined by aux_{offset,size} fields in the
    user_page structure, and read/write pointers aux_{head,tail}, which abide
    by the same rules as data_* counterparts of the main perf buffer.

    In order to allocate/mmap AUX, userspace needs to set up aux_offset to
    such an offset that will be greater than data_offset+data_size and
    aux_size to be the desired buffer size. Both need to be page aligned.
    Then, same aux_offset and aux_size should be passed to mmap() call and
    if everything adds up, you should have an AUX buffer as a result.

    Pages that are mapped into this buffer also come out of user's mlock
    rlimit plus perf_event_mlock_kb allowance.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently, the actual perf ring buffer is one page into the mmap area,
    following the user page and the userspace follows this convention. This
    patch adds data_{offset,size} fields to user_page that can be used by
    userspace instead for locating perf data in the mmap area. This is also
    helpful when mapping existing or shared buffers if their size is not
    known in advance.

    Right now, it is made to follow the existing convention that

    data_offset == PAGE_SIZE and
    data_offset + data_size == mmap_size.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • BPF programs, attached to kprobes, provide a safe way to execute
    user-defined BPF byte-code programs without being able to crash or
    hang the kernel in any way. The BPF engine makes sure that such
    programs have a finite execution time and that they cannot break
    out of their sandbox.

    The user interface is to attach to a kprobe via the perf syscall:

    struct perf_event_attr attr = {
    .type = PERF_TYPE_TRACEPOINT,
    .config = event_id,
    ...
    };

    event_fd = perf_event_open(&attr,...);
    ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

    'prog_fd' is a file descriptor associated with BPF program
    previously loaded.

    'event_id' is an ID of the kprobe created.

    Closing 'event_fd':

    close(event_fd);

    ... automatically detaches BPF program from it.

    BPF programs can call in-kernel helper functions to:

    - lookup/update/delete elements in maps

    - probe_read - wraper of probe_kernel_read() used to access any
    kernel data structures

    BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
    architecture dependent) and return 0 to ignore the event and 1 to store
    kprobe event into the ring buffer.

    Note, kprobes are a fundamentally _not_ a stable kernel ABI,
    so BPF programs attached to kprobes must be recompiled for
    every kernel version and user must supply correct LINUX_VERSION_CODE
    in attr.kern_version during bpf_prog_load() call.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Steven Rostedt
    Reviewed-by: Masami Hiramatsu
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

27 Mar, 2015

4 commits

  • While thinking on the whole clock discussion it occurred to me we have
    two distinct uses of time:

    1) the tracking of event/ctx/cgroup enabled/running/stopped times
    which includes the self-monitoring support in struct
    perf_event_mmap_page.

    2) the actual timestamps visible in the data records.

    And we've been conflating them.

    The first is all about tracking time deltas, nobody should really care
    in what time base that happens, its all relative information, as long
    as its internally consistent it works.

    The second however is what people are worried about when having to
    merge their data with external sources. And here we have the
    discussion on MONOTONIC vs MONOTONIC_RAW etc..

    Where MONOTONIC is good for correlating between machines (static
    offset), MONOTNIC_RAW is required for correlating against a fixed rate
    hardware clock.

    This means configurability; now 1) makes that hard because it needs to
    be internally consistent across groups of unrelated events; which is
    why we had to have a global perf_clock().

    However, for 2) it doesn't really matter, perf itself doesn't care
    what it writes into the buffer.

    The below patch makes the distinction between these two cases by
    adding perf_event_clock() which is used for the second case. It
    further makes this configurable on a per-event basis, but adds a few
    sanity checks such that we cannot combine events with different clocks
    in confusing ways.

    And since we then have per-event configurability we might as well
    retain the 'legacy' behaviour as a default.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While looking at some fuzzer output I noticed that we do not hold any
    locks on leader->ctx and therefore the sibling_list iteration is
    unsafe.

    Acquire the relevant ctx->mutex before calling into the pmu specific
    code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Vince Weaver
    Cc: Jiri Olsa
    Cc: Sasha Levin
    Link: http://lkml.kernel.org/r/20150225151639.GL5029@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

23 Mar, 2015

2 commits

  • The only reason CQM had to use a hard-coded pmu type was so it could use
    cqm_target in hw_perf_event.

    Do away with the {tp,bp,cqm}_target pointers and provide a non type
    specific one.

    This allows us to do away with that silly pmu type as well.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Vince Weaver
    Cc: acme@kernel.org
    Cc: acme@redhat.com
    Cc: hpa@zytor.com
    Cc: jolsa@redhat.com
    Cc: kanaka.d.juvva@intel.com
    Cc: matt.fleming@intel.com
    Cc: tglx@linutronix.de
    Cc: torvalds@linux-foundation.org
    Cc: vikas.shivappa@linux.intel.com
    Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Vince reported a watchdog lockup like:

    [] perf_tp_event+0xc4/0x210
    [] perf_trace_lock+0x12a/0x160
    [] lock_release+0x130/0x260
    [] _raw_spin_unlock_irqrestore+0x24/0x40
    [] do_send_sig_info+0x5d/0x80
    [] send_sigio_to_task+0x12f/0x1a0
    [] send_sigio+0xae/0x100
    [] kill_fasync+0x97/0xf0
    [] perf_event_wakeup+0xd4/0xf0
    [] perf_pending_event+0x33/0x60
    [] irq_work_run_list+0x4c/0x80
    [] irq_work_run+0x18/0x40
    [] smp_trace_irq_work_interrupt+0x3f/0xc0
    [] trace_irq_work_interrupt+0x6d/0x80

    Which is caused by an irq_work generating new irq_work and therefore
    not allowing forward progress.

    This happens because processing the perf irq_work triggers another
    perf event (tracepoint stuff) which in turn generates an irq_work ad
    infinitum.

    Avoid this by raising the recursion counter in the irq_work -- which
    effectively disables all software events (including tracepoints) from
    actually triggering again.

    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Paul Mackerras
    Cc: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Mar, 2015

1 commit

  • Commit:

    a83fe28e2e45 ("perf: Fix put_event() ctx lock")

    changed the locking logic in put_event() by replacing mutex_lock_nested()
    with perf_event_ctx_lock_nested(), but didn't fix the subsequent
    mutex_unlock() with a correct counterpart, perf_event_ctx_unlock().

    Contexts are thus leaked as a result of incremented refcount
    in perf_event_ctx_lock_nested().

    Signed-off-by: Leon Yu
    Cc: Arnaldo Carvalho de Melo
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Fixes: a83fe28e2e45 ("perf: Fix put_event() ctx lock")
    Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.com
    Signed-off-by: Ingo Molnar

    Leon Yu
     

03 Mar, 2015

1 commit

  • This reverts commit 74390aa55678 ("perf: Remove the extra validity check
    on nr_pages")

    nr_pages equals to number of pages - 1 in perf_mmap. So nr_pages = 0 is
    valid.

    So the nr_pages != 0 && !is_power_of_2(nr_pages) are all
    needed for checking. Otherwise, for example, perf test 6 failed.

    # perf test 6
    6: x86 rdpmc test :Error:
    mmap() syscall returned with (Invalid argument)
    FAILED!

    Signed-off-by: Kan Liang
    Cc: Andi Kleen
    Cc: Kaixu Xia
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1425280466-7830-1-git-send-email-kan.liang@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kan Liang
     

26 Feb, 2015

1 commit


25 Feb, 2015

4 commits

  • Add support for task events as well as system-wide events. This change
    has a big impact on the way that we gather LLC occupancy values in
    intel_cqm_event_read().

    Currently, for system-wide (per-cpu) events we defer processing to
    userspace which knows how to discard all but one cpu result per package.

    Things aren't so simple for task events because we need to do the value
    aggregation ourselves. To do this, we defer updating the LLC occupancy
    value in event->count from intel_cqm_event_read() and do an SMP
    cross-call to read values for all packages in intel_cqm_event_count().
    We need to ensure that we only do this for one task event per cache
    group, otherwise we'll report duplicate values.

    If we're a system-wide event we want to fallback to the default
    perf_event_count() implementation. Refactor this into a common function
    so that we don't duplicate the code.

    Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
    event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
    driver. This task information is only availble in the upper layers of
    the perf infrastructure.

    Other perf backends stash the target task in event->hw.*target so we
    need to do something similar. The task is used to determine whether
    events should share a cache group and an RMID.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Kanaka Juvva
    Cc: Linus Torvalds
    Cc: Vikas Shivappa
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • The Intel QoS PMU needs to know whether an event is part of a cgroup
    during ->event_init(), because tasks in the same cgroup share a
    monitoring ID.

    Move the cgroup initialisation before calling into the PMU driver.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Kanaka Juvva
    Cc: Linus Torvalds
    Cc: Vikas Shivappa
    Link: http://lkml.kernel.org/r/1422038748-21397-4-git-send-email-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • For PMU drivers that record per-package counters, the ->count variable
    cannot be used to record an accurate aggregated value, since it's not
    possible to perform SMP cross-calls to cpus on other packages from the
    context in which we update ->count.

    Introduce a new optional ->count() accessor function that can be used to
    customize how values are collected. If a PMU driver doesn't provide a
    ->count() function, we fallback to the existing code.

    There is necessarily a window of staleness with this approach because
    the task that generated the counter value may not have been scheduled by
    the cpu recently.

    An alternative and more complex approach would be to use a hrtimer to
    periodically refresh the values from a more permissive scheduling
    context. So, we're trading off complexity for accuracy.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Kanaka Juvva
    Cc: Linus Torvalds
    Cc: Vikas Shivappa
    Link: http://lkml.kernel.org/r/1422038748-21397-3-git-send-email-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     
  • Move perf_cgroup_from_task() from kernel/events/ to include/linux/ along
    with the necessary struct definitions, so that it can be used by the PMU
    code.

    When the upcoming Intel Cache Monitoring PMU driver assigns monitoring
    IDs to perf events, it needs to be able to check whether any two
    monitoring events overlap (say, a cgroup and task event), which means we
    need to be able to lookup the cgroup associated with a task (if any).

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Kanaka Juvva
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vikas Shivappa
    Link: http://lkml.kernel.org/r/1422038748-21397-2-git-send-email-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     

19 Feb, 2015

7 commits

  • …/acme/linux into perf/core

    Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

    User visible changes:

    - No need to explicitely enable evsels for workload started from perf, let it
    be enabled via perf_event_attr.enable_on_exec, removing some events that take
    place in the 'perf trace' before a workload is really started by it.
    (Arnaldo Carvalho de Melo)

    - Fix to handle optimized not-inlined functions in 'perf probe' (Masami Hiramatsu)

    - Update 'perf probe' man page (Masami Hiramatsu)

    - 'perf trace': Allow mixing with tracepoints and suppressing plain syscalls
    (Arnaldo Carvalho de Melo)

    Infrastructure changes:

    - Introduce {trace_seq_do,event_format_}_fprintf functions to allow
    a default tracepoint field list printer to be used in tools that allows
    redirecting output to a file. (Arnaldo Carvalho de Melo)

    - The man page for pthread_attr_set_affinity_np states that _GNU_SOURCE
    must be defined before pthread.h, do it to fix the build in some
    systems (Josh Boyer)

    - Cleanups in 'perf buildid-cache' (Masami Hiramatsu)

    - Fix dso cache test case (Namhyung Kim)

    - Do Not rely on dso__data_read_offset() to open DSO (Namhyung Kim)

    - Make perf aware of tracefs (Steven Rostedt).

    - Fix build by defining STT_GNU_IFUNC for glibc 2.9 and older (Vinson Lee)

    - AArch64 symbol resolution fixes (Victor Kamensky)

    - Kconfig beachhead (Jiri Olsa)

    - Simplify nr_pages validity (Kaixu Xia)

    - Fixup header positioning in 'perf list' (Yunlong Song)

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Use event->attr.branch_sample_type to replace
    intel_pmu_needs_lbr_smpl() for avoiding duplicated code that
    implicitly enables the LBR.

    Currently, branch stack can be enabled by user explicitly requesting
    branch sampling or implicit branch sampling to correct PEBS skid.

    For user explicitly requested branch sampling, the branch_sample_type
    is explicitly set by user. For PEBS case, the branch_sample_type is also
    implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: eranian@google.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/1415156173-10035-11-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Yan, Zheng
     
  • If two tasks were both forked from the same parent task, Events in
    their perf task contexts can be the same. Perf core may leave out
    switching the perf event contexts.

    Previous patch inroduces pmu specific data. The data is for saving
    the LBR stack, it is task specific. So we need to switch the data
    even when context switch is optimized out.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: eranian@google.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/1415156173-10035-7-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Yan, Zheng
     
  • Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach
    stata. The flag is set by PMU's event_init() callback, it indicates
    that perf event needs PMU specific data.

    The PMU specific data are initialized to zeros. Later patches will
    use PMU specific data to save LBR stack.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: eranian@google.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/1415156173-10035-6-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Yan, Zheng
     
  • Previous commit introduces context switch callback, its function
    overlaps with the flush branch stack callback. So we can use the
    context switch callback to flush LBR stack.

    This patch adds code that uses the flush branch callback to
    flush the LBR stack when task is being scheduled in. The callback
    is enabled only when there are events use the LBR hardware. This
    patch also removes all old flush branch stack code.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: eranian@google.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/1415156173-10035-4-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Yan, Zheng
     
  • The callback is invoked when process is scheduled in or out.
    It provides mechanism for later patches to save/store the LBR
    stack. For the schedule in case, the callback is invoked at
    the same place that flush branch stack callback is invoked.
    So it also can replace the flush branch stack callback. To
    avoid unnecessary overhead, the callback is enabled only when
    there are events use the LBR stack.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: eranian@google.com
    Cc: jolsa@redhat.com
    Link: http://lkml.kernel.org/r/1415156173-10035-3-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Yan, Zheng
     
  • For hardware events, the userspace page of the event gets updated in
    context switches, so if we read the timestamp in the page, we get
    fresh info.

    For software events, this is missing currently. This patch makes the
    behavior consistent.

    With this patch, we can implement clock_gettime(THREAD_CPUTIME) with
    PERF_COUNT_SW_DUMMY in userspace as suggested by Andy and Peter. Code
    like this:

    if (pc->cap_user_time) {
    do {
    seq = pc->lock;
    barrier();

    running = pc->time_running;
    cyc = rdtsc();
    time_mult = pc->time_mult;
    time_shift = pc->time_shift;
    time_offset = pc->time_offset;

    barrier();
    } while (pc->lock != seq);

    quot = (cyc >> time_shift);
    rem = cyc & ((1 << time_shift) - 1);
    delta = time_offset + quot * time_mult +
    ((rem * time_mult) >> time_shift);

    running += delta;
    return running;
    }

    I tried it on a busy system, the userspace page updating doesn't
    have noticeable overhead.

    Signed-off-by: Shaohua Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/aa2dd2e4f1e9f2225758be5ba00f14d6909a8ce1.1423180257.git.shli@fb.com
    [ Improved the changelog. ]
    Signed-off-by: Ingo Molnar

    Shaohua Li