14 Sep, 2015
4 commits
-
commit c7999c6f3fed9e383d3131474588f282ae6d56b9 upstream.
I ran the perf fuzzer, which triggered some WARN()s which are due to
trying to stop/restart an event on the wrong CPU.Use the normal IPI pattern to ensure we run the code on the correct CPU.
Signed-off-by: Peter Zijlstra (Intel)
Cc: Vince Weaver
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: bad7192b842c ("perf: Fix PERF_EVENT_IOC_PERIOD to force-reset the period")
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit ee9397a6fb9bc4e52677f5e33eed4abee0f515e6 upstream.
If rb->aux_refcount is decremented to zero before rb->refcount,
__rb_free_aux() may be called twice resulting in a double free of
rb->aux_pages. Fix this by adding a check to __rb_free_aux().Signed-off-by: Ben Hutchings
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 57ffc5ca679f ("perf: Fix AUX buffer refcounting")
Link: http://lkml.kernel.org/r/1437953468.12842.17.camel@decadent.org.uk
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman -
commit 00a2916f7f82c348a2a94dbb572874173bc308a3 upstream.
A recent fix to the shadow timestamp inadvertly broke the running time
accounting.We must not update the running timestamp if we fail to schedule the
event, the event will not have ran. This can (and did) result in
negative total runtime because the stopped timestamp was before the
running timestamp (we 'started' but never stopped the event -- because
it never really started we didn't have to stop it either).Reported-and-Tested-by: Vince Weaver
Fixes: 72f669c0086f ("perf: Update shadow timestamp before add event")
Signed-off-by: Peter Zijlstra (Intel)
Cc: Shaohua Li
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman -
commit fed66e2cdd4f127a43fd11b8d92a99bdd429528c upstream.
Vince reported that the fasync signal stuff doesn't work proper for
inherited events. So fix that.Installing fasync allocates memory and sets filp->f_flags |= FASYNC,
which upon the demise of the file descriptor ensures the allocation is
freed and state is updated.Now for perf, we can have the events stick around for a while after the
original FD is dead because of references from child events. So we
cannot copy the fasync pointer around. We can however consistently use
the parent's fasync, as that will be updated.Reported-and-Tested-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho deMelo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1434011521.1495.71.camel@twins
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
30 Jun, 2015
1 commit
-
commit 2f993cf093643b98477c421fa2b9a98dcc940323 upstream.
While looking for other users of get_state/cond_sync. I Found
ring_buffer_attach() and it looks obviously buggy?Don't we need to ensure that we have "synchronize" _between_
list_del() and list_add() ?IOW. Suppose that ring_buffer_attach() preempts right_after
get_state_synchronize_rcu() and gp completes before spin_lock().In this case cond_synchronize_rcu() does nothing and we reuse
->rb_entry without waiting for gp in between?It also moves the ->rcu_pending check under "if (rb)", to make it
more readable imo.Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: josh@joshtriplett.org
Cc: tj@kernel.org
Fixes: b69cf53640da ("perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()")
Link: http://lkml.kernel.org/r/20150530200425.GA15748@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
27 May, 2015
2 commits
-
PMUs that don't support hardware scatter tables require big contiguous
chunks of memory and a PMI to switch between them. However, in overwrite
using a PMI for this purpose adds extra overhead that the users would
like to avoid. Thus, in overwrite mode for such PMUs we can only allow
one contiguous chunk for the entire requested buffer.This patch changes the behavior accordingly, so that if the buddy allocator
fails to come up with a single high-order chunk for the entire requested
buffer, the allocation will fail.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1432308626-18845-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
there is a race between perf_event_free_bpf_prog() and free_trace_kprobe():
__free_event()
event->destroy(event)
tp_perf_event_destroy()
perf_trace_destroy()
perf_trace_event_unreg()which is dropping event->tp_event->perf_refcount and allows to proceed in:
unregister_trace_kprobe()
unregister_kprobe_event()
trace_remove_event_call()
probe_remove_event_call()
free_trace_kprobe()while __free_event does:
call_rcu(&event->rcu_head, free_event_rcu);
free_event_rcu()
perf_event_free_bpf_prog()To fix the race simply move perf_event_free_bpf_prog() before
event->destroy(), since event->tp_event is still valid at that point.Note, perf_trace_destroy() is not racing with trace_remove_event_call()
since they both grab event_mutex.Reported-by: Wang Nan
Signed-off-by: Alexei Starovoitov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: lizefan@huawei.com
Cc: pi3orama@163.com
Fixes: 2541517c32be ("tracing, perf: Implement BPF programs attached to kprobes")
Link: http://lkml.kernel.org/r/1431717321-28772-1-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar
08 May, 2015
1 commit
-
While fuzzing Sasha tripped over another ctx->mutex recursion lockdep
splat. Annotate this.Reported-by: Sasha Levin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: Vince Weaver
Signed-off-by: Ingo Molnar
16 Apr, 2015
1 commit
-
Pull networking updates from David Miller:
1) Add BQL support to via-rhine, from Tino Reichardt.
2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
can support hw switch offloading. From Floria Fainelli.3) Allow 'ip address' commands to initiate multicast group join/leave,
from Madhu Challa.4) Many ipv4 FIB lookup optimizations from Alexander Duyck.
5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
Borkmann.6) Remove the ugly compat support in ARP for ugly layers like ax25,
rose, etc. And use this to clean up the neigh layer, then use it to
implement MPLS support. All from Eric Biederman.7) Support L3 forwarding offloading in switches, from Scott Feldman.
8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
up route lookups even further. From Alexander Duyck.9) Many improvements and bug fixes to the rhashtable implementation,
from Herbert Xu and Thomas Graf. In particular, in the case where
an rhashtable user bulk adds a large number of items into an empty
table, we expand the table much more sanely.10) Don't make the tcp_metrics hash table per-namespace, from Eric
Biederman.11) Extend EBPF to access SKB fields, from Alexei Starovoitov.
12) Split out new connection request sockets so that they can be
established in the main hash table. Much less false sharing since
hash lookups go direct to the request sockets instead of having to
go first to the listener then to the request socks hashed
underneath. From Eric Dumazet.13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.
14) Support stable privacy address generation for RFC7217 in IPV6. From
Hannes Frederic Sowa.15) Hash network namespace into IP frag IDs, also from Hannes Frederic
Sowa.16) Convert PTP get/set methods to use 64-bit time, from Richard
Cochran.* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
fm10k: Bump driver version to 0.15.2
fm10k: corrected VF multicast update
fm10k: mbx_update_max_size does not drop all oversized messages
fm10k: reset head instead of calling update_max_size
fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
fm10k: update xcast mode before synchronizing multicast addresses
fm10k: start service timer on probe
fm10k: fix function header comment
fm10k: comment next_vf_mbx flow
fm10k: don't handle mailbox events in iov_event path and always process mailbox
fm10k: use separate workqueue for fm10k driver
fm10k: Set PF queues to unlimited bandwidth during virtualization
fm10k: expose tx_timeout_count as an ethtool stat
fm10k: only increment tx_timeout_count in Tx hang path
fm10k: remove extraneous "Reset interface" message
fm10k: separate PF only stats so that VF does not display them
fm10k: use hw->mac.max_queues for stats
fm10k: only show actual queues, not the maximum in hardware
fm10k: allow creation of VLAN on default vid
fm10k: fix unused warnings
...
02 Apr, 2015
11 commits
-
For counters that generate AUX data that is bound to the context of a
running task, such as instruction tracing, the decoder needs to know
exactly which task is running when the event is first scheduled in,
before the first sched_switch. The decoder's need to know this stems
from the fact that instruction flow trace decoding will almost always
require program's object code in order to reconstruct said flow and
for that we need at least its pid/tid in the perf stream.To single out such instruction tracing pmus, this patch introduces
ITRACE PMU capability. The reason this is not part of RECORD_AUX
record is that not all pmus capable of generating AUX data need this,
and the opposite is *probably* also true.While sched_switch covers for most cases, there are two problems with it:
the consumer will need to process events out of order (that is, having
found RECORD_AUX, it will have to skip forward to the nearest sched_switch
to figure out which task it was, then go back to the actual trace to
decode it) and it completely misses the case when the tracing is enabled
and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.Nested writers are forbidden and guards are in place to catch such
attempts.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_event_context). For
such pmus it makes sense to disallow creating conflicting events early on,
so as to provide consistent behavior for the user.This patch adds a pmu capability that indicates such constraint on event
creation.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
pmu's capability mask. This will make the ring buffer AUX allocation code
ensure that the biggest high order allocation for the aux buffer pages is
no bigger than half of the total requested buffer size, thus making sure
that the buffer has at least two high order allocations.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.This patch adds a new pmu capability to request higher order allocations.
Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Alexander Shishkin
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.Right now, it is made to follow the existing convention that
data_offset == PAGE_SIZE and
data_offset + data_size == mmap_size.Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar -
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.The user interface is to attach to a kprobe via the perf syscall:
struct perf_event_attr attr = {
.type = PERF_TYPE_TRACEPOINT,
.config = event_id,
...
};event_fd = perf_event_open(&attr,...);
ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);'prog_fd' is a file descriptor associated with BPF program
previously loaded.'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structuresBPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.Signed-off-by: Alexei Starovoitov
Reviewed-by: Steven Rostedt
Reviewed-by: Masami Hiramatsu
Cc: Andrew Morton
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: David S. Miller
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar
27 Mar, 2015
4 commits
-
While thinking on the whole clock discussion it occurred to me we have
two distinct uses of time:1) the tracking of event/ctx/cgroup enabled/running/stopped times
which includes the self-monitoring support in struct
perf_event_mmap_page.2) the actual timestamps visible in the data records.
And we've been conflating them.
The first is all about tracking time deltas, nobody should really care
in what time base that happens, its all relative information, as long
as its internally consistent it works.The second however is what people are worried about when having to
merge their data with external sources. And here we have the
discussion on MONOTONIC vs MONOTONIC_RAW etc..Where MONOTONIC is good for correlating between machines (static
offset), MONOTNIC_RAW is required for correlating against a fixed rate
hardware clock.This means configurability; now 1) makes that hard because it needs to
be internally consistent across groups of unrelated events; which is
why we had to have a global perf_clock().However, for 2) it doesn't really matter, perf itself doesn't care
what it writes into the buffer.The below patch makes the distinction between these two cases by
adding perf_event_clock() which is used for the second case. It
further makes this configurable on a per-event basis, but adds a few
sanity checks such that we cannot combine events with different clocks
in confusing ways.And since we then have per-event configurability we might as well
retain the 'legacy' behaviour as a default.Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Arnaldo Carvalho de Melo
Cc: David Ahern
Cc: Jiri Olsa
Cc: John Stultz
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar -
While looking at some fuzzer output I noticed that we do not hold any
locks on leader->ctx and therefore the sibling_list iteration is
unsafe.Acquire the relevant ctx->mutex before calling into the pmu specific
code.Signed-off-by: Peter Zijlstra (Intel)
Cc: Vince Weaver
Cc: Jiri Olsa
Cc: Sasha Levin
Link: http://lkml.kernel.org/r/20150225151639.GL5029@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
Signed-off-by: Ingo Molnar
23 Mar, 2015
2 commits
-
The only reason CQM had to use a hard-coded pmu type was so it could use
cqm_target in hw_perf_event.Do away with the {tp,bp,cqm}_target pointers and provide a non type
specific one.This allows us to do away with that silly pmu type as well.
Signed-off-by: Peter Zijlstra (Intel)
Cc: Vince Weaver
Cc: acme@kernel.org
Cc: acme@redhat.com
Cc: hpa@zytor.com
Cc: jolsa@redhat.com
Cc: kanaka.d.juvva@intel.com
Cc: matt.fleming@intel.com
Cc: tglx@linutronix.de
Cc: torvalds@linux-foundation.org
Cc: vikas.shivappa@linux.intel.com
Link: http://lkml.kernel.org/r/20150305211019.GU21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar -
Vince reported a watchdog lockup like:
[] perf_tp_event+0xc4/0x210
[] perf_trace_lock+0x12a/0x160
[] lock_release+0x130/0x260
[] _raw_spin_unlock_irqrestore+0x24/0x40
[] do_send_sig_info+0x5d/0x80
[] send_sigio_to_task+0x12f/0x1a0
[] send_sigio+0xae/0x100
[] kill_fasync+0x97/0xf0
[] perf_event_wakeup+0xd4/0xf0
[] perf_pending_event+0x33/0x60
[] irq_work_run_list+0x4c/0x80
[] irq_work_run+0x18/0x40
[] smp_trace_irq_work_interrupt+0x3f/0xc0
[] trace_irq_work_interrupt+0x6d/0x80Which is caused by an irq_work generating new irq_work and therefore
not allowing forward progress.This happens because processing the perf irq_work triggers another
perf event (tracepoint stuff) which in turn generates an irq_work ad
infinitum.Avoid this by raising the recursion counter in the irq_work -- which
effectively disables all software events (including tracepoints) from
actually triggering again.Reported-by: Vince Weaver
Tested-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Paul Mackerras
Cc: Steven Rostedt
Cc:
Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
13 Mar, 2015
1 commit
-
Commit:
a83fe28e2e45 ("perf: Fix put_event() ctx lock")
changed the locking logic in put_event() by replacing mutex_lock_nested()
with perf_event_ctx_lock_nested(), but didn't fix the subsequent
mutex_unlock() with a correct counterpart, perf_event_ctx_unlock().Contexts are thus leaked as a result of incremented refcount
in perf_event_ctx_lock_nested().Signed-off-by: Leon Yu
Cc: Arnaldo Carvalho de Melo
Cc: Paul Mackerras
Cc: Peter Zijlstra
Fixes: a83fe28e2e45 ("perf: Fix put_event() ctx lock")
Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.com
Signed-off-by: Ingo Molnar
03 Mar, 2015
1 commit
-
This reverts commit 74390aa55678 ("perf: Remove the extra validity check
on nr_pages")nr_pages equals to number of pages - 1 in perf_mmap. So nr_pages = 0 is
valid.So the nr_pages != 0 && !is_power_of_2(nr_pages) are all
needed for checking. Otherwise, for example, perf test 6 failed.# perf test 6
6: x86 rdpmc test :Error:
mmap() syscall returned with (Invalid argument)
FAILED!Signed-off-by: Kan Liang
Cc: Andi Kleen
Cc: Kaixu Xia
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1425280466-7830-1-git-send-email-kan.liang@intel.com
Signed-off-by: Arnaldo Carvalho de Melo
26 Feb, 2015
1 commit
-
Signed-off-by: Ingo Molnar
25 Feb, 2015
4 commits
-
Add support for task events as well as system-wide events. This change
has a big impact on the way that we gather LLC occupancy values in
intel_cqm_event_read().Currently, for system-wide (per-cpu) events we defer processing to
userspace which knows how to discard all but one cpu result per package.Things aren't so simple for task events because we need to do the value
aggregation ourselves. To do this, we defer updating the LLC occupancy
value in event->count from intel_cqm_event_read() and do an SMP
cross-call to read values for all packages in intel_cqm_event_count().
We need to ensure that we only do this for one task event per cache
group, otherwise we'll report duplicate values.If we're a system-wide event we want to fallback to the default
perf_event_count() implementation. Refactor this into a common function
so that we don't duplicate the code.Also, introduce PERF_TYPE_INTEL_CQM, since we need a way to track an
event's task (if the event isn't per-cpu) inside of the Intel CQM PMU
driver. This task information is only availble in the upper layers of
the perf infrastructure.Other perf backends stash the target task in event->hw.*target so we
need to do something similar. The task is used to determine whether
events should share a cache group and an RMID.Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Kanaka Juvva
Cc: Linus Torvalds
Cc: Vikas Shivappa
Cc: linux-api@vger.kernel.org
Link: http://lkml.kernel.org/r/1422038748-21397-8-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar -
The Intel QoS PMU needs to know whether an event is part of a cgroup
during ->event_init(), because tasks in the same cgroup share a
monitoring ID.Move the cgroup initialisation before calling into the PMU driver.
Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Kanaka Juvva
Cc: Linus Torvalds
Cc: Vikas Shivappa
Link: http://lkml.kernel.org/r/1422038748-21397-4-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar -
For PMU drivers that record per-package counters, the ->count variable
cannot be used to record an accurate aggregated value, since it's not
possible to perform SMP cross-calls to cpus on other packages from the
context in which we update ->count.Introduce a new optional ->count() accessor function that can be used to
customize how values are collected. If a PMU driver doesn't provide a
->count() function, we fallback to the existing code.There is necessarily a window of staleness with this approach because
the task that generated the counter value may not have been scheduled by
the cpu recently.An alternative and more complex approach would be to use a hrtimer to
periodically refresh the values from a more permissive scheduling
context. So, we're trading off complexity for accuracy.Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Kanaka Juvva
Cc: Linus Torvalds
Cc: Vikas Shivappa
Link: http://lkml.kernel.org/r/1422038748-21397-3-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar -
Move perf_cgroup_from_task() from kernel/events/ to include/linux/ along
with the necessary struct definitions, so that it can be used by the PMU
code.When the upcoming Intel Cache Monitoring PMU driver assigns monitoring
IDs to perf events, it needs to be able to check whether any two
monitoring events overlap (say, a cgroup and task event), which means we
need to be able to lookup the cgroup associated with a task (if any).Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Kanaka Juvva
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vikas Shivappa
Link: http://lkml.kernel.org/r/1422038748-21397-2-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar
19 Feb, 2015
7 commits
-
…/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
User visible changes:
- No need to explicitely enable evsels for workload started from perf, let it
be enabled via perf_event_attr.enable_on_exec, removing some events that take
place in the 'perf trace' before a workload is really started by it.
(Arnaldo Carvalho de Melo)- Fix to handle optimized not-inlined functions in 'perf probe' (Masami Hiramatsu)
- Update 'perf probe' man page (Masami Hiramatsu)
- 'perf trace': Allow mixing with tracepoints and suppressing plain syscalls
(Arnaldo Carvalho de Melo)Infrastructure changes:
- Introduce {trace_seq_do,event_format_}_fprintf functions to allow
a default tracepoint field list printer to be used in tools that allows
redirecting output to a file. (Arnaldo Carvalho de Melo)- The man page for pthread_attr_set_affinity_np states that _GNU_SOURCE
must be defined before pthread.h, do it to fix the build in some
systems (Josh Boyer)- Cleanups in 'perf buildid-cache' (Masami Hiramatsu)
- Fix dso cache test case (Namhyung Kim)
- Do Not rely on dso__data_read_offset() to open DSO (Namhyung Kim)
- Make perf aware of tracefs (Steven Rostedt).
- Fix build by defining STT_GNU_IFUNC for glibc 2.9 and older (Vinson Lee)
- AArch64 symbol resolution fixes (Victor Kamensky)
- Kconfig beachhead (Jiri Olsa)
- Simplify nr_pages validity (Kaixu Xia)
- Fixup header positioning in 'perf list' (Yunlong Song)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org> -
Use event->attr.branch_sample_type to replace
intel_pmu_needs_lbr_smpl() for avoiding duplicated code that
implicitly enables the LBR.Currently, branch stack can be enabled by user explicitly requesting
branch sampling or implicit branch sampling to correct PEBS skid.For user explicitly requested branch sampling, the branch_sample_type
is explicitly set by user. For PEBS case, the branch_sample_type is also
implicitly set to PERF_SAMPLE_BRANCH_ANY in x86_pmu_hw_config.Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1415156173-10035-11-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar -
If two tasks were both forked from the same parent task, Events in
their perf task contexts can be the same. Perf core may leave out
switching the perf event contexts.Previous patch inroduces pmu specific data. The data is for saving
the LBR stack, it is task specific. So we need to switch the data
even when context switch is optimized out.Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1415156173-10035-7-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar -
Introduce a new flag PERF_ATTACH_TASK_DATA for perf event's attach
stata. The flag is set by PMU's event_init() callback, it indicates
that perf event needs PMU specific data.The PMU specific data are initialized to zeros. Later patches will
use PMU specific data to save LBR stack.Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1415156173-10035-6-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar -
Previous commit introduces context switch callback, its function
overlaps with the flush branch stack callback. So we can use the
context switch callback to flush LBR stack.This patch adds code that uses the flush branch callback to
flush the LBR stack when task is being scheduled in. The callback
is enabled only when there are events use the LBR hardware. This
patch also removes all old flush branch stack code.Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1415156173-10035-4-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar -
The callback is invoked when process is scheduled in or out.
It provides mechanism for later patches to save/store the LBR
stack. For the schedule in case, the callback is invoked at
the same place that flush branch stack callback is invoked.
So it also can replace the flush branch stack callback. To
avoid unnecessary overhead, the callback is enabled only when
there are events use the LBR stack.Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Vince Weaver
Cc: eranian@google.com
Cc: jolsa@redhat.com
Link: http://lkml.kernel.org/r/1415156173-10035-3-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar -
For hardware events, the userspace page of the event gets updated in
context switches, so if we read the timestamp in the page, we get
fresh info.For software events, this is missing currently. This patch makes the
behavior consistent.With this patch, we can implement clock_gettime(THREAD_CPUTIME) with
PERF_COUNT_SW_DUMMY in userspace as suggested by Andy and Peter. Code
like this:if (pc->cap_user_time) {
do {
seq = pc->lock;
barrier();running = pc->time_running;
cyc = rdtsc();
time_mult = pc->time_mult;
time_shift = pc->time_shift;
time_offset = pc->time_offset;barrier();
} while (pc->lock != seq);quot = (cyc >> time_shift);
rem = cyc & ((1 << time_shift) - 1);
delta = time_offset + quot * time_mult +
((rem * time_mult) >> time_shift);running += delta;
return running;
}I tried it on a busy system, the userspace page updating doesn't
have noticeable overhead.Signed-off-by: Shaohua Li
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andy Lutomirski
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/aa2dd2e4f1e9f2225758be5ba00f14d6909a8ce1.1423180257.git.shli@fb.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar