Eric Lee / smarc-fsl-linux-kernel

04 Feb, 2019

2 commits

ca3bb3d02 perf/ring_buffer: Convert ring_buffer.aux_refcount to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable ring_buffer.aux_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the ring_buffer.aux_refcount it might make a difference
in following places:

- perf_aux_output_begin(): increment in refcount_inc_not_zero() only
guarantees control dependency on success vs. fully ordered
atomic counterpart
- rb_free_aux(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and ACQUIRE ordering + control dependency
on success vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@kernel.org
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/1548678448-24458-4-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-04 15:46:17 +0800
fecb8ed2c perf/ring_buffer: Convert ring_buffer.refcount to refcount_t ... Browse Code »

atomic_t variables are currently used to implement reference
counters with the following properties:

- counter is initialized to 1 using atomic_set()
- a resource is freed upon counter reaching zero
- once counter reaches zero, its further
increments aren't allowed
- counter schema uses basic atomic operations
(set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable ring_buffer.refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

** Important note for maintainers:

Some functions from refcount_t API defined in lib/refcount.c
have different memory ordering guarantees than their atomic
counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
for more information.

Normally the differences should not matter since refcount_t provides
enough guarantees to satisfy the refcounting use cases, but in
some rare cases it might matter.
Please double check that you don't have some undocumented
memory guarantees for this variable usage.

For the ring_buffer.refcount it might make a difference
in following places:

- ring_buffer_get(): increment in refcount_inc_not_zero() only
guarantees control dependency on success vs. fully ordered
atomic counterpart
- ring_buffer_put(): decrement in refcount_dec_and_test() only
provides RELEASE ordering and ACQUIRE ordering + control dependency
on success vs. fully ordered atomic counterpart

Suggested-by: Kees Cook
Signed-off-by: Elena Reshetova
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: David Windsor
Reviewed-by: Hans Liljestrand
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@kernel.org
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/1548678448-24458-3-git-send-email-elena.reshetova@intel.com
Signed-off-by: Ingo Molnar

Elena Reshetova
2019-02-04 15:46:16 +0800

08 Jan, 2018

1 commit

8cf7e0e22 perf: Make perf_callchain function static ... Browse Code »

And move it to core.c, because there's no caller of this function other
than the one in core.c

Signed-off-by: Jiri Olsa
Cc: Alexander Shishkin
Cc: Andi Kleen
Cc: David Ahern
Cc: Namhyung Kim
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20180107160356.28203-6-jolsa@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2018-01-08 23:33:01 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

25 Aug, 2017

3 commits

d0618410e tracing, perf: Adjust code layout in get_recursion_context() ... Browse Code »

In an XDP redirect applications using tracepoint xdp:xdp_redirect to
diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
was consuming 2% CPU. This was reduced to 1.85% with this simple
change.

Looking at the annotated asm code, it was clear that the unlikely case
in_nmi() test was chosen (by the compiler) as the most likely
event/branch. This small adjustment makes the compiler (GCC version
7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
Signed-off-by: Ingo Molnar

Jesper Dangaard Brouer
2017-08-25 17:04:18 +0800
d9a50b025 perf/aux: Ensure aux_wakeup represents most recent wakeup index ... Browse Code »

The aux_watermark member of struct ring_buffer represents the period (in
terms of bytes) at which wakeup events should be generated when data is
written to the aux buffer in non-snapshot mode. On hardware that cannot
generate an interrupt when the aux_head reaches an arbitrary wakeup index
(such as ARM SPE), the aux_head sampled from handle->head in
perf_aux_output_{skip,end} may in fact be past the wakeup index. This
can lead to wakeup slowly falling behind the head. For example, consider
the case where hardware can only generate an interrupt on a page-boundary
and the aux buffer is initialised as follows:

// Buffer size is 2 * PAGE_SIZE
rb->aux_head = rb->aux_wakeup = 0
rb->aux_watermark = PAGE_SIZE / 2

following the first perf_aux_output_begin call, the handle is
initialised with:

handle->head = 0
handle->size = 2 * PAGE_SIZE
handle->wakeup = PAGE_SIZE / 2

and the hardware will be programmed to generate an interrupt at
PAGE_SIZE.

When the interrupt is raised, the hardware head will be at PAGE_SIZE,
so calling perf_aux_output_end(handle, PAGE_SIZE) puts the ring buffer
into the following state:

rb->aux_head = PAGE_SIZE
rb->aux_wakeup = PAGE_SIZE / 2
rb->aux_watermark = PAGE_SIZE / 2

and then the next call to perf_aux_output_begin will result in:

handle->head = handle->wakeup = PAGE_SIZE

for which the semantics are unclear and, for a smaller aux_watermark
(e.g. PAGE_SIZE / 4), then the wakeup would in fact be behind head at
this point.

This patch fixes the problem by rounding down the aux_head (as sampled
from the handle) to the nearest aux_watermark boundary when updating
rb->aux_wakeup, therefore taking into account any overruns by the
hardware.

Reported-by: Mark Rutland
Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Alexander Shishkin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1502900297-21839-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar

Will Deacon
2017-08-25 17:04:16 +0800
2ab346cfb perf/aux: Make aux_{head,wakeup} ring_buffer members long ... Browse Code »

The aux_head and aux_wakeup members of struct ring_buffer are defined
using the local_t type, despite the fact that they are only accessed via
the perf_aux_output_*() functions, which cannot race with each other for a
given ring buffer.

This patch changes the type of the members to long, so we can avoid
using the local_*() API where it isn't needed.

Signed-off-by: Will Deacon
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1502900297-21839-1-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar

Will Deacon
2017-08-25 17:04:15 +0800

26 Jul, 2016

1 commit

aa7145c16 bpf, events: fix offset in skb copy handler ... Browse Code »

This patch fixes the __output_custom() routine we currently use with
bpf_skb_copy(). I missed that when len is larger than the size of the
current handle, we can issue multiple invocations of copy_func, and
__output_custom() advances destination but also source buffer by the
written amount of bytes. When we have __output_custom(), this is actually
wrong since in that case the source buffer points to a non-linear object,
in our case an skb, which the copy_func helper is supposed to walk.
Therefore, since this is non-linear we thus need to pass the offset into
the helper, so that copy_func can use it for extracting the data from
the source object.

Therefore, adjust the callback signatures properly and pass offset
into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
__DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
i) to pass in whether we should advance source buffer or not; this is
a compile-time constant condition, ii) to pass in the offset for
__output_custom(), which we do with help of __VA_ARGS__, so everything
can stay inlined as is currently. Both changes allow for adapting the
__output_* fast-path helpers w/o extra overhead.

Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-07-26 01:34:11 +0800

16 Jul, 2016

1 commit

7e3f977ed perf, events: add non-linear data support for raw records ... Browse Code »

This patch adds support for non-linear data on raw records. It
extends raw records to have one or multiple fragments that will
be written linearly into the ring slot, where each fragment can
optionally have a custom callback handler to walk and extract
complex, possibly non-linear data.

If a callback handler is provided for a fragment, then the new
__output_custom() will be used instead of __output_copy() for
the perf_output_sample() part. perf_prepare_sample() does all
the size calculation only once, so perf_output_sample() doesn't
need to redo the same work anymore, meaning real_size and padding
will be cached in the raw record. The raw record becomes 32 bytes
in size without holes; to not increase it further and to avoid
doing unnecessary recalculations in fast-path, we can reuse
next pointer of the last fragment, idea here is borrowed from
ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
path for PERF_SAMPLE_RAW minimal.

This facility is needed for BPF's event output helper as a first
user that will, in a follow-up, add an additional perf_raw_frag
to its perf_raw_record in order to be able to more efficiently
dump skb context after a linear head meta data related to it.
skbs can be non-linear and thus need a custom output function to
dump buffers. Currently, the skb data needs to be copied twice;
with the help of __output_custom() this work only needs to be
done once. Future users could be things like XDP/BPF programs
that work on different context though and would thus also have
a different callback function.

The few users of raw records are adapted to initialize their frag
data from the raw record itself, no change in behavior for them.
The code is based upon a PoC diff provided by Peter Zijlstra [1].

[1] http://thread.gmane.org/gmane.linux.network/421294

Suggested-by: Peter Zijlstra
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-07-16 05:23:56 +0800

13 Apr, 2016

1 commit

889fac6d6 Merge tag 'v4.6-rc3' into perf/core, to refresh the tree ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-04-13 14:57:03 +0800

31 Mar, 2016

2 commits

86e7972f6 perf/ring_buffer: Introduce new ioctl options to pause and resume the ring-buffer ... Browse Code »

Add new ioctl() to pause/resume ring-buffer output.

In some situations we want to read from the ring-buffer only when we
ensure nothing can write to the ring-buffer during reading. Without
this patch we have to turn off all events attached to this ring-buffer
to achieve this.

This patch is a prerequisite to enable overwrite support for the
perf ring-buffer support. Following commits will introduce new methods
support reading from overwrite ring buffer. Before reading, caller
must ensure the ring buffer is frozen, or the reading is unreliable.

Signed-off-by: Wang Nan
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Brendan Gregg
Cc: He Kuang
Cc: Jiri Olsa
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Zefan Li
Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
Signed-off-by: Ingo Molnar

Wang Nan
2016-03-31 16:30:45 +0800
95ff4ca26 perf/core: Free AUX pages in unmap path ... Browse Code »

Now that we can ensure that when ring buffer's AUX area is on the way
to getting unmapped new transactions won't start, we only need to stop
all events that can potentially be writing aux data to our ring buffer.

Having done that, we can safely free the AUX pages and corresponding
PMU data, as this time it is guaranteed to be the last aux reference
holder.

This partially reverts:

57ffc5ca679 ("perf: Fix AUX buffer refcounting")

... which was made to defer deallocation that was otherwise possible
from an NMI context. Now it is no longer the case; the last call to
rb_free_aux() that drops the last AUX reference has to happen in
perf_mmap_close() on that AUX area.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2016-03-31 16:30:42 +0800

20 Feb, 2016

1 commit

568b329a0 perf: generalize perf_callchain ... Browse Code »

. avoid walking the stack when there is no room left in the buffer
. generalize get_perf_callchain() to be called from bpf helper

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-02-20 13:21:44 +0800

06 Jul, 2015

1 commit

57ffc5ca6 perf: Fix AUX buffer refcounting ... Browse Code »

Its currently possible to drop the last refcount to the aux buffer
from NMI context, which results in the expected fireworks.

The refcounting needs a bigger overhaul, but to cure the immediate
problem, delay the freeing by using an irq_work.

Reviewed-and-tested-by: Alexander Shishkin
Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20150618103249.GK19282@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-07-06 20:08:30 +0800

07 Jun, 2015

1 commit

21509084f perf/x86/intel: Handle multiple records in the PEBS buffer ... Browse Code »

When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.

Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible
scenarios to demux.

The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:

A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >

B) the hardware assist is armed, any next event will trigger it

C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS

Now consider the following chain of events:

A0, B0, A1, C0

The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.

The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.

For instance, consider this chain of events:

A0, B0, A1, B1, C01

Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.

This time the record pertains to both events.

Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.

Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.

The assumption/hope is that such discards will be rare.

Here lists some possible ways you may get high discard rate.

- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.

Signed-off-by: Yan, Zheng
Signed-off-by: Kan Liang
[ Changelog improvements. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar

Signed-off-by: Ingo Molnar

Yan, Zheng
2015-06-07 22:08:45 +0800

02 Apr, 2015

5 commits

1a5941312 perf: Add wakeup watermark control to the AUX area ... Browse Code »

When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.

We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-04-02 23:14:16 +0800
2023a0d28 perf: Support overwrite mode for the AUX area ... Browse Code »

This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.

Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.

A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.

Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-04-02 23:14:15 +0800
fdc267066 perf: Add API for PMUs to write to the AUX area ... Browse Code »

For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.

After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.

PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.

Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.

Nested writers are forbidden and guards are in place to catch such
attempts.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-04-02 23:14:13 +0800
68db7e98c perf: Add AUX record ... Browse Code »

When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.

Signed-off-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Arnaldo Carvalho de Melo
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Alexander Shishkin
2015-04-02 23:14:12 +0800
45bfb2e50 perf: Add AUX area to ring buffer for raw data streams ... Browse Code »

This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.

AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.

In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.

Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Alexander Shishkin
Cc: Borislav Petkov
Cc: Frederic Weisbecker
Cc: H. Peter Anvin
Cc: Kaixu Xia
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-04-02 23:13:46 +0800

06 Nov, 2013

1 commit

0a196848c perf: Fix arch_perf_out_copy_user default ... Browse Code »

The arch_perf_output_copy_user() default of
__copy_from_user_inatomic() returns bytes not copied, while all other
argument functions given DEFINE_OUTPUT_COPY() return bytes copied.

Since copy_from_user_nmi() is the odd duck out by returning bytes
copied where all other *copy_{to,from}* functions return bytes not
copied, change it over and ammend DEFINE_OUTPUT_COPY() to expect bytes
not copied.

Oddly enough DEFINE_OUTPUT_COPY() already returned bytes not copied
while expecting its worker functions to return bytes copied.

Signed-off-by: Peter Zijlstra
Acked-by: will.deacon@arm.com
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20131030201622.GR16117@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-11-06 19:34:25 +0800

19 Jun, 2013

1 commit

9bb5d40cd perf: Fix mmap() accounting hole ... Browse Code »

Vince's fuzzer once again found holes. This time it spotted a leak in
the locked page accounting.

When an event had redirected output and its close() was the last
reference to the buffer we didn't have a vm context to undo accounting.

Change the code to destroy the buffer on the last munmap() and detach
all redirected events at that time. This provides us the right context
to undo the vm accounting.

Reported-and-tested-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
Cc:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-06-19 18:44:13 +0800

28 May, 2013

1 commit

26cb63ad1 perf: Fix perf mmap bugs ... Browse Code »

Vince reported a problem found by his perf specific trinity
fuzzer.

Al noticed 2 problems with perf's mmap():

- it has issues against fork() since we use vma->vm_mm for accounting.
- it has an rb refcount leak on double mmap().

We fix the issues against fork() by using VM_DONTCOPY; I don't
think there's code out there that uses this; we didn't hear
about weird accounting problems/crashes. If we do need this to
work, the previously proposed VM_PINNED could make this work.

Aside from the rb reference leak spotted by Al, Vince's example
prog was indeed doing a double mmap() through the use of
perf_event_set_output().

This exposes another problem, since we now have 2 events with
one buffer, the accounting gets screwy because we account per
event. Fix this by making the buffer responsible for its own
accounting.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc: Al Viro
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Link: http://lkml.kernel.org/r/20130528085548.GA12193@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-05-28 17:05:08 +0800

21 Mar, 2013

1 commit

dd9c086d9 perf: Fix ring_buffer perf_output_space() boundary calculation ... Browse Code »

This patch fixes a flaw in perf_output_space(). In case the size
of the space needed is bigger than the actual buffer size, there
may be situations where the function would return true (i.e.,
there is space) when it should not. head > offset due to
rounding of the masking logic.

The problem can be tested by activating BTS on Intel processors.
A BTS record can be as big as 16 pages. The following command
fails:

$ perf record -m 4 -c 1 -e branches:u my_test_program

You will get a buffer corruption with this. Perf report won't be
able to parse the perf.data.

The fix is to first check that the requested space is smaller
than the buffer size. If so, then the masking logic will work
fine. If not, then there is no chance the record can be saved
and it will be gracefully handled by upper code layers.

[ In v2, we also make the logic for the writable more explicit by
renaming it to rb->overwrite because it tells whether or not the
buffer can overwrite its tail (suggested by PeterZ). ]

Signed-off-by: Stephane Eranian
Acked-by: Peter Zijlstra
Cc: peterz@infradead.org
Cc: jolsa@redhat.com
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad
Signed-off-by: Ingo Molnar

Stephane Eranian
2013-03-21 19:04:35 +0800

10 Aug, 2012

3 commits

c5ebcedb5 perf: Add ability to attach user stack dump to sample ... Browse Code »

Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
of the user level stack on sample. The size of the dump is specified by
sample_stack_user value.

Being able to dump parts of the user stack, starting from the stack
pointer, will be useful to make a post mortem dwarf CFI based stack
unwinding.

Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
architecture provides user stack dump on perf event samples. This needs
access to the user stack pointer which is not unified across
architectures. Enabling this for x86 architecture.

Signed-off-by: Jiri Olsa
Original-patch-by: Frederic Weisbecker
Cc: "Frank Ch. Eigler"
Cc: Arun Sharma
Cc: Benjamin Redelings
Cc: Corey Ashford
Cc: Cyrill Gorcunov
Cc: Frank Ch. Eigler
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Masami Hiramatsu
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Tom Zanussi
Cc: Ulrich Drepper
Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2012-08-10 23:17:58 +0800
5685e0ff4 perf: Add perf_output_skip function to skip bytes in sample ... Browse Code »

Introducing perf_output_skip function to be able to skip data within the
perf ring buffer.

When writing data into perf ring buffer we first reserve needed place in
ring buffer and then copy the actual data.

There's a possibility we won't be able to fill all the reserved size
with data, so we need a way to skip the remaining bytes.

This is going to be useful when storing the user stack dump, where we
might end up with less data than we originally requested.

Signed-off-by: Jiri Olsa
Acked-by: Frederic Weisbecker
Cc: "Frank Ch. Eigler"
Cc: Arun Sharma
Cc: Benjamin Redelings
Cc: Corey Ashford
Cc: Cyrill Gorcunov
Cc: Frank Ch. Eigler
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Masami Hiramatsu
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Tom Zanussi
Cc: Ulrich Drepper
Link: http://lkml.kernel.org/r/1344345647-11536-5-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo

Jiri Olsa
2012-08-10 23:16:22 +0800
91d7753a4 perf: Factor __output_copy to be usable with specific copy function ... Browse Code »

Adding a generic way to use __output_copy function with specific copy
function via DEFINE_PERF_OUTPUT_COPY macro.

Using this to add new __output_copy_user function, that provides output
copy from user pointers. For x86 the copy_from_user_nmi function is used
and __copy_from_user_inatomic for the rest of the architectures.

This new function will be used in user stack dump on sample, coming in
next patches.

Signed-off-by: Jiri Olsa
Cc: "Frank Ch. Eigler"
Cc: Arun Sharma
Cc: Benjamin Redelings
Cc: Corey Ashford
Cc: Cyrill Gorcunov
Cc: Frank Ch. Eigler
Cc: Ingo Molnar
Cc: Jiri Olsa
Cc: Masami Hiramatsu
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Stephane Eranian
Cc: Tom Zanussi
Cc: Ulrich Drepper
Link: http://lkml.kernel.org/r/1344345647-11536-4-git-send-email-jolsa@redhat.com
Signed-off-by: Frederic Weisbecker
Signed-off-by: Arnaldo Carvalho de Melo

Frederic Weisbecker
2012-08-10 22:44:06 +0800

31 Jul, 2012

1 commit

e6dab5ffa perf/trace: Add ability to set a target task for events ... Browse Code »

A few events are interesting not only for a current task.
For example, sched_stat_* events are interesting for a task
which wakes up. For this reason, it will be good if such
events will be delivered to a target task too.

Now a target task can be set by using __perf_task().

The original idea and a draft patch belongs to Peter Zijlstra.

I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.

Inspired-by: Peter Zijlstra
Cc: Steven Rostedt
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: Steven Rostedt
Cc: Arun Sharma
Signed-off-by: Andrew Vagin
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar

Andrew Vagin
2012-07-31 23:02:05 +0800

06 Dec, 2011

1 commit

d6c1c49de Merge branch 'perf/urgent' into perf/core ... Browse Code »

Merge reason: Add these cherry-picked commits so that future changes
on perf/core don't conflict.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-12-06 13:43:49 +0800

05 Dec, 2011

1 commit

10c6db110 perf: Fix loss of notification with multi-event ... Browse Code »

When you do:
$ perf record -e cycles,cycles,cycles noploop 10

You expect about 10,000 samples for each event, i.e., 10s at
1000samples/sec. However, this is not what's happening. You
get much fewer samples, maybe 3700 samples/event:

$ perf report -D | tail -15
Aggregated stats:
TOTAL events: 10998
MMAP events: 66
COMM events: 2
SAMPLE events: 10930
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644
cycles stats:
TOTAL events: 3642
SAMPLE events: 3642
cycles stats:
TOTAL events: 3644
SAMPLE events: 3644

On a Intel Nehalem or even AMD64, there are 4 counters capable
of measuring cycles, so there is plenty of space to measure those
events without multiplexing (even with the NMI watchdog active).
And even with multiplexing, we'd expect roughly the same number
of samples per event.

The root of the problem was that when the event that caused the buffer
to become full was not the first event passed on the cmdline, the user
notification would get lost. The notification was sent to the file
descriptor of the overflowed event but the perf tool was not polling
on it. The perf tool aggregates all samples into a single buffer,
i.e., the buffer of the first event. Consequently, it assumes
notifications for any event will come via that descriptor.

The seemingly straight forward solution of moving the waitq into the
ringbuffer object doesn't work because of life-time issues. One could
perf_event_set_output() on a fd that you're also blocking on and cause
the old rb object to be freed while its waitq would still be
referenced by the blocked thread -> FAIL.

Therefore link all events to the ringbuffer and broadcast the wakeup
from the ringbuffer object to all possible events that could be waited
upon. This is rather ugly, and we're open to better solutions but it
works for now.

Reported-by: Stephane Eranian
Finished-by: Stephane Eranian
Reviewed-by: Stephane Eranian
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-12-05 16:33:03 +0800

14 Nov, 2011

1 commit

9251f904f perf: Carve out callchain functionality ... Browse Code »

Split the callchain code from the perf events core into
a new kernel/events/callchain.c file.

This simplifies a bit the big core.c

Signed-off-by: Borislav Petkov
Cc: Arnaldo Carvalho de Melo
Cc: Stephane Eranian
[keep ctx recursion handling inline and use internal headers]
Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1318778104-17152-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Borislav Petkov
2011-11-14 20:31:26 +0800

01 Jul, 2011

1 commit

a8b0ca17b perf: Remove the nmi parameter from the swevent and overflow interface ... Browse Code »
1

The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.

For the various event classes:

- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.

As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.

Signed-off-by: Peter Zijlstra
Cc: Michael Cree
Cc: Will Deacon
Cc: Deng-Cheng Zhu
Cc: Anton Blanchard
Cc: Eric B Munson
Cc: Heiko Carstens
Cc: Paul Mundt
Cc: David S. Miller
Cc: Frederic Weisbecker
Cc: Jason Wessel
Cc: Don Zickus
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-07-01 17:06:35 +0800

09 Jun, 2011

1 commit

76369139c perf: Split up buffer handling from core code ... Browse Code »

And create the internal perf events header.

v2: Keep an internal inlined perf_output_copy()

Signed-off-by: Frederic Weisbecker
Acked-by: Peter Zijlstra
Cc: Borislav Petkov
Cc: Stephane Eranian
Cc: Arnaldo Carvalho de Melo
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com
[ v3: use clearer 'ring_buffer' and 'rb' naming ]
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2011-06-09 18:57:54 +0800