04 Feb, 2019

2 commits

  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable ring_buffer.aux_refcount is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
    for more information.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.
    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the ring_buffer.aux_refcount it might make a difference
    in following places:

    - perf_aux_output_begin(): increment in refcount_inc_not_zero() only
    guarantees control dependency on success vs. fully ordered
    atomic counterpart
    - rb_free_aux(): decrement in refcount_dec_and_test() only
    provides RELEASE ordering and ACQUIRE ordering + control dependency
    on success vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: namhyung@kernel.org
    Link: https://lkml.kernel.org/r/1548678448-24458-4-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     
  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable ring_buffer.refcount is used as pure reference counter.
    Convert it to refcount_t and fix up the operations.

    ** Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
    for more information.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.
    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the ring_buffer.refcount it might make a difference
    in following places:

    - ring_buffer_get(): increment in refcount_inc_not_zero() only
    guarantees control dependency on success vs. fully ordered
    atomic counterpart
    - ring_buffer_put(): decrement in refcount_dec_and_test() only
    provides RELEASE ordering and ACQUIRE ordering + control dependency
    on success vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: namhyung@kernel.org
    Link: https://lkml.kernel.org/r/1548678448-24458-3-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     

08 Jan, 2018

1 commit

  • And move it to core.c, because there's no caller of this function other
    than the one in core.c

    Signed-off-by: Jiri Olsa
    Cc: Alexander Shishkin
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20180107160356.28203-6-jolsa@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

25 Aug, 2017

3 commits

  • In an XDP redirect applications using tracepoint xdp:xdp_redirect to
    diagnose TX overrun, I noticed perf_swevent_get_recursion_context()
    was consuming 2% CPU. This was reduced to 1.85% with this simple
    change.

    Looking at the annotated asm code, it was clear that the unlikely case
    in_nmi() test was chosen (by the compiler) as the most likely
    event/branch. This small adjustment makes the compiler (GCC version
    7.1.1 20170622 (Red Hat 7.1.1-3)) put in_nmi() as an unlikely branch.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/150342256382.16595.986861478681783732.stgit@firesoul
    Signed-off-by: Ingo Molnar

    Jesper Dangaard Brouer
     
  • The aux_watermark member of struct ring_buffer represents the period (in
    terms of bytes) at which wakeup events should be generated when data is
    written to the aux buffer in non-snapshot mode. On hardware that cannot
    generate an interrupt when the aux_head reaches an arbitrary wakeup index
    (such as ARM SPE), the aux_head sampled from handle->head in
    perf_aux_output_{skip,end} may in fact be past the wakeup index. This
    can lead to wakeup slowly falling behind the head. For example, consider
    the case where hardware can only generate an interrupt on a page-boundary
    and the aux buffer is initialised as follows:

    // Buffer size is 2 * PAGE_SIZE
    rb->aux_head = rb->aux_wakeup = 0
    rb->aux_watermark = PAGE_SIZE / 2

    following the first perf_aux_output_begin call, the handle is
    initialised with:

    handle->head = 0
    handle->size = 2 * PAGE_SIZE
    handle->wakeup = PAGE_SIZE / 2

    and the hardware will be programmed to generate an interrupt at
    PAGE_SIZE.

    When the interrupt is raised, the hardware head will be at PAGE_SIZE,
    so calling perf_aux_output_end(handle, PAGE_SIZE) puts the ring buffer
    into the following state:

    rb->aux_head = PAGE_SIZE
    rb->aux_wakeup = PAGE_SIZE / 2
    rb->aux_watermark = PAGE_SIZE / 2

    and then the next call to perf_aux_output_begin will result in:

    handle->head = handle->wakeup = PAGE_SIZE

    for which the semantics are unclear and, for a smaller aux_watermark
    (e.g. PAGE_SIZE / 4), then the wakeup would in fact be behind head at
    this point.

    This patch fixes the problem by rounding down the aux_head (as sampled
    from the handle) to the nearest aux_watermark boundary when updating
    rb->aux_wakeup, therefore taking into account any overruns by the
    hardware.

    Reported-by: Mark Rutland
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arm-kernel@lists.infradead.org
    Link: http://lkml.kernel.org/r/1502900297-21839-2-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • The aux_head and aux_wakeup members of struct ring_buffer are defined
    using the local_t type, despite the fact that they are only accessed via
    the perf_aux_output_*() functions, which cannot race with each other for a
    given ring buffer.

    This patch changes the type of the members to long, so we can avoid
    using the local_*() API where it isn't needed.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arm-kernel@lists.infradead.org
    Link: http://lkml.kernel.org/r/1502900297-21839-1-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

26 Jul, 2016

1 commit

  • This patch fixes the __output_custom() routine we currently use with
    bpf_skb_copy(). I missed that when len is larger than the size of the
    current handle, we can issue multiple invocations of copy_func, and
    __output_custom() advances destination but also source buffer by the
    written amount of bytes. When we have __output_custom(), this is actually
    wrong since in that case the source buffer points to a non-linear object,
    in our case an skb, which the copy_func helper is supposed to walk.
    Therefore, since this is non-linear we thus need to pass the offset into
    the helper, so that copy_func can use it for extracting the data from
    the source object.

    Therefore, adjust the callback signatures properly and pass offset
    into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
    __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
    i) to pass in whether we should advance source buffer or not; this is
    a compile-time constant condition, ii) to pass in the offset for
    __output_custom(), which we do with help of __VA_ARGS__, so everything
    can stay inlined as is currently. Both changes allow for adapting the
    __output_* fast-path helpers w/o extra overhead.

    Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
    Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jul, 2016

1 commit

  • This patch adds support for non-linear data on raw records. It
    extends raw records to have one or multiple fragments that will
    be written linearly into the ring slot, where each fragment can
    optionally have a custom callback handler to walk and extract
    complex, possibly non-linear data.

    If a callback handler is provided for a fragment, then the new
    __output_custom() will be used instead of __output_copy() for
    the perf_output_sample() part. perf_prepare_sample() does all
    the size calculation only once, so perf_output_sample() doesn't
    need to redo the same work anymore, meaning real_size and padding
    will be cached in the raw record. The raw record becomes 32 bytes
    in size without holes; to not increase it further and to avoid
    doing unnecessary recalculations in fast-path, we can reuse
    next pointer of the last fragment, idea here is borrowed from
    ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
    path for PERF_SAMPLE_RAW minimal.

    This facility is needed for BPF's event output helper as a first
    user that will, in a follow-up, add an additional perf_raw_frag
    to its perf_raw_record in order to be able to more efficiently
    dump skb context after a linear head meta data related to it.
    skbs can be non-linear and thus need a custom output function to
    dump buffers. Currently, the skb data needs to be copied twice;
    with the help of __output_custom() this work only needs to be
    done once. Future users could be things like XDP/BPF programs
    that work on different context though and would thus also have
    a different callback function.

    The few users of raw records are adapted to initialize their frag
    data from the raw record itself, no change in behavior for them.
    The code is based upon a PoC diff provided by Peter Zijlstra [1].

    [1] http://thread.gmane.org/gmane.linux.network/421294

    Suggested-by: Peter Zijlstra
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

13 Apr, 2016

1 commit


31 Mar, 2016

2 commits

  • Add new ioctl() to pause/resume ring-buffer output.

    In some situations we want to read from the ring-buffer only when we
    ensure nothing can write to the ring-buffer during reading. Without
    this patch we have to turn off all events attached to this ring-buffer
    to achieve this.

    This patch is a prerequisite to enable overwrite support for the
    perf ring-buffer support. Following commits will introduce new methods
    support reading from overwrite ring buffer. Before reading, caller
    must ensure the ring buffer is frozen, or the reading is unreliable.

    Signed-off-by: Wang Nan
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Brendan Gregg
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
    Signed-off-by: Ingo Molnar

    Wang Nan
     
  • Now that we can ensure that when ring buffer's AUX area is on the way
    to getting unmapped new transactions won't start, we only need to stop
    all events that can potentially be writing aux data to our ring buffer.

    Having done that, we can safely free the AUX pages and corresponding
    PMU data, as this time it is guaranteed to be the last aux reference
    holder.

    This partially reverts:

    57ffc5ca679 ("perf: Fix AUX buffer refcounting")

    ... which was made to defer deallocation that was otherwise possible
    from an NMI context. Now it is no longer the case; the last call to
    rb_free_aux() that drops the last AUX reference has to happen in
    perf_mmap_close() on that AUX area.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/87d1qtz23d.fsf@ashishki-desk.ger.corp.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

20 Feb, 2016

1 commit


06 Jul, 2015

1 commit

  • Its currently possible to drop the last refcount to the aux buffer
    from NMI context, which results in the expected fireworks.

    The refcounting needs a bigger overhaul, but to cure the immediate
    problem, delay the freeing by using an irq_work.

    Reviewed-and-tested-by: Alexander Shishkin
    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150618103249.GK19282@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Jun, 2015

1 commit

  • When the PEBS interrupt threshold is larger than one record and the
    machine supports multiple PEBS events, the records of these events are
    mixed up and we need to demultiplex them.

    Demuxing the records is hard because the hardware is deficient. The
    hardware has two issues that, when combined, create impossible
    scenarios to demux.

    The first issue is that the 'status' field of the PEBS record is a copy
    of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
    problem let us first describe the regular PEBS cycle:

    A) the CTRn value reaches 0:
    - the corresponding bit in GLOBAL_STATUS gets set
    - we start arming the hardware assist
    < some unspecified amount of time later -- this could cover multiple
    events of interest >

    B) the hardware assist is armed, any next event will trigger it

    C) a matching event happens:
    - the hardware assist triggers and generates a PEBS record
    this includes a copy of GLOBAL_STATUS at this moment
    - if we auto-reload we (re)set CTRn
    - we clear the relevant bit in GLOBAL_STATUS

    Now consider the following chain of events:

    A0, B0, A1, C0

    The event generated for counter 0 will include a status with counter 1
    set, even though its not at all related to the record. A similar thing
    can happen with a !PEBS event if it just happens to overflow at the
    right moment.

    The second issue is that the hardware will only emit one record for two
    or more counters if the event that triggers the assist is 'close'. The
    'close' can be several cycles. In some cases even the complete assist,
    if the event is something that doesn't need retirement.

    For instance, consider this chain of events:

    A0, B0, A1, B1, C01

    Where C01 is an event that triggers both hardware assists, we will
    generate but a single record, but again with both counters listed in the
    status field.

    This time the record pertains to both events.

    Note that these two cases are different but undistinguishable with the
    data as generated. Therefore demuxing records with multiple PEBS bits
    (we can safely ignore status bits for !PEBS counters) is impossible.

    Furthermore we cannot emit the record to both events because that might
    cause a data leak -- the events might not have the same privileges -- so
    what this patch does is discard such events.

    The assumption/hope is that such discards will be rare.

    Here lists some possible ways you may get high discard rate.

    - when you count the same thing multiple times. But it is not a useful
    configuration.
    - you can be unfortunate if you measure with a userspace only PEBS
    event along with either a kernel or unrestricted PEBS event. Imagine
    the event triggering and setting the overflow flag right before
    entering the kernel. Then all kernel side events will end up with
    multiple bits set.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    [ Changelog improvements. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Yan, Zheng
     

02 Apr, 2015

5 commits

  • When AUX area gets a certain amount of new data, we want to wake up
    userspace to collect it. This adds a new control to specify how much
    data will cause a wakeup. This is then passed down to pmu drivers via
    output handle's "wakeup" field, so that the driver can find the nearest
    point where it can generate an interrupt.

    We repurpose __reserved_2 in the event attribute for this, even though
    it was never checked to be zero before, aux_watermark will only matter
    for new AUX-aware code, so the old code should still be fine.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • This adds support for overwrite mode in the AUX area, which means "keep
    collecting data till you're stopped", turning AUX area into a circular
    buffer, where new data overwrites old data. It does not depend on data
    buffer's overwrite mode, so that it doesn't lose sideband data that is
    instrumental for processing AUX data.

    Overwrite mode is enabled at mapping AUX area read only. Even though
    aux_tail in the buffer's user page might be user writable, it will be
    ignored in this mode.

    A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
    data stream every time an event writes new data to the AUX area. The pmu
    driver might not be able to infer the exact beginning of the new data in
    each snapshot, some drivers will only provide the tail, which is
    aux_offset + aux_size in the AUX record. Consumer has to be able to tell
    the new data from the old one, for example, by means of time stamps if
    such are provided in the trace.

    Consumer is also responsible for disabling any events that might write
    to the AUX area (thus potentially racing with the consumer) before
    collecting the data.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • For pmus that wish to write data to ring buffer's AUX area, provide
    perf_aux_output_{begin,end}() calls to initiate/commit data writes,
    similarly to perf_output_{begin,end}. These also use the same output
    handle structure. Also, similarly to software counterparts, these
    will direct inherited events' output to parents' ring buffers.

    After the perf_aux_output_begin() returns successfully, handle->size
    is set to the maximum amount of data that can be written wrt aux_tail
    pointer, so that no data that the user hasn't seen will be overwritten,
    therefore this should always be called before hardware writing is
    enabled. On success, this will return the pointer to pmu driver's
    private structure allocated for this aux area by pmu::setup_aux. Same
    pointer can also be retrieved using perf_get_aux() while hardware
    writing is enabled.

    PMU driver should pass the actual amount of data written as a parameter
    to perf_aux_output_end(). All hardware writes should be completed and
    visible before this one is called.

    Additionally, perf_aux_output_skip() will adjust output handle and
    aux_head in case some part of the buffer has to be skipped over to
    maintain hardware's alignment constraints.

    Nested writers are forbidden and guards are in place to catch such
    attempts.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When there's new data in the AUX space, output a record indicating its
    offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
    mean the described data was truncated to fit in the ring buffer.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • This patch introduces "AUX space" in the perf mmap buffer, intended for
    exporting high bandwidth data streams to userspace, such as instruction
    flow traces.

    AUX space is a ring buffer, defined by aux_{offset,size} fields in the
    user_page structure, and read/write pointers aux_{head,tail}, which abide
    by the same rules as data_* counterparts of the main perf buffer.

    In order to allocate/mmap AUX, userspace needs to set up aux_offset to
    such an offset that will be greater than data_offset+data_size and
    aux_size to be the desired buffer size. Both need to be page aligned.
    Then, same aux_offset and aux_size should be passed to mmap() call and
    if everything adds up, you should have an AUX buffer as a result.

    Pages that are mapped into this buffer also come out of user's mlock
    rlimit plus perf_event_mlock_kb allowance.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Nov, 2013

1 commit

  • The arch_perf_output_copy_user() default of
    __copy_from_user_inatomic() returns bytes not copied, while all other
    argument functions given DEFINE_OUTPUT_COPY() return bytes copied.

    Since copy_from_user_nmi() is the odd duck out by returning bytes
    copied where all other *copy_{to,from}* functions return bytes not
    copied, change it over and ammend DEFINE_OUTPUT_COPY() to expect bytes
    not copied.

    Oddly enough DEFINE_OUTPUT_COPY() already returned bytes not copied
    while expecting its worker functions to return bytes copied.

    Signed-off-by: Peter Zijlstra
    Acked-by: will.deacon@arm.com
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20131030201622.GR16117@laptop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Jun, 2013

1 commit

  • Vince's fuzzer once again found holes. This time it spotted a leak in
    the locked page accounting.

    When an event had redirected output and its close() was the last
    reference to the buffer we didn't have a vm context to undo accounting.

    Change the code to destroy the buffer on the last munmap() and detach
    all redirected events at that time. This provides us the right context
    to undo the vm accounting.

    Reported-and-tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20130604084421.GI8923@twins.programming.kicks-ass.net
    Cc:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

28 May, 2013

1 commit

  • Vince reported a problem found by his perf specific trinity
    fuzzer.

    Al noticed 2 problems with perf's mmap():

    - it has issues against fork() since we use vma->vm_mm for accounting.
    - it has an rb refcount leak on double mmap().

    We fix the issues against fork() by using VM_DONTCOPY; I don't
    think there's code out there that uses this; we didn't hear
    about weird accounting problems/crashes. If we do need this to
    work, the previously proposed VM_PINNED could make this work.

    Aside from the rb reference leak spotted by Al, Vince's example
    prog was indeed doing a double mmap() through the use of
    perf_event_set_output().

    This exposes another problem, since we now have 2 events with
    one buffer, the accounting gets screwy because we account per
    event. Fix this by making the buffer responsible for its own
    accounting.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: Al Viro
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/20130528085548.GA12193@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Mar, 2013

1 commit

  • This patch fixes a flaw in perf_output_space(). In case the size
    of the space needed is bigger than the actual buffer size, there
    may be situations where the function would return true (i.e.,
    there is space) when it should not. head > offset due to
    rounding of the masking logic.

    The problem can be tested by activating BTS on Intel processors.
    A BTS record can be as big as 16 pages. The following command
    fails:

    $ perf record -m 4 -c 1 -e branches:u my_test_program

    You will get a buffer corruption with this. Perf report won't be
    able to parse the perf.data.

    The fix is to first check that the requested space is smaller
    than the buffer size. If so, then the masking logic will work
    fine. If not, then there is no chance the record can be saved
    and it will be gracefully handled by upper code layers.

    [ In v2, we also make the logic for the writable more explicit by
    renaming it to rb->overwrite because it tells whether or not the
    buffer can overwrite its tail (suggested by PeterZ). ]

    Signed-off-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Cc: peterz@infradead.org
    Cc: jolsa@redhat.com
    Cc: fweisbec@gmail.com
    Link: http://lkml.kernel.org/r/20130318133327.GA3056@quad
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

10 Aug, 2012

3 commits

  • Introducing PERF_SAMPLE_STACK_USER sample type bit to trigger the dump
    of the user level stack on sample. The size of the dump is specified by
    sample_stack_user value.

    Being able to dump parts of the user stack, starting from the stack
    pointer, will be useful to make a post mortem dwarf CFI based stack
    unwinding.

    Added HAVE_PERF_USER_STACK_DUMP config option to determine if the
    architecture provides user stack dump on perf event samples. This needs
    access to the user stack pointer which is not unified across
    architectures. Enabling this for x86 architecture.

    Signed-off-by: Jiri Olsa
    Original-patch-by: Frederic Weisbecker
    Cc: "Frank Ch. Eigler"
    Cc: Arun Sharma
    Cc: Benjamin Redelings
    Cc: Corey Ashford
    Cc: Cyrill Gorcunov
    Cc: Frank Ch. Eigler
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Tom Zanussi
    Cc: Ulrich Drepper
    Link: http://lkml.kernel.org/r/1344345647-11536-6-git-send-email-jolsa@redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • Introducing perf_output_skip function to be able to skip data within the
    perf ring buffer.

    When writing data into perf ring buffer we first reserve needed place in
    ring buffer and then copy the actual data.

    There's a possibility we won't be able to fill all the reserved size
    with data, so we need a way to skip the remaining bytes.

    This is going to be useful when storing the user stack dump, where we
    might end up with less data than we originally requested.

    Signed-off-by: Jiri Olsa
    Acked-by: Frederic Weisbecker
    Cc: "Frank Ch. Eigler"
    Cc: Arun Sharma
    Cc: Benjamin Redelings
    Cc: Corey Ashford
    Cc: Cyrill Gorcunov
    Cc: Frank Ch. Eigler
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Masami Hiramatsu
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Tom Zanussi
    Cc: Ulrich Drepper
    Link: http://lkml.kernel.org/r/1344345647-11536-5-git-send-email-jolsa@redhat.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Jiri Olsa
     
  • Adding a generic way to use __output_copy function with specific copy
    function via DEFINE_PERF_OUTPUT_COPY macro.

    Using this to add new __output_copy_user function, that provides output
    copy from user pointers. For x86 the copy_from_user_nmi function is used
    and __copy_from_user_inatomic for the rest of the architectures.

    This new function will be used in user stack dump on sample, coming in
    next patches.

    Signed-off-by: Jiri Olsa
    Cc: "Frank Ch. Eigler"
    Cc: Arun Sharma
    Cc: Benjamin Redelings
    Cc: Corey Ashford
    Cc: Cyrill Gorcunov
    Cc: Frank Ch. Eigler
    Cc: Ingo Molnar
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Tom Zanussi
    Cc: Ulrich Drepper
    Link: http://lkml.kernel.org/r/1344345647-11536-4-git-send-email-jolsa@redhat.com
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Arnaldo Carvalho de Melo

    Frederic Weisbecker
     

31 Jul, 2012

1 commit

  • A few events are interesting not only for a current task.
    For example, sched_stat_* events are interesting for a task
    which wakes up. For this reason, it will be good if such
    events will be delivered to a target task too.

    Now a target task can be set by using __perf_task().

    The original idea and a draft patch belongs to Peter Zijlstra.

    I need these events for profiling sleep times. sched_switch is used for
    getting callchains and sched_stat_* is used for getting time periods.
    These events are combined in user space, then it can be analyzed by
    perf tools.

    Inspired-by: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Paul Mackerras
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Arun Sharma
    Signed-off-by: Andrew Vagin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.org
    Signed-off-by: Ingo Molnar

    Andrew Vagin
     

06 Dec, 2011

1 commit


05 Dec, 2011

1 commit

  • When you do:
    $ perf record -e cycles,cycles,cycles noploop 10

    You expect about 10,000 samples for each event, i.e., 10s at
    1000samples/sec. However, this is not what's happening. You
    get much fewer samples, maybe 3700 samples/event:

    $ perf report -D | tail -15
    Aggregated stats:
    TOTAL events: 10998
    MMAP events: 66
    COMM events: 2
    SAMPLE events: 10930
    cycles stats:
    TOTAL events: 3644
    SAMPLE events: 3644
    cycles stats:
    TOTAL events: 3642
    SAMPLE events: 3642
    cycles stats:
    TOTAL events: 3644
    SAMPLE events: 3644

    On a Intel Nehalem or even AMD64, there are 4 counters capable
    of measuring cycles, so there is plenty of space to measure those
    events without multiplexing (even with the NMI watchdog active).
    And even with multiplexing, we'd expect roughly the same number
    of samples per event.

    The root of the problem was that when the event that caused the buffer
    to become full was not the first event passed on the cmdline, the user
    notification would get lost. The notification was sent to the file
    descriptor of the overflowed event but the perf tool was not polling
    on it. The perf tool aggregates all samples into a single buffer,
    i.e., the buffer of the first event. Consequently, it assumes
    notifications for any event will come via that descriptor.

    The seemingly straight forward solution of moving the waitq into the
    ringbuffer object doesn't work because of life-time issues. One could
    perf_event_set_output() on a fd that you're also blocking on and cause
    the old rb object to be freed while its waitq would still be
    referenced by the blocked thread -> FAIL.

    Therefore link all events to the ringbuffer and broadcast the wakeup
    from the ringbuffer object to all possible events that could be waited
    upon. This is rather ugly, and we're open to better solutions but it
    works for now.

    Reported-by: Stephane Eranian
    Finished-by: Stephane Eranian
    Reviewed-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111126014731.GA7030@quad
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Nov, 2011

1 commit

  • Split the callchain code from the perf events core into
    a new kernel/events/callchain.c file.

    This simplifies a bit the big core.c

    Signed-off-by: Borislav Petkov
    Cc: Arnaldo Carvalho de Melo
    Cc: Stephane Eranian
    [keep ctx recursion handling inline and use internal headers]
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1318778104-17152-1-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

01 Jul, 2011

1 commit

  • The nmi parameter indicated if we could do wakeups from the current
    context, if not, we would set some state and self-IPI and let the
    resulting interrupt do the wakeup.

    For the various event classes:

    - hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
    the PMI-tail (ARM etc.)
    - tracepoint: nmi=0; since tracepoint could be from NMI context.
    - software: nmi=[0,1]; some, like the schedule thing cannot
    perform wakeups, and hence need 0.

    As one can see, there is very little nmi=1 usage, and the down-side of
    not using it is that on some platforms some software events can have a
    jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).

    The up-side however is that we can remove the nmi parameter and save a
    bunch of conditionals in fast paths.

    Signed-off-by: Peter Zijlstra
    Cc: Michael Cree
    Cc: Will Deacon
    Cc: Deng-Cheng Zhu
    Cc: Anton Blanchard
    Cc: Eric B Munson
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David S. Miller
    Cc: Frederic Weisbecker
    Cc: Jason Wessel
    Cc: Don Zickus
    Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Jun, 2011

1 commit

  • And create the internal perf events header.

    v2: Keep an internal inlined perf_output_copy()

    Signed-off-by: Frederic Weisbecker
    Acked-by: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Stephane Eranian
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1305827704-5607-1-git-send-email-fweisbec@gmail.com
    [ v3: use clearer 'ring_buffer' and 'rb' naming ]
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker