30 May, 2016

1 commit

  • Additionally to being able to control the system wide maximum depth via
    /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
    different depths per event, using perf_event_attr.sample_max_stack for
    that.

    This uses an u16 hole at the end of perf_event_attr, that, when
    perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
    sample_max_stack is zero, means use perf_event_max_stack, otherwise
    it'll be bounds checked under callchain_mutex.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

17 May, 2016

1 commit

  • The perf_sample->ip_callchain->nr value includes all the entries in the
    ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
    while what the user expects is that what is in the kernel.perf_event_max_stack
    sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
    honoured in terms of IP addresses in the stack trace.

    So allocate a bunch of extra entries for contexts, and do the accounting
    via perf_callchain_entry_ctx struct members.

    A new sysctl, kernel.perf_event_max_contexts_per_stack is also
    introduced for investigating possible bugs in the callchain
    implementation by some arch.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

23 Apr, 2016

1 commit

  • This patch introduces 'write_backward' bit to perf_event_attr, which
    controls the direction of a ring buffer. After set, the corresponding
    ring buffer is written from end to beginning. This feature is design to
    support reading from overwritable ring buffer.

    Ring buffer can be created by mapping a perf event fd. Kernel puts event
    records into ring buffer, user tooling like perf fetch them from
    address returned by mmap(). To prevent racing between kernel and tooling,
    they communicate to each other through 'head' and 'tail' pointers.
    Kernel maintains 'head' pointer, points it to the next free area (tail
    of the last record). Tooling maintains 'tail' pointer, points it to the
    tail of last consumed record (record has already been fetched). Kernel
    determines the available space in a ring buffer using these two
    pointers to avoid overwrite unfetched records.

    By mapping without 'PROT_WRITE', an overwritable ring buffer is created.
    Different from normal ring buffer, tooling is unable to maintain 'tail'
    pointer because writing is forbidden. Therefore, for this type of ring
    buffers, kernel overwrite old records unconditionally, works like flight
    recorder. This feature would be useful if reading from overwritable ring
    buffer were as easy as reading from normal ring buffer. However,
    there's an obscure problem.

    The following figure demonstrates a full overwritable ring buffer. In
    this figure, the 'head' pointer points to the end of last record, and a
    long record 'E' is pending. For a normal ring buffer, a 'tail' pointer
    would have pointed to position (X), so kernel knows there's no more
    space in the ring buffer. However, for an overwritable ring buffer,
    kernel ignore the 'tail' pointer.

    (X) head
    . |
    . V
    +------+-------+----------+------+---+
    |A....A|B.....B|C........C|D....D| |
    +------+-------+----------+------+---+

    Record 'A' is overwritten by event 'E':

    head
    |
    V
    +--+---+-------+----------+------+---+
    |.E|..A|B.....B|C........C|D....D|E..|
    +--+---+-------+----------+------+---+

    Now tooling decides to read from this ring buffer. However, none of these
    two natural positions, 'head' and the start of this ring buffer, are
    pointing to the head of a record. Even the full ring buffer can be
    accessed by tooling, it is unable to find a position to start decoding.

    The first attempt tries to solve this problem AFAIK can be found from
    [1]. It makes kernel to maintain 'tail' pointer: updates it when ring
    buffer is half full. However, this approach introduces overhead to
    fast path. Test result shows a 1% overhead [2]. In addition, this method
    utilizes no more tham 50% records.

    Another attempt can be found from [3], which allows putting the size of
    an event at the end of each record. This approach allows tooling to find
    records in a backward manner from 'head' pointer by reading size of a
    record from its tail. However, because of alignment requirement, it
    needs 8 bytes to record the size of a record, which is a huge waste. Its
    performance is also not good, because more data need to be written.
    This approach also introduces some extra branch instructions to fast
    path.

    'write_backward' is a better solution to this problem.

    Following figure demonstrates the state of the overwritable ring buffer
    when 'write_backward' is set before overwriting:

    head
    |
    V
    +---+------+----------+-------+------+
    | |D....D|C........C|B.....B|A....A|
    +---+------+----------+-------+------+

    and after overwriting:
    head
    |
    V
    +---+------+----------+-------+---+--+
    |..E|D....D|C........C|B.....B|A..|E.|
    +---+------+----------+-------+---+--+

    In each situation, 'head' points to the beginning of the newest record.
    From this record, tooling can iterate over the full ring buffer and fetch
    records one by one.

    The only limitation that needs to be considered is back-to-back reading.
    Due to the non-deterministic of user programs, it is impossible to ensure
    the ring buffer keeps stable during reading. Consider an extreme situation:
    tooling is scheduled out after reading record 'D', then a burst of events
    come, eat up the whole ring buffer (one or multiple rounds). When the
    tooling process comes back, reading after 'D' is incorrect now.

    To prevent this problem, we need to find a way to ensure the ring buffer
    is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is
    suggested because its overhead is lower than
    ioctl(PERF_EVENT_IOC_ENABLE).

    By carefully verifying 'header' pointer, reader can avoid pausing the
    ring-buffer. For example:

    /* A union of all possible events */
    union perf_event event;

    p = head = perf_mmap__read_head();
    while (true) {
    /* copy header of next event */
    fetch(&event.header, p, sizeof(event.header));

    /* read 'head' pointer */
    head = perf_mmap__read_head();

    /* check overwritten: is the header good? */
    if (!verify(sizeof(event.header), p, head))
    break;

    /* copy the whole event */
    fetch(&event, p, event.header.size);

    /* read 'head' pointer again */
    head = perf_mmap__read_head();

    /* is the whole event good? */
    if (!verify(event.header.size, p, head))
    break;
    p += event.header.size;
    }

    However, the overhead is high because:

    a) In-place decoding is not safe.
    Copying-verifying-decoding is required.
    b) Fetching 'head' pointer requires additional synchronization.

    (From Alexei Starovoitov:

    Even when this trick works, pause is needed for more than stability of
    reading. When we collect the events into overwrite buffer we're waiting
    for some other trigger (like all cpu utilization spike or just one cpu
    running and all others are idle) and when it happens the buffer has
    valuable info from the past. At this point new events are no longer
    interesting and buffer should be paused, events read and unpaused until
    next trigger comes.)

    This patch utilizes event's default overflow_handler introduced
    previously. perf_event_output_backward() is created as the default
    overflow handler for backward ring buffers. To avoid extra overhead to
    fast path, original perf_event_output() becomes __perf_event_output()
    and marked '__always_inline'. In theory, there's no extra overhead
    introduced to fast path.

    Performance testing:

    Calling 3000000 times of 'close(-1)', use gettimeofday() to check
    duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture
    system calls. In ns.

    Testing environment:

    CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
    Kernel : v4.5.0
    MEAN STDVAR
    BASE 800214.950 2853.083
    PRE1 2253846.700 9997.014
    PRE2 2257495.540 8516.293
    POST 2250896.100 8933.921

    Where 'BASE' is pure performance without capturing. 'PRE1' is test
    result of pure 'v4.5.0' kernel. 'PRE2' is test result before this
    patch. 'POST' is test result after this patch. See [4] for the detailed
    experimental setup.

    Considering the stdvar, this patch doesn't introduce performance
    overhead to the fast path.

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html
    [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html
    [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html
    [4] http://lkml.kernel.org/g/56F89DCD.1040202@huawei.com

    Signed-off-by: Wang Nan
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexei Starovoitov
    Cc:
    Cc:
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Brendan Gregg
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1459865478-53413-1-git-send-email-wangnan0@huawei.com
    [ Fixed the changelog some more. ]
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Wang Nan
     

31 Mar, 2016

1 commit

  • Add new ioctl() to pause/resume ring-buffer output.

    In some situations we want to read from the ring-buffer only when we
    ensure nothing can write to the ring-buffer during reading. Without
    this patch we have to turn off all events attached to this ring-buffer
    to achieve this.

    This patch is a prerequisite to enable overwrite support for the
    perf ring-buffer support. Following commits will introduce new methods
    support reading from overwrite ring buffer. Before reading, caller
    must ensure the ring buffer is frozen, or the reading is unreliable.

    Signed-off-by: Wang Nan
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Brendan Gregg
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Zefan Li
    Link: http://lkml.kernel.org/r/1459147292-239310-2-git-send-email-wangnan0@huawei.com
    Signed-off-by: Ingo Molnar

    Wang Nan
     

23 Nov, 2015

1 commit

  • With LBRv5 reading the extra LBR flags like mispredict, TSX, cycles is
    not free anymore, as it has moved to a separate MSR.

    For callstack mode we don't need any of this information; so we can
    avoid the unnecessary MSR read. Add flags to the perf interface where
    perf record can request not collecting this information.

    Add branch_sample_type flags for CYCLES and FLAGS. It's a bit unusual
    for branch_sample_types to be negative (disable), not positive (enable),
    but since the legacy ABI reported the flags we need some form of
    explicit disabling to avoid breaking the ABI.

    After we have the flags the x86 perf code can keep track if any users
    need the flags. If noone needs it the information is not collected.

    This cuts down the cost of LBR callstack on Skylake significantly.
    Profiling a kernel build with LBR call stack the average run time of
    the PMI handler drops by 43%.

    Signed-off-by: Andi Kleen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: acme@kernel.org
    Cc: jolsa@kernel.org
    Link: http://lkml.kernel.org/r/1445366797-30894-2-git-send-email-andi@firstfloor.org
    Signed-off-by: Ingo Molnar

    Andi Kleen
     

05 Nov, 2015

1 commit

  • Pull networking updates from David Miller:

    Changes of note:

    1) Allow to schedule ICMP packets in IPVS, from Alex Gartrell.

    2) Provide FIB table ID in ipv4 route dumps just as ipv6 does, from
    David Ahern.

    3) Allow the user to ask for the statistics to be filtered out of
    ipv4/ipv6 address netlink dumps. From Sowmini Varadhan.

    4) More work to pass the network namespace context around deep into
    various packet path APIs, starting with the netfilter hooks. From
    Eric W Biederman.

    5) Add layer 2 TX/RX checksum offloading to qeth driver, from Thomas
    Richter.

    6) Use usec resolution for SYN/ACK RTTs in TCP, from Yuchung Cheng.

    7) Support Very High Throughput in wireless MESH code, from Bob
    Copeland.

    8) Allow setting the ageing_time in switchdev/rocker. From Scott
    Feldman.

    9) Properly autoload L2TP type modules, from Stephen Hemminger.

    10) Fix and enable offload features by default in 8139cp driver, from
    David Woodhouse.

    11) Support both ipv4 and ipv6 sockets in a single vxlan device, from
    Jiri Benc.

    12) Fix CWND limiting of thin streams in TCP, from Bendik Rønning
    Opstad.

    13) Fix IPSEC flowcache overflows on large systems, from Steffen
    Klassert.

    14) Convert bridging to track VLANs using rhashtable entries rather than
    a bitmap. From Nikolay Aleksandrov.

    15) Make TCP listener handling completely lockless, this is a major
    accomplishment. Incoming request sockets now live in the
    established hash table just like any other socket too.

    From Eric Dumazet.

    15) Provide more bridging attributes to netlink, from Nikolay
    Aleksandrov.

    16) Use hash based algorithm for ipv4 multipath routing, this was very
    long overdue. From Peter Nørlund.

    17) Several y2038 cures, mostly avoiding timespec. From Arnd Bergmann.

    18) Allow non-root execution of EBPF programs, from Alexei Starovoitov.

    19) Support SO_INCOMING_CPU as setsockopt, from Eric Dumazet. This
    influences the port binding selection logic used by SO_REUSEPORT.

    20) Add ipv6 support to VRF, from David Ahern.

    21) Add support for Mellanox Spectrum switch ASIC, from Jiri Pirko.

    22) Add rtl8xxxu Realtek wireless driver, from Jes Sorensen.

    23) Implement RACK loss recovery in TCP, from Yuchung Cheng.

    24) Support multipath routes in MPLS, from Roopa Prabhu.

    25) Fix POLLOUT notification for listening sockets in AF_UNIX, from Eric
    Dumazet.

    26) Add new QED Qlogic river, from Yuval Mintz, Manish Chopra, and
    Sudarsana Kalluru.

    27) Don't fetch timestamps on AF_UNIX sockets, from Hannes Frederic
    Sowa.

    28) Support ipv6 geneve tunnels, from John W Linville.

    29) Add flood control support to switchdev layer, from Ido Schimmel.

    30) Fix CHECKSUM_PARTIAL handling of potentially fragmented frames, from
    Hannes Frederic Sowa.

    31) Support persistent maps and progs in bpf, from Daniel Borkmann.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1790 commits)
    sh_eth: use DMA barriers
    switchdev: respect SKIP_EOPNOTSUPP flag in case there is no recursion
    net: sched: kill dead code in sch_choke.c
    irda: Delete an unnecessary check before the function call "irlmp_unregister_service"
    net: dsa: mv88e6xxx: include DSA ports in VLANs
    net: dsa: mv88e6xxx: disable SA learning for DSA and CPU ports
    net/core: fix for_each_netdev_feature
    vlan: Invoke driver vlan hooks only if device is present
    arcnet/com20020: add LEDS_CLASS dependency
    bpf, verifier: annotate verbose printer with __printf
    dp83640: Only wait for timestamps for packets with timestamping enabled.
    ptp: Change ptp_class to a proper bitmask
    dp83640: Prune rx timestamp list before reading from it
    dp83640: Delay scheduled work.
    dp83640: Include hash in timestamp/packet matching
    ipv6: fix tunnel error handling
    net/mlx5e: Fix LSO vlan insertion
    net/mlx5e: Re-eanble client vlan TX acceleration
    net/mlx5e: Return error in case mlx5e_set_features() fails
    net/mlx5e: Don't allow more than max supported channels
    ...

    Linus Torvalds
     

22 Oct, 2015

1 commit

  • This helper is used to send raw data from eBPF program into
    special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
    User space needs to perf_event_open() it (either for one or all cpus) and
    store FD into perf_event_array (similar to bpf_perf_event_read() helper)
    before eBPF program can send data into it.

    Today the programs triggered by kprobe collect the data and either store
    it into the maps or print it via bpf_trace_printk() where latter is the debug
    facility and not suitable to stream the data. This new helper replaces
    such bpf_trace_printk() usage and allows programs to have dedicated
    channel into user space for post-processing of the raw data collected.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

20 Oct, 2015

2 commits

  • Add a new branch sample type to cover only call branches (function calls).
    The current ANY_CALL included direct, indirect calls and far jumps.

    We want to be able to differentiate indirect from direct calls. Therefore
    we introduce PERF_SAMPLE_BRANCH_CALL. The implementation is up to each
    architecture.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: khandual@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1444720151-10275-2-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     
  • Commit:

    b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock")

    allowed the time_shift value in perf_event_mmap_page to be as much
    as 32. Unfortunately the documented algorithms for using time_shift
    have it shifting an integer, whereas to work correctly with the value
    32, the type must be u64.

    In the case of perf tools, Intel PT decodes correctly but the timestamps
    that are output (for example by perf script) have lost 32-bits of
    granularity so they look like they are not changing at all.

    Fix by limiting the shift to 31 and adjusting the multiplier accordingly.

    Also update the documentation of perf_event_mmap_page so that new code
    based on it will be more future-proof.

    Signed-off-by: Adrian Hunter
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: b20112edeadf ("perf/x86: Improve accuracy of perf/sched clock")
    Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Ingo Molnar

    Adrian Hunter
     

04 Aug, 2015

1 commit

  • Intel Skylake supports reporting the time in cycles a branch in the LBR
    took, to give a rough indication of the basic block performance.

    Export the cycle information in the branch_info structure.
    This can be done by just reusing some currently zero padding.

    This is just the generic header change. The architecture
    still needs to fill it in.

    There's no attempt to convert to real time, as we really
    want cycles here.

    Signed-off-by: Andi Kleen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1431285767-27027-5-git-send-email-andi@firstfloor.org
    Signed-off-by: Ingo Molnar

    Andi Kleen
     

24 Jul, 2015

1 commit

  • There are already two events for context switches, namely the tracepoint
    sched:sched_switch and the software event context_switches.
    Unfortunately neither are suitable for use by non-privileged users for
    the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
    context switch.

    Tracepoints are no good at all for non-privileged users because they
    need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Jiri Olsa
    Cc: Andi Kleen
    Cc: Mathieu Poirier
    Cc: Pawel Moll
    Cc: Stephane Eranian
    Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Adrian Hunter
     

20 Jun, 2015

1 commit

  • System wide sampling like 'perf top' or 'perf record -a' read all
    threads /proc/xxx/maps before sampling. If there are any threads which
    generating a keeping growing huge maps, perf will do infinite loop
    during synthesizing. Nothing will be sampled.

    This patch fixes this issue by adding per-thread timeout to force stop
    this kind of endless proc map processing.

    PERF_RECORD_MISC_PROC_MAP_PARSE_TIME_OUT is introduced to indicate that
    the mmap record are truncated by time out. User will get warning
    notification when truncated mmap records are detected.

    Reported-by: Ying Huang
    Signed-off-by: Kan Liang
    Cc: Andi Kleen
    Cc: David Ahern
    Cc: Ying Huang
    Link: http://lkml.kernel.org/r/1434549071-25611-1-git-send-email-kan.liang@intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Kan Liang
     

07 Jun, 2015

2 commits

  • After enlarging the PEBS interrupt threshold, there may be some mixed up
    PEBS samples which are discarded by the kernel.

    This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
    the number of possible discarded records when it is impossible to demux
    the samples.

    It makes sure the user is not left in the dark about such discards.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: eranian@google.com
    Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Kan Liang
     
  • This patch adds a new branch_sample_type flag to enable
    filtering branch sampling to indirect jumps. The support
    is subject to hardware or kernel software support on each
    architecture.

    Filtering on indirect jump is useful to study the targets
    of the jump.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Andi Kleen
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@redhat.com
    Cc: dsahern@gmail.com
    Cc: jolsa@redhat.com
    Cc: kan.liang@intel.com
    Cc: namhyung@kernel.org
    Link: http://lkml.kernel.org/r/1431637800-31061-2-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

02 Apr, 2015

7 commits

  • For counters that generate AUX data that is bound to the context of a
    running task, such as instruction tracing, the decoder needs to know
    exactly which task is running when the event is first scheduled in,
    before the first sched_switch. The decoder's need to know this stems
    from the fact that instruction flow trace decoding will almost always
    require program's object code in order to reconstruct said flow and
    for that we need at least its pid/tid in the perf stream.

    To single out such instruction tracing pmus, this patch introduces
    ITRACE PMU capability. The reason this is not part of RECORD_AUX
    record is that not all pmus capable of generating AUX data need this,
    and the opposite is *probably* also true.

    While sched_switch covers for most cases, there are two problems with it:
    the consumer will need to process events out of order (that is, having
    found RECORD_AUX, it will have to skip forward to the nearest sched_switch
    to figure out which task it was, then go back to the actual trace to
    decode it) and it completely misses the case when the tracing is enabled
    and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When AUX area gets a certain amount of new data, we want to wake up
    userspace to collect it. This adds a new control to specify how much
    data will cause a wakeup. This is then passed down to pmu drivers via
    output handle's "wakeup" field, so that the driver can find the nearest
    point where it can generate an interrupt.

    We repurpose __reserved_2 in the event attribute for this, even though
    it was never checked to be zero before, aux_watermark will only matter
    for new AUX-aware code, so the old code should still be fine.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • This adds support for overwrite mode in the AUX area, which means "keep
    collecting data till you're stopped", turning AUX area into a circular
    buffer, where new data overwrites old data. It does not depend on data
    buffer's overwrite mode, so that it doesn't lose sideband data that is
    instrumental for processing AUX data.

    Overwrite mode is enabled at mapping AUX area read only. Even though
    aux_tail in the buffer's user page might be user writable, it will be
    ignored in this mode.

    A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
    data stream every time an event writes new data to the AUX area. The pmu
    driver might not be able to infer the exact beginning of the new data in
    each snapshot, some drivers will only provide the tail, which is
    aux_offset + aux_size in the AUX record. Consumer has to be able to tell
    the new data from the old one, for example, by means of time stamps if
    such are provided in the trace.

    Consumer is also responsible for disabling any events that might write
    to the AUX area (thus potentially racing with the consumer) before
    collecting the data.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When there's new data in the AUX space, output a record indicating its
    offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
    mean the described data was truncated to fit in the ring buffer.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • This patch introduces "AUX space" in the perf mmap buffer, intended for
    exporting high bandwidth data streams to userspace, such as instruction
    flow traces.

    AUX space is a ring buffer, defined by aux_{offset,size} fields in the
    user_page structure, and read/write pointers aux_{head,tail}, which abide
    by the same rules as data_* counterparts of the main perf buffer.

    In order to allocate/mmap AUX, userspace needs to set up aux_offset to
    such an offset that will be greater than data_offset+data_size and
    aux_size to be the desired buffer size. Both need to be page aligned.
    Then, same aux_offset and aux_size should be passed to mmap() call and
    if everything adds up, you should have an AUX buffer as a result.

    Pages that are mapped into this buffer also come out of user's mlock
    rlimit plus perf_event_mlock_kb allowance.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently, the actual perf ring buffer is one page into the mmap area,
    following the user page and the userspace follows this convention. This
    patch adds data_{offset,size} fields to user_page that can be used by
    userspace instead for locating perf data in the mmap area. This is also
    helpful when mapping existing or shared buffers if their size is not
    known in advance.

    Right now, it is made to follow the existing convention that

    data_offset == PAGE_SIZE and
    data_offset + data_size == mmap_size.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Frederic Weisbecker
    Cc: H. Peter Anvin
    Cc: Kaixu Xia
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul Mackerras
    Cc: Robert Richter
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@infradead.org
    Cc: adrian.hunter@intel.com
    Cc: kan.liang@intel.com
    Cc: markus.t.metzger@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • BPF programs, attached to kprobes, provide a safe way to execute
    user-defined BPF byte-code programs without being able to crash or
    hang the kernel in any way. The BPF engine makes sure that such
    programs have a finite execution time and that they cannot break
    out of their sandbox.

    The user interface is to attach to a kprobe via the perf syscall:

    struct perf_event_attr attr = {
    .type = PERF_TYPE_TRACEPOINT,
    .config = event_id,
    ...
    };

    event_fd = perf_event_open(&attr,...);
    ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

    'prog_fd' is a file descriptor associated with BPF program
    previously loaded.

    'event_id' is an ID of the kprobe created.

    Closing 'event_fd':

    close(event_fd);

    ... automatically detaches BPF program from it.

    BPF programs can call in-kernel helper functions to:

    - lookup/update/delete elements in maps

    - probe_read - wraper of probe_kernel_read() used to access any
    kernel data structures

    BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
    architecture dependent) and return 0 to ignore the event and 1 to store
    kprobe event into the ring buffer.

    Note, kprobes are a fundamentally _not_ a stable kernel ABI,
    so BPF programs attached to kprobes must be recompiled for
    every kernel version and user must supply correct LINUX_VERSION_CODE
    in attr.kern_version during bpf_prog_load() call.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Steven Rostedt
    Reviewed-by: Masami Hiramatsu
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

27 Mar, 2015

1 commit

  • While thinking on the whole clock discussion it occurred to me we have
    two distinct uses of time:

    1) the tracking of event/ctx/cgroup enabled/running/stopped times
    which includes the self-monitoring support in struct
    perf_event_mmap_page.

    2) the actual timestamps visible in the data records.

    And we've been conflating them.

    The first is all about tracking time deltas, nobody should really care
    in what time base that happens, its all relative information, as long
    as its internally consistent it works.

    The second however is what people are worried about when having to
    merge their data with external sources. And here we have the
    discussion on MONOTONIC vs MONOTONIC_RAW etc..

    Where MONOTONIC is good for correlating between machines (static
    offset), MONOTNIC_RAW is required for correlating against a fixed rate
    hardware clock.

    This means configurability; now 1) makes that hard because it needs to
    be internally consistent across groups of unrelated events; which is
    why we had to have a global perf_clock().

    However, for 2) it doesn't really matter, perf itself doesn't care
    what it writes into the buffer.

    The below patch makes the distinction between these two cases by
    adding perf_event_clock() which is used for the second case. It
    further makes this configurable on a per-event basis, but adds a few
    sanity checks such that we cannot combine events with different clocks
    in confusing ways.

    And since we then have per-event configurability we might as well
    retain the 'legacy' behaviour as a default.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Feb, 2015

2 commits

  • With LBR call stack feature enable, there are three callchain options.
    Enable the 3rd callchain option (LBR callstack) to user space tooling.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jiri Olsa
    Cc: Arnaldo Carvalho de Melo
    Cc: Andy Lutomirski
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/20141105093759.GQ10501@worktop.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The index of lbr_sel_map is bit value of perf branch_sample_type.
    PERF_SAMPLE_BRANCH_MAX is 1024 at present, so each lbr_sel_map uses
    4096 bytes. By using bit shift as index, we can reduce lbr_sel_map
    size to 40 bytes. This patch defines 'bit shift' for branch types,
    and use 'bit shift' to define lbr_sel_maps.

    Signed-off-by: Yan, Zheng
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Stephane Eranian
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Paul Mackerras
    Cc: Vince Weaver
    Cc: jolsa@redhat.com
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/1415156173-10035-2-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Yan, Zheng
     

16 Nov, 2014

1 commit

  • Enable capture of interrupted machine state for each sample.

    Registers to sample are passed per event in the sample_regs_intr bitmask.

    To sample interrupt machine state, the PERF_SAMPLE_INTR_REGS must be passed in
    sample_type.

    The list of available registers is arch dependent and provided by asm/perf_regs.h

    Registers are laid out as u64 in the order of the bit order of sample_intr_regs.

    This patch also adds a new ABI version PERF_ATTR_SIZE_VER4 because we extend
    the perf_event_attr struct with a new u64 field.

    Reviewed-by: Jiri Olsa
    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: cebbert.lkml@gmail.com
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/1411559322-16548-2-git-send-email-eranian@google.com
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

28 Oct, 2014

1 commit

  • struct perf_event_mmap_page has members called "index" and
    "cap_user_rdpmc". Spell them correctly in the examples.

    Signed-off-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: linux-api@vger.kernel.org
    Link: http://lkml.kernel.org/r/320ba26391a8123cc16e5f02d24d34bd404332fd.1412313343.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

09 Jun, 2014

1 commit

  • The mmap2 interface was missing the protection and flags bits needed to
    accurately determine if a mmap memory area was shared or private and
    if it was readable or not.

    Signed-off-by: Peter Zijlstra
    [tweaked patch to compile and wrote changelog]
    Signed-off-by: Don Zickus
    Link: http://lkml.kernel.org/r/1400526833-141779-2-git-send-email-dzickus@redhat.com
    Signed-off-by: Jiri Olsa

    Peter Zijlstra
     

06 Jun, 2014

1 commit

  • perf tools like 'perf report' can aggregate samples by comm strings,
    which generally works. However, there are other potential use-cases.
    For example, to pair up 'calls' with 'returns' accurately (from branch
    events like Intel BTS) it is necessary to identify whether the process
    has exec'd. Although a comm event is generated when an 'exec' happens
    it is also generated whenever the comm string is changed on a whim
    (e.g. by prctl PR_SET_NAME). This patch adds a flag to the comm event
    to differentiate one case from the other.

    In order to determine whether the kernel supports the new flag, a
    selection bit named 'exec' is added to struct perf_event_attr. The
    bit does nothing but will cause perf_event_open() to fail if the bit
    is set on kernels that do not have it defined.

    Signed-off-by: Adrian Hunter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/537D9EBE.7030806@intel.com
    Cc: Paul Mackerras
    Cc: Dave Jones
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Alexander Viro
    Cc: Linus Torvalds
    Cc: linux-fsdevel@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Adrian Hunter
     

05 Jun, 2014

1 commit

  • This patch introduces new branch filter PERF_SAMPLE_BRANCH_COND which
    will extend the existing perf ABI. This will filter branches which are
    conditional. Various architectures can provide this functionality either
    with HW filtering support (if present) or with SW filtering of captured
    branch instructions.

    Signed-off-by: Anshuman Khandual
    Reviewed-by: Stephane Eranian
    Reviewed-by: Andi Kleen
    Signed-off-by: Peter Zijlstra
    Cc: mpe@ellerman.id.au
    Cc: benh@kernel.crashing.org
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1400743210-32289-1-git-send-email-khandual@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Anshuman Khandual
     

19 May, 2014

1 commit

  • Vince noticed that we test the (unsigned long) flags field against an
    (unsigned int) constant. This would allow setting the high bits on 64bit
    platforms and not get an error.

    There is nothing that uses the high bits, so it should be entirely
    harmless, but we don't want userspace to accidentally set them anyway,
    so fix the constants.

    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140423102254.GL11096@twins.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

24 Jan, 2014

1 commit


16 Jan, 2014

1 commit


12 Jan, 2014

1 commit

  • Unlike recent modern userspace API such as:

    epoll_create1 (EPOLL_CLOEXEC), eventfd (EFD_CLOEXEC),
    fanotify_init (FAN_CLOEXEC), inotify_init1 (IN_CLOEXEC),
    signalfd (SFD_CLOEXEC), timerfd_create (TFD_CLOEXEC),
    or the venerable general purpose open (O_CLOEXEC),

    perf_event_open() syscall lack a flag to atomically set FD_CLOEXEC
    (eg. close-on-exec) flag on file descriptor it returns to userspace.

    The present patch adds a PERF_FLAG_FD_CLOEXEC flag to allow
    perf_event_open() syscall to atomically set close-on-exec.

    Having this flag will enable userspace to remove the file descriptor
    from the list of file descriptors being inherited across exec,
    without the need to call fcntl(fd, F_SETFD, FD_CLOEXEC) and the
    associated race condition between the current thread and another
    thread calling fork(2) then execve(2).

    Links:

    - Secure File Descriptor Handling (Ulrich Drepper, 2008)
    http://udrepper.livejournal.com/20407.html

    - Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012)
    http://danwalsh.livejournal.com/53603.html

    - Notes in DMA buffer sharing: leak and security hole
    http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/dma-buf-sharing.txt?id=v3.13-rc3#n428

    Signed-off-by: Yann Droneaud
    Cc: Arnaldo Carvalho de Melo
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Paul Mackerras
    Cc: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/8c03f54e1598b1727c19706f3af03f98685d9fe6.1388952061.git.ydroneaud@opteya.com
    Signed-off-by: Ingo Molnar

    Yann Droneaud
     

17 Dec, 2013

1 commit

  • Commit fdfbbd07e91f8fe3871 ("perf: Add generic transaction flags")
    added support for PERF_SAMPLE_TRANSACTION but forgot to add documentation
    for the sample type to include/uapi/linux/perf_event.h

    Signed-off-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Cc: Andi Kleen
    Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1312131548450.10372@pianoman.cluster.toy
    Signed-off-by: Ingo Molnar

    Vince Weaver
     

04 Nov, 2013

1 commit


29 Oct, 2013

1 commit

  • The PPC64 people noticed a missing memory barrier and crufty old
    comments in the perf ring buffer code. So update all the comments and
    add the missing barrier.

    When the architecture implements local_t using atomic_long_t there
    will be double barriers issued; but short of introducing more
    conditional barrier primitives this is the best we can do.

    Reported-by: Victor Kaplansky
    Tested-by: Victor Kaplansky
    Signed-off-by: Peter Zijlstra
    Cc: Mathieu Desnoyers
    Cc: michael@ellerman.id.au
    Cc: Paul McKenney
    Cc: Michael Neuling
    Cc: Frederic Weisbecker
    Cc: anton@samba.org
    Cc: benh@kernel.crashing.org
    Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Oct, 2013

1 commit

  • Add a generic qualifier for transaction events, as a new sample
    type that returns a flag word. This is particularly useful
    for qualifying aborts: to distinguish aborts which happen
    due to asynchronous events (like conflicts caused by another
    CPU) versus instructions that lead to an abort.

    The tuning strategies are very different for those cases,
    so it's important to distinguish them easily and early.

    Since it's inconvenient and inflexible to filter for this
    in the kernel we report all the events out and allow
    some post processing in user space.

    The flags are based on the Intel TSX events, but should be fairly
    generic and mostly applicable to other HTM architectures too. In addition
    to various flag words there's also reserved space to report an
    program supplied abort code. For TSX this is used to distinguish specific
    classes of aborts, like a lock busy abort when doing lock elision.

    Flags:

    Elision and generic transactions (ELISION vs TRANSACTION)
    (HLE vs RTM on TSX; IBM etc. would likely only use TRANSACTION)
    Aborts caused by current thread vs aborts caused by others (SYNC vs ASYNC)
    Retryable transaction (RETRY)
    Conflicts with other threads (CONFLICT)
    Transaction write capacity overflow (CAPACITY WRITE)
    Transaction read capacity overflow (CAPACITY READ)

    Transactions implicitely aborted can also return an abort code.
    This can be used to signal specific events to the profiler. A common
    case is abort on lock busy in a RTM eliding library (code 0xff)
    To handle this case we include the TSX abort code

    Common example aborts in TSX would be:

    - Data conflict with another thread on memory read.
    Flags: TRANSACTION|ASYNC|CONFLICT
    - executing a WRMSR in a transaction. Flags: TRANSACTION|SYNC
    - HLE transaction in user space is too large
    Flags: ELISION|SYNC|CAPACITY-WRITE

    The only flag that is somewhat TSX specific is ELISION.

    This adds the perf core glue needed for reporting the new flag word out.

    v2: Add MEM/MISC
    v3: Move transaction to the end
    v4: Separate capacity-read/write and remove misc
    v5: Remove _SAMPLE. Move abort flags to 32bit. Rename
    transaction to txn
    Signed-off-by: Andi Kleen
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1379688044-14173-2-git-send-email-andi@firstfloor.org
    Signed-off-by: Ingo Molnar

    Andi Kleen
     

20 Sep, 2013

2 commits

  • Solve the problems around the broken definition of perf_event_mmap_page::
    cap_usr_time and cap_usr_rdpmc fields which used to overlap, partially
    fixed by:

    860f085b74e9 ("perf: Fix broken union in 'struct perf_event_mmap_page'")

    The problem with the fix (merged in v3.12-rc1 and not yet released
    officially), noticed by Vince Weaver is that the new behavior is
    not detectable by new user-space, and that due to the reuse of the
    field names it's easy to mis-compile a binary if old headers are used
    on a new kernel or new headers are used on an old kernel.

    To solve all that make this change explicit, detectable and self-contained,
    by iterating the ABI the following way:

    - Always clear bit 0, and rename it to usrpage->cap_bit0, to at least not
    confuse old user-space binaries. RDPMC will be marked as unavailable
    to old binaries but that's within the ABI, this is a capability bit.

    - Rename bit 1 to ->cap_bit0_is_deprecated and always set it to 1, so new
    libraries can reliably detect that bit 0 is deprecated and perma-zero
    without having to check the kernel version.

    - Use bits 2, 3, 4 for the newly defined, correct functionality:

    cap_user_rdpmc : 1, /* The RDPMC instruction can be used to read counts */
    cap_user_time : 1, /* The time_* fields are used */
    cap_user_time_zero : 1, /* The time_zero field is used */

    - Rename all the bitfield names in perf_event.h to be different from the
    old names, to make sure it's not possible to mis-compile it
    accidentally with old assumptions.

    The 'size' field can then be used in the future to add new fields and it
    will act as a natural ABI version indicator as well.

    Also adjust tools/perf/ userspace for the new definitions, noticed by
    Adrian Hunter.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Also-Fixed-by: Adrian Hunter
    Link: http://lkml.kernel.org/n/tip-zr03yxjrpXesOzzupszqglbv@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • For some mysterious reason the sample_id field of PERF_RECORD_MMAP went AWOL.

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

18 Sep, 2013

1 commit

  • Without the following patch I have problems compiling code using
    the new PERF_EVENT_IOC_ID ioctl(). It looks like u64 was used
    instead of __u64

    Signed-off-by: Vince Weaver
    Acked-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1309171450380.11444@vincent-weaver-1.um.maine.edu
    Signed-off-by: Ingo Molnar

    Vince Weaver