10 Nov, 2020

2 commits

  • struct perf_sample_data lives on-stack, we should be careful about it's
    size. Furthermore, the pt_regs copy in there is only because x86_64 is a
    trainwreck, solve it differently.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Steven Rostedt
    Link: https://lkml.kernel.org/r/20201030151955.258178461@infradead.org

    Peter Zijlstra
     
  • __perf_output_begin() has an on-stack struct perf_sample_data in the
    unlikely case it needs to generate a LOST record. However, every call
    to perf_output_begin() must already have a perf_sample_data on-stack.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20201030151954.985416146@infradead.org

    Peter Zijlstra
     

10 Sep, 2020

1 commit

  • The pmu::sched_task() is a context switch callback. It passes the
    cpuctx->task_ctx as a parameter to the lower code. To find the
    cpuctx->task_ctx, the current code iterates a cpuctx list.
    The same context will iterated in perf_event_context_sched_out() soon.
    Share the cpuctx->task_ctx can avoid the unnecessary iteration of the
    cpuctx list.

    The pmu::sched_task() is also required for the optimization case for
    equivalent contexts.

    The task_ctx_sched_out() will eventually disable and reenable the PMU
    when schedule out events. Add perf_pmu_disable() and perf_pmu_enable()
    around task_ctx_sched_out() don't break anything.

    Drop the cpuctx->ctx.lock for the pmu::sched_task(). The lock is for
    per-CPU context, which is not necessary for the per-task context
    schedule.

    No one uses sched_cb_entry, perf_sched_cb_usages, sched_cb_list, and
    perf_pmu_sched_task() any more.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200821195754.20159-2-kan.liang@linux.intel.com

    Kan Liang
     

18 Aug, 2020

2 commits

  • Starts from Ice Lake, the TopDown metrics are directly available as
    fixed counters and do not require generic counters. Also, the TopDown
    metrics can be collected per thread. Extend the RDPMC usage to support
    per-thread TopDown metrics.

    The RDPMC index of the PERF_METRICS will be output if RDPMC users ask
    for the RDPMC index of the metrics events.

    To support per thread RDPMC TopDown, the metrics and slots counters have
    to be saved/restored during the context switching.

    The last_period and period_left are not used in the counting mode. Use
    the fields for saved_metric and saved_slots.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200723171117.9918-12-kan.liang@linux.intel.com

    Kan Liang
     
  • Current perf assumes that events in a group are independent. Close an
    event doesn't impact the value of the other events in the same group.
    If the closed event is a member, after the event closure, other events
    are still running like a group. If the closed event is a leader, other
    events are running as singleton events.

    Add PERF_EV_CAP_SIBLING to allow events to indicate they require being
    part of a group, and when the leader dies they cannot exist
    independently.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200723171117.9918-8-kan.liang@linux.intel.com

    Kan Liang
     

06 Aug, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

    2) Support UDP segmentation in code TSO code, from Eric Dumazet.

    3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

    4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

    5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

    6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

    7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

    8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

    9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

    10) Switch qca8k driver over to phylink, from Jonathan McDowell.

    11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

    12) Several conversions over to generic power management, from Vaibhav
    Gupta.

    13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

    14) Various https url conversions, from Alexander A. Klimov.

    15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

    16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

    17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

    18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

    19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

    20) XDP support for xen-netfront, from Denis Kirjanov.

    21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

    22) Support EF100 chip in sfc driver, from Edward Cree.

    23) Add XDP support to mvpp2 driver, from Matteo Croce.

    24) Support MPTCP in sock_diag, from Paolo Abeni.

    25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

    26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

    27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

    28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

    29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

    30) Add rfc4884 support to icmp code, from Willem de Bruijn.

    31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

    32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

    33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

    34) Support TCP syncookies in MPTCP, from Flowian Westphal.

    35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
    net: thunderx: initialize VF's mailbox mutex before first usage
    usb: hso: remove bogus check for EINPROGRESS
    usb: hso: no complaint about kmalloc failure
    hso: fix bailout in error case of probe
    ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
    selftests/net: relax cpu affinity requirement in msg_zerocopy test
    mptcp: be careful on subflow creation
    selftests: rtnetlink: make kci_test_encap() return sub-test result
    selftests: rtnetlink: correct the final return value for the test
    net: dsa: sja1105: use detected device id instead of DT one on mismatch
    tipc: set ub->ifindex for local ipv6 address
    ipv6: add ipv6_dev_find()
    net: openvswitch: silence suspicious RCU usage warning
    Revert "vxlan: fix tos value before xmit"
    ptp: only allow phase values lower than 1 period
    farsync: switch from 'pci_' to 'dma_' API
    wan: wanxl: switch from 'pci_' to 'dma_' API
    hv_netvsc: do not use VF device if link is down
    dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
    net: macb: Properly handle phylink on at91sam9x
    ...

    Linus Torvalds
     

22 Jul, 2020

1 commit


08 Jul, 2020

2 commits

  • A new kmem_cache method has replaced the kzalloc() to allocate the PMU
    specific data. The task_ctx_size is not required anymore.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/1593780569-62993-19-git-send-email-kan.liang@linux.intel.com

    Kan Liang
     
  • Currently, the PMU specific data task_ctx_data is allocated by the
    function kzalloc() in the perf generic code. When there is no specific
    alignment requirement for the task_ctx_data, the method works well for
    now. However, there will be a problem once a specific alignment
    requirement is introduced in future features, e.g., the Architecture LBR
    XSAVE feature requires 64-byte alignment. If the specific alignment
    requirement is not fulfilled, the XSAVE family of instructions will fail
    to save/restore the xstate to/from the task_ctx_data.

    The function kzalloc() itself only guarantees a natural alignment. A
    new method to allocate the task_ctx_data has to be introduced, which
    has to meet the requirements as below:
    - must be a generic method can be used by different architectures,
    because the allocation of the task_ctx_data is implemented in the
    perf generic code;
    - must be an alignment-guarantee method (The alignment requirement is
    not changed after the boot);
    - must be able to allocate/free a buffer (smaller than a page size)
    dynamically;
    - should not cause extra CPU overhead or space overhead.

    Several options were considered as below:
    - One option is to allocate a larger buffer for task_ctx_data. E.g.,
    ptr = kmalloc(size + alignment, GFP_KERNEL);
    ptr &= ~(alignment - 1);
    This option causes space overhead.
    - Another option is to allocate the task_ctx_data in the PMU specific
    code. To do so, several function pointers have to be added. As a
    result, both the generic structure and the PMU specific structure
    will become bigger. Besides, extra function calls are added when
    allocating/freeing the buffer. This option will increase both the
    space overhead and CPU overhead.
    - The third option is to use a kmem_cache to allocate a buffer for the
    task_ctx_data. The kmem_cache can be created with a specific alignment
    requirement by the PMU at boot time. A new pointer for kmem_cache has
    to be added in the generic struct pmu, which would be used to
    dynamically allocate a buffer for the task_ctx_data at run time.
    Although the new pointer is added to the struct pmu, the existing
    variable task_ctx_size is not required anymore. The size of the
    generic structure is kept the same.

    The third option which meets all the aforementioned requirements is used
    to replace kzalloc() for the PMU specific data allocation. A later patch
    will remove the kzalloc() method and the related variables.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/1593780569-62993-17-git-send-email-kan.liang@linux.intel.com

    Kan Liang
     

01 Jul, 2020

1 commit

  • Sanitize and expose get/put_callchain_entry(). This would be used by bpf
    stack map.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200630062846.664389-2-songliubraving@fb.com

    Song Liu
     

15 Jun, 2020

1 commit

  • Record (single instruction) changes to the kernel text (i.e.
    self-modifying code) in order to support tracers like Intel PT and
    ARM CoreSight.

    A copy of the running kernel code is needed as a reference point (e.g.
    from /proc/kcore). The text poke event records the old bytes and the
    new bytes so that the event can be processed forwards or backwards.

    The basic problem is recording the modified instruction in an
    unambiguous manner given SMP instruction cache (in)coherence. That is,
    when modifying an instruction concurrently any solution with one or
    multiple timestamps is not sufficient:

    CPU0 CPU1
    0
    1 write insn A
    2 execute insn A
    3 sync-I$
    4

    Due to I$, CPU1 might execute either the old or new A. No matter where
    we record tracepoints on CPU0, one simply cannot tell what CPU1 will
    have observed, except that at 0 it must be the old one and at 4 it
    must be the new one.

    To solve this, take inspiration from x86 text poking, which has to
    solve this exact problem due to variable length instruction encoding
    and I-fetch windows.

    1) overwrite the instruction with a breakpoint and sync I$

    This guarantees that that code flow will never hit the target
    instruction anymore, on any CPU (or rather, it will cause an
    exception).

    2) issue the TEXT_POKE event

    3) overwrite the breakpoint with the new instruction and sync I$

    Now we know that any execution after the TEXT_POKE event will either
    observe the breakpoint (and hit the exception) or the new instruction.

    So by guarding the TEXT_POKE event with an exception on either side;
    we can now tell, without doubt, which instruction another CPU will
    have observed.

    Signed-off-by: Adrian Hunter
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200512121922.8997-2-adrian.hunter@intel.com

    Adrian Hunter
     

04 Jun, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

    2) Add GSO partial support to igc, from Sasha Neftin.

    3) Several cleanups and improvements to r8169 from Heiner Kallweit.

    4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

    5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

    6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

    7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

    8) Add sriov and vf support to hinic, from Luo bin.

    9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

    10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

    11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

    12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

    13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

    14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

    15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

    16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

    17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

    18) Several RISCV bpf jit optimizations, from Luke Nelson.

    19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

    20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

    21) Add BPF iterators, from Yonghang Song.

    22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

    23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

    24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

    25) Add CAP_BPF, from Alexei Starovoitov.

    26) Support terse dumps in the packet scheduler, from Vlad Buslov.

    27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

    28) Add devm_register_netdev(), from Bartosz Golaszewski.

    29) Minimize qdisc resets, from Cong Wang.

    30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
    selftests: net: ip_defrag: ignore EPERM
    net_failover: fixed rollback in net_failover_open()
    Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
    Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
    vmxnet3: allow rx flow hash ops only when rss is enabled
    hinic: add set_channels ethtool_ops support
    selftests/bpf: Add a default $(CXX) value
    tools/bpf: Don't use $(COMPILE.c)
    bpf, selftests: Use bpf_probe_read_kernel
    s390/bpf: Use bcr 0,%0 as tail call nop filler
    s390/bpf: Maintain 8-byte stack alignment
    selftests/bpf: Fix verifier test
    selftests/bpf: Fix sample_cnt shared between two threads
    bpf, selftests: Adapt cls_redirect to call csum_level helper
    bpf: Add csum_level helper for fixing up csum levels
    bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
    sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
    crypto/chtls: IPv6 support for inline TLS
    Crypto/chcr: Fixes a coccinile check error
    Crypto/chcr: Fixes compilations warnings
    ...

    Linus Torvalds
     

20 May, 2020

1 commit

  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    sizeof(flexible-array-member) triggers a warning because flexible array
    members have incomplete type[1]. There are some instances of code in
    which the sizeof operator is being incorrectly/erroneously applied to
    zero-length arrays and the result is zero. Such instances may be hiding
    some bugs. So, this work (flexible-array member conversions) will also
    help to get completely rid of those sorts of issues.

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200511201227.GA14041@embeddedor

    Gustavo A. R. Silva
     

27 Apr, 2020

1 commit

  • Instead of having all the sysctl handlers deal with user pointers, which
    is rather hairy in terms of the BPF interaction, copy the input to and
    from userspace in common code. This also means that the strings are
    always NUL-terminated by the common code, making the API a little bit
    safer.

    As most handler just pass through the data to one of the common handlers
    a lot of the changes are mechnical.

    Signed-off-by: Christoph Hellwig
    Acked-by: Andrey Ignatov
    Signed-off-by: Al Viro

    Christoph Hellwig
     

16 Apr, 2020

1 commit

  • Open access to monitoring of kernel code, CPUs, tracepoints and
    namespaces data for a CAP_PERFMON privileged process. Providing the
    access under CAP_PERFMON capability singly, without the rest of
    CAP_SYS_ADMIN credentials, excludes chances to misuse the credentials
    and makes operation more secure.

    CAP_PERFMON implements the principle of least privilege for performance
    monitoring and observability operations (POSIX IEEE 1003.1e 2.2.2.39
    principle of least privilege: A security design principle that states
    that a process or program be granted only those privileges (e.g.,
    capabilities) necessary to accomplish its legitimate function, and only
    for the time that such privileges are actually required)

    For backward compatibility reasons the access to perf_events subsystem
    remains open for CAP_SYS_ADMIN privileged processes but CAP_SYS_ADMIN
    usage for secure perf_events monitoring is discouraged with respect to
    CAP_PERFMON capability.

    Signed-off-by: Alexey Budankov
    Reviewed-by: James Morris
    Tested-by: Arnaldo Carvalho de Melo
    Cc: Alexei Starovoitov
    Cc: Andi Kleen
    Cc: Igor Lubashev
    Cc: Jiri Olsa
    Cc: linux-man@vger.kernel.org
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Serge Hallyn
    Cc: Song Liu
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: intel-gfx@lists.freedesktop.org
    Cc: linux-doc@vger.kernel.org
    Cc: linux-security-module@vger.kernel.org
    Cc: selinux@vger.kernel.org
    Link: http://lore.kernel.org/lkml/471acaef-bb8a-5ce2-923f-90606b78eef9@linux.intel.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Alexey Budankov
     

27 Mar, 2020

1 commit

  • The PERF_SAMPLE_CGROUP bit is to save (perf_event) cgroup information in
    the sample. It will add a 64-bit id to identify current cgroup and it's
    the file handle in the cgroup file system. Userspace should use this
    information with PERF_RECORD_CGROUP event to match which cgroup it
    belongs.

    I put it before PERF_SAMPLE_AUX for simplicity since it just needs a
    64-bit word. But if we want bigger samples, I can work on that
    direction too.

    Committer testing:

    $ pahole perf_sample_data | grep -w cgroup -B5 -A5
    /* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */
    struct perf_regs regs_intr; /* 312 16 */
    /* --- cacheline 5 boundary (320 bytes) was 8 bytes ago --- */
    u64 stack_user_size; /* 328 8 */
    u64 phys_addr; /* 336 8 */
    u64 cgroup; /* 344 8 */

    /* size: 384, cachelines: 6, members: 22 */
    /* padding: 32 */
    };
    $

    Signed-off-by: Namhyung Kim
    Tested-by: Arnaldo Carvalho de Melo
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Johannes Weiner
    Cc: Mark Rutland
    Cc: Zefan Li
    Link: http://lore.kernel.org/lkml/20200325124536.2800725-3-namhyung@kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Namhyung Kim
     

06 Mar, 2020

1 commit

  • The storage required for visit_groups_merge's min heap needs to vary in
    order to support more iterators, such as when multiple nested cgroups'
    events are being visited. This change allows for 2 iterators and doesn't
    support growth.

    Based-on-work-by: Peter Zijlstra (Intel)
    Signed-off-by: Ian Rogers
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200214075133.181299-5-irogers@google.com

    Ian Rogers
     

11 Feb, 2020

1 commit

  • The low level index is the index in the underlying hardware buffer of
    the most recently captured taken branch which is always saved in
    branch_entries[0]. It is very useful for reconstructing the call stack.
    For example, in Intel LBR call stack mode, the depth of reconstructed
    LBR call stack limits to the number of LBR registers. With the low level
    index information, perf tool may stitch the stacks of two samples. The
    reconstructed LBR call stack can break the HW limitation.

    Add a new branch sample type to retrieve low level index of raw branch
    records. The low level index is between -1 (unknown) and max depth which
    can be retrieved in /sys/devices/cpu/caps/branches.

    Only when the new branch sample type is set, the low level index
    information is dumped into the PERF_SAMPLE_BRANCH_STACK output.
    Perf tool should check the attr.branch_sample_type, and apply the
    corresponding format for PERF_SAMPLE_BRANCH_STACK samples.
    Otherwise, some user case may be broken. For example, users may parse a
    perf.data, which include the new branch sample type, with an old version
    perf tool (without the check). Users probably get incorrect information
    without any warning.

    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200127165355.27495-2-kan.liang@linux.intel.com

    Kan Liang
     

10 Feb, 2020

1 commit

  • Pull perf fixes from Thomas Gleixner:
    "A set of fixes and improvements for the perf subsystem:

    Kernel fixes:

    - Install cgroup events to the correct CPU context to prevent a
    potential list double add

    - Prevent an integer underflow in the perf mlock accounting

    - Add a missing prototype for arch_perf_update_userpage()

    Tooling:

    - Add a missing unlock in the error path of maps__insert() in perf
    maps.

    - Fix the build with the latest libbfd

    - Fix the perf parser so it does not delete parse event terms, which
    caused a regression for using perf with the ARM CoreSight as the
    sink configuration was missing due to the deletion.

    - Fix the double free in the perf CPU map merging test case

    - Add the missing ustring support for the perf probe command"

    * tag 'perf-urgent-2020-02-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf maps: Add missing unlock to maps__insert() error case
    perf probe: Add ustring support for perf probe command
    perf: Make perf able to build with latest libbfd
    perf test: Fix test case Merge cpu map
    perf parse: Copy string to perf_evsel_config_term
    perf parse: Refactor 'struct perf_evsel_config_term'
    kernel/events: Add a missing prototype for arch_perf_update_userpage()
    perf/cgroups: Install cgroup events to correct cpuctx
    perf/core: Fix mlock accounting in perf_mmap()

    Linus Torvalds
     

29 Jan, 2020

1 commit


14 Jan, 2020

1 commit

  • eBPF requires needing to know the size of the perf ring buffer structure.
    But it unfortunately has the same name as the generic ring buffer used by
    tracing and oprofile. To make it less ambiguous, rename the perf ring buffer
    structure to "perf_buffer".

    As other parts of the ring buffer code has "perf_" as the prefix, it only
    makes sense to give the ring buffer the "perf_" prefix as well.

    Link: https://lore.kernel.org/r/20191213153553.GE20583@krava
    Acked-by: Peter Zijlstra
    Suggested-by: Alexei Starovoitov
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     

27 Nov, 2019

1 commit

  • Pull perf updates from Ingo Molnar:
    "The main kernel side changes in this cycle were:

    - Various Intel-PT updates and optimizations (Alexander Shishkin)

    - Prohibit kprobes on Xen/KVM emulate prefixes (Masami Hiramatsu)

    - Add support for LSM and SELinux checks to control access to the
    perf syscall (Joel Fernandes)

    - Misc other changes, optimizations, fixes and cleanups - see the
    shortlog for details.

    There were numerous tooling changes as well - 254 non-merge commits.
    Here are the main changes - too many to list in detail:

    - Enhancements to core tooling infrastructure, perf.data, libperf,
    libtraceevent, event parsing, vendor events, Intel PT, callchains,
    BPF support and instruction decoding.

    - There were updates to the following tools:

    perf annotate
    perf diff
    perf inject
    perf kvm
    perf list
    perf maps
    perf parse
    perf probe
    perf record
    perf report
    perf script
    perf stat
    perf test
    perf trace

    - And a lot of other changes: please see the shortlog and Git log for
    more details"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (279 commits)
    perf parse: Fix potential memory leak when handling tracepoint errors
    perf probe: Fix spelling mistake "addrees" -> "address"
    libtraceevent: Fix memory leakage in copy_filter_type
    libtraceevent: Fix header installation
    perf intel-bts: Does not support AUX area sampling
    perf intel-pt: Add support for decoding AUX area samples
    perf intel-pt: Add support for recording AUX area samples
    perf pmu: When using default config, record which bits of config were changed by the user
    perf auxtrace: Add support for queuing AUX area samples
    perf session: Add facility to peek at all events
    perf auxtrace: Add support for dumping AUX area samples
    perf inject: Cut AUX area samples
    perf record: Add aux-sample-size config term
    perf record: Add support for AUX area sampling
    perf auxtrace: Add support for AUX area sample recording
    perf auxtrace: Move perf_evsel__find_pmu()
    perf record: Add a function to test for kernel support for AUX area sampling
    perf tools: Add kernel AUX area sampling definitions
    perf/core: Make the mlock accounting simple again
    perf report: Jump to symbol source view from total cycles view
    ...

    Linus Torvalds
     

21 Nov, 2019

1 commit


15 Nov, 2019

2 commits

  • Exporting perf_event_pause() as an external accessor for kernel users (such
    as KVM) who may do both disable perf_event and read count with just one
    time to hold perf_event_ctx_lock. Also the value could be reset optionally.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Like Xu
    Acked-by: Peter Zijlstra
    Signed-off-by: Paolo Bonzini

    Like Xu
     
  • Currently, perf_event_period() is used by user tools via ioctl. Based on
    naming convention, exporting perf_event_period() for kernel users (such
    as KVM) who may recalibrate the event period for their assigned counter
    according to their requirements.

    The perf_event_period() is an external accessor, just like the
    perf_event_{en,dis}able() and should thus use perf_event_ctx_lock().

    Suggested-by: Kan Liang
    Signed-off-by: Like Xu
    Acked-by: Peter Zijlstra
    Signed-off-by: Paolo Bonzini

    Like Xu
     

13 Nov, 2019

1 commit

  • AUX data can be used to annotate perf events such as performance counters
    or tracepoints/breakpoints by including it in sample records when
    PERF_SAMPLE_AUX flag is set. Such samples would be instrumental in debugging
    and profiling by providing, for example, a history of instruction flow
    leading up to the event's overflow.

    The implementation makes use of grouping an AUX event with all the events
    that wish to take samples of the AUX data, such that the former is the
    group leader. The samplees should also specify the desired size of the AUX
    sample via attr.aux_sample_size.

    AUX capable PMUs need to explicitly add support for sampling, because it
    relies on a new callback to take a snapshot of the buffer without touching
    the event states.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: adrian.hunter@intel.com
    Cc: mathieu.poirier@linaro.org
    Link: https://lkml.kernel.org/r/20191025140835.53665-2-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

28 Oct, 2019

3 commits

  • Declare swap_task_ctx() methods at the generic and x86 specific
    pmu types to bridge calls to platform specific PMU code on optimized
    context switch path between equivalent task perf event contexts.

    Signed-off-by: Alexey Budankov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Ian Rogers
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Song Liu
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: https://lkml.kernel.org/r/9a0aa84a-f062-9b64-3133-373658550c4b@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexey Budankov
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • As per POSIX, the correct spelling of the error code is EACCES:

    include/uapi/asm-generic/errno-base.h:#define EACCES 13 /* Permission denied */

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Kosina
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Namhyung Kim
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: https://lkml.kernel.org/r/20191024122904.12463-1-geert+renesas@glider.be
    Signed-off-by: Ingo Molnar

    Geert Uytterhoeven
     

18 Oct, 2019

1 commit

  • In current mainline, the degree of access to perf_event_open(2) system
    call depends on the perf_event_paranoid sysctl. This has a number of
    limitations:

    1. The sysctl is only a single value. Many types of accesses are controlled
    based on the single value thus making the control very limited and
    coarse grained.
    2. The sysctl is global, so if the sysctl is changed, then that means
    all processes get access to perf_event_open(2) opening the door to
    security issues.

    This patch adds LSM and SELinux access checking which will be used in
    Android to access perf_event_open(2) for the purposes of attaching BPF
    programs to tracepoints, perf profiling and other operations from
    userspace. These operations are intended for production systems.

    5 new LSM hooks are added:
    1. perf_event_open: This controls access during the perf_event_open(2)
    syscall itself. The hook is called from all the places that the
    perf_event_paranoid sysctl is checked to keep it consistent with the
    systctl. The hook gets passed a 'type' argument which controls CPU,
    kernel and tracepoint accesses (in this context, CPU, kernel and
    tracepoint have the same semantics as the perf_event_paranoid sysctl).
    Additionally, I added an 'open' type which is similar to
    perf_event_paranoid sysctl == 3 patch carried in Android and several other
    distros but was rejected in mainline [1] in 2016.

    2. perf_event_alloc: This allocates a new security object for the event
    which stores the current SID within the event. It will be useful when
    the perf event's FD is passed through IPC to another process which may
    try to read the FD. Appropriate security checks will limit access.

    3. perf_event_free: Called when the event is closed.

    4. perf_event_read: Called from the read(2) and mmap(2) syscalls for the event.

    5. perf_event_write: Called from the ioctl(2) syscalls for the event.

    [1] https://lwn.net/Articles/696240/

    Since Peter had suggest LSM hooks in 2016 [1], I am adding his
    Suggested-by tag below.

    To use this patch, we set the perf_event_paranoid sysctl to -1 and then
    apply selinux checking as appropriate (default deny everything, and then
    add policy rules to give access to domains that need it). In the future
    we can remove the perf_event_paranoid sysctl altogether.

    Suggested-by: Peter Zijlstra
    Co-developed-by: Peter Zijlstra
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: James Morris
    Cc: Arnaldo Carvalho de Melo
    Cc: rostedt@goodmis.org
    Cc: Yonghong Song
    Cc: Kees Cook
    Cc: Ingo Molnar
    Cc: Alexei Starovoitov
    Cc: jeffv@google.com
    Cc: Jiri Olsa
    Cc: Daniel Borkmann
    Cc: primiano@google.com
    Cc: Song Liu
    Cc: rsavitski@google.com
    Cc: Namhyung Kim
    Cc: Matthew Garrett
    Link: https://lkml.kernel.org/r/20191014170308.70668-1-joel@joelfernandes.org

    Joel Fernandes (Google)
     

28 Aug, 2019

1 commit

  • In some cases, ordinary (non-AUX) events can generate data for AUX events.
    For example, PEBS events can come out as records in the Intel PT stream
    instead of their usual DS records, if configured to do so.

    One requirement for such events is to consistently schedule together, to
    ensure that the data from the "AUX output" events isn't lost while their
    corresponding AUX event is not scheduled. We use grouping to provide this
    guarantee: an "AUX output" event can be added to a group where an AUX event
    is a group leader, and provided that the former supports writing to the
    latter.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: kan.liang@linux.intel.com
    Link: https://lkml.kernel.org/r/20190806084606.4021-2-alexander.shishkin@linux.intel.com

    Alexander Shishkin
     

13 Jul, 2019

1 commit

  • So far, we tried to disallow grouping exclusive events for the fear of
    complications they would cause with moving between contexts. Specifically,
    moving a software group to a hardware context would violate the exclusivity
    rules if both groups contain matching exclusive events.

    This attempt was, however, unsuccessful: the check that we have in the
    perf_event_open() syscall is both wrong (looks at wrong PMU) and
    insufficient (group leader may still be exclusive), as can be illustrated
    by running:

    $ perf record -e '{intel_pt//,cycles}' uname
    $ perf record -e '{cycles,intel_pt//}' uname

    ultimately successfully.

    Furthermore, we are completely free to trigger the exclusivity violation
    by:

    perf -e '{cycles,intel_pt//}' -e '{intel_pt//,instructions}'

    even though the helpful perf record will not allow that, the ABI will.

    The warning later in the perf_event_open() path will also not trigger, because
    it's also wrong.

    Fix all this by validating the original group before moving, getting rid
    of broken safeguards and placing a useful one to perf_install_in_context().

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: mathieu.poirier@linaro.org
    Cc: will.deacon@arm.com
    Fixes: bed5b25ad9c8a ("perf: Add a pmu capability for "exclusive" events")
    Link: https://lkml.kernel.org/r/20190701110755.24646-1-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

09 Jul, 2019

1 commit


25 Jun, 2019

2 commits

  • Currently perf_rotate_context assumes that if the context's nr_events !=
    nr_active a rotation is necessary for perf event multiplexing. With
    cgroups, nr_events is the total count of events for all cgroups and
    nr_active will not include events in a cgroup other than the current
    task's. This makes rotation appear necessary for cgroups when it is not.

    Add a perf_event_context flag that is set when rotation is necessary.
    Clear the flag during sched_out and set it when a flexible sched_in
    fails due to resources.

    Signed-off-by: Ian Rogers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: https://lkml.kernel.org/r/20190601082722.44543-1-irogers@google.com
    Signed-off-by: Ingo Molnar

    Ian Rogers
     
  • The perf fuzzer caused Skylake machine to crash:

    [ 9680.085831] Call Trace:
    [ 9680.088301]
    [ 9680.090363] perf_output_sample_regs+0x43/0xa0
    [ 9680.094928] perf_output_sample+0x3aa/0x7a0
    [ 9680.099181] perf_event_output_forward+0x53/0x80
    [ 9680.103917] __perf_event_overflow+0x52/0xf0
    [ 9680.108266] ? perf_trace_run_bpf_submit+0xc0/0xc0
    [ 9680.113108] perf_swevent_hrtimer+0xe2/0x150
    [ 9680.117475] ? check_preempt_wakeup+0x181/0x230
    [ 9680.122091] ? check_preempt_curr+0x62/0x90
    [ 9680.126361] ? ttwu_do_wakeup+0x19/0x140
    [ 9680.130355] ? try_to_wake_up+0x54/0x460
    [ 9680.134366] ? reweight_entity+0x15b/0x1a0
    [ 9680.138559] ? __queue_work+0x103/0x3f0
    [ 9680.142472] ? update_dl_rq_load_avg+0x1cd/0x270
    [ 9680.147194] ? timerqueue_del+0x1e/0x40
    [ 9680.151092] ? __remove_hrtimer+0x35/0x70
    [ 9680.155191] __hrtimer_run_queues+0x100/0x280
    [ 9680.159658] hrtimer_interrupt+0x100/0x220
    [ 9680.163835] smp_apic_timer_interrupt+0x6a/0x140
    [ 9680.168555] apic_timer_interrupt+0xf/0x20
    [ 9680.172756]

    The XMM registers can only be collected by PEBS hardware events on the
    platforms with PEBS baseline support, e.g. Icelake, not software/probe
    events.

    Add capabilities flag PERF_PMU_CAP_EXTENDED_REGS to indicate the PMU
    which support extended registers. For X86, the extended registers are
    XMM registers.

    Add has_extended_regs() to check if extended registers are applied.

    The generic code define the mask of extended registers as 0 if arch
    headers haven't overridden it.

    Originally-by: Peter Zijlstra (Intel)
    Reported-by: Vince Weaver
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Fixes: 878068ea270e ("perf/x86: Support outputting XMM registers")
    Link: https://lkml.kernel.org/r/1559081314-9714-1-git-send-email-kan.liang@linux.intel.com
    Signed-off-by: Ingo Molnar

    Kan Liang
     

03 Jun, 2019

1 commit

  • Adding attr_update attribute group into pmu, to allow
    having multiple attribute groups for same group name.

    This will allow us to update "events" or "format"
    directories with attributes that depend on various
    HW conditions.

    For example having group_format_extra group that updates
    "format" directory only if pmu version is 2 and higher:

    static umode_t
    exra_is_visible(struct kobject *kobj, struct attribute *attr, int i)
    {
    return x86_pmu.version >= 2 ? attr->mode : 0;
    }

    static struct attribute_group group_format_extra = {
    .name = "format",
    .is_visible = exra_is_visible,
    };

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190512155518.21468-3-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

18 May, 2019

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - support for SVE and Pointer Authentication in guests
    - PMU improvements

    POWER:
    - support for direct access to the POWER9 XIVE interrupt controller
    - memory and performance optimizations

    x86:
    - support for accessing memory not backed by struct page
    - fixes and refactoring

    Generic:
    - dirty page tracking improvements"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
    kvm: fix compilation on aarch64
    Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
    kvm: x86: Fix L1TF mitigation for shadow MMU
    KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
    KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
    KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
    KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
    kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
    tests: kvm: Add tests for KVM_SET_NESTED_STATE
    KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
    tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
    tests: kvm: Add tests to .gitignore
    KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
    KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
    KVM: Fix the bitmap range to copy during clear dirty
    KVM: arm64: Fix ptrauth ID register masking logic
    KVM: x86: use direct accessors for RIP and RSP
    KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
    KVM: x86: Omit caching logic for always-available GPRs
    kvm, x86: Properly check whether a pfn is an MMIO or not
    ...

    Linus Torvalds
     

07 May, 2019

1 commit

  • Pull perf updates from Ingo Molnar:
    "The main kernel changes were:

    - add support for Intel's "adaptive PEBS v4" - which embedds LBS data
    in PEBS records and can thus batch up and reduce the IRQ (NMI) rate
    significantly - reducing overhead and making call-graph profiling
    less intrusive.

    - add Intel CPU core and uncore support updates for Tremont, Icelake,

    - extend the x86 PMU constraints scheduler with 'constraint ranges'
    to better support Icelake hw constraints,

    - make x86 call-chain support work better with CONFIG_FRAME_POINTER=y

    - misc other changes

    Tooling changes:

    - updates to the main tools: 'perf record', 'perf trace', 'perf
    stat'

    - updated Intel and S/390 vendor events

    - libtraceevent updates

    - misc other updates and fixes"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
    perf/x86: Make perf callchains work without CONFIG_FRAME_POINTER
    watchdog: Fix typo in comment
    perf/x86/intel: Add Tremont core PMU support
    perf/x86/intel/uncore: Add Intel Icelake uncore support
    perf/x86/msr: Add Icelake support
    perf/x86/intel/rapl: Add Icelake support
    perf/x86/intel/cstate: Add Icelake support
    perf/x86/intel: Add Icelake support
    perf/x86: Support constraint ranges
    perf/x86/lbr: Avoid reading the LBRs when adaptive PEBS handles them
    perf/x86/intel: Support adaptive PEBS v4
    perf/x86/intel/ds: Extract code of event update in short period
    perf/x86/intel: Extract memory code PEBS parser for reuse
    perf/x86: Support outputting XMM registers
    perf/x86/intel: Force resched when TFA sysctl is modified
    perf/core: Add perf_pmu_resched() as global function
    perf/headers: Fix stale comment for struct perf_addr_filter
    perf/core: Make perf_swevent_init_cpu() static
    perf/x86: Add sanity checks to x86_schedule_events()
    perf/x86: Optimize x86_schedule_events()
    ...

    Linus Torvalds
     

03 May, 2019

1 commit

  • Now that all AUX allocations are high-order by default, the software
    double buffering PMU capability doesn't make sense any more, get rid
    of it. In case some PMUs choose to opt out, we can re-introduce it.

    Signed-off-by: Alexander Shishkin
    Acked-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: adrian.hunter@intel.com
    Link: http://lkml.kernel.org/r/20190503085536.24119-3-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

01 May, 2019

1 commit