05 Sep, 2017

1 commit

  • Pull x86 cache quality monitoring update from Thomas Gleixner:
    "This update provides a complete rewrite of the Cache Quality
    Monitoring (CQM) facility.

    The existing CQM support was duct taped into perf with a lot of issues
    and the attempts to fix those turned out to be incomplete and
    horrible.

    After lengthy discussions it was decided to integrate the CQM support
    into the Resource Director Technology (RDT) facility, which is the
    obvious choise as in hardware CQM is part of RDT. This allowed to add
    Memory Bandwidth Monitoring support on top.

    As a result the mechanisms for allocating cache/memory bandwidth and
    the corresponding monitoring mechanisms are integrated into a single
    management facility with a consistent user interface"

    * 'x86-cache-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/intel_rdt: Turn off most RDT features on Skylake
    x86/intel_rdt: Add command line options for resource director technology
    x86/intel_rdt: Move special case code for Haswell to a quirk function
    x86/intel_rdt: Remove redundant ternary operator on return
    x86/intel_rdt/cqm: Improve limbo list processing
    x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug
    x86/intel_rdt: Modify the intel_pqr_state for better performance
    x86/intel_rdt/cqm: Clear the default RMID during hotcpu
    x86/intel_rdt: Show bitmask of shareable resource with other executing units
    x86/intel_rdt/mbm: Handle counter overflow
    x86/intel_rdt/mbm: Add mbm counter initialization
    x86/intel_rdt/mbm: Basic counting of MBM events (total and local)
    x86/intel_rdt/cqm: Add CPU hotplug support
    x86/intel_rdt/cqm: Add sched_in support
    x86/intel_rdt: Introduce rdt_enable_key for scheduling
    x86/intel_rdt/cqm: Add mount,umount support
    x86/intel_rdt/cqm: Add rmdir support
    x86/intel_rdt: Separate the ctrl bits from rmdir
    x86/intel_rdt/cqm: Add mon_data
    x86/intel_rdt: Prepare for RDT monitor data support
    ...

    Linus Torvalds
     

29 Aug, 2017

3 commits

  • For understanding how the workload maps to memory channels and hardware
    behavior, it's very important to collect address maps with physical
    addresses. For example, 3D XPoint access can only be found by filtering
    the physical address.

    Add a new sample type for physical address.

    perf already has a facility to collect data virtual address. This patch
    introduces a function to convert the virtual address to physical address.
    The function is quite generic and can be extended to any architecture as
    long as a virtual address is provided.

    - For kernel direct mapping addresses, virt_to_phys is used to convert
    the virtual addresses to physical address.

    - For user virtual addresses, __get_user_pages_fast is used to walk the
    pages tables for user physical address.

    - This does not work for vmalloc addresses right now. These are not
    resolved, but code to do that could be added.

    The new sample type requires collecting the virtual address. The
    virtual address will not be output unless SAMPLE_ADDR is applied.

    For security, the physical address can only be exposed to root or
    privileged user.

    Tested-by: Madhavan Srinivasan
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: acme@kernel.org
    Cc: mpe@ellerman.id.au
    Link: http://lkml.kernel.org/r/1503967969-48278-1-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Kan Liang
     
  • I just noticed that hw.itrace_started and hw.config are aliased to the
    same location. Now, the PT driver happens to use both, which works out
    fine by sheer luck:

    - STORE(hw.itrace_start) is ordered before STORE(hw.config), in the
    program order, although there are no compiler barriers to ensure that,

    - to the perf_log_itrace_start() hw.itrace_start looks set at the same
    time as when it is intended to be set because both stores happen in the
    same path,

    - hw.config is never reset to zero in the PT driver.

    Now, the use of hw.config by the PT driver makes more sense (it being a
    HW PMU) than messing around with itrace_started, which is an awkward API
    to begin with.

    This patch replaces hw.itrace_started with an attach_state bit and an
    API call for the PMU drivers to use to communicate the condition.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20170330153956.25994-1-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • When running perf on the ftrace:function tracepoint, there is a bug
    which can be reproduced by:

    perf record -e ftrace:function -a sleep 20 &
    perf record -e ftrace:function ls
    perf script

    ls 10304 [005] 171.853235: ftrace:function:
    perf_output_begin
    ls 10304 [005] 171.853237: ftrace:function:
    perf_output_begin
    ls 10304 [005] 171.853239: ftrace:function:
    task_tgid_nr_ns
    ls 10304 [005] 171.853240: ftrace:function:
    task_tgid_nr_ns
    ls 10304 [005] 171.853242: ftrace:function:
    __task_pid_nr_ns
    ls 10304 [005] 171.853244: ftrace:function:
    __task_pid_nr_ns

    We can see that all the function traces are doubled.

    The problem is caused by the inconsistency of the register
    function perf_ftrace_event_register() with the probe function
    perf_ftrace_function_call(). The former registers one probe
    for every perf_event. And the latter handles all perf_events
    on the current cpu. So when two perf_events on the current cpu,
    the traces of them will be doubled.

    So this patch adds an extra parameter "event" for perf_tp_event,
    only send sample data to this event when it's not NULL.

    Signed-off-by: Zhou Chengming
    Reviewed-by: Jiri Olsa
    Acked-by: Steven Rostedt (VMware)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: alexander.shishkin@linux.intel.com
    Cc: huawei.libin@huawei.com
    Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.com
    Signed-off-by: Ingo Molnar

    Zhou Chengming
     

10 Aug, 2017

1 commit

  • Vince reported the following rdpmc() testcase failure:

    > Failing test case:
    >
    > fd=perf_event_open();
    > addr=mmap(fd);
    > exec() // without closing or unmapping the event
    > fd=perf_event_open();
    > addr=mmap(fd);
    > rdpmc() // GPFs due to rdpmc being disabled

    The problem is of course that exec() plays tricks with what is
    current->mm, only destroying the old mappings after having
    installed the new mm.

    Fix this confusion by passing along vma->vm_mm instead of relying on
    current->mm.

    Reported-by: Vince Weaver
    Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: 1e0fb9ec679c ("perf: Add pmu callbacks to track event mapping and unmapping")
    Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net
    [ Minor cleanups. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Aug, 2017

1 commit

  • 'perf cqm' never worked due to the incompatibility between perf
    infrastructure and cqm hardware support. The hardware uses RMIDs to
    track the llc occupancy of tasks and these RMIDs are per package. This
    makes monitoring a hierarchy like cgroup along with monitoring of tasks
    separately difficult and several patches sent to lkml to fix them were
    NACKed. Further more, the following issues in the current perf cqm make
    it almost unusable:

    1. No support to monitor the same group of tasks for which we do
    allocation using resctrl.

    2. It gives random and inaccurate data (mostly 0s) once we run out
    of RMIDs due to issues in Recycling.

    3. Recycling results in inaccuracy of data because we cannot
    guarantee that the RMID was stolen from a task when it was not
    pulling data into cache or even when it pulled the least data. Also
    for monitoring llc_occupancy, if we stop using an RMID_x and then
    start using an RMID_y after we reclaim an RMID from an other event,
    we miss accounting all the occupancy that was tagged to RMID_x at a
    later perf_count.

    2. Recycling code makes the monitoring code complex including
    scheduling because the event can lose RMID any time. Since MBM
    counters count bandwidth for a period of time by taking snap shot of
    total bytes at two different times, recycling complicates the way we
    count MBM in a hierarchy. Also we need a spin lock while we do the
    processing to account for MBM counter overflow. We also currently
    use a spin lock in scheduling to prevent the RMID from being taken
    away.

    4. Lack of support when we run different kind of event like task,
    system-wide and cgroup events together. Data mostly prints 0s. This
    is also because we can have only one RMID tied to a cpu as defined
    by the cqm hardware but a perf can at the same time tie multiple
    events during one sched_in.

    5. No support of monitoring a group of tasks. There is partial support
    for cgroup but it does not work once there is a hierarchy of cgroups
    or if we want to monitor a task in a cgroup and the cgroup itself.

    6. No support for monitoring tasks for the lifetime without perf
    overhead.

    7. It reported the aggregate cache occupancy or memory bandwidth over
    all sockets. But most cloud and VMM based use cases want to know the
    individual per-socket usage.

    Signed-off-by: Vikas Shivappa
    Signed-off-by: Thomas Gleixner
    Cc: ravi.v.shankar@intel.com
    Cc: tony.luck@intel.com
    Cc: fenghua.yu@intel.com
    Cc: peterz@infradead.org
    Cc: eranian@google.com
    Cc: vikas.shivappa@intel.com
    Cc: ak@linux.intel.com
    Cc: davidcc@google.com
    Cc: reinette.chatre@intel.com
    Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com

    Vikas Shivappa
     

06 Jul, 2017

1 commit

  • Pull networking updates from David Miller:
    "Reasonably busy this cycle, but perhaps not as busy as in the 4.12
    merge window:

    1) Several optimizations for UDP processing under high load from
    Paolo Abeni.

    2) Support pacing internally in TCP when using the sch_fq packet
    scheduler for this is not practical. From Eric Dumazet.

    3) Support mutliple filter chains per qdisc, from Jiri Pirko.

    4) Move to 1ms TCP timestamp clock, from Eric Dumazet.

    5) Add batch dequeueing to vhost_net, from Jason Wang.

    6) Flesh out more completely SCTP checksum offload support, from
    Davide Caratti.

    7) More plumbing of extended netlink ACKs, from David Ahern, Pablo
    Neira Ayuso, and Matthias Schiffer.

    8) Add devlink support to nfp driver, from Simon Horman.

    9) Add RTM_F_FIB_MATCH flag to RTM_GETROUTE queries, from Roopa
    Prabhu.

    10) Add stack depth tracking to BPF verifier and use this information
    in the various eBPF JITs. From Alexei Starovoitov.

    11) Support XDP on qed device VFs, from Yuval Mintz.

    12) Introduce BPF PROG ID for better introspection of installed BPF
    programs. From Martin KaFai Lau.

    13) Add bpf_set_hash helper for TC bpf programs, from Daniel Borkmann.

    14) For loads, allow narrower accesses in bpf verifier checking, from
    Yonghong Song.

    15) Support MIPS in the BPF selftests and samples infrastructure, the
    MIPS eBPF JIT will be merged in via the MIPS GIT tree. From David
    Daney.

    16) Support kernel based TLS, from Dave Watson and others.

    17) Remove completely DST garbage collection, from Wei Wang.

    18) Allow installing TCP MD5 rules using prefixes, from Ivan
    Delalande.

    19) Add XDP support to Intel i40e driver, from Björn Töpel

    20) Add support for TC flower offload in nfp driver, from Simon
    Horman, Pieter Jansen van Vuuren, Benjamin LaHaise, Jakub
    Kicinski, and Bert van Leeuwen.

    21) IPSEC offloading support in mlx5, from Ilan Tayari.

    22) Add HW PTP support to macb driver, from Rafal Ozieblo.

    23) Networking refcount_t conversions, From Elena Reshetova.

    24) Add sock_ops support to BPF, from Lawrence Brako. This is useful
    for tuning the TCP sockopt settings of a group of applications,
    currently via CGROUPs"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1899 commits)
    net: phy: dp83867: add workaround for incorrect RX_CTRL pin strap
    dt-bindings: phy: dp83867: provide a workaround for incorrect RX_CTRL pin strap
    cxgb4: Support for get_ts_info ethtool method
    cxgb4: Add PTP Hardware Clock (PHC) support
    cxgb4: time stamping interface for PTP
    nfp: default to chained metadata prepend format
    nfp: remove legacy MAC address lookup
    nfp: improve order of interfaces in breakout mode
    net: macb: remove extraneous return when MACB_EXT_DESC is defined
    bpf: add missing break in for the TCP_BPF_SNDCWND_CLAMP case
    bpf: fix return in load_bpf_file
    mpls: fix rtm policy in mpls_getroute
    net, ax25: convert ax25_cb.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_route.refcount from atomic_t to refcount_t
    net, ax25: convert ax25_uid_assoc.refcount from atomic_t to refcount_t
    net, sctp: convert sctp_ep_common.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_transport.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_chunk.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_datamsg.refcnt from atomic_t to refcount_t
    net, sctp: convert sctp_auth_bytes.refcnt from atomic_t to refcount_t
    ...

    Linus Torvalds
     

05 Jun, 2017

1 commit

  • Allow BPF_PROG_TYPE_PERF_EVENT program types to attach to all
    perf_event types, including HW_CACHE, RAW, and dynamic pmu events.
    Only tracepoint/kprobe events are treated differently which require
    BPF_PROG_TYPE_TRACEPOINT/BPF_PROG_TYPE_KPROBE program types accordingly.

    Also add support for reading all event counters using
    bpf_perf_event_read() helper.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

26 May, 2017

1 commit

  • perf, tracing, kprobes and jump_labels have a gazillion of ways to create
    dependency lock chains. Some of those involve nested invocations of
    get_online_cpus().

    The conversion of the hotplug locking to a percpu rwsem requires to avoid
    such nested calls. sys_perf_event_open() protects most of the syscall logic
    against cpu hotplug. This causes nested calls and lock inversions versus
    ftrace and kprobes in various interesting ways.

    It's impossible to move the hotplug locking to the outer end of all call
    chains in the involved facilities, so the hotplug protection in
    sys_perf_event_open() needs to be solved differently.

    Introduce 'pmus_mutex' which protects a perf private online cpumask. This
    mutex is taken when the mask is updated in the cpu hotplug callbacks and
    can be taken in sys_perf_event_open() to protect the swhash setup/teardown
    code and when the final judgement about a valid event has to be made.

    [ tglx: Produced changelog and fixed the swhash interaction ]

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Acked-by: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Mathieu Desnoyers
    Cc: Masami Hiramatsu
    Link: http://lkml.kernel.org/r/20170524081548.930941109@linutronix.de

    Thomas Gleixner
     

30 Mar, 2017

1 commit

  • Current AMD IOMMU perf PMU inappropriately uses the hardware struct
    inside the union in struct hw_perf_event, extra_reg in particular.

    Instead, introduce an AMD IOMMU-specific struct with required parameters
    to be programmed into the IOMMU performance counter control register.

    Update the pasid field from 16 to 20 bits while at it.

    Signed-off-by: Suravee Suthikulpanit
    [ Fixup macros, shorten get_next_avail_iommu_bnk_cntr() local vars, massage commit message. ]
    Signed-off-by: Borislav Petkov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Jörg Rödel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: iommu@lists.linux-foundation.org
    Link: http://lkml.kernel.org/r/1487926102-13073-10-git-send-email-Suravee.Suthikulpanit@amd.com
    Signed-off-by: Ingo Molnar

    Suravee Suthikulpanit
     

16 Mar, 2017

1 commit

  • In preparation for adding more flags to perf AUX records, introduce a
    separate API for setting the flags for a session, rather than appending
    more bool arguments to perf_aux_output_end. This allows to set each
    flag at the time a corresponding condition is detected, instead of
    tracking it in each driver's private state.

    Signed-off-by: Will Deacon
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Poirier
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20170220133352.17995-3-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

14 Mar, 2017

1 commit

  • With the advert of container technologies like docker, that depend on
    namespaces for isolation, there is a need for tracing support for
    namespaces. This patch introduces new PERF_RECORD_NAMESPACES event for
    recording namespaces related info. By recording info for every
    namespace, it is left to userspace to take a call on the definition of a
    container and trace containers by updating perf tool accordingly.

    Each namespace has a combination of device and inode numbers. Though
    every namespace has the same device number currently, that may change in
    future to avoid the need for a namespace of namespaces. Considering such
    possibility, record both device and inode numbers separately for each
    namespace.

    Signed-off-by: Hari Bathini
    Acked-by: Jiri Olsa
    Acked-by: Peter Zijlstra
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Ananth N Mavinakayanahalli
    Cc: Aravinda Prasad
    Cc: Brendan Gregg
    Cc: Daniel Borkmann
    Cc: Eric Biederman
    Cc: Sargun Dhillon
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/148891929686.25309.2827618988917007768.stgit@hbathini.in.ibm.com
    Signed-off-by: Arnaldo Carvalho de Melo

    Hari Bathini
     

10 Feb, 2017

1 commit

  • While supporting file-based address filters for CPU events requires some
    extra context switch handling, kernel address filters are easy, since the
    kernel mapping is preserved across address spaces. It is also useful as
    it permits tracing scheduling paths of the kernel.

    This patch allows setting up kernel filters for CPU events.

    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Mathieu Poirier
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Will Deacon
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20170126094057.13805-4-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

30 Jan, 2017

2 commits

  • cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
    with shared pmus in order to avoid visiting the same cpuctx more than once
    in a for_each_pmu loop.

    cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
    have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
    hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.

    The change above, together with the previous patch in this series, removed
    the remaining uses of cpuctx->unique_pmu, so we remove it altogether.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Srinivas Pandruvada
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vikas Shivappa
    Cc: Vince Weaver
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • This patch follows from a conversation in CQM/CMT's last series about
    speeding up the context switch for cgroup events:

    https://patchwork.kernel.org/patch/9478617/

    This is a low-hanging fruit optimization. It replaces the iteration over
    the "pmus" list in cgroup switch by an iteration over a new list that
    contains only cpuctxs with at least one cgroup event.

    This is necessary because the number of PMUs have increased over the years
    e.g modern x86 server systems have well above 50 PMUs.

    The iteration over the full PMU list is unneccessary and can be costly in
    heavy cache contention scenarios.

    Below are some instrumentation measurements with 10, 50 and 90 percentiles
    of the total cost of context switch before and after this optimization for
    a simple array read/write microbenchark.

    Contention
    Level Nr events Before (us) After (us) Median
    L2 L3 types (10%, 50%, 90%) (10%, 50%, 90% Speedup
    --------------------------------------------------------------------------
    Low Low 1 (1.72, 2.42, 5.85) (1.35, 1.64, 5.46) 29%
    High Low 1 (2.08, 4.56, 19.8) (1720, 2.20, 13.7) 51%
    High High 1 (2.86, 10.4, 12.7) (2.54, 4.32, 12.1) 58%

    Low Low 2 (1.98, 3.20, 6.89) (1.68, 2.41, 8.89) 24%
    High Low 2 (2.48, 5.28, 22.4) (2150, 3.69, 14.6) 30%
    High High 2 (3.32, 8.09, 13.9) (2.80, 5.15, 13.7) 36%

    where:

    1 event type = cycles
    2 event types = cycles,intel_cqm/llc_occupancy/

    Contention L2 Low: workset < L2 cache size.
    High: " >> L2 " " .
    Contention L3 Low: workset of task on all sockets < L3 cache size.
    High: " " " " " " >> L3 " " .

    Median Speedup is (50%ile Before - 50%ile After) / 50%ile Before

    Unsurprisingly, the benefits of this optimization decrease with the number
    of cpuctxs with a cgroup events, yet, is never detrimental.

    Tested-by: Mark Rutland
    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mark Rutland
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Srinivas Pandruvada
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vikas Shivappa
    Cc: Vince Weaver
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     

14 Jan, 2017

1 commit

  • It's possible to set up PEBS events to get only errors and not
    any data, like on SNB-X (model 45) and IVB-EP (model 62)
    via 2 perf commands running simultaneously:

    taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10

    This leads to a soft lock up, because the error path of the
    intel_pmu_drain_pebs_nhm() does not account event->hw.interrupt
    for error PEBS interrupts, so in case you're getting ONLY
    errors you don't have a way to stop the event when it's over
    the max_samples_per_tick limit:

    NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816]
    ...
    RIP: 0010:[] [] smp_call_function_single+0xe2/0x140
    ...
    Call Trace:
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    ? perf_cgroup_attach+0x70/0x70
    perf_install_in_context+0x199/0x1b0
    ? ctx_resched+0x90/0x90
    SYSC_perf_event_open+0x641/0xf90
    SyS_perf_event_open+0x9/0x10
    do_syscall_64+0x6c/0x1f0
    entry_SYSCALL64_slow_path+0x25/0x25

    Add perf_event_account_interrupt() which does the interrupt
    and frequency checks and call it from intel_pmu_drain_pebs_nhm()'s
    error path.

    We keep the pending_kill and pending_wakeup logic only in the
    __perf_event_overflow() path, because they make sense only if
    there's any data to deliver.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1482931866-6018-2-git-send-email-jolsa@kernel.org
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

28 Oct, 2016

1 commit

  • The trinity syscall fuzzer triggered following WARN() on powerpc:

    WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
    ...
    NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
    LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
    Call Trace:
    [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
    [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
    [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
    [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
    [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
    [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

    Followed by a lockdep warning:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.8.0-rc5+ #7 Tainted: G W
    -------------------------------
    ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by ls/2998:
    #0: (rcu_read_lock){......}, at: [] .__atomic_notifier_call_chain+0x0/0x1c0
    #1: (rcu_read_lock){......}, at: [] .hw_breakpoint_handler+0x0/0x2b0

    stack backtrace:
    CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
    Call Trace:
    [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
    [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
    [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
    [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
    [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
    [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
    [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
    [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
    [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
    [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
    [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
    [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

    While it looks like the first WARN() is probably valid, the other one is
    triggered by disabling event via perf_event_disable() from atomic context.

    The event is disabled here in case we were not able to emulate
    the instruction that hit the breakpoint. By disabling the event
    we unschedule the event and make sure it's not scheduled back.

    But we can't call perf_event_disable() from atomic context, instead
    we need to use the event's pending_disable irq_work method to disable it.

    Reported-by: Jan Stancek
    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Huang Ying
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     

06 Oct, 2016

1 commit

  • Pull networking updates from David Miller:

    1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

    2) Do TCP Small Queues for retransmits, from Eric Dumazet.

    3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

    4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

    5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

    6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

    7) Support ndo_poll_controller in mlx5, from Calvin Owens.

    8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

    9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

    10) Congestion control in RXRPC, from David Howells.

    11) Support geneve RX offload in ixgbe, from Emil Tantilov.

    12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

    13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

    14) Support RX network flow classification to igb, from Gangfeng Huang.

    15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

    16) New skbmod packet action, from Jamal Hadi Salim.

    17) Remove some inefficiencies in snmp proc output, from Jia He.

    18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

    19) New dsa driver for qca8xxx chips, from John Crispin.

    20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

    21) Add L3 mode to ipvlan, from Mahesh Bandewar.

    22) Support 802.1ad in mlx4, from Moshe Shemesh.

    23) Support hardware LRO in mediatek driver, from Nelson Chang.

    24) Add TC offloading to mlx5, from Or Gerlitz.

    25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

    26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

    27) NAPI support for ath10k, from Rajkumar Manoharan.

    28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

    29) UDP replicast support in TIPC, from Richard Alpe.

    30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

    31) Support BQL in thunderx driver, from Sunil Goutham.

    32) TSO support in alx driver, from Tobias Regnery.

    33) Add stream parser engine and use it in kcm.

    34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

    35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
    mlxsw: switchx2: Fix misuse of hard_header_len
    mlxsw: spectrum: Fix misuse of hard_header_len
    net/faraday: Stop NCSI device on shutdown
    net/ncsi: Introduce ncsi_stop_dev()
    net/ncsi: Rework the channel monitoring
    net/ncsi: Allow to extend NCSI request properties
    net/ncsi: Rework request index allocation
    net/ncsi: Don't probe on the reserved channel ID (0x1f)
    net/ncsi: Introduce NCSI_RESERVED_CHANNEL
    net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
    net: Add netdev all_adj_list refcnt propagation to fix panic
    net: phy: Add Edge-rate driver for Microsemi PHYs.
    vmxnet3: Wake queue from reset work
    i40e: avoid NULL pointer dereference and recursive errors on early PCI error
    qed: Add RoCE ll2 & GSI support
    qed: Add support for memory registeration verbs
    qed: Add support for QP verbs
    qed: PD,PKEY and CQ verb support
    qed: Add support for RoCE hw init
    qede: Add qedr framework
    ...

    Linus Torvalds
     

03 Sep, 2016

2 commits

  • Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
    via overflow_handler mechanism.
    When program is attached the overflow_handlers become stacked.
    The program acts as a filter.
    Returning zero from the program means that the normal perf_event_output handler
    will not be called and sampling event won't be stored in the ring buffer.

    The overflow_handler_context==NULL is an additional safety check
    to make sure programs are not attached to hw breakpoints and watchdog
    in case other checks (that prevent that now anyway) get accidentally
    relaxed in the future.

    The program refcnt is incremented in case perf_events are inhereted
    when target task is forked.
    Similar to kprobe and tracepoint programs there is no ioctl to
    detach the program or swap already attached program. The user space
    expected to close(perf_event_fd) like it does right now for kprobe+bpf.
    That restriction simplifies the code quite a bit.

    The invocation of overflow_handler in __perf_event_overflow() is now
    done via READ_ONCE, since that pointer can be replaced when the program
    is attached while perf_event itself could have been active already.
    There is no need to do similar treatment for event->prog, since it's
    assigned only once before it's accessed.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce BPF_PROG_TYPE_PERF_EVENT programs that can be attached to
    HW and SW perf events (PERF_TYPE_HARDWARE and PERF_TYPE_SOFTWARE
    correspondingly in uapi/linux/perf_event.h)

    The program visible context meta structure is
    struct bpf_perf_event_data {
    struct pt_regs regs;
    __u64 sample_period;
    };
    which is accessible directly from the program:
    int bpf_prog(struct bpf_perf_event_data *ctx)
    {
    ... ctx->sample_period ...
    ... ctx->regs.ip ...
    }

    The bpf verifier rewrites the accesses into kernel internal
    struct bpf_perf_event_data_kern which allows changing
    struct perf_sample_data without affecting bpf programs.
    New fields can be added to the end of struct bpf_perf_event_data
    in the future.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

18 Aug, 2016

2 commits

  • Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events,
    that allows a PMU to signal the generic perf code that an event is readable
    in the current CPU if the event is active in a CPU in the same package as
    the current CPU.

    This is an optimization that avoids a unnecessary IPI for the common case
    where uncore events are run and read in the same package but in
    different CPUs.

    As an example, the IPI removal speeds up perf_read() in my Haswell system
    as follows:

    - For event UNC_C_LLC_LOOKUP: From 260 us to 31 us.
    - For event RAPL's power/energy-cores/: From to 255 us to 27 us.

    For the optimization to work, all events in the group must have it
    (similarly to PERF_EV_CAP_SOFTWARE).

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: David Carrillo-Cisneros
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-4-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a
    group's leader to indicate that is_software_event(event) is true for all
    events in a group. This is the only usage of event->group_flags.

    This pattern of setting a group level flags when all events in the group
    share a property is useful for the flag introduced in the next patch and
    for future CQM/CMT flags. So this patches generalizes group_flags to work
    as an aggregate of event level flags.

    PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags
    that I intend to add are also determinable at event initialization.
    To better convey the above, this patch renames event's group_flags to
    group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE.

    Individual event flags are stored in the new event->event_caps. Since the
    cap flags do not change after event initialization, there is no need to
    serialize event_caps. This new field is used when events are added to a
    context, similarly to how PERF_GROUP_SOFTWARE and is_software_event()
    worked.

    Lastly, for consistency, updates is_software_event() to rely in event_cap
    instead of the context index.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     

10 Aug, 2016

2 commits

  • For perf record -b, which requires the pmu::sched_task callback the
    current code is rather expensive:

    7.68% sched-pipe [kernel.vmlinux] [k] perf_pmu_sched_task
    5.95% sched-pipe [kernel.vmlinux] [k] __switch_to
    5.20% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
    3.95% sched-pipe perf [.] worker_thread

    The problem is that it will iterate all registered PMUs, most of which
    will not have anything to do. Avoid this by keeping an explicit list
    of PMUs that have requested the callback.

    The perf_sched_cb_{inc,dec}() functions already takes the required pmu
    argument, and now that these functions are no longer called from NMI
    context we can use them to manage a list.

    With this patch applied the function doesn't show up in the top 4
    anymore (it dropped to 18th place).

    6.67% sched-pipe [kernel.vmlinux] [k] __switch_to
    6.18% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
    3.92% sched-pipe [kernel.vmlinux] [k] switch_mm_irqs_off
    3.71% sched-pipe perf [.] worker_thread

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There's a perf stat bug easy to observer on a machine with only one cgroup:

    $ perf stat -e cycles -I 1000 -C 0 -G /
    # time counts unit events
    1.000161699 cycles /
    2.000355591 cycles /
    3.000565154 cycles /
    4.000951350 cycles /

    We'd expect some output there.

    The underlying problem is that there is an optimization in
    perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
    if the old and new cgroups in a task switch are the same.

    This optimization interacts with the current code in two ways
    that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
    cgroup event matches the current task. These are:

    1. On creation of the first cgroup event in a CPU: In current code,
    cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
    aforesaid optimization, perf_cgroup_sched_in will run until the next
    cgroup switches in that CPU. This may happen late or never happen,
    depending on system's number of cgroups, CPU load, etc.

    2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
    cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
    because cpuctx->cgrp == NULL until a cgroup switch occurs and
    perf_cgroup_sched_in is executed (updating cpuctx->cgrp).

    This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
    mirroring what list_del_event does when removing a cgroup event from CPU
    context, as introduced in:

    commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")

    With this patch, cpuctx->cgrp is always set/clear when installing/removing
    the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
    correctly set, event_filter_match works as intended when events are
    sched in/out.

    After the fix, the output is as expected:

    $ perf stat -e cycles -I 1000 -a -G /
    # time counts unit events
    1.004699159 627342882 cycles /
    2.007397156 615272690 cycles /
    3.010019057 616726074 cycles /

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     

30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

26 Jul, 2016

1 commit

  • This patch fixes the __output_custom() routine we currently use with
    bpf_skb_copy(). I missed that when len is larger than the size of the
    current handle, we can issue multiple invocations of copy_func, and
    __output_custom() advances destination but also source buffer by the
    written amount of bytes. When we have __output_custom(), this is actually
    wrong since in that case the source buffer points to a non-linear object,
    in our case an skb, which the copy_func helper is supposed to walk.
    Therefore, since this is non-linear we thus need to pass the offset into
    the helper, so that copy_func can use it for extracting the data from
    the source object.

    Therefore, adjust the callback signatures properly and pass offset
    into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
    __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
    i) to pass in whether we should advance source buffer or not; this is
    a compile-time constant condition, ii) to pass in the offset for
    __output_custom(), which we do with help of __VA_ARGS__, so everything
    can stay inlined as is currently. Both changes allow for adapting the
    __output_* fast-path helpers w/o extra overhead.

    Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
    Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jul, 2016

1 commit

  • This patch adds support for non-linear data on raw records. It
    extends raw records to have one or multiple fragments that will
    be written linearly into the ring slot, where each fragment can
    optionally have a custom callback handler to walk and extract
    complex, possibly non-linear data.

    If a callback handler is provided for a fragment, then the new
    __output_custom() will be used instead of __output_copy() for
    the perf_output_sample() part. perf_prepare_sample() does all
    the size calculation only once, so perf_output_sample() doesn't
    need to redo the same work anymore, meaning real_size and padding
    will be cached in the raw record. The raw record becomes 32 bytes
    in size without holes; to not increase it further and to avoid
    doing unnecessary recalculations in fast-path, we can reuse
    next pointer of the last fragment, idea here is borrowed from
    ZERO_OR_NULL_PTR(), which should keep the perf_output_sample()
    path for PERF_SAMPLE_RAW minimal.

    This facility is needed for BPF's event output helper as a first
    user that will, in a follow-up, add an additional perf_raw_frag
    to its perf_raw_record in order to be able to more efficiently
    dump skb context after a linear head meta data related to it.
    skbs can be non-linear and thus need a custom output function to
    dump buffers. Currently, the skb data needs to be copied twice;
    with the help of __output_custom() this work only needs to be
    done once. Future users could be things like XDP/BPF programs
    that work on different context though and would thus also have
    a different callback function.

    The few users of raw records are adapted to initialize their frag
    data from the raw record itself, no change in behavior for them.
    The code is based upon a PoC diff provided by Peter Zijlstra [1].

    [1] http://thread.gmane.org/gmane.linux.network/421294

    Suggested-by: Peter Zijlstra
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

14 Jul, 2016

2 commits

  • All users converted to state machine callbacks.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Nicolas Iooss
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153335.115333381@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Actually a nice symmetric startup/teardown pair which fits properly into
    the state machine concept. In the long run we should be able to invoke
    the startup callback for the boot CPU via the state machine and get
    rid of the init function which invokes it on the boot CPU.

    Note: This comes actually before the perf hardware callbacks. In the notifier
    model the hardware callbacks have a higher priority than the core
    callback. But that's solely for CPU offline so that hardware migration of
    events happens before the core is notified about the outgoing CPU.

    With the symetric state array model we have the following ordering:

    UP: core -> hardware
    DOWN: hardware -> core

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Siewior
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153333.587514098@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

08 Jun, 2016

1 commit


03 Jun, 2016

2 commits

  • Add a way to show different sysfs events attributes depending on
    HyperThreading is on or off. This is difficult to determine
    early at boot, so we just do it dynamically when the sysfs
    attribute is read.

    Signed-off-by: Andi Kleen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: acme@kernel.org
    Cc: jolsa@kernel.org
    Link: http://lkml.kernel.org/r/1463703002-19686-3-git-send-email-andi@firstfloor.org
    Signed-off-by: Ingo Molnar

    Andi Kleen
     
  • The perf_event_aux() function iterates all PMUs and all events in
    their respective per-CPU contexts to find the events to deliver
    side-band records to.

    For example, the brk test case in lkp triggers many mmap() operations,
    which, if we're also running perf, results in many perf_event_aux()
    invocations.

    If we enable uncore PMU support (even when uncore events are not used),
    dozens of uncore PMUs will be iterated, which can significantly
    decrease brk_test's throughput.

    For example, the brk throughput:

    without uncore PMUs: 2647573 ops_per_sec
    with uncore PMUs: 1768444 ops_per_sec

    ... a 33% reduction.

    To get at the per-CPU events that need side-band records, this patch
    puts these events on a per-CPU list, this avoids iterating the PMUs
    and any events that do not need side-band records.

    Per task events are unchanged to avoid extra overhead on the context
    switch paths.

    Suggested-by: Peter Zijlstra (Intel)
    Reported-by: Huang, Ying
    Signed-off-by: Kan Liang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1458757477-3781-1-git-send-email-kan.liang@intel.com
    Signed-off-by: Ingo Molnar

    Kan Liang
     

30 May, 2016

1 commit

  • Additionally to being able to control the system wide maximum depth via
    /proc/sys/kernel/perf_event_max_stack, now we are able to ask for
    different depths per event, using perf_event_attr.sample_max_stack for
    that.

    This uses an u16 hole at the end of perf_event_attr, that, when
    perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
    sample_max_stack is zero, means use perf_event_max_stack, otherwise
    it'll be bounds checked under callchain_mutex.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     

26 May, 2016

1 commit

  • Pull perf updates from Ingo Molnar:
    "Mostly tooling and PMU driver fixes, but also a number of late updates
    such as the reworking of the call-chain size limiting logic to make
    call-graph recording more robust, plus tooling side changes for the
    new 'backwards ring-buffer' extension to the perf ring-buffer"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    perf record: Read from backward ring buffer
    perf record: Rename variable to make code clear
    perf record: Prevent reading invalid data in record__mmap_read
    perf evlist: Add API to pause/resume
    perf trace: Use the ptr->name beautifier as default for "filename" args
    perf trace: Use the fd->name beautifier as default for "fd" args
    perf report: Add srcline_from/to branch sort keys
    perf evsel: Record fd into perf_mmap
    perf evsel: Add overwrite attribute and check write_backward
    perf tools: Set buildid dir under symfs when --symfs is provided
    perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
    perf annotate: Sort list of recognised instructions
    perf annotate: Fix identification of ARM blt and bls instructions
    perf tools: Fix usage of max_stack sysctl
    perf callchain: Stop validating callchains by the max_stack sysctl
    perf trace: Fix exit_group() formatting
    perf top: Use machine->kptr_restrict_warned
    perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
    perf machine: Do not bail out if not managing to read ref reloc symbol
    perf/x86/intel/p4: Trival indentation fix, remove space
    ...

    Linus Torvalds
     

18 May, 2016

1 commit

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support SPI based w5100 devices, from Akinobu Mita.

    2) Partial Segmentation Offload, from Alexander Duyck.

    3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

    4) Allow cls_flower stats offload, from Amir Vadai.

    5) Implement bpf blinding, from Daniel Borkmann.

    6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
    actually using FASYNC these atomics are superfluous. From Eric
    Dumazet.

    7) Run TCP more preemptibly, also from Eric Dumazet.

    8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
    driver, from Gal Pressman.

    9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

    10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

    11) Support tunneling offloads in qed, from Manish Chopra.

    12) aRFS offloading in mlx5e, from Maor Gottlieb.

    13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
    Leitner.

    14) Add MSG_EOR support to TCP, this allows controlling packet
    coalescing on application record boundaries for more accurate
    socket timestamp sampling. From Martin KaFai Lau.

    15) Fix alignment of 64-bit netlink attributes across the board, from
    Nicolas Dichtel.

    16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

    17) Several conversions of drivers to ethtool ksettings, from Philippe
    Reynes.

    18) Checksum neutral ILA in ipv6, from Tom Herbert.

    19) Factorize all of the various marvell dsa drivers into one, from
    Vivien Didelot

    20) Add VF support to qed driver, from Yuval Mintz"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
    Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
    Revert "phy dp83867: Make rgmii parameters optional"
    r8169: default to 64-bit DMA on recent PCIe chips
    phy dp83867: Make rgmii parameters optional
    phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
    bpf: arm64: remove callee-save registers use for tmp registers
    asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
    switchdev: pass pointer to fib_info instead of copy
    net_sched: close another race condition in tcf_mirred_release()
    tipc: fix nametable publication field in nl compat
    drivers: net: Don't print unpopulated net_device name
    qed: add support for dcbx.
    ravb: Add missing free_irq() calls to ravb_close()
    qed: Remove a stray tab
    net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fec-mpc52xx: use phydev from struct net_device
    bpf, doc: fix typo on bpf_asm descriptions
    stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
    net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
    net: ethernet: fs-enet: use phydev from struct net_device
    ...

    Linus Torvalds
     

17 May, 2016

4 commits

  • The perf_sample->ip_callchain->nr value includes all the entries in the
    ip_callchain->ip[] array, real addresses and PERF_CONTEXT_{KERNEL,USER,etc},
    while what the user expects is that what is in the kernel.perf_event_max_stack
    sysctl or in the upcoming per event perf_event_attr.sample_max_stack knob be
    honoured in terms of IP addresses in the stack trace.

    So allocate a bunch of extra entries for contexts, and do the accounting
    via perf_callchain_entry_ctx struct members.

    A new sysctl, kernel.perf_event_max_contexts_per_stack is also
    introduced for investigating possible bugs in the callchain
    implementation by some arch.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-3b4wnqk340c4sg4gwkfdi9yk@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • We need have different helpers to account how many contexts we have in
    the sample and for real addresses, so do it now as a prep patch, to
    ease review.

    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-q964tnyuqrxw5gld18vizs3c@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • We will use it to count how many addresses are in the entry->ip[] array,
    excluding PERF_CONTEXT_{KERNEL,USER,etc} entries, so that we can really
    return the number of entries specified by the user via the relevant
    sysctl, kernel.perf_event_max_contexts, or via the per event
    perf_event_attr.sample_max_stack knob.

    This way we keep the perf_sample->ip_callchain->nr meaning, that is the
    number of entries, be it real addresses or PERF_CONTEXT_ entries, while
    honouring the max_stack knobs, i.e. the end result will be max_stack
    entries if we have at least that many entries in a given stack trace.

    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-s8teto51tdqvlfhefndtat9r@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo
     
  • This makes perf_callchain_{user,kernel}() receive the max stack
    as context for the perf_callchain_entry, instead of accessing
    the global sysctl_perf_event_max_stack.

    Cc: Adrian Hunter
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Brendan Gregg
    Cc: David Ahern
    Cc: Frederic Weisbecker
    Cc: He Kuang
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Milian Wolff
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: Zefan Li
    Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
    Signed-off-by: Arnaldo Carvalho de Melo

    Arnaldo Carvalho de Melo