29 Oct, 2016

1 commit

  • Pull perf fixes from Ingo Molnar:
    "Misc kernel fixes: a virtualization environment related fix, an uncore
    PMU driver removal handling fix, a PowerPC fix and new events for
    Knights Landing"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel: Honour the CPUID for number of fixed counters in hypervisors
    perf/powerpc: Don't call perf_event_disable() from atomic context
    perf/core: Protect PMU device removal with a 'pmu_bus_running' check, to fix CONFIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic
    perf/x86/intel/cstate: Add C-state residency events for Knights Landing

    Linus Torvalds
     

28 Oct, 2016

2 commits

  • The trinity syscall fuzzer triggered following WARN() on powerpc:

    WARNING: CPU: 9 PID: 2998 at arch/powerpc/kernel/hw_breakpoint.c:278
    ...
    NIP [c00000000093aedc] .hw_breakpoint_handler+0x28c/0x2b0
    LR [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0
    Call Trace:
    [c0000002f7933580] [c00000000093aed8] .hw_breakpoint_handler+0x288/0x2b0 (unreliable)
    [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
    [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
    [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
    [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
    [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

    Followed by a lockdep warning:

    ===============================
    [ INFO: suspicious RCU usage. ]
    4.8.0-rc5+ #7 Tainted: G W
    -------------------------------
    ./include/linux/rcupdate.h:556 Illegal context switch in RCU read-side critical section!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by ls/2998:
    #0: (rcu_read_lock){......}, at: [] .__atomic_notifier_call_chain+0x0/0x1c0
    #1: (rcu_read_lock){......}, at: [] .hw_breakpoint_handler+0x0/0x2b0

    stack backtrace:
    CPU: 9 PID: 2998 Comm: ls Tainted: G W 4.8.0-rc5+ #7
    Call Trace:
    [c0000002f7933150] [c00000000094b1f8] .dump_stack+0xe0/0x14c (unreliable)
    [c0000002f79331e0] [c00000000013c468] .lockdep_rcu_suspicious+0x138/0x180
    [c0000002f7933270] [c0000000001005d8] .___might_sleep+0x278/0x2e0
    [c0000002f7933300] [c000000000935584] .mutex_lock_nested+0x64/0x5a0
    [c0000002f7933410] [c00000000023084c] .perf_event_ctx_lock_nested+0x16c/0x380
    [c0000002f7933500] [c000000000230a80] .perf_event_disable+0x20/0x60
    [c0000002f7933580] [c00000000093aeec] .hw_breakpoint_handler+0x29c/0x2b0
    [c0000002f7933630] [c0000000000f671c] .notifier_call_chain+0x7c/0xf0
    [c0000002f79336d0] [c0000000000f6abc] .__atomic_notifier_call_chain+0xbc/0x1c0
    [c0000002f7933780] [c0000000000f6c40] .notify_die+0x70/0xd0
    [c0000002f7933820] [c00000000001a74c] .do_break+0x4c/0x100
    [c0000002f7933920] [c0000000000089fc] handle_dabr_fault+0x14/0x48

    While it looks like the first WARN() is probably valid, the other one is
    triggered by disabling event via perf_event_disable() from atomic context.

    The event is disabled here in case we were not able to emulate
    the instruction that hit the breakpoint. By disabling the event
    we unschedule the event and make sure it's not scheduled back.

    But we can't call perf_event_disable() from atomic context, instead
    we need to use the event's pending_disable irq_work method to disable it.

    Reported-by: Jan Stancek
    Signed-off-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Huang Ying
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20161026094824.GA21397@krava
    Signed-off-by: Ingo Molnar

    Jiri Olsa
     
  • …FIG_DEBUG_TEST_DRIVER_REMOVE=y kernel panic

    CAI Qian reported a crash in the PMU uncore device removal code,
    enabled by the CONFIG_DEBUG_TEST_DRIVER_REMOVE=y option:

    https://marc.info/?l=linux-kernel&m=147688837328451

    The reason for the crash is that perf_pmu_unregister() tries to remove
    a PMU device which is not added at this point. We add PMU devices
    only after pmu_bus is registered, which happens in the
    perf_event_sysfs_init() call and sets the 'pmu_bus_running' flag.

    The fix is to get the 'pmu_bus_running' flag state at the point
    the PMU is taken out of the PMU list and remove the device
    later only if it's set.

    Reported-by: CAI Qian <caiqian@redhat.com>
    Tested-by: CAI Qian <caiqian@redhat.com>
    Signed-off-by: Jiri Olsa <jolsa@kernel.org>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Kan Liang <kan.liang@intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rob Herring <robh@kernel.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/20161020111011.GA13361@krava
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Jiri Olsa
     

19 Oct, 2016

1 commit


06 Oct, 2016

1 commit

  • Pull networking updates from David Miller:

    1) BBR TCP congestion control, from Neal Cardwell, Yuchung Cheng and
    co. at Google. https://lwn.net/Articles/701165/

    2) Do TCP Small Queues for retransmits, from Eric Dumazet.

    3) Support collect_md mode for all IPV4 and IPV6 tunnels, from Alexei
    Starovoitov.

    4) Allow cls_flower to classify packets in ip tunnels, from Amir Vadai.

    5) Support DSA tagging in older mv88e6xxx switches, from Andrew Lunn.

    6) Support GMAC protocol in iwlwifi mwm, from Ayala Beker.

    7) Support ndo_poll_controller in mlx5, from Calvin Owens.

    8) Move VRF processing to an output hook and allow l3mdev to be
    loopback, from David Ahern.

    9) Support SOCK_DESTROY for UDP sockets. Also from David Ahern.

    10) Congestion control in RXRPC, from David Howells.

    11) Support geneve RX offload in ixgbe, from Emil Tantilov.

    12) When hitting pressure for new incoming TCP data SKBs, perform a
    partial rathern than a full purge of the OFO queue (which could be
    huge). From Eric Dumazet.

    13) Convert XFRM state and policy lookups to RCU, from Florian Westphal.

    14) Support RX network flow classification to igb, from Gangfeng Huang.

    15) Hardware offloading of eBPF in nfp driver, from Jakub Kicinski.

    16) New skbmod packet action, from Jamal Hadi Salim.

    17) Remove some inefficiencies in snmp proc output, from Jia He.

    18) Add FIB notifications to properly propagate route changes to
    hardware which is doing forwarding offloading. From Jiri Pirko.

    19) New dsa driver for qca8xxx chips, from John Crispin.

    20) Implement RFC7559 ipv6 router solicitation backoff, from Maciej
    Żenczykowski.

    21) Add L3 mode to ipvlan, from Mahesh Bandewar.

    22) Support 802.1ad in mlx4, from Moshe Shemesh.

    23) Support hardware LRO in mediatek driver, from Nelson Chang.

    24) Add TC offloading to mlx5, from Or Gerlitz.

    25) Convert various drivers to ethtool ksettings interfaces, from
    Philippe Reynes.

    26) TX max rate limiting for cxgb4, from Rahul Lakkireddy.

    27) NAPI support for ath10k, from Rajkumar Manoharan.

    28) Support XDP in mlx5, from Rana Shahout and Saeed Mahameed.

    29) UDP replicast support in TIPC, from Richard Alpe.

    30) Per-queue statistics for qed driver, from Sudarsana Reddy Kalluru.

    31) Support BQL in thunderx driver, from Sunil Goutham.

    32) TSO support in alx driver, from Tobias Regnery.

    33) Add stream parser engine and use it in kcm.

    34) Support async DHCP replies in ipconfig module, from Uwe
    Kleine-König.

    35) DSA port fast aging for mv88e6xxx driver, from Vivien Didelot.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1715 commits)
    mlxsw: switchx2: Fix misuse of hard_header_len
    mlxsw: spectrum: Fix misuse of hard_header_len
    net/faraday: Stop NCSI device on shutdown
    net/ncsi: Introduce ncsi_stop_dev()
    net/ncsi: Rework the channel monitoring
    net/ncsi: Allow to extend NCSI request properties
    net/ncsi: Rework request index allocation
    net/ncsi: Don't probe on the reserved channel ID (0x1f)
    net/ncsi: Introduce NCSI_RESERVED_CHANNEL
    net/ncsi: Avoid unused-value build warning from ia64-linux-gcc
    net: Add netdev all_adj_list refcnt propagation to fix panic
    net: phy: Add Edge-rate driver for Microsemi PHYs.
    vmxnet3: Wake queue from reset work
    i40e: avoid NULL pointer dereference and recursive errors on early PCI error
    qed: Add RoCE ll2 & GSI support
    qed: Add support for memory registeration verbs
    qed: Add support for QP verbs
    qed: PD,PKEY and CQ verb support
    qed: Add support for RoCE hw init
    qede: Add qedr framework
    ...

    Linus Torvalds
     

03 Oct, 2016

1 commit


23 Sep, 2016

2 commits


22 Sep, 2016

1 commit

  • An "exclusive" PMU is the one that can only have one event scheduled in
    at any given time. There may be more than one of such PMUs in a system,
    though, like Intel PT and BTS. It should be allowed to have one event
    for either of those inside the same context (there may be other constraints
    that may prevent this, but those would be hardware-specific). However,
    the exclusivity code is written so that only one event from any of the
    "exclusive" PMUs is allowed in a context.

    Fix this by making the exclusive event filter explicitly match two events'
    PMUs.

    Signed-off-by: Alexander Shishkin
    Acked-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160920154811.3255-3-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

10 Sep, 2016

3 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The order of accesses to ring buffer's aux_mmap_count and aux_refcount
    has to be preserved across the users, namely perf_mmap_close() and
    perf_aux_output_begin(), otherwise the inversion can result in the latter
    holding the last reference to the aux buffer and subsequently free'ing
    it in atomic context, triggering a warning.

    > ------------[ cut here ]------------
    > WARNING: CPU: 0 PID: 257 at kernel/events/ring_buffer.c:541 __rb_free_aux+0x11a/0x130
    > CPU: 0 PID: 257 Comm: stopbug Not tainted 4.8.0-rc1+ #2596
    > Call Trace:
    > [] __warn+0xcb/0xf0
    > [] warn_slowpath_null+0x1d/0x20
    > [] __rb_free_aux+0x11a/0x130
    > [] rb_free_aux+0x18/0x20
    > [] perf_aux_output_begin+0x163/0x1e0
    > [] bts_event_start+0x3a/0xd0
    > [] bts_event_add+0x5d/0x80
    > [] event_sched_in.isra.104+0xf6/0x2f0
    > [] group_sched_in+0x6e/0x190
    > [] ctx_sched_in+0x2fe/0x5f0
    > [] perf_event_sched_in+0x60/0x80
    > [] ctx_resched+0x5b/0x90
    > [] __perf_event_enable+0x1e1/0x240
    > [] event_function+0xa9/0x180
    > [] ? perf_cgroup_attach+0x70/0x70
    > [] remote_function+0x3f/0x50
    > [] flush_smp_call_function_queue+0x83/0x150
    > [] generic_smp_call_function_single_interrupt+0x13/0x60
    > [] smp_call_function_single_interrupt+0x27/0x40
    > [] call_function_single_interrupt+0x89/0x90
    > [] finish_task_switch+0xa6/0x210
    > [] ? finish_task_switch+0x67/0x210
    > [] __schedule+0x3dd/0xb50
    > [] schedule+0x35/0x80
    > [] sys_sched_yield+0x61/0x70
    > [] entry_SYSCALL_64_fastpath+0x18/0xa8
    > ---[ end trace 6235f556f5ea83a9 ]---

    This patch puts the checks in perf_aux_output_begin() in the same order
    as that of perf_mmap_close().

    Reported-by: Vince Weaver
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160906132353.19887-3-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     
  • In the mmap_close() path we need to stop all the AUX events that are
    writing data to the AUX area that we are unmapping, before we can
    safely free the pages. To determine if an event needs to be stopped,
    we're comparing its ->rb against the one that's getting unmapped.
    However, a SET_OUTPUT ioctl may turn up inside an AUX transaction
    and swizzle event::rb to some other ring buffer, but the transaction
    will keep writing data to the old ring buffer until the event gets
    scheduled out. At this point, mmap_close() will skip over such an
    event and will proceed to free the AUX area, while it's still being
    used by this event, which will set off a warning in the mmap_close()
    path and cause a memory corruption.

    To avoid this, always stop an AUX event before its ->rb is updated;
    this will release the (potentially) last reference on the AUX area
    of the buffer. If the event gets restarted, its new ring buffer will
    be used. If another SET_OUTPUT comes and switches it back to the
    old ring buffer that's getting unmapped, it's also fine: this
    ring buffer's aux_mmap_count will be zero and AUX transactions won't
    start any more.

    Reported-by: Vince Weaver
    Signed-off-by: Alexander Shishkin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160906132353.19887-2-alexander.shishkin@linux.intel.com
    Signed-off-by: Ingo Molnar

    Alexander Shishkin
     

07 Sep, 2016

1 commit

  • The newly added bpf_overflow_handler function is only built of both
    CONFIG_EVENT_TRACING and CONFIG_BPF_SYSCALL are enabled, but the caller
    only checks the latter:

    kernel/events/core.c: In function 'perf_event_alloc':
    kernel/events/core.c:9106:27: error: 'bpf_overflow_handler' undeclared (first use in this function)

    This changes the caller so we also skip this call if CONFIG_EVENT_TRACING
    is disabled entirely.

    Signed-off-by: Arnd Bergmann
    Fixes: aa6a5f3cb2b2 ("perf, bpf: add perf events core support for BPF_PROG_TYPE_PERF_EVENT programs")
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

05 Sep, 2016

3 commits

  • PERF_EF_START is a flag to indicate to the PMU ->add() callback that, as
    well as claiming the PMU resources required by the event being added,
    it should also start the PMU.

    Passing this flag to the ->start() callback doesn't make sense, because
    ->start() always tries to start the PMU. Remove it.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: mark.rutland@arm.com
    Link: http://lkml.kernel.org/r/1471257765-29662-1-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • Conflicts:
    kernel/events/core.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • This effectively reverts commit:

    71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")

    ... and puts in a comment explaining why we ignore the return value.

    Reported-by: Vegard Nossum
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: David Carrillo-Cisneros
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 71e7bc2bab77 ("perf/core: Check return value of the perf_event_read() IPI")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Sep, 2016

1 commit

  • Allow attaching BPF_PROG_TYPE_PERF_EVENT programs to sw and hw perf events
    via overflow_handler mechanism.
    When program is attached the overflow_handlers become stacked.
    The program acts as a filter.
    Returning zero from the program means that the normal perf_event_output handler
    will not be called and sampling event won't be stored in the ring buffer.

    The overflow_handler_context==NULL is an additional safety check
    to make sure programs are not attached to hw breakpoints and watchdog
    in case other checks (that prevent that now anyway) get accidentally
    relaxed in the future.

    The program refcnt is incremented in case perf_events are inhereted
    when target task is forked.
    Similar to kprobe and tracepoint programs there is no ioctl to
    detach the program or swap already attached program. The user space
    expected to close(perf_event_fd) like it does right now for kprobe+bpf.
    That restriction simplifies the code quite a bit.

    The invocation of overflow_handler in __perf_event_overflow() is now
    done via READ_ONCE, since that pointer can be replaced when the program
    is attached while perf_event itself could have been active already.
    There is no need to do similar treatment for event->prog, since it's
    assigned only once before it's accessed.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

24 Aug, 2016

1 commit

  • When tearing down an AUX buf for an event via perf_mmap_close(),
    __perf_event_output_stop() is called on the event's CPU to ensure that
    trace generation is halted before the process of unmapping and
    freeing the buffer pages begins.

    The callback is performed via cpu_function_call(), which ensures that it
    runs with interrupts disabled and is therefore not preemptible.
    Unfortunately, the current code grabs the per-cpu context pointer using
    get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair
    the call with put_cpu_ptr(), leading to a preempt_count() imbalance and
    a BUG when freeing the AUX buffer later on:

    WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120
    Modules linked in:
    [...]
    Call Trace:
    [] dump_stack+0x4f/0x72
    [] __warn+0xc6/0xe0
    [] warn_slowpath_null+0x18/0x20
    [] __rb_free_aux+0x10c/0x120
    [] rb_free_aux+0x13/0x20
    [] perf_mmap_close+0x29e/0x2f0
    [] ? perf_iterate_ctx+0xe0/0xe0
    [] remove_vma+0x25/0x60
    [] exit_mmap+0x106/0x140
    [] mmput+0x1c/0xd0
    [] do_exit+0x253/0xbf0
    [] do_group_exit+0x3e/0xb0
    [] get_signal+0x249/0x640
    [] do_signal+0x23/0x640
    [] ? _raw_write_unlock_irq+0x12/0x30
    [] ? _raw_spin_unlock_irq+0x9/0x10
    [] ? __schedule+0x2c6/0x710
    [] exit_to_usermode_loop+0x74/0x90
    [] prepare_exit_to_usermode+0x26/0x30
    [] retint_user+0x8/0x10

    This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is
    already disabled by the caller.

    Signed-off-by: Will Deacon
    Reviewed-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: 95ff4ca26c49 ("perf/core: Free AUX pages in unmap path")
    Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

18 Aug, 2016

12 commits

  • Introduce the flag PMU_EV_CAP_READ_ACTIVE_PKG, useful for uncore events,
    that allows a PMU to signal the generic perf code that an event is readable
    in the current CPU if the event is active in a CPU in the same package as
    the current CPU.

    This is an optimization that avoids a unnecessary IPI for the common case
    where uncore events are run and read in the same package but in
    different CPUs.

    As an example, the IPI removal speeds up perf_read() in my Haswell system
    as follows:

    - For event UNC_C_LLC_LOOKUP: From 260 us to 31 us.
    - For event RAPL's power/energy-cores/: From to 255 us to 27 us.

    For the optimization to work, all events in the group must have it
    (similarly to PERF_EV_CAP_SOFTWARE).

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: David Carrillo-Cisneros
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-4-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • Currently, PERF_GROUP_SOFTWARE is used in the group_flags field of a
    group's leader to indicate that is_software_event(event) is true for all
    events in a group. This is the only usage of event->group_flags.

    This pattern of setting a group level flags when all events in the group
    share a property is useful for the flag introduced in the next patch and
    for future CQM/CMT flags. So this patches generalizes group_flags to work
    as an aggregate of event level flags.

    PERF_GROUP_SOFTWARE denotes an inmutable event's property. All other flags
    that I intend to add are also determinable at event initialization.
    To better convey the above, this patch renames event's group_flags to
    group_caps and PERF_GROUP_SOFTWARE to PERF_EV_CAP_SOFTWARE.

    Individual event flags are stored in the new event->event_caps. Since the
    cap flags do not change after event initialization, there is no need to
    serialize event_caps. This new field is used when events are added to a
    context, similarly to how PERF_GROUP_SOFTWARE and is_software_event()
    worked.

    Lastly, for consistency, updates is_software_event() to rely in event_cap
    instead of the context index.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-3-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • When decoding the perf_regs mask in perf_output_sample_regs(),
    we loop through the mask using find_first_bit and find_next_bit functions.

    While the exisiting code works fine in most of the case, the logic
    is broken for big-endian 32-bit kernels.

    When reading a u64 mask using (u32 *)(&val)[0], find_*_bit() assumes
    that it gets the lower 32 bits of u64, but instead it gets the upper
    32 bits - which is wrong.

    The fix is to swap the words of the u64 to handle this case.
    This is _not_ a regular endianness swap.

    Suggested-by: Yury Norov
    Signed-off-by: Madhavan Srinivasan
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Yury Norov
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/1471426568-31051-2-git-send-email-maddy@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Madhavan Srinivasan
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The call to smp_call_function_single in perf_event_read() may fail if
    an invalid or not online CPU index is passed. Warn user if such bug is
    present and return error.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • At this time the perf_addr_filter_needs_mmap() function will _not_
    return true on a user space 'stop' filter. But stop filters need
    exactly the same kind of mapping that range and start filters get.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Function perf_event_mmap() is called by the MM subsystem each time
    part of a binary is loaded in memory. There can be several mapping
    for a binary, many times unrelated to the code section.

    Each time a section of a binary is mapped address filters are
    updated, event when the map doesn't pertain to the code section.
    The end result is that filters are configured based on the last map
    event that was received rather than the last mapping of the code
    segment.

    For example if we have an executable 'main' that calls library
    'libcstest.so.1.0', and that we want to collect traces on code
    that is in that library. The perf cmd line for this scenario
    would be:

    perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main

    Resulting in binaries being mapped this way:

    root@linaro-nano:~# cat /proc/1950/maps
    00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main
    00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main
    00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main
    7fa2464000-7fa2474000 rw-p 00000000 00:00 0
    7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so
    7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0
    7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0
    7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0
    7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0
    7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar]
    7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso]
    7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so
    7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack]
    root@linaro-nano:~#

    Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in
    memory. Once that has been done perf_event_mmap() has been called
    4 times, with the last map starting at address 0x7fa25ce000 and
    the address filter configured to start filtering when the
    IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c).

    But that is wrong since the code segment for library 'libcstest.so.1.0'
    as been mapped at 0x7fa25bd000, resulting in traces not being
    collected.

    This patch corrects the situation by requesting that address
    filters be updated only if the mapped event is for a code
    segment.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Binary file names have to be supplied for both range and start/stop
    filters but the current code only processes the filename if an
    address range filter is specified. This code adds processing of
    the filename for start/stop filters.

    Signed-off-by: Mathieu Poirier
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.org
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • Vincent reported triggering the WARN_ON_ONCE() in event_function_local().

    While thinking through cases I noticed that by using event_function()
    directly, we miss the inactive case usually handled by
    event_function_call().

    Therefore construct a blend of event_function_call() and
    event_function() that handles the cases relevant to
    event_function_local().

    Reported-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org # 4.5+
    Fixes: fae3fde65138 ("perf: Collapse and fix event_function_call() users")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Purely cosmetic, no changes in the compiled code.

    Perhaps it is just me but I can hardly read __replace_page() because I can't
    distinguish "page" from "kpage" and because I need to look at the caller to
    to ensure that, say, kpage is really the new page and the code is correct.
    Rename them to old_page and new_page, this matches the caller.

    Signed-off-by: Oleg Nesterov
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Brenden Blanco
    Cc: Jiri Olsa
    Cc: Johannes Weiner
    Cc: Linus Torvalds
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20160817153704.GC29724@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • __replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path,
    it should only do this if page_check_address() fails.

    This means that every enable/disable leads to unbalanced mem_cgroup_uncharge()
    from put_page(old_page), it is trivial to underflow the page_counter->count
    and trigger OOM.

    Reported-and-tested-by: Brenden Blanco
    Signed-off-by: Oleg Nesterov
    Reviewed-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Alexander Shishkin
    Cc: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: stable@vger.kernel.org # 3.17+
    Fixes: 00501b531c47 ("mm: memcontrol: rewrite charge API")
    Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

10 Aug, 2016

5 commits

  • For perf record -b, which requires the pmu::sched_task callback the
    current code is rather expensive:

    7.68% sched-pipe [kernel.vmlinux] [k] perf_pmu_sched_task
    5.95% sched-pipe [kernel.vmlinux] [k] __switch_to
    5.20% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
    3.95% sched-pipe perf [.] worker_thread

    The problem is that it will iterate all registered PMUs, most of which
    will not have anything to do. Avoid this by keeping an explicit list
    of PMUs that have requested the callback.

    The perf_sched_cb_{inc,dec}() functions already takes the required pmu
    argument, and now that these functions are no longer called from NMI
    context we can use them to manage a list.

    With this patch applied the function doesn't show up in the top 4
    anymore (it dropped to 18th place).

    6.67% sched-pipe [kernel.vmlinux] [k] __switch_to
    6.18% sched-pipe [kernel.vmlinux] [k] __intel_pmu_disable_all
    3.92% sched-pipe [kernel.vmlinux] [k] switch_mm_irqs_off
    3.71% sched-pipe perf [.] worker_thread

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In order to allow optimizing perf_pmu_sched_task() we must ensure
    perf_sched_cb_{inc,dec}() are no longer called from NMI context; this
    means that pmu::{start,stop}() can no longer use them.

    Prepare for this by reworking the whole large PEBS setup code.

    The current code relied on the cpuc->pebs_enabled state, however since
    that reflects the current active state as per pmu::{start,stop}() we
    can no longer rely on this.

    Introduce two counters: cpuc->n_pebs and cpuc->n_large_pebs which
    count the total number of PEBS events and the number of PEBS events
    that have FREERUNNING set, resp.. With this we can tell if the current
    setup requires a single record interrupt threshold or can use a larger
    buffer.

    This also improves the code in that it re-enables the large threshold
    once the PEBS event that required single record gets removed.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Groups of events are supposed to be scheduled atomically, such that it
    is possible to derive meaningful ratios between their values.

    We take great pains to achieve this when scheduling event groups to a
    PMU in group_sched_in(), calling {start,commit}_txn() (which fall back
    to perf_pmu_{disable,enable}() if necessary) to provide this guarantee.
    However we don't mirror this in group_sched_out(), and in some cases
    events will not be scheduled out atomically.

    For example, if we disable an event group with PERF_EVENT_IOC_DISABLE,
    we'll cross-call __perf_event_disable() for the group leader, and will
    call group_sched_out() without having first disabled the relevant PMU.
    We will disable/enable the PMU around each pmu->del() call, but between
    each call the PMU will be enabled and events may count.

    Avoid this by explicitly disabling and enabling the PMU around event
    removal in group_sched_out(), mirroring what we do in group_sched_in().

    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1469553141-28314-1-git-send-email-mark.rutland@arm.com
    Signed-off-by: Ingo Molnar

    Mark Rutland
     
  • There's a perf stat bug easy to observer on a machine with only one cgroup:

    $ perf stat -e cycles -I 1000 -C 0 -G /
    # time counts unit events
    1.000161699 cycles /
    2.000355591 cycles /
    3.000565154 cycles /
    4.000951350 cycles /

    We'd expect some output there.

    The underlying problem is that there is an optimization in
    perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
    if the old and new cgroups in a task switch are the same.

    This optimization interacts with the current code in two ways
    that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
    cgroup event matches the current task. These are:

    1. On creation of the first cgroup event in a CPU: In current code,
    cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
    aforesaid optimization, perf_cgroup_sched_in will run until the next
    cgroup switches in that CPU. This may happen late or never happen,
    depending on system's number of cgroups, CPU load, etc.

    2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
    cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
    because cpuctx->cgrp == NULL until a cgroup switch occurs and
    perf_cgroup_sched_in is executed (updating cpuctx->cgrp).

    This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
    mirroring what list_del_event does when removing a cgroup event from CPU
    context, as introduced in:

    commit 68cacd29167b ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")

    With this patch, cpuctx->cgrp is always set/clear when installing/removing
    the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
    correctly set, event_filter_match works as intended when events are
    sched in/out.

    After the fix, the output is as expected:

    $ perf stat -e cycles -I 1000 -a -G /
    # time counts unit events
    1.004699159 627342882 cycles /
    2.007397156 615272690 cycles /
    3.010019057 616726074 cycles /

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     
  • Vegard Nossum reported that perf fuzzing generates a NULL
    pointer dereference crash:

    > Digging a bit deeper into this, it seems the event itself is getting
    > created by perf_event_open() and it gets added to the pmu_event_list
    > through:
    >
    > perf_event_open()
    > - perf_event_alloc()
    > - account_event()
    > - account_pmu_sb_event()
    > - attach_sb_event()
    >
    > so at this point the event is being attached but its ->ctx is still
    > NULL. It seems like ->ctx is set just a bit later in
    > perf_event_open(), though.
    >
    > But before that, __schedule() comes along and creates a stack trace
    > similar to the one above:
    >
    > __schedule()
    > - __perf_event_task_sched_out()
    > - perf_iterate_sb()
    > - perf_iterate_sb_cpu()
    > - event_filter_match()
    > - perf_cgroup_match()
    > - __get_cpu_context()
    > - (dereference ctx which is NULL)
    >
    > So I guess the question is... should the event be attached (= put on
    > the list) before ->ctx gets set? Or should the cgroup code check for a
    > NULL ->ctx?

    The latter seems like the simplest solution. Moving the list-add later
    creates a bit of a mess.

    Reported-by: Vegard Nossum
    Tested-by: Vegard Nossum
    Tested-by: Vince Weaver
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: David Carrillo-Cisneros
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Fixes: f2fb6bef9251 ("perf/core: Optimize side-band event delivery")
    Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Aug, 2016

1 commit

  • When the perf interrupt handler exceeds a threshold warning messages
    are displayed on console:

    [12739.31793] perf interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
    [71340.165065] perf interrupt took too long (5005 > 5000), lowering kernel.perf_event_max_sample_rate to 25000

    Many customers and users are confused by the message wondering if
    something is wrong or they need to take action to fix a problem.
    Since a user can not do anything to fix the issue, the message is really
    more informational than a warning. Adjust the log level accordingly.

    Signed-off-by: David Ahern
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1470084569-438-1-git-send-email-dsa@cumulusnetworks.com
    Signed-off-by: Ingo Molnar

    David Ahern
     

30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

26 Jul, 2016

1 commit

  • This patch fixes the __output_custom() routine we currently use with
    bpf_skb_copy(). I missed that when len is larger than the size of the
    current handle, we can issue multiple invocations of copy_func, and
    __output_custom() advances destination but also source buffer by the
    written amount of bytes. When we have __output_custom(), this is actually
    wrong since in that case the source buffer points to a non-linear object,
    in our case an skb, which the copy_func helper is supposed to walk.
    Therefore, since this is non-linear we thus need to pass the offset into
    the helper, so that copy_func can use it for extracting the data from
    the source object.

    Therefore, adjust the callback signatures properly and pass offset
    into the skb_header_pointer() invoked from bpf_skb_copy() callback. The
    __DEFINE_OUTPUT_COPY_BODY() is adjusted to accommodate for two things:
    i) to pass in whether we should advance source buffer or not; this is
    a compile-time constant condition, ii) to pass in the offset for
    __output_custom(), which we do with help of __VA_ARGS__, so everything
    can stay inlined as is currently. Both changes allow for adapting the
    __output_* fast-path helpers w/o extra overhead.

    Fixes: 555c8a8623a3 ("bpf: avoid stack copy and use skb ctx for event output")
    Fixes: 7e3f977edd0b ("perf, events: add non-linear data support for raw records")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

24 Jul, 2016

1 commit