04 Aug, 2015

1 commit

  • commit d6726c8145290bef950ae2538ea6ae1d96a1944b upstream.

    He Kuang noticed that the trace event samples for arrays was broken:

    "The output result of trace_foo_bar event in traceevent samples is
    wrong. This problem can be reproduced as following:

    (Build kernel with SAMPLE_TRACE_EVENTS=m)

    $ insmod trace-events-sample.ko

    $ echo 1 > /sys/kernel/debug/tracing/events/sample-trace/foo_bar/enable

    $ cat /sys/kernel/debug/tracing/trace

    event-sample-980 [000] .... 43.649559: foo_bar: foo hello 21 0x15
    BIT1|BIT3|0x10 {0x1,0x6f6f6e53,0xff007970,0xffffffff} Snoopy
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    The array length is not right, should be {0x1}.
    (ffffffff,ffffffff)

    event-sample-980 [000] .... 44.653827: foo_bar: foo hello 22 0x16
    BIT2|BIT3|0x10
    {0x1,0x2,0x646e6147,0x666c61,0xffffffff,0xffffffff,0x750aeffe,0x7}
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    The array length is not right, should be {0x1,0x2}.
    Gandalf (ffffffff,ffffffff)"

    This was caused by an update to have __print_array()'s second parameter
    be the count of items in the array and not the size of the array.

    As there is already users of __print_array(), it can not change. But
    the sample code can and we can also improve on the documentation about
    __print_array() and __get_dynamic_array_len().

    Link: http://lkml.kernel.org/r/1436839171-31527-2-git-send-email-hekuang@huawei.com

    Fixes: ac01ce1410fc2 ("tracing: Make ftrace_print_array_seq compute buf_len")
    Reported-by: He Kuang
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     

17 Apr, 2015

2 commits

  • 1.
    first bug is a silly mistake. It broke tracing examples and prevented
    simple bpf programs from loading.

    In the following code:
    if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
    } else if (...) {
    // this part should have been executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    Obviously it's not doing that. So simple instructions like:
    r2 = *(u64 *)(r1 + 8)
    will be rejected. Note the comments in the code around these branches
    were and still valid and indicate the true intent.

    Replace it with:
    if (BPF_SIZE(insn->code) != BPF_W)
    continue;

    if (insn->imm == 0) {
    } else if (...) {
    // now this code will be executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    2.
    second bug is more subtle.
    If malicious code is using the same dest register as source register,
    the checks designed to prevent the same instruction to be used with different
    pointer types will fail to trigger, since we were assigning src_reg_type
    when it was already overwritten by check_mem_access().
    The fix is trivial. Just move line:
    src_reg_type = regs[insn->src_reg].type;
    before check_mem_access().
    Add new 'access skb fields bad4' test to check this case.

    Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • For the short-term solution, lets fix bpf helper functions to use
    skb->mac_header relative offsets instead of skb->data in order to
    get the same eBPF programs with cls_bpf and act_bpf work on ingress
    and egress qdisc path. We need to ensure that mac_header is set
    before calling into programs. This is effectively the first option
    from below referenced discussion.

    More long term solution for LD_ABS|LD_IND instructions will be more
    intrusive but also more beneficial than this, and implemented later
    as it's too risky at this point in time.

    I.e., we plan to look into the option of moving skb_pull() out of
    eth_type_trans() and into netif_receive_skb() as has been suggested
    as second option. Meanwhile, this solution ensures ingress can be
    used with eBPF, too, and that we won't run into ABI troubles later.
    For dealing with negative offsets inside eBPF helper functions,
    we've implemented bpf_skb_clone_unwritable() to test for unwriteable
    headers.

    Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Apr, 2015

1 commit

  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

15 Apr, 2015

3 commits

  • Pull perf changes from Ingo Molnar:
    "Core kernel changes:

    - One of the more interesting features in this cycle is the ability
    to attach eBPF programs (user-defined, sandboxed bytecode executed
    by the kernel) to kprobes.

    This allows user-defined instrumentation on a live kernel image
    that can never crash, hang or interfere with the kernel negatively.
    (Right now it's limited to root-only, but in the future we might
    allow unprivileged use as well.)

    (Alexei Starovoitov)

    - Another non-trivial feature is per event clockid support: this
    allows, amongst other things, the selection of different clock
    sources for event timestamps traced via perf.

    This feature is sought by people who'd like to merge perf generated
    events with external events that were measured with different
    clocks:

    - cluster wide profiling

    - for system wide tracing with user-space events,

    - JIT profiling events

    etc. Matching perf tooling support is added as well, available via
    the -k, --clockid parameter to perf record et al.

    (Peter Zijlstra)

    Hardware enablement kernel changes:

    - x86 Intel Processor Trace (PT) support: which is a hardware tracer
    on steroids, available on Broadwell CPUs.

    The hardware trace stream is directly output into the user-space
    ring-buffer, using the 'AUX' data format extension that was added
    to the perf core to support hardware constraints such as the
    necessity to have the tracing buffer physically contiguous.

    This patch-set was developed for two years and this is the result.
    A simple way to make use of this is to use BTS tracing, the PT
    driver emulates BTS output - available via the 'intel_bts' PMU.
    More explicit PT specific tooling support is in the works as well -
    will probably be ready by 4.2.

    (Alexander Shishkin, Peter Zijlstra)

    - x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
    feature of Intel Xeon CPUs that allows the measurement and
    allocation/partitioning of caches to individual workloads.

    These kernel changes expose the measurement side as a new PMU
    driver, which exposes various QoS related PMU events. (The
    partitioning change is work in progress and is planned to be merged
    as a cgroup extension.)

    (Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
    Waskiewicz Jr)

    - x86 Intel Haswell LBR call stack support: this is a new Haswell
    feature that allows the hardware recording of call chains, plus
    tooling support. To activate this feature you have to enable it
    via the new 'lbr' call-graph recording option:

    perf record --call-graph lbr
    perf report

    or:

    perf top --call-graph lbr

    This hardware feature is a lot faster than stack walk or dwarf
    based unwinding, but has some limitations:

    - It reuses the current LBR facility, so LBR call stack and
    branch record can not be enabled at the same time.

    - It is only available for user-space callchains.

    (Yan, Zheng)

    - x86 Intel Broadwell CPU support and various event constraints and
    event table fixes for earlier models.

    (Andi Kleen)

    - x86 Intel HT CPUs event scheduling workarounds. This is a complex
    CPU bug affecting the SNB,IVB,HSW families that results in counter
    value corruption. The mitigation code is automatically enabled and
    is transparent.

    (Maria Dimakopoulou, Stephane Eranian)

    The perf tooling side had a ton of changes in this cycle as well, so
    I'm only able to list the user visible changes here, in addition to
    the tooling changes outlined above:

    User visible changes affecting all tools:

    - Improve support of compressed kernel modules (Jiri Olsa)
    - Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
    - Bash completion for subcommands (Yunlong Song)
    - Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
    - Support missing -f to override perf.data file ownership. (Yunlong Song)
    - Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)

    User visible changes in individual tools:

    'perf data':

    New tool for converting perf.data to other formats, initially
    for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
    Sebastian Siewior)

    'perf diff':

    Add --kallsyms option (David Ahern)

    'perf list':

    Allow listing events with 'tracepoint' prefix (Yunlong Song)

    Sort the output of the command (Yunlong Song)

    'perf kmem':

    Respect -i option (Jiri Olsa)

    Print big numbers using thousands' group (Namhyung Kim)

    Allow -v option (Namhyung Kim)

    Fix alignment of slab result table (Namhyung Kim)

    'perf probe':

    Support multiple probes on different binaries on the same command line (Masami Hiramatsu)

    Support unnamed union/structure members data collection. (Masami Hiramatsu)

    Check kprobes blacklist when adding new events. (Masami Hiramatsu)

    'perf record':

    Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)

    Support recording running/enabled time (Andi Kleen)

    'perf sched':

    Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)

    'perf report' and 'perf top':

    Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)

    Indicate which callchain entries are annotated in the
    TUI hists browser (Arnaldo Carvalho de Melo)

    Add pid/tid filtering to 'report' and 'script' commands (David Ahern)

    Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
    cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
    events (Arnaldo Carvalho de Melo)

    'perf stat':

    Report unsupported events properly (Suzuki K. Poulose)

    Output running time and run/enabled ratio in CSV mode (Andi Kleen)

    'perf trace':

    Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)

    Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)

    Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)

    Dump stack on segfaults (Arnaldo Carvalho de Melo)

    No need to explicitely enable evsels for workload started from perf, let it
    be enabled via perf_event_attr.enable_on_exec, removing some events that take
    place in the 'perf trace' before a workload is really started by it.
    (Arnaldo Carvalho de Melo)

    Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)

    There's also been a ton of infrastructure work done, such as the
    split-out of perf's build system into tools/build/ and other changes -
    see the shortlog and changelog for details"

    * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
    perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
    perf evlist: Fix type for references to data_head/tail
    perf probe: Check the orphaned -x option
    perf probe: Support multiple probes on different binaries
    perf buildid-list: Fix segfault when show DSOs with hits
    perf tools: Fix cross-endian analysis
    perf tools: Fix error path to do closedir() when synthesizing threads
    perf tools: Fix synthesizing fork_event.ppid for non-main thread
    perf tools: Add 'I' event modifier for exclude_idle bit
    perf report: Don't call map__kmap if map is NULL.
    perf tests: Fix attr tests
    perf probe: Fix ARM 32 building error
    perf tools: Merge all perf_event_attr print functions
    perf record: Add clockid parameter
    perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
    perf sched replay: Support using -f to override perf.data file ownership
    perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
    perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
    perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
    perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
    ...

    Linus Torvalds
     
  • Pull tracing updates from Steven Rostedt:
    "Some clean ups and small fixes, but the biggest change is the addition
    of the TRACE_DEFINE_ENUM() macro that can be used by tracepoints.

    Tracepoints have helper functions for the TP_printk() called
    __print_symbolic() and __print_flags() that lets a numeric number be
    displayed as a a human comprehensible text. What is placed in the
    TP_printk() is also shown in the tracepoint format file such that user
    space tools like perf and trace-cmd can parse the binary data and
    express the values too. Unfortunately, the way the TRACE_EVENT()
    macro works, anything placed in the TP_printk() will be shown pretty
    much exactly as is. The problem arises when enums are used. That's
    because unlike macros, enums will not be changed into their values by
    the C pre-processor. Thus, the enum string is exported to the format
    file, and this makes it useless for user space tools.

    The TRACE_DEFINE_ENUM() solves this by converting the enum strings in
    the TP_printk() format into their number, and that is what is shown to
    user space. For example, the tracepoint tlb_flush currently has this
    in its format file:

    __print_symbolic(REC->reason,
    { TLB_FLUSH_ON_TASK_SWITCH, "flush on task switch" },
    { TLB_REMOTE_SHOOTDOWN, "remote shootdown" },
    { TLB_LOCAL_SHOOTDOWN, "local shootdown" },
    { TLB_LOCAL_MM_SHOOTDOWN, "local mm shootdown" })

    After adding:

    TRACE_DEFINE_ENUM(TLB_FLUSH_ON_TASK_SWITCH);
    TRACE_DEFINE_ENUM(TLB_REMOTE_SHOOTDOWN);
    TRACE_DEFINE_ENUM(TLB_LOCAL_SHOOTDOWN);
    TRACE_DEFINE_ENUM(TLB_LOCAL_MM_SHOOTDOWN);

    Its format file will contain this:

    __print_symbolic(REC->reason,
    { 0, "flush on task switch" },
    { 1, "remote shootdown" },
    { 2, "local shootdown" },
    { 3, "local mm shootdown" })"

    * tag 'trace-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (27 commits)
    tracing: Add enum_map file to show enums that have been mapped
    writeback: Export enums used by tracepoint to user space
    v4l: Export enums used by tracepoints to user space
    SUNRPC: Export enums in tracepoints to user space
    mm: tracing: Export enums in tracepoints to user space
    irq/tracing: Export enums in tracepoints to user space
    f2fs: Export the enums in the tracepoints to userspace
    net/9p/tracing: Export enums in tracepoints to userspace
    x86/tlb/trace: Export enums in used by tlb_flush tracepoint
    tracing/samples: Update the trace-event-sample.h with TRACE_DEFINE_ENUM()
    tracing: Allow for modules to convert their enums to values
    tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values
    tracing: Update trace-event-sample with TRACE_SYSTEM_VAR documentation
    tracing: Give system name a pointer
    brcmsmac: Move each system tracepoints to their own header
    iwlwifi: Move each system tracepoints to their own header
    mac80211: Move message tracepoints to their own header
    tracing: Add TRACE_SYSTEM_VAR to xhci-hcd
    tracing: Add TRACE_SYSTEM_VAR to kvm-s390
    tracing: Add TRACE_SYSTEM_VAR to intel-sst
    ...

    Linus Torvalds
     
  • Pull HID updates from Jiri Kosina:

    - quite a few firmware fixes for RMI driver by Andrew Duggan

    - huion and uclogic drivers have been substantially overlaping in
    functionality laterly. This redundancy is fixed by hid-huion driver
    being merged into hid-uclogic; work done by Benjamin Tissoires and
    Nikolai Kondrashov

    - i2c-hid now supports ACPI GPIO interrupts; patch from Mika Westerberg

    - Some of the quirks, that got separated into individual drivers, have
    historically had EXPERT dependency. As HID subsystem matured (as
    well as the individual drivers), this made less and less sense. This
    dependency is now being removed by patch from Jean Delvare

    - Logitech lg4ff driver received a couple of improvements for mode
    switching, by Michal Malý

    - multitouch driver now supports clickpads, patches by Benjamin
    Tissoires and Seth Forshee

    - hid-sensor framework received a substantial update; namely support
    for Custom and Generic pages is being added; work done by Srinivas
    Pandruvada

    - wacom driver received substantial update; it now supports
    i2c-conntected devices (Mika Westerberg), Bamboo PADs are now
    properly supported (Benjamin Tissoires), much improved battery
    reporting (Jason Gerecke) and pen proximity cleanups (Ping Cheng)

    - small assorted fixes and device ID additions

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (68 commits)
    HID: sensor: Update document for custom sensor
    HID: sensor: Custom and Generic sensor support
    HID: debug: fix error handling in hid_debug_events_read()
    Input - mt: Fix input_mt_get_slot_by_key
    HID: logitech-hidpp: fix error return code
    HID: wacom: Add support for Cintiq 13HD Touch
    HID: logitech-hidpp: add a module parameter to keep firmware gestures
    HID: usbhid: yet another mouse with ALWAYS_POLL
    HID: usbhid: more mice with ALWAYS_POLL
    HID: wacom: set stylus_in_proximity before checking touch_down
    HID: wacom: use wacom_wac_finger_count_touches to set touch_down
    HID: wacom: remove hardcoded WACOM_QUIRK_MULTI_INPUT
    HID: pidff: effect can't be NULL
    HID: add quirk for PIXART OEM mouse used by HP
    HID: add HP OEM mouse to quirk ALWAYS_POLL
    HID: wacom: ask for a in-prox report when it was missed
    HID: hid-sensor-hub: Fix sparse warning
    HID: hid-sensor-hub: fix attribute read for logical usage id
    HID: plantronics: fix Kconfig default
    HID: pidff: support more than one concurrent effect
    ...

    Linus Torvalds
     

14 Apr, 2015

1 commit


08 Apr, 2015

2 commits

  • Document the use of TRACE_DEFINE_ENUM() by adding enums to the
    trace-event-sample.h and using this macro to convert them in the format
    files.

    Also update the comments and sho the use of __print_symbolic() and
    __print_flags() as well as adding comments abount __print_array().

    Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org

    Reviewed-by: Masami Hiramatsu
    Tested-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     
  • Add documentation about TRACE_SYSTEM needing to be alpha-numeric or with
    underscores, and that if it is not, then the use of TRACE_SYSTEM_VAR is
    required to make something that is.

    An example of this is shown in samples/trace_events/trace-events-sample.h

    Link: http://lkml.kernel.org/r/20150403013802.220157513@goodmis.org

    Reviewed-by: Masami Hiramatsu
    Tested-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt

    Steven Rostedt (Red Hat)
     

07 Apr, 2015

1 commit

  • Commit 608cd71a9c7c ("tc: bpf: generalize pedit action") has added the
    possibility to mangle packet data to BPF programs in the tc pipeline.
    This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
    for fixing up the protocol checksums after the packet mangling.

    It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
    unnecessary checksum recomputations when BPF programs adjusting l3/l4
    checksums and documents all three helpers in uapi header.

    Moreover, a sample program is added to show how BPF programs can make use
    of the mangle and csum helpers.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Apr, 2015

4 commits

  • One BPF program attaches to kmem_cache_alloc_node() and
    remembers all allocated objects in the map.
    Another program attaches to kmem_cache_free() and deletes
    corresponding object from the map.

    User space walks the map every second and prints any objects
    which are older than 1 second.

    Usage:

    $ sudo tracex4

    Then start few long living processes. The 'tracex4' will print
    something like this:

    obj 0xffff880465928000 is 13sec old was allocated at ip ffffffff8105dc32
    obj 0xffff88043181c280 is 13sec old was allocated at ip ffffffff8105dc32
    obj 0xffff880465848000 is 8sec old was allocated at ip ffffffff8105dc32
    obj 0xffff8804338bc280 is 15sec old was allocated at ip ffffffff8105dc32

    $ addr2line -fispe vmlinux ffffffff8105dc32
    do_fork at fork.c:1665

    As soon as processes exit the memory is reclaimed and 'tracex4'
    prints nothing.

    Similar experiment can be done with the __kmalloc()/kfree() pair.

    Signed-off-by: Alexei Starovoitov
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-10-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     
  • BPF C program attaches to
    blk_mq_start_request()/blk_update_request() kprobe events to
    calculate IO latency.

    For every completed block IO event it computes the time delta
    in nsec and records in a histogram map:

    map[log10(delta)*10]++

    User space reads this histogram map every 2 seconds and prints
    it as a 'heatmap' using gray shades of text terminal. Black
    spaces have many events and white spaces have very few events.
    Left most space is the smallest latency, right most space is
    the largest latency in the range.

    Usage:

    $ sudo ./tracex3
    and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.

    Observe IO latencies and how different activity (like 'make
    kernel') affects it.

    Similar experiments can be done for network transmit latencies,
    syscalls, etc.

    '-t' flag prints the heatmap using normal ascii characters:

    $ sudo ./tracex3 -t
    heatmap of IO latency
    # - many events with this latency
    - few events
    |1us |10us |100us |1ms |10ms |100ms |1s |10s
    *ooo. *O.#. # 221
    . *# . # 125
    .. .o#*.. # 55
    . . . . .#O # 37
    .# # 175
    .#*. # 37
    # # 199
    . . *#*. # 55
    *#..* # 42
    # # 266
    ...***Oo#*OO**o#* . # 629
    # # 271
    . .#o* o.*o* # 221
    . . o* *#O.. # 50

    Signed-off-by: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-9-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     
  • this example has two probes in one C file that attach to
    different kprove events and use two different maps.

    1st probe is x64 specific equivalent of dropmon. It attaches to
    kfree_skb, retrevies 'ip' address of kfree_skb() caller and
    counts number of packet drops at that 'ip' address. User space
    prints 'location - count' map every second.

    2nd probe attaches to kprobe:sys_write and computes a histogram
    of different write sizes

    Usage:
    $ sudo tracex2
    location 0xffffffff81695995 count 1
    location 0xffffffff816d0da9 count 2

    location 0xffffffff81695995 count 2
    location 0xffffffff816d0da9 count 2

    location 0xffffffff81695995 count 3
    location 0xffffffff816d0da9 count 2

    557145+0 records in
    557145+0 records out
    285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
    syscall write() stats
    byte_size : count distribution
    1 -> 1 : 3 | |
    2 -> 3 : 0 | |
    4 -> 7 : 0 | |
    8 -> 15 : 0 | |
    16 -> 31 : 2 | |
    32 -> 63 : 3 | |
    64 -> 127 : 1 | |
    128 -> 255 : 1 | |
    256 -> 511 : 0 | |
    512 -> 1023 : 1118968 |************************************* |

    Ctrl-C at any time. Kernel will auto cleanup maps and programs

    $ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995
    0xffffffff816d0da9 0xffffffff81695995:
    ./bld_x64/../net/ipv4/icmp.c:1038 0xffffffff816d0da9:
    ./bld_x64/../net/unix/af_unix.c:1231

    Signed-off-by: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-8-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     
  • tracex1_kern.c - C program compiled into BPF.

    It attaches to kprobe:netif_receive_skb()

    When skb->dev->name == "lo", it prints sample debug message into
    trace_pipe via bpf_trace_printk() helper function.

    tracex1_user.c - corresponding user space component that:
    - loads BPF program via bpf() syscall
    - opens kprobes:netif_receive_skb event via perf_event_open()
    syscall
    - attaches the program to event via ioctl(event_fd,
    PERF_EVENT_IOC_SET_BPF, prog_fd);
    - prints from trace_pipe

    Note, this BPF program is non-portable. It must be recompiled
    with current kernel headers. kprobe is not a stable ABI and
    BPF+kprobe scripts may no longer be meaningful when kernel
    internals change.

    No matter in what way the kernel changes, neither the kprobe,
    nor the BPF program can ever crash or corrupt the kernel,
    assuming the kprobes, perf and BPF subsystem has no bugs.

    The verifier will detect that the program is using
    bpf_trace_printk() and the kernel will print 'this is a DEBUG
    kernel' warning banner, which means that bpf_trace_printk()
    should be used for debugging of the BPF program only.

    Usage:
    $ sudo tracex1
    ping-19826 [000] d.s2 63103.382648: : skb ffff880466b1ca00 len 84
    ping-19826 [000] d.s2 63103.382684: : skb ffff880466b1d300 len 84

    ping-19826 [000] d.s2 63104.382533: : skb ffff880466b1ca00 len 84
    ping-19826 [000] d.s2 63104.382594: : skb ffff880466b1d300 len 84

    Signed-off-by: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-7-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

25 Mar, 2015

2 commits


18 Mar, 2015

1 commit

  • as a follow on to patch 70006af95515 ("bpf: allow eBPF access skb fields")
    this patch allows 'protocol' and 'vlan_tci' fields to be accessible
    from extended BPF programs.

    The usage of 'protocol', 'vlan_present' and 'vlan_tci' fields is the same as
    corresponding SKF_AD_PROTOCOL, SKF_AD_VLAN_TAG_PRESENT and SKF_AD_VLAN_TAG
    accesses in classic BPF.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Mar, 2015

1 commit


15 Mar, 2015

1 commit


02 Mar, 2015

2 commits

  • We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
    ELF BPF loader where instructions are being loaded that need map fixups.

    An initial stage loads all maps into the kernel, and later on replaces
    related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
    register and the actual fd as immediate value.

    The kernel verifier recognizes this keyword and replaces the map fd with
    a real pointer internally.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can
    remove the test stubs which were added to get the verifier suite up.

    We can just let the test cases probe under socket filter type instead.
    In the fill/spill test case, we cannot (yet) access fields from the
    context (skb), but we may adapt that test case in future.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

24 Feb, 2015

7 commits


18 Feb, 2015

1 commit

  • Fixes a potential corruption with uninitialized stack memory in the
    seccomp BPF sample program.

    [akpm@linux-foundation.org: coding-style fixlet]
    Signed-off-by: Kees Cook
    Reported-by: Robert Swiecki
    Tested-by: Robert Swiecki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     

13 Feb, 2015

1 commit

  • Pull tracing updates from Steven Rostedt:
    "The updates included in this pull request for ftrace are:

    o Several clean ups to the code

    One such clean up was to convert to 64 bit time keeping, in the
    ring buffer benchmark code.

    o Adding of __print_array() helper macro for TRACE_EVENT()

    o Updating the sample/trace_events/ to add samples of different ways
    to make trace events. Lots of features have been added since the
    sample code was made, and these features are mostly unknown.
    Developers have been making their own hacks to do things that are
    already available.

    o Performance improvements. Most notably, I found a performance bug
    where a waiter that is waiting for a full page from the ring buffer
    will see that a full page is not available, and go to sleep. The
    sched event caused by it going to sleep would cause it to wake up
    again. It would see that there was still not a full page, and go
    back to sleep again, and that would wake it up again, until finally
    it would see a full page. This change has been marked for stable.

    Other improvements include removing global locks from fast paths"

    * tag 'trace-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ring-buffer: Do not wake up a splice waiter when page is not full
    tracing: Fix unmapping loop in tracing_mark_write
    tracing: Add samples of DECLARE_EVENT_CLASS() and DEFINE_EVENT()
    tracing: Add TRACE_EVENT_FN example
    tracing: Add TRACE_EVENT_CONDITION sample
    tracing: Update the TRACE_EVENT fields available in the sample code
    tracing: Separate out initializing top level dir from instances
    tracing: Make tracing_init_dentry_tr() static
    trace: Use 64-bit timekeeping
    tracing: Add array printing helper
    tracing: Remove newline from trace_printk warning banner
    tracing: Use IS_ERR() check for return value of tracing_init_dentry()
    tracing: Remove unneeded includes of debugfs.h and fs.h
    tracing: Remove taking of trace_types_lock in pipe files
    tracing: Add ref count to tracer for when they are being read by pipe

    Linus Torvalds
     

11 Feb, 2015

1 commit

  • Pull live patching infrastructure from Jiri Kosina:
    "Let me provide a bit of history first, before describing what is in
    this pile.

    Originally, there was kSplice as a standalone project that implemented
    stop_machine()-based patching for the linux kernel. This project got
    later acquired, and the current owner is providing live patching as a
    proprietary service, without any intentions to have their
    implementation merged.

    Then, due to rising user/customer demand, both Red Hat and SUSE
    started working on their own implementation (not knowing about each
    other), and announced first versions roughly at the same time [1] [2].

    The principle difference between the two solutions is how they are
    making sure that the patching is performed in a consistent way when it
    comes to different execution threads with respect to the semantic
    nature of the change that is being introduced.

    In a nutshell, kPatch is issuing stop_machine(), then looking at
    stacks of all existing processess, and if it decides that the system
    is in a state that can be patched safely, it proceeds insterting code
    redirection machinery to the patched functions.

    On the other hand, kGraft provides a per-thread consistency during one
    single pass of a process through the kernel and performs a lazy
    contignuous migration of threads from "unpatched" universe to the
    "patched" one at safe checkpoints.

    If interested in a more detailed discussion about the consistency
    models and its possible combinations, please see the thread that
    evolved around [3].

    It pretty quickly became obvious to the interested parties that it's
    absolutely impractical in this case to have several isolated solutions
    for one task to co-exist in the kernel. During a dedicated Live
    Kernel Patching track at LPC in Dusseldorf, all the interested parties
    sat together and came up with a joint aproach that would work for both
    distro vendors. Steven Rostedt took notes [4] from this meeting.

    And the foundation for that aproach is what's present in this pull
    request.

    It provides a basic infrastructure for function "live patching" (i.e.
    code redirection), including API for kernel modules containing the
    actual patches, and API/ABI for userspace to be able to operate on the
    patches (look up what patches are applied, enable/disable them, etc).

    It's relatively simple and minimalistic, as it's making use of
    existing kernel infrastructure (namely ftrace) as much as possible.
    It's also self-contained, in a sense that it doesn't hook itself in
    any other kernel subsystem (it doesn't even touch any other code).
    It's now implemented for x86 only as a reference architecture, but
    support for powerpc, s390 and arm is already in the works (adding
    arch-specific support basically boils down to teaching ftrace about
    regs-saving).

    Once this common infrastructure gets merged, both Red Hat and SUSE
    have agreed to immediately start porting their current solutions on
    top of this, abandoning their out-of-tree code. The plan basically is
    that each patch will be marked by flag(s) that would indicate which
    consistency model it is willing to use (again, the details have been
    sketched out already in the thread at [3]).

    Before this happens, the current codebase can be used to patch a large
    group of secruity/stability problems the patches for which are not too
    complex (in a sense that they don't introduce non-trivial change of
    function's return value semantics, they don't change layout of data
    structures, etc) -- this corresponds to LEAVE_FUNCTION &&
    SWITCH_FUNCTION semantics described at [3].

    This tree has been in linux-next since December.

    [1] https://lkml.org/lkml/2014/4/30/477
    [2] https://lkml.org/lkml/2014/7/14/857
    [3] https://lkml.org/lkml/2014/11/7/354
    [4] http://linuxplumbersconf.org/2014/wp-content/uploads/2014/10/LPC2014_LivePatching.txt

    [ The core code is introduced by the three commits authored by Seth
    Jennings, which got a lot of changes incorporated during numerous
    respins and reviews of the initial implementation. All the followup
    commits have materialized only after public tree has been created,
    so they were not folded into initial three commits so that the
    public tree doesn't get rebased ]"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
    livepatch: add missing newline to error message
    livepatch: rename config to CONFIG_LIVEPATCH
    livepatch: fix uninitialized return value
    livepatch: support for repatching a function
    livepatch: enforce patch stacking semantics
    livepatch: change ARCH_HAVE_LIVE_PATCHING to HAVE_LIVE_PATCHING
    livepatch: fix deferred module patching order
    livepatch: handle ancient compilers with more grace
    livepatch: kconfig: use bool instead of boolean
    livepatch: samples: fix usage example comments
    livepatch: MAINTAINERS: add git tree location
    livepatch: use FTRACE_OPS_FL_IPMODIFY
    livepatch: move x86 specific ftrace handler code to arch/x86
    livepatch: samples: add sample live patching module
    livepatch: kernel: add support for live patching
    livepatch: kernel: add TAINT_LIVEPATCH

    Linus Torvalds
     

10 Feb, 2015

4 commits


04 Feb, 2015

1 commit


27 Jan, 2015

1 commit

  • hash map is unordered, so get_next_key() iterator shouldn't
    rely on particular order of elements. So relax this test.

    Fixes: ffb65f27a155 ("bpf: add a testsuite for eBPF maps")
    Reported-by: Michael Holzheu
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

24 Dec, 2014

1 commit


22 Dec, 2014

1 commit