17 Apr, 2015

2 commits

  • 1.
    first bug is a silly mistake. It broke tracing examples and prevented
    simple bpf programs from loading.

    In the following code:
    if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
    } else if (...) {
    // this part should have been executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    Obviously it's not doing that. So simple instructions like:
    r2 = *(u64 *)(r1 + 8)
    will be rejected. Note the comments in the code around these branches
    were and still valid and indicate the true intent.

    Replace it with:
    if (BPF_SIZE(insn->code) != BPF_W)
    continue;

    if (insn->imm == 0) {
    } else if (...) {
    // now this code will be executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    2.
    second bug is more subtle.
    If malicious code is using the same dest register as source register,
    the checks designed to prevent the same instruction to be used with different
    pointer types will fail to trigger, since we were assigning src_reg_type
    when it was already overwritten by check_mem_access().
    The fix is trivial. Just move line:
    src_reg_type = regs[insn->src_reg].type;
    before check_mem_access().
    Add new 'access skb fields bad4' test to check this case.

    Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • For the short-term solution, lets fix bpf helper functions to use
    skb->mac_header relative offsets instead of skb->data in order to
    get the same eBPF programs with cls_bpf and act_bpf work on ingress
    and egress qdisc path. We need to ensure that mac_header is set
    before calling into programs. This is effectively the first option
    from below referenced discussion.

    More long term solution for LD_ABS|LD_IND instructions will be more
    intrusive but also more beneficial than this, and implemented later
    as it's too risky at this point in time.

    I.e., we plan to look into the option of moving skb_pull() out of
    eth_type_trans() and into netif_receive_skb() as has been suggested
    as second option. Meanwhile, this solution ensures ingress can be
    used with eBPF, too, and that we won't run into ABI troubles later.
    For dealing with negative offsets inside eBPF helper functions,
    we've implemented bpf_skb_clone_unwritable() to test for unwriteable
    headers.

    Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
    Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
    Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Apr, 2015

1 commit

  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

07 Apr, 2015

1 commit

  • Commit 608cd71a9c7c ("tc: bpf: generalize pedit action") has added the
    possibility to mangle packet data to BPF programs in the tc pipeline.
    This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
    for fixing up the protocol checksums after the packet mangling.

    It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
    unnecessary checksum recomputations when BPF programs adjusting l3/l4
    checksums and documents all three helpers in uapi header.

    Moreover, a sample program is added to show how BPF programs can make use
    of the mangle and csum helpers.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Apr, 2015

4 commits

  • One BPF program attaches to kmem_cache_alloc_node() and
    remembers all allocated objects in the map.
    Another program attaches to kmem_cache_free() and deletes
    corresponding object from the map.

    User space walks the map every second and prints any objects
    which are older than 1 second.

    Usage:

    $ sudo tracex4

    Then start few long living processes. The 'tracex4' will print
    something like this:

    obj 0xffff880465928000 is 13sec old was allocated at ip ffffffff8105dc32
    obj 0xffff88043181c280 is 13sec old was allocated at ip ffffffff8105dc32
    obj 0xffff880465848000 is 8sec old was allocated at ip ffffffff8105dc32
    obj 0xffff8804338bc280 is 15sec old was allocated at ip ffffffff8105dc32

    $ addr2line -fispe vmlinux ffffffff8105dc32
    do_fork at fork.c:1665

    As soon as processes exit the memory is reclaimed and 'tracex4'
    prints nothing.

    Similar experiment can be done with the __kmalloc()/kfree() pair.

    Signed-off-by: Alexei Starovoitov
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-10-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     
  • BPF C program attaches to
    blk_mq_start_request()/blk_update_request() kprobe events to
    calculate IO latency.

    For every completed block IO event it computes the time delta
    in nsec and records in a histogram map:

    map[log10(delta)*10]++

    User space reads this histogram map every 2 seconds and prints
    it as a 'heatmap' using gray shades of text terminal. Black
    spaces have many events and white spaces have very few events.
    Left most space is the smallest latency, right most space is
    the largest latency in the range.

    Usage:

    $ sudo ./tracex3
    and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.

    Observe IO latencies and how different activity (like 'make
    kernel') affects it.

    Similar experiments can be done for network transmit latencies,
    syscalls, etc.

    '-t' flag prints the heatmap using normal ascii characters:

    $ sudo ./tracex3 -t
    heatmap of IO latency
    # - many events with this latency
    - few events
    |1us |10us |100us |1ms |10ms |100ms |1s |10s
    *ooo. *O.#. # 221
    . *# . # 125
    .. .o#*.. # 55
    . . . . .#O # 37
    .# # 175
    .#*. # 37
    # # 199
    . . *#*. # 55
    *#..* # 42
    # # 266
    ...***Oo#*OO**o#* . # 629
    # # 271
    . .#o* o.*o* # 221
    . . o* *#O.. # 50

    Signed-off-by: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-9-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     
  • this example has two probes in one C file that attach to
    different kprove events and use two different maps.

    1st probe is x64 specific equivalent of dropmon. It attaches to
    kfree_skb, retrevies 'ip' address of kfree_skb() caller and
    counts number of packet drops at that 'ip' address. User space
    prints 'location - count' map every second.

    2nd probe attaches to kprobe:sys_write and computes a histogram
    of different write sizes

    Usage:
    $ sudo tracex2
    location 0xffffffff81695995 count 1
    location 0xffffffff816d0da9 count 2

    location 0xffffffff81695995 count 2
    location 0xffffffff816d0da9 count 2

    location 0xffffffff81695995 count 3
    location 0xffffffff816d0da9 count 2

    557145+0 records in
    557145+0 records out
    285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
    syscall write() stats
    byte_size : count distribution
    1 -> 1 : 3 | |
    2 -> 3 : 0 | |
    4 -> 7 : 0 | |
    8 -> 15 : 0 | |
    16 -> 31 : 2 | |
    32 -> 63 : 3 | |
    64 -> 127 : 1 | |
    128 -> 255 : 1 | |
    256 -> 511 : 0 | |
    512 -> 1023 : 1118968 |************************************* |

    Ctrl-C at any time. Kernel will auto cleanup maps and programs

    $ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995
    0xffffffff816d0da9 0xffffffff81695995:
    ./bld_x64/../net/ipv4/icmp.c:1038 0xffffffff816d0da9:
    ./bld_x64/../net/unix/af_unix.c:1231

    Signed-off-by: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-8-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     
  • tracex1_kern.c - C program compiled into BPF.

    It attaches to kprobe:netif_receive_skb()

    When skb->dev->name == "lo", it prints sample debug message into
    trace_pipe via bpf_trace_printk() helper function.

    tracex1_user.c - corresponding user space component that:
    - loads BPF program via bpf() syscall
    - opens kprobes:netif_receive_skb event via perf_event_open()
    syscall
    - attaches the program to event via ioctl(event_fd,
    PERF_EVENT_IOC_SET_BPF, prog_fd);
    - prints from trace_pipe

    Note, this BPF program is non-portable. It must be recompiled
    with current kernel headers. kprobe is not a stable ABI and
    BPF+kprobe scripts may no longer be meaningful when kernel
    internals change.

    No matter in what way the kernel changes, neither the kprobe,
    nor the BPF program can ever crash or corrupt the kernel,
    assuming the kprobes, perf and BPF subsystem has no bugs.

    The verifier will detect that the program is using
    bpf_trace_printk() and the kernel will print 'this is a DEBUG
    kernel' warning banner, which means that bpf_trace_printk()
    should be used for debugging of the BPF program only.

    Usage:
    $ sudo tracex1
    ping-19826 [000] d.s2 63103.382648: : skb ffff880466b1ca00 len 84
    ping-19826 [000] d.s2 63103.382684: : skb ffff880466b1d300 len 84

    ping-19826 [000] d.s2 63104.382533: : skb ffff880466b1ca00 len 84
    ping-19826 [000] d.s2 63104.382594: : skb ffff880466b1d300 len 84

    Signed-off-by: Alexei Starovoitov
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1427312966-8434-7-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

18 Mar, 2015

1 commit

  • as a follow on to patch 70006af95515 ("bpf: allow eBPF access skb fields")
    this patch allows 'protocol' and 'vlan_tci' fields to be accessible
    from extended BPF programs.

    The usage of 'protocol', 'vlan_present' and 'vlan_tci' fields is the same as
    corresponding SKF_AD_PROTOCOL, SKF_AD_VLAN_TAG_PRESENT and SKF_AD_VLAN_TAG
    accesses in classic BPF.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Mar, 2015

1 commit


02 Mar, 2015

2 commits

  • We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
    ELF BPF loader where instructions are being loaded that need map fixups.

    An initial stage loads all maps into the kernel, and later on replaces
    related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
    register and the actual fd as immediate value.

    The kernel verifier recognizes this keyword and replaces the map fd with
    a real pointer internally.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can
    remove the test stubs which were added to get the verifier suite up.

    We can just let the test cases probe under socket filter type instead.
    In the fill/spill test case, we cannot (yet) access fields from the
    context (skb), but we may adapt that test case in future.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

27 Jan, 2015

1 commit

  • hash map is unordered, so get_next_key() iterator shouldn't
    rely on particular order of elements. So relax this test.

    Fixes: ffb65f27a155 ("bpf: add a testsuite for eBPF maps")
    Reported-by: Michael Holzheu
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

06 Dec, 2014

4 commits

  • sockex2_kern.c is purposefully large eBPF program in C.
    llvm compiles ~200 lines of C code into ~300 eBPF instructions.

    It's similar to __skb_flow_dissect() to demonstrate that complex packet parsing
    can be done by eBPF.
    Then it uses (struct flow_keys)->dst IP address (or hash of ipv6 dst) to keep
    stats of number of packets per IP.
    User space loads eBPF program, attaches it to loopback interface and prints
    dest_ip->#packets stats every second.

    Usage:
    $sudo samples/bpf/sockex2
    ip 127.0.0.1 count 19
    ip 127.0.0.1 count 178115
    ip 127.0.0.1 count 369437
    ip 127.0.0.1 count 559841
    ip 127.0.0.1 count 750539

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • this example does the same task as previous socket example
    in assembler, but this one does it in C.

    eBPF program in kernel does:
    /* assume that packet is IPv4, load one byte of IP->proto */
    int index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    long *value;

    value = bpf_map_lookup_elem(&my_map, &index);
    if (value)
    __sync_fetch_and_add(value, 1);

    Corresponding user space reads map[tcp], map[udp], map[icmp]
    and prints protocol stats every second

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • simple .o parser and loader using BPF syscall.
    .o is a standard ELF generated by LLVM backend

    It parses elf file compiled by llvm .c->.o
    - parses 'maps' section and creates maps via BPF syscall
    - parses 'license' section and passes it to syscall
    - parses elf relocations for BPF maps and adjusts BPF_LD_IMM64 insns
    by storing map_fd into insn->imm and marking such insns as BPF_PSEUDO_MAP_FD
    - loads eBPF programs via BPF syscall

    One ELF file can contain multiple BPF programs.

    int load_bpf_file(char *path);
    populates prog_fd[] and map_fd[] with FDs received from bpf syscall

    bpf_helpers.h - helper functions available to eBPF programs written in C

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • this socket filter example does:
    - creates arraymap in kernel with key 4 bytes and value 8 bytes

    - loads eBPF program which assumes that packet is IPv4 and loads one byte of
    IP->proto from the packet and uses it as a key in a map

    r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
    *(u32*)(fp - 4) = r0;
    value = bpf_map_lookup_elem(map_fd, fp - 4);
    if (value)
    (*(u64*)value) += 1;

    - attaches this program to raw socket

    - every second user space reads map[IPPROTO_TCP], map[IPPROTO_UDP], map[IPPROTO_ICMP]
    to see how many packets of given protocol were seen on loopback interface

    Usage:
    $sudo samples/bpf/sock_example
    TCP 0 UDP 0 ICMP 0 packets
    TCP 187600 UDP 0 ICMP 4 packets
    TCP 376504 UDP 0 ICMP 8 packets
    TCP 563116 UDP 0 ICMP 12 packets
    TCP 753144 UDP 0 ICMP 16 packets

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

19 Nov, 2014

2 commits


31 Oct, 2014

1 commit

  • - add a test specifically targeting verifier state pruning.
    It checks state propagation between registers, storing that
    state into stack and state pruning algorithm recognizing
    equivalent stack and register states.

    - add summary line to spot failures easier

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

22 Oct, 2014

1 commit

  • while comparing for verifier state equivalency the comparison
    was missing a check for uninitialized register.
    Make sure it does so and add a testcase.

    Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
    Cc: Hannes Frederic Sowa
    Signed-off-by: Alexei Starovoitov
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Oct, 2014

1 commit


27 Sep, 2014

1 commit

  • 1.
    the library includes a trivial set of BPF syscall wrappers:
    int bpf_create_map(int key_size, int value_size, int max_entries);
    int bpf_update_elem(int fd, void *key, void *value);
    int bpf_lookup_elem(int fd, void *key, void *value);
    int bpf_delete_elem(int fd, void *key);
    int bpf_get_next_key(int fd, void *key, void *next_key);
    int bpf_prog_load(enum bpf_prog_type prog_type,
    const struct sock_filter_int *insns, int insn_len,
    const char *license);
    bpf_prog_load() stores verifier log into global bpf_log_buf[] array

    and BPF_*() macros to build instructions

    2.
    test stubs configure eBPF infra with 'unspec' map and program types.
    These are fake types used by user space testsuite only.

    3.
    verifier tests valid and invalid programs and expects predefined
    error log messages from kernel.
    40 tests so far.

    $ sudo ./test_verifier
    #0 add+sub+mul OK
    #1 unreachable OK
    #2 unreachable2 OK
    #3 out of range jump OK
    #4 out of range jump2 OK
    #5 test1 ld_imm64 OK
    ...

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov