15 Dec, 2015

1 commit

  • [ Upstream commit fbca9d2d35c6ef1b323fae75cc9545005ba25097 ]

    During own review but also reported by Dmitry's syzkaller [1] it has been
    noticed that we trigger a heap out-of-bounds access on eBPF array maps
    when updating elements. This happens with each map whose map->value_size
    (specified during map creation time) is not multiple of 8 bytes.

    In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
    used to align array map slots for faster access. However, in function
    array_map_update_elem(), we update the element as ...

    memcpy(array->value + array->elem_size * index, value, array->elem_size);

    ... where we access 'value' out-of-bounds, since it was allocated from
    map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
    and later on copied through copy_from_user(value, uvalue, map->value_size).
    Thus, up to 7 bytes, we can access out-of-bounds.

    Same could happen from within an eBPF program, where in worst case we
    access beyond an eBPF program's designated stack.

    Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an
    official release yet, it only affects priviledged users.

    In case of array_map_lookup_elem(), the verifier prevents eBPF programs
    from accessing beyond map->value_size through check_map_access(). Also
    from syscall side map_lookup_elem() only copies map->value_size back to
    user, so nothing could leak.

    [1] http://github.com/google/syzkaller

    Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

28 Apr, 2015

1 commit

  • ALU64_DIV instruction should be dividing 64-bit by 64-bit,
    whereas do_div() does 64-bit by 32-bit divide.
    x64 and arm64 JITs correctly implement 64 by 64 unsigned divide.
    llvm BPF backend emits code assuming that ALU64_DIV does 64 by 64.

    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Reported-by: Michael Holzheu
    Acked-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

17 Apr, 2015

2 commits

  • 1.
    first bug is a silly mistake. It broke tracing examples and prevented
    simple bpf programs from loading.

    In the following code:
    if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
    } else if (...) {
    // this part should have been executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    Obviously it's not doing that. So simple instructions like:
    r2 = *(u64 *)(r1 + 8)
    will be rejected. Note the comments in the code around these branches
    were and still valid and indicate the true intent.

    Replace it with:
    if (BPF_SIZE(insn->code) != BPF_W)
    continue;

    if (insn->imm == 0) {
    } else if (...) {
    // now this code will be executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    2.
    second bug is more subtle.
    If malicious code is using the same dest register as source register,
    the checks designed to prevent the same instruction to be used with different
    pointer types will fail to trigger, since we were assigning src_reg_type
    when it was already overwritten by check_mem_access().
    The fix is trivial. Just move line:
    src_reg_type = regs[insn->src_reg].type;
    before check_mem_access().
    Add new 'access skb fields bad4' test to check this case.

    Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Due to missing bounds check the DAG pass of the BPF verifier can corrupt
    the memory which can cause random crashes during program loading:

    [8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
    [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
    [8.452329] Oops: 0000 [#1] SMP
    [8.452329] Call Trace:
    [8.452329] [] bpf_check+0x852/0x2000
    [8.452329] [] bpf_prog_load+0x1e4/0x310
    [8.452329] [] ? might_fault+0x5f/0xb0
    [8.452329] [] SyS_bpf+0x806/0xa30

    Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Apr, 2015

1 commit

  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

02 Apr, 2015

1 commit

  • BPF programs, attached to kprobes, provide a safe way to execute
    user-defined BPF byte-code programs without being able to crash or
    hang the kernel in any way. The BPF engine makes sure that such
    programs have a finite execution time and that they cannot break
    out of their sandbox.

    The user interface is to attach to a kprobe via the perf syscall:

    struct perf_event_attr attr = {
    .type = PERF_TYPE_TRACEPOINT,
    .config = event_id,
    ...
    };

    event_fd = perf_event_open(&attr,...);
    ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

    'prog_fd' is a file descriptor associated with BPF program
    previously loaded.

    'event_id' is an ID of the kprobe created.

    Closing 'event_fd':

    close(event_fd);

    ... automatically detaches BPF program from it.

    BPF programs can call in-kernel helper functions to:

    - lookup/update/delete elements in maps

    - probe_read - wraper of probe_kernel_read() used to access any
    kernel data structures

    BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
    architecture dependent) and return 0 to ignore the event and 1 to store
    kprobe event into the ring buffer.

    Note, kprobes are a fundamentally _not_ a stable kernel ABI,
    so BPF programs attached to kprobes must be recompiled for
    every kernel version and user must supply correct LINUX_VERSION_CODE
    in attr.kern_version during bpf_prog_load() call.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Steven Rostedt
    Reviewed-by: Masami Hiramatsu
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

30 Mar, 2015

1 commit

  • existing TC action 'pedit' can munge any bits of the packet.
    Generalize it for use in bpf programs attached as cls_bpf and act_bpf via
    bpf_skb_store_bytes() helper function.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jiri Pirko
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

21 Mar, 2015

1 commit

  • In order to prepare eBPF support for tc action, we need to add
    sched_act_type, so that the eBPF verifier is aware of what helper
    function act_bpf may use, that it can load skb data and read out
    currently available skb fields.

    This is bascially analogous to 96be4325f443 ("ebpf: add sched_cls_type
    and map it to sk_filter's verifier ops").

    BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
    separate since both will have a different set of functionality in
    future (classifier vs action), thus we won't run into ABI troubles
    when the point in time comes to diverge functionality from the
    classifier.

    The future plan for act_bpf would be that it will be able to write
    into skb->data and alter selected fields mirrored in struct __sk_buff.

    For an initial support, it's sufficient to map it to sk_filter_ops.

    Signed-off-by: Daniel Borkmann
    Cc: Jiri Pirko
    Reviewed-by: Jiri Pirko
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Mar, 2015

3 commits

  • introduce user accessible mirror of in-kernel 'struct sk_buff':
    struct __sk_buff {
    __u32 len;
    __u32 pkt_type;
    __u32 mark;
    __u32 queue_mapping;
    };

    bpf programs can do:

    int bpf_prog(struct __sk_buff *skb)
    {
    __u32 var = skb->pkt_type;

    which will be compiled to bpf assembler as:

    dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)

    bpf verifier will check validity of access and will convert it to:

    dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
    dst_reg &= 7

    since skb->pkt_type is a bitfield.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • This patch adds the possibility to obtain raw_smp_processor_id() in
    eBPF. Currently, this is only possible in classic BPF where commit
    da2033c28226 ("filter: add SKF_AD_RXHASH and SKF_AD_CPU") has added
    facilities for this.

    Perhaps most importantly, this would also allow us to track per CPU
    statistics with eBPF maps, or to implement a poor-man's per CPU data
    structure through eBPF maps.

    Example function proto-type looks like:

    u32 (*smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id;

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This work is similar to commit 4cd3675ebf74 ("filter: added BPF
    random opcode") and adds a possibility for packet sampling in eBPF.

    Currently, this is only possible in classic BPF and useful to
    combine sampling with f.e. packet sockets, possible also with tc.

    Example function proto-type looks like:

    u32 (*prandom_u32)(void) = (void *)BPF_FUNC_get_prandom_u32;

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

13 Mar, 2015

1 commit

  • I noticed that a helper function with argument type ARG_ANYTHING does
    not need to have an initialized value (register).

    This can worst case lead to unintented stack memory leakage in future
    helper functions if they are not carefully designed, or unintended
    application behaviour in case the application developer was not careful
    enough to match a correct helper function signature in the API.

    The underlying issue is that ARG_ANYTHING should actually be split
    into two different semantics:

    1) ARG_DONTCARE for function arguments that the helper function
    does not care about (in other words: the default for unused
    function arguments), and

    2) ARG_ANYTHING that is an argument actually being used by a
    helper function and *guaranteed* to be an initialized register.

    The current risk is low: ARG_ANYTHING is only used for the 'flags'
    argument (r4) in bpf_map_update_elem() that internally does strict
    checking.

    Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

07 Mar, 2015

1 commit

  • Fengguang reported, that on openrisc and avr32 architectures, we
    get the following linker errors on *_defconfig builds that have
    no bpf syscall support:

    net/built-in.o:(.rodata+0x1cd0): undefined reference to `bpf_map_lookup_elem_proto'
    net/built-in.o:(.rodata+0x1cd4): undefined reference to `bpf_map_update_elem_proto'
    net/built-in.o:(.rodata+0x1cd8): undefined reference to `bpf_map_delete_elem_proto'

    Fix it up by providing built-in weak definitions of the symbols,
    so they can be overridden when the syscall is enabled. I think
    the issue might be that gcc is not able to optimize all that away.
    This patch fixes the linker errors for me, tested with Fengguang's
    make.cross [1] script.

    [1] https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross

    Reported-by: Fengguang Wu
    Fixes: d4052c4aea0c ("ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

02 Mar, 2015

5 commits

  • This work extends the "classic" BPF programmable tc classifier by
    extending its scope also to native eBPF code!

    This allows for user space to implement own custom, 'safe' C like
    classifiers (or whatever other frontend language LLVM et al may
    provide in future), that can then be compiled with the LLVM eBPF
    backend to an eBPF elf file. The result of this can be loaded into
    the kernel via iproute2's tc. In the kernel, they can be JITed on
    major archs and thus run in native performance.

    Simple, minimal toy example to demonstrate the workflow:

    #include
    #include
    #include

    #include "tc_bpf_api.h"

    __section("classify")
    int cls_main(struct sk_buff *skb)
    {
    return (0x800 << 16) | load_byte(skb, ETH_HLEN + __builtin_offsetof(struct iphdr, tos));
    }

    char __license[] __section("license") = "GPL";

    The classifier can then be compiled into eBPF opcodes and loaded
    via tc, for example:

    clang -O2 -emit-llvm -c cls.c -o - | llc -march=bpf -filetype=obj -o cls.o
    tc filter add dev em1 parent 1: bpf cls.o [...]

    As it has been demonstrated, the scope can even reach up to a fully
    fledged flow dissector (similarly as in samples/bpf/sockex2_kern.c).

    For tc, maps are allowed to be used, but from kernel context only,
    in other words, eBPF code can keep state across filter invocations.
    In future, we perhaps may reattach from a different application to
    those maps e.g., to read out collected statistics/state.

    Similarly as in socket filters, we may extend functionality for eBPF
    classifiers over time depending on the use cases. For that purpose,
    cls_bpf programs are using BPF_PROG_TYPE_SCHED_CLS program type, so
    we can allow additional functions/accessors (e.g. an ABI compatible
    offset translation to skb fields/metadata). For an initial cls_bpf
    support, we allow the same set of helper functions as eBPF socket
    filters, but we could diverge at some point in time w/o problem.

    I was wondering whether cls_bpf and act_bpf could share C programs,
    I can imagine that at some point, we introduce i) further common
    handlers for both (or even beyond their scope), and/or if truly needed
    ii) some restricted function space for each of them. Both can be
    abstracted easily through struct bpf_verifier_ops in future.

    The context of cls_bpf versus act_bpf is slightly different though:
    a cls_bpf program will return a specific classid whereas act_bpf a
    drop/non-drop return code, latter may also in future mangle skbs.
    That said, we can surely have a "classify" and "action" section in
    a single object file, or considered mentioned constraint add a
    possibility of a shared section.

    The workflow for getting native eBPF running from tc [1] is as
    follows: for f_bpf, I've added a slightly modified ELF parser code
    from Alexei's kernel sample, which reads out the LLVM compiled
    object, sets up maps (and dynamically fixes up map fds) if any, and
    loads the eBPF instructions all centrally through the bpf syscall.

    The resulting fd from the loaded program itself is being passed down
    to cls_bpf, which looks up struct bpf_prog from the fd store, and
    holds reference, so that it stays available also after tc program
    lifetime. On tc filter destruction, it will then drop its reference.

    Moreover, I've also added the optional possibility to annotate an
    eBPF filter with a name (e.g. path to object file, or something
    else if preferred) so that when tc dumps currently installed filters,
    some more context can be given to an admin for a given instance (as
    opposed to just the file descriptor number).

    Last but not least, bpf_prog_get() and bpf_prog_put() needed to be
    exported, so that eBPF can be used from cls_bpf built as a module.
    Thanks to 60a3b2253c41 ("net: bpf: make eBPF interpreter images
    read-only") I think this is of no concern since anything wanting to
    alter eBPF opcode after verification stage would crash the kernel.

    [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf

    Signed-off-by: Daniel Borkmann
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • is_gpl_compatible and prog_type should be moved directly into bpf_prog
    as they stay immutable during bpf_prog's lifetime, are core attributes
    and they can be locked as read-only later on via bpf_prog_select_runtime().

    With a bit of rearranging, this also allows us to shrink bpf_prog_aux
    to exactly 1 cacheline.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As discussed recently and at netconf/netdev01, we want to prevent making
    bpf_verifier_ops registration available for modules, but have them at a
    controlled place inside the kernel instead.

    The reason for this is, that out-of-tree modules can go crazy and define
    and register any verfifier ops they want, doing all sorts of crap, even
    bypassing available GPLed eBPF helper functions. We don't want to offer
    such a shiny playground, of course, but keep strict control to ourselves
    inside the core kernel.

    This also encourages us to design eBPF user helpers carefully and
    generically, so they can be shared among various subsystems using eBPF.

    For the eBPF traffic classifier (cls_bpf), it's a good start to share
    the same helper facilities as we currently do in eBPF for socket filters.

    That way, we have BPF_PROG_TYPE_SCHED_CLS look like it's own type, thus
    one day if there's a good reason to diverge the set of helper functions
    from the set available to socket filters, we keep ABI compatibility.

    In future, we could place all bpf_prog_type_list at a central place,
    perhaps.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We can move bpf_map_ops and bpf_verifier_ops and other structs into ro
    section, bpf_map_type_list and bpf_prog_type_list into read mostly.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Now that we have BPF_PROG_TYPE_SOCKET_FILTER up and running, we can
    remove the test stubs which were added to get the verifier suite up.

    We can just let the test cases probe under socket filter type instead.
    In the fill/spill test case, we cannot (yet) access fields from the
    context (skb), but we may adapt that test case in future.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Jan, 2015

1 commit

  • Pull networking fixes from David Miller:

    1) Don't OOPS on socket AIO, from Christoph Hellwig.

    2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
    Grumbach.

    3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.

    4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
    Starovoitov.

    5) Lots of crash, memory leak, short TX packet et al bug fixes in
    sh_eth from Ben Hutchings.

    6) Fix memory corruption in SCTP wrt. INIT collitions, from Daniel
    Borkmann.

    7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
    From Eric Dumazet and Govindarajulu Varadarajan.

    8) Header length calculation fix in mac80211 from Fred Chou.

    9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
    From Ezequiel Garcia.

    10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
    tcp diag used. From Herbert Xu.

    11) amd-xgbe programs wrong rx flow control register, from Thomas
    Lendacky.

    12) Fix race leading to use after free in ping receive path, from Subash
    Abhinov Kasiviswanathan.

    13) Cache redirect routes otherwise we can get a heavy backlog of rcu
    jobs liberating DST_NOCACHE entries. From Hannes Frederic Sowa.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
    net: don't OOPS on socket aio
    stmmac: prevent probe drivers to crash kernel
    bnx2x: fix napi poll return value for repoll
    ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
    sh_eth: Fix DMA-API usage for RX buffers
    sh_eth: Check for DMA mapping errors on transmit
    sh_eth: Ensure DMA engines are stopped before freeing buffers
    sh_eth: Remove RX overflow log messages
    ping: Fix race in free in receive path
    udp_diag: Fix socket skipping within chain
    can: kvaser_usb: Fix state handling upon BUS_ERROR events
    can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
    can: kvaser_usb: Send correct context to URB completion
    can: kvaser_usb: Do not sleep in atomic context
    ipv4: try to cache dst_entries which would cause a redirect
    samples: bpf: relax test_maps check
    bpf: rcu lock must not be held when calling copy_to_user()
    net: sctp: fix slab corruption from use after free on INIT collisions
    net: mv643xx_eth: Fix highmem support in non-TSO egress path
    sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
    ...

    Linus Torvalds
     

27 Jan, 2015

1 commit

  • BUG: sleeping function called from invalid context at mm/memory.c:3732
    in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
    1 lock held by test_maps/671:
    #0: (rcu_read_lock){......}, at: [] map_lookup_elem+0xe8/0x260
    Call Trace:
    ([] show_trace+0x12e/0x150)
    [] show_stack+0xa0/0x100
    [] dump_stack+0x74/0xc8
    [] ___might_sleep+0x23a/0x248
    [] might_fault+0x70/0xe8
    [] map_lookup_elem+0x188/0x260
    [] SyS_bpf+0x20e/0x840

    Fix it by allocating temporary buffer to store map element value.

    Fixes: db20fd2b0108 ("bpf: add lookup/update/delete/iterate methods to BPF maps")
    Reported-by: Michael Holzheu
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

20 Jan, 2015

1 commit

  • Nothing needs the module pointer any more, and the next patch will
    call it from RCU, where the module itself might no longer exist.
    Removing the arg is the safest approach.

    This just codifies the use of the module_alloc/module_free pattern
    which ftrace and bpf use.

    Signed-off-by: Rusty Russell
    Acked-by: Alexei Starovoitov
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: Ralf Baechle
    Cc: Ley Foon Tan
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Steven Rostedt
    Cc: x86@kernel.org
    Cc: Ananth N Mavinakayanahalli
    Cc: Anil S Keshavamurthy
    Cc: Masami Hiramatsu
    Cc: linux-cris-kernel@axis.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mips@linux-mips.org
    Cc: nios2-dev@lists.rocketboards.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: sparclinux@vger.kernel.org
    Cc: netdev@vger.kernel.org

    Rusty Russell
     

06 Dec, 2014

1 commit

  • introduce program type BPF_PROG_TYPE_SOCKET_FILTER that is used
    for attaching programs to sockets where ctx == skb.

    add verifier checks for ABS/IND instructions which can only be seen
    in socket filters, therefore the check:
    if (env->prog->aux->prog_type != BPF_PROG_TYPE_SOCKET_FILTER)
    verbose("BPF_LD_ABS|IND instructions are only allowed in socket filters\n");

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

20 Nov, 2014

1 commit

  • - fix NULL pointer dereference:
    kernel/bpf/arraymap.c:41 array_map_alloc() error: potential null dereference 'array'. (kzalloc returns null)
    kernel/bpf/arraymap.c:41 array_map_alloc() error: we previously assumed 'array' could be null (see line 40)

    - integer overflow check was missing in arraymap
    (hashmap checks for overflow via kmalloc_array())

    - arraymap can round_up(value_size, 8) to zero. check was missing.

    - hashmap was missing zero size check as well, since roundup_pow_of_two() can
    truncate into zero

    - found a typo in the arraymap comment and unnecessary empty line

    Fix all of these issues and make both overflow checks explicit U32 in size.

    Reported-by: kbuild test robot
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

19 Nov, 2014

6 commits

  • proper types and function helpers are ready. Use them in verifier testsuite.
    Remove temporary stubs

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
    map accessors to eBPF programs

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • fix errno of BPF_MAP_LOOKUP_ELEM command as bpf manpage
    described it in commit b4fc1a460f30("Merge branch 'bpf-next'"):
    -----
    BPF_MAP_LOOKUP_ELEM
    int bpf_lookup_elem(int fd, void *key, void *value)
    {
    union bpf_attr attr = {
    .map_fd = fd,
    .key = ptr_to_u64(key),
    .value = ptr_to_u64(value),
    };

    return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
    }
    bpf() syscall looks up an element with given key in a map fd.
    If element is found it returns zero and stores element's value
    into value. If element is not found it returns -1 and sets
    errno to ENOENT.

    and further down in manpage:

    ENOENT For BPF_MAP_LOOKUP_ELEM or BPF_MAP_DELETE_ELEM, indicates that
    element with given key was not found.
    -----

    In general all BPF commands return ENOENT when map element is not found
    (including BPF_MAP_GET_NEXT_KEY and BPF_MAP_UPDATE_ELEM with
    flags == BPF_MAP_UPDATE_ONLY)

    Subsequent patch adds a testsuite to check return values for all of
    these combinations.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • add new map type BPF_MAP_TYPE_ARRAY and its implementation

    - optimized for fastest possible lookup()
    . in the future verifier/JIT may recognize lookup() with constant key
    and optimize it into constant pointer. Can optimize non-constant
    key into direct pointer arithmetic as well, since pointers and
    value_size are constant for the life of the eBPF program.
    In other words array_map_lookup_elem() may be 'inlined' by verifier/JIT
    while preserving concurrent access to this map from user space

    - two main use cases for array type:
    . 'global' eBPF variables: array of 1 element with key=0 and value is a
    collection of 'global' variables which programs can use to keep the state
    between events
    . aggregation of tracing events into fixed set of buckets

    - all array elements pre-allocated and zero initialized at init time

    - key as an index in array and can only be 4 byte

    - map_delete_elem() returns EINVAL, since elements cannot be deleted

    - map_update_elem() replaces elements in an non-atomic way
    (for atomic updates hashtable type should be used instead)

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • add new map type BPF_MAP_TYPE_HASH and its implementation

    - maps are created/destroyed by userspace. Both userspace and eBPF programs
    can lookup/update/delete elements from the map

    - eBPF programs can be called in_irq(), so use spin_lock_irqsave() mechanism
    for concurrent updates

    - key/value are opaque range of bytes (aligned to 8 bytes)

    - user space provides 3 configuration attributes via BPF syscall:
    key_size, value_size, max_entries

    - map takes care of allocating/freeing key/value pairs

    - map_update_elem() must fail to insert new element when max_entries
    limit is reached to make sure that eBPF programs cannot exhaust memory

    - map_update_elem() replaces elements in an atomic way

    - optimized for speed of lookup() which can be called multiple times from
    eBPF program which itself is triggered by high volume of events
    . in the future JIT compiler may recognize lookup() call and optimize it
    further, since key_size is constant for life of eBPF program

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • the current meaning of BPF_MAP_UPDATE_ELEM syscall command is:
    either update existing map element or create a new one.
    Initially the plan was to add a new command to handle the case of
    'create new element if it didn't exist', but 'flags' style looks
    cleaner and overall diff is much smaller (more code reused), so add 'flags'
    attribute to BPF_MAP_UPDATE_ELEM command with the following meaning:
    #define BPF_ANY 0 /* create new element or update existing */
    #define BPF_NOEXIST 1 /* create new element if it didn't exist */
    #define BPF_EXIST 2 /* update existing element */

    bpf_update_elem(fd, key, value, BPF_NOEXIST) call can fail with EEXIST
    if element already exists.

    bpf_update_elem(fd, key, value, BPF_EXIST) can fail with ENOENT
    if element doesn't exist.

    Userspace will call it as:
    int bpf_update_elem(int fd, void *key, void *value, __u64 flags)
    {
    union bpf_attr attr = {
    .map_fd = fd,
    .key = ptr_to_u64(key),
    .value = ptr_to_u64(value),
    .flags = flags;
    };

    return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));
    }

    First two bits of 'flags' are used to encode style of bpf_update_elem() command.
    Bits 2-63 are reserved for future use.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Nov, 2014

1 commit


31 Oct, 2014

1 commit

  • verifier keeps track of register state spilled to stack.
    registers are 8-byte wide and always aligned, so instead of tracking them
    in every byte-sized stack slot, use MAX_BPF_STACK / 8 array to track
    spilled register state.
    Though verifier runs in user context and its state freed immediately
    after verification, it makes sense to reduce its memory usage.
    This optimization reduces sizeof(struct verifier_state)
    from 12464 to 1712 on 64-bit and from 6232 to 1112 on 32-bit.

    Note, this patch doesn't change existing limits, which are there to bound
    time and memory during verification: 4k total number of insns in a program,
    1k number of jumps (states to visit) and 32k number of processed insn
    (since an insn may be visited multiple times). Theoretical worst case memory
    during verification is 1712 * 1k = 17Mbyte. Out-of-memory situation triggers
    cleanup and rejects the program.

    Suggested-by: Andy Lutomirski
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

28 Oct, 2014

1 commit

  • introduce two configs:
    - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
    depend on
    - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use

    that solves several problems:
    - tracing and others that wish to use eBPF don't need to depend on NET.
    They can use BPF_SYSCALL to allow loading from userspace or select BPF
    to use it directly from kernel in NET-less configs.
    - in 3.18 programs cannot be attached to events yet, so don't force it on
    - when the rest of eBPF infra is there in 3.19+, it's still useful to
    switch it off to minimize kernel size

    bloat-o-meter on x64 shows:
    add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)

    tested with many different config combinations. Hopefully didn't miss anything.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

22 Oct, 2014

1 commit

  • while comparing for verifier state equivalency the comparison
    was missing a check for uninitialized register.
    Make sure it does so and add a testcase.

    Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
    Cc: Hannes Frederic Sowa
    Signed-off-by: Alexei Starovoitov
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Oct, 2014

1 commit

  • consider C program represented in eBPF:
    int filter(int arg)
    {
    int a, b, c, *ptr;

    if (arg == 1)
    ptr = &a;
    else if (arg == 2)
    ptr = &b;
    else
    ptr = &c;

    *ptr = 0;
    return 0;
    }
    eBPF verifier has to follow all possible paths through the program
    to recognize that '*ptr = 0' instruction would be safe to execute
    in all situations.
    It's doing it by picking a path towards the end and observes changes
    to registers and stack at every insn until it reaches bpf_exit.
    Then it comes back to one of the previous branches and goes towards
    the end again with potentially different values in registers.
    When program has a lot of branches, the number of possible combinations
    of branches is huge, so verifer has a hard limit of walking no more
    than 32k instructions. This limit can be reached and complex (but valid)
    programs could be rejected. Therefore it's important to recognize equivalent
    verifier states to prune this depth first search.

    Basic idea can be illustrated by the program (where .. are some eBPF insns):
    1: ..
    2: if (rX == rY) goto 4
    3: ..
    4: ..
    5: ..
    6: bpf_exit
    In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
    Since insn#2 is a branch the verifier will remember its state in verifier stack
    to come back to it later.
    Since insn#4 is marked as 'branch target', the verifier will remember its state
    in explored_states[4] linked list.
    Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
    will continue.
    Without search pruning optimization verifier would have to walk 4, 5, 6 again,
    effectively simulating execution of insns 1, 2, 4, 5, 6
    With search pruning it will check whether state at #4 after jumping from #2
    is equivalent to one recorded in explored_states[4] during first pass.
    If there is an equivalent state, verifier can prune the search at #4 and declare
    this path to be safe as well.
    In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
    and 1, 2, 4 insns produces equivalent registers and stack.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

27 Sep, 2014

6 commits

  • 1.
    the library includes a trivial set of BPF syscall wrappers:
    int bpf_create_map(int key_size, int value_size, int max_entries);
    int bpf_update_elem(int fd, void *key, void *value);
    int bpf_lookup_elem(int fd, void *key, void *value);
    int bpf_delete_elem(int fd, void *key);
    int bpf_get_next_key(int fd, void *key, void *next_key);
    int bpf_prog_load(enum bpf_prog_type prog_type,
    const struct sock_filter_int *insns, int insn_len,
    const char *license);
    bpf_prog_load() stores verifier log into global bpf_log_buf[] array

    and BPF_*() macros to build instructions

    2.
    test stubs configure eBPF infra with 'unspec' map and program types.
    These are fake types used by user space testsuite only.

    3.
    verifier tests valid and invalid programs and expects predefined
    error log messages from kernel.
    40 tests so far.

    $ sudo ./test_verifier
    #0 add+sub+mul OK
    #1 unreachable OK
    #2 unreachable2 OK
    #3 out of range jump OK
    #4 out of range jump2 OK
    #5 test1 ld_imm64 OK
    ...

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • This patch adds verifier core which simulates execution of every insn and
    records the state of registers and program stack. Every branch instruction seen
    during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
    it pops the state from the stack and continues until it reaches BPF_EXIT again.
    For program:
    1: bpf_mov r1, xxx
    2: if (r1 == 0) goto 5
    3: bpf_mov r0, 1
    4: goto 6
    5: bpf_mov r0, 2
    6: bpf_exit
    The verifier will walk insns: 1, 2, 3, 4, 6
    then it will pop the state recorded at insn#2 and will continue: 5, 6

    This way it walks all possible paths through the program and checks all
    possible values of registers. While doing so, it checks for:
    - invalid instructions
    - uninitialized register access
    - uninitialized stack access
    - misaligned stack access
    - out of range stack access
    - invalid calling convention
    - instruction encoding is not using reserved fields

    Kernel subsystem configures the verifier with two callbacks:

    - bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
    that provides information to the verifer which fields of 'ctx'
    are accessible (remember 'ctx' is the first argument to eBPF program)

    - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
    returns argument constraints of kernel helper functions that eBPF program
    may call, so that verifier can checks that R1-R5 types match the prototype

    More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • check that control flow graph of eBPF program is a directed acyclic graph

    check_cfg() does:
    - detect loops
    - detect unreachable instructions
    - check that program terminates with BPF_EXIT insn
    - check that all branches are within program boundary

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
    to refer to process-local map_fd. Scan the program for such instructions and
    if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
    by verifier to check access to maps in bpf_map_lookup/update() calls.
    If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
    BPF_PSEUDO_MAP_FD flag.

    Note that eBPF interpreter is generic and knows nothing about pseudo insns.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • add optional attributes for BPF_PROG_LOAD syscall:
    union bpf_attr {
    struct {
    ...
    __u32 log_level; /* verbosity level of eBPF verifier */
    __u32 log_size; /* size of user buffer */
    __aligned_u64 log_buf; /* user supplied 'char *buffer' */
    };
    };

    when log_level > 0 the verifier will return its verification log in the user
    supplied buffer 'log_buf' which can be used by program author to analyze why
    verifier rejected given program.

    'Understanding eBPF verifier messages' section of Documentation/networking/filter.txt
    provides several examples of these messages, like the program:

    BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
    BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
    BPF_LD_MAP_FD(BPF_REG_1, 0),
    BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
    BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
    BPF_EXIT_INSN(),

    will be rejected with the following multi-line message in log_buf:

    0: (7a) *(u64 *)(r10 -8) = 0
    1: (bf) r2 = r10
    2: (07) r2 += -8
    3: (b7) r1 = 0
    4: (85) call 1
    5: (15) if r0 == 0x0 goto pc+1
    R0=map_ptr R10=fp
    6: (7a) *(u64 *)(r0 +4) = 0
    misaligned access off 4 size 8

    The format of the output can change at any time as verifier evolves.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • this patch adds all of eBPF verfier documentation and empty bpf_check()

    The end goal for the verifier is to statically check safety of the program.

    Verifier will catch:
    - loops
    - out of range jumps
    - unreachable instructions
    - invalid instructions
    - uninitialized register access
    - uninitialized stack access
    - misaligned stack access
    - out of range stack access
    - invalid calling convention

    More details in Documentation/networking/filter.txt

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov