04 Nov, 2015

1 commit

  • The verbose() printer dumps the verifier state to user space, so let gcc
    take care to check calls to verbose() for (future) errors. make with W=1
    correctly suggests: function might be possible candidate for 'gnu_printf'
    format attribute [-Wsuggest-attribute=format].

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

03 Nov, 2015

5 commits

  • This work adds support for "persistent" eBPF maps/programs. The term
    "persistent" is to be understood that maps/programs have a facility
    that lets them survive process termination. This is desired by various
    eBPF subsystem users.

    Just to name one example: tc classifier/action. Whenever tc parses
    the ELF object, extracts and loads maps/progs into the kernel, these
    file descriptors will be out of reach after the tc instance exits.
    So a subsequent tc invocation won't be able to access/relocate on this
    resource, and therefore maps cannot easily be shared, f.e. between the
    ingress and egress networking data path.

    The current workaround is that Unix domain sockets (UDS) need to be
    instrumented in order to pass the created eBPF map/program file
    descriptors to a third party management daemon through UDS' socket
    passing facility. This makes it a bit complicated to deploy shared
    eBPF maps or programs (programs f.e. for tail calls) among various
    processes.

    We've been brainstorming on how we could tackle this issue and various
    approches have been tried out so far, which can be read up further in
    the below reference.

    The architecture we eventually ended up with is a minimal file system
    that can hold map/prog objects. The file system is a per mount namespace
    singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
    mounts within a given namespace will point to the same instance. The
    file system allows for creating a user-defined directory structure.
    The objects for maps/progs are created/fetched through bpf(2) with
    two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
    along with a pathname is being passed to bpf(2) that in turn creates
    (we call it eBPF object pinning) the file system nodes. Only the pathname
    is being passed to bpf(2) for getting a new BPF file descriptor to an
    existing node. The user can use that to access maps and progs later on,
    through bpf(2). Removal of file system nodes is being managed through
    normal VFS functions such as unlink(2), etc. The file system code is
    kept to a very minimum and can be further extended later on.

    The next step I'm working on is to add dump eBPF map/prog commands
    to bpf(2), so that a specification from a given file descriptor can
    be retrieved. This can be used by things like CRIU but also applications
    can inspect the meta data after calling BPF_OBJ_GET.

    Big thanks also to Alexei and Hannes who significantly contributed
    in the design discussion that eventually let us end up with this
    architecture here.

    Reference: https://lkml.org/lkml/2015/10/15/925
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We currently have duplicated cleanup code in bpf_prog_put() and
    bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
    not worth it to make it a common helper called by both, but with
    the recent addition of resource charging, we could have avoided
    the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock
    from bpf_prog_put") if we would have had only a single, common path.
    We can simplify it further by assigning aux->prog only once during
    allocation time.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Add a bpf_map_get() function that we're going to use later on and
    align/clean the remaining helpers a bit so that we have them a bit
    more consistent:

    - __bpf_map_get() and __bpf_prog_get() that both work on the fd
    struct, check whether the descriptor is eBPF and return the
    pointer to the map/prog stored in the private data.

    Also, we can return f.file->private_data directly, the function
    signature is enough of a documentation already.

    - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
    call their respective __bpf_map_get()/__bpf_prog_get() variants,
    and take a reference.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Since we're going to use anon_inode_getfd() invocations in more than just
    the current places, make a helper function for both, so that we only need
    to pass a map/prog pointer to the helper itself in order to get a fd. The
    new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When running bpf samples on rt kernel, it reports the below warning:

    BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
    in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
    Preemption disabled at:[] kprobe_perf_func+0x30/0x228

    CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
    Hardware name: Freescale Layerscape 2085a RDB Board (DT)
    Call trace:
    [] dump_backtrace+0x0/0x128
    [] show_stack+0x20/0x30
    [] dump_stack+0x7c/0xa0
    [] ___might_sleep+0x188/0x1a0
    [] rt_spin_lock+0x28/0x40
    [] htab_map_update_elem+0x124/0x320
    [] bpf_map_update_elem+0x40/0x58
    [] __bpf_prog_run+0xd48/0x1640
    [] trace_call_bpf+0x8c/0x100
    [] kprobe_perf_func+0x30/0x228
    [] kprobe_dispatcher+0x34/0x58
    [] kprobe_handler+0x114/0x250
    [] kprobe_breakpoint_handler+0x1c/0x30
    [] brk_handler+0x88/0x98
    [] do_debug_exception+0x50/0xb8
    Exception stack(0xffff808349687460 to 0xffff808349687580)
    7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
    7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
    74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
    74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
    74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
    7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
    7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
    7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
    7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77

    Convert hashtab lock to raw lock to avoid such warning.

    Signed-off-by: Yang Shi
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Yang Shi
     

27 Oct, 2015

1 commit

  • Fix safety checks for bpf_perf_event_read():
    - only non-inherited events can be added to perf_event_array map
    (do this check statically at map insertion time)
    - dynamically check that event is local and !pmu->count
    Otherwise buggy bpf program can cause kernel splat.

    Also fix error path after perf_event_attrs()
    and remove redundant 'extern'.

    Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
    Signed-off-by: Alexei Starovoitov
    Tested-by: Wang Nan
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

22 Oct, 2015

1 commit

  • This helper is used to send raw data from eBPF program into
    special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
    User space needs to perf_event_open() it (either for one or all cpus) and
    store FD into perf_event_array (similar to bpf_perf_event_read() helper)
    before eBPF program can send data into it.

    Today the programs triggered by kprobe collect the data and either store
    it into the maps or print it via bpf_trace_printk() where latter is the debug
    facility and not suitable to stream the data. This new helper replaces
    such bpf_trace_printk() usage and allows programs to have dedicated
    channel into user space for post-processing of the raw data collected.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Oct, 2015

1 commit

  • Currently, is only called from __prog_put_rcu in the bpf_prog_release
    path. Need this to call this from bpf_prog_put also to get correct
    accounting.

    Fixes: aaac3ba95e4c8b49 ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Tom Herbert
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Tom Herbert
     

13 Oct, 2015

2 commits

  • since eBPF programs and maps use kernel memory consider it 'locked' memory
    from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
    This limit is typically set to 64Kbytes by distros, so almost all
    bpf+tracing programs would need to increase it, since they use maps,
    but kernel charges maximum map size upfront.
    For example the hash map of 1024 elements will be charged as 64Kbyte.
    It's inconvenient for current users and changes current behavior for root,
    but probably worth doing to be consistent root vs non-root.

    Similar accounting logic is done by mmap of perf_event.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • In order to let unprivileged users load and execute eBPF programs
    teach verifier to prevent pointer leaks.
    Verifier will prevent
    - any arithmetic on pointers
    (except R10+Imm which is used to compute stack addresses)
    - comparison of pointers
    (except if (map_value_ptr == 0) ... )
    - passing pointers to helper functions
    - indirectly passing pointers in stack to helper functions
    - returning pointer from bpf program
    - storing pointers into ctx or maps

    Spill/fill of pointers into stack is allowed, but mangling
    of pointers stored in the stack or reading them byte by byte is not.

    Within bpf programs the pointers do exist, since programs need to
    be able to access maps, pass skb pointer to LD_ABS insns, etc
    but programs cannot pass such pointer values to the outside
    or obfuscate them.

    Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
    so that socket filters (tcpdump), af_packet (quic acceleration)
    and future kcm can use it.
    tracing and tc cls/act program types still require root permissions,
    since tracing actually needs to be able to see all kernel pointers
    and tc is for root only.

    For example, the following unprivileged socket filter program is allowed:
    int bpf_prog1(struct __sk_buff *skb)
    {
    u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    u64 *value = bpf_map_lookup_elem(&my_map, &index);

    if (value)
    *value += skb->len;
    return 0;
    }

    but the following program is not:
    int bpf_prog1(struct __sk_buff *skb)
    {
    u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    u64 *value = bpf_map_lookup_elem(&my_map, &index);

    if (value)
    *value += (u64) skb;
    return 0;
    }
    since it would leak the kernel address into the map.

    Unprivileged socket filter bpf programs have access to the
    following helper functions:
    - map lookup/update/delete (but they cannot store kernel pointers into them)
    - get_random (it's already exposed to unprivileged user space)
    - get_smp_processor_id
    - tail_call into another socket filter program
    - ktime_get_ns

    The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
    This toggle defaults to off (0), but can be set true (1). Once true,
    bpf programs and maps cannot be accessed from unprivileged process,
    and the toggle cannot be set back to false.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Kees Cook
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

11 Oct, 2015

1 commit

  • eBPF socket filter programs may see junk in 'u32 cb[5]' area,
    since it could have been used by protocol layers earlier.

    For socket filter programs used in af_packet we need to clean
    20 bytes of skb->cb area if it could be used by the program.
    For programs attached to TCP/UDP sockets we need to save/restore
    these 20 bytes, since it's used by protocol layers.

    Remove SK_RUN_FILTER macro, since it's no longer used.

    Long term we may move this bpf cb area to per-cpu scratch, but that
    requires addition of new 'per-cpu load/store' instructions,
    so not suitable as a short term fix.

    Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
    Reported-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 Oct, 2015

1 commit

  • While recently arguing on a seccomp discussion that raw prandom_u32()
    access shouldn't be exposed to unpriviledged user space, I forgot the
    fact that SKF_AD_RANDOM extension actually already does it for some time
    in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").

    Since prandom_u32() is being used in a lot of critical networking code,
    lets be more conservative and split their states. Furthermore, consolidate
    eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
    bpf_get_prandom_u32() was only accessible for priviledged users, but
    should that change one day, we also don't want to leak raw sequences
    through things like eBPF maps.

    One thought was also to have own per bpf_prog states, but due to ABI
    reasons this is not easily possible, i.e. the program code currently
    cannot access bpf_prog itself, and copying the rnd_state to/from the
    stack scratch space whenever a program uses the prng seems not really
    worth the trouble and seems too hacky. If needed, taus113 could in such
    cases be implemented within eBPF using a map entry to keep the state
    space, or get_random_bytes() could become a second helper in cases where
    performance would not be critical.

    Both sides can trigger a one-time late init via prandom_init_once() on
    the shared state. Performance-wise, there should even be a tiny gain
    as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
    inside the BPF core since kernels could have a NET-less config as well.

    Signed-off-by: Daniel Borkmann
    Acked-by: Hannes Frederic Sowa
    Acked-by: Alexei Starovoitov
    Cc: Chema Gonzalez
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

05 Oct, 2015

1 commit

  • Commit ea317b267e9d ("bpf: Add new bpf map type to store the pointer
    to struct perf_event") added perf_event.h to the main eBPF header, so
    it gets included for all users. perf_event.h is actually only needed
    from array map side, so lets sanitize this a bit.

    Signed-off-by: Daniel Borkmann
    Cc: Kaixu Xia
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

03 Oct, 2015

2 commits

  • Using routing realms as part of the classifier is quite useful, it
    can be viewed as a tag for one or multiple routing entries (think of
    an analogy to net_cls cgroup for processes), set by user space routing
    daemons or via iproute2 as an indicator for traffic classifiers and
    later on processed in the eBPF program.

    Unlike actions, the classifier can inspect device flags and enable
    netif_keep_dst() if necessary. tc actions don't have that possibility,
    but in case people know what they are doing, it can be used from there
    as well (e.g. via devs that must keep dsts by design anyway).

    If a realm is set, the handler returns the non-zero realm. User space
    can set the full 32bit realm for the dst.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As we need to add further flags to the bpf_prog structure, lets migrate
    both bools to a bitfield representation. The size of the base structure
    (excluding insns) remains unchanged at 40 bytes.

    Add also tags for the kmemchecker, so that it doesn't throw false
    positives. Even in case gcc would generate suboptimal code, it's not
    being accessed in performance critical paths.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Sep, 2015

2 commits

  • when the verifier log is enabled the print_bpf_insn() is doing
    bpf_alu_string[BPF_OP(insn->code) >> 4]
    and
    bpf_jmp_string[BPF_OP(insn->code) >> 4]
    where BPF_OP is a 4-bit instruction opcode.
    Malformed insns can cause out of bounds access.
    Fix it by sizing arrays appropriately.

    The bug was found by clang address sanitizer with libfuzzer.

    Reported-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • We may already have gotten a proper fd struct through fdget(), so
    whenever we return at the end of an map operation, we need to call
    fdput(). However, each map operation from syscall side first probes
    CHECK_ATTR() to verify that unused fields in the bpf_attr union are
    zero.

    In case of malformed input, we return with error, but the lookup to
    the map_fd was already performed at that time, so that we return
    without an corresponding fdput(). Fix it by performing an fdget()
    only right before bpf_map_get(). The fdget() invocation on maps in
    the verifier is not affected.

    Fixes: db20fd2b0108 ("bpf: add lookup/update/delete/iterate methods to BPF maps")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

13 Aug, 2015

1 commit


10 Aug, 2015

3 commits


27 Jul, 2015

1 commit


21 Jul, 2015

1 commit

  • improve accuracy of timing in test_bpf and add two stress tests:
    - {skb->data[0], get_smp_processor_id} repeated 2k times
    - {skb->data[0], vlan_push} x 68 followed by {skb->data[0], vlan_pop} x 68

    1st test is useful to test performance of JIT implementation of BPF_LD_ABS
    together with BPF_CALL instructions.
    2nd test is stressing skb_vlan_push/pop logic together with skb->data access
    via BPF_LD_ABS insn which checks that re-caching of skb->data is done correctly.

    In order to call bpf_skb_vlan_push() from test_bpf.ko have to add
    three export_symbol_gpl.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

14 Jul, 2015

1 commit

  • ARG1 = BPF_R1 as it stands, evaluates to regs[BPF_REG_1] = regs[BPF_REG_1]
    and thus has no effect. Add a comment instead, explaining what happens and
    why it's okay to just remove it. Since from user space side, a tail call is
    invoked as a pseudo helper function via bpf_tail_call_proto, the verifier
    checks the arguments just like with any other helper function and makes
    sure that the first argument (regs[BPF_REG_1])'s type is ARG_PTR_TO_CTX.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jun, 2015

2 commits

  • bpf_trace_printk() is a helper function used to debug eBPF programs.
    Let socket and TC programs use it as well.
    Note, it's DEBUG ONLY helper. If it's used in the program,
    the kernel will print warning banner to make sure users don't use
    it in production.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • eBPF programs attached to kprobes need to filter based on
    current->pid, uid and other fields, so introduce helper functions:

    u64 bpf_get_current_pid_tgid(void)
    Return: current->tgid << 32 | current->pid

    u64 bpf_get_current_uid_gid(void)
    Return: current_gid << 32 | current_uid

    bpf_get_current_comm(char *buf, int size_of_buf)
    stores current->comm into buf

    They can be used from the programs attached to TC as well to classify packets
    based on current task fields.

    Update tracex2 example to print histogram of write syscalls for each process
    instead of aggregated for all.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

07 Jun, 2015

1 commit

  • allow programs read/write skb->mark, tc_index fields and
    ((struct qdisc_skb_cb *)cb)->data.

    mark and tc_index are generically useful in TC.
    cb[0]-cb[4] are primarily used to pass arguments from one
    program to another called via bpf_tail_call() which can
    be seen in sockex3_kern.c example.

    All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
    mark, tc_index are writeable from tc_cls_act only.
    cb[0]-cb[4] are writeable by both sockets and tc_cls_act.

    Add verifier tests and improve sample code.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

01 Jun, 2015

2 commits

  • Besides others, move bpf_tail_call_proto to the remaining definitions
    of other protos, improve comments a bit (i.e. remove some obvious ones,
    where the code is already self-documenting, add objectives for others),
    simplify bpf_prog_array_compatible() a bit.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As this is already exported from tracing side via commit d9847d310ab4
    ("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
    as well want to move it to the core, so also networking users can make
    use of it, e.g. to measure diffs for certain flows from ingress/egress.

    Signed-off-by: Daniel Borkmann
    Cc: Alexei Starovoitov
    Cc: Ingo Molnar
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

31 May, 2015

1 commit

  • Normally the program attachment place (like sockets, qdiscs) takes
    care of rcu protection and calls bpf_prog_put() after a grace period.
    The programs stored inside prog_array may not be attached anywhere,
    so prog_array needs to take care of preserving rcu protection.
    Otherwise bpf_tail_call() will race with bpf_prog_put().
    To solve that introduce bpf_prog_put_rcu() helper function and use
    it in 3 places where unattached program can decrement refcnt:
    closing program fd, deleting/replacing program in prog_array.

    Fixes: 04fd61ab36ec ("bpf: allow bpf programs to tail-call other bpf programs")
    Reported-by: Martin Schwidefsky
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

22 May, 2015

1 commit

  • introduce bpf_tail_call(ctx, &jmp_table, index) helper function
    which can be used from BPF programs like:
    int bpf_prog(struct pt_regs *ctx)
    {
    ...
    bpf_tail_call(ctx, &jmp_table, index);
    ...
    }
    that is roughly equivalent to:
    int bpf_prog(struct pt_regs *ctx)
    {
    ...
    if (jmp_table[index])
    return (*jmp_table[index])(ctx);
    ...
    }
    The important detail that it's not a normal call, but a tail call.
    The kernel stack is precious, so this helper reuses the current
    stack frame and jumps into another BPF program without adding
    extra call frame.
    It's trivially done in interpreter and a bit trickier in JITs.
    In case of x64 JIT the bigger part of generated assembler prologue
    is common for all programs, so it is simply skipped while jumping.
    Other JITs can do similar prologue-skipping optimization or
    do stack unwind before jumping into the next program.

    bpf_tail_call() arguments:
    ctx - context pointer
    jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
    index - index in the jump table

    Since all BPF programs are idenitified by file descriptor, user space
    need to populate the jmp_table with FDs of other BPF programs.
    If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
    and program execution continues as normal.

    New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
    populate this jmp_table array with FDs of other bpf programs.
    Programs can share the same jmp_table array or use multiple jmp_tables.

    The chain of tail calls can form unpredictable dynamic loops therefore
    tail_call_cnt is used to limit the number of calls and currently is set to 32.

    Use cases:
    Acked-by: Daniel Borkmann

    ==========
    - simplify complex programs by splitting them into a sequence of small programs

    - dispatch routine
    For tracing and future seccomp the program may be triggered on all system
    calls, but processing of syscall arguments will be different. It's more
    efficient to implement them as:
    int syscall_entry(struct seccomp_data *ctx)
    {
    bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
    ... default: process unknown syscall ...
    }
    int sys_write_event(struct seccomp_data *ctx) {...}
    int sys_read_event(struct seccomp_data *ctx) {...}
    syscall_jmp_table[__NR_write] = sys_write_event;
    syscall_jmp_table[__NR_read] = sys_read_event;

    For networking the program may call into different parsers depending on
    packet format, like:
    int packet_parser(struct __sk_buff *skb)
    {
    ... parse L2, L3 here ...
    __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
    bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
    ... default: process unknown protocol ...
    }
    int parse_tcp(struct __sk_buff *skb) {...}
    int parse_udp(struct __sk_buff *skb) {...}
    ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
    ipproto_jmp_table[IPPROTO_UDP] = parse_udp;

    - for TC use case, bpf_tail_call() allows to implement reclassify-like logic

    - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
    are atomic, so user space can build chains of BPF programs on the fly

    Implementation details:
    =======================
    - high performance of bpf_tail_call() is the goal.
    It could have been implemented without JIT changes as a wrapper on top of
    BPF_PROG_RUN() macro, but with two downsides:
    . all programs would have to pay performance penalty for this feature and
    tail call itself would be slower, since mandatory stack unwind, return,
    stack allocate would be done for every tailcall.
    . tailcall would be limited to programs running preempt_disabled, since
    generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
    need to be either global per_cpu variable accessed by helper and by wrapper
    or global variable protected by locks.

    In this implementation x64 JIT bypasses stack unwind and jumps into the
    callee program after prologue.

    - bpf_prog_array_compatible() ensures that prog_type of callee and caller
    are the same and JITed/non-JITed flag is the same, since calling JITed
    program from non-JITed is invalid, since stack frames are different.
    Similarly calling kprobe type program from socket type program is invalid.

    - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
    abstraction, its user space API and all of verifier logic.
    It's in the existing arraymap.c file, since several functions are
    shared with regular array map.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

28 Apr, 2015

1 commit

  • ALU64_DIV instruction should be dividing 64-bit by 64-bit,
    whereas do_div() does 64-bit by 32-bit divide.
    x64 and arm64 JITs correctly implement 64 by 64 unsigned divide.
    llvm BPF backend emits code assuming that ALU64_DIV does 64 by 64.

    Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets")
    Reported-by: Michael Holzheu
    Acked-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

17 Apr, 2015

2 commits

  • 1.
    first bug is a silly mistake. It broke tracing examples and prevented
    simple bpf programs from loading.

    In the following code:
    if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
    } else if (...) {
    // this part should have been executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    Obviously it's not doing that. So simple instructions like:
    r2 = *(u64 *)(r1 + 8)
    will be rejected. Note the comments in the code around these branches
    were and still valid and indicate the true intent.

    Replace it with:
    if (BPF_SIZE(insn->code) != BPF_W)
    continue;

    if (insn->imm == 0) {
    } else if (...) {
    // now this code will be executed when
    // insn->code == BPF_W and insn->imm != 0
    }

    2.
    second bug is more subtle.
    If malicious code is using the same dest register as source register,
    the checks designed to prevent the same instruction to be used with different
    pointer types will fail to trigger, since we were assigning src_reg_type
    when it was already overwritten by check_mem_access().
    The fix is trivial. Just move line:
    src_reg_type = regs[insn->src_reg].type;
    before check_mem_access().
    Add new 'access skb fields bad4' test to check this case.

    Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Due to missing bounds check the DAG pass of the BPF verifier can corrupt
    the memory which can cause random crashes during program loading:

    [8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
    [8.451293] IP: [] kmem_cache_alloc_trace+0x8d/0x2f0
    [8.452329] Oops: 0000 [#1] SMP
    [8.452329] Call Trace:
    [8.452329] [] bpf_check+0x852/0x2000
    [8.452329] [] bpf_prog_load+0x1e4/0x310
    [8.452329] [] ? might_fault+0x5f/0xb0
    [8.452329] [] SyS_bpf+0x806/0xa30

    Fixes: f1bca824dabb ("bpf: add search pruning optimization to verifier")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Apr, 2015

1 commit

  • Pull networking updates from David Miller:

    1) Add BQL support to via-rhine, from Tino Reichardt.

    2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
    can support hw switch offloading. From Floria Fainelli.

    3) Allow 'ip address' commands to initiate multicast group join/leave,
    from Madhu Challa.

    4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

    5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
    Borkmann.

    6) Remove the ugly compat support in ARP for ugly layers like ax25,
    rose, etc. And use this to clean up the neigh layer, then use it to
    implement MPLS support. All from Eric Biederman.

    7) Support L3 forwarding offloading in switches, from Scott Feldman.

    8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
    up route lookups even further. From Alexander Duyck.

    9) Many improvements and bug fixes to the rhashtable implementation,
    from Herbert Xu and Thomas Graf. In particular, in the case where
    an rhashtable user bulk adds a large number of items into an empty
    table, we expand the table much more sanely.

    10) Don't make the tcp_metrics hash table per-namespace, from Eric
    Biederman.

    11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

    12) Split out new connection request sockets so that they can be
    established in the main hash table. Much less false sharing since
    hash lookups go direct to the request sockets instead of having to
    go first to the listener then to the request socks hashed
    underneath. From Eric Dumazet.

    13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

    14) Support stable privacy address generation for RFC7217 in IPV6. From
    Hannes Frederic Sowa.

    15) Hash network namespace into IP frag IDs, also from Hannes Frederic
    Sowa.

    16) Convert PTP get/set methods to use 64-bit time, from Richard
    Cochran.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
    fm10k: Bump driver version to 0.15.2
    fm10k: corrected VF multicast update
    fm10k: mbx_update_max_size does not drop all oversized messages
    fm10k: reset head instead of calling update_max_size
    fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
    fm10k: update xcast mode before synchronizing multicast addresses
    fm10k: start service timer on probe
    fm10k: fix function header comment
    fm10k: comment next_vf_mbx flow
    fm10k: don't handle mailbox events in iov_event path and always process mailbox
    fm10k: use separate workqueue for fm10k driver
    fm10k: Set PF queues to unlimited bandwidth during virtualization
    fm10k: expose tx_timeout_count as an ethtool stat
    fm10k: only increment tx_timeout_count in Tx hang path
    fm10k: remove extraneous "Reset interface" message
    fm10k: separate PF only stats so that VF does not display them
    fm10k: use hw->mac.max_queues for stats
    fm10k: only show actual queues, not the maximum in hardware
    fm10k: allow creation of VLAN on default vid
    fm10k: fix unused warnings
    ...

    Linus Torvalds
     

02 Apr, 2015

1 commit

  • BPF programs, attached to kprobes, provide a safe way to execute
    user-defined BPF byte-code programs without being able to crash or
    hang the kernel in any way. The BPF engine makes sure that such
    programs have a finite execution time and that they cannot break
    out of their sandbox.

    The user interface is to attach to a kprobe via the perf syscall:

    struct perf_event_attr attr = {
    .type = PERF_TYPE_TRACEPOINT,
    .config = event_id,
    ...
    };

    event_fd = perf_event_open(&attr,...);
    ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd);

    'prog_fd' is a file descriptor associated with BPF program
    previously loaded.

    'event_id' is an ID of the kprobe created.

    Closing 'event_fd':

    close(event_fd);

    ... automatically detaches BPF program from it.

    BPF programs can call in-kernel helper functions to:

    - lookup/update/delete elements in maps

    - probe_read - wraper of probe_kernel_read() used to access any
    kernel data structures

    BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
    architecture dependent) and return 0 to ignore the event and 1 to store
    kprobe event into the ring buffer.

    Note, kprobes are a fundamentally _not_ a stable kernel ABI,
    so BPF programs attached to kprobes must be recompiled for
    every kernel version and user must supply correct LINUX_VERSION_CODE
    in attr.kern_version during bpf_prog_load() call.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Steven Rostedt
    Reviewed-by: Masami Hiramatsu
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David S. Miller
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

30 Mar, 2015

1 commit

  • existing TC action 'pedit' can munge any bits of the packet.
    Generalize it for use in bpf programs attached as cls_bpf and act_bpf via
    bpf_skb_store_bytes() helper function.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jiri Pirko
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

21 Mar, 2015

1 commit

  • In order to prepare eBPF support for tc action, we need to add
    sched_act_type, so that the eBPF verifier is aware of what helper
    function act_bpf may use, that it can load skb data and read out
    currently available skb fields.

    This is bascially analogous to 96be4325f443 ("ebpf: add sched_cls_type
    and map it to sk_filter's verifier ops").

    BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
    separate since both will have a different set of functionality in
    future (classifier vs action), thus we won't run into ABI troubles
    when the point in time comes to diverge functionality from the
    classifier.

    The future plan for act_bpf would be that it will be able to write
    into skb->data and alter selected fields mirrored in struct __sk_buff.

    For an initial support, it's sufficient to map it to sk_filter_ops.

    Signed-off-by: Daniel Borkmann
    Cc: Jiri Pirko
    Reviewed-by: Jiri Pirko
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Mar, 2015

1 commit

  • introduce user accessible mirror of in-kernel 'struct sk_buff':
    struct __sk_buff {
    __u32 len;
    __u32 pkt_type;
    __u32 mark;
    __u32 queue_mapping;
    };

    bpf programs can do:

    int bpf_prog(struct __sk_buff *skb)
    {
    __u32 var = skb->pkt_type;

    which will be compiled to bpf assembler as:

    dst_reg = *(u32 *)(src_reg + 4) // 4 == offsetof(struct __sk_buff, pkt_type)

    bpf verifier will check validity of access and will convert it to:

    dst_reg = *(u8 *)(src_reg + offsetof(struct sk_buff, __pkt_type_offset))
    dst_reg &= 7

    since skb->pkt_type is a bitfield.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov