21 Mar, 2016

1 commit

  • Pull 'objtool' stack frame validation from Ingo Molnar:
    "This tree adds a new kernel build-time object file validation feature
    (ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
    It was written by and is maintained by Josh Poimboeuf.

    The motivation: there's a category of hard to find kernel bugs, most
    of them in assembly code (but also occasionally in C code), that
    degrades the quality of kernel stack dumps/backtraces. These bugs are
    hard to detect at the source code level. Such bugs result in
    incorrect/incomplete backtraces most of time - but can also in some
    rare cases result in crashes or other undefined behavior.

    The build time correctness checking is done via the new 'objtool'
    user-space utility that was written for this purpose and which is
    hosted in the kernel repository in tools/objtool/. The tool's (very
    simple) UI and source code design is shaped after Git and perf and
    shares quite a bit of infrastructure with tools/perf (which tooling
    infrastructure sharing effort got merged via perf and is already
    upstream). Objtool follows the well-known kernel coding style.

    Objtool does not try to check .c or .S files, it instead analyzes the
    resulting .o generated machine code from first principles: it decodes
    the instruction stream and interprets it. (Right now objtool supports
    the x86-64 architecture.)

    From tools/objtool/Documentation/stack-validation.txt:

    "The kernel CONFIG_STACK_VALIDATION option enables a host tool named
    objtool which runs at compile time. It has a "check" subcommand
    which analyzes every .o file and ensures the validity of its stack
    metadata. It enforces a set of rules on asm code and C inline
    assembly code so that stack traces can be reliable.

    Currently it only checks frame pointer usage, but there are plans to
    add CFI validation for C files and CFI generation for asm files.

    For each function, it recursively follows all possible code paths
    and validates the correct frame pointer state at each instruction.

    It also follows code paths involving special sections, like
    .altinstructions, __jump_table, and __ex_table, which can add
    alternative execution paths to a given instruction (or set of
    instructions). Similarly, it knows how to follow switch statements,
    for which gcc sometimes uses jump tables."

    When this new kernel option is enabled (it's disabled by default), the
    tool, if it finds any suspicious assembly code pattern, outputs
    warnings in compiler warning format:

    warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
    warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
    warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
    warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer

    ... so that scripts that pick up compiler warnings will notice them.
    All known warnings triggered by the tool are fixed by the tree, most
    of the commits in fact prepare the kernel to be warning-free. Most of
    them are bugfixes or cleanups that stand on their own, but there are
    also some annotations of 'special' stack frames for justified cases
    such entries to JIT-ed code (BPF) or really special boot time code.

    There are two other long-term motivations behind this tool as well:

    - To improve the quality and reliability of kernel stack frames, so
    that they can be used for optimized live patching.

    - To create independent infrastructure to check the correctness of
    CFI stack frames at build time. CFI debuginfo is notoriously
    unreliable and we cannot use it in the kernel as-is without extra
    checking done both on the kernel side and on the build side.

    The quality of kernel stack frames matters to debuggability as well,
    so IMO we can merge this without having to consider the live patching
    or CFI debuginfo angle"

    * 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    objtool: Only print one warning per function
    objtool: Add several performance improvements
    tools: Copy hashtable.h into tools directory
    objtool: Fix false positive warnings for functions with multiple switch statements
    objtool: Rename some variables and functions
    objtool: Remove superflous INIT_LIST_HEAD
    objtool: Add helper macros for traversing instructions
    objtool: Fix false positive warnings related to sibling calls
    objtool: Compile with debugging symbols
    objtool: Detect infinite recursion
    objtool: Prevent infinite recursion in noreturn detection
    objtool: Detect and warn if libelf is missing and don't break the build
    tools: Support relative directory path for 'O='
    objtool: Support CROSS_COMPILE
    x86/asm/decoder: Use explicitly signed chars
    objtool: Enable stack metadata validation on 64-bit x86
    objtool: Add CONFIG_STACK_VALIDATION option
    objtool: Add tool to perform compile-time stack metadata validation
    x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
    sched: Always inline context_switch()
    ...

    Linus Torvalds
     

10 Mar, 2016

2 commits

  • Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
    the result is typically passed to print("%s", buf) and extra bytes
    after zero don't cause any harm.
    In bpf the result of bpf_get_current_comm() is used as the part of
    map key and was causing spurious hash map mismatches.
    Use strlcpy() to guarantee zero-terminated string.
    bpf verifier checks that output buffer is zero-initialized,
    so even for short task names the output buffer don't have junk bytes.
    Note it's not a security concern, since kprobe+bpf is root only.

    Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
    Reported-by: Tobias Waldekranz
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • 0-day bot reported build error:
    kernel/built-in.o: In function `map_lookup_elem':
    >> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy'
    when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
    Add weak definition to resolve it.
    This code path in map_lookup_elem() is never taken
    when CONFIG_PERF_EVENTS is not set.

    Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation")
    Reported-by: Fengguang Wu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

09 Mar, 2016

5 commits

  • It was observed that calling bpf_get_stackid() from a kprobe inside
    slub or from spin_unlock causes similar deadlock as with hashmap,
    therefore convert stackmap to use pre-allocated memory.

    The call_rcu is no longer feasible mechanism, since delayed freeing
    causes bpf_get_stackid() to fail unpredictably when number of actual
    stacks is significantly less than user requested max_entries.
    Since elements are no longer freed into slub, we can push elements into
    freelist immediately and let them be recycled.
    However the very unlikley race between user space map_lookup() and
    program-side recycling is possible:
    cpu0 cpu1
    ---- ----
    user does lookup(stackidX)
    starts copying ips into buffer
    delete(stackidX)
    calls bpf_get_stackid()
    which recyles the element and
    overwrites with new stack trace

    To avoid user space seeing a partial stack trace consisting of two
    merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
    to preserve consistent stack trace delivery to user space.
    Now we can move memset(,0) of left-over element value from critical
    path of bpf_get_stackid() into slow-path of user space lookup.
    Also disallow lookup() from bpf program, since it's useless and
    program shouldn't be messing with collected stack trace.

    Note that similar race between user space lookup and kernel side updates
    is also present in hashmap, but it's not a new race. bpf programs were
    always allowed to modify hash and array map elements while user space
    is copying them.

    Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Suggested-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • If kprobe is placed on spin_unlock then calling kmalloc/kfree from
    bpf programs is not safe, since the following dead lock is possible:
    kfree->spin_lock(kmem_cache_node->lock)...spin_unlock->kprobe->
    bpf_prog->map_update->kmalloc->spin_lock(of the same kmem_cache_node->lock)
    and deadlocks.

    The following solutions were considered and some implemented, but
    eventually discarded
    - kmem_cache_create for every map
    - add recursion check to slow-path of slub
    - use reserved memory in bpf_map_update for in_irq or in preempt_disabled
    - kmalloc via irq_work

    At the end pre-allocation of all map elements turned out to be the simplest
    solution and since the user is charged upfront for all the memory, such
    pre-allocation doesn't affect the user space visible behavior.

    Since it's impossible to tell whether kprobe is triggered in a safe
    location from kmalloc point of view, use pre-allocation by default
    and introduce new BPF_F_NO_PREALLOC flag.

    While testing of per-cpu hash maps it was discovered
    that alloc_percpu(GFP_ATOMIC) has odd corner cases and often
    fails to allocate memory even when 90% of it is free.
    The pre-allocation of per-cpu hash elements solves this problem as well.

    Turned out that bpf_map_update() quickly followed by
    bpf_map_lookup()+bpf_map_delete() is very common pattern used
    in many of iovisor/bcc/tools, so there is additional benefit of
    pre-allocation, since such use cases are must faster.

    Since all hash map elements are now pre-allocated we can remove
    atomic increment of htab->count and save few more cycles.

    Also add bpf_map_precharge_memlock() to check rlimit_memlock early to avoid
    large malloc/free done by users who don't have sufficient limits.

    Pre-allocation is done with vmalloc and alloc/free is done
    via percpu_freelist. Here are performance numbers for different
    pre-allocation algorithms that were implemented, but discarded
    in favor of percpu_freelist:

    1 cpu:
    pcpu_ida 2.1M
    pcpu_ida nolock 2.3M
    bt 2.4M
    kmalloc 1.8M
    hlist+spinlock 2.3M
    pcpu_freelist 2.6M

    4 cpu:
    pcpu_ida 1.5M
    pcpu_ida nolock 1.8M
    bt w/smp_align 1.7M
    bt no/smp_align 1.1M
    kmalloc 0.7M
    hlist+spinlock 0.2M
    pcpu_freelist 2.0M

    8 cpu:
    pcpu_ida 0.7M
    bt w/smp_align 0.8M
    kmalloc 0.4M
    pcpu_freelist 1.5M

    32 cpu:
    kmalloc 0.13M
    pcpu_freelist 0.49M

    pcpu_ida nolock is a modified percpu_ida algorithm without
    percpu_ida_cpu locks and without cross-cpu tag stealing.
    It's faster than existing percpu_ida, but not as fast as pcpu_freelist.

    bt is a variant of block/blk-mq-tag.c simlified and customized
    for bpf use case. bt w/smp_align is using cache line for every 'long'
    (similar to blk-mq-tag). bt no/smp_align allocates 'long'
    bitmasks continuously to save memory. It's comparable to percpu_ida
    and in some cases faster, but slower than percpu_freelist

    hlist+spinlock is the simplest free list with single spinlock.
    As expeceted it has very bad scaling in SMP.

    kmalloc is existing implementation which is still available via
    BPF_F_NO_PREALLOC flag. It's significantly slower in single cpu and
    in 8 cpu setup it's 3 times slower than pre-allocation with pcpu_freelist,
    but saves memory, so in cases where map->max_entries can be large
    and number of map update/delete per second is low, it may make
    sense to use it.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce simple percpu_freelist to keep single list of elements
    spread across per-cpu singly linked lists.

    /* push element into the list */
    void pcpu_freelist_push(struct pcpu_freelist *, struct pcpu_freelist_node *);

    /* pop element from the list */
    struct pcpu_freelist_node *pcpu_freelist_pop(struct pcpu_freelist *);

    The object is pushed to the current cpu list.
    Pop first trying to get the object from the current cpu list,
    if it's empty goes to the neigbour cpu list.

    For bpf program usage pattern the collision rate is very low,
    since programs push and pop the objects typically on the same cpu.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • if kprobe is placed within update or delete hash map helpers
    that hold bucket spin lock and triggered bpf program is trying to
    grab the spinlock for the same bucket on the same cpu, it will
    deadlock.
    Fix it by extending existing recursion prevention mechanism.

    Note, map_lookup and other tracing helpers don't have this problem,
    since they don't hold any locks and don't modify global data.
    bpf_trace_printk has its own recursive check and ok as well.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

29 Feb, 2016

1 commit

  • objtool reports the following false positive warnings:

    kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x5c: sibling call from callable instruction with changed frame pointer
    kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x60: function has unreachable instruction
    kernel/bpf/core.o: warning: objtool: __bpf_prog_run()+0x64: function has unreachable instruction
    [...]

    It's confused by the following dynamic jump instruction in
    __bpf_prog_run()::

    jmp *(%r12,%rax,8)

    which corresponds to the following line in the C code:

    goto *jumptable[insn->code];

    There's no way for objtool to deterministically find all possible
    branch targets for a dynamic jump, so it can't verify this code.

    In this case the jumps all stay within the function, and there's nothing
    unusual going on related to the stack, so we can whitelist the function.

    Signed-off-by: Josh Poimboeuf
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Bernd Petrovitsch
    Cc: Borislav Petkov
    Cc: Chris J Arges
    Cc: Jiri Slaby
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Namhyung Kim
    Cc: Pedro Alves
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: live-patching@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/b90e6bf3fdbfb5c4cc1b164b965502e53cf48935.1456719558.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     

23 Feb, 2016

1 commit


22 Feb, 2016

1 commit

  • Currently, when we pass a buffer from the eBPF stack into a helper
    function, the function proto indicates argument types as ARG_PTR_TO_STACK
    and ARG_CONST_STACK_SIZE pair. If R contains the former, then R
    must be of the latter type. Then, verifier checks whether the buffer
    points into eBPF stack, is initialized, etc. The verifier also guarantees
    that the constant value passed in R is greater than 0, so helper
    functions don't need to test for it and can always assume a non-NULL
    initialized buffer as well as non-0 buffer size.

    This patch adds a new argument types ARG_CONST_STACK_SIZE_OR_ZERO that
    allows to also pass NULL as R and 0 as R into the helper function.
    Such helper functions, of course, need to be able to handle these cases
    internally then. Verifier guarantees that either R == NULL && R == 0
    or R != NULL && R != 0 (like the case of ARG_CONST_STACK_SIZE), any
    other combinations are not possible to load.

    I went through various options of extending the verifier, and introducing
    the type ARG_CONST_STACK_SIZE_OR_ZERO seems to have most minimal changes
    needed to the verifier.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

20 Feb, 2016

2 commits

  • add new map type to store stack traces and corresponding helper
    bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
    @ctx: struct pt_regs*
    @map: pointer to stack_trace map
    @flags: bits 0-7 - numer of stack frames to skip
    bit 8 - collect user stack instead of kernel
    bit 9 - compare stacks by hash only
    bit 10 - if two different stacks hash into the same stackid
    discard old
    other bits - reserved
    Return: >= 0 stackid on success or negative error

    stackid is a 32-bit integer handle that can be further combined with
    other data (including other stackid) and used as a key into maps.

    Userspace will access stackmap using standard lookup/delete syscall commands to
    retrieve full stack trace for given stackid.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
    which pointed out a missing rcu read lock.

    Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array maps")
    Signed-off-by: Sasha Levin
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Sasha Levin
     

11 Feb, 2016

1 commit

  • When ctx access is used, the kernel often needs to expand/rewrite
    instructions, so after that patching, branch offsets have to be
    adjusted for both forward and backward jumps in the new eBPF program,
    but for backward jumps it fails to account the delta. Meaning, for
    example, if the expansion happens exactly on the insn that sits at
    the jump target, it doesn't fix up the back jump offset.

    Analysis on what the check in adjust_branches() is currently doing:

    /* adjust offset of jmps if necessary */
    if (i < pos && i + insn->off + 1 > pos)
    insn->off += delta;
    else if (i > pos && i + insn->off + 1 < pos)
    insn->off -= delta;

    First condition (forward jumps):

    Before: After:

    insns[0] insns[0]
    insns[1] off + 1 == pos, means we jump to that newly patched
    instruction, so no offset adjustment are needed. That part is correct.

    Second condition (backward jumps):

    Before: After:

    insns[0] insns[0]
    insns[1] pos is okay only by itself. However, i +
    insn->off + 1 < pos does not always work as intended to trigger the
    adjustment. It works when jump targets would be far off where the
    delta wouldn't matter. But, for example, where the fixed insn->off
    before pointed to pos (target_Y), it now points to pos + delta, so
    that additional room needs to be taken into account for the check.
    This means that i) both tests here need to be adjusted into pos + delta,
    and ii) for the second condition, the test needs to be
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Feb, 2016

3 commits

  • The functions bpf_map_lookup_elem(map, key, value) and
    bpf_map_update_elem(map, key, value, flags) need to get/set
    values from all-cpus for per-cpu hash and array maps,
    so that user space can aggregate/update them as necessary.

    Example of single counter aggregation in user space:
    unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
    long values[nr_cpus];
    long value = 0;

    bpf_lookup_elem(fd, key, values);
    for (i = 0; i < nr_cpus; i++)
    value += values[i];

    The user space must provide round_up(value_size, 8) * nr_cpus
    array to get/set values, since kernel will use 'long' copy
    of per-cpu values to try to copy good counters atomically.
    It's a best-effort, since bpf programs and user space are racing
    to access the same memory.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Primary use case is a histogram array of latency
    where bpf program computes the latency of block requests or other
    events and stores histogram of latency into array of 64 elements.
    All cpus are constantly running, so normal increment is not accurate,
    bpf_xadd causes cache ping-pong and this per-cpu approach allows
    fastest collision-free counters.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Introduce BPF_MAP_TYPE_PERCPU_HASH map type which is used to do
    accurate counters without need to use BPF_XADD instruction which turned
    out to be too costly for high-performance network monitoring.
    In the typical use case the 'key' is the flow tuple or other long
    living object that sees a lot of events per second.

    bpf_map_lookup_elem() returns per-cpu area.
    Example:
    struct {
    u32 packets;
    u32 bytes;
    } * ptr = bpf_map_lookup_elem(&map, &key);
    /* ptr points to this_cpu area of the value, so the following
    * increments will not collide with other cpus
    */
    ptr->packets ++;
    ptr->bytes += skb->len;

    bpf_update_elem() atomically creates a new element where all per-cpu
    values are zero initialized and this_cpu value is populated with
    given 'value'.
    Note that non-per-cpu hash map always allocates new element
    and then deletes old after rcu grace period to maintain atomicity
    of update. Per-cpu hash map updates element values in-place.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

29 Jan, 2016

1 commit

  • Robustify refcounting.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: David Ahern
    Cc: Jiri Olsa
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: Wang Nan
    Cc: vince@deater.net
    Link: http://lkml.kernel.org/r/20160126045947.GA40151@ast-mbp.thefacebook.com
    Signed-off-by: Ingo Molnar

    Alexei Starovoitov
     

13 Jan, 2016

1 commit

  • On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
    constant shift that can't be encoded in the immediate field of the
    UBFM/SBFM instructions is passed to the JIT. Since these shifts
    amounts, which are negative or >= regsize, are invalid, reject them in
    the eBPF verifier and the classic BPF filter checker, for all
    architectures.

    Signed-off-by: Rabin Vincent
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Rabin Vincent
     

30 Dec, 2015

3 commits

  • Both htab_map_update_elem() and htab_map_delete_elem() can be
    called from eBPF program, and they may be in kernel hot path,
    so it isn't efficient to use a per-hashtable lock in this two
    helpers.

    The per-hashtable spinlock is used for protecting bucket's
    hlist, and per-bucket lock is just enough. This patch converts
    the per-hashtable lock into per-bucket spinlock, so that
    contention can be decreased a lot.

    Signed-off-by: Ming Lei
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    tom.leiming@gmail.com
     
  • The spinlock is just used for protecting the per-bucket
    hlist, so it isn't needed for selecting bucket.

    Acked-by: Daniel Borkmann
    Signed-off-by: Ming Lei
    Signed-off-by: David S. Miller

    tom.leiming@gmail.com
     
  • Preparing for removing global per-hashtable lock, so
    the counter need to be defined as aotmic_t first.

    Acked-by: Daniel Borkmann
    Signed-off-by: Ming Lei
    Signed-off-by: David S. Miller

    tom.leiming@gmail.com
     

19 Dec, 2015

1 commit

  • Back in the days where eBPF (or back then "internal BPF" ;->) was not
    exposed to user space, and only the classic BPF programs internally
    translated into eBPF programs, we missed the fact that for classic BPF
    A and X needed to be cleared. It was fixed back then via 83d5b7ef99c9
    ("net: filter: initialize A and X registers"), and thus classic BPF
    specifics were added to the eBPF interpreter core to work around it.

    This added some confusion for JIT developers later on that take the
    eBPF interpreter code as an example for deriving their JIT. F.e. in
    f75298f5c3fe ("s390/bpf: clear correct BPF accumulator register"), at
    least X could leak stack memory. Furthermore, since this is only needed
    for classic BPF translations and not for eBPF (verifier takes care
    that read access to regs cannot be done uninitialized), more complexity
    is added to JITs as they need to determine whether they deal with
    migrations or native eBPF where they can just omit clearing A/X in
    their prologue and thus reduce image size a bit, see f.e. cde66c2d88da
    ("s390/bpf: Only clear A and X for converted BPF programs"). In other
    cases (x86, arm64), A and X is being cleared in the prologue also for
    eBPF case, which is unnecessary.

    Lets move this into the BPF migration in bpf_convert_filter() where it
    actually belongs as long as the number of eBPF JITs are still few. It
    can thus be done generically; allowing us to remove the quirk from
    __bpf_prog_run() and to slightly reduce JIT image size in case of eBPF,
    while reducing code duplication on this matter in current(/future) eBPF
    JITs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Reviewed-by: Michael Holzheu
    Tested-by: Michael Holzheu
    Cc: Zi Shen Lim
    Cc: Yang Shi
    Acked-by: Yang Shi
    Acked-by: Zi Shen Lim
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

13 Dec, 2015

1 commit

  • Add support for renaming and hard links to the fs. Most of this can be
    implemented by using simple library operations under the same constraints
    that we don't use a reserved name like elsewhere. Linking can be useful
    to share/manage things like maps across subsystem users. It works within
    the file system boundary, but is not allowed for directories.

    Symbolic links are explicitly not implemented here, as it can be better
    done already by doing bind mounts inside bpf fs to set up shared directories
    f.e. useful when using volumes in docker containers that map a private
    working directory into /sys/fs/bpf/ which contains itself a bind mounted
    path from the host's /sys/fs/bpf/ mount that is shared among multiple
    containers. For single maps instead of whole directory, hard links can
    be easily used to do the same.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

04 Dec, 2015

1 commit


03 Dec, 2015

1 commit

  • For large map->value_size the user space can trigger memory allocation warnings like:
    WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
    __alloc_pages_nodemask+0x695/0x14e0()
    Call Trace:
    [< inline >] __dump_stack lib/dump_stack.c:15
    [] dump_stack+0x68/0x92 lib/dump_stack.c:50
    [] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
    [] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
    [< inline >] __alloc_pages_slowpath mm/page_alloc.c:2989
    [] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
    [] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
    [< inline >] alloc_pages include/linux/gfp.h:451
    [] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
    [] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
    [] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
    [< inline >] kmalloc_large include/linux/slab.h:390
    [] __kmalloc+0x234/0x250 mm/slub.c:3525
    [< inline >] kmalloc include/linux/slab.h:463
    [< inline >] map_update_elem kernel/bpf/syscall.c:288
    [< inline >] SYSC_bpf kernel/bpf/syscall.c:744

    To avoid never succeeding kmalloc with order >= MAX_ORDER check that
    elem->value_size and computed elem_size are within limits for both hash and
    array type maps.
    Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
    Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

02 Dec, 2015

1 commit

  • During own review but also reported by Dmitry's syzkaller [1] it has been
    noticed that we trigger a heap out-of-bounds access on eBPF array maps
    when updating elements. This happens with each map whose map->value_size
    (specified during map creation time) is not multiple of 8 bytes.

    In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
    used to align array map slots for faster access. However, in function
    array_map_update_elem(), we update the element as ...

    memcpy(array->value + array->elem_size * index, value, array->elem_size);

    ... where we access 'value' out-of-bounds, since it was allocated from
    map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
    and later on copied through copy_from_user(value, uvalue, map->value_size).
    Thus, up to 7 bytes, we can access out-of-bounds.

    Same could happen from within an eBPF program, where in worst case we
    access beyond an eBPF program's designated stack.

    Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an
    official release yet, it only affects priviledged users.

    In case of array_map_lookup_elem(), the verifier prevents eBPF programs
    from accessing beyond map->value_size through check_map_access(). Also
    from syscall side map_lookup_elem() only copies map->value_size back to
    user, so nothing could leak.

    [1] http://github.com/google/syzkaller

    Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

26 Nov, 2015

1 commit

  • Currently, when having map file descriptors pointing to program arrays,
    there's still the issue that we unconditionally flush program array
    contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
    when such a file descriptor is released and is independent of the map's
    refcount.

    Having this flush independent of the refcount is for a reason: there
    can be arbitrary complex dependency chains among tail calls, also circular
    ones (direct or indirect, nesting limit determined during runtime), and
    we need to make sure that the map drops all references to eBPF programs
    it holds, so that the map's refcount can eventually drop to zero and
    initiate its freeing. Btw, a walk of the whole dependency graph would
    not be possible for various reasons, one being complexity and another
    one inconsistency, i.e. new programs can be added to parts of the graph
    at any time, so there's no guaranteed consistent state for the time of
    such a walk.

    Now, the program array pinning itself works, but the issue is that each
    derived file descriptor on close would nevertheless call unconditionally
    into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
    this flush until the last reference to a user is dropped. As this only
    concerns a subset of references (f.e. a prog array could hold a program
    that itself has reference on the prog array holding it, etc), we need to
    track them separately.

    Short analysis on the refcounting: on map creation time usercnt will be
    one, so there's no change in behaviour for bpf_map_release(), if unpinned.
    If we already fail in map_create(), we are immediately freed, and no
    file descriptor has been made public yet. In bpf_obj_pin_user(), we need
    to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
    reference, so before we drop the reference on the fd with fdput().
    Therefore, if actual pinning fails, we need to drop that reference again
    in bpf_any_put(), otherwise we keep holding it. When last reference
    drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
    care of dropping the usercnt again. In the bpf_obj_get_user() case, the
    bpf_any_get() will grab a reference on the usercnt, still at a time when
    we have the reference on the path. Should we later on fail to grab a new
    file descriptor, bpf_any_put() will drop it, otherwise we hold it until
    bpf_map_release() time.

    Joint work with Alexei.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

21 Nov, 2015

1 commit

  • Add a handler for show_fdinfo() to be used by the anon-inodes
    backend for eBPF maps, and dump the map specification there. Not
    only useful for admins, but also it provides a minimal way to
    compare specs from ELF vs pinned object.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

04 Nov, 2015

1 commit

  • The verbose() printer dumps the verifier state to user space, so let gcc
    take care to check calls to verbose() for (future) errors. make with W=1
    correctly suggests: function might be possible candidate for 'gnu_printf'
    format attribute [-Wsuggest-attribute=format].

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

03 Nov, 2015

5 commits

  • This work adds support for "persistent" eBPF maps/programs. The term
    "persistent" is to be understood that maps/programs have a facility
    that lets them survive process termination. This is desired by various
    eBPF subsystem users.

    Just to name one example: tc classifier/action. Whenever tc parses
    the ELF object, extracts and loads maps/progs into the kernel, these
    file descriptors will be out of reach after the tc instance exits.
    So a subsequent tc invocation won't be able to access/relocate on this
    resource, and therefore maps cannot easily be shared, f.e. between the
    ingress and egress networking data path.

    The current workaround is that Unix domain sockets (UDS) need to be
    instrumented in order to pass the created eBPF map/program file
    descriptors to a third party management daemon through UDS' socket
    passing facility. This makes it a bit complicated to deploy shared
    eBPF maps or programs (programs f.e. for tail calls) among various
    processes.

    We've been brainstorming on how we could tackle this issue and various
    approches have been tried out so far, which can be read up further in
    the below reference.

    The architecture we eventually ended up with is a minimal file system
    that can hold map/prog objects. The file system is a per mount namespace
    singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
    mounts within a given namespace will point to the same instance. The
    file system allows for creating a user-defined directory structure.
    The objects for maps/progs are created/fetched through bpf(2) with
    two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
    along with a pathname is being passed to bpf(2) that in turn creates
    (we call it eBPF object pinning) the file system nodes. Only the pathname
    is being passed to bpf(2) for getting a new BPF file descriptor to an
    existing node. The user can use that to access maps and progs later on,
    through bpf(2). Removal of file system nodes is being managed through
    normal VFS functions such as unlink(2), etc. The file system code is
    kept to a very minimum and can be further extended later on.

    The next step I'm working on is to add dump eBPF map/prog commands
    to bpf(2), so that a specification from a given file descriptor can
    be retrieved. This can be used by things like CRIU but also applications
    can inspect the meta data after calling BPF_OBJ_GET.

    Big thanks also to Alexei and Hannes who significantly contributed
    in the design discussion that eventually let us end up with this
    architecture here.

    Reference: https://lkml.org/lkml/2015/10/15/925
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • We currently have duplicated cleanup code in bpf_prog_put() and
    bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
    not worth it to make it a common helper called by both, but with
    the recent addition of resource charging, we could have avoided
    the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock
    from bpf_prog_put") if we would have had only a single, common path.
    We can simplify it further by assigning aux->prog only once during
    allocation time.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Add a bpf_map_get() function that we're going to use later on and
    align/clean the remaining helpers a bit so that we have them a bit
    more consistent:

    - __bpf_map_get() and __bpf_prog_get() that both work on the fd
    struct, check whether the descriptor is eBPF and return the
    pointer to the map/prog stored in the private data.

    Also, we can return f.file->private_data directly, the function
    signature is enough of a documentation already.

    - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
    call their respective __bpf_map_get()/__bpf_prog_get() variants,
    and take a reference.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Since we're going to use anon_inode_getfd() invocations in more than just
    the current places, make a helper function for both, so that we only need
    to pass a map/prog pointer to the helper itself in order to get a fd. The
    new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • When running bpf samples on rt kernel, it reports the below warning:

    BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
    in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
    Preemption disabled at:[] kprobe_perf_func+0x30/0x228

    CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
    Hardware name: Freescale Layerscape 2085a RDB Board (DT)
    Call trace:
    [] dump_backtrace+0x0/0x128
    [] show_stack+0x20/0x30
    [] dump_stack+0x7c/0xa0
    [] ___might_sleep+0x188/0x1a0
    [] rt_spin_lock+0x28/0x40
    [] htab_map_update_elem+0x124/0x320
    [] bpf_map_update_elem+0x40/0x58
    [] __bpf_prog_run+0xd48/0x1640
    [] trace_call_bpf+0x8c/0x100
    [] kprobe_perf_func+0x30/0x228
    [] kprobe_dispatcher+0x34/0x58
    [] kprobe_handler+0x114/0x250
    [] kprobe_breakpoint_handler+0x1c/0x30
    [] brk_handler+0x88/0x98
    [] do_debug_exception+0x50/0xb8
    Exception stack(0xffff808349687460 to 0xffff808349687580)
    7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
    7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
    74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
    74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
    74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
    7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
    7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
    7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
    7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77

    Convert hashtab lock to raw lock to avoid such warning.

    Signed-off-by: Yang Shi
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Yang Shi
     

27 Oct, 2015

1 commit

  • Fix safety checks for bpf_perf_event_read():
    - only non-inherited events can be added to perf_event_array map
    (do this check statically at map insertion time)
    - dynamically check that event is local and !pmu->count
    Otherwise buggy bpf program can cause kernel splat.

    Also fix error path after perf_event_attrs()
    and remove redundant 'extern'.

    Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
    Signed-off-by: Alexei Starovoitov
    Tested-by: Wang Nan
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

22 Oct, 2015

1 commit

  • This helper is used to send raw data from eBPF program into
    special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
    User space needs to perf_event_open() it (either for one or all cpus) and
    store FD into perf_event_array (similar to bpf_perf_event_read() helper)
    before eBPF program can send data into it.

    Today the programs triggered by kprobe collect the data and either store
    it into the maps or print it via bpf_trace_printk() where latter is the debug
    facility and not suitable to stream the data. This new helper replaces
    such bpf_trace_printk() usage and allows programs to have dedicated
    channel into user space for post-processing of the raw data collected.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

16 Oct, 2015

1 commit

  • Currently, is only called from __prog_put_rcu in the bpf_prog_release
    path. Need this to call this from bpf_prog_put also to get correct
    accounting.

    Fixes: aaac3ba95e4c8b49 ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Tom Herbert
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Tom Herbert
     

13 Oct, 2015

2 commits

  • since eBPF programs and maps use kernel memory consider it 'locked' memory
    from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
    This limit is typically set to 64Kbytes by distros, so almost all
    bpf+tracing programs would need to increase it, since they use maps,
    but kernel charges maximum map size upfront.
    For example the hash map of 1024 elements will be charged as 64Kbyte.
    It's inconvenient for current users and changes current behavior for root,
    but probably worth doing to be consistent root vs non-root.

    Similar accounting logic is done by mmap of perf_event.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • In order to let unprivileged users load and execute eBPF programs
    teach verifier to prevent pointer leaks.
    Verifier will prevent
    - any arithmetic on pointers
    (except R10+Imm which is used to compute stack addresses)
    - comparison of pointers
    (except if (map_value_ptr == 0) ... )
    - passing pointers to helper functions
    - indirectly passing pointers in stack to helper functions
    - returning pointer from bpf program
    - storing pointers into ctx or maps

    Spill/fill of pointers into stack is allowed, but mangling
    of pointers stored in the stack or reading them byte by byte is not.

    Within bpf programs the pointers do exist, since programs need to
    be able to access maps, pass skb pointer to LD_ABS insns, etc
    but programs cannot pass such pointer values to the outside
    or obfuscate them.

    Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
    so that socket filters (tcpdump), af_packet (quic acceleration)
    and future kcm can use it.
    tracing and tc cls/act program types still require root permissions,
    since tracing actually needs to be able to see all kernel pointers
    and tc is for root only.

    For example, the following unprivileged socket filter program is allowed:
    int bpf_prog1(struct __sk_buff *skb)
    {
    u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    u64 *value = bpf_map_lookup_elem(&my_map, &index);

    if (value)
    *value += skb->len;
    return 0;
    }

    but the following program is not:
    int bpf_prog1(struct __sk_buff *skb)
    {
    u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
    u64 *value = bpf_map_lookup_elem(&my_map, &index);

    if (value)
    *value += (u64) skb;
    return 0;
    }
    since it would leak the kernel address into the map.

    Unprivileged socket filter bpf programs have access to the
    following helper functions:
    - map lookup/update/delete (but they cannot store kernel pointers into them)
    - get_random (it's already exposed to unprivileged user space)
    - get_smp_processor_id
    - tail_call into another socket filter program
    - ktime_get_ns

    The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
    This toggle defaults to off (0), but can be set true (1). Once true,
    bpf programs and maps cannot be accessed from unprivileged process,
    and the toggle cannot be set back to false.

    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Kees Cook
    Signed-off-by: David S. Miller

    Alexei Starovoitov