01 Dec, 2016

1 commit

  • If we have a branch that looks something like this

    int foo = map->value;
    if (condition) {
    foo += blah;
    } else {
    foo = bar;
    }
    map->array[foo] = baz;

    We will incorrectly assume that the !condition branch is equal to the condition
    branch as the register for foo will be UNKNOWN_VALUE in both cases. We need to
    adjust this logic to only do this if we didn't do a varlen access after we
    processed the !condition branch, otherwise we have different ranges and need to
    check the other branch as well.

    Fixes: 484611357c19 ("bpf: allow access into map value arrays")
    Reported-by: Jann Horn
    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Josef Bacik
     

17 Nov, 2016

1 commit

  • I made some invalid assumptions with BPF_AND and BPF_MOD that could result in
    invalid accesses to bpf map entries. Fix this up by doing a few things

    1) Kill BPF_MOD support. This doesn't actually get used by the compiler in real
    life and just adds extra complexity.

    2) Fix the logic for BPF_AND, don't allow AND of negative numbers and set the
    minimum value to 0 for positive AND's.

    3) Don't do operations on the ranges if they are set to the limits, as they are
    by definition undefined, and allowing arithmetic operations on those values
    could make them appear valid when they really aren't.

    This fixes the testcase provided by Jann as well as a few other theoretical
    problems.

    Reported-by: Jann Horn
    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Josef Bacik
     

08 Nov, 2016

2 commits

  • In map_create(), we first find and create the map, then once that
    suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
    a new anon fd through anon_inode_getfd(). The problem is, once the
    latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
    the map via map->ops->map_free(), but without uncharging the previously
    locked memory first. That means that the user_struct allocation is
    leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
    Make the label names in the fix consistent with bpf_prog_load().

    Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Commit a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
    added an extra per-cpu reserve to the hash table map to restore old
    behaviour from pre prealloc times. When non-prealloc is in use for a
    map, then problem is that once a hash table extra element has been
    linked into the hash-table, and the hash table is destroyed due to
    refcount dropping to zero, then htab_map_free() -> delete_all_elements()
    will walk the whole hash table and drop all elements via htab_elem_free().
    The problem is that the element from the extra reserve is first fed
    to the wrong backend allocator and eventually freed twice.

    Fixes: a6ed3ea65d98 ("bpf: restore behavior of bpf_map_update_elem")
    Reported-by: Dmitry Vyukov
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

11 Oct, 2016

1 commit

  • Pull more vfs updates from Al Viro:
    ">rename2() work from Miklos + current_time() from Deepa"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Replace current_fs_time() with current_time()
    fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
    fs: Replace CURRENT_TIME with current_time() for inode timestamps
    fs: proc: Delete inode time initializations in proc_alloc_inode()
    vfs: Add current_time() api
    vfs: add note about i_op->rename changes to porting
    fs: rename "rename2" i_op to "rename"
    vfs: remove unused i_op->rename
    fs: make remaining filesystems use .rename2
    libfs: support RENAME_NOREPLACE in simple_rename()
    fs: support RENAME_NOREPLACE for local filesystems
    ncpfs: fix unused variable warning

    Linus Torvalds
     

29 Sep, 2016

1 commit

  • Suppose you have a map array value that is something like this

    struct foo {
    unsigned iter;
    int array[SOME_CONSTANT];
    };

    You can easily insert this into an array, but you cannot modify the contents of
    foo->array[] after the fact. This is because we have no way to verify we won't
    go off the end of the array at verification time. This patch provides a start
    for this work. We accomplish this by keeping track of a minimum and maximum
    value a register could be while we're checking the code. Then at the time we
    try to do an access into a MAP_VALUE we verify that the maximum offset into that
    region is a valid access into that memory region. So in practice, code such as
    this

    unsigned index = 0;

    if (foo->iter >= SOME_CONSTANT)
    foo->iter = index;
    else
    index = foo->iter++;
    foo->array[index] = bar;

    would be allowed, as we can verify that index will always be between 0 and
    SOME_CONSTANT-1. If you wish to use signed values you'll have to have an extra
    check to make sure the index isn't less than 0, or do something like index %=
    SOME_CONSTANT.

    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Josef Bacik
     

28 Sep, 2016

2 commits

  • put_cpu_var takes the percpu data, not the data returned from
    get_cpu_var.

    This doesn't change the behavior.

    Cc: Tejun Heo
    Cc: Alexei Starovoitov
    Signed-off-by: Shaohua Li
    Acked-by: Alexei Starovoitov
    Acked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Shaohua Li
     
  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

27 Sep, 2016

1 commit

  • This prevent future potential pointer leaks when an unprivileged eBPF
    program will read a pointer value from its context. Even if
    is_valid_access() returns a pointer type, the eBPF verifier replace it
    with UNKNOWN_VALUE. The register value that contains a kernel address is
    then allowed to leak. Moreover, this fix allows unprivileged eBPF
    programs to use functions with (legitimate) pointer arguments.

    Not an issue currently since reg_type is only set for PTR_TO_PACKET or
    PTR_TO_PACKET_END in XDP and TC programs that can only be loaded as
    privileged. For now, the only unprivileged eBPF program allowed is for
    socket filtering and all the types from its context are UNKNOWN_VALUE.
    However, this fix is important for future unprivileged eBPF programs
    which could use pointers in their context.

    Signed-off-by: Mickaël Salaün
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Mickaël Salaün
     

22 Sep, 2016

4 commits

  • When running as parser interpret BPF_LD | BPF_IMM | BPF_DW
    instructions as loading CONST_IMM with the value stored
    in imm. The verifier will continue not recognizing those
    due to concerns about search space/program complexity
    increase.

    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Advanced JIT compilers and translators may want to use
    eBPF verifier as a base for parsers or to perform custom
    checks and validations.

    Add ability for external users to invoke the verifier
    and provide callbacks to be invoked for every intruction
    checked. For now only add most basic callback for
    per-instruction pre-interpretation checks is added. More
    advanced users may also like to have per-instruction post
    callback and state comparison callback.

    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Move verifier's internal structures to a header file and
    prefix their names with bpf_ to avoid potential namespace
    conflicts. Those structures will soon be used by external
    analyzers.

    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     
  • Storing state in reserved fields of instructions makes
    it impossible to run verifier on programs already
    marked as read-only. Allocate and use an array of
    per-instruction state instead.

    While touching the error path rename and move existing
    jump target.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

21 Sep, 2016

2 commits

  • This work implements direct packet access for helpers and direct packet
    write in a similar fashion as already available for XDP types via commits
    4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
    6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
    complementary feature to the already available direct packet read for tc
    (cls/act) programs.

    For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
    and bpf_csum_update(). The first is generally needed for both, read and
    write, because they would otherwise only be limited to the current linear
    skb head. Usually, when the data_end test fails, programs just bail out,
    or, in the direct read case, use bpf_skb_load_bytes() as an alternative
    to overcome this limitation. If such data sits in non-linear parts, we
    can just pull them in once with the new helper, retest and eventually
    access them.

    At the same time, this also makes sure the skb is uncloned, which is, of
    course, a necessary condition for direct write. As this needs to be an
    invariant for the write part only, the verifier detects writes and adds
    a prologue that is calling bpf_skb_pull_data() to effectively unclone the
    skb from the very beginning in case it is indeed cloned. The heuristic
    makes use of a similar trick that was done in 233577a22089 ("net: filter:
    constify detection of pkt_type_offset"). This comes at zero cost for other
    programs that do not use the direct write feature. Should a program use
    this feature only sparsely and has read access for the most parts with,
    for example, drop return codes, then such write action can be delegated
    to a tail called program for mitigating this cost of potential uncloning
    to a late point in time where it would have been paid similarly with the
    bpf_skb_store_bytes() as well. Advantage of direct write is that the
    writes are inlined whereas the helper cannot make any length assumptions
    and thus needs to generate a call to memcpy() also for small sizes, as well
    as cost of helper call itself with sanity checks are avoided. Plus, when
    direct read is already used, we don't need to cache or perform rechecks
    on the data boundaries (due to verifier invalidating previous checks for
    helpers that change skb->data), so more complex programs using rewrites
    can benefit from switching to direct read plus write.

    For direct packet access to helpers, we save the otherwise needed copy into
    a temp struct sitting on stack memory when use-case allows. Both facilities
    are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
    this to map helpers and csum_diff, and can successively enable other helpers
    where we find it makes sense. Helpers that definitely cannot be allowed for
    this are those part of bpf_helper_changes_skb_data() since they can change
    underlying data, and those that write into memory as this could happen for
    packet typed args when still cloned. bpf_csum_update() helper accommodates
    for the fact that we need to fixup checksum_complete when using direct write
    instead of bpf_skb_store_bytes(), meaning the programs can use available
    helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
    csum_block_add(), csum_block_sub() equivalents in eBPF together with the
    new helper. A usage example will be provided for iproute2's examples/bpf/
    directory.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Current contract for the following two helper argument types is:

    * ARG_CONST_STACK_SIZE: passed argument pair must be (ptr, >0).
    * ARG_CONST_STACK_SIZE_OR_ZERO: passed argument pair can be either
    (NULL, 0) or (ptr, >0).

    With 6841de8b0d03 ("bpf: allow helpers access the packet directly"), we can
    pass also raw packet data to helpers, so depending on the argument type
    being PTR_TO_PACKET, we now either assert memory via check_packet_access()
    or check_stack_boundary(). As a result, the tests in check_packet_access()
    currently allow more than intended with regards to reg->imm.

    Back in 969bf05eb3ce ("bpf: direct packet access"), check_packet_access()
    was fine to ignore size argument since in check_mem_access() size was
    bpf_size_to_bytes() derived and prior to the call to check_packet_access()
    guaranteed to be larger than zero.

    However, for the above two argument types, it currently means, we can have
    a
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Sep, 2016

2 commits

  • This work adds BPF_CALL_() macros and converts all the eBPF helper functions
    to use them, in a similar fashion like we do with SYSCALL_DEFINE() macros
    that are used today. Motivation for this is to hide all the register handling
    and all necessary casts from the user, so that it is done automatically in the
    background when adding a BPF_CALL_() call.

    This makes current helpers easier to review, eases to write future helpers,
    avoids getting the casting mess wrong, and allows for extending all helpers at
    once (f.e. build time checks, etc). It also helps detecting more easily in
    code reviews that unused registers are not instrumented in the code by accident,
    breaking compatibility with existing programs.

    BPF_CALL_() internals are quite similar to SYSCALL_DEFINE() ones with some
    fundamental differences, for example, for generating the actual helper function
    that carries all u64 regs, we need to fill unused regs, so that we always end up
    with 5 u64 regs as an argument.

    I reviewed several 0-5 generated BPF_CALL_() variants of the .i results and
    they look all as expected. No sparse issue spotted. We let this also sit for a
    few days with Fengguang's kbuild test robot, and there were no issues seen. On
    s390, it barked on the "uses dynamic stack allocation" notice, which is an old
    one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
    to the call wrapper, just telling that the perf raw record/frag sits on stack
    (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
    and they were fine as well. All eBPF helpers are now converted to use these
    macros, getting rid of a good chunk of all the raw castings.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
    and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
    anyway, so we can drop the unneeded test for dev; also few more other
    misc bits addressed here.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

09 Sep, 2016

1 commit

  • LLVM can generate code that tests for direct packet access via
    skb->data/data_end in a way that currently gets rejected by the
    verifier, example:

    [...]
    7: (61) r3 = *(u32 *)(r6 +80)
    8: (61) r9 = *(u32 *)(r6 +76)
    9: (bf) r2 = r9
    10: (07) r2 += 54
    11: (3d) if r3 >= r2 goto pc+12
    R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
    R9=pkt(id=0,off=0,r=0) R10=fp
    12: (18) r4 = 0xffffff7a
    14: (05) goto pc+430
    [...]

    from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv
    R6=ctx R9=pkt(id=0,off=0,r=0) R10=fp
    24: (7b) *(u64 *)(r10 -40) = r1
    25: (b7) r1 = 0
    26: (63) *(u32 *)(r6 +56) = r1
    27: (b7) r2 = 40
    28: (71) r8 = *(u8 *)(r9 +20)
    invalid access to packet, off=20 size=1, R9(id=0,off=0,r=0)

    The reason why this gets rejected despite a proper test is that we
    currently call find_good_pkt_pointers() only in case where we detect
    tests like rX > pkt_end, where rX is of type pkt(id=Y,off=Z,r=0) and
    derived, for example, from a register of type pkt(id=Y,off=0,r=0)
    pointing to skb->data. find_good_pkt_pointers() then fills the range
    in the current branch to pkt(id=Y,off=0,r=Z) on success.

    For above case, we need to extend that to recognize pkt_end >= rX
    pattern and mark the other branch that is taken on success with the
    appropriate pkt(id=Y,off=0,r=Z) type via find_good_pkt_pointers().
    Since eBPF operates on BPF_JGT (>) and BPF_JGE (>=), these are the
    only two practical options to test for from what LLVM could have
    generated, since there's no such thing as BPF_JLT (= r2 goto pc+12
    R1=inv R2=pkt(id=0,off=54,r=0) R3=pkt_end R4=inv R6=ctx
    R9=pkt(id=0,off=0,r=0) R10=fp
    12: (18) r4 = 0xffffff7a
    14: (05) goto pc+430
    [...]

    from 11 to 24: R1=inv R2=pkt(id=0,off=54,r=54) R3=pkt_end R4=inv
    R6=ctx R9=pkt(id=0,off=0,r=54) R10=fp
    24: (7b) *(u64 *)(r10 -40) = r1
    25: (b7) r1 = 0
    26: (63) *(u32 *)(r6 +56) = r1
    27: (b7) r2 = 40
    28: (71) r8 = *(u8 *)(r9 +20)
    29: (bf) r1 = r8
    30: (25) if r8 > 0x3c goto pc+47
    R1=inv56 R2=imm40 R3=pkt_end R4=inv R6=ctx R8=inv56
    R9=pkt(id=0,off=0,r=54) R10=fp
    31: (b7) r1 = 1
    [...]

    Verifier test cases are also added in this work, one that demonstrates
    the mentioned example here and one that tries a bad packet access for
    the current/fall-through branch (the one with types pkt(id=X,off=Y,r=0),
    pkt(id=X,off=0,r=0)), then a case with good and bad accesses, and two
    with both test variants (>, >=).

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

03 Sep, 2016

2 commits

  • Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
    preallocated hash maps, since doing memory allocation
    in overflow_handler can crash depending on where nmi got triggered.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • The verifier supported only 4-byte metafields in
    struct __sk_buff and struct xdp_md. The metafields in upcoming
    struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
    Teach verifier to recognize 8-byte metafield access.
    The patch doesn't affect safety of sockets and xdp programs.
    They check for 4-byte only ctx access before these conditions are hit.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

18 Aug, 2016

1 commit


13 Aug, 2016

3 commits

  • The helper functions like bpf_map_lookup_elem(map, key) were only
    allowing 'key' to point to the initialized stack area.
    That is causing performance degradation when programs need to process
    millions of packets per second and need to copy contents of the packet
    into the stack just to pass the stack pointer into the lookup() function.
    Allow such helpers read from the packet directly.
    All helpers that expect ARG_PTR_TO_MAP_KEY, ARG_PTR_TO_MAP_VALUE,
    ARG_PTR_TO_STACK assume byte aligned pointer, so no alignment concerns,
    only need to check that helper will not be accessing beyond
    the packet range verified by the prior 'if (ptr < data_end)' condition.
    For now allow this feature for XDP programs only. Later it can be
    relaxed for the clsact programs as well.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • While hashing out BPF's current_task_under_cgroup helper bits, it came
    to discussion that the skb_in_cgroup helper name was suboptimally chosen.

    Tejun says:

    So, I think in_cgroup should mean that the object is in that
    particular cgroup while under_cgroup in the subhierarchy of that
    cgroup. Let's rename the other subhierarchy test to under too. I
    think that'd be a lot less confusing going forward.

    [...]

    It's more intuitive and gives us the room to implement the real
    "in" test if ever necessary in the future.

    Since this touches uapi bits, we need to change this as long as v4.8
    is not yet officially released. Thus, change the helper enum and rename
    related bits.

    Fixes: 4a482f34afcc ("cgroup: bpf: Add bpf_skb_in_cgroup_proto")
    Reference: http://patchwork.ozlabs.org/patch/658500/
    Suggested-by: Sargun Dhillon
    Suggested-by: Tejun Heo
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • This adds a bpf helper that's similar to the skb_in_cgroup helper to check
    whether the probe is currently executing in the context of a specific
    subset of the cgroupsv2 hierarchy. It does this based on membership test
    for a cgroup arraymap. It is invalid to call this in an interrupt, and
    it'll return an error. The helper is primarily to be used in debugging
    activities for containers, where you may have multiple programs running in
    a given top-level "container".

    Signed-off-by: Sargun Dhillon
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Tejun Heo
    Acked-by: Tejun Heo
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Sargun Dhillon
     

07 Aug, 2016

1 commit

  • The introduction of pre-allocated hash elements inadvertently broke
    the behavior of bpf hash maps where users expected to call
    bpf_map_update_elem() without considering that the map can be full.
    Some programs do:
    old_value = bpf_map_lookup_elem(map, key);
    if (old_value) {
    ... prepare new_value on stack ...
    bpf_map_update_elem(map, key, new_value);
    }
    Before pre-alloc the update() for existing element would work even
    in 'map full' condition. Restore this behavior.

    The above program could have updated old_value in place instead of
    update() which would be faster and most programs use that approach,
    but sometimes the values are large and the programs use update()
    helper to do atomic replacement of the element.
    Note we cannot simply update element's value in-place like percpu
    hash map does and have to allocate extra num_possible_cpu elements
    and use this extra reserve when the map is full.

    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

04 Aug, 2016

1 commit

  • Using per-register incrementing ID can lead to
    find_good_pkt_pointers() confusing registers which
    have completely different values. Consider example:

    0: (bf) r6 = r1
    1: (61) r8 = *(u32 *)(r6 +76)
    2: (61) r0 = *(u32 *)(r6 +80)
    3: (bf) r7 = r8
    4: (07) r8 += 32
    5: (2d) if r8 > r0 goto pc+9
    R0=pkt_end R1=ctx R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=0,off=32,r=32) R10=fp
    6: (bf) r8 = r7
    7: (bf) r9 = r7
    8: (71) r1 = *(u8 *)(r7 +0)
    9: (0f) r8 += r1
    10: (71) r1 = *(u8 *)(r7 +1)
    11: (0f) r9 += r1
    12: (07) r8 += 32
    13: (2d) if r8 > r0 goto pc+1
    R0=pkt_end R1=inv56 R6=ctx R7=pkt(id=0,off=0,r=32) R8=pkt(id=1,off=32,r=32) R9=pkt(id=1,off=0,r=32) R10=fp
    14: (71) r1 = *(u8 *)(r9 +16)
    15: (b7) r7 = 0
    16: (bf) r0 = r7
    17: (95) exit

    We need to get a UNKNOWN_VALUE with imm to force id
    generation so lines 0-5 make r7 a valid packet pointer.
    We then read two different bytes from the packet and
    add them to copies of the constructed packet pointer.
    r8 (line 9) and r9 (line 11) will get the same id of 1,
    independently. When either of them is validated (line
    13) - find_good_pkt_pointers() will also mark the other
    as safe. This leads to access on line 14 being mistakenly
    considered safe.

    Fixes: 969bf05eb3ce ("bpf: direct packet access")
    Signed-off-by: Jakub Kicinski
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

28 Jul, 2016

1 commit

  • Pull networking updates from David Miller:

    1) Unified UDP encapsulation offload methods for drivers, from
    Alexander Duyck.

    2) Make DSA binding more sane, from Andrew Lunn.

    3) Support QCA9888 chips in ath10k, from Anilkumar Kolli.

    4) Several workqueue usage cleanups, from Bhaktipriya Shridhar.

    5) Add XDP (eXpress Data Path), essentially running BPF programs on RX
    packets as soon as the device sees them, with the option to mirror
    the packet on TX via the same interface. From Brenden Blanco and
    others.

    6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet.

    7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli.

    8) Simplify netlink conntrack entry layout, from Florian Westphal.

    9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido
    Schimmel, Yotam Gigi, and Jiri Pirko.

    10) Add SKB array infrastructure and convert tun and macvtap over to it.
    From Michael S Tsirkin and Jason Wang.

    11) Support qdisc packet injection in pktgen, from John Fastabend.

    12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy.

    13) Add NV congestion control support to TCP, from Lawrence Brakmo.

    14) Add GSO support to SCTP, from Marcelo Ricardo Leitner.

    15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni.

    16) Support MPLS over IPV4, from Simon Horman.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits)
    xgene: Fix build warning with ACPI disabled.
    be2net: perform temperature query in adapter regardless of its interface state
    l2tp: Correctly return -EBADF from pppol2tp_getname.
    net/mlx5_core/health: Remove deprecated create_singlethread_workqueue
    net: ipmr/ip6mr: update lastuse on entry change
    macsec: ensure rx_sa is set when validation is disabled
    tipc: dump monitor attributes
    tipc: add a function to get the bearer name
    tipc: get monitor threshold for the cluster
    tipc: make cluster size threshold for monitoring configurable
    tipc: introduce constants for tipc address validation
    net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()
    MAINTAINERS: xgene: Add driver and documentation path
    Documentation: dtb: xgene: Add MDIO node
    dtb: xgene: Add MDIO node
    drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset
    drivers: net: xgene: Use exported functions
    drivers: net: xgene: Enable MDIO driver
    drivers: net: xgene: Add backward compatibility
    drivers: net: phy: xgene: Add MDIO driver
    ...

    Linus Torvalds
     

20 Jul, 2016

3 commits

  • For forwarding to be effective, XDP programs should be allowed to
    rewrite packet data.

    This requires that the drivers supporting XDP must all map the packet
    memory as TODEVICE or BIDIRECTIONAL before invoking the program.

    Signed-off-by: Brenden Blanco
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Brenden Blanco
     
  • Add a new bpf prog type that is intended to run in early stages of the
    packet rx path. Only minimal packet metadata will be available, hence a
    new context type, struct xdp_md, is exposed to userspace. So far only
    expose the packet start and end pointers, and only in read mode.

    An XDP program must return one of the well known enum values, all other
    return codes are reserved for future use. Unfortunately, this
    restriction is hard to enforce at verification time, so take the
    approach of warning at runtime when such programs are encountered. Out
    of bounds return codes should alias to XDP_ABORTED.

    Signed-off-by: Brenden Blanco
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Brenden Blanco
     
  • A subsystem may need to store many copies of a bpf program, each
    deserving its own reference. Rather than requiring the caller to loop
    one by one (with possible mid-loop failure), add a bulk bpf_prog_add
    api.

    Signed-off-by: Brenden Blanco
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Brenden Blanco
     

17 Jul, 2016

1 commit

  • Should have been obvious, only called from bpf() syscall via map_update_elem()
    that calls bpf_fd_array_map_update_elem() under RCU read lock and thus this
    must also be in GFP_ATOMIC, of course.

    Fixes: 3b1efb196eee ("bpf, maps: flush own entries on perf map release")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jul, 2016

1 commit

  • This work addresses a couple of issues bpf_skb_event_output()
    helper currently has: i) We need two copies instead of just a
    single one for the skb data when it should be part of a sample.
    The data can be non-linear and thus needs to be extracted via
    bpf_skb_load_bytes() helper first, and then copied once again
    into the ring buffer slot. ii) Since bpf_skb_load_bytes()
    currently needs to be used first, the helper needs to see a
    constant size on the passed stack buffer to make sure BPF
    verifier can do sanity checks on it during verification time.
    Thus, just passing skb->len (or any other non-constant value)
    wouldn't work, but changing bpf_skb_load_bytes() is also not
    the proper solution, since the two copies are generally still
    needed. iii) bpf_skb_load_bytes() is just for rather small
    buffers like headers, since they need to sit on the limited
    BPF stack anyway. Instead of working around in bpf_skb_load_bytes(),
    this work improves the bpf_skb_event_output() helper to address
    all 3 at once.

    We can make use of the passed in skb context that we have in
    the helper anyway, and use some of the reserved flag bits as
    a length argument. The helper will use the new __output_custom()
    facility from perf side with bpf_skb_copy() as callback helper
    to walk and extract the data. It will pass the data for setup
    to bpf_event_output(), which generates and pushes the raw record
    with an additional frag part. The linear data used in the first
    frag of the record serves as programmatically defined meta data
    passed along with the appended sample.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

12 Jul, 2016

1 commit

  • The Kconfig currently controlling compilation of this code is:

    init/Kconfig:config BPF_SYSCALL
    init/Kconfig: bool "Enable bpf() system call"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the couple traces of modular infrastructure use, so that
    when reading the driver there is no doubt it is builtin-only.

    Note that MODULE_ALIAS is a no-op for non-modular code.

    We replace module.h with init.h since the file does use __init.

    Cc: Alexei Starovoitov
    Cc: netdev@vger.kernel.org
    Signed-off-by: Paul Gortmaker
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Paul Gortmaker
     

07 Jul, 2016

1 commit


02 Jul, 2016

4 commits

  • Adds a bpf helper, bpf_skb_in_cgroup, to decide if a skb->sk
    belongs to a descendant of a cgroup2. It is similar to the
    feature added in netfilter:
    commit c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")

    The user is expected to populate a BPF_MAP_TYPE_CGROUP_ARRAY
    which will be used by the bpf_skb_in_cgroup.

    Modifications to the bpf verifier is to ensure BPF_MAP_TYPE_CGROUP_ARRAY
    and bpf_skb_in_cgroup() are always used together.

    Signed-off-by: Martin KaFai Lau
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Tejun Heo
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
    To update an element, the caller is expected to obtain a cgroup2 backed
    fd by open(cgroup2_dir) and then update the array with that fd.

    Signed-off-by: Martin KaFai Lau
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Tejun Heo
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • Since bpf_prog_get() and program type check is used in a couple of places,
    refactor this into a small helper function that we can make use of. Since
    the non RO prog->aux part is not used in performance critical paths and a
    program destruction via RCU is rather very unlikley when doing the put, we
    shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
    check, but actually not taking the ref at all (due to being in fdget() /
    fdput() section of the bpf fd) is even cleaner and makes the diff smaller
    as well, so just go for that. Callsites are changed to make use of the new
    helper where possible.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Jann Horn reported following analysis that could potentially result
    in a very hard to trigger (if not impossible) UAF race, to quote his
    event timeline:

    - Set up a process with threads T1, T2 and T3
    - Let T1 set up a socket filter F1 that invokes another filter F2
    through a BPF map [tail call]
    - Let T1 trigger the socket filter via a unix domain socket write,
    don't wait for completion
    - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
    - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
    - Let T3 close the file descriptor for F2, dropping the reference
    count of F2 to 2
    - At this point, T1 should have looked up F2 from the map, but not
    finished executing it
    - Let T3 remove F2 from the BPF map, dropping the reference count of
    F2 to 1
    - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
    the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
    via schedule_work()
    - At this point, the BPF program could be freed
    - BPF execution is still running in a freed BPF program

    While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
    event fd we're doing the syscall on doesn't disappear from underneath us
    for whole syscall time, it may not be the case for the bpf fd used as
    an argument only after we did the put. It needs to be a valid fd pointing
    to a BPF program at the time of the call to make the bpf_prog_get() and
    while T2 gets preempted, F2 must have dropped reference to 1 on the other
    CPU. The fput() from the close() in T3 should also add additionally delay
    to the reference drop via exit_task_work() when bpf_prog_release() gets
    called as well as scheduling bpf_prog_free_deferred().

    That said, it makes nevertheless sense to move the BPF prog destruction
    generally after RCU grace period to guarantee that such scenario above,
    but also others as recently fixed in ceb56070359b ("bpf, perf: delay release
    of BPF prog after grace period") with regards to tail calls won't happen.
    Integrating bpf_prog_free_deferred() directly into the RCU callback is
    not allowed since the invocation might happen from either softirq or
    process context, so we're not permitted to block. Reviewing all bpf_prog_put()
    invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
    their destruction) with call_rcu() look good to me.

    Since we don't know whether at the time of attaching the program, we're
    already part of a tail call map, we need to use RCU variant. However, due
    to this, there won't be severely more stress on the RCU callback queue:
    situations with above bpf_prog_get() and bpf_prog_put() combo in practice
    normally won't lead to releases, but even if they would, enough effort/
    cycles have to be put into loading a BPF program into the kernel already.

    Reported-by: Jann Horn
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Jun, 2016

2 commits

  • Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
    instead of the raw variant. This allows for preemption checks when we
    have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
    need to keep the raw variant for socket filters, but we can reuse the
    helper that is already there from cBPF side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Some minor cleanups: i) Remove the unlikely() from fd array map lookups
    and let the CPU branch predictor do its job, scenarios where there is not
    always a map entry are very well valid. ii) Move the attribute type check
    in the bpf_perf_event_read() helper a bit earlier so it's consistent wrt
    checks with bpf_perf_event_output() helper as well. iii) remove some
    comments that are self-documenting in kprobe_prog_is_valid_access() and
    therefore make it consistent to tp_prog_is_valid_access() as well.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann