04 Nov, 2018

1 commit

  • [ Upstream commit ba6b8de423f8d0dee48d6030288ed81c03ddf9f0 ]

    Relying on map_release hook to decrement the reference counts when a
    map is removed only works if the map is not being pinned. In the
    pinned case the ref is decremented immediately and the BPF programs
    released. After this BPF programs may not be in-use which is not
    what the user would expect.

    This patch moves the release logic into bpf_map_put_uref() and brings
    sockmap in-line with how a similar case is handled in prog array maps.

    Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    John Fastabend
     

31 Jan, 2018

1 commit

  • [ upstream commit be95a845cc4402272994ce290e3ad928aff06cb9 ]

    In addition to commit b2157399cc98 ("bpf: prevent out-of-bounds
    speculation") also change the layout of struct bpf_map such that
    false sharing of fast-path members like max_entries is avoided
    when the maps reference counter is altered. Therefore enforce
    them to be placed into separate cachelines.

    pahole dump after change:

    struct bpf_map {
    const struct bpf_map_ops * ops; /* 0 8 */
    struct bpf_map * inner_map_meta; /* 8 8 */
    void * security; /* 16 8 */
    enum bpf_map_type map_type; /* 24 4 */
    u32 key_size; /* 28 4 */
    u32 value_size; /* 32 4 */
    u32 max_entries; /* 36 4 */
    u32 map_flags; /* 40 4 */
    u32 pages; /* 44 4 */
    u32 id; /* 48 4 */
    int numa_node; /* 52 4 */
    bool unpriv_array; /* 56 1 */

    /* XXX 7 bytes hole, try to pack */

    /* --- cacheline 1 boundary (64 bytes) --- */
    struct user_struct * user; /* 64 8 */
    atomic_t refcnt; /* 72 4 */
    atomic_t usercnt; /* 76 4 */
    struct work_struct work; /* 80 32 */
    char name[16]; /* 112 16 */
    /* --- cacheline 2 boundary (128 bytes) --- */

    /* size: 128, cachelines: 2, members: 17 */
    /* sum members: 121, holes: 1, sum holes: 7 */
    };

    Now all entries in the first cacheline are read only throughout
    the life time of the map, set up once during map creation. Overall
    struct size and number of cachelines doesn't change from the
    reordering. struct bpf_map is usually first member and embedded
    in map structs in specific map implementations, so also avoid those
    members to sit at the end where it could potentially share the
    cacheline with first map values e.g. in the array since remote
    CPUs could trigger map updates just as well for those (easily
    dirtying members like max_entries intentionally as well) while
    having subsequent values in cache.

    Quoting from Google's Project Zero blog [1]:

    Additionally, at least on the Intel machine on which this was
    tested, bouncing modified cache lines between cores is slow,
    apparently because the MESI protocol is used for cache coherence
    [8]. Changing the reference counter of an eBPF array on one
    physical CPU core causes the cache line containing the reference
    counter to be bounced over to that CPU core, making reads of the
    reference counter on all other CPU cores slow until the changed
    reference counter has been written back to memory. Because the
    length and the reference counter of an eBPF array are stored in
    the same cache line, this also means that changing the reference
    counter on one physical CPU core causes reads of the eBPF array's
    length to be slow on other physical CPU cores (intentional false
    sharing).

    While this doesn't 'control' the out-of-bounds speculation through
    masking the index as in commit b2157399cc98, triggering a manipulation
    of the map's reference counter is really trivial, so lets not allow
    to easily affect max_entries from it.

    Splitting to separate cachelines also generally makes sense from
    a performance perspective anyway in that fast-path won't have a
    cache miss if the map gets pinned, reused in other progs, etc out
    of control path, thus also avoids unintentional false sharing.

    [1] https://googleprojectzero.blogspot.ch/2018/01/reading-privileged-memory-with-side.html

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

17 Jan, 2018

1 commit

  • commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream.

    Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
    memory accesses under a bounds check may be speculated even if the
    bounds check fails, providing a primitive for building a side channel.

    To avoid leaking kernel data round up array-based maps and mask the index
    after bounds check, so speculated load with out of bounds index will load
    either valid value from the array or zero from the padded area.

    Unconditionally mask index for all array types even when max_entries
    are not rounded to power of 2 for root user.
    When map is created by unpriv user generate a sequence of bpf insns
    that includes AND operation to make sure that JITed code includes
    the same 'index & index_mask' operation.

    If prog_array map is created by unpriv user replace
    bpf_tail_call(ctx, map, index);
    with
    if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
    }
    (along with roundup to power 2) to prevent out-of-bounds speculation.
    There is secondary redundant 'if (index >= max_entries)' in the interpreter
    and in all JITs, but they can be optimized later if necessary.

    Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
    cannot be used by unpriv, so no changes there.

    That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
    all architectures with and without JIT.

    v2->v3:
    Daniel noticed that attack potentially can be crafted via syscall commands
    without loading the program, so add masking to those paths as well.

    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Cc: Jiri Slaby
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     

09 Oct, 2017

1 commit

  • Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced
    support for attaching an eBPF object by an fd, with the
    'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
    IPT_SO_SET_REPLACE call.

    However this breaks subsequent iptables calls:

    # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
    # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
    iptables: Invalid argument. Run `dmesg' for more information.

    That's because iptables works by loading existing rules using
    IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
    the replacement set.

    However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
    (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
    occurs, userspace passes a bogus fd number, which leads to
    'bpf_mt_check_v1' to fail.

    One suggested solution [1] was to hack iptables userspace, to perform a
    "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
    process-local fd per every 'xt_bpf_info_v1' entry seen.

    However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
    depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.

    This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
    '.fd' and instead perform an in-kernel lookup for the bpf object given
    the provided '.path'.

    It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
    XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
    expected to provide the path of the pinned object.

    Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.

    References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
    [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2

    Reported-by: Rafael Buchbinder
    Signed-off-by: Shmulik Ladkani
    Acked-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Shmulik Ladkani
     

09 Sep, 2017

1 commit

  • The bpf map sockmap supports adding programs via attach commands. This
    patch adds the detach command to keep the API symmetric and allow
    users to remove previously added programs. Otherwise the user would
    have to delete the map and re-add it to get in this state.

    This also adds a series of additional tests to capture detach operation
    and also attaching/detaching invalid prog types.

    API note: socks will run (or not run) programs depending on the state
    of the map at the time the sock is added. We do not for example walk
    the map and remove programs from previously attached socks.

    Acked-by: Daniel Borkmann
    Signed-off-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

29 Aug, 2017

1 commit

  • In the initial sockmap API we provided strparser and verdict programs
    using a single attach command by extending the attach API with a the
    attach_bpf_fd2 field.

    However, if we add other programs in the future we will be adding a
    field for every new possible type, attach_bpf_fd(3,4,..). This
    seems a bit clumsy for an API. So lets push the programs using two
    new type fields.

    BPF_SK_SKB_STREAM_PARSER
    BPF_SK_SKB_STREAM_VERDICT

    This has the advantage of having a readable name and can easily be
    extended in the future.

    Updates to samples and sockmap included here also generalize tests
    slightly to support upcoming patch for multiple map support.

    Signed-off-by: John Fastabend
    Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
    Suggested-by: Alexei Starovoitov
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    John Fastabend
     

20 Aug, 2017

2 commits

  • Reported-by: kbuild test robot
    Signed-off-by: David S. Miller

    David S. Miller
     
  • The current map creation API does not allow to provide the numa-node
    preference. The memory usually comes from where the map-creation-process
    is running. The performance is not ideal if the bpf_prog is known to
    always run in a numa node different from the map-creation-process.

    One of the use case is sharding on CPU to different LRU maps (i.e.
    an array of LRU maps). Here is the test result of map_perf_test on
    the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
    CPU0 to be allocated from a remote numa node:

    [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]

    ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
    5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
    4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
    3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
    6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
    2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
    1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
    7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
    0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec #<<<

    After specifying numa node:
    ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
    5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
    3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
    1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
    6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
    2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
    4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
    7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
    0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<

    This patch adds one field, numa_node, to the bpf_attr. Since numa node 0
    is a valid node, a new flag BPF_F_NUMA_NODE is also added. The numa_node
    field is honored if and only if the BPF_F_NUMA_NODE flag is set.

    Numa node selection is not supported for percpu map.

    This patch does not change all the kmalloc. F.e.
    'htab = kzalloc()' is not changed since the object
    is small enough to stay in the cache.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

17 Aug, 2017

3 commits

  • Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER

    net/core/filter.c: In function ‘do_sk_redirect_map’:
    net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration]
    sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
    ^
    net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default]
    sk = __sock_map_lookup_elem(ri->map, ri->ifindex);

    Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
    Reported-by: Eric Dumazet
    Signed-off-by: John Fastabend
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Recently we added a new map type called dev map used to forward XDP
    packets between ports (6093ec2dc313). This patches introduces a
    similar notion for sockets.

    A sockmap allows users to add participating sockets to a map. When
    sockets are added to the map enough context is stored with the
    map entry to use the entry with a new helper

    bpf_sk_redirect_map(map, key, flags)

    This helper (analogous to bpf_redirect_map in XDP) is given the map
    and an entry in the map. When called from a sockmap program, discussed
    below, the skb will be sent on the socket using skb_send_sock().

    With the above we need a bpf program to call the helper from that will
    then implement the send logic. The initial site implemented in this
    series is the recv_sock hook. For this to work we implemented a map
    attach command to add attributes to a map. In sockmap we add two
    programs a parse program and a verdict program. The parse program
    uses strparser to build messages and pass them to the verdict program.
    The parse programs use the normal strparser semantics. The verdict
    program is of type SK_SKB.

    The verdict program returns a verdict SK_DROP, or SK_REDIRECT for
    now. Additional actions may be added later. When SK_REDIRECT is
    returned, expected when bpf program uses bpf_sk_redirect_map(), the
    sockmap logic will consult per cpu variables set by the helper routine
    and pull the sock entry out of the sock map. This pattern follows the
    existing redirect logic in cls and xdp programs.

    This gives the flow,

    recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock
    \
    -> kfree_skb

    As an example use case a message based load balancer may use specific
    logic in the verdict program to select the sock to send on.

    Sample programs are provided in future patches that hopefully illustrate
    the user interfaces. Also selftests are in follow-on patches.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • bpf_prog_inc_not_zero will be used by upcoming sockmap patches this
    patch simply exports it so we can pull it in.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

09 Aug, 2017

1 commit

  • Unifies adjusted and unadjusted register value types (e.g. FRAME_POINTER is
    now just a PTR_TO_STACK with zero offset).
    Tracks value alignment by means of tracking known & unknown bits. This
    also replaces the 'reg->imm' (leading zero bits) calculations for (what
    were) UNKNOWN_VALUEs.
    If pointer leaks are allowed, and adjust_ptr_min_max_vals returns -EACCES,
    treat the pointer as an unknown scalar and try again, because we might be
    able to conclude something about the result (e.g. pointer & 0x40 is either
    0 or 0x40).
    Verifier hooks in the netronome/nfp driver were changed to match the new
    data structures.

    Signed-off-by: Edward Cree
    Signed-off-by: David S. Miller

    Edward Cree
     

18 Jul, 2017

3 commits

  • Initial patches missed case with CONFIG_BPF_SYSCALL not set.

    Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map")
    Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine")
    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • For performance reasons we want to avoid updating the tail pointer in
    the driver tx ring as much as possible. To accomplish this we add
    batching support to the redirect path in XDP.

    This adds another ndo op "xdp_flush" that is used to inform the driver
    that it should bump the tail pointer on the TX ring.

    Signed-off-by: John Fastabend
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    John Fastabend
     
  • BPF programs can use the devmap with a bpf_redirect_map() helper
    routine to forward packets to netdevice in map.

    Signed-off-by: John Fastabend
    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    John Fastabend
     

03 Jul, 2017

1 commit

  • This work tries to make the semantics and code around the
    narrower ctx access a bit easier to follow. Right now
    everything is done inside the .is_valid_access(). Offset
    matching is done differently for read/write types, meaning
    writes don't support narrower access and thus matching only
    on offsetof(struct foo, bar) is enough whereas for read
    case that supports narrower access we must check for
    offsetof(struct foo, bar) + offsetof(struct foo, bar) +
    sizeof() - 1 for each of the cases. For read cases of
    individual members that don't support narrower access (like
    packet pointers or skb->cb[] case which has its own narrow
    access logic), we check as usual only offsetof(struct foo,
    bar) like in write case. Then, for the case where narrower
    access is allowed, we also need to set the aux info for the
    access. Meaning, ctx_field_size and converted_op_size have
    to be set. First is the original field size e.g. sizeof()
    as in above example from the user facing ctx, and latter
    one is the target size after actual rewrite happened, thus
    for the kernel facing ctx. Also here we need the range match
    and we need to keep track changing convert_ctx_access() and
    converted_op_size from is_valid_access() as both are not at
    the same location.

    We can simplify the code a bit: check_ctx_access() becomes
    simpler in that we only store ctx_field_size as a meta data
    and later in convert_ctx_accesses() we fetch the target_size
    right from the location where we do convert. Should the verifier
    be misconfigured we do reject for BPF_WRITE cases or target_size
    that are not provided. For the subsystems, we always work on
    ranges in is_valid_access() and add small helpers for ranges
    and narrow access, convert_ctx_accesses() sets target_size
    for the relevant instruction.

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Cc: Yonghong Song
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Jun, 2017

1 commit

  • This patch allows userspace to do BPF_MAP_LOOKUP_ELEM on
    BPF_MAP_TYPE_PROG_ARRAY,
    BPF_MAP_TYPE_ARRAY_OF_MAPS and
    BPF_MAP_TYPE_HASH_OF_MAPS.

    The lookup returns a prog-id or map-id to the userspace.
    The userspace can then use the BPF_PROG_GET_FD_BY_ID
    or BPF_MAP_GET_FD_BY_ID to get a fd.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

24 Jun, 2017

1 commit

  • Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
    context fields") permits narrower load for certain ctx fields.
    The commit however will already generate a masking even if
    the prog-specific ctx conversion produces the result with
    narrower size.

    For example, for __sk_buff->protocol, the ctx conversion
    loads the data into register with 2-byte load.
    A narrower 2-byte load should not generate masking.
    For __sk_buff->vlan_present, the conversion function
    set the result as either 0 or 1, essentially a byte.
    The narrower 2-byte or 1-byte load should not generate masking.

    To avoid unnecessary masking, prog-specific *_is_valid_access
    now passes converted_op_size back to verifier, which indicates
    the valid data width after perceived future conversion.
    Based on this information, verifier is able to avoid
    unnecessary marking.

    Since we want more information back from prog-specific
    *_is_valid_access checking, all of them are packed into
    one data structure for more clarity.

    Acked-by: Daniel Borkmann
    Signed-off-by: Yonghong Song
    Signed-off-by: David S. Miller

    Yonghong Song
     

15 Jun, 2017

1 commit

  • Currently, verifier will reject a program if it contains an
    narrower load from the bpf context structure. For example,
    __u8 h = __sk_buff->hash, or
    __u16 p = __sk_buff->protocol
    __u32 sample_period = bpf_perf_event_data->sample_period
    which are narrower loads of 4-byte or 8-byte field.

    This patch solves the issue by:
    . Introduce a new parameter ctx_field_size to carry the
    field size of narrower load from prog type
    specific *__is_valid_access validator back to verifier.
    . The non-zero ctx_field_size for a memory access indicates
    (1). underlying prog type specific convert_ctx_accesses
    supporting non-whole-field access
    (2). the current insn is a narrower or whole field access.
    . In verifier, for such loads where load memory size is
    less than ctx_field_size, verifier transforms it
    to a full field load followed by proper masking.
    . Currently, __sk_buff and bpf_perf_event_data->sample_period
    are supporting narrowing loads.
    . Narrower stores are still not allowed as typical ctx stores
    are just normal stores.

    Because of this change, some tests in verifier will fail and
    these tests are removed. As a bonus, rename some out of bound
    __sk_buff->cb access to proper field name and remove two
    redundant "skb cb oob" tests.

    Acked-by: Daniel Borkmann
    Signed-off-by: Yonghong Song
    Signed-off-by: David S. Miller

    Yonghong Song
     

07 Jun, 2017

2 commits

  • This patch generates an unique ID for each created bpf_map.
    The approach is similar to the earlier patch for bpf_prog ID.

    It is worth to note that the bpf_map's ID and bpf_prog's ID
    are in two independent ID spaces and both have the same valid range:
    [1, INT_MAX).

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • This patch generates an unique ID for each BPF_PROG_LOAD-ed prog.
    It is worth to note that each BPF_PROG_LOAD-ed prog will have
    a different ID even they have the same bpf instructions.

    The ID is generated by the existing idr_alloc_cyclic().
    The ID is ranged from [1, INT_MAX). It is allocated in cyclic manner,
    so an ID will get reused every 2 billion BPF_PROG_LOAD.

    The bpf_prog_alloc_id() is done after bpf_prog_select_runtime()
    because the jit process may have allocated a new prog. Hence,
    we need to ensure the value of pointer 'prog' will not be changed
    any more before storing the prog to the prog_idr.

    After bpf_prog_select_runtime(), the prog is read-only. Hence,
    the id is stored in 'struct bpf_prog_aux'.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

01 Jun, 2017

1 commit


12 Apr, 2017

2 commits

  • There's no need to have struct bpf_map_type_list since
    it just contains a list_head, the type, and the ops
    pointer. Since the types are densely packed and not
    actually dynamically registered, it's much easier and
    smaller to have an array of type->ops pointer. Also
    initialize this array statically to remove code needed
    to initialize it.

    In order to save duplicating the list, move it to the
    types header file added by the previous patch and
    include it in the same fashion.

    Signed-off-by: Johannes Berg
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • There's no need to have struct bpf_prog_type_list since
    it just contains a list_head, the type, and the ops
    pointer. Since the types are densely packed and not
    actually dynamically registered, it's much easier and
    smaller to have an array of type->ops pointer. Also
    initialize this array statically to remove code needed
    to initialize it.

    In order to save duplicating the list, move it to a new
    header file and include it in the places needing it.

    Signed-off-by: Johannes Berg
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Johannes Berg
     

02 Apr, 2017

1 commit

  • development and testing of networking bpf programs is quite cumbersome.
    Despite availability of user space bpf interpreters the kernel is
    the ultimate authority and execution environment.
    Current test frameworks for TC include creation of netns, veth,
    qdiscs and use of various packet generators just to test functionality
    of a bpf program. XDP testing is even more complicated, since
    qemu needs to be started with gro/gso disabled and precise queue
    configuration, transferring of xdp program from host into guest,
    attaching to virtio/eth0 and generating traffic from the host
    while capturing the results from the guest.

    Moreover analyzing performance bottlenecks in XDP program is
    impossible in virtio environment, since cost of running the program
    is tiny comparing to the overhead of virtio packet processing,
    so performance testing can only be done on physical nic
    with another server generating traffic.

    Furthermore ongoing changes to user space control plane of production
    applications cannot be run on the test servers leaving bpf programs
    stubbed out for testing.

    Last but not least, the upstream llvm changes are validated by the bpf
    backend testsuite which has no ability to test the code generated.

    To improve this situation introduce BPF_PROG_TEST_RUN command
    to test and performance benchmark bpf programs.

    Joint work with Daniel Borkmann.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

23 Mar, 2017

2 commits

  • This patch adds hash of maps support (hashmap->bpf_map).
    BPF_MAP_TYPE_HASH_OF_MAPS is added.

    A map-in-map contains a pointer to another map and lets call
    this pointer 'inner_map_ptr'.

    Notes on deleting inner_map_ptr from a hash map:

    1. For BPF_F_NO_PREALLOC map-in-map, when deleting
    an inner_map_ptr, the htab_elem itself will go through
    a rcu grace period and the inner_map_ptr resides
    in the htab_elem.

    2. For pre-allocated htab_elem (!BPF_F_NO_PREALLOC),
    when deleting an inner_map_ptr, the htab_elem may
    get reused immediately. This situation is similar
    to the existing prealloc-ated use cases.

    However, the bpf_map_fd_put_ptr() calls bpf_map_put() which calls
    inner_map->ops->map_free(inner_map) which will go
    through a rcu grace period (i.e. all bpf_map's map_free
    currently goes through a rcu grace period). Hence,
    the inner_map_ptr is still safe for the rcu reader side.

    This patch also includes BPF_MAP_TYPE_HASH_OF_MAPS to the
    check_map_prealloc() in the verifier. preallocation is a
    must for BPF_PROG_TYPE_PERF_EVENT. Hence, even we don't expect
    heavy updates to map-in-map, enforcing BPF_F_NO_PREALLOC for map-in-map
    is impossible without disallowing BPF_PROG_TYPE_PERF_EVENT from using
    map-in-map first.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • This patch adds a few helper funcs to enable map-in-map
    support (i.e. outer_map->inner_map). The first outer_map type
    BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
    The next patch will introduce a hash of maps type.

    Any bpf map type can be acted as an inner_map. The exception
    is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
    indirection makes it harder to verify the owner_prog_type
    and owner_jited.

    Multi-level map-in-map is not supported (i.e. map->map is ok
    but not map->map->map).

    When adding an inner_map to an outer_map, it currently checks the
    map_type, key_size, value_size, map_flags, max_entries and ops.
    The verifier also uses those map's properties to do static analysis.
    map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
    is using a preallocated hashtab for the inner_hash also. ops and
    max_entries are needed to generate inlined map-lookup instructions.
    For simplicity reason, a simple '==' test is used for both map_flags
    and max_entries. The equality of ops is implied by the equality of
    map_type.

    During outer_map creation time, an inner_map_fd is needed to create an
    outer_map. However, the inner_map_fd's life time does not depend on the
    outer_map. The inner_map_fd is merely used to initialize
    the inner_map_meta of the outer_map.

    Also, for the outer_map:

    * It allows element update and delete from syscall
    * It allows element lookup from bpf_prog

    The above is similar to the current fd_array pattern.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

17 Mar, 2017

1 commit

  • Optimize bpf_call -> bpf_map_lookup_elem() -> array_map_lookup_elem()
    into a sequence of bpf instructions.
    When JIT is on the sequence of bpf instructions is the sequence
    of native cpu instructions with significantly faster performance
    than indirect call and two function's prologue/epilogue.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

18 Feb, 2017

1 commit

  • Long standing issue with JITed programs is that stack traces from
    function tracing check whether a given address is kernel code
    through {__,}kernel_text_address(), which checks for code in core
    kernel, modules and dynamically allocated ftrace trampolines. But
    what is still missing is BPF JITed programs (interpreted programs
    are not an issue as __bpf_prog_run() will be attributed to them),
    thus when a stack trace is triggered, the code walking the stack
    won't see any of the JITed ones. The same for address correlation
    done from user space via reading /proc/kallsyms. This is read by
    tools like perf, but the latter is also useful for permanent live
    tracing with eBPF itself in combination with stack maps when other
    eBPF types are part of the callchain. See offwaketime example on
    dumping stack from a map.

    This work tries to tackle that issue by making the addresses and
    symbols known to the kernel. The lookup from *kernel_text_address()
    is implemented through a latched RB tree that can be read under
    RCU in fast-path that is also shared for symbol/size/offset lookup
    for a specific given address in kallsyms. The slow-path iteration
    through all symbols in the seq file done via RCU list, which holds
    a tiny fraction of all exported ksyms, usually below 0.1 percent.
    Function symbols are exported as bpf_prog_, in order to aide
    debugging and attribution. This facility is currently enabled for
    root-only when bpf_jit_kallsyms is set to 1, and disabled if hardening
    is active in any mode. The rationale behind this is that still a lot
    of systems ship with world read permissions on kallsyms thus addresses
    should not get suddenly exposed for them. If that situation gets
    much better in future, we always have the option to change the
    default on this. Likewise, unprivileged programs are not allowed
    to add entries there either, but that is less of a concern as most
    such programs types relevant in this context are for root-only anyway.
    If enabled, call graphs and stack traces will then show a correct
    attribution; one example is illustrated below, where the trace is
    now visible in tooling such as perf script --kallsyms=/proc/kallsyms
    and friends.

    Before:

    7fff8166889d bpf_clone_redirect+0x80007f0020ed (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff006451f1a007 (/usr/lib64/libc-2.18.so)

    After:

    7fff816688b7 bpf_clone_redirect+0x80007f002107 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa0575728 bpf_prog_33c45a467c9e061a+0x8000600020fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fffa07ef1fc cls_bpf_classify+0x8000600020dc (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81678b68 tc_classify+0x80007f002078 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d40b __netif_receive_skb_core+0x80007f0025fb (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164d718 __netif_receive_skb+0x80007f002018 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164e565 process_backlog+0x80007f002095 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8164dc71 net_rx_action+0x80007f002231 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff81767461 __softirqentry_text_start+0x80007f0020d1 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817658ac do_softirq_own_stack+0x80007f00201c (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2c20 do_softirq+0x80007f002050 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff810a2cb5 __local_bh_enable_ip+0x80007f002085 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168d452 ip_finish_output2+0x80007f002152 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168ea3d ip_finish_output+0x80007f00217d (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff8168f2af ip_output+0x80007f00203f (/lib/modules/4.9.0-rc8+/build/vmlinux)
    [...]
    7fff81005854 do_syscall_64+0x80007f002054 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    7fff817649eb return_from_SYSCALL_64+0x80007f002000 (/lib/modules/4.9.0-rc8+/build/vmlinux)
    f5d80 __sendmsg_nocancel+0xffff01c484812007 (/usr/lib64/libc-2.18.so)

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Jan, 2017

1 commit


19 Jan, 2017

1 commit

  • This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
    that are to be used for map allocations. Using kmalloc() for very large
    allocations can cause excessive work within the page allocator, so i) fall
    back earlier to vmalloc() when the attempt is considered costly anyway,
    and even more importantly ii) don't trigger OOM killer with any of the
    allocators.

    Since this is based on a user space request, for example, when creating
    maps with element pre-allocation, we really want such requests to fail
    instead of killing other user space processes.

    Also, don't spam the kernel log with warnings should any of the allocations
    fail under pressure. Given that, we can make backend selection in
    bpf_map_area_alloc() generic, and convert all maps over to use this API
    for spots with potentially large allocation requests.

    Note, replacing the one kmalloc_array() is fine as overflow checks happen
    earlier in htab_map_alloc(), since it must also protect the multiplication
    for vmalloc() should kmalloc_array() fail.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

18 Jan, 2017

1 commit


17 Jan, 2017

1 commit

  • Commit 7bd509e311f4 ("bpf: add prog_digest and expose it via
    fdinfo/netlink") was recently discussed, partially due to
    admittedly suboptimal name of "prog_digest" in combination
    with sha1 hash usage, thus inevitably and rightfully concerns
    about its security in terms of collision resistance were
    raised with regards to use-cases.

    The intended use cases are for debugging resp. introspection
    only for providing a stable "tag" over the instruction sequence
    that both kernel and user space can calculate independently.
    It's not usable at all for making a security relevant decision.
    So collisions where two different instruction sequences generate
    the same tag can happen, but ideally at a rather low rate. The
    "tag" will be dumped in hex and is short enough to introspect
    in tracepoints or kallsyms output along with other data such
    as stack trace, etc. Thus, this patch performs a rename into
    prog_tag and truncates the tag to a short output (64 bits) to
    make it obvious it's not collision-free.

    Should in future a hash or facility be needed with a security
    relevant focus, then we can think about requirements, constraints,
    etc that would fit to that situation. For now, rework the exposed
    parts for the current use cases as long as nothing has been
    released yet. Tested on x86_64 and s390x.

    Fixes: 7bd509e311f4 ("bpf: add prog_digest and expose it via fdinfo/netlink")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Andy Lutomirski
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

12 Jan, 2017

1 commit

  • Currently, when calling convert_ctx_access() callback for the various
    program types, we pass in insn->dst_reg, insn->src_reg, insn->off from
    the original instruction. This information is needed to rewrite the
    instruction that is based on the user ctx structure into a kernel
    representation for the ctx. As we'd like to allow access size beyond
    just BPF_W, we'd need also insn->code for that in order to decode the
    original access size. Given that, lets just pass insn directly to the
    convert_ctx_access() callback and work on that to not clutter the
    callback with even more arguments we need to pass when everything is
    already contained in insn. So lets go through that once, no functional
    change.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Jan, 2017

1 commit


18 Dec, 2016

2 commits

  • Commit aaac3ba95e4c ("bpf: charge user for creation of BPF maps and
    programs") made a wrong assumption of charging against prog->pages.
    Unlike map->pages, prog->pages are still subject to change when we
    need to expand the program through bpf_prog_realloc().

    This can for example happen during verification stage when we need to
    expand and rewrite parts of the program. Should the required space
    cross a page boundary, then prog->pages is not the same anymore as
    its original value that we used to bpf_prog_charge_memlock() on. Thus,
    we'll hit a wrap-around during bpf_prog_uncharge_memlock() when prog
    is freed eventually. I noticed this that despite having unlimited
    memlock, programs suddenly refused to load with EPERM error due to
    insufficient memlock.

    There are two ways to fix this issue. One would be to add a cached
    variable to struct bpf_prog that takes a snapshot of prog->pages at the
    time of charging. The other approach is to also account for resizes. I
    chose to go with the latter for a couple of reasons: i) We want accounting
    rather to be more accurate instead of further fooling limits, ii) adding
    yet another page counter on struct bpf_prog would also be a waste just
    for this purpose. We also do want to charge as early as possible to
    avoid going into the verifier just to find out later on that we crossed
    limits. The only place that needs to be fixed is bpf_prog_realloc(),
    since only here we expand the program, so we try to account for the
    needed delta and should we fail, call-sites check for outcome anyway.
    On cBPF to eBPF migrations, we don't grab a reference to the user as
    they are charged differently. With that in place, my test case worked
    fine.

    Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Geert rightfully complained that 7bd509e311f4 ("bpf: add prog_digest
    and expose it via fdinfo/netlink") added a too large allocation of
    variable 'raw' from bss section, and should instead be done dynamically:

    # ./scripts/bloat-o-meter kernel/bpf/core.o.1 kernel/bpf/core.o.2
    add/remove: 3/0 grow/shrink: 0/0 up/down: 33291/0 (33291)
    function old new delta
    raw - 32832 +32832
    [...]

    Since this is only relevant during program creation path, which can be
    considered slow-path anyway, lets allocate that dynamically and be not
    implicitly dependent on verifier mutex. Move bpf_prog_calc_digest() at
    the beginning of replace_map_fd_with_map_ptr() and also error handling
    stays straight forward.

    Reported-by: Geert Uytterhoeven
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Dec, 2016

1 commit

  • When loading a BPF program via bpf(2), calculate the digest over
    the program's instruction stream and store it in struct bpf_prog's
    digest member. This is done at a point in time before any instructions
    are rewritten by the verifier. Any unstable map file descriptor
    number part of the imm field will be zeroed for the hash.

    fdinfo example output for progs:

    # cat /proc/1590/fdinfo/5
    pos: 0
    flags: 02000002
    mnt_id: 11
    prog_type: 1
    prog_jited: 1
    prog_digest: b27e8b06da22707513aa97363dfb11c7c3675d28
    memlock: 4096

    When programs are pinned and retrieved by an ELF loader, the loader
    can check the program's digest through fdinfo and compare it against
    one that was generated over the ELF file's program section to see
    if the program needs to be reloaded. Furthermore, this can also be
    exposed through other means such as netlink in case of a tc cls/act
    dump (or xdp in future), but also through tracepoints or other
    facilities to identify the program. Other than that, the digest can
    also serve as a base name for the work in progress kallsyms support
    of programs. The digest doesn't depend/select the crypto layer, since
    we need to keep dependencies to a minimum. iproute2 will get support
    for this facility.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

22 Nov, 2016

1 commit


13 Nov, 2016

1 commit

  • Commit 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings
    scheme") added a bug in that the prog's reference count is not dropped
    in the error path when mlx4_en_try_alloc_resources() is failing from
    mlx4_xdp_set().

    We previously took bpf_prog_add(prog, priv->rx_ring_num - 1), that we
    need to release again. Earlier in the call path, dev_change_xdp_fd()
    itself holds a reference to the prog as well (hence the '- 1' in the
    bpf_prog_add()), so a simple atomic_sub() is safe to use here. When
    an error is propagated, then bpf_prog_put() is called eventually from
    dev_change_xdp_fd()

    Fixes: 67f8b1dcb9ee ("net/mlx4_en: Refactor the XDP forwarding rings scheme")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann