05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 64 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

13 Apr, 2019

1 commit

  • Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
    long correspondingly. It's similar to user space strtol(3) and
    strtoul(3) with a few changes to the API:

    * instead of NUL-terminated C string the helpers expect buffer and
    buffer length;

    * resulting long or unsigned long is returned in a separate
    result-argument;

    * return value is used to indicate success or failure, on success number
    of consumed bytes is returned that can be used to identify position to
    read next if the buffer is expected to contain multiple integers;

    * instead of *base* argument, *flags* is used that provides base in 5
    LSB, other bits are reserved for future use;

    * number of supported bases is limited.

    Documentation for the new helpers is provided in bpf.h UAPI.

    The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
    be able to convert string input to e.g. "ulongvec" output.

    E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
    parsed by calling to bpf_strtoul three times.

    Implementation notes:

    Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
    functions. It's done exactly same way as fs/proc/base.c already does.

    Unfortunately existing kstrtoX function can't be used directly since
    they fail if any invalid character is present right after integer in the
    string. Existing simple_strtoX functions can't be used either since
    they're obsolete and don't handle overflow properly.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     

02 Feb, 2019

2 commits

  • Introduce BPF_F_LOCK flag for map_lookup and map_update syscall commands
    and for map_update() helper function.
    In all these cases take a lock of existing element (which was provided
    in BTF description) before copying (in or out) the rest of map value.

    Implementation details that are part of uapi:

    Array:
    The array map takes the element lock for lookup/update.

    Hash:
    hash map also takes the lock for lookup/update and tries to avoid the bucket lock.
    If old element exists it takes the element lock and updates the element in place.
    If element doesn't exist it allocates new one and inserts into hash table
    while holding the bucket lock.
    In rare case the hashmap has to take both the bucket lock and the element lock
    to update old value in place.

    Cgroup local storage:
    It is similar to array. update in place and lookup are done with lock taken.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Introduce 'struct bpf_spin_lock' and bpf_spin_lock/unlock() helpers to let
    bpf program serialize access to other variables.

    Example:
    struct hash_elem {
    int cnt;
    struct bpf_spin_lock lock;
    };
    struct hash_elem * val = bpf_map_lookup_elem(&hash_map, &key);
    if (val) {
    bpf_spin_lock(&val->lock);
    val->cnt++;
    bpf_spin_unlock(&val->lock);
    }

    Restrictions and safety checks:
    - bpf_spin_lock is only allowed inside HASH and ARRAY maps.
    - BTF description of the map is mandatory for safety analysis.
    - bpf program can take one bpf_spin_lock at a time, since two or more can
    cause dead locks.
    - only one 'struct bpf_spin_lock' is allowed per map element.
    It drastically simplifies implementation yet allows bpf program to use
    any number of bpf_spin_locks.
    - when bpf_spin_lock is taken the calls (either bpf2bpf or helpers) are not allowed.
    - bpf program must bpf_spin_unlock() before return.
    - bpf program can access 'struct bpf_spin_lock' only via
    bpf_spin_lock()/bpf_spin_unlock() helpers.
    - load/store into 'struct bpf_spin_lock lock;' field is not allowed.
    - to use bpf_spin_lock() helper the BTF description of map value must be
    a struct and have 'struct bpf_spin_lock anyname;' field at the top level.
    Nested lock inside another struct is not allowed.
    - syscall map_lookup doesn't copy bpf_spin_lock field to user space.
    - syscall map_update and program map_update do not update bpf_spin_lock field.
    - bpf_spin_lock cannot be on the stack or inside networking packet.
    bpf_spin_lock can only be inside HASH or ARRAY map value.
    - bpf_spin_lock is available to root only and to all program types.
    - bpf_spin_lock is not allowed in inner maps of map-in-map.
    - ld_abs is not allowed inside spin_lock-ed region.
    - tracing progs and socket filter progs cannot use bpf_spin_lock due to
    insufficient preemption checks

    Implementation details:
    - cgroup-bpf class of programs can nest with xdp/tc programs.
    Hence bpf_spin_lock is equivalent to spin_lock_irqsave.
    Other solutions to avoid nested bpf_spin_lock are possible.
    Like making sure that all networking progs run with softirq disabled.
    spin_lock_irqsave is the simplest and doesn't add overhead to the
    programs that don't use it.
    - arch_spinlock_t is used when its implemented as queued_spin_lock
    - archs can force their own arch_spinlock_t
    - on architectures where queued_spin_lock is not available and
    sizeof(arch_spinlock_t) != sizeof(__u32) trivial lock is used.
    - presence of bpf_spin_lock inside map value could have been indicated via
    extra flag during map_create, but specifying it via BTF is cleaner.
    It provides introspection for map key/value and reduces user mistakes.

    Next steps:
    - allow bpf_spin_lock in other map types (like cgroup local storage)
    - introduce BPF_F_LOCK flag for bpf_map_update() syscall and helper
    to request kernel to grab bpf_spin_lock before rewriting the value.
    That will serialize access to map elements.

    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

26 Oct, 2018

1 commit

  • Commit f1a2e44a3aec ("bpf: add queue and stack maps") probably just
    copy-pasted .pkt_access for bpf_map_{pop,peek}_elem() helpers, but
    this is buggy in this context since it would allow writes into cloned
    skbs which is invalid. Therefore, disable .pkt_access for the two.

    Fixes: f1a2e44a3aec ("bpf: add queue and stack maps")
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Cc: Mauricio Vasquez B
    Acked-by: Mauricio Vasquez B
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

20 Oct, 2018

1 commit

  • Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
    These maps support peek, pop and push operations that are exposed to eBPF
    programs through the new bpf_map[peek/pop/push] helpers. Those operations
    are exposed to userspace applications through the already existing
    syscalls in the following way:

    BPF_MAP_LOOKUP_ELEM -> peek
    BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
    BPF_MAP_UPDATE_ELEM -> push

    Queue/stack maps are implemented using a buffer, tail and head indexes,
    hence BPF_F_NO_PREALLOC is not supported.

    As opposite to other maps, queue and stack do not use RCU for protecting
    maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
    argument that is a pointer to a memory zone where to save the value of a
    map. Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
    be passed as an extra argument.

    Our main motivation for implementing queue/stack maps was to keep track
    of a pool of elements, like network ports in a SNAT, however we forsee
    other use cases, like for exampling saving last N kernel events in a map
    and then analysing from userspace.

    Signed-off-by: Mauricio Vasquez B
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Mauricio Vasquez B
     

01 Oct, 2018

3 commits

  • This commit introduced per-cpu cgroup local storage.

    Per-cpu cgroup local storage is very similar to simple cgroup storage
    (let's call it shared), except all the data is per-cpu.

    The main goal of per-cpu variant is to implement super fast
    counters (e.g. packet counters), which don't require neither
    lookups, neither atomic operations.

    >From userspace's point of view, accessing a per-cpu cgroup storage
    is similar to other per-cpu map types (e.g. per-cpu hashmaps and
    arrays).

    Writing to a per-cpu cgroup storage is not atomic, but is performed
    by copying longs, so some minimal atomicity is here, exactly
    as with other per-cpu maps.

    Signed-off-by: Roman Gushchin
    Cc: Daniel Borkmann
    Cc: Alexei Starovoitov
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Roman Gushchin
     
  • To simplify the following introduction of per-cpu cgroup storage,
    let's rework a bit a mechanism of passing a pointer to a cgroup
    storage into the bpf_get_local_storage(). Let's save a pointer
    to the corresponding bpf_cgroup_storage structure, instead of
    a pointer to the actual buffer.

    It will help us to handle per-cpu storage later, which has
    a different way of accessing to the actual data.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Cc: Daniel Borkmann
    Cc: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Roman Gushchin
     
  • In order to introduce per-cpu cgroup storage, let's generalize
    bpf cgroup core to support multiple cgroup storage types.
    Potentially, per-node cgroup storage can be added later.

    This commit is mostly a formal change that replaces
    cgroup_storage pointer with a array of cgroup_storage pointers.
    It doesn't actually introduce a new storage type,
    it will be done later.

    Each bpf program is now able to have one cgroup storage of each type.

    Signed-off-by: Roman Gushchin
    Acked-by: Song Liu
    Cc: Daniel Borkmann
    Cc: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Roman Gushchin
     

03 Aug, 2018

1 commit

  • The bpf_get_local_storage() helper function is used
    to get a pointer to the bpf local storage from a bpf program.

    It takes a pointer to a storage map and flags as arguments.
    Right now it accepts only cgroup storage maps, and flags
    argument has to be 0. Further it can be extended to support
    other types of local storage: e.g. thread local storage etc.

    Signed-off-by: Roman Gushchin
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann

    Roman Gushchin
     

04 Jun, 2018

1 commit

  • bpf has been used extensively for tracing. For example, bcc
    contains an almost full set of bpf-based tools to trace kernel
    and user functions/events. Most tracing tools are currently
    either filtered based on pid or system-wide.

    Containers have been used quite extensively in industry and
    cgroup is often used together to provide resource isolation
    and protection. Several processes may run inside the same
    container. It is often desirable to get container-level tracing
    results as well, e.g. syscall count, function count, I/O
    activity, etc.

    This patch implements a new helper, bpf_get_current_cgroup_id(),
    which will return cgroup id based on the cgroup within which
    the current task is running.

    The later patch will provide an example to show that
    userspace can get the same cgroup id so it could
    configure a filter or policy in the bpf program based on
    task cgroup id.

    The helper is currently implemented for tracing. It can
    be added to other program types as well when needed.

    Acked-by: Alexei Starovoitov
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov

    Yonghong Song
     

10 Jan, 2017

1 commit


23 Oct, 2016

1 commit

  • Use case is mainly for soreuseport to select sockets for the local
    numa node, but since generic, lets also add this for other networking
    and tracing program types.

    Suggested-by: Eric Dumazet
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

21 Sep, 2016

1 commit

  • This work implements direct packet access for helpers and direct packet
    write in a similar fashion as already available for XDP types via commits
    4acf6c0b84c9 ("bpf: enable direct packet data write for xdp progs") and
    6841de8b0d03 ("bpf: allow helpers access the packet directly"), and as a
    complementary feature to the already available direct packet read for tc
    (cls/act) programs.

    For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
    and bpf_csum_update(). The first is generally needed for both, read and
    write, because they would otherwise only be limited to the current linear
    skb head. Usually, when the data_end test fails, programs just bail out,
    or, in the direct read case, use bpf_skb_load_bytes() as an alternative
    to overcome this limitation. If such data sits in non-linear parts, we
    can just pull them in once with the new helper, retest and eventually
    access them.

    At the same time, this also makes sure the skb is uncloned, which is, of
    course, a necessary condition for direct write. As this needs to be an
    invariant for the write part only, the verifier detects writes and adds
    a prologue that is calling bpf_skb_pull_data() to effectively unclone the
    skb from the very beginning in case it is indeed cloned. The heuristic
    makes use of a similar trick that was done in 233577a22089 ("net: filter:
    constify detection of pkt_type_offset"). This comes at zero cost for other
    programs that do not use the direct write feature. Should a program use
    this feature only sparsely and has read access for the most parts with,
    for example, drop return codes, then such write action can be delegated
    to a tail called program for mitigating this cost of potential uncloning
    to a late point in time where it would have been paid similarly with the
    bpf_skb_store_bytes() as well. Advantage of direct write is that the
    writes are inlined whereas the helper cannot make any length assumptions
    and thus needs to generate a call to memcpy() also for small sizes, as well
    as cost of helper call itself with sanity checks are avoided. Plus, when
    direct read is already used, we don't need to cache or perform rechecks
    on the data boundaries (due to verifier invalidating previous checks for
    helpers that change skb->data), so more complex programs using rewrites
    can benefit from switching to direct read plus write.

    For direct packet access to helpers, we save the otherwise needed copy into
    a temp struct sitting on stack memory when use-case allows. Both facilities
    are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
    this to map helpers and csum_diff, and can successively enable other helpers
    where we find it makes sense. Helpers that definitely cannot be allowed for
    this are those part of bpf_helper_changes_skb_data() since they can change
    underlying data, and those that write into memory as this could happen for
    packet typed args when still cloned. bpf_csum_update() helper accommodates
    for the fact that we need to fixup checksum_complete when using direct write
    instead of bpf_skb_store_bytes(), meaning the programs can use available
    helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
    csum_block_add(), csum_block_sub() equivalents in eBPF together with the
    new helper. A usage example will be provided for iproute2's examples/bpf/
    directory.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Sep, 2016

2 commits

  • This work adds BPF_CALL_() macros and converts all the eBPF helper functions
    to use them, in a similar fashion like we do with SYSCALL_DEFINE() macros
    that are used today. Motivation for this is to hide all the register handling
    and all necessary casts from the user, so that it is done automatically in the
    background when adding a BPF_CALL_() call.

    This makes current helpers easier to review, eases to write future helpers,
    avoids getting the casting mess wrong, and allows for extending all helpers at
    once (f.e. build time checks, etc). It also helps detecting more easily in
    code reviews that unused registers are not instrumented in the code by accident,
    breaking compatibility with existing programs.

    BPF_CALL_() internals are quite similar to SYSCALL_DEFINE() ones with some
    fundamental differences, for example, for generating the actual helper function
    that carries all u64 regs, we need to fill unused regs, so that we always end up
    with 5 u64 regs as an argument.

    I reviewed several 0-5 generated BPF_CALL_() variants of the .i results and
    they look all as expected. No sparse issue spotted. We let this also sit for a
    few days with Fengguang's kbuild test robot, and there were no issues seen. On
    s390, it barked on the "uses dynamic stack allocation" notice, which is an old
    one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
    to the call wrapper, just telling that the perf raw record/frag sits on stack
    (gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
    and they were fine as well. All eBPF helpers are now converted to use these
    macros, getting rid of a good chunk of all the raw castings.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Some minor misc cleanups, f.e. use sizeof(__u32) instead of hardcoding
    and in __bpf_skb_max_len(), I missed that we always have skb->dev valid
    anyway, so we can drop the unneeded test for dev; also few more other
    misc bits addressed here.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Jun, 2016

1 commit

  • Use smp_processor_id() for the generic helper bpf_get_smp_processor_id()
    instead of the raw variant. This allows for preemption checks when we
    have DEBUG_PREEMPT, and otherwise uses the raw variant anyway. We only
    need to keep the raw variant for socket filters, but we can reuse the
    helper that is already there from cBPF side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

15 Apr, 2016

1 commit

  • This patch converts all helpers that can use ARG_PTR_TO_RAW_STACK as argument
    type. For tc programs this is bpf_skb_load_bytes(), bpf_skb_get_tunnel_key(),
    bpf_skb_get_tunnel_opt(). For tracing, this optimizes bpf_get_current_comm()
    and bpf_probe_read(). The check in bpf_skb_load_bytes() for MAX_BPF_STACK can
    also be removed since the verifier already makes sure we stay within bounds
    on stack buffers.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Mar, 2016

1 commit

  • Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
    the result is typically passed to print("%s", buf) and extra bytes
    after zero don't cause any harm.
    In bpf the result of bpf_get_current_comm() is used as the part of
    map key and was causing spurious hash map mismatches.
    Use strlcpy() to guarantee zero-terminated string.
    bpf verifier checks that output buffer is zero-initialized,
    so even for short task names the output buffer don't have junk bytes.
    Note it's not a security concern, since kprobe+bpf is root only.

    Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
    Reported-by: Tobias Waldekranz
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

08 Oct, 2015

1 commit

  • While recently arguing on a seccomp discussion that raw prandom_u32()
    access shouldn't be exposed to unpriviledged user space, I forgot the
    fact that SKF_AD_RANDOM extension actually already does it for some time
    in cBPF via commit 4cd3675ebf74 ("filter: added BPF random opcode").

    Since prandom_u32() is being used in a lot of critical networking code,
    lets be more conservative and split their states. Furthermore, consolidate
    eBPF and cBPF prandom handlers to use the new internal PRNG. For eBPF,
    bpf_get_prandom_u32() was only accessible for priviledged users, but
    should that change one day, we also don't want to leak raw sequences
    through things like eBPF maps.

    One thought was also to have own per bpf_prog states, but due to ABI
    reasons this is not easily possible, i.e. the program code currently
    cannot access bpf_prog itself, and copying the rnd_state to/from the
    stack scratch space whenever a program uses the prng seems not really
    worth the trouble and seems too hacky. If needed, taus113 could in such
    cases be implemented within eBPF using a map entry to keep the state
    space, or get_random_bytes() could become a second helper in cases where
    performance would not be critical.

    Both sides can trigger a one-time late init via prandom_init_once() on
    the shared state. Performance-wise, there should even be a tiny gain
    as bpf_user_rnd_u32() saves one function call. The PRNG needs to live
    inside the BPF core since kernels could have a NET-less config as well.

    Signed-off-by: Daniel Borkmann
    Acked-by: Hannes Frederic Sowa
    Acked-by: Alexei Starovoitov
    Cc: Chema Gonzalez
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Jun, 2015

1 commit

  • eBPF programs attached to kprobes need to filter based on
    current->pid, uid and other fields, so introduce helper functions:

    u64 bpf_get_current_pid_tgid(void)
    Return: current->tgid << 32 | current->pid

    u64 bpf_get_current_uid_gid(void)
    Return: current_gid << 32 | current_uid

    bpf_get_current_comm(char *buf, int size_of_buf)
    stores current->comm into buf

    They can be used from the programs attached to TC as well to classify packets
    based on current task fields.

    Update tracex2 example to print histogram of write syscalls for each process
    instead of aggregated for all.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

01 Jun, 2015

2 commits

  • Besides others, move bpf_tail_call_proto to the remaining definitions
    of other protos, improve comments a bit (i.e. remove some obvious ones,
    where the code is already self-documenting, add objectives for others),
    simplify bpf_prog_array_compatible() a bit.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • As this is already exported from tracing side via commit d9847d310ab4
    ("tracing: Allow BPF programs to call bpf_ktime_get_ns()"), we might
    as well want to move it to the core, so also networking users can make
    use of it, e.g. to measure diffs for certain flows from ingress/egress.

    Signed-off-by: Daniel Borkmann
    Cc: Alexei Starovoitov
    Cc: Ingo Molnar
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

16 Mar, 2015

2 commits

  • This patch adds the possibility to obtain raw_smp_processor_id() in
    eBPF. Currently, this is only possible in classic BPF where commit
    da2033c28226 ("filter: add SKF_AD_RXHASH and SKF_AD_CPU") has added
    facilities for this.

    Perhaps most importantly, this would also allow us to track per CPU
    statistics with eBPF maps, or to implement a poor-man's per CPU data
    structure through eBPF maps.

    Example function proto-type looks like:

    u32 (*smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id;

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • This work is similar to commit 4cd3675ebf74 ("filter: added BPF
    random opcode") and adds a possibility for packet sampling in eBPF.

    Currently, this is only possible in classic BPF and useful to
    combine sampling with f.e. packet sockets, possible also with tc.

    Example function proto-type looks like:

    u32 (*prandom_u32)(void) = (void *)BPF_FUNC_get_prandom_u32;

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

02 Mar, 2015

1 commit


19 Nov, 2014

1 commit