23 Sep, 2020

1 commit

  • Two minor conflicts:

    1) net/ipv4/route.c, adding a new local variable while
    moving another local variable and removing it's
    initial assignment.

    2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
    One pretty prints the port mode differently, whilst another
    changes the driver to try and obtain the port mode from
    the port node rather than the switch node.

    Signed-off-by: David S. Miller

    David S. Miller
     

16 Sep, 2020

1 commit

  • Running selftest
    ./btf_btf -p
    the kernel had the following warning:
    [ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300
    [ 51.529217] Modules linked in:
    [ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878
    [ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014
    [ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300
    ...
    [ 51.542826] Call Trace:
    [ 51.543119] map_seq_next+0x53/0x80
    [ 51.543528] seq_read+0x263/0x400
    [ 51.543932] vfs_read+0xad/0x1c0
    [ 51.544311] ksys_read+0x5f/0xe0
    [ 51.544689] do_syscall_64+0x33/0x40
    [ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The related source code in kernel/bpf/hashtab.c:
    709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
    710 {
    711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
    712 struct hlist_nulls_head *head;
    713 struct htab_elem *l, *next_l;
    714 u32 hash, key_size;
    715 int i = 0;
    716
    717 WARN_ON_ONCE(!rcu_read_lock_held());

    In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key()
    without holding a rcu_read_lock(), hence causing the above warning.
    To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock.

    Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap")
    Reported-by: Alexei Starovoitov
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Cc: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com

    Yonghong Song
     

20 Aug, 2020

1 commit

  • Add kernel module with user mode driver that populates bpffs with
    BPF iterators.

    $ mount bpffs /my/bpffs/ -t bpf
    $ ls -la /my/bpffs/
    total 4
    drwxrwxrwt 2 root root 0 Jul 2 00:27 .
    drwxr-xr-x 19 root root 4096 Jul 2 00:09 ..
    -rw------- 1 root root 0 Jul 2 00:27 maps.debug
    -rw------- 1 root root 0 Jul 2 00:27 progs.debug

    The user mode driver will load BPF Type Formats, create BPF maps, populate BPF
    maps, load two BPF programs, attach them to BPF iterators, and finally send two
    bpf_link IDs back to the kernel.
    The kernel will pin two bpf_links into newly mounted bpffs instance under
    names "progs.debug" and "maps.debug". These two files become human readable.

    $ cat /my/bpffs/progs.debug
    id name attached
    11 dump_bpf_map bpf_iter_bpf_map
    12 dump_bpf_prog bpf_iter_bpf_prog
    27 test_pkt_access
    32 test_main test_pkt_access test_pkt_access
    33 test_subprog1 test_pkt_access_subprog1 test_pkt_access
    34 test_subprog2 test_pkt_access_subprog2 test_pkt_access
    35 test_subprog3 test_pkt_access_subprog3 test_pkt_access
    36 new_get_skb_len get_skb_len test_pkt_access
    37 new_get_skb_ifindex get_skb_ifindex test_pkt_access
    38 new_get_constant get_constant test_pkt_access

    The BPF program dump_bpf_prog() in iterators.bpf.c is printing this data about
    all BPF programs currently loaded in the system. This information is unstable
    and will change from kernel to kernel as ".debug" suffix conveys.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200819042759.51280-4-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

10 May, 2020

1 commit

  • To produce a file bpf iterator, the fd must be
    corresponding to a link_fd assocciated with a
    trace/iter program. When the pinned file is
    opened, a seq_file will be generated.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200509175906.2475893-1-yhs@fb.com

    Yonghong Song
     

03 Mar, 2020

1 commit

  • Introduce bpf_link abstraction, representing an attachment of BPF program to
    a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
    ownership of attached BPF program, reference counting of a link itself, when
    reference from multiple anonymous inodes, as well as ensures that release
    callback will be called from a process context, so that users can safely take
    mutex locks and sleep.

    Additionally, with a new abstraction it's now possible to generalize pinning
    of a link object in BPF FS, allowing to explicitly prevent BPF program
    detachment on process exit by pinning it in a BPF FS and let it open from
    independent other process to keep working with it.

    Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
    program attachments) into utilizing bpf_link framework, making them pinnable
    in BPF FS. More FD-based bpf_links will be added in follow up patches.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com

    Andrii Nakryiko
     

09 Feb, 2020

1 commit

  • Pull vfs file system parameter updates from Al Viro:
    "Saner fs_parser.c guts and data structures. The system-wide registry
    of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
    the horror switch() in fs_parse() that would have to grow another case
    every time something got added to that system-wide registry.

    New syntax types can be added by filesystems easily now, and their
    namespace is that of functions - not of system-wide enum members. IOW,
    they can be shared or kept private and if some turn out to be widely
    useful, we can make them common library helpers, etc., without having
    to do anything whatsoever to fs_parse() itself.

    And we already get that kind of requests - the thing that finally
    pushed me into doing that was "oh, and let's add one for timeouts -
    things like 15s or 2h". If some filesystem really wants that, let them
    do it. Without somebody having to play gatekeeper for the variants
    blessed by direct support in fs_parse(), TYVM.

    Quite a bit of boilerplate is gone. And IMO the data structures make a
    lot more sense now. -200LoC, while we are at it"

    * 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
    tmpfs: switch to use of invalfc()
    cgroup1: switch to use of errorfc() et.al.
    procfs: switch to use of invalfc()
    hugetlbfs: switch to use of invalfc()
    cramfs: switch to use of errofc() et.al.
    gfs2: switch to use of errorfc() et.al.
    fuse: switch to use errorfc() et.al.
    ceph: use errorfc() and friends instead of spelling the prefix out
    prefix-handling analogues of errorf() and friends
    turn fs_param_is_... into functions
    fs_parse: handle optional arguments sanely
    fs_parse: fold fs_parameter_desc/fs_parameter_spec
    fs_parser: remove fs_parameter_description name field
    add prefix to fs_context->log
    ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
    new primitive: __fs_parse()
    switch rbd and libceph to p_log-based primitives
    struct p_log, variants of warnf() et.al. taking that one instead
    teach logfc() to handle prefices, give it saner calling conventions
    get rid of cg_invalf()
    ...

    Linus Torvalds
     

08 Feb, 2020

2 commits


27 Jan, 2020

1 commit

  • If seq_file .next fuction does not change position index,
    read after some lseek can generate an unexpected output.

    See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

    v1 -> v2: removed missed increment in end of function

    Signed-off-by: Vasily Averin
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com

    Vasily Averin
     

22 Jan, 2020

1 commit

  • kernel/bpf/inode.c misuses kern_path...() - it's much simpler (and
    more efficient, on top of that) to use user_path...() counterparts
    rather than bothering with doing getname() manually.

    Signed-off-by: Al Viro
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200120232858.GF8904@ZenIV.linux.org.uk

    Al Viro
     

18 Nov, 2019

2 commits

  • Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
    and remove artificial 32k limit. This allows to make bpf_prog's refcounting
    non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

    Validated compilation by running allyesconfig kernel build.

    Suggested-by: Daniel Borkmann
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com

    Andrii Nakryiko
     
  • 92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
    potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
    (32k). Due to using 32-bit counter, it's possible in practice to overflow
    refcounter and make it wrap around to 0, causing erroneous map free, while
    there are still references to it, causing use-after-free problems.

    But having a failing refcounting operations are problematic in some cases. One
    example is mmap() interface. After establishing initial memory-mapping, user
    is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
    splitting it into multiple non-contiguous regions. All this happening without
    any control from the users of mmap subsystem. Rather mmap subsystem sends
    notifications to original creator of memory mapping through open/close
    callbacks, which are optionally specified during initial memory mapping
    creation. These callbacks are used to maintain accurate refcount for bpf_map
    (see next patch in this series). The problem is that open() callback is not
    supposed to fail, because memory-mapped resource is set up and properly
    referenced. This is posing a problem for using memory-mapping with BPF maps.

    One solution to this is to maintain separate refcount for just memory-mappings
    and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
    There are similar use cases in current work on tcp-bpf, necessitating extra
    counter as well. This seems like a rather unfortunate and ugly solution that
    doesn't scale well to various new use cases.

    Another approach to solve this is to use non-failing refcount_t type, which
    uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
    stays there. This utlimately causes memory leak, but prevents use after free.

    But given refcounting is not the most performance-critical operation with BPF
    maps (it's not used from running BPF program code), we can also just switch to
    64-bit counter that can't overflow in practice, potentially disadvantaging
    32-bit platforms a tiny bit. This simplifies semantics and allows above
    described scenarios to not worry about failing refcount increment operation.

    In terms of struct bpf_map size, we are still good and use the same amount of
    space:

    BEFORE (3 cache lines, 8 bytes of padding at the end):
    struct bpf_map {
    const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
    struct bpf_map * inner_map_meta; /* 8 8 */
    void * security; /* 16 8 */
    enum bpf_map_type map_type; /* 24 4 */
    u32 key_size; /* 28 4 */
    u32 value_size; /* 32 4 */
    u32 max_entries; /* 36 4 */
    u32 map_flags; /* 40 4 */
    int spin_lock_off; /* 44 4 */
    u32 id; /* 48 4 */
    int numa_node; /* 52 4 */
    u32 btf_key_type_id; /* 56 4 */
    u32 btf_value_type_id; /* 60 4 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct btf * btf; /* 64 8 */
    struct bpf_map_memory memory; /* 72 16 */
    bool unpriv_array; /* 88 1 */
    bool frozen; /* 89 1 */

    /* XXX 38 bytes hole, try to pack */

    /* --- cacheline 2 boundary (128 bytes) --- */
    atomic_t refcnt __attribute__((__aligned__(64))); /* 128 4 */
    atomic_t usercnt; /* 132 4 */
    struct work_struct work; /* 136 32 */
    char name[16]; /* 168 16 */

    /* size: 192, cachelines: 3, members: 21 */
    /* sum members: 146, holes: 1, sum holes: 38 */
    /* padding: 8 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
    } __attribute__((__aligned__(64)));

    AFTER (same 3 cache lines, no extra padding now):
    struct bpf_map {
    const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
    struct bpf_map * inner_map_meta; /* 8 8 */
    void * security; /* 16 8 */
    enum bpf_map_type map_type; /* 24 4 */
    u32 key_size; /* 28 4 */
    u32 value_size; /* 32 4 */
    u32 max_entries; /* 36 4 */
    u32 map_flags; /* 40 4 */
    int spin_lock_off; /* 44 4 */
    u32 id; /* 48 4 */
    int numa_node; /* 52 4 */
    u32 btf_key_type_id; /* 56 4 */
    u32 btf_value_type_id; /* 60 4 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct btf * btf; /* 64 8 */
    struct bpf_map_memory memory; /* 72 16 */
    bool unpriv_array; /* 88 1 */
    bool frozen; /* 89 1 */

    /* XXX 38 bytes hole, try to pack */

    /* --- cacheline 2 boundary (128 bytes) --- */
    atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */
    atomic64_t usercnt; /* 136 8 */
    struct work_struct work; /* 144 32 */
    char name[16]; /* 176 16 */

    /* size: 192, cachelines: 3, members: 21 */
    /* sum members: 154, holes: 1, sum holes: 38 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
    } __attribute__((__aligned__(64)));

    This patch, while modifying all users of bpf_map_inc, also cleans up its
    interface to match bpf_map_put with separate operations for bpf_map_inc and
    bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
    respectively). Also, given there are no users of bpf_map_inc_not_zero
    specifying uref=true, remove uref flag and default to uref=false internally.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com

    Andrii Nakryiko
     

19 Sep, 2019

1 commit

  • Convert the bpf filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Alexei Starovoitov
    cc: Daniel Borkmann
    cc: Martin KaFai Lau
    cc: Song Liu
    cc: Yonghong Song
    cc: netdev@vger.kernel.org
    cc: bpf@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     

19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

17 May, 2019

1 commit

  • For iptable module to load a bpf program from a pinned location, it
    only retrieve a loaded program and cannot change the program content so
    requiring a write permission for it might not be necessary.
    Also when adding or removing an unrelated iptable rule, it might need to
    flush and reload the xt_bpf related rules as well and triggers the inode
    permission check. It might be better to remove the write premission
    check for the inode so we won't need to grant write access to all the
    processes that flush and restore iptables rules.

    Signed-off-by: Chenbo Feng
    Signed-off-by: Alexei Starovoitov

    Chenbo Feng
     

02 May, 2019

1 commit


26 Mar, 2019

1 commit

  • syzkaller was able to generate the following UAF in bpf:

    BUG: KASAN: use-after-free in lookup_last fs/namei.c:2269 [inline]
    BUG: KASAN: use-after-free in path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
    Read of size 1 at addr ffff8801c4865c47 by task syz-executor2/9423

    CPU: 0 PID: 9423 Comm: syz-executor2 Not tainted 4.20.0-rc1-next-20181109+
    #110
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x244/0x39d lib/dump_stack.c:113
    print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
    __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
    lookup_last fs/namei.c:2269 [inline]
    path_lookupat.isra.43+0x9f8/0xc00 fs/namei.c:2318
    filename_lookup+0x26a/0x520 fs/namei.c:2348
    user_path_at_empty+0x40/0x50 fs/namei.c:2608
    user_path include/linux/namei.h:62 [inline]
    do_mount+0x180/0x1ff0 fs/namespace.c:2980
    ksys_mount+0x12d/0x140 fs/namespace.c:3258
    __do_sys_mount fs/namespace.c:3272 [inline]
    __se_sys_mount fs/namespace.c:3269 [inline]
    __x64_sys_mount+0xbe/0x150 fs/namespace.c:3269
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x457569
    Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
    48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff
    ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fde6ed96c78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
    RDX: 0000000020000040 RSI: 0000000020000000 RDI: 0000000000000000
    RBP: 000000000072bf00 R08: 0000000020000340 R09: 0000000000000000
    R10: 0000000000200000 R11: 0000000000000246 R12: 00007fde6ed976d4
    R13: 00000000004c2c24 R14: 00000000004d4990 R15: 00000000ffffffff

    Allocated by task 9424:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
    __do_kmalloc mm/slab.c:3722 [inline]
    __kmalloc_track_caller+0x157/0x760 mm/slab.c:3737
    kstrdup+0x39/0x70 mm/util.c:49
    bpf_symlink+0x26/0x140 kernel/bpf/inode.c:356
    vfs_symlink+0x37a/0x5d0 fs/namei.c:4127
    do_symlinkat+0x242/0x2d0 fs/namei.c:4154
    __do_sys_symlink fs/namei.c:4173 [inline]
    __se_sys_symlink fs/namei.c:4171 [inline]
    __x64_sys_symlink+0x59/0x80 fs/namei.c:4171
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 9425:
    save_stack+0x43/0xd0 mm/kasan/kasan.c:448
    set_track mm/kasan/kasan.c:460 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
    kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
    __cache_free mm/slab.c:3498 [inline]
    kfree+0xcf/0x230 mm/slab.c:3817
    bpf_evict_inode+0x11f/0x150 kernel/bpf/inode.c:565
    evict+0x4b9/0x980 fs/inode.c:558
    iput_final fs/inode.c:1550 [inline]
    iput+0x674/0xa90 fs/inode.c:1576
    do_unlinkat+0x733/0xa30 fs/namei.c:4069
    __do_sys_unlink fs/namei.c:4110 [inline]
    __se_sys_unlink fs/namei.c:4108 [inline]
    __x64_sys_unlink+0x42/0x50 fs/namei.c:4108
    do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    In this scenario path lookup under RCU is racing with the final
    unlink in case of symlinks. As Linus puts it in his analysis:

    [...] We actually RCU-delay the inode freeing itself, but
    when we do the final iput(), the "evict()" function is called
    synchronously. Now, the simple fix would seem to just RCU-delay
    the kfree() of the symlink data in bpf_evict_inode(). Maybe
    that's the right thing to do. [...]

    Al suggested to piggy-back on the ->destroy_inode() callback in
    order to implement RCU deferral there which can then kfree() the
    inode->i_link eventually right before putting inode back into
    inode cache. By reusing free_inode_nonrcu() from there we can
    avoid the need for our own inode cache and just reuse generic
    one as we currently do.

    And in-fact on top of all this we should just get rid of the
    bpf_evict_inode() entirely. This means truncate_inode_pages_final()
    and clear_inode() will then simply be called by the fs core via
    evict(). Dropping the reference should really only be done when
    inode is unhashed and nothing reachable anymore, so it's better
    also moved into the final ->destroy_inode() callback.

    Fixes: 0f98621bef5d ("bpf, inode: add support for symlinks and fix mtime/ctime")
    Reported-by: syzbot+fb731ca573367b7f6564@syzkaller.appspotmail.com
    Reported-by: syzbot+a13e5ead792d6df37818@syzkaller.appspotmail.com
    Reported-by: syzbot+7a8ba368b47fdefca61e@syzkaller.appspotmail.com
    Suggested-by: Al Viro
    Analyzed-by: Linus Torvalds
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Linus Torvalds
    Acked-by: Al Viro
    Link: https://lore.kernel.org/lkml/0000000000006946d2057bbd0eef@google.com/T/

    Daniel Borkmann
     

13 Aug, 2018

1 commit

  • Commit a26ca7c982cb ("bpf: btf: Add pretty print support to
    the basic arraymap") and 699c86d6ec21 ("bpf: btf: add pretty
    print for hash/lru_hash maps") enabled support for BTF and
    dumping via BPF fs for array and hash/lru map. However, both
    can be decoupled from each other such that regular BPF maps
    can be supported for attaching BTF key/value information,
    while not all maps necessarily need to dump via map_seq_show_elem()
    callback.

    The basic sanity check which is a prerequisite for all maps
    is that key/value size has to match in any case, and some maps
    can have extra checks via map_check_btf() callback, e.g.
    probing certain types or indicating no support in general. With
    that we can also enable retrieving BTF info for per-cpu map
    types and lpm.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Acked-by: Yonghong Song

    Daniel Borkmann
     

11 Aug, 2018

1 commit

  • In function map_seq_next() of kernel/bpf/inode.c,
    the first key will be the "0" regardless of the map type.
    This works for array. But for hash type, if it happens
    key "0" is in the map, the bpffs map show will miss
    some items if the key "0" is not the first element of
    the first bucket.

    This patch fixed the issue by guaranteeing to get
    the first element, if the seq_show is just started,
    by passing NULL pointer key to map_get_next_key() callback.
    This way, no missing elements will occur for
    bpffs hash table show even if key "0" is in the map.

    Fixes: a26ca7c982cb5 ("bpf: btf: Add pretty print support to the basic arraymap")
    Acked-by: Alexei Starovoitov
    Signed-off-by: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Yonghong Song
     

09 Jun, 2018

1 commit

  • syzkaller was able to trigger the following warning in
    do_dentry_open():

    WARNING: CPU: 1 PID: 4508 at fs/open.c:778 do_dentry_open+0x4ad/0xe40 fs/open.c:778
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 1 PID: 4508 Comm: syz-executor867 Not tainted 4.17.0+ #90
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    [...]
    vfs_open+0x139/0x230 fs/open.c:908
    do_last fs/namei.c:3370 [inline]
    path_openat+0x1717/0x4dc0 fs/namei.c:3511
    do_filp_open+0x249/0x350 fs/namei.c:3545
    do_sys_open+0x56f/0x740 fs/open.c:1101
    __do_sys_openat fs/open.c:1128 [inline]
    __se_sys_openat fs/open.c:1122 [inline]
    __x64_sys_openat+0x9d/0x100 fs/open.c:1122
    do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Problem was that prog and map inodes in bpf fs did not
    implement a dummy file open operation that would return an
    error. The patch in do_dentry_open() checks whether f_ops
    are present and if not bails out with an error. While this
    may be fine, we really shouldn't be throwing a warning
    though. Thus follow the model similar to bad_file_ops and
    reject the request unconditionally with -EIO.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Reported-by: syzbot+2e7fcab0f56fdbb330b8@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

30 Apr, 2018

1 commit

  • tracepoints to bpf core were added as a way to provide introspection
    to bpf programs and maps, but after some time it became clear that
    this approach is inadequate, so prog_id, map_id and corresponding
    get_next_id, get_fd_by_id, get_info_by_fd, prog_query APIs were
    introduced and fully adopted by bpftool and other applications.
    The tracepoints in bpf core started to rot and causing syzbot warnings:
    WARNING: CPU: 0 PID: 3008 at kernel/trace/trace_event_perf.c:274
    Kernel panic - not syncing: panic_on_warn set ...
    perf_trace_bpf_map_keyval+0x260/0xbd0 include/trace/events/bpf.h:228
    trace_bpf_map_update_elem include/trace/events/bpf.h:274 [inline]
    map_update_elem kernel/bpf/syscall.c:597 [inline]
    SYSC_bpf kernel/bpf/syscall.c:1478 [inline]
    Hence this patch deletes tracepoints in bpf core.

    Reported-by: Eric Biggers
    Reported-by: syzbot
    Signed-off-by: Alexei Starovoitov
    Acked-by: David S. Miller
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

20 Apr, 2018

1 commit

  • This patch adds pretty print support to the basic arraymap.
    Support for other bpf maps can be added later.

    This patch adds new attrs to the BPF_MAP_CREATE command to allow
    specifying the btf_fd, btf_key_id and btf_value_id. The
    BPF_MAP_CREATE can then associate the btf to the map if
    the creating map supports BTF.

    A BTF supported map needs to implement two new map ops,
    map_seq_show_elem() and map_check_btf(). This patch has
    implemented these new map ops for the basic arraymap.

    It also adds file_operations, bpffs_map_fops, to the pinned
    map such that the pinned map can be opened and read.
    After that, the user has an intuitive way to do
    "cat bpffs/pathto/a-pinned-map" instead of getting
    an error.

    bpffs_map_fops should not be extended further to support
    other operations. Other operations (e.g. write/key-lookup...)
    should be realized by the userspace tools (e.g. bpftool) through
    the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
    Follow up patches will allow the userspace to obtain
    the BTF from a map-fd.

    Here is a sample output when reading a pinned arraymap
    with the following map's value:

    struct map_value {
    int count_a;
    int count_b;
    };

    cat /sys/fs/bpf/pinned_array_map:

    0: {1,2}
    1: {3,4}
    2: {5,6}
    ...

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

09 Mar, 2018

1 commit

  • When pinning a file under the BPF virtual file system (traditionally
    /sys/fs/bpf), using a dot in the name of the location to pin at is not
    allowed. For example, trying to pin at "/sys/fs/bpf/foo.bar" will be
    rejected with -EPERM.

    This check was introduced at the same time as the BPF file system
    itself, with commit b2197755b263 ("bpf: add support for persistent
    maps/progs"). At this time, it was checked in a function called
    "bpf_dname_reserved()", which made clear that using a dot was reserved
    for future extensions.

    This function disappeared and the check was moved elsewhere with commit
    0c93b7d85d40 ("bpf: reject invalid names right in ->lookup()"), and the
    meaning of the dot ban was lost.

    The present commit simply adds a comment in the source to explain to the
    reader that the usage of dots is reserved for future usage.

    Signed-off-by: Quentin Monnet
    Signed-off-by: Daniel Borkmann

    Quentin Monnet
     

31 Jan, 2018

1 commit

  • Pull mqueue/bpf vfs cleanups from Al Viro:
    "mqueue and bpf go through rather painful and similar contortions to
    create objects in their dentry trees. Provide a primitive for doing
    that without abusing ->mknod(), switch bpf and mqueue to it.

    Another mqueue-related thing that has ended up in that branch is
    on-demand creation of internal mount (based upon the work of Giuseppe
    Scrivano)"

    * 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: switch to on-demand creation of internal mount
    tidy do_mq_open() up a bit
    mqueue: clean prepare_open() up
    do_mq_open(): move all work prior to dentry_open() into a helper
    mqueue: fold mq_attr_ok() into mqueue_get_inode()
    move dentry_open() calls up into do_mq_open()
    mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata
    bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod()
    new primitive: vfs_mkobj()

    Linus Torvalds
     

06 Jan, 2018

2 commits


20 Oct, 2017

1 commit

  • Introduce the map read/write flags to the eBPF syscalls that returns the
    map fd. The flags is used to set up the file mode when construct a new
    file descriptor for bpf maps. To not break the backward capability, the
    f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
    it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
    read the map content, it will check the file mode to see if it is
    allowed to make the change.

    Signed-off-by: Chenbo Feng
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Chenbo Feng
     

09 Oct, 2017

1 commit

  • Commit 2c16d6033264 ("netfilter: xt_bpf: support ebpf") introduced
    support for attaching an eBPF object by an fd, with the
    'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
    IPT_SO_SET_REPLACE call.

    However this breaks subsequent iptables calls:

    # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
    # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
    iptables: Invalid argument. Run `dmesg' for more information.

    That's because iptables works by loading existing rules using
    IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
    the replacement set.

    However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
    (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
    occurs, userspace passes a bogus fd number, which leads to
    'bpf_mt_check_v1' to fail.

    One suggested solution [1] was to hack iptables userspace, to perform a
    "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
    process-local fd per every 'xt_bpf_info_v1' entry seen.

    However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
    depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.

    This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
    '.fd' and instead perform an in-kernel lookup for the bpf object given
    the provided '.path'.

    It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
    XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
    expected to provide the path of the pinned object.

    Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.

    References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
    [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2

    Reported-by: Rafael Buchbinder
    Signed-off-by: Shmulik Ladkani
    Acked-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Signed-off-by: Pablo Neira Ayuso

    Shmulik Ladkani
     

06 Jul, 2017

1 commit

  • Implement the show_options superblock op for bpf as part of a bid to get
    rid of s_options and generic_show_options() to make it easier to implement
    a context-based mount where the mount options can be passed individually
    over a file descriptor.

    Signed-off-by: David Howells
    cc: Alexei Starovoitov
    cc: Daniel Borkmann
    cc: netdev@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     

27 Apr, 2017

1 commit

  • simple_fill_super() is passed an array of tree_descr structures which
    describe the files to create in the filesystem's root directory. Since
    these arrays are never modified intentionally, they should be 'const' so
    that they are placed in .rodata and benefit from memory protection.
    This patch updates the function signature and all users, and also
    constifies tree_descr.name.

    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

26 Jan, 2017

1 commit

  • This work adds a number of tracepoints to paths that are either
    considered slow-path or exception-like states, where monitoring or
    inspecting them would be desirable.

    For bpf(2) syscall, tracepoints have been placed for main commands
    when they succeed. In XDP case, tracepoint is for exceptions, that
    is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
    return code, or when error occurs during XDP_TX action and the packet
    could not be forwarded.

    Both have been split into separate event headers, and can be further
    extended. Worst case, if they unexpectedly should get into our way in
    future, they can also removed [1]. Of course, these tracepoints (like
    any other) can be analyzed by eBPF itself, etc. Example output:

    # ./perf record -a -e bpf:* sleep 10
    # ./perf script
    sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
    sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
    sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
    sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
    [...]
    sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
    swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER

    [1] https://lwn.net/Articles/705270/

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Nov, 2016

1 commit

  • Since we recently converted the BPF filesystem over to use mount_nodev(),
    we now have the possibility to also hold mount options in sb's s_fs_info.
    This work implements mount options support for specifying permissions on
    the sb's inode, which will be used by tc when it manually needs to mount
    the fs.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

01 Nov, 2016

1 commit

  • While commit bb35a6ef7da4 ("bpf, inode: allow for rename and link ops")
    added support for hard links that can be used for prog and map nodes,
    this work adds simple symlink support, which can be used f.e. for
    directories also when unpriviledged and works with cmdline tooling that
    understands S_IFLNK anyway. Since the switch in e27f4a942a0e ("bpf: Use
    mount_nodev not mount_ns to mount the bpf filesystem"), there can be
    various mount instances with mount_nodev() and thus hierarchy can be
    flattened to facilitate object sharing. Thus, we can keep bpf tooling
    also working by repointing paths.

    Most of the functionality can be used from vfs library operations. The
    symlink is stored in the inode itself, that is in i_link, which is
    sufficient in our case as opposed to storing it in the page cache.
    While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
    the directories mtime and ctime, so add a common helper for it called
    bpf_dentry_finalize() that takes care of it for all cases now.

    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Sep, 2016

1 commit

  • CURRENT_TIME macro is not appropriate for filesystems as it
    doesn't use the right granularity for filesystem timestamps.
    Use current_time() instead.

    CURRENT_TIME is also not y2038 safe.

    This is also in preparation for the patch that transitions
    vfs timestamps to use 64 bit time and hence make them
    y2038 safe. As part of the effort current_time() will be
    extended to do range checks. Hence, it is necessary for all
    file system timestamps to use current_time(). Also,
    current_time() will be transitioned along with vfs to be
    y2038 safe.

    Note that whenever a single call to current_time() is used
    to change timestamps in different inodes, it is because they
    share the same time granularity.

    Signed-off-by: Deepa Dinamani
    Reviewed-by: Arnd Bergmann
    Acked-by: Felipe Balbi
    Acked-by: Steven Whitehouse
    Acked-by: Ryusuke Konishi
    Acked-by: David Sterba
    Signed-off-by: Al Viro

    Deepa Dinamani
     

12 Jul, 2016

1 commit

  • The Kconfig currently controlling compilation of this code is:

    init/Kconfig:config BPF_SYSCALL
    init/Kconfig: bool "Enable bpf() system call"

    ...meaning that it currently is not being built as a module by anyone.

    Lets remove the couple traces of modular infrastructure use, so that
    when reading the driver there is no doubt it is builtin-only.

    Note that MODULE_ALIAS is a no-op for non-modular code.

    We replace module.h with init.h since the file does use __init.

    Cc: Alexei Starovoitov
    Cc: netdev@vger.kernel.org
    Signed-off-by: Paul Gortmaker
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Paul Gortmaker
     

24 May, 2016

1 commit

  • Follow-up to commit e27f4a942a0e ("bpf: Use mount_nodev not mount_ns
    to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.

    The original idea was to have a per mountns instance instead of a
    single global fs instance, but that didn't work out and we had to
    switch to mount_nodev() model. The intent of that middle ground was
    that we avoid users who don't play nice to create endless instances
    of bpf fs which are difficult to control and discover from an admin
    point of view, but at the same time it would have allowed us to be
    more flexible with regard to namespaces.

    Therefore, since we now did the switch to mount_nodev() as a fix
    where individual instances are created, we also need to remove userns
    mount flag along with it to avoid running into mentioned situation.
    I don't expect any breakage at this early point in time with removing
    the flag and we can revisit this later should the requirement for
    this come up with future users. This and commit e27f4a942a0e have
    been split to facilitate tracking should any of them run into the
    unlikely case of causing a regression.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Signed-off-by: Daniel Borkmann
    Acked-by: Hannes Frederic Sowa
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

21 May, 2016

1 commit

  • While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
    bpf filesystem. Looking at the code I saw a broken usage of mount_ns
    with current->nsproxy->mnt_ns. As the code does not acquire a
    reference to the mount namespace it can not possibly be correct to
    store the mount namespace on the superblock as it does.

    Replace mount_ns with mount_nodev so that each mount of the bpf
    filesystem returns a distinct instance, and the code is not buggy.

    In discussion with Hannes Frederic Sowa it was reported that the use
    of mount_ns was an attempt to have one bpf instance per mount
    namespace, in an attempt to keep resources that pin resources from
    hiding. That intent simply does not work, the vfs is not built to
    allow that kind of behavior. Which means that the bpf filesystem
    really is buggy both semantically and in it's implemenation as it does
    not nor can it implement the original intent.

    This change is userspace visible, but my experience with similar
    filesystems leads me to believe nothing will break with a model of each
    mount of the bpf filesystem is distinct from all others.

    Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
    Cc: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: "Eric W. Biederman"
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

19 May, 2016

1 commit

  • Pull misc vfs cleanups from Al Viro:
    "Assorted cleanups and fixes all over the place"

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    coredump: only charge written data against RLIMIT_CORE
    coredump: get rid of coredump_params->written
    ecryptfs_lookup(): try either only encrypted or plaintext name
    ecryptfs: avoid multiple aliases for directories
    bpf: reject invalid names right in ->lookup()
    __d_alloc(): treat NULL name as QSTR("/", 1)
    mtd: switch ubi_open_volume_path() to vfs_stat()
    mtd: switch open_mtd_by_chdev() to use of vfs_stat()

    Linus Torvalds
     

29 Apr, 2016

1 commit

  • On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
    the malicious application may overflow 32-bit bpf program refcnt.
    It's also possible to overflow map refcnt on 1Tb system.
    Impose 32k hard limit which means that the same bpf program or
    map cannot be shared by more than 32k processes.

    Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
    Reported-by: Jann Horn
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     

28 Mar, 2016

1 commit