13 May, 2019

1 commit

  • Commit 31fd85816dbe ("bpf: permits narrower load from bpf program
    context fields") made the verifier add AND instructions to clear the
    unwanted bits with a mask when doing a narrow load. The mask is
    computed with

    (1 << size * 8) - 1

    where "size" is the size of the narrow load. When doing a 4 byte load
    of a an 8 byte field the verifier shifts the literal 1 by 32 places to
    the left. This results in an overflow of a signed integer, which is an
    undefined behavior. Typically, the computed mask was zero, so the
    result of the narrow load ended up being zero too.

    Cast the literal to long long to avoid overflows. Note that narrow
    load of the 4 byte fields does not have the undefined behavior,
    because the load size can only be either 1 or 2 bytes, so shifting 1
    by 8 or 16 places will not overflow it. And reading 4 bytes would not
    be a narrow load of a 4 bytes field.

    Fixes: 31fd85816dbe ("bpf: permits narrower load from bpf program context fields")
    Reviewed-by: Alban Crequy
    Reviewed-by: Iago López Galeiras
    Signed-off-by: Krzesimir Nowak
    Cc: Yonghong Song
    Signed-off-by: Daniel Borkmann

    Krzesimir Nowak
     

11 May, 2019

1 commit

  • systemtap folks reported the following splat recently:

    [ 7790.862212] WARNING: CPU: 3 PID: 26759 at arch/x86/kernel/kprobes/core.c:1022 kprobe_fault_handler+0xec/0xf0
    [...]
    [ 7790.864113] CPU: 3 PID: 26759 Comm: sshd Not tainted 5.1.0-0.rc7.git1.1.fc31.x86_64 #1
    [ 7790.864198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS[...]
    [ 7790.864314] RIP: 0010:kprobe_fault_handler+0xec/0xf0
    [ 7790.864375] Code: 48 8b 50 [...]
    [ 7790.864714] RSP: 0018:ffffc06800bdbb48 EFLAGS: 00010082
    [ 7790.864812] RAX: ffff9e2b75a16320 RBX: 0000000000000000 RCX: 0000000000000000
    [ 7790.865306] RDX: ffffffffffffffff RSI: 000000000000000e RDI: ffffc06800bdbbf8
    [ 7790.865514] RBP: ffffc06800bdbbf8 R08: 0000000000000000 R09: 0000000000000000
    [ 7790.865960] R10: 0000000000000000 R11: 0000000000000000 R12: ffffc06800bdbbf8
    [ 7790.866037] R13: ffff9e2ab56a0418 R14: ffff9e2b6d0bb400 R15: ffff9e2b6d268000
    [ 7790.866114] FS: 00007fde49937d80(0000) GS:ffff9e2b75a00000(0000) knlGS:0000000000000000
    [ 7790.866193] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 7790.866318] CR2: 0000000000000000 CR3: 000000012f312000 CR4: 00000000000006e0
    [ 7790.866419] Call Trace:
    [ 7790.866677] do_user_addr_fault+0x64/0x480
    [ 7790.867513] do_page_fault+0x33/0x210
    [ 7790.868002] async_page_fault+0x1e/0x30
    [ 7790.868071] RIP: 0010: (null)
    [ 7790.868144] Code: Bad RIP value.
    [ 7790.868229] RSP: 0018:ffffc06800bdbca8 EFLAGS: 00010282
    [ 7790.868362] RAX: ffff9e2b598b60f8 RBX: ffffc06800bdbe48 RCX: 0000000000000004
    [ 7790.868629] RDX: 0000000000000004 RSI: ffffc06800bdbc6c RDI: ffff9e2b598b60f0
    [ 7790.868834] RBP: ffffc06800bdbcf8 R08: 0000000000000000 R09: 0000000000000004
    [ 7790.870432] R10: 00000000ff6f7a03 R11: 0000000000000000 R12: 0000000000000001
    [ 7790.871859] R13: ffffc06800bdbcb8 R14: 0000000000000000 R15: ffff9e2acd0a5310
    [ 7790.873455] ? vfs_read+0x5/0x170
    [ 7790.874639] ? vfs_read+0x1/0x170
    [ 7790.875834] ? trace_call_bpf+0xf6/0x260
    [ 7790.877044] ? vfs_read+0x1/0x170
    [ 7790.878208] ? vfs_read+0x5/0x170
    [ 7790.879345] ? kprobe_perf_func+0x233/0x260
    [ 7790.880503] ? vfs_read+0x1/0x170
    [ 7790.881632] ? vfs_read+0x5/0x170
    [ 7790.882751] ? kprobe_ftrace_handler+0x92/0xf0
    [ 7790.883926] ? __vfs_read+0x30/0x30
    [ 7790.885050] ? ftrace_ops_assist_func+0x94/0x100
    [ 7790.886183] ? vfs_read+0x1/0x170
    [ 7790.887283] ? vfs_read+0x5/0x170
    [ 7790.888348] ? ksys_read+0x5a/0xe0
    [ 7790.889389] ? do_syscall_64+0x5c/0xa0
    [ 7790.890401] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe

    After some debugging, turns out that the logic in 2cbd95a5c4fb
    ("bpf: change parameters of call/branch offset adjustment") has
    a bug that is exposed after 52875a04f4b2 ("bpf: verifier: remove
    dead code") in that we miss some of the jump offset adjustments
    after code patching when we remove dead code, more concretely,
    upon backward jump spanning over the area that is being removed.

    BPF insns of a case that was hit pre 52875a04f4b2:

    [...]
    676: (85) call bpf_perf_event_output#-47616
    677: (05) goto pc-636
    678: (62) *(u32 *)(r10 -64) = 0
    679: (bf) r7 = r10
    680: (07) r7 += -64
    681: (05) goto pc-44
    682: (05) goto pc-1
    683: (05) goto pc-1

    BPF insns afterwards:

    [...]
    618: (85) call bpf_perf_event_output#-47616
    619: (05) goto pc-638
    620: (62) *(u32 *)(r10 -64) = 0
    621: (bf) r7 = r10
    622: (07) r7 += -64
    623: (05) goto pc-44

    To illustrate the bug, situation looks as follows:
    ____
    0 | | = end_new && curr + off + 1 < end_new in the
    branch delta adjustments is never hit because curr + off + 1 <
    end_new is compared as unsigned and therefore curr + off + 1 >
    end_new in unsigned realm as curr + off + 1 becomes negative
    since the insns are memmove()'d before the offset adjustments.

    Correct BPF insns after this fix:

    [...]
    618: (85) call bpf_perf_event_output#-47216
    619: (05) goto pc-578
    620: (62) *(u32 *)(r10 -64) = 0
    621: (bf) r7 = r10
    622: (07) r7 += -64
    623: (05) goto pc-44

    Note that unprivileged case is not affected from this.

    Fixes: 52875a04f4b2 ("bpf: verifier: remove dead code")
    Fixes: 2cbd95a5c4fb ("bpf: change parameters of call/branch offset adjustment")
    Reported-by: Frank Ch. Eigler
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Kicinski
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     

08 May, 2019

2 commits

  • Pull networking updates from David Miller:
    "Highlights:

    1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

    2) Add fib_sync_mem to control the amount of dirty memory we allow to
    queue up between synchronize RCU calls, from David Ahern.

    3) Make flow classifier more lockless, from Vlad Buslov.

    4) Add PHY downshift support to aquantia driver, from Heiner
    Kallweit.

    5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
    contention on SLAB spinlocks in heavy RPC workloads.

    6) Partial GSO offload support in XFRM, from Boris Pismenny.

    7) Add fast link down support to ethtool, from Heiner Kallweit.

    8) Use siphash for IP ID generator, from Eric Dumazet.

    9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
    entries, from David Ahern.

    10) Move skb->xmit_more into a per-cpu variable, from Florian
    Westphal.

    11) Improve eBPF verifier speed and increase maximum program size,
    from Alexei Starovoitov.

    12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
    spinlocks. From Neil Brown.

    13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

    14) Improve link partner cap detection in generic PHY code, from
    Heiner Kallweit.

    15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
    Maguire.

    16) Remove SKB list implementation assumptions in SCTP, your's truly.

    17) Various cleanups, optimizations, and simplifications in r8169
    driver. From Heiner Kallweit.

    18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

    19) Switch PHY drivers over to use dynamic featue detection, from
    Heiner Kallweit.

    20) Support flow steering without masking in dpaa2-eth, from Ioana
    Ciocoi.

    21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
    Pirko.

    22) Increase the strict parsing of current and future netlink
    attributes, also export such policies to userspace. From Johannes
    Berg.

    23) Allow DSA tag drivers to be modular, from Andrew Lunn.

    24) Remove legacy DSA probing support, also from Andrew Lunn.

    25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
    Haabendal.

    26) Add a generic tracepoint for TX queue timeouts to ease debugging,
    from Cong Wang.

    27) More indirect call optimizations, from Paolo Abeni"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
    cxgb4: Fix error path in cxgb4_init_module
    net: phy: improve pause mode reporting in phy_print_status
    dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
    net: macb: Change interrupt and napi enable order in open
    net: ll_temac: Improve error message on error IRQ
    net/sched: remove block pointer from common offload structure
    net: ethernet: support of_get_mac_address new ERR_PTR error
    net: usb: smsc: fix warning reported by kbuild test robot
    staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
    net: dsa: support of_get_mac_address new ERR_PTR error
    net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
    vrf: sit mtu should not be updated when vrf netdev is the link
    net: dsa: Fix error cleanup path in dsa_init_module
    l2tp: Fix possible NULL pointer dereference
    taprio: add null check on sched_nest to avoid potential null pointer dereference
    net: mvpp2: cls: fix less than zero check on a u32 variable
    net_sched: sch_fq: handle non connected flows
    net_sched: sch_fq: do not assume EDT packets are ordered
    net: hns3: use devm_kcalloc when allocating desc_cb
    net: hns3: some cleanup for struct hns3_enet_ring
    ...

    Linus Torvalds
     
  • Pull vfs inode freeing updates from Al Viro:
    "Introduction of separate method for RCU-delayed part of
    ->destroy_inode() (if any).

    Pretty much as posted, except that destroy_inode() stashes
    ->free_inode into the victim (anon-unioned with ->i_fops) before
    scheduling i_callback() and the last two patches (sockfs conversion
    and folding struct socket_wq into struct socket) are excluded - that
    pair should go through netdev once davem reopens his tree"

    * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
    orangefs: make use of ->free_inode()
    shmem: make use of ->free_inode()
    hugetlb: make use of ->free_inode()
    overlayfs: make use of ->free_inode()
    jfs: switch to ->free_inode()
    fuse: switch to ->free_inode()
    ext4: make use of ->free_inode()
    ecryptfs: make use of ->free_inode()
    ceph: use ->free_inode()
    btrfs: use ->free_inode()
    afs: switch to use of ->free_inode()
    dax: make use of ->free_inode()
    ntfs: switch to ->free_inode()
    securityfs: switch to ->free_inode()
    apparmor: switch to ->free_inode()
    rpcpipe: switch to ->free_inode()
    bpf: switch to ->free_inode()
    mqueue: switch to ->free_inode()
    ufs: switch to ->free_inode()
    coda: switch to ->free_inode()
    ...

    Linus Torvalds
     

07 May, 2019

1 commit

  • Pull x86 mm updates from Ingo Molnar:
    "The changes in here are:

    - text_poke() fixes and an extensive set of executability lockdowns,
    to (hopefully) eliminate the last residual circumstances under
    which we are using W|X mappings even temporarily on x86 kernels.
    This required a broad range of surgery in text patching facilities,
    module loading, trampoline handling and other bits.

    - tweak page fault messages to be more informative and more
    structured.

    - remove DISCONTIGMEM support on x86-32 and make SPARSEMEM the
    default.

    - reduce KASLR granularity on 5-level paging kernels from 512 GB to
    1 GB.

    - misc other changes and updates"

    * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    x86/mm: Initialize PGD cache during mm initialization
    x86/alternatives: Add comment about module removal races
    x86/kprobes: Use vmalloc special flag
    x86/ftrace: Use vmalloc special flag
    bpf: Use vmalloc special flag
    modules: Use vmalloc special flag
    mm/vmalloc: Add flag for freeing of special permsissions
    mm/hibernation: Make hibernation handle unmapped pages
    x86/mm/cpa: Add set_direct_map_*() functions
    x86/alternatives: Remove the return value of text_poke_*()
    x86/jump-label: Remove support for custom text poker
    x86/modules: Avoid breaking W^X while loading modules
    x86/kprobes: Set instruction page as executable
    x86/ftrace: Set trampoline pages as executable
    x86/kgdb: Avoid redundant comparison of patched code
    x86/alternatives: Use temporary mm for text poking
    x86/alternatives: Initialize temporary mm for patching
    fork: Provide a function for copying init_mm
    uprobes: Initialize uprobes earlier
    x86/mm: Save debug registers when loading a temporary mm
    ...

    Linus Torvalds
     

03 May, 2019

1 commit


02 May, 2019

1 commit


30 Apr, 2019

1 commit

  • Use new flag VM_FLUSH_RESET_PERMS for handling freeing of special
    permissioned memory in vmalloc and remove places where memory was set RW
    before freeing which is no longer needed. Don't track if the memory is RO
    anymore because it is now tracked in vmalloc.

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: Alexei Starovoitov
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Daniel Borkmann
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nadav Amit
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190426001143.4983-19-namit@vmware.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     

28 Apr, 2019

1 commit

  • After allowing a bpf prog to
    - directly read the skb->sk ptr
    - get the fullsock bpf_sock by "bpf_sk_fullsock()"
    - get the bpf_tcp_sock by "bpf_tcp_sock()"
    - get the listener sock by "bpf_get_listener_sock()"
    - avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
    into different bpf running context.

    this patch is another effort to make bpf's network programming
    more intuitive to do (together with memory and performance benefit).

    When bpf prog needs to store data for a sk, the current practice is to
    define a map with the usual 4-tuples (src/dst ip/port) as the key.
    If multiple bpf progs require to store different sk data, multiple maps
    have to be defined. Hence, wasting memory to store the duplicated
    keys (i.e. 4 tuples here) in each of the bpf map.
    [ The smallest key could be the sk pointer itself which requires
    some enhancement in the verifier and it is a separate topic. ]

    Also, the bpf prog needs to clean up the elem when sk is freed.
    Otherwise, the bpf map will become full and un-usable quickly.
    The sk-free tracking currently could be done during sk state
    transition (e.g. BPF_SOCK_OPS_STATE_CB).

    The size of the map needs to be predefined which then usually ended-up
    with an over-provisioned map in production. Even the map was re-sizable,
    while the sk naturally come and go away already, this potential re-size
    operation is arguably redundant if the data can be directly connected
    to the sk itself instead of proxy-ing through a bpf map.

    This patch introduces sk->sk_bpf_storage to provide local storage space
    at sk for bpf prog to use. The space will be allocated when the first bpf
    prog has created data for this particular sk.

    The design optimizes the bpf prog's lookup (and then optionally followed by
    an inline update). bpf_spin_lock should be used if the inline update needs
    to be protected.

    BPF_MAP_TYPE_SK_STORAGE:
    -----------------------
    To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
    this patch) needs to be created. Multiple BPF_MAP_TYPE_SK_STORAGE maps can
    be created to fit different bpf progs' needs. The map enforces
    BTF to allow printing the sk-local-storage during a system-wise
    sk dump (e.g. "ss -ta") in the future.

    The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
    a "sk-local-storage" data from a particular sk.
    Think of the map as a meta-data (or "type") of a "sk-local-storage". This
    particular "type" of "sk-local-storage" data can then be stored in any sk.

    The main purposes of this map are mostly:
    1. Define the size of a "sk-local-storage" type.
    2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
    map-id, map-btf...etc.)
    3. Keep track of all sk's storages of this "type" and clean them up
    when the map is freed.

    sk->sk_bpf_storage:
    ------------------
    The main lookup/update/delete is done on sk->sk_bpf_storage (which
    is a "struct bpf_sk_storage"). When doing a lookup,
    the "map" pointer is now used as the "key" to search on the
    sk_storage->list. The "map" pointer is actually serving
    as the "type" of the "sk-local-storage" that is being
    requested.

    To allow very fast lookup, it should be as fast as looking up an
    array at a stable-offset. At the same time, it is not ideal to
    set a hard limit on the number of sk-local-storage "type" that the
    system can have. Hence, this patch takes a cache approach.
    The last search result from sk_storage->list is cached in
    sk_storage->cache[] which is a stable sized array. Each
    "sk-local-storage" type has a stable offset to the cache[] array.
    In the future, a map's flag could be introduced to do cache
    opt-out/enforcement if it became necessary.

    The cache size is 16 (i.e. 16 types of "sk-local-storage").
    Programs can share map. On the program side, having a few bpf_progs
    running in the networking hotpath is already a lot. The bpf_prog
    should have already consolidated the existing sock-key-ed map usage
    to minimize the map lookup penalty. 16 has enough runway to grow.

    All sk-local-storage data will be removed from sk->sk_bpf_storage
    during sk destruction.

    bpf_sk_storage_get() and bpf_sk_storage_delete():
    ------------------------------------------------
    Instead of using bpf_map_(lookup|update|delete)_elem(),
    the bpf prog needs to use the new helper bpf_sk_storage_get() and
    bpf_sk_storage_delete(). The verifier can then enforce the
    ARG_PTR_TO_SOCKET argument. The bpf_sk_storage_get() also allows to
    "create" new elem if one does not exist in the sk. It is done by
    the new BPF_SK_STORAGE_GET_F_CREATE flag. An optional value can also be
    provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
    The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock. Together,
    it has eliminated the potential use cases for an equivalent
    bpf_map_update_elem() API (for bpf_prog) in this patch.

    Misc notes:
    ----------
    1. map_get_next_key is not supported. From the userspace syscall
    perspective, the map has the socket fd as the key while the map
    can be shared by pinned-file or map-id.

    Since btf is enforced, the existing "ss" could be enhanced to pretty
    print the local-storage.

    Supporting a kernel defined btf with 4 tuples as the return key could
    be explored later also.

    2. The sk->sk_lock cannot be acquired. Atomic operations is used instead.
    e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
    Please refer to the source code comments for the details in
    synchronization cases and considerations.

    3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.

    Benchmark:
    ---------
    Here is the benchmark data collected by turning on
    the "kernel.bpf_stats_enabled" sysctl.
    Two bpf progs are tested:

    One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
    sk ptr as the key. (verifier is modified to support sk ptr as the key
    That should have shortened the key lookup time.)

    Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.

    Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
    each egress skb and then bump the cnt. netperf is used to drive
    data with 4096 connected UDP sockets.

    BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
    27: cgroup_skb name egress_sk_map tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
    loaded_at 2019-04-15T13:46:39-0700 uid 0
    xlated 344B jited 258B memlock 4096B map_ids 16
    btf_id 5

    BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
    30: cgroup_skb name egress_sk_stora tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
    loaded_at 2019-04-15T13:47:54-0700 uid 0
    xlated 168B jited 156B memlock 4096B map_ids 17
    btf_id 6

    Here is a high-level picture on how are the objects organized:

    sk
    ┌──────┐
    │ │
    │ │
    │ │
    │*sk_bpf_storage─────▶ bpf_sk_storage
    └──────┘ ┌───────┐
    ┌───────────┤ list │
    │ │ │
    │ │ │
    │ │ │
    │ └───────┘

    │ elem
    │ ┌────────┐
    ├─▶│ snode │
    │ ├────────┤
    │ │ data │ bpf_map
    │ ├────────┤ ┌─────────┐
    │ │map_node│◀─┬─────┤ list │
    │ └────────┘ │ │ │
    │ │ │ │
    │ elem │ │ │
    │ ┌────────┐ │ └─────────┘
    └─▶│ snode │ │
    ├────────┤ │
    bpf_map │ data │ │
    ┌─────────┐ ├────────┤ │
    │ list ├───────▶│map_node│ │
    │ │ └────────┘ │
    │ │ │
    │ │ elem │
    └─────────┘ ┌────────┐ │
    ┌─▶│ snode │ │
    │ ├────────┤ │
    │ │ data │ │
    │ ├────────┤ │
    │ │map_node│◀─┘
    │ └────────┘


    │ ┌───────┐
    sk └──────────│ list │
    ┌──────┐ │ │
    │ │ │ │
    │ │ │ │
    │ │ └───────┘
    │*sk_bpf_storage───────▶bpf_sk_storage
    └──────┘

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Martin KaFai Lau
     

27 Apr, 2019

1 commit

  • This is an opt-in interface that allows a tracepoint to provide a safe
    buffer that can be written from a BPF_PROG_TYPE_RAW_TRACEPOINT program.
    The size of the buffer must be a compile-time constant, and is checked
    before allowing a BPF program to attach to a tracepoint that uses this
    feature.

    The pointer to this buffer will be the first argument of tracepoints
    that opt in; the pointer is valid and can be bpf_probe_read() by both
    BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE
    programs that attach to such a tracepoint, but the buffer to which it
    points may only be written by the latter.

    Signed-off-by: Matt Mullins
    Acked-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov

    Matt Mullins
     

26 Apr, 2019

2 commits

  • In case of a null check on a pointer inside a subprog, we should mark all
    registers with this pointer as either safe or unknown, in both the current
    and previous frames. Currently, only spilled registers and registers in
    the current frame are marked. Packet bound checks in subprogs have the
    same issue. This patch fixes it to mark registers in previous frames as
    well.

    A good reproducer for null checks looks as follow:

    1: ptr = bpf_map_lookup_elem(map, &key);
    2: ret = subprog(ptr) {
    3: return ptr != NULL;
    4: }
    5: if (ret)
    6: value = *ptr;

    With the above, the verifier will complain on line 6 because it sees ptr
    as map_value_or_null despite the null check in subprog 1.

    Note that this patch fixes another resulting bug when using
    bpf_sk_release():

    1: sk = bpf_sk_lookup_tcp(...);
    2: subprog(sk) {
    3: if (sk)
    4: bpf_sk_release(sk);
    5: }
    6: if (!sk)
    7: return 0;
    8: return 1;

    In the above, mark_ptr_or_null_regs will warn on line 6 because it will
    try to free the reference state, even though it was already freed on
    line 3.

    Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
    Signed-off-by: Paul Chaignon
    Signed-off-by: Alexei Starovoitov

    Paul Chaignon
     
  • target_fd is target namespace. If there is a flow dissector BPF program
    attached to that namespace, its (single) id is returned.

    v5:
    * drop net ref right after rcu unlock (Daniel Borkmann)

    v4:
    * add missing put_net (Jann Horn)

    v3:
    * add missing inline to skb_flow_dissector_prog_query static def
    (kbuild test robot )

    v2:
    * don't sleep in rcu critical section (Jakub Kicinski)
    * check input prog_cnt (exit early)

    Cc: Jann Horn
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Daniel Borkmann

    Stanislav Fomichev
     

23 Apr, 2019

2 commits

  • Drop bpf_verifier_lock for root to avoid being DoS-ed by unprivileged.
    The BPF verifier is now fully parallel.
    All unpriv users are still serialized by bpf_verifier_lock to avoid
    exhausting kernel memory by running N parallel verifications.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     
  • Move three global variables protected by bpf_verifier_lock into
    'struct bpf_verifier_env' to allow parallel verification.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Alexei Starovoitov
     

18 Apr, 2019

3 commits

  • A lot of the performance gain comes from this patch.

    While analysing performance overhead it was found that the largest CPU
    stalls were caused when touching the struct page area. It is first read with
    a READ_ONCE from build_skb_around via page_is_pfmemalloc(), and when freed
    written by page_frag_free() call.

    Measurements show that the prefetchw (W) variant operation is needed to
    achieve the performance gain. We believe this optimization it two fold,
    first the W-variant saves one step in the cache-coherency protocol, and
    second it helps us to avoid the non-temporal prefetch HW optimizations and
    bring this into all cache-levels. It might be worth investigating if
    prefetch into L2 will have the same benefit.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Ilias Apalodimas
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Jesper Dangaard Brouer
     
  • As cpumap now batch consume xdp_frame's from the ptr_ring, it knows how many
    SKBs it need to allocate. Thus, lets bulk allocate these SKBs via
    kmem_cache_alloc_bulk() API, and use the previously introduced function
    build_skb_around().

    Notice that the flag __GFP_ZERO asks the slab/slub allocator to clear the
    memory for us. This does clear a larger area than needed, but my micro
    benchmarks on Intel CPUs show that this is slightly faster due to being a
    cacheline aligned area is cleared for the SKBs. (For SLUB allocator, there
    is a future optimization potential, because SKBs will with high probability
    originate from same page. If we can find/identify continuous memory areas
    then the Intel CPU memset rep stos will have a real performance gain.)

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Jesper Dangaard Brouer
     
  • Move ptr_ring dequeue outside loop, that allocate SKBs and calls network
    stack, as these operations that can take some time. The ptr_ring is a
    communication channel between CPUs, where we want to reduce/limit any
    cacheline bouncing.

    Do a concentrated bulk dequeue via ptr_ring_consume_batched, to shorten the
    period and times the remote cacheline in ptr_ring is read

    Batch size 8 is both to (1) limit BH-disable period, and (2) consume one
    cacheline on 64-bit archs. After reducing the BH-disable section further
    then we can consider changing this, while still thinking about L1 cacheline
    size being active.

    Signed-off-by: Jesper Dangaard Brouer
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov

    Jesper Dangaard Brouer
     

17 Apr, 2019

1 commit


16 Apr, 2019

1 commit

  • commit f1a2e44a3aec ("bpf: add queue and stack maps") introduced new BPF
    helper functions:
    - BPF_FUNC_map_push_elem
    - BPF_FUNC_map_pop_elem
    - BPF_FUNC_map_peek_elem

    but they were made available only for network BPF programs. This patch
    makes them available for tracepoint, cgroup and lirc programs.

    Signed-off-by: Alban Crequy
    Cc: Mauricio Vasquez B
    Acked-by: Song Liu
    Signed-off-by: Daniel Borkmann

    Alban Crequy
     

13 Apr, 2019

13 commits

  • There are a few "regs[regno]" here are there across "check_reg_arg", this
    patch factor it out into a simple "reg" pointer. The intention is to
    simplify code indentation and make the later patches in this set look
    cleaner.

    Reviewed-by: Jakub Kicinski
    Signed-off-by: Jiong Wang
    Signed-off-by: Alexei Starovoitov

    Jiong Wang
     
  • After code refactor in previous patches, the propagation logic inside the
    for loop in "propagate_liveness" becomes clear that they are good enough to
    be factored out into a common function "propagate_liveness_reg".

    Reviewed-by: Jakub Kicinski
    Signed-off-by: Jiong Wang
    Signed-off-by: Alexei Starovoitov

    Jiong Wang
     
  • Access to reg states were not factored out, the consequence is long code
    for dereferencing them which made the indentation not good for reading.

    This patch factor out these code so the core code in the loop could be
    easier to follow.

    Reviewed-by: Jakub Kicinski
    Signed-off-by: Jiong Wang
    Signed-off-by: Alexei Starovoitov

    Jiong Wang
     
  • Propagation for register and stack slot are finished in separate for loop,
    while they are perfect to be put into a single loop.

    This could also let them share some common variables in later patches.

    Signed-off-by: Jiong Wang
    Signed-off-by: Alexei Starovoitov

    Jiong Wang
     
  • Fix a new warning reported by kbuild for make ARCH=i386:

    In file included from kernel/bpf/cgroup.c:11:0:
    kernel/bpf/cgroup.c: In function '__cgroup_bpf_run_filter_sysctl':
    include/linux/kernel.h:827:29: warning: comparison of distinct pointer types lacks a cast
    (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
    ^
    include/linux/kernel.h:841:4: note: in expansion of macro '__typecheck'
    (__typecheck(x, y) && __no_side_effects(x, y))
    ^~~~~~~~~~~
    include/linux/kernel.h:851:24: note: in expansion of macro '__safe_cmp'
    __builtin_choose_expr(__safe_cmp(x, y), \
    ^~~~~~~~~~
    include/linux/kernel.h:860:19: note: in expansion of macro '__careful_cmp'
    #define min(x, y) __careful_cmp(x, y, > kernel/bpf/cgroup.c:837:17: note: in expansion of macro 'min'
    ctx.new_len = min(PAGE_SIZE, *pcount);
    ^~~

    Fixes: 4e63acdff864 ("bpf: Introduce bpf_sysctl_{get,set}_new_value helpers")
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Add bpf_strtol and bpf_strtoul to convert a string to long and unsigned
    long correspondingly. It's similar to user space strtol(3) and
    strtoul(3) with a few changes to the API:

    * instead of NUL-terminated C string the helpers expect buffer and
    buffer length;

    * resulting long or unsigned long is returned in a separate
    result-argument;

    * return value is used to indicate success or failure, on success number
    of consumed bytes is returned that can be used to identify position to
    read next if the buffer is expected to contain multiple integers;

    * instead of *base* argument, *flags* is used that provides base in 5
    LSB, other bits are reserved for future use;

    * number of supported bases is limited.

    Documentation for the new helpers is provided in bpf.h UAPI.

    The helpers are made available to BPF_PROG_TYPE_CGROUP_SYSCTL programs to
    be able to convert string input to e.g. "ulongvec" output.

    E.g. "net/ipv4/tcp_mem" consists of three ulong integers. They can be
    parsed by calling to bpf_strtoul three times.

    Implementation notes:

    Implementation includes "../../lib/kstrtox.h" to reuse integer parsing
    functions. It's done exactly same way as fs/proc/base.c already does.

    Unfortunately existing kstrtoX function can't be used directly since
    they fail if any invalid character is present right after integer in the
    string. Existing simple_strtoX functions can't be used either since
    they're obsolete and don't handle overflow properly.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Currently the way to pass result from BPF helper to BPF program is to
    provide memory area defined by pointer and size: func(void *, size_t).

    It works great for generic use-case, but for simple types, such as int,
    it's overkill and consumes two arguments when it could use just one.

    Introduce new argument types ARG_PTR_TO_INT and ARG_PTR_TO_LONG to be
    able to pass result from helper to program via pointer to int and long
    correspondingly: func(int *) or func(long *).

    New argument types are similar to ARG_PTR_TO_MEM with the following
    differences:
    * they don't require corresponding ARG_CONST_SIZE argument, predefined
    access sizes are used instead (32bit for int, 64bit for long);
    * it's possible to use more than one such an argument in a helper;
    * provided pointers have to be aligned.

    It's easy to introduce similar ARG_PTR_TO_CHAR and ARG_PTR_TO_SHORT
    argument types. It's not done due to lack of use-case though.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Add file_pos field to bpf_sysctl context to read and write sysctl file
    position at which sysctl is being accessed (read or written).

    The field can be used to e.g. override whole sysctl value on write to
    sysctl even when sys_write is called by user space with file_pos > 0. Or
    BPF program may reject such accesses.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Add helpers to work with new value being written to sysctl by user
    space.

    bpf_sysctl_get_new_value() copies value being written to sysctl into
    provided buffer.

    bpf_sysctl_set_new_value() overrides new value being written by user
    space with a one from provided buffer. Buffer should contain string
    representation of the value, similar to what can be seen in /proc/sys/.

    Both helpers can be used only on sysctl write.

    File position matters and can be managed by an interface that will be
    introduced separately. E.g. if user space calls sys_write to a file in
    /proc/sys/ at file position = X, where X > 0, then the value set by
    bpf_sysctl_set_new_value() will be written starting from X. If program
    wants to override whole value with specified buffer, file position has
    to be set to zero.

    Documentation for the new helpers is provided in bpf.h UAPI.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Add bpf_sysctl_get_current_value() helper to copy current sysctl value
    into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.

    It provides same string as user space can see by reading corresponding
    file in /proc/sys/, including new line, etc.

    Documentation for the new helper is provided in bpf.h UAPI.

    Since current value is kept in ctl_table->data in a parsed form,
    ctl_table->proc_handler() with write=0 is called to read that data and
    convert it to a string. Such a string can later be parsed by a program
    using helpers that will be introduced separately.

    Unfortunately it's not trivial to provide API to access parsed data due to
    variety of data representations (string, intvec, uintvec, ulongvec,
    custom structures, even NULL, etc). Instead it's assumed that user know
    how to handle specific sysctl they're interested in and appropriate
    helpers can be used.

    Since ctl_table->proc_handler() expects __user buffer, conversion to
    __user happens for kernel allocated one where the value is stored.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Add bpf_sysctl_get_name() helper to copy sysctl name (/proc/sys/ entry)
    into provided by BPF_PROG_TYPE_CGROUP_SYSCTL program buffer.

    By default full name (w/o /proc/sys/) is copied, e.g. "net/ipv4/tcp_mem".

    If BPF_F_SYSCTL_BASE_NAME flag is set, only base name will be copied,
    e.g. "tcp_mem".

    Documentation for the new helper is provided in bpf.h UAPI.

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Containerized applications may run as root and it may create problems
    for whole host. Specifically such applications may change a sysctl and
    affect applications in other containers.

    Furthermore in existing infrastructure it may not be possible to just
    completely disable writing to sysctl, instead such a process should be
    gradual with ability to log what sysctl are being changed by a
    container, investigate, limit the set of writable sysctl to currently
    used ones (so that new ones can not be changed) and eventually reduce
    this set to zero.

    The patch introduces new program type BPF_PROG_TYPE_CGROUP_SYSCTL and
    attach type BPF_CGROUP_SYSCTL to solve these problems on cgroup basis.

    New program type has access to following minimal context:
    struct bpf_sysctl {
    __u32 write;
    };

    Where @write indicates whether sysctl is being read (= 0) or written (=
    1).

    Helpers to access sysctl name and value will be introduced separately.

    BPF_CGROUP_SYSCTL attach point is added to sysctl code right before
    passing control to ctl_table->proc_handler so that BPF program can
    either allow or deny access to sysctl.

    Suggested-by: Roman Gushchin
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     
  • Currently kernel/bpf/cgroup.c contains only one program type and one
    proto function cgroup_dev_func_proto(). It'd be useful to have base
    proto function that can be reused for new cgroup-bpf program types
    coming soon.

    Introduce cgroup_base_func_proto().

    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov

    Andrey Ignatov
     

12 Apr, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-04-12

    The following pull-request contains BPF updates for your *net-next* tree.

    The main changes are:

    1) Improve BPF verifier scalability for large programs through two
    optimizations: i) remove verifier states that are not useful in pruning,
    ii) stop walking parentage chain once first LIVE_READ is seen. Combined
    gives approx 20x speedup. Increase limits for accepting large programs
    under root, and add various stress tests, from Alexei.

    2) Implement global data support in BPF. This enables static global variables
    for .data, .rodata and .bss sections to be properly handled which allows
    for more natural program development. This also opens up the possibility
    to optimize program workflow by compiling ELFs only once and later only
    rewriting section data before reload, from Daniel and with test cases and
    libbpf refactoring from Joe.

    3) Add config option to generate BTF type info for vmlinux as part of the
    kernel build process. DWARF debug info is converted via pahole to BTF.
    Latter relies on libbpf and makes use of BTF deduplication algorithm which
    results in 100x savings compared to DWARF data. Resulting .BTF section is
    typically about 2MB in size, from Andrii.

    4) Add BPF verifier support for stack access with variable offset from
    helpers and add various test cases along with it, from Andrey.

    5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header
    so that L2 encapsulation can be used for tc tunnels, from Alan.

    6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that
    users can define a subset of allowed __sk_buff fields that get fed into
    the test program, from Stanislav.

    7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up
    various UBSAN warnings in bpftool, from Yonghong.

    8) Generate a pkg-config file for libbpf, from Luca.

    9) Dump program's BTF id in bpftool, from Prashant.

    10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP
    program, from Magnus.

    11) kallsyms related fixes for the case when symbols are not present in
    BPF selftests and samples, from Daniel
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Apr, 2019

1 commit

  • Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN:
    * ctx_in/ctx_size_in - input context
    * ctx_out/ctx_size_out - output context

    The intended use case is to pass some meta data to the test runs that
    operate on skb (this has being brought up on recent LPC).

    For programs that use bpf_prog_test_run_skb, support __sk_buff input and
    output. Initially, from input __sk_buff, copy _only_ cb and priority into
    skb, all other non-zero fields are prohibited (with EINVAL).
    If the user has set ctx_out/ctx_size_out, copy the potentially modified
    __sk_buff back to the userspace.

    We require all fields of input __sk_buff except the ones we explicitly
    support to be set to zero. The expectation is that in the future we might
    add support for more fields and we want to fail explicitly if the user
    runs the program on the kernel where we don't yet support them.

    The API is intentionally vague (i.e. we don't explicitly add __sk_buff
    to bpf_attr, but ctx_in) to potentially let other test_run types use
    this interface in the future (this can be xdp_md for xdp types for
    example).

    v4:
    * don't copy more than allowed in bpf_ctx_init [Martin]

    v3:
    * handle case where ctx_in is NULL, but ctx_out is not [Martin]
    * convert size==0 checks to ptr==NULL checks and add some extra ptr
    checks [Martin]

    v2:
    * Addressed comments from Martin Lau

    Signed-off-by: Stanislav Fomichev
    Acked-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann

    Stanislav Fomichev
     

10 Apr, 2019

6 commits

  • Given we'll be reusing BPF array maps for global data/bss/rodata
    sections, we need a way to associate BTF DataSec type as its map
    value type. In usual cases we have this ugly BPF_ANNOTATE_KV_PAIR()
    macro hack e.g. via 38d5d3b3d5db ("bpf: Introduce BPF_ANNOTATE_KV_PAIR")
    to get initial map to type association going. While more use cases
    for it are discouraged, this also won't work for global data since
    the use of array map is a BPF loader detail and therefore unknown
    at compilation time. For array maps with just a single entry we make
    an exception in terms of BTF in that key type is declared optional
    if value type is of DataSec type. The latter LLVM is guaranteed to
    emit and it also aligns with how we regard global data maps as just
    a plain buffer area reusing existing map facilities for allowing
    things like introspection with existing tools.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • This work adds kernel-side verification, logging and seq_show dumping
    of BTF Var and DataSec kinds which are emitted with latest LLVM. The
    following constraints apply:

    BTF Var must have:

    - Its kind_flag is 0
    - Its vlen is 0
    - Must point to a valid type
    - Type must not resolve to a forward type
    - Size of underlying type must be > 0
    - Must have a valid name
    - Can only be a source type, not sink or intermediate one
    - Name may include dots (e.g. in case of static variables
    inside functions)
    - Cannot be a member of a struct/union
    - Linkage so far can either only be static or global/allocated

    BTF DataSec must have:

    - Its kind_flag is 0
    - Its vlen cannot be 0
    - Its size cannot be 0
    - Must have a valid name
    - Can only be a source type, not sink or intermediate one
    - Name may include dots (e.g. to represent .bss, .data, .rodata etc)
    - Cannot be a member of a struct/union
    - Inner btf_var_secinfo array with {type,offset,size} triple
    must be sorted by offset in ascending order
    - Type must always point to BTF Var
    - BTF resolved size of Var must be = sum of triple sizes (thus holes
    are allowed)

    btf_var_resolve(), btf_ptr_resolve() and btf_modifier_resolve()
    are on a high level quite similar but each come with slight,
    subtle differences. They could potentially be a bit refactored
    in future which hasn't been done here to ease review.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Trivial addition to allow '.' aside from '_' as "special" characters
    in the object name. Used to allow for substrings in maps from loader
    side such as ".bss", ".data", ".rodata", but could also be useful for
    other purposes.

    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • This patch adds a new BPF_MAP_FREEZE command which allows to
    "freeze" the map globally as read-only / immutable from syscall
    side.

    Map permission handling has been refactored into map_get_sys_perms()
    and drops FMODE_CAN_WRITE in case of locked map. Main use case is
    to allow for setting up .rodata sections from the BPF ELF which
    are loaded into the kernel, meaning BPF loader first allocates
    map, sets up map value by copying .rodata section into it and once
    complete, it calls BPF_MAP_FREEZE on the map fd to prevent further
    modifications.

    Right now BPF_MAP_FREEZE only takes map fd as argument while remaining
    bpf_attr members are required to be zero. I didn't add write-only
    locking here as counterpart since I don't have a concrete use-case
    for it on my side, and I think it makes probably more sense to wait
    once there is actually one. In that case bpf_attr can be extended
    as usual with a flag field and/or others where flag 0 means that
    we lock the map read-only hence this doesn't prevent to add further
    extensions to BPF_MAP_FREEZE upon need.

    A map creation flag like BPF_F_WRONCE was not considered for couple
    of reasons: i) in case of a generic implementation, a map can consist
    of more than just one element, thus there could be multiple map
    updates needed to set the map into a state where it can then be
    made immutable, ii) WRONCE indicates exact one-time write before
    it is then set immutable. A generic implementation would set a bit
    atomically on map update entry (if unset), indicating that every
    subsequent update from then onwards will need to bail out there.
    However, map updates can fail, so upon failure that flag would need
    to be unset again and the update attempt would need to be repeated
    for it to be eventually made immutable. While this can be made
    race-free, this approach feels less clean and in combination with
    reason i), it's not generic enough. A dedicated BPF_MAP_FREEZE
    command directly sets the flag and caller has the guarantee that
    map is immutable from syscall side upon successful return for any
    future syscall invocations that would alter the map state, which
    is also more intuitive from an API point of view. A command name
    such as BPF_MAP_LOCK has been avoided as it's too close with BPF
    map spin locks (which already has BPF_F_LOCK flag). BPF_MAP_FREEZE
    is so far only enabled for privileged users.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • This work adds two new map creation flags BPF_F_RDONLY_PROG
    and BPF_F_WRONLY_PROG in order to allow for read-only or
    write-only BPF maps from a BPF program side.

    Today we have BPF_F_RDONLY and BPF_F_WRONLY, but this only
    applies to system call side, meaning the BPF program has full
    read/write access to the map as usual while bpf(2) calls with
    map fd can either only read or write into the map depending
    on the flags. BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG allows
    for the exact opposite such that verifier is going to reject
    program loads if write into a read-only map or a read into a
    write-only map is detected. For read-only map case also some
    helpers are forbidden for programs that would alter the map
    state such as map deletion, update, etc. As opposed to the two
    BPF_F_RDONLY / BPF_F_WRONLY flags, BPF_F_RDONLY_PROG as well
    as BPF_F_WRONLY_PROG really do correspond to the map lifetime.

    We've enabled this generic map extension to various non-special
    maps holding normal user data: array, hash, lru, lpm, local
    storage, queue and stack. Further generic map types could be
    followed up in future depending on use-case. Main use case
    here is to forbid writes into .rodata map values from verifier
    side.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Both BPF_F_WRONLY / BPF_F_RDONLY flags are tied to the map file
    descriptor, but not to the map object itself! Meaning, at map
    creation time BPF_F_RDONLY can be set to make the map read-only
    from syscall side, but this holds only for the returned fd, so
    any other fd either retrieved via bpf file system or via map id
    for the very same underlying map object can have read-write access
    instead.

    Given that, keeping the two flags around in the map_flags attribute
    and exposing them to user space upon map dump is misleading and
    may lead to false conclusions. Since these two flags are not
    tied to the map object lets also not store them as map property.

    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov

    Daniel Borkmann