20 Jan, 2021

2 commits

  • [ Upstream commit 69ca310f34168eae0ada434796bfc22fb4a0fa26 ]

    On some systems, some variant of the following splat is
    repeatedly seen. The common factor in all traces seems
    to be the entry point to task_file_seq_next(). With the
    patch, all warnings go away.

    rcu: INFO: rcu_sched self-detected stall on CPU
    rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
    \x09(t=21033 jiffies g=159148529 q=223125)
    NMI backtrace for cpu 26
    CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
    Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
    Call Trace:

    dump_stack+0x50/0x70
    nmi_cpu_backtrace.cold.6+0x13/0x50
    ? lapic_can_unplug_cpu.cold.30+0x40/0x40
    nmi_trigger_cpumask_backtrace+0xba/0xca
    rcu_dump_cpu_stacks+0x99/0xc7
    rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
    ? tick_sched_do_timer+0x60/0x60
    update_process_times+0x24/0x50
    tick_sched_timer+0x37/0x70
    __hrtimer_run_queues+0xfe/0x270
    hrtimer_interrupt+0xf4/0x210
    smp_apic_timer_interrupt+0x5e/0x120
    apic_timer_interrupt+0xf/0x20

    RIP: 0010:get_pid_task+0x38/0x80
    Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
    RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
    RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
    RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
    R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
    R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
    ? find_ge_pid+0x1b/0x20
    task_seq_get_next+0x52/0xc0
    task_file_seq_get_next+0x159/0x220
    task_file_seq_next+0x4f/0xa0
    bpf_seq_read+0x159/0x390
    vfs_read+0x8a/0x140
    ksys_read+0x59/0xd0
    do_syscall_64+0x42/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f95ae73e76e
    Code: Bad RIP value.
    RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
    RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
    RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
    RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
    R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
    R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0

    If unable to obtain the file structure for the current task,
    proceed to the next task number after the one returned from
    task_seq_get_next(), instead of the next task number from the
    original iterator.

    Also, save the stopping task number from task_seq_get_next()
    on failure in case of restarts.

    Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
    Signed-off-by: Jonathan Lemon
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201218185032.2464558-2-jonathan.lemon@gmail.com
    Signed-off-by: Sasha Levin

    Jonathan Lemon
     
  • [ Upstream commit 91b2db27d3ff9ad29e8b3108dfbf1e2f49fe9bd3 ]

    Simplify task_file_seq_get_next() by removing two in/out arguments: task
    and fstruct. Use info->task and info->files instead.

    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
    Signed-off-by: Sasha Levin

    Song Liu
     

12 Dec, 2020

1 commit

  • Remove bpf_ prefix, which causes these helpers to be reported in verifier
    dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
    fix it as long as it is still possible before UAPI freezes on these helpers.

    Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Linus Torvalds

    Andrii Nakryiko
     

11 Dec, 2020

1 commit

  • The 64-bit signed bounds should not affect 32-bit signed bounds unless the
    verifier knows that upper 32-bits are either all 1s or all 0s. For example the
    register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
    since smax_value could be larger than 32-bit subregister can hold.
    The verifier refines the smax/s32_max return value from certain helpers in
    do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
    value is also bounded. When both smin and smax bounds fit into 32-bit
    subregister the verifier can propagate those bounds.

    Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
    Reported-by: Jean-Philippe Brucker
    Acked-by: John Fastabend
    Signed-off-by: Alexei Starovoitov

    Alexei Starovoitov
     

15 Nov, 2020

1 commit

  • Currently verifier enforces return code checks for subprograms in the
    same manner as it does for program entry points. This prevents returning
    arbitrary scalar values from subprograms. Scalar type of returned values
    is checked by btf_prepare_func_args() and hence it should be safe to
    allow only scalars for now. Relax return code checks for subprograms and
    allow any correct scalar values.

    Fixes: 51c39bb1d5d10 (bpf: Introduce function-by-function verification)
    Signed-off-by: Dmitrii Banshchikov
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201113171756.90594-1-me@ubique.spb.ru

    Dmitrii Banshchikov
     

11 Nov, 2020

1 commit

  • The unsigned variable datasec_id is assigned a return value from the call
    to check_pseudo_btf_id(), which may return negative error code.

    This fixes the following coccicheck warning:

    ./kernel/bpf/verifier.c:9616:5-15: WARNING: Unsigned expression compared with zero: datasec_id > 0

    Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
    Reported-by: Tosk Robot
    Signed-off-by: Kaixu Xia
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: John Fastabend
    Cc: Hao Luo
    Link: https://lore.kernel.org/bpf/1605071026-25906-1-git-send-email-kaixuxia@tencent.com

    Kaixu Xia
     

07 Nov, 2020

1 commit

  • The current logic checks if the name of the BTF type passed in
    attach_btf_id starts with "bpf_lsm_", this is not sufficient as it also
    allows attachment to non-LSM hooks like the very function that performs
    this check, i.e. bpf_lsm_verify_prog.

    In order to ensure that this verification logic allows attachment to
    only LSM hooks, the LSM_HOOK definitions in lsm_hook_defs.h are used to
    generate a BTF_ID set. Upon verification, the attach_btf_id of the
    program being attached is checked for presence in this set.

    Fixes: 9e4e01dfd325 ("bpf: lsm: Implement attach, detach and execution")
    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201105230651.2621917-1-kpsingh@chromium.org

    KP Singh
     

06 Nov, 2020

2 commits

  • Zero-fill element values for all other cpus than current, just as
    when not using prealloc. This is the only way the bpf program can
    ensure known initial values for all cpus ('onallcpus' cannot be
    set when coming from the bpf program).

    The scenario is: bpf program inserts some elements in a per-cpu
    map, then deletes some (or userspace does). When later adding
    new elements using bpf_map_update_elem(), the bpf program can
    only set the value of the new elements for the current cpu.
    When prealloc is enabled, previously deleted elements are re-used.
    Without the fix, values for other cpus remain whatever they were
    when the re-used entry was previously freed.

    A selftest is added to validate correct operation in above
    scenario as well as in case of LRU per-cpu map element re-use.

    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Signed-off-by: David Verbeiren
    Signed-off-by: Alexei Starovoitov
    Acked-by: Matthieu Baerts
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201104112332.15191-1-david.verbeiren@tessares.net

    David Verbeiren
     
  • Fix build error when BPF_SYSCALL is not set/enabled but BPF_PRELOAD is
    by making BPF_PRELOAD depend on BPF_SYSCALL.

    ERROR: modpost: "bpf_preload_ops" [kernel/bpf/preload/bpf_preload.ko] undefined!

    Reported-by: kernel test robot
    Reported-by: Randy Dunlap
    Signed-off-by: Randy Dunlap
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201105195109.26232-1-rdunlap@infradead.org

    Randy Dunlap
     

30 Oct, 2020

1 commit

  • Commit 3193c0836 ("bpf: Disable GCC -fgcse optimization for
    ___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
    function scope __attribute__((optimize("-fno-gcse"))), to disable a
    GCC specific optimization that was causing trouble on x86 builds, and
    was not expected to have any positive effect in the first place.

    However, as the GCC manual documents, __attribute__((optimize))
    is not for production use, and results in all other optimization
    options to be forgotten for the function in question. This can
    cause all kinds of trouble, but in one particular reported case,
    it causes -fno-asynchronous-unwind-tables to be disregarded,
    resulting in .eh_frame info to be emitted for the function.

    This reverts commit 3193c0836, and instead, it disables the -fgcse
    optimization for the entire source file, but only when building for
    X86 using GCC with CONFIG_BPF_JIT_ALWAYS_ON disabled. Note that the
    original commit states that CONFIG_RETPOLINE=n triggers the issue,
    whereas CONFIG_RETPOLINE=y performs better without the optimization,
    so it is kept disabled in both cases.

    Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Alexei Starovoitov
    Tested-by: Geert Uytterhoeven
    Reviewed-by: Nick Desaulniers
    Link: https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=GFHpzoj_hCoBQ@mail.gmail.com/
    Link: https://lore.kernel.org/bpf/20201028171506.15682-2-ardb@kernel.org

    Ard Biesheuvel
     

24 Oct, 2020

1 commit

  • Pull networking fixes from Jakub Kicinski:
    "Cross-tree/merge window issues:

    - rtl8150: don't incorrectly assign random MAC addresses; fix late in
    the 5.9 cycle started depending on a return code from a function
    which changed with the 5.10 PR from the usb subsystem

    Current release regressions:

    - Revert "virtio-net: ethtool configurable RXCSUM", it was causing
    crashes at probe when control vq was not negotiated/available

    Previous release regressions:

    - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
    bus, only first device would be probed correctly

    - nexthop: Fix performance regression in nexthop deletion by
    effectively switching from recently added synchronize_rcu() to
    synchronize_rcu_expedited()

    - netsec: ignore 'phy-mode' device property on ACPI systems; the
    property is not populated correctly by the firmware, but firmware
    configures the PHY so just keep boot settings

    Previous releases - always broken:

    - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
    bulk transfers getting "stuck"

    - icmp: randomize the global rate limiter to prevent attackers from
    getting useful signal

    - r8169: fix operation under forced interrupt threading, make the
    driver always use hard irqs, even on RT, given the handler is light
    and only wants to schedule napi (and do so through a _irqoff()
    variant, preferably)

    - bpf: Enforce pointer id generation for all may-be-null register
    type to avoid pointers erroneously getting marked as null-checked

    - tipc: re-configure queue limit for broadcast link

    - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
    tunnels

    - fix various issues in chelsio inline tls driver

    Misc:

    - bpf: improve just-added bpf_redirect_neigh() helper api to support
    supplying nexthop by the caller - in case BPF program has already
    done a lookup we can avoid doing another one

    - remove unnecessary break statements

    - make MCTCP not select IPV6, but rather depend on it"

    * tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
    tcp: fix to update snd_wl1 in bulk receiver fast path
    net: Properly typecast int values to set sk_max_pacing_rate
    netfilter: nf_fwd_netdev: clear timestamp in forwarding path
    ibmvnic: save changed mac address to adapter->mac_addr
    selftests: mptcp: depends on built-in IPv6
    Revert "virtio-net: ethtool configurable RXCSUM"
    rtnetlink: fix data overflow in rtnl_calcit()
    net: ethernet: mtk-star-emac: select REGMAP_MMIO
    net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
    net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
    bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static
    bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh()
    bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
    mptcp: depends on IPV6 but not as a module
    sfc: move initialisation of efx->filter_sem to efx_init_struct()
    mpls: load mpls_gso after mpls_iptunnel
    net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
    net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
    net: dsa: bcm_sf2: make const array static, makes object smaller
    mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
    ...

    Linus Torvalds
     

23 Oct, 2020

1 commit

  • Pull initial set_fs() removal from Al Viro:
    "Christoph's set_fs base series + fixups"

    * 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: Allow a NULL pos pointer to __kernel_read
    fs: Allow a NULL pos pointer to __kernel_write
    powerpc: remove address space overrides using set_fs()
    powerpc: use non-set_fs based maccess routines
    x86: remove address space overrides using set_fs()
    x86: make TASK_SIZE_MAX usable from assembly code
    x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
    lkdtm: remove set_fs-based tests
    test_bitmap: remove user bitmap tests
    uaccess: add infrastructure for kernel builds with set_fs()
    fs: don't allow splice read/write without explicit ops
    fs: don't allow kernel reads and writes without iter ops
    sysctl: Convert to iter interfaces
    proc: add a read_iter method to proc proc_ops
    proc: cleanup the compat vs no compat file ops
    proc: remove a level of indentation in proc_get_inode

    Linus Torvalds
     

20 Oct, 2020

2 commits

  • The commit af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
    introduces RET_PTR_TO_BTF_ID_OR_NULL and
    the commit eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
    introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.
    Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type
    could become PTR_TO_MEM_OR_NULL which is not covered by
    BPF_PROBE_MEM.

    The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL
    pointer type requires the bpf program to explicitly do a NULL check first.
    After NULL check, the verifier will mark all registers having
    the same reg->id as safe to use. However, the reg->id
    is not set for those new _OR_NULL return types. One of the ways
    that may be wrong is, checking NULL for one btf_id typed pointer will
    end up validating all other btf_id typed pointers because
    all of them have id == 0. The later tests will exercise
    this path.

    To fix it and also avoid similar issue in the future, this patch
    moves the id generation logic out of each individual RET type
    test in check_helper_call(). Instead, it does one
    reg_type_may_be_null() test and then do the id generation
    if needed.

    This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg()
    to catch future breakage.

    The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is
    fine because it just happens that the existing id generation after
    check_ctx_access() has covered it. It is also using the
    reg_type_may_be_null() to decide if id generation is needed or not.

    Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
    Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201019194212.1050855-1-kafai@fb.com

    Martin KaFai Lau
     
  • A break is not needed if it is preceded by a return.

    Signed-off-by: Tom Rix
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201019173846.1021-1-trix@redhat.com

    Tom Rix
     

16 Oct, 2020

1 commit

  • Pull networking updates from Jakub Kicinski:

    - Add redirect_neigh() BPF packet redirect helper, allowing to limit
    stack traversal in common container configs and improving TCP
    back-pressure.

    Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

    - Expand netlink policy support and improve policy export to user
    space. (Ge)netlink core performs request validation according to
    declared policies. Expand the expressiveness of those policies
    (min/max length and bitmasks). Allow dumping policies for particular
    commands. This is used for feature discovery by user space (instead
    of kernel version parsing or trial and error).

    - Support IGMPv3/MLDv2 multicast listener discovery protocols in
    bridge.

    - Allow more than 255 IPv4 multicast interfaces.

    - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
    packets of TCPv6.

    - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
    multiple subflows in a load balancing scenario. Enhance advertising
    addresses via the RM_ADDR/ADD_ADDR options.

    - Support SMC-Dv2 version of SMC, which enables multi-subnet
    deployments.

    - Allow more calls to same peer in RxRPC.

    - Support two new Controller Area Network (CAN) protocols - CAN-FD and
    ISO 15765-2:2016.

    - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
    kernel problem.

    - Add TC actions for implementing MPLS L2 VPNs.

    - Improve nexthop code - e.g. handle various corner cases when nexthop
    objects are removed from groups better, skip unnecessary
    notifications and make it easier to offload nexthops into HW by
    converting to a blocking notifier.

    - Support adding and consuming TCP header options by BPF programs,
    opening the doors for easy experimental and deployment-specific TCP
    option use.

    - Reorganize TCP congestion control (CC) initialization to simplify
    life of TCP CC implemented in BPF.

    - Add support for shipping BPF programs with the kernel and loading
    them early on boot via the User Mode Driver mechanism, hence reusing
    all the user space infra we have.

    - Support sleepable BPF programs, initially targeting LSM and tracing.

    - Add bpf_d_path() helper for returning full path for given 'struct
    path'.

    - Make bpf_tail_call compatible with bpf-to-bpf calls.

    - Allow BPF programs to call map_update_elem on sockmaps.

    - Add BPF Type Format (BTF) support for type and enum discovery, as
    well as support for using BTF within the kernel itself (current use
    is for pretty printing structures).

    - Support listing and getting information about bpf_links via the bpf
    syscall.

    - Enhance kernel interfaces around NIC firmware update. Allow
    specifying overwrite mask to control if settings etc. are reset
    during update; report expected max time operation may take to users;
    support firmware activation without machine reboot incl. limits of
    how much impact reset may have (e.g. dropping link or not).

    - Extend ethtool configuration interface to report IEEE-standard
    counters, to limit the need for per-vendor logic in user space.

    - Adopt or extend devlink use for debug, monitoring, fw update in many
    drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
    dpaa2-eth).

    - In mlxsw expose critical and emergency SFP module temperature alarms.
    Refactor port buffer handling to make the defaults more suitable and
    support setting these values explicitly via the DCBNL interface.

    - Add XDP support for Intel's igb driver.

    - Support offloading TC flower classification and filtering rules to
    mscc_ocelot switches.

    - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
    fixed interval period pulse generator and one-step timestamping in
    dpaa-eth.

    - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
    offload.

    - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
    this HW to use it. Convert mvpp2 to split PCS.

    - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
    7-port Mediatek MT7531 IP.

    - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
    and wcn3680 support in wcn36xx.

    - Improve performance for packets which don't require much offloads on
    recent Mellanox NICs by 20% by making multiple packets share a
    descriptor entry.

    - Move chelsio inline crypto drivers (for TLS and IPsec) from the
    crypto subtree to drivers/net. Move MDIO drivers out of the phy
    directory.

    - Clean up a lot of W=1 warnings, reportedly the actively developed
    subsections of networking drivers should now build W=1 warning free.

    - Make sure drivers don't use in_interrupt() to dynamically adapt their
    code. Convert tasklets to use new tasklet_setup API (sadly this
    conversion is not yet complete).

    * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
    Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
    net, sockmap: Don't call bpf_prog_put() on NULL pointer
    bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
    bpf, sockmap: Add locking annotations to iterator
    netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
    net: fix pos incrementment in ipv6_route_seq_next
    net/smc: fix invalid return code in smcd_new_buf_create()
    net/smc: fix valid DMBE buffer sizes
    net/smc: fix use-after-free of delayed events
    bpfilter: Fix build error with CONFIG_BPFILTER_UMH
    cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
    net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
    bpf: Fix register equivalence tracking.
    rxrpc: Fix loss of final ack on shutdown
    rxrpc: Fix bundle counting for exclusive connections
    netfilter: restore NF_INET_NUMHOOKS
    ibmveth: Identify ingress large send packets.
    ibmveth: Switch order of ibmveth_helper calls.
    cxgb4: handle 4-tuple PEDIT to NAT mode translation
    selftests: Add VRF route leaking tests
    ...

    Linus Torvalds
     

15 Oct, 2020

2 commits

  • The 64-bit JEQ/JNE handling in reg_set_min_max() was clearing reg->id in either
    true or false branch. In the case 'if (reg->id)' check was done on the other
    branch the counter part register would have reg->id == 0 when called into
    find_equal_scalars(). In such case the helper would incorrectly identify other
    registers with id == 0 as equivalent and propagate the state incorrectly.
    Fix it by preserving ID across reg_set_min_max().

    In other words any kind of comparison operator on the scalar register
    should preserve its ID to recognize:

    r1 = r2
    if (r1 == 20) {
    #1 here both r1 and r2 == 20
    } else if (r2 < 20) {
    #2 here both r1 and r2 < 20
    }

    The patch is addressing #1 case. The #2 was working correctly already.

    Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: John Fastabend
    Tested-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201014175608.1416-1-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     
  • Pull objtool updates from Ingo Molnar:
    "Most of the changes are cleanups and reorganization to make the
    objtool code more arch-agnostic. This is in preparation for non-x86
    support.

    Other changes:

    - KASAN fixes

    - Handle unreachable trap after call to noreturn functions better

    - Ignore unreachable fake jumps

    - Misc smaller fixes & cleanups"

    * tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    perf build: Allow nested externs to enable BUILD_BUG() usage
    objtool: Allow nested externs to enable BUILD_BUG()
    objtool: Permit __kasan_check_{read,write} under UACCESS
    objtool: Ignore unreachable trap after call to noreturn functions
    objtool: Handle calling non-function symbols in other sections
    objtool: Ignore unreachable fake jumps
    objtool: Remove useless tests before save_reg()
    objtool: Decode unwind hint register depending on architecture
    objtool: Make unwind hint definitions available to other architectures
    objtool: Only include valid definitions depending on source file type
    objtool: Rename frame.h -> objtool.h
    objtool: Refactor jump table code to support other architectures
    objtool: Make relocation in alternative handling arch dependent
    objtool: Abstract alternative special case handling
    objtool: Move macros describing structures to arch-dependent code
    objtool: Make sync-check consider the target architecture
    objtool: Group headers to check in a single list
    objtool: Define 'struct orc_entry' only when needed
    objtool: Skip ORC entry creation for non-text sections
    objtool: Move ORC logic out of check()
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-10-12

    The main changes are:

    1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

    2) libbpf relocation support for different size load/store, from Andrii.

    3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

    4) BPF support for per-cpu variables, form Hao.

    5) sockmap improvements, from John.
    ====================

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

12 Oct, 2020

1 commit

  • Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
    ("bpf: Relax max_entries check for most of the inner map types") added support
    for dynamic inner max elements for most map-in-map types. Exceptions were maps
    like array or prog array where the map_gen_lookup() callback uses the maps'
    max_entries field as a constant when emitting instructions.

    We recently implemented Maglev consistent hashing into Cilium's load balancer
    which uses map-in-map with an outer map being hash and inner being array holding
    the Maglev backend table for each service. This has been designed this way in
    order to reduce overall memory consumption given the outer hash map allows to
    avoid preallocating a large, flat memory area for all services. Also, the
    number of service mappings is not always known a-priori.

    The use case for dynamic inner array map entries is to further reduce memory
    overhead, for example, some services might just have a small number of back
    ends while others could have a large number. Right now the Maglev backend table
    for small and large number of backends would need to have the same inner array
    map entries which adds a lot of unneeded overhead.

    Dynamic inner array map entries can be realized by avoiding the inlined code
    generation for their lookup. The lookup will still be efficient since it will
    be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
    The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
    inline code generation and relaxes array_map_meta_equal() check to ignore both
    maps' max_entries. This also still allows to have faster lookups for map-in-map
    when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

    Example code generation where inner map is dynamic sized array:

    # bpftool p d x i 125
    int handle__sys_enter(void * ctx):
    ; int handle__sys_enter(void *ctx)
    0: (b4) w1 = 0
    ; int key = 0;
    1: (63) *(u32 *)(r10 -4) = r1
    2: (bf) r2 = r10
    ;
    3: (07) r2 += -4
    ; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
    4: (18) r1 = map[id:468]
    6: (07) r1 += 272
    7: (61) r0 = *(u32 *)(r2 +0)
    8: (35) if r0 >= 0x3 goto pc+5
    9: (67) r0 <
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

    Daniel Borkmann
     

10 Oct, 2020

2 commits

  • Under register pressure the llvm may spill registers with bounds into the stack.
    The verifier has to track them through spill/fill otherwise many kinds of bound
    errors will be seen. The spill/fill of induction variables was already
    happening. This patch extends this logic from tracking spill/fill of a constant
    into any bounded register. There is no need to track spill/fill of unbounded,
    since no new information will be retrieved from the stack during register fill.

    Though extra stack difference could cause state pruning to be less effective, no
    adverse affects were seen from this patch on selftests and on cilium programs.

    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20201009011240.48506-3-alexei.starovoitov@gmail.com

    Yonghong Song
     
  • The llvm register allocator may use two different registers representing the
    same virtual register. In such case the following pattern can be observed:
    1047: (bf) r9 = r6
    1048: (a5) if r6 < 0x1000 goto pc+1
    1050: ...
    1051: (a5) if r9 < 0x2 goto pc+66
    1052: ...
    1053: (bf) r2 = r9 /* r2 needs to have upper and lower bounds */

    This is normal behavior of greedy register allocator.
    The slides 137+ explain why regalloc introduces such register copy:
    http://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf
    There is no way to tell llvm 'not to do this'.
    Hence the verifier has to recognize such patterns.

    In order to track this information without backtracking allocate ID
    for scalars in a similar way as it's done for find_good_pkt_pointers().

    When the verifier encounters r9 = r6 assignment it will assign the same ID
    to both registers. Later if either register range is narrowed via conditional
    jump propagate the register state into the other register.

    Clear register ID in adjust_reg_min_max_vals() for any alu instruction. The
    register ID is ignored for scalars in regsafe() and doesn't affect state
    pruning. mark_reg_unknown() clears the ID. It's used to process call, endian
    and other instructions. Hence ID is explicitly cleared only in
    adjust_reg_min_max_vals() and in 32-bit mov.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20201009011240.48506-2-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

09 Oct, 2020

1 commit


08 Oct, 2020

2 commits

  • Simon reported an issue with the current scalar32_min_max_or() implementation.
    That is, compared to the other 32 bit subreg tracking functions, the code in
    scalar32_min_max_or() stands out that it's using the 64 bit registers instead
    of 32 bit ones. This leads to bounds tracking issues, for example:

    [...]
    8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
    8: (79) r1 = *(u64 *)(r0 +0)
    R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
    9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
    9: (b7) r0 = 1
    10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
    10: (18) r2 = 0x600000002
    12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    12: (ad) if r1 < r2 goto pc+1
    R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    13: (95) exit
    14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    14: (25) if r1 > 0x0 goto pc+1
    R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    15: (95) exit
    16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    16: (47) r1 |= 0
    17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x1; 0x700000000),s32_max_value=1,u32_max_value=1) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    [...]

    The bound tests on the map value force the upper unsigned bound to be 25769803777
    in 64 bit (0b11000000000000000000000000000000001) and then lower one to be 1. By
    using OR they are truncated and thus result in the range [1,1] for the 32 bit reg
    tracker. This is incorrect given the only thing we know is that the value must be
    positive and thus 2147483647 (0b1111111111111111111111111111111) at max for the
    subregs. Fix it by using the {u,s}32_{min,max}_value vars instead. This also makes
    sense, for example, for the case where we update dst_reg->s32_{min,max}_value in
    the else branch we need to use the newly computed dst_reg->u32_{min,max}_value as
    we know that these are positive. Previously, in the else branch the 64 bit values
    of umin_value=1 and umax_value=32212254719 were used and latter got truncated to
    be 1 as upper bound there. After the fix the subreg range is now correct:

    [...]
    8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
    8: (79) r1 = *(u64 *)(r0 +0)
    R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
    9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
    9: (b7) r0 = 1
    10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
    10: (18) r2 = 0x600000002
    12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    12: (ad) if r1 < r2 goto pc+1
    R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    13: (95) exit
    14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    14: (25) if r1 > 0x0 goto pc+1
    R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    15: (95) exit
    16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    16: (47) r1 |= 0
    17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
    [...]

    Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
    Reported-by: Simon Scannell
    Signed-off-by: Daniel Borkmann
    Reviewed-by: John Fastabend
    Acked-by: Alexei Starovoitov

    Daniel Borkmann
     
  • Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is
    not enabled.

    ../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’?
    .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],

    ../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’?
    .btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],

    Fixes: 1df8f55a37bd ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type")
    Signed-off-by: Randy Dunlap
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org

    Randy Dunlap
     

06 Oct, 2020

2 commits

  • Rejecting non-native endian BTF overlapped with the addition
    of support for it.

    The rest were more simple overlapping changes, except the
    renesas ravb binding update, which had to follow a file
    move as well as a YAML conversion.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
    pcpu_freelist in NMI:

    ./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi

    [ 18.984807] ================================
    [ 18.984807] WARNING: inconsistent lock state
    [ 18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
    [ 18.984809] --------------------------------
    [ 18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
    [ 18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
    [ 18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180
    [ 18.984813] {INITIAL USE} state was registered at:
    [ 18.984814] lock_acquire+0x175/0x7c0
    [ 18.984814] _raw_spin_lock+0x2c/0x40
    [ 18.984815] __pcpu_freelist_pop+0xe3/0x180
    [ 18.984815] pcpu_freelist_pop+0x31/0x40
    [ 18.984816] htab_map_alloc+0xbbf/0xf40
    [ 18.984816] __do_sys_bpf+0x5aa/0x3ed0
    [ 18.984817] do_syscall_64+0x2d/0x40
    [ 18.984818] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 18.984818] irq event stamp: 12
    [...]
    [ 18.984822] other info that might help us debug this:
    [ 18.984823] Possible unsafe locking scenario:
    [ 18.984823]
    [ 18.984824] CPU0
    [ 18.984824] ----
    [ 18.984824] lock(&head->lock);
    [ 18.984826]
    [ 18.984826] lock(&head->lock);
    [ 18.984827]
    [ 18.984828] *** DEADLOCK ***
    [ 18.984828]
    [ 18.984829] 2 locks held by test_progs/1990:
    [...]
    [ 18.984838]
    [ 18.984838] dump_stack+0x9a/0xd0
    [ 18.984839] lock_acquire+0x5c9/0x7c0
    [ 18.984839] ? lock_release+0x6f0/0x6f0
    [ 18.984840] ? __pcpu_freelist_pop+0xe3/0x180
    [ 18.984840] _raw_spin_lock+0x2c/0x40
    [ 18.984841] ? __pcpu_freelist_pop+0xe3/0x180
    [ 18.984841] __pcpu_freelist_pop+0xe3/0x180
    [ 18.984842] pcpu_freelist_pop+0x17/0x40
    [ 18.984842] ? lock_release+0x6f0/0x6f0
    [ 18.984843] __bpf_get_stackid+0x534/0xaf0
    [ 18.984843] bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
    [ 18.984844] bpf_overflow_handler+0x12f/0x3f0

    This is because pcpu_freelist_head.lock is accessed in both NMI and
    non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.

    Since NMI interrupts non-NMI context, when NMI context tries to lock the
    raw_spinlock, non-NMI context of the same CPU may already have locked a
    lock and is blocked from unlocking the lock. For a system with N CPUs,
    there could be N NMIs at the same time, and they may block N non-NMI
    raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike
    _pop(), failing _push() means leaking memory. This issue is more likely to
    trigger in non-SMP system.

    Fix this issue with an extra list, pcpu_freelist.extralist. The extralist
    is primarily used to take _push() when raw_spin_trylock() failed on all
    the per CPU lists. It should be empty most of the time. The following
    table summarizes the behavior of pcpu_freelist in NMI and non-NMI:

    non-NMI pop(): use _lock(); check per CPU lists first;
    if all per CPU lists are empty, check extralist;
    if extralist is empty, return NULL.

    non-NMI push(): use _lock(); only push to per CPU lists.

    NMI pop(): use _trylock(); check per CPU lists first;
    if all per CPU lists are locked or empty, check extralist;
    if extralist is locked or empty, return NULL.

    NMI push(): use _trylock(); check per CPU lists first;
    if all per CPU lists are locked; try push to extralist;
    if extralist is also locked, keep trying on per CPU lists.

    Reported-by: Alexei Starovoitov
    Signed-off-by: Song Liu
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com

    Song Liu
     

05 Oct, 2020

1 commit

  • Replace /* fallthrough */ comments with the new pseudo-keyword
    macro fallthrough [1].

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20201002234217.GA12280@embeddedor

    Gustavo A. R. Silva
     

03 Oct, 2020

4 commits

  • We are missing a deref for the case when we are doing BPF_PROG_BIND_MAP
    on a map that's being already held by the program.
    There is 'if (ret) bpf_map_put(map)' below which doesn't trigger
    because we don't consider this an error.
    Let's add missing bpf_map_put() for this specific condition.

    Fixes: ef15314aa5de ("bpf: Add BPF_PROG_BIND_MAP syscall")
    Reported-by: Alexei Starovoitov
    Signed-off-by: Stanislav Fomichev
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20201003002544.3601440-1-sdf@google.com

    Stanislav Fomichev
     
  • Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
    helper always returns a valid pointer, therefore no need to check
    returned value for NULL. Also note that all programs run with
    preemption disabled, which means that the returned pointer is stable
    during all the execution of the program.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com

    Hao Luo
     
  • Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
    bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
    except that it may return NULL. This happens when the cpu parameter is
    out of range. So the caller must check the returned value.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com

    Hao Luo
     
  • Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
    ksym so that further dereferences on the ksym can use the BTF info
    to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
    the verifier reads the btf_id stored in the insn[0]'s imm field and
    marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
    which is encoded in btf_vminux by pahole. If the VAR is not of a struct
    type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
    and the mem_size is resolved to the size of the VAR's type.

    >From the VAR btf_id, the verifier can also read the address of the
    ksym's corresponding kernel var from kallsyms and use that to fill
    dst_reg.

    Therefore, the proper functionality of pseudo_btf_id depends on (1)
    kallsyms and (2) the encoding of kernel global VARs in pahole, which
    should be available since pahole v1.18.

    Signed-off-by: Hao Luo
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com

    Hao Luo
     

01 Oct, 2020

2 commits

  • Currently, perf event in perf event array is removed from the array when
    the map fd used to add the event is closed. This behavior makes it
    difficult to the share perf events with perf event array.

    Introduce perf event map that keeps the perf event open with a new flag
    BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
    removed when the original map fd is closed. Instead, the perf event will
    stay in the map until 1) it is explicitly removed from the array; or 2)
    the array is freed.

    Signed-off-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com

    Song Liu
     
  • With its use in BPF, the cookie generator can be called very frequently
    in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
    and attached to the root cgroup, for example, when used in v1/v2 mixed
    environments. In particular, when there's a high churn on sockets in the
    system there can be many parallel requests to the bpf_get_socket_cookie()
    and bpf_get_netns_cookie() helpers which then cause contention on the
    atomic counter.

    As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
    allocator"), add a small helper library that both can use for the 64 bit
    counters. Given this can be called from different contexts, we also need
    to deal with potential nested calls even though in practice they are
    considered extremely rare. One idea as suggested by Eric Dumazet was
    to use a reverse counter for this situation since we don't expect 64 bit
    overflows anyways; that way, we can avoid bigger gaps in the 64 bit
    counter space compared to just batch-wise increase. Even on machines
    with small number of cores (e.g. 4) the cookie generation shrinks from
    min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
    in parallel from multiple CPUs.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: Martin KaFai Lau
    Cc: Eric Dumazet
    Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net

    Daniel Borkmann
     

30 Sep, 2020

4 commits

  • Eelco reported we can't properly access arguments if the tracing
    program is attached to extension program.

    Having following program:

    SEC("classifier/test_pkt_md_access")
    int test_pkt_md_access(struct __sk_buff *skb)

    with its extension:

    SEC("freplace/test_pkt_md_access")
    int test_pkt_md_access_new(struct __sk_buff *skb)

    and tracing that extension with:

    SEC("fentry/test_pkt_md_access_new")
    int BPF_PROG(fentry, struct sk_buff *skb)

    It's not possible to access skb argument in the fentry program,
    with following error from verifier:

    ; int BPF_PROG(fentry, struct sk_buff *skb)
    0: (79) r1 = *(u64 *)(r1 +0)
    invalid bpf_context access off=0 size=8

    The problem is that btf_ctx_access gets the context type for the
    traced program, which is in this case the extension.

    But when we trace extension program, we want to get the context
    type of the program that the extension is attached to, so we can
    access the argument properly in the trace program.

    This version of the patch is tweaked slightly from Jiri's original one,
    since the refactoring in the previous patches means we have to get the
    target prog type from the new variable in prog->aux instead of directly
    from the target prog.

    Reported-by: Eelco Chaudron
    Suggested-by: Jiri Olsa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355278.48470.17057040257274725638.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • This enables support for attaching freplace programs to multiple attach
    points. It does this by amending the UAPI for bpf_link_Create with a target
    btf ID that can be used to supply the new attachment point along with the
    target program fd. The target must be compatible with the target that was
    supplied at program load time.

    The implementation reuses the checks that were factored out of
    check_attach_btf_id() to ensure compatibility between the BTF types of the
    old and new attachment. If these match, a new bpf_tracing_link will be
    created for the new attach target, allowing multiple attachments to
    co-exist simultaneously.

    The code could theoretically support multiple-attach of other types of
    tracing programs as well, but since I don't have a use case for any of
    those, there is no API support for doing so.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • In preparation for allowing multiple attachments of freplace programs, move
    the references to the target program and trampoline into the
    bpf_tracing_link structure when that is created. To do this atomically,
    introduce a new mutex in prog->aux to protect writing to the two pointers
    to target prog and trampoline, and rename the members to make it clear that
    they are related.

    With this change, it is no longer possible to attach the same tracing
    program multiple times (detaching in-between), since the reference from the
    tracing program to the target disappears on the first attach. However,
    since the next patch will let the caller supply an attach target, that will
    also make it possible to attach to the same place multiple times.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • The Makefile in bpf/preload builds a local copy of libbpf, but does not
    properly clean up after itself. This can lead to subsequent compilation
    failures, since the feature detection cache is kept around which can lead
    subsequent detection to fail.

    Fix this by properly setting clean-files, and while we're at it, also add a
    .gitignore for the directory to ignore the build artifacts.

    Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200927193005.8459-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

29 Sep, 2020

3 commits

  • A helper is added to allow seq file writing of kernel data
    structures using vmlinux BTF. Its signature is

    long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
    u32 btf_ptr_size, u64 flags);

    Flags and struct btf_ptr definitions/use are identical to the
    bpf_snprintf_btf helper, and the helper returns 0 on success
    or a negative error value.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • BPF iter size is limited to PAGE_SIZE; if we wish to display BTF-based
    representations of larger kernel data structures such as task_struct,
    this will be insufficient.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-6-git-send-email-alan.maguire@oracle.com

    Alan Maguire
     
  • A helper is added to support tracing kernel type information in BPF
    using the BPF Type Format (BTF). Its signature is

    long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
    u32 btf_ptr_size, u64 flags);

    struct btf_ptr * specifies

    - a pointer to the data to be traced
    - the BTF id of the type of data pointed to
    - a flags field is provided for future use; these flags
    are not to be confused with the BTF_F_* flags
    below that control how the btf_ptr is displayed; the
    flags member of the struct btf_ptr may be used to
    disambiguate types in kernel versus module BTF, etc;
    the main distinction is the flags relate to the type
    and information needed in identifying it; not how it
    is displayed.

    For example a BPF program with a struct sk_buff *skb
    could do the following:

    static struct btf_ptr b = { };

    b.ptr = skb;
    b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
    bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

    Default output looks like this:

    (struct sk_buff){
    .transport_header = (__u16)65535,
    .mac_header = (__u16)65535,
    .end = (sk_buff_data_t)192,
    .head = (unsigned char *)0x000000007524fd8b,
    .data = (unsigned char *)0x000000007524fd8b,
    .truesize = (unsigned int)768,
    .users = (refcount_t){
    .refs = (atomic_t){
    .counter = (int)1,
    },
    },
    }

    Flags modifying display are as follows:

    - BTF_F_COMPACT: no formatting around type information
    - BTF_F_NONAME: no struct/union member names/types
    - BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
    equivalent to %px.
    - BTF_F_ZERO: show zero-valued struct/union members;
    they are not displayed by default

    Signed-off-by: Alan Maguire
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com

    Alan Maguire