09 Dec, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) More jumbo frame fixes in r8169, from Heiner Kallweit.

    2) Fix bpf build in minimal configuration, from Alexei Starovoitov.

    3) Use after free in slcan driver, from Jouni Hogander.

    4) Flower classifier port ranges don't work properly in the HW offload
    case, from Yoshiki Komachi.

    5) Use after free in hns3_nic_maybe_stop_tx(), from Yunsheng Lin.

    6) Out of bounds access in mqprio_dump(), from Vladyslav Tarasiuk.

    7) Fix flow dissection in dsa TX path, from Alexander Lobakin.

    8) Stale syncookie timestampe fixes from Guillaume Nault.

    [ Did an evil merge to silence a warning introduced by this pull - Linus ]

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
    r8169: fix rtl_hw_jumbo_disable for RTL8168evl
    net_sched: validate TCA_KIND attribute in tc_chain_tmplt_add()
    r8169: add missing RX enabling for WoL on RTL8125
    vhost/vsock: accept only packets with the right dst_cid
    net: phy: dp83867: fix hfs boot in rgmii mode
    net: ethernet: ti: cpsw: fix extra rx interrupt
    inet: protect against too small mtu values.
    gre: refetch erspan header from skb->data after pskb_may_pull()
    pppoe: remove redundant BUG_ON() check in pppoe_pernet
    tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
    tcp: tighten acceptance of ACKs not matching a child socket
    tcp: fix rejected syncookies due to stale timestamps
    lpc_eth: kernel BUG on remove
    tcp: md5: fix potential overestimation of TCP option space
    net: sched: allow indirect blocks to bind to clsact in TC
    net: core: rename indirect block ingress cb function
    net-sysfs: Call dev_hold always in netdev_queue_add_kobject
    net: dsa: fix flow dissection on Tx path
    net/tls: Fix return values to avoid ENOTSUPP
    net: avoid an indirect call in ____sys_recvmsg()
    ...

    Linus Torvalds
     

05 Dec, 2019

1 commit

  • For jited bpf program, if the subprogram count is 1, i.e.,
    there is no callees in the program, prog->aux->func will be NULL
    and prog->bpf_func points to image address of the program.

    If there is more than one subprogram, prog->aux->func is populated,
    and subprogram 0 can be accessed through either prog->bpf_func or
    prog->aux->func[0]. Other subprograms should be accessed through
    prog->aux->func[subprog_id].

    This patch fixed a bug in check_attach_btf_id(), where
    prog->aux->func[subprog_id] is used to access any subprogram which
    caused a segfault like below:
    [79162.619208] BUG: kernel NULL pointer dereference, address:
    0000000000000000
    ......
    [79162.634255] Call Trace:
    [79162.634974] ? _cond_resched+0x15/0x30
    [79162.635686] ? kmem_cache_alloc_trace+0x162/0x220
    [79162.636398] ? selinux_bpf_prog_alloc+0x1f/0x60
    [79162.637111] bpf_prog_load+0x3de/0x690
    [79162.637809] __do_sys_bpf+0x105/0x1740
    [79162.638488] do_syscall_64+0x5b/0x180
    [79162.639147] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ......

    Fixes: 5b92a28aae4d ("bpf: Support attaching tracing BPF program to other BPF programs")
    Reported-by: Eelco Chaudron
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191205010606.177774-1-yhs@fb.com

    Yonghong Song
     

04 Dec, 2019

1 commit

  • Pull irq updates from Ingo Molnar:
    "Most of the IRQ subsystem changes in this cycle were irq-chip driver
    updates:

    - Qualcomm PDC wakeup interrupt support

    - Layerscape external IRQ support

    - Broadcom bcm7038 PM and wakeup support

    - Ingenic driver cleanup and modernization

    - GICv3 ITS preparation for GICv4.1 updates

    - GICv4 fixes

    There's also the series from Frederic Weisbecker that fixes memory
    ordering bugs for the irq-work logic, whose primary fix is to turn
    work->irq_work.flags into an atomic variable and then convert the
    complex (and buggy) atomic_cmpxchg() loop in irq_work_claim() into a
    much simpler atomic_fetch_or() call.

    There are also various smaller cleanups"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    pinctrl/sdm845: Add PDC wakeup interrupt map for GPIOs
    pinctrl/msm: Setup GPIO chip in hierarchy
    irqchip/qcom-pdc: Add irqchip set/get state calls
    irqchip/qcom-pdc: Add irqdomain for wakeup capable GPIOs
    irqchip/qcom-pdc: Do not toggle IRQ_ENABLE during mask/unmask
    irqchip/qcom-pdc: Update max PDC interrupts
    of/irq: Document properties for wakeup interrupt parent
    genirq: Introduce irq_chip_get/set_parent_state calls
    irqdomain: Add bus token DOMAIN_BUS_WAKEUP
    genirq: Fix function documentation of __irq_alloc_descs()
    irq_work: Fix IRQ_WORK_BUSY bit clearing
    irqchip/ti-sci-inta: Use ERR_CAST inlined function instead of ERR_PTR(PTR_ERR(...))
    irq_work: Slightly simplify IRQ_WORK_PENDING clearing
    irq_work: Fix irq_work_claim() memory ordering
    irq_work: Convert flags to atomic_t
    irqchip: Ingenic: Add process for more than one irq at the same time.
    irqchip: ingenic: Alloc generic chips from IRQ domain
    irqchip: ingenic: Get virq number from IRQ domain
    irqchip: ingenic: Error out if IRQ domain creation failed
    irqchip: ingenic: Drop redundant irq_suspend / irq_resume functions
    ...

    Linus Torvalds
     

03 Dec, 2019

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-12-02

    The following pull-request contains BPF updates for your *net* tree.

    We've added 10 non-merge commits during the last 6 day(s) which contain
    a total of 10 files changed, 60 insertions(+), 51 deletions(-).

    The main changes are:

    1) Fix vmlinux BTF generation for binutils pre v2.25, from Stanislav Fomichev.

    2) Fix libbpf global variable relocation to take symbol's st_value offset
    into account, from Andrii Nakryiko.

    3) Fix libbpf build on powerpc where check_abi target fails due to different
    readelf output format, from Aurelien Jarno.

    4) Don't set BPF insns RO for the case when they are JITed in order to avoid
    fragmenting the direct map, from Daniel Borkmann.

    5) Fix static checker warning in btf_distill_func_proto() as well as a build
    error due to empty enum when BPF is compiled out, from Alexei Starovoitov.

    6) Fix up generation of bpf_helper_defs.h for perf, from Arnaldo Carvalho de Melo.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Nov, 2019

1 commit

  • Some kconfigs can have BPF enabled without a single valid program type.
    In such configurations the build will fail with:
    ./kernel/bpf/btf.c:3466:1: error: empty enum is invalid

    Fix it by adding unused value to the enum.

    Reported-by: Randy Dunlap
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Randy Dunlap # build-tested
    Link: https://lore.kernel.org/bpf/20191128043508.2346723-1-ast@kernel.org

    Alexei Starovoitov
     

27 Nov, 2019

3 commits

  • kernel/bpf/btf.c:4023 btf_distill_func_proto()
    error: potentially dereferencing uninitialized 't'.

    kernel/bpf/btf.c
    4012 nargs = btf_type_vlen(func);
    4013 if (nargs >= MAX_BPF_FUNC_ARGS) {
    4014 bpf_log(log,
    4015 "The function %s has %d arguments. Too many.\n",
    4016 tname, nargs);
    4017 return -EINVAL;
    4018 }
    4019 ret = __get_type_size(btf, func->type, &t);
    ^^
    t isn't initialized for the first -EINVAL return

    This is unlikely path, since BTF should have been validated at this point.
    Fix it by returning 'void' BTF.

    Reported-by: Dan Carpenter
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191126230106.237179-1-ast@kernel.org

    Alexei Starovoitov
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - A comprehensive rewrite of the robust/PI futex code's exit handling
    to fix various exit races. (Thomas Gleixner et al)

    - Rework the generic REFCOUNT_FULL implementation using
    atomic_fetch_* operations so that the performance impact of the
    cmpxchg() loops is mitigated for common refcount operations.

    With these performance improvements the generic implementation of
    refcount_t should be good enough for everybody - and this got
    confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
    REFCOUNT_FULL entirely, leaving the generic implementation enabled
    unconditionally. (Will Deacon)

    - Other misc changes, fixes, cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    lkdtm: Remove references to CONFIG_REFCOUNT_FULL
    locking/refcount: Remove unused 'refcount_error_report()' function
    locking/refcount: Consolidate implementations of refcount_t
    locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
    locking/refcount: Move saturation warnings out of line
    locking/refcount: Improve performance of generic REFCOUNT_FULL code
    locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
    locking/refcount: Remove unused refcount_*_checked() variants
    locking/refcount: Ensure integer operands are treated as signed
    locking/refcount: Define constants for saturation and max refcount values
    futex: Prevent exit livelock
    futex: Provide distinct return value when owner is exiting
    futex: Add mutex around futex exit
    futex: Provide state handling for exec() as well
    futex: Sanitize exit state handling
    futex: Mark the begin of futex exit explicitly
    futex: Set task::futex_state to DEAD right after handling futex exit
    futex: Split futex_mm_release() for exit/exec
    exit/exec: Seperate mm_release()
    futex: Replace PF_EXITPIDONE with a state
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Dynamic tick (nohz) updates, perhaps most notably changes to force
    the tick on when needed due to lengthy in-kernel execution on CPUs
    on which RCU is waiting.

    - Linux-kernel memory consistency model updates.

    - Replace rcu_swap_protected() with rcu_prepace_pointer().

    - Torture-test updates.

    - Documentation updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
    security/safesetid: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/sched: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/netfilter: Replace rcu_swap_protected() with rcu_replace_pointer()
    net/core: Replace rcu_swap_protected() with rcu_replace_pointer()
    bpf/cgroup: Replace rcu_swap_protected() with rcu_replace_pointer()
    fs/afs: Replace rcu_swap_protected() with rcu_replace_pointer()
    drivers/scsi: Replace rcu_swap_protected() with rcu_replace_pointer()
    drm/i915: Replace rcu_swap_protected() with rcu_replace_pointer()
    x86/kvm/pmu: Replace rcu_swap_protected() with rcu_replace_pointer()
    rcu: Upgrade rcu_swap_protected() to rcu_replace_pointer()
    rcu: Suppress levelspread uninitialized messages
    rcu: Fix uninitialized variable in nocb_gp_wait()
    rcu: Update descriptions for rcu_future_grace_period tracepoint
    rcu: Update descriptions for rcu_nocb_wake tracepoint
    rcu: Remove obsolete descriptions for rcu_barrier tracepoint
    rcu: Ensure that ->rcu_urgent_qs is set before resched IPI
    workqueue: Convert for_each_wq to use built-in list check
    rcu: Several rcu_segcblist functions can be static
    rcu: Remove unused function hlist_bl_del_init_rcu()
    Documentation: Rename rcu_node_context_switch() to rcu_note_context_switch()
    ...

    Linus Torvalds
     

26 Nov, 2019

2 commits

  • Pull networking updates from David Miller:
    "Another merge window, another pull full of stuff:

    1) Support alternative names for network devices, from Jiri Pirko.

    2) Introduce per-netns netdev notifiers, also from Jiri Pirko.

    3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
    Larsen.

    4) Allow compiling out the TLS TOE code, from Jakub Kicinski.

    5) Add several new tracepoints to the kTLS code, also from Jakub.

    6) Support set channels ethtool callback in ena driver, from Sameeh
    Jubran.

    7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
    SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.

    8) Add XDP support to mvneta driver, from Lorenzo Bianconi.

    9) Lots of netfilter hw offload fixes, cleanups and enhancements,
    from Pablo Neira Ayuso.

    10) PTP support for aquantia chips, from Egor Pomozov.

    11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
    Josh Hunt.

    12) Add smart nagle to tipc, from Jon Maloy.

    13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
    Duvvuru.

    14) Add a flow mask cache to OVS, from Tonghao Zhang.

    15) Add XDP support to ice driver, from Maciej Fijalkowski.

    16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.

    17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.

    18) Support it in stmmac driver too, from Jose Abreu.

    19) Support TIPC encryption and auth, from Tuong Lien.

    20) Introduce BPF trampolines, from Alexei Starovoitov.

    21) Make page_pool API more numa friendly, from Saeed Mahameed.

    22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.

    23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
    libbpf: Fix usage of u32 in userspace code
    mm: Implement no-MMU variant of vmalloc_user_node_flags
    slip: Fix use-after-free Read in slip_open
    net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
    macvlan: schedule bc_work even if error
    enetc: add support Credit Based Shaper(CBS) for hardware offload
    net: phy: add helpers phy_(un)lock_mdio_bus
    mdio_bus: don't use managed reset-controller
    ax88179_178a: add ethtool_op_get_ts_info()
    mlxsw: spectrum_router: Fix use of uninitialized adjacency index
    mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
    bpf: Simplify __bpf_arch_text_poke poke type handling
    bpf: Introduce BPF_TRACE_x helper for the tracing tests
    bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
    bpf, testing: Add various tail call test cases
    bpf, x86: Emit patchable direct jump as tail call
    bpf: Constant map key tracking for prog array pokes
    bpf: Add poke dependency tracking for prog array maps
    bpf: Add initial poke descriptor table for jit images
    bpf: Move owner type, jited info into array auxiliary data
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "There are several notable changes here:

    - Single thread migrating itself has been optimized so that it
    doesn't need threadgroup rwsem anymore.

    - Freezer optimization to avoid unnecessary frozen state changes.

    - cgroup ID unification so that cgroup fs ino is the only unique ID
    used for the cgroup and can be used to directly look up live
    cgroups through filehandle interface on 64bit ino archs. On 32bit
    archs, cgroup fs ino is still the only ID in use but it is only
    unique when combined with gen.

    - selftest and other changes"

    * 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
    writeback: fix -Wformat compilation warnings
    docs: cgroup: mm: Fix spelling of "list"
    cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
    cgroup: use cgrp->kn->id as the cgroup ID
    kernfs: use 64bit inos if ino_t is 64bit
    kernfs: implement custom exportfs ops and fid type
    kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
    kernfs: convert kernfs_node->id from union kernfs_node_id to u64
    kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
    kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
    netprio: use css ID instead of cgroup ID
    writeback: use ino_t for inodes in tracepoints
    kernfs: fix ino wrap-around detection
    kselftests: cgroup: Avoid the reuse of fd after it is deallocated
    cgroup: freezer: don't change task and cgroups status unnecessarily
    cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
    cgroup: remove cgroup_enable_task_cg_lists() optimization
    cgroup: pids: use atomic64_t for pids->limit
    selftests: cgroup: Run test_core under interfering stress
    selftests: cgroup: Add task migration tests
    ...

    Linus Torvalds
     

25 Nov, 2019

8 commits

  • Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
    and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
    old_addr as well as new_addr, it's a bit redundant and unnecessarily
    complicates __bpf_arch_text_poke() itself since we can derive the same from
    the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
    as types which also allows to clean up call-sites.

    In addition to that, __bpf_arch_text_poke() currently verifies that text
    matches expected old_insn before we invoke text_poke_bp(). Also add a check
    on new_insn and skip rewrite if it already matches. Reason why this is rather
    useful is that it avoids making any special casing in prog_array_map_poke_run()
    when old and new prog were NULL and has the benefit that also for this case
    we perform a check on text whether it really matches our expectations.

    Suggested-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • Add tracking of constant keys into tail call maps. The signature of
    bpf_tail_call_proto is that arg1 is ctx, arg2 map pointer and arg3
    is a index key. The direct call approach for tail calls can be enabled
    if the verifier asserted that for all branches leading to the tail call
    helper invocation, the map pointer and index key were both constant
    and the same.

    Tracking of map pointers we already do from prior work via c93552c443eb
    ("bpf: properly enforce index mask to prevent out-of-bounds speculation")
    and 09772d92cd5a ("bpf: avoid retpoline for lookup/update/ delete calls
    on maps").

    Given the tail call map index key is not on stack but directly in the
    register, we can add similar tracking approach and later in fixup_bpf_calls()
    add a poke descriptor to the progs poke_tab with the relevant information
    for the JITing phase.

    We internally reuse insn->imm for the rewritten BPF_JMP | BPF_TAIL_CALL
    instruction in order to point into the prog's poke_tab, and keep insn->imm
    as 0 as indicator that current indirect tail call emission must be used.
    Note that publishing to the tracker must happen at the end of fixup_bpf_calls()
    since adding elements to the poke_tab reallocates its memory, so we need
    to wait until its in final state.

    Future work can generalize and add similar approach to optimize plain
    array map lookups. Difference there is that we need to look into the key
    value that sits on stack. For clarity in bpf_insn_aux_data, map_state
    has been renamed into map_ptr_state, so we get map_{ptr,key}_state as
    trackers.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/e8db37f6b2ae60402fa40216c96738ee9b316c32.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • This work adds program tracking to prog array maps. This is needed such
    that upon prog array updates/deletions we can fix up all programs which
    make use of this tail call map. We add ops->map_poke_{un,}track()
    helpers to maps to maintain the list of programs and ops->map_poke_run()
    for triggering the actual update.

    bpf_array_aux is extended to contain the list head and poke_mutex in
    order to serialize program patching during updates/deletions.
    bpf_free_used_maps() will untrack the program shortly before dropping
    the reference to the map. For clearing out the prog array once all urefs
    are dropped we need to use schedule_work() to have a sleepable context.

    The prog_array_map_poke_run() is triggered during updates/deletions and
    walks the maintained prog list. It checks in their poke_tabs whether the
    map and key is matching and runs the actual bpf_arch_text_poke() for
    patching in the nop or new jmp location. Depending on the type of update,
    we use one of BPF_MOD_{NOP_TO_JUMP,JUMP_TO_NOP,JUMP_TO_JUMP}.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/1fb364bb3c565b3e415d5ea348f036ff379e779d.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • Add initial poke table data structures and management to the BPF
    prog that can later be used by JITs. Also add an instance of poke
    specific data for tail call maps; plan for later work is to extend
    this also for BPF static keys.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/1db285ec2ea4207ee0455b3f8e191a4fc58b9ade.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • We're going to extend this with further information which is only
    relevant for prog array at this point. Given this info is not used
    in critical path, move it into its own structure such that the main
    array map structure can be kept on diet.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/b9ddccdb0f6f7026489ee955f16c96381e1e7238.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • We later on are going to need a sleepable context as opposed to plain
    RCU callback in order to untrack programs we need to poke at runtime
    and tracking as well as image update is performed under mutex.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/09823b1d5262876e9b83a8e75df04cf0467357a4.1574452833.git.daniel@iogearbox.net

    Daniel Borkmann
     
  • With latest llvm (trunk https://github.com/llvm/llvm-project),
    test_progs, which has +alu32 enabled, failed for strobemeta.o.
    The verifier output looks like below with edit to replace large
    decimal numbers with hex ones.
    193: (85) call bpf_probe_read_user_str#114
    R0=inv(id=0)
    194: (26) if w0 > 0x1 goto pc+4
    R0_w=inv(id=0,umax_value=0xffffffff00000001)
    195: (6b) *(u16 *)(r7 +80) = r0
    196: (bc) w6 = w0
    R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    197: (67) r6 <>= 32
    R6=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    ...
    201: (79) r8 = *(u64 *)(r10 -416)
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
    202: (0f) r8 += r6
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    203: (07) r8 += 9696
    R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    ...
    255: (bf) r1 = r8
    R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=0xffffffff,var_off=(0x0; 0xffffffff))
    ...
    257: (85) call bpf_probe_read_user_str#114
    R1 unbounded memory access, make sure to bounds check any array access into a map

    The value range for register r6 at insn 198 should be really just 0/1.
    The umax_value=0xffffffff caused later verification failure.

    After jmp instructions, the current verifier already tried to use just
    obtained information to get better register range. The current mechanism is
    for 64bit register only. This patch implemented to tighten the range
    for 32bit sub-registers after jmp32 instructions.
    With the patch, we have the below range ranges for the
    above code sequence:
    193: (85) call bpf_probe_read_user_str#114
    R0=inv(id=0)
    194: (26) if w0 > 0x1 goto pc+4
    R0_w=inv(id=0,smax_value=0x7fffffff00000001,umax_value=0xffffffff00000001,
    var_off=(0x0; 0xffffffff00000001))
    195: (6b) *(u16 *)(r7 +80) = r0
    196: (bc) w6 = w0
    R6_w=inv(id=0,umax_value=0xffffffff,var_off=(0x0; 0x1))
    197: (67) r6 <>= 32
    R6=inv(id=0,umax_value=1,var_off=(0x0; 0x1))
    ...
    201: (79) r8 = *(u64 *)(r10 -416)
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,imm=0)
    202: (0f) r8 += r6
    R8_w=map_value(id=0,off=40,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
    203: (07) r8 += 9696
    R8_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
    ...
    255: (bf) r1 = r8
    R1_w=map_value(id=0,off=9736,ks=4,vs=13872,umax_value=1,var_off=(0x0; 0x1))
    ...
    257: (85) call bpf_probe_read_user_str#114
    ...

    At insn 194, the register R0 has better var_off.mask and smax_value.
    Especially, the var_off.mask ensures later lshift and rshift
    maintains proper value range.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Yonghong Song
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191121170650.449030-1-yhs@fb.com

    Yonghong Song
     
  • Tetsuo pointed out that it was not only the device unregister hook that was
    broken for devmap_hash types, it was also cleanup on map free. So better
    fix this as well.

    While we're at it, there's no reason to allocate the netdev_map array for
    DEVMAP_HASH, so skip that and adjust the cost accordingly.

    Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
    Reported-by: Tetsuo Handa
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20191121133612.430414-1-toke@redhat.com

    Toke Høiland-Jørgensen
     

24 Nov, 2019

1 commit


23 Nov, 2019

1 commit


21 Nov, 2019

3 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2019-11-20

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 81 non-merge commits during the last 17 day(s) which contain
    a total of 120 files changed, 4958 insertions(+), 1081 deletions(-).

    There are 3 trivial conflicts, resolve it by always taking the chunk from
    196e8ca74886c433:

    <<<<<<< HEAD
    =======
    void *bpf_map_area_mmapable_alloc(u64 size, int numa_node);
    >>>>>>> 196e8ca74886c433dcfc64a809707074b936aaf5

    <<<<<<< HEAD
    void *bpf_map_area_alloc(u64 size, int numa_node)
    =======
    static void *__bpf_map_area_alloc(u64 size, int numa_node, bool mmapable)
    >>>>>>> 196e8ca74886c433dcfc64a809707074b936aaf5

    <<<<<<< HEAD
    if (size << PAGE_ALLOC_COSTLY_ORDER)) {
    =======
    /* kmalloc()'ed memory can't be mmap()'ed */
    if (!mmapable && size << PAGE_ALLOC_COSTLY_ORDER)) {
    >>>>>>> 196e8ca74886c433dcfc64a809707074b936aaf5

    The main changes are:

    1) Addition of BPF trampoline which works as a bridge between kernel functions,
    BPF programs and other BPF programs along with two new use cases: i) fentry/fexit
    BPF programs for tracing with practically zero overhead to call into BPF (as
    opposed to k[ret]probes) and ii) attachment of the former to networking related
    programs to see input/output of networking programs (covering xdpdump use case),
    from Alexei Starovoitov.

    2) BPF array map mmap support and use in libbpf for global data maps; also a big
    batch of libbpf improvements, among others, support for reading bitfields in a
    relocatable manner (via libbpf's CO-RE helper API), from Andrii Nakryiko.

    3) Extend s390x JIT with usage of relative long jumps and loads in order to lift
    the current 64/512k size limits on JITed BPF programs there, from Ilya Leoshkevich.

    4) Add BPF audit support and emit messages upon successful prog load and unload in
    order to have a timeline of events, from Daniel Borkmann and Jiri Olsa.

    5) Extension to libbpf and xdpsock sample programs to demo the shared umem mode
    (XDP_SHARED_UMEM) as well as RX-only and TX-only sockets, from Magnus Karlsson.

    6) Several follow-up bug fixes for libbpf's auto-pinning code and a new API
    call named bpf_get_link_xdp_info() for retrieving the full set of prog
    IDs attached to XDP, from Toke Høiland-Jørgensen.

    7) Add BTF support for array of int, array of struct and multidimensional arrays
    and enable it for skb->cb[] access in kfree_skb test, from Martin KaFai Lau.

    8) Fix AF_XDP by using the correct number of channels from ethtool, from Luigi Rizzo.

    9) Two fixes for BPF selftest to get rid of a hang in test_tc_tunnel and to avoid
    xdping to be run as standalone, from Jiri Benc.

    10) Various BPF selftest fixes when run with latest LLVM trunk, from Yonghong Song.

    11) Fix a memory leak in BPF fentry test run data, from Colin Ian King.

    12) Various smaller misc cleanups and improvements mostly all over BPF selftests and
    samples, from Daniel T. Lee, Andre Guedes, Anders Roxell, Mao Wenan, Yue Haibing.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Given we recently extended the original bpf_map_area_alloc() helper in
    commit fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY"),
    we need to apply the same logic as in ff1c08e1f74b ("bpf: Change size
    to u64 for bpf_map_{area_alloc, charge_init}()"). To avoid conflicts,
    extend it for bpf-next.

    Reported-by: Stephen Rothwell
    Signed-off-by: Daniel Borkmann

    Daniel Borkmann
     
  • Allow for audit messages to be emitted upon BPF program load and
    unload for having a timeline of events. The load itself is in
    syscall context, so additional info about the process initiating
    the BPF prog creation can be logged and later directly correlated
    to the unload event.

    The only info really needed from BPF side is the globally unique
    prog ID where then audit user space tooling can query / dump all
    info needed about the specific BPF program right upon load event
    and enrich the record, thus these changes needed here can be kept
    small and non-intrusive to the core.

    Raw example output:

    # auditctl -D
    # auditctl -a always,exit -F arch=x86_64 -S bpf
    # ausearch --start recent -m 1334
    [...]
    ----
    time->Wed Nov 20 12:45:51 2019
    type=PROCTITLE msg=audit(1574271951.590:8974): proctitle="./test_verifier"
    type=SYSCALL msg=audit(1574271951.590:8974): arch=c000003e syscall=321 success=yes exit=14 a0=5 a1=7ffe2d923e80 a2=78 a3=0 items=0 ppid=742 pid=949 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
    type=UNKNOWN[1334] msg=audit(1574271951.590:8974): auid=0 uid=0 gid=0 ses=2 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=949 comm="test_verifier" exe="/root/bpf-next/tools/testing/selftests/bpf/test_verifier" prog-id=3260 event=LOAD
    ----
    time->Wed Nov 20 12:45:51 2019
    type=UNKNOWN[1334] msg=audit(1574271951.590:8975): prog-id=3260 event=UNLOAD
    ----
    [...]

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191120213816.8186-1-jolsa@kernel.org

    Daniel Borkmann
     

20 Nov, 2019

1 commit

  • Fix sparse warning:

    kernel/bpf/arraymap.c:481:5: warning:
    symbol 'array_map_mmap' was not declared. Should it be static?

    Reported-by: Hulk Robot
    Signed-off-by: YueHaibing
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191119142113.15388-1-yuehaibing@huawei.com

    YueHaibing
     

18 Nov, 2019

3 commits

  • Add ability to memory-map contents of BPF array map. This is extremely useful
    for working with BPF global data from userspace programs. It allows to avoid
    typical bpf_map_{lookup,update}_elem operations, improving both performance
    and usability.

    There had to be special considerations for map freezing, to avoid having
    writable memory view into a frozen map. To solve this issue, map freezing and
    mmap-ing is happening under mutex now:
    - if map is already frozen, no writable mapping is allowed;
    - if map has writable memory mappings active (accounted in map->writecnt),
    map freezing will keep failing with -EBUSY;
    - once number of writable memory mappings drops to zero, map freezing can be
    performed again.

    Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
    can't be memory mapped either.

    For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
    to be mmap()'able. We also need to make sure that array data memory is
    page-sized and page-aligned, so we over-allocate memory in such a way that
    struct bpf_array is at the end of a single page of memory with array->value
    being aligned with the start of the second page. On deallocation we need to
    accomodate this memory arrangement to free vmalloc()'ed memory correctly.

    One important consideration regarding how memory-mapping subsystem functions.
    Memory-mapping subsystem provides few optional callbacks, among them open()
    and close(). close() is called for each memory region that is unmapped, so
    that users can decrease their reference counters and free up resources, if
    necessary. open() is *almost* symmetrical: it's called for each memory region
    that is being mapped, **except** the very first one. So bpf_map_mmap does
    initial refcnt bump, while open() will do any extra ones after that. Thus
    number of close() calls is equal to number of open() calls plus one more.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: John Fastabend
    Acked-by: Johannes Weiner
    Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com

    Andrii Nakryiko
     
  • Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
    and remove artificial 32k limit. This allows to make bpf_prog's refcounting
    non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.

    Validated compilation by running allyesconfig kernel build.

    Suggested-by: Daniel Borkmann
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com

    Andrii Nakryiko
     
  • 92117d8443bc ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
    potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
    (32k). Due to using 32-bit counter, it's possible in practice to overflow
    refcounter and make it wrap around to 0, causing erroneous map free, while
    there are still references to it, causing use-after-free problems.

    But having a failing refcounting operations are problematic in some cases. One
    example is mmap() interface. After establishing initial memory-mapping, user
    is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
    splitting it into multiple non-contiguous regions. All this happening without
    any control from the users of mmap subsystem. Rather mmap subsystem sends
    notifications to original creator of memory mapping through open/close
    callbacks, which are optionally specified during initial memory mapping
    creation. These callbacks are used to maintain accurate refcount for bpf_map
    (see next patch in this series). The problem is that open() callback is not
    supposed to fail, because memory-mapped resource is set up and properly
    referenced. This is posing a problem for using memory-mapping with BPF maps.

    One solution to this is to maintain separate refcount for just memory-mappings
    and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
    There are similar use cases in current work on tcp-bpf, necessitating extra
    counter as well. This seems like a rather unfortunate and ugly solution that
    doesn't scale well to various new use cases.

    Another approach to solve this is to use non-failing refcount_t type, which
    uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
    stays there. This utlimately causes memory leak, but prevents use after free.

    But given refcounting is not the most performance-critical operation with BPF
    maps (it's not used from running BPF program code), we can also just switch to
    64-bit counter that can't overflow in practice, potentially disadvantaging
    32-bit platforms a tiny bit. This simplifies semantics and allows above
    described scenarios to not worry about failing refcount increment operation.

    In terms of struct bpf_map size, we are still good and use the same amount of
    space:

    BEFORE (3 cache lines, 8 bytes of padding at the end):
    struct bpf_map {
    const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
    struct bpf_map * inner_map_meta; /* 8 8 */
    void * security; /* 16 8 */
    enum bpf_map_type map_type; /* 24 4 */
    u32 key_size; /* 28 4 */
    u32 value_size; /* 32 4 */
    u32 max_entries; /* 36 4 */
    u32 map_flags; /* 40 4 */
    int spin_lock_off; /* 44 4 */
    u32 id; /* 48 4 */
    int numa_node; /* 52 4 */
    u32 btf_key_type_id; /* 56 4 */
    u32 btf_value_type_id; /* 60 4 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct btf * btf; /* 64 8 */
    struct bpf_map_memory memory; /* 72 16 */
    bool unpriv_array; /* 88 1 */
    bool frozen; /* 89 1 */

    /* XXX 38 bytes hole, try to pack */

    /* --- cacheline 2 boundary (128 bytes) --- */
    atomic_t refcnt __attribute__((__aligned__(64))); /* 128 4 */
    atomic_t usercnt; /* 132 4 */
    struct work_struct work; /* 136 32 */
    char name[16]; /* 168 16 */

    /* size: 192, cachelines: 3, members: 21 */
    /* sum members: 146, holes: 1, sum holes: 38 */
    /* padding: 8 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
    } __attribute__((__aligned__(64)));

    AFTER (same 3 cache lines, no extra padding now):
    struct bpf_map {
    const struct bpf_map_ops * ops __attribute__((__aligned__(64))); /* 0 8 */
    struct bpf_map * inner_map_meta; /* 8 8 */
    void * security; /* 16 8 */
    enum bpf_map_type map_type; /* 24 4 */
    u32 key_size; /* 28 4 */
    u32 value_size; /* 32 4 */
    u32 max_entries; /* 36 4 */
    u32 map_flags; /* 40 4 */
    int spin_lock_off; /* 44 4 */
    u32 id; /* 48 4 */
    int numa_node; /* 52 4 */
    u32 btf_key_type_id; /* 56 4 */
    u32 btf_value_type_id; /* 60 4 */
    /* --- cacheline 1 boundary (64 bytes) --- */
    struct btf * btf; /* 64 8 */
    struct bpf_map_memory memory; /* 72 16 */
    bool unpriv_array; /* 88 1 */
    bool frozen; /* 89 1 */

    /* XXX 38 bytes hole, try to pack */

    /* --- cacheline 2 boundary (128 bytes) --- */
    atomic64_t refcnt __attribute__((__aligned__(64))); /* 128 8 */
    atomic64_t usercnt; /* 136 8 */
    struct work_struct work; /* 144 32 */
    char name[16]; /* 176 16 */

    /* size: 192, cachelines: 3, members: 21 */
    /* sum members: 154, holes: 1, sum holes: 38 */
    /* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
    } __attribute__((__aligned__(64)));

    This patch, while modifying all users of bpf_map_inc, also cleans up its
    interface to match bpf_map_put with separate operations for bpf_map_inc and
    bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
    respectively). Also, given there are no users of bpf_map_inc_not_zero
    specifying uref=true, remove uref flag and default to uref=false internally.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com

    Andrii Nakryiko
     

16 Nov, 2019

7 commits

  • Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
    including their subprograms. This feature allows snooping on input and output
    packets in XDP, TC programs including their return values. In order to do that
    the verifier needs to track types not only of vmlinux, but types of other BPF
    programs as well. The verifier also needs to translate uapi/linux/bpf.h types
    used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
    BPF programs. In some cases LLVM optimizations can remove arguments from BPF
    subprograms without adjusting BTF info that LLVM backend knows. When BTF info
    disagrees with actual types that the verifiers sees the BPF trampoline has to
    fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
    program can still attach to such subprograms, but it won't be able to recognize
    pointer types like 'struct sk_buff *' and it won't be able to pass them to
    bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
    would need to use bpf_probe_read_kernel() instead.

    The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
    to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
    points to previously loaded BPF program the attach_btf_id is BTF type id of
    main function or one of its subprograms.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org

    Alexei Starovoitov
     
  • Make the verifier check that BTF types of function arguments match actual types
    passed into top-level BPF program and into BPF-to-BPF calls. If types match
    such BPF programs and sub-programs will have full support of BPF trampoline. If
    types mismatch the trampoline has to be conservative. It has to save/restore
    five program arguments and assume 64-bit scalars.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-17-ast@kernel.org

    Alexei Starovoitov
     
  • Annotate BPF program context types with program-side type and kernel-side type.
    This type information is used by the verifier. btf_get_prog_ctx_type() is
    used in the later patches to verify that BTF type of ctx in BPF program matches to
    kernel expected ctx type. For example, the XDP program type is:
    BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp, struct xdp_md, struct xdp_buff)
    That means that XDP program should be written as:
    int xdp_prog(struct xdp_md *ctx) { ... }

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-16-ast@kernel.org

    Alexei Starovoitov
     
  • btf_resolve_helper_id() caching logic is a bit racy, since under root the
    verifier can verify several programs in parallel. Fix it with READ/WRITE_ONCE.
    Fix the type as well, since error is also recorded.

    Fixes: a7658e1a4164 ("bpf: Check types of arguments passed into helpers")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-15-ast@kernel.org

    Alexei Starovoitov
     
  • Introduce BPF trampoline concept to allow kernel code to call into BPF programs
    with practically zero overhead. The trampoline generation logic is
    architecture dependent. It's converting native calling convention into BPF
    calling convention. BPF ISA is 64-bit (even on 32-bit architectures). The
    registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
    program accepts only single argument "ctx" in R1. Whereas CPU native calling
    convention is different. x86-64 is passing first 6 arguments in registers
    and the rest on the stack. x86-32 is passing first 3 arguments in registers.
    sparc64 is passing first 6 in registers. And so on.

    The trampolines between BPF and kernel already exist. BPF_CALL_x macros in
    include/linux/filter.h statically compile trampolines from BPF into kernel
    helpers. They convert up to five u64 arguments into kernel C pointers and
    integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
    32-bit architecture they're meaningful.

    The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
    __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
    kernel function arguments into array of u64s that BPF program consumes via
    R1=ctx pointer.

    This patch set is doing the same job as __bpf_trace_##call() static
    trampolines, but dynamically for any kernel function. There are ~22k global
    kernel functions that are attachable via nop at function entry. The function
    arguments and types are described in BTF. The job of btf_distill_func_proto()
    function is to extract useful information from BTF into "function model" that
    architecture dependent trampoline generators will use to generate assembly code
    to cast kernel function arguments into array of u64s. For example the kernel
    function eth_type_trans has two pointers. They will be casted to u64 and stored
    into stack of generated trampoline. The pointer to that stack space will be
    passed into BPF program in R1. On x86-64 such generated trampoline will consume
    16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
    make sure that only two u64 are accessed read-only by BPF program. The verifier
    will also recognize the precise type of the pointers being accessed and will
    not allow typecasting of the pointer to a different type within BPF program.

    The tracing use case in the datacenter demonstrated that certain key kernel
    functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
    active. Other functions have both kprobe and kretprobe. So it is essential to
    keep both kernel code and BPF programs executing at maximum speed. Hence
    generated BPF trampoline is re-generated every time new program is attached or
    detached to maintain maximum performance.

    To avoid the high cost of retpoline the attached BPF programs are called
    directly. __bpf_prog_enter/exit() are used to support per-program execution
    stats. In the future this logic will be optimized further by adding support
    for bpf_stats_enabled_key inside generated assembly code. Introduction of
    preemptible and sleepable BPF programs will completely remove the need to call
    to __bpf_prog_enter/exit().

    Detach of a BPF program from the trampoline should not fail. To avoid memory
    allocation in detach path the half of the page is used as a reserve and flipped
    after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
    which is enough for BPF tracing use cases. This limit can be increased in the
    future.

    BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
    BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
    kprobe BPF program remembers function arguments in a map while kretprobe
    fetches arguments from a map and analyzes them together with return value.
    BPF_TRACE_FEXIT accelerates this typical use case.

    Recursion prevention for kprobe BPF programs is done via per-cpu
    bpf_prog_active counter. In practice that turned out to be a mistake. It
    caused programs to randomly skip execution. The tracing tools missed results
    they were looking for. Hence BPF trampoline doesn't provide builtin recursion
    prevention. It's a job of BPF program itself and will be addressed in the
    follow up patches.

    BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
    in the future. For example to remove retpoline cost from XDP programs.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Andrii Nakryiko
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org

    Alexei Starovoitov
     
  • Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
    nops/calls in kernel text into calls into BPF trampoline and to patch
    calls/nops inside BPF programs too.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org

    Alexei Starovoitov
     
  • Currently passing alignment greater than 4 to bpf_jit_binary_alloc does
    not work: in such cases it silently aligns only to 4 bytes.

    On s390, in order to load a constant from memory in a large (>512k) BPF
    program, one must use lgrl instruction, whose memory operand must be
    aligned on an 8-byte boundary.

    This patch makes it possible to request 8-byte alignment from
    bpf_jit_binary_alloc, and also makes it issue a warning when an
    unsupported alignment is requested.

    Signed-off-by: Ilya Leoshkevich
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191115123722.58462-1-iii@linux.ibm.com

    Ilya Leoshkevich
     

13 Nov, 2019

2 commits

  • cgroup ID is currently allocated using a dedicated per-hierarchy idr
    and used internally and exposed through tracepoints and bpf. This is
    confusing because there are tracepoints and other interfaces which use
    the cgroupfs ino as IDs.

    The preceding changes made kn->id exposed as ino as 64bit ino on
    supported archs or ino+gen (low 32bits as ino, high gen). There's no
    reason for cgroup to use different IDs. The kernfs IDs are unique and
    userland can easily discover them and map them back to paths using
    standard file operations.

    This patch replaces cgroup IDs with kernfs IDs.

    * cgroup_id() is added and all cgroup ID users are converted to use it.

    * kernfs_node creation is moved to earlier during cgroup init so that
    cgroup_id() is available during init.

    * While at it, s/cgroup/cgrp/ in psi helpers for consistency.

    * Fallback ID value is changed to 1 to be consistent with root cgroup
    ID.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim

    Tejun Heo
     
  • kernfs_node->id is currently a union kernfs_node_id which represents
    either a 32bit (ino, gen) pair or u64 value. I can't see much value
    in the usage of the union - all that's needed is a 64bit ID which the
    current code is already limited to. Using a union makes the code
    unnecessarily complicated and prevents using 64bit ino without adding
    practical benefits.

    This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
    ino is stored in the lower 32bits and gen upper. Accessors -
    kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
    ino and gen. This simplifies ID handling less cumbersome and will
    allow using 64bit inos on supported archs.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Reviewed-by: Greg Kroah-Hartman
    Cc: Namhyung Kim
    Cc: Jens Axboe
    Cc: Alexei Starovoitov

    Tejun Heo
     

11 Nov, 2019

1 commit

  • We need to convert flags to atomic_t in order to later fix an ordering
    issue on atomic_cmpxchg() failure. This will allow us to use atomic_fetch_or().

    Also clarify the nature of those flags.

    [ mingo: Converted two more usage site the original patch missed. ]

    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Paul E . McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20191108160858.31665-2-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

10 Nov, 2019

1 commit


08 Nov, 2019

1 commit

  • This patch adds array support to btf_struct_access().
    It supports array of int, array of struct and multidimensional
    array.

    It also allows using u8[] as a scratch space. For example,
    it allows access the "char cb[48]" with size larger than
    the array's element "char". Another potential use case is
    "u64 icsk_ca_priv[]" in the tcp congestion control.

    btf_resolve_size() is added to resolve the size of any type.
    It will follow the modifier if there is any. Please
    see the function comment for details.

    This patch also adds the "off < moff" check at the beginning
    of the for loop. It is to reject cases when "off" is pointing
    to a "hole" in a struct.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20191107180903.4097702-1-kafai@fb.com

    Martin KaFai Lau
     

07 Nov, 2019

1 commit

  • In the bpf interpreter mode, bpf_probe_read_kernel is used to read
    from PTR_TO_BTF_ID's kernel object. It currently missed considering
    the insn->off. This patch fixes it.

    Fixes: 2a02759ef5f8 ("bpf: Add support for BTF pointers to interpreter")
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20191107014640.384083-1-kafai@fb.com

    Martin KaFai Lau