14 Mar, 2020

13 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-03-13

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 86 non-merge commits during the last 12 day(s) which contain
    a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).

    The main changes are:

    1) Add modify_return attach type which allows to attach to a function via
    BPF trampoline and is run after the fentry and before the fexit programs
    and can pass a return code to the original caller, from KP Singh.

    2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
    objects to be visible in /proc/kallsyms so they can be annotated in
    stack traces, from Jiri Olsa.

    3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
    in order to enable this for BPF based socket dispatch, from Lorenz Bauer.

    4) Introduce a new bpftool 'prog profile' command which attaches to existing
    BPF programs via fentry and fexit hooks and reads out hardware counters
    during that period, from Song Liu. Example usage:

    bpftool prog profile id 337 duration 3 cycles instructions llc_misses

    4228 run_cnt
    3403698 cycles (84.08%)
    3525294 instructions # 1.04 insn per cycle (84.05%)
    13 llc_misses # 3.69 LLC misses per million isns (83.50%)

    5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
    of a new bpf_link abstraction to keep in particular BPF tracing programs
    attached even when the applicaion owning them exits, from Andrii Nakryiko.

    6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
    and which returns the PID as seen by the init namespace, from Carlos Neira.

    7) Refactor of RISC-V JIT code to move out common pieces and addition of a
    new RV32G BPF JIT compiler, from Luke Nelson.

    8) Add gso_size context member to __sk_buff in order to be able to know whether
    a given skb is GSO or not, from Willem de Bruijn.

    9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
    implementation but can be called from tracepoint programs, from Eelco Chaudron.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Sparse reports a warning at __bpf_prog_enter() and __bpf_prog_exit()

    warning: context imbalance in __bpf_prog_enter() - wrong count at exit
    warning: context imbalance in __bpf_prog_exit() - unexpected unlock

    The root cause is the missing annotation at __bpf_prog_enter()
    and __bpf_prog_exit()

    Add the missing __acquires(RCU) annotation
    Add the missing __releases(RCU) annotation

    Signed-off-by: Jules Irenge
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200311010908.42366-2-jbi.octave@gmail.com

    Jules Irenge
     
  • Now that we have all the objects (bpf_prog, bpf_trampoline,
    bpf_dispatcher) linked in bpf_tree, there's no need to have
    separate bpf_image tree for images.

    Reverting the bpf_image tree together with struct bpf_image,
    because it's no longer needed.

    Also removing bpf_image_alloc function and adding the original
    bpf_jit_alloc_exec_page interface instead.

    The kernel_text_address function can now rely only on is_bpf_text_address,
    because it checks the bpf_tree that contains all the objects.

    Keeping bpf_image_ksym_add and bpf_image_ksym_del because they are
    useful wrappers with perf's ksymbol interface calls.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-13-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding dispatchers to kallsyms. It's displayed as
    bpf_dispatcher_

    where NAME is the name of dispatcher.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-12-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding trampolines to kallsyms. It's displayed as
    bpf_trampoline_ [bpf]

    where ID is the BTF id of the trampoline function.

    Adding bpf_image_ksym_add/del functions that setup
    the start/end values and call KSYMBOL perf events
    handlers.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-11-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Separating /proc/kallsyms add/del code and adding bpf_ksym_add/del
    functions for that.

    Moving bpf_prog_ksym_node_add/del functions to __bpf_ksym_add/del
    and changing their argument to 'struct bpf_ksym' object. This way
    we can call them for other bpf objects types like trampoline and
    dispatcher.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-10-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding 'prog' bool flag to 'struct bpf_ksym' to mark that
    this object belongs to bpf_prog object.

    This change allows having bpf_prog objects together with
    other types (trampolines and dispatchers) in the single
    bpf_tree. It's used when searching for bpf_prog exception
    tables by the bpf_prog_ksym_find function, where we need
    to get the bpf_prog pointer.

    >From now we can safely add bpf_ksym support for trampoline
    or dispatcher objects, because we can differentiate them
    from bpf_prog objects.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-9-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding bpf_ksym_find function that is used bpf bpf address
    lookup functions:
    __bpf_address_lookup
    is_bpf_text_address

    while keeping bpf_prog_kallsyms_find to be used only for lookup
    of bpf_prog objects (will happen in following changes).

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-8-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Moving ksym_tnode list node to 'struct bpf_ksym' object,
    so the symbol itself can be chained and used in other
    objects like bpf_trampoline and bpf_dispatcher.

    We need bpf_ksym object to be linked both in bpf_kallsyms
    via lnode for /proc/kallsyms and in bpf_tree via tnode for
    bpf address lookup functions like __bpf_address_lookup or
    bpf_prog_kallsyms_find.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200312195610.346362-7-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding lnode list node to 'struct bpf_ksym' object,
    so the struct bpf_ksym itself can be chained and used
    in other objects like bpf_trampoline and bpf_dispatcher.

    Changing iterator to bpf_ksym in bpf_get_kallsym function.

    The ksym->start is holding the prog->bpf_func value,
    so it's ok to use it as value in bpf_get_kallsym.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200312195610.346362-6-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding name to 'struct bpf_ksym' object to carry the name
    of the symbol for bpf_prog, bpf_trampoline, bpf_dispatcher
    objects.

    The current benefit is that name is now generated only when
    the symbol is added to the list, so we don't need to generate
    it every time it's accessed.

    The future benefit is that we will have all the bpf objects
    symbols represented by struct bpf_ksym.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200312195610.346362-5-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Adding 'struct bpf_ksym' object that will carry the
    kallsym information for bpf symbol. Adding the start
    and end address to begin with. It will be used by
    bpf_prog, bpf_trampoline, bpf_dispatcher objects.

    The symbol_start/symbol_end values were originally used
    to sort bpf_prog objects. For the address displayed in
    /proc/kallsyms we are using prog->bpf_func value.

    I'm using the bpf_func value for program symbol start
    instead of the symbol_start, because it makes no difference
    for sorting bpf_prog objects and we can use it directly as
    an address to display it in /proc/kallsyms.

    Signed-off-by: Jiri Olsa
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200312195610.346362-4-jolsa@kernel.org
    Signed-off-by: Alexei Starovoitov

    Jiri Olsa
     
  • Instead of requiring users to do three steps for cleaning up bpf_link, its
    anon_inode file, and unused fd, abstract that away into bpf_link_cleanup()
    helper. bpf_link_defunct() is removed, as it shouldn't be needed as an
    individual operation anymore.

    v1->v2:
    - keep bpf_link_cleanup() static for now (Daniel).

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/20200313002128.2028680-1-andriin@fb.com
    Signed-off-by: Alexei Starovoitov

    Andrii Nakryiko
     

13 Mar, 2020

4 commits

  • Minor overlapping changes, nothing serious.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Introduce new helper that reuses existing xdp perf_event output
    implementation, but can be called from raw_tracepoint programs
    that receive 'struct xdp_buff *' as a tracepoint argument.

    Signed-off-by: Eelco Chaudron
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial

    Eelco Chaudron
     
  • New bpf helper bpf_get_ns_current_pid_tgid,
    This helper will return pid and tgid from current task
    which namespace matches dev_t and inode number provided,
    this will allows us to instrument a process inside a container.

    Signed-off-by: Carlos Neira
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yonghong Song
    Link: https://lore.kernel.org/bpf/20200304204157.58695-3-cneirabustos@gmail.com

    Carlos Neira
     
  • Pull networking fixes from David Miller:
    "It looks like a decent sized set of fixes, but a lot of these are one
    liner off-by-one and similar type changes:

    1) Fix netlink header pointer to calcular bad attribute offset
    reported to user. From Pablo Neira Ayuso.

    2) Don't double clear PHY interrupts when ->did_interrupt is set,
    from Heiner Kallweit.

    3) Add missing validation of various (devlink, nl802154, fib, etc.)
    attributes, from Jakub Kicinski.

    4) Missing *pos increments in various netfilter seq_next ops, from
    Vasily Averin.

    5) Missing break in of_mdiobus_register() loop, from Dajun Jin.

    6) Don't double bump tx_dropped in veth driver, from Jiang Lidong.

    7) Work around FMAN erratum A050385, from Madalin Bucur.

    8) Make sure ARP header is pulled early enough in bonding driver,
    from Eric Dumazet.

    9) Do a cond_resched() during multicast processing of ipvlan and
    macvlan, from Mahesh Bandewar.

    10) Don't attach cgroups to unrelated sockets when in interrupt
    context, from Shakeel Butt.

    11) Fix tpacket ring state management when encountering unknown GSO
    types. From Willem de Bruijn.

    12) Fix MDIO bus PHY resume by checking mdio_bus_phy_may_suspend()
    only in the suspend context. From Heiner Kallweit"

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (112 commits)
    net: systemport: fix index check to avoid an array out of bounds access
    tc-testing: add ETS scheduler to tdc build configuration
    net: phy: fix MDIO bus PM PHY resuming
    net: hns3: clear port base VLAN when unload PF
    net: hns3: fix RMW issue for VLAN filter switch
    net: hns3: fix VF VLAN table entries inconsistent issue
    net: hns3: fix "tc qdisc del" failed issue
    taprio: Fix sending packets without dequeueing them
    net: mvmdio: avoid error message for optional IRQ
    net: dsa: mv88e6xxx: Add missing mask of ATU occupancy register
    net: memcg: fix lockdep splat in inet_csk_accept()
    s390/qeth: implement smarter resizing of the RX buffer pool
    s390/qeth: refactor buffer pool code
    s390/qeth: use page pointers to manage RX buffer pool
    seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
    net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
    net/packet: tpacket_rcv: do not increment ring index on drop
    sxgbe: Fix off by one in samsung driver strncpy size arg
    net: caif: Add lockdep expression to RCU traversal primitive
    MAINTAINERS: remove Sathya Perla as Emulex NIC maintainer
    ...

    Linus Torvalds
     

12 Mar, 2020

2 commits

  • Pull thread fix from Christian Brauner:
    "This contains a single fix for a regression which was introduced when
    we introduced the ability to select a specific pid at process creation
    time.

    When this feature is requested, the error value will be set to -EPERM
    after exiting the pid allocation loop. This caused EPERM to be
    returned when e.g. the init process/child subreaper of the pid
    namespace has already died where we used to return ENOMEM before.

    The first patch here simply fixes the regression by unconditionally
    setting the return value back to ENOMEM again once we've successfully
    allocated the requested pid number. This should be easy to backport to
    v5.5.

    The second patch adds a comment explaining that we must keep returning
    ENOMEM since we've been doing it for a long time and have explicitly
    documented this behavior for userspace. This seemed worthwhile because
    we now have at least two separate example where people tried to change
    the return value to something other than ENOMEM (The first version of
    the regression fix did that too and the commit message links to an
    earlier patch that tried to do the same.).

    I have a simple regression test to make sure we catch this regression
    in the future but since that introduces a whole new selftest subdir
    and test files I'll keep this for v5.7"

    * tag 'for-linus-2020-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    pid: make ENOMEM return value more obvious
    pid: Fix error return value in some cases

    Linus Torvalds
     
  • Pull ftrace fix from Steven Rostedt:
    "Have ftrace lookup_rec() return a consistent record otherwise it can
    break live patching"

    * tag 'trace-v5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    ftrace: Return the first found result in lookup_rec()

    Linus Torvalds
     

11 Mar, 2020

5 commits

  • It appears that ip ranges can overlap so. In that case lookup_rec()
    returns whatever results it got last even if it found nothing in last
    searched page.

    This breaks an obscure livepatch late module patching usecase:
    - load livepatch
    - load the patched module
    - unload livepatch
    - try to load livepatch again

    To fix this return from lookup_rec() as soon as it found the record
    containing searched-for ip. This used to be this way prior lookup_rec()
    introduction.

    Link: http://lkml.kernel.org/r/20200306174317.21699-1-asavkov@redhat.com

    Cc: stable@vger.kernel.org
    Fixes: 7e16f581a817 ("ftrace: Separate out functionality from ftrace_location_range()")
    Signed-off-by: Artem Savkov
    Signed-off-by: Steven Rostedt (VMware)

    Artem Savkov
     
  • Add bpf_link_new_file() API for cases when we need to ensure anon_inode is
    successfully created before we proceed with expensive BPF program attachment
    procedure, which will require equally (if not more so) expensive and
    potentially failing compensation detachment procedure just because anon_inode
    creation failed. This API allows to simplify code by ensuring first that
    anon_inode is created and after BPF program is attached proceed with
    fd_install() that can't fail.

    After anon_inode file is created, link can't be just kfree()'d anymore,
    because its destruction will be performed by deferred file_operations->release
    call. For this, bpf_link API required specifying two separate operations:
    release() and dealloc(), former performing detachment only, while the latter
    frees memory used by bpf_link itself. dealloc() needs to be specified, because
    struct bpf_link is frequently embedded into link type-specific container
    struct (e.g., struct bpf_raw_tp_link), so bpf_link itself doesn't know how to
    properly free the memory. In case when anon_inode file was successfully
    created, but subsequent BPF attachment failed, bpf_link needs to be marked as
    "defunct", so that file's release() callback will perform only memory
    deallocation, but no detachment.

    Convert raw tracepoint and tracing attachment to new API and eliminate
    detachment from error handling path.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200309231051.1270337-1-andriin@fb.com

    Andrii Nakryiko
     
  • We are testing network memory accounting in our setup and noticed
    inconsistent network memory usage and often unrelated cgroups network
    usage correlates with testing workload. On further inspection, it
    seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
    irq context specially for cgroup v1.

    mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
    and kind of assumes that this can only happen from sk_clone_lock()
    and the source sock object has already associated cgroup. However in
    cgroup v1, where network memory accounting is opt-in, the source sock
    can be unassociated with any cgroup and the new cloned sock can get
    associated with unrelated interrupted cgroup.

    Cgroup v2 can also suffer if the source sock object was created by
    process in the root cgroup or if sk_alloc() is called in irq context.
    The fix is to just do nothing in interrupt.

    WARNING: Please note that about half of the TCP sockets are allocated
    from the IRQ context, so, memory used by such sockets will not be
    accouted by the memcg.

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:

    CPU: 70 PID: 12720 Comm: ssh Tainted: 5.6.0-smp-DEV #1
    Hardware name: ...
    Call Trace:

    dump_stack+0x57/0x75
    mem_cgroup_sk_alloc+0xe9/0xf0
    sk_clone_lock+0x2a7/0x420
    inet_csk_clone_lock+0x1b/0x110
    tcp_create_openreq_child+0x23/0x3b0
    tcp_v6_syn_recv_sock+0x88/0x730
    tcp_check_req+0x429/0x560
    tcp_v6_rcv+0x72d/0xa40
    ip6_protocol_deliver_rcu+0xc9/0x400
    ip6_input+0x44/0xd0
    ? ip6_protocol_deliver_rcu+0x400/0x400
    ip6_rcv_finish+0x71/0x80
    ipv6_rcv+0x5b/0xe0
    ? ip6_sublist_rcv+0x2e0/0x2e0
    process_backlog+0x108/0x1e0
    net_rx_action+0x26b/0x460
    __do_softirq+0x104/0x2a6
    do_softirq_own_stack+0x2a/0x40

    do_softirq.part.19+0x40/0x50
    __local_bh_enable_ip+0x51/0x60
    ip6_finish_output2+0x23d/0x520
    ? ip6table_mangle_hook+0x55/0x160
    __ip6_finish_output+0xa1/0x100
    ip6_finish_output+0x30/0xd0
    ip6_output+0x73/0x120
    ? __ip6_finish_output+0x100/0x100
    ip6_xmit+0x2e3/0x600
    ? ipv6_anycast_cleanup+0x50/0x50
    ? inet6_csk_route_socket+0x136/0x1e0
    ? skb_free_head+0x1e/0x30
    inet6_csk_xmit+0x95/0xf0
    __tcp_transmit_skb+0x5b4/0xb20
    __tcp_send_ack.part.60+0xa3/0x110
    tcp_send_ack+0x1d/0x20
    tcp_rcv_state_process+0xe64/0xe80
    ? tcp_v6_connect+0x5d1/0x5f0
    tcp_v6_do_rcv+0x1b1/0x3f0
    ? tcp_v6_do_rcv+0x1b1/0x3f0
    __release_sock+0x7f/0xd0
    release_sock+0x30/0xa0
    __inet_stream_connect+0x1c3/0x3b0
    ? prepare_to_wait+0xb0/0xb0
    inet_stream_connect+0x3b/0x60
    __sys_connect+0x101/0x120
    ? __sys_getsockopt+0x11b/0x140
    __x64_sys_connect+0x1a/0x20
    do_syscall_64+0x51/0x200
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
    Fixes: 2d7580738345 ("mm: memcontrol: consolidate cgroup socket tracking")
    Fixes: d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Signed-off-by: David S. Miller

    Shakeel Butt
     
  • Pull cgroup fixes from Tejun Heo:

    - cgroup.procs listing related fixes.

    It didn't interlock properly with exiting tasks leaving a short
    window where a cgroup has empty cgroup.procs but still can't be
    removed and misbehaved on short reads.

    - psi_show() crash fix on 32bit ino archs

    - Empty release_agent handling fix

    * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup1: don't call release_agent when it is ""
    cgroup: fix psi_show() crash on 32bit ino archs
    cgroup: Iterate tasks that did not finish do_exit()
    cgroup: cgroup_procs_next should increase position index
    cgroup-v1: cgroup_pidlist_next should update position index

    Linus Torvalds
     
  • Pull workqueue fixes from Tejun Heo:
    "Workqueue has been incorrectly round-robining per-cpu work items.
    Hillf's patch fixes that.

    The other patch documents memory-ordering properties of workqueue
    operations"

    * 'for-5.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
    workqueue: don't use wq_select_unbound_cpu() for bound works
    workqueue: Document (some) memory-ordering properties of {queue,schedule}_work()

    Linus Torvalds
     

10 Mar, 2020

2 commits

  • wq_select_unbound_cpu() is designed for unbound workqueues only, but
    it's wrongly called when using a bound workqueue too.

    Fixing this ensures work queued to a bound workqueue with
    cpu=WORK_CPU_UNBOUND always runs on the local CPU.

    Before, that would happen only if wq_unbound_cpumask happened to include
    it (likely almost always the case), or was empty, or we got lucky with
    forced round-robin placement. So restricting
    /sys/devices/virtual/workqueue/cpumask to a small subset of a machine's
    CPUs would cause some bound work items to run unexpectedly there.

    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Cc: stable@vger.kernel.org # v4.5+
    Signed-off-by: Hillf Danton
    [dj: massage changelog]
    Signed-off-by: Daniel Jordan
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Tejun Heo

    Hillf Danton
     
  • The alloc_pid() codepath used to be simpler. With the introducation of the
    ability to choose specific pids in 49cb2fc42ce4 ("fork: extend clone3() to
    support setting a PID") it got more complex. It hasn't been super obvious
    that ENOMEM is returned when the pid namespace init process/child subreaper
    of the pid namespace has died. As can be seen from multiple attempts to
    improve this see e.g. [1] and most recently [2].
    We regressed returning ENOMEM in [3] and [2] restored it. Let's add a
    comment on top explaining that this is historic and documented behavior and
    cannot easily be changed.

    [1]: 35f71bc0a09a ("fork: report pid reservation failure properly")
    [2]: b26ebfe12f34 ("pid: Fix error return value in some cases")
    [3]: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
    Signed-off-by: Christian Brauner

    Christian Brauner
     

08 Mar, 2020

2 commits

  • Recent changes to alloc_pid() allow the pid number to be specified on
    the command line. If set_tid_size is set, then the code scanning the
    levels will hard-set retval to -EPERM, overriding it's previous -ENOMEM
    value.

    After the code scanning the levels, there are error returns that do not
    set retval, assuming it is still set to -ENOMEM.

    So set retval back to -ENOMEM after scanning the levels.

    Fixes: 49cb2fc42ce4 ("fork: extend clone3() to support setting a PID")
    Signed-off-by: Corey Minyard
    Acked-by: Christian Brauner
    Cc: Andrei Vagin
    Cc: Dmitry Safonov
    Cc: Oleg Nesterov
    Cc: Adrian Reber
    Cc: # 5.5
    Link: https://lore.kernel.org/r/20200306172314.12232-1-minyard@acm.org
    [christian.brauner@ubuntu.com: fixup commit message]
    Signed-off-by: Christian Brauner

    Corey Minyard
     
  • Pull block fixes from Jens Axboe:
    "Here are a few fixes that should go into this release. This contains:

    - Revert of a bad bcache patch from this merge window

    - Removed unused function (Daniel)

    - Fixup for the blktrace fix from Jan from this release (Cengiz)

    - Fix of deeper level bfqq overwrite in BFQ (Carlo)"

    * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
    block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
    blktrace: fix dereference after null check
    Revert "bcache: ignore pending signals when creating gc and allocator thread"
    block: Remove used kblockd_schedule_work_on()

    Linus Torvalds
     

07 Mar, 2020

1 commit

  • Pull thread fixes from Christian Brauner:
    "Here are a few hopefully uncontroversial fixes:

    - Use RCU_INIT_POINTER() when initializing rcu protected members in
    task_struct to fix sparse warnings.

    - Add pidfd_fdinfo_test binary to .gitignore file"

    * tag 'for-linus-2020-03-07' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
    selftests: pidfd: Add pidfd_fdinfo_test in .gitignore
    exit: Fix Sparse errors and warnings
    fork: Use RCU_INIT_POINTER() instead of rcu_access_pointer()

    Linus Torvalds
     

06 Mar, 2020

3 commits

  • test_run.o is not built when CONFIG_NET is not set and
    bpf_prog_test_run_tracing being referenced in bpf_trace.o causes the
    linker error:

    ld: kernel/trace/bpf_trace.o:(.rodata+0x38): undefined reference to
    `bpf_prog_test_run_tracing'

    Add a __weak function in bpf_trace.c to handle this.

    Fixes: da00d2f117a0 ("bpf: Add test ops for BPF_PROG_TYPE_TRACING")
    Signed-off-by: KP Singh
    Reported-by: Randy Dunlap
    Acked-by: Randy Dunlap
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200305220127.29109-1-kpsingh@chromium.org

    KP Singh
     
  • While well intentioned, checking CAP_MAC_ADMIN for attaching
    BPF_MODIFY_RETURN tracing programs to "security_" functions is not
    necessary as tracing BPF programs already require CAP_SYS_ADMIN.

    Fixes: 6ba43b761c41 ("bpf: Attachment verification for BPF_MODIFY_RETURN")
    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200305204955.31123-1-kpsingh@chromium.org

    KP Singh
     
  • There was a recent change in blktrace.c that added a RCU protection to
    `q->blk_trace` in order to fix a use-after-free issue during access.

    However the change missed an edge case that can lead to dereferencing of
    `bt` pointer even when it's NULL:

    Coverity static analyzer marked this as a FORWARD_NULL issue with CID
    1460458.

    ```
    /kernel/trace/blktrace.c: 1904 in sysfs_blk_trace_attr_store()
    1898 ret = 0;
    1899 if (bt == NULL)
    1900 ret = blk_trace_setup_queue(q, bdev);
    1901
    1902 if (ret == 0) {
    1903 if (attr == &dev_attr_act_mask)
    >>> CID 1460458: Null pointer dereferences (FORWARD_NULL)
    >>> Dereferencing null pointer "bt".
    1904 bt->act_mask = value;
    1905 else if (attr == &dev_attr_pid)
    1906 bt->pid = value;
    1907 else if (attr == &dev_attr_start_lba)
    1908 bt->start_lba = value;
    1909 else if (attr == &dev_attr_end_lba)
    ```

    Added a reassignment with RCU annotation to fix the issue.

    Fixes: c780e86dd48 ("blktrace: Protect q->blk_trace with RCU")
    Cc: stable@vger.kernel.org
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Reviewed-by: Steven Rostedt (VMware)
    Signed-off-by: Cengiz Can
    Signed-off-by: Jens Axboe

    Cengiz Can
     

05 Mar, 2020

6 commits

  • The current fexit and fentry tests rely on a different program to
    exercise the functions they attach to. Instead of doing this, implement
    the test operations for tracing which will also be used for
    BPF_MODIFY_RETURN in a subsequent patch.

    Also, clean up the fexit test to use the generated skeleton.

    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org

    KP Singh
     
  • - Allow BPF_MODIFY_RETURN attachment only to functions that are:

    * Whitelisted for error injection by checking
    within_error_injection_list. Similar discussions happened for the
    bpf_override_return helper.

    * security hooks, this is expected to be cleaned up with the LSM
    changes after the KRSI patches introduce the LSM_HOOK macro:

    https://lore.kernel.org/bpf/20200220175250.10795-1-kpsingh@chromium.org/

    - The attachment is currently limited to functions that return an int.
    This can be extended later other types (e.g. PTR).

    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-5-kpsingh@chromium.org

    KP Singh
     
  • When multiple programs are attached, each program receives the return
    value from the previous program on the stack and the last program
    provides the return value to the attached function.

    The fmod_ret bpf programs are run after the fentry programs and before
    the fexit programs. The original function is only called if all the
    fmod_ret programs return 0 to avoid any unintended side-effects. The
    success value, i.e. 0 is not currently configurable but can be made so
    where user-space can specify it at load time.

    For example:

    int func_to_be_attached(int a, int b)
    {
    if (ret != 0)
    goto do_fexit;

    original_function:

    }
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-4-kpsingh@chromium.org

    KP Singh
     
  • As we need to introduce a third type of attachment for trampolines, the
    flattened signature of arch_prepare_bpf_trampoline gets even more
    complicated.

    Refactor the prog and count argument to arch_prepare_bpf_trampoline to
    use bpf_tramp_progs to simplify the addition and accounting for new
    attachment types.

    Signed-off-by: KP Singh
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200304191853.1529-2-kpsingh@chromium.org

    KP Singh
     
  • Older (and maybe current) versions of systemd set release_agent to "" when
    shutting down, but do not set notify_on_release to 0.

    Since 64e90a8acb85 ("Introduce STATIC_USERMODEHELPER to mediate
    call_usermodehelper()"), we filter out such calls when the user mode helper
    path is "". However, when used in conjunction with an actual (i.e. non "")
    STATIC_USERMODEHELPER, the path is never "", so the real usermode helper
    will be called with argv[0] == "".

    Let's avoid this by not invoking the release_agent when it is "".

    Signed-off-by: Tycho Andersen
    Signed-off-by: Tejun Heo

    Tycho Andersen
     
  • Similar to the commit d7495343228f ("cgroup: fix incorrect
    WARN_ON_ONCE() in cgroup_setup_root()"), cgroup_id(root_cgrp) does not
    equal to 1 on 32bit ino archs which triggers all sorts of issues with
    psi_show() on s390x. For example,

    BUG: KASAN: slab-out-of-bounds in collect_percpu_times+0x2d0/
    Read of size 4 at addr 000000001e0ce000 by task read_all/3667
    collect_percpu_times+0x2d0/0x798
    psi_show+0x7c/0x2a8
    seq_read+0x2ac/0x830
    vfs_read+0x92/0x150
    ksys_read+0xe2/0x188
    system_call+0xd8/0x2b4

    Fix it by using cgroup_ino().

    Fixes: 743210386c03 ("cgroup: use cgrp->kn->id as the cgroup ID")
    Signed-off-by: Qian Cai
    Acked-by: Johannes Weiner
    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org # v5.5

    Qian Cai
     

03 Mar, 2020

1 commit

  • Introduce bpf_link abstraction, representing an attachment of BPF program to
    a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
    ownership of attached BPF program, reference counting of a link itself, when
    reference from multiple anonymous inodes, as well as ensures that release
    callback will be called from a process context, so that users can safely take
    mutex locks and sleep.

    Additionally, with a new abstraction it's now possible to generalize pinning
    of a link object in BPF FS, allowing to explicitly prevent BPF program
    detachment on process exit by pinning it in a BPF FS and let it open from
    independent other process to keep working with it.

    Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
    program attachments) into utilizing bpf_link framework, making them pinnable
    in BPF FS. More FD-based bpf_links will be added in follow up patches.

    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com

    Andrii Nakryiko
     

02 Mar, 2020

1 commit