08 Oct, 2020

1 commit

  • * tag 'v5.4.70': (3051 commits)
    Linux 5.4.70
    netfilter: ctnetlink: add a range check for l3/l4 protonum
    ep_create_wakeup_source(): dentry name can change under you...
    ...

    Conflicts:
    arch/arm/mach-imx/pm-imx6.c
    arch/arm64/boot/dts/freescale/imx8mm-evk.dts
    arch/arm64/boot/dts/freescale/imx8mn-ddr4-evk.dts
    drivers/crypto/caam/caamalg.c
    drivers/gpu/drm/imx/dw_hdmi-imx.c
    drivers/gpu/drm/imx/imx-ldb.c
    drivers/gpu/drm/imx/ipuv3/ipuv3-crtc.c
    drivers/mmc/host/sdhci-esdhc-imx.c
    drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
    drivers/net/ethernet/freescale/enetc/enetc.c
    drivers/net/ethernet/freescale/enetc/enetc_pf.c
    drivers/thermal/imx_thermal.c
    drivers/usb/cdns3/ep0.c
    drivers/xen/swiotlb-xen.c
    sound/soc/fsl/fsl_esai.c
    sound/soc/fsl/fsl_sai.c

    Signed-off-by: Jason Liu

    Jason Liu
     

01 Oct, 2020

3 commits

  • [ Upstream commit e6a18d36118bea3bf497c9df4d9988b6df120689 ]

    Bryce reported that he saw the following with:

    0: r6 = r1
    1: r1 = 12
    2: r0 = *(u16 *)skb[r1]

    The xlated sequence was incorrectly clobbering r2 with pointer
    value of r6 ...

    0: (bf) r6 = r1
    1: (b7) r1 = 12
    2: (bf) r1 = r6
    3: (bf) r2 = r1
    4: (85) call bpf_skb_load_helper_16_no_cache#7692160

    ... and hence call to the load helper never succeeded given the
    offset was too high. Fix it by reordering the load of r6 to r1.

    Other than that the insn has similar calling convention than BPF
    helpers, that is, r0 - r5 are scratch regs, so nothing else
    affected after the insn.

    Fixes: e0cea7ce988c ("bpf: implement ld_abs/ld_ind in native bpf")
    Reported-by: Bryce Kahle
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/cace836e4d07bb63b1a53e49c5dfb238a040c298.1599512096.git.daniel@iogearbox.net
    Signed-off-by: Sasha Levin

    Daniel Borkmann
     
  • [ Upstream commit bea0c5c942d3b4e9fb6ed45f6a7de74c6b112437 ]

    Devlink health core conditions the reporter's recovery with the
    expiration of the grace period. This is not relevant for the first
    recovery. Explicitly demand that the grace period will only apply to
    recoveries other than the first.

    Fixes: c8e1da0bf923 ("devlink: Add health report functionality")
    Signed-off-by: Aya Levin
    Reviewed-by: Moshe Shemesh
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Aya Levin
     
  • [ Upstream commit 1e3f9f073c47bee7c23e77316b07bc12338c5bba ]

    if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Vasily Averin
     

27 Sep, 2020

2 commits

  • [ Upstream commit e1b9efe6baebe79019a2183176686a0e709388ae ]

    When a netdev is enslaved to a bridge, its parent identifier is queried.
    This is done so that packets that were already forwarded in hardware
    will not be forwarded again by the bridge device between netdevs
    belonging to the same hardware instance.

    The operation fails when the netdev is an upper of netdevs with
    different parent identifiers.

    Instead of failing the enslavement, have dev_get_port_parent_id() return
    '-EOPNOTSUPP' which will signal the bridge to skip the query operation.
    Other callers of the function are not affected by this change.

    Fixes: 7e1146e8c10c ("net: devlink: introduce devlink_compat_switch_id_get() helper")
    Signed-off-by: Ido Schimmel
    Reported-by: Vasundhara Volam
    Reviewed-by: Jiri Pirko
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit 1869e226a7b3ef75b4f70ede2f1b7229f7157fa4 ]

    flowi4_multipath_hash was added by the commit referenced below for
    tunnels. Unfortunately, the patch did not initialize the new field
    for several fast path lookups that do not initialize the entire flow
    struct to 0. Fix those locations. Currently, flowi4_multipath_hash
    is random garbage and affects the hash value computed by
    fib_multipath_hash for multipath selection.

    Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash")
    Signed-off-by: David Ahern
    Cc: wenxu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

23 Sep, 2020

1 commit

  • commit eabe861881a733fc84f286f4d5a1ffaddd4f526f upstream.

    pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear().
    we should handle this correctly or we would get wrong sk_buff.

    Fixes: 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function")
    Signed-off-by: Miaohe Lin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     

12 Sep, 2020

1 commit

  • [ Upstream commit 96e97bc07e90f175a8980a22827faf702ca4cb30 ]

    napi_disable() makes sure to set the NAPI_STATE_NPSVC bit to prevent
    netpoll from accessing rings before init is complete. However, the
    same is not done for fresh napi instances in netif_napi_add(),
    even though we expect NAPI instances to be added as disabled.

    This causes crashes during driver reconfiguration (enabling XDP,
    changing the channel count) - if there is any printk() after
    netif_napi_add() but before napi_enable().

    To ensure memory ordering is correct we need to use RCU accessors.

    Reported-by: Rob Sherwood
    Fixes: 2d8bff12699a ("netpoll: Close race condition between poll_one_napi and napi_disable")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jakub Kicinski
     

10 Sep, 2020

1 commit

  • commit 6570bc79c0dfff0f228b7afd2de720fb4e84d61d upstream.

    Commit 323ebb61e32b4 ("net: use listified RX for handling GRO_NORMAL
    skbs") made use of listified skb processing for the users of
    napi_gro_frags().
    The same technique can be used in a way more common napi_gro_receive()
    to speed up non-merged (GRO_NORMAL) skbs for a wide range of drivers
    including gro_cells and mac80211 users.
    This slightly changes the return value in cases where skb is being
    dropped by the core stack, but it seems to have no impact on related
    drivers' functionality.
    gro_normal_batch is left untouched as it's very individual for every
    single system configuration and might be tuned in manual order to
    achieve an optimal performance.

    Signed-off-by: Alexander Lobakin
    Acked-by: Edward Cree
    Signed-off-by: David S. Miller
    Signed-off-by: Hyunsoon Kim
    Signed-off-by: Greg Kroah-Hartman

    Alexander Lobakin
     

03 Sep, 2020

1 commit

  • [ Upstream commit 55eff0eb7460c3d50716ed9eccf22257b046ca92 ]

    We may access the two bytes after vlan_hdr in vlan_set_encap_proto(). So
    we should pull VLAN_HLEN + sizeof(unsigned short) in skb_vlan_untag() or
    we may access the wrong data.

    Fixes: 0d5501c1c828 ("net: Always untag vlan-tagged traffic on input.")
    Signed-off-by: Miaohe Lin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     

26 Aug, 2020

1 commit

  • [ Upstream commit 84f44df664e9f0e261157e16ee1acd77cc1bb78d ]

    Similar to patch ("bpf: sock_ops ctx access may stomp registers") if the
    src_reg = dst_reg when reading the sk field of a sock_ops struct we
    generate xlated code,

    53: (61) r9 = *(u32 *)(r9 +28)
    54: (15) if r9 == 0x0 goto pc+3
    56: (79) r9 = *(u64 *)(r9 +0)

    This stomps on the r9 reg to do the sk_fullsock check and then when
    reading the skops->sk field instead of the sk pointer we get the
    sk_fullsock. To fix use similar pattern noted in the previous fix
    and use the temp field to save/restore a register used to do
    sk_fullsock check.

    After the fix the generated xlated code reads,

    52: (7b) *(u64 *)(r9 +32) = r8
    53: (61) r8 = *(u32 *)(r9 +28)
    54: (15) if r9 == 0x0 goto pc+3
    55: (79) r8 = *(u64 *)(r9 +32)
    56: (79) r9 = *(u64 *)(r9 +0)
    57: (05) goto pc+1
    58: (79) r8 = *(u64 *)(r9 +32)

    Here r9 register was in-use so r8 is chosen as the temporary register.
    In line 52 r8 is saved in temp variable and at line 54 restored in case
    fullsock != 0. Finally we handle fullsock == 0 case by restoring at
    line 58.

    This adds a new macro SOCK_OPS_GET_SK it is almost possible to merge
    this with SOCK_OPS_GET_FIELD, but I found the extra branch logic a
    bit more confusing than just adding a new macro despite a bit of
    duplicating code.

    Fixes: 1314ef561102e ("bpf: export bpf_sock for BPF_PROG_TYPE_SOCK_OPS prog type")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/159718349653.4728.6559437186853473612.stgit@john-Precision-5820-Tower
    Signed-off-by: Sasha Levin

    John Fastabend
     

21 Aug, 2020

1 commit

  • commit d9539752d23283db4692384a634034f451261e29 upstream.

    Add missed sock updates to compat path via a new helper, which will be
    used more in coming patches. (The net/core/scm.c code is left as-is here
    to assist with -stable backports for the compat path.)

    Cc: Christoph Hellwig
    Cc: Sargun Dhillon
    Cc: Jakub Kicinski
    Cc: stable@vger.kernel.org
    Fixes: 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not set correctly")
    Fixes: d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set correctly")
    Acked-by: Christian Brauner
    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

19 Aug, 2020

1 commit

  • [ Upstream commit 0f5907af39137f8183ed536aaa00f322d7365130 ]

    If we failed to assign proto idx, we free the twsk_slab_name but forget to
    free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
    together and also use this helper function in proto_unregister().

    Fixes: b45ce32135d1 ("sock: fix potential memory leak in proto_register()")
    Signed-off-by: Miaohe Lin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     

07 Aug, 2020

1 commit

  • commit bb0de3131f4c60a9bf976681e0fe4d1e55c7a821 upstream.

    The sockmap code currently ignores the value of attach_bpf_fd when
    detaching a program. This is contrary to the usual behaviour of
    checking that attach_bpf_fd represents the currently attached
    program.

    Ensure that attach_bpf_fd is indeed the currently attached
    program. It turns out that all sockmap selftests already do this,
    which indicates that this is unlikely to cause breakage.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200629095630.7933-5-lmb@cloudflare.com
    Signed-off-by: Greg Kroah-Hartman

    Lorenz Bauer
     

01 Aug, 2020

4 commits

  • [ Upstream commit f2b2c55e512879a05456eaf5de4d1ed2f7757509 ]

    If an unconnected socket in a UDP reuseport group connect()s, has_conns is
    set to 1. Then, when a packet is received, udp[46]_lib_lookup2() scans all
    sockets in udp_hslot looking for the connected socket with the highest
    score.

    However, when the number of sockets bound to the port exceeds max_socks,
    reuseport_grow() resets has_conns to 0. It can cause udp[46]_lib_lookup2()
    to return without scanning all sockets, resulting in that packets sent to
    connected sockets may be distributed to unconnected sockets.

    Therefore, reuseport_grow() should copy has_conns.

    Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets")
    CC: Willem de Bruijn
    Reviewed-by: Benjamin Herrenschmidt
    Signed-off-by: Kuniyuki Iwashima
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Kuniyuki Iwashima
     
  • [ Upstream commit cebb69754f37d68e1355a5e726fdac317bcda302 ]

    When vlan_newlink call register_vlan_dev fails, it might return error
    with dev->reg_state = NETREG_UNREGISTERED. The rtnl_newlink should
    free the memory. But currently rtnl_newlink only free the memory which
    state is NETREG_UNINITIALIZED.

    BUG: memory leak
    unreferenced object 0xffff8881051de000 (size 4096):
    comm "syz-executor139", pid 560, jiffies 4294745346 (age 32.445s)
    hex dump (first 32 bytes):
    76 6c 61 6e 32 00 00 00 00 00 00 00 00 00 00 00 vlan2...........
    00 45 28 03 81 88 ff ff 00 00 00 00 00 00 00 00 .E(.............
    backtrace:
    [] kmalloc_node include/linux/slab.h:578 [inline]
    [] kvmalloc_node+0x33/0xd0 mm/util.c:574
    [] kvmalloc include/linux/mm.h:753 [inline]
    [] kvzalloc include/linux/mm.h:761 [inline]
    [] alloc_netdev_mqs+0x83/0xd90 net/core/dev.c:9929
    [] rtnl_create_link+0x2c0/0xa20 net/core/rtnetlink.c:3067
    [] __rtnl_newlink+0xc9c/0x1330 net/core/rtnetlink.c:3329
    [] rtnl_newlink+0x66/0x90 net/core/rtnetlink.c:3397
    [] rtnetlink_rcv_msg+0x540/0x990 net/core/rtnetlink.c:5460
    [] netlink_rcv_skb+0x12b/0x3a0 net/netlink/af_netlink.c:2469
    [] netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
    [] netlink_unicast+0x4c6/0x690 net/netlink/af_netlink.c:1329
    [] netlink_sendmsg+0x735/0xcc0 net/netlink/af_netlink.c:1918
    [] sock_sendmsg_nosec net/socket.c:652 [inline]
    [] sock_sendmsg+0x109/0x140 net/socket.c:672
    [] ____sys_sendmsg+0x5f5/0x780 net/socket.c:2352
    [] ___sys_sendmsg+0x11d/0x1a0 net/socket.c:2406
    [] __sys_sendmsg+0xeb/0x1b0 net/socket.c:2439
    [] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Fixes: cb626bf566eb ("net-sysfs: Fix reference count leak")
    Reported-by: Hulk Robot
    Signed-off-by: Weilong Chen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Weilong Chen
     
  • [ Upstream commit 9bb5fbea59f36a589ef886292549ca4052fe676c ]

    When I cat 'tx_timeout' by sysfs, it displays as follows. It's better to
    add a newline for easy reading.

    root@syzkaller:~# cat /sys/devices/virtual/net/lo/queues/tx-0/tx_timeout
    0root@syzkaller:~#

    Signed-off-by: Xiongfeng Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xiongfeng Wang
     
  • [ Upstream commit 7df5cb75cfb8acf96c7f2342530eb41e0c11f4c3 ]

    IRQs are disabled when freeing skbs in input queue.
    Use the IRQ safe variant to free skbs here.

    Fixes: 145dd5f9c88f ("net: flush the softnet backlog in process context")
    Signed-off-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Subash Abhinov Kasiviswanathan
     

22 Jul, 2020

2 commits

  • [ Upstream commit ad0f75e5f57ccbceec13274e1e242f2b5a6397ed ]

    When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
    copied, so the cgroup refcnt must be taken too. And, unlike the
    sk_alloc() path, sock_update_netprioidx() is not called here.
    Therefore, it is safe and necessary to grab the cgroup refcnt
    even when cgroup_sk_alloc is disabled.

    sk_clone_lock() is in BH context anyway, the in_interrupt()
    would terminate this function if called there. And for sk_alloc()
    skcd->val is always zero. So it's safe to factor out the code
    to make it more readable.

    The global variable 'cgroup_sk_alloc_disabled' is used to determine
    whether to take these reference counts. It is impossible to make
    the reference counting correct unless we save this bit of information
    in skcd->val. So, add a new bit there to record whether the socket
    has already taken the reference counts. This obviously relies on
    kmalloc() to align cgroup pointers to at least 4 bytes,
    ARCH_KMALLOC_MINALIGN is certainly larger than that.

    This bug seems to be introduced since the beginning, commit
    d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
    tried to fix it but not compeletely. It seems not easy to trigger until
    the recent commit 090e28b229af
    ("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Reported-by: Cameron Berkenpas
    Reported-by: Peter Geis
    Reported-by: Lu Fengqi
    Reported-by: Daniël Sonck
    Reported-by: Zhang Qiang
    Tested-by: Cameron Berkenpas
    Tested-by: Peter Geis
    Tested-by: Thomas Lamprecht
    Cc: Daniel Borkmann
    Cc: Zefan Li
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit d7bf2ebebc2bd61ab95e2a8e33541ef282f303d4 ]

    There are a couple of places in net/sched/ that check skb->protocol and act
    on the value there. However, in the presence of VLAN tags, the value stored
    in skb->protocol can be inconsistent based on whether VLAN acceleration is
    enabled. The commit quoted in the Fixes tag below fixed the users of
    skb->protocol to use a helper that will always see the VLAN ethertype.

    However, most of the callers don't actually handle the VLAN ethertype, but
    expect to find the IP header type in the protocol field. This means that
    things like changing the ECN field, or parsing diffserv values, stops
    working if there's a VLAN tag, or if there are multiple nested VLAN
    tags (QinQ).

    To fix this, change the helper to take an argument that indicates whether
    the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
    make sure to skip all of them, so behaviour is consistent even in QinQ
    mode.

    To make the helper usable from the ECN code, move it to if_vlan.h instead
    of pkt_sched.h.

    v3:
    - Remove empty lines
    - Move vlan variable definitions inside loop in skb_protocol()
    - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
    bpf_skb_ecn_set_ce()

    v2:
    - Use eth_type_vlan() helper in skb_protocol()
    - Also fix code that reads skb->protocol directly
    - Change a couple of 'if/else if' statements to switch constructs to avoid
    calling the helper twice

    Reported-by: Ilya Ponetayev
    Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Toke Høiland-Jørgensen
     

16 Jul, 2020

3 commits

  • commit 63960260457a02af2a6cb35d75e6bdb17299c882 upstream.

    When evaluating access control over kallsyms visibility, credentials at
    open() time need to be used, not the "current" creds (though in BPF's
    case, this has likely always been the same). Plumb access to associated
    file->f_cred down through bpf_dump_raw_ok() and its callers now that
    kallsysm_show_value() has been refactored to take struct cred.

    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: bpf@vger.kernel.org
    Cc: stable@vger.kernel.org
    Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump")
    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • [ Upstream commit 8025751d4d55a2f32be6bdf825b6a80c299875f5 ]

    If an ingress verdict program specifies message sizes greater than
    skb->len and there is an ENOMEM error due to memory pressure we
    may call the rcv_msg handler outside the strp_data_ready() caller
    context. This is because on an ENOMEM error the strparser will
    retry from a workqueue. The caller currently protects the use of
    psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
    block.

    But, in above workqueue error case the psock is accessed outside
    the read_lock/unlock block of the caller. So instead of using
    psock directly we must do a look up against the sk again to
    ensure the psock is available.

    There is an an ugly piece here where we must handle
    the case where we paused the strp and removed the psock. On
    psock removal we first pause the strparser and then remove
    the psock. If the strparser is paused while an skb is
    scheduled on the workqueue the skb will be dropped on the
    flow and kfree_skb() is called. If the workqueue manages
    to get called before we pause the strparser but runs the rcvmsg
    callback after the psock is removed we will hit the unlikely
    case where we run the sockmap rcvmsg handler but do not have
    a psock. For now we will follow strparser logic and drop the
    skb on the floor with skb_kfree(). This is ugly because the
    data is dropped. To date this has not caused problems in practice
    because either the application controlling the sockmap is
    coordinating with the datapath so that skbs are "flushed"
    before removal or we simply wait for the sock to be closed before
    removing it.

    This patch fixes the describe RCU bug and dropping the skb doesn't
    make things worse. Future patches will improve this by allowing
    the normal case where skbs are not merged to skip the strparser
    altogether. In practice many (most?) use cases have no need to
    merge skbs so its both a code complexity hit as seen above and
    a performance issue. For example, in the Cilium case we always
    set the strparser up to return sbks 1:1 without any merging and
    have avoided above issues.

    Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
    Signed-off-by: Sasha Levin

    John Fastabend
     
  • [ Upstream commit 93dd5f185916b05e931cffae636596f21f98546e ]

    There are two paths to generate the below RCU splat the first and
    most obvious is the result of the BPF verdict program issuing a
    redirect on a TLS socket (This is the splat shown below). Unlike
    the non-TLS case the caller of the *strp_read() hooks does not
    wrap the call in a rcu_read_lock/unlock. Then if the BPF program
    issues a redirect action we hit the RCU splat.

    However, in the non-TLS socket case the splat appears to be
    relatively rare, because the skmsg caller into the strp_data_ready()
    is wrapped in a rcu_read_lock/unlock. Shown here,

    static void sk_psock_strp_data_ready(struct sock *sk)
    {
    struct sk_psock *psock;

    rcu_read_lock();
    psock = sk_psock(sk);
    if (likely(psock)) {
    if (tls_sw_has_ctx_rx(sk)) {
    psock->parser.saved_data_ready(sk);
    } else {
    write_lock_bh(&sk->sk_callback_lock);
    strp_data_ready(&psock->parser.strp);
    write_unlock_bh(&sk->sk_callback_lock);
    }
    }
    rcu_read_unlock();
    }

    If the above was the only way to run the verdict program we
    would be safe. But, there is a case where the strparser may throw an
    ENOMEM error while parsing the skb. This is a result of a failed
    skb_clone, or alloc_skb_for_msg while building a new merged skb when
    the msg length needed spans multiple skbs. This will in turn put the
    skb on the strp_wrk workqueue in the strparser code. The skb will
    later be dequeued and verdict programs run, but now from a
    different context without the rcu_read_lock()/unlock() critical
    section in sk_psock_strp_data_ready() shown above. In practice
    I have not seen this yet, because as far as I know most users of the
    verdict programs are also only working on single skbs. In this case no
    merge happens which could trigger the above ENOMEM errors. In addition
    the system would need to be under memory pressure. For example, we
    can't hit the above case in selftests because we missed having tests
    to merge skbs. (Added in later patch)

    To fix the below splat extend the rcu_read_lock/unnlock block to
    include the call to sk_psock_tls_verdict_apply(). This will fix both
    TLS redirect case and non-TLS redirect+error case. Also remove
    psock from the sk_psock_tls_verdict_apply() function signature its
    not used there.

    [ 1095.937597] WARNING: suspicious RCU usage
    [ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W
    [ 1095.944363] -----------------------------
    [ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
    [ 1095.950866]
    [ 1095.950866] other info that might help us debug this:
    [ 1095.950866]
    [ 1095.957146]
    [ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
    [ 1095.961482] 1 lock held by test_sockmap/15970:
    [ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
    [ 1095.968568]
    [ 1095.968568] stack backtrace:
    [ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1
    [ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    [ 1095.980519] Call Trace:
    [ 1095.982191] dump_stack+0x8f/0xd0
    [ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0
    [ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250
    [ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls]

    v2: Improve commit message to identify non-TLS redirect plus error case
    condition as well as more common TLS case. In the process I decided
    doing the rcu_read_unlock followed by the lock/unlock inside branches
    was unnecessarily complex. We can just extend the current rcu block
    and get the same effeective without the shuffling and branching.
    Thanks Martin!

    Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
    Reported-by: Jakub Sitnicki
    Reported-by: kernel test robot
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Acked-by: Martin KaFai Lau
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
    Signed-off-by: Sasha Levin

    John Fastabend
     

01 Jul, 2020

3 commits

  • [ Upstream commit 0ad6f6e767ec2f613418cbc7ebe5ec4c35af540c ]

    Back in commit f60e5990d9c1 ("ipv6: protect skb->sk accesses
    from recursive dereference inside the stack") Hannes added code
    so that IPv6 stack would not trust skb->sk for typical cases
    where packet goes through 'standard' xmit path (__dev_queue_xmit())

    Alas af_packet had a dev_direct_xmit() path that was not
    dealing yet with xmit_recursion level.

    Also change sk_mc_loop() to dump a stack once only.

    Without this patch, syzbot was able to trigger :

    [1]
    [ 153.567378] WARNING: CPU: 7 PID: 11273 at net/core/sock.c:721 sk_mc_loop+0x51/0x70
    [ 153.567378] Modules linked in: nfnetlink ip6table_raw ip6table_filter iptable_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 iptable_filter macsec macvtap tap macvlan 8021q hsr wireguard libblake2s blake2s_x86_64 libblake2s_generic udp_tunnel ip6_udp_tunnel libchacha20poly1305 poly1305_x86_64 chacha_x86_64 libchacha curve25519_x86_64 libcurve25519_generic netdevsim batman_adv dummy team bridge stp llc w1_therm wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core
    [ 153.567386] CPU: 7 PID: 11273 Comm: b159172088 Not tainted 5.8.0-smp-DEV #273
    [ 153.567387] RIP: 0010:sk_mc_loop+0x51/0x70
    [ 153.567388] Code: 66 83 f8 0a 75 24 0f b6 4f 12 b8 01 00 00 00 31 d2 d3 e0 a9 bf ef ff ff 74 07 48 8b 97 f0 02 00 00 0f b6 42 3a 83 e0 01 5d c3 0b b8 01 00 00 00 5d c3 0f b6 87 18 03 00 00 5d c0 e8 04 83 e0
    [ 153.567388] RSP: 0018:ffff95c69bb93990 EFLAGS: 00010212
    [ 153.567388] RAX: 0000000000000011 RBX: ffff95c6e0ee3e00 RCX: 0000000000000007
    [ 153.567389] RDX: ffff95c69ae50000 RSI: ffff95c6c30c3000 RDI: ffff95c6c30c3000
    [ 153.567389] RBP: ffff95c69bb93990 R08: ffff95c69a77f000 R09: 0000000000000008
    [ 153.567389] R10: 0000000000000040 R11: 00003e0e00026128 R12: ffff95c6c30c3000
    [ 153.567390] R13: ffff95c6cc4fd500 R14: ffff95c6f84500c0 R15: ffff95c69aa13c00
    [ 153.567390] FS: 00007fdc3a283700(0000) GS:ffff95c6ff9c0000(0000) knlGS:0000000000000000
    [ 153.567390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 153.567391] CR2: 00007ffee758e890 CR3: 0000001f9ba20003 CR4: 00000000001606e0
    [ 153.567391] Call Trace:
    [ 153.567391] ip6_finish_output2+0x34e/0x550
    [ 153.567391] __ip6_finish_output+0xe7/0x110
    [ 153.567391] ip6_finish_output+0x2d/0xb0
    [ 153.567392] ip6_output+0x77/0x120
    [ 153.567392] ? __ip6_finish_output+0x110/0x110
    [ 153.567392] ip6_local_out+0x3d/0x50
    [ 153.567392] ipvlan_queue_xmit+0x56c/0x5e0
    [ 153.567393] ? ksize+0x19/0x30
    [ 153.567393] ipvlan_start_xmit+0x18/0x50
    [ 153.567393] dev_direct_xmit+0xf3/0x1c0
    [ 153.567393] packet_direct_xmit+0x69/0xa0
    [ 153.567394] packet_sendmsg+0xbf0/0x19b0
    [ 153.567394] ? plist_del+0x62/0xb0
    [ 153.567394] sock_sendmsg+0x65/0x70
    [ 153.567394] sock_write_iter+0x93/0xf0
    [ 153.567394] new_sync_write+0x18e/0x1a0
    [ 153.567395] __vfs_write+0x29/0x40
    [ 153.567395] vfs_write+0xb9/0x1b0
    [ 153.567395] ksys_write+0xb1/0xe0
    [ 153.567395] __x64_sys_write+0x1a/0x20
    [ 153.567395] do_syscall_64+0x43/0x70
    [ 153.567396] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 153.567396] RIP: 0033:0x453549
    [ 153.567396] Code: Bad RIP value.
    [ 153.567396] RSP: 002b:00007fdc3a282cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
    [ 153.567397] RAX: ffffffffffffffda RBX: 00000000004d32d0 RCX: 0000000000453549
    [ 153.567397] RDX: 0000000000000020 RSI: 0000000020000300 RDI: 0000000000000003
    [ 153.567398] RBP: 00000000004d32d8 R08: 0000000000000000 R09: 0000000000000000
    [ 153.567398] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d32dc
    [ 153.567398] R13: 00007ffee742260f R14: 00007fdc3a282dc0 R15: 00007fdc3a283700
    [ 153.567399] ---[ end trace c1d5ae2b1059ec62 ]---

    f60e5990d9c1 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 814152a89ed52c722ab92e9fbabcac3cb8a39245 ]

    I got a memleak report when doing some fuzz test:

    unreferenced object 0xffff888112584000 (size 13599):
    comm "ip", pid 3048, jiffies 4294911734 (age 343.491s)
    hex dump (first 32 bytes):
    74 61 70 30 00 00 00 00 00 00 00 00 00 00 00 00 tap0............
    00 ee d9 19 81 88 ff ff 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] __kmalloc_node+0x309/0x3a0
    [] kvmalloc_node+0x7f/0xc0
    [] alloc_netdev_mqs+0x76/0xfc0
    [] __tun_chr_ioctl+0x1456/0x3d70
    [] ksys_ioctl+0xe5/0x130
    [] __x64_sys_ioctl+0x6f/0xb0
    [] do_syscall_64+0x56/0xa0
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    unreferenced object 0xffff888111845cc0 (size 8):
    comm "ip", pid 3048, jiffies 4294911734 (age 343.491s)
    hex dump (first 8 bytes):
    74 61 70 30 00 88 ff ff tap0....
    backtrace:
    [] kstrdup+0x35/0x70
    [] kstrdup_const+0x3d/0x50
    [] kvasprintf_const+0xf1/0x180
    [] kobject_set_name_vargs+0x56/0x140
    [] dev_set_name+0xab/0xe0
    [] netdev_register_kobject+0xc0/0x390
    [] register_netdevice+0xb61/0x1250
    [] __tun_chr_ioctl+0x1cd1/0x3d70
    [] ksys_ioctl+0xe5/0x130
    [] __x64_sys_ioctl+0x6f/0xb0
    [] do_syscall_64+0x56/0xa0
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    unreferenced object 0xffff88811886d800 (size 512):
    comm "ip", pid 3048, jiffies 4294911734 (age 343.491s)
    hex dump (first 32 bytes):
    00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N..........
    ff ff ff ff ff ff ff ff c0 66 3d a3 ff ff ff ff .........f=.....
    backtrace:
    [] device_add+0x61e/0x1950
    [] netdev_register_kobject+0x17e/0x390
    [] register_netdevice+0xb61/0x1250
    [] __tun_chr_ioctl+0x1cd1/0x3d70
    [] ksys_ioctl+0xe5/0x130
    [] __x64_sys_ioctl+0x6f/0xb0
    [] do_syscall_64+0x56/0xa0
    [] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    If call_netdevice_notifiers() failed, then rollback_registered()
    calls netdev_unregister_kobject() which holds the kobject. The
    reference cannot be put because the netdev won't be add to todo
    list, so it will leads a memleak, we need put the reference to
    avoid memleak.

    Reported-by: Hulk Robot
    Signed-off-by: Yang Yingliang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yang Yingliang
     
  • [ Upstream commit 41b14fb8724d5a4b382a63cb4a1a61880347ccb8 ]

    Clearing the sock TX queue in sk_set_socket() might cause unexpected
    out-of-order transmit when called from sock_orphan(), as outstanding
    packets can pick a different TX queue and bypass the ones already queued.

    This is undesired in general. More specifically, it breaks the in-order
    scheduling property guarantee for device-offloaded TLS sockets.

    Remove the call to sk_tx_queue_clear() in sk_set_socket(), and add it
    explicitly only where needed.

    Fixes: e022f0b4a03f ("net: Introduce sk_tx_queue_mapping")
    Signed-off-by: Tariq Toukan
    Reviewed-by: Boris Pismenny
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tariq Toukan
     

24 Jun, 2020

6 commits

  • [ Upstream commit 11d6011c2cf29f7c8181ebde6c8bc0c4d83adcd7 ]

    Sequence counters write paths are critical sections that must never be
    preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

    Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
    netdev name retrieval.") handled a deadlock, observed with
    CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
    infinitely spinning: it got scheduled after the seqcount write side
    blocked inside its own critical section.

    To fix that deadlock, among other issues, the commit added a
    cond_resched() inside the read side section. While this will get the
    non-preemptible kernel eventually unstuck, the seqcount reader is fully
    exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

    The fix is also still broken: if the seqcount reader belongs to a
    real-time scheduling policy, it can spin forever and the kernel will
    livelock.

    Disabling preemption over the seqcount write side critical section will
    not work: inside it are a number of GFP_KERNEL allocations and mutex
    locking through the drivers/base/ :: device_rename() call chain.

    >From all the above, replace the seqcount with a rwsem.

    Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
    Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
    Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
    Cc:
    Reported-by: kbuild test robot [ v1 missing up_read() on error exit ]
    Reported-by: Dan Carpenter [ v1 missing up_read() on error exit ]
    Signed-off-by: Ahmed S. Darwish
    Reviewed-by: Sebastian Andrzej Siewior
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Ahmed S. Darwish
     
  • [ Upstream commit 2da2b32fd9346009e9acdb68c570ca8d3966aba7 ]

    CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
    Both PREEMPT and PREEMPT_RT require the same functionality which today
    depends on CONFIG_PREEMPT.

    Update the comment to use CONFIG_PREEMPTION.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Acked-by: David S. Miller
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: netdev@vger.kernel.org
    Link: https://lore.kernel.org/r/20191015191821.11479-22-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Thomas Gleixner
     
  • [ Upstream commit 60e5ca8a64bad8f3e2e20a1e57846e497361c700 ]

    Add missed bpf_map_charge_init() in sock_hash_alloc() and
    correspondingly bpf_map_charge_finish() on ENOMEM.

    It was found accidentally while working on unrelated selftest that
    checks "map->memory.pages > 0" is true for all map types.

    Before:
    # bpftool m l
    ...
    3692: sockhash name m_sockhash flags 0x0
    key 4B value 4B max_entries 8 memlock 0B

    After:
    # bpftool m l
    ...
    84: sockmap name m_sockmap flags 0x0
    key 4B value 4B max_entries 8 memlock 4096B

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Andrey Ignatov
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com
    Signed-off-by: Sasha Levin

    Andrey Ignatov
     
  • [ Upstream commit 0f5d82f187e1beda3fe7295dfc500af266a5bd80 ]

    Added a check in the switch case on start_header that checks for
    the existence of the header, and in the case that MAC is not set
    and the caller requests for MAC, -EFAULT. If the caller requests
    for NET then MAC's existence is completely ignored.

    There is no function to check NET header's existence and as far
    as cgroup_skb/egress is concerned it should always be set.

    Removed for ptr >= the start of header, considering offset is
    bounded unsigned and should always be true. len
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com
    Signed-off-by: Sasha Levin

    YiFei Zhu
     
  • [ Upstream commit 75e68e5bf2c7fa9d3e874099139df03d5952a3e1 ]

    We can end up modifying the sockhash bucket list from two CPUs when a
    sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
    that is in the sockhash is unlinking itself from it on another CPU
    it (sock_hash_delete_from_link).

    This results in accessing a list element that is in an undefined state as
    reported by KASAN:

    | ==================================================================
    | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
    | Write of size 8 at addr dead000000000122 by task kworker/2:1/95
    |
    | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    | Workqueue: events bpf_map_free_deferred
    | Call Trace:
    | dump_stack+0x97/0xe0
    | ? sock_hash_free+0x13c/0x280
    | __kasan_report.cold+0x5/0x40
    | ? mark_lock+0xbc1/0xc00
    | ? sock_hash_free+0x13c/0x280
    | kasan_report+0x38/0x50
    | ? sock_hash_free+0x152/0x280
    | sock_hash_free+0x13c/0x280
    | bpf_map_free_deferred+0xb2/0xd0
    | ? bpf_map_charge_finish+0x50/0x50
    | ? rcu_read_lock_sched_held+0x81/0xb0
    | ? rcu_read_lock_bh_held+0x90/0x90
    | process_one_work+0x59a/0xac0
    | ? lock_release+0x3b0/0x3b0
    | ? pwq_dec_nr_in_flight+0x110/0x110
    | ? rwlock_bug.part.0+0x60/0x60
    | worker_thread+0x7a/0x680
    | ? _raw_spin_unlock_irqrestore+0x4c/0x60
    | kthread+0x1cc/0x220
    | ? process_one_work+0xac0/0xac0
    | ? kthread_create_on_node+0xa0/0xa0
    | ret_from_fork+0x24/0x30
    | ==================================================================

    Fix it by reintroducing spin-lock protected critical section around the
    code that removes the elements from the bucket on sockhash free.

    To do that we also need to defer processing of removed elements, until out
    of atomic context so that we can unlink the socket from the map when
    holding the sock lock.

    Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
    Reported-by: Eric Dumazet
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com
    Signed-off-by: Sasha Levin

    Jakub Sitnicki
     
  • [ Upstream commit 33a7c831565c43a7ee2f38c7df4c4a40e1dfdfed ]

    When sockhash gets destroyed while sockets are still linked to it, we will
    walk the bucket lists and delete the links. However, we are not freeing the
    list elements after processing them, leaking the memory.

    The leak can be triggered by close()'ing a sockhash map when it still
    contains sockets, and observed with kmemleak:

    unreferenced object 0xffff888116e86f00 (size 64):
    comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff ...A.....i/.....
    backtrace:
    [] sock_hash_update_common+0x4ca/0x760
    [] sock_hash_update_elem+0x1d2/0x200
    [] __do_sys_bpf+0x2046/0x2990
    [] do_syscall_64+0xad/0x9a0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Fix it by freeing the list element when we're done with it.

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com
    Signed-off-by: Sasha Levin

    Jakub Sitnicki
     

22 Jun, 2020

2 commits

  • [ Upstream commit e91de6afa81c10e9f855c5695eb9a53168d96b73 ]

    KTLS uses a stream parser to collect TLS messages and send them to
    the upper layer tls receive handler. This ensures the tls receiver
    has a full TLS header to parse when it is run. However, when a
    socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
    is enabled we end up with two stream parsers running on the same
    socket.

    The result is both try to run on the same socket. First the KTLS
    stream parser runs and calls read_sock() which will tcp_read_sock
    which in turn calls tcp_rcv_skb(). This dequeues the skb from the
    sk_receive_queue. When this is done KTLS code then data_ready()
    callback which because we stacked KTLS on top of the bpf stream
    verdict program has been replaced with sk_psock_start_strp(). This
    will in turn kick the stream parser again and eventually do the
    same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
    a skb from the sk_receive_queue.

    At this point the data stream is broke. Part of the stream was
    handled by the KTLS side some other bytes may have been handled
    by the BPF side. Generally this results in either missing data
    or more likely a "Bad Message" complaint from the kTLS receive
    handler as the BPF program steals some bytes meant to be in a
    TLS header and/or the TLS header length is no longer correct.

    We've already broke the idealized model where we can stack ULPs
    in any order with generic callbacks on the TX side to handle this.
    So in this patch we do the same thing but for RX side. We add
    a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
    program is running and add a tls_sw_has_ctx_rx() helper so BPF
    side can learn there is a TLS ULP on the socket.

    Then on BPF side we omit calling our stream parser to avoid
    breaking the data stream for the KTLS receiver. Then on the
    KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
    receiver is done with the packet but before it posts the
    msg to userspace. This gives us symmetry between the TX and
    RX halfs and IMO makes it usable again. On the TX side we
    process packets in this order BPF -> TLS -> TCP and on
    the receive side in the reverse order TCP -> TLS -> BPF.

    Discovered while testing OpenSSL 3.0 Alpha2.0 release.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    John Fastabend
     
  • [ Upstream commit ca2f5f21dbbd5e3a00cd3e97f728aa2ca0b2e011 ]

    We will need this block of code called from tls context shortly
    lets refactor the redirect logic so its easy to use. This also
    cleans up the switch stmt so we have fewer fallthrough cases.

    No logic changes are intended.

    Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
    Signed-off-by: John Fastabend
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Jakub Sitnicki
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin

    John Fastabend
     

19 Jun, 2020

1 commit

  • * tag 'v5.4.47': (2193 commits)
    Linux 5.4.47
    KVM: arm64: Save the host's PtrAuth keys in non-preemptible context
    KVM: arm64: Synchronize sysreg state on injecting an AArch32 exception
    ...

    Conflicts:
    arch/arm/boot/dts/imx6qdl.dtsi
    arch/arm/mach-imx/Kconfig
    arch/arm/mach-imx/common.h
    arch/arm/mach-imx/suspend-imx6.S
    arch/arm64/boot/dts/freescale/imx8qxp-mek.dts
    arch/powerpc/include/asm/cacheflush.h
    drivers/cpufreq/imx6q-cpufreq.c
    drivers/dma/imx-sdma.c
    drivers/edac/synopsys_edac.c
    drivers/firmware/imx/imx-scu.c
    drivers/net/ethernet/freescale/fec.h
    drivers/net/ethernet/freescale/fec_main.c
    drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
    drivers/net/phy/phy_device.c
    drivers/perf/fsl_imx8_ddr_perf.c
    drivers/usb/cdns3/gadget.c
    drivers/usb/dwc3/gadget.c
    include/uapi/linux/dma-buf.h

    Signed-off-by: Jason Liu

    Jason Liu
     

03 Jun, 2020

1 commit

  • [ Upstream commit c0bbbdc32febd4f034ecbf3ea17865785b2c0652 ]

    __netif_receive_skb_core may change the skb pointer passed into it (e.g.
    in rx_handler). The original skb may be freed as a result of this
    operation.

    The callers of __netif_receive_skb_core may further process original skb
    by using pt_prev pointer returned by __netif_receive_skb_core thus
    leading to unpleasant effects.

    The solution is to pass skb by reference into __netif_receive_skb_core.

    v2: Added Fixes tag and comment regarding ppt_prev and skb invariant.

    Fixes: 88eb1944e18c ("net: core: propagate SKB lists through packet_type lookup")
    Signed-off-by: Boris Sukholitko
    Acked-by: Edward Cree
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Boris Sukholitko
     

27 May, 2020

1 commit

  • commit 5cf65922bb15279402e1e19b5ee8c51d618fa51f upstream.

    When attaching a flow dissector program to a network namespace with
    bpf(BPF_PROG_ATTACH, ...) we grab a reference to bpf_prog.

    If netns gets destroyed while a flow dissector is still attached, and there
    are no other references to the prog, we leak the reference and the program
    remains loaded.

    Leak can be reproduced by running flow dissector tests from selftests/bpf:

    # bpftool prog list
    # ./test_flow_dissector.sh
    ...
    selftests: test_flow_dissector [PASS]
    # bpftool prog list
    4: flow_dissector name _dissect tag e314084d332a5338 gpl
    loaded_at 2020-05-20T18:50:53+0200 uid 0
    xlated 552B jited 355B memlock 4096B map_ids 3,4
    btf_id 4
    #

    Fix it by detaching the flow dissector program when netns is going away.

    Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Stanislav Fomichev
    Link: https://lore.kernel.org/bpf/20200521083435.560256-1-jakub@cloudflare.com
    Signed-off-by: Greg Kroah-Hartman

    Jakub Sitnicki
     

20 May, 2020

3 commits

  • [ Upstream commit 3e104c23816220919ea1b3fd93fabe363c67c484 ]

    When sk_msg_pop() is called where the pop operation is working on
    the end of a sge element and there is no additional trailing data
    and there _is_ data in front of pop, like the following case,

    |____________a_____________|__pop__|

    We have out of order operations where we incorrectly set the pop
    variable so that instead of zero'ing pop we incorrectly leave it
    untouched, effectively. This can cause later logic to shift the
    buffers around believing it should pop extra space. The result is
    we have 'popped' more data then we expected potentially breaking
    program logic.

    It took us a while to hit this case because typically we pop headers
    which seem to rarely be at the end of a scatterlist elements but
    we can't rely on this.

    Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Acked-by: Martin KaFai Lau
    Link: https://lore.kernel.org/bpf/158861288359.14306.7654891716919968144.stgit@john-Precision-5820-Tower
    Signed-off-by: Sasha Levin

    John Fastabend
     
  • [ Upstream commit 090e28b229af92dc5b40786ca673999d59e73056 ]

    If systemd is configured to use hybrid mode which enables the use of
    both cgroup v1 and v2, systemd will create new cgroup on both the default
    root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
    task to the two cgroups. If the task does some network thing then the v2
    cgroup can never be freed after the session exited.

    One of our machines ran into OOM due to this memory leak.

    In the scenario described above when sk_alloc() is called
    cgroup_sk_alloc() thought it's in v2 mode, so it stores
    the cgroup pointer in sk->sk_cgrp_data and increments
    the cgroup refcnt, but then sock_update_netprioidx()
    thought it's in v1 mode, so it stores netprioidx value
    in sk->sk_cgrp_data, so the cgroup refcnt will never be freed.

    Currently we do the mode switch when someone writes to the ifpriomap
    cgroup control file. The easiest fix is to also do the switch when
    a task is attached to a new cgroup.

    Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
    Reported-by: Yang Yingliang
    Tested-by: Yang Yingliang
    Signed-off-by: Zefan Li
    Acked-by: Tejun Heo
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Zefan Li
     
  • [ Upstream commit dd912306ff008891c82cd9f63e8181e47a9cb2fb ]

    syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event
    between bonding master and slave. I managed to find a reproducer
    for this:

    ip li set bond0 up
    ifenslave bond0 eth0
    brctl addbr br0
    ethtool -K eth0 lro off
    brctl addif br0 bond0
    ip li set br0 up

    When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave,
    it captures this and calls bond_compute_features() to fixup its
    master's and other slaves' features. However, when syncing with
    its lower devices by netdev_sync_lower_features() this event is
    triggered again on slaves when the LRO feature fails to change,
    so it goes back and forth recursively until the kernel stack is
    exhausted.

    Commit 17b85d29e82c intentionally lets __netdev_update_features()
    return -1 for such a failure case, so we have to just rely on
    the existing check inside netdev_sync_lower_features() and skip
    NETDEV_FEAT_CHANGE event only for this specific failure case.

    Fixes: fd867d51f889 ("net/core: generic support for disabling netdev features down stack")
    Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com
    Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com
    Cc: Jarod Wilson
    Cc: Nikolay Aleksandrov
    Cc: Josh Poimboeuf
    Cc: Jann Horn
    Reviewed-by: Jay Vosburgh
    Signed-off-by: Cong Wang
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang