23 Dec, 2019

1 commit

  • Pull networking fixes from David Miller:

    1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
    including adding a missing ipv6 match description.

    2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
    Bhat.

    3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.

    4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.

    5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.

    6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
    Chaignon.

    7) Multicast MAC limit test is off by one in qede, from Manish Chopra.

    8) Fix established socket lookup race when socket goes from
    TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
    RCU grace period. From Eric Dumazet.

    9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.

    10) Fix active backup transition after link failure in bonding, from
    Mahesh Bandewar.

    11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.

    12) Fix wrong interface passed to ->mac_link_up(), from Russell King.

    13) Fix DSA egress flooding settings in b53, from Florian Fainelli.

    14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.

    15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.

    16) Reject invalid MTU values in stmmac, from Jose Abreu.

    17) Fix refcount leak in error path of u32 classifier, from Davide
    Caratti.

    18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
    Kaseorg.

    19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.

    20) Disable hardware GRO when XDP is attached to qede, frm Manish
    Chopra.

    21) Since we encode state in the low pointer bits, dst metrics must be
    at least 4 byte aligned, which is not necessarily true on m68k. Add
    annotations to fix this, from Geert Uytterhoeven.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
    sfc: Include XDP packet headroom in buffer step size.
    sfc: fix channel allocation with brute force
    net: dst: Force 4-byte alignment of dst_metrics
    selftests: pmtu: fix init mtu value in description
    hv_netvsc: Fix unwanted rx_table reset
    net: phy: ensure that phy IDs are correctly typed
    mod_devicetable: fix PHY module format
    qede: Disable hardware gro when xdp prog is installed
    net: ena: fix issues in setting interrupt moderation params in ethtool
    net: ena: fix default tx interrupt moderation interval
    net/smc: unregister ib devices in reboot_event
    net: stmmac: platform: Fix MDIO init for platforms without PHY
    llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
    net: hisilicon: Fix a BUG trigered by wrong bytes_compl
    net: dsa: ksz: use common define for tag len
    s390/qeth: don't return -ENOTSUPP to userspace
    s390/qeth: fix promiscuous mode after reset
    s390/qeth: handle error due to unsupported transport mode
    cxgb4: fix refcount init for TC-MQPRIO offload
    tc-testing: initial tdc selftests for cls_u32
    ...

    Linus Torvalds
     

21 Dec, 2019

3 commits

  • In the reboot_event handler, unregister the ib devices and enable
    the IB layer to release the devices before the reboot.

    Fixes: a33a803cfe64 ("net/smc: guarantee removal of link groups in reboot")
    Signed-off-by: Karsten Graul
    Reviewed-by: Ursula Braun
    Signed-off-by: David S. Miller

    Karsten Graul
     
  • When a frame with NULL DSAP is received, llc_station_rcv is called.
    In turn, llc_stat_ev_rx_null_dsap_xid_c is called to check if it is a NULL
    XID frame. The return statement of llc_stat_ev_rx_null_dsap_xid_c returns 1
    when the incoming frame is not a NULL XID frame and 0 otherwise. Hence, a
    NULL XID response is returned unexpectedly, e.g. when the incoming frame is
    a NULL TEST command.

    To fix the error, simply remove the conditional operator.

    A similar error in llc_stat_ev_rx_null_dsap_test_c is also fixed.

    Signed-off-by: Chan Shu Tak, Alex
    Signed-off-by: David S. Miller

    Chan Shu Tak, Alex
     
  • Remove special taglen define KSZ8795_INGRESS_TAG_LEN
    and use generic KSZ_INGRESS_TAG_LEN instead.

    Signed-off-by: Michael Grzeschik
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Michael Grzeschik
     

20 Dec, 2019

3 commits

  • when users replace cls_u32 filters with new ones having wrong parameters,
    so that u32_change() fails to validate them, the kernel doesn't roll-back
    correctly, and leaves semi-configured rules.

    Fix this in u32_walk(), avoiding a call to the walker function on filters
    that don't have a match rule connected. The side effect is, these "empty"
    filters are not even dumped when present; but that shouldn't be a problem
    as long as we are restoring the original behaviour, where semi-configured
    filters were not even added in the error path of u32_change().

    Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty")
    Signed-off-by: Davide Caratti
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2019-12-19

    The following pull-request contains BPF updates for your *net* tree.

    We've added 10 non-merge commits during the last 8 day(s) which contain
    a total of 21 files changed, 269 insertions(+), 108 deletions(-).

    The main changes are:

    1) Fix lack of synchronization between xsk wakeup and destroying resources
    used by xsk wakeup, from Maxim Mikityanskiy.

    2) Fix pruning with tail call patching, untrack programs in case of verifier
    error and fix a cgroup local storage tracking bug, from Daniel Borkmann.

    3) Fix clearing skb->tstamp in bpf_redirect() when going from ingress to
    egress which otherwise cause issues e.g. on fq qdisc, from Lorenz Bauer.

    4) Fix compile warning of unused proc_dointvec_minmax_bpf_restricted() when
    only cBPF is present, from Alexander Lobakin.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • proc_dointvec_minmax_bpf_restricted() has been firstly introduced
    in commit 2e4a30983b0f ("bpf: restrict access to core bpf sysctls")
    under CONFIG_HAVE_EBPF_JIT. Then, this ifdef has been removed in
    ede95a63b5e8 ("bpf: add bpf_jit_limit knob to restrict unpriv
    allocations"), because a new sysctl, bpf_jit_limit, made use of it.
    Finally, this parameter has become long instead of integer with
    fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
    and thus, a new proc_dolongvec_minmax_bpf_restricted() has been
    added.

    With this last change, we got back to that
    proc_dointvec_minmax_bpf_restricted() is used only under
    CONFIG_HAVE_EBPF_JIT, but the corresponding ifdef has not been
    brought back.

    So, in configurations like CONFIG_BPF_JIT=y && CONFIG_HAVE_EBPF_JIT=n
    since v4.20 we have:

    CC net/core/sysctl_net_core.o
    net/core/sysctl_net_core.c:292:1: warning: ‘proc_dointvec_minmax_bpf_restricted’ defined but not used [-Wunused-function]
    292 | proc_dointvec_minmax_bpf_restricted(struct ctl_table *table, int write,
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Suppress this by guarding it with CONFIG_HAVE_EBPF_JIT again.

    Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K")
    Signed-off-by: Alexander Lobakin
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191218091821.7080-1-alobakin@dlink.ru

    Alexander Lobakin
     

19 Dec, 2019

2 commits

  • The XSK wakeup callback in drivers makes some sanity checks before
    triggering NAPI. However, some configuration changes may occur during
    this function that affect the result of those checks. For example, the
    interface can go down, and all the resources will be destroyed after the
    checks in the wakeup function, but before it attempts to use these
    resources. Wrap this callback in rcu_read_lock to allow driver to
    synchronize_rcu before actually destroying the resources.

    xsk_wakeup is a new function that encapsulates calling ndo_xsk_wakeup
    wrapped into the RCU lock. After this commit, xsk_poll starts using
    xsk_wakeup and checks xs->zc instead of ndo_xsk_wakeup != NULL to decide
    ndo_xsk_wakeup should be called. It also fixes a bug introduced with the
    need_wakeup feature: a non-zero-copy socket may be used with a driver
    supporting zero-copy, and in this case ndo_xsk_wakeup should not be
    called, so the xs->zc check is the correct one.

    Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
    Signed-off-by: Maxim Mikityanskiy
    Signed-off-by: Björn Töpel
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20191217162023.16011-2-maximmi@mellanox.com

    Maxim Mikityanskiy
     
  • The kernel may sleep while holding a spinlock.
    The function call path (from bottom to top) in Linux 4.19 is:

    net/nfc/nci/uart.c, 349:
    nci_skb_alloc in nci_uart_default_recv_buf
    net/nfc/nci/uart.c, 255:
    (FUNC_PTR)nci_uart_default_recv_buf in nci_uart_tty_receive
    net/nfc/nci/uart.c, 254:
    spin_lock in nci_uart_tty_receive

    nci_skb_alloc(GFP_KERNEL) can sleep at runtime.
    (FUNC_PTR) means a function pointer is called.

    To fix this bug, GFP_KERNEL is replaced with GFP_ATOMIC for
    nci_skb_alloc().

    This bug is found by a static analysis tool STCheck written by myself.

    Signed-off-by: Jia-Ju Bai
    Signed-off-by: David S. Miller

    Jia-Ju Bai
     

18 Dec, 2019

4 commits

  • Dev_hold has to be called always in rx_queue_add_kobject.
    Otherwise usage count drops below 0 in case of failure in
    kobject_init_and_add.

    Fixes: b8eb718348b8 ("net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject")
    Reported-by: syzbot
    Cc: Tetsuo Handa
    Cc: David Miller
    Cc: Lukas Bulwahn
    Signed-off-by: Jouni Hogander
    Signed-off-by: David S. Miller

    Jouni Hogander
     
  • dsa_link_touch() is not exported, or defined outside of the
    file it is in so make it static to avoid the following warning:

    net/dsa/dsa2.c:127:17: warning: symbol 'dsa_link_touch' was not declared. Should it be static?

    Signed-off-by: Ben Dooks (Codethink)
    Signed-off-by: David S. Miller

    Ben Dooks (Codethink)
     
  • sk->sk_pacing_shift can be read and written without lock
    synchronization. This patch adds annotations to
    document this fact and avoid future syzbot complains.

    This might also avoid unexpected false sharing
    in sk_pacing_shift_update(), as the compiler
    could remove the conditional check and always
    write over sk->sk_pacing_shift :

    if (sk->sk_pacing_shift != val)
    sk->sk_pacing_shift = val;

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • syzbot reported a memory leak when an allocation fails within
    genradix_prealloc() for output streams. That's because
    genradix_prealloc() leaves initialized members initialized when the
    issue happens and SCTP stack will abort the current initialization but
    without cleaning up such members.

    The fix here is to always call genradix_free() when genradix_prealloc()
    fails, for output and also input streams, as it suffers from the same
    issue.

    Reported-by: syzbot+772d9e36c490b18d51d1@syzkaller.appspotmail.com
    Fixes: 2075e50caf5e ("sctp: convert to genradix")
    Signed-off-by: Marcelo Ricardo Leitner
    Tested-by: Xin Long
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

17 Dec, 2019

3 commits

  • …rnel/git/jberg/mac80211

    Johannes Berg says:

    ====================
    A handful of fixes:
    * disable AQL on most drivers, addressing the iwlwifi issues
    * fix double-free on network namespace changes
    * fix TID field in frames injected through monitor interfaces
    * fix ieee80211_calc_rx_airtime()
    * fix NULL pointer dereference in rfkill (and remove BUG_ON)
    ====================

    Signed-off-by: David S. Miller <davem@davemloft.net>

    David S. Miller
     
  • virtio_transport_get_ops() and virtio_transport_send_pkt_info()
    can only be used on connecting/connected sockets, since a socket
    assigned to a transport is required.

    This patch adds a WARN_ON() on virtio_transport_get_ops() to check
    this requirement, a comment and a returned error on
    virtio_transport_send_pkt_info(),

    Signed-off-by: Stefano Garzarella
    Signed-off-by: David S. Miller

    Stefano Garzarella
     
  • With multi-transport support, listener sockets are not bound to any
    transport. So, calling virtio_transport_reset(), when an error
    occurs, on a listener socket produces the following null-pointer
    dereference:

    BUG: kernel NULL pointer dereference, address: 00000000000000e8
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 20 Comm: kworker/0:1 Not tainted 5.5.0-rc1-ste-00003-gb4be21f316ac-dirty #56
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    Workqueue: virtio_vsock virtio_transport_rx_work [vmw_vsock_virtio_transport]
    RIP: 0010:virtio_transport_send_pkt_info+0x20/0x130 [vmw_vsock_virtio_transport_common]
    Code: 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 49 89 fc 53 48 83 ec 10 44 8b 76 20 e8 c0 ba fe ff 8b 80 e8 00 00 00 e8 64 e3 7d c1 45 8b 45 00 41 8b 8c 24 d4 02
    RSP: 0018:ffffc900000b7d08 EFLAGS: 00010282
    RAX: 0000000000000000 RBX: ffff88807bf12728 RCX: 0000000000000000
    RDX: ffff88807bf12700 RSI: ffffc900000b7d50 RDI: ffff888035c84000
    RBP: ffffc900000b7d40 R08: ffff888035c84000 R09: ffffc900000b7d08
    R10: ffff8880781de800 R11: 0000000000000018 R12: ffff888035c84000
    R13: ffffc900000b7d50 R14: 0000000000000000 R15: ffff88807bf12724
    FS: 0000000000000000(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000000000e8 CR3: 00000000790f4004 CR4: 0000000000160ef0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    virtio_transport_reset+0x59/0x70 [vmw_vsock_virtio_transport_common]
    virtio_transport_recv_pkt+0x5bb/0xe50 [vmw_vsock_virtio_transport_common]
    ? detach_buf_split+0xf1/0x130
    virtio_transport_rx_work+0xba/0x130 [vmw_vsock_virtio_transport]
    process_one_work+0x1c0/0x300
    worker_thread+0x45/0x3c0
    kthread+0xfc/0x130
    ? current_work+0x40/0x40
    ? kthread_park+0x90/0x90
    ret_from_fork+0x35/0x40
    Modules linked in: sunrpc kvm_intel kvm vmw_vsock_virtio_transport vmw_vsock_virtio_transport_common irqbypass vsock virtio_rng rng_core
    CR2: 00000000000000e8
    ---[ end trace e75400e2ea2fa824 ]---

    This happens because virtio_transport_reset() calls
    virtio_transport_send_pkt_info() that can be used only on
    connecting/connected sockets.

    This patch fixes the issue, using virtio_transport_reset_no_sock()
    instead of virtio_transport_reset() when we are handling a listener
    socket.

    Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
    Signed-off-by: Stefano Garzarella
    Signed-off-by: David S. Miller

    Stefano Garzarella
     

16 Dec, 2019

2 commits

  • In rfkill_register, the struct rfkill pointer is first derefernced
    and then checked for NULL. This patch removes the BUG_ON and returns
    an error to the caller in case rfkill is NULL.

    Signed-off-by: Aditya Pakki
    Link: https://lore.kernel.org/r/20191215153409.21696-1-pakki001@umn.edu
    Signed-off-by: Johannes Berg

    Aditya Pakki
     
  • FASTOPEN setsockopt() or sendmsg() may switch the SMC socket to fallback
    mode. Once fallback mode is active, the native TCP socket functions are
    called. Nevertheless there is a small race window, when FASTOPEN
    setsockopt/sendmsg runs in parallel to a connect(), and switch the
    socket into fallback mode before connect() takes the sock lock.
    Make sure the SMC-specific connect setup is omitted in this case.

    This way a syzbot-reported refcount problem is fixed, triggered by
    different threads running non-blocking connect() and FASTOPEN_KEY
    setsockopt.

    Reported-by: syzbot+96d3f9ff6a86d37e44c8@syzkaller.appspotmail.com
    Fixes: 6d6dd528d5af ("net/smc: fix refcount non-blocking connect() -part 2")
    Signed-off-by: Ursula Braun
    Signed-off-by: Karsten Graul
    Signed-off-by: Jakub Kicinski

    Ursula Braun
     

14 Dec, 2019

7 commits

  • At the time commit ce5ec440994b ("tcp: ensure epoll edge trigger
    wakeup when write queue is empty") was added to the kernel,
    we still had a single write queue, combining rtx and write queues.

    Once we moved the rtx queue into a separate rb-tree, testing
    if sk_write_queue is empty has been suboptimal.

    Indeed, if we have packets in the rtx queue, we probably want
    to delay the EPOLLOUT generation at the time incoming packets
    will free them, making room, but more importantly avoiding
    flooding application with EPOLLOUT events.

    Solution is to use tcp_rtx_and_write_queues_empty() helper.

    Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
    Signed-off-by: Eric Dumazet
    Cc: Jason Baron
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     
  • Due to how tcp_sendmsg() is implemented, we can have an empty
    skb at the tail of the write queue.

    Most [1] tcp_write_queue_empty() callers want to know if there is
    anything to send (payload and/or FIN)

    Instead of checking if the sk_write_queue is empty, we need
    to test if tp->write_seq == tp->snd_nxt

    [1] tcp_send_fin() was the only caller that expected to
    see if an skb was in the write queue, I have changed the code
    to reuse the tcp_write_queue_tail() result.

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     
  • Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
    write queue in error cases") in linux-4.14 stable triggered
    various bugs. One of them has been fixed in commit ba2ddb43f270
    ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
    we still have crashes in some occasions.

    Root-cause is that when tcp_sendmsg() has allocated a fresh
    skb and could not append a fragment before being blocked
    in sk_stream_wait_memory(), tcp_write_xmit() might be called
    and decide to send this fresh and empty skb.

    Sending an empty packet is not only silly, it might have caused
    many issues we had in the past with tp->packets_out being
    out of sync.

    Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
    Signed-off-by: Eric Dumazet
    Cc: Christoph Paasch
    Acked-by: Neal Cardwell
    Cc: Jason Baron
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     
  • Michal Kubecek and Firo Yang did a very nice analysis of crashes
    happening in __inet_lookup_established().

    Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
    (via a close()/socket()/listen() cycle) without a RCU grace period,
    I should not have changed listeners linkage in their hash table.

    They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
    so that a lookup can detect a socket in a hash list was moved in
    another one.

    Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve
    merge conflict for v4/v6 ordering fix"), we have to add
    hlist_nulls_add_tail_rcu() helper.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: Michal Kubecek
    Reported-by: Firo Yang
    Reviewed-by: Michal Kubecek
    Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     
  • In commit 4b1373de73a3 ("net: ipv6: addr: perform strict checks also for
    doit handlers") we add strict check for inet6_rtm_getaddr(). But we did
    the invalid header values check before checking if NETLINK_F_STRICT_CHK
    is set. This may break backwards compatibility if user already set the
    ifm->ifa_prefixlen, ifm->ifa_flags, ifm->ifa_scope in their netlink code.

    I didn't move the nlmsg_len check because I thought it's a valid check.

    Reported-by: Jianlin Shi
    Fixes: 4b1373de73a3 ("net: ipv6: addr: perform strict checks also for doit handlers")
    Signed-off-by: Hangbin Liu
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski

    Hangbin Liu
     
  • Redirecting a packet from ingress to egress by using bpf_redirect
    breaks if the egress interface has an fq qdisc installed. This is the same
    problem as fixed in 'commit 8203e2d844d3 ("net: clear skb->tstamp in forwarding paths")

    Clear skb->tstamp when redirecting into the egress path.

    Fixes: 80b14dee2bea ("net: Add a new socket option for a future transmit time.")
    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Lorenz Bauer
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/bpf/20191213180817.2510-1-lmb@cloudflare.com

    Lorenz Bauer
     
  • Pull io_uring fixes from Jens Axboe:

    - A tweak to IOSQE_IO_LINK (also marked for stable) to allow links that
    don't sever if the result is < 0.

    This is mostly for linked timeouts, where if we ask for a pure
    timeout we always get -ETIME. This makes links useless for that case,
    hence allow a case where it works.

    - Five minor optimizations to fix and improve cases that regressed
    since v5.4.

    - An SQTHREAD locking fix.

    - A sendmsg/recvmsg iov assignment fix.

    - Net fix where read_iter/write_iter don't honor IOCB_NOWAIT, and
    subsequently ensuring that works for io_uring.

    - Fix a case where for an invalid opcode we might return -EBADF instead
    of -EINVAL, if the ->fd of that sqe was set to an invalid fd value.

    * tag 'io_uring-5.5-20191212' of git://git.kernel.dk/linux-block:
    io_uring: ensure we return -EINVAL on unknown opcode
    io_uring: add sockets to list of files that support non-blocking issue
    net: make socket read/write_iter() honor IOCB_NOWAIT
    io_uring: only hash regular files for async work execution
    io_uring: run next sqe inline if possible
    io_uring: don't dynamically allocate poll data
    io_uring: deferred send/recvmsg should assign iov
    io_uring: sqthread should grab ctx->uring_lock for submissions
    io-wq: briefly spin for new work after finishing work
    io-wq: remove worker->wait waitqueue
    io_uring: allow unbreakable links

    Linus Torvalds
     

13 Dec, 2019

4 commits

  • Instead of just having an airtime flag in debugfs, turn AQL into a proper
    NL80211_EXT_FEATURE, so drivers can turn it on when they are ready, and so
    we also expose the presence of the feature to userspace.

    This also has the effect of flipping the default, so drivers have to opt in
    to using AQL instead of getting it by default with TXQs. To keep
    functionality the same as pre-patch, we set this feature for ath10k (which
    is where it is needed the most).

    While we're at it, split out the debugfs interface so AQL gets its own
    per-station debugfs file instead of using the 'airtime' file.

    [Johannes:]
    This effectively disables AQL for iwlwifi, where it fixes a number of
    issues:
    * TSO in iwlwifi is causing underflows and associated warnings in AQL
    * HE (802.11ax) rates aren't reported properly so at HE rates, AQL could
    never have a valid estimate (it'd use 6 Mbps instead of up to 2400!)

    Signed-off-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/r/20191212111437.224294-1-toke@redhat.com
    Fixes: 3ace10f5b5ad ("mac80211: Implement Airtime-based Queue Limit (AQL)")
    Signed-off-by: Johannes Berg

    Toke Høiland-Jørgensen
     
  • This code was copied from mt76 and inherited an off by one bug from
    there. The > should be >= so that we don't read one element beyond
    the end of the array.

    Fixes: db3e1c40cf2f ("mac80211: Import airtime calculation code from mt76")
    Reported-by: Toke Høiland-Jørgensen
    Signed-off-by: Dan Carpenter
    Acked-by: Toke Høiland-Jørgensen
    Link: https://lore.kernel.org/r/20191126120910.ftr4t7me3by32aiz@kili.mountain
    Signed-off-by: Johannes Berg

    Dan Carpenter
     
  • If wdev->wext.keys was initialized it didn't get reset to NULL on
    unregister (and it doesn't get set in cfg80211_init_wdev either), but
    wdev is reused if unregister was triggered through
    cfg80211_switch_netns.

    The next unregister (for whatever reason) will try to free
    wdev->wext.keys again.

    Signed-off-by: Stefan Bühler
    Link: https://lore.kernel.org/r/20191126100543.782023-1-stefan.buehler@tik.uni-stuttgart.de
    Signed-off-by: Johannes Berg

    Stefan Bühler
     
  • Fix overwriting of the qos_ctrl.tid field for encrypted frames injected on
    a monitor interface. While qos_ctrl.tid is not encrypted, it's used as an
    input into the encryption algorithm so it's protected, and thus cannot be
    modified after encryption. For injected frames, the encryption may already
    have been done in userspace, so we cannot change any fields.

    Before passing the frame to the driver, the qos_ctrl.tid field is updated
    from skb->priority. Prior to dbd50a851c50 skb->priority was updated in
    ieee80211_select_queue_80211(), but this function is no longer always
    called.

    Update skb->priority in ieee80211_monitor_start_xmit() so that the value
    is stored, and when later code 'modifies' the TID it really sets it to
    the same value as before, preserving the encryption.

    Fixes: dbd50a851c50 ("mac80211: only allocate one queue when using iTXQs")
    Signed-off-by: Fredrik Olofsson
    Link: https://lore.kernel.org/r/20191119133451.14711-1-fredrik.olofsson@anyfinetworks.com
    [rewrite commit message based on our discussion]
    Signed-off-by: Johannes Berg

    Fredrik Olofsson
     

11 Dec, 2019

5 commits

  • In the function 'tipc_disc_rcv()', the 'msg_peer_net_hash()' is called
    to read the header data field but after the message skb has been freed,
    that might result in a garbage value...

    This commit fixes it by defining a new local variable to store the data
    first, just like the other header fields' handling.

    Fixes: f73b12812a3d ("tipc: improve throughput between nodes in netns")
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • When a user message is sent, TIPC will check if the socket has faced a
    congestion at link layer. If that happens, it will make a sleep to wait
    for the congestion to disappear. This leaves a gap for other users to
    take over the socket (e.g. multi threads) since the socket is released
    as well. Also, in case of connectionless (e.g. SOCK_RDM), user is free
    to send messages to various destinations (e.g. via 'sendto()'), then
    the socket's preformatted header has to be updated correspondingly
    prior to the actual payload message building.

    Unfortunately, the latter action is done before the first action which
    causes a condition issue that the destination of a certain message can
    be modified incorrectly in the middle, leading to wrong destination
    when that message is built. Consequently, when the message is sent to
    the link layer, it gets stuck there forever because the peer node will
    simply reject it. After a number of retransmission attempts, the link
    is eventually taken down and the retransmission failure is reported.

    This commit fixes the problem by rearranging the order of actions to
    prevent the race condition from occurring, so the message building is
    'atomic' and its header will not be modified by anyone.

    Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • In commit c55c8edafa91 ("tipc: smooth change between replicast and
    broadcast"), we allow instant switching between replicast and broadcast
    by sending a dummy 'SYN' packet on the last used link to synchronize
    packets on the links. The 'SYN' message is an object of link congestion
    also, so if that happens, a 'SOCK_WAKEUP' will be scheduled to be sent
    back to the socket...
    However, in that commit, we simply use the same socket 'cong_link_cnt'
    counter for both the 'SYN' & normal payload message sending. Therefore,
    if both the replicast & broadcast links are congested, the counter will
    be not updated correctly but overwritten by the latter congestion.
    Later on, when the 'SOCK_WAKEUP' messages are processed, the counter is
    reduced one by one and eventually overflowed. Consequently, further
    activities on the socket will only wait for the false congestion signal
    to disappear but never been met.

    Because sending the 'SYN' message is vital for the mechanism, it should
    be done anyway. This commit fixes the issue by marking the message with
    an error code e.g. 'TIPC_ERR_NO_PORT', so its sending should not face a
    link congestion, there is no need to touch the socket 'cong_link_cnt'
    either. In addition, in the event of any error (e.g. -ENOBUFS), we will
    purge the entire payload message queue and make a return immediately.

    Fixes: c55c8edafa91 ("tipc: smooth change between replicast and broadcast")
    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • The current rbtree for service ranges in the name table is built based
    on the 'lower' & 'upper' range values resulting in a flaw in the rbtree
    searching. Some issues have been observed in case of range overlapping:

    Case #1: unable to withdraw a name entry:
    After some name services are bound, all of them are withdrawn by user
    but one remains in the name table forever. This corrupts the table and
    that service becomes dummy i.e. no real port.
    E.g.

    /
    {22, 22}
    /
    /
    ---> {10, 50}
    / \
    / \
    {10, 30} {20, 60}

    The node {10, 30} cannot be removed since the rbtree searching stops at
    the node's ancestor i.e. {10, 50}, so starting from it will never reach
    the finding node.

    Case #2: failed to send data in some cases:
    E.g. Two service ranges: {20, 60}, {10, 50} are bound. The rbtree for
    this service will be one of the two cases below depending on the order
    of the bindings:

    {20, 60} {10, 50}
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • The socket read/write helpers only look at the file O_NONBLOCK. not
    the iocb IOCB_NOWAIT flag. This breaks users like preadv2/pwritev2
    and io_uring that rely on not having the file itself marked nonblocking,
    but rather the iocb itself.

    Cc: netdev@vger.kernel.org
    Acked-by: David Miller
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Dec, 2019

6 commits

  • There is softlockup when using TPACKET_V3:
    ...
    NMI watchdog: BUG: soft lockup - CPU#2 stuck for 60010ms!
    (__irq_svc) from [] (_raw_spin_unlock_irqrestore+0x44/0x54)
    (_raw_spin_unlock_irqrestore) from [] (mod_timer+0x210/0x25c)
    (mod_timer) from []
    (prb_retire_rx_blk_timer_expired+0x68/0x11c)
    (prb_retire_rx_blk_timer_expired) from []
    (call_timer_fn+0x90/0x17c)
    (call_timer_fn) from [] (run_timer_softirq+0x2d4/0x2fc)
    (run_timer_softirq) from [] (__do_softirq+0x218/0x318)
    (__do_softirq) from [] (irq_exit+0x88/0xac)
    (irq_exit) from [] (msa_irq_exit+0x11c/0x1d4)
    (msa_irq_exit) from [] (handle_IPI+0x650/0x7f4)
    (handle_IPI) from [] (gic_handle_irq+0x108/0x118)
    (gic_handle_irq) from [] (__irq_usr+0x44/0x5c)
    ...

    If __ethtool_get_link_ksettings() is failed in
    prb_calc_retire_blk_tmo(), msec and tmo will be zero, so tov_in_jiffies
    is zero and the timer expire for retire_blk_timer is turn to
    mod_timer(&pkc->retire_blk_timer, jiffies + 0),
    which will trigger cpu usage of softirq is 100%.

    Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.")
    Tested-by: Xiao Jiangfeng
    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller

    Mao Wenan
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Wait for rcu grace period after releasing netns in ctnetlink,
    from Florian Westphal.

    2) Incorrect command type in flowtable offload ndo invocation,
    from wenxu.

    3) Incorrect callback type in flowtable offload flow tuple
    updates, also from wenxu.

    4) Fix compile warning on flowtable offload infrastructure due to
    possible reference to uninitialized variable, from Nathan Chancellor.

    5) Do not inline nf_ct_resolve_clash(), this is called from slow
    path / stress situations. From Florian Westphal.

    6) Missing IPv6 flow selector description in flowtable offload.

    7) Missing check for NETDEV_UNREGISTER in nf_tables offload
    infrastructure, from wenxu.

    8) Update NAT selftest to use randomized netns names, from
    Florian Westphal.

    9) Restore nfqueue bridge support, from Marco Oliverio.

    10) Compilation warning in SCTP_CHUNKMAP_*() on xt_sctp header.
    From Phil Sutter.

    11) Fix bogus lookup/get match for non-anonymous rbtree sets.

    12) Missing netlink validation for NFT_SET_ELEM_INTERVAL_END
    elements.

    13) Missing netlink validation for NFT_DATA_VALUE after
    nft_data_init().

    14) If rule specifies no actions, offload infrastructure returns
    EOPNOTSUPP.

    15) Module refcount leak in object updates.

    16) Missing sanitization for ARP traffic from br_netfilter, from
    Eric Dumazet.

    17) Compilation breakage on big-endian due to incorrect memcpy()
    size in the flowtable offload infrastructure.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • In function 'memcpy',
    inlined from 'flow_offload_mangle' at net/netfilter/nf_flow_table_offload.c:112:2,
    inlined from 'flow_offload_port_dnat' at net/netfilter/nf_flow_table_offload.c:373:2,
    inlined from 'nf_flow_rule_route_ipv4' at net/netfilter/nf_flow_table_offload.c:424:3:
    ./include/linux/string.h:376:4: error: call to '__read_overflow2' declared with attribute error: detected read beyond size of object passed as 2nd parameter
    376 | __read_overflow2();
    | ^~~~~~~~~~~~~~~~~~

    The original u8* was done in the hope to make this more adaptable but
    consensus is to keep this like it is in tc pedit.

    Fixes: c29f74e0df7a ("netfilter: nf_flow_table: hardware offload support")
    Reported-by: Laura Abbott
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
    at places where these are defined. Later patches will remove the unused
    definition of FIELD_SIZEOF().

    This patch is generated using following script:

    EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

    git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
    do

    if [[ "$file" =~ $EXCLUDE_FILES ]]; then
    continue
    fi
    sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
    done

    Signed-off-by: Pankaj Bharadiya
    Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook
    Acked-by: David Miller # for net

    Pankaj Bharadiya
     
  • This is needed, because if the flag X25_ACCPT_APPRV_FLAG is not set on a
    socket (manual call confirmation) and the channel is cleared by remote
    before the manual call confirmation was sent, this situation needs to
    be handled.

    Signed-off-by: Martin Schiller
    Signed-off-by: David S. Miller

    Martin Schiller
     
  • Syzbot found a crash:

    BUG: KMSAN: uninit-value in crc32_body lib/crc32.c:112 [inline]
    BUG: KMSAN: uninit-value in crc32_le_generic lib/crc32.c:179 [inline]
    BUG: KMSAN: uninit-value in __crc32c_le_base+0x4fa/0xd30 lib/crc32.c:202
    Call Trace:
    crc32_body lib/crc32.c:112 [inline]
    crc32_le_generic lib/crc32.c:179 [inline]
    __crc32c_le_base+0x4fa/0xd30 lib/crc32.c:202
    chksum_update+0xb2/0x110 crypto/crc32c_generic.c:90
    crypto_shash_update+0x4c5/0x530 crypto/shash.c:107
    crc32c+0x150/0x220 lib/libcrc32c.c:47
    sctp_csum_update+0x89/0xa0 include/net/sctp/checksum.h:36
    __skb_checksum+0x1297/0x12a0 net/core/skbuff.c:2640
    sctp_compute_cksum include/net/sctp/checksum.h:59 [inline]
    sctp_packet_pack net/sctp/output.c:528 [inline]
    sctp_packet_transmit+0x40fb/0x4250 net/sctp/output.c:597
    sctp_outq_flush_transports net/sctp/outqueue.c:1146 [inline]
    sctp_outq_flush+0x1823/0x5d80 net/sctp/outqueue.c:1194
    sctp_outq_uncork+0xd0/0xf0 net/sctp/outqueue.c:757
    sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1781 [inline]
    sctp_side_effects net/sctp/sm_sideeffect.c:1184 [inline]
    sctp_do_sm+0x8fe1/0x9720 net/sctp/sm_sideeffect.c:1155
    sctp_primitive_REQUESTHEARTBEAT+0x175/0x1a0 net/sctp/primitive.c:185
    sctp_apply_peer_addr_params+0x212/0x1d40 net/sctp/socket.c:2433
    sctp_setsockopt_peer_addr_params net/sctp/socket.c:2686 [inline]
    sctp_setsockopt+0x189bb/0x19090 net/sctp/socket.c:4672

    The issue was caused by transport->ipaddr set with uninit addr param, which
    was passed by:

    sctp_transport_init net/sctp/transport.c:47 [inline]
    sctp_transport_new+0x248/0xa00 net/sctp/transport.c:100
    sctp_assoc_add_peer+0x5ba/0x2030 net/sctp/associola.c:611
    sctp_process_param net/sctp/sm_make_chunk.c:2524 [inline]

    where 'addr' is set by sctp_v4_from_addr_param(), and it doesn't initialize
    the padding of addr->v4.

    Later when calling sctp_make_heartbeat(), hbinfo.daddr(=transport->ipaddr)
    will become the part of skb, and the issue occurs.

    This patch is to fix it by initializing the padding of addr->v4 in
    sctp_v4_from_addr_param(), as well as other functions that do the similar
    thing, and these functions shouldn't trust that the caller initializes the
    memory, as Marcelo suggested.

    Reported-by: syzbot+6dcbfea81cd3d4dd0b02@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Xin Long