06 Feb, 2021

1 commit

  • Commit c80794323e82 ("net: Fix packet reordering caused by GRO and
    listified RX cooperation") had the unfortunate effect of adding
    latencies in common workloads.

    Before the patch, GRO packets were immediately passed to
    upper stacks.

    After the patch, we can accumulate quite a lot of GRO
    packets (depdending on NAPI budget).

    My fix is counting in napi->rx_count number of segments
    instead of number of logical packets.

    Fixes: c80794323e82 ("net: Fix packet reordering caused by GRO and listified RX cooperation")
    Signed-off-by: Eric Dumazet
    Bisected-by: John Sperbeck
    Tested-by: Jian Yang
    Cc: Maxim Mikityanskiy
    Reviewed-by: Saeed Mahameed
    Reviewed-by: Edward Cree
    Reviewed-by: Alexander Lobakin
    Link: https://lore.kernel.org/r/20210204213146.4192368-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     

05 Feb, 2021

1 commit

  • When iteratively computing a checksum with csum_block_add, track the
    offset "pos" to correctly rotate in csum_block_add when offset is odd.

    The open coded implementation of skb_copy_and_csum_datagram did this.
    With the switch to __skb_datagram_iter calling csum_and_copy_to_iter,
    pos was reinitialized to 0 on each call.

    Bring back the pos by passing it along with the csum to the callback.

    Changes v1->v2
    - pass csum value, instead of csump pointer (Alexander Duyck)

    Link: https://lore.kernel.org/netdev/20210128152353.GB27281@optiplex/
    Fixes: 950fcaecd5cc ("datagram: consolidate datagram copy to iter helpers")
    Reported-by: Oliver Graute
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Alexander Duyck
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20210203192952.1849843-1-willemdebruijn.kernel@gmail.com
    Signed-off-by: Jakub Kicinski

    Willem de Bruijn
     

31 Jan, 2021

1 commit

  • Following race condition was detected:
    - neigh_flush_dev() is under execution and calls
    neigh_mark_dead(n) marking the neighbour entry 'n' as dead.

    - Executing: __netif_receive_skb() ->
    __netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
    calls __neigh_lookup() which takes a reference on neighbour entry 'n'.

    - Moves further along neigh_flush_dev() and calls
    neigh_cleanup_and_release(n), but since reference count increased in t2,
    'n' couldn't be destroyed.

    - Moves further along, arp_process() and calls
    neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
    the neighbour entry back in gc_list(neigh_mark_dead(), removed it
    earlier in t0 from gc_list)

    - arp_process() finally calls neigh_release(n), destroying
    the neighbour entry.

    This leads to 'n' still being part of gc_list, but the actual
    neighbour structure has been freed.

    The situation can be prevented from happening if we disallow a dead
    entry to have any possibility of updating gc_list. This is what the
    patch intends to achieve.

    Fixes: 9c29a2f55ec0 ("neighbor: Fix locking order for gc_list changes")
    Signed-off-by: Chinmay Agarwal
    Reviewed-by: Cong Wang
    Reviewed-by: David Ahern
    Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com
    Signed-off-by: Jakub Kicinski

    Chinmay Agarwal
     

20 Jan, 2021

2 commits

  • With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
    logically done when RXCSUM offload is off.

    Fixes: 14136564c8ee ("net: Add TLS RX offload feature")
    Signed-off-by: Tariq Toukan
    Reviewed-by: Boris Pismenny
    Link: https://lore.kernel.org/r/20210117151538.9411-1-tariqt@nvidia.com
    Signed-off-by: Jakub Kicinski

    Tariq Toukan
     
  • Fix incorrect user_ptr dereferencing when handling port param get/set:

    idx [0] stores the 'struct devlink' pointer;
    idx [1] stores the 'struct devlink_port' pointer;

    Fixes: 637989b5d77e ("devlink: Always use user_ptr[0] for devlink and simplify post_doit")
    CC: Parav Pandit
    Signed-off-by: Oleksandr Mazur
    Signed-off-by: Vadym Kochan
    Link: https://lore.kernel.org/r/20210119085333.16833-1-vadym.kochan@plvision.eu
    Signed-off-by: Jakub Kicinski

    Oleksandr Mazur
     

17 Jan, 2021

1 commit

  • Commit 3226b158e67c ("net: avoid 32 x truesize under-estimation for
    tiny skbs") ensured that skbs with data size lower than 1025 bytes
    will be kmalloc'ed to avoid excessive page cache fragmentation and
    memory consumption.
    However, the fix adressed only __napi_alloc_skb() (primarily for
    virtio_net and napi_get_frags()), but the issue can still be achieved
    through __netdev_alloc_skb(), which is still used by several drivers.
    Drivers often allocate a tiny skb for headers and place the rest of
    the frame to frags (so-called copybreak).
    Mirror the condition to __netdev_alloc_skb() to handle this case too.

    Since v1 [0]:
    - fix "Fixes:" tag;
    - refine commit message (mention copybreak usecase).

    [0] https://lore.kernel.org/netdev/20210114235423.232737-1-alobakin@pm.me

    Fixes: a1c7fff7e18f ("net: netdev_alloc_skb() use build_skb()")
    Signed-off-by: Alexander Lobakin
    Link: https://lore.kernel.org/r/20210115150354.85967-1-alobakin@pm.me
    Signed-off-by: Jakub Kicinski

    Alexander Lobakin
     

16 Jan, 2021

1 commit

  • syzbot report reminded us that very big ewma_log were supported in the past,
    even if they made litle sense.

    tc qdisc replace dev xxx root est 1sec 131072sec ...

    While fixing the bug, also add boundary checks for ewma_log, in line
    with range supported by iproute2.

    UBSAN: shift-out-of-bounds in net/core/gen_estimator.c:83:38
    shift exponent -1 is negative
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:79 [inline]
    dump_stack+0x107/0x163 lib/dump_stack.c:120
    ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
    __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
    est_timer.cold+0xbb/0x12d net/core/gen_estimator.c:83
    call_timer_fn+0x1a5/0x710 kernel/time/timer.c:1417
    expire_timers kernel/time/timer.c:1462 [inline]
    __run_timers.part.0+0x692/0xa80 kernel/time/timer.c:1731
    __run_timers kernel/time/timer.c:1712 [inline]
    run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1744
    __do_softirq+0x2bc/0xa77 kernel/softirq.c:343
    asm_call_irq_on_stack+0xf/0x20

    __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
    run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
    do_softirq_own_stack+0xaa/0xd0 arch/x86/kernel/irq_64.c:77
    invoke_softirq kernel/softirq.c:226 [inline]
    __irq_exit_rcu+0x17f/0x200 kernel/softirq.c:420
    irq_exit_rcu+0x5/0x20 kernel/softirq.c:432
    sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1096
    asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
    RIP: 0010:native_save_fl arch/x86/include/asm/irqflags.h:29 [inline]
    RIP: 0010:arch_local_save_flags arch/x86/include/asm/irqflags.h:79 [inline]
    RIP: 0010:arch_irqs_disabled arch/x86/include/asm/irqflags.h:169 [inline]
    RIP: 0010:acpi_safe_halt drivers/acpi/processor_idle.c:111 [inline]
    RIP: 0010:acpi_idle_do_entry+0x1c9/0x250 drivers/acpi/processor_idle.c:516

    Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate estimators")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Link: https://lore.kernel.org/r/20210114181929.1717985-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     

15 Jan, 2021

2 commits

  • Cited patch below blocked the TLS TX device offload unless HW_CSUM
    is set. This broke devices that use IP_CSUM && IP6_CSUM.
    Here we fix it.

    Note that the single HW_TLS_TX feature flag indicates support for
    both IPv4/6, hence it should still be disabled in case only one of
    (IP_CSUM | IPV6_CSUM) is set.

    Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
    Signed-off-by: Tariq Toukan
    Reported-by: Rohit Maheshwari
    Reviewed-by: Maxim Mikityanskiy
    Link: https://lore.kernel.org/r/20210114151215.7061-1-tariqt@nvidia.com
    Signed-off-by: Jakub Kicinski

    Tariq Toukan
     
  • Both virtio net and napi_get_frags() allocate skbs
    with a very small skb->head

    While using page fragments instead of a kmalloc backed skb->head might give
    a small performance improvement in some cases, there is a huge risk of
    under estimating memory usage.

    For both GOOD_COPY_LEN and GRO_MAX_HEAD, we can fit at least 32 allocations
    per page (order-3 page in x86), or even 64 on PowerPC

    We have been tracking OOM issues on GKE hosts hitting tcp_mem limits
    but consuming far more memory for TCP buffers than instructed in tcp_mem[2]

    Even if we force napi_alloc_skb() to only use order-0 pages, the issue
    would still be there on arches with PAGE_SIZE >= 32768

    This patch makes sure that small skb head are kmalloc backed, so that
    other objects in the slab page can be reused instead of being held as long
    as skbs are sitting in socket queues.

    Note that we might in the future use the sk_buff napi cache,
    instead of going through a more expensive __alloc_skb()

    Another idea would be to use separate page sizes depending
    on the allocated length (to never have more than 4 frags per page)

    I would like to thank Greg Thelen for his precious help on this matter,
    analysing crash dumps is always a time consuming task.

    Fixes: fd11a83dd363 ("net: Pull out core bits of __netdev_alloc_skb and add __napi_alloc_skb")
    Signed-off-by: Eric Dumazet
    Cc: Paolo Abeni
    Cc: Greg Thelen
    Reviewed-by: Alexander Duyck
    Acked-by: Michael S. Tsirkin
    Link: https://lore.kernel.org/r/20210113161819.1155526-1-eric.dumazet@gmail.com
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     

12 Jan, 2021

1 commit

  • skb_seq_read iterates over an skb, returning pointer and length of
    the next data range with each call.

    It relies on kmap_atomic to access highmem pages when needed.

    An skb frag may be backed by a compound page, but kmap_atomic maps
    only a single page. There are not enough kmap slots to always map all
    pages concurrently.

    Instead, if kmap_atomic is needed, iterate over each page.

    As this increases the number of calls, avoid this unless needed.
    The necessary condition is captured in skb_frag_must_loop.

    I tried to make the change as obvious as possible. It should be easy
    to verify that nothing changes if skb_frag_must_loop returns false.

    Tested:
    On an x86 platform with
    CONFIG_HIGHMEM=y
    CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP=y
    CONFIG_NETFILTER_XT_MATCH_STRING=y

    Run
    ip link set dev lo mtu 1500
    iptables -A OUTPUT -m string --string 'badstring' -algo bm -j ACCEPT
    dd if=/dev/urandom of=in bs=1M count=20
    nc -l -p 8000 > /dev/null &
    nc -w 1 -q 0 localhost 8000 < in

    Signed-off-by: Willem de Bruijn
    Signed-off-by: Jakub Kicinski

    Willem de Bruijn
     

09 Jan, 2021

5 commits

  • If register_netdevice() fails at the very last stage - the
    notifier call - some subsystems may have already seen it and
    grabbed a reference. struct net_device can't be freed right
    away without calling netdev_wait_all_refs().

    Now that we have a clean interface in form of dev->needs_free_netdev
    and lenient free_netdev() we can undo what commit 93ee31f14f6f ("[NET]:
    Fix free_netdev on register_netdev failure.") has done and complete
    the unregistration path by bringing the net_set_todo() call back.

    After registration fails user is still expected to explicitly
    free the net_device, so make sure ->needs_free_netdev is cleared,
    otherwise rolling back the registration will cause the old double
    free for callers who release rtnl_lock before the free.

    This also solves the problem of priv_destructor not being called
    on notifier error.

    net_set_todo() will be moved back into unregister_netdevice_queue()
    in a follow up.

    Reported-by: Hulk Robot
    Reported-by: Yang Yingliang
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • There are two flavors of handling netdev registration:
    - ones called without holding rtnl_lock: register_netdev() and
    unregister_netdev(); and
    - those called with rtnl_lock held: register_netdevice() and
    unregister_netdevice().

    While the semantics of the former are pretty clear, the same can't
    be said about the latter. The netdev_todo mechanism is utilized to
    perform some of the device unregistering tasks and it hooks into
    rtnl_unlock() so the locked variants can't actually finish the work.
    In general free_netdev() does not mix well with locked calls. Most
    drivers operating under rtnl_lock set dev->needs_free_netdev to true
    and expect core to make the free_netdev() call some time later.

    The part where this becomes most problematic is error paths. There is
    no way to unwind the state cleanly after a call to register_netdevice(),
    since unreg can't be performed fully without dropping locks.

    Make free_netdev() more lenient, and defer the freeing if device
    is being unregistered. This allows error paths to simply call
    free_netdev() both after register_netdevice() failed, and after
    a call to unregister_netdevice() but before dropping rtnl_lock.

    Simplify the error paths which are currently doing gymnastics
    around free_netdev() handling.

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • Explain the two basic flows of struct net_device's operation.

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • reuse->socks[] is modified concurrently by reuseport_add_sock. To
    prevent reading values that have not been fully initialized, only read
    the array up until the last known safe index instead of incorrectly
    re-reading the last index of the array.

    Fixes: acdcecc61285f ("udp: correct reuseport selection with connected sockets")
    Signed-off-by: Baptiste Lepers
    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/r/20210107051110.12247-1-baptiste.lepers@gmail.com
    Signed-off-by: Jakub Kicinski

    Baptiste Lepers
     
  • skbs in fraglist could be shared by a BPF filter loaded at TC. If TC
    writes, it will call skb_ensure_writable -> pskb_expand_head to create
    a private linear section for the head_skb. And then call
    skb_clone_fraglist -> skb_get on each skb in the fraglist.

    skb_segment_list overwrites part of the skb linear section of each
    fragment itself. Even after skb_clone, the frag_skbs share their
    linear section with their clone in PF_PACKET.

    Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
    a link for the same frag_skbs chain. If a new skb (not frags) is
    queued to one of the sk_receive_queue, multiple ptypes can see and
    release this. It causes use-after-free.

    [ 4443.426215] ------------[ cut here ]------------
    [ 4443.426222] refcount_t: underflow; use-after-free.
    [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
    refcount_dec_and_test_checked+0xa4/0xc8
    [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
    [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
    [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
    [ 4443.426808] Call trace:
    [ 4443.426813] refcount_dec_and_test_checked+0xa4/0xc8
    [ 4443.426823] skb_release_data+0x144/0x264
    [ 4443.426828] kfree_skb+0x58/0xc4
    [ 4443.426832] skb_queue_purge+0x64/0x9c
    [ 4443.426844] packet_set_ring+0x5f0/0x820
    [ 4443.426849] packet_setsockopt+0x5a4/0xcd0
    [ 4443.426853] __sys_setsockopt+0x188/0x278
    [ 4443.426858] __arm64_sys_setsockopt+0x28/0x38
    [ 4443.426869] el0_svc_common+0xf0/0x1d0
    [ 4443.426873] el0_svc_handler+0x74/0x98
    [ 4443.426880] el0_svc+0x8/0xc

    Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
    Signed-off-by: Dongseok Yi
    Acked-by: Willem de Bruijn
    Acked-by: Daniel Borkmann
    Link: https://lore.kernel.org/r/1610072918-174177-1-git-send-email-dseok.yi@samsung.com
    Signed-off-by: Jakub Kicinski

    Dongseok Yi
     

29 Dec, 2020

5 commits

  • pneigh_enqueue() tries to obtain a random delay by mod
    NEIGH_VAR(p, PROXY_DELAY). However, NEIGH_VAR(p, PROXY_DELAY)
    migth be zero at that point because someone could write zero
    to /proc/sys/net/ipv4/neigh/[device]/proxy_delay after the
    callers check it.

    This patch uses prandom_u32_max() to get a random delay instead
    which avoids potential division by zero.

    Signed-off-by: weichenchen
    Signed-off-by: David S. Miller

    weichenchen
     
  • Accesses to dev->xps_rxqs_map (when using dev->num_tc) should be
    protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
    see an actual bug being triggered, but let's be safe here and take the
    rtnl lock while accessing the map in sysfs.

    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • Two race conditions can be triggered when storing xps rxqs, resulting in
    various oops and invalid memory accesses:

    1. Calling netdev_set_num_tc while netif_set_xps_queue:

    - netif_set_xps_queue uses dev->tc_num as one of the parameters to
    compute the size of new_dev_maps when allocating it. dev->tc_num is
    also used to access the map, and the compiler may generate code to
    retrieve this field multiple times in the function.

    - netdev_set_num_tc sets dev->tc_num.

    If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
    is set to a higher value through netdev_set_num_tc, later accesses to
    new_dev_maps in netif_set_xps_queue could lead to accessing memory
    outside of new_dev_maps; triggering an oops.

    2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

    2.1. netdev_set_num_tc starts by resetting the xps queues,
    dev->tc_num isn't updated yet.

    2.2. netif_set_xps_queue is called, setting up the map with the
    *old* dev->num_tc.

    2.3. netdev_set_num_tc updates dev->tc_num.

    2.4. Later accesses to the map lead to out of bound accesses and
    oops.

    A similar issue can be found with netdev_reset_tc.

    One way of triggering this is to set an iface up (for which the driver
    uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
    xps_rxqs in a concurrent thread. With the right timing an oops is
    triggered.

    Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
    and netdev_reset_tc should be mutually exclusive. We do that by taking
    the rtnl lock in xps_rxqs_store.

    Fixes: 8af2c06ff4b1 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • Accesses to dev->xps_cpus_map (when using dev->num_tc) should be
    protected by the rtnl lock, like we do for netif_set_xps_queue. I didn't
    see an actual bug being triggered, but let's be safe here and take the
    rtnl lock while accessing the map in sysfs.

    Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     
  • Two race conditions can be triggered when storing xps cpus, resulting in
    various oops and invalid memory accesses:

    1. Calling netdev_set_num_tc while netif_set_xps_queue:

    - netif_set_xps_queue uses dev->tc_num as one of the parameters to
    compute the size of new_dev_maps when allocating it. dev->tc_num is
    also used to access the map, and the compiler may generate code to
    retrieve this field multiple times in the function.

    - netdev_set_num_tc sets dev->tc_num.

    If new_dev_maps is allocated using dev->tc_num and then dev->tc_num
    is set to a higher value through netdev_set_num_tc, later accesses to
    new_dev_maps in netif_set_xps_queue could lead to accessing memory
    outside of new_dev_maps; triggering an oops.

    2. Calling netif_set_xps_queue while netdev_set_num_tc is running:

    2.1. netdev_set_num_tc starts by resetting the xps queues,
    dev->tc_num isn't updated yet.

    2.2. netif_set_xps_queue is called, setting up the map with the
    *old* dev->num_tc.

    2.3. netdev_set_num_tc updates dev->tc_num.

    2.4. Later accesses to the map lead to out of bound accesses and
    oops.

    A similar issue can be found with netdev_reset_tc.

    One way of triggering this is to set an iface up (for which the driver
    uses netdev_set_num_tc in the open path, such as bnx2x) and writing to
    xps_cpus in a concurrent thread. With the right timing an oops is
    triggered.

    Both issues have the same fix: netif_set_xps_queue, netdev_set_num_tc
    and netdev_reset_tc should be mutually exclusive. We do that by taking
    the rtnl lock in xps_cpus_store.

    Fixes: 184c449f91fe ("net: Add support for XPS with QoS via traffic classes")
    Signed-off-by: Antoine Tenart
    Reviewed-by: Alexander Duyck
    Signed-off-by: Jakub Kicinski

    Antoine Tenart
     

17 Dec, 2020

1 commit

  • There are some use cases for netdev_notify_peers in the context
    when rtnl lock is already held. Introduce lockless version
    of netdev_notify_peers call to save the extra code to call
    call_netdevice_notifiers(NETDEV_NOTIFY_PEERS, dev);
    call_netdevice_notifiers(NETDEV_RESEND_IGMP, dev);
    After that, convert netdev_notify_peers to call the new helper.

    Suggested-by: Nathan Lynch
    Signed-off-by: Lijun Pan
    Signed-off-by: Jakub Kicinski

    Lijun Pan
     

16 Dec, 2020

1 commit

  • Pull networking updates from Jakub Kicinski:
    "Core:

    - support "prefer busy polling" NAPI operation mode, where we defer
    softirq for some time expecting applications to periodically busy
    poll

    - AF_XDP: improve efficiency by more batching and hindering the
    adjacency cache prefetcher

    - af_packet: make packet_fanout.arr size configurable up to 64K

    - tcp: optimize TCP zero copy receive in presence of partial or
    unaligned reads making zero copy a performance win for much smaller
    messages

    - XDP: add bulk APIs for returning / freeing frames

    - sched: support fragmenting IP packets as they come out of conntrack

    - net: allow virtual netdevs to forward UDP L4 and fraglist GSO skbs

    BPF:

    - BPF switch from crude rlimit-based to memcg-based memory accounting

    - BPF type format information for kernel modules and related tracing
    enhancements

    - BPF implement task local storage for BPF LSM

    - allow the FENTRY/FEXIT/RAW_TP tracing programs to use
    bpf_sk_storage

    Protocols:

    - mptcp: improve multiple xmit streams support, memory accounting and
    many smaller improvements

    - TLS: support CHACHA20-POLY1305 cipher

    - seg6: add support for SRv6 End.DT4/DT6 behavior

    - sctp: Implement RFC 6951: UDP Encapsulation of SCTP

    - ppp_generic: add ability to bridge channels directly

    - bridge: Connectivity Fault Management (CFM) support as is defined
    in IEEE 802.1Q section 12.14.

    Drivers:

    - mlx5: make use of the new auxiliary bus to organize the driver
    internals

    - mlx5: more accurate port TX timestamping support

    - mlxsw:
    - improve the efficiency of offloaded next hop updates by using
    the new nexthop object API
    - support blackhole nexthops
    - support IEEE 802.1ad (Q-in-Q) bridging

    - rtw88: major bluetooth co-existance improvements

    - iwlwifi: support new 6 GHz frequency band

    - ath11k: Fast Initial Link Setup (FILS)

    - mt7915: dual band concurrent (DBDC) support

    - net: ipa: add basic support for IPA v4.5

    Refactor:

    - a few pieces of in_interrupt() cleanup work from Sebastian Andrzej
    Siewior

    - phy: add support for shared interrupts; get rid of multiple driver
    APIs and have the drivers write a full IRQ handler, slight growth
    of driver code should be compensated by the simpler API which also
    allows shared IRQs

    - add common code for handling netdev per-cpu counters

    - move TX packet re-allocation from Ethernet switch tag drivers to a
    central place

    - improve efficiency and rename nla_strlcpy

    - number of W=1 warning cleanups as we now catch those in a patchwork
    build bot

    Old code removal:

    - wan: delete the DLCI / SDLA drivers

    - wimax: move to staging

    - wifi: remove old WDS wifi bridging support"

    * tag 'net-next-5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1922 commits)
    net: hns3: fix expression that is currently always true
    net: fix proc_fs init handling in af_packet and tls
    nfc: pn533: convert comma to semicolon
    af_vsock: Assign the vsock transport considering the vsock address flags
    af_vsock: Set VMADDR_FLAG_TO_HOST flag on the receive path
    vsock_addr: Check for supported flag values
    vm_sockets: Add VMADDR_FLAG_TO_HOST vsock flag
    vm_sockets: Add flags field in the vsock address data structure
    net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled
    tcp: Add logic to check for SYN w/ data in tcp_simple_retransmit
    net: mscc: ocelot: install MAC addresses in .ndo_set_rx_mode from process context
    nfc: s3fwrn5: Release the nfc firmware
    net: vxget: clean up sparse warnings
    mlxsw: spectrum_router: Use eXtended mezzanine to offload IPv4 router
    mlxsw: spectrum: Set KVH XLT cache mode for Spectrum2/3
    mlxsw: spectrum_router_xm: Introduce basic XM cache flushing
    mlxsw: reg: Add Router LPM Cache Enable Register
    mlxsw: reg: Add Router LPM Cache ML Delete Register
    mlxsw: spectrum_router_xm: Implement L-value tracking for M-index
    mlxsw: reg: Add XM Router M Table Register
    ...

    Linus Torvalds
     

15 Dec, 2020

5 commits

  • With NETIF_F_HW_TLS_TX packets are encrypted in HW. This cannot be
    logically done when HW_CSUM offload is off.

    Fixes: 2342a8512a1e ("net: Add TLS TX offload features")
    Signed-off-by: Tariq Toukan
    Reviewed-by: Boris Pismenny
    Link: https://lore.kernel.org/r/20201213143929.26253-1-tariqt@nvidia.com
    Signed-off-by: Jakub Kicinski

    Tariq Toukan
     
  • syzbot reproduces BUG_ON in skb_checksum_help():
    tun creates (bogus) skb with huge partial-checksummed area and
    small ip packet inside. Then ip_rcv trims the skb based on size
    of internal ip packet, after that csum offset points beyond of
    trimmed skb. Then checksum_tg() called via netfilter hook
    triggers BUG_ON:

    offset = skb_checksum_start_offset(skb);
    BUG_ON(offset >= skb_headlen(skb));

    To work around the problem this patch forces pskb_trim_rcsum_slow()
    to return -EINVAL in described scenario. It allows its callers to
    drop such kind of packets.

    Link: https://syzkaller.appspot.com/bug?id=b419a5ca95062664fe1a60b764621eb4526e2cd0
    Reported-by: syzbot+7010af67ced6105e5ab6@syzkaller.appspotmail.com
    Signed-off-by: Vasily Averin
    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/r/1b2494af-2c56-8ee2-7bc0-923fcad1cdf8@virtuozzo.com
    Signed-off-by: Jakub Kicinski

    Vasily Averin
     
  • Pull scheduler updates from Thomas Gleixner:

    - migrate_disable/enable() support which originates from the RT tree
    and is now a prerequisite for the new preemptible kmap_local() API
    which aims to replace kmap_atomic().

    - A fair amount of topology and NUMA related improvements

    - Improvements for the frequency invariant calculations

    - Enhanced robustness for the global CPU priority tracking and decision
    making

    - The usual small fixes and enhancements all over the place

    * tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
    sched/fair: Trivial correction of the newidle_balance() comment
    sched/fair: Clear SMT siblings after determining the core is not idle
    sched: Fix kernel-doc markup
    x86: Print ratio freq_max/freq_base used in frequency invariance calculations
    x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
    x86, sched: Calculate frequency invariance for AMD systems
    irq_work: Optimize irq_work_single()
    smp: Cleanup smp_call_function*()
    irq_work: Cleanup
    sched: Limit the amount of NUMA imbalance that can exist at fork time
    sched/numa: Allow a floating imbalance between NUMA nodes
    sched: Avoid unnecessary calculation of load imbalance at clone time
    sched/numa: Rename nr_running and break out the magic number
    sched: Make migrate_disable/enable() independent of RT
    sched/topology: Condition EAS enablement on FIE support
    arm64: Rebuild sched domains on invariance status changes
    sched/topology,schedutil: Wrap sched domains rebuild
    sched/uclamp: Allow to reset a task uclamp constraint value
    sched/core: Fix typos in comments
    Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
    ...

    Linus Torvalds
     
  • Pull misc fixes from Christian Brauner:
    "This contains several fixes which felt worth being combined into a
    single branch:

    - Use put_nsproxy() instead of open-coding it switch_task_namespaces()

    - Kirill's work to unify lifecycle management for all namespaces. The
    lifetime counters are used identically for all namespaces types.
    Namespaces may of course have additional unrelated counters and
    these are not altered. This work allows us to unify the type of the
    counters and reduces maintenance cost by moving the counter in one
    place and indicating that basic lifetime management is identical
    for all namespaces.

    - Peilin's fix adding three byte padding to Dmitry's
    PTRACE_GET_SYSCALL_INFO uapi struct to prevent an info leak.

    - Two smal patches to convert from the /* fall through */ comment
    annotation to the fallthrough keyword annotation which I had taken
    into my branch and into -next before df561f6688fe ("treewide: Use
    fallthrough pseudo-keyword") made it upstream which fixed this
    tree-wide.

    Since I didn't want to invalidate all testing for other commits I
    didn't rebase and kept them"

    * tag 'fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    nsproxy: use put_nsproxy() in switch_task_namespaces()
    sys: Convert to the new fallthrough notation
    signal: Convert to the new fallthrough notation
    time: Use generic ns_common::count
    cgroup: Use generic ns_common::count
    mnt: Use generic ns_common::count
    user: Use generic ns_common::count
    pid: Use generic ns_common::count
    ipc: Use generic ns_common::count
    uts: Use generic ns_common::count
    net: Use generic ns_common::count
    ns: Add a common refcount into ns_common
    ptrace: Prevent kernel-infoleak in ptrace_get_syscall_info()

    Linus Torvalds
     
  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-12-14

    1) Expose bpf_sk_storage_*() helpers to iterator programs, from Florent Revest.

    2) Add AF_XDP selftests based on veth devs to BPF selftests, from Weqaar Janjua.

    3) Support for finding BTF based kernel attach targets through libbpf's
    bpf_program__set_attach_target() API, from Andrii Nakryiko.

    4) Permit pointers on stack for helper calls in the verifier, from Yonghong Song.

    5) Fix overflows in hash map elem size after rlimit removal, from Eric Dumazet.

    6) Get rid of direct invocation of llc in BPF selftests, from Andrew Delgadillo.

    7) Fix xsk_recvmsg() to reorder socket state check before access, from Björn Töpel.

    8) Add new libbpf API helper to retrieve ring buffer epoll fd, from Brendan Jackman.

    9) Batch of minor BPF selftest improvements all over the place, from Florian Lehner,
    KP Singh, Jiri Olsa and various others.

    * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
    selftests/bpf: Add a test for ptr_to_map_value on stack for helper access
    bpf: Permits pointers on stack for helper calls
    libbpf: Expose libbpf ring_buffer epoll_fd
    selftests/bpf: Add set_attach_target() API selftest for module target
    libbpf: Support modules in bpf_program__set_attach_target() API
    selftests/bpf: Silence ima_setup.sh when not running in verbose mode.
    selftests/bpf: Drop the need for LLVM's llc
    selftests/bpf: fix bpf_testmod.ko recompilation logic
    samples/bpf: Fix possible hang in xdpsock with multiple threads
    selftests/bpf: Make selftest compilation work on clang 11
    selftests/bpf: Xsk selftests - adding xdpxceiver to .gitignore
    selftests/bpf: Drop tcp-{client,server}.py from Makefile
    selftests/bpf: Xsk selftests - Bi-directional Sockets - SKB, DRV
    selftests/bpf: Xsk selftests - Socket Teardown - SKB, DRV
    selftests/bpf: Xsk selftests - DRV POLL, NOPOLL
    selftests/bpf: Xsk selftests - SKB POLL, NOPOLL
    selftests/bpf: Xsk selftests framework
    bpf: Only provide bpf_sock_from_file with CONFIG_NET
    bpf: Return -ENOTSUPP when attaching to non-kernel BTF
    xsk: Validate socket state in xsk_recvmsg, prior touching socket members
    ...
    ====================

    Link: https://lore.kernel.org/r/20201214214316.20642-1-daniel@iogearbox.net
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

12 Dec, 2020

1 commit


11 Dec, 2020

2 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf 2020-12-10

    The following pull-request contains BPF updates for your *net* tree.

    We've added 21 non-merge commits during the last 12 day(s) which contain
    a total of 21 files changed, 163 insertions(+), 88 deletions(-).

    The main changes are:

    1) Fix propagation of 32-bit signed bounds from 64-bit bounds, from Alexei.

    2) Fix ring_buffer__poll() return value, from Andrii.

    3) Fix race in lwt_bpf, from Cong.

    4) Fix test_offload, from Toke.

    5) Various xsk fixes.

    Please consider pulling these changes from:

    git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

    Thanks a lot!

    Also thanks to reporters, reviewers and testers of commits in this pull-request:

    Cong Wang, Hulk Robot, Jakub Kicinski, Jean-Philippe Brucker, John
    Fastabend, Magnus Karlsson, Maxim Mikityanskiy, Yonghong Song
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • We use rcu_assign_pointer to assign both the table and the entries,
    but the entries are not marked as __rcu. This generates sparse
    warnings.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

10 Dec, 2020

1 commit

  • The offending commit introduces a cleanup callback that is invoked
    when the driver module is removed to clean up the tunnel device
    flow block. But it returns on the first iteration of the for loop.
    The remaining indirect flow blocks will never be freed.

    Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure")
    CC: Pablo Neira Ayuso
    Signed-off-by: Chris Mi
    Reviewed-by: Roi Dayan

    Chris Mi
     

09 Dec, 2020

3 commits

  • Since commit 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF
    programs in net_device"), the XDP program attachment info is now maintained
    in the core code. This interacts badly with the xdp_attachment_flags_ok()
    check that prevents unloading an XDP program with different load flags than
    it was loaded with. In practice, two kinds of failures are seen:

    - An XDP program loaded without specifying a mode (and which then ends up
    in driver mode) cannot be unloaded if the program mode is specified on
    unload.

    - The dev_xdp_uninstall() hook always calls the driver callback with the
    mode set to the type of the program but an empty flags argument, which
    means the flags_ok() check prevents the program from being removed,
    leading to bpf prog reference leaks.

    The original reason this check was added was to avoid ambiguity when
    multiple programs were loaded. With the way the checks are done in the core
    now, this is quite simple to enforce in the core code, so let's add a check
    there and get rid of the xdp_attachment_flags_ok() callback entirely.

    Fixes: 7f0a838254bd ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: Daniel Borkmann
    Acked-by: Jakub Kicinski
    Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk

    Toke Høiland-Jørgensen
     
  • This moves the bpf_sock_from_file definition into net/core/filter.c
    which only gets compiled with CONFIG_NET and also moves the helper proto
    usage next to other tracing helpers that are conditional on CONFIG_NET.

    This avoids
    ld: kernel/trace/bpf_trace.o: in function `bpf_sock_from_file':
    bpf_trace.c:(.text+0xe23): undefined reference to `sock_from_file'
    When compiling a kernel with BPF and without NET.

    Reported-by: kernel test robot
    Reported-by: Randy Dunlap
    Signed-off-by: Florent Revest
    Signed-off-by: Alexei Starovoitov
    Acked-by: Randy Dunlap
    Acked-by: Martin KaFai Lau
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20201208173623.1136863-1-revest@chromium.org

    Florent Revest
     
  • Simplify the return expression.

    Signed-off-by: Zheng Yongjun
    Signed-off-by: David S. Miller

    Zheng Yongjun
     

08 Dec, 2020

2 commits

  • migrate_disable() is just a wrapper for preempt_disable() in
    non-RT kernel. It is safe to replace it, and RT kernel will
    benefit.

    Note that it is introduced since Feb 2020.

    Suggested-by: Alexei Starovoitov
    Signed-off-by: Cong Wang
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201205075946.497763-2-xiyou.wangcong@gmail.com

    Cong Wang
     
  • The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
    and BPF redirect helpers. Callers on RX path are all in BH context,
    disabling preemption is not sufficient to prevent BH interruption.

    In production, we observed strange packet drops because of the race
    condition between LWT xmit and TC ingress, and we verified this issue
    is fixed after we disable BH.

    Although this bug was technically introduced from the beginning, that
    is commit 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure"),
    at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
    So this patch may not work well before RCU flavor consolidation has been
    completed around v5.0.

    Update the comments above the code too, as call_rcu() is now BH friendly.

    Signed-off-by: Dongdong Wang
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Cong Wang
    Link: https://lore.kernel.org/bpf/20201205075946.497763-1-xiyou.wangcong@gmail.com

    Dongdong Wang
     

05 Dec, 2020

2 commits

  • Iterators are currently used to expose kernel information to userspace
    over fast procfs-like files but iterators could also be used to
    manipulate local storage. For example, the task_file iterator could be
    used to initialize a socket local storage with associations between
    processes and sockets or to selectively delete local storage values.

    Signed-off-by: Florent Revest
    Signed-off-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Acked-by: KP Singh
    Link: https://lore.kernel.org/bpf/20201204113609.1850150-3-revest@google.com

    Florent Revest
     
  • Currently, the sock_from_file prototype takes an "err" pointer that is
    either not set or set to -ENOTSOCK IFF the returned socket is NULL. This
    makes the error redundant and it is ignored by a few callers.

    This patch simplifies the API by letting callers deduce the error based
    on whether the returned socket is NULL or not.

    Suggested-by: Al Viro
    Signed-off-by: Florent Revest
    Signed-off-by: Daniel Borkmann
    Reviewed-by: KP Singh
    Link: https://lore.kernel.org/bpf/20201204113609.1850150-1-revest@google.com

    Florent Revest
     

04 Dec, 2020

2 commits

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-12-03

    The main changes are:

    1) Support BTF in kernel modules, from Andrii.

    2) Introduce preferred busy-polling, from Björn.

    3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh.

    4) Memcg-based memory accounting for bpf objects, from Roman.

    5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav.

    * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits)
    selftests/bpf: Fix invalid use of strncat in test_sockmap
    libbpf: Use memcpy instead of strncpy to please GCC
    selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module
    selftests/bpf: Add tp_btf CO-RE reloc test for modules
    libbpf: Support attachment of BPF tracing programs to kernel modules
    libbpf: Factor out low-level BPF program loading helper
    bpf: Allow to specify kernel module BTFs when attaching BPF programs
    bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier
    selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF
    selftests/bpf: Add support for marking sub-tests as skipped
    selftests/bpf: Add bpf_testmod kernel module for testing
    libbpf: Add kernel module BTF support for CO-RE relocations
    libbpf: Refactor CO-RE relocs to not assume a single BTF object
    libbpf: Add internal helper to load BTF data by FD
    bpf: Keep module's btf_data_size intact after load
    bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address()
    selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP
    bpf: Adds support for setting window clamp
    samples/bpf: Fix spelling mistake "recieving" -> "receiving"
    bpf: Fix cold build of test_progs-no_alu32
    ...
    ====================

    Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com
    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     
  • Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP,
    which sets the maximum receiver window size. It will be useful for
    limiting receiver window based on RTT.

    Signed-off-by: Prankur gupta
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com

    Prankur gupta