24 Oct, 2020

1 commit

  • With SO_RCVLOWAT, under memory pressure,
    it is possible to enter a state where:

    1. We have not received enough bytes to satisfy SO_RCVLOWAT.
    2. We have not entered buffer pressure (see tcp_rmem_pressure()).
    3. But, we do not have enough buffer space to accept more packets.

    In this case, we advertise 0 rwnd (due to #3) but the application does
    not drain the receive queue (no wakeup because of #1 and #2) so the
    flow stalls.

    Modify the heuristic for SO_RCVLOWAT so that, if we are advertising
    rwnd
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201023184709.217614-1-arjunroy.kdev@gmail.com
    Signed-off-by: Jakub Kicinski

    Arjun Roy
     

06 Oct, 2020

1 commit


03 Oct, 2020

1 commit

  • commit a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab
    objects") adds the checks for Slab pages, but the pages don't have
    page_count are still missing from the check.

    Network layer's sendpage method is not designed to send page_count 0
    pages neither, therefore both PageSlab() and page_count() should be
    both checked for the sending page. This is exactly what sendpage_ok()
    does.

    This patch uses sendpage_ok() in do_tcp_sendpages() to detect misused
    .sendpage, to make the code more robust.

    Fixes: a10674bf2406 ("tcp: detecting the misuse of .sendpage for Slab objects")
    Suggested-by: Eric Dumazet
    Signed-off-by: Coly Li
    Cc: Vasily Averin
    Cc: David S. Miller
    Cc: stable@vger.kernel.org
    Signed-off-by: David S. Miller

    Coly Li
     

01 Oct, 2020

1 commit

  • TCP has been using it to work around the possibility of tcp_delack_timer()
    finding the socket owned by user.

    After commit 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
    we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
    so we can get rid of icsk_ack.blocked

    This frees space that following patch will reuse.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Sep, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-09-23

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 95 non-merge commits during the last 22 day(s) which contain
    a total of 124 files changed, 4211 insertions(+), 2040 deletions(-).

    The main changes are:

    1) Full multi function support in libbpf, from Andrii.

    2) Refactoring of function argument checks, from Lorenz.

    3) Make bpf_tail_call compatible with functions (subprograms), from Maciej.

    4) Program metadata support, from YiFei.

    5) bpf iterator optimizations, from Yonghong.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

15 Sep, 2020

3 commits

  • For EPOLLET, applications must call sendmsg until they get EAGAIN.
    Otherwise, there is no guarantee that EPOLLOUT is sent if there was
    a failure upon memory allocation.

    As a result on high-speed NICs, userspace observes multiple small
    sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
    1-2 TSOs in between two sendmsg syscalls:

    // One large partial send due to memory allocation failure.
    sendmsg(20MB) = 2MB
    // Many small sends until EAGAIN.
    sendmsg(18MB) = 64KB
    sendmsg(17.9MB) = 128KB
    sendmsg(17.8MB) = 64KB
    ...
    sendmsg(...) = EAGAIN
    // At this point, userspace can assume an EPOLLOUT.

    To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
    to guarantee that we send EPOLLOUT after partial sendmsg.

    After this commit userspace can assume that it will receive an EPOLLOUT
    after the first partial sendmsg. This EPOLLOUT will benefit from
    sk_stream_write_space() logic delaying the EPOLLOUT until significant
    space is available in write queue.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • If there was any event available on the TCP socket, tcp_poll()
    will be called to retrieve all the events. In tcp_poll(), we call
    sk_stream_is_writeable() which returns true as long as we are at least
    one byte below notsent_lowat. This will result in quite a few
    spurious EPLLOUT and frequent tiny sendmsg() calls as a result.

    Similar to sk_stream_write_space(), use __sk_stream_is_writeable
    with a wake value of 1, so that we set EPOLLOUT only if half the
    space is available for write.

    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     
  • That is needed to let the subflows announce promptly when new
    space is available in the receive buffer.

    tcp_cleanup_rbuf() is currently a static function, drop the
    scope modifier and add a declaration in the TCP header.

    Reviewed-by: Mat Martineau
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

11 Sep, 2020

2 commits

  • Now that the previous patches ensure that all call sites for
    tcp_set_congestion_control() want to initialize congestion control, we
    can simplify tcp_set_congestion_control() by removing the reinit
    argument and the code to support it.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yuchung Cheng
    Acked-by: Kevin Yang
    Cc: Lawrence Brakmo

    Neal Cardwell
     
  • Change tcp_init_transfer() to only initialize congestion control if it
    has not been initialized already.

    With this new approach, we can arrange things so that if the EBPF code
    sets the congestion control by calling setsockopt(TCP_CONGESTION) then
    tcp_init_transfer() will not re-initialize the CC module.

    This is an approach that has the following beneficial properties:

    (1) This allows CC module customizations made by the EBPF called in
    tcp_init_transfer() to persist, and not be wiped out by a later
    call to tcp_init_congestion_control() in tcp_init_transfer().

    (2) Does not flip the order of EBPF and CC init, to avoid causing bugs
    for existing code upstream that depends on the current order.

    (3) Does not cause 2 initializations for for CC in the case where the
    EBPF called in tcp_init_transfer() wants to set the CC to a new CC
    algorithm.

    (4) Allows follow-on simplifications to the code in net/core/filter.c
    and net/ipv4/tcp_cong.c, which currently both have some complexity
    to special-case CC initialization to avoid double CC
    initialization if EBPF sets the CC.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Alexei Starovoitov
    Acked-by: Yuchung Cheng
    Acked-by: Kevin Yang
    Cc: Lawrence Brakmo

    Neal Cardwell
     

25 Aug, 2020

4 commits

  • This patch is adapted from Eric's patch in an earlier discussion [1].

    The TCP_SAVE_SYN currently only stores the network header and
    tcp header. This patch allows it to optionally store
    the mac header also if the setsockopt's optval is 2.

    It requires one more bit for the "save_syn" bit field in tcp_sock.
    This patch achieves this by moving the syn_smc bit next to the is_mptcp.
    The syn_smc is currently used with the TCP experimental option. Since
    syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
    the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
    with "IS_ENABLED(CONFIG_MPTCP)".

    The mac_hdrlen is also stored in the "struct saved_syn"
    to allow a quick offset from the bpf prog if it chooses to start
    getting from the network header or the tcp header.

    [1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/

    Suggested-by: Eric Dumazet
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com

    Martin KaFai Lau
     
  • This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
    to set the min rto of a connection. It could be used together
    with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

    A later selftest patch will communicate the max delay ack in a
    bpf tcp header option and then the receiving side can use
    bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com

    Martin KaFai Lau
     
  • This change is mostly from an internal patch and adapts it from sysctl
    config to the bpf_setsockopt setup.

    The bpf_prog can set the max delay ack by using
    bpf_setsockopt(TCP_BPF_DELACK_MAX). This max delay ack can be communicated
    to its peer through bpf header option. The receiving peer can then use
    this max delay ack and set a potentially lower rto by using
    bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
    in the next patch.

    Another later selftest patch will also use it like the above to show
    how to write and parse bpf tcp header option.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com

    Martin KaFai Lau
     
  • The TCP_SAVE_SYN has both the network header and tcp header.
    The total length of the saved syn packet is currently stored in
    the first 4 bytes (u32) of an array and the actual packet data is
    stored after that.

    A later patch will add a bpf helper that allows to get the tcp header
    alone from the saved syn without the network header. It will be more
    convenient to have a direct offset to a specific header instead of
    re-parsing it. This requires to separately store the network hdrlen.
    The total header length (i.e. network + tcp) is still needed for the
    current usage in getsockopt. Although this total length can be obtained
    by looking into the tcphdr and then get the (th->doff << 2), this patch
    chooses to directly store the tcp hdrlen in the second four bytes of
    this newly created "struct saved_syn". By using a new struct, it can
    give a readable name to each individual header length.

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Reviewed-by: Eric Dumazet
    Acked-by: John Fastabend
    Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com

    Martin KaFai Lau
     

11 Aug, 2020

1 commit

  • When TFO keys are read back on big endian systems either via the global
    sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
    don't match what was written.

    For example, on s390x:

    # echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
    # cat /proc/sys/net/ipv4/tcp_fastopen_key
    02000000-01000000-04000000-03000000

    Instead of:

    # cat /proc/sys/net/ipv4/tcp_fastopen_key
    00000001-00000002-00000003-00000004

    Fix this by converting to the correct endianness on read. This was
    reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
    selftest on s390x, which depends on the read value matching what was
    written. I've confirmed that the test now passes on big and little endian
    systems.

    Signed-off-by: Jason Baron
    Fixes: 438ac88009bc ("net: fastopen: robustness and endianness fixes for SipHash")
    Cc: Ard Biesheuvel
    Cc: Eric Dumazet
    Reported-and-tested-by: Colin Ian King
    Signed-off-by: David S. Miller

    Jason Baron
     

01 Aug, 2020

1 commit

  • This change adds TCP_NLA_EDT to SCM_TIMESTAMPING_OPT_STATS that reports
    the earliest departure time(EDT) of the timestamped skb. By tracking EDT
    values of the skb from different timestamps, we can observe when and how
    much the value changed. This allows to measure the precise delay
    injected on the sender host e.g. by a bpf-base throttler.

    Signed-off-by: Yousuk Seung
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Yousuk Seung
     

29 Jul, 2020

1 commit

  • sockptr_advance never properly worked. Replace it with _offset variants
    of copy_from_sockptr and copy_to_sockptr.

    Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
    Reported-by: Jason A. Donenfeld
    Reported-by: Ido Schimmel
    Signed-off-by: Christoph Hellwig
    Acked-by: Jason A. Donenfeld
    Tested-by: Ido Schimmel
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

25 Jul, 2020

3 commits


20 Jul, 2020

1 commit


11 Jul, 2020

1 commit


10 Jul, 2020

1 commit

  • syzkaller found its way into setsockopt with TCP_CONGESTION "cdg".
    tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock
    just copies all the memory, the allocated pointer will be copied as
    well, if the app called setsockopt(..., TCP_CONGESTION) on the listener.
    If now the socket will be destroyed before the congestion-control
    has properly been initialized (through a call to tcp_init_transfer), we
    will end up freeing memory that does not belong to that particular
    socket, opening the door to a double-free:

    [ 11.413102] ==================================================================
    [ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.415329]
    [ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80
    [ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
    [ 11.418148] Call Trace:
    [ 11.418534]
    [ 11.418834] dump_stack+0x7d/0xb0
    [ 11.419297] print_address_description.constprop.0+0x1a/0x210
    [ 11.422079] kasan_report_invalid_free+0x51/0x80
    [ 11.423433] __kasan_slab_free+0x15e/0x170
    [ 11.424761] kfree+0x8c/0x230
    [ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0
    [ 11.426493] inet_csk_destroy_sock+0x153/0x2c0
    [ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100
    [ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0
    [ 11.429457] cookie_v4_check+0x13d0/0x2500
    [ 11.433189] tcp_v4_do_rcv+0x60e/0x780
    [ 11.433727] tcp_v4_rcv+0x2869/0x2e10
    [ 11.437143] ip_protocol_deliver_rcu+0x23/0x190
    [ 11.437810] ip_local_deliver+0x294/0x350
    [ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0
    [ 11.441995] process_backlog+0x1b1/0x6b0
    [ 11.443148] net_rx_action+0x37e/0xc40
    [ 11.445361] __do_softirq+0x18c/0x61a
    [ 11.445881] asm_call_on_stack+0x12/0x20
    [ 11.446409]
    [ 11.446716] do_softirq_own_stack+0x34/0x40
    [ 11.447259] do_softirq.part.0+0x26/0x30
    [ 11.447827] __local_bh_enable_ip+0x46/0x50
    [ 11.448406] ip_finish_output2+0x60f/0x1bc0
    [ 11.450109] __ip_queue_xmit+0x71c/0x1b60
    [ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0
    [ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a
    [ 11.456810] tcp_v4_do_rcv+0x2ad/0x780
    [ 11.457995] __release_sock+0x14b/0x2c0
    [ 11.458529] release_sock+0x4a/0x170
    [ 11.459005] __inet_stream_connect+0x467/0xc80
    [ 11.461435] inet_stream_connect+0x4e/0xa0
    [ 11.462043] __sys_connect+0x204/0x270
    [ 11.465515] __x64_sys_connect+0x6a/0xb0
    [ 11.466088] do_syscall_64+0x3e/0x70
    [ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 11.467341] RIP: 0033:0x7f56046dc469
    [ 11.467844] Code: Bad RIP value.
    [ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    [ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469
    [ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004
    [ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003
    [ 11.474321]
    [ 11.474527] Allocated by task 4884:
    [ 11.475031] save_stack+0x1b/0x40
    [ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0
    [ 11.476182] tcp_cdg_init+0xf0/0x150
    [ 11.476744] tcp_init_congestion_control+0x9b/0x3a0
    [ 11.477435] tcp_set_congestion_control+0x270/0x32f
    [ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00
    [ 11.478744] __sys_setsockopt+0xff/0x1e0
    [ 11.479259] __x64_sys_setsockopt+0xb5/0x150
    [ 11.479895] do_syscall_64+0x3e/0x70
    [ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 11.481097]
    [ 11.481321] Freed by task 4872:
    [ 11.481783] save_stack+0x1b/0x40
    [ 11.482230] __kasan_slab_free+0x12c/0x170
    [ 11.482839] kfree+0x8c/0x230
    [ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0
    [ 11.484502] inet_csk_destroy_sock+0x153/0x2c0
    [ 11.485144] tcp_close+0x932/0xfe0
    [ 11.485642] inet_release+0xc1/0x1c0
    [ 11.486131] __sock_release+0xc0/0x270
    [ 11.486697] sock_close+0xc/0x10
    [ 11.487145] __fput+0x277/0x780
    [ 11.487632] task_work_run+0xeb/0x180
    [ 11.488118] __prepare_exit_to_usermode+0x15a/0x160
    [ 11.488834] do_syscall_64+0x4a/0x70
    [ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Wei Wang fixed a part of these CDG-malloc issues with commit c12014440750
    ("tcp: memset ca_priv data to 0 properly").

    This patch here fixes the listener-scenario: We make sure that listeners
    setting the congestion-control through setsockopt won't initialize it
    (thus CDG never allocates on listeners). For those who use AF_UNSPEC to
    reuse a socket, tcp_disconnect() is changed to cleanup afterwards.

    (The issue can be reproduced at least down to v4.4.x.)

    Cc: Wei Wang
    Cc: Eric Dumazet
    Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control")
    Signed-off-by: Christoph Paasch
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Christoph Paasch
     

03 Jul, 2020

1 commit

  • This essentially reverts commit 721230326891 ("tcp: md5: reject TCP_MD5SIG
    or TCP_MD5SIG_EXT on established sockets")

    Mathieu reported that many vendors BGP implementations can
    actually switch TCP MD5 on established flows.

    Quoting Mathieu :
    Here is a list of a few network vendors along with their behavior
    with respect to TCP MD5:

    - Cisco: Allows for password to be changed, but within the hold-down
    timer (~180 seconds).
    - Juniper: When password is initially set on active connection it will
    reset, but after that any subsequent password changes no network
    resets.
    - Nokia: No notes on if they flap the tcp connection or not.
    - Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
    both sides are ok with new passwords.
    - Meta-Switch: Expects the password to be set before a connection is
    attempted, but no further info on whether they reset the TCP
    connection on a change.
    - Avaya: Disable the neighbor, then set password, then re-enable.
    - Zebos: Would normally allow the change when socket connected.

    We can revert my prior change because commit 9424e2e7ad93 ("tcp: md5: fix potential
    overestimation of TCP option space") removed the leak of 4 kernel bytes to
    the wire that was the main reason for my patch.

    While doing my investigations, I found a bug when a MD5 key is changed, leading
    to these commits that stable teams want to consider before backporting this revert :

    Commit 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
    Commit e6ced831ef11 ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers")

    Fixes: 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets"
    Signed-off-by: Eric Dumazet
    Reported-by: Mathieu Desnoyers
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Jul, 2020

1 commit

  • My prior fix went a bit too far, according to Herbert and Mathieu.

    Since we accept that concurrent TCP MD5 lookups might see inconsistent
    keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()

    Clearing all key->key[] is needed to avoid possible KMSAN reports,
    if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
    using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.

    data_race() was added in linux-5.8 and will prevent KCSAN reports,
    this can safely be removed in stable backports, if data_race() is
    not yet backported.

    v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()

    Fixes: 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
    Signed-off-by: Eric Dumazet
    Cc: Mathieu Desnoyers
    Cc: Herbert Xu
    Cc: Marco Elver
    Reviewed-by: Mathieu Desnoyers
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jul, 2020

1 commit

  • MD5 keys are read with RCU protection, and tcp_md5_do_add()
    might update in-place a prior key.

    Normally, typical RCU updates would allocate a new piece
    of memory. In this case only key->key and key->keylen might
    be updated, and we do not care if an incoming packet could
    see the old key, the new one, or some intermediate value,
    since changing the key on a live flow is known to be problematic
    anyway.

    We only want to make sure that in the case key->keylen
    is changed, cpus in tcp_md5_hash_key() wont try to use
    uninitialized data, or crash because key->keylen was
    read twice to feed sg_init_one() and ahash_request_set_crypt()

    Fixes: 9ea88a153001 ("tcp: md5: check md5 signature without socket lock")
    Signed-off-by: Eric Dumazet
    Cc: Mathieu Desnoyers
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Jun, 2020

1 commit


14 Jun, 2020

1 commit

  • Pull networking fixes from David Miller:

    1) Fix cfg80211 deadlock, from Johannes Berg.

    2) RXRPC fails to send norigications, from David Howells.

    3) MPTCP RM_ADDR parsing has an off by one pointer error, fix from
    Geliang Tang.

    4) Fix crash when using MSG_PEEK with sockmap, from Anny Hu.

    5) The ucc_geth driver needs __netdev_watchdog_up exported, from
    Valentin Longchamp.

    6) Fix hashtable memory leak in dccp, from Wang Hai.

    7) Fix how nexthops are marked as FDB nexthops, from David Ahern.

    8) Fix mptcp races between shutdown and recvmsg, from Paolo Abeni.

    9) Fix crashes in tipc_disc_rcv(), from Tuong Lien.

    10) Fix link speed reporting in iavf driver, from Brett Creeley.

    11) When a channel is used for XSK and then reused again later for XSK,
    we forget to clear out the relevant data structures in mlx5 which
    causes all kinds of problems. Fix from Maxim Mikityanskiy.

    12) Fix memory leak in genetlink, from Cong Wang.

    13) Disallow sockmap attachments to UDP sockets, it simply won't work.
    From Lorenz Bauer.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (83 commits)
    net: ethernet: ti: ale: fix allmulti for nu type ale
    net: ethernet: ti: am65-cpsw-nuss: fix ale parameters init
    net: atm: Remove the error message according to the atomic context
    bpf: Undo internal BPF_PROBE_MEM in BPF insns dump
    libbpf: Support pre-initializing .bss global variables
    tools/bpftool: Fix skeleton codegen
    bpf: Fix memlock accounting for sock_hash
    bpf: sockmap: Don't attach programs to UDP sockets
    bpf: tcp: Recv() should return 0 when the peer socket is closed
    ibmvnic: Flush existing work items before device removal
    genetlink: clean up family attributes allocations
    net: ipa: header pad field only valid for AP->modem endpoint
    net: ipa: program upper nibbles of sequencer type
    net: ipa: fix modem LAN RX endpoint id
    net: ipa: program metadata mask differently
    ionic: add pcie_print_link_status
    rxrpc: Fix race between incoming ACK parser and retransmitter
    net/mlx5: E-Switch, Fix some error pointer dereferences
    net/mlx5: Don't fail driver on failure to create debugfs
    net/mlx5e: CT: Fix ipv6 nat header rewrite actions
    ...

    Linus Torvalds
     

10 Jun, 2020

2 commits

  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

09 Jun, 2020

1 commit

  • Use vm_insert_pages() for tcp receive zerocopy. Spin lock cycles (as
    reported by perf) drop from a couple of percentage points to a fraction of
    a percent. This results in a roughly 6% increase in efficiency, measured
    roughly as zerocopy receive count divided by CPU utilization.

    The intention of this patchset is to reduce atomic ops for tcp zerocopy
    receives, which normally hits the same spinlock multiple times
    consecutively.

    [akpm@linux-foundation.org: suppress gcc-7.2.0 warning]
    Link: http://lkml.kernel.org/r/20200128025958.43490-3-arjunroy.kdev@gmail.com
    Signed-off-by: Arjun Roy
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Cc: David Miller
    Cc: Matthew Wilcox
    Cc: Jason Gunthorpe
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Arjun Roy
     

29 May, 2020

8 commits


16 May, 2020

1 commit