24 Nov, 2020

1 commit

  • When the TCP stack is in SYN flood mode, the server child socket is
    created from the SYN cookie received in a TCP packet with the ACK flag
    set.

    The child socket is created when the server receives the first TCP
    packet with a valid SYN cookie from the client. Usually, this packet
    corresponds to the final step of the TCP 3-way handshake, the ACK
    packet. But is also possible to receive a valid SYN cookie from the
    first TCP data packet sent by the client, and thus create a child socket
    from that SYN cookie.

    Since a client socket is ready to send data as soon as it receives the
    SYN+ACK packet from the server, the client can send the ACK packet (sent
    by the TCP stack code), and the first data packet (sent by the userspace
    program) almost at the same time, and thus the server will equally
    receive the two TCP packets with valid SYN cookies almost at the same
    instant.

    When such event happens, the TCP stack code has a race condition that
    occurs between the momement a lookup is done to the established
    connections hashtable to check for the existence of a connection for the
    same client, and the moment that the child socket is added to the
    established connections hashtable. As a consequence, this race condition
    can lead to a situation where we add two child sockets to the
    established connections hashtable and deliver two sockets to the
    userspace program to the same client.

    This patch fixes the race condition by checking if an existing child
    socket exists for the same client when we are adding the second child
    socket to the established connections socket. If an existing child
    socket exists, we drop the packet and discard the second child socket
    to the same client.

    Signed-off-by: Ricardo Dias
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
    Signed-off-by: Jakub Kicinski

    Ricardo Dias
     

01 Oct, 2020

1 commit

  • TCP has been using it to work around the possibility of tcp_delack_timer()
    finding the socket owned by user.

    After commit 6f458dfb4092 ("tcp: improve latencies of timer triggered events")
    we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
    so we can get rid of icsk_ack.blocked

    This frees space that following patch will reuse.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Aug, 2020

1 commit


20 Jul, 2020

1 commit


11 Jul, 2020

1 commit

  • Commit 0c3d79bce48034018e840468ac5a642894a521a3 ("tcp: reduce SYN-ACK
    retrans for TCP_DEFER_ACCEPT") introduces syn_ack_recalc() which decides
    if a minisock is held and a SYN+ACK is retransmitted or not.

    If rskq_defer_accept is not zero in syn_ack_recalc(), max_retries always
    has the same value because max_retries is overwritten by rskq_defer_accept
    in reqsk_timer_handler().

    This commit adds three changes:
    - remove redundant non-zero check for rskq_defer_accept in
    reqsk_timer_handler().
    - remove max_retries from the arguments of syn_ack_recalc() and use
    rskq_defer_accept instead.
    - rename thresh to max_syn_ack_retries for readability.

    Signed-off-by: Kuniyuki Iwashima
    Reviewed-by: Benjamin Herrenschmidt
    CC: Julian Anastasov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Kuniyuki Iwashima
     

05 Jun, 2020

1 commit

  • Clearing the 'inet_num' field is necessary and safe if and
    only if the socket is not bound. The MPTCP protocol calls
    the destroy helper on bound sockets, as tcp_v{4,6}_syn_recv_sock
    completed successfully.

    Move the clearing of such field out of the common code, otherwise
    the MPTCP MP_JOIN error path will find the wrong 'inet_num' value
    on socket disposal, __inet_put_port() will acquire the wrong lock
    and bind_node removal could race with other modifiers possibly
    corrupting the bind hash table.

    Reported-and-tested-by: Christoph Paasch
    Fixes: 729cd6436f35 ("mptcp: cope better with MP_JOIN failure")
    Signed-off-by: Paolo Abeni
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Paolo Abeni
     

25 May, 2020

1 commit


20 May, 2020

1 commit

  • The commit 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk")
    added a bind-address cache in tb->fast*. The tb->fast* caches the address
    of a sk which has successfully been binded with SO_REUSEPORT ON. The idea
    is to avoid the expensive conflict search in inet_csk_bind_conflict().

    There is an issue with wildcard matching where sk_reuseport_match() should
    have returned false but it is currently returning true. It ends up
    hiding bind conflict. For example,

    bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
    bind("[::2]:443"); /* with SO_REUSEPORT. Succeed. */
    bind("[::]:443"); /* with SO_REUSEPORT. Still Succeed where it shouldn't */

    The last bind("[::]:443") with SO_REUSEPORT on should have failed because
    it should have a conflict with the very first bind("[::1]:443") which
    has SO_REUSEPORT off. However, the address "[::2]" is cached in
    tb->fast* in the second bind. In the last bind, the sk_reuseport_match()
    returns true because the binding sk's wildcard addr "[::]" matches with
    the "[::2]" cached in tb->fast*.

    The correct bind conflict is reported by removing the second
    bind such that tb->fast* cache is not involved and forces the
    bind("[::]:443") to go through the inet_csk_bind_conflict():

    bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
    bind("[::]:443"); /* with SO_REUSEPORT. -EADDRINUSE */

    The expected behavior for sk_reuseport_match() is, it should only allow
    the "cached" tb->fast* address to be used as a wildcard match but not
    the address of the binding sk. To do that, the current
    "bool match_wildcard" arg is split into
    "bool match_sk1_wildcard" and "bool match_sk2_wildcard".

    This change only affects the sk_reuseport_match() which is only
    used by inet_csk (e.g. TCP).
    The other use cases are calling inet_rcv_saddr_equal() and
    this patch makes it pass the same "match_wildcard" arg twice to
    the "ipv[46]_rcv_saddr_equal(..., match_wildcard, match_wildcard)".

    Cc: Josef Bacik
    Fixes: 637bc8bbe6c0 ("inet: reset tb->fastreuseport when adding a reuseport sk")
    Signed-off-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

16 May, 2020

1 commit

  • Move the steps to prepare an inet_connection_sock for
    forced disposal inside a separate helper. No functional
    changes inteded, this will just simplify the next patch.

    Signed-off-by: Paolo Abeni
    Reviewed-by: Christoph Paasch
    Reviewed-by: Mat Martineau
    Signed-off-by: David S. Miller

    Paolo Abeni
     

13 Mar, 2020

4 commits

  • Minor overlapping changes, nothing serious.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
    sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
    the port have also SO_REUSEPORT enabled and have the same EUID, all of them
    can be listened. This is not safe.

    Let's say, an application has root privilege and binds sockets to an
    ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
    sockets is not listened yet, a malicious user can use sudo, exhaust
    ephemeral ports, and bind sockets to the same ephemeral port, so he or she
    can call listen and steal the port.

    To prevent this issue, we must not bind more than one sockets that have the
    same EUID and both of SO_REUSEADDR and SO_REUSEPORT.

    On the other hand, if the sockets have different EUIDs, the issue above does
    not occur. After sockets with different EUIDs are bound to the same port and
    one of them is listened, no more socket can be listened. This is because the
    condition below is evaluated true and listen() for the second socket fails.

    } else if (!reuseport_ok ||
    !reuseport || !sk2->sk_reuseport ||
    rcu_access_pointer(sk->sk_reuseport_cb) ||
    (sk2->sk_state != TCP_TIME_WAIT &&
    !uid_eq(uid, sock_i_uid(sk2)))) {
    if (inet_rcv_saddr_equal(sk, sk2, true))
    break;
    }

    Therefore, on the same port, we cannot do listen() for multiple sockets with
    different EUIDs and any other listen syscalls fail, so the problem does not
    happen. In this case, we can still call connect() for other sockets that
    cannot be listened, so we have to succeed to call bind() in order to fully
    utilize 4-tuples.

    Summarizing the above, we should be able to bind only one socket having
    SO_REUSEADDR and SO_REUSEPORT per EUID.

    Signed-off-by: Kuniyuki Iwashima
    Signed-off-by: David S. Miller

    Kuniyuki Iwashima
     
  • Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger
    condition for bind_conflict") introduced a restriction to forbid to bind
    SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
    assign ports dispersedly so that we can connect to the same remote host.

    The change results in accelerating port depletion so that we fail to bind
    sockets to the same local port even if we want to connect to the different
    remote hosts.

    You can reproduce this issue by following instructions below.

    1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
    2. set SO_REUSEADDR to two sockets.
    3. bind two sockets to (localhost, 0) and the latter fails.

    Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
    the legacy behaviour to enable the SO_REUSEADDR option and make it possible
    to connect to different remote (addr, port) tuples.

    This patch allows us to bind SO_REUSEADDR enabled sockets to the same
    (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
    ephemeral ports are exhausted. This also allows connect() and listen() to
    share ports in the following way and may break some applications. So the
    ip_autobind_reuse is 0 by default and disables the feature.

    1. setsockopt(sk1, SO_REUSEADDR)
    2. setsockopt(sk2, SO_REUSEADDR)
    3. bind(sk1, saddr, 0)
    4. bind(sk2, saddr, 0)
    5. connect(sk1, daddr)
    6. listen(sk2)

    If it is set 1, we can fully utilize the 4-tuples, but we should use
    IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.

    The notable thing is that if all sockets bound to the same port have
    both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
    ephemeral port and also do listen().

    Signed-off-by: Kuniyuki Iwashima
    Signed-off-by: David S. Miller

    Kuniyuki Iwashima
     
  • When we get an ephemeral port, the relax is false, so the SO_REUSEADDR
    conditions may be evaluated twice. We do not need to check the conditions
    again.

    Signed-off-by: Kuniyuki Iwashima
    Signed-off-by: David S. Miller

    Kuniyuki Iwashima
     

12 Mar, 2020

1 commit

  • Locking newsk while still holding the listener lock triggered
    a lockdep splat [1]

    We can simply move the memcg code after we release the listener lock,
    as this can also help if multiple threads are sharing a common listener.

    Also fix a typo while reading socket sk_rmem_alloc.

    [1]
    WARNING: possible recursive locking detected
    5.6.0-rc3-syzkaller #0 Not tainted
    --------------------------------------------
    syz-executor598/9524 is trying to acquire lock:
    ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
    ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492

    but task is already holding lock:
    ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
    ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(sk_lock-AF_INET6);
    lock(sk_lock-AF_INET6);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by syz-executor598/9524:
    #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
    #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

    stack backtrace:
    CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x188/0x20d lib/dump_stack.c:118
    print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
    check_deadlock kernel/locking/lockdep.c:2411 [inline]
    validate_chain kernel/locking/lockdep.c:2954 [inline]
    __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
    lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
    lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
    lock_sock include/net/sock.h:1541 [inline]
    inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
    inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
    __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
    __sys_accept4+0x53/0x90 net/socket.c:1809
    __do_sys_accept4 net/socket.c:1821 [inline]
    __se_sys_accept4 net/socket.c:1818 [inline]
    __x64_sys_accept4+0x93/0xf0 net/socket.c:1818
    do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4445c9
    Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
    RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
    R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000

    Fixes: d752a4986532 ("net: memcg: late association of sock to memcg")
    Signed-off-by: Eric Dumazet
    Cc: Shakeel Butt
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Mar, 2020

1 commit

  • If a TCP socket is allocated in IRQ context or cloned from unassociated
    (i.e. not associated to a memcg) in IRQ context then it will remain
    unassociated for its whole life. Almost half of the TCPs created on the
    system are created in IRQ context, so, memory used by such sockets will
    not be accounted by the memcg.

    This issue is more widespread in cgroup v1 where network memory
    accounting is opt-in but it can happen in cgroup v2 if the source socket
    for the cloning was created in root memcg.

    To fix the issue, just do the association of the sockets at the accept()
    time in the process context and then force charge the memory buffer
    already used and reserved by the socket.

    Signed-off-by: Shakeel Butt
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Shakeel Butt
     

21 Jan, 2020

1 commit

  • After commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    the macro isn't used anymore. remove it.

    Signed-off-by: Alex Shi
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: David S. Miller

    Alex Shi
     

10 Jan, 2020

1 commit

  • If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are
    copied when the listener is cloned. Sometimes the clone is immediately
    deleted, which will invoke the release op on the clone and likely
    corrupt the listening socket's icsk_ulp_data.

    The clone operation is invoked immediately after the clone is copied and
    gives the ULP type an opportunity to set up the clone socket and its
    icsk_ulp_data.

    The MPTCP ULP clone will silently fallback to plain TCP on allocation
    failure, so 'clone()' does not need to return an error code.

    v6 -> v7:
    - move and rename ulp clone helper to make it inline-friendly
    v5 -> v6:
    - clarified MPTCP clone usage in commit message

    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     

25 Dec, 2019

1 commit

  • The MTU update code is supposed to be invoked in response to real
    networking events that update the PMTU. In IPv6 PMTU update function
    __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
    confirmed time.

    But for tunnel code, it will call pmtu before xmit, like:
    - tnl_update_pmtu()
    - skb_dst_update_pmtu()
    - ip6_rt_update_pmtu()
    - __ip6_rt_update_pmtu()
    - dst_confirm_neigh()

    If the tunnel remote dst mac address changed and we still do the neigh
    confirm, we will not be able to update neigh cache and ping6 remote
    will failed.

    So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
    should not be invoking dst_confirm_neigh() as we have no evidence
    of successful two-way communication at this point.

    On the other hand it is also important to keep the neigh reachability fresh
    for TCP flows, so we cannot remove this dst_confirm_neigh() call.

    To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
    to choose whether we should do neigh update or not. I will add the parameter
    in this patch and set all the callers to true to comply with the previous
    way, and fix the tunnel code one by one on later patches.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Suggested-by: David Miller
    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

07 Nov, 2019

1 commit


14 Oct, 2019

1 commit

  • Both tcp_v4_err() and tcp_v6_err() do the following operations
    while they do not own the socket lock :

    fastopen = tp->fastopen_rsk;
    snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

    The problem is that without appropriate barrier, the compiler
    might reload tp->fastopen_rsk and trigger a NULL deref.

    request sockets are protected by RCU, we can simply add
    the missing annotations and barriers to solve the issue.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Oct, 2019

1 commit


21 Sep, 2019

1 commit

  • Julian noted that rt_uses_gateway has a more subtle use than 'is gateway
    set':
    https://lore.kernel.org/netdev/alpine.LFD.2.21.1909151104060.2546@ja.home.ssi.bg/

    Revert that part of the commit referenced in the Fixes tag.

    Currently, there are no u8 holes in 'struct rtable'. There is a 4-byte hole
    in the second cacheline which contains the gateway declaration. So move
    rt_gw_family down to the gateway declarations since they are always used
    together, and then re-use that u8 for rt_uses_gateway. End result is that
    rtable size is unchanged.

    Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway")
    Reported-by: Julian Anastasov
    Signed-off-by: David Ahern
    Reviewed-by: Julian Anastasov
    Signed-off-by: Jakub Kicinski

    David Ahern
     

22 Jun, 2019

1 commit


20 Jun, 2019

1 commit

  • KMSAN caught uninit-value in tcp_create_openreq_child() [1]
    This is caused by a recent change, combined by the fact
    that TCP cleared num_timeout, num_retrans and sk fields only
    when a request socket was about to be queued.

    Under syncookie mode, a temporary request socket is used,
    and req->num_timeout could contain garbage.

    Lets clear these three fields sooner, there is really no
    point trying to defer this and risk other bugs.

    [1]

    BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
    CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x191/0x1f0 lib/dump_stack.c:113
    kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611
    __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304
    tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
    tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152
    tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209
    cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
    tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
    tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
    ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
    ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
    dst_input include/net/dst.h:439 [inline]
    ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
    __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
    __netif_receive_skb net/core/dev.c:5095 [inline]
    process_backlog+0x721/0x1410 net/core/dev.c:5906
    napi_poll net/core/dev.c:6329 [inline]
    net_rx_action+0x738/0x1940 net/core/dev.c:6395
    __do_softirq+0x4ad/0x858 kernel/softirq.c:293
    do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052

    do_softirq kernel/softirq.c:338 [inline]
    __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
    local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
    rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
    ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
    ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
    dst_output include/net/dst.h:433 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
    inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
    __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
    tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
    tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
    __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
    tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
    tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
    inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
    inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
    __sock_release net/socket.c:601 [inline]
    sock_close+0x156/0x490 net/socket.c:1273
    __fput+0x4c9/0xba0 fs/file_table.c:280
    ____fput+0x37/0x40 fs/file_table.c:313
    task_work_run+0x22e/0x2a0 kernel/task_work.c:113
    tracehook_notify_resume include/linux/tracehook.h:185 [inline]
    exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
    prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
    syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
    do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
    entry_SYSCALL_64_after_hwframe+0x63/0xe7
    RIP: 0033:0x401d50
    Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00
    RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
    RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50
    RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003
    RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c
    R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0
    R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline]
    kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160
    kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177
    kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781
    reqsk_alloc include/net/request_sock.h:84 [inline]
    inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384
    cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
    tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
    tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
    ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
    ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
    dst_input include/net/dst.h:439 [inline]
    ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
    __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
    __netif_receive_skb net/core/dev.c:5095 [inline]
    process_backlog+0x721/0x1410 net/core/dev.c:5906
    napi_poll net/core/dev.c:6329 [inline]
    net_rx_action+0x738/0x1940 net/core/dev.c:6395
    __do_softirq+0x4ad/0x858 kernel/softirq.c:293
    do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
    do_softirq kernel/softirq.c:338 [inline]
    __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
    local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
    rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
    ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
    ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
    dst_output include/net/dst.h:433 [inline]
    NF_HOOK include/linux/netfilter.h:305 [inline]
    ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
    inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
    __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
    tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
    tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
    __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
    tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
    tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
    inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
    inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
    __sock_release net/socket.c:601 [inline]
    sock_close+0x156/0x490 net/socket.c:1273
    __fput+0x4c9/0xba0 fs/file_table.c:280
    ____fput+0x37/0x40 fs/file_table.c:313
    task_work_run+0x22e/0x2a0 kernel/task_work.c:113
    tracehook_notify_resume include/linux/tracehook.h:185 [inline]
    exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
    prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
    syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
    do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
    entry_SYSCALL_64_after_hwframe+0x63/0xe7

    Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Soheil Hassas Yeganeh
    Reported-by: syzbot
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jun, 2019

1 commit


06 Jun, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

09 Apr, 2019

1 commit

  • To allow the gateway to be either an IPv4 or IPv6 address, remove
    rt_uses_gateway from rtable and replace with rt_gw_family. If
    rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
    to rt_gw4 to represent the IPv4 version.

    Signed-off-by: David Ahern
    Reviewed-by: Ido Schimmel
    Signed-off-by: David S. Miller

    David Ahern
     

08 Nov, 2018

2 commits

  • Set the backlog earlier in inet_dccp_listen() and inet_listen(),
    then we can avoid the redundant setting.

    Signed-off-by: Yafang Shao
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yafang Shao
     
  • Change the inet socket lookup to avoid packets arriving on a device
    enslaved to an l3mdev from matching unbound sockets by removing the
    wildcard for non sk_bound_dev_if and instead relying on check against
    the secondary device index, which will be 0 when the input device is
    not enslaved to an l3mdev and so match against an unbound socket and
    not match when the input device is enslaved.

    Change the socket binding to take the l3mdev into account to allow an
    unbound socket to not conflict sockets bound to an l3mdev given the
    datapath isolation now guaranteed.

    Signed-off-by: Robert Shearman
    Signed-off-by: Mike Manning
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Robert Shearman
     

03 Oct, 2018

1 commit

  • Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Aug, 2018

1 commit

  • This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
    a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
    non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

    BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
    to store the bpf context instead of using the skb->cb[48].

    At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
    from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
    point, it is not always clear where the bpf context can be appended
    in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
    aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
    clear if the lower layer is only ipv4 and ipv6 in the future and
    will it not touch the cb[] again before transiting to the upper
    layer.

    For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
    instead of IP[6]CB and it may still modify the cb[] after calling
    the udp[46]_lib_lookup_skb(). Because of the above reason, if
    sk->cb is used for the bpf ctx, saving-and-restoring is needed
    and likely the whole 48 bytes cb[] has to be saved and restored.

    Instead of saving, setting and restoring the cb[], this patch opts
    to create a new "struct sk_reuseport_kern" and setting the needed
    values in there.

    The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
    will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
    specific usage at this point and it is also inline with the current
    sock_reuseport.c implementation (i.e. no protocol specific requirement).

    In "struct sk_reuseport_md", this patch exposes data/data_end/len
    with semantic similar to other existing usages. Together
    with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
    the bpf prog can peek anywhere in the skb. The "bind_inany" tells
    the bpf prog that the reuseport group is bind-ed to a local
    INANY address which cannot be learned from skb.

    The new "bind_inany" is added to "struct sock_reuseport" which will be
    used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
    to avoid repeating the "bind INANY" test on
    "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
    only be properly initialized when a "sk->sk_reuseport" enabled sk is
    adding to a hashtable (i.e. during "reuseport_alloc()" and
    "reuseport_add_sock()").

    The new "sk_select_reuseport()" is the main helper that the
    bpf prog will use to select a SO_REUSEPORT sk. It is the only function
    that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
    the earlier patch, the validity of a selected sk is checked in
    run time in "sk_select_reuseport()". Doing the check in
    verification time is difficult and inflexible (consider the map-in-map
    use case). The runtime check is to compare the selected sk's reuseport_id
    with the reuseport_id that we want. This helper will return -EXXX if the
    selected sk cannot serve the incoming request (e.g. reuseport_id
    not match). The bpf prog can decide if it wants to do SK_DROP as its
    discretion.

    When the bpf prog returns SK_PASS, the kernel will check if a
    valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
    If it does , it will use the selected sk. If not, the kernel
    will select one from "reuse->socks[]" (as before this patch).

    The SK_DROP and SK_PASS handling logic will be in the next patch.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

11 May, 2018

1 commit


03 Feb, 2018

1 commit

  • This patch effectively reverts commit 9f1c2674b328 ("net: memcontrol:
    defer call to mem_cgroup_sk_alloc()").

    Moving mem_cgroup_sk_alloc() to the inet_csk_accept() completely breaks
    memcg socket memory accounting, as packets received before memcg
    pointer initialization are not accounted and are causing refcounting
    underflow on socket release.

    Actually the free-after-use problem was fixed by
    commit c0576e397508 ("net: call cgroup_sk_alloc() earlier in
    sk_clone_lock()") for the cgroup pointer.

    So, let's revert it and call mem_cgroup_sk_alloc() just before
    cgroup_sk_alloc(). This is safe, as we hold a reference to the socket
    we're cloning, and it holds a reference to the memcg.

    Also, let's drop BUG_ON(mem_cgroup_is_root()) check from
    mem_cgroup_sk_alloc(). I see no reasons why bumping the root
    memcg counter is a good reason to panic, and there are no realistic
    ways to hit it.

    Signed-off-by: Roman Gushchin
    Cc: Eric Dumazet
    Cc: David S. Miller
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: David S. Miller

    Roman Gushchin
     

21 Dec, 2017

2 commits

  • sk_state_load is only used by AF_INET/AF_INET6, so rename it to
    inet_sk_state_load and move it into inet_sock.h.

    sk_state_store is removed as it is not used any more.

    Signed-off-by: Yafang Shao
    Signed-off-by: David S. Miller

    Yafang Shao
     
  • As sk_state is a common field for struct sock, so the state
    transition tracepoint should not be a TCP specific feature.
    Currently it traces all AF_INET state transition, so I rename this
    tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
    into trace/events/sock.h.
    We dont need to create a file named trace/events/inet_sock.h for this one single
    tracepoint.

    Two helpers are introduced to trace sk_state transition
    - void inet_sk_state_store(struct sock *sk, int newstate);
    - void inet_sk_set_state(struct sock *sk, int state);
    As trace header should not be included in other header files,
    so they are defined in sock.c.

    The protocol such as SCTP maybe compiled as a ko, hence export
    inet_sk_set_state().

    Signed-off-by: Yafang Shao
    Signed-off-by: David S. Miller

    Yafang Shao
     

30 Oct, 2017

1 commit

  • Several conflicts here.

    NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
    nfp_fl_output() needed some adjustments because the code block is in
    an else block now.

    Parallel additions to net/pkt_cls.h and net/sch_generic.h

    A bug fix in __tcp_retransmit_skb() conflicted with some of
    the rbtree changes in net-next.

    The tc action RCU callback fixes in 'net' had some overlap with some
    of the recent tcf_block reworking.

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Oct, 2017

1 commit

  • In my first attempt to fix the lockdep splat, I forgot we could
    enter inet_csk_route_req() with a freshly allocated request socket,
    for which refcount has not yet been elevated, due to complex
    SLAB_TYPESAFE_BY_RCU rules.

    We either are in rcu_read_lock() section _or_ we own a refcount on the
    request.

    Correct RCU verb to use here is rcu_dereference_check(), although it is
    not possible to prove we actually own a reference on a shared
    refcount :/

    In v2, I added ireq_opt_deref() helper and use in three places, to fix other
    possible splats.

    [ 49.844590] lockdep_rcu_suspicious+0xea/0xf3
    [ 49.846487] inet_csk_route_req+0x53/0x14d
    [ 49.848334] tcp_v4_route_req+0xe/0x10
    [ 49.850174] tcp_conn_request+0x31c/0x6a0
    [ 49.851992] ? __lock_acquire+0x614/0x822
    [ 49.854015] tcp_v4_conn_request+0x5a/0x79
    [ 49.855957] ? tcp_v4_conn_request+0x5a/0x79
    [ 49.858052] tcp_rcv_state_process+0x98/0xdcc
    [ 49.859990] ? sk_filter_trim_cap+0x2f6/0x307
    [ 49.862085] tcp_v4_do_rcv+0xfc/0x145
    [ 49.864055] ? tcp_v4_do_rcv+0xfc/0x145
    [ 49.866173] tcp_v4_rcv+0x5ab/0xaf9
    [ 49.868029] ip_local_deliver_finish+0x1af/0x2e7
    [ 49.870064] ip_local_deliver+0x1b2/0x1c5
    [ 49.871775] ? inet_del_offload+0x45/0x45
    [ 49.873916] ip_rcv_finish+0x3f7/0x471
    [ 49.875476] ip_rcv+0x3f1/0x42f
    [ 49.876991] ? ip_local_deliver_finish+0x2e7/0x2e7
    [ 49.878791] __netif_receive_skb_core+0x6d3/0x950
    [ 49.880701] ? process_backlog+0x7e/0x216
    [ 49.882589] __netif_receive_skb+0x1d/0x5e
    [ 49.884122] process_backlog+0x10c/0x216
    [ 49.885812] net_rx_action+0x147/0x3df

    Fixes: a6ca7abe53633 ("tcp/dccp: fix lockdep splat in inet_csk_route_req()")
    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Reported-by: kernel test robot
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Oct, 2017

1 commit

  • This patch fixes the following lockdep splat in inet_csk_route_req()

    lockdep_rcu_suspicious
    inet_csk_route_req
    tcp_v4_send_synack
    tcp_rtx_synack
    inet_rtx_syn_ack
    tcp_fastopen_synack_time
    tcp_retransmit_timer
    tcp_write_timer_handler
    tcp_write_timer
    call_timer_fn

    Thread running inet_csk_route_req() owns a reference on the request
    socket, so we have the guarantee ireq->ireq_opt wont be changed or
    freed.

    lockdep can enforce this invariant for us.

    Fixes: c92e8c02fe66 ("tcp/dccp: fix ireq->opt races")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Oct, 2017

1 commit

  • There were quite a few overlapping sets of changes here.

    Daniel's bug fix for off-by-ones in the new BPF branch instructions,
    along with the added allowances for "data_end > ptr + x" forms
    collided with the metadata additions.

    Along with those three changes came veritifer test cases, which in
    their final form I tried to group together properly. If I had just
    trimmed GIT's conflict tags as-is, this would have split up the
    meta tests unnecessarily.

    In the socketmap code, a set of preemption disabling changes
    overlapped with the rename of bpf_compute_data_end() to
    bpf_compute_data_pointers().

    Changes were made to the mv88e6060.c driver set addr method
    which got removed in net-next.

    The hyperv transport socket layer had a locking change in 'net'
    which overlapped with a change of socket state macro usage
    in 'net-next'.

    Signed-off-by: David S. Miller

    David S. Miller