24 Nov, 2020

1 commit

  • When the TCP stack is in SYN flood mode, the server child socket is
    created from the SYN cookie received in a TCP packet with the ACK flag
    set.

    The child socket is created when the server receives the first TCP
    packet with a valid SYN cookie from the client. Usually, this packet
    corresponds to the final step of the TCP 3-way handshake, the ACK
    packet. But is also possible to receive a valid SYN cookie from the
    first TCP data packet sent by the client, and thus create a child socket
    from that SYN cookie.

    Since a client socket is ready to send data as soon as it receives the
    SYN+ACK packet from the server, the client can send the ACK packet (sent
    by the TCP stack code), and the first data packet (sent by the userspace
    program) almost at the same time, and thus the server will equally
    receive the two TCP packets with valid SYN cookies almost at the same
    instant.

    When such event happens, the TCP stack code has a race condition that
    occurs between the momement a lookup is done to the established
    connections hashtable to check for the existence of a connection for the
    same client, and the moment that the child socket is added to the
    established connections hashtable. As a consequence, this race condition
    can lead to a situation where we add two child sockets to the
    established connections hashtable and deliver two sockets to the
    userspace program to the same client.

    This patch fixes the race condition by checking if an existing child
    socket exists for the same client when we are adding the second child
    socket to the established connections socket. If an existing child
    socket exists, we drop the packet and discard the second child socket
    to the same client.

    Signed-off-by: Ricardo Dias
    Signed-off-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lan
    Signed-off-by: Jakub Kicinski

    Ricardo Dias
     

01 Sep, 2020

1 commit


12 Aug, 2020

1 commit

  • In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
    SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
    behaviour or in the incorrect reuse of a bind.

    the kernel keeps track for each bind_bucket if all sockets in the
    bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
    These flags allow skipping the costly bind_conflict check when possible
    (meaning when all sockets have the proper SO_REUSE option).

    For every socket added to a bind_bucket, these flags need to be updated.
    As soon as a socket that does not support reuse is added, the flag is
    set to false and will never go back to true, unless the bind_bucket is
    deleted.

    Note that there is no mechanism to re-evaluate these flags when a socket
    is removed (this might make sense when removing a socket that would not
    allow reuse; this leaves room for a future patch).

    For this optimization to work, it is mandatory that these flags are
    properly initialized and updated.

    When a child socket is created from a listen socket in
    __inet_inherit_port, the TPROXY case could create a new bind bucket
    without properly initializing these flags, thus preventing the
    optimization to work. Alternatively, a socket not allowing reuse could
    be added to an existing bind bucket without updating the flags, causing
    bind_conflict to never be called as it should.

    Call inet_csk_update_fastreuse when __inet_inherit_port decides to create
    a new bind_bucket or use a different bind_bucket than the one of the
    listen socket.

    Fixes: 093d282321da ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()")
    Acked-by: Matthieu Baerts
    Signed-off-by: Tim Froidcoeur
    Signed-off-by: David S. Miller

    Tim Froidcoeur
     

18 Jul, 2020

2 commits

  • Run a BPF program before looking up a listening socket on the receive path.
    Program selects a listening socket to yield as result of socket lookup by
    calling bpf_sk_assign() helper and returning SK_PASS code. Program can
    revert its decision by assigning a NULL socket with bpf_sk_assign().

    Alternatively, BPF program can also fail the lookup by returning with
    SK_DROP, or let the lookup continue as usual with SK_PASS on return, when
    no socket has been selected with bpf_sk_assign().

    This lets the user match packets with listening sockets freely at the last
    possible point on the receive path, where we know that packets are destined
    for local delivery after undergoing policing, filtering, and routing.

    With BPF code selecting the socket, directing packets destined to an IP
    range or to a port range to a single socket becomes possible.

    In case multiple programs are attached, they are run in series in the order
    in which they were attached. The end result is determined from return codes
    of all the programs according to following rules:

    1. If any program returned SK_PASS and selected a valid socket, the socket
    is used as result of socket lookup.
    2. If more than one program returned SK_PASS and selected a socket,
    last selection takes effect.
    3. If any program returned SK_DROP, and no program returned SK_PASS and
    selected a socket, socket lookup fails with -ECONNREFUSED.
    4. If all programs returned SK_PASS and none of them selected a socket,
    socket lookup continues to htable-based lookup.

    Suggested-by: Marek Majkowski
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200717103536.397595-5-jakub@cloudflare.com

    Jakub Sitnicki
     
  • Prepare for calling into reuseport from __inet_lookup_listener as well.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Alexei Starovoitov
    Acked-by: Andrii Nakryiko
    Link: https://lore.kernel.org/bpf/20200717103536.397595-4-jakub@cloudflare.com

    Jakub Sitnicki
     

14 Dec, 2019

1 commit

  • Michal Kubecek and Firo Yang did a very nice analysis of crashes
    happening in __inet_lookup_established().

    Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
    (via a close()/socket()/listen() cycle) without a RCU grace period,
    I should not have changed listeners linkage in their hash table.

    They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
    so that a lookup can detect a socket in a hash list was moved in
    another one.

    Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve
    merge conflict for v4/v6 ordering fix"), we have to add
    hlist_nulls_add_tail_rcu() helper.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: Michal Kubecek
    Reported-by: Firo Yang
    Reviewed-by: Michal Kubecek
    Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
    Signed-off-by: Jakub Kicinski

    Eric Dumazet
     

31 Oct, 2019

1 commit

  • This socket field can be read and written by concurrent cpus.

    Use READ_ONCE() and WRITE_ONCE() annotations to document this,
    and avoid some compiler 'optimizations'.

    KCSAN reported :

    BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv

    write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
    sk_incoming_cpu_update include/net/sock.h:953 [inline]
    tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
    ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
    process_backlog+0x1d3/0x420 net/core/dev.c:5955
    napi_poll net/core/dev.c:6392 [inline]
    net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
    do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
    do_softirq kernel/softirq.c:329 [inline]
    __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189

    read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
    sk_incoming_cpu_update include/net/sock.h:952 [inline]
    tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
    ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
    process_backlog+0x1d3/0x420 net/core/dev.c:5955
    napi_poll net/core/dev.c:6392 [inline]
    net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
    smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jun, 2019

1 commit


06 Jun, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Dec, 2018

1 commit

  • Patch eedbbb0d98b2 "net: dccp: initialize (addr,port) ..."
    added calling to inet_hashinfo2_init() from dccp_init().

    However, inet_hashinfo2_init() is marked as __init(), and
    thus the kernel panics when dccp is loaded as module. Removing
    __init() tag from inet_hashinfo2_init() is not feasible because
    it calls into __init functions in mm.

    This patch adds inet_hashinfo2_init_mod() function that can
    be called after the init phase is done; changes dccp_init() to
    call the new function; un-marks inet_hashinfo2_init() as
    exported.

    Fixes: eedbbb0d98b2 ("net: dccp: initialize (addr,port) ...")
    Reported-by: kernel test robot
    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

18 Dec, 2018

1 commit

  • Commit d9fbc7f6431f "net: tcp: prefer listeners bound to an address"
    removes port-only listener lookups. This caused segfaults in DCCP
    lookups because DCCP did not initialize the (addr,port) hashtable.

    This patch adds said initialization.

    The only non-trivial issue here is the size of the new hashtable.
    It seemed reasonable to make it match the size of the port-only
    hashtable (= INET_LHTABLE_SIZE) that was used previously. Other
    parameters to inet_hashinfo2_init() match those used in TCP.

    V2 changes: marked inet_hashinfo2_init as an exported symbol
    so that DCCP compiles when configured as a module.

    Tested: syzcaller issues fixed; the second patch in the patchset
    tests that DCCP lookups work correctly.

    Fixes: d9fbc7f6431f "net: tcp: prefer listeners bound to an address"
    Reported-by: syzcaller
    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

15 Dec, 2018

1 commit

  • A relatively common use case is to have several IPs configured
    on a host, and have different listeners for each of them. We would
    like to add a "catch all" listener on addr_any, to match incoming
    connections not served by any of the listeners bound to a specific
    address.

    However, port-only lookups can match addr_any sockets when sockets
    listening on specific addresses are present if so_reuseport flag
    is set. This patch eliminates lookups into port-only hashtable,
    as lookups by (addr,port) tuple are easily available.

    In addition, compute_score() is tweaked to _not_ match
    addr_any sockets to specific addresses, as hash collisions
    could result in the unwanted behavior described above.

    Tested: the patch compiles; full test in the last patch in this
    patchset. Existing reuseport_* selftests also pass.

    Suggested-by: Eric Dumazet
    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

08 Nov, 2018

2 commits

  • The commit a04a480d4392 ("net: Require exact match for TCP socket
    lookups if dif is l3mdev") only ensures that the correct socket is
    selected for packets in a VRF. However, there is no guarantee that
    the unbound socket will be selected for packets when not in a VRF.
    By checking for a device match in compute_score() also for the case
    when there is no bound device and attaching a score to this, the
    unbound socket is selected. And if a failure is returned when there
    is no device match, this ensures that bound sockets are never selected,
    even if there is no unbound socket.

    Signed-off-by: Mike Manning
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Mike Manning
     
  • Change the inet socket lookup to avoid packets arriving on a device
    enslaved to an l3mdev from matching unbound sockets by removing the
    wildcard for non sk_bound_dev_if and instead relying on check against
    the secondary device index, which will be 0 when the input device is
    not enslaved to an l3mdev and so match against an unbound socket and
    not match when the input device is enslaved.

    Change the socket binding to take the l3mdev into account to allow an
    unbound socket to not conflict sockets bound to an l3mdev given the
    datapath isolation now guaranteed.

    Signed-off-by: Robert Shearman
    Signed-off-by: Mike Manning
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Robert Shearman
     

31 Oct, 2018

1 commit

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

11 Aug, 2018

2 commits

  • This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
    SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
    the earlier patch. "bpf_run_sk_reuseport()" will return -ECONNREFUSED
    when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
    The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
    handle this case and return NULL immediately instead of continuing the
    sk search from its hashtable.

    It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
    BPF_PROG_TYPE_SK_REUSEPORT. The "sk_reuseport_attach_bpf()" will check
    if the attaching bpf prog is in the new SK_REUSEPORT or the existing
    SOCKET_FILTER type and then check different things accordingly.

    One level of "__reuseport_attach_prog()" call is removed. The
    "sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
    back to "reuseport_attach_prog()" in sock_reuseport.c. sock_reuseport.c
    seems to have more knowledge on those test requirements than filter.c.
    In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
    the old_prog (if any) is also directly freed instead of returning the
    old_prog to the caller and asking the caller to free.

    The sysctl_optmem_max check is moved back to the
    "sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
    As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
    bounded by the usual "bpf_prog_charge_memlock()" during load time
    instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     
  • This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
    a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY. Like other
    non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

    BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
    to store the bpf context instead of using the skb->cb[48].

    At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
    from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp). At this
    point, it is not always clear where the bpf context can be appended
    in the skb->cb[48] to avoid saving-and-restoring cb[]. Even putting
    aside the difference between ipv4-vs-ipv6 and udp-vs-tcp. It is not
    clear if the lower layer is only ipv4 and ipv6 in the future and
    will it not touch the cb[] again before transiting to the upper
    layer.

    For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
    instead of IP[6]CB and it may still modify the cb[] after calling
    the udp[46]_lib_lookup_skb(). Because of the above reason, if
    sk->cb is used for the bpf ctx, saving-and-restoring is needed
    and likely the whole 48 bytes cb[] has to be saved and restored.

    Instead of saving, setting and restoring the cb[], this patch opts
    to create a new "struct sk_reuseport_kern" and setting the needed
    values in there.

    The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
    will serve all ipv4/ipv6 + udp/tcp combinations. There is no protocol
    specific usage at this point and it is also inline with the current
    sock_reuseport.c implementation (i.e. no protocol specific requirement).

    In "struct sk_reuseport_md", this patch exposes data/data_end/len
    with semantic similar to other existing usages. Together
    with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
    the bpf prog can peek anywhere in the skb. The "bind_inany" tells
    the bpf prog that the reuseport group is bind-ed to a local
    INANY address which cannot be learned from skb.

    The new "bind_inany" is added to "struct sock_reuseport" which will be
    used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
    to avoid repeating the "bind INANY" test on
    "sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run. It can
    only be properly initialized when a "sk->sk_reuseport" enabled sk is
    adding to a hashtable (i.e. during "reuseport_alloc()" and
    "reuseport_add_sock()").

    The new "sk_select_reuseport()" is the main helper that the
    bpf prog will use to select a SO_REUSEPORT sk. It is the only function
    that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY. As mentioned in
    the earlier patch, the validity of a selected sk is checked in
    run time in "sk_select_reuseport()". Doing the check in
    verification time is difficult and inflexible (consider the map-in-map
    use case). The runtime check is to compare the selected sk's reuseport_id
    with the reuseport_id that we want. This helper will return -EXXX if the
    selected sk cannot serve the incoming request (e.g. reuseport_id
    not match). The bpf prog can decide if it wants to do SK_DROP as its
    discretion.

    When the bpf prog returns SK_PASS, the kernel will check if a
    valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
    If it does , it will use the selected sk. If not, the kernel
    will select one from "reuse->socks[]" (as before this patch).

    The SK_DROP and SK_PASS handling logic will be in the next patch.

    Signed-off-by: Martin KaFai Lau
    Acked-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann

    Martin KaFai Lau
     

20 Jun, 2018

1 commit

  • Similar to 69678bcd4d2d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
    need to fail if dev_match is not true. Currently, a packet to a given port
    can match a socket bound to device when it should not. In the VRF case,
    this causes the lookup to hit a VRF socket and not a global socket
    resulting in a response trying to go through the VRF when it should not.

    Fixes: 3fa6f616a7a4d ("net: ipv4: add second dif to inet socket lookups")
    Fixes: 4297a0ef08572 ("net: ipv6: add second dif to inet6 socket lookups")
    Reported-by: Lou Berger
    Diagnosed-by: Renato Westphal
    Tested-by: Renato Westphal
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

01 Feb, 2018

1 commit

  • With gcc-4.1.2:

    net/ipv4/inet_hashtables.c: In function ‘inet_unhash’:
    net/ipv4/inet_hashtables.c:628: warning: ‘ilb’ may be used uninitialized in this function

    While this is a false positive, it can easily be avoided by using the
    pointer itself as the canary variable.

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Geert Uytterhoeven
     

21 Dec, 2017

1 commit

  • As sk_state is a common field for struct sock, so the state
    transition tracepoint should not be a TCP specific feature.
    Currently it traces all AF_INET state transition, so I rename this
    tracepoint to inet_sock_set_state tracepoint with some minor changes and move it
    into trace/events/sock.h.
    We dont need to create a file named trace/events/inet_sock.h for this one single
    tracepoint.

    Two helpers are introduced to trace sk_state transition
    - void inet_sk_state_store(struct sock *sk, int newstate);
    - void inet_sk_set_state(struct sock *sk, int state);
    As trace header should not be included in other header files,
    so they are defined in sock.c.

    The protocol such as SCTP maybe compiled as a ko, hence export
    inet_sk_set_state().

    Signed-off-by: Yafang Shao
    Signed-off-by: David S. Miller

    Yafang Shao
     

03 Dec, 2017

2 commits

  • The current listener hashtable is hashed by port only.
    When a process is listening at many IP addresses with the same port (e.g.
    [IP1]:443, [IP2]:443... [IPN]:443), the inet[6]_lookup_listener()
    performance is degraded to a link list. It is prone to syn attack.

    UDP had a similar issue and a second hashtable was added to resolve it.

    This patch adds a second hashtable for the listener's sockets.
    The second hashtable is hashed by port and address.

    It cannot reuse the existing skc_portaddr_node which is shared
    with skc_bind_node. TCP listener needs to use skc_bind_node.
    Instead, this patch adds a hlist_node 'icsk_listen_portaddr_node' to
    the inet_connection_sock which the listener (like TCP) also belongs to.

    The new portaddr hashtable may need two lookup (First by IP:PORT.
    Second by INADDR_ANY:PORT if the IP:PORT is a not found). Hence,
    it implements a similar cut off as UDP such that it will only consult the
    new portaddr hashtable if the current port-only hashtable has >10
    sk in the link-list.

    lhash2 and lhash2_mask are added to 'struct inet_hashinfo'. I take
    this chance to plug a 4 bytes hole. It is done by first moving
    the existing bind_bucket_cachep up and then add the new
    (int lhash2_mask, *lhash2) after the existing bhash_size.

    Signed-off-by: Martin KaFai Lau
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     
  • This patch adds a count to the 'struct inet_listen_hashbucket'.
    It counts how many sk is hashed to a bucket. It will be
    used to decide if the (to-be-added) portaddr listener's hashtable
    should be used during inet[6]_lookup_listener().

    Signed-off-by: Martin KaFai Lau
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Martin KaFai Lau
     

30 Nov, 2017

1 commit

  • Since commit e32ea7e74727 ("soreuseport: fast reuseport UDP socket
    selection") and commit c125e80b8868 ("soreuseport: fast reuseport
    TCP socket selection") the relevant reuseport socket matching the current
    packet is selected by the reuseport_select_sock() call. The only
    exceptions are invalid BPF filters/filters returning out-of-range
    indices.
    In the latter case the code implicitly falls back to using the hash
    demultiplexing, but instead of selecting the socket inside the
    reuseport_select_sock() function, it relies on the hash selection
    logic introduced with the early soreuseport implementation.

    With this patch, in case of a BPF filter returning a bad socket
    index value, we fall back to hash-based selection inside the
    reuseport_select_sock() body, so that we can drop some duplicate
    code in the ipv4 and ipv6 stack.

    This also allows faster lookup in the above scenario and will allow
    us to avoid computing the hash value for successful, BPF based
    demultiplexing - in a later patch.

    Signed-off-by: Paolo Abeni
    Acked-by: Craig Gallek
    Signed-off-by: David S. Miller

    Paolo Abeni
     

22 Oct, 2017

1 commit

  • Syzkaller stumbled upon a way to trigger
    WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
    reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39

    There are two initialization paths for the sock_reuseport structure in a
    socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
    SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation
    assumedthat the socket lock protected both of these paths when it actually
    only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this
    double allocation by running these paths concurrently.

    This patch moves the check for double allocation into the reuseport_alloc
    function which is protected by a global spin lock.

    Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
    Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

08 Aug, 2017

1 commit

  • Add a second device index, sdif, to inet socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
    ingress index is obtained from IPCB using inet_sdif and after the cb move
    in tcp_v4_rcv the tcp_v4_sdif helper is used.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

03 Jul, 2017

1 commit


01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

09 May, 2017

1 commit

  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

19 Jan, 2017

2 commits

  • In inet_csk_get_port we seem to be using smallest_port to figure out where the
    best place to look for a SO_REUSEPORT sk that matches with an existing set of
    SO_REUSEPORT's. However if we get to the logic

    if (smallest_size != -1) {
    port = smallest_port;
    goto have_port;
    }

    we will do a useless search, because we would have already done the
    inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
    would have gone to found_tb and succeeded. Since this logic makes us do yet
    another trip through inet_csk_bind_conflict for a port we know won't work just
    delete this code and save us the time.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • We pass these per-protocol equal functions around in various places, but
    we can just have one function that checks the sk->sk_family and then do
    the right comparison function. I've also changed the ipv4 version to
    not cast to inet_sock since it is unneeded.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

17 Oct, 2016

1 commit

  • Currently, socket lookups for l3mdev (vrf) use cases can match a socket
    that is bound to a port but not a device (ie., a global socket). If the
    sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
    based on the main table even though the packet came in from an L3 domain.
    The end result is that the connection does not establish creating
    confusion for users since the service is running and a socket shows in
    ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
    skb came through an interface enslaved to an l3mdev device and the
    tcp_l3mdev_accept is not set.

    skb's through an l3mdev interface are marked by setting a flag in
    inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
    flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
    flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
    inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the
    match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
    move is done after the socket lookup, so IP6CB is used.

    The flags field in inet_skb_parm struct needs to be increased to add
    another flag. There is currently a 1-byte hole following the flags,
    so it can be expanded to u16 without increasing the size of the struct.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

04 May, 2016

1 commit


02 May, 2016

1 commit

  • I forgot to include a check for listener port equality when deciding
    if two sockets should belong to the same reuseport group. This was
    not caught previously because it's only necessary when two listening
    sockets for the same user happen to hash to the same listener bucket.
    The same error does not exist in the UDP path.

    Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
    Signed-off-by: Craig Gallek
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Craig Gallek
     

28 Apr, 2016

1 commit


26 Apr, 2016

1 commit

  • d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
    was merged as a bug fix to the net tree. Two conflicting changes
    were committed to net-next before the above fix was merged back to
    net-next:
    ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
    3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

    These changes switched the datastructure used for TCP and UDP sockets
    from hlist_nulls to hlist. This patch applies the necessary parts
    of the net tree fix to net-next which were not automatic as part of the
    merge.

    Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

08 Apr, 2016

1 commit

  • David Ahern reported panics in __inet_hash() caused by my recent commit.

    The reason is inet_reuseport_add_sock() was still using
    sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
    SO_REUSEPORT enabled listeners were causing an instant crash.

    While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
    flag, as it is inherited from the parent at clone time.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Apr, 2016

2 commits

  • When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
    cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

    By letting listeners use SOCK_RCU_FREE infrastructure,
    we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

    Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
    only listeners are impacted by this change.

    Peak performance under SYNFLOOD is increased by ~33% :

    On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

    Most consuming functions are now skb_set_owner_w() and sock_wfree()
    contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • RX packet processing holds rcu_read_lock(), so we can remove
    pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
    if inet_diag also holds rcu before calling them.

    This is needed anyway as __inet_lookup_listener() and
    inet6_lookup_listener() will soon no longer increment
    refcount on the found listener.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Feb, 2016

1 commit

  • In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
    in connect()"), I added a very simple heuristic, so that we got better
    chances to use even ports, and allow bind() users to have more available
    slots.

    It gave nice results, but with more than 200,000 TCP sessions on a typical
    server, the ~30,000 ephemeral ports are still a rare resource.

    I chose to go a step further, by looking at all even ports, and if none
    was available, fallback to odd ports.

    The companion patch does the same in bind(), but in opposite way.

    I've seen exec times of up to 30ms on busy servers, so I no longer
    disable BH for the whole traversal, but only for each hash bucket.
    I also call cond_resched() to be gentle to other tasks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet