22 Jul, 2018

1 commit

  • [ Upstream commit 8c43bd1706885ba1acfa88da02bc60a2ec16f68c ]

    Similar to 69678bcd4d2d ("udp: fix SO_BINDTODEVICE"), TCP socket lookups
    need to fail if dev_match is not true. Currently, a packet to a given port
    can match a socket bound to device when it should not. In the VRF case,
    this causes the lookup to hit a VRF socket and not a global socket
    resulting in a response trying to go through the VRF when it should not.

    Fixes: 3fa6f616a7a4d ("net: ipv4: add second dif to inet socket lookups")
    Fixes: 4297a0ef08572 ("net: ipv6: add second dif to inet6 socket lookups")
    Reported-by: Lou Berger
    Diagnosed-by: Renato Westphal
    Tested-by: Renato Westphal
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

22 Oct, 2017

1 commit

  • Syzkaller stumbled upon a way to trigger
    WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41
    reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39

    There are two initialization paths for the sock_reuseport structure in a
    socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through
    SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation
    assumedthat the socket lock protected both of these paths when it actually
    only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this
    double allocation by running these paths concurrently.

    This patch moves the check for double allocation into the reuseport_alloc
    function which is protected by a global spin lock.

    Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection")
    Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection")
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

08 Aug, 2017

1 commit

  • Add a second device index, sdif, to inet socket lookups. sdif is the
    index for ingress devices enslaved to an l3mdev. It allows the lookups
    to consider the enslaved device as well as the L3 domain when searching
    for a socket.

    TCP moves the data in the cb. Prior to tcp_v4_rcv (e.g., early demux) the
    ingress index is obtained from IPCB using inet_sdif and after the cb move
    in tcp_v4_rcv the tcp_v4_sdif helper is used.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

03 Jul, 2017

1 commit


01 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

09 May, 2017

1 commit

  • There are many code paths opencoding kvmalloc. Let's use the helper
    instead. The main difference to kvmalloc is that those users are
    usually not considering all the aspects of the memory allocator. E.g.
    allocation requests
    Reviewed-by: Boris Ostrovsky # Xen bits
    Acked-by: Kees Cook
    Acked-by: Vlastimil Babka
    Acked-by: Andreas Dilger # Lustre
    Acked-by: Christian Borntraeger # KVM/s390
    Acked-by: Dan Williams # nvdim
    Acked-by: David Sterba # btrfs
    Acked-by: Ilya Dryomov # Ceph
    Acked-by: Tariq Toukan # mlx4
    Acked-by: Leon Romanovsky # mlx5
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Herbert Xu
    Cc: Anton Vorontsov
    Cc: Colin Cross
    Cc: Tony Luck
    Cc: "Rafael J. Wysocki"
    Cc: Ben Skeggs
    Cc: Kent Overstreet
    Cc: Santosh Raspatur
    Cc: Hariprasad S
    Cc: Yishai Hadas
    Cc: Oleg Drokin
    Cc: "Yan, Zheng"
    Cc: Alexander Viro
    Cc: Alexei Starovoitov
    Cc: Eric Dumazet
    Cc: David Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

19 Jan, 2017

2 commits

  • In inet_csk_get_port we seem to be using smallest_port to figure out where the
    best place to look for a SO_REUSEPORT sk that matches with an existing set of
    SO_REUSEPORT's. However if we get to the logic

    if (smallest_size != -1) {
    port = smallest_port;
    goto have_port;
    }

    we will do a useless search, because we would have already done the
    inet_csk_bind_conflict for that port and it would have returned 1, otherwise we
    would have gone to found_tb and succeeded. Since this logic makes us do yet
    another trip through inet_csk_bind_conflict for a port we know won't work just
    delete this code and save us the time.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     
  • We pass these per-protocol equal functions around in various places, but
    we can just have one function that checks the sk->sk_family and then do
    the right comparison function. I've also changed the ipv4 version to
    not cast to inet_sock since it is unneeded.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

17 Oct, 2016

1 commit

  • Currently, socket lookups for l3mdev (vrf) use cases can match a socket
    that is bound to a port but not a device (ie., a global socket). If the
    sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
    based on the main table even though the packet came in from an L3 domain.
    The end result is that the connection does not establish creating
    confusion for users since the service is running and a socket shows in
    ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
    skb came through an interface enslaved to an l3mdev device and the
    tcp_l3mdev_accept is not set.

    skb's through an l3mdev interface are marked by setting a flag in
    inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
    flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
    flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
    inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the
    match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
    move is done after the socket lookup, so IP6CB is used.

    The flags field in inet_skb_parm struct needs to be increased to add
    another flag. There is currently a 1-byte hole following the flags,
    so it can be expanded to u16 without increasing the size of the struct.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

04 May, 2016

1 commit


02 May, 2016

1 commit

  • I forgot to include a check for listener port equality when deciding
    if two sockets should belong to the same reuseport group. This was
    not caught previously because it's only necessary when two listening
    sockets for the same user happen to hash to the same listener bucket.
    The same error does not exist in the UDP path.

    Fixes: c125e80b8868("soreuseport: fast reuseport TCP socket selection")
    Signed-off-by: Craig Gallek
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Craig Gallek
     

28 Apr, 2016

1 commit


26 Apr, 2016

1 commit

  • d894ba18d4e4 ("soreuseport: fix ordering for mixed v4/v6 sockets")
    was merged as a bug fix to the net tree. Two conflicting changes
    were committed to net-next before the above fix was merged back to
    net-next:
    ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU")
    3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")

    These changes switched the datastructure used for TCP and UDP sockets
    from hlist_nulls to hlist. This patch applies the necessary parts
    of the net tree fix to net-next which were not automatic as part of the
    merge.

    Fixes: 1602f49b58ab ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

08 Apr, 2016

1 commit

  • David Ahern reported panics in __inet_hash() caused by my recent commit.

    The reason is inet_reuseport_add_sock() was still using
    sk_nulls_for_each_rcu() instead of sk_for_each_rcu().
    SO_REUSEPORT enabled listeners were causing an instant crash.

    While chasing this bug, I found that I forgot to clear SOCK_RCU_FREE
    flag, as it is inherited from the parent at clone time.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Apr, 2016

2 commits

  • When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
    cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

    By letting listeners use SOCK_RCU_FREE infrastructure,
    we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

    Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
    only listeners are impacted by this change.

    Peak performance under SYNFLOOD is increased by ~33% :

    On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

    Most consuming functions are now skb_set_owner_w() and sock_wfree()
    contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • RX packet processing holds rcu_read_lock(), so we can remove
    pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
    if inet_diag also holds rcu before calling them.

    This is needed anyway as __inet_lookup_listener() and
    inet6_lookup_listener() will soon no longer increment
    refcount on the found listener.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Feb, 2016

1 commit

  • In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
    in connect()"), I added a very simple heuristic, so that we got better
    chances to use even ports, and allow bind() users to have more available
    slots.

    It gave nice results, but with more than 200,000 TCP sessions on a typical
    server, the ~30,000 ephemeral ports are still a rare resource.

    I chose to go a step further, by looking at all even ports, and if none
    was available, fallback to odd ports.

    The companion patch does the same in bind(), but in opposite way.

    I've seen exec times of up to 30ms on busy servers, so I no longer
    disable BH for the whole traversal, but only for each hash bucket.
    I also call cond_resched() to be gentle to other tasks.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2016

3 commits

  • This change extends the fast SO_REUSEPORT socket lookup implemented
    for UDP to TCP. Listener sockets with SO_REUSEPORT and the same
    receive address are additionally added to an array for faster
    random access. This means that only a single socket from the group
    must be found in the listener list before any socket in the group can
    be used to receive a packet. Previously, every socket in the group
    needed to be considered before handing off the incoming packet.

    This feature also exposes the ability to use a BPF program when
    selecting a socket from a reuseport group.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     
  • This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
    groups. Doing so with a BPF filter will require access to the
    skb in question. This change plumbs the skb (and offset to payload
    data) through the call stack to the listening socket lookup
    implementations where it will be used in a following patch.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     
  • In order to support fast reuseport lookups in TCP, the hash function
    defined in struct proto must be capable of returning an error code.
    This patch changes the function signature of all related hash functions
    to return an integer and handles or propagates this return value at
    all call sites.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Oct, 2015

1 commit

  • As we no longer hold listener lock in fast path, it is possible that a
    child is created right after listener freed its bound port, if a close()
    is done while incoming packets are processed.

    __inet_inherit_port() must detect this and return an error,
    so that caller can free the child earlier.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2015

1 commit

  • SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
    to fetch incoming cpu handling a particular TCP flow after accept()

    This commits adds setsockopt() support and extends SO_REUSEPORT selection
    logic : If a TCP listener or UDP socket has this option set, a packet is
    delivered to this socket only if CPU handling the packet matches the specified
    one.

    This allows to build very efficient TCP servers, using one listener per
    RX queue, as the associated TCP listener should only accept flows handled
    in softirq by the same cpu.
    This provides optimal NUMA behavior and keep cpu caches hot.

    Note that __inet_lookup_listener() still has to iterate over the list of
    all listeners. Following patch puts sk_refcnt in a different cache line
    to let this iteration hit only shared and read mostly cache lines.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2015

1 commit

  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

1 commit


23 Jul, 2015

1 commit


22 Jul, 2015

1 commit

  • Andrew Morton reported following warning on one ARM build
    with gcc-4.4 :

    net/ipv4/inet_hashtables.c: In function 'inet_ehash_locks_alloc':
    net/ipv4/inet_hashtables.c:617: warning: division by zero

    Even guarded with a test on sizeof(spinlock_t), compiler does not
    like current construct on a !CONFIG_SMP build.

    Remove the warning by using a temporary variable.

    Fixes: 095dc8e0c368 ("tcp: fix/cleanup inet_ehash_locks_alloc()")
    Reported-by: Andrew Morton
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Jul, 2015

2 commits

  • inet_twsk_deschedule() calls are followed by inet_twsk_put().

    Only particular case is in inet_twsk_purge() but there is no point
    to defer the inet_twsk_put() after re-enabling BH.

    Lets rename inet_twsk_deschedule() to inet_twsk_deschedule_put()
    and move the inet_twsk_put() inside.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • timewait sockets have a complex refcounting logic.
    Once we realize it should be similar to established and
    syn_recv sockets, we can use sk_nulls_del_node_init_rcu()
    and remove inet_twsk_unhash()

    In particular, deferred inet_twsk_put() added in commit
    13475a30b66cd ("tcp: connect() race with timewait reuse")
    looks unecessary : When removing a timewait socket from
    ehash or bhash, caller must own a reference on the socket
    anyway.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 May, 2015

2 commits

  • __inet_hash_connect() does not use its third argument (port_offset)
    if socket was already bound to a source port.

    No need to perform useless but expensive md5 computations.

    Reported-by: Crestez Dan Leonard
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • A long standing problem on busy servers is the tiny available TCP port
    range (/proc/sys/net/ipv4/ip_local_port_range) and the default
    sequential allocation of source ports in connect() system call.

    If a host is having a lot of active TCP sessions, chances are
    very high that all ports are in use by at least one flow,
    and subsequent bind(0) attempts fail, or have to scan a big portion of
    space to find a slot.

    In this patch, I changed the starting point in __inet_hash_connect()
    so that we try to favor even [1] ports, leaving odd ports for bind()
    users.

    We still perform a sequential search, so there is no guarantee, but
    if connect() targets are very different, end result is we leave
    more ports available to bind(), and we spread them all over the range,
    lowering time for both connect() and bind() to find a slot.

    This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
    is even, ie if start/end values have different parity.

    Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
    32768 - 60999 (instead of 32768 - 61000)

    There is no change on security aspects here, only some poor hashing
    schemes could be eventually impacted by this change.

    [1] : The odd/even property depends on ip_local_port_range values parity

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 May, 2015

1 commit

  • If tcp ehash table is constrained to a very small number of buckets
    (eg boot parameter thash_entries=128), then we can crash if spinlock
    array has more entries.

    While we are at it, un-inline inet_ehash_locks_alloc() and make
    following changes :

    - Budget 2 cache lines per cpu worth of 'spinlocks'
    - Try to kmalloc() the array to avoid extra TLB pressure.
    (Most servers at Google allocate 8192 bytes for this hash table)
    - Get rid of various #ifdef

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 May, 2015

1 commit

  • We no longer need bsocket atomic counter, as inet_csk_get_port()
    calls bind_conflict() regardless of its value, after commit
    2b05ad33e1e624e ("tcp: bind() fix autoselection to share ports")

    This patch removes overhead of maintaining this counter and
    double inet_csk_get_port() calls under pressure.

    Signed-off-by: Eric Dumazet
    Cc: Marcelo Ricardo Leitner
    Cc: Flavio Leitner
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2015

1 commit

  • Using a timer wheel for timewait sockets was nice ~15 years ago when
    memory was expensive and machines had a single processor.

    This does not scale, code is ugly and source of huge latencies
    (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

    We can afford to use an extra 64 bytes per timewait sock and spread
    timewait load to all cpus to have better behavior.

    Tested:

    On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
    on the target (lpaa24)

    Before patch :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    419594

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    437171

    While test is running, we can observe 25 or even 33 ms latencies.

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
    rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
    rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

    After patch :

    About 90% increase of throughput :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    810442

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    800992

    And latencies are kept to minimal values during this load, even
    if network utilization is 90% higher :

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
    rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Apr, 2015

1 commit

  • The ipv4 code uses a mixture of coding styles. In some instances check
    for non-NULL pointer is done as x != NULL and sometimes as x. x is
    preferred according to checkpatch and this patch makes the code
    consistent by adopting the latter form.

    No changes detected by objdiff.

    Signed-off-by: Ian Morris
    Signed-off-by: David S. Miller

    Ian Morris
     

19 Mar, 2015

5 commits