05 Oct, 2015

1 commit

  • I accidentally cleared fastopenq.max_qlen in reqsk_queue_alloc()
    while max_qlen can be set before listen() is called,
    using TCP_FASTOPEN socket option for example.

    Fixes: 0536fcc039a8 ("tcp: prepare fastopen code for upcoming listener changes")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Oct, 2015

6 commits

  • This control variable was set at first listen(fd, backlog)
    call, but not updated if application tried to increase or decrease
    backlog. It made sense at the time listener had a non resizeable
    hash table.

    Also rounding to powers of two was not very friendly.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • It is enough to check listener sk_state, no need for an extra
    condition.

    max_qlen_log can be moved into struct request_sock_queue

    We can remove syn_wait_lock and the alignment it enforced.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We no longer use hash_rnd, nr_table_entries and syn_table[]

    For a listener with a backlog of 10 millions sockets, this
    saves 80 MBytes of vmalloced memory.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • qlen_inc & young_inc were protected by listener lock,
    while qlen_dec & young_dec were atomic fields.

    Everything needs to be atomic for upcoming lockless listener.

    Also move qlen/young in request_sock_queue as we'll get rid
    of struct listen_sock eventually.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • struct request_sock_queue fields are currently protected
    by the listener 'lock' (not a real spinlock)

    We need to add a private spinlock instead, so that softirq handlers
    creating children do not have to worry with backlog notion
    that the listener 'lock' carries.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

1 commit

  • While auditing TCP stack for upcoming 'lockless' listener changes,
    I found I had to change fastopen_init_queue() to properly init the object
    before publishing it.

    Otherwise an other cpu could try to lock the spinlock before it gets
    properly initialized.

    Instead of adding appropriate barriers, just remove dynamic memory
    allocations :
    - Structure is 28 bytes on 64bit arches. Using additional 8 bytes
    for holding a pointer seems overkill.
    - Two listeners can share same cache line and performance would suffer.

    If we really want to save few bytes, we would instead dynamically allocate
    whole struct request_sock_queue in the future.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Aug, 2015

1 commit

  • reqsk_queue_destroy() and reqsk_queue_unlink() should use
    del_timer_sync() instead of del_timer() before calling reqsk_put(),
    otherwise we could free a req still used by another cpu.

    But before doing so, reqsk_queue_destroy() must release syn_wait_lock
    spinlock or risk a dead lock, as reqsk_timer_handler() might
    need to take this same spinlock from reqsk_queue_unlink() (called from
    inet_csk_reqsk_queue_drop())

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Mar, 2015

1 commit

  • This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

    We hold syn_wait_lock for such small sections, that it makes no sense to use
    a read/write lock. A spin lock is simply faster.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

1 commit

  • One of the major issue for TCP is the SYNACK rtx handling,
    done by inet_csk_reqsk_queue_prune(), fired by the keepalive
    timer of a TCP_LISTEN socket.

    This function runs for awful long times, with socket lock held,
    meaning that other cpus needing this lock have to spin for hundred of ms.

    SYNACK are sent in huge bursts, likely to cause severe drops anyway.

    This model was OK 15 years ago when memory was very tight.

    We now can afford to have a timer per request sock.

    Timer invocations no longer need to lock the listener,
    and can be run from all cpus in parallel.

    With following patch increasing somaxconn width to 32 bits,
    I tested a listener with more than 4 million active request sockets,
    and a steady SYNFLOOD of ~200,000 SYN per second.
    Host was sending ~830,000 SYNACK per second.

    This is ~100 times more what we could achieve before this patch.

    Later, we will get rid of the listener hash and use ehash instead.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2015

1 commit


17 Mar, 2015

1 commit

  • reqsk_put() is the generic function that should be used
    to release a refcount (and automatically call reqsk_free())

    reqsk_free() might be called if refcount is known to be 0
    or undefined.

    refcnt is set to one in inet_csk_reqsk_queue_add()

    As request socks are not yet in global ehash table,
    I added temporary debugging checks in reqsk_put() and reqsk_free()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jun, 2014

1 commit

  • It seems overkill to use vmalloc() for typical listeners with less than
    2048 hash buckets. Try kmalloc() and fallback to vmalloc() to reduce TLB
    pressure.

    Use kvfree() helper as it is now available.
    Use ilog2() instead of a loop.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Feb, 2014

1 commit

  • One of my pet coding style peeves is the practice of
    adding extra return; at the end of function.
    Kill several instances of this in network code.

    I suppose some coccinelle wizardy could do this automatically.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

15 Jan, 2013

1 commit

  • spin_is_locked() on a non !SMP build is kind of useless.

    BUG_ON(!spin_is_locked(xx)) is guaranteed to crash.

    Just remove this check in reqsk_fastopen_remove() as
    the callers do hold the socket lock.

    Reported-by: Ketan Kulkarni
    Signed-off-by: Eric Dumazet
    Cc: Jerry Chu
    Cc: Yuchung Cheng
    Cc: Dave Taht
    Acked-by: H.K. Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Sep, 2012

1 commit

  • This patch builds on top of the previous patch to add the support
    for TFO listeners. This includes -

    1. allocating, properly initializing, and managing the per listener
    fastopen_queue structure when TFO is enabled

    2. changes to the inet_csk_accept code to support TFO. E.g., the
    request_sock can no longer be freed upon accept(), not until 3WHS
    finishes

    3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
    if it's a TFO socket

    4. properly closing a TFO listener, and a TFO socket before 3WHS
    finishes

    5. supporting TCP_FASTOPEN socket option

    6. modifying tcp_check_req() to use to check a TFO socket as well
    as request_sock

    7. supporting TCP's TFO cookie option

    8. adding a new SYN-ACK retransmit handler to use the timer directly
    off the TFO socket rather than the listener socket. Note that TFO
    server side will not retransmit anything other than SYN-ACK until
    the 3WHS is completed.

    The patch also contains an important function
    "reqsk_fastopen_remove()" to manage the somewhat complex relation
    between a listener, its request_sock, and the corresponding child
    socket. See the comment above the function for the detail.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     

07 Dec, 2011

1 commit

  • Since commit c5ed63d66f24(tcp: fix three tcp sysctls tuning),
    sysctl_max_syn_backlog is determined by tcp_hashinfo->ehash_mask,
    and the minimal value is 128, and it will increase in proportion to the
    memory of machine.
    The original description for tcp_max_syn_backlog and sysctl_max_syn_backlog
    are out of date.

    Changelog:
    V2: update description for sysctl_max_syn_backlog

    Signed-off-by: Weiping Pan
    Reviewed-by: Shan Wei
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Peter Pan(潘卫平)
     

09 Dec, 2010

1 commit


03 Dec, 2010

1 commit

  • This will also improve handling of ipv6 tcp socket request
    backlog when syncookies are not enabled. When backlog
    becomes very deep, last quarter of backlog is limited to
    validated destinations. Previously only ipv4 implemented
    this logic, but now ipv6 does too.

    Now we are only one step away from enabling timewait
    recycling for ipv6, and that step is simply filling in
    the implementation of tcp_v6_get_peer() and
    tcp_v6_tw_get_peer().

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Nov, 2010

1 commit

  • We forgot to use __GFP_HIGHMEM in several __vmalloc() calls.

    In ceph, add the missing flag.

    In fib_trie.c, xfrm_hash.c and request_sock.c, using vzalloc() is
    cleaner and allows using HIGHMEM pages as well.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Jul, 2008

1 commit

  • Removes legacy reinvent-the-wheel type thing. The generic
    machinery integrates much better to automated debugging aids
    such as kerneloops.org (and others), and is unambiguous due to
    better naming. Non-intuively BUG_TRAP() is actually equal to
    WARN_ON() rather than BUG_ON() though some might actually be
    promoted to BUG_ON() but I left that to future.

    I could make at least one BUILD_BUG_ON conversion.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

29 Jan, 2008

1 commit


15 Nov, 2007

1 commit

  • The request_sock_queue's listen_opt is either vmalloc-ed or
    kmalloc-ed depending on the number of table entries. Thus it
    is expected to be handled properly on free, which is done in
    the reqsk_queue_destroy().

    However the error path in inet_csk_listen_start() calls
    the lite version of reqsk_queue_destroy, called
    __reqsk_queue_destroy, which calls the kfree unconditionally.

    Fix this and move the __reqsk_queue_destroy into a .c file as
    it looks too big to be inline.

    As David also noticed, this is an error recovery path only,
    so no locking is required and the lopt is known to be not NULL.

    reqsk_queue_yank_listen_sk is also now only used in
    net/core/request_sock.c so we should move it there too.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     

03 Dec, 2006

1 commit

  • We currently allocate a fixed size (TCP_SYNQ_HSIZE=512) slots hash table for
    each LISTEN socket, regardless of various parameters (listen backlog for
    example)

    On x86_64, this means order-1 allocations (might fail), even for 'small'
    sockets, expecting few connections. On the contrary, a huge server wanting a
    backlog of 50000 is slowed down a bit because of this fixed limit.

    This patch makes the sizing of listen hash table a dynamic parameter,
    depending of :
    - net.core.somaxconn tunable (default is 128)
    - net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
    - backlog value given by user application (2nd parameter of listen())

    For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of
    kmalloc().

    We still limit memory allocation with the two existing tunables (somaxconn &
    tcp_max_syn_backlog). So for standard setups, this patch actually reduce RAM
    usage.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Apr, 2006

1 commit


27 Mar, 2006

1 commit

  • Just noticed that request_sock.[ch] contain a useless assignment of
    rskq_accept_head to itself. I assume this is a typo and the 2nd one
    was supposed to be _tail. However, setting _tail to NULL is not
    needed, so the patch below just drops the 2nd assignment.

    Signed-off-By: Norbert Kiesel
    Signed-off-by: Adrian Bunk
    Signed-off-by: David S. Miller

    Norbert Kiesel
     

28 Feb, 2006

1 commit

  • In 295f7324ff8d9ea58b4d3ec93b1aaa1d80e048a9 I moved defer_accept from
    tcp_sock to request_queue and mistakingly reset it at reqsl_queue_alloc, causing
    calls to setsockopt(TCP_DEFER_ACCEPT ) to be lost after bind, the fix is to
    remove the zeroing of rskq_defer_accept from reqsl_queue_alloc.

    Thanks to Alexandra N. Kossovsky for
    reporting and testing the suggested fix.

    Signed-off-by: Arnaldo Carvalho de Melo
    Signed-off-by: David S. Miller

    Arnaldo Carvalho de Melo
     

30 Aug, 2005

3 commits


19 Jun, 2005

3 commits