10 Jan, 2019

1 commit

  • [ Upstream commit f0c928d878e7d01b613c9ae5c971a6b1e473a938 ]

    Alexei reported use after frees in inet_diag_dump_icsk() [1]

    Because we use refcount_set() when various sockets are setup and
    inserted into ehash, we also need to make sure inet_diag_dump_icsk()
    wont race with the refcount_set() operations.

    Jonathan Lemon sent a patch changing net_twsk_hashdance() but
    other spots would need risky changes.

    Instead, fix inet_diag_dump_icsk() as this bug came with
    linux-4.10 only.

    [1] Quoting Alexei :

    First something iterating over sockets finds already freed tw socket:

    refcount_t: increment on 0; use-after-free.
    WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
    RIP: 0010:refcount_inc+0x26/0x30
    RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
    RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
    RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
    RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
    R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
    R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
    FS: 00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag] // sock_hold(sk); in net/ipv4/inet_diag.c:1002
    ? kmalloc_large_node+0x37/0x70
    ? __kmalloc_node_track_caller+0x1cb/0x260
    ? __alloc_skb+0x72/0x1b0
    ? __kmalloc_reserve.isra.40+0x2e/0x80
    __inet_diag_dump+0x3b/0x80 [inet_diag]
    netlink_dump+0x116/0x2a0
    netlink_recvmsg+0x205/0x3c0
    sock_read_iter+0x89/0xd0
    __vfs_read+0xf7/0x140
    vfs_read+0x8a/0x140
    SyS_read+0x3f/0xa0
    do_syscall_64+0x5a/0x100

    then a minute later twsk timer fires and hits two bad refcnts
    for this freed socket:

    refcount_t: decrement hit 0; leaking memory.
    WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
    Modules linked in:
    RIP: 0010:refcount_dec+0x2e/0x40
    RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
    RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
    RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
    RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
    R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
    R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    inet_twsk_kill+0x9d/0xc0 // inet_twsk_bind_unhash(tw, hashinfo);
    call_timer_fn+0x29/0x110
    run_timer_softirq+0x36b/0x3a0

    refcount_t: underflow; use-after-free.
    WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
    RIP: 0010:refcount_sub_and_test+0x46/0x50
    RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
    RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
    RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
    RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
    R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
    R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    inet_twsk_put+0x12/0x20 // inet_twsk_put(tw);
    call_timer_fn+0x29/0x110
    run_timer_softirq+0x36b/0x3a0

    Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
    Signed-off-by: Eric Dumazet
    Reported-by: Alexei Starovoitov
    Cc: Jonathan Lemon
    Acked-by: Jonathan Lemon
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

02 Sep, 2017

1 commit


19 Aug, 2017

1 commit


14 Jan, 2017

2 commits

  • This patch removes the support of RFC5827 early retransmit (i.e.,
    fast recovery on small inflight with
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • This patch makes RACK install a reordering timer when it suspects
    some packets might be lost, but wants to delay the decision
    a little bit to accomodate reordering.

    It does not create a new timer but instead repurposes the existing
    RTO timer, because both are meant to retransmit packets.
    Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
    the RACK timing check fails. The wait time is set to

    RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

    This translates to expecting a packet (Packet) should take
    (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

    When there are multiple packets that need a timer, we use one timer
    with the maximum timeout. Therefore the timer conservatively uses
    the maximum window to expire N packets by one timeout, instead of
    N timeouts to expire N packets sent at different times.

    The fudge factor is 2 jiffies to ensure when the timer fires, all
    the suspected packets would exceed the deadline and be marked lost
    by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
    clock may tick between calling icsk_reset_xmit_timer(timeout) and
    actually hang the timer. The next jiffy is to lower-bound the timeout
    to 2 jiffies when reo_wnd is < 1ms.

    When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
    in Recovery we'll enter fast recovery and force fast retransmit.
    This is very similar to the early retransmit (RFC5827) except RACK
    is not constrained to only enter recovery for small outstanding
    flights.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

10 Nov, 2016

1 commit

  • We had various problems in the past in tcp_get_info() and used
    specific synchronization to avoid deadlocks.

    We would like to add more instrumentation points for TCP, and
    avoiding grabing socket lock in tcp_getinfo() was too costly.

    Being able to lock the socket allows to provide consistent set
    of fields.

    inet_diag_dump_icsk() can make sure ehash locks are not
    held any more when tcp_get_info() is called.

    We can remove syncp added in commit d654976cbf85
    ("tcp: fix a potential deadlock in tcp_get_info()"), but we need
    to use lock_sock_fast() instead of spin_lock_bh() since TCP input
    path can now be run from process context.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Oct, 2016

1 commit

  • In criu we are actively using diag interface to collect sockets
    present in the system when dumping applications. And while for
    unix, tcp, udp[lite], packet, netlink it works as expected,
    the raw sockets do not have. Thus add it.

    v2:
    - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@)
    - implement @destroy for diag requests (by dsa@)

    v3:
    - add export of raw_abort for IPv6 (by dsa@)
    - pass net-admin flag into inet_sk_diag_fill due to
    changes in net-next branch (by dsa@)

    v4:
    - use @pad in struct inet_diag_req_v2 for raw socket
    protocol specification: raw module carries sockets
    which may have custom protocol passed from socket()
    syscall and sole @sdiag_protocol is not enough to
    match underlied ones
    - start reporting protocol specifed in socket() call
    when sockets are raw ones for the same reason: user
    space tools like ss may parse this attribute and use
    it for socket matching

    v5 (by eric.dumazet@):
    - use sock_hold in raw_sock_get instead of atomic_inc,
    we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock
    when looking up so counter won't be zero here.

    v6:
    - use sdiag_raw_protocol() helper which will access @pad
    structure used for raw sockets protocol specification:
    we can't simply rename this member without breaking uapi

    v7:
    - sine sdiag_raw_protocol() helper is not suitable for
    uapi lets rather make an alias structure with proper
    names. __check_inet_diag_req_raw helper will catch
    if any of structure unintentionally changed.

    CC: David S. Miller
    CC: Eric Dumazet
    CC: David Ahern
    CC: Alexey Kuznetsov
    CC: James Morris
    CC: Hideaki YOSHIFUJI
    CC: Patrick McHardy
    CC: Andrey Vagin
    CC: Stephen Hemminger
    Signed-off-by: Cyrill Gorcunov
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     

20 Oct, 2016

1 commit

  • softirq handlers use RCU protection to lookup listeners,
    and write operations all happen from process context.
    We do not need to block BH for dump operations.

    Also SYN_RECV since request sockets are stored in the ehash table :

    1) inet_diag_dump_icsk() no longer need to clear
    cb->args[3] and cb->args[4] that were used as cursors while
    iterating the old per listener hash table.

    2) Also factorize a test : No need to scan listening_hash[]
    if r->id.idiag_dport is not zero.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Sep, 2016

1 commit

  • This adds the capability for a process that has CAP_NET_ADMIN on
    a socket to see the socket mark in socket dumps.

    Commit a52e95abf772 ("net: diag: allow socket bytecode filters to
    match socket marks") recently gave privileged processes the
    ability to filter socket dumps based on mark. This patch is
    complementary: it ensures that the mark is also passed to
    userspace in the socket's netlink attributes. It is useful for
    tools like ss which display information about sockets.

    Tested: https://android-review.googlesource.com/270210
    Signed-off-by: Lorenzo Colitti
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

25 Aug, 2016

2 commits

  • This allows a privileged process to filter by socket mark when
    dumping sockets via INET_DIAG_BY_FAMILY. This is useful on
    systems that use mark-based routing such as Android.

    The ability to filter socket marks requires CAP_NET_ADMIN, which
    is consistent with other privileged operations allowed by the
    SOCK_DIAG interface such as the ability to destroy sockets and
    the ability to inspect BPF filters attached to packet sockets.

    Tested: https://android-review.googlesource.com/261350
    Signed-off-by: Lorenzo Colitti
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     
  • This simplifies the code a bit and also allows inet_diag_bc_audit
    to send to userspace an error that isn't EINVAL.

    Signed-off-by: Lorenzo Colitti
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

28 Jun, 2016

1 commit


27 Apr, 2016

1 commit


22 Apr, 2016

1 commit


16 Apr, 2016

1 commit

  • inet_diag_msg_common_fill is used to fill the diag msg common info,
    we need to use it in sctp_diag as well, so export it.

    inet_diag_msg_attrs_fill is used to fill some common attrs info between
    sctp diag and tcp diag.

    v2->v3:
    - do not need to define and export inet_diag_get_handler any more.
    cause all the functions in it are in sctp_diag.ko, we just call
    them in sctp_diag.ko.

    - add inet_diag_msg_attrs_fill to make codes clear.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

05 Apr, 2016

2 commits

  • When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
    cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

    By letting listeners use SOCK_RCU_FREE infrastructure,
    we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

    Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
    only listeners are impacted by this change.

    Peak performance under SYNFLOOD is increased by ~33% :

    On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

    Most consuming functions are now skb_set_owner_w() and sock_wfree()
    contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • RX packet processing holds rcu_read_lock(), so we can remove
    pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
    if inet_diag also holds rcu before calling them.

    This is needed anyway as __inet_lookup_listener() and
    inet6_lookup_listener() will soon no longer increment
    refcount on the found listener.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Mar, 2016

1 commit


11 Feb, 2016

1 commit

  • This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
    groups. Doing so with a BPF filter will require access to the
    skb in question. This change plumbs the skb (and offset to payload
    data) through the call stack to the listening socket lookup
    implementations where it will be used in a following patch.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

21 Jan, 2016

1 commit


16 Dec, 2015

2 commits

  • This passes the SOCK_DESTROY operation to the underlying protocol
    diag handler, or returns -EOPNOTSUPP if that handler does not
    define a destroy operation.

    Most of this patch is just renaming functions. This is not
    strictly necessary, but it would be fairly counterintuitive to
    have the code to destroy inet sockets be in a function whose name
    starts with inet_diag_get.

    Signed-off-by: Lorenzo Colitti
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     
  • Currently, inet_diag_dump_one_icsk finds a socket and then dumps
    its information to userspace. Split it into a part that finds the
    socket and a part that dumps the information.

    Signed-off-by: Lorenzo Colitti
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Lorenzo Colitti
     

03 Oct, 2015

2 commits

  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • qlen_inc & young_inc were protected by listener lock,
    while qlen_dec & young_dec were atomic fields.

    Everything needs to be atomic for upcoming lockless listener.

    Also move qlen/young in request_sock_queue as we'll get rid
    of struct listen_sock eventually.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jul, 2015

1 commit

  • Reconsidering my commit 20462155 "net: inet_diag: export IPV6_V6ONLY
    sockopt", I am not happy with the limitations it causes for socket
    analysing code in userspace. Exporting the value only if it is set makes
    it hard for userspace to decide whether the option is not set or the
    kernel does not support exporting the option at all.

    >From an auditor's perspective, the interesting question for listening
    AF_INET6 sockets is: "Does it NOT have IPV6_V6ONLY set?" Because it is
    the unexpected case. This patch allows to answer this question reliably.

    Signed-off-by: Phil Sutter
    Cc: Eric Dumazet
    Signed-off-by: David S. Miller

    Phil Sutter
     

24 Jun, 2015

1 commit

  • For AF_INET6 sockets, the value of struct ipv6_pinfo.ipv6only is
    exported to userspace. It indicates whether a socket bound to in6addr_any
    listens on IPv4 as well as IPv6. Since the socket is natively IPv6, it is not
    listed by e.g. 'ss -l -4'.

    This patch is accompanied by an appropriate one for iproute2 to enable
    the additional information in 'ss -e'.

    Signed-off-by: Phil Sutter
    Signed-off-by: David S. Miller

    Phil Sutter
     

23 Jun, 2015

1 commit


22 Jun, 2015

1 commit

  • When an inet_sock is destroyed, its source port (sk_num) is set to
    zero as part of the unhash procedure. In order to supply a source
    port as part of the NETLINK_SOCK_DIAG socket destruction broadcasts,
    the source port number must be read from inet_sport instead.

    Tested: ss -E
    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

16 Jun, 2015

2 commits

  • This get_info handler will simply dispatch to the appropriate
    existing inet protocol handler.

    This patch also includes a new netlink attribute
    (INET_DIAG_PROTOCOL). This attribute is currently only used
    for multicast messages. Without this attribute, there is no
    way of knowing the IP protocol used by the socket information
    being broadcast. This attribute is not necessary in the 'dump'
    variant of this protocol (though it could easily be added)
    because dump requests are issued for specific family/protocol
    pairs.

    Tested: ss -E (note, the -E option has not yet been merged into
    the upstream version of ss).

    Signed-off-by: Craig Gallek
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Craig Gallek
     
  • Previously, there was no clear distinction between the inet protocols
    that used struct tcp_info to report information and those that didn't.
    This change adds a specific size attribute to the inet_diag_handler
    struct which defines these interfaces. This will make dispatching
    sock_diag get_info requests identical for all inet protocols in a
    following patch.

    Tested: ss -au
    Tested: ss -at
    Signed-off-by: Craig Gallek
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Craig Gallek
     

30 Apr, 2015

1 commit

  • We would like that optional info provided by Congestion Control
    modules using netlink can also be read using getsockopt()

    This patch changes get_info() to put this information in a buffer,
    instead of skb, like tcp_get_info(), so that following patch
    can reuse this common infrastructure.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Daniel Borkmann
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Apr, 2015

1 commit

  • Two different problems are fixed here :

    1) inet_sk_diag_fill() might be called without socket lock held.
    icsk->icsk_ca_ops can change under us and module be unloaded.
    -> Access to freed memory.
    Fix this using rcu_read_lock() to prevent module unload.

    2) Some TCP Congestion Control modules provide information
    but again this is not safe against icsk->icsk_ca_ops
    change and nla_put() errors were ignored. Some sockets
    could not get the additional info if skb was almost full.

    Fix this by returning a status from get_info() handlers and
    using rcu protection as well.

    Signed-off-by: Eric Dumazet
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2015

1 commit

  • Using a timer wheel for timewait sockets was nice ~15 years ago when
    memory was expensive and machines had a single processor.

    This does not scale, code is ugly and source of huge latencies
    (Typically 30 ms have been seen, cpus spinning on death_lock spinlock.)

    We can afford to use an extra 64 bytes per timewait sock and spread
    timewait load to all cpus to have better behavior.

    Tested:

    On following test, /proc/sys/net/ipv4/tcp_tw_recycle is set to 1
    on the target (lpaa24)

    Before patch :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    419594

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    437171

    While test is running, we can observe 25 or even 33 ms latencies.

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20601ms
    rtt min/avg/max/mdev = 0.020/0.217/25.771/1.535 ms, pipe 2

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 20702ms
    rtt min/avg/max/mdev = 0.019/0.183/33.761/1.441 ms, pipe 2

    After patch :

    About 90% increase of throughput :

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    810442

    lpaa23:~# ./super_netperf 200 -H lpaa24 -t TCP_CC -l 60 -- -p0,0
    800992

    And latencies are kept to minimal values during this load, even
    if network utilization is 90% higher :

    lpaa24:~# ping -c 1000 -i 0.02 -qn lpaa23
    ...
    1000 packets transmitted, 1000 received, 0% packet loss, time 19991ms
    rtt min/avg/max/mdev = 0.023/0.064/0.360/0.042 ms

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Mar, 2015

1 commit

  • This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

    We hold syn_wait_lock for such small sections, that it makes no sense to use
    a read/write lock. A spin lock is simply faster.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

2 commits

  • Conflicts:
    drivers/net/ethernet/emulex/benet/be_main.c
    net/core/sysctl_net_core.c
    net/ipv4/inet_diag.c

    The be_main.c conflict resolution was really tricky. The conflict
    hunks generated by GIT were very unhelpful, to say the least. It
    split functions in half and moved them around, when the real actual
    conflict only existed solely inside of one function, that being
    be_map_pci_bars().

    So instead, to resolve this, I checked out be_main.c from the top
    of net-next, then I applied the be_main.c changes from 'net' since
    the last time I merged. And this worked beautifully.

    The inet_diag.c and sysctl_net_core.c conflicts were simple
    overlapping changes, and were easily to resolve.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • One of the major issue for TCP is the SYNACK rtx handling,
    done by inet_csk_reqsk_queue_prune(), fired by the keepalive
    timer of a TCP_LISTEN socket.

    This function runs for awful long times, with socket lock held,
    meaning that other cpus needing this lock have to spin for hundred of ms.

    SYNACK are sent in huge bursts, likely to cause severe drops anyway.

    This model was OK 15 years ago when memory was very tight.

    We now can afford to have a timer per request sock.

    Timer invocations no longer need to lock the listener,
    and can be run from all cpus in parallel.

    With following patch increasing somaxconn width to 32 bits,
    I tested a listener with more than 4 million active request sockets,
    and a steady SYNFLOOD of ~200,000 SYN per second.
    Host was sending ~830,000 SYNACK per second.

    This is ~100 times more what we could achieve before this patch.

    Later, we will get rid of the listener hash and use ehash instead.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Mar, 2015

1 commit


17 Mar, 2015

1 commit


15 Mar, 2015

2 commits