16 Jun, 2017

1 commit

  • Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
    sockets. Based on a similar infrastructure in tcp_cong. The idea is that any
    ULP can add its own logic by changing the TCP proto_ops structure to its own
    methods.

    Example usage:

    setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));

    modules will call:
    tcp_register_ulp(&tcp_tls_ulp_ops);

    to register/unregister their ulp, with an init function and name.

    A list of registered ulps will be returned by tcp_get_available_ulp, which is
    hooked up to /proc. Example:

    $ cat /proc/sys/net/ipv4/tcp_available_ulp
    tls

    There is currently no functionality to remove or chain ULPs, but
    it should be possible to add these in the future if needed.

    Signed-off-by: Boris Pismenny
    Signed-off-by: Dave Watson
    Signed-off-by: David S. Miller

    Dave Watson
     

10 Mar, 2017

1 commit

  • Lockdep issues a circular dependency warning when AFS issues an operation
    through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.

    The theory lockdep comes up with is as follows:

    (1) If the pagefault handler decides it needs to read pages from AFS, it
    calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
    creating a call requires the socket lock:

    mmap_sem must be taken before sk_lock-AF_RXRPC

    (2) afs_open_socket() opens an AF_RXRPC socket and binds it. rxrpc_bind()
    binds the underlying UDP socket whilst holding its socket lock.
    inet_bind() takes its own socket lock:

    sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET

    (3) Reading from a TCP socket into a userspace buffer might cause a fault
    and thus cause the kernel to take the mmap_sem, but the TCP socket is
    locked whilst doing this:

    sk_lock-AF_INET must be taken before mmap_sem

    However, lockdep's theory is wrong in this instance because it deals only
    with lock classes and not individual locks. The AF_INET lock in (2) isn't
    really equivalent to the AF_INET lock in (3) as the former deals with a
    socket entirely internal to the kernel that never sees userspace. This is
    a limitation in the design of lockdep.

    Fix the general case by:

    (1) Double up all the locking keys used in sockets so that one set are
    used if the socket is created by userspace and the other set is used
    if the socket is created by the kernel.

    (2) Store the kern parameter passed to sk_alloc() in a variable in the
    sock struct (sk_kern_sock). This informs sock_lock_init(),
    sock_init_data() and sk_clone_lock() as to the lock keys to be used.

    Note that the child created by sk_clone_lock() inherits the parent's
    kern setting.

    (3) Add a 'kern' parameter to ->accept() that is analogous to the one
    passed in to ->create() that distinguishes whether kernel_accept() or
    sys_accept4() was the caller and can be passed to sk_alloc().

    Note that a lot of accept functions merely dequeue an already
    allocated socket. I haven't touched these as the new socket already
    exists before we get the parameter.

    Note also that there are a couple of places where I've made the accepted
    socket unconditionally kernel-based:

    irda_accept()
    rds_rcp_accept_one()
    tcp_accept_from_sock()

    because they follow a sock_create_kern() and accept off of that.

    Whilst creating this, I noticed that lustre and ocfs don't create sockets
    through sock_create_kern() and thus they aren't marked as for-kernel,
    though they appear to be internal. I wonder if these should do that so
    that they use the new set of lock keys.

    Signed-off-by: David Howells
    Signed-off-by: David S. Miller

    David Howells
     

19 Jan, 2017

1 commit

  • The only difference between inet6_csk_bind_conflict and inet_csk_bind_conflict
    is how they check the rcv_saddr, so delete this call back and simply
    change inet_csk_bind_conflict to call inet_rcv_saddr_equal.

    Signed-off-by: Josef Bacik
    Signed-off-by: David S. Miller

    Josef Bacik
     

14 Jan, 2017

1 commit

  • This patch makes RACK install a reordering timer when it suspects
    some packets might be lost, but wants to delay the decision
    a little bit to accomodate reordering.

    It does not create a new timer but instead repurposes the existing
    RTO timer, because both are meant to retransmit packets.
    Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
    the RACK timing check fails. The wait time is set to

    RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge

    This translates to expecting a packet (Packet) should take
    (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.

    When there are multiple packets that need a timer, we use one timer
    with the maximum timeout. Therefore the timer conservatively uses
    the maximum window to expire N packets by one timeout, instead of
    N timeouts to expire N packets sent at different times.

    The fudge factor is 2 jiffies to ensure when the timer fires, all
    the suspected packets would exceed the deadline and be marked lost
    by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
    clock may tick between calling icsk_reset_xmit_timer(timeout) and
    actually hang the timer. The next jiffy is to lower-bound the timeout
    to 2 jiffies when reo_wnd is < 1ms.

    When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
    in Recovery we'll enter fast recovery and force fast retransmit.
    This is very similar to the early retransmit (RFC5827) except RACK
    is not constrained to only enter recovery for small outstanding
    flights.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 Dec, 2016

1 commit

  • A user may call listen with binding an explicit port with the intent
    that the kernel will assign an available port to the socket. In this
    case inet_csk_get_port does a port scan. For such sockets, the user may
    also set soreuseport with the intent a creating more sockets for the
    port that is selected. The problem is that the initial socket being
    opened could inadvertently choose an existing and unreleated port
    number that was already created with soreuseport.

    This patch adds a boolean parameter to inet_bind_conflict that indicates
    rather soreuseport is allowed for the check (in addition to
    sk->sk_reuseport). In calls to inet_bind_conflict from inet_csk_get_port
    the argument is set to true if an explicit port is being looked up (snum
    argument is nonzero), and is false if port scan is done.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

30 Oct, 2016

1 commit

  • Per listen(fd, backlog) rules, there is really no point accepting a SYN,
    sending a SYNACK, and dropping the following ACK packet if accept queue
    is full, because application is not draining accept queue fast enough.

    This behavior is fooling TCP clients that believe they established a
    flow, while there is nothing at server side. They might then send about
    10 MSS (if using IW10) that will be dropped anyway while server is under
    stress.

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2016

1 commit

  • The TCP CUBIC module already uses 64 bytes.
    The upcoming TCP BBR module uses 88 bytes.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Neal Cardwell
     

19 Feb, 2016

1 commit

  • Ilya reported following lockdep splat:

    kernel: =========================
    kernel: [ BUG: held lock freed! ]
    kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
    kernel: -------------------------
    kernel: swapper/5/0 is freeing memory
    ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
    kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0
    kernel: 4 locks held by swapper/5/0:
    kernel: #0: (rcu_read_lock){......}, at: []
    netif_receive_skb_internal+0x4b/0x1f0
    kernel: #1: (rcu_read_lock){......}, at: []
    ip_local_deliver_finish+0x3f/0x380
    kernel: #2: (slock-AF_INET){+.-...}, at: []
    sk_clone_lock+0x19b/0x440
    kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0

    To properly fix this issue, inet_csk_reqsk_queue_add() needs
    to return to its callers if the child as been queued
    into accept queue.

    We also need to make sure listener is still there before
    calling sk->sk_data_ready(), by holding a reference on it,
    since the reference carried by the child can disappear as
    soon as the child is put on accept queue.

    Reported-by: Ilya Dryomov
    Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Oct, 2015

1 commit

  • Multiple cpus can process duplicates of incoming ACK messages
    matching a SYN_RECV request socket. This is a rare event under
    normal operations, but definitely can happen.

    Only one must win the race, otherwise corruption would occur.

    To fix this without adding new atomic ops, we use logic in
    inet_ehash_nolisten() to detect the request was present in the same
    ehash bucket where we try to insert the new child.

    If request socket was not found, we have to undo the child creation.

    This actually removes a spin_lock()/spin_unlock() pair in
    reqsk_queue_unlink() for the fast path.

    Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Oct, 2015

2 commits

  • Under stress, a close() on a listener can trigger the
    WARN_ON(sk->sk_ack_backlog) in inet_csk_listen_stop()

    We need to test if listener is still active before queueing
    a child in inet_csk_reqsk_queue_add()

    Create a common inet_child_forget() helper, and use it
    from inet_csk_reqsk_queue_add() and inet_csk_listen_stop()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Let's reduce the confusion about inet_csk_reqsk_queue_drop() :
    In many cases we also need to release reference on request socket,
    so add a helper to do this, reducing code size and complexity.

    Fixes: 4bdc3d66147b ("tcp/dccp: fix behavior of stale SYN_RECV request sockets")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Oct, 2015

1 commit


03 Oct, 2015

3 commits

  • This control variable was set at first listen(fd, backlog)
    call, but not updated if application tried to increase or decrease
    backlog. It made sense at the time listener had a non resizeable
    hash table.

    Also rounding to powers of two was not very friendly.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In this patch, we insert request sockets into TCP/DCCP
    regular ehash table (where ESTABLISHED and TIMEWAIT sockets
    are) instead of using the per listener hash table.

    ACK packets find SYN_RECV pseudo sockets without having
    to find and lock the listener.

    In nominal conditions, this halves pressure on listener lock.

    Note that this will allow for SO_REUSEPORT refinements,
    so that we can select a listener using cpu/numa affinities instead
    of the prior 'consistent hash', since only SYN packets will
    apply this selection logic.

    We will shrink listen_sock in the following patch to ease
    code review.

    Signed-off-by: Eric Dumazet
    Cc: Ying Cai
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is no longer used.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2015

2 commits


26 Sep, 2015

1 commit


01 Jun, 2015

1 commit

  • Linux 3.17 and earlier are explicitly engineered so that if the app
    doesn't specifically request a CC module on a listener before the SYN
    arrives, then the child gets the system default CC when the connection
    is established. See tcp_init_congestion_control() in 3.17 or earlier,
    which says "if no choice made yet assign the current value set as
    default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
    created") altered these semantics, so that children got their parent
    listener's congestion control even if the system default had changed
    after the listener was created.

    This commit returns to those original semantics from 3.17 and earlier,
    since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]:
    Congestion control initialization."), and some Linux congestion
    control workflows depend on that.

    In summary, if a listener socket specifically sets TCP_CONGESTION to
    "x", or the route locks the CC module to "x", then the child gets
    "x". Otherwise the child gets current system default from
    net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
    earlier, and this commit restores that.

    Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created")
    Cc: Florian Westphal
    Cc: Daniel Borkmann
    Cc: Glenn Judd
    Cc: Stephen Hemminger
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: Yuchung Cheng
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Neal Cardwell
     

19 May, 2015

1 commit

  • tcp_illinois and upcoming tcp_cdg require 64bit alignment of
    icsk_ca_priv

    x86 does not care, but other architectures might.

    Fixes: 05cbc0db03e82 ("ipv4: Create probe timer for tcp PMTU as per RFC4821")
    Signed-off-by: Eric Dumazet
    Cc: Fan Du
    Acked-by: Fan Du
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Apr, 2015

1 commit

  • [ 3897.923145] BUG: unable to handle kernel NULL pointer dereference at
    0000000000000080
    [ 3897.931025] IP: [] reqsk_timer_handler+0x1a6/0x243

    There is a race when reqsk_timer_handler() and tcp_check_req() call
    inet_csk_reqsk_queue_unlink() on the same req at the same time.

    Before commit fa76ce7328b2 ("inet: get rid of central tcp/dccp listener
    timer"), listener spinlock was held and race could not happen.

    To solve this bug, we change reqsk_queue_unlink() to not assume req
    must be found, and we return a status, to conditionally release a
    refcount on the request sock.

    This also means tcp_check_req() in non fastopen case might or not
    consume req refcount, so tcp_v6_hnd_req() & tcp_v4_hnd_req() have
    to properly handle this.

    (Same remark for dccp_check_req() and its callers)

    inet_csk_reqsk_queue_drop() is now too big to be inlined, as it is
    called 4 times in tcp and 3 times in dccp.

    Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
    Signed-off-by: Eric Dumazet
    Reported-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2015

2 commits

  • One of the major issue for TCP is the SYNACK rtx handling,
    done by inet_csk_reqsk_queue_prune(), fired by the keepalive
    timer of a TCP_LISTEN socket.

    This function runs for awful long times, with socket lock held,
    meaning that other cpus needing this lock have to spin for hundred of ms.

    SYNACK are sent in huge bursts, likely to cause severe drops anyway.

    This model was OK 15 years ago when memory was very tight.

    We now can afford to have a timer per request sock.

    Timer invocations no longer need to lock the listener,
    and can be run from all cpus in parallel.

    With following patch increasing somaxconn width to 32 bits,
    I tested a listener with more than 4 million active request sockets,
    and a steady SYNFLOOD of ~200,000 SYN per second.
    Host was sending ~830,000 SYNACK per second.

    This is ~100 times more what we could achieve before this patch.

    Later, we will get rid of the listener hash and use ehash instead.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • When request sock are put in ehash table, the whole notion
    of having a previous request to update dl_next is pointless.

    Also, following patch will get rid of big purge timer,
    so we want to delete a request sock without holding listener lock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2015

1 commit

  • While testing last patch series, I found req sock refcounting was wrong.

    We must set skc_refcnt to 1 for all request socks added in hashes,
    but also on request sockets created by FastOpen or syncookies.

    It is tricky because we need to defer this initialization so that
    future RCU lookups do not try to take a refcount on a not yet
    fully initialized request socket.

    Also get rid of ireq_refcnt alias.

    Signed-off-by: Eric Dumazet
    Fixes: 13854e5a6046 ("inet: add proper refcounting to request sock")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Mar, 2015

1 commit

  • reqsk_put() is the generic function that should be used
    to release a refcount (and automatically call reqsk_free())

    reqsk_free() might be called if refcount is known to be 0
    or undefined.

    refcnt is set to one in inet_csk_reqsk_queue_add()

    As request socks are not yet in global ehash table,
    I added temporary debugging checks in reqsk_put() and reqsk_free()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Mar, 2015

1 commit

  • As per RFC4821 7.3. Selecting Probe Size, a probe timer should
    be armed once probing has converged. Once this timer expired,
    probing again to take advantage of any path PMTU change. The
    recommended probing interval is 10 minutes per RFC1981. Probing
    interval could be sysctled by sysctl_tcp_probe_interval.

    Eric Dumazet suggested to implement pseudo timer based on 32bits
    jiffies tcp_time_stamp instead of using classic timer for such
    rare event.

    Signed-off-by: Fan Du
    Signed-off-by: David S. Miller

    Fan Du
     

06 Jan, 2015

1 commit

  • This patch adds necessary infrastructure to the congestion control
    framework for later per route congestion control support.

    For a per route congestion control possibility, our aim is to store
    a unique u32 key identifier into dst metrics, which can then be
    mapped into a tcp_congestion_ops struct. We argue that having a
    RTAX key entry is the most simple, generic and easy way to manage,
    and also keeps the memory footprint of dst entries lower on 64 bit
    than with storing a pointer directly, for example. Having a unique
    key id also allows for decoupling actual TCP congestion control
    module management from the FIB layer, i.e. we don't have to care
    about expensive module refcounting inside the FIB at this point.

    We first thought of using an IDR store for the realization, which
    takes over dynamic assignment of unused key space and also performs
    the key to pointer mapping in RCU. While doing so, we stumbled upon
    the issue that due to the nature of dynamic key distribution, it
    just so happens, arguably in very rare occasions, that excessive
    module loads and unloads can lead to a possible reuse of previously
    used key space. Thus, previously stale keys in the dst metric are
    now being reassigned to a different congestion control algorithm,
    which might lead to unexpected behaviour. One way to resolve this
    would have been to walk FIBs on the actually rare occasion of a
    module unload and reset the metric keys for each FIB in each netns,
    but that's just very costly.

    Therefore, we argue a better solution is to reuse the unique
    congestion control algorithm name member and map that into u32 key
    space through jhash. For that, we split the flags attribute (as it
    currently uses 2 bits only anyway) into two u32 attributes, flags
    and key, so that we can keep the cacheline boundary of 2 cachelines
    on x86_64 and cache the precalculated key at registration time for
    the fast path. On average we might expect 2 - 4 modules being loaded
    worst case perhaps 15, so a key collision possibility is extremely
    low, and guaranteed collision-free on LE/BE for all in-tree modules.
    Overall this results in much simpler code, and all without the
    overhead of an IDR. Due to the deterministic nature, modules can
    now be unloaded, the congestion control algorithm for a specific
    but unloaded key will fall back to the default one, and on module
    reload time it will switch back to the expected algorithm
    transparently.

    Joint work with Florian Westphal.

    Signed-off-by: Florian Westphal
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

23 Sep, 2014

1 commit

  • icsk_rto is a 32bit field, and icsk_backoff can reach 15 by default,
    or more if some sysctl (eg tcp_retries2) are changed.

    Better use 64bit to perform icsk_rto << icsk_backoff operations

    As Joe Perches suggested, add a helper for this.

    Yuchung spotted the tcp_v4_err() case.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Aug, 2014

1 commit

  • Make sure we use the correct address-family-specific function for
    handling MTU reductions from within tcp_release_cb().

    Previously AF_INET6 sockets were incorrectly always using the IPv6
    code path when sometimes they were handling IPv4 traffic and thus had
    an IPv4 dst.

    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Willem de Bruijn
    Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications")
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Neal Cardwell
     

16 Apr, 2014

1 commit

  • ip_queue_xmit() assumes the skb it has to transmit is attached to an
    inet socket. Commit 31c70d5956fc ("l2tp: keep original skb ownership")
    changed l2tp to not change skb ownership and thus broke this assumption.

    One fix is to add a new 'struct sock *sk' parameter to ip_queue_xmit(),
    so that we do not assume skb->sk points to the socket used by l2tp
    tunnel.

    Fixes: 31c70d5956fc ("l2tp: keep original skb ownership")
    Reported-by: Zhan Jianyu
    Tested-by: Zhan Jianyu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2013

1 commit

  • There are a mix of function prototypes with and without extern
    in the kernel sources. Standardize on not using extern for
    function prototypes.

    Function prototypes don't need to be written with extern.
    extern is assumed by the compiler. Its use is as unnecessary as
    using auto to declare automatic/local variables in a block.

    Signed-off-by: Joe Perches
    Signed-off-by: David S. Miller

    Joe Perches
     

12 Mar, 2013

1 commit

  • This patch series implement the Tail loss probe (TLP) algorithm described
    in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
    first patch implements the basic algorithm.

    TLP's goal is to reduce tail latency of short transactions. It achieves
    this by converting retransmission timeouts (RTOs) occuring due
    to tail losses (losses at end of transactions) into fast recovery.
    TLP transmits one packet in two round-trips when a connection is in
    Open state and isn't receiving any ACKs. The transmitted packet, aka
    loss probe, can be either new or a retransmission. When there is tail
    loss, the ACK from a loss probe triggers FACK/early-retransmit based
    fast recovery, thus avoiding a costly RTO. In the absence of loss,
    there is no change in the connection state.

    PTO stands for probe timeout. It is a timer event indicating
    that an ACK is overdue and triggers a loss probe packet. The PTO value
    is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
    ACK timer when there is only one oustanding packet.

    TLP Algorithm

    On transmission of new data in Open state:
    -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
    -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
    -> PTO = min(PTO, RTO)

    Conditions for scheduling PTO:
    -> Connection is in Open state.
    -> Connection is either cwnd limited or no new data to send.
    -> Number of probes per tail loss episode is limited to one.
    -> Connection is SACK enabled.

    When PTO fires:
    new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

    no_new_packet:
    -> retransmit the last segment.
    Its ACK triggers FACK or early retransmit based recovery.

    ACK path:
    -> rearm RTO at start of ACK processing.
    -> reschedule PTO if need be.

    In addition, the patch includes a small variation to the Early Retransmit
    (ER) algorithm, such that ER and TLP together can in principle recover any
    N-degree of tail loss through fast recovery. TLP is controlled by the same
    sysctl as ER, tcp_early_retrans sysctl.
    tcp_early_retrans==0; disables TLP and ER.
    ==1; enables RFC5827 ER.
    ==2; delayed ER.
    ==3; TLP and delayed ER. [DEFAULT]
    ==4; TLP only.

    The TLP patch series have been extensively tested on Google Web servers.
    It is most effective for short Web trasactions, where it reduced RTOs by 15%
    and improved HTTP response time (average by 6%, 99th percentile by 10%).
    The transmitted probes account for
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

15 Dec, 2012

1 commit

  • If in either of the above functions inet_csk_route_child_sock() or
    __inet_inherit_port() fails, the newsk will not be freed:

    unreferenced object 0xffff88022e8a92c0 (size 1592):
    comm "softirq", pid 0, jiffies 4294946244 (age 726.160s)
    hex dump (first 32 bytes):
    0a 01 01 01 0a 01 01 02 00 00 00 00 a7 cc 16 00 ................
    02 00 03 01 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x21/0x3e
    [] kmem_cache_alloc+0xb5/0xc5
    [] sk_prot_alloc.isra.53+0x2b/0xcd
    [] sk_clone_lock+0x16/0x21e
    [] inet_csk_clone_lock+0x10/0x7b
    [] tcp_create_openreq_child+0x21/0x481
    [] tcp_v4_syn_recv_sock+0x3a/0x23b
    [] tcp_check_req+0x29f/0x416
    [] tcp_v4_do_rcv+0x161/0x2bc
    [] tcp_v4_rcv+0x6c9/0x701
    [] ip_local_deliver_finish+0x70/0xc4
    [] ip_local_deliver+0x4e/0x7f
    [] ip_rcv_finish+0x1fc/0x233
    [] ip_rcv+0x217/0x267
    [] __netif_receive_skb+0x49e/0x553
    [] netif_receive_skb+0x50/0x82

    This happens, because sk_clone_lock initializes sk_refcnt to 2, and thus
    a single sock_put() is not enough to free the memory. Additionally, things
    like xfrm, memcg, cookie_values,... may have been initialized.
    We have to free them properly.

    This is fixed by forcing a call to tcp_done(), ending up in
    inet_csk_destroy_sock, doing the final sock_put(). tcp_done() is necessary,
    because it ends up doing all the cleanup on xfrm, memcg, cookie_values,
    xfrm,...

    Before calling tcp_done, we have to set the socket to SOCK_DEAD, to
    force it entering inet_csk_destroy_sock. To avoid the warning in
    inet_csk_destroy_sock, inet_num has to be set to 0.
    As inet_csk_destroy_sock does a dec on orphan_count, we first have to
    increase it.

    Calling tcp_done() allows us to remove the calls to
    tcp_clear_xmit_timer() and tcp_cleanup_congestion_control().

    A similar approach is taken for dccp by calling dccp_done().

    This is in the kernel since 093d282321 (tproxy: fix hash locking issue
    when using port redirection in __inet_inherit_port()), thus since
    version >= 2.6.37.

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

07 Aug, 2012

1 commit

  • IPv6 needs a cookie in dst_check() call.

    We need to add rx_dst_cookie and provide a family independent
    sk_rx_dst_set(sk, skb) method to properly support IPv6 TCP early demux.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jul, 2012

1 commit


16 Jul, 2012

1 commit

  • This abstracts away the call to dst_ops->update_pmtu() so that we can
    transparently handle the fact that, in the future, the dst itself can
    be invalidated by the PMTU update (when we have non-host routes cached
    in sockets).

    So we try to rebuild the socket cached route after the method
    invocation if necessary.

    This isn't used by SCTP because it needs to cache dsts per-transport,
    and thus will need it's own local version of this helper.

    Signed-off-by: David S. Miller

    David S. Miller
     

11 Jul, 2012

1 commit


23 Jun, 2012

1 commit


09 Jun, 2012

1 commit

  • The get_peer method TCP uses is full of special cases that make no
    sense accommodating, and it also gets in the way of doing more
    reasonable things here.

    First of all, if the socket doesn't have a usable cached route, there
    is no sense in trying to optimize timewait recycling.

    Likewise for the case where we have IP options, such as SRR enabled,
    that make the IP header destination address (and thus the destination
    address of the route key) differ from that of the connection's
    destination address.

    Just return a NULL peer in these cases, and thus we're also able to
    get rid of the clumsy inetpeer release logic.

    Signed-off-by: David S. Miller

    David S. Miller
     

27 Apr, 2012

1 commit

  • Quoting Tore Anderson from :
    https://bugzilla.kernel.org/show_bug.cgi?id=42572

    When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
    size does not take into account the size of the IPv6 Fragmentation
    header that needs to be included in outbound packets, causing every
    transmitted TCP segment to be fragmented across two IPv6 packets, the
    latter of which will only contain 8 bytes of actual payload.

    RTAX_FEATURE_ALLFRAG is typically set on a route in response to
    receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
    than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
    PTBs with MTU < 1280 are still valid, in particular when an IPv6
    packet is sent to an IPv4 destination through a stateless translator.
    Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
    the path will be translated to ICMPv6 PTB which may then indicate an
    MTU of less than 1280.

    The Linux kernel refuses to reduce the effective MTU to anything below
    1280 bytes, instead it sets it to exactly 1280 bytes, and
    RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
    to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
    instead of 1232 (additionally taking into account the 8 bytes required
    by the IPv6 Fragmentation extension header).

    This in turn results in rather inefficient transmission, as every
    transmitted TCP segment now is split in two fragments containing
    1232+8 bytes of payload.

    After this patch, all the outgoing packets that includes a
    Fragmentation header all are "atomic" or "non-fragmented" fragments,
    i.e., they both have Offset=0 and More Fragments=0.

    With help from David S. Miller

    Reported-by: Tore Anderson
    Signed-off-by: Eric Dumazet
    Cc: Maciej Żenczykowski
    Cc: Tom Herbert
    Tested-by: Tore Anderson
    Signed-off-by: David S. Miller

    Eric Dumazet