19 Apr, 2012

1 commit

  • Alexander Beregalov reported skb_over_panic errors and provided stack
    trace.

    I occurs commit a21d45726aca (tcp: avoid order-1 allocations on wifi and
    tx path) added a regression, when a retransmit is done after a partial
    ACK.

    tcp_retransmit_skb() tries to aggregate several frames if the first one
    has enough available room to hold the following ones payload. This is
    controlled by /proc/sys/net/ipv4/tcp_retrans_collapse tunable (default :
    enabled)

    Problem is we must make sure _pskb_trim_head() doesnt fool
    skb_availroom() when pulling some bytes from skb (this pull is done when
    receiver ACK part of the frame).

    Reported-by: Alexander Beregalov
    Cc: Marc MERLIN
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Apr, 2012

1 commit

  • Marc Merlin reported many order-1 allocations failures in TX path on its
    wireless setup, that dont make any sense with MTU=1500 network, and non
    SG capable hardware.

    After investigation, it turns out TCP uses sk_stream_alloc_skb() and
    used as a convention skb_tailroom(skb) to know how many bytes of data
    payload could be put in this skb (for non SG capable devices)

    Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
    sizeof(struct skb_shared_info) being above 2048)

    Later, mac80211 layer need to add some bytes at the tail of skb
    (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
    available has to call pskb_expand_head() and request order-1
    allocations.

    This patch changes sk_stream_alloc_skb() so that only
    sk->sk_prot->max_header bytes of headroom are reserved, and use a new
    skb field, avail_size to hold the data payload limit.

    This way, order-0 allocations done by TCP stack can leave more than 2 KB
    of tailroom and no more allocation is performed in mac80211 layer (or
    any layer needing some tailroom)

    avail_size is unioned with mark/dropcount, since mark will be set later
    in IP stack for output packets. Therefore, skb size is unchanged.

    Reported-by: Marc MERLIN
    Tested-by: Marc MERLIN
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Feb, 2012

1 commit


31 Jan, 2012

1 commit

  • This commit fixes tcp_trim_head() to recalculate the number of
    segments in the skb with the skb's existing MSS, so trimming the head
    causes the skb segment count to be monotonically non-increasing - it
    should stay the same or go down, but not increase.

    Previously tcp_trim_head() used the current MSS of the connection. But
    if there was a decrease in MSS between original transmission and ACK
    (e.g. due to PMTUD), this could cause tcp_trim_head() to
    counter-intuitively increase the segment count when trimming bytes off
    the head of an skb. This violated assumptions in tcp_tso_acked() that
    tcp_trim_head() only decreases the packet count, so that packets_acked
    in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to
    pass u32 pkts_acked values as large as 0xffffffff to
    ca_ops->pkts_acked().

    As an aside, if tcp_trim_head() had really wanted the skb to reflect
    the current MSS, it should have called tcp_set_skb_tso_segs()
    unconditionally, since a decrease in MSS would mean that a
    single-packet skb should now be sliced into multiple segments.

    Signed-off-by: Neal Cardwell
    Acked-by: Nandita Dukkipati
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Neal Cardwell
     

27 Jan, 2012

1 commit


13 Dec, 2011

1 commit

  • This patch replaces all uses of struct sock fields' memory_pressure,
    memory_allocated, sockets_allocated, and sysctl_mem to acessor
    macros. Those macros can either receive a socket argument, or a mem_cgroup
    argument, depending on the context they live in.

    Since we're only doing a macro wrapping here, no performance impact at all is
    expected in the case where we don't have cgroups disabled.

    Signed-off-by: Glauber Costa
    Reviewed-by: Hiroyouki Kamezawa
    CC: David S. Miller
    CC: Eric W. Biederman
    CC: Eric Dumazet
    Signed-off-by: David S. Miller

    Glauber Costa
     

06 Dec, 2011

1 commit

  • commit f07d960df3 (tcp: avoid frag allocation for small frames)
    breaked assumption in tcp stack that skb is either linear (skb->data_len
    == 0), or fully fragged (skb->data_len == skb->len)

    tcp_trim_head() made this assumption, we must fix it.

    Thanks to Vijay for providing a very detailed explanation.

    Reported-by: Vijay Subramanian
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Dec, 2011

1 commit

  • We discovered that TCP stack could retransmit misaligned skbs if a
    malicious peer acknowledged sub MSS frame. This currently can happen
    only if output interface is non SG enabled : If SG is enabled, tcp
    builds headless skbs (all payload is included in fragments), so the tcp
    trimming process only removes parts of skb fragments, header stay
    aligned.

    Some arches cant handle misalignments, so force a head reallocation and
    shrink headroom to MAX_TCP_HEADER.

    Dont care about misaligments on x86 and PPC (or other arches setting
    NET_IP_ALIGN to 0)

    This patch introduces __pskb_copy() which can specify the headroom of
    new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Nov, 2011

1 commit

  • Since 2005 (c1b4a7e69576d65efc31a8cea0714173c2841244)
    tcp_tso_should_defer has been using tcp_max_burst() as a target limit
    for deciding how large to make outgoing TSO packets when not using
    sysctl_tcp_tso_win_divisor. But since 2008
    (dd9e0dda66ba38a2ddd1405ac279894260dc5c36) tcp_max_burst() returns the
    reordering degree. We should not have tcp_tso_should_defer attempt to
    build larger segments just because there is more reordering. This
    commit splits the notion of deferral size used in TSO from the notion
    of burst size used in cwnd moderation, and returns the TSO deferral
    limit to its original value.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

09 Nov, 2011

1 commit


21 Oct, 2011

1 commit

  • Adding const qualifiers to pointers can ease code review, and spot some
    bugs. It might allow compiler to optimize code further.

    For example, is it legal to temporary write a null cksum into tcphdr
    in tcp_md5_hash_header() ? I am afraid a sniffer could catch the
    temporary null value...

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2011

1 commit

  • To ease skb->truesize sanitization, its better to be able to localize
    all references to skb frags size.

    Define accessors : skb_frag_size() to fetch frag size, and
    skb_frag_size_{set|add|sub}() to manipulate it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Sep, 2011

1 commit

  • Rename struct tcp_skb_cb "flags" to "tcp_flags" to ease code review and
    maintenance.

    Its content is a combination of FIN/SYN/RST/PSH/ACK/URG/ECE/CWR flags

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Aug, 2011

2 commits

  • This patch implements Proportional Rate Reduction (PRR) for TCP.
    PRR is an algorithm that determines TCP's sending rate in fast
    recovery. PRR avoids excessive window reductions and aims for
    the actual congestion window size at the end of recovery to be as
    close as possible to the window determined by the congestion control
    algorithm. PRR also improves accuracy of the amount of data sent
    during loss recovery.

    The patch implements the recommended flavor of PRR called PRR-SSRB
    (Proportional rate reduction with slow start reduction bound) and
    replaces the existing rate halving algorithm. PRR improves upon the
    existing Linux fast recovery under a number of conditions including:
    1) burst losses where the losses implicitly reduce the amount of
    outstanding data (pipe) below the ssthresh value selected by the
    congestion control algorithm and,
    2) losses near the end of short flows where application runs out of
    data to send.

    As an example, with the existing rate halving implementation a single
    loss event can cause a connection carrying short Web transactions to
    go into the slow start mode after the recovery. This is because during
    recovery Linux pulls the congestion window down to packets_in_flight+1
    on every ACK. A short Web response often runs out of new data to send
    and its pipe reduces to zero by the end of recovery when all its packets
    are drained from the network. Subsequent HTTP responses using the same
    connection will have to slow start to raise cwnd to ssthresh. PRR on
    the other hand aims for the cwnd to be as close as possible to ssthresh
    by the end of recovery.

    A description of PRR and a discussion of its performance can be found at
    the following links:
    - IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
    - IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
    - Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng

    Signed-off-by: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     
  • Signed-off-by: Ian Campbell
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: "Pekka Savola (ipv6)"
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: netdev@vger.kernel.org
    Signed-off-by: David S. Miller

    Ian Campbell
     

09 May, 2011

1 commit

  • This allows us to acquire the exact route keying information from the
    protocol, however that might be managed.

    It handles all of the possibilities, from the simplest case of storing
    the key in inet->cork.fl to the more complex setup SCTP has where
    individual transports determine the flow.

    Signed-off-by: David S. Miller

    David S. Miller
     

08 Apr, 2011

1 commit


02 Apr, 2011

1 commit

  • All callers are prepared for alloc failures anyway, so this error
    can safely be boomeranged to the callers domain without super
    bad consequences. ...At worst the connection might go into a state
    where each RTO tries to (unsuccessfully) re-fragment with such
    a mis-sized value and eventually dies.

    Signed-off-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

31 Mar, 2011

1 commit


22 Feb, 2011

1 commit

  • Fix a bug that undo_retrans is incorrectly decremented when undo_marker is
    not set or undo_retrans is already 0. This happens when sender receives
    more DSACK ACKs than packets retransmitted during the current
    undo phase. This may also happen when sender receives DSACK after
    the undo operation is completed or cancelled.

    Fix another bug that undo_retrans is incorrectly incremented when
    sender retransmits an skb and tcp_skb_pcount(skb) > 1 (TSO). This case
    is rare but not impossible.

    Signed-off-by: Yuchung Cheng
    Acked-by: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

14 Jan, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits)
    Documentation/trace/events.txt: Remove obsolete sched_signal_send.
    writeback: fix global_dirty_limits comment runtime -> real-time
    ppc: fix comment typo singal -> signal
    drivers: fix comment typo diable -> disable.
    m68k: fix comment typo diable -> disable.
    wireless: comment typo fix diable -> disable.
    media: comment typo fix diable -> disable.
    remove doc for obsolete dynamic-printk kernel-parameter
    remove extraneous 'is' from Documentation/iostats.txt
    Fix spelling milisec -> ms in snd_ps3 module parameter description
    Fix spelling mistakes in comments
    Revert conflicting V4L changes
    i7core_edac: fix typos in comments
    mm/rmap.c: fix comment
    sound, ca0106: Fix assignment to 'channel'.
    hrtimer: fix a typo in comment
    init/Kconfig: fix typo
    anon_inodes: fix wrong function name in comment
    fix comment typos concerning "consistent"
    poll: fix a typo in comment
    ...

    Fix up trivial conflicts in:
    - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c)
    - fs/ext4/ext4.h

    Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.

    Linus Torvalds
     

23 Dec, 2010

1 commit


21 Dec, 2010

1 commit

  • This patch changes the default initial receive window to 10 mss
    (defined constant). The default window is limited to the maximum
    of 10*1460 and 2*mss (when mss > 1460).

    draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
    increasing TCP's initial congestion window to 10 mss or about 15KB.
    Leading up to this proposal were several large-scale live Internet
    experiments with an initial congestion window of 10 mss (IW10), where
    we showed that the average latency of HTTP responses improved by
    approximately 10%. This was accompanied by a slight increase in
    retransmission rate (0.5%), most of which is coming from applications
    opening multiple simultaneous connections. To understand the extreme
    worst case scenarios, and fairness issues (IW10 versus IW3), we further
    conducted controlled testbed experiments. We came away finding minimal
    negative impact even under low link bandwidths (dial-ups) and small
    buffers. These results are extremely encouraging to adopting IW10.

    However, an initial congestion window of 10 mss is useless unless a TCP
    receiver advertises an initial receive window of at least 10 mss.
    Fortunately, in the large-scale Internet experiments we found that most
    widely used operating systems advertised large initial receive windows
    of 64KB, allowing us to experiment with a wide range of initial
    congestion windows. Linux systems were among the few exceptions that
    advertised a small receive window of 6KB. The purpose of this patch is
    to fix this shortcoming.

    References:
    1. A comprehensive list of all IW10 references to date.
    http://code.google.com/speed/protocols/tcpm-IW10.html

    2. Paper describing results from large-scale Internet experiments with IW10.
    http://ccr.sigcomm.org/drupal/?q=node/621

    3. Controlled testbed experiments under worst case scenarios and a
    fairness study.
    http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf

    4. Raw test data from testbed experiments (Linux senders/receivers)
    with initial congestion and receive windows of both 10 mss.
    http://research.csc.ncsu.edu/netsrv/?q=content/iw10

    5. Internet-Draft. Increasing TCP's Initial Window.
    https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/

    Signed-off-by: Nandita Dukkipati
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

14 Dec, 2010

1 commit

  • Make all RTAX_ADVMSS metric accesses go through a new helper function,
    dst_metric_advmss().

    Leave the actual default metric as "zero" in the real metric slot,
    and compute the actual default value dynamically via a new dst_ops
    AF specific callback.

    For stacked IPSEC routes, we use the advmss of the path which
    preserves existing behavior.

    Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
    advmss on pmtu updates. This inconsistency in advmss handling
    results in more raw metric accesses than I wish we ended up with.

    Signed-off-by: David S. Miller

    David S. Miller
     

09 Dec, 2010

4 commits


03 Dec, 2010

1 commit

  • TCP_BASE_MSS is defined, but not used.
    commit 5d424d5a introduce this macro, so use
    it to initial sysctl_tcp_base_mss.

    commit 5d424d5a674f782d0659a3b66d951f412901faee
    Author: John Heffner
    Date: Mon Mar 20 17:53:41 2006 -0800

    [TCP]: MTU probing

    Signed-off-by: Shan Wei
    Signed-off-by: David S. Miller

    Shan Wei
     

25 Nov, 2010

1 commit

  • In dev_pick_tx, don't do work in calculating queue
    index or setting
    the index in the sock unless the device has more than one queue. This
    allows the sock to be set only with a queue index of a multi-queue
    device which is desirable if device are stacked like in a tunnel.

    We also allow the mapping of a socket to queue to be changed. To
    maintain in order packet transmission a flag (ooo_okay) has been
    added to the sk_buff structure. If a transport layer sets this flag
    on a packet, the transmit queue can be changed for the socket.
    Presumably, the transport would set this if there was no possbility
    of creating OOO packets (for instance, there are no packets in flight
    for the socket). This patch includes the modification in TCP output
    for setting this flag.

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     

18 Nov, 2010

1 commit

  • The current tcp_connect code completely ignores errors from sending an skb.
    This makes sense in many situations (like -ENOBUFFS) but I want to be able to
    immediately fail connections if they are denied by the SELinux netfilter hook.
    Netfilter does not normally return ECONNREFUSED when it drops a packet so we
    respect that error code as a final and fatal error that can not be recovered.

    Based-on-patch-by: Patrick McHardy
    Signed-off-by: Eric Paris
    Signed-off-by: David S. Miller

    Eric Paris
     

02 Nov, 2010

1 commit

  • "gadget", "through", "command", "maintain", "maintain", "controller", "address",
    "between", "initiali[zs]e", "instead", "function", "select", "already",
    "equal", "access", "management", "hierarchy", "registration", "interest",
    "relative", "memory", "offset", "already",

    Signed-off-by: Uwe Kleine-König
    Signed-off-by: Jiri Kosina

    Uwe Kleine-König
     

24 Sep, 2010

1 commit


02 Sep, 2010

1 commit


23 Aug, 2010

1 commit

  • Via setsockopt it is possible to reduce the socket RX buffer
    (SO_RCVBUF). TCP method to select the initial window and window scaling
    option in tcp_select_initial_window() currently misbehaves and do not
    consider a reduced RX socket buffer via setsockopt.

    Even though the server's RX buffer is reduced via setsockopt() to 256
    byte (Initial Window 384 byte => 256 * 2 - (256 * 2 / 4)) the window
    scale option is still 7:

    192.168.1.38.40676 > 78.47.222.210.5001: Flags [S], seq 2577214362, win 5840, options [mss 1460,sackOK,TS val 338417 ecr 0,nop,wscale 0], length 0
    78.47.222.210.5001 > 192.168.1.38.40676: Flags [S.], seq 1570631029, ack 2577214363, win 384, options [mss 1452,sackOK,TS val 2435248895 ecr 338417,nop,wscale 7], length 0
    192.168.1.38.40676 > 78.47.222.210.5001: Flags [.], ack 1, win 5840, options [nop,nop,TS val 338421 ecr 2435248895], length 0

    Within tcp_select_initial_window() the original space argument - a
    representation of the rx buffer size - is expanded during
    tcp_select_initial_window(). Only sysctl_tcp_rmem[2], sysctl_rmem_max
    and window_clamp are considered to calculate the initial window.

    This patch adjust the window_clamp argument if the user explicitly
    reduce the receive buffer.

    Signed-off-by: Hagen Paul Pfeifer
    Cc: David S. Miller
    Cc: Patrick McHardy
    Cc: Eric Dumazet
    Cc: Ilpo Järvinen
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

21 Jul, 2010

1 commit


20 Jul, 2010

1 commit

  • It can happen that there are no packets in queue while calling
    tcp_xmit_retransmit_queue(). tcp_write_queue_head() then returns
    NULL and that gets deref'ed to get sacked into a local var.

    There is no work to do if no packets are outstanding so we just
    exit early.

    This oops was introduced by 08ebd1721ab8fd (tcp: remove tp->lost_out
    guard to make joining diff nicer).

    Signed-off-by: Ilpo Järvinen
    Reported-by: Lennart Schulte
    Tested-by: Lennart Schulte
    Signed-off-by: David S. Miller

    Ilpo Järvinen
     

13 Jul, 2010

1 commit


29 Jun, 2010

1 commit


16 Jun, 2010

1 commit

  • unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH,
    TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced
    with the corresponding TCPHDR_*.

    Signed-off-by: Changli Gao
    ----
    include/net/tcp.h | 24 ++++++-------
    net/ipv4/tcp.c | 8 ++--
    net/ipv4/tcp_input.c | 2 -
    net/ipv4/tcp_output.c | 59 ++++++++++++++++-----------------
    net/netfilter/nf_conntrack_proto_tcp.c | 32 ++++++-----------
    net/netfilter/xt_TCPMSS.c | 4 --
    6 files changed, 58 insertions(+), 71 deletions(-)
    Signed-off-by: David S. Miller

    Changli Gao