05 Sep, 2013

1 commit


30 Aug, 2013

2 commits

  • After hearing many people over past years complaining against TSO being
    bursty or even buggy, we are proud to present automatic sizing of TSO
    packets.

    One part of the problem is that tcp_tso_should_defer() uses an heuristic
    relying on upcoming ACKS instead of a timer, but more generally, having
    big TSO packets makes little sense for low rates, as it tends to create
    micro bursts on the network, and general consensus is to reduce the
    buffering amount.

    This patch introduces a per socket sk_pacing_rate, that approximates
    the current sending rate, and allows us to size the TSO packets so
    that we try to send one packet every ms.

    This field could be set by other transports.

    Patch has no impact for high speed flows, where having large TSO packets
    makes sense to reach line rate.

    For other flows, this helps better packet scheduling and ACK clocking.

    This patch increases performance of TCP flows in lossy environments.

    A new sysctl (tcp_min_tso_segs) is added, to specify the
    minimal size of a TSO packet (default being 2).

    A follow-up patch will provide a new packet scheduler (FQ), using
    sk_pacing_rate as an input to perform optional per flow pacing.

    This explains why we chose to set sk_pacing_rate to twice the current
    rate, allowing 'slow start' ramp up.

    sk_pacing_rate = 2 * cwnd * mss / srtt

    v2: Neal Cardwell reported a suspect deferring of last two segments on
    initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
    into account tp->xmit_size_goal_segs

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Tom Herbert
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch implements RFC6980: Drop fragmented ndisc packets by
    default. If a fragmented ndisc packet is received the user is informed
    that it is possible to disable the check.

    Cc: Fernando Gont
    Cc: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

14 Aug, 2013

1 commit

  • Commit cab70040dfd95ee32144f02fade64f0cb94f31a0 ("net: igmp:
    Reduce Unsolicited report interval to 1s when using IGMPv3") and
    2690048c01f32bf45d1c1e1ab3079bc10ad2aea7 ("net: igmp: Allow user-space
    configuration of igmp unsolicited report interval") by William Manley made
    igmp unsolicited report intervals configurable per interface and corrected
    the interval of unsolicited igmpv3 report messages resendings to 1s.

    Same needs to be done for IPv6:

    MLDv1 (RFC2710 7.10.): 10 seconds
    MLDv2 (RFC3810 9.11.): 1 second

    Both intervals are configurable via new procfs knobs
    mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.

    (also added .force_mld_version to ipv6_devconf_dflt to bring structs in
    line without semantic changes)

    v2:
    a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
    unsolicited_report_interval procfs knobs.
    b) incorporate stylistic feedback from William Manley

    v3:
    a) add new DEVCONF_* values to the end of the enum (thanks to David
    Miller)

    Cc: Cong Wang
    Cc: William Manley
    Cc: Benjamin LaHaise
    Cc: YOSHIFUJI Hideaki
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

10 Aug, 2013

2 commits


31 Jul, 2013

1 commit


25 Jul, 2013

1 commit

  • Idea of this patch is to add optional limitation of number of
    unsent bytes in TCP sockets, to reduce usage of kernel memory.

    TCP receiver might announce a big window, and TCP sender autotuning
    might allow a large amount of bytes in write queue, but this has little
    performance impact if a large part of this buffering is wasted :

    Write queue needs to be large only to deal with large BDP, not
    necessarily to cope with scheduling delays (incoming ACKS make room
    for the application to queue more bytes)

    For most workloads, using a value of 128 KB or less is OK to give
    applications enough time to react to POLLOUT events in time
    (or being awaken in a blocking sendmsg())

    This patch adds two ways to set the limit :

    1) Per socket option TCP_NOTSENT_LOWAT

    2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
    not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
    Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

    This changes poll()/select()/epoll() to report POLLOUT
    only if number of unsent bytes is below tp->nosent_lowat

    Note this might increase number of sendmsg()/sendfile() calls
    when using non blocking sockets,
    and increase number of context switches for blocking sockets.

    Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
    defined as :
    Specify the minimum number of bytes in the buffer until
    the socket layer will pass the data to the protocol)

    Tested:

    netperf sessions, and watching /proc/net/protocols "memory" column for TCP

    With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
    used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

    lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
    TCPv6 1880 2 45458 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
    TCP 1696 508 45458 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

    lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
    TCPv6 1880 2 20567 no 208 yes ipv6 y y y y y y y y y y y y y n y y y y y
    TCP 1696 508 20567 no 208 yes kernel y y y y y y y y y y y y y n y y y y y

    Using 128KB has no bad effect on the throughput or cpu usage
    of a single flow, although there is an increase of context switches.

    A bonus is that we hold socket lock for a shorter amount
    of time and should improve latencies of ACK processing.

    lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
    OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
    Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
    Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
    Size Size Size (sec) Util Util Util Util Demand Demand Units
    Final Final % Method % Method
    1651584 6291456 16384 20.00 17447.90 10^6bits/s 3.13 S -1.00 U 0.353 -1.000 usec/KB

    Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

    412,514 context-switches

    200.034645535 seconds time elapsed

    lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
    lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
    OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
    Local Remote Local Elapsed Throughput Throughput Local Local Remote Remote Local Remote Service
    Send Socket Recv Socket Send Time Units CPU CPU CPU CPU Service Service Demand
    Size Size Size (sec) Util Util Util Util Demand Demand Units
    Final Final % Method % Method
    1593240 6291456 16384 20.00 17321.16 10^6bits/s 3.35 S -1.00 U 0.381 -1.000 usec/KB

    Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

    2,675,818 context-switches

    200.029651391 seconds time elapsed

    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Acked-By: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Jul, 2013

1 commit

  • Conflicts:
    drivers/net/ethernet/freescale/fec_main.c
    drivers/net/ethernet/renesas/sh_eth.c
    net/ipv4/gre.c

    The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
    and the splitting of the gre.c code into seperate files.

    The FEC conflict was two sets of changes adding ethtool support code
    in an "!CONFIG_M5272" CPP protected block.

    Finally the sh_eth.c conflict was between one commit add bits set
    in the .eesr_err_check mask whilst another commit removed the
    .tx_error_check member and assignments.

    Signed-off-by: David S. Miller

    David S. Miller
     

24 Jun, 2013

1 commit


13 Jun, 2013

1 commit

  • commit 6648bd7e0e62c0c8c03b (ipv4: Add sysctl knob to control
    early socket demux) introduced such sysctl, but forgot to add
    doc into Documentation/networking/ip-sysctl.txt. This patch adds it.

    Basically I grab the doc from the description of commit 41063e9dd11956f2d285
    (ipv4: Early TCP socket demux.) and the above commit.

    Cc: Eric Dumazet
    Cc: Alexander Duyck
    Cc: David S. Miller
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

08 Jun, 2013

1 commit


28 May, 2013

1 commit


21 Mar, 2013

2 commits

  • This patch implements F-RTO (foward RTO recovery):

    When the first retransmission after timeout is acknowledged, F-RTO
    sends new data instead of old data. If the next ACK acknowledges
    some never-retransmitted data, then the timeout was spurious and the
    congestion state is reverted. Otherwise if the next ACK selectively
    acknowledges the new data, then the timeout was genuine and the
    loss recovery continues. This idea applies to recurring timeouts
    as well. While F-RTO sends different data during timeout recovery,
    it does not (and should not) change the congestion control.

    The implementaion follows the three steps of SACK enhanced algorithm
    (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
    3 are in tcp_process_loss(). The basic version is not supported
    because SACK enhanced version also works for non-SACK connections.

    The new implementation is functionally in parity with the old F-RTO
    implementation except the one case where it increases undo events:
    In addition to the RFC algorithm, a spurious timeout may be detected
    without sending data in step 2, as long as the SACK confirms not
    all the original data are dropped. When this happens, the sender
    will undo the cwnd and perhaps enter fast recovery instead. This
    additional check increases the F-RTO undo events by 5x compared
    to the prior implementation on Google Web servers, since the sender
    often does not have new data to send for HTTP.

    Note F-RTO may detect spurious timeout before Eifel with timestamps
    does so.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • The patch series refactor the F-RTO feature (RFC4138/5682).

    This is to simplify the loss recovery processing. Existing F-RTO
    was developed during the experimental stage (RFC4138) and has
    many experimental features. It takes a separate code path from
    the traditional timeout processing by overloading CA_Disorder
    instead of using CA_Loss state. This complicates CA_Disorder state
    handling because it's also used for handling dubious ACKs and undos.
    While the algorithm in the RFC does not change the congestion control,
    the implementation intercepts congestion control in various places
    (e.g., frto_cwnd in tcp_ack()).

    The new code implements newer F-RTO RFC5682 using CA_Loss processing
    path. F-RTO becomes a small extension in the timeout processing
    and interfaces with congestion control and Eifel undo modules.
    It lets congestion control (module) determines how many to send
    independently. F-RTO only chooses what to send in order to detect
    spurious retranmission. If timeout is found spurious it invokes
    existing Eifel undo algorithms like DSACK or TCP timestamp based
    detection.

    The first patch removes all F-RTO code except the sysctl_tcp_frto is
    left for the new implementation. Since CA_EVENT_FRTO is removed, TCP
    westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

18 Mar, 2013

1 commit

  • TCPCT uses option-number 253, reserved for experimental use and should
    not be used in production environments.
    Further, TCPCT does not fully implement RFC 6013.

    As a nice side-effect, removing TCPCT increases TCP's performance for
    very short flows:

    Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
    for files of 1KB size.

    before this patch:
    average (among 7 runs) of 20845.5 Requests/Second
    after:
    average (among 7 runs) of 21403.6 Requests/Second

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

15 Mar, 2013

1 commit


12 Mar, 2013

1 commit

  • This patch series implement the Tail loss probe (TLP) algorithm described
    in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
    first patch implements the basic algorithm.

    TLP's goal is to reduce tail latency of short transactions. It achieves
    this by converting retransmission timeouts (RTOs) occuring due
    to tail losses (losses at end of transactions) into fast recovery.
    TLP transmits one packet in two round-trips when a connection is in
    Open state and isn't receiving any ACKs. The transmitted packet, aka
    loss probe, can be either new or a retransmission. When there is tail
    loss, the ACK from a loss probe triggers FACK/early-retransmit based
    fast recovery, thus avoiding a costly RTO. In the absence of loss,
    there is no change in the connection state.

    PTO stands for probe timeout. It is a timer event indicating
    that an ACK is overdue and triggers a loss probe packet. The PTO value
    is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
    ACK timer when there is only one oustanding packet.

    TLP Algorithm

    On transmission of new data in Open state:
    -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
    -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
    -> PTO = min(PTO, RTO)

    Conditions for scheduling PTO:
    -> Connection is in Open state.
    -> Connection is either cwnd limited or no new data to send.
    -> Number of probes per tail loss episode is limited to one.
    -> Connection is SACK enabled.

    When PTO fires:
    new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

    no_new_packet:
    -> retransmit the last segment.
    Its ACK triggers FACK or early retransmit based recovery.

    ACK path:
    -> rearm RTO at start of ACK processing.
    -> reschedule PTO if need be.

    In addition, the patch includes a small variation to the Early Retransmit
    (ER) algorithm, such that ER and TLP together can in principle recover any
    N-degree of tail loss through fast recovery. TLP is controlled by the same
    sysctl as ER, tcp_early_retrans sysctl.
    tcp_early_retrans==0; disables TLP and ER.
    ==1; enables RFC5827 ER.
    ==2; delayed ER.
    ==3; TLP and delayed ER. [DEFAULT]
    ==4; TLP only.

    The TLP patch series have been extensively tested on Google Web servers.
    It is most effective for short Web trasactions, where it reduced RTOs by 15%
    and improved HTTP response time (average by 6%, 99th percentile by 10%).
    The transmitted probes account for
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

06 Feb, 2013

1 commit

  • TCP Appropriate Byte Count was added by me, but later disabled.
    There is no point in maintaining it since it is a potential source
    of bugs and Linux already implements other better window protection
    heuristics.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

23 Jan, 2013

1 commit

  • Since we have removed NCE (Neighbour Cache Entry) reference from
    routing entries, the only refcnt holders of an NCE are its timer
    (if running) and its owner table, in usual cases. As a result,
    neigh_periodic_work() purges NCEs over and over again even for
    gateways.

    It does not make sense to purge entries, if number of them is
    very small, so keep them. The minimum number of entries to keep
    is specified by gc_thresh1.

    Signed-off-by: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    YOSHIFUJI Hideaki / 吉藤英明
     

16 Jan, 2013

1 commit


11 Jan, 2013

1 commit

  • Recent commit (commit 7e3a2dc52953 doc: make the description of how tcp_ecn
    works more explicit and clear ) clarified the behavior of tcp_ecn sysctl
    variable but description is inconsistent. When requested by incoming conections,
    ECN is enabled with not just tcp_ecn = 2 but also with tcp_ecn = 1.

    This patch makes it clear that with tcp_ecn = 1, ECN is enabled when requested
    by incoming connections.

    Also fix spelling of 'incoming'.

    Signed-off-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Vijay Subramanian
     

05 Jan, 2013

2 commits


11 Dec, 2012

1 commit

  • The description for tcp_fin_timeout should be tigher and more clear.

    In addition to being tighter, we should make the spelling of the
    state name consistent with what utilities report, remove the now
    dated reference to 2.2 and put the default in the consistent place.

    Signed-off-by: Rick Jones
    Signed-off-by: David S. Miller

    Rick Jones
     

08 Dec, 2012

1 commit


06 Dec, 2012

1 commit


30 Nov, 2012

1 commit


26 Oct, 2012

1 commit

  • Currently sctp allows for the optional use of md5 of sha1 hmac algorithms to
    generate cookie values when establishing new connections via two build time
    config options. Theres no real reason to make this a static selection. We can
    add a sysctl that allows for the dynamic selection of these algorithms at run
    time, with the default value determined by the corresponding crypto library
    availability.
    This comes in handy when, for example running a system in FIPS mode, where use
    of md5 is disallowed, but SHA1 is permitted.

    Note: This new sysctl has no corresponding socket option to select the cookie
    hmac algorithm. I chose not to implement that intentionally, as RFC 6458
    contains no option for this value, and I opted not to pollute the socket option
    namespace.

    Change notes:
    v2)
    * Updated subject to have the proper sctp prefix as per Dave M.
    * Replaced deafult selection options with new options that allow
    developers to explicitly select available hmac algs at build time
    as per suggestion by Vlad Y.

    Signed-off-by: Neil Horman
    CC: Vlad Yasevich
    CC: "David S. Miller"
    CC: netdev@vger.kernel.org
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Neil Horman
     

01 Sep, 2012

2 commits

  • This patch adds all the necessary data structure and support
    functions to implement TFO server side. It also documents a number
    of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
    extension MIBs.

    In addition, it includes the following:

    1. a new TCP_FASTOPEN socket option an application must call to
    supply a max backlog allowed in order to enable TFO on its listener.

    2. A number of key data structures:
    "fastopen_rsk" in tcp_sock - for a big socket to access its
    request_sock for retransmission and ack processing purpose. It is
    non-NULL iff 3WHS not completed.

    "fastopenq" in request_sock_queue - points to a per Fast Open
    listener data structure "fastopen_queue" to keep track of qlen (# of
    outstanding Fast Open requests) and max_qlen, among other things.

    "listener" in tcp_request_sock - to point to the original listener
    for book-keeping purpose, i.e., to maintain qlen against max_qlen
    as part of defense against IP spoofing attack.

    3. various data structure and functions, many in tcp_fastopen.c, to
    support server side Fast Open cookie operations, including
    /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     
  • Commit 9ad7c049 ("tcp: RFC2988bis + taking RTT sample from 3WHS for
    the passive open side") changed the initRTO from 3secs to 1sec in
    accordance to RFC6298 (former RFC2988bis). This reduced the time till
    the last SYN retransmission packet gets sent from 93secs to 31secs.

    RFC1122 is stating that the retransmission should be done for at least 3
    minutes, but this seems to be quite high.

    "However, the values of R1 and R2 may be different for SYN
    and data segments. In particular, R2 for a SYN segment MUST
    be set large enough to provide retransmission of the segment
    for at least 3 minutes. The application can close the
    connection (i.e., give up on the open attempt) sooner, of
    course."

    This patch increases the value of TCP_SYN_RETRIES to the value of 6,
    providing a retransmission window of 63secs.

    The comments for SYN and SYNACK retries have also been updated to
    describe the current settings. The same goes for the documentation file
    "Documentation/networking/ip-sysctl.txt".

    Signed-off-by: Alexander Bergmann
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Alex Bergmann
     

31 Jul, 2012

1 commit


23 Jul, 2012

1 commit

  • I've seen several attempts recently made to do quick failover of sctp transports
    by reducing various retransmit timers and counters. While its possible to
    implement a faster failover on multihomed sctp associations, its not
    particularly robust, in that it can lead to unneeded retransmits, as well as
    false connection failures due to intermittent latency on a network.

    Instead, lets implement the new ietf quick failover draft found here:
    http://tools.ietf.org/html/draft-nishida-tsvwg-sctp-failover-05

    This will let the sctp stack identify transports that have had a small number of
    errors, and avoid using them quickly until their reliability can be
    re-established. I've tested this out on two virt guests connected via multiple
    isolated virt networks and believe its in compliance with the above draft and
    works well.

    Signed-off-by: Neil Horman
    CC: Vlad Yasevich
    CC: Sridhar Samudrala
    CC: "David S. Miller"
    CC: linux-sctp@vger.kernel.org
    CC: joe@perches.com
    Acked-by: Vlad Yasevich
    Signed-off-by: David S. Miller

    Neil Horman
     

20 Jul, 2012

2 commits

  • In trusted networks, e.g., intranet, data-center, the client does not
    need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
    mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
    of cookie availability.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • sendmsg() (or sendto()) with MSG_FASTOPEN is a combo of connect(2)
    and write(2). The application should replace connect() with it to
    send data in the opening SYN packet.

    For blocking socket, sendmsg() blocks until all the data are buffered
    locally and the handshake is completed like connect() call. It
    returns similar errno like connect() if the TCP handshake fails.

    For non-blocking socket, it returns the number of bytes queued (and
    transmitted in the SYN-data packet) if cookie is available. If cookie
    is not available, it transmits a data-less SYN packet with Fast Open
    cookie request option and returns -EINPROGRESS like connect().

    Using MSG_FASTOPEN on connecting or connected socket will result in
    simlar errno like repeating connect() calls. Therefore the application
    should only use this flag on new sockets.

    The buffer size of sendmsg() is independent of the MSS of the connection.

    Signed-off-by: Yuchung Cheng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

17 Jul, 2012

1 commit

  • Implement the RFC 5691 mitigation against Blind
    Reset attack using RST bit.

    Idea is to validate incoming RST sequence,
    to match RCV.NXT value, instead of previouly accepted
    window : (RCV.NXT < RCV.NXT+RCV.WND)

    If sequence is in window but not an exact match, send
    a "challenge ACK", so that the other part can resend an
    RST with the appropriate sequence.

    Add a new sysctl, tcp_challenge_ack_limit, to limit
    number of challenge ACK sent per second.

    Add a new SNMP counter to count number of challenge acks sent.
    (netstat -s | grep TCPChallengeACK)

    Signed-off-by: Eric Dumazet
    Cc: Kiran Kumar Kella
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Jul, 2012

1 commit

  • This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

    sk->sk_wmem_alloc not allowed to grow above a given limit,
    allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
    given time.

    TSO packets are sized/capped to half the limit, so that we have two
    TSO packets in flight, allowing better bandwidth use.

    As a side effect, setting the limit to 40000 automatically reduces the
    standard gso max limit (65536) to 40000/2 : It can help to reduce
    latencies of high prio packets, having smaller TSO packets.

    This means we divert sock_wfree() to a tcp_wfree() handler, to
    queue/send following frames when skb_orphan() [2] is called for the
    already queued skbs.

    Results on my dev machines (tg3/ixgbe nics) are really impressive,
    using standard pfifo_fast, and with or without TSO/GSO.

    Without reduction of nominal bandwidth, we have reduction of buffering
    per bulk sender :
    < 1ms on Gbit (instead of 50ms with TSO)
    < 8ms on 100Mbit (instead of 132 ms)

    I no longer have 4 MBytes backlogged in qdisc by a single netperf
    session, and both side socket autotuning no longer use 4 Mbytes.

    As skb destructor cannot restart xmit itself ( as qdisc lock might be
    taken at this point ), we delegate the work to a tasklet. We use one
    tasklest per cpu for performance reasons.

    If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
    This flag is tested in a new protocol method called from release_sock(),
    to eventually send new segments.

    [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
    [2] skb_orphan() is usually called at TX completion time,
    but some drivers call it in their start_xmit() handler.
    These drivers should at least use BQL, or else a single TCP
    session can still fill the whole NIC TX ring, since TSQ will
    have no effect.

    Signed-off-by: Eric Dumazet
    Cc: Dave Taht
    Cc: Tom Herbert
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Nandita Dukkipati
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jul, 2012

1 commit


13 Jun, 2012

1 commit

  • Routing of 127/8 is tradtionally forbidden, we consider
    packets from that address block martian when routing and do
    not process corresponding ARP requests.

    This is a sane default but renders a huge address space
    practically unuseable.

    The RFC states that no address within the 127/8 block should
    ever appear on any network anywhere but it does not forbid
    the use of such addresses outside of the loopback device in
    particular. For example to address a pool of virtual guests
    behind a load balancer.

    This patch adds a new interface option 'route_localnet'
    enabling routing of the 127/8 address block and processing
    of ARP requests on a specific interface.

    Note that for the feature to work, the default local route
    covering 127/8 dev lo needs to be removed.

    Example:
    $ sysctl -w net.ipv4.conf.eth0.route_localnet=1
    $ ip route del 127.0.0.0/8 dev lo table local
    $ ip addr add 127.1.0.1/16 dev eth0
    $ ip route flush cache

    V2: Fix invalid check to auto flush cache (thanks davem)

    Signed-off-by: Thomas Graf
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Thomas Graf
     

09 May, 2012

1 commit

  • if net.bridge.bridge-nf-filter-vlan-tagged sysctl is enabled, bridge
    netfilter removes the vlan header temporarily and then feeds the packet
    to ip(6)tables.

    When the new "bridge-nf-pass-vlan-input-device" sysctl is on
    (default off), then bridge netfilter will also set the
    in-interface to the vlan interface; if such an interface exists.

    This is needed to make iptables REDIRECT target work with
    "vlan-on-top-of-bridge" setups and to allow use of "iptables -i" to
    match the vlan device name.

    Also update Documentation with current brnf default settings.

    Signed-off-by: Florian Westphal
    Acked-by: Bart De Schuymer
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso