20 Apr, 2013

1 commit


12 Mar, 2013

1 commit

  • This patch series implement the Tail loss probe (TLP) algorithm described
    in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
    first patch implements the basic algorithm.

    TLP's goal is to reduce tail latency of short transactions. It achieves
    this by converting retransmission timeouts (RTOs) occuring due
    to tail losses (losses at end of transactions) into fast recovery.
    TLP transmits one packet in two round-trips when a connection is in
    Open state and isn't receiving any ACKs. The transmitted packet, aka
    loss probe, can be either new or a retransmission. When there is tail
    loss, the ACK from a loss probe triggers FACK/early-retransmit based
    fast recovery, thus avoiding a costly RTO. In the absence of loss,
    there is no change in the connection state.

    PTO stands for probe timeout. It is a timer event indicating
    that an ACK is overdue and triggers a loss probe packet. The PTO value
    is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
    ACK timer when there is only one oustanding packet.

    TLP Algorithm

    On transmission of new data in Open state:
    -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
    -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
    -> PTO = min(PTO, RTO)

    Conditions for scheduling PTO:
    -> Connection is in Open state.
    -> Connection is either cwnd limited or no new data to send.
    -> Number of probes per tail loss episode is limited to one.
    -> Connection is SACK enabled.

    When PTO fires:
    new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

    no_new_packet:
    -> retransmit the last segment.
    Its ACK triggers FACK or early retransmit based recovery.

    ACK path:
    -> rearm RTO at start of ACK processing.
    -> reschedule PTO if need be.

    In addition, the patch includes a small variation to the Early Retransmit
    (ER) algorithm, such that ER and TLP together can in principle recover any
    N-degree of tail loss through fast recovery. TLP is controlled by the same
    sysctl as ER, tcp_early_retrans sysctl.
    tcp_early_retrans==0; disables TLP and ER.
    ==1; enables RFC5827 ER.
    ==2; delayed ER.
    ==3; TLP and delayed ER. [DEFAULT]
    ==4; TLP only.

    The TLP patch series have been extensively tested on Google Web servers.
    It is most effective for short Web trasactions, where it reduced RTOs by 15%
    and improved HTTP response time (average by 6%, 99th percentile by 10%).
    The transmitted probes account for
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Nandita Dukkipati
     

13 Dec, 2012

1 commit

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     

10 Dec, 2012

4 commits

  • Add logic to verify that a port comparison byte code operation
    actually has the second inet_diag_bc_op from which we read the port
    for such operations.

    Previously the code blindly referenced op[1] without first checking
    whether a second inet_diag_bc_op struct could fit there. So a
    malicious user could make the kernel read 4 bytes beyond the end of
    the bytecode array by claiming to have a whole port comparison byte
    code (2 inet_diag_bc_op structs) when in fact the bytecode was not
    long enough to hold both.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Add logic to check the address family of the user-supplied conditional
    and the address family of the connection entry. We now do not do
    prefix matching of addresses from different address families (AF_INET
    vs AF_INET6), except for the previously existing support for having an
    IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
    maintains as-is).

    This change is needed for two reasons:

    (1) The addresses are different lengths, so comparing a 128-bit IPv6
    prefix match condition to a 32-bit IPv4 connection address can cause
    us to unwittingly walk off the end of the IPv4 address and read
    garbage or oops.

    (2) The IPv4 and IPv6 address spaces are semantically distinct, so a
    simple bit-wise comparison of the prefixes is not meaningful, and
    would lead to bogus results (except for the IPv4-mapped IPv6 case,
    which this commit maintains).

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
    operations.

    Previously we did not validate the inet_diag_hostcond, address family,
    address length, and prefix length. So a malicious user could make the
    kernel read beyond the end of the bytecode array by claiming to have a
    whole inet_diag_hostcond when the bytecode was not long enough to
    contain a whole inet_diag_hostcond of the given address family. Or
    they could make the kernel read up to about 27 bytes beyond the end of
    a connection address by passing a prefix length that exceeded the
    length of addresses of the given family.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
    instantiated for IPv4 traffic and in the SYN-RECV state were actually
    created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
    means that for such connections inet6_rsk(req) returns a pointer to a
    random spot in memory up to roughly 64KB beyond the end of the
    request_sock.

    With this bug, for a server using AF_INET6 TCP sockets and serving
    IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
    inet_diag_fill_req() causing an oops or the export to user space of 16
    bytes of kernel memory as a garbage IPv6 address, depending on where
    the garbage inet6_rsk(req) pointed.

    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Neal Cardwell
     

11 Nov, 2012

1 commit


04 Nov, 2012

2 commits

  • We've observed that in case if UDP diag module is not
    supported in kernel the netlink returns NLMSG_DONE without
    notifying a caller that handler is missed.

    This patch makes __inet_diag_dump to return error code instead.

    So as example it become possible to detect such situation
    and handle it gracefully on userspace level.

    Signed-off-by: Cyrill Gorcunov
    CC: David Miller
    CC: Eric Dumazet
    CC: Pavel Emelyanov
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     
  • For passive TCP connections using TCP_DEFER_ACCEPT facility,
    we incorrectly increment req->retrans each time timeout triggers
    while no SYNACK is sent.

    SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
    which we received the ACK from client). Only the last SYNACK is sent
    so that we can receive again an ACK from client, to move the req into
    accept queue. We plan to change this later to avoid the useless
    retransmit (and potential problem as this SYNACK could be lost)

    TCP_INFO later gives wrong information to user, claiming imaginary
    retransmits.

    Decouple req->retrans field into two independent fields :

    num_retrans : number of retransmit
    num_timeout : number of timeouts

    num_timeout is the counter that is incremented at each timeout,
    regardless of actual SYNACK being sent or not, and used to
    compute the exponential timeout.

    Introduce inet_rtx_syn_ack() helper to increment num_retrans
    only if ->rtx_syn_ack() succeeded.

    Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
    when we re-send a SYNACK in answer to a (retransmitted) SYN.
    Prior to this patch, we were not counting these retransmits.

    Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
    only if a synack packet was successfully queued.

    Reported-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Cc: Julian Anastasov
    Cc: Vijay Subramanian
    Cc: Elliott Hughes
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Oct, 2012

1 commit


11 Sep, 2012

1 commit

  • It is a frequent mistake to confuse the netlink port identifier with a
    process identifier. Try to reduce this confusion by renaming fields
    that hold port identifiers portid instead of pid.

    I have carefully avoided changing the structures exported to
    userspace to avoid changing the userspace API.

    I have successfully built an allyesconfig kernel with this change.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

15 Aug, 2012

1 commit


17 Jul, 2012

1 commit

  • Before this patch sock_diag works for init_net only and dumps
    information about sockets from all namespaces.

    This patch expands sock_diag for all name-spaces.
    It creates a netlink kernel socket for each netns and filters
    data during dumping.

    v2: filter accoding with netns in all places
    remove an unused variable.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: Pavel Emelyanov
    CC: Eric Dumazet
    Cc: linux-kernel@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Signed-off-by: Andrew Vagin
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Andrey Vagin
     

28 Jun, 2012

1 commit


27 Jun, 2012

1 commit


08 May, 2012

1 commit

  • Conflicts:
    drivers/net/ethernet/intel/e1000e/param.c
    drivers/net/wireless/iwlwifi/iwl-agn-rx.c
    drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
    drivers/net/wireless/iwlwifi/iwl-trans.h

    Resolved the iwlwifi conflict with mainline using 3-way diff posted
    by John Linville and Stephen Rothwell. In 'net' we added a bug
    fix to make iwlwifi report a more accurate skb->truesize but this
    conflicted with RX path changes that happened meanwhile in net-next.

    In e1000e a conflict arose in the validation code for settings of
    adapter->itr. 'net-next' had more sophisticated logic so that
    logic was used.

    Signed-off-by: David S. Miller

    David S. Miller
     

26 Apr, 2012

2 commits


27 Feb, 2012

1 commit


12 Jan, 2012

2 commits


31 Dec, 2011

1 commit


17 Dec, 2011

2 commits


12 Dec, 2011

1 commit


10 Dec, 2011

8 commits


07 Dec, 2011

6 commits

  • This patch moves the sock_ code from inet_diag.c to generic sock_diag.c
    file and provides necessary request_module-s calls and a pointer on
    inet_diag_compat dumping routine.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Now all the code works with sock_diag_req-compatible structs, so it's
    possible to stop using the inet_diag_type2proto in inet_csk_diag_fill.
    Pass the inet_diag_req into it and use the sdiag_protocol field. At the
    same time remove the explicit ext argument, since it's also on the req.

    However, this conversion is still required in _compat code, so just move
    this routine, not remove.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • The new API will specify family to work with. Teach the existing
    socket walking code to bypass not interesting ones.

    To preserve compatibility with existing behavior the _compat code
    sets interesting family to AF_UNSPEC to dump them all.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Make inet_diag_dumo work with given header instead of calculating
    one from the nl message.

    The SOCK_DIAG_BY_FAMILY just passes skb's one through, the compat code
    converts the old header to new one.

    Also fix the bytecode calculation to find one at proper offset.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • Make inet_diag_get_exact work with given header instead of calculating
    one from the nl message.

    The SOCK_DIAG_BY_FAMILY just passes skb's one through, the compat code
    converts the old header to new one.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov
     
  • This one coinsides with the sock_diag_req in the beginning and
    contains only used fields from its previous analogue.

    The existing code is patched to use the _compat version of it
    for now.

    Signed-off-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Pavel Emelyanov