14 Oct, 2013

1 commit


14 Sep, 2013

1 commit

  • [ Upstream commit 7ed5c5ae96d23da22de95e1c7a239537acd378b1 ]

    When the repair mode is turned off, the write queue seqs are
    updated so that the whole queue is considered to be 'already sent.

    The "when" field must be set for such skb. It's used in tcp_rearm_rto
    for example. If the "when" field isn't set, the retransmit timeout can
    be calculated incorrectly and a tcp connected can stop for two minutes
    (TCP_RTO_MAX).

    Acked-by: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Andrey Vagin
     

17 May, 2013

1 commit

  • GSO TCP handler has following issues :

    1) ooo_okay from original GSO packet is duplicated to all segments
    2) segments (but the last one) are orphaned, so transmit path can not
    get transmit queue number from the socket. This happens if GSO
    segmentation is done before stacked device for example.

    Result is we can send packets from a given TCP flow to different TX
    queues (if using multiqueue NICS). This generates OOO problems and
    spurious SACK & retransmits.

    Fix this by keeping socket pointer set for all segments.

    This means that every segment must also have a destructor, and the
    original gso skb truesize must be split on all segments, to keep
    precise sk->sk_wmem_alloc accounting.

    Signed-off-by: Eric Dumazet
    Cc: Maciej Żenczykowski
    Cc: Tom Herbert
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 May, 2013

1 commit

  • TCP md5 communications fail [1] for some devices, because sg/crypto code
    assume page offsets are below PAGE_SIZE.

    This was discovered using mlx4 driver [2], but I suspect loopback
    might trigger the same bug now we use order-3 pages in tcp_sendmsg()

    [1] Failure is giving following messages.

    huh, entered softirq 3 NET_RX ffffffff806ad230 preempt_count 00000100,
    exited with 00000101?

    [2] mlx4 driver uses order-2 pages to allocate RX frags

    Reported-by: Matt Schnall
    Signed-off-by: Eric Dumazet
    Cc: Bernhard Beck
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Apr, 2013

1 commit


13 Apr, 2013

1 commit

  • I noticed that TSQ (TCP Small queues) was less effective when TSO is
    turned off, and GSO is on. If BQL is not enabled, TSQ has then no
    effect.

    It turns out the GSO engine frees the original gso_skb at the time the
    fragments are generated and queued to the NIC.

    We should instead call the tcp_wfree() destructor for the last fragment,
    to keep the flow control as intended in TSQ. This effectively limits
    the number of queued packets on qdisc + NIC layers.

    Signed-off-by: Eric Dumazet
    Cc: Tom Herbert
    Cc: Yuchung Cheng
    Cc: Nandita Dukkipati
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Mar, 2013

1 commit


18 Mar, 2013

1 commit

  • TCPCT uses option-number 253, reserved for experimental use and should
    not be used in production environments.
    Further, TCPCT does not fully implement RFC 6013.

    As a nice side-effect, removing TCPCT increases TCP's performance for
    very short flows:

    Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
    for files of 1KB size.

    before this patch:
    average (among 7 runs) of 20845.5 Requests/Second
    after:
    average (among 7 runs) of 21403.6 Requests/Second

    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Christoph Paasch
     

14 Mar, 2013

1 commit

  • Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack :

    https://code.google.com/p/chromium/issues/detail?id=182056

    commit a21d45726acac (tcp: avoid order-1 allocations on wifi and tx
    path) did a poor choice adding an 'avail_size' field to skb, while
    what we really needed was a 'reserved_tailroom' one.

    It would have avoided commit 22b4a4f22da (tcp: fix retransmit of
    partially acked frames) and this commit.

    Crash occurs because skb_split() is not aware of the 'avail_size'
    management (and should not be aware)

    Signed-off-by: Eric Dumazet
    Reported-by: Mukesh Agrawal
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Mar, 2013

1 commit

  • Adds generic tunneling offloading support for IPv4-UDP based
    tunnels.
    GSO type is added to request this offload for a skb.
    netdev feature NETIF_F_UDP_TUNNEL is added for hardware offloaded
    udp-tunnel support. Currently no device supports this feature,
    software offload is used.

    This can be used by tunneling protocols like VXLAN.

    CC: Jesse Gross
    Signed-off-by: Pravin B Shelar
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

27 Feb, 2013

1 commit

  • Pull slave-dmaengine updates from Vinod Koul:
    "This is fairly big pull by my standards as I had missed last merge
    window. So we have the support for device tree for slave-dmaengine,
    large updates to dw_dmac driver from Andy for reusing on different
    architectures. Along with this we have fixes on bunch of the drivers"

    Fix up trivial conflicts, usually due to #include line movement next to
    each other.

    * 'next' of git://git.infradead.org/users/vkoul/slave-dma: (111 commits)
    Revert "ARM: SPEAr13xx: Pass DW DMAC platform data from DT"
    ARM: dts: pl330: Add #dma-cells for generic dma binding support
    DMA: PL330: Register the DMA controller with the generic DMA helpers
    DMA: PL330: Add xlate function
    DMA: PL330: Add new pl330 filter for DT case.
    dma: tegra20-apb-dma: remove unnecessary assignment
    edma: do not waste memory for dma_mask
    dma: coh901318: set residue only if dma is in progress
    dma: coh901318: avoid unbalanced locking
    dmaengine.h: remove redundant else keyword
    dma: of-dma: protect list write operation by spin_lock
    dmaengine: ste_dma40: do not remove descriptors for cyclic transfers
    dma: of-dma.c: fix memory leakage
    dw_dmac: apply default dma_mask if needed
    dmaengine: ioat - fix spare sparse complain
    dmaengine: move drivers/of/dma.c -> drivers/dma/of-dma.c
    ioatdma: fix race between updating ioat->head and IOAT_COMPLETION_PENDING
    dw_dmac: add support for Lynxpoint DMA controllers
    dw_dmac: return proper residue value
    dw_dmac: fill individual length of descriptor
    ...

    Linus Torvalds
     

16 Feb, 2013

1 commit

  • Following patch adds GRE protocol offload handler so that
    skb_gso_segment() can segment GRE packets.
    SKB GSO CB is added to keep track of total header length so that
    skb_segment can push entire header. e.g. in case of GRE, skb_segment
    need to push inner and outer headers to every segment.
    New NETIF_F_GRE_GSO feature is added for devices which support HW
    GRE TSO offload. Currently none of devices support it therefore GRE GSO
    always fall backs to software GSO.

    [ Compute pkt_len before ip_local_out() invocation. -DaveM ]

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

14 Feb, 2013

3 commits

  • Patch cef401de7be8c4e (net: fix possible wrong checksum
    generation) fixed wrong checksum calculation but it broke TSO by
    defining new GSO type but not a netdev feature for that type.
    net_gso_ok() would not allow hardware checksum/segmentation
    offload of such packets without the feature.

    Following patch fixes TSO and wrong checksum. This patch uses
    same logic that Eric Dumazet used. Patch introduces new flag
    SKBTX_SHARED_FRAG if at least one frag can be modified by
    the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
    info tx_flags rather than gso_type.

    tx_flags is better compared to gso_type since we can have skb with
    shared frag without gso packet. It does not link SHARED_FRAG to
    GSO, So there is no need to define netdev feature for this.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • A timestamp can be set, only if a socket is in the repair mode.

    This patch adds a new socket option TCP_TIMESTAMP, which allows to
    get and set current tcp times stamp.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: Eric Dumazet
    Cc: Pavel Emelyanov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     
  • This functionality is used for restoring tcp sockets. A tcp timestamp
    depends on how long a system has been running, so it's differ for each
    host. The solution is to set a per-socket offset.

    A per-socket offset for a TIME_WAIT socket is inherited from a proper
    tcp socket.

    tcp_request_sock doesn't have a timestamp offset, because the repair
    mode for them are not implemented.

    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Cc: Eric Dumazet
    Cc: Pavel Emelyanov
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

06 Feb, 2013

1 commit

  • TCP Appropriate Byte Count was added by me, but later disabled.
    There is no point in maintaining it since it is a potential source
    of bugs and Linux already implements other better window protection
    heuristics.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

28 Jan, 2013

1 commit

  • Pravin Shelar mentioned that GSO could potentially generate
    wrong TX checksum if skb has fragments that are overwritten
    by the user between the checksum computation and transmit.

    He suggested to linearize skbs but this extra copy can be
    avoided for normal tcp skbs cooked by tcp_sendmsg().

    This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
    in skb_shinfo(skb)->gso_type if at least one frag can be
    modified by the user.

    Typical sources of such possible overwrites are {vm}splice(),
    sendfile(), and macvtap/tun/virtio_net drivers.

    Tested:

    $ netperf -H 7.7.8.84
    MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
    7.7.8.84 () port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 3959.52

    $ netperf -H 7.7.8.84 -t TCP_SENDFILE
    TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
    port 0 AF_INET
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    87380 16384 16384 10.00 3216.80

    Performance of the SENDFILE is impacted by the extra allocation and
    copy, and because we use order-0 pages, while the TCP_STREAM uses
    bigger pages.

    Reported-by: Pravin Shelar
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Jan, 2013

1 commit


11 Jan, 2013

2 commits

  • Under unusual circumstances, TCP collapse can split a big GRO TCP packet
    while its being used in a splice(socket->pipe) operation.

    skb_splice_bits() releases the socket lock before calling
    splice_to_pipe().

    [ 1081.353685] WARNING: at net/ipv4/tcp.c:1330 tcp_cleanup_rbuf+0x4d/0xfc()
    [ 1081.371956] Hardware name: System x3690 X5 -[7148Z68]-
    [ 1081.391820] cleanup rbuf bug: copied AD3BCF1 seq AD370AF rcvnxt AD3CF13

    To fix this problem, we must eat skbs in tcp_recv_skb().

    Remove the inline keyword from tcp_recv_skb() definition since
    it has three call sites.

    Reported-by: Christian Becker
    Cc: Willy Tarreau
    Signed-off-by: Eric Dumazet
    Tested-by: Willy Tarreau
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • commit 02275a2ee7c0 (tcp: don't abort splice() after small transfers)
    added a regression.

    [ 83.843570] INFO: rcu_sched self-detected stall on CPU
    [ 83.844575] INFO: rcu_sched detected stalls on CPUs/tasks: { 6} (detected by 0, t=21002 jiffies, g=4457, c=4456, q=13132)
    [ 83.844582] Task dump for CPU 6:
    [ 83.844584] netperf R running task 0 8966 8952 0x0000000c
    [ 83.844587] 0000000000000000 0000000000000006 0000000000006c6c 0000000000000000
    [ 83.844589] 000000000000006c 0000000000000096 ffffffff819ce2bc ffffffffffffff10
    [ 83.844592] ffffffff81088679 0000000000000010 0000000000000246 ffff880c4b9ddcd8
    [ 83.844594] Call Trace:
    [ 83.844596] [] ? vprintk_emit+0x1c9/0x4c0
    [ 83.844601] [] ? schedule+0x29/0x70
    [ 83.844606] [] ? tcp_splice_data_recv+0x42/0x50
    [ 83.844610] [] ? tcp_read_sock+0xda/0x260
    [ 83.844613] [] ? tcp_prequeue_process+0xb0/0xb0
    [ 83.844615] [] ? tcp_splice_read+0xc0/0x250
    [ 83.844618] [] ? sock_splice_read+0x22/0x30
    [ 83.844622] [] ? do_splice_to+0x7b/0xa0
    [ 83.844627] [] ? sys_splice+0x59c/0x5d0
    [ 83.844630] [] ? putname+0x2b/0x40
    [ 83.844633] [] ? do_sys_open+0x174/0x1e0
    [ 83.844636] [] ? system_call_fastpath+0x16/0x1b

    if recv_actor() returns 0, we should stop immediately,
    because looping wont give a chance to drain the pipe.

    Signed-off-by: Eric Dumazet
    Cc: Willy Tarreau
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jan, 2013

2 commits


13 Dec, 2012

1 commit

  • Pull networking changes from David Miller:

    1) Allow to dump, monitor, and change the bridge multicast database
    using netlink. From Cong Wang.

    2) RFC 5961 TCP blind data injection attack mitigation, from Eric
    Dumazet.

    3) Networking user namespace support from Eric W. Biederman.

    4) tuntap/virtio-net multiqueue support by Jason Wang.

    5) Support for checksum offload of encapsulated packets (basically,
    tunneled traffic can still be checksummed by HW). From Joseph
    Gasparakis.

    6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
    Daniel Borkmann.

    7) Bridge port parameters over netlink and BPDU blocking support
    from Stephen Hemminger.

    8) Improve data access patterns during inet socket demux by rearranging
    socket layout, from Eric Dumazet.

    9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
    Jon Maloy.

    10) Update TCP socket hash sizing to be more in line with current day
    realities. The existing heurstics were choosen a decade ago.
    From Eric Dumazet.

    11) Fix races, queue bloat, and excessive wakeups in ATM and
    associated drivers, from Krzysztof Mazur and David Woodhouse.

    12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
    in VXLAN driver, from David Stevens.

    13) Add "oops_only" mode to netconsole, from Amerigo Wang.

    14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
    allow DCB netlink to work on namespaces other than the initial
    namespace. From John Fastabend.

    15) Support PTP in the Tigon3 driver, from Matt Carlson.

    16) tun/vhost zero copy fixes and improvements, plus turn it on
    by default, from Michael S. Tsirkin.

    17) Support per-association statistics in SCTP, from Michele
    Baldessari.

    And many, many, driver updates, cleanups, and improvements. Too
    numerous to mention individually.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
    net/mlx4_en: Add support for destination MAC in steering rules
    net/mlx4_en: Use generic etherdevice.h functions.
    net: ethtool: Add destination MAC address to flow steering API
    bridge: add support of adding and deleting mdb entries
    bridge: notify mdb changes via netlink
    ndisc: Unexport ndisc_{build,send}_skb().
    uapi: add missing netconf.h to export list
    pkt_sched: avoid requeues if possible
    solos-pci: fix double-free of TX skb in DMA mode
    bnx2: Fix accidental reversions.
    bna: Driver Version Updated to 3.1.2.1
    bna: Firmware update
    bna: Add RX State
    bna: Rx Page Based Allocation
    bna: TX Intr Coalescing Fix
    bna: Tx and Rx Optimizations
    bna: Code Cleanup and Enhancements
    ath9k: check pdata variable before dereferencing it
    ath5k: RX timestamp is reported at end of frame
    ath9k_htc: RX timestamp is reported at end of frame
    ...

    Linus Torvalds
     

03 Dec, 2012

1 commit

  • TCP coalescing added a regression in splice(socket->pipe) performance,
    for some workloads because of the way tcp_read_sock() is implemented.

    The reason for this is the break when (offset + 1 != skb->len).

    As we released the socket lock, this condition is possible if TCP stack
    added a fragment to the skb, which can happen with TCP coalescing.

    So let's go back to the beginning of the loop when this happens,
    to give a chance to splice more frags per system call.

    Doing so fixes the issue and makes GRO 10% faster than LRO
    on CPU-bound splice() workloads instead of the opposite.

    Signed-off-by: Willy Tarreau
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Willy Tarreau
     

02 Dec, 2012

2 commits

  • Recent network changes allowed high order pages being used
    for skb fragments.

    This uncovered a bug in do_tcp_sendpages() which was assuming its caller
    provided an array of order-0 page pointers.

    We only have to deal with a single page in this function, and its order
    is irrelevant.

    Reported-by: Willy Tarreau
    Tested-by: Willy Tarreau
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • As time passed, available memory increased faster than number of
    concurrent tcp sockets.

    As a result, a machine with 4GB of ram gets a hash table
    with 524288 slots, using 8388608 bytes of memory.

    Lets change that by a 16x factor (one slot for 128 KB of ram)

    Even if a small machine needs a _lot_ of sockets, tcp lookups are now
    very efficient, using one cache line per socket.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Nov, 2012

1 commit

  • Allow an unpriviled user who has created a user namespace, and then
    created a network namespace to effectively use the new network
    namespace, by reducing capable(CAP_NET_ADMIN) and
    capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
    CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.

    Settings that merely control a single network device are allowed.
    Either the network device is a logical network device where
    restrictions make no difference or the network device is hardware NIC
    that has been explicity moved from the initial network namespace.

    In general policy and network stack state changes are allowed
    while resource control is left unchanged.

    Allow creating raw sockets.
    Allow the SIOCSARP ioctl to control the arp cache.
    Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
    Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
    Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
    Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
    Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
    Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting gre tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipip tunnels.

    Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
    adding, changing and deleting ipsec virtual tunnel interfaces.

    Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
    MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
    sockets.

    Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
    arbitrary ip options.

    Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
    Allow setting the IP_TRANSPARENT ipv4 socket option.
    Allow setting the TCP_REPAIR socket option.
    Allow setting the TCP_CONGESTION socket option.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

18 Nov, 2012

1 commit


16 Nov, 2012

1 commit

  • Currently if a socket was repaired with a few packet in a write queue,
    a kernel bug may be triggered:

    kernel BUG at net/ipv4/tcp_output.c:2330!
    RIP: 0010:[] tcp_retransmit_skb+0x5ff/0x610

    According to the initial realization v3.4-rc2-963-gc0e88ff,
    all skb-s should look like already posted. This patch fixes code
    according with this sentence.

    Here are three points, which were not done in the initial patch:
    1. A tcp send head should not be changed
    2. Initialize TSO state of a skb
    3. Reset the retransmission time

    This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
    passes the ussual way, but isn't sent to network. This patch solves
    all described problems and handles tcp_sendpages.

    Cc: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    Signed-off-by: Andrey Vagin
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Andrew Vagin
     

11 Nov, 2012

1 commit


23 Oct, 2012

2 commits

  • Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
    It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
    client application can use this information to check Fast Open success rate.

    Signed-off-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     
  • SIOCINQ can use the lock_sock_fast() version to avoid double acquisition
    of socket lock.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Oct, 2012

1 commit

  • tcp_ioctl() tries to take into account if tcp socket received a FIN
    to report correct number bytes in receive queue.

    But its flaky because if the application ate the last skb,
    we return 1 instead of 0.

    Correct way to detect that FIN was received is to test SOCK_DONE.

    Reported-by: Elliot Hughes
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Sep, 2012

1 commit

  • Conflicts:
    drivers/net/team/team.c
    drivers/net/usb/qmi_wwan.c
    net/batman-adv/bat_iv_ogm.c
    net/ipv4/fib_frontend.c
    net/ipv4/route.c
    net/l2tp/l2tp_netlink.c

    The team, fib_frontend, route, and l2tp_netlink conflicts were simply
    overlapping changes.

    qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

    With help from Antonio Quartulli.

    Signed-off-by: David S. Miller

    David S. Miller
     

25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2012

2 commits

  • rcv_wscale is a symetric parameter with snd_wscale.

    Both this parameters are set on a connection handshake.

    Without this value a remote window size can not be interpreted correctly,
    because a value from a packet should be shifted on rcv_wscale.

    And one more thing is that wscale_ok should be set too.

    This patch doesn't break a backward compatibility.
    If someone uses it in a old scheme, a rcv window
    will be restored with the same bug (rcv_wscale = 0).

    v2: Save backward compatibility on big-endian system. Before
    the first two bytes were snd_wscale and the second two bytes were
    rcv_wscale. Now snd_wscale is opt_val & 0xFFFF and rcv_wscale >> 16.
    This approach is independent on byte ordering.

    Cc: David S. Miller
    Cc: Alexey Kuznetsov
    Cc: James Morris
    Cc: Hideaki YOSHIFUJI
    Cc: Patrick McHardy
    CC: Pavel Emelyanov
    Signed-off-by: Andrew Vagin
    Acked-by: Pavel Emelyanov
    Signed-off-by: David S. Miller

    Andrey Vagin
     
  • Signed-off-by: Christoph Paasch
    Acked-by: H.K. Jerry Chu
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Christoph Paasch
     

20 Sep, 2012

1 commit

  • If recv() syscall is called for a TCP socket so that
    - IOAT DMA is used
    - MSG_WAITALL flag is used
    - requested length is bigger than sk_rcvbuf
    - enough data has already arrived to bring rcv_wnd to zero
    then when tcp_recvmsg() gets to calling sk_wait_data(), receive
    window can be still zero while sk_async_wait_queue exhausts
    enough space to keep it zero. As this queue isn't cleaned until
    the tcp_service_net_dma() call, sk_wait_data() cannot receive
    any data and blocks forever.

    If zero receive window and non-empty sk_async_wait_queue is
    detected before calling sk_wait_data(), process the queue first.

    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Michal Kubeček
     

01 Sep, 2012

1 commit

  • This patch builds on top of the previous patch to add the support
    for TFO listeners. This includes -

    1. allocating, properly initializing, and managing the per listener
    fastopen_queue structure when TFO is enabled

    2. changes to the inet_csk_accept code to support TFO. E.g., the
    request_sock can no longer be freed upon accept(), not until 3WHS
    finishes

    3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
    if it's a TFO socket

    4. properly closing a TFO listener, and a TFO socket before 3WHS
    finishes

    5. supporting TCP_FASTOPEN socket option

    6. modifying tcp_check_req() to use to check a TFO socket as well
    as request_sock

    7. supporting TCP's TFO cookie option

    8. adding a new SYN-ACK retransmit handler to use the timer directly
    off the TFO socket rather than the listener socket. Note that TFO
    server side will not retransmit anything other than SYN-ACK until
    the 3WHS is completed.

    The patch also contains an important function
    "reqsk_fastopen_remove()" to manage the somewhat complex relation
    between a listener, its request_sock, and the corresponding child
    socket. See the comment above the function for the detail.

    Signed-off-by: H.K. Jerry Chu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Cc: Eric Dumazet
    Cc: Tom Herbert
    Signed-off-by: David S. Miller

    Jerry Chu
     

02 Aug, 2012

1 commit