03 Feb, 2015

2 commits

  • This patch fixes a bug where vnet_skb_shape() didn't set the already-selected
    queue mapping when a packet copy was required. This results in using the
    wrong queue index for stops/starts, hung tx queues and watchdog timeouts
    under heavy load.

    Signed-off-by: David L Stevens
    Acked-by: Sowmini Varadhan
    Signed-off-by: David S. Miller

    David L Stevens
     
  • Currently qlge_update_hw_vlan_features() will always first put the
    interface down, then update features and then bring it up again. But it
    is possible to hit this code while the adapter is down and this causes a
    non-paired call to napi_disable(), which will get stuck.

    This patch fixes it by skipping these down/up actions if the interface
    is already down.

    Fixes: a45adbe8d352 ("qlge: Enhance nested VLAN (Q-in-Q) handling.")
    Cc: Harish Patil
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    Marcelo Leitner
     

02 Feb, 2015

1 commit

  • In commit be9f4a44e7d41 ("ipv4: tcp: remove per net tcp_sock")
    I tried to address contention on a socket lock, but the solution
    I chose was horrible :

    commit 3a7c384ffd57e ("ipv4: tcp: unicast_sock should not land outside
    of TCP stack") addressed a selinux regression.

    commit 0980e56e506b ("ipv4: tcp: set unicast_sock uc_ttl to -1")
    took care of another regression.

    commit b5ec8eeac46 ("ipv4: fix ip_send_skb()") fixed another regression.

    commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
    was another shot in the dark.

    Really, just use a proper socket per cpu, and remove the skb_orphan()
    call, to re-enable flow control.

    This solves a serious problem with FQ packet scheduler when used in
    hostile environments, as we do not want to allocate a flow structure
    for every RST packet sent in response to a spoofed packet.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Feb, 2015

2 commits

  • Doing the following commands on a non idle network device
    panics the box instantly, because cpu_bstats gets overwritten
    by stats.

    tc qdisc add dev eth0 root
    ... some traffic (one packet is enough) ...
    tc qdisc replace dev eth0 root est 1sec 4sec

    [ 325.355596] BUG: unable to handle kernel paging request at ffff8841dc5a074c
    [ 325.362609] IP: [] __gnet_stats_copy_basic+0x3e/0x90
    [ 325.369158] PGD 1fa7067 PUD 0
    [ 325.372254] Oops: 0000 [#1] SMP
    [ 325.375514] Modules linked in: ...
    [ 325.398346] CPU: 13 PID: 14313 Comm: tc Not tainted 3.19.0-smp-DEV #1163
    [ 325.412042] task: ffff8800793ab5d0 ti: ffff881ff2fa4000 task.ti: ffff881ff2fa4000
    [ 325.419518] RIP: 0010:[] [] __gnet_stats_copy_basic+0x3e/0x90
    [ 325.428506] RSP: 0018:ffff881ff2fa7928 EFLAGS: 00010286
    [ 325.433824] RAX: 000000000000000c RBX: ffff881ff2fa796c RCX: 000000000000000c
    [ 325.440988] RDX: ffff8841dc5a0744 RSI: 0000000000000060 RDI: 0000000000000060
    [ 325.448120] RBP: ffff881ff2fa7948 R08: ffffffff81cd4f80 R09: 0000000000000000
    [ 325.455268] R10: ffff883ff223e400 R11: 0000000000000000 R12: 000000015cba0744
    [ 325.462405] R13: ffffffff81cd4f80 R14: ffff883ff223e460 R15: ffff883feea0722c
    [ 325.469536] FS: 00007f2ee30fa700(0000) GS:ffff88407fa20000(0000) knlGS:0000000000000000
    [ 325.477630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 325.483380] CR2: ffff8841dc5a074c CR3: 0000003feeae9000 CR4: 00000000001407e0
    [ 325.490510] Stack:
    [ 325.492524] ffff883feea0722c ffff883fef719dc0 ffff883feea0722c ffff883ff223e4a0
    [ 325.499990] ffff881ff2fa79a8 ffffffff815424ee ffff883ff223e49c 000000015cba0744
    [ 325.507460] 00000000f2fa7978 0000000000000000 ffff881ff2fa79a8 ffff883ff223e4a0
    [ 325.514956] Call Trace:
    [ 325.517412] [] gen_new_estimator+0x8e/0x230
    [ 325.523250] [] gen_replace_estimator+0x4a/0x60
    [ 325.529349] [] tc_modify_qdisc+0x52b/0x590
    [ 325.535117] [] rtnetlink_rcv_msg+0xa0/0x240
    [ 325.540963] [] ? __rtnl_unlock+0x20/0x20
    [ 325.546532] [] netlink_rcv_skb+0xb1/0xc0
    [ 325.552145] [] rtnetlink_rcv+0x25/0x40
    [ 325.557558] [] netlink_unicast+0x168/0x220
    [ 325.563317] [] netlink_sendmsg+0x2ec/0x3e0

    Lets play safe and not use an union : percpu 'pointers' are mostly read
    anyway, and we have typically few qdiscs per host.

    Signed-off-by: Eric Dumazet
    Cc: John Fastabend
    Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The existing code frees the skb in EAGAIN case, in which the skb will be
    retried from upper layer and used again.
    Also, the existing code doesn't free send buffer slot in error case, because
    there is no completion message for unsent packets.
    This patch fixes these problems.

    (Please also include this patch for stable trees. Thanks!)

    Signed-off-by: Haiyang Zhang
    Reviewed-by: K. Y. Srinivasan
    Signed-off-by: David S. Miller

    Haiyang Zhang
     

31 Jan, 2015

8 commits

  • This patch fixes the following kernel crash,

    WARNING: CPU: 2 PID: 0 at net/ipv4/tcp_input.c:3079 tcp_clean_rtx_queue+0x658/0x80c()
    Call trace:
    [] dump_backtrace+0x0/0x184
    [] show_stack+0x10/0x1c
    [] dump_stack+0x74/0x98
    [] warn_slowpath_common+0x88/0xb0
    [] warn_slowpath_null+0x14/0x20
    [] tcp_clean_rtx_queue+0x654/0x80c
    [] tcp_ack+0x454/0x688
    [] tcp_rcv_established+0x4a4/0x62c
    [] tcp_v4_do_rcv+0x16c/0x350
    [] tcp_v4_rcv+0x8e8/0x904
    [] ip_local_deliver_finish+0x100/0x26c
    [] ip_local_deliver+0xac/0xc4
    [] ip_rcv_finish+0xe8/0x328
    [] ip_rcv+0x24c/0x38c
    [] __netif_receive_skb_core+0x29c/0x7c8
    [] __netif_receive_skb+0x28/0x7c
    [] netif_receive_skb_internal+0x5c/0xe0
    [] napi_gro_receive+0xb4/0x110
    [] xgene_enet_process_ring+0x144/0x338
    [] xgene_enet_napi+0x1c/0x50
    [] net_rx_action+0x154/0x228
    [] __do_softirq+0x110/0x28c
    [] irq_exit+0x8c/0xc0
    [] handle_IRQ+0x44/0xa8
    [] gic_handle_irq+0x38/0x7c
    [...]

    Software writes poison data into the descriptor bytes[15:8] and upon
    receiving the interrupt, if those bytes are overwritten by the hardware with
    the valid data, software also reads bytes[7:0] and executes receive/tx
    completion logic.

    If the CPU executes the above two reads in out of order fashion, then the
    bytes[7:0] will have older data and causing the kernel panic. We have to
    force the order of the reads and thus this patch introduces read memory
    barrier between these reads.

    Signed-off-by: Iyappan Subramanian
    Signed-off-by: Keyur Chudgar
    Signed-off-by: David S. Miller

    Iyappan Subramanian
     
  • Toshiaki Makita says:

    ====================
    Fix checksum error when using stacked vlan

    When I was testing 802.1ad, I found several drivers don't take into
    account 802.1ad or multiple vlans when retrieving L3 (IP/IPv6) or
    L4 (TCP/UDP) protocol for checksum offload.

    It is mainly due to vlan_get_protocol(), which extracts ether type only
    when it is tagged with single 802.1Q. When 802.1ad is used or there are
    multiple vlans, it extracts vlan protocol and drivers cannot determine
    which L3/L4 protocol is used.

    Those drivers, most of which have IP_CSUM/IPV6_CSUM features, get L3/L4
    header-offset by software, so it seems that their checksum offload works
    with multiple vlans if we can parse protocols correctly.
    (They know mac header length, and probably don't care about what is in it.)

    And another thing, some of Intel's drivers seem to use skb->protocol where
    vlan_get_protocol() is more suitable.

    I tested that at least igb/igbvf on I350 works with this patch set.

    Note:
    We can hand a double tagged packet with CHECKSUM_PARTIAL to a HW driver
    by creating a vlan device on a bridge device and enabling vlan_filtering
    of the bridge with 802.1ad protocol.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
    ixgbevf_tx_csum() fails to get the network protocol and checksum related
    descriptor fields are not configured correctly because skb->protocol
    doesn't show the L3 protocol in this case.

    Use first->protocol instead of skb->protocol to get the proper network
    protocol.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
    ixgbe_tx_csum() fails to get the network protocol and checksum related
    descriptor fields are not configured correctly because skb->protocol
    doesn't show the L3 protocol in this case.

    Use vlan_get_protocol() to get the proper network protocol.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • When a skb has multiple vlans and it is CHECKSUM_PARTIAL,
    igbvf_tx_csum() fails to get the network protocol and checksum related
    descriptor fields are not configured correctly because skb->protocol
    doesn't show the L3 protocol in this case.

    Use vlan_get_protocol() to get the proper network protocol.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • vlan_get_protocol() could not get network protocol if a skb has a 802.1ad
    vlan tag or multiple vlans, which caused incorrect checksum calculation
    in several drivers.

    Fix vlan_get_protocol() to retrieve network protocol instead of incorrect
    vlan protocol.

    As the logic is the same as skb_network_protocol(), create a common helper
    function __vlan_get_protocol() and call it from existing functions.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Toshiaki Makita
     
  • When making use of RFC5061, section 4.2.4. for setting the primary IP
    address, we're passing a wrong parameter header to param_type2af(),
    resulting always in NULL being returned.

    At this point, param.p points to a sctp_addip_param struct, containing
    a sctp_paramhdr (type = 0xc004, length = var), and crr_id as a correlation
    id. Followed by that, as also presented in RFC5061 section 4.2.4., comes
    the actual sctp_addr_param, which also contains a sctp_paramhdr, but
    this time with the correct type SCTP_PARAM_IPV{4,6}_ADDRESS that
    param_type2af() can make use of. Since we already hold a pointer to
    addr_param from previous line, just reuse it for param_type2af().

    Fixes: d6de3097592b ("[SCTP]: Add the handling of "Set Primary IP Address" parameter to INIT")
    Signed-off-by: Saran Maruti Ramanara
    Signed-off-by: Daniel Borkmann
    Acked-by: Vlad Yasevich
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Saran Maruti Ramanara
     
  • The subscription bitmask passed via struct sockaddr_nl is converted to
    the group number when calling the netlink_bind() and netlink_unbind()
    callbacks.

    The conversion is however incorrect since bitmask (1 << 0) needs to be
    mapped to group number 1. Note that you cannot specify the group number 0
    (usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
    since this is rejected through -EINVAL.

    This problem became noticeable since 97840cb ("netfilter: nfnetlink:
    fix insufficient validation in nfnetlink_bind") when binding to bitmask
    (1 << 0) in ctnetlink.

    Reported-by: Andre Tomt
    Reported-by: Ivan Delalande
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     

30 Jan, 2015

11 commits

  • RFC 1191 said, "a host MUST not increase its estimate of the Path
    MTU in response to the contents of a Datagram Too Big message."

    Signed-off-by: Li Wei
    Signed-off-by: David S. Miller

    Li Wei
     
  • Arnd Bergmann says:

    ====================
    net: driver fixes from arm randconfig builds

    These four patches are fallout from test builds on ARM. I have a
    few more of them in my backlog but have not yet confirmed them
    to still be valid.

    The first three patches are about incomplete dependencies on
    old drivers. One could backport them to the beginning of time
    in theory, but there is little value since nobody would run into
    these problems.

    The final patch is one I had submitted before together with the
    respective pcmcia patch but forgot to follow up on that. It's
    still a valid but relatively theoretical bug, because the previous
    behavior of the driver was just as broken as what we have in
    mainline.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • A recent patch tried to work around a valid warning for the use of a
    deprecated interface by blindly changing from the old
    pcmcia_request_exclusive_irq() interface to pcmcia_request_irq().

    This driver has an interrupt handler that is not currently aware
    of shared interrupts, but can be easily converted to be.
    At the moment, the driver reads the interrupt status register
    repeatedly until it contains only zeroes in the interesting bits,
    and handles each bit individually.

    This patch adds the missing part of returning IRQ_NONE in case none
    of the bits are set to start with, so we can move on to the next
    interrupt source.

    Signed-off-by: Arnd Bergmann
    Fixes: 5f5316fcd08ef7 ("am2150: Update nmclan_cs.c to use update PCMCIA API")
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • The ni65 and lance ethernet drivers manually program the ISA DMA
    controller that is only available on x86 PCs and a few compatible
    systems. Trying to build it on ARM results in this error:

    ni65.c: In function 'ni65_probe1':
    ni65.c:496:62: error: 'DMA1_STAT_REG' undeclared (first use in this function)
    ((inb(DMA1_STAT_REG) >> 4) & 0x0f)
    ^
    ni65.c:496:62: note: each undeclared identifier is reported only once for each function it appears in
    ni65.c:497:63: error: 'DMA2_STAT_REG' undeclared (first use in this function)
    | (inb(DMA2_STAT_REG) & 0xf0);

    The DMA1_STAT_REG and DMA2_STAT_REG registers are only defined for
    alpha, mips, parisc, powerpc and x86, although it is not clear
    which subarchitectures actually have them at the correct location.

    This patch for now just disables it for ARM, to avoid randconfig
    build errors. We could also decide to limit it to the set of
    architectures on which it does compile, but that might look more
    deliberate than guessing based on where the drivers build.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • The cosa driver is rather outdated and does not get built on most
    platforms because it requires the ISA_DMA_API symbol. However
    there are some ARM platforms that have ISA_DMA_API but no virt_to_bus,
    and they get this build error when enabling the ltpc driver.

    drivers/net/wan/cosa.c: In function 'tx_interrupt':
    drivers/net/wan/cosa.c:1768:3: error: implicit declaration of function 'virt_to_bus'
    unsigned long addr = virt_to_bus(cosa->txbuf);
    ^

    The same problem exists for the Hostess SV-11 and Sealevel Systems 4021
    drivers.

    This adds another dependency in Kconfig to avoid that configuration.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • The cs89x0 driver can either be built as an ISA driver or a platform
    driver, the choice is controlled by the CS89x0_PLATFORM Kconfig
    symbol. Building the ISA driver on a system that does not have
    a way to map I/O ports fails with this error:

    drivers/built-in.o: In function `cs89x0_ioport_probe.constprop.1':
    :(.init.text+0x4794): undefined reference to `ioport_map'
    :(.init.text+0x4830): undefined reference to `ioport_unmap'

    This changes the Kconfig logic to take that option away and
    always force building the platform variant of this driver if
    CONFIG_HAS_IOPORT_MAP is not set. This is the only correct
    choice in this case, and it avoids the build error.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • When we've run out of space in the output buffer to store more data, we
    will call zlib_deflate with a NULL output buffer until we've consumed
    remaining input.

    When this happens, olen contains the size the output buffer would have
    consumed iff we'd have had enough room.

    This can later cause skb_over_panic when ppp_generic skb_put()s
    the returned length.

    Reported-by: Iain Douglas
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Nicolas Dichtel says:

    ====================
    netns: audit netdevice creation with IFLA_NET_NS_[PID|FD]

    When one of these attributes is set, the netdevice is created into the netns
    pointed by IFLA_NET_NS_[PID|FD] (see the call to rtnl_create_link() in
    rtnl_newlink()). Let's call this netns the dest_net. After this creation, if the
    newlink handler exists, it is called with a netns argument that points to the
    netns where the netlink message has been received (called src_net in the code)
    which is the link netns.
    Hence, with one of these attributes, it's possible to create a x-netns
    netdevice.

    Here is the result of my code review:
    - all ip tunnels (sit, ipip, ip6_tunnels, gre[tap][v6], ip_vti[6]) does not
    really allows to use this feature: the netdevice is created in the dest_net
    and the src_net is completely ignored in the newlink handler.
    - VLAN properly handles this x-netns creation.
    - bridge ignores src_net, which seems fine (NETIF_F_NETNS_LOCAL is set).
    - CAIF subsystem is not clear for me (I don't know how it works), but it seems
    to wrongly use src_net. Patch #1 tries to fix this, but it was done only by
    code review (and only compile-tested), so please carefully review it. I may
    miss something.
    - HSR subsystem uses src_net to parse IFLA_HSR_SLAVE[1|2], but the netdevice has
    the flag NETIF_F_NETNS_LOCAL, so the question is: does this netdevice really
    supports x-netns? If not, the newlink handler should use the dest_net instead
    of src_net, I can provide the patch.
    - ieee802154 uses also src_net and does not have NETIF_F_NETNS_LOCAL. Same
    question: does this netdevice really supports x-netns?
    - bonding ignores src_net and flag NETIF_F_NETNS_LOCAL is set, ie x-netns is not
    supported. Fine.
    - CAN does not support rtnl/newlink, ok.
    - ipvlan uses src_net and does not have NETIF_F_NETNS_LOCAL. After looking at
    the code, it seems that this drivers support x-netns. Am I right?
    - macvlan/macvtap uses src_net and seems to have x-netns support.
    - team ignores src_net and has the flag NETIF_F_NETNS_LOCAL, ie x-netns is not
    supported. Ok.
    - veth uses src_net and have x-netns support ;-) Ok.
    - VXLAN didn't properly handle this. The link netns (vxlan->net) is the src_net
    and not dest_net (see patch #2). Note that it was already possible to create a
    x-netns vxlan before the commit f01ec1c017de ("vxlan: add x-netns support")
    but the nedevice remains broken.

    To summarize:
    - CAIF patch must be carefully reviewed
    - for HSR, ieee802154, ipvlan: is x-netns supported?
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Rename the netns to src_net to avoid confusion with the netns where the
    interface stands. The user may specify IFLA_NET_NS_[PID|FD] to create
    a x-netns netndevice: IFLA_NET_NS_[PID|FD] points to the netns where the
    netdevice stands and src_net to the link netns.

    Note that before commit f01ec1c017de ("vxlan: add x-netns support"), it was
    possible to create a x-netns vxlan netdevice, but the netdevice was not
    operational.

    Fixes: f01ec1c017de ("vxlan: add x-netns support")
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • src_net points to the netns where the netlink message has been received. This
    netns may be different from the netns where the interface is created (because
    the user may add IFLA_NET_NS_[PID|FD]). In this case, src_net is the link netns.

    It seems wrong to override the netns in the newlink() handler because if it
    was not already src_net, it means that the user explicitly asks to create the
    netdevice in another netns.

    CC: Sjur Brændeland
    CC: Dmitry Tarnyagin
    Fixes: 8391c4aab1aa ("caif: Bugfixes in CAIF netdevice for close and flow control")
    Fixes: c41254006377 ("caif-hsi: Add rtnl support")
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     
  • Fixed commit added from64to32 under _#ifndef do_csum_ but used it
    under _#ifndef csum_tcpudp_nofold_, breaking some builds (Fengguang's
    robot reported TILEGX's). Move from64to32 under the latter.

    Fixes: 150ae0e94634 ("lib/checksum.c: fix carry in csum_tcpudp_nofold")
    Reported-by: kbuild test robot
    Signed-off-by: Karl Beldan
    Cc: Eric Dumazet
    Cc: David S. Miller
    Signed-off-by: David S. Miller

    karl beldan
     

29 Jan, 2015

9 commits

  • When I added sk_pacing_rate field, I forgot to initialize its value
    in the per cpu unicast_sock used in ip_send_unicast_reply()

    This means that for sch_fq users, RST packets, or ACK packets sent
    on behalf of TIME_WAIT sockets might be sent to slowly or even dropped
    once we reach the per flow limit.

    Signed-off-by: Eric Dumazet
    Fixes: 95bd09eb2750 ("tcp: TSO packets automatic sizing")
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The carry from the 64->32bits folding was dropped, e.g with:
    saddr=0xFFFFFFFF daddr=0xFF0000FF len=0xFFFF proto=0 sum=1,
    csum_tcpudp_nofold returned 0 instead of 1.

    Signed-off-by: Karl Beldan
    Cc: Al Viro
    Cc: Eric Dumazet
    Cc: Arnd Bergmann
    Cc: Mike Frysinger
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    karl beldan
     
  • Reported in: https://bugzilla.kernel.org/show_bug.cgi?id=92081

    This patch avoids calling rtnl_notify if the device ndo_bridge_getlink
    handler does not return any bytes in the skb.

    Alternately, the skb->len check can be moved inside rtnl_notify.

    For the bridge vlan case described in 92081, there is also a fix needed
    in bridge driver to generate a proper notification. Will fix that in
    subsequent patch.

    v2: rebase patch on net tree

    Signed-off-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Roopa Prabhu
     
  • Neal Cardwell says:

    ====================
    fix stretch ACK bugs in TCP CUBIC and Reno

    This patch series fixes the TCP CUBIC and Reno congestion control
    modules to properly handle stretch ACKs in their respective additive
    increase modes, and in the transitions from slow start to additive
    increase.

    This finishes the project started by commit 9f9843a751d0a2057 ("tcp:
    properly handle stretch acks in slow start"), which fixed behavior for
    TCP congestion control when handling stretch ACKs in slow start mode.

    Motivation: In the Jan 2015 netdev thread 'BW regression after "tcp:
    refine TSO autosizing"', Eyal Perry documented a regression that Eric
    Dumazet determined was caused by improper handling of TCP stretch
    ACKs.

    Background: LRO, GRO, delayed ACKs, and middleboxes can cause "stretch
    ACKs" that cover more than the RFC-specified maximum of 2
    packets. These stretch ACKs can cause serious performance shortfalls
    in common congestion control algorithms, like Reno and CUBIC, which
    were designed and tuned years ago with receiver hosts that were not
    using LRO or GRO, and were instead ACKing every other packet.

    Testing: at Google we have been using this approach for handling
    stretch ACKs for CUBIC datacenter and Internet traffic for several
    years, with good results.

    v2:
    * fixed return type of tcp_slow_start() to be u32 instead of int
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch fixes a bug in CUBIC that causes cwnd to increase slightly
    too slowly when multiple ACKs arrive in the same jiffy.

    If cwnd is supposed to increase at a rate of more than once per jiffy,
    then CUBIC was sometimes too slow. Because the bic_target is
    calculated for a future point in time, calculated with time in
    jiffies, the cwnd can increase over the course of the jiffy while the
    bic_target calculated as the proper CUBIC cwnd at time
    t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
    increases on jiffy tick boundaries.

    So since the cnt is set to:
    ca->cnt = cwnd / (bic_target - cwnd);
    as cwnd increases but bic_target does not increase due to jiffy
    granularity, the cnt becomes too large, causing cwnd to increase
    too slowly.

    For example:
    - suppose at the beginning of a jiffy, cwnd=40, bic_target=44
    - so CUBIC sets:
    ca->cnt = cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
    - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
    increases cwnd to 41
    - so CUBIC sets:
    ca->cnt = cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13

    So now CUBIC will wait for 13 packets to be ACKed before increasing
    cwnd to 42, insted of 10 as it should.

    The fix is to avoid adjusting the slope (determined by ca->cnt)
    multiple times within a jiffy, and instead skip to compute the Reno
    cwnd, the "TCP friendliness" code path.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Change CUBIC to properly handle stretch ACKs in additive increase mode
    by passing in the count of ACKed packets to tcp_cong_avoid_ai().

    In addition, because we are now precisely accounting for stretch ACKs,
    including delayed ACKs, we can now remove the delayed ACK tracking and
    estimation code that tracked recent delayed ACK behavior in
    ca->delayed_ack.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • Change Reno to properly handle stretch ACKs in additive increase mode
    by passing in the count of ACKed packets to tcp_cong_avoid_ai().

    In addition, if snd_cwnd crosses snd_ssthresh during slow start
    processing, and we then exit slow start mode, we need to carry over
    any remaining "credit" for packets ACKed and apply that to additive
    increase by passing this remaining "acked" count to
    tcp_cong_avoid_ai().

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • tcp_cong_avoid_ai() was too timid (snd_cwnd increased too slowly) on
    "stretch ACKs" -- cases where the receiver ACKed more than 1 packet in
    a single ACK. For example, suppose w is 10 and we get a stretch ACK
    for 20 packets, so acked is 20. We ought to increase snd_cwnd by 2
    (since acked/w = 20/10 = 2), but instead we were only increasing cwnd
    by 1. This patch fixes that behavior.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     
  • LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
    cover more than the RFC-specified maximum of 2 packets. These stretch
    ACKs can cause serious performance shortfalls in common congestion
    control algorithms that were designed and tuned years ago with
    receiver hosts that were not using LRO or GRO, and were instead
    politely ACKing every other packet.

    This patch series fixes Reno and CUBIC to handle stretch ACKs.

    This patch prepares for the upcoming stretch ACK bug fix patches. It
    adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
    fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
    changes all congestion control algorithms to pass in 1 for the ACKed
    count. It also changes tcp_slow_start() to return the number of packet
    ACK "credits" that were not processed in slow start mode, and can be
    processed by the congestion control module in additive increase mode.

    In future patches we will fix tcp_cong_avoid_ai() to handle stretch
    ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
    and additive increase mode.

    Reported-by: Eyal Perry
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Neal Cardwell
     

28 Jan, 2015

5 commits

  • Pull networking fixes from David Miller:

    1) Don't OOPS on socket AIO, from Christoph Hellwig.

    2) Scheduled scans should be aborted upon RFKILL, from Emmanuel
    Grumbach.

    3) Fix sleep in atomic context in kvaser_usb, from Ahmed S Darwish.

    4) Fix RCU locking across copy_to_user() in bpf code, from Alexei
    Starovoitov.

    5) Lots of crash, memory leak, short TX packet et al bug fixes in
    sh_eth from Ben Hutchings.

    6) Fix memory corruption in SCTP wrt. INIT collitions, from Daniel
    Borkmann.

    7) Fix return value logic for poll handlers in netxen, enic, and bnx2x.
    From Eric Dumazet and Govindarajulu Varadarajan.

    8) Header length calculation fix in mac80211 from Fred Chou.

    9) mv643xx_eth doesn't handle highmem correctly in non-TSO code paths.
    From Ezequiel Garcia.

    10) udp_diag has bogus logic in it's hash chain skipping, copy same fix
    tcp diag used. From Herbert Xu.

    11) amd-xgbe programs wrong rx flow control register, from Thomas
    Lendacky.

    12) Fix race leading to use after free in ping receive path, from Subash
    Abhinov Kasiviswanathan.

    13) Cache redirect routes otherwise we can get a heavy backlog of rcu
    jobs liberating DST_NOCACHE entries. From Hannes Frederic Sowa.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (48 commits)
    net: don't OOPS on socket aio
    stmmac: prevent probe drivers to crash kernel
    bnx2x: fix napi poll return value for repoll
    ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too
    sh_eth: Fix DMA-API usage for RX buffers
    sh_eth: Check for DMA mapping errors on transmit
    sh_eth: Ensure DMA engines are stopped before freeing buffers
    sh_eth: Remove RX overflow log messages
    ping: Fix race in free in receive path
    udp_diag: Fix socket skipping within chain
    can: kvaser_usb: Fix state handling upon BUS_ERROR events
    can: kvaser_usb: Retry the first bulk transfer on -ETIMEDOUT
    can: kvaser_usb: Send correct context to URB completion
    can: kvaser_usb: Do not sleep in atomic context
    ipv4: try to cache dst_entries which would cause a redirect
    samples: bpf: relax test_maps check
    bpf: rcu lock must not be held when calling copy_to_user()
    net: sctp: fix slab corruption from use after free on INIT collisions
    net: mv643xx_eth: Fix highmem support in non-TSO egress path
    sh_eth: Fix serialisation of interrupt disable with interrupt & NAPI handlers
    ...

    Linus Torvalds
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     
  • In the case when alloc_netdev fails we return NULL to a caller. But there is no
    check for NULL in the probe drivers. This patch changes NULL to an error
    pointer. The function description is amended to reflect what we may get
    returned.

    Signed-off-by: Andy Shevchenko
    Signed-off-by: David S. Miller

    Andy Shevchenko
     
  • Pull powerpc fixes from Michael Ellerman:
    "Two powerpc fixes"

    * tag 'powerpc-3.19-5' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
    powerpc/powernv: Restore LPCR with LPCR_PECE1 cleared
    powerpc/xmon: Fix another endiannes issue in RTAS call from xmon

    Linus Torvalds
     
  • Pull one more module fix from Rusty Russell:
    "SCSI was using module_refcount() to figure out when the module was
    unloading: this broke with new atomic refcounting. The code is still
    suspicious, but this solves the WARN_ON()"

    * tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
    scsi: always increment reference count

    Linus Torvalds
     

27 Jan, 2015

2 commits