19 May, 2016

5 commits

  • [ Upstream commit 626abd59e51d4d8c6367e03aae252a8aa759ac78 ]

    Currently, when creating or updating a route, no check is performed
    in both ipv4 and ipv6 code to the hoplimit value.

    The caller can i.e. set hoplimit to 256, and when such route will
    be used, packets will be sent with hoplimit/ttl equal to 0.

    This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
    ipv6 route code, substituting any value greater than 255 with 255.

    This is consistent with what is currently done for ADVMSS and MTU
    in the ipv4 code.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit 10a81980fc47e64ffac26a073139813d3f697b64 ]

    In the very unlikely case __tcp_retransmit_skb() can not use the cloning
    done in tcp_transmit_skb(), we need to refresh skb_mstamp before doing
    the copy and transmit, otherwise TCP TS val will be an exact copy of
    original transmit.

    Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit b7f8fe251e4609e2a437bd2c2dea01e61db6849c ]

    iptunnel_pull_header expects that IP header was already pulled; with this
    expectation, it pulls the tunnel header. This is not true in gre_err.
    Furthermore, ipv4_update_pmtu and ipv4_redirect expect that skb->data points
    to the IP header.

    We cannot pull the tunnel header in this path. It's just a matter of not
    calling iptunnel_pull_header - we don't need any of its effects.

    Fixes: bda7bb463436 ("gre: Allow multiple protocol listener for gre protocol.")
    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jiri Benc
     
  • [ Upstream commit 391a20333b8393ef2e13014e6e59d192c5594471 ]

    After commit fbd40ea0180a ("ipv4: Don't do expensive useless work
    during inetdev destroy.") when deleting an interface,
    fib_del_ifaddr() can be executed without any primary address
    present on the dead interface.

    The above is safe, but triggers some "bug: prim == NULL" warnings.

    This commit avoids warning if the in_dev is dead

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit d6d5e999e5df67f8ec20b6be45e2229455ee3699 ]

    For local routes that require a particular output interface we do not want
    to cache the result. Caching the result causes incorrect behaviour when
    there are multiple source addresses on the interface. The end result
    being that if the intended recipient is waiting on that interface for the
    packet he won't receive it because it will be delivered on the loopback
    interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
    interface as well.

    This can be tested by running a program such as "dhcp_release" which
    attempts to inject a packet on a particular interface so that it is
    received by another program on the same board. The receiving process
    should see an IP_PKTINFO ipi_ifndex value of the source interface
    (e.g., eth1) instead of the loopback interface (e.g., lo). The packet
    will still appear on the loopback interface in tcpdump but the important
    aspect is that the CMSG info is correct.

    Sample dhcp_release command line:

    dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

    Signed-off-by: Allain Legacy
    Signed off-by: Chris Friesen
    Reviewed-by: Julian Anastasov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Chris Friesen
     

20 Apr, 2016

9 commits

  • [ Upstream commit 4cfc86f3dae6ca38ed49cdd78f458a03d4d87992 ]

    Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst()
    before calling fib_lookup(), which means fib_table_lookup() is
    using non-deterministic data at this line:

    if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) {

    Fix by initializing the entire fl4 structure, which will prevent
    similar issues as fields are added in the future by ensuring that
    all fields are initialized to zero unless explicitly initialized
    to another value.

    Fixes: 58189ca7b2741 ("net: Fix vti use case with oif in dst lookups")
    Suggested-by: David Ahern
    Signed-off-by: Lance Richardson
    Acked-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Lance Richardson
     
  • [ Upstream commit ad0ea1989cc4d5905941d0a9e62c63ad6d859cef ]

    Currently, ingress ipv4 broadcast datagrams are dropped since,
    in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on
    bcast packets.

    This patch addresses the issue, invoking ip_check_mc_rcu()
    only for mcast packets.

    Fixes: 6e5403093261 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()")
    Signed-off-by: Paolo Abeni
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit e316ea62e3203d524ff0239a40c56d3a39ad1b5c ]

    Now SYN_RECV request sockets are installed in ehash table, an ICMP
    handler can find a request socket while another cpu handles an incoming
    packet transforming this SYN_RECV request socket into an ESTABLISHED
    socket.

    We need to remove the now obsolete WARN_ON(req->sk), since req->sk
    is set when a new child is created and added into listener accept queue.

    If this race happens, the ICMP will do nothing special.

    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Reported-by: Ben Lazarus
    Reported-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit fbd40ea0180a2d328c5adc61414dc8bab9335ce2 ]

    When an inetdev is destroyed, every address assigned to the interface
    is removed. And in this scenerio we do two pointless things which can
    be very expensive if the number of assigned interfaces is large:

    1) Address promotion. We are deleting all addresses, so there is no
    point in doing this.

    2) A full nf conntrack table purge for every address. We only need to
    do this once, as is already caught by the existing
    masq_dev_notifier so masq_inet_event() can skip this.

    Reported-by: Solar Designer
    Signed-off-by: David S. Miller
    Tested-by: Cyrill Gorcunov
    Signed-off-by: Greg Kroah-Hartman

    David S. Miller
     
  • [ Upstream commit a9d99ce28ed359d68cf6f3c1a69038aefedf6d6a ]

    If final packet (ACK) of 3WHS is lost, it appears we do not properly
    account the following incoming segment into tcpi_segs_in

    While we are at it, starts segs_in with one, to count the SYN packet.

    We do not yet count number of SYN we received for a request sock, we
    might add this someday.

    packetdrill script showing proper behavior after fix :

    // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +0 < S 0:0(0) win 32792
    +0 > S. 0:0(0) ack 1
    +.020 < P. 1:1001(1000) ack 1 win 32792

    +0 accept(3, ..., ...) = 4

    +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%

    Fixes: 2efd055c53c06 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1837b2e2bcd23137766555a63867e649c0b637f0 ]

    The current reserved_tailroom calculation fails to take hlen and tlen into
    account.

    skb:
    [__hlen__|__data____________|__tlen___|__extra__]
    ^ ^
    head skb_end_offset

    In this representation, hlen + data + tlen is the size passed to alloc_skb.
    "extra" is the extra space made available in __alloc_skb because of
    rounding up by kmalloc. We can reorder the representation like so:

    [__hlen__|__data____________|__extra__|__tlen___]
    ^ ^
    head skb_end_offset

    The maximum space available for ip headers and payload without
    fragmentation is min(mtu, data + extra). Therefore,
    reserved_tailroom
    = data + extra + tlen - min(mtu, data + extra)
    = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
    = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

    Compare the second line to the current expression:
    reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
    and we can see that hlen and tlen are not taken into account.

    The min() in the third line can be expanded into:
    if mtu < skb_tailroom - tlen:
    reserved_tailroom = skb_tailroom - mtu
    else:
    reserved_tailroom = tlen

    Depending on hlen, tlen, mtu and the number of multicast address records,
    the current code may output skbs that have less tailroom than
    dev->needed_tailroom or it may output more skbs than needed because not all
    space available is used.

    Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
    Signed-off-by: Benjamin Poirier
    Acked-by: Hannes Frederic Sowa
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Poirier
     
  • [ Upstream commit a8c4a2522a0808c5c2143612909717d1115c40cf ]

    Otherwise we break the contract with GSO to only pass CHECKSUM_PARTIAL
    skbs down. This can easily happen with UDP+IPv4 sockets with the first
    MSG_MORE write smaller than the MTU, second write is a sendfile.

    Returning -EOPNOTSUPP lets the callers fall back into normal sendmsg path,
    were we calculate the checksum manually during copying.

    Commit d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked
    sockets") started to exposes this bug.

    Fixes: d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")
    Reported-by: Jiri Benc
    Cc: Jiri Benc
    Reported-by: Wakko Warner
    Cc: Wakko Warner
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hannes Frederic Sowa
     
  • [ Upstream commit 5146d1f151122e868e594c7b45115d64825aee5f ]

    IPCB may contain data from previous layers (in the observed case the
    qdisc layer). In the observed scenario, the data was misinterpreted as
    ip header options, which later caused the ihl to be set to an invalid
    value (opt before dst_link_failure can be called for
    various types of tunnels. This change only applies to encapsulated ipv4
    packets.

    The code introduced in 11c21a30 which clears all of IPCB has been removed
    to be consistent with these changes, and instead the opt field is cleared
    unconditionally in ip_tunnel_xmit. The change in ip_tunnel_xmit applies to
    SIT, GRE, and IPIP tunnels.

    The relevant vti, l2tp, and pptp functions already contain similar code for
    clearing the IPCB.

    Signed-off-by: Bernie Harris
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Bernie Harris
     
  • [ Upstream commit 9bdfb3b79e61c60e1a3e2dc05ad164528afa6b8a ]

    Currently it's converted into msecs, thus HZ=1000 intact.

    Signed-off-by: Konstantin Khlebnikov
    Fixes: 740b0f1841f6 ("tcp: switch rtt estimations to usec resolution")
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     

04 Mar, 2016

10 commits

  • [ Upstream commit a97eb33ff225f34a8124774b3373fd244f0e83ce ]

    An error response from a RTM_GETNETCONF request can return the positive
    error value EINVAL in the struct nlmsgerr that can mislead userspace.

    Signed-off-by: Anton Protopopov
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Anton Protopopov
     
  • [ Upstream commit 7716682cc58e305e22207d5bb315f26af6b1e243 ]

    Ilya reported following lockdep splat:

    kernel: =========================
    kernel: [ BUG: held lock freed! ]
    kernel: 4.5.0-rc1-ceph-00026-g5e0a311 #1 Not tainted
    kernel: -------------------------
    kernel: swapper/5/0 is freeing memory
    ffff880035c9d200-ffff880035c9dbff, with a lock still held there!
    kernel: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0
    kernel: 4 locks held by swapper/5/0:
    kernel: #0: (rcu_read_lock){......}, at: []
    netif_receive_skb_internal+0x4b/0x1f0
    kernel: #1: (rcu_read_lock){......}, at: []
    ip_local_deliver_finish+0x3f/0x380
    kernel: #2: (slock-AF_INET){+.-...}, at: []
    sk_clone_lock+0x19b/0x440
    kernel: #3: (&(&queue->rskq_lock)->rlock){+.-...}, at:
    [] inet_csk_reqsk_queue_add+0x28/0xa0

    To properly fix this issue, inet_csk_reqsk_queue_add() needs
    to return to its callers if the child as been queued
    into accept queue.

    We also need to make sure listener is still there before
    calling sk->sk_data_ready(), by holding a reference on it,
    since the reference carried by the child can disappear as
    soon as the child is put on accept queue.

    Reported-by: Ilya Dryomov
    Fixes: ebb516af60e1 ("tcp/dccp: fix race at listener dismantle phase")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit deed49df7390d5239024199e249190328f1651e7 ]

    Since the gc of ipv4 route was removed, the route cached would has
    no chance to be removed, and even it has been timeout, it still could
    be used, cause no code to check it's expires.

    Fix this issue by checking and removing route cache when we get route.

    Signed-off-by: Xin Long
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit 729235554d805c63e5e274fcc6a98e71015dd847 ]

    If tcp_v4_inbound_md5_hash() returns an error, we must release
    the refcount on the request socket, not on the listener.

    The bug was added for IPv4 only.

    Fixes: 079096f103fac ("tcp/dccp: install syn_recv requests into ehash table")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 919483096bfe75dda338e98d56da91a263746a0a ]

    Dmitry reported memory leaks of IP options allocated in
    ip_cmsg_send() when/if this function returns an error.

    Callers are responsible for the freeing.

    Many thanks to Dmitry for the report and diagnostic.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 ]

    Devices may have limits on the number of fragments in an skb they support.
    Current codebase uses a constant as maximum for number of fragments one
    skb can hold and use.
    When enabling scatter/gather and running traffic with many small messages
    the codebase uses the maximum number of fragments and may thereby violate
    the max for certain devices.
    The patch introduces a global variable as max number of fragments.

    Signed-off-by: Hans Westgaard Ry
    Reviewed-by: Håkon Bugge
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hans Westgaard Ry
     
  • [ Upstream commit 9cf7490360bf2c46a16b7525f899e4970c5fc144 ]

    Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
    were leading to RST.

    This is of course incorrect.

    A specific list of ICMP messages should be able to drop a SYN_RECV.

    For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
    not hold a dst per SYN_RECV pseudo request.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751
    Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
    Reported-by: Petr Novopashenniy
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ff5d749772018602c47509bdc0093ff72acd82ec ]

    With some combinations of user provided flags in netlink command,
    it is possible to call tcp_get_info() with a buffer that is not 8-bytes
    aligned.

    It does matter on some arches, so we need to use put_unaligned() to
    store the u64 fields.

    Current iproute2 package does not trigger this particular issue.

    Fixes: 0df48c26d841 ("tcp: add tcpi_bytes_acked to tcp_info")
    Fixes: 977cb0ecf82e ("tcp: add pacing_rate information into tcp_info")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 8282f27449bf15548cb82c77b6e04ee0ab827bdc ]

    Later parts of the stack (including fragmentation) expect that there is
    never a socket attached to frag in a frag_list, however this invariant
    was not enforced on all defrag paths. This could lead to the
    BUG_ON(skb->sk) during ip_do_fragment(), as per the call stack at the
    end of this commit message.

    While the call could be added to openvswitch to fix this particular
    error, the head and tail of the frags list are already orphaned
    indirectly inside ip_defrag(), so it seems like the remaining fragments
    should all be orphaned in all circumstances.

    kernel BUG at net/ipv4/ip_output.c:586!
    [...]
    Call Trace:

    [] ? do_output.isra.29+0x1b0/0x1b0 [openvswitch]
    [] ovs_fragment+0xcc/0x214 [openvswitch]
    [] ? dst_discard_out+0x20/0x20
    [] ? dst_ifdown+0x80/0x80
    [] ? find_bucket.isra.2+0x62/0x70 [openvswitch]
    [] ? mod_timer_pending+0x65/0x210
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? nf_conntrack_in+0x252/0x500 [nf_conntrack]
    [] ? __lock_is_held+0x54/0x70
    [] do_output.isra.29+0xe3/0x1b0 [openvswitch]
    [] do_execute_actions+0xe11/0x11f0 [openvswitch]
    [] ? __lock_is_held+0x54/0x70
    [] ovs_execute_actions+0x32/0xd0 [openvswitch]
    [] ovs_dp_process_packet+0x85/0x140 [openvswitch]
    [] ? __lock_is_held+0x54/0x70
    [] ovs_execute_actions+0xb2/0xd0 [openvswitch]
    [] ovs_dp_process_packet+0x85/0x140 [openvswitch]
    [] ? ovs_ct_get_labels+0x49/0x80 [openvswitch]
    [] ovs_vport_receive+0x5d/0xa0 [openvswitch]
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? internal_dev_xmit+0x5/0x140 [openvswitch]
    [] internal_dev_xmit+0x6c/0x140 [openvswitch]
    [] ? internal_dev_xmit+0x5/0x140 [openvswitch]
    [] dev_hard_start_xmit+0x2b9/0x5e0
    [] ? netif_skb_features+0xd1/0x1f0
    [] __dev_queue_xmit+0x800/0x930
    [] ? __dev_queue_xmit+0x50/0x930
    [] ? mark_held_locks+0x71/0x90
    [] ? neigh_resolve_output+0x106/0x220
    [] dev_queue_xmit+0x10/0x20
    [] neigh_resolve_output+0x178/0x220
    [] ? ip_finish_output2+0x1ff/0x590
    [] ip_finish_output2+0x1ff/0x590
    [] ? ip_finish_output2+0x7e/0x590
    [] ip_do_fragment+0x831/0x8a0
    [] ? ip_copy_metadata+0x1b0/0x1b0
    [] ip_fragment.constprop.49+0x43/0x80
    [] ip_finish_output+0x17c/0x340
    [] ? nf_hook_slow+0xe4/0x190
    [] ip_output+0x70/0x110
    [] ? ip_fragment.constprop.49+0x80/0x80
    [] ip_local_out+0x39/0x70
    [] ip_send_skb+0x19/0x40
    [] ip_push_pending_frames+0x33/0x40
    [] icmp_push_reply+0xea/0x120
    [] icmp_reply.constprop.23+0x1ed/0x230
    [] icmp_echo.part.21+0x4e/0x50
    [] ? __lock_is_held+0x54/0x70
    [] ? rcu_read_lock_held+0x5e/0x70
    [] icmp_echo+0x36/0x70
    [] icmp_rcv+0x271/0x450
    [] ip_local_deliver_finish+0x127/0x3a0
    [] ? ip_local_deliver_finish+0x41/0x3a0
    [] ip_local_deliver+0x60/0xd0
    [] ? ip_rcv_finish+0x560/0x560
    [] ip_rcv_finish+0xdd/0x560
    [] ip_rcv+0x283/0x3e0
    [] ? match_held_lock+0x192/0x200
    [] ? inet_del_offload+0x40/0x40
    [] __netif_receive_skb_core+0x392/0xae0
    [] ? process_backlog+0x8e/0x230
    [] ? mark_held_locks+0x71/0x90
    [] __netif_receive_skb+0x18/0x60
    [] process_backlog+0x78/0x230
    [] ? process_backlog+0xdd/0x230
    [] net_rx_action+0x155/0x400
    [] __do_softirq+0xcc/0x420
    [] ? ip_finish_output2+0x217/0x590
    [] do_softirq_own_stack+0x1c/0x30

    [] do_softirq+0x4e/0x60
    [] __local_bh_enable_ip+0xa8/0xb0
    [] ip_finish_output2+0x240/0x590
    [] ? ip_do_fragment+0x831/0x8a0
    [] ip_do_fragment+0x831/0x8a0
    [] ? ip_copy_metadata+0x1b0/0x1b0
    [] ip_fragment.constprop.49+0x43/0x80
    [] ip_finish_output+0x17c/0x340
    [] ? nf_hook_slow+0xe4/0x190
    [] ip_output+0x70/0x110
    [] ? ip_fragment.constprop.49+0x80/0x80
    [] ip_local_out+0x39/0x70
    [] ip_send_skb+0x19/0x40
    [] ip_push_pending_frames+0x33/0x40
    [] raw_sendmsg+0x7d3/0xc30
    [] ? __lock_acquire+0x3db/0x1b90
    [] ? inet_sendmsg+0xc7/0x1d0
    [] ? __lock_is_held+0x54/0x70
    [] inet_sendmsg+0x10a/0x1d0
    [] ? inet_sendmsg+0x5/0x1d0
    [] sock_sendmsg+0x38/0x50
    [] ___sys_sendmsg+0x25f/0x270
    [] ? handle_mm_fault+0x8dd/0x1320
    [] ? _raw_spin_unlock+0x27/0x40
    [] ? __do_page_fault+0x1e2/0x460
    [] ? __fget_light+0x66/0x90
    [] __sys_sendmsg+0x42/0x80
    [] SyS_sendmsg+0x12/0x20
    [] entry_SYSCALL_64_fastpath+0x12/0x6f
    Code: 00 00 44 89 e0 e9 7c fb ff ff 4c 89 ff e8 e7 e7 ff ff 41 8b 9d 80 00 00 00 2b 5d d4 89 d8 c1 f8 03 0f b7 c0 e9 33 ff ff f
    66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
    RIP [] ip_do_fragment+0x892/0x8a0
    RSP

    Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
    Signed-off-by: Joe Stringer
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Joe Stringer
     
  • [ Upstream commit e62a123b8ef7c5dc4db2c16383d506860ad21b47 ]

    Neal reported crashes with this stack trace :

    RIP: 0010:[] tcp_v4_send_ack+0x41/0x20f
    ...
    CR2: 0000000000000018 CR3: 000000044005c000 CR4: 00000000001427e0
    ...
    [] tcp_v4_reqsk_send_ack+0xa5/0xb4
    [] tcp_check_req+0x2ea/0x3e0
    [] tcp_rcv_state_process+0x850/0x2500
    [] tcp_v4_do_rcv+0x141/0x330
    [] sk_backlog_rcv+0x21/0x30
    [] tcp_recvmsg+0x75d/0xf90
    [] inet_recvmsg+0x80/0xa0
    [] sock_aio_read+0xee/0x110
    [] do_sync_read+0x6f/0xa0
    [] SyS_read+0x1e1/0x290
    [] system_call_fastpath+0x16/0x1b

    The problem here is the skb we provide to tcp_v4_send_ack() had to
    be parked in the backlog of a new TCP fastopen child because this child
    was owned by the user at the time an out of window packet arrived.

    Before queuing a packet, TCP has to set skb->dev to NULL as the device
    could disappear before packet is removed from the queue.

    Fix this issue by using the net pointer provided by the socket (being a
    timewait or a request socket).

    IPv6 is immune to the bug : tcp_v6_send_response() already gets the net
    pointer from the socket if provided.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Reported-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Cc: Jerry Chu
    Cc: Yuchung Cheng
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

01 Feb, 2016

3 commits

  • [ Upstream commit 9207f9d45b0ad071baa128e846d7e7ed85016df3 ]

    Skb_gso_segment() uses skb control block during segmentation.
    This patch adds 32-bytes room for previous control block which
    will be copied into all resulting segments.

    This patch fixes kernel crash during fragmenting forwarded packets.
    Fragmentation requires valid IP CB in skb for clearing ip options.
    Also patch removes custom save/restore in ovs code, now it's redundant.

    Signed-off-by: Konstantin Khlebnikov
    Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • [ Upstream commit 40ba330227ad00b8c0cdf2f425736ff9549cc423 ]

    Commit acf8dd0a9d0b ("udp: only allow UFO for packets from SOCK_DGRAM
    sockets") disallows UFO for packets sent from raw sockets. We need to do
    the same also for SOCK_DGRAM sockets with SO_NO_CHECK options, even if
    for a bit different reason: while such socket would override the
    CHECKSUM_PARTIAL set by ip_ufo_append_data(), gso_size is still set and
    bad offloading flags warning is triggered in __skb_gso_segment().

    In the IPv6 case, SO_NO_CHECK option is ignored but we need to disallow
    UFO for packets sent by sockets with UDP_NO_CHECK6_TX option.

    Signed-off-by: Michal Kubecek
    Tested-by: Shannon Nelson
    Acked-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michal Kubeček
     
  • [ Upstream commit 83d15e70c4d8909d722c0d64747d8fb42e38a48f ]

    For tcp_yeah, use an ssthresh floor of 2, the same floor used by Reno
    and CUBIC, per RFC 5681 (equation 4).

    tcp_yeah_ssthresh() was sometimes returning a 0 or negative ssthresh
    value if the intended reduction is as big or bigger than the current
    cwnd. Congestion control modules should never return a zero or
    negative ssthresh. A zero ssthresh generally results in a zero cwnd,
    causing the connection to stall. A negative ssthresh value will be
    interpreted as a u32 and will set a target cwnd for PRR near 4
    billion.

    Oleksandr Natalenko reported that a system using tcp_yeah with ECN
    could see a warning about a prior_cwnd of 0 in
    tcp_cwnd_reduction(). Testing verified that this was due to
    tcp_yeah_ssthresh() misbehaving in this way.

    Reported-by: Oleksandr Natalenko
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     

07 Jan, 2016

1 commit

  • Patch 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode
    conditionally") introduced a bug that cwnd may become 0 when both
    inflight and sndcnt are 0 (cwnd = inflight + sndcnt). This may lead
    to a div-by-zero if the connection starts another cwnd reduction
    phase by setting tp->prior_cwnd to the current cwnd (0) in
    tcp_init_cwnd_reduction().

    To prevent this we skip PRR operation when nothing is acked or
    sacked. Then cwnd must be positive in all cases as long as ssthresh
    is positive:

    1) The proportional reduction mode
    inflight > ssthresh > 0

    2) The reduction bound mode
    a) inflight == ssthresh > 0

    b) inflight < ssthresh
    sndcnt > 0 since newly_acked_sacked > 0 and inflight < ssthresh

    Therefore in all cases inflight and sndcnt can not both be 0.
    We check invalid tp->prior_cwnd to avoid potential div0 bugs.

    In reality this bug is triggered only with a sequence of less common
    events. For example, the connection is terminating an ECN-triggered
    cwnd reduction with an inflight 0, then it receives reordered/old
    ACKs or DSACKs from prior transmission (which acks nothing). Or the
    connection is in fast recovery stage that marks everything lost,
    but fails to retransmit due to local issues, then receives data
    packets from other end which acks nothing.

    Fixes: 3759824da87b ("tcp: PRR uses CRB mode by default and SS mode conditionally")
    Reported-by: Oleksandr Natalenko
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Yuchung Cheng
     

05 Jan, 2016

1 commit

  • Commands run in a vrf context are not failing as expected on a route lookup:
    root@kenny:~# ip ro ls table vrf-red
    unreachable default

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    ping: Warning: source address might be selected on device other than vrf-red.
    PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.

    --- 10.100.1.254 ping statistics ---
    2 packets transmitted, 0 received, 100% packet loss, time 999ms

    Since the vrf table does not have a route for 10.100.1.254 the ping
    should have failed. The saddr lookup causes a full VRF table lookup.
    Propogating a lookup failure to the user allows the command to fail as
    expected:

    root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
    connect: No route to host

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

23 Dec, 2015

1 commit


19 Dec, 2015

1 commit


18 Dec, 2015

1 commit

  • Yuchung tracked a regression caused by commit 57be5bdad759 ("ip: convert
    tcp_sendmsg() to iov_iter primitives") for TCP Fast Open.

    Some Fast Open users do not actually add any data in the SYN packet.

    Fixes: 57be5bdad759 ("ip: convert tcp_sendmsg() to iov_iter primitives")
    Reported-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Cc: Al Viro
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Dec, 2015

1 commit

  • fou->udp_offloads is managed by RCU. As it is actually included inside
    the fou sockets, we cannot let the memory go out of scope before a grace
    period. We either can synchronize_rcu or switch over to kfree_rcu to
    manage the sockets. kfree_rcu seems appropriate as it is used by vxlan
    and geneve.

    Fixes: 23461551c00628c ("fou: Support for foo-over-udp RX path")
    Cc: Tom Herbert
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

15 Dec, 2015

3 commits

  • David Wilder reported crashes caused by dst reuse.

    I am seeing a crash on a distro V4.2.3 kernel caused by a double
    release of a dst_entry. In ipv4_dst_destroy() the call to
    list_empty() finds a poisoned next pointer, indicating the dst_entry
    has already been removed from the list and freed. The crash occurs
    18 to 24 hours into a run of a network stress exerciser.

    Thanks to his detailed report and analysis, we were able to understand
    the core issue.

    IP early demux can associate a dst to skb, after a lookup in TCP/UDP
    sockets.

    When socket cache is not properly set, we want to store into
    sk->sk_dst_cache the dst for future IP early demux lookups,
    by acquiring a stable refcount on the dst.

    Problem is this acquisition is simply using an atomic_inc(),
    which works well, unless the dst was queued for destruction from
    dst_release() noticing dst refcount went to zero, if DST_NOCACHE
    was set on dst.

    We need to make sure current refcount is not zero before incrementing
    it, or risk double free as David reported.

    This patch, being a stable candidate, adds two new helpers, and use
    them only from IP early demux problematic paths.

    It might be possible to merge in net-next skb_dst_force() and
    skb_dst_force_safe(), but I prefer having the smallest patch for stable
    kernels : Maybe some skb_dst_force() callers do not expect skb->dst
    can suddenly be cleared.

    Can probably be backported back to linux-3.6 kernels

    Reported-by: David J. Wilder
    Tested-by: David J. Wilder
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • 郭永刚 reported that one could simply crash the kernel as root by
    using a simple program:

    int socket_fd;
    struct sockaddr_in addr;
    addr.sin_port = 0;
    addr.sin_addr.s_addr = INADDR_ANY;
    addr.sin_family = 10;

    socket_fd = socket(10,3,0x40000000);
    connect(socket_fd , &addr,16);

    AF_INET, AF_INET6 sockets actually only support 8-bit protocol
    identifiers. inet_sock's skc_protocol field thus is sized accordingly,
    thus larger protocol identifiers simply cut off the higher bits and
    store a zero in the protocol fields.

    This could lead to e.g. NULL function pointer because as a result of
    the cut off inet_num is zero and we call down to inet_autobind, which
    is NULL for raw sockets.

    kernel: Call Trace:
    kernel: [] ? inet_autobind+0x2e/0x70
    kernel: [] inet_dgram_connect+0x54/0x80
    kernel: [] SYSC_connect+0xd9/0x110
    kernel: [] ? ptrace_notify+0x5b/0x80
    kernel: [] ? syscall_trace_enter_phase2+0x108/0x200
    kernel: [] SyS_connect+0xe/0x10
    kernel: [] tracesys_phase2+0x84/0x89

    I found no particular commit which introduced this problem.

    CVE: CVE-2015-8543
    Cc: Cong Wang
    Reported-by: 郭永刚
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     
  • Pablo Neira Ayuso says:

    ====================
    netfilter fixes for net

    The following patchset contains Netfilter fixes for you net tree,
    specifically for nf_tables and nfnetlink_queue, they are:

    1) Avoid a compilation warning in nfnetlink_queue that was introduced
    in the previous merge window with the simplification of the conntrack
    integration, from Arnd Bergmann.

    2) nfnetlink_queue is leaking the pernet subsystem registration from
    a failure path, patch from Nikolay Borisov.

    3) Pass down netns pointer to batch callback in nfnetlink, this is the
    largest patch and it is not a bugfix but it is a dependency to
    resolve a splat in the correct way.

    4) Fix a splat due to incorrect socket memory accounting with nfnetlink
    skbuff clones.

    5) Add missing conntrack dependencies to NFT_DUP_IPV4 and NFT_DUP_IPV6.

    6) Traverse the nftables commit list in reverse order from the commit
    path, otherwise we crash when the user applies an incremental update
    via 'nft -f' that deletes an object that was just introduced in this
    batch, from Xin Long.

    Regarding the compilation warning fix, many people have sent us (and
    keep sending us) patches to address this, that's why I'm including this
    batch even if this is not critical.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

14 Dec, 2015

1 commit

  • The VRF driver cycles netdevs when an interface is enslaved or released:
    the down event is used to flush neighbor and route tables and the up
    event (if the interface was already up) effectively moves local and
    connected routes to the proper table.

    As of 4f823defdd5b the local route is left hanging around after a link
    down, so when a netdev is moved from one VRF to another (or released
    from a VRF altogether) local routes are left in the wrong table.

    Fix by handling the NETDEV_CHANGEUPPER event. When the upper dev is
    an L3mdev then call fib_disable_ip to flush all routes, local ones
    to.

    Fixes: 4f823defdd5b ("ipv4: fix to not remove local route on link down")
    Cc: Julian Anastasov
    Signed-off-by: David Ahern
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    David Ahern
     

11 Dec, 2015

1 commit


04 Dec, 2015

1 commit

  • When a multicast group is joined on a socket, a struct ip_mc_socklist
    is appended to the sockets mc_list containing information about the
    joined group.

    If the interface is hot unplugged, this entry becomes stale. Prior to
    commit 52ad353a5344f ("igmp: fix the problem when mc leave group") it
    was possible to remove the stale entry by performing a
    IP_DROP_MEMBERSHIP, passing either the old ifindex or ip address on
    the interface. However, this fix enforces that the interface must
    still exist. Thus with time, the number of stale entries grows, until
    sysctl_igmp_max_memberships is reached and then it is not possible to
    join and more groups.

    The previous patch fixes an issue where a IP_DROP_MEMBERSHIP is
    performed without specifying the interface, either by ifindex or ip
    address. However here we do supply one of these. So loosen the
    restriction on device existence to only apply when the interface has
    not been specified. This then restores the ability to clean up the
    stale entries.

    Signed-off-by: Andrew Lunn
    Fixes: 52ad353a5344f "(igmp: fix the problem when mc leave group")
    Signed-off-by: David S. Miller

    Andrew Lunn
     

02 Dec, 2015

1 commit

  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet