05 Mar, 2020

1 commit

  • [ Upstream commit 303d0403b8c25e994e4a6e45389e173cf8706fb5 ]

    As of the below commit, udp sockets bound to a specific address can
    coexist with one bound to the any addr for the same port.

    The commit also phased out the use of socket hashing based only on
    port (hslot), in favor of always hashing on {addr, port} (hslot2).

    The change broke the following behavior with disconnect (AF_UNSPEC):

    server binds to 0.0.0.0:1337
    server connects to 127.0.0.1:80
    server disconnects
    client connects to 127.0.0.1:1337
    client sends "hello"
    server reads "hello" // times out, packet did not find sk

    On connect the server acquires a specific source addr suitable for
    routing to its destination. On disconnect it reverts to the any addr.

    The connect call triggers a rehash to a different hslot2. On
    disconnect, add the same to return to the original hslot2.

    Skip this step if the socket is going to be unhashed completely.

    Fixes: 4cdeeee9252a ("net: udp: prefer listeners bound to an address")
    Reported-by: Pavel Roskin
    Signed-off-by: Willem de Bruijn
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

11 Feb, 2020

4 commits

  • [ Upstream commit 784f8344de750a41344f4bbbebb8507a730fc99c ]

    tp->segs_in and tp->segs_out need to be cleared in tcp_disconnect().

    tcp_disconnect() is rarely used, but it is worth fixing it.

    Fixes: 2efd055c53c0 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
    Signed-off-by: Eric Dumazet
    Cc: Marcelo Ricardo Leitner
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit db7ffee6f3eb3683cdcaeddecc0a630a14546fe3 ]

    tp->data_segs_in and tp->data_segs_out need to be cleared
    in tcp_disconnect().

    tcp_disconnect() is rarely used, but it is worth fixing it.

    Fixes: a44d6eacdaf5 ("tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In")
    Signed-off-by: Eric Dumazet
    Cc: Martin KaFai Lau
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 2fbdd56251b5c62f96589f39eded277260de7267 ]

    tp->delivered needs to be cleared in tcp_disconnect().

    tcp_disconnect() is rarely used, but it is worth fixing it.

    Fixes: ddf1af6fa00e ("tcp: new delivery accounting")
    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Yuchung Cheng
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit c13c48c00a6bc1febc73902505bdec0967bd7095 ]

    total_retrans needs to be cleared in tcp_disconnect().

    tcp_disconnect() is rarely used, but it is worth fixing it.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Cc: SeongJae Park
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

06 Feb, 2020

1 commit

  • [ Upstream commit 95224166a9032ff5d08fca633d37113078ce7d01 ]

    With an ebpf program that redirects packets through a vti[6] interface,
    the packets are dropped because no dst is attached.

    This could also be reproduced with an AF_PACKET socket, with the following
    python script (vti1 is an ip_vti interface):

    import socket
    send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
    # scapy
    # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
    # raw(p)
    req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
    send_s.sendto(req, ('vti1', 0x800, 0, 0))

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Steffen Klassert
    Signed-off-by: Sasha Levin

    Nicolas Dichtel
     

01 Feb, 2020

1 commit

  • [ Upstream commit f9e95555757915fc194288862d2978e370fe316b ]

    Include the size of struct nhmsg size when calculating
    how much of a payload to allocate in a new netlink nexthop
    notification message.

    Without this, we will fail to fill the skbuff at certain nexthop
    group sizes.

    You can reproduce the failure with the following iproute2 commands:

    ip link add dummy1 type dummy
    ip link add dummy2 type dummy
    ip link add dummy3 type dummy
    ip link add dummy4 type dummy
    ip link add dummy5 type dummy
    ip link add dummy6 type dummy
    ip link add dummy7 type dummy
    ip link add dummy8 type dummy
    ip link add dummy9 type dummy
    ip link add dummy10 type dummy
    ip link add dummy11 type dummy
    ip link add dummy12 type dummy
    ip link add dummy13 type dummy
    ip link add dummy14 type dummy
    ip link add dummy15 type dummy
    ip link add dummy16 type dummy
    ip link add dummy17 type dummy
    ip link add dummy18 type dummy
    ip link add dummy19 type dummy

    ip ro add 1.1.1.1/32 dev dummy1
    ip ro add 1.1.1.2/32 dev dummy2
    ip ro add 1.1.1.3/32 dev dummy3
    ip ro add 1.1.1.4/32 dev dummy4
    ip ro add 1.1.1.5/32 dev dummy5
    ip ro add 1.1.1.6/32 dev dummy6
    ip ro add 1.1.1.7/32 dev dummy7
    ip ro add 1.1.1.8/32 dev dummy8
    ip ro add 1.1.1.9/32 dev dummy9
    ip ro add 1.1.1.10/32 dev dummy10
    ip ro add 1.1.1.11/32 dev dummy11
    ip ro add 1.1.1.12/32 dev dummy12
    ip ro add 1.1.1.13/32 dev dummy13
    ip ro add 1.1.1.14/32 dev dummy14
    ip ro add 1.1.1.15/32 dev dummy15
    ip ro add 1.1.1.16/32 dev dummy16
    ip ro add 1.1.1.17/32 dev dummy17
    ip ro add 1.1.1.18/32 dev dummy18
    ip ro add 1.1.1.19/32 dev dummy19

    ip next add id 1 via 1.1.1.1 dev dummy1
    ip next add id 2 via 1.1.1.2 dev dummy2
    ip next add id 3 via 1.1.1.3 dev dummy3
    ip next add id 4 via 1.1.1.4 dev dummy4
    ip next add id 5 via 1.1.1.5 dev dummy5
    ip next add id 6 via 1.1.1.6 dev dummy6
    ip next add id 7 via 1.1.1.7 dev dummy7
    ip next add id 8 via 1.1.1.8 dev dummy8
    ip next add id 9 via 1.1.1.9 dev dummy9
    ip next add id 10 via 1.1.1.10 dev dummy10
    ip next add id 11 via 1.1.1.11 dev dummy11
    ip next add id 12 via 1.1.1.12 dev dummy12
    ip next add id 13 via 1.1.1.13 dev dummy13
    ip next add id 14 via 1.1.1.14 dev dummy14
    ip next add id 15 via 1.1.1.15 dev dummy15
    ip next add id 16 via 1.1.1.16 dev dummy16
    ip next add id 17 via 1.1.1.17 dev dummy17
    ip next add id 18 via 1.1.1.18 dev dummy18
    ip next add id 19 via 1.1.1.19 dev dummy19

    ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
    ip next del id 1111

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: Stephen Worley
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stephen Worley
     

29 Jan, 2020

7 commits

  • commit 4e4362d2bf2a49ff44dbbc9585207977ca3d71d0 upstream.

    Commit 9b42c1f179a6 ("xfrm: Extend the output_mark") added output_mark
    support but missed ESP offload support.

    xfrm_smark_get() is not called within xfrm_input() for packets coming
    from esp4_gro_receive() or esp6_gro_receive(). Therefore call
    xfrm_smark_get() directly within these functions.

    Fixes: 9b42c1f179a6 ("xfrm: Extend the output_mark to support input direction and masking.")
    Signed-off-by: Ulrich Weber
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Ulrich Weber
     
  • [ Upstream commit 9827c0634e461703abf81e8cc8b7adf5da5886d0 ]

    Sven-Haegar reported looping on fib dumps when 255.255.255.255 route has
    been added to a table. The looping is caused by the key rolling over from
    FFFFFFFF to 0. When dumping a specific table only, we need a means to detect
    when the table dump is done. The key and count saved to cb args are both 0
    only at the start of the table dump. If key is 0 and count > 0, then we are
    in the rollover case. Detect and return to avoid looping.

    This only affects dumps of a specific table; for dumps of all tables
    (the case prior to the change in the Fixes tag) inet_dump_fib moved
    the entry counter to the next table and reset the cb args used by
    fib_table_dump and fn_trie_dump_leaf, so the rollover ffffffff back
    to 0 did not cause looping with the dumps.

    Fixes: effe67926624 ("net: Enable kernel side filtering of route dumps")
    Reported-by: Sven-Haegar Koch
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit bb48eb9b12a95db9d679025927269d4adda6dbd1 ]

    When submitting v2 of "fou: Support binding FoU socket" (1713cb37bf67),
    I accidentally sent the wrong version of the patch and one fix was
    missing. In the initial version of the patch, as well as the version 2
    that I submitted, I incorrectly used ".type" for the two V6-attributes.
    The correct is to use ".len".

    Reported-by: Dmitry Vyukov
    Fixes: 1713cb37bf67 ("fou: Support binding FoU socket")
    Signed-off-by: Kristian Evensen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Kristian Evensen
     
  • [ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ]

    Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    apparently allowed syzbot to trigger various crashes in TCP stack [1]

    I believe this commit only made things easier for syzbot to find
    its way into triggering use-after-frees. But really the bugs
    could lead to bad TCP behavior or even plain crashes even for
    non malicious peers.

    I have audited all calls to tcp_rtx_queue_unlink() and
    tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
    if we are removing from rtx queue the skb that tp->highest_sack points to.

    These updates were missing in three locations :

    1) tcp_clean_rtx_queue() [This one seems quite serious,
    I have no idea why this was not caught earlier]

    2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]

    3) tcp_send_synack() [Probably not a big deal for normal operations]

    [1]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16

    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
    __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
    kasan_report+0x12/0x20 mm/kasan/common.c:639
    __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
    tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
    tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
    tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
    tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
    tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
    tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
    tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
    tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
    tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
    ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
    NF_HOOK include/linux/netfilter.h:307 [inline]
    NF_HOOK include/linux/netfilter.h:301 [inline]
    ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
    __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
    __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
    process_backlog+0x206/0x750 net/core/dev.c:6093
    napi_poll net/core/dev.c:6530 [inline]
    net_rx_action+0x508/0x1120 net/core/dev.c:6598
    __do_softirq+0x262/0x98c kernel/softirq.c:292
    run_ksoftirqd kernel/softirq.c:603 [inline]
    run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
    smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
    kthread+0x361/0x430 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

    Allocated by task 10091:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    __kasan_kmalloc mm/kasan/common.c:513 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
    slab_post_alloc_hook mm/slab.h:584 [inline]
    slab_alloc_node mm/slab.c:3263 [inline]
    kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
    alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
    sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
    sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
    tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
    tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
    inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    __sys_sendto+0x262/0x380 net/socket.c:1998
    __do_sys_sendto net/socket.c:2010 [inline]
    __se_sys_sendto net/socket.c:2006 [inline]
    __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 10095:
    save_stack+0x23/0x90 mm/kasan/common.c:72
    set_track mm/kasan/common.c:80 [inline]
    kasan_set_free_info mm/kasan/common.c:335 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
    __cache_free mm/slab.c:3426 [inline]
    kmem_cache_free+0x86/0x320 mm/slab.c:3694
    kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
    __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
    sk_eat_skb include/net/sock.h:2453 [inline]
    tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
    inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:886 [inline]
    sock_recvmsg net/socket.c:904 [inline]
    sock_recvmsg+0xce/0x110 net/socket.c:900
    __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
    __do_sys_recvfrom net/socket.c:2073 [inline]
    __se_sys_recvfrom net/socket.c:2069 [inline]
    __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff8880a488d040
    which belongs to the cache skbuff_fclone_cache of size 456
    The buggy address is located 40 bytes inside of
    456-byte region [ffff8880a488d040, ffff8880a488d208)
    The buggy address belongs to the page:
    page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
    raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
    raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq")
    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering")
    Signed-off-by: Eric Dumazet
    Cc: Cambda Zhu
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Acked-by: Neal Cardwell
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 5b2f1f3070b6447b76174ea8bfb7390dc6253ebd ]

    do_div() does a 64-by-32 division. Use div64_long() instead of it
    if the divisor is long, to avoid truncation to 32-bit.
    And as a nice side effect also cleans up the function a bit.

    Signed-off-by: Wen Yang
    Cc: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wen Yang
     
  • [ Upstream commit d39ca2590d10712f412add7a88e1dd467a7246f4 ]

    This reverts commit 0d4a6608f68c7532dcbfec2ea1150c9761767d03.

    Willem reported that after commit 0d4a6608f68c ("udp: do rmem bulk
    free even if the rx sk queue is empty") the memory allocated by
    an almost idle system with many UDP sockets can grow a lot.

    For stable kernel keep the solution as simple as possible and revert
    the offending commit.

    Reported-by: Willem de Bruijn
    Diagnosed-by: Eric Dumazet
    Fixes: 0d4a6608f68c ("udp: do rmem bulk free even if the rx sk queue is empty")
    Signed-off-by: Paolo Abeni
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit d0f418516022c32ecceaf4275423e5bd3f8743a9 ]

    in the same manner as commit 690afc165bb3 ("net: ip6_gre: fix moving
    ip6gre between namespaces"), fix namespace moving as it was broken since
    commit 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.").
    Indeed, the ip6_gre commit removed the local flag for collect_md
    condition, so there is no reason to keep it for ip_gre/ip_tunnel.

    this patch will fix both ip_tunnel and ip_gre modules.

    Fixes: 2e15ea390e6f ("ip_gre: Add support to collect tunnel metadata.")
    Signed-off-by: William Dauchy
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    William Dauchy
     

23 Jan, 2020

6 commits

  • commit 216808c6ba6d00169fd2aa928ec3c0e63bef254f upstream.

    At the time commit ce5ec440994b ("tcp: ensure epoll edge trigger
    wakeup when write queue is empty") was added to the kernel,
    we still had a single write queue, combining rtx and write queues.

    Once we moved the rtx queue into a separate rb-tree, testing
    if sk_write_queue is empty has been suboptimal.

    Indeed, if we have packets in the rtx queue, we probably want
    to delay the EPOLLOUT generation at the time incoming packets
    will free them, making room, but more importantly avoiding
    flooding application with EPOLLOUT events.

    Solution is to use tcp_rtx_and_write_queues_empty() helper.

    Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
    Signed-off-by: Eric Dumazet
    Cc: Jason Baron
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit e176b1ba476cf36f723cfcc7a9e57f3cb47dec70 ]

    When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
    retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
    If packet loss is detected at this time, retransmit_skb_hint will be set
    to point to the current packet loss in tcp_verify_retransmit_hint(),
    then the packets that were previously marked lost but not retransmitted
    due to the restriction of cwnd will be skipped and cannot be
    retransmitted.

    To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
    be reset only after all marked lost packets are retransmitted
    (retrans_out >= lost_out), otherwise we need to traverse from
    tcp_rtx_queue_head in tcp_xmit_retransmit_queue().

    Packetdrill to demonstrate:

    // Disable RACK and set max_reordering to keep things simple
    0 `sysctl -q net.ipv4.tcp_recovery=0`
    +0 `sysctl -q net.ipv4.tcp_max_reordering=3`

    // Establish a connection
    +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +.1 < S 0:0(0) win 32792
    +0 > S. 0:0(0) ack 1
    +.01 < . 1:1(0) ack 1 win 257
    +0 accept(3, ..., ...) = 4

    // Send 8 data segments
    +0 write(4, ..., 8000) = 8000
    +0 > P. 1:8001(8000) ack 1

    // Enter recovery and 1:3001 is marked lost
    +.01 < . 1:1(0) ack 1 win 257
    +0 < . 1:1(0) ack 1 win 257
    +0 < . 1:1(0) ack 1 win 257

    // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
    +0 > . 1:1001(1000) ack 1

    // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
    +.01 < . 1:1(0) ack 2001 win 257
    // Now retransmit_skb_hint points to 4001:5001 which is now marked lost

    // BUG: 2001:3001 was not retransmitted
    +0 > . 2001:3001(1000) ack 1

    Signed-off-by: Pengcheng Yang
    Acked-by: Neal Cardwell
    Tested-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Pengcheng Yang
     
  • commit 212e7f56605ef9688d0846db60c6c6ec06544095 upstream.

    An earlier commit (1b789577f655060d98d20e,
    "netfilter: arp_tables: init netns pointer in xt_tgchk_param struct")
    fixed missing net initialization for arptables, but turns out it was
    incomplete. We can get a very similar struct net NULL deref during
    error unwinding:

    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77
    xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175
    cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline]
    translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587
    do_replace net/ipv4/netfilter/arp_tables.c:981 [inline]
    do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461

    Also init the netns pointer in xt_tgdtor_param struct.

    Fixes: add67461240c1d ("netfilter: add struct net * to target parameters")
    Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • commit e7a5f1f1cd0008e5ad379270a8657e121eedb669 upstream.

    Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
    if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
    and there's also some data in psock->ingress_msg, the data in
    psock->ingress_msg will be purged. It is always happen when request to a
    HTTP1.0 server like python SimpleHTTPServer since the server send FIN
    packet after data is sent out.

    Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface")
    Reported-by: Arika Chen
    Suggested-by: Arika Chen
    Signed-off-by: Lingpeng Chen
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Lingpeng Chen
     
  • commit 7361d44896ff20d48bdd502d1a0cd66308055d45 upstream.

    When user returns SK_DROP we need to reset the number of copied bytes
    to indicate to the user the bytes were dropped and not sent. If we
    don't reset the copied arg sendmsg will return as if those bytes were
    copied giving the user a positive return value.

    This works as expected today except in the case where the user also
    pops bytes. In the pop case the sg.size is reduced but we don't correctly
    account for this when copied bytes is reset. The popped bytes are not
    accounted for and we return a small positive value potentially confusing
    the user.

    The reason this happens is due to a typo where we do the wrong comparison
    when accounting for pop bytes. In this fix notice the if/else is not
    needed and that we have a similar problem if we push data except its not
    visible to the user because if delta is larger the sg.size we return a
    negative value so it appears as an error regardless.

    Fixes: 7246d8ed4dcce ("bpf: helper to pop data from messages")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     
  • commit 33bfe20dd7117dd81fd896a53f743a233e1ad64f upstream.

    When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
    and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
    don't push the write_space callback up and instead simply overwrite the
    op with the psock stored previous op. This may or may not be correct so
    to ensure we don't overwrite the TLS write space hook pass this field to
    the ULP and have it fixup the ctx.

    This completes a previous fix that pushed the ops through to the ULP
    but at the time missed doing this for write_space, presumably because
    write_space TLS hook was added around the same time.

    Fixes: 95fa145479fbc ("bpf: sockmap/tls, close can race with map free")
    Signed-off-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jakub Sitnicki
    Acked-by: Jonathan Lemon
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    John Fastabend
     

15 Jan, 2020

1 commit

  • commit 1b789577f655060d98d20ed0c6f9fbd469d6ba63 upstream.

    We get crash when the targets checkentry function tries to make
    use of the network namespace pointer for arptables.

    When the net pointer got added back in 2010, only ip/ip6/ebtables were
    changed to initialize it, so arptables has this set to NULL.

    This isn't a problem for normal arptables because no existing
    arptables target has a checkentry function that makes use of par->net.

    However, direct users of the setsockopt interface can provide any
    target they want as long as its registered for ARP or UNPSEC protocols.

    syzkaller managed to send a semi-valid arptables rule for RATEEST target
    which is enough to trigger NULL deref:

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    RIP: xt_rateest_tg_checkentry+0x11d/0xb40 net/netfilter/xt_RATEEST.c:109
    [..]
    xt_check_target+0x283/0x690 net/netfilter/x_tables.c:1019
    check_target net/ipv4/netfilter/arp_tables.c:399 [inline]
    find_check_entry net/ipv4/netfilter/arp_tables.c:422 [inline]
    translate_table+0x1005/0x1d70 net/ipv4/netfilter/arp_tables.c:572
    do_replace net/ipv4/netfilter/arp_tables.c:977 [inline]
    do_arpt_set_ctl+0x310/0x640 net/ipv4/netfilter/arp_tables.c:1456

    Fixes: add67461240c1d ("netfilter: add struct net * to target parameters")
    Reported-by: syzbot+d7358a458d8a81aee898@syzkaller.appspotmail.com
    Signed-off-by: Florian Westphal
    Acked-by: Cong Wang
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     

12 Jan, 2020

1 commit


09 Jan, 2020

2 commits

  • [ Upstream commit 7c68fa2bddda6d942bd387c9ba5b4300737fd991 ]

    sk->sk_pacing_shift can be read and written without lock
    synchronization. This patch adds annotations to
    document this fact and avoid future syzbot complains.

    This might also avoid unexpected false sharing
    in sk_pacing_shift_update(), as the compiler
    could remove the conditional check and always
    write over sk->sk_pacing_shift :

    if (sk->sk_pacing_shift != val)
    sk->sk_pacing_shift = val;

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     
  • [ Upstream commit a5a7daa52edb5197a3b696afee13ef174dc2e993 ]

    Reading tp->recvmsg_inq after socket lock is released
    raises a KCSAN warning [1]

    Replace has_tss & has_cmsg by cmsg_flags and make
    sure to not read tp->recvmsg_inq a second time.

    [1]
    BUG: KCSAN: data-race in tcp_chrono_stop / tcp_recvmsg

    write to 0xffff888126adef24 of 2 bytes by interrupt on cpu 0:
    tcp_chrono_set net/ipv4/tcp_output.c:2309 [inline]
    tcp_chrono_stop+0x14c/0x280 net/ipv4/tcp_output.c:2338
    tcp_clean_rtx_queue net/ipv4/tcp_input.c:3165 [inline]
    tcp_ack+0x274f/0x3170 net/ipv4/tcp_input.c:3688
    tcp_rcv_established+0x37e/0xf50 net/ipv4/tcp_input.c:5696
    tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
    tcp_v4_rcv+0x19dc/0x1bb0 net/ipv4/tcp_ipv4.c:1942
    ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
    netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5214
    napi_skb_finish net/core/dev.c:5677 [inline]
    napi_gro_receive+0x28f/0x330 net/core/dev.c:5710

    read to 0xffff888126adef25 of 1 bytes by task 7275 on cpu 1:
    tcp_recvmsg+0x77b/0x1a30 net/ipv4/tcp.c:2187
    inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:871 [inline]
    sock_recvmsg net/socket.c:889 [inline]
    sock_recvmsg+0x92/0xb0 net/socket.c:885
    sock_read_iter+0x15f/0x1e0 net/socket.c:967
    call_read_iter include/linux/fs.h:1889 [inline]
    new_sync_read+0x389/0x4f0 fs/read_write.c:414
    __vfs_read+0xb1/0xc0 fs/read_write.c:427
    vfs_read fs/read_write.c:461 [inline]
    vfs_read+0x143/0x2c0 fs/read_write.c:446
    ksys_read+0xd5/0x1b0 fs/read_write.c:587
    __do_sys_read fs/read_write.c:597 [inline]
    __se_sys_read fs/read_write.c:595 [inline]
    __x64_sys_read+0x4c/0x60 fs/read_write.c:595
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 7275 Comm: sshd Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: b75eba76d3d7 ("tcp: send in-queue bytes in cmsg upon read")
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

05 Jan, 2020

9 commits

  • [ Upstream commit 8dbd76e79a16b45b2ccb01d2f2e08dbf64e71e40 ]

    Michal Kubecek and Firo Yang did a very nice analysis of crashes
    happening in __inet_lookup_established().

    Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
    (via a close()/socket()/listen() cycle) without a RCU grace period,
    I should not have changed listeners linkage in their hash table.

    They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
    so that a lookup can detect a socket in a hash list was moved in
    another one.

    Since we added code in commit d296ba60d8e2 ("soreuseport: Resolve
    merge conflict for v4/v6 ordering fix"), we have to add
    hlist_nulls_add_tail_rcu() helper.

    Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
    Signed-off-by: Eric Dumazet
    Reported-by: Michal Kubecek
    Reported-by: Firo Yang
    Reviewed-by: Michal Kubecek
    Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1f85e6267caca44b30c54711652b0726fadbb131 ]

    Backport of commit fdfc5c8594c2 ("tcp: remove empty skb from
    write queue in error cases") in linux-4.14 stable triggered
    various bugs. One of them has been fixed in commit ba2ddb43f270
    ("tcp: Don't dequeue SYN/FIN-segments from write-queue"), but
    we still have crashes in some occasions.

    Root-cause is that when tcp_sendmsg() has allocated a fresh
    skb and could not append a fragment before being blocked
    in sk_stream_wait_memory(), tcp_write_xmit() might be called
    and decide to send this fresh and empty skb.

    Sending an empty packet is not only silly, it might have caused
    many issues we had in the past with tp->packets_out being
    out of sync.

    Fixes: c65f7f00c587 ("[TCP]: Simplify SKB data portion allocation with NETIF_F_SG.")
    Signed-off-by: Eric Dumazet
    Cc: Christoph Paasch
    Acked-by: Neal Cardwell
    Cc: Jason Baron
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 8247a79efa2f28b44329f363272550c1738377de ]

    When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    Although vti and vti6 are immune to this problem because they are IFF_NOARP
    interfaces, as Guillaume pointed. There is still no sense to confirm neighbour
    here.

    v5: Update commit description.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit 7a1592bcb15d71400a98632727791d1e68ea0ee8 ]

    When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    v5: No Change.
    v4: Update commit description
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Fixes: 0dec879f636f ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP")
    Reviewed-by: Guillaume Nault
    Tested-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit bd085ef678b2cc8c38c105673dfe8ff8f5ec0c57 ]

    The MTU update code is supposed to be invoked in response to real
    networking events that update the PMTU. In IPv6 PMTU update function
    __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
    confirmed time.

    But for tunnel code, it will call pmtu before xmit, like:
    - tnl_update_pmtu()
    - skb_dst_update_pmtu()
    - ip6_rt_update_pmtu()
    - __ip6_rt_update_pmtu()
    - dst_confirm_neigh()

    If the tunnel remote dst mac address changed and we still do the neigh
    confirm, we will not be able to update neigh cache and ping6 remote
    will failed.

    So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
    should not be invoking dst_confirm_neigh() as we have no evidence
    of successful two-way communication at this point.

    On the other hand it is also important to keep the neigh reachability fresh
    for TCP flows, so we cannot remove this dst_confirm_neigh() call.

    To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
    to choose whether we should do neigh update or not. I will add the parameter
    in this patch and set all the callers to true to comply with the previous
    way, and fix the tunnel code one by one on later patches.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Suggested-by: David Miller
    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit feed8a4fc9d46c3126fb9fcae0e9248270c6321a ]

    When the size of the receive buffer for a socket is close to 2^31 when
    computing if we have enough space in the buffer to copy a packet from
    the queue to the buffer we might hit an integer overflow.

    When an user set net.core.rmem_default to a value close to 2^31 UDP
    packets are dropped because of this overflow. This can be visible, for
    instance, with failure to resolve hostnames.

    This can be fixed by casting sk_rcvbuf (which is an int) to unsigned
    int, similarly to how it is done in TCP.

    Signed-off-by: Antonio Messina
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Antonio Messina
     
  • [ Upstream commit 853697504de043ff0bfd815bd3a64de1dce73dc7 ]

    >From commit 50895b9de1d3 ("tcp: highest_sack fix"), the logic about
    setting tp->highest_sack to the head of the send queue was removed.
    Of course the logic is error prone, but it is logical. Before we
    remove the pointer to the highest sack skb and use the seq instead,
    we need to set tp->highest_sack to NULL when there is no skb after
    the last sack, and then replace NULL with the real skb when new skb
    inserted into the rtx queue, because the NULL means the highest sack
    seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
    the next ACK with sack option will increase tp->reordering unexpectedly.

    This patch sets tp->highest_sack to the tail of the rtx queue if
    it's NULL and new data is sent. The patch keeps the rule that the
    highest_sack can only be maintained by sack processing, except for
    this only case.

    Fixes: 50895b9de1d3 ("tcp: highest_sack fix")
    Signed-off-by: Cambda Zhu
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cambda Zhu
     
  • commit bbab7ef235031f6733b5429ae7877bfa22339712 upstream.

    This code reads two global variables without protection
    of a lock. We need READ_ONCE()/WRITE_ONCE() pairs to
    avoid load/store-tearing and better document the intent.

    KCSAN reported :
    BUG: KCSAN: data-race in icmp_global_allow / icmp_global_allow

    read to 0xffffffff861a8014 of 4 bytes by task 11201 on cpu 0:
    icmp_global_allow+0x36/0x1b0 net/ipv4/icmp.c:254
    icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
    icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
    icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
    icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
    ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
    dst_link_failure include/net/dst.h:419 [inline]
    vti_xmit net/ipv4/ip_vti.c:243 [inline]
    vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
    __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
    netdev_start_xmit include/linux/netdevice.h:4434 [inline]
    xmit_one net/core/dev.c:3280 [inline]
    dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
    __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
    dst_output include/net/dst.h:436 [inline]
    ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179

    write to 0xffffffff861a8014 of 4 bytes by task 11183 on cpu 1:
    icmp_global_allow+0x174/0x1b0 net/ipv4/icmp.c:272
    icmpv6_global_allow net/ipv6/icmp.c:184 [inline]
    icmpv6_global_allow net/ipv6/icmp.c:179 [inline]
    icmp6_send+0x493/0x1140 net/ipv6/icmp.c:514
    icmpv6_send+0x71/0xb0 net/ipv6/ip6_icmp.c:43
    ip6_link_failure+0x43/0x180 net/ipv6/route.c:2640
    dst_link_failure include/net/dst.h:419 [inline]
    vti_xmit net/ipv4/ip_vti.c:243 [inline]
    vti_tunnel_xmit+0x27f/0xa50 net/ipv4/ip_vti.c:279
    __netdev_start_xmit include/linux/netdevice.h:4420 [inline]
    netdev_start_xmit include/linux/netdevice.h:4434 [inline]
    xmit_one net/core/dev.c:3280 [inline]
    dev_hard_start_xmit+0xef/0x430 net/core/dev.c:3296
    __dev_queue_xmit+0x14c9/0x1b60 net/core/dev.c:3873
    dev_queue_xmit+0x21/0x30 net/core/dev.c:3906
    neigh_direct_output+0x1f/0x30 net/core/neighbour.c:1530
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x7a6/0xec0 net/ipv6/ip6_output.c:116
    __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
    __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
    ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 11183 Comm: syz-executor.2 Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 4cdf507d5452 ("icmp: add a global rate limitation")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit 71685eb4ce80ae9c49eff82ca4dd15acab215de9 upstream.

    We need to explicitely forbid read/store tearing in inet_peer_gc()
    and inet_putpeer().

    The following syzbot report reminds us about inet_putpeer()
    running without a lock held.

    BUG: KCSAN: data-race in inet_putpeer / inet_putpeer

    write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 0:
    inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
    ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
    inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
    __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
    rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
    rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
    rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    invoke_softirq kernel/softirq.c:373 [inline]
    irq_exit+0xbb/0xe0 kernel/softirq.c:413
    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
    smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
    native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
    arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
    default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
    cpuidle_idle_call kernel/sched/idle.c:154 [inline]
    do_idle+0x1af/0x280 kernel/sched/idle.c:263

    write to 0xffff888121fb2ed0 of 4 bytes by interrupt on cpu 1:
    inet_putpeer+0x37/0xa0 net/ipv4/inetpeer.c:240
    ip4_frag_free+0x3d/0x50 net/ipv4/ip_fragment.c:102
    inet_frag_destroy_rcu+0x58/0x80 net/ipv4/inet_fragment.c:228
    __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
    rcu_do_batch+0x256/0x5b0 kernel/rcu/tree.c:2157
    rcu_core+0x369/0x4d0 kernel/rcu/tree.c:2377
    rcu_core_si+0x12/0x20 kernel/rcu/tree.c:2386
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
    smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165
    kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: 4b9d9be839fd ("inetpeer: remove unused list")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

18 Dec, 2019

3 commits

  • [ Upstream commit 0e4940928c26527ce8f97237fef4c8a91cd34207 ]

    After pskb_may_pull() we should always refetch the header
    pointers from the skb->data in case it got reallocated.

    In gre_parse_header(), the erspan header is still fetched
    from the 'options' pointer which is fetched before
    pskb_may_pull().

    Found this during code review of a KMSAN bug report.

    Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup")
    Cc: Lorenzo Bianconi
    Signed-off-by: Cong Wang
    Acked-by: Lorenzo Bianconi
    Acked-by: William Tu
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit 9424e2e7ad93ffffa88f882c9bc5023570904b55 ]

    Back in 2008, Adam Langley fixed the corner case of packets for flows
    having all of the following options : MD5 TS SACK

    Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
    can be cooked from the remaining 8 bytes.

    tcp_established_options() correctly sets opts->num_sack_blocks
    to zero, but returns 36 instead of 32.

    This means TCP cooks packets with 4 extra bytes at the end
    of options, containing unitialized bytes.

    Fixes: 33ad798c924b ("tcp: options clean up")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Acked-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 501a90c945103e8627406763dac418f20f3837b2 ]

    syzbot was once again able to crash a host by setting a very small mtu
    on loopback device.

    Let's make inetdev_valid_mtu() available in include/net/ip.h,
    and use it in ip_setup_cork(), so that we protect both ip_append_page()
    and __ip_append_data()

    Also add a READ_ONCE() when the device mtu is read.

    Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
    even if other code paths might write over this field.

    Add a big comment in include/linux/netdevice.h about dev->mtu
    needing READ_ONCE()/WRITE_ONCE() annotations.

    Hopefully we will add the missing ones in followup patches.

    [1]

    refcount_t: saturated; leaking memory.
    WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    panic+0x2e3/0x75c kernel/panic.c:221
    __warn.cold+0x2f/0x3e kernel/panic.c:582
    report_bug+0x289/0x300 lib/bug.c:195
    fixup_bug arch/x86/kernel/traps.c:174 [inline]
    fixup_bug arch/x86/kernel/traps.c:169 [inline]
    do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
    do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
    invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
    RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
    Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
    RSP: 0018:ffff88809689f550 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
    RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
    R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
    R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
    refcount_add include/linux/refcount.h:193 [inline]
    skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
    sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
    ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
    udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
    inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
    kernel_sendpage+0x92/0xf0 net/socket.c:3794
    sock_sendpage+0x8b/0xc0 net/socket.c:936
    pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
    splice_from_pipe_feed fs/splice.c:512 [inline]
    __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
    splice_from_pipe+0x108/0x170 fs/splice.c:671
    generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
    do_splice_from fs/splice.c:861 [inline]
    direct_splice_actor+0x123/0x190 fs/splice.c:1035
    splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
    do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
    do_sendfile+0x597/0xd00 fs/read_write.c:1464
    __do_sys_sendfile64 fs/read_write.c:1525 [inline]
    __se_sys_sendfile64 fs/read_write.c:1511 [inline]
    __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x441409
    Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
    RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
    RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
    R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
    R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
    Kernel Offset: disabled
    Rebooting in 86400 seconds..

    Fixes: 1470ddf7f8ce ("inet: Remove explicit write references to sk/inet in ip_append_data")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

05 Dec, 2019

1 commit

  • [ Upstream commit 031097d9e079e40dce401031d1012e83d80eaf01 ]

    TLS 1.3 started using the entry at the end of the SG array
    for chaining-in the single byte content type entry. This mostly
    works:

    [ E E E E E E . . ]
    ^ ^
    start end

    E < content type
    /
    [ E E E E E E C . ]
    ^ ^
    start end

    (Where E denotes a populated SG entry; C denotes a chaining entry.)

    If the array is full, however, the end will point to the start:

    [ E E E E E E E E ]
    ^
    start
    end

    And we end up overwriting the start:

    E < content type
    /
    [ C E E E E E E E ]
    ^
    start
    end

    The sg array is supposed to be a circular buffer with start and
    end markers pointing anywhere. In case where start > end
    (i.e. the circular buffer has "wrapped") there is an extra entry
    reserved at the end to chain the two halves together.

    [ E E E E E E . . l ]

    (Where l is the reserved entry for "looping" back to front.

    As suggested by John, let's reserve another entry for chaining
    SG entries after the main circular buffer. Note that this entry
    has to be pointed to by the end entry so its position is not fixed.

    Examples of full messages:

    [ E E E E E E E E . l ]
    ^ ^
    start end


    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jakub Kicinski
     

23 Nov, 2019

1 commit

  • Once udp stack has set the UDP_SKB_IS_STATELESS flag, later skb free
    assumes all skb head state has been dropped already.

    This will leak the extension memory in case the skb has extensions other
    than the ipsec secpath, e.g. bridge nf data.

    To fix this, set the UDP_SKB_IS_STATELESS flag only if we don't have
    extensions or if the extension space can be free'd.

    Fixes: 895b5c9f206eb7d25dc1360a ("netfilter: drop bridge nf reset from nf_reset")
    Cc: Paolo Abeni
    Reported-by: Byron Stanoszek
    Signed-off-by: Florian Westphal
    Acked-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Florian Westphal
     

19 Nov, 2019

1 commit

  • Commit eec4844fae7c ("proc/sysctl: add shared variables for range
    check") did:
    - .extra2 = &two,
    + .extra2 = SYSCTL_ONE,
    here, which doesn't seem to be intentional, given the changelog.
    This patch restores it to the previous, as the value of 2 still makes
    sense (used in fib_multipath_hash()).

    Fixes: eec4844fae7c ("proc/sysctl: add shared variables for range check")
    Cc: Matteo Croce
    Signed-off-by: Marcelo Ricardo Leitner
    Acked-by: Matteo Croce
    Signed-off-by: David S. Miller

    Marcelo Ricardo Leitner
     

17 Nov, 2019

1 commit

  • In route.c, inet_rtm_getroute_build_skb() creates an skb with no
    headroom. This skb is then used by inet_rtm_getroute() which may pass
    it to rt_fill_info() and, from there, to ipmr_get_route(). The later
    might try to reuse this skb by cloning it and prepending an IPv4
    header. But since the original skb has no headroom, skb_push() triggers
    skb_under_panic():

    skbuff: skb_under_panic: text:00000000ca46ad8a len:80 put:20 head:00000000cd28494e data:000000009366fd6b tail:0x3c end:0xec0 dev:veth0
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:108!
    invalid opcode: 0000 [#1] SMP KASAN PTI
    CPU: 6 PID: 587 Comm: ip Not tainted 5.4.0-rc6+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
    RIP: 0010:skb_panic+0xbf/0xd0
    Code: 41 a2 ff 8b 4b 70 4c 8b 4d d0 48 c7 c7 20 76 f5 8b 44 8b 45 bc 48 8b 55 c0 48 8b 75 c8 41 54 41 57 41 56 41 55 e8 75 dc 7a ff 0b 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
    RSP: 0018:ffff888059ddf0b0 EFLAGS: 00010286
    RAX: 0000000000000086 RBX: ffff888060a315c0 RCX: ffffffff8abe4822
    RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff88806c9a79cc
    RBP: ffff888059ddf118 R08: ffffed100d9361b1 R09: ffffed100d9361b0
    R10: ffff88805c68aee3 R11: ffffed100d9361b1 R12: ffff88805d218000
    R13: ffff88805c689fec R14: 000000000000003c R15: 0000000000000ec0
    FS: 00007f6af184b700(0000) GS:ffff88806c980000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007ffc8204a000 CR3: 0000000057b40006 CR4: 0000000000360ee0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    skb_push+0x7e/0x80
    ipmr_get_route+0x459/0x6fa
    rt_fill_info+0x692/0x9f0
    inet_rtm_getroute+0xd26/0xf20
    rtnetlink_rcv_msg+0x45d/0x630
    netlink_rcv_skb+0x1a5/0x220
    rtnetlink_rcv+0x15/0x20
    netlink_unicast+0x305/0x3a0
    netlink_sendmsg+0x575/0x730
    sock_sendmsg+0xb5/0xc0
    ___sys_sendmsg+0x497/0x4f0
    __sys_sendmsg+0xcb/0x150
    __x64_sys_sendmsg+0x48/0x50
    do_syscall_64+0xd2/0xac0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Actually the original skb used to have enough headroom, but the
    reserve_skb() call was lost with the introduction of
    inet_rtm_getroute_build_skb() by commit 404eb77ea766 ("ipv4: support
    sport, dport and ip_proto in RTM_GETROUTE").

    We could reserve some headroom again in inet_rtm_getroute_build_skb(),
    but this function shouldn't be responsible for handling the special
    case of ipmr_get_route(). Let's handle that directly in
    ipmr_get_route() by calling skb_realloc_headroom() instead of
    skb_clone().

    Fixes: 404eb77ea766 ("ipv4: support sport, dport and ip_proto in RTM_GETROUTE")
    Signed-off-by: Guillaume Nault
    Signed-off-by: David S. Miller

    Guillaume Nault