01 Oct, 2020

2 commits

  • [ Upstream commit a3ea86739f1bc7e121d921842f0f4a8ab1af94d9 ]

    if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Vasily Averin
     
  • [ Upstream commit 9ed498c6280a2f2b51d02df96df53037272ede49 ]

    sk->sk_backlog.tail might be read without holding the socket spinlock,
    we need to add proper READ_ONCE()/WRITE_ONCE() to silence the warnings.

    KCSAN reported :

    BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg

    write to 0xffff8881265109f8 of 8 bytes by interrupt on cpu 1:
    __sk_add_backlog include/net/sock.h:907 [inline]
    sk_add_backlog include/net/sock.h:938 [inline]
    tcp_add_backlog+0x476/0xce0 net/ipv4/tcp_ipv4.c:1759
    tcp_v4_rcv+0x1a70/0x1bd0 net/ipv4/tcp_ipv4.c:1947
    ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:4929
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5043
    netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5133
    napi_skb_finish net/core/dev.c:5596 [inline]
    napi_gro_receive+0x28f/0x330 net/core/dev.c:5629
    receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
    virtnet_receive drivers/net/virtio_net.c:1323 [inline]
    virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
    napi_poll net/core/dev.c:6311 [inline]
    net_rx_action+0x3ae/0xa90 net/core/dev.c:6379
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    invoke_softirq kernel/softirq.c:373 [inline]
    irq_exit+0xbb/0xe0 kernel/softirq.c:413
    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
    do_IRQ+0xa6/0x180 arch/x86/kernel/irq.c:263
    ret_from_intr+0x0/0x19
    native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
    arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
    default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
    cpuidle_idle_call kernel/sched/idle.c:154 [inline]
    do_idle+0x1af/0x280 kernel/sched/idle.c:263
    cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
    start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
    secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241

    read to 0xffff8881265109f8 of 8 bytes by task 8057 on cpu 0:
    tcp_recvmsg+0x46e/0x1b40 net/ipv4/tcp.c:2050
    inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
    sock_recvmsg_nosec net/socket.c:871 [inline]
    sock_recvmsg net/socket.c:889 [inline]
    sock_recvmsg+0x92/0xb0 net/socket.c:885
    sock_read_iter+0x15f/0x1e0 net/socket.c:967
    call_read_iter include/linux/fs.h:1889 [inline]
    new_sync_read+0x389/0x4f0 fs/read_write.c:414
    __vfs_read+0xb1/0xc0 fs/read_write.c:427
    vfs_read fs/read_write.c:461 [inline]
    vfs_read+0x143/0x2c0 fs/read_write.c:446
    ksys_read+0xd5/0x1b0 fs/read_write.c:587
    __do_sys_read fs/read_write.c:597 [inline]
    __se_sys_read fs/read_write.c:595 [inline]
    __x64_sys_read+0x4c/0x60 fs/read_write.c:595
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 8057 Comm: syz-fuzzer Not tainted 5.4.0-rc6+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Eric Dumazet
     

27 Sep, 2020

3 commits

  • [ Upstream commit 2fbc6e89b2f1403189e624cabaf73e189c5e50c6 ]

    Kfir reported that pmtu exceptions are not created properly for
    deployments where multipath routes use the same device.

    After some digging I see 2 compounding problems:
    1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
    the route lookup. This is the second use case where this has
    been a problem (the first is related to use of vti devices with
    VRF). I can not find any reason for the oif to be changed after the
    lookup; the code goes back to the start of git. It does not seem
    logical so remove it.

    2. fib_lookups for exceptions do not call fib_select_path to handle
    multipath route selection based on the hash.

    The end result is that the fib_lookup used to add the exception
    always creates it based using the first leg of the route.

    An example topology showing the problem:

    | host1
    +------+
    | eth0 | .209
    +------+
    |
    +------+
    switch | br0 |
    +------+
    |
    +---------+---------+
    | host2 | host3
    +------+ +------+
    | eth0 | .250 | eth0 | 192.168.252.252
    +------+ +------+

    +-----+ +-----+
    | vti | .2 | vti | 192.168.247.3
    +-----+ +-----+
    \ /
    =================================
    tunnels
    192.168.247.1/24

    for h in host1 host2 host3; do
    ip netns add ${h}
    ip -netns ${h} link set lo up
    ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
    done

    ip netns add switch
    ip -netns switch li set lo up
    ip -netns switch link add br0 type bridge stp 0
    ip -netns switch link set br0 up

    for n in 1 2 3; do
    ip -netns switch link add eth-sw type veth peer name eth-h${n}
    ip -netns switch li set eth-h${n} master br0 up
    ip -netns switch li set eth-sw netns host${n} name eth0
    done

    ip -netns host1 addr add 192.168.252.209/24 dev eth0
    ip -netns host1 link set dev eth0 up
    ip -netns host1 route add 192.168.247.0/24 \
    nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0

    ip -netns host2 addr add 192.168.252.250/24 dev eth0
    ip -netns host2 link set dev eth0 up

    ip -netns host2 addr add 192.168.252.252/24 dev eth0
    ip -netns host3 link set dev eth0 up

    ip netns add tunnel
    ip -netns tunnel li set lo up
    ip -netns tunnel li add br0 type bridge
    ip -netns tunnel li set br0 up
    for n in $(seq 11 20); do
    ip -netns tunnel addr add dev br0 192.168.247.${n}/24
    done

    for n in 2 3
    do
    ip -netns tunnel link add vti${n} type veth peer name eth${n}
    ip -netns tunnel link set eth${n} mtu 1360 master br0 up
    ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
    ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
    done
    ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3

    ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
    ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
    ip -netns host1 ro ls cache

    Before this patch the cache always shows exceptions against the first
    leg in the multipath route; 192.168.252.250 per this example. Since the
    hash has an initial random seed, you may need to vary the final octet
    more than what is listed. In my tests, using addresses between 11 and 19
    usually found 1 that used both legs.

    With this patch, the cache will have exceptions for both legs.

    Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions")
    Reported-by: Kfir Itzhak
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 1869e226a7b3ef75b4f70ede2f1b7229f7157fa4 ]

    flowi4_multipath_hash was added by the commit referenced below for
    tunnels. Unfortunately, the patch did not initialize the new field
    for several fast path lookups that do not initialize the entire flow
    struct to 0. Fix those locations. Currently, flowi4_multipath_hash
    is random garbage and affects the hash value computed by
    fib_multipath_hash for multipath selection.

    Fixes: 24ba14406c5c ("route: Add multipath_hash in flowi_common to make user-define hash")
    Signed-off-by: David Ahern
    Cc: wenxu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit ba9e04a7ddf4f22a10e05bf9403db6b97743c7bf ]

    Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we
    echo the TOS value of the received packets in the response.
    However, we do not want to echo the lower 2 ECN bits in accordance
    with RFC 3168 6.1.5 robustness principles.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

    Signed-off-by: Wei Wang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Wei Wang
     

12 Sep, 2020

1 commit

  • [ Upstream commit 7f6f32bb7d3355cd78ebf1dece9a6ea7a0ca8158 ]

    fib_info_notify_update() is always called with RTNL held, but not from
    an RCU read-side critical section. This leads to the following warning
    [1] when the FIB table list is traversed with
    hlist_for_each_entry_rcu(), but without a proper lockdep expression.

    Since modification of the list is protected by RTNL, silence the warning
    by adding a lockdep expression which verifies RTNL is held.

    [1]
    =============================
    WARNING: suspicious RCU usage
    5.9.0-rc1-custom-14233-g2f26e122d62f #129 Not tainted
    -----------------------------
    net/ipv4/fib_trie.c:2124 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by ip/834:
    #0: ffffffff85a3b6b0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0

    stack backtrace:
    CPU: 0 PID: 834 Comm: ip Not tainted 5.9.0-rc1-custom-14233-g2f26e122d62f #129
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x100/0x184
    lockdep_rcu_suspicious+0x143/0x14d
    fib_info_notify_update+0x8d1/0xa60
    __nexthop_replace_notify+0xd2/0x290
    rtm_new_nexthop+0x35e2/0x5946
    rtnetlink_rcv_msg+0x4f7/0xbd0
    netlink_rcv_skb+0x17a/0x480
    rtnetlink_rcv+0x22/0x30
    netlink_unicast+0x5ae/0x890
    netlink_sendmsg+0x98a/0xf40
    ____sys_sendmsg+0x879/0xa00
    ___sys_sendmsg+0x122/0x190
    __sys_sendmsg+0x103/0x1d0
    __x64_sys_sendmsg+0x7d/0xb0
    do_syscall_64+0x32/0x50
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fde28c3be57
    Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 3d 00 f0 ff ff 77 51
    c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
    RSP: 002b:00007ffc09330028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fde28c3be57
    RDX: 0000000000000000 RSI: 00007ffc09330090 RDI: 0000000000000003
    RBP: 000000005f45f911 R08: 0000000000000001 R09: 00007ffc0933012c
    R10: 0000000000000076 R11: 0000000000000246 R12: 0000000000000001
    R13: 00007ffc09330290 R14: 00007ffc09330eee R15: 00005610e48ed020

    Fixes: 1bff1a0c9bbd ("ipv4: Add function to send route updates")
    Signed-off-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     

03 Sep, 2020

1 commit

  • [ Upstream commit eeaac3634ee0e3f35548be35275efeca888e9b23 ]

    Currently the nexthop code will use an empty NHA_GROUP attribute, but it
    requires at least 1 entry in order to function properly. Otherwise we
    end up derefencing null or random pointers all over the place due to not
    having any nh_grp_entry members allocated, nexthop code relies on having at
    least the first member present. Empty NHA_GROUP doesn't make any sense so
    just disallow it.
    Also add a WARN_ON for any future users of nexthop_create_group().

    BUG: kernel NULL pointer dereference, address: 0000000000000080
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP
    CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
    RIP: 0010:fib_check_nexthop+0x4a/0xaa
    Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
    RSP: 0018:ffff88807983ba00 EFLAGS: 00010213
    RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000
    RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80
    RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a
    R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001
    FS: 00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0
    Call Trace:
    fib_create_info+0x64d/0xaf7
    fib_table_insert+0xf6/0x581
    ? __vma_adjust+0x3b6/0x4d4
    inet_rtm_newroute+0x56/0x70
    rtnetlink_rcv_msg+0x1e3/0x20d
    ? rtnl_calcit.isra.0+0xb8/0xb8
    netlink_rcv_skb+0x5b/0xac
    netlink_unicast+0xfa/0x17b
    netlink_sendmsg+0x334/0x353
    sock_sendmsg_nosec+0xf/0x3f
    ____sys_sendmsg+0x1a0/0x1fc
    ? copy_msghdr_from_user+0x4c/0x61
    ___sys_sendmsg+0x63/0x84
    ? handle_mm_fault+0xa39/0x11b5
    ? sockfd_lookup_light+0x72/0x9a
    __sys_sendmsg+0x50/0x6e
    do_syscall_64+0x54/0xbe
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f10dacc0bb7
    Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
    RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7
    RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003
    RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008
    R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440
    Modules linked in:
    CR2: 0000000000000080

    CC: David Ahern
    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com
    Signed-off-by: Nikolay Aleksandrov
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Aleksandrov
     

19 Aug, 2020

3 commits

  • [ Upstream commit d76f3351cea2d927fdf70dd7c06898235035e84e ]

    In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
    SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
    behaviour or in the incorrect reuse of a bind.

    the kernel keeps track for each bind_bucket if all sockets in the
    bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
    These flags allow skipping the costly bind_conflict check when possible
    (meaning when all sockets have the proper SO_REUSE option).

    For every socket added to a bind_bucket, these flags need to be updated.
    As soon as a socket that does not support reuse is added, the flag is
    set to false and will never go back to true, unless the bind_bucket is
    deleted.

    Note that there is no mechanism to re-evaluate these flags when a socket
    is removed (this might make sense when removing a socket that would not
    allow reuse; this leaves room for a future patch).

    For this optimization to work, it is mandatory that these flags are
    properly initialized and updated.

    When a child socket is created from a listen socket in
    __inet_inherit_port, the TPROXY case could create a new bind bucket
    without properly initializing these flags, thus preventing the
    optimization to work. Alternatively, a socket not allowing reuse could
    be added to an existing bind bucket without updating the flags, causing
    bind_conflict to never be called as it should.

    Call inet_csk_update_fastreuse when __inet_inherit_port decides to create
    a new bind_bucket or use a different bind_bucket than the one of the
    listen socket.

    Fixes: 093d282321da ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()")
    Acked-by: Matthieu Baerts
    Signed-off-by: Tim Froidcoeur
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tim Froidcoeur
     
  • [ Upstream commit 62ffc589abb176821662efc4525ee4ac0b9c3894 ]

    Refactor the fastreuse update code in inet_csk_get_port into a small
    helper function that can be called from other places.

    Acked-by: Matthieu Baerts
    Signed-off-by: Tim Froidcoeur
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Tim Froidcoeur
     
  • [ Upstream commit f19008e676366c44e9241af57f331b6c6edf9552 ]

    When TFO keys are read back on big endian systems either via the global
    sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
    don't match what was written.

    For example, on s390x:

    # echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
    # cat /proc/sys/net/ipv4/tcp_fastopen_key
    02000000-01000000-04000000-03000000

    Instead of:

    # cat /proc/sys/net/ipv4/tcp_fastopen_key
    00000001-00000002-00000003-00000004

    Fix this by converting to the correct endianness on read. This was
    reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
    selftest on s390x, which depends on the read value matching what was
    written. I've confirmed that the test now passes on big and little endian
    systems.

    Signed-off-by: Jason Baron
    Fixes: 438ac88009bc ("net: fastopen: robustness and endianness fixes for SipHash")
    Cc: Ard Biesheuvel
    Cc: Eric Dumazet
    Reported-and-tested-by: Colin Ian King
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Baron
     

11 Aug, 2020

3 commits

  • [ Upstream commit 730e700e2c19d87e578ff0e7d8cb1d4a02b036d2 ]

    For retransmitted packets, TCP needs to resort to using TCP timestamps
    for computing RTT samples. In the common case where the data and ACK
    fall in the same 1-millisecond interval, TCP senders with millisecond-
    granularity TCP timestamps compute a ca_rtt_us of 0. This ca_rtt_us
    of 0 propagates to rs->rtt_us.

    This value of 0 can cause performance problems for congestion control
    modules. For example, in BBR, the zero min_rtt sample can bring the
    min_rtt and BDP estimate down to 0, reduce snd_cwnd and result in a
    low throughput. It would be hard to mitigate this with filtering in
    the congestion control module, because the proper floor to apply would
    depend on the method of RTT sampling (using timestamp options or
    internally-saved transmission timestamps).

    This fix applies a floor of 1 for the RTT sample delta from TCP
    timestamps, so that seq_rtt_us, ca_rtt_us, and rs->rtt_us will be at
    least 1 * (USEC_PER_SEC / TCP_TS_HZ).

    Note that the receiver RTT computation in tcp_rcv_rtt_measure() and
    min_rtt computation in tcp_update_rtt_min() both already apply a floor
    of 1 timestamp tick, so this commit makes the code more consistent in
    avoiding this edge case of a value of 0.

    Signed-off-by: Jianfeng Wang
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Acked-by: Kevin Yang
    Acked-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jianfeng Wang
     
  • [ Upstream commit 622e32b7d4a6492cf5c1f759ef833f817418f7b3 ]

    The GRE tunnel can be used to transport traffic that does not rely on a
    Internet checksum (e.g. SCTP). The issue can be triggered creating a GRE
    or GRETAP tunnel and transmitting SCTP traffic ontop of it where CRC
    offload has been disabled. In order to fix the issue we need to
    recompute the GRE csum in gre_gso_segment() not relying on the inner
    checksum.
    The issue is still present when we have the CRC offload enabled.
    In this case we need to disable the CRC offload if we require GRE
    checksum since otherwise skb_checksum() will report a wrong value.

    Fixes: 90017accff61 ("sctp: Add GSO support")
    Signed-off-by: Lorenzo Bianconi
    Reviewed-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Lorenzo Bianconi
     
  • [ Upstream commit 83f3522860f702748143e022f1a546547314c715 ]

    fib_trie_unmerge() is called with RTNL held, but not from an RCU
    read-side critical section. This leads to the following warning [1] when
    the FIB alias list in a leaf is traversed with
    hlist_for_each_entry_rcu().

    Since the function is always called with RTNL held and since
    modification of the list is protected by RTNL, simply use
    hlist_for_each_entry() and silence the warning.

    [1]
    WARNING: suspicious RCU usage
    5.8.0-rc4-custom-01520-gc1f937f3f83b #30 Not tainted
    -----------------------------
    net/ipv4/fib_trie.c:1867 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by ip/164:
    #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0

    stack backtrace:
    CPU: 0 PID: 164 Comm: ip Not tainted 5.8.0-rc4-custom-01520-gc1f937f3f83b #30
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:
    dump_stack+0x100/0x184
    lockdep_rcu_suspicious+0x153/0x15d
    fib_trie_unmerge+0x608/0xdb0
    fib_unmerge+0x44/0x360
    fib4_rule_configure+0xc8/0xad0
    fib_nl_newrule+0x37a/0x1dd0
    rtnetlink_rcv_msg+0x4f7/0xbd0
    netlink_rcv_skb+0x17a/0x480
    rtnetlink_rcv+0x22/0x30
    netlink_unicast+0x5ae/0x890
    netlink_sendmsg+0x98a/0xf40
    ____sys_sendmsg+0x879/0xa00
    ___sys_sendmsg+0x122/0x190
    __sys_sendmsg+0x103/0x1d0
    __x64_sys_sendmsg+0x7d/0xb0
    do_syscall_64+0x54/0xa0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fc80a234e97
    Code: Bad RIP value.
    RSP: 002b:00007ffef8b66798 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc80a234e97
    RDX: 0000000000000000 RSI: 00007ffef8b66800 RDI: 0000000000000003
    RBP: 000000005f141b1c R08: 0000000000000001 R09: 0000000000000000
    R10: 00007fc80a2a8ac0 R11: 0000000000000246 R12: 0000000000000001
    R13: 0000000000000000 R14: 00007ffef8b67008 R15: 0000556fccb10020

    Fixes: 0ddcf43d5d4a ("ipv4: FIB Local/MAIN table collapse")
    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     

01 Aug, 2020

3 commits

  • [ Upstream commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace ]

    Currently, SO_REUSEPORT does not work well if connected sockets are in a
    UDP reuseport group.

    Then reuseport_has_conns() returns true and the result of
    reuseport_select_sock() is discarded. Also, unconnected sockets have the
    same score, hence only does the first unconnected socket in udp_hslot
    always receive all packets sent to unconnected sockets.

    So, the result of reuseport_select_sock() should be used for load
    balancing.

    The noteworthy point is that the unconnected sockets placed after
    connected sockets in sock_reuseport.socks will receive more packets than
    others because of the algorithm in reuseport_select_sock().

    index | connected | reciprocal_scale | result
    ---------------------------------------------
    0 | no | 20% | 40%
    1 | no | 20% | 20%
    2 | yes | 20% | 0%
    3 | no | 20% | 40%
    4 | yes | 20% | 0%

    If most of the sockets are connected, this can be a problem, but it still
    works better than now.

    Fixes: acdcecc61285 ("udp: correct reuseport selection with connected sockets")
    CC: Willem de Bruijn
    Reviewed-by: Benjamin Herrenschmidt
    Signed-off-by: Kuniyuki Iwashima
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Kuniyuki Iwashima
     
  • [ Upstream commit 76be93fc0702322179bb0ea87295d820ee46ad14 ]

    Previously TLP may send multiple probes of new data in one
    flight. This happens when the sender is cwnd limited. After the
    initial TLP containing new data is sent, the sender receives another
    ACK that acks partial inflight. It may re-arm another TLP timer
    to send more, if no further ACK returns before the next TLP timeout
    (PTO) expires. The sender may send in theory a large amount of TLP
    until send queue is depleted. This only happens if the sender sees
    such irregular uncommon ACK pattern. But it is generally undesirable
    behavior during congestion especially.

    The original TLP design restrict only one TLP probe per inflight as
    published in "Reducing Web Latency: the Virtue of Gentle Aggression",
    SIGCOMM 2013. This patch changes TLP to send at most one probe
    per inflight.

    Note that if the sender is app-limited, TLP retransmits old data
    and did not have this issue.

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit b0a422772fec29811e293c7c0e6f991c0fd9241d ]

    We can't use IS_UDPLITE to replace udp_sk->pcflag when UDPLITE_RECV_CC is
    checked.

    Fixes: b2bf1e2659b1 ("[UDP]: Clean up for IS_UDPLITE macro")
    Signed-off-by: Miaohe Lin
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     

22 Jul, 2020

8 commits

  • [ Upstream commit 0da7536fb47f51df89ccfcb1fa09f249d9accec5 ]

    When no full socket is available, skbs are sent over a per-netns
    control socket. Its sk_mark is temporarily adjusted to match that
    of the real (request or timewait) socket or to reflect an incoming
    skb, so that the outgoing skb inherits this in __ip_make_skb.

    Introduction of the socket cookie mark field broke this. Now the
    skb is set through the cookie and cork:

    # init sockc.mark from sk_mark or cmsg
    ip_append_data
    ip_setup_cork # convert sockc.mark to cork mark
    ip_push_pending_frames
    ip_finish_skb
    __ip_make_skb # set skb->mark to cork mark

    But I missed these special control sockets. Update all callers of
    __ip(6)_make_skb that were originally missed.

    For IPv6, the same two icmp(v6) paths are affected. The third
    case is not, as commit 92e55f412cff ("tcp: don't annotate
    mark on control socket from tcp_v6_send_response()") replaced
    the ctl_sk->sk_mark with passing the mark field directly as a
    function argument. That commit predates the commit that
    introduced the bug.

    Fixes: c6af0c227a22 ("ip: support SO_MARK cmsg")
    Signed-off-by: Willem de Bruijn
    Reported-by: Martin KaFai Lau
    Reviewed-by: Martin KaFai Lau
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit 1ca0fafd73c5268e8fc4b997094b8bb2bfe8deea ]

    This essentially reverts commit 721230326891 ("tcp: md5: reject TCP_MD5SIG
    or TCP_MD5SIG_EXT on established sockets")

    Mathieu reported that many vendors BGP implementations can
    actually switch TCP MD5 on established flows.

    Quoting Mathieu :
    Here is a list of a few network vendors along with their behavior
    with respect to TCP MD5:

    - Cisco: Allows for password to be changed, but within the hold-down
    timer (~180 seconds).
    - Juniper: When password is initially set on active connection it will
    reset, but after that any subsequent password changes no network
    resets.
    - Nokia: No notes on if they flap the tcp connection or not.
    - Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
    both sides are ok with new passwords.
    - Meta-Switch: Expects the password to be set before a connection is
    attempted, but no further info on whether they reset the TCP
    connection on a change.
    - Avaya: Disable the neighbor, then set password, then re-enable.
    - Zebos: Would normally allow the change when socket connected.

    We can revert my prior change because commit 9424e2e7ad93 ("tcp: md5: fix potential
    overestimation of TCP option space") removed the leak of 4 kernel bytes to
    the wire that was the main reason for my patch.

    While doing my investigations, I found a bug when a MD5 key is changed, leading
    to these commits that stable teams want to consider before backporting this revert :

    Commit 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
    Commit e6ced831ef11 ("tcp: md5: refine tcp_md5_do_add()/tcp_md5_hash_key() barriers")

    Fixes: 721230326891 "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets"
    Signed-off-by: Eric Dumazet
    Reported-by: Mathieu Desnoyers
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit e6ced831ef11a2a06e8d00aad9d4fc05b610bf38 ]

    My prior fix went a bit too far, according to Herbert and Mathieu.

    Since we accept that concurrent TCP MD5 lookups might see inconsistent
    keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()

    Clearing all key->key[] is needed to avoid possible KMSAN reports,
    if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
    using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.

    data_race() was added in linux-5.8 and will prevent KCSAN reports,
    this can safely be removed in stable backports, if data_race() is
    not yet backported.

    v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()

    Fixes: 6a2febec338d ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()")
    Signed-off-by: Eric Dumazet
    Cc: Mathieu Desnoyers
    Cc: Herbert Xu
    Cc: Marco Elver
    Reviewed-by: Mathieu Desnoyers
    Acked-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit e114e1e8ac9d31f25b9dd873bab5d80c1fc482ca ]

    Whenever cookie_init_timestamp() has been used to encode
    ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.

    Otherwise, tcp_synack_options() will still advertize options like WSCALE
    that we can not deduce later when receiving the packet from the client
    to complete 3WHS.

    Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
    but we can not know for sure that all TCP stacks have the same logic.

    Before the fix a tcpdump would exhibit this wrong exchange :

    10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
    10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
    10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
    10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
    10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0

    After this patch the exchange looks saner :

    11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
    11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
    11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
    11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
    11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
    11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
    11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0

    Of course, no SACK block will ever be added later, but nothing should break.
    Technically, we could remove the 4 nops included in MD5+TS options,
    but again some stacks could break seeing not conventional alignment.

    Fixes: 4957faade11b ("TCPCT part 1g: Responder Cookie => Initiator")
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Cc: Mathieu Desnoyers
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 6a2febec338df7e7699a52d00b2e1207dcf65b28 ]

    MD5 keys are read with RCU protection, and tcp_md5_do_add()
    might update in-place a prior key.

    Normally, typical RCU updates would allocate a new piece
    of memory. In this case only key->key and key->keylen might
    be updated, and we do not care if an incoming packet could
    see the old key, the new one, or some intermediate value,
    since changing the key on a live flow is known to be problematic
    anyway.

    We only want to make sure that in the case key->keylen
    is changed, cpus in tcp_md5_hash_key() wont try to use
    uninitialized data, or crash because key->keylen was
    read twice to feed sg_init_one() and ahash_request_set_crypt()

    Fixes: 9ea88a153001 ("tcp: md5: check md5 signature without socket lock")
    Signed-off-by: Eric Dumazet
    Cc: Mathieu Desnoyers
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ce69e563b325f620863830c246a8698ccea52048 ]

    syzkaller found its way into setsockopt with TCP_CONGESTION "cdg".
    tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock
    just copies all the memory, the allocated pointer will be copied as
    well, if the app called setsockopt(..., TCP_CONGESTION) on the listener.
    If now the socket will be destroyed before the congestion-control
    has properly been initialized (through a call to tcp_init_transfer), we
    will end up freeing memory that does not belong to that particular
    socket, opening the door to a double-free:

    [ 11.413102] ==================================================================
    [ 11.414181] BUG: KASAN: double-free or invalid-free in tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.415329]
    [ 11.415560] CPU: 3 PID: 4884 Comm: syz-executor.5 Not tainted 5.8.0-rc2 #80
    [ 11.416544] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
    [ 11.418148] Call Trace:
    [ 11.418534]
    [ 11.418834] dump_stack+0x7d/0xb0
    [ 11.419297] print_address_description.constprop.0+0x1a/0x210
    [ 11.422079] kasan_report_invalid_free+0x51/0x80
    [ 11.423433] __kasan_slab_free+0x15e/0x170
    [ 11.424761] kfree+0x8c/0x230
    [ 11.425157] tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.425872] tcp_v4_destroy_sock+0x57/0x5a0
    [ 11.426493] inet_csk_destroy_sock+0x153/0x2c0
    [ 11.427093] tcp_v4_syn_recv_sock+0xb29/0x1100
    [ 11.427731] tcp_get_cookie_sock+0xc3/0x4a0
    [ 11.429457] cookie_v4_check+0x13d0/0x2500
    [ 11.433189] tcp_v4_do_rcv+0x60e/0x780
    [ 11.433727] tcp_v4_rcv+0x2869/0x2e10
    [ 11.437143] ip_protocol_deliver_rcu+0x23/0x190
    [ 11.437810] ip_local_deliver+0x294/0x350
    [ 11.439566] __netif_receive_skb_one_core+0x15d/0x1a0
    [ 11.441995] process_backlog+0x1b1/0x6b0
    [ 11.443148] net_rx_action+0x37e/0xc40
    [ 11.445361] __do_softirq+0x18c/0x61a
    [ 11.445881] asm_call_on_stack+0x12/0x20
    [ 11.446409]
    [ 11.446716] do_softirq_own_stack+0x34/0x40
    [ 11.447259] do_softirq.part.0+0x26/0x30
    [ 11.447827] __local_bh_enable_ip+0x46/0x50
    [ 11.448406] ip_finish_output2+0x60f/0x1bc0
    [ 11.450109] __ip_queue_xmit+0x71c/0x1b60
    [ 11.451861] __tcp_transmit_skb+0x1727/0x3bb0
    [ 11.453789] tcp_rcv_state_process+0x3070/0x4d3a
    [ 11.456810] tcp_v4_do_rcv+0x2ad/0x780
    [ 11.457995] __release_sock+0x14b/0x2c0
    [ 11.458529] release_sock+0x4a/0x170
    [ 11.459005] __inet_stream_connect+0x467/0xc80
    [ 11.461435] inet_stream_connect+0x4e/0xa0
    [ 11.462043] __sys_connect+0x204/0x270
    [ 11.465515] __x64_sys_connect+0x6a/0xb0
    [ 11.466088] do_syscall_64+0x3e/0x70
    [ 11.466617] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 11.467341] RIP: 0033:0x7f56046dc469
    [ 11.467844] Code: Bad RIP value.
    [ 11.468282] RSP: 002b:00007f5604dccdd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
    [ 11.469326] RAX: ffffffffffffffda RBX: 000000000068bf00 RCX: 00007f56046dc469
    [ 11.470379] RDX: 0000000000000010 RSI: 0000000020000000 RDI: 0000000000000004
    [ 11.471311] RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 11.472286] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [ 11.473341] R13: 000000000041427c R14: 00007f5604dcd5c0 R15: 0000000000000003
    [ 11.474321]
    [ 11.474527] Allocated by task 4884:
    [ 11.475031] save_stack+0x1b/0x40
    [ 11.475548] __kasan_kmalloc.constprop.0+0xc2/0xd0
    [ 11.476182] tcp_cdg_init+0xf0/0x150
    [ 11.476744] tcp_init_congestion_control+0x9b/0x3a0
    [ 11.477435] tcp_set_congestion_control+0x270/0x32f
    [ 11.478088] do_tcp_setsockopt.isra.0+0x521/0x1a00
    [ 11.478744] __sys_setsockopt+0xff/0x1e0
    [ 11.479259] __x64_sys_setsockopt+0xb5/0x150
    [ 11.479895] do_syscall_64+0x3e/0x70
    [ 11.480395] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [ 11.481097]
    [ 11.481321] Freed by task 4872:
    [ 11.481783] save_stack+0x1b/0x40
    [ 11.482230] __kasan_slab_free+0x12c/0x170
    [ 11.482839] kfree+0x8c/0x230
    [ 11.483240] tcp_cleanup_congestion_control+0x58/0xd0
    [ 11.483948] tcp_v4_destroy_sock+0x57/0x5a0
    [ 11.484502] inet_csk_destroy_sock+0x153/0x2c0
    [ 11.485144] tcp_close+0x932/0xfe0
    [ 11.485642] inet_release+0xc1/0x1c0
    [ 11.486131] __sock_release+0xc0/0x270
    [ 11.486697] sock_close+0xc/0x10
    [ 11.487145] __fput+0x277/0x780
    [ 11.487632] task_work_run+0xeb/0x180
    [ 11.488118] __prepare_exit_to_usermode+0x15a/0x160
    [ 11.488834] do_syscall_64+0x4a/0x70
    [ 11.489326] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Wei Wang fixed a part of these CDG-malloc issues with commit c12014440750
    ("tcp: memset ca_priv data to 0 properly").

    This patch here fixes the listener-scenario: We make sure that listeners
    setting the congestion-control through setsockopt won't initialize it
    (thus CDG never allocates on listeners). For those who use AF_UNSPEC to
    reuse a socket, tcp_disconnect() is changed to cleanup afterwards.

    (The issue can be reproduced at least down to v4.4.x.)

    Cc: Wei Wang
    Cc: Eric Dumazet
    Fixes: 2b0a8c9eee81 ("tcp: add CDG congestion control")
    Signed-off-by: Christoph Paasch
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Christoph Paasch
     
  • [ Upstream commit ba3bb0e76ccd464bb66665a1941fabe55dadb3ba ]

    Whenever tcp_try_rmem_schedule() returns an error, we are under
    trouble and should make sure to wakeup readers so that they
    can drain socket queues and eventually make room.

    Fixes: 03f45c883c6f ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 5eff06902394425c722f0a44d9545909a8800f79 ]

    IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to
    incomplete IPsec ACQUIRE messages being sent to userspace. Currently,
    both raw sockets and IPv6 ping sockets set those fields.

    Expected output of "ip xfrm monitor":
    acquire proto esp
    sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4
    policy src 10.0.2.15/32 dst 8.8.8.8/32

    Currently with ping sockets:
    acquire proto esp
    sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4
    policy src 10.0.2.15/32 dst 8.8.8.8/32

    The Libreswan test suite found this problem after Fedora changed the
    value for the sysctl net.ipv4.ping_group_range.

    Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind")
    Reported-by: Paul Wouters
    Tested-by: Paul Wouters
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     

01 Jul, 2020

5 commits

  • [ Upstream commit b344579ca8478598937215f7005d6c7b84d28aee ]

    Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where
    Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an
    ACK when the minimum rtt of a connection goes down. From inspection it
    is clear from the existing code that this could happen in an example
    like the following:

    o The first 8 RTT samples in a round trip are 150ms, resulting in a
    curr_rtt of 150ms and a delay_min of 150ms.

    o The 9th RTT sample is 100ms. The curr_rtt does not change after the
    first 8 samples, so curr_rtt remains 150ms. But delay_min can be
    lowered at any time, so delay_min falls to 100ms. The code executes
    the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min
    of 100ms, and the curr_rtt is declared far enough above delay_min to
    force a (spurious) exit of Slow start.

    The fix here is simple: allow every RTT sample in a round trip to
    lower the curr_rtt.

    Fixes: ae27e98a5152 ("[TCP] CUBIC v2.3")
    Reported-by: Mirja Kuehlewind
    Signed-off-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Neal Cardwell
     
  • [ Upstream commit ba61539c6ae57f4146284a5cb4f7b7ed8d42bf45 ]

    In the datapath, the ip_tunnel_lookup() is used and it internally uses
    fallback tunnel device pointer, which is fb_tunnel_dev.
    This pointer variable should be set to NULL when a fb interface is deleted.
    But there is no routine to set fb_tunnel_dev pointer to NULL.
    So, this pointer will be still used after interface is deleted and
    it eventually results in the use-after-free problem.

    Test commands:
    ip netns add A
    ip netns add B
    ip link add eth0 type veth peer name eth1
    ip link set eth0 netns A
    ip link set eth1 netns B

    ip netns exec A ip link set lo up
    ip netns exec A ip link set eth0 up
    ip netns exec A ip link add gre1 type gre local 10.0.0.1 \
    remote 10.0.0.2
    ip netns exec A ip link set gre1 up
    ip netns exec A ip a a 10.0.100.1/24 dev gre1
    ip netns exec A ip a a 10.0.0.1/24 dev eth0

    ip netns exec B ip link set lo up
    ip netns exec B ip link set eth1 up
    ip netns exec B ip link add gre1 type gre local 10.0.0.2 \
    remote 10.0.0.1
    ip netns exec B ip link set gre1 up
    ip netns exec B ip a a 10.0.100.2/24 dev gre1
    ip netns exec B ip a a 10.0.0.2/24 dev eth1
    ip netns exec A hping3 10.0.100.2 -2 --flood -d 60000 &
    ip netns del B

    Splat looks like:
    [ 77.793450][ C3] ==================================================================
    [ 77.794702][ C3] BUG: KASAN: use-after-free in ip_tunnel_lookup+0xcc4/0xf30
    [ 77.795573][ C3] Read of size 4 at addr ffff888060bd9c84 by task hping3/2905
    [ 77.796398][ C3]
    [ 77.796664][ C3] CPU: 3 PID: 2905 Comm: hping3 Not tainted 5.8.0-rc1+ #616
    [ 77.797474][ C3] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
    [ 77.798453][ C3] Call Trace:
    [ 77.798815][ C3]
    [ 77.799142][ C3] dump_stack+0x9d/0xdb
    [ 77.799605][ C3] print_address_description.constprop.7+0x2cc/0x450
    [ 77.800365][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
    [ 77.800908][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
    [ 77.801517][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
    [ 77.802145][ C3] kasan_report+0x154/0x190
    [ 77.802821][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
    [ 77.803503][ C3] ip_tunnel_lookup+0xcc4/0xf30
    [ 77.804165][ C3] __ipgre_rcv+0x1ab/0xaa0 [ip_gre]
    [ 77.804862][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0
    [ 77.805621][ C3] gre_rcv+0x304/0x1910 [ip_gre]
    [ 77.806293][ C3] ? lock_acquire+0x1a9/0x870
    [ 77.806925][ C3] ? gre_rcv+0xfe/0x354 [gre]
    [ 77.807559][ C3] ? erspan_xmit+0x2e60/0x2e60 [ip_gre]
    [ 77.808305][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0
    [ 77.809032][ C3] ? rcu_read_lock_held+0x90/0xa0
    [ 77.809713][ C3] gre_rcv+0x1b8/0x354 [gre]
    [ ... ]

    Suggested-by: Eric Dumazet
    Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • [ Upstream commit 662051215c758ae8545451628816204ed6cd372d ]

    Back in 2013, we made a change that broke fast retransmit
    for non SACK flows.

    Indeed, for these flows, a sender needs to receive three duplicate
    ACK before starting fast retransmit. Sending ACK with different
    receive window do not count.

    Even if enabling SACK is strongly recommended these days,
    there still are some cases where it has to be disabled.

    Not increasing the window seems better than having to
    rely on RTO.

    After the fix, following packetdrill test gives :

    // Initialize connection
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +0 < S 0:0(0) win 32792
    +0 > S. 0:0(0) ack 1
    +0 < . 1:1(0) ack 1 win 514

    +0 accept(3, ..., ...) = 4

    +0 < . 1:1001(1000) ack 1 win 514
    // Quick ack
    +0 > . 1:1(0) ack 1001 win 264

    +0 < . 2001:3001(1000) ack 1 win 514
    // DUPACK : Normally we should not change the window
    +0 > . 1:1(0) ack 1001 win 264

    +0 < . 3001:4001(1000) ack 1 win 514
    // DUPACK : Normally we should not change the window
    +0 > . 1:1(0) ack 1001 win 264

    +0 < . 4001:5001(1000) ack 1 win 514
    // DUPACK : Normally we should not change the window
    +0 > . 1:1(0) ack 1001 win 264

    +0 < . 1001:2001(1000) ack 1 win 514
    // Hole is repaired.
    +0 > . 1:1(0) ack 5001 win 272

    Fixes: 4e4f1fc22681 ("tcp: properly increase rcv_ssthresh for ofo packets")
    Signed-off-by: Eric Dumazet
    Reported-by: Venkat Venkatsubra
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 2570284060b48f3f79d8f1a2698792f36c385e9a ]

    there is a problem with the CWR flag set in an incoming ACK segment
    and it leads to the situation when the ECE flag is latched forever

    the following packetdrill script shows what happens:

    // Stack receives incoming segments with CE set
    +0.1 [noecn] . 1001:1001(0) ack 12001
    +0.0 >[noecn] E. 1001:1001(0) ack 13001
    +0.0 >[noecn] E. 1001:1001(0) ack 14001

    // Write a packet
    +0.1 write(3, ..., 1000) = 1000
    +0.0 >[ect0] PE. 1001:2001(1000) ack 14001

    // Pure ACK received
    +0.01 [ect0] P. 2001:3001(1000) ack 14001
    // but Linux will still keep ECE latched here, with packetdrill
    // flagging a missing ECE flag, expecting
    // >[ect0] PE. 2001:3001(1000) ack 14001
    // in the script

    In the situation above we will continue to send ECN ECHO packets
    and trigger the peer to reduce the congestion window. To avoid that
    we can check CWR on pure ACKs received.

    v3:
    - Add a sequence check to avoid sending an ACK to an ACK

    v2:
    - Adjusted the comment
    - move CWR check before checking for unacknowledged packets

    Signed-off-by: Denis Kirjanov
    Acked-by: Neal Cardwell
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Denis Kirjanov
     
  • [ Upstream commit 5eea3a63ff4aba6a26002e657a6d21934b7e2b96 ]

    ie.,
    $ ifconfig eth0 6.6.6.6 netmask 255.255.255.0

    $ ip rule add from 6.6.6.6 table 6666

    $ ip route add 9.9.9.9 via 6.6.6.6

    $ ping -I 6.6.6.6 9.9.9.9
    PING 9.9.9.9 (9.9.9.9) from 6.6.6.6 : 56(84) bytes of data.

    3 packets transmitted, 0 received, 100% packet loss, time 2079ms

    $ arp
    Address HWtype HWaddress Flags Mask Iface
    6.6.6.6 (incomplete) eth0

    The arp request address is error, this is because fib_table_lookup in
    fib_check_nh lookup the destnation 9.9.9.9 nexthop, the scope of
    the fib result is RT_SCOPE_LINK,the correct scope is RT_SCOPE_HOST.
    Here I add a check of whether this is RT_TABLE_MAIN to solve this problem.

    Fixes: 3bfd847203c6 ("net: Use passed in table for nexthop lookups")
    Signed-off-by: guodeqing
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    guodeqing
     

24 Jun, 2020

1 commit

  • [ Upstream commit 487082fb7bd2a32b66927d2b22e3a81b072b44f0 ]

    When user application calls read() with MSG_PEEK flag to read data
    of bpf sockmap socket, kernel panic happens at
    __tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
    queue after read out under MSG_PEEK flag is set. Because it's not
    judged whether sk_msg is the last msg of ingress_msg queue, the next
    sk_msg may be the head of ingress_msg queue, whose memory address of
    sg page is invalid. So it's necessary to add check codes to prevent
    this problem.

    [20759.125457] BUG: kernel NULL pointer dereference, address:
    0000000000000008
    [20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G E
    5.4.32 #1
    [20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
    4.1.12 06/18/2017
    [20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
    [20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
    [20759.276099] tcp_bpf_recvmsg+0x113/0x370
    [20759.281137] inet_recvmsg+0x55/0xc0
    [20759.285734] __sys_recvfrom+0xc8/0x130
    [20759.290566] ? __audit_syscall_entry+0x103/0x130
    [20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
    [20759.301700] ? __audit_syscall_exit+0x1e4/0x290
    [20759.307235] __x64_sys_recvfrom+0x24/0x30
    [20759.312226] do_syscall_64+0x55/0x1b0
    [20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: dihu
    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Acked-by: Jakub Sitnicki
    Link: https://lore.kernel.org/bpf/20200605084625.9783-1-anny.hu@linux.alibaba.com
    Signed-off-by: Sasha Levin

    dihu
     

17 Jun, 2020

1 commit

  • [ Upstream commit fbe4e0c1b298b4665ee6915266c9d6c5b934ef4a ]

    fib_triestat_seq_show() calls hlist_for_each_entry_rcu(tb, head,
    tb_hlist) without rcu_read_lock() will trigger a warning,

    net/ipv4/fib_trie.c:2579 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    1 lock held by proc01/115277:
    #0: c0000014507acf00 (&p->lock){+.+.}-{3:3}, at: seq_read+0x58/0x670

    Call Trace:
    dump_stack+0xf4/0x164 (unreliable)
    lockdep_rcu_suspicious+0x140/0x164
    fib_triestat_seq_show+0x750/0x880
    seq_read+0x1a0/0x670
    proc_reg_read+0x10c/0x1b0
    __vfs_read+0x3c/0x70
    vfs_read+0xac/0x170
    ksys_read+0x7c/0x140
    system_call+0x5c/0x68

    Fix it by adding a pair of rcu_read_lock/unlock() and use
    cond_resched_rcu() to avoid the situation where walking of a large
    number of items may prevent scheduling for a long time.

    Signed-off-by: Qian Cai
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Qian Cai
     

11 Jun, 2020

1 commit

  • [ Upstream commit 1b49cd71b52403822731dc9f283185d1da355f97 ]

    When devinet_sysctl_register() failed, the memory allocated
    in neigh_parms_alloc() should be freed.

    Fixes: 20e61da7ffcf ("ipv4: fail early when creating netdev named all or default")
    Signed-off-by: Yang Yingliang
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yang Yingliang
     

03 Jun, 2020

8 commits

  • commit 1fd1c768f3624a5e66766e7b4ddb9b607cd834a5 upstream.

    Similar to the last path, need to fix fib_info_nh_uses_dev for
    external nexthops to avoid referencing multiple nh_grp structs.
    Move the device check in fib_info_nh_uses_dev to a helper and
    create a nexthop version that is called if the fib_info uses an
    external nexthop.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: David Ahern
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • commit 90f33bffa382598a32cc82abfeb20adc92d041b6 upstream.

    We must avoid modifying published nexthop groups while they might be
    in use, otherwise we might see NULL ptr dereferences. In order to do
    that we allocate 2 nexthoup group structures upon nexthop creation
    and swap between them when we have to delete an entry. The reason is
    that we can't fail nexthop group removal, so we can't handle allocation
    failure thus we move the extra allocation on creation where we can
    safely fail and return ENOMEM.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Aleksandrov
     
  • commit ac21753a5c2c9a6a2019997481a2ac12bbde48c8 upstream.

    Move nh_grp dereference and check for removing nexthop group due to
    all members gone into remove_nh_grp_entry.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: David Ahern
    Acked-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • commit 4c559f15efcc43b996f4da528cd7f9483aaca36d upstream.

    Dan Carpenter says: "Smatch complains that the value for "cmd" comes
    from the network and can't be trusted."

    Add pptp_msg_name() helper function that checks for the array boundary.

    Fixes: f09943fefe6b ("[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port")
    Reported-by: Dan Carpenter
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Pablo Neira Ayuso
     
  • commit 976eba8ab596bab94b9714cd46d38d5c6a2c660d upstream.

    In Commit dd9ee3444014 ("vti4: Fix a ipip packet processing bug in
    'IPCOMP' virtual tunnel"), it tries to receive IPIP packets in vti
    by calling xfrm_input(). This case happens when a small packet or
    frag sent by peer is too small to get compressed.

    However, xfrm_input() will still get to the IPCOMP path where skb
    sec_path is set, but never dropped while it should have been done
    in vti_ipcomp4_protocol.cb_handler(vti_rcv_cb), as it's not an
    ipcomp4 packet. This will cause that the packet can never pass
    xfrm4_policy_check() in the upper protocol rcv functions.

    So this patch is to call ip_tunnel_rcv() to process IPIP packets
    instead.

    Fixes: dd9ee3444014 ("vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel")
    Reported-by: Xiumei Mu
    Signed-off-by: Xin Long
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • commit db87668ad1e4917cfe04e217307ba6ed9390716e upstream.

    This xfrm_state_put call in esp4/6_gro_receive() will cause
    double put for state, as in out_reset path secpath_reset()
    will put all states set in skb sec_path.

    So fix it by simply remove the xfrm_state_put call.

    Fixes: 6ed69184ed9c ("xfrm: Reset secpath in xfrm failure")
    Signed-off-by: Xin Long
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit 84be69b869a5a496a6cfde9b3c29509207a1f1fa ]

    For nexthop groups, attributes after NHA_GROUP_TYPE are invalid, but
    nh_check_attr_group starts checking at NHA_GROUP. The group type defaults
    to multipath and the NHA_GROUP_TYPE is currently optional so this has
    slipped through so far. Fix the attribute checking to handle support of
    new group types.

    Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
    Signed-off-by: ASSOGBA Emery
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit a6211caa634da39d861a47437ffcda8b38ef421b ]

    Commit adb03115f459 ("net: get rid of an signed integer overflow in ip_idents_reserve()")
    used atomic_cmpxchg to replace "atomic_add_return" inside the function
    "ip_idents_reserve". The reason was to avoid UBSAN warning.
    However, this change has caused performance degrade and in GCC-8,
    fno-strict-overflow is now mapped to -fwrapv -fwrapv-pointer
    and signed integer overflow is now undefined by default at all
    optimization levels[1]. Moreover, it was a bug in UBSAN vs -fwrapv
    /-fno-strict-overflow, so Let's revert it safely.

    [1] https://gcc.gnu.org/gcc-8/changes.html

    Suggested-by: Peter Zijlstra
    Suggested-by: Eric Dumazet
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: Jakub Kicinski
    Cc: Jiri Pirko
    Cc: Arvind Sankar
    Cc: Peter Zijlstra
    Cc: Eric Dumazet
    Cc: Jiong Wang
    Signed-off-by: Yuqi Jin
    Signed-off-by: Shaokun Zhang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuqi Jin