07 Feb, 2019

1 commit

  • ade446403bfb ("net: ipv4: do not handle duplicate fragments as
    overlapping") was backported to many stable trees, but it had a problem
    that was "accidentally" fixed by the upstream commit 0ff89efb5246 ("ip:
    fail fast on IP defrag errors")

    This is the fixup for that problem as we do not want the larger patch in
    the older stable trees.

    Fixes: ade446403bfb ("net: ipv4: do not handle duplicate fragments as overlapping")
    Reported-by: Ivan Babrou
    Reported-by: Eric Dumazet
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

31 Jan, 2019

4 commits

  • [ Upstream commit f6f2a4a2eb92bc73671204198bb2f8ab53ff59fb ]

    Setting the low threshold to 0 has no effect on frags allocation,
    we need to clear high_thresh instead.

    The code was pre-existent to commit 648700f76b03 ("inet: frags:
    use rhashtables for reassembly units"), but before the above,
    such assignment had a different role: prevent concurrent eviction
    from the worker and the netns cleanup helper.

    Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit 13d7f46386e060df31b727c9975e38306fa51e7a ]

    TCP transmission with MSG_ZEROCOPY fails if the peer closes its end of
    the connection and so transitions this socket to CLOSE_WAIT state.

    Transmission in close wait state is acceptable. Other similar tests in
    the stack (e.g., in FastOpen) accept both states. Relax this test, too.

    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg276886.html
    Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg227390.html
    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Reported-by: Marek Majkowski
    Signed-off-by: Willem de Bruijn
    CC: Yuchung Cheng
    CC: Neal Cardwell
    CC: Soheil Hassas Yeganeh
    CC: Alexey Kodanev
    Acked-by: Soheil Hassas Yeganeh
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     
  • [ Upstream commit f97f4dd8b3bb9d0993d2491e0f22024c68109184 ]

    IPv4 routing tables are flushed in two cases:

    1. In response to events in the netdev and inetaddr notification chains
    2. When a network namespace is being dismantled

    In both cases only routes associated with a dead nexthop group are
    flushed. However, a nexthop group will only be marked as dead in case it
    is populated with actual nexthops using a nexthop device. This is not
    the case when the route in question is an error route (e.g.,
    'blackhole', 'unreachable').

    Therefore, when a network namespace is being dismantled such routes are
    not flushed and leaked [1].

    To reproduce:
    # ip netns add blue
    # ip -n blue route add unreachable 192.0.2.0/24
    # ip netns del blue

    Fix this by not skipping error routes that are not marked with
    RTNH_F_DEAD when flushing the routing tables.

    To prevent the flushing of such routes in case #1, add a parameter to
    fib_table_flush() that indicates if the table is flushed as part of
    namespace dismantle or not.

    Note that this problem does not exist in IPv6 since error routes are
    associated with the loopback device.

    [1]
    unreferenced object 0xffff888066650338 (size 56):
    comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff ..........ba....
    e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00 ...d............
    backtrace:
    [] inet_rtm_newroute+0x129/0x220
    [] rtnetlink_rcv_msg+0x397/0xa20
    [] netlink_rcv_skb+0x132/0x380
    [] netlink_unicast+0x4c0/0x690
    [] netlink_sendmsg+0x929/0xe10
    [] sock_sendmsg+0xc8/0x110
    [] ___sys_sendmsg+0x77a/0x8f0
    [] __sys_sendmsg+0xf7/0x250
    [] do_syscall_64+0x14d/0x610
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff
    unreferenced object 0xffff888061621c88 (size 48):
    comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
    hex dump (first 32 bytes):
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk
    6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff kkkkkkkk..&_....
    backtrace:
    [] fib_table_insert+0x978/0x1500
    [] inet_rtm_newroute+0x129/0x220
    [] rtnetlink_rcv_msg+0x397/0xa20
    [] netlink_rcv_skb+0x132/0x380
    [] netlink_unicast+0x4c0/0x690
    [] netlink_sendmsg+0x929/0xe10
    [] sock_sendmsg+0xc8/0x110
    [] ___sys_sendmsg+0x77a/0x8f0
    [] __sys_sendmsg+0xf7/0x250
    [] do_syscall_64+0x14d/0x610
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    Fixes: 8cced9eff1d4 ("[NETNS]: Enable routing configuration in non-initial namespace.")
    Signed-off-by: Ido Schimmel
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ido Schimmel
     
  • [ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ]

    In certain cases, pskb_trim_rcsum() may change skb pointers.
    Reinitialize header pointers afterwards to avoid potential
    use-after-frees. Add a note in the documentation of
    pskb_trim_rcsum(). Found by KASAN.

    Signed-off-by: Ross Lagerwall
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Ross Lagerwall
     

26 Jan, 2019

1 commit

  • [ Upstream commit 06aa151ad1fc74a49b45336672515774a678d78d ]

    If same destination IP address config is already existing, that config is
    just used. MAC address also should be same.
    However, there is no MAC address checking routine.
    So that MAC address checking routine is added.

    test commands:
    %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
    -j CLUSTERIP --new --hashmode sourceip \
    --clustermac 01:00:5e:00:00:20 --total-nodes 2 --local-node 1
    %iptables -A INPUT -p tcp -i lo -d 192.168.0.5 --dport 80 \
    -j CLUSTERIP --new --hashmode sourceip \
    --clustermac 01:00:5e:00:00:21 --total-nodes 2 --local-node 1

    After this patch, above commands are disallowed.

    Signed-off-by: Taehee Yoo
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Taehee Yoo
     

23 Jan, 2019

1 commit

  • [ Upstream commit 4a06fa67c4da20148803525151845276cdb995c1 ]

    Commit 2efd4fca703a ("ip: in cmsg IP(V6)_ORIGDSTADDR call
    pskb_may_pull") avoided a read beyond the end of the skb linear
    segment by calling pskb_may_pull.

    That function can trigger a BUG_ON in pskb_expand_head if the skb is
    shared, which it is when when peeking. It can also return ENOMEM.

    Avoid both by switching to safer skb_header_pointer.

    Fixes: 2efd4fca703a ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
    Reported-by: syzbot
    Suggested-by: Eric Dumazet
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

10 Jan, 2019

3 commits

  • [ Upstream commit f0c928d878e7d01b613c9ae5c971a6b1e473a938 ]

    Alexei reported use after frees in inet_diag_dump_icsk() [1]

    Because we use refcount_set() when various sockets are setup and
    inserted into ehash, we also need to make sure inet_diag_dump_icsk()
    wont race with the refcount_set() operations.

    Jonathan Lemon sent a patch changing net_twsk_hashdance() but
    other spots would need risky changes.

    Instead, fix inet_diag_dump_icsk() as this bug came with
    linux-4.10 only.

    [1] Quoting Alexei :

    First something iterating over sockets finds already freed tw socket:

    refcount_t: increment on 0; use-after-free.
    WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
    RIP: 0010:refcount_inc+0x26/0x30
    RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
    RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
    RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
    RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
    R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
    R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
    FS: 00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag] // sock_hold(sk); in net/ipv4/inet_diag.c:1002
    ? kmalloc_large_node+0x37/0x70
    ? __kmalloc_node_track_caller+0x1cb/0x260
    ? __alloc_skb+0x72/0x1b0
    ? __kmalloc_reserve.isra.40+0x2e/0x80
    __inet_diag_dump+0x3b/0x80 [inet_diag]
    netlink_dump+0x116/0x2a0
    netlink_recvmsg+0x205/0x3c0
    sock_read_iter+0x89/0xd0
    __vfs_read+0xf7/0x140
    vfs_read+0x8a/0x140
    SyS_read+0x3f/0xa0
    do_syscall_64+0x5a/0x100

    then a minute later twsk timer fires and hits two bad refcnts
    for this freed socket:

    refcount_t: decrement hit 0; leaking memory.
    WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
    Modules linked in:
    RIP: 0010:refcount_dec+0x2e/0x40
    RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
    RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
    RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
    RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
    R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
    R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    inet_twsk_kill+0x9d/0xc0 // inet_twsk_bind_unhash(tw, hashinfo);
    call_timer_fn+0x29/0x110
    run_timer_softirq+0x36b/0x3a0

    refcount_t: underflow; use-after-free.
    WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
    RIP: 0010:refcount_sub_and_test+0x46/0x50
    RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
    RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
    RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
    RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
    R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
    R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
    FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:

    inet_twsk_put+0x12/0x20 // inet_twsk_put(tw);
    call_timer_fn+0x29/0x110
    run_timer_softirq+0x36b/0x3a0

    Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
    Signed-off-by: Eric Dumazet
    Reported-by: Alexei Starovoitov
    Cc: Jonathan Lemon
    Acked-by: Jonathan Lemon
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ade446403bfb79d3528d56071a84b15351a139ad ]

    Since commit 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping
    segments.") IPv4 reassembly code drops the whole queue whenever an
    overlapping fragment is received. However, the test is written in a way
    which detects duplicate fragments as overlapping so that in environments
    with many duplicate packets, fragmented packets may be undeliverable.

    Add an extra test and for (potentially) duplicate fragment, only drop the
    new fragment rather than the whole queue. Only starting offset and length
    are checked, not the contents of the fragments as that would be too
    expensive. For similar reason, linear list ("run") of a rbtree node is not
    iterated, we only check if the new fragment is a subset of the interval
    covered by existing consecutive fragments.

    v2: instead of an exact check iterating through linear list of an rbtree
    node, only check if the new fragment is subset of the "run" (suggested
    by Eric Dumazet)

    Fixes: 7969e5c40dfd ("ip: discard IPv4 datagrams with overlapping segments.")
    Signed-off-by: Michal Kubecek
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michal Kubecek
     
  • [ Upstream commit 5648451e30a0d13d11796574919a359025d52cce ]

    vr.vifi is indirectly controlled by user-space, hence leading to
    a potential exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    net/ipv4/ipmr.c:1616 ipmr_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)
    net/ipv4/ipmr.c:1690 ipmr_compat_ioctl() warn: potential spectre issue 'mrt->vif_table' [r] (local cap)

    Fix this by sanitizing vr.vifi before using it to index mrt->vif_table'

    Notice that given that speculation windows are large, the policy is
    to kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Gustavo A. R. Silva
     

17 Dec, 2018

4 commits

  • commit f9bfe4e6a9d08d405fe7b081ee9a13e649c97ecf upstream.

    tcp_tso_should_defer() can return true in three different cases :

    1) We are cwnd-limited
    2) We are rwnd-limited
    3) We are application limited.

    Neal pointed out that my recent fix went too far, since
    it assumed that if we were not in 1) case, we must be rwnd-limited

    Fix this by properly populating the is_cwnd_limited and
    is_rwnd_limited booleans.

    After this change, we can finally move the silly check for FIN
    flag only for the application-limited case.

    The same move for EOR bit will be handled in net-next,
    since commit 1c09f7d073b1 ("tcp: do not try to defer skbs
    with eor mark (MSG_EOR)") is scheduled for linux-4.21

    Tested by running 200 concurrent netperf -t TCP_RR -- -r 60000,100
    and checking none of them was rwnd_limited in the chrono_stat
    output from "ss -ti" command.

    Fixes: 41727549de3e ("tcp: Do not underestimate rwnd_limited")
    Signed-off-by: Eric Dumazet
    Suggested-by: Neal Cardwell
    Reviewed-by: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Reviewed-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit b2b7af861122a0c0f6260155c29a1b2e594cd5b5 ]

    TCP loss probe timer may fire when the retranmission queue is empty but
    has a non-zero tp->packets_out counter. tcp_send_loss_probe will call
    tcp_rearm_rto which triggers NULL pointer reference by fetching the
    retranmission queue head in its sub-routines.

    Add a more detailed warning to help catch the root cause of the inflight
    accounting inconsistency.

    Reported-by: Rafael Tinoco
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Signed-off-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit 41727549de3e7281feb174d568c6e46823db8684 ]

    If available rwnd is too small, tcp_tso_should_defer()
    can decide it is worth waiting before splitting a TSO packet.

    This really means we are rwnd limited.

    Fixes: 5615f88614a4 ("tcp: instrument how long TCP is limited by receive window")
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Reviewed-by: Yuchung Cheng
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ebaf39e6032faf77218220707fc3fa22487784e0 ]

    The *_frag_reasm() functions are susceptible to miscalculating the byte
    count of packet fragments in case the truesize of a head buffer changes.
    The truesize member may be changed by the call to skb_unclone(), leaving
    the fragment memory limit counter unbalanced even if all fragments are
    processed. This miscalculation goes unnoticed as long as the network
    namespace which holds the counter is not destroyed.

    Should an attempt be made to destroy a network namespace that holds an
    unbalanced fragment memory limit counter the cleanup of the namespace
    never finishes. The thread handling the cleanup gets stuck in
    inet_frags_exit_net() waiting for the percpu counter to reach zero. The
    thread is usually in running state with a stacktrace similar to:

    PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
    #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
    #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
    #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
    #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
    #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
    #10 [ffff880621563e38] process_one_work at ffffffff81096f14

    It is not possible to create new network namespaces, and processes
    that call unshare() end up being stuck in uninterruptible sleep state
    waiting to acquire the net_mutex.

    The bug was observed in the IPv6 netfilter code by Per Sundstrom.
    I thank him for his analysis of the problem. The parts of this patch
    that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

    Signed-off-by: Jiri Wiesner
    Reported-by: Per Sundstrom
    Acked-by: Peter Oskolkov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jiri Wiesner
     

08 Dec, 2018

1 commit

  • commit 000ade8016400d93b4d7c89970d96b8c14773d45 upstream.

    By passing a limit of 2 bytes to strncat, strncat is limited to writing
    fewer bytes than what it's supposed to append to the name here.

    Since the bounds are checked on the line above this, just remove the string
    bounds checks entirely since they're unneeded.

    Signed-off-by: Sultan Alsawaf
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sultan Alsawaf
     

01 Dec, 2018

1 commit

  • commit 8873c064d1de579ea23412a6d3eee972593f142b upstream.

    syzkaller was able to hit the WARN_ON(sock_owned_by_user(sk));
    in tcp_close()

    While a socket is being closed, it is very possible other
    threads find it in rtnetlink dump.

    tcp_get_info() will acquire the socket lock for a short amount
    of time (slow = lock_sock_fast(sk)/unlock_sock_fast(sk, slow);),
    enough to trigger the warning.

    Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

23 Nov, 2018

2 commits

  • [ Upstream commit 0d5b9311baf27bb545f187f12ecfd558220c607d ]

    Multiple cpus might attempt to insert a new fragment in rhashtable,
    if for example RPS is buggy, as reported by 배석진 in
    https://patchwork.ozlabs.org/patch/994601/

    We use rhashtable_lookup_get_insert_key() instead of
    rhashtable_insert_fast() to let cpus losing the race
    free their own inet_frag_queue and use the one that
    was inserted by another cpu.

    Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
    Signed-off-by: Eric Dumazet
    Reported-by: 배석진
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 16f7eb2b77b55da816c4e207f3f9440a8cafc00a ]

    The various types of tunnels running over IPv4 can ask to set the DF
    bit to do PMTU discovery. However, PMTU discovery is subject to the
    threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
    disabled on routes with "mtu lock". In those cases, we shouldn't set
    the DF bit.

    This patch makes setting the DF bit conditional on the route's MTU
    locking state.

    This issue seems to be older than git history.

    Signed-off-by: Sabrina Dubroca
    Reviewed-by: Stefano Brivio
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     

14 Nov, 2018

1 commit

  • commit 076ed3da0c9b2f88d9157dbe7044a45641ae369e upstream.

    commit 40413955ee26 ("Cipso: cipso_v4_optptr enter infinite loop") fixed
    a possible infinite loop in the IP option parsing of CIPSO. The fix
    assumes that ip_options_compile filtered out all zero length options and
    that no other one-byte options beside IPOPT_END and IPOPT_NOOP exist.
    While this assumption currently holds true, add explicit checks for zero
    length and invalid length options to be safe for the future. Even though
    ip_options_compile should have validated the options, the introduction of
    new one-byte options can still confuse this code without the additional
    checks.

    Signed-off-by: Stefan Nuernberger
    Cc: David Woodhouse
    Cc: Simon Veith
    Cc: stable@vger.kernel.org
    Acked-by: Paul Moore
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefan Nuernberger
     

04 Nov, 2018

4 commits

  • [ Upstream commit eddf016b910486d2123675a6b5fd7d64f77cdca8 ]

    If the skb space ends in an unresolved entry while dumping we'll miss
    some unresolved entries. The reason is due to zeroing the entry counter
    between dumping resolved and unresolved mfc entries. We should just
    keep counting until the whole table is dumped and zero when we move to
    the next as we have a separate table counter.

    Reported-by: Colin Ian King
    Fixes: 8fb472c09b9d ("ipmr: improve hash scalability")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Aleksandrov
     
  • [ Upstream commit 7de414a9dd91426318df7b63da024b2b07e53df5 ]

    Most callers of pskb_trim_rcsum() simply drop the skb when
    it fails, however, ip_check_defrag() still continues to pass
    the skb up to stack. This is suspicious.

    In ip_check_defrag(), after we learn the skb is an IP fragment,
    passing the skb to callers makes no sense, because callers expect
    fragments are defrag'ed on success. So, dropping the skb when we
    can't defrag it is reasonable.

    Note, prior to commit 88078d98d1bb, this is not a big problem as
    checksum will be fixed up anyway. After it, the checksum is not
    correct on failure.

    Found this during code review.

    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE are friends")
    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Cong Wang
     
  • [ Upstream commit db4f1be3ca9b0ef7330763d07bf4ace83ad6f913 ]

    Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
    incorrect for any packet that has an incorrect checksum value.

    udp4/6_csum_init() will both make a call to
    __skb_checksum_validate_complete() to initialize/validate the csum
    field when receiving a CHECKSUM_COMPLETE packet. When this packet
    fails validation, skb->csum will be overwritten with the pseudoheader
    checksum so the packet can be fully validated by software, but the
    skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
    the stack can later warn the user about their hardware spewing bad
    checksums. Unfortunately, leaving the SKB in this state can cause
    problems later on in the checksum calculation.

    Since the the packet is still marked as CHECKSUM_COMPLETE,
    udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
    from skb->csum instead of adding it, leaving us with a garbage value
    in that field. Once we try to copy the packet to userspace in the
    udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
    to checksum the packet data and add it in the garbage skb->csum value
    to perform our final validation check.

    Since the value we're validating is not the proper checksum, it's possible
    that the folded value could come out to 0, causing us not to drop the
    packet. Instead, we believe that the packet was checksummed incorrectly
    by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
    to warn the user with netdev_rx_csum_fault(skb->dev);

    Unfortunately, since this is the UDP path, skb->dev has been overwritten
    by skb->dev_scratch and is no longer a valid pointer, so we end up
    reading invalid memory.

    This patch addresses this problem in two ways:
    1) Do not use the dev pointer when calling netdev_rx_csum_fault()
    from skb_copy_and_csum_datagram_msg(). Since this gets called
    from the UDP path where skb->dev has been overwritten, we have
    no way of knowing if the pointer is still valid. Also for the
    sake of consistency with the other uses of
    netdev_rx_csum_fault(), don't attempt to call it if the
    packet was checksummed by software.

    2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
    If we receive a packet that's CHECKSUM_COMPLETE that fails
    verification (i.e. skb->csum_valid == 0), check who performed
    the calculation. It's possible that the checksum was done in
    software by the network stack earlier (such as Netfilter's
    CONNTRACK module), and if that says the checksum is bad,
    we can drop the packet immediately instead of waiting until
    we try and copy it to userspace. Otherwise, we need to
    mark the SKB as CHECKSUM_NONE, since the skb->csum field
    no longer contains the full packet checksum after the
    call to __skb_checksum_validate_complete().

    Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
    Fixes: c84d949057ca ("udp: copy skb->truesize in the first cache line")
    Cc: Sam Kumar
    Cc: Eric Dumazet
    Signed-off-by: Sean Tranchetti
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sean Tranchetti
     
  • [ Upstream commit bfc0698bebcb16d19ecfc89574ad4d696955e5d3 ]

    A policy may have been set up with multiple transforms (e.g., ESP
    and ipcomp). In this situation, the ingress IPsec processing
    iterates in xfrm_input() and applies each transform in turn,
    processing the nexthdr to find any additional xfrm that may apply.

    This patch resets the transport header back to network header
    only after the last transformation so that subsequent xfrms
    can find the correct transport header.

    Fixes: 7785bba299a8 ("esp: Add a software GRO codepath")
    Suggested-by: Steffen Klassert
    Signed-off-by: Sowmini Varadhan
    Signed-off-by: Steffen Klassert
    Signed-off-by: Sasha Levin

    Sowmini Varadhan
     

18 Oct, 2018

6 commits

  • [ Upstream commit 2ab2ddd301a22ca3c5f0b743593e4ad2953dfa53 ]

    Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1ad98e9d1bdf4724c0a8532fabd84bf3c457c2bc ]

    In normal SYN processing, packets are handled without listener
    lock and in RCU protected ingress path.

    But syzkaller is known to be able to trick us and SYN
    packets might be processed in process context, after being
    queued into socket backlog.

    In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
    accessing ireq_opt") I made a very stupid fix, that happened
    to work mostly because of the regular path being RCU protected.

    Really the thing protecting ireq->ireq_opt is RCU read lock,
    and the pseudo request refcnt is not relevant.

    This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
    block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
    pair in the paths that might be taken when processing SYN from
    socket backlog (thus possibly in process context)

    Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 7e823644b60555f70f241274b8d0120dd919269a ]

    Commit 2276f58ac589 ("udp: use a separate rx queue for packet reception")
    turned static inline __skb_recv_udp() from being a trivial helper around
    __skb_recv_datagram() into a UDP specific implementaion, making it
    EXPORT_SYMBOL_GPL() at the same time.

    There are external modules that got broken by __skb_recv_udp() not being
    visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().

    Rationale (one of those) why this is actually "technically correct" thing
    to do: __skb_recv_udp() used to be an inline wrapper around
    __skb_recv_datagram(), which itself (still, and correctly so, I believe)
    is EXPORT_SYMBOL().

    Cc: Paolo Abeni
    Cc: Eric Dumazet
    Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception")
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     
  • [ Upstream commit af7d6cce53694a88d6a1bb60c9a239a6a5144459 ]

    Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
    exceptions"), exceptions get deprecated separately from cached
    routes. In particular, administrative changes don't clear PMTU anymore.

    As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
    on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
    the local MTU change can become stale:
    - if the local MTU is now lower than the PMTU, that PMTU is now
    incorrect
    - if the local MTU was the lowest value in the path, and is increased,
    we might discover a higher PMTU

    Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
    cases.

    If the exception was locked, the discovered PMTU was smaller than the
    minimal accepted PMTU. In that case, if the new local MTU is smaller
    than the current PMTU, let PMTU discovery figure out if locking of the
    exception is still needed.

    To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
    notifier. By the time the notifier is called, dev->mtu has been
    changed. This patch adds the old MTU as additional information in the
    notifier structure, and a new call_netdevice_notifiers_u32() function.

    Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
    Signed-off-by: Sabrina Dubroca
    Reviewed-by: Stefano Brivio
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit 64199fc0a46ba211362472f7f942f900af9492fd ]

    Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy,
    do not do it.

    Fixes: 2efd4fca703a ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Reported-by: syzbot
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ccfec9e5cb2d48df5a955b7bf47f7782157d3bc2]

    Cong noted that we need the same checks introduced by commit 76c0ddd8c3a6
    ("ip6_tunnel: be careful when accessing the inner header")
    even for ipv4 tunnels.

    Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
    Suggested-by: Cong Wang
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

29 Sep, 2018

2 commits

  • [ Upstream commit 2b5a921740a55c00223a797d075b9c77c42cb171 ]

    commit 2abb7cdc0dc8 ("udp: Add support for doing checksum
    unnecessary conversion") left out the early demux path for
    connected sockets. As a result IP_CMSG_CHECKSUM gives wrong
    values for such socket when GRO is not enabled/available.

    This change addresses the issue by moving the csum conversion to a
    common helper and using such helper in both the default and the
    early demux rx path.

    Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit c56cae23c6b167acc68043c683c4573b80cbcc2c ]

    When splitting a GSO segment that consists of encapsulated packets, the
    skb->mac_len of the segments can end up being set wrong, causing packet
    drops in particular when using act_mirred and ifb interfaces in
    combination with a qdisc that splits GSO packets.

    This happens because at the time skb_segment() is called, network_header
    will point to the inner header, throwing off the calculation in
    skb_reset_mac_len(). The network_header is subsequently adjust by the
    outer IP gso_segment handlers, but they don't set the mac_len.

    Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
    gso_segment handlers, after they modify the network_header.

    Many thanks to Eric Dumazet for his help in identifying the cause of
    the bug.

    Acked-by: Dave Taht
    Reviewed-by: Eric Dumazet
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Toke Høiland-Jørgensen
     

26 Sep, 2018

3 commits

  • [ Upstream commit 5cf4a8532c992bb22a9ecd5f6d93f873f4eaccc2 ]

    According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
    flag was introduced because send(2) ignores unknown message flags and
    any legacy application which was accidentally passing the equivalent of
    MSG_ZEROCOPY earlier should not see any new behaviour.

    Before commit f214f915e7db ("tcp: enable MSG_ZEROCOPY"), a send(2) call
    which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
    would succeed. However, after that commit, it fails with -ENOBUFS. So
    it appears that the SO_ZEROCOPY flag fails to fulfill its intended
    purpose. Fix it.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Vincent Whitchurch
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vincent Whitchurch
     
  • [ Upstream commit 5a64506b5c2c3cdb29d817723205330378075448 ]

    If erspan tunnel hasn't been established, we'd better send icmp port
    unreachable message after receive erspan packets.

    Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
    Cc: William Tu
    Signed-off-by: Haishuang Yan
    Acked-by: William Tu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Haishuang Yan
     
  • [ Upstream commit 51dc63e3911fbb1f0a7a32da2fe56253e2040ea4 ]

    When processing icmp unreachable message for erspan tunnel, tunnel id
    should be erspan_net_id instead of ipgre_net_id.

    Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
    Cc: William Tu
    Signed-off-by: Haishuang Yan
    Acked-by: William Tu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Haishuang Yan
     

20 Sep, 2018

6 commits

  • commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

    A kernel crash occurrs when defragmented packet is fragmented
    in ip_do_fragment().
    In defragment routine, skb_orphan() is called and
    skb->ip_defrag_offset is set. but skb->sk and
    skb->ip_defrag_offset are same union member. so that
    frag->sk is not NULL.
    Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
    defragmented packet is fragmented.

    test commands:
    %iptables -t nat -I POSTROUTING -j MASQUERADE
    %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

    splat looks like:
    [ 261.069429] kernel BUG at net/ipv4/ip_output.c:636!
    [ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
    [ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
    [ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
    [ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
    [ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
    [ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
    [ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
    [ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
    [ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
    [ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
    [ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
    [ 261.198158] Call Trace:
    [ 261.199018] ? dst_output+0x180/0x180
    [ 261.205011] ? save_trace+0x300/0x300
    [ 261.209018] ? ip_copy_metadata+0xb00/0xb00
    [ 261.213034] ? sched_clock_local+0xd4/0x140
    [ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack]
    [ 261.223014] ? rt_cpu_seq_stop+0x10/0x10
    [ 261.227014] ? find_held_lock+0x39/0x1c0
    [ 261.233008] ip_finish_output+0x51d/0xb50
    [ 261.237006] ? ip_fragment.constprop.56+0x220/0x220
    [ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
    [ 261.250152] ? rcu_is_watching+0x77/0x120
    [ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
    [ 261.261033] ? nf_hook_slow+0xb1/0x160
    [ 261.265007] ip_output+0x1c7/0x710
    [ 261.269005] ? ip_mc_output+0x13f0/0x13f0
    [ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0
    [ 261.278152] ? ip_fragment.constprop.56+0x220/0x220
    [ 261.282996] ? nf_hook_slow+0xb1/0x160
    [ 261.287007] raw_sendmsg+0x21f9/0x4420
    [ 261.291008] ? dst_output+0x180/0x180
    [ 261.297003] ? sched_clock_cpu+0x126/0x170
    [ 261.301003] ? find_held_lock+0x39/0x1c0
    [ 261.306155] ? stop_critical_timings+0x420/0x420
    [ 261.311004] ? check_flags.part.36+0x450/0x450
    [ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40
    [ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40
    [ 261.326142] ? cyc2ns_read_end+0x10/0x10
    [ 261.330139] ? raw_bind+0x280/0x280
    [ 261.334138] ? sched_clock_cpu+0x126/0x170
    [ 261.338995] ? check_flags.part.36+0x450/0x450
    [ 261.342991] ? __lock_acquire+0x4500/0x4500
    [ 261.348994] ? inet_sendmsg+0x11c/0x500
    [ 261.352989] ? dst_output+0x180/0x180
    [ 261.357012] inet_sendmsg+0x11c/0x500
    [ ... ]

    v2:
    - clear skb->sk at reassembly routine.(Eric Dumarzet)

    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Suggested-by: Eric Dumazet
    Signed-off-by: Taehee Yoo
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • This patch changes the runtime behavior of IP defrag queue:
    incoming in-order fragments are added to the end of the current
    list/"run" of in-order fragments at the tail.

    On some workloads, UDP stream performance is substantially improved:

    RX: ./udp_stream -F 10 -T 2 -l 60
    TX: ./udp_stream -c -H -F 10 -T 5 -l 60

    with this patchset applied on a 10Gbps receiver:

    throughput=9524.18
    throughput_units=Mbit/s

    upstream (net-next):

    throughput=4608.93
    throughput_units=Mbit/s

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • This patch introduces several helper functions/macros that will be
    used in the follow-up patch. No runtime changes yet.

    The new logic (fully implemented in the second patch) is as follows:

    * Nodes in the rb-tree will now contain not single fragments, but lists
    of consecutive fragments ("runs").

    * At each point in time, the current "active" run at the tail is
    maintained/tracked. Fragments that arrive in-order, adjacent
    to the previous tail fragment, are added to this tail run without
    triggering the re-balancing of the rb-tree.

    * If a fragment arrives out of order with the offset _before_ the tail run,
    it is inserted into the rb-tree as a single fragment.

    * If a fragment arrives after the current tail fragment (with a gap),
    it starts a new "tail" run, as is inserted into the rb-tree
    at the end as the head of the new run.

    skb->cb is used to store additional information
    needed here (suggested by Eric Dumazet).

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • We accidentally removed the parentheses here, but they are required
    because '!' has higher precedence than '&'.

    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller
    (cherry picked from commit 70837ffe3085c9a91488b52ca13ac84424da1042)
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

    skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

    Current uses (TCP receive ofo queue and netem) need to save/restore
    tstamp, while skb->dev is either NULL (TCP) or a constant for a given
    queue (netem).

    Since we plan using an RB tree for TCP retransmit queue to speedup SACK
    processing with large BDP, this patch exchanges skb->dev and
    skb->tstamp.

    This saves some overhead in both TCP and netem.

    v2: removes the swtstamp field from struct tcp_skb_cb

    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Wei Wang
    Cc: Willem de Bruijn
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Geeralize private netem_rb_to_skb()

    TCP rtx queue will soon be converted to rb-tree,
    so we will need skb_rbtree_walk() helpers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet