18 Oct, 2018

6 commits

  • [ Upstream commit 2ab2ddd301a22ca3c5f0b743593e4ad2953dfa53 ]

    Timer handlers do not imply rcu_read_lock(), so my recent fix
    triggered a LOCKDEP warning when SYNACK is retransmit.

    Lets add rcu_read_lock()/rcu_read_unlock() pairs around ireq->ireq_opt
    usages instead of guessing what is done by callers, since it is
    not worth the pain.

    Get rid of ireq_opt_deref() helper since it hides the logic
    without real benefit, since it is now a standard rcu_dereference().

    Fixes: 1ad98e9d1bdf ("tcp/dccp: fix lockdep issue when SYN is backlogged")
    Signed-off-by: Eric Dumazet
    Reported-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 1ad98e9d1bdf4724c0a8532fabd84bf3c457c2bc ]

    In normal SYN processing, packets are handled without listener
    lock and in RCU protected ingress path.

    But syzkaller is known to be able to trick us and SYN
    packets might be processed in process context, after being
    queued into socket backlog.

    In commit 06f877d613be ("tcp/dccp: fix other lockdep splats
    accessing ireq_opt") I made a very stupid fix, that happened
    to work mostly because of the regular path being RCU protected.

    Really the thing protecting ireq->ireq_opt is RCU read lock,
    and the pseudo request refcnt is not relevant.

    This patch extends what I did in commit 449809a66c1d ("tcp/dccp:
    block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
    pair in the paths that might be taken when processing SYN from
    socket backlog (thus possibly in process context)

    Fixes: 06f877d613be ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 7e823644b60555f70f241274b8d0120dd919269a ]

    Commit 2276f58ac589 ("udp: use a separate rx queue for packet reception")
    turned static inline __skb_recv_udp() from being a trivial helper around
    __skb_recv_datagram() into a UDP specific implementaion, making it
    EXPORT_SYMBOL_GPL() at the same time.

    There are external modules that got broken by __skb_recv_udp() not being
    visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().

    Rationale (one of those) why this is actually "technically correct" thing
    to do: __skb_recv_udp() used to be an inline wrapper around
    __skb_recv_datagram(), which itself (still, and correctly so, I believe)
    is EXPORT_SYMBOL().

    Cc: Paolo Abeni
    Cc: Eric Dumazet
    Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception")
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     
  • [ Upstream commit af7d6cce53694a88d6a1bb60c9a239a6a5144459 ]

    Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
    exceptions"), exceptions get deprecated separately from cached
    routes. In particular, administrative changes don't clear PMTU anymore.

    As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
    on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
    the local MTU change can become stale:
    - if the local MTU is now lower than the PMTU, that PMTU is now
    incorrect
    - if the local MTU was the lowest value in the path, and is increased,
    we might discover a higher PMTU

    Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
    cases.

    If the exception was locked, the discovered PMTU was smaller than the
    minimal accepted PMTU. In that case, if the new local MTU is smaller
    than the current PMTU, let PMTU discovery figure out if locking of the
    exception is still needed.

    To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
    notifier. By the time the notifier is called, dev->mtu has been
    changed. This patch adds the old MTU as additional information in the
    notifier structure, and a new call_netdevice_notifiers_u32() function.

    Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
    Signed-off-by: Sabrina Dubroca
    Reviewed-by: Stefano Brivio
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit 64199fc0a46ba211362472f7f942f900af9492fd ]

    Caching ip_hdr(skb) before a call to pskb_may_pull() is buggy,
    do not do it.

    Fixes: 2efd4fca703a ("ip: in cmsg IP(V6)_ORIGDSTADDR call pskb_may_pull")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Reported-by: syzbot
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit ccfec9e5cb2d48df5a955b7bf47f7782157d3bc2]

    Cong noted that we need the same checks introduced by commit 76c0ddd8c3a6
    ("ip6_tunnel: be careful when accessing the inner header")
    even for ipv4 tunnels.

    Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
    Suggested-by: Cong Wang
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     

29 Sep, 2018

2 commits

  • [ Upstream commit 2b5a921740a55c00223a797d075b9c77c42cb171 ]

    commit 2abb7cdc0dc8 ("udp: Add support for doing checksum
    unnecessary conversion") left out the early demux path for
    connected sockets. As a result IP_CMSG_CHECKSUM gives wrong
    values for such socket when GRO is not enabled/available.

    This change addresses the issue by moving the csum conversion to a
    common helper and using such helper in both the default and the
    early demux rx path.

    Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Abeni
     
  • [ Upstream commit c56cae23c6b167acc68043c683c4573b80cbcc2c ]

    When splitting a GSO segment that consists of encapsulated packets, the
    skb->mac_len of the segments can end up being set wrong, causing packet
    drops in particular when using act_mirred and ifb interfaces in
    combination with a qdisc that splits GSO packets.

    This happens because at the time skb_segment() is called, network_header
    will point to the inner header, throwing off the calculation in
    skb_reset_mac_len(). The network_header is subsequently adjust by the
    outer IP gso_segment handlers, but they don't set the mac_len.

    Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
    gso_segment handlers, after they modify the network_header.

    Many thanks to Eric Dumazet for his help in identifying the cause of
    the bug.

    Acked-by: Dave Taht
    Reviewed-by: Eric Dumazet
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Toke Høiland-Jørgensen
     

26 Sep, 2018

3 commits

  • [ Upstream commit 5cf4a8532c992bb22a9ecd5f6d93f873f4eaccc2 ]

    According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
    flag was introduced because send(2) ignores unknown message flags and
    any legacy application which was accidentally passing the equivalent of
    MSG_ZEROCOPY earlier should not see any new behaviour.

    Before commit f214f915e7db ("tcp: enable MSG_ZEROCOPY"), a send(2) call
    which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
    would succeed. However, after that commit, it fails with -ENOBUFS. So
    it appears that the SO_ZEROCOPY flag fails to fulfill its intended
    purpose. Fix it.

    Fixes: f214f915e7db ("tcp: enable MSG_ZEROCOPY")
    Signed-off-by: Vincent Whitchurch
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Vincent Whitchurch
     
  • [ Upstream commit 5a64506b5c2c3cdb29d817723205330378075448 ]

    If erspan tunnel hasn't been established, we'd better send icmp port
    unreachable message after receive erspan packets.

    Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
    Cc: William Tu
    Signed-off-by: Haishuang Yan
    Acked-by: William Tu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Haishuang Yan
     
  • [ Upstream commit 51dc63e3911fbb1f0a7a32da2fe56253e2040ea4 ]

    When processing icmp unreachable message for erspan tunnel, tunnel id
    should be erspan_net_id instead of ipgre_net_id.

    Fixes: 84e54fe0a5ea ("gre: introduce native tunnel support for ERSPAN")
    Cc: William Tu
    Signed-off-by: Haishuang Yan
    Acked-by: William Tu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Haishuang Yan
     

20 Sep, 2018

19 commits

  • commit 5d407b071dc369c26a38398326ee2be53651cfe4 upstream

    A kernel crash occurrs when defragmented packet is fragmented
    in ip_do_fragment().
    In defragment routine, skb_orphan() is called and
    skb->ip_defrag_offset is set. but skb->sk and
    skb->ip_defrag_offset are same union member. so that
    frag->sk is not NULL.
    Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
    defragmented packet is fragmented.

    test commands:
    %iptables -t nat -I POSTROUTING -j MASQUERADE
    %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

    splat looks like:
    [ 261.069429] kernel BUG at net/ipv4/ip_output.c:636!
    [ 261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
    [ 261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
    [ 261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
    [ 261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
    [ 261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
    [ 261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
    [ 261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
    [ 261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
    [ 261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
    [ 261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
    [ 261.174169] FS: 00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
    [ 261.183012] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
    [ 261.198158] Call Trace:
    [ 261.199018] ? dst_output+0x180/0x180
    [ 261.205011] ? save_trace+0x300/0x300
    [ 261.209018] ? ip_copy_metadata+0xb00/0xb00
    [ 261.213034] ? sched_clock_local+0xd4/0x140
    [ 261.218158] ? kill_l4proto+0x120/0x120 [nf_conntrack]
    [ 261.223014] ? rt_cpu_seq_stop+0x10/0x10
    [ 261.227014] ? find_held_lock+0x39/0x1c0
    [ 261.233008] ip_finish_output+0x51d/0xb50
    [ 261.237006] ? ip_fragment.constprop.56+0x220/0x220
    [ 261.243011] ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
    [ 261.250152] ? rcu_is_watching+0x77/0x120
    [ 261.255010] ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
    [ 261.261033] ? nf_hook_slow+0xb1/0x160
    [ 261.265007] ip_output+0x1c7/0x710
    [ 261.269005] ? ip_mc_output+0x13f0/0x13f0
    [ 261.273002] ? __local_bh_enable_ip+0xe9/0x1b0
    [ 261.278152] ? ip_fragment.constprop.56+0x220/0x220
    [ 261.282996] ? nf_hook_slow+0xb1/0x160
    [ 261.287007] raw_sendmsg+0x21f9/0x4420
    [ 261.291008] ? dst_output+0x180/0x180
    [ 261.297003] ? sched_clock_cpu+0x126/0x170
    [ 261.301003] ? find_held_lock+0x39/0x1c0
    [ 261.306155] ? stop_critical_timings+0x420/0x420
    [ 261.311004] ? check_flags.part.36+0x450/0x450
    [ 261.315005] ? _raw_spin_unlock_irq+0x29/0x40
    [ 261.320995] ? _raw_spin_unlock_irq+0x29/0x40
    [ 261.326142] ? cyc2ns_read_end+0x10/0x10
    [ 261.330139] ? raw_bind+0x280/0x280
    [ 261.334138] ? sched_clock_cpu+0x126/0x170
    [ 261.338995] ? check_flags.part.36+0x450/0x450
    [ 261.342991] ? __lock_acquire+0x4500/0x4500
    [ 261.348994] ? inet_sendmsg+0x11c/0x500
    [ 261.352989] ? dst_output+0x180/0x180
    [ 261.357012] inet_sendmsg+0x11c/0x500
    [ ... ]

    v2:
    - clear skb->sk at reassembly routine.(Eric Dumarzet)

    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Suggested-by: Eric Dumazet
    Signed-off-by: Taehee Yoo
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Taehee Yoo
     
  • This patch changes the runtime behavior of IP defrag queue:
    incoming in-order fragments are added to the end of the current
    list/"run" of in-order fragments at the tail.

    On some workloads, UDP stream performance is substantially improved:

    RX: ./udp_stream -F 10 -T 2 -l 60
    TX: ./udp_stream -c -H -F 10 -T 5 -l 60

    with this patchset applied on a 10Gbps receiver:

    throughput=9524.18
    throughput_units=Mbit/s

    upstream (net-next):

    throughput=4608.93
    throughput_units=Mbit/s

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit a4fd284a1f8fd4b6c59aa59db2185b1e17c5c11c)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • This patch introduces several helper functions/macros that will be
    used in the follow-up patch. No runtime changes yet.

    The new logic (fully implemented in the second patch) is as follows:

    * Nodes in the rb-tree will now contain not single fragments, but lists
    of consecutive fragments ("runs").

    * At each point in time, the current "active" run at the tail is
    maintained/tracked. Fragments that arrive in-order, adjacent
    to the previous tail fragment, are added to this tail run without
    triggering the re-balancing of the rb-tree.

    * If a fragment arrives out of order with the offset _before_ the tail run,
    it is inserted into the rb-tree as a single fragment.

    * If a fragment arrives after the current tail fragment (with a gap),
    it starts a new "tail" run, as is inserted into the rb-tree
    at the end as the head of the new run.

    skb->cb is used to store additional information
    needed here (suggested by Eric Dumazet).

    Reported-by: Willem de Bruijn
    Signed-off-by: Peter Oskolkov
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller
    (cherry picked from commit 353c9cb360874e737fb000545f783df756c06f9a)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • We accidentally removed the parentheses here, but they are required
    because '!' has higher precedence than '&'.

    Fixes: fa0f527358bd ("ip: use rb trees for IP frag queue.")
    Signed-off-by: Dan Carpenter
    Signed-off-by: David S. Miller
    (cherry picked from commit 70837ffe3085c9a91488b52ca13ac84424da1042)
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit bffa72cf7f9df842f0016ba03586039296b4caaf upstream

    skb->rbnode shares space with skb->next, skb->prev and skb->tstamp

    Current uses (TCP receive ofo queue and netem) need to save/restore
    tstamp, while skb->dev is either NULL (TCP) or a constant for a given
    queue (netem).

    Since we plan using an RB tree for TCP retransmit queue to speedup SACK
    processing with large BDP, this patch exchanges skb->dev and
    skb->tstamp.

    This saves some overhead in both TCP and netem.

    v2: removes the swtstamp field from struct tcp_skb_cb

    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Wei Wang
    Cc: Willem de Bruijn
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Geeralize private netem_rb_to_skb()

    TCP rtx queue will soon be converted to rb-tree,
    so we will need skb_rbtree_walk() helpers.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 18a4c0eab2623cc95be98a1e6af1ad18e7695977)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This behavior is required in IPv6, and there is little need
    to tolerate overlapping fragments in IPv4. This change
    simplifies the code and eliminates potential DDoS attack vectors.

    Tested: ran ip_defrag selftest (not yet available uptream).

    Suggested-by: David S. Miller
    Signed-off-by: Peter Oskolkov
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller
    (cherry picked from commit 7969e5c40dfd04799d4341f1b7cd266b6e47f227)
    Signed-off-by: Greg Kroah-Hartman

    Peter Oskolkov
     
  • Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
    since linker might place next to it a non zero value preventing a change
    to ip6frag_low_thresh.

    ip6frag_low_thresh is not used anymore in the kernel, but we do not
    want to prematuraly break user scripts wanting to change it.

    Since specifying a minimal value of 0 for proc_doulongvec_minmax()
    is moot, let's remove these zero values in all defrag units.

    Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller
    (cherry picked from commit 3d23401283e80ceb03f765842787e0e79ff598b7)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • ip_defrag uses skb->cb[] to store the fragment offset, and unfortunately
    this integer is currently in a different cache line than skb->next,
    meaning that we use two cache lines per skb when finding the insertion point.

    By aliasing skb->ip_defrag_offset and skb->dev, we pack all the fields
    in a single cache line and save precious memory bandwidth.

    Note that after the fast path added by Changli Gao in commit
    d6bebca92c66 ("fragment: add fast path for in-order fragments")
    this change wont help the fast path, since we still need
    to access prev->len (2nd cache line), but will show great
    benefits when slow path is entered, since we perform
    a linear scan of a potentially long list.

    Also, note that this potential long list is an attack vector,
    we might consider also using an rb-tree there eventually.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit bf66337140c64c27fa37222b7abca7e49d63fb57)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • An skb_clone() was added in commit ec4fbd64751d ("inet: frag: release
    spinlock before calling icmp_send()")

    While fixing the bug at that time, it also added a very high cost
    for DDOS frags, as the ICMP rate limit is applied after this
    expensive operation (skb_clone() + consume_skb(), implying memory
    allocations, copy, and freeing)

    We can use skb_get(head) here, all we want is to make sure skb wont
    be freed by another cpu.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 1eec5d5670084ee644597bd26c25e22c69b9f748)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 3e67f106f619dcfaf6f4e2039599bdb69848c714)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This function is obsolete, after rhashtable addition to inet defrag.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 2d44ed22e607f9a285b049de2263e3840673a260)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • This refactors ip_expire() since one indentation level is removed.

    Note: in the future, we should try hard to avoid the skb_clone()
    since this is a serious performance cost.
    Under DDOS, the ICMP message wont be sent because of rate limits.

    Fact that ip6_expire_frag_queue() does not use skb_clone() is
    disturbing too. Presumably IPv6 should have the same
    issue than the one we fixed in commit ec4fbd64751d
    ("inet: frag: release spinlock before calling icmp_send()")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 399d1404be660d355192ff4df5ccc3f4159ec1e4)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()

    Also since we use rhashtable we can bring back the number of fragments
    in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
    removed in commit 434d305405ab ("inet: frag: don't account number
    of fragment queues")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 6befe4a78b1553edb6eed3a78b4bcd9748526672)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller
    (cherry picked from commit 648700f76b03b7e8149d13cc2bdb3355035258a9)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • In preparation for unconditionally passing the struct timer_list pointer to
    all timer callbacks, switch to using the new timer_setup() and from_timer()
    to pass the timer pointer explicitly.

    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Cc: "David S. Miller"
    Cc: Alexey Kuznetsov
    Cc: Hideaki YOSHIFUJI
    Cc: Pablo Neira Ayuso
    Cc: Jozsef Kadlecsik
    Cc: Florian Westphal
    Cc: linux-wpan@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: netfilter-devel@vger.kernel.org
    Cc: coreteam@netfilter.org
    Signed-off-by: Kees Cook
    Acked-by: Stefan Schmidt # for ieee802154
    Signed-off-by: David S. Miller
    (cherry picked from commit 78802011fbe34331bdef6f2dfb1634011f0e4c32)
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     
  • We need to call inet_frags_init() before register_pernet_subsys(),
    as a prereq for following patch ("inet: frags: use rhashtables for reassembly units")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 483a6e4fa055123142d8956866fe2aa9c98d546d)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.

    These functions no longer have a struct inet_frags parameter :

    inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 093ba72914b696521e4885756a68a3332782c8de)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • We will soon initialize one rhashtable per struct netns_frags
    in inet_frags_init_net().

    This patch changes the return value to eventually propagate an
    error.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    (cherry picked from commit 787bea7748a76130566f881c2342a0be4127d182)
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

15 Sep, 2018

3 commits

  • [ Upstream commit 037b0b86ecf5646f8eae777d8b52ff8b401692ec ]

    Lets not turn the TCP ULP lookup into an arbitrary module loader as
    we only intend to load ULP modules through this mechanism, not other
    unrelated kernel modules:

    [root@bar]# cat foo.c
    #include
    #include
    #include
    #include

    int main(void)
    {
    int sock = socket(PF_INET, SOCK_STREAM, 0);
    setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
    return 0;
    }

    [root@bar]# gcc foo.c -O2 -Wall
    [root@bar]# lsmod | grep sctp
    [root@bar]# ./a.out
    [root@bar]# lsmod | grep sctp
    sctp 1077248 4
    libcrc32c 16384 3 nf_conntrack,nf_nat,sctp
    [root@bar]#

    Fix it by adding module alias to TCP ULP modules, so probing module
    via request_module() will be limited to tcp-ulp-[name]. The existing
    modules like kTLS will load fine given tcp-ulp-tls alias, but others
    will fail to load:

    [root@bar]# lsmod | grep sctp
    [root@bar]# ./a.out
    [root@bar]# lsmod | grep sctp
    [root@bar]#

    Sockmap is not affected from this since it's either built-in or not.

    Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Acked-by: Song Liu
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 63cc357f7bba6729869565a12df08441a5995d9a ]

    RFC 1337 says:
    ''Ignore RST segments in TIME-WAIT state.
    If the 2 minute MSL is enforced, this fix avoids all three hazards.''

    So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
    expire rather than removing it instantly when a reset is received.

    However, Linux will also re-start the TIME-WAIT timer.

    This causes connect to fail when tying to re-use ports or very long
    delays (until syn retry interval exceeds MSL).

    packetdrill test case:
    // Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
    `sysctl net.ipv4.tcp_rfc1337=1`

    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    0.000 bind(3, ..., ...) = 0
    0.000 listen(3, 1) = 0

    0.100 < S 0:0(0) win 29200
    0.100 > S. 0:0(0) ack 1
    0.200 < . 1:1(0) ack 1 win 257
    0.200 accept(3, ..., ...) = 4

    // Receive first segment
    0.310 < P. 1:1001(1000) ack 1 win 46

    // Send one ACK
    0.310 > . 1:1(0) ack 1001

    // read 1000 byte
    0.310 read(4, ..., 1000) = 1000

    // Application writes 100 bytes
    0.350 write(4, ..., 100) = 100
    0.350 > P. 1:101(100) ack 1001

    // ACK
    0.500 < . 1001:1001(0) ack 101 win 257

    // close the connection
    0.600 close(4) = 0
    0.600 > F. 101:101(0) ack 1001 win 244

    // Our side is in FIN_WAIT_1 & waits for ack to fin
    0.7 < . 1001:1001(0) ack 102 win 244

    // Our side is in FIN_WAIT_2 with no outstanding data.
    0.8 < F. 1001:1001(0) ack 102 win 244
    0.8 > . 102:102(0) ack 1002 win 244

    // Our side is now in TIME_WAIT state, send ack for fin.
    0.9 < F. 1002:1002(0) ack 102 win 244
    0.9 > . 102:102(0) ack 1002 win 244

    // Peer reopens with in-window SYN:
    1.000 < S 1000:1000(0) win 9200

    // Therefore, reply with ACK.
    1.000 > . 102:102(0) ack 1002 win 244

    // Peer sends RST for this ACK. Normally this RST results
    // in tw socket removal, but rfc1337=1 setting prevents this.
    1.100 < R 1002:1002(0) win 244

    // second syn. Due to rfc1337=1 expect another pure ACK.
    31.0 < S 1000:1000(0) win 9200
    31.0 > . 102:102(0) ack 1002 win 244

    // .. and another RST from peer.
    31.1 < R 1002:1002(0) win 244
    31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

    // third syn after one minute. Time-Wait socket should have expired by now.
    63.0 < S 1000:1000(0) win 9200

    // so we expect a syn-ack & 3whs to proceed from here on.
    63.0 > S. 0:0(0) ack 1

    Without this patch, 'ss' shows restarts of tw timer and last packet is
    thus just another pure ack, more than one minute later.

    This restores the original code from commit 283fd6cf0be690a83
    ("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

    For some reason the else branch was removed/lost in 1f28b683339f7
    ("Merge in TCP/UDP optimizations and [..]") and timer restart became
    unconditional.

    Reported-by: Michal Tesar
    Signed-off-by: Florian Westphal
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • [ Upstream commit 431280eebed9f5079553daf003011097763e71fd ]

    tcp uses per-cpu (and per namespace) sockets (net->ipv4.tcp_sk) internally
    to send some control packets.

    1) RST packets, through tcp_v4_send_reset()
    2) ACK packets in SYN-RECV and TIME-WAIT state, through tcp_v4_send_ack()

    These packets assert IP_DF, and also use the hashed IP ident generator
    to provide an IPv4 ID number.

    Geoff Alexander reported this could be used to build off-path attacks.

    These packets should not be fragmented, since their size is smaller than
    IPV4_MIN_MTU. Only some tunneled paths could eventually have to fragment,
    regardless of inner IPID.

    We really can use zero IPID, to address the flaw, and as a bonus,
    avoid a couple of atomic operations in ip_idents_reserve()

    Signed-off-by: Eric Dumazet
    Reported-by: Geoff Alexander
    Tested-by: Geoff Alexander
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

24 Aug, 2018

3 commits

  • [ Upstream commit e56b8ce363a36fb7b74b80aaa5cc9084f2c908b4 ]

    Attempt to make cryptic TCP seq number error messages clearer by
    (1) identifying the source of the message as "TCP", (2) identifying the
    errors as "seq # bug", and (3) grouping the field identifiers and values
    by separating them with commas.

    E.g., the following message is changed from:

    recvmsg bug 2: copied 73BCB6CD seq 70F17CBE rcvnxt 73BCB9AA fl 0
    WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:1881 tcp_recvmsg+0x649/0xb90

    to:

    TCP recvmsg seq # bug 2: copied 73BCB6CD, seq 70F17CBE, rcvnxt 73BCB9AA, fl 0
    WARNING: CPU: 2 PID: 1501 at /linux/net/ipv4/tcp.c:2011 tcp_recvmsg+0x694/0xba0

    Suggested-by: 積丹尼 Dan Jacobson
    Signed-off-by: Randy Dunlap
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     
  • [ Upstream commit a69258f7aa2623e0930212f09c586fd06674ad79 ]

    After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
    related callbacks are no longer needed

    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Acked-by: Lawrence Brakmo
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Yuchung Cheng
     
  • [ Upstream commit d376bef9c29b3c65aeee4e785fffcd97ef0a9a81 ]

    nft_compat relies on xt_request_find_match to increment
    refcount of the module that provides the match/target.

    The (builtin) icmp matches did't set the module owner so it
    was possible to rmmod ip(6)tables while icmp extensions were still in use.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     

06 Aug, 2018

2 commits

  • [ Upstream commit 4672694bd4f1aebdab0ad763ae4716e89cb15221 ]

    ip_frag_queue() might call pskb_pull() on one skb that
    is already in the fragment queue.

    We need to take care of possible truesize change, or we
    might have an imbalance of the netns frags memory usage.

    IPv6 is immune to this bug, because RFC5722, Section 4,
    amended by Errata ID 3089 states :

    When reassembling an IPv6 datagram, if
    one or more its constituent fragments is determined to be an
    overlapping fragment, the entire datagram (and any constituent
    fragments) MUST be silently discarded.

    Fixes: 158f323b9868 ("net: adjust skb->truesize in pskb_expand_head()")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit 56e2c94f055d328f5f6b0a5c1721cca2f2d4e0a1 ]

    We currently check current frags memory usage only when
    a new frag queue is created. This allows attackers to first
    consume the memory budget (default : 4 MB) creating thousands
    of frag queues, then sending tiny skbs to exceed high_thresh
    limit by 2 to 3 order of magnitude.

    Note that before commit 648700f76b03 ("inet: frags: use rhashtables
    for reassembly units"), work queue could be starved under DOS,
    getting no cpu cycles.
    After commit 648700f76b03, only the per frag queue timer can eventually
    remove an incomplete frag queue and its skbs.

    Fixes: b13d3cbfb8e8 ("inet: frag: move eviction of queues to work queue")
    Signed-off-by: Eric Dumazet
    Reported-by: Jann Horn
    Cc: Florian Westphal
    Cc: Peter Oskolkov
    Cc: Paolo Abeni
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

03 Aug, 2018

2 commits

  • [ Upstream commit 15ecbe94a45ef88491ca459b26efdd02f91edb6d ]

    Larry Brakmo proposal ( https://patchwork.ozlabs.org/patch/935233/
    tcp: force cwnd at least 2 in tcp_cwnd_reduction) made us rethink
    about our recent patch removing ~16 quick acks after ECN events.

    tcp_enter_quickack_mode(sk, 1) makes sure one immediate ack is sent,
    but in the case the sender cwnd was lowered to 1, we do not want
    to have a delayed ack for the next packet we will receive.

    Fixes: 522040ea5fdd ("tcp: do not aggressively quick ack after ECN events")
    Signed-off-by: Eric Dumazet
    Reported-by: Neal Cardwell
    Cc: Lawrence Brakmo
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ Upstream commit f4c9f85f3b2cb7669830cd04d0be61192a4d2436 ]

    Refactor tcp_ecn_check_ce and __tcp_ecn_check_ce to accept struct sock*
    instead of tcp_sock* to clean up type casts. This is a pure refactor
    patch.

    Signed-off-by: Yousuk Seung
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yousuk Seung