20 Nov, 2020

1 commit

  • IPV6=m
    NF_DEFRAG_IPV6=y

    ld: net/ipv6/netfilter/nf_conntrack_reasm.o: in function
    `nf_ct_frag6_gather':
    net/ipv6/netfilter/nf_conntrack_reasm.c:462: undefined reference to
    `ipv6_frag_thdr_truncated'

    Netfilter is depending on ipv6 symbol ipv6_frag_thdr_truncated. This
    dependency is forcing IPV6=y.

    Remove this dependency by moving ipv6_frag_thdr_truncated out of ipv6. This
    is the same solution as used with a similar issues: Referring to
    commit 70b095c843266 ("ipv6: remove dependency of nf_defrag_ipv6 on ipv6
    module")

    Fixes: 9d9e937b1c8b ("ipv6/netfilter: Discard first fragment not including all headers")
    Reported-by: Randy Dunlap
    Reported-by: kernel test robot
    Signed-off-by: Georg Kohmann
    Acked-by: Pablo Neira Ayuso
    Acked-by: Randy Dunlap # build-tested
    Link: https://lore.kernel.org/r/20201119095833.8409-1-geokohma@cisco.com
    Signed-off-by: Jakub Kicinski

    Georg Kohmann
     

17 Nov, 2020

1 commit

  • Packets are processed even though the first fragment don't include all
    headers through the upper layer header. This breaks TAHI IPv6 Core
    Conformance Test v6LC.1.3.6.

    Referring to RFC8200 SECTION 4.5: "If the first fragment does not include
    all headers through an Upper-Layer header, then that fragment should be
    discarded and an ICMP Parameter Problem, Code 3, message should be sent to
    the source of the fragment, with the Pointer field set to zero."

    The fragment needs to be validated the same way it is done in
    commit 2efdaaaf883a ("IPv6: reply ICMP error if the first fragment don't
    include all headers") for ipv6. Wrap the validation into a common function,
    ipv6_frag_thdr_truncated() to check for truncation in the upper layer
    header. This validation does not fullfill all aspects of RFC 8200,
    section 4.5, but is at the moment sufficient to pass mentioned TAHI test.

    In netfilter, utilize the fragment offset returned by find_prev_fhdr() to
    let ipv6_frag_thdr_truncated() start it's traverse from the fragment
    header.

    Return 0 to drop the fragment in the netfilter. This is the same behaviour
    as used on other protocol errors in this function, e.g. when
    nf_ct_frag6_queue() returns -EPROTO. The Fragment will later be picked up
    by ipv6_frag_rcv() in reassembly.c. ipv6_frag_rcv() will then send an
    appropriate ICMP Parameter Problem message back to the source.

    References commit 2efdaaaf883a ("IPv6: reply ICMP error if the first
    fragment don't include all headers")

    Signed-off-by: Georg Kohmann
    Acked-by: Pablo Neira Ayuso
    Link: https://lore.kernel.org/r/20201111115025.28879-1-geokohma@cisco.com
    Signed-off-by: Jakub Kicinski

    Georg Kohmann
     

01 Nov, 2020

1 commit

  • Based on RFC 8200, Section 4.5 Fragment Header:

    - If the first fragment does not include all headers through an
    Upper-Layer header, then that fragment should be discarded and
    an ICMP Parameter Problem, Code 3, message should be sent to
    the source of the fragment, with the Pointer field set to zero.

    Checking each packet header in IPv6 fast path will have performance impact,
    so I put the checking in ipv6_frag_rcv().

    As the packet may be any kind of L4 protocol, I only checked some common
    protocols' header length and handle others by (offset + 1) > skb->len.
    Also use !(frag_off & htons(IP6_OFFSET)) to catch atomic fragments
    (fragmented packet with only one fragment).

    When send ICMP error message, if the 1st truncated fragment is ICMP message,
    icmp6_send() will break as is_ineligible() return true. So I added a check
    in is_ineligible() to let fragment packet with nexthdr ICMP but no ICMP header
    return false.

    Signed-off-by: Hangbin Liu
    Signed-off-by: Jakub Kicinski

    Hangbin Liu
     

09 Aug, 2019

1 commit

  • Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6
    defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
    generating many fragments) and running over an IPsec tunnel, reported
    more than 6Gbps throughput. After that patch, the same test gets only
    9Mbps when receiving on a be2net nic (driver can make a big difference
    here, for example, ixgbe doesn't seem to be affected).

    By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
    (IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert
    "ipv4: use skb coalescing in defragmentation"")).

    Without fragment coalescing, be2net runs out of Rx ring entries and
    starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
    the netperf traffic is only composed of UDP fragments, any lost packet
    prevents reassembly of the full datagram. Therefore, fragments which
    have no possibility to ever get reassembled pile up in the reassembly
    queue, until the memory accounting exeeds the threshold. At that point
    no fragment is accepted anymore, which effectively discards all
    netperf traffic.

    When reassembly timeout expires, some stale fragments are removed from
    the reassembly queue, so a few packets can be received, reassembled
    and delivered to the netperf receiver. But the nic still drops frames
    and soon the reassembly queue gets filled again with stale fragments.
    These long time frames where no datagram can be received explain why
    the performance drop is so significant.

    Re-introducing fragment coalescing is enough to get the initial
    performances again (6.6Gbps with be2net): driver doesn't drop frames
    anymore (no more rx_drops_no_frags errors) and the reassembly engine
    works at full speed.

    This patch is quite conservative and only coalesces skbs for local
    IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
    forwarding). Coalescing could be extended in the future if need be, as
    more scenarios would probably benefit from it.

    [0]: Test configuration
    Sender:
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
    ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
    ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
    ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
    netserver -D -L fc00:2::1

    Receiver:
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
    ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
    ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
    ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
    netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6

    Signed-off-by: Guillaume Nault
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Guillaume Nault
     

19 Jun, 2019

1 commit

  • syzbot reported another issue caused by my recent patches. [1]

    The issue here is that fqdir_exit() is initiating a work queue
    and immediately returns. A bit later cleanup_net() was able
    to free the MIB (percpu data) and the whole struct net was freed,
    but we had active frag timers that fired and triggered use-after-free.

    We need to make sure that timers can catch fqdir->dead being set,
    to bailout.

    Since RCU is used for the reader side, this means
    we want to respect an RCU grace period between these operations :

    1) qfdir->dead = 1;

    2) netns dismantle (freeing of various data structure)

    This patch uses new new (struct pernet_operations)->pre_exit
    infrastructure to ensures a full RCU grace period
    happens between fqdir_pre_exit() and fqdir_exit()

    This also means we can use a regular work queue, we no
    longer need rcu_work.

    Tested:

    $ time for i in {1..1000}; do unshare -n /bin/false;done

    real 0m2.585s
    user 0m0.160s
    sys 0m2.214s

    [1]

    BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
    Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860

    CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
    __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    kasan_report+0x12/0x20 mm/kasan/common.c:614
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
    ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
    call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
    expire_timers kernel/time/timer.c:1366 [inline]
    __run_timers kernel/time/timer.c:1685 [inline]
    __run_timers kernel/time/timer.c:1653 [inline]
    run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
    __do_softirq+0x25c/0x94c kernel/softirq.c:293
    invoke_softirq kernel/softirq.c:374 [inline]
    irq_exit+0x180/0x1d0 kernel/softirq.c:414
    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
    smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806

    RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
    Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
    RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
    RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
    RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
    R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
    R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
    tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
    tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
    tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
    tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
    security_file_ioctl+0x77/0xc0 security/security.c:1370
    ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
    __do_sys_ioctl fs/ioctl.c:720 [inline]
    __se_sys_ioctl fs/ioctl.c:718 [inline]
    __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4592c9
    Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
    RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
    RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
    R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff

    Allocated by task 9047:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_kmalloc mm/kasan/common.c:489 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
    slab_post_alloc_hook mm/slab.h:437 [inline]
    slab_alloc mm/slab.c:3326 [inline]
    kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
    kmem_cache_zalloc include/linux/slab.h:732 [inline]
    net_alloc net/core/net_namespace.c:386 [inline]
    copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
    create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
    unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
    ksys_unshare+0x440/0x980 kernel/fork.c:2692
    __do_sys_unshare kernel/fork.c:2760 [inline]
    __se_sys_unshare kernel/fork.c:2758 [inline]
    __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 2541:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
    __cache_free mm/slab.c:3432 [inline]
    kmem_cache_free+0x86/0x260 mm/slab.c:3698
    net_free net/core/net_namespace.c:402 [inline]
    net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
    net_drop_ns net/core/net_namespace.c:408 [inline]
    cleanup_net+0x538/0x960 net/core/net_namespace.c:571
    process_one_work+0x989/0x1790 kernel/workqueue.c:2269
    worker_thread+0x98/0xe40 kernel/workqueue.c:2415
    kthread+0x354/0x420 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

    The buggy address belongs to the object at ffff88808b9fe100
    which belongs to the cache net_namespace of size 6784
    The buggy address is located 560 bytes inside of
    6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
    The buggy address belongs to the page:
    page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
    flags: 0x1fffc0000010200(slab|head)
    raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
    raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 3c8fc8782044 ("inet: frags: rework rhashtable dismantle")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Jun, 2019

1 commit


13 Jun, 2019

1 commit

  • Get the ingress interface and increment ICMP counters based on that
    instead of skb->dev when the the dev is a VRF device.

    This is a follow up on the following message:
    https://www.spinics.net/lists/netdev/msg560268.html

    v2: Avoid changing skb->dev since it has unintended effect for local
    delivery (David Ahern).
    Signed-off-by: Stephen Suryaputra
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Stephen Suryaputra
     

08 Jun, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 May, 2019

1 commit

  • Both IPv6 and 6lowpan are calling inet_frags_fini() too soon.

    inet_frags_fini() is dismantling a kmem_cache, that might be needed
    later when unregister_pernet_subsys() eventually has to remove
    frags queues from hash tables and free them.

    This fixes potential use-after-free, and is a prereq for the following patch.

    Fixes: d4ad4d22e7ac ("inet: frags: use kmem_cache for inet_frag_queue")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 May, 2019

7 commits


27 Feb, 2019

1 commit


26 Jan, 2019

1 commit

  • Currently, IPv6 defragmentation code drops non-last fragments that
    are smaller than 1280 bytes: see
    commit 0ed4229b08c1 ("ipv6: defrag: drop non-last frags smaller than min mtu")

    This behavior is not specified in IPv6 RFCs and appears to break
    compatibility with some IPv6 implemenations, as reported here:
    https://www.spinics.net/lists/netdev/msg543846.html

    This patch re-uses common IP defragmentation queueing and reassembly
    code in IPv6, removing the 1280 byte restriction.

    Signed-off-by: Peter Oskolkov
    Reported-by: Tom Herbert
    Cc: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

31 Dec, 2018

1 commit


21 Dec, 2018

1 commit

  • It was reported that IPsec would crash when it encounters an IPv6
    reassembled packet because skb->sk is non-zero and not a valid
    pointer.

    This is because skb->sk is now a union with ip_defrag_offset.

    This patch fixes this by resetting skb->sk when exiting from
    the reassembly code.

    Reported-by: Xiumei Mu
    Fixes: 219badfaade9 ("ipv6: frags: get rid of ip6frag_skb_cb/...")
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

06 Dec, 2018

1 commit

  • The *_frag_reasm() functions are susceptible to miscalculating the byte
    count of packet fragments in case the truesize of a head buffer changes.
    The truesize member may be changed by the call to skb_unclone(), leaving
    the fragment memory limit counter unbalanced even if all fragments are
    processed. This miscalculation goes unnoticed as long as the network
    namespace which holds the counter is not destroyed.

    Should an attempt be made to destroy a network namespace that holds an
    unbalanced fragment memory limit counter the cleanup of the namespace
    never finishes. The thread handling the cleanup gets stuck in
    inet_frags_exit_net() waiting for the percpu counter to reach zero. The
    thread is usually in running state with a stacktrace similar to:

    PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
    #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
    #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
    #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
    #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
    #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
    #10 [ffff880621563e38] process_one_work at ffffffff81096f14

    It is not possible to create new network namespaces, and processes
    that call unshare() end up being stuck in uninterruptible sleep state
    waiting to acquire the net_mutex.

    The bug was observed in the IPv6 netfilter code by Per Sundstrom.
    I thank him for his analysis of the problem. The parts of this patch
    that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

    Signed-off-by: Jiri Wiesner
    Reported-by: Per Sundstrom
    Acked-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Jiri Wiesner
     

22 Sep, 2018

2 commits

  • Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
    hard-limited to those of the root/init ns.

    There are at least two use cases when it would be desirable to
    set the high_thresh values higher in a child namespace vs the global hard
    limit:

    - a security/ddos protection policy may lower the thresholds in the
    root/init ns but allow for a special exception in a child namespace
    - testing: a test running in a namespace may want to set these
    thresholds higher in its namespace than what is in the root/init ns

    The new behavior:

    # ip netns add testns
    # ip netns exec testns bash

    # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
    net.ipv4.ipfrag_high_thresh = 9000000

    # sysctl net.ipv4.ipfrag_high_thresh
    net.ipv4.ipfrag_high_thresh = 9000000

    # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
    net.ipv6.ip6frag_high_thresh = 9000000

    # sysctl net.ipv6.ip6frag_high_thresh
    net.ipv6.ip6frag_high_thresh = 9000000

    The old behavior:

    # ip netns add testns
    # ip netns exec testns bash

    # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
    net.ipv4.ipfrag_high_thresh = 9000000

    # sysctl net.ipv4.ipfrag_high_thresh
    net.ipv4.ipfrag_high_thresh = 4194304

    # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
    net.ipv6.ip6frag_high_thresh = 9000000

    # sysctl net.ipv6.ip6frag_high_thresh
    net.ipv6.ip6frag_high_thresh = 4194304

    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     
  • This is similar to how ipv4 now behaves:
    commit 0ff89efb5246 ("ip: fail fast on IP defrag errors").

    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

11 Sep, 2018

1 commit


06 Aug, 2018

2 commits

  • don't bother with pathological cases, they only waste cycles.
    IPv6 requires a minimum MTU of 1280 so we should never see fragments
    smaller than this (except last frag).

    v3: don't use awkward "-offset + len"
    v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
    There were concerns that there could be even smaller frags
    generated by intermediate nodes, e.g. on radio networks.

    Cc: Peter Oskolkov
    Cc: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Similar to TCP OOO RX queue, it makes sense to use rb trees to store
    IP fragments, so that OOO fragments are inserted faster.

    Tested:

    - a follow-up patch contains a rather comprehensive ip defrag
    self-test (functional)
    - ran neper `udp_stream -c -H -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
    282078937 total packets received
    0 forwarded
    0 incoming packets discarded
    946760 incoming packets delivered
    18743456 requests sent out
    101 fragments dropped after timeout
    282077129 reassemblies required
    944952 packets reassembled ok
    262734239 packet reassembles failed
    (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).

    Reported-by: Jann Horn
    Reported-by: Juha-Matti Tilli
    Suggested-by: Eric Dumazet
    Signed-off-by: Peter Oskolkov
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

18 Jul, 2018

1 commit

  • IPV6=m
    DEFRAG_IPV6=m
    CONNTRACK=y yields:

    net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
    net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable'
    net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6'

    Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
    ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
    IPV6=y too.

    This patch gets rid of the 'followup linker error' by removing
    the dependency of ipv6.ko symbols from netfilter ipv6 defrag.

    Shared code is placed into a header, then used from both.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

19 Apr, 2018

1 commit

  • lockdep does not know that the locks used by IPv4 defrag
    and IPv6 reassembly units are of different classes.

    It complains because of following chains :

    1) sch_direct_xmit() (lock txq->_xmit_lock)
    dev_hard_start_xmit()
    xmit_one()
    dev_queue_xmit_nit()
    packet_rcv_fanout()
    ip_check_defrag()
    ip_defrag()
    spin_lock() (lock frag queue spinlock)

    2) ip6_input_finish()
    ipv6_frag_rcv() (lock frag queue spinlock)
    ip6_frag_queue()
    icmpv6_param_prob() (lock txq->_xmit_lock at some point)

    We could add lockdep annotations, but we also can make sure IPv6
    calls icmpv6_param_prob() only after the release of the frag queue spinlock,
    since this naturally makes frag queue spinlock a leaf in lock hierarchy.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Apr, 2018

1 commit


05 Apr, 2018

1 commit

  • Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
    since linker might place next to it a non zero value preventing a change
    to ip6frag_low_thresh.

    ip6frag_low_thresh is not used anymore in the kernel, but we do not
    want to prematuraly break user scripts wanting to change it.

    Since specifying a minimal value of 0 for proc_doulongvec_minmax()
    is moot, let's remove these zero values in all defrag units.

    Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Apr, 2018

1 commit


01 Apr, 2018

8 commits

  • ip6_frag_queue uses skb->cb[] to store the fragment offset, meaning that
    we could use two cache lines per skb when finding the insertion point,
    if for some reason inet6_skb_parm size is increased in the future.

    By using skb->ip_defrag_offset instead of skb->cb[], we pack all
    the fields in a single cache line, matching what we did for IPv4.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Make it similar to IPv4 ip_expire(), and release the lock
    before calling icmp functions.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This function is obsolete, after rhashtable addition to inet defrag.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This refactors ip_expire() since one indentation level is removed.

    Note: in the future, we should try hard to avoid the skb_clone()
    since this is a serious performance cost.
    Under DDOS, the ICMP message wont be sent because of rate limits.

    Fact that ip6_expire_frag_queue() does not use skb_clone() is
    disturbing too. Presumably IPv6 should have the same
    issue than the one we fixed in commit ec4fbd64751d
    ("inet: frag: release spinlock before calling icmp_send()")

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We want to call inet_frags_init() earlier.

    This is a prereq to "inet: frags: use rhashtables for reassembly units"

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.

    These functions no longer have a struct inet_frags parameter :

    inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet