24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

14 Jun, 2020

1 commit

  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

09 May, 2020

1 commit


05 May, 2020

1 commit

  • This patch reverts the folowing commits:

    commit 064ff66e2bef84f1153087612032b5b9eab005bd
    "bonding: add missing netdev_update_lockdep_key()"

    commit 53d374979ef147ab51f5d632dfe20b14aebeccd0
    "net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"

    commit 1f26c0d3d24125992ab0026b0dab16c08df947c7
    "net: fix kernel-doc warning in "

    commit ab92d68fc22f9afab480153bd82a20f6e2533769
    "net: core: add generic lockdep keys"

    but keeps the addr_list_lock_key because we still lock
    addr_list_lock nestedly on stack devices, unlikely xmit_lock
    this is safe because we don't take addr_list_lock on any fast
    path.

    Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
    Cc: Dmitry Vyukov
    Cc: Taehee Yoo
    Signed-off-by: Cong Wang
    Acked-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Cong Wang
     

25 Oct, 2019

1 commit

  • Some interface types could be nested.
    (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
    These interface types should set lockdep class because, without lockdep
    class key, lockdep always warn about unexisting circular locking.

    In the current code, these interfaces have their own lockdep class keys and
    these manage itself. So that there are so many duplicate code around the
    /driver/net and /net/.
    This patch adds new generic lockdep keys and some helper functions for it.

    This patch does below changes.
    a) Add lockdep class keys in struct net_device
    - qdisc_running, xmit, addr_list, qdisc_busylock
    - these keys are used as dynamic lockdep key.
    b) When net_device is being allocated, lockdep keys are registered.
    - alloc_netdev_mqs()
    c) When net_device is being free'd llockdep keys are unregistered.
    - free_netdev()
    d) Add generic lockdep key helper function
    - netdev_register_lockdep_key()
    - netdev_unregister_lockdep_key()
    - netdev_update_lockdep_key()
    e) Remove unnecessary generic lockdep macro and functions
    f) Remove unnecessary lockdep code of each interfaces.

    After this patch, each interface modules don't need to maintain
    their lockdep keys.

    Signed-off-by: Taehee Yoo
    Signed-off-by: David S. Miller

    Taehee Yoo
     

09 Aug, 2019

1 commit

  • Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6
    defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
    generating many fragments) and running over an IPsec tunnel, reported
    more than 6Gbps throughput. After that patch, the same test gets only
    9Mbps when receiving on a be2net nic (driver can make a big difference
    here, for example, ixgbe doesn't seem to be affected).

    By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
    (IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert
    "ipv4: use skb coalescing in defragmentation"")).

    Without fragment coalescing, be2net runs out of Rx ring entries and
    starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
    the netperf traffic is only composed of UDP fragments, any lost packet
    prevents reassembly of the full datagram. Therefore, fragments which
    have no possibility to ever get reassembled pile up in the reassembly
    queue, until the memory accounting exeeds the threshold. At that point
    no fragment is accepted anymore, which effectively discards all
    netperf traffic.

    When reassembly timeout expires, some stale fragments are removed from
    the reassembly queue, so a few packets can be received, reassembled
    and delivered to the netperf receiver. But the nic still drops frames
    and soon the reassembly queue gets filled again with stale fragments.
    These long time frames where no datagram can be received explain why
    the performance drop is so significant.

    Re-introducing fragment coalescing is enough to get the initial
    performances again (6.6Gbps with be2net): driver doesn't drop frames
    anymore (no more rx_drops_no_frags errors) and the reassembly engine
    works at full speed.

    This patch is quite conservative and only coalesces skbs for local
    IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
    forwarding). Coalescing could be extended in the future if need be, as
    more scenarios would probably benefit from it.

    [0]: Test configuration
    Sender:
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
    ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
    ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
    ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
    netserver -D -L fc00:2::1

    Receiver:
    ip xfrm policy flush
    ip xfrm state flush
    ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
    ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
    ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
    ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
    netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6

    Signed-off-by: Guillaume Nault
    Acked-by: Florian Westphal
    Signed-off-by: David S. Miller

    Guillaume Nault
     

19 Jun, 2019

1 commit

  • syzbot reported another issue caused by my recent patches. [1]

    The issue here is that fqdir_exit() is initiating a work queue
    and immediately returns. A bit later cleanup_net() was able
    to free the MIB (percpu data) and the whole struct net was freed,
    but we had active frag timers that fired and triggered use-after-free.

    We need to make sure that timers can catch fqdir->dead being set,
    to bailout.

    Since RCU is used for the reader side, this means
    we want to respect an RCU grace period between these operations :

    1) qfdir->dead = 1;

    2) netns dismantle (freeing of various data structure)

    This patch uses new new (struct pernet_operations)->pre_exit
    infrastructure to ensures a full RCU grace period
    happens between fqdir_pre_exit() and fqdir_exit()

    This also means we can use a regular work queue, we no
    longer need rcu_work.

    Tested:

    $ time for i in {1..1000}; do unshare -n /bin/false;done

    real 0m2.585s
    user 0m0.160s
    sys 0m2.214s

    [1]

    BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
    Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860

    CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:

    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
    __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    kasan_report+0x12/0x20 mm/kasan/common.c:614
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
    ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
    call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
    expire_timers kernel/time/timer.c:1366 [inline]
    __run_timers kernel/time/timer.c:1685 [inline]
    __run_timers kernel/time/timer.c:1653 [inline]
    run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
    __do_softirq+0x25c/0x94c kernel/softirq.c:293
    invoke_softirq kernel/softirq.c:374 [inline]
    irq_exit+0x180/0x1d0 kernel/softirq.c:414
    exiting_irq arch/x86/include/asm/apic.h:536 [inline]
    smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
    apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806

    RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
    Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
    RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
    RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
    RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
    RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
    R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
    R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
    tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
    tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
    tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
    tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
    security_file_ioctl+0x77/0xc0 security/security.c:1370
    ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
    __do_sys_ioctl fs/ioctl.c:720 [inline]
    __se_sys_ioctl fs/ioctl.c:718 [inline]
    __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4592c9
    Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
    RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
    RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
    R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff

    Allocated by task 9047:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_kmalloc mm/kasan/common.c:489 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
    slab_post_alloc_hook mm/slab.h:437 [inline]
    slab_alloc mm/slab.c:3326 [inline]
    kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
    kmem_cache_zalloc include/linux/slab.h:732 [inline]
    net_alloc net/core/net_namespace.c:386 [inline]
    copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
    create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
    unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
    ksys_unshare+0x440/0x980 kernel/fork.c:2692
    __do_sys_unshare kernel/fork.c:2760 [inline]
    __se_sys_unshare kernel/fork.c:2758 [inline]
    __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 2541:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
    __cache_free mm/slab.c:3432 [inline]
    kmem_cache_free+0x86/0x260 mm/slab.c:3698
    net_free net/core/net_namespace.c:402 [inline]
    net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
    net_drop_ns net/core/net_namespace.c:408 [inline]
    cleanup_net+0x538/0x960 net/core/net_namespace.c:571
    process_one_work+0x989/0x1790 kernel/workqueue.c:2269
    worker_thread+0x98/0xe40 kernel/workqueue.c:2415
    kthread+0x354/0x420 kernel/kthread.c:255
    ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

    The buggy address belongs to the object at ffff88808b9fe100
    which belongs to the cache net_namespace of size 6784
    The buggy address is located 560 bytes inside of
    6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
    The buggy address belongs to the page:
    page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
    flags: 0x1fffc0000010200(slab|head)
    raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
    raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

    Fixes: 3c8fc8782044 ("inet: frags: rework rhashtable dismantle")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jun, 2019

1 commit


31 May, 2019

2 commits

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 655 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Kate Stewart
    Reviewed-by: Richard Fontana
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070034.575739538@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 May, 2019

1 commit

  • Both IPv6 and 6lowpan are calling inet_frags_fini() too soon.

    inet_frags_fini() is dismantling a kmem_cache, that might be needed
    later when unregister_pernet_subsys() eventually has to remove
    frags queues from hash tables and free them.

    This fixes potential use-after-free, and is a prereq for the following patch.

    Fixes: d4ad4d22e7ac ("inet: frags: use kmem_cache for inet_frag_queue")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 May, 2019

7 commits


21 May, 2019

1 commit


27 Feb, 2019

1 commit


19 Feb, 2019

1 commit

  • This patch aligns IP defragmenation logic in 6lowpan with that
    of IPv4 and IPv6: see
    commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag")

    Modifying ip_defrag selftest seemed like an overkill, as I suspect
    most kernel test setups do not have 6lowpan hwsim enabled. So I ran
    the following code/script manually:

    insmod ./mac802154_hwsim.ko

    iwpan dev wpan0 set pan_id 0xbeef
    ip link add link wpan0 name lowpan0 type lowpan
    ip link set wpan0 up
    ip link set lowpan0 up

    iwpan dev wpan1 set pan_id 0xbeef
    ip netns add foo
    iwpan phy1 set netns name foo
    ip netns exec foo ip link add link wpan1 name lowpan1 type lowpan
    ip netns exec foo ip link set wpan1 up
    ip netns exec foo ip link set lowpan1 up

    ip -6 addr add "fb01::1/128" nodad dev lowpan0
    ip -netns foo -6 addr add "fb02::1/128" nodad dev lowpan1

    ip -6 route add "fb02::1/128" dev lowpan0
    ip -netns foo -6 route add "fb01::1/128" dev lowpan1

    # then in term1:
    ip netns exec foo bash
    ./udp_stream -6

    # in term2:
    ./udp_stream -c -6 -H fb02::1

    # pr_warn_once showed that the code changed by this patch
    # was invoked.

    Signed-off-by: Peter Oskolkov
    Acked-by: Alexander Aring
    Signed-off-by: Stefan Schmidt

    Peter Oskolkov
     

25 Dec, 2018

1 commit


22 Sep, 2018

1 commit

  • Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
    hard-limited to those of the root/init ns.

    There are at least two use cases when it would be desirable to
    set the high_thresh values higher in a child namespace vs the global hard
    limit:

    - a security/ddos protection policy may lower the thresholds in the
    root/init ns but allow for a special exception in a child namespace
    - testing: a test running in a namespace may want to set these
    thresholds higher in its namespace than what is in the root/init ns

    The new behavior:

    # ip netns add testns
    # ip netns exec testns bash

    # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
    net.ipv4.ipfrag_high_thresh = 9000000

    # sysctl net.ipv4.ipfrag_high_thresh
    net.ipv4.ipfrag_high_thresh = 9000000

    # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
    net.ipv6.ip6frag_high_thresh = 9000000

    # sysctl net.ipv6.ip6frag_high_thresh
    net.ipv6.ip6frag_high_thresh = 9000000

    The old behavior:

    # ip netns add testns
    # ip netns exec testns bash

    # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
    net.ipv4.ipfrag_high_thresh = 9000000

    # sysctl net.ipv4.ipfrag_high_thresh
    net.ipv4.ipfrag_high_thresh = 4194304

    # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
    net.ipv6.ip6frag_high_thresh = 9000000

    # sysctl net.ipv6.ip6frag_high_thresh
    net.ipv6.ip6frag_high_thresh = 4194304

    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     

11 Sep, 2018

1 commit


06 Aug, 2018

2 commits

  • Pointers fq and net are being assigned but are never used hence they
    are redundant and can be removed.

    Cleans up clang warnings:
    warning: variable 'fq' set but not used [-Wunused-but-set-variable]
    warning: variable 'net' set but not used [-Wunused-but-set-variable]

    Signed-off-by: Colin Ian King
    Signed-off-by: Stefan Schmidt

    Colin Ian King
     
  • This patch fixes patch add handling to take care tail and headroom for
    single 6lowpan frames. We need to be sure we have a skb with the right
    head and tailroom for single frames. This patch do it by using
    skb_copy_expand() if head and tailroom is not enough allocated by upper
    layer.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195059
    Reported-by: David Palma
    Reported-by: Rabi Narayan Sahoo
    Cc: stable@vger.kernel.org
    Signed-off-by: Alexander Aring
    Signed-off-by: Stefan Schmidt

    Alexander Aring
     

21 Jul, 2018

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter/IPVS updates for your net-next
    tree:

    1) No need to set ttl from reject action for the bridge family, from
    Taehee Yoo.

    2) Use a fixed timeout for flow that are passed up from the flowtable
    to conntrack, from Florian Westphal.

    3) More preparation patches for tproxy support for nf_tables, from Mate
    Eckl.

    4) Remove unnecessary indirection in core IPv6 checksum function, from
    Florian Westphal.

    5) Use nf_ct_get_tuplepr() from openvswitch, instead of opencoding it.
    From Florian Westphal.

    6) socket match now selects socket infrastructure, instead of depending
    on it. From Mate Eckl.

    7) Patch series to simplify conntrack tuple building/parsing from packet
    path and ctnetlink, from Florian Westphal.

    8) Fetch timeout policy from protocol helpers, instead of doing it from
    core, from Florian Westphal.

    9) Merge IPv4 and IPv6 protocol trackers into conntrack core, from
    Florian Westphal.

    10) Depend on CONFIG_NF_TABLES_IPV6 and CONFIG_IP6_NF_IPTABLES
    respectively, instead of IPV6. Patch from Mate Eckl.

    11) Add specific function for garbage collection in conncount,
    from Yi-Hung Wei.

    12) Catch number of elements in the connlimit list, from Yi-Hung Wei.

    13) Move locking to nf_conncount, from Yi-Hung Wei.

    14) Series of patches to add lockless tree traversal in nf_conncount,
    from Yi-Hung Wei.

    15) Resolve clash in matching conntracks when race happens, from
    Martynas Pumputis.

    16) If connection entry times out, remove template entry from the
    ip_vs_conn_tab table to improve behaviour under flood, from
    Julian Anastasov.

    17) Remove useless parameter from nf_ct_helper_ext_add(), from Gao feng.

    18) Call abort from 2-phase commit protocol before requesting modules,
    make sure this is done under the mutex, from Florian Westphal.

    19) Grab module reference when starting transaction, also from Florian.

    20) Dynamically allocate expression info array for pre-parsing, from
    Florian.

    21) Add per netns mutex for nf_tables, from Florian Westphal.

    22) A couple of patches to simplify and refactor nf_osf code to prepare
    for nft_osf support.

    23) Break evaluation on missing socket, from Mate Eckl.

    24) Allow to match socket mark from nft_socket, from Mate Eckl.

    25) Remove dependency on nf_defrag_ipv6, now that IPv6 tracker is
    built-in into nf_conntrack. From Florian Westphal.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

18 Jul, 2018

1 commit

  • IPV6=m
    DEFRAG_IPV6=m
    CONNTRACK=y yields:

    net/netfilter/nf_conntrack_proto.o: In function `nf_ct_netns_do_get':
    net/netfilter/nf_conntrack_proto.c:802: undefined reference to `nf_defrag_ipv6_enable'
    net/netfilter/nf_conntrack_proto.o:(.rodata+0x640): undefined reference to `nf_conntrack_l4proto_icmpv6'

    Setting DEFRAG_IPV6=y causes undefined references to ip6_rhash_params
    ip6_frag_init and ip6_expire_frag_queue so it would be needed to force
    IPV6=y too.

    This patch gets rid of the 'followup linker error' by removing
    the dependency of ipv6.ko symbols from netfilter ipv6 defrag.

    Shared code is placed into a header, then used from both.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     

05 Jul, 2018

1 commit


24 Apr, 2018

1 commit

  • This patch initialize stack variables which are used in
    frag_lowpan_compare_key to zero. In my case there are padding bytes in the
    structures ieee802154_addr as well in frag_lowpan_compare_key. Otherwise
    the key variable contains random bytes. The result is that a compare of
    two keys by memcmp works incorrect.

    Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
    Signed-off-by: Alexander Aring
    Reported-by: Stefan Schmidt
    Signed-off-by: Stefan Schmidt

    Alexander Aring
     

05 Apr, 2018

1 commit

  • Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches,
    since linker might place next to it a non zero value preventing a change
    to ip6frag_low_thresh.

    ip6frag_low_thresh is not used anymore in the kernel, but we do not
    want to prematuraly break user scripts wanting to change it.

    Since specifying a minimal value of 0 for proc_doulongvec_minmax()
    is moot, let's remove these zero values in all defrag units.

    Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh")
    Signed-off-by: Eric Dumazet
    Reported-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Apr, 2018

6 commits

  • Some users are willing to provision huge amounts of memory to be able
    to perform reassembly reasonnably well under pressure.

    Current memory tracking is using one atomic_t and integers.

    Switch to atomic_long_t so that 64bit arches can use more than 2GB,
    without any cost for 32bit arches.

    Note that this patch avoids an overflow error, if high_thresh was set
    to ~2GB, since this test in inet_frag_alloc() was never true :

    if (... || frag_mem_limit(nf) > nf->high_thresh)

    Tested:

    $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 14705885 memory 16000002880

    $ nstat -n ; sleep 1 ; nstat | grep Reas
    IpReasmReqds 3317150 0.0
    IpReasmFails 3317112 0.0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This function is obsolete, after rhashtable addition to inet defrag.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Some applications still rely on IP fragmentation, and to be fair linux
    reassembly unit is not working under any serious load.

    It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)

    A work queue is supposed to garbage collect items when host is under memory
    pressure, and doing a hash rebuild, changing seed used in hash computations.

    This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
    occurring every 5 seconds if host is under fire.

    Then there is the problem of sharing this hash table for all netns.

    It is time to switch to rhashtables, and allocate one of them per netns
    to speedup netns dismantle, since this is a critical metric these days.

    Lookup is now using RCU. A followup patch will even remove
    the refcount hold/release left from prior implementation and save
    a couple of atomic operations.

    Before this patch, 16 cpus (16 RX queue NIC) could not handle more
    than 1 Mpps frags DDOS.

    After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
    of storage for the fragments (exact number depends on frags being evicted
    after timeout)

    $ grep FRAG /proc/net/sockstat
    FRAG: inuse 1966916 memory 2140004608

    A followup patch will change the limits for 64bit arches.

    Signed-off-by: Eric Dumazet
    Cc: Kirill Tkhai
    Cc: Herbert Xu
    Cc: Florian Westphal
    Cc: Jesper Dangaard Brouer
    Cc: Alexander Aring
    Cc: Stefan Schmidt
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We want to call lowpan_net_frag_init() earlier.
    Similar to commit "inet: frags: refactor ipv6_frag_init()"

    This is a prereq to "inet: frags: use rhashtables for reassembly units"

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In order to simplify the API, add a pointer to struct inet_frags.
    This will allow us to make things less complex.

    These functions no longer have a struct inet_frags parameter :

    inet_frag_destroy(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_put(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frag_kill(struct inet_frag_queue *q /*, struct inet_frags *f */)
    inet_frags_exit_net(struct netns_frags *nf /*, struct inet_frags *f */)
    ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We will soon initialize one rhashtable per struct netns_frags
    in inet_frags_init_net().

    This patch changes the return value to eventually propagate an
    error.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Mar, 2018

1 commit


23 Mar, 2018

1 commit

  • Fun set of conflict resolutions here...

    For the mac80211 stuff, these were fortunately just parallel
    adds. Trivially resolved.

    In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
    function phy_disable_interrupts() earlier in the file, whilst in
    'net-next' the phy_error() call from this function was removed.

    In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
    'rt_table_id' member of rtable collided with a bug fix in 'net' that
    added a new struct member "rt_mtu_locked" which needs to be copied
    over here.

    The mlxsw driver conflict consisted of net-next separating
    the span code and definitions into separate files, whilst
    a 'net' bug fix made some changes to that moved code.

    The mlx5 infiniband conflict resolution was quite non-trivial,
    the RDMA tree's merge commit was used as a guide here, and
    here are their notes:

    ====================

    Due to bug fixes found by the syzkaller bot and taken into the for-rc
    branch after development for the 4.17 merge window had already started
    being taken into the for-next branch, there were fairly non-trivial
    merge issues that would need to be resolved between the for-rc branch
    and the for-next branch. This merge resolves those conflicts and
    provides a unified base upon which ongoing development for 4.17 can
    be based.

    Conflicts:
    drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524
    (IB/mlx5: Fix cleanup order on unload) added to for-rc and
    commit b5ca15ad7e61 (IB/mlx5: Add proper representors support)
    add as part of the devel cycle both needed to modify the
    init/de-init functions used by mlx5. To support the new
    representors, the new functions added by the cleanup patch
    needed to be made non-static, and the init/de-init list
    added by the representors patch needed to be modified to
    match the init/de-init list changes made by the cleanup
    patch.
    Updates:
    drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
    prototypes added by representors patch to reflect new function
    names as changed by cleanup patch
    drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
    stage list to match new order from cleanup patch
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Mar, 2018

1 commit

  • These pernet_operations register and unregister sysctl.
    Also, there is inet_frags_exit_net() called in exit method,
    which has to be safe after a560002437d3 "net: Fix hlist
    corruptions in inet_evict_bucket()".

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai