10 Nov, 2020

1 commit

  • Due to the legacy usage of hard_header_len for SIT tunnels while
    already using infrastructure from net/ipv4/ip_tunnel.c the
    calculation of the path MTU in tnl_update_pmtu is incorrect.
    This leads to unnecessary creation of MTU exceptions for any
    flow going over a SIT tunnel.

    As SIT tunnels do not have a header themsevles other than their
    transport (L3, L2) headers we're leaving hard_header_len set to zero
    as tnl_update_pmtu is already taking care of the transport headers
    sizes.

    This will also help avoiding unnecessary IPv6 GC runs and spinlock
    contention seen when using SIT tunnels and for more than
    net.ipv6.route.gc_thresh flows.

    Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
    Signed-off-by: Oliver Herms
    Acked-by: Willem de Bruijn
    Link: https://lore.kernel.org/r/20201103104133.GA1573211@tws
    Signed-off-by: Jakub Kicinski

    Oliver Herms
     

01 Jul, 2020

1 commit

  • Sit uses skb->protocol to determine packet type, and bails out if it's
    not set. For AF_PACKET injection, we need to support its call chain of:

    packet_sendmsg -> packet_snd -> packet_parse_headers ->
    dev_parse_header_protocol -> parse_protocol

    Without a valid parse_protocol, this returns zero, and sit rejects the
    skb. So, this wires up the ip_tunnel handler for layer 3 packets for
    that case.

    Reported-by: Willem de Bruijn
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

20 May, 2020

2 commits


25 Dec, 2019

1 commit

  • When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

15 Jul, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

1 commit

  • VRF netdev mtu isn't typically set and have an mtu of 65536. When the
    link of a tunnel is set, the tunnel mtu is changed from 1480 to the link
    mtu minus tunnel header. In the case of VRF netdev is the link, then the
    tunnel mtu becomes 65516. So, fix it by not setting the tunnel mtu in
    this case.

    Signed-off-by: Stephen Suryaputra
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Stephen Suryaputra
     

05 Apr, 2019

1 commit

  • ipip6 tunnels run iptunnel_pull_header on received skbs. This can
    determine the following use-after-free accessing iph pointer since
    the packet will be 'uncloned' running pskb_expand_head if it is a
    cloned gso skb (e.g if the packet has been sent though a veth device)

    [ 706.369655] BUG: KASAN: use-after-free in ipip6_rcv+0x1678/0x16e0 [sit]
    [ 706.449056] Read of size 1 at addr ffffe01b6bd855f5 by task ksoftirqd/1/=
    [ 706.669494] Hardware name: HPE ProLiant m400 Server/ProLiant m400 Server, BIOS U02 08/19/2016
    [ 706.771839] Call trace:
    [ 706.801159] dump_backtrace+0x0/0x2f8
    [ 706.845079] show_stack+0x24/0x30
    [ 706.884833] dump_stack+0xe0/0x11c
    [ 706.925629] print_address_description+0x68/0x260
    [ 706.982070] kasan_report+0x178/0x340
    [ 707.025995] __asan_report_load1_noabort+0x30/0x40
    [ 707.083481] ipip6_rcv+0x1678/0x16e0 [sit]
    [ 707.132623] tunnel64_rcv+0xd4/0x200 [tunnel4]
    [ 707.185940] ip_local_deliver_finish+0x3b8/0x988
    [ 707.241338] ip_local_deliver+0x144/0x470
    [ 707.289436] ip_rcv_finish+0x43c/0x14b0
    [ 707.335447] ip_rcv+0x628/0x1138
    [ 707.374151] __netif_receive_skb_core+0x1670/0x2600
    [ 707.432680] __netif_receive_skb+0x28/0x190
    [ 707.482859] process_backlog+0x1d0/0x610
    [ 707.529913] net_rx_action+0x37c/0xf68
    [ 707.574882] __do_softirq+0x288/0x1018
    [ 707.619852] run_ksoftirqd+0x70/0xa8
    [ 707.662734] smpboot_thread_fn+0x3a4/0x9e8
    [ 707.711875] kthread+0x2c8/0x350
    [ 707.750583] ret_from_fork+0x10/0x18

    [ 707.811302] Allocated by task 16982:
    [ 707.854182] kasan_kmalloc.part.1+0x40/0x108
    [ 707.905405] kasan_kmalloc+0xb4/0xc8
    [ 707.948291] kasan_slab_alloc+0x14/0x20
    [ 707.994309] __kmalloc_node_track_caller+0x158/0x5e0
    [ 708.053902] __kmalloc_reserve.isra.8+0x54/0xe0
    [ 708.108280] __alloc_skb+0xd8/0x400
    [ 708.150139] sk_stream_alloc_skb+0xa4/0x638
    [ 708.200346] tcp_sendmsg_locked+0x818/0x2b90
    [ 708.251581] tcp_sendmsg+0x40/0x60
    [ 708.292376] inet_sendmsg+0xf0/0x520
    [ 708.335259] sock_sendmsg+0xac/0xf8
    [ 708.377096] sock_write_iter+0x1c0/0x2c0
    [ 708.424154] new_sync_write+0x358/0x4a8
    [ 708.470162] __vfs_write+0xc4/0xf8
    [ 708.510950] vfs_write+0x12c/0x3d0
    [ 708.551739] ksys_write+0xcc/0x178
    [ 708.592533] __arm64_sys_write+0x70/0xa0
    [ 708.639593] el0_svc_handler+0x13c/0x298
    [ 708.686646] el0_svc+0x8/0xc

    [ 708.739019] Freed by task 17:
    [ 708.774597] __kasan_slab_free+0x114/0x228
    [ 708.823736] kasan_slab_free+0x10/0x18
    [ 708.868703] kfree+0x100/0x3d8
    [ 708.905320] skb_free_head+0x7c/0x98
    [ 708.948204] skb_release_data+0x320/0x490
    [ 708.996301] pskb_expand_head+0x60c/0x970
    [ 709.044399] __iptunnel_pull_header+0x3b8/0x5d0
    [ 709.098770] ipip6_rcv+0x41c/0x16e0 [sit]
    [ 709.146873] tunnel64_rcv+0xd4/0x200 [tunnel4]
    [ 709.200195] ip_local_deliver_finish+0x3b8/0x988
    [ 709.255596] ip_local_deliver+0x144/0x470
    [ 709.303692] ip_rcv_finish+0x43c/0x14b0
    [ 709.349705] ip_rcv+0x628/0x1138
    [ 709.388413] __netif_receive_skb_core+0x1670/0x2600
    [ 709.446943] __netif_receive_skb+0x28/0x190
    [ 709.497120] process_backlog+0x1d0/0x610
    [ 709.544169] net_rx_action+0x37c/0xf68
    [ 709.589131] __do_softirq+0x288/0x1018

    [ 709.651938] The buggy address belongs to the object at ffffe01b6bd85580
    which belongs to the cache kmalloc-1024 of size 1024
    [ 709.804356] The buggy address is located 117 bytes inside of
    1024-byte region [ffffe01b6bd85580, ffffe01b6bd85980)
    [ 709.946340] The buggy address belongs to the page:
    [ 710.003824] page:ffff7ff806daf600 count:1 mapcount:0 mapping:ffffe01c4001f600 index:0x0
    [ 710.099914] flags: 0xfffff8000000100(slab)
    [ 710.149059] raw: 0fffff8000000100 dead000000000100 dead000000000200 ffffe01c4001f600
    [ 710.242011] raw: 0000000000000000 0000000000380038 00000001ffffffff 0000000000000000
    [ 710.334966] page dumped because: kasan: bad access detected

    Fix it resetting iph pointer after iptunnel_pull_header

    Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap")
    Tested-by: Jianlin Shi
    Signed-off-by: Lorenzo Bianconi
    Signed-off-by: David S. Miller

    Lorenzo Bianconi
     

12 Mar, 2019

1 commit

  • In func check_6rd,tunnel->ip6rd.relay_prefixlen may equal to
    32,so UBSAN complain about it.

    UBSAN: Undefined behaviour in net/ipv6/sit.c:781:47
    shift exponent 32 is too large for 32-bit type 'unsigned int'
    CPU: 6 PID: 20036 Comm: syz-executor.0 Not tainted 4.19.27 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1
    04/01/2014
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xca/0x13e lib/dump_stack.c:113
    ubsan_epilogue+0xe/0x81 lib/ubsan.c:159
    __ubsan_handle_shift_out_of_bounds+0x293/0x2e8 lib/ubsan.c:425
    check_6rd.constprop.9+0x433/0x4e0 net/ipv6/sit.c:781
    try_6rd net/ipv6/sit.c:806 [inline]
    ipip6_tunnel_xmit net/ipv6/sit.c:866 [inline]
    sit_tunnel_xmit+0x141c/0x2720 net/ipv6/sit.c:1033
    __netdev_start_xmit include/linux/netdevice.h:4300 [inline]
    netdev_start_xmit include/linux/netdevice.h:4309 [inline]
    xmit_one net/core/dev.c:3243 [inline]
    dev_hard_start_xmit+0x17c/0x780 net/core/dev.c:3259
    __dev_queue_xmit+0x1656/0x2500 net/core/dev.c:3829
    neigh_output include/net/neighbour.h:501 [inline]
    ip6_finish_output2+0xa36/0x2290 net/ipv6/ip6_output.c:120
    ip6_finish_output+0x3e7/0xa20 net/ipv6/ip6_output.c:154
    NF_HOOK_COND include/linux/netfilter.h:278 [inline]
    ip6_output+0x1e2/0x720 net/ipv6/ip6_output.c:171
    dst_output include/net/dst.h:444 [inline]
    ip6_local_out+0x99/0x170 net/ipv6/output_core.c:176
    ip6_send_skb+0x9d/0x2f0 net/ipv6/ip6_output.c:1697
    ip6_push_pending_frames+0xc0/0x100 net/ipv6/ip6_output.c:1717
    rawv6_push_pending_frames net/ipv6/raw.c:616 [inline]
    rawv6_sendmsg+0x2435/0x3530 net/ipv6/raw.c:946
    inet_sendmsg+0xf8/0x5c0 net/ipv4/af_inet.c:798
    sock_sendmsg_nosec net/socket.c:621 [inline]
    sock_sendmsg+0xc8/0x110 net/socket.c:631
    ___sys_sendmsg+0x6cf/0x890 net/socket.c:2114
    __sys_sendmsg+0xf0/0x1b0 net/socket.c:2152
    do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Signed-off-by: linmiaohe
    Signed-off-by: David S. Miller

    Miaohe Lin
     

02 Mar, 2019

1 commit

  • If register_netdev() is failed to register sitn->fb_tunnel_dev,
    it will go to err_reg_dev and forget to free netdev(sitn->fb_tunnel_dev).

    BUG: memory leak
    unreferenced object 0xffff888378daad00 (size 512):
    comm "syz-executor.1", pid 4006, jiffies 4295121142 (age 16.115s)
    hex dump (first 32 bytes):
    00 e6 ed c0 83 88 ff ff 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kvmalloc include/linux/mm.h:577 [inline]
    [] kvzalloc include/linux/mm.h:585 [inline]
    [] netif_alloc_netdev_queues net/core/dev.c:8380 [inline]
    [] alloc_netdev_mqs+0x600/0xcc0 net/core/dev.c:8970
    [] sit_init_net+0x295/0xa40 net/ipv6/sit.c:1848
    [] ops_init+0xad/0x3e0 net/core/net_namespace.c:129
    [] setup_net+0x2ba/0x690 net/core/net_namespace.c:314
    [] copy_net_ns+0x1dc/0x330 net/core/net_namespace.c:437
    [] create_new_namespaces+0x382/0x730 kernel/nsproxy.c:107
    [] copy_namespaces+0x2ed/0x3d0 kernel/nsproxy.c:165
    [] copy_process.part.27+0x231e/0x6db0 kernel/fork.c:1919
    [] copy_process kernel/fork.c:1713 [inline]
    [] _do_fork+0x1bc/0xe90 kernel/fork.c:2224
    [] do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:290
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [] 0xffffffffffffffff

    Signed-off-by: Mao Wenan
    Signed-off-by: David S. Miller

    Mao Wenan
     

08 Feb, 2019

1 commit

  • If we disabled IPv6 from the kernel command line (ipv6.disable=1), we should
    not call ip6_err_gen_icmpv6_unreach(). This:

    ip link add sit1 type sit local 192.0.2.1 remote 192.0.2.2 ttl 1
    ip link set sit1 up
    ip addr add 198.51.100.1/24 dev sit1
    ping 198.51.100.2

    if IPv6 is disabled at boot time, will crash the kernel.

    v2: there's no need to use in6_dev_get(), use __in6_dev_get() instead,
    as we only need to check that idev exists and we are under
    rcu_read_lock() (from netif_receive_skb_internal()).

    Reported-by: Jianlin Shi
    Fixes: ca15a078bd90 ("sit: generate icmpv6 error when receiving icmpv4 error")
    Cc: Oussama Ghorbel
    Signed-off-by: Hangbin Liu
    Reviewed-by: Stefano Brivio
    Signed-off-by: David S. Miller

    Hangbin Liu
     

02 Jan, 2019

1 commit

  • KMSAN detected read beyond end of buffer in vti and sit devices when
    passing truncated packets with PF_PACKET. The issue affects additional
    ip tunnel devices.

    Extend commit 76c0ddd8c3a6 ("ip6_tunnel: be careful when accessing the
    inner header") and commit ccfec9e5cb2d ("ip_tunnel: be careful when
    accessing the inner header").

    Move the check to a separate helper and call at the start of each
    ndo_start_xmit function in net/ipv4 and net/ipv6.

    Minor changes:
    - convert dev_kfree_skb to kfree_skb on error path,
    as dev_kfree_skb calls consume_skb which is not for error paths.
    - use pskb_network_may_pull even though that is pedantic here,
    as the same as pskb_may_pull for devices without llheaders.
    - do not cache ipv6 hdrs if used only once
    (unsafe across pskb_may_pull, was more relevant to earlier patch)

    Reported-by: syzbot
    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

27 Sep, 2018

2 commits


02 Jun, 2018

1 commit

  • I don't know where this value comes from (probably a copy and paste and
    paste and paste ...).
    Let's use standard values which are a bit greater.

    Link: https://git.kernel.org/pub/scm/linux/kernel/git/davem/netdev-vger-cvs.git/commit/?id=e5afd356a411a
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

06 Apr, 2018

1 commit

  • Use dev_valid_name() to make sure user does not provide illegal
    device name.

    syzbot caught the following bug :

    BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
    BUG: KASAN: stack-out-of-bounds in ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254
    Write of size 33 at addr ffff8801b64076d8 by task syzkaller932654/4453

    CPU: 0 PID: 4453 Comm: syzkaller932654 Not tainted 4.16.0+ #1
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:17 [inline]
    dump_stack+0x1b9/0x29f lib/dump_stack.c:53
    print_address_description+0x6c/0x20b mm/kasan/report.c:256
    kasan_report_error mm/kasan/report.c:354 [inline]
    kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
    check_memory_region_inline mm/kasan/kasan.c:260 [inline]
    check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
    memcpy+0x37/0x50 mm/kasan/kasan.c:303
    strlcpy include/linux/string.h:300 [inline]
    ipip6_tunnel_locate+0x63b/0xaa0 net/ipv6/sit.c:254
    ipip6_tunnel_ioctl+0xe71/0x241b net/ipv6/sit.c:1221
    dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
    dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
    sock_ioctl+0x47e/0x680 net/socket.c:1015
    vfs_ioctl fs/ioctl.c:46 [inline]
    file_ioctl fs/ioctl.c:500 [inline]
    do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
    ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
    SYSC_ioctl fs/ioctl.c:708 [inline]
    SyS_ioctl+0x24/0x30 fs/ioctl.c:706
    do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Mar, 2018

1 commit


10 Mar, 2018

1 commit

  • fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
    ip6tnl0, ip6gre0) are automatically created when the corresponding
    module is loaded.

    These tunnels are also automatically created when a new network
    namespace is created, at a great cost.

    In many cases, netns are used for isolation purposes, and these
    extra network devices are a waste of resources. We are using
    thousands of netns per host, and hit the netns creation/delete
    bottleneck a lot. (Many thanks to Kirill for recent work on this)

    Add a new sysctl so that we can opt-out from this automatic creation.

    Note that these tunnels are still created for the initial namespace,
    to be the least intrusive for typical setups.

    Tested:
    lpk43:~# cat add_del_unshare.sh
    for i in `seq 1 40`
    do
    (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
    done
    wait

    lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
    lpk43:~# time ./add_del_unshare.sh

    real 0m37.521s
    user 0m0.886s
    sys 7m7.084s
    lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
    lpk43:~# time ./add_del_unshare.sh

    real 0m4.761s
    user 0m0.851s
    sys 1m8.343s
    lpk43:~#

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Mar, 2018

1 commit


28 Feb, 2018

2 commits

  • Commit 128bb975dc3c ("ip6_gre: init dev->mtu and dev->hard_header_len
    correctly") fixed IFLA_MTU ignored on NEWLINK for ip6_gre. The same
    mtu fix is also needed for sit.

    Note that dev->hard_header_len setting for sit works fine, no need to
    fix it. sit is actually ipv4 tunnel, it can't call ip6_tnl_change_mtu
    to set mtu.

    Reported-by: Jianlin Shi
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • These pernet_operations are similar to ip6_tnl_net_ops. Exit method
    unregisters all net sit devices, and it looks like another
    pernet_operations are not interested in foreign net sit list.
    Init method registers netdevice. So, it's possible to mark them async.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

23 Feb, 2018

1 commit

  • gcc-8 has a new warning that detects overlapping input and output arguments
    in memcpy(). It triggers for sit_init_net() calling ipip6_tunnel_clone_6rd(),
    which is actually correct:

    net/ipv6/sit.c: In function 'sit_init_net':
    net/ipv6/sit.c:192:3: error: 'memcpy' source argument is the same as destination [-Werror=restrict]

    The problem here is that the logic detecting the memcpy() arguments finds them
    to be the same, but the conditional that tests for the input and output of
    ipip6_tunnel_clone_6rd() to be identical is not a compile-time constant.

    We know that netdev_priv(t->dev) is the same as t for a tunnel device,
    and comparing "dev" directly here lets the compiler figure out as well
    that 'dev == sitn->fb_tunnel_dev' when called from sit_init_net(), so
    it no longer warns.

    This code is old, so Cc stable to make sure that we don't get the warning
    for older kernels built with new gcc.

    Cc: Martin Sebor
    Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83456
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

26 Jan, 2018

1 commit

  • Some dst_ops (e.g. md_dst_ops)) doesn't set this handler. It may result to:
    "BUG: unable to handle kernel NULL pointer dereference at (null)"

    Let's add a helper to check if update_pmtu is available before calling it.

    Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
    Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
    CC: Roman Kapl
    CC: Xin Long
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

30 Nov, 2017

1 commit

  • After parsing the sit netlink change info, we forget to update frag_off in
    ipip6_tunnel_update(). Fix it by assigning frag_off with new value.

    Reported-by: Jianlin Shi
    Signed-off-by: Hangbin Liu
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Hangbin Liu
     

01 Nov, 2017

1 commit

  • Using SIT tunnels with VRFs works fine if the underlay device is in a
    VRF and the link parameter is set to the VRF device. e.g.,

    ip tunnel add jtun mode sit remote local dev myvrf

    Update the device check to allow the link to be the enslaved device as
    well. e.g.,

    ip tunnel add jtun mode sit remote local dev eth4

    where eth4 is enslaved to myvrf.

    Reported-by: Jeff Barnhill
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

20 Sep, 2017

1 commit

  • Implement exit_batch() method to dismantle more devices
    per round.

    (rtnl_lock() ...
    unregister_netdevice_many() ...
    rtnl_unlock())

    Tested:
    $ cat add_del_unshare.sh
    for i in `seq 1 40`
    do
    (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
    done
    wait ; grep net_namespace /proc/slabinfo

    Before patch :
    $ time ./add_del_unshare.sh
    net_namespace 110 267 5504 1 2 : tunables 8 4 0 : slabdata 110 267 0

    real 3m25.292s
    user 0m0.644s
    sys 0m40.153s

    After patch:

    $ time ./add_del_unshare.sh
    net_namespace 126 282 5504 1 2 : tunables 8 4 0 : slabdata 126 282 0

    real 1m38.965s
    user 0m0.688s
    sys 0m37.017s

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jul, 2017

1 commit


27 Jun, 2017

3 commits


24 Jun, 2017

1 commit


08 Jun, 2017

1 commit

  • Network devices can allocate reasources and private memory using
    netdev_ops->ndo_init(). However, the release of these resources
    can occur in one of two different places.

    Either netdev_ops->ndo_uninit() or netdev->destructor().

    The decision of which operation frees the resources depends upon
    whether it is necessary for all netdev refs to be released before it
    is safe to perform the freeing.

    netdev_ops->ndo_uninit() presumably can occur right after the
    NETDEV_UNREGISTER notifier completes and the unicast and multicast
    address lists are flushed.

    netdev->destructor(), on the other hand, does not run until the
    netdev references all go away.

    Further complicating the situation is that netdev->destructor()
    almost universally does also a free_netdev().

    This creates a problem for the logic in register_netdevice().
    Because all callers of register_netdevice() manage the freeing
    of the netdev, and invoke free_netdev(dev) if register_netdevice()
    fails.

    If netdev_ops->ndo_init() succeeds, but something else fails inside
    of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
    it is not able to invoke netdev->destructor().

    This is because netdev->destructor() will do a free_netdev() and
    then the caller of register_netdevice() will do the same.

    However, this means that the resources that would normally be released
    by netdev->destructor() will not be.

    Over the years drivers have added local hacks to deal with this, by
    invoking their destructor parts by hand when register_netdevice()
    fails.

    Many drivers do not try to deal with this, and instead we have leaks.

    Let's close this hole by formalizing the distinction between what
    private things need to be freed up by netdev->destructor() and whether
    the driver needs unregister_netdevice() to perform the free_netdev().

    netdev->priv_destructor() performs all actions to free up the private
    resources that used to be freed by netdev->destructor(), except for
    free_netdev().

    netdev->needs_free_netdev is a boolean that indicates whether
    free_netdev() should be done at the end of unregister_netdevice().

    Now, register_netdevice() can sanely release all resources after
    ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
    and netdev->priv_destructor().

    And at the end of unregister_netdevice(), we invoke
    netdev->priv_destructor() and optionally call free_netdev().

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Apr, 2017

1 commit

  • This feature allows the administrator to set an fwmark for
    packets traversing a tunnel. This allows the use of independent
    routing tables for tunneled packets without the use of iptables.

    There is no concept of per-packet routing decisions through IPv4
    tunnels, so this implementation does not need to work with
    per-packet route lookups as the v6 implementation may
    (with IP6_TNL_F_USE_ORIG_FWMARK).

    Further, since the v4 tunnel ioctls share datastructures
    (which can not be trivially modified) with the kernel's internal
    tunnel configuration structures, the mark attribute must be stored
    in the tunnel structure itself and passed as a parameter when
    creating or changing tunnel attributes.

    Signed-off-by: Craig Gallek
    Signed-off-by: David S. Miller

    Craig Gallek
     

09 Feb, 2017

1 commit

  • Dmitry reported a double free in sit_init_net():

    kernel BUG at mm/percpu.c:689!
    invalid opcode: 0000 [#1] SMP KASAN
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 0 PID: 15692 Comm: syz-executor1 Not tainted 4.10.0-rc6-next-20170206 #1
    Hardware name: Google Google Compute Engine/Google Compute Engine,
    BIOS Google 01/01/2011
    task: ffff8801c9cc27c0 task.stack: ffff88017d1d8000
    RIP: 0010:pcpu_free_area+0x68b/0x810 mm/percpu.c:689
    RSP: 0018:ffff88017d1df488 EFLAGS: 00010046
    RAX: 0000000000010000 RBX: 00000000000007c0 RCX: ffffc90002829000
    RDX: 0000000000010000 RSI: ffffffff81940efb RDI: ffff8801db841d94
    RBP: ffff88017d1df590 R08: dffffc0000000000 R09: 1ffffffff0bb3bdd
    R10: dffffc0000000000 R11: 00000000000135dd R12: ffff8801db841d80
    R13: 0000000000038e40 R14: 00000000000007c0 R15: 00000000000007c0
    FS: 00007f6ea608f700(0000) GS:ffff8801dbe00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000000002000aff8 CR3: 00000001c8d44000 CR4: 00000000001426f0
    DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
    Call Trace:
    free_percpu+0x212/0x520 mm/percpu.c:1264
    ipip6_dev_free+0x43/0x60 net/ipv6/sit.c:1335
    sit_init_net+0x3cb/0xa10 net/ipv6/sit.c:1831
    ops_init+0x10a/0x530 net/core/net_namespace.c:115
    setup_net+0x2ed/0x690 net/core/net_namespace.c:291
    copy_net_ns+0x26c/0x530 net/core/net_namespace.c:396
    create_new_namespaces+0x409/0x860 kernel/nsproxy.c:106
    unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
    SYSC_unshare kernel/fork.c:2281 [inline]
    SyS_unshare+0x64e/0xfc0 kernel/fork.c:2231
    entry_SYSCALL_64_fastpath+0x1f/0xc2

    This is because when tunnel->dst_cache init fails, we free dev->tstats
    once in ipip6_tunnel_init() and twice in sit_init_net(). This looks
    redundant but its ndo_uinit() does not seem enough to clean up everything
    here. So avoid this by setting dev->tstats to NULL after the first free,
    at least for -net.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

25 Dec, 2016

1 commit


18 Nov, 2016

1 commit

  • Make struct pernet_operations::id unsigned.

    There are 2 reasons to do so:

    1)
    This field is really an index into an zero based array and
    thus is unsigned entity. Using negative value is out-of-bound
    access by definition.

    2)
    On x86_64 unsigned 32-bit data which are mixed with pointers
    via array indexing or offsets added or subtracted to pointers
    are preffered to signed 32-bit data.

    "int" being used as an array index needs to be sign-extended
    to 64-bit before being used.

    void f(long *p, int i)
    {
    g(p[i]);
    }

    roughly translates to

    movsx rsi, esi
    mov rdi, [rsi+...]
    call g

    MOVSX is 3 byte instruction which isn't necessary if the variable is
    unsigned because x86_64 is zero extending by default.

    Now, there is net_generic() function which, you guessed it right, uses
    "int" as an array index:

    static inline void *net_generic(const struct net *net, int id)
    {
    ...
    ptr = ng->ptr[id - 1];
    ...
    }

    And this function is used a lot, so those sign extensions add up.

    Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
    messing with code generation):

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

    Unfortunately some functions actually grow bigger.
    This is a semmingly random artefact of code generation with register
    allocator being used differently. gcc decides that some variable
    needs to live in new r8+ registers and every access now requires REX
    prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
    used which is longer than [r8]

    However, overall balance is in negative direction:

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
    function old new delta
    nfsd4_lock 3886 3959 +73
    tipc_link_build_proto_msg 1096 1140 +44
    mac80211_hwsim_new_radio 2776 2808 +32
    tipc_mon_rcv 1032 1058 +26
    svcauth_gss_legacy_init 1413 1429 +16
    tipc_bcbase_select_primary 379 392 +13
    nfsd4_exchange_id 1247 1260 +13
    nfsd4_setclientid_confirm 782 793 +11
    ...
    put_client_renew_locked 494 480 -14
    ip_set_sockfn_get 730 716 -14
    geneve_sock_add 829 813 -16
    nfsd4_sequence_done 721 703 -18
    nlmclnt_lookup_host 708 686 -22
    nfsd4_lockt 1085 1063 -22
    nfs_get_client 1077 1050 -27
    tcf_bpf_init 1106 1076 -30
    nfsd4_encode_fattr 5997 5930 -67
    Total: Before=154856051, After=154854321, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

21 Oct, 2016

1 commit

  • ipv4/ip_tunnel:
    - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len - t_hlen
    - preserve all ndo_change_mtu checks for now to prevent regressions

    ipv6/ip6_tunnel:
    - min_mtu = 68, max_mtu = 0xFFF8 - dev->hard_header_len
    - preserve all ndo_change_mtu checks for now to prevent regressions

    ipv6/ip6_vti:
    - min_mtu = 1280, max_mtu = 65535
    - remove redundant vti6_change_mtu

    ipv6/sit:
    - min_mtu = 1280, max_mtu = 0xFFF8 - t_hlen
    - remove redundant ipip6_tunnel_change_mtu

    CC: netdev@vger.kernel.org
    CC: "David S. Miller"
    CC: Alexey Kuznetsov
    CC: James Morris
    CC: Hideaki YOSHIFUJI
    CC: Patrick McHardy
    Signed-off-by: Jarod Wilson
    Signed-off-by: David S. Miller

    Jarod Wilson
     

13 Aug, 2016

1 commit


11 Aug, 2016

1 commit

  • This is a preparatory patch for converting qdisc linked list into a
    hashtable. As we'll need to include hashtable.h in netdevice.h, we first
    have to make sure that this will not introduce symbol conflicts for any of
    the netdevice.h users.

    Reviewed-by: Cong Wang
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina