25 Jul, 2020

2 commits


20 Jul, 2020

1 commit


21 Jun, 2020

1 commit


29 May, 2020

5 commits


08 Dec, 2019

1 commit

  • syzbot was once again able to crash a host by setting a very small mtu
    on loopback device.

    Let's make inetdev_valid_mtu() available in include/net/ip.h,
    and use it in ip_setup_cork(), so that we protect both ip_append_page()
    and __ip_append_data()

    Also add a READ_ONCE() when the device mtu is read.

    Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
    even if other code paths might write over this field.

    Add a big comment in include/linux/netdevice.h about dev->mtu
    needing READ_ONCE()/WRITE_ONCE() annotations.

    Hopefully we will add the missing ones in followup patches.

    [1]

    refcount_t: saturated; leaking memory.
    WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
    Kernel panic - not syncing: panic_on_warn set ...
    CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x197/0x210 lib/dump_stack.c:118
    panic+0x2e3/0x75c kernel/panic.c:221
    __warn.cold+0x2f/0x3e kernel/panic.c:582
    report_bug+0x289/0x300 lib/bug.c:195
    fixup_bug arch/x86/kernel/traps.c:174 [inline]
    fixup_bug arch/x86/kernel/traps.c:169 [inline]
    do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
    do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
    invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
    RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
    Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
    RSP: 0018:ffff88809689f550 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
    RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
    R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
    R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
    refcount_add include/linux/refcount.h:193 [inline]
    skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
    sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
    ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
    udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
    inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
    kernel_sendpage+0x92/0xf0 net/socket.c:3794
    sock_sendpage+0x8b/0xc0 net/socket.c:936
    pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
    splice_from_pipe_feed fs/splice.c:512 [inline]
    __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
    splice_from_pipe+0x108/0x170 fs/splice.c:671
    generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
    do_splice_from fs/splice.c:861 [inline]
    direct_splice_actor+0x123/0x190 fs/splice.c:1035
    splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
    do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
    do_sendfile+0x597/0xd00 fs/read_write.c:1464
    __do_sys_sendfile64 fs/read_write.c:1525 [inline]
    __se_sys_sendfile64 fs/read_write.c:1511 [inline]
    __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x441409
    Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
    RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
    RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
    R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
    R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
    Kernel Offset: disabled
    Rebooting in 86400 seconds..

    Fixes: 1470ddf7f8ce ("inet: Remove explicit write references to sk/inet in ip_append_data")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Nov, 2019

2 commits

  • Any argument outside of that range would result in an out of bound
    memory access, since the accessed array is 65536 bits long.

    Signed-off-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     
  • Note that the sysctl write accessor functions guarantee that:
    net->ipv4.sysctl_ip_prot_sock ipv4.ip_local_ports.range[0]
    invariant is maintained, and as such the max() in selinux hooks is actually spurious.

    ie. even though
    if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
    per logic is the same as
    if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
    it is actually functionally equivalent to:
    if (snum < low || snum > high) {
    which is equivalent to:
    if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
    even though the first clause is spurious.

    But we want to hold on to it in case we ever want to change what what
    inet_port_requires_bind_service() means (for example by changing
    it from a, by default, [0..1024) range to some sort of set).

    Test: builds, git 'grep inet_prot_sock' finds no other references
    Cc: Eric Dumazet
    Signed-off-by: Maciej Żenczykowski
    Signed-off-by: David S. Miller

    Maciej Żenczykowski
     

23 Nov, 2019

1 commit


22 Oct, 2019

1 commit

  • This patch removes the iph field from the state structure, which is not
    properly initialized. Instead, add a new field to make the "do we want
    to set DF" be the state bit and move the code to set the DF flag from
    ip_frag_next().

    Joint work with Pablo and Linus.

    Fixes: 19c3401a917b ("net: ipv4: place control buffer handling away from fragmentation iterators")
    Reported-by: Patrick Schönthaler
    Signed-off-by: Eric Dumazet
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Linus Torvalds
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 Sep, 2019

1 commit

  • Enable setting skb->mark for UDP and RAW sockets using cmsg.

    This is analogous to existing support for TOS, TTL, txtime, etc.

    Packet sockets already support this as of commit c7d39e32632e
    ("packet: support per-packet fwmark for af_packet sendmsg").

    Similar to other fields, implement by
    1. initialize the sockcm_cookie.mark from socket option sk_mark
    2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
    3. initialize inet_cork.mark from sockcm_cookie.mark
    4. initialize each (usually just one) skb->mark from inet_cork.mark

    Step 1 is handled in one location for most protocols by ipcm_init_sk
    as of commit 351782067b6b ("ipv4: ipcm_cookie initializers").

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

15 Jun, 2019

1 commit

  • If we want to set a EDT time for the skb we want to send
    via ip_send_unicast_reply(), we have to pass a new parameter
    and initialize ipc.sockc.transmit_time with it.

    This fixes the EDT time for ACK/RST packets sent on behalf of
    a TIME_WAIT socket.

    Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jun, 2019

1 commit


04 Jun, 2019

1 commit

  • syzbot reported nasty use-after-free [1]

    Lets remove frag_list field from structs ip_fraglist_iter
    and ip6_fraglist_iter. This seens not needed anyway.

    [1] :
    BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
    Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

    CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x172/0x1f0 lib/dump_stack.c:113
    print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
    __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
    kasan_report+0x12/0x20 mm/kasan/common.c:614
    __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
    kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
    ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
    __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
    ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
    dst_output include/net/dst.h:433 [inline]
    ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
    ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
    ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
    rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
    rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x44add9
    Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
    RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
    RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
    R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

    Allocated by task 8947:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_kmalloc mm/kasan/common.c:489 [inline]
    __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
    kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
    slab_post_alloc_hook mm/slab.h:437 [inline]
    slab_alloc_node mm/slab.c:3269 [inline]
    kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
    __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
    alloc_skb include/linux/skbuff.h:1058 [inline]
    __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
    ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
    rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 8947:
    save_stack+0x23/0x90 mm/kasan/common.c:71
    set_track mm/kasan/common.c:79 [inline]
    __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
    kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
    __cache_free mm/slab.c:3432 [inline]
    kmem_cache_free+0x86/0x260 mm/slab.c:3698
    kfree_skbmem net/core/skbuff.c:625 [inline]
    kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
    __kfree_skb net/core/skbuff.c:682 [inline]
    kfree_skb net/core/skbuff.c:699 [inline]
    kfree_skb+0xf0/0x390 net/core/skbuff.c:693
    kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
    __dev_xmit_skb net/core/dev.c:3551 [inline]
    __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
    dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
    neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
    neigh_output include/net/neighbour.h:511 [inline]
    ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
    ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
    __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
    ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
    NF_HOOK_COND include/linux/netfilter.h:294 [inline]
    ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
    dst_output include/net/dst.h:433 [inline]
    ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
    ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
    ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
    rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
    rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
    inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:671
    ___sys_sendmsg+0x803/0x920 net/socket.c:2292
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
    __do_sys_sendmsg net/socket.c:2339 [inline]
    __se_sys_sendmsg net/socket.c:2337 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
    do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    The buggy address belongs to the object at ffff888085a3cbc0
    which belongs to the cache skbuff_head_cache of size 224
    The buggy address is located 0 bytes inside of
    224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
    The buggy address belongs to the page:
    page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
    flags: 0x1fffc0000000200(slab)
    raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
    raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
    >ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    ^
    ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

    Fixes: 0feca6190f88 ("net: ipv6: add skbuff fraglist splitter")
    Fixes: c8b17be0b7a4 ("net: ipv4: add skbuff fraglist splitter")
    Signed-off-by: Eric Dumazet
    Cc: Pablo Neira Ayuso
    Acked-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 May, 2019

3 commits

  • This patch exposes a new API to refragment a skbuff. This allows you to
    split either a linear skbuff or to force the refragmentation of an
    existing fraglist using a different mtu. The API consists of:

    * ip_frag_init(), that initializes the internal state of the transformer.
    * ip_frag_next(), that allows you to fetch the next fragment. This function
    internally allocates the skbuff that represents the fragment, it pushes
    the IPv4 header, and it also copies the payload for each fragment.

    The ip_frag_state object stores the internal state of the splitter.

    This code has been extracted from ip_do_fragment(). Symbols are also
    exported to allow to reuse this iterator from the bridge codepath to
    build its own refragmentation routine by reusing the existing codebase.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • This patch adds the skbuff fraglist splitter. This API provides an
    iterator to transform the fraglist into single skbuff objects, it
    consists of:

    * ip_fraglist_init(), that initializes the internal state of the
    fraglist splitter.
    * ip_fraglist_prepare(), that restores the IPv4 header on the
    fragments.
    * ip_fraglist_next(), that retrieves the fragment from the fraglist and
    it updates the internal state of the splitter to point to the next
    fragment skbuff in the fraglist.

    The ip_fraglist_iter object stores the internal state of the iterator.

    This code has been extracted from ip_do_fragment(). Symbols are also
    exported to allow to reuse this iterator from the bridge codepath to
    build its own refragmentation routine by reusing the existing codebase.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira Ayuso
     
  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Apr, 2019

1 commit


02 Apr, 2019

1 commit

  • Configuration check to accept source route IP options should be made on
    the incoming netdevice when the skb->dev is an l3mdev master. The route
    lookup for the source route next hop also needs the incoming netdev.

    v2->v3:
    - Simplify by passing the original netdevice down the stack (per David
    Ahern).

    Signed-off-by: Stephen Suryaputra
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Stephen Suryaputra
     

22 Mar, 2019

1 commit

  • fib_trie implementation calls synchronize_rcu when a certain amount of
    pages are dirty from freed entries. The number of pages was determined
    experimentally in 2009 (commit c3059477fce2d).

    At the current setting, synchronize_rcu is called often -- 51 times in a
    second in one test with an average of an 8 msec delay adding a fib entry.
    The total impact is a lot of slow down modifying the fib. This is seen
    in the output of 'time' - the difference between real time and sys+user.
    For example, using 720,022 single path routes and 'ip -batch'[1]:

    $ time ./ip -batch ipv4/routes-1-hops
    real 0m14.214s
    user 0m2.513s
    sys 0m6.783s

    So roughly 35% of the actual time to install the routes is from the ip
    command getting scheduled out, most notably due to synchronize_rcu (this
    is observed using 'perf sched timehist').

    This patch makes the amount of dirty memory configurable between 64k where
    the synchronize_rcu is called often (small, low end systems that are memory
    sensitive) to 64M where synchronize_rcu is called rarely during a large
    FIB change (for high end systems with lots of memory). The default is 512kB
    which corresponds to the current setting of 128 pages with a 4kB page size.

    As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
    in a second blocking for up to 30 msec in a single instance, and a total
    of almost 100 msec across the 4 calls in the second. The trade off is
    allowing FIB entries to consume more memory in a given time window but
    but with much better fib insertion rates (~30% increase in prefixes/sec).
    With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
    file runs in:

    $ time ./ip -batch ipv4/routes-1-hops
    real 0m9.692s
    user 0m2.491s
    sys 0m6.769s

    So the dead time is reduced to about 1/2 second or
    Signed-off-by: David S. Miller

    David Ahern
     

02 Mar, 2019

1 commit

  • For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
    But for ip -6 route, currently we only support tcp, udp and icmp.

    Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.

    v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
    rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.

    Reported-by: Jianlin Shi
    Fixes: eacb9384a3fe ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     

26 Feb, 2019

1 commit


08 Nov, 2018

1 commit


07 Nov, 2018

1 commit


05 Oct, 2018

4 commits


07 Jul, 2018

2 commits

  • skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly
    after modification by __sock_cmsg_send, by calling sock_tx_timestamp.

    The IPv4 and IPv6 paths do this conversion differently. In IPv4, the
    individual protocols that support tx timestamps call this function
    and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is
    called in __ip6_append_data.

    There is no need to store both tx_flags and ts_flags in the cookie
    as one is derived from the other. Convert when setting up the cork
    and remove the redundant field. This is similar to IPv6, only have
    the conversion happen only once per datagram, in ip(6)_setup_cork.

    Also change __ip6_append_data to match __ip_append_data. Only update
    tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is
    redundant: only valid protocols can have non-zero cork->tx_flags.

    After this change the IPv4 and IPv6 logic is the same.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • Initialize the cookie in one location to reduce code duplication and
    avoid bugs from inconsistent initialization, such as that fixed in
    commit 9887cba19978 ("ip: limit use of gso_size to udp").

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

04 Jul, 2018

2 commits

  • Also involved adding a way to run a netfilter hook over a list of packets.
    Rather than attempting to make netfilter know about lists (which would be
    a major project in itself) we just let it call the regular okfn (in this
    case ip_rcv_finish()) for any packets it steals, and have it give us back
    a list of packets it's synchronously accepted (which normally NF_HOOK
    would automatically call okfn() on, but we want to be able to potentially
    pass the list to a listified version of okfn().)
    The netfilter hooks themselves are indirect calls that still happen per-
    packet (see nf_hook_entry_hookfn()), but again, changing that can be left
    for future work.

    There is potential for out-of-order receives if the netfilter hook ends up
    synchronously stealing packets, as they will be processed before any
    accepts earlier in the list. However, it was already possible for an
    asynchronous accept to cause out-of-order receives, so presumably this is
    considered OK.

    Signed-off-by: Edward Cree
    Signed-off-by: David S. Miller

    Edward Cree
     
  • This patch introduces __ip_queue_xmit(), through which the callers
    can pass tos param into it without having to set inet->tos. For
    ipv6, ip6_xmit() already allows passing tclass parameter.

    It's needed when some transport protocol doesn't use inet->tos,
    like sctp's per transport dscp, which will be added in next patch.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

24 May, 2018

1 commit


27 Apr, 2018

2 commits

  • Support generic segmentation offload for udp datagrams. Callers can
    concatenate and send at once the payload of multiple datagrams with
    the same destination.

    To set segment size, the caller sets socket option UDP_SEGMENT to the
    length of each discrete payload. This value must be smaller than or
    equal to the relevant MTU.

    A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
    per send call basis.

    Total byte length may then exceed MTU. If not an exact multiple of
    segment size, the last segment will be shorter.

    The implementation adds a gso_size field to the udp socket, ip(v6)
    cmsg cookie and inet_cork structure to be able to set the value at
    setsockopt or cmsg time and to work with both lockless and corked
    paths.

    Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.

    tcp tso
    3197 MB/s 54232 msg/s 54232 calls/s
    6,457,754,262 cycles

    tcp gso
    1765 MB/s 29939 msg/s 29939 calls/s
    11,203,021,806 cycles

    tcp without tso/gso *
    739 MB/s 12548 msg/s 12548 calls/s
    11,205,483,630 cycles

    udp
    876 MB/s 14873 msg/s 624666 calls/s
    11,205,777,429 cycles

    udp gso
    2139 MB/s 36282 msg/s 36282 calls/s
    11,204,374,561 cycles

    [*] after reverting commit 0a6b2a1dc2a2
    ("tcp: switch to GSO being always on")

    Measured total system cycles ('-a') for one core while pinning both
    the network receive path and benchmark process to that core:

    perf stat -a -C 12 -e cycles \
    ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4

    Note the reduction in calls/s with GSO. Bytes per syscall drops
    increases from 1470 to 61818.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     
  • UDP segmentation offload needs access to inet_cork in the udp layer.
    Pass the struct to ip(6)_make_skb instead of allocating it on the
    stack in that function itself.

    This patch is a noop otherwise.

    Signed-off-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Willem de Bruijn
     

18 Apr, 2018

1 commit

  • Move logic of fib_convert_metrics into ip_metrics_convert. This allows
    the code that converts netlink attributes into metrics struct to be
    re-used in a later patch by IPv6.

    This is mostly a code move with the following changes to variable names:
    - fi->fib_net becomes net
    - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
    - metrics array is passed as an input from fi->fib_metrics->metrics

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern