27 Feb, 2020

1 commit

  • IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one.
    While this operation sounds illogical, we have to support it.

    One of the things it does for TCP socket is to switch sk->sk_prot
    to tcp_prot.

    We now have other layers playing with sk->sk_prot, so we should make
    sure to not interfere with them.

    This patch makes sure sk_prot is the default pointer for TCP IPv6 socket.

    syzbot reported :
    BUG: kernel NULL pointer dereference, address: 0000000000000000
    PGD a0113067 P4D a0113067 PUD a8771067 PMD 0
    Oops: 0010 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
    RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
    RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
    R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
    R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
    FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427
    __sock_release net/socket.c:605 [inline]
    sock_close+0xe1/0x260 net/socket.c:1283
    __fput+0x2e4/0x740 fs/file_table.c:280
    ____fput+0x15/0x20 fs/file_table.c:313
    task_work_run+0x176/0x1b0 kernel/task_work.c:113
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    exit_to_usermode_loop arch/x86/entry/common.c:164 [inline]
    prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195
    syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278
    do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x45c429
    Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
    RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429
    RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004
    RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000
    R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff
    R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c
    Modules linked in:
    CR2: 0000000000000000
    ---[ end trace 82567b5207e87bae ]---
    RIP: 0010:0x0
    Code: Bad RIP value.
    RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
    RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
    RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
    R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
    R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
    FS: 00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com
    Cc: Daniel Borkmann
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Feb, 2020

1 commit

  • Variables declared in a switch statement before any case statements
    cannot be automatically initialized with compiler instrumentation (as
    they are not part of any execution flow). With GCC's proposed automatic
    stack variable initialization feature, this triggers a warning (and they
    don't get initialized). Clang's automatic stack variable initialization
    (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
    doesn't initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase,
    so even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where
    they're used or lift them up into the main function body.

    net/ipv6/ip6_gre.c: In function ‘ip6gre_err’:
    net/ipv6/ip6_gre.c:440:32: warning: statement will never be executed [-Wswitch-unreachable]
    440 | struct ipv6_tlv_tnl_enc_lim *tel;
    | ^~~

    net/ipv6/ip6_tunnel.c: In function ‘ip6_tnl_err’:
    net/ipv6/ip6_tunnel.c:520:32: warning: statement will never be executed [-Wswitch-unreachable]
    520 | struct ipv6_tlv_tnl_enc_lim *tel;
    | ^~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     

17 Feb, 2020

2 commits

  • When splitting an RTA_MULTIPATH request into multiple routes and adding the
    second and later components, we must not simply remove NLM_F_REPLACE but
    instead replace it by NLM_F_CREATE. Otherwise, it may look like the netlink
    message was malformed.

    For example,
    ip route add 2001:db8::1/128 dev dummy0
    ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0 \
    nexthop via fe80::30:2 dev dummy0
    results in the following warnings:
    [ 1035.057019] IPv6: RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE
    [ 1035.057517] IPv6: NLM_F_CREATE should be set when creating new route

    This patch makes the nlmsg sequence look equivalent for __ip6_ins_rt() to
    what it would get if the multipath route had been added in multiple netlink
    operations:
    ip route add 2001:db8::1/128 dev dummy0
    ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0
    ip route append 2001:db8::1/128 nexthop via fe80::30:2 dev dummy0

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Benjamin Poirier
    Reviewed-by: Michal Kubecek
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Benjamin Poirier
     
  • After commit 27596472473a ("ipv6: fix ECMP route replacement") it is no
    longer possible to replace an ECMP-able route by a non ECMP-able route.
    For example,
    ip route add 2001:db8::1/128 via fe80::1 dev dummy0
    ip route replace 2001:db8::1/128 dev dummy0
    does not work as expected.

    Tweak the replacement logic so that point 3 in the log of the above commit
    becomes:
    3. If the new route is not ECMP-able, and no matching non-ECMP-able route
    exists, replace matching ECMP-able route (if any) or add the new route.

    We can now summarize the entire replace semantics to:
    When doing a replace, prefer replacing a matching route of the same
    "ECMP-able-ness" as the replace argument. If there is no such candidate,
    fallback to the first route found.

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Benjamin Poirier
    Reviewed-by: Michal Kubecek
    Signed-off-by: David S. Miller

    Benjamin Poirier
     

14 Feb, 2020

2 commits

  • With ipip, it is possible to create an extra interface explicitly
    attached to a given physical interface:

    # ip link show tunl0
    4: tunl0@NONE: mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    # ip link add tunl1 type ipip dev eth0
    # ip link show tunl1
    6: tunl1@eth0: mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0

    But it is not possible with ip6tnl:

    # ip link show ip6tnl0
    5: ip6tnl0@NONE: mtu 1452 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/tunnel6 :: brd ::
    # ip link add ip6tnl1 type ip6tnl dev eth0
    RTNETLINK answers: File exists

    This patch aims to make it possible by adding link comparaison in both
    tunnel locate and lookup functions; we also modify mtu calculation when
    attached to an interface with a lower mtu.

    This permits to make use of x-netns communication by moving the newly
    created tunnel in a given netns.

    Signed-off-by: William Dauchy
    Reviewed-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    William Dauchy
     
  • This introduces a helper function to be called only by network drivers
    that wraps calls to icmp[v6]_send in a conntrack transformation, in case
    NAT has been used. We don't want to pollute the non-driver path, though,
    so we introduce this as a helper to be called by places that actually
    make use of this, as suggested by Florian.

    Signed-off-by: Jason A. Donenfeld
    Cc: Florian Westphal
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

08 Feb, 2020

1 commit

  • __in6_dev_get(dev) called from inet6_set_link_af() can return NULL.

    The needed check has been recently removed, let's add it back.

    While do_setlink() does call validate_linkmsg() :
    ...
    err = validate_linkmsg(dev, tb); /* OK at this point */
    ...

    It is possible that the following call happening before the
    ->set_link_af() removes IPv6 if MTU is less than 1280 :

    if (tb[IFLA_MTU]) {
    err = dev_set_mtu_ext(dev, nla_get_u32(tb[IFLA_MTU]), extack);
    if (err < 0)
    goto errout;
    status |= DO_SETLINK_MODIFIED;
    }
    ...

    if (tb[IFLA_AF_SPEC]) {
    ...
    err = af_ops->set_link_af(dev, af);
    ->inet6_set_link_af() // CRASH because idev is NULL

    Please note that IPv4 is immune to the bug since inet_set_link_af() does :

    struct in_device *in_dev = __in_dev_get_rcu(dev);
    if (!in_dev)
    return -EAFNOSUPPORT;

    This problem has been mentioned in commit cf7afbfeb8ce ("rtnl: make
    link af-specific updates atomic") changelog :

    This method is not fail proof, while it is currently sufficient
    to make set_link_af() inerrable and thus 100% atomic, the
    validation function method will not be able to detect all error
    scenarios in the future, there will likely always be errors
    depending on states which are f.e. not protected by rtnl_mutex
    and thus may change between validation and setting.

    IPv6: ADDRCONF(NETDEV_CHANGE): lo: link becomes ready
    general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
    CPU: 0 PID: 9698 Comm: syz-executor712 Not tainted 5.5.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
    Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
    RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
    RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
    RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
    R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
    FS: 0000000000c46880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f0494ca0d0 CR3: 000000009e4ac000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    do_setlink+0x2a9f/0x3720 net/core/rtnetlink.c:2754
    rtnl_group_changelink net/core/rtnetlink.c:3103 [inline]
    __rtnl_newlink+0xdd1/0x1790 net/core/rtnetlink.c:3257
    rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
    rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
    netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
    rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
    netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
    netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    ____sys_sendmsg+0x753/0x880 net/socket.c:2343
    ___sys_sendmsg+0x100/0x170 net/socket.c:2397
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
    __do_sys_sendmsg net/socket.c:2439 [inline]
    __se_sys_sendmsg net/socket.c:2437 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4402e9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffd62fbcf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402e9
    RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000008 R09: 00000000004002c8
    R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000401b70
    R13: 0000000000401c00 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace cfa7664b8fdcdff3 ]---
    RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
    Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
    RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
    RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
    RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
    R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
    FS: 0000000000c46880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000004 CR3: 000000009e4ac000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 7dc2bccab0ee ("Validate required parameters in inet6_validate_link_af")
    Signed-off-by: Eric Dumazet
    Bisected-and-reported-by: syzbot
    Cc: Maxim Mikityanskiy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Jan, 2020

2 commits

  • If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

    ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!

    This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
    selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.

    As exporting a symbol for an empty function would be a bit wasteful, fix
    this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
    CONFIG_MPTCP_IPV6=n case.

    Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
    clear this is a pure-IPV6 function, just like mptcpv6_init().

    Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: David S. Miller

    Geert Uytterhoeven
     
  • We can't deal with syncookie mode yet, the syncookie rx path will create
    tcp reqsk, i.e. we get OOB access because we treat tcp reqsk as mptcp reqsk one:

    TCP: SYN flooding on port 20002. Sending cookies.
    BUG: KASAN: slab-out-of-bounds in subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
    Read of size 1 at addr ffff8881167bc148 by task syz-executor099/2120
    subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
    tcp_get_cookie_sock+0xcf/0x520 net/ipv4/syncookies.c:209
    cookie_v6_check+0x15a5/0x1e90 net/ipv6/syncookies.c:252
    tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1123 [inline]
    [..]

    Bug can be reproduced via "sysctl net.ipv4.tcp_syncookies=2".

    Note that MPTCP should work with syncookies (4th ack would carry needed
    state), but it appears better to sort that out in -next so do tcp
    fallback for now.

    I removed the MPTCP ifdef for tcp_rsk "is_mptcp" member because
    if (IS_ENABLED()) is easier to read than "#ifdef IS_ENABLED()/#endif" pair.

    Cc: Eric Dumazet
    Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
    Reported-by: Christoph Paasch
    Tested-by: Christoph Paasch
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

27 Jan, 2020

1 commit

  • This patch extends UDP GRO to support fraglist GRO/GSO
    by using the previously introduced infrastructure.
    If the feature is enabled, all UDP packets are going to
    fraglist GRO (local input and forward).

    After validating the csum, we mark ip_summed as
    CHECKSUM_UNNECESSARY for fraglist GRO packets to
    make sure that the csum is not touched.

    Signed-off-by: Steffen Klassert
    Reviewed-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Steffen Klassert
     

26 Jan, 2020

1 commit


24 Jan, 2020

3 commits

  • Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
    to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
    the final ACK of the three-way handshake.

    Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
    responding SYN-ACK is received and notify the MPTCP connection layer.

    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Signed-off-by: Peter Krystad
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Peter Krystad
     
  • Implements the infrastructure for MPTCP sockets.

    MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
    sockets are only managed by the MPTCP socket that owns them and are not
    visible from userspace. This commit allows a userspace program to open
    an MPTCP socket with:

    sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

    The resulting socket is simply a wrapper around a single regular TCP
    socket, without any of the MPTCP protocol implemented over the wire.

    Co-developed-by: Florian Westphal
    Signed-off-by: Florian Westphal
    Co-developed-by: Peter Krystad
    Signed-off-by: Peter Krystad
    Co-developed-by: Matthieu Baerts
    Signed-off-by: Matthieu Baerts
    Co-developed-by: Paolo Abeni
    Signed-off-by: Paolo Abeni
    Signed-off-by: Mat Martineau
    Signed-off-by: Christoph Paasch
    Signed-off-by: David S. Miller

    Mat Martineau
     
  • if seq_file .next fuction does not change position index,
    read after some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: David S. Miller

    Vasily Averin
     

23 Jan, 2020

1 commit

  • in the same manner as commit d0f418516022 ("net, ip_tunnel: fix
    namespaces move"), fix namespace moving as it was broken since commit
    8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnel"), but for
    ipv6 this time; there is no reason to keep it for ip6_tunnel.

    Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnel")
    Signed-off-by: William Dauchy
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    William Dauchy
     

21 Jan, 2020

2 commits

  • After LRO/GRO is applied, SRv6 encapsulated packets have
    SKB_GSO_IPXIP6 feature flag, and this flag must be removed right after
    decapulation procedure.

    Currently, SKB_GSO_IPXIP6 flag is not removed on End.D* actions, which
    creates inconsistent packet state, that is, a normal TCP/IP packets
    have the SKB_GSO_IPXIP6 flag. This behavior can cause unexpected
    fallback to GSO on routing to netdevices that do not support
    SKB_GSO_IPXIP6. For example, on inter-VRF forwarding, decapsulated
    packets separated into small packets by GSO because VRF devices do not
    support TSO for packets with SKB_GSO_IPXIP6 flag, and this degrades
    forwarding performance.

    This patch removes encapsulation related GSO flags from the skb right
    after the End.D* action is applied.

    Fixes: d7a669dd2f8b ("ipv6: sr: add helper functions for seg6local")
    Signed-off-by: Yuki Taguchi
    Signed-off-by: David S. Miller

    Yuki Taguchi
     
  • Steffen Klassert says:

    ====================
    pull request (net): ipsec 2020-01-21

    1) Fix packet tx through bpf_redirect() for xfrm and vti
    interfaces. From Nicolas Dichtel.

    2) Do not confirm neighbor when do pmtu update on a virtual
    xfrm interface. From Xu Wang.

    3) Support output_mark for offload ESP packets, this was
    forgotten when the output_mark was added initially.
    From Ulrich Weber.

    Please pull or let me know if there are problems.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Jan, 2020

1 commit

  • Support for moving IPv4 GRE tunnels between namespaces was added in
    commit b57708add314 ("gre: add x-netns support"). The respective change
    for IPv6 tunnels, commit 22f08069e8b4 ("ip6gre: add x-netns support")
    did not drop NETIF_F_NETNS_LOCAL flag so moving them from one netns to
    another is still denied in IPv6 case. Drop NETIF_F_NETNS_LOCAL flag from
    ip6gre tunnels to allow moving ip6gre tunnel endpoints between network
    namespaces.

    Signed-off-by: Niko Kortstrom
    Acked-by: Nicolas Dichtel
    Acked-by: William Tu
    Signed-off-by: David S. Miller

    Niko Kortstrom
     

15 Jan, 2020

3 commits

  • Commit 9b42c1f179a6 ("xfrm: Extend the output_mark") added output_mark
    support but missed ESP offload support.

    xfrm_smark_get() is not called within xfrm_input() for packets coming
    from esp4_gro_receive() or esp6_gro_receive(). Therefore call
    xfrm_smark_get() directly within these functions.

    Fixes: 9b42c1f179a6 ("xfrm: Extend the output_mark to support input direction and masking.")
    Signed-off-by: Ulrich Weber
    Signed-off-by: Steffen Klassert

    Ulrich Weber
     
  • In a similar fashion to previous patch, add "offload" and "trap"
    indication to IPv6 routes.

    This is done by using two unused bits in 'struct fib6_info' to hold
    these indications. Capable drivers are expected to set these when
    processing the various in-kernel route notifications.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Acked-by: Roopa Prabhu
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • This is a straight-forward conversion case for the new function,
    iterating over the return value from udp_rcv_segment, which actually is
    a wrapper around skb_gso_segment.

    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

14 Jan, 2020

1 commit

  • With an ebpf program that redirects packets through a vti[6] interface,
    the packets are dropped because no dst is attached.

    This could also be reproduced with an AF_PACKET socket, with the following
    python script (vti1 is an ip_vti interface):

    import socket
    send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
    # scapy
    # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
    # raw(p)
    req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
    send_s.sendto(req, ('vti1', 0x800, 0, 0))

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Steffen Klassert

    Nicolas Dichtel
     

10 Jan, 2020

1 commit

  • MPTCP will make use of tcp_send_mss() and tcp_push() when sending
    data to specific TCP subflows.

    tcp_request_sock_ipvX_ops and ipvX_specific will be referenced
    during TCP subflow creation.

    Co-developed-by: Peter Krystad
    Signed-off-by: Peter Krystad
    Reviewed-by: Eric Dumazet
    Signed-off-by: Mat Martineau
    Signed-off-by: David S. Miller

    Mat Martineau
     

04 Jan, 2020

1 commit


03 Jan, 2020

3 commits

  • Add support for userspace to specify a device index to limit the scope
    of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
    is renamed to tcpm_ifindex and the new field is only checked if the new
    TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
    must point to an L3 master device (e.g., VRF). The API and error
    handling are setup to allow the constraint to be relaxed in the future
    to any device index.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
    add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.

    With the key now based on an l3index, add the new parameter to the
    lookup functions and consider the l3index when looking for a match.

    The l3index comes from the skb when processing ingress packets leveraging
    the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
    v6 variants). When the sdif index is set it means the packet ingressed a
    device that is part of an L3 domain and inet_iif points to the VRF device.
    For egress, the L3 domain is determined from the socket binding and
    sk_bound_dev_if.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • The original ingress device index is saved to the cb space of the skb
    and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash
    can be called before and after the cb move, pass dif and sdif to it so
    the caller can save both prior to the cb move. Both are used by a later
    patch.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

01 Jan, 2020

1 commit


25 Dec, 2019

12 commits

  • Now that mlxsw is converted to use the new FIB notifications it is
    possible to delete the old ones and use the new replace / append /
    delete notifications.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When an entire multipath route is deleted, only emit a notification if
    it is the first route in the node. Emit a replace notification in case
    the last sibling is followed by another route. Otherwise, emit a delete
    notification.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • For the purpose of route offload, when a single route is deleted, it is
    only of interest if it is the first route in the node or if it is
    sibling to such a route.

    In the first case, distinguish between several possibilities:

    1. Route is the last route in the node. Emit a delete notification

    2. Route is followed by a non-multipath route. Emit a replace
    notification for the non-multipath route.

    3. Route is followed by a multipath route. Emit a replace notification
    for the multipath route.

    In the second case, only emit a delete notification to ensure the route
    is no longer used as a valid nexthop.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When a new listener is registered to the FIB notification chain it
    receives a dump of all the available routes in the system. Instead, make
    sure to only replay the IPv6 routes that are actually used in the data
    path and are of any interest to the new listener.

    This is done by iterating over all the routing tables in the given
    namespace, but from each traversed node only the first route ('leaf') is
    notified. Multipath routes are notified in a single notification instead
    of one for each nexthop.

    Add fib6_rt_dump_tmp() to do that. Later on in the patch set it will be
    renamed to fib6_rt_dump() instead of the existing one.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • In a similar fashion to previous patches, only notify the new multipath
    route if it is the first route in the node or if it was appended to such
    route.

    The type of the notification (replace vs. append) is determined based on
    the number of routes added ('nhn') and the number of sibling routes. If
    the two do not match, then an append notification should be sent.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • Similar to the corresponding IPv4 patch, only notify the new route if it
    is replacing the currently offloaded one. Meaning, the one pointed to by
    'fn->leaf'.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • fib6_add_rt2node() takes care of adding a single route ('struct
    fib6_info') to a FIB node. The route in question should only be notified
    in case it is added as the first route in the node (lowest metric) or if
    it is added as a sibling route to the first route in the node.

    The first criterion can be tested by checking if the route is pointed to
    by 'fn->leaf'. The second criterion can be tested by checking the new
    'notify_sibling_rt' variable that is set when the route is added as a
    sibling to the first route in the node.

    Signed-off-by: Ido Schimmel
    Reviewed-by: Jiri Pirko
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Ido Schimmel
     
  • When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    Although vti and vti6 are immune to this problem because they are IFF_NOARP
    interfaces, as Guillaume pointed. There is still no sense to confirm neighbour
    here.

    v5: Update commit description.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    v5: No Change.
    v4: Update commit description
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Fixes: 0dec879f636f ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP")
    Reviewed-by: Guillaume Nault
    Tested-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • When we do ipv6 gre pmtu update, we will also do neigh confirm currently.
    This will cause the neigh cache be refreshed and set to REACHABLE before
    xmit.

    But if the remote mac address changed, e.g. device is deleted and recreated,
    we will not able to notice this and still use the old mac address as the neigh
    cache is REACHABLE.

    Fix this by disable neigh confirm when do pmtu update

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reported-by: Jianlin Shi
    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu
     
  • The MTU update code is supposed to be invoked in response to real
    networking events that update the PMTU. In IPv6 PMTU update function
    __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
    confirmed time.

    But for tunnel code, it will call pmtu before xmit, like:
    - tnl_update_pmtu()
    - skb_dst_update_pmtu()
    - ip6_rt_update_pmtu()
    - __ip6_rt_update_pmtu()
    - dst_confirm_neigh()

    If the tunnel remote dst mac address changed and we still do the neigh
    confirm, we will not be able to update neigh cache and ping6 remote
    will failed.

    So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
    should not be invoking dst_confirm_neigh() as we have no evidence
    of successful two-way communication at this point.

    On the other hand it is also important to keep the neigh reachability fresh
    for TCP flows, so we cannot remove this dst_confirm_neigh() call.

    To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
    to choose whether we should do neigh update or not. I will add the parameter
    in this patch and set all the callers to true to comply with the previous
    way, and fix the tunnel code one by one on later patches.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Suggested-by: David Miller
    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller

    Hangbin Liu