05 Mar, 2020

2 commits

  • [ Upstream commit afecdb376bd81d7e16578f0cfe82a1aec7ae18f3 ]

    When splitting an RTA_MULTIPATH request into multiple routes and adding the
    second and later components, we must not simply remove NLM_F_REPLACE but
    instead replace it by NLM_F_CREATE. Otherwise, it may look like the netlink
    message was malformed.

    For example,
    ip route add 2001:db8::1/128 dev dummy0
    ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0 \
    nexthop via fe80::30:2 dev dummy0
    results in the following warnings:
    [ 1035.057019] IPv6: RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE
    [ 1035.057517] IPv6: NLM_F_CREATE should be set when creating new route

    This patch makes the nlmsg sequence look equivalent for __ip6_ins_rt() to
    what it would get if the multipath route had been added in multiple netlink
    operations:
    ip route add 2001:db8::1/128 dev dummy0
    ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0
    ip route append 2001:db8::1/128 nexthop via fe80::30:2 dev dummy0

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Benjamin Poirier
    Reviewed-by: Michal Kubecek
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Poirier
     
  • [ Upstream commit e404b8c7cfb31654c9024d497cec58a501501692 ]

    After commit 27596472473a ("ipv6: fix ECMP route replacement") it is no
    longer possible to replace an ECMP-able route by a non ECMP-able route.
    For example,
    ip route add 2001:db8::1/128 via fe80::1 dev dummy0
    ip route replace 2001:db8::1/128 dev dummy0
    does not work as expected.

    Tweak the replacement logic so that point 3 in the log of the above commit
    becomes:
    3. If the new route is not ECMP-able, and no matching non-ECMP-able route
    exists, replace matching ECMP-able route (if any) or add the new route.

    We can now summarize the entire replace semantics to:
    When doing a replace, prefer replacing a matching route of the same
    "ECMP-able-ness" as the replace argument. If there is no such candidate,
    fallback to the first route found.

    Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
    Signed-off-by: Benjamin Poirier
    Reviewed-by: Michal Kubecek
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Benjamin Poirier
     

11 Feb, 2020

1 commit

  • [ Upstream commit db3fa271022dacb9f741b96ea4714461a8911bb9 ]

    __in6_dev_get(dev) called from inet6_set_link_af() can return NULL.

    The needed check has been recently removed, let's add it back.

    While do_setlink() does call validate_linkmsg() :
    ...
    err = validate_linkmsg(dev, tb); /* OK at this point */
    ...

    It is possible that the following call happening before the
    ->set_link_af() removes IPv6 if MTU is less than 1280 :

    if (tb[IFLA_MTU]) {
    err = dev_set_mtu_ext(dev, nla_get_u32(tb[IFLA_MTU]), extack);
    if (err < 0)
    goto errout;
    status |= DO_SETLINK_MODIFIED;
    }
    ...

    if (tb[IFLA_AF_SPEC]) {
    ...
    err = af_ops->set_link_af(dev, af);
    ->inet6_set_link_af() // CRASH because idev is NULL

    Please note that IPv4 is immune to the bug since inet_set_link_af() does :

    struct in_device *in_dev = __in_dev_get_rcu(dev);
    if (!in_dev)
    return -EAFNOSUPPORT;

    This problem has been mentioned in commit cf7afbfeb8ce ("rtnl: make
    link af-specific updates atomic") changelog :

    This method is not fail proof, while it is currently sufficient
    to make set_link_af() inerrable and thus 100% atomic, the
    validation function method will not be able to detect all error
    scenarios in the future, there will likely always be errors
    depending on states which are f.e. not protected by rtnl_mutex
    and thus may change between validation and setting.

    IPv6: ADDRCONF(NETDEV_CHANGE): lo: link becomes ready
    general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
    CPU: 0 PID: 9698 Comm: syz-executor712 Not tainted 5.5.0-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
    Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
    RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
    RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
    RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
    R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
    FS: 0000000000c46880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f0494ca0d0 CR3: 000000009e4ac000 CR4: 00000000001406f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    do_setlink+0x2a9f/0x3720 net/core/rtnetlink.c:2754
    rtnl_group_changelink net/core/rtnetlink.c:3103 [inline]
    __rtnl_newlink+0xdd1/0x1790 net/core/rtnetlink.c:3257
    rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
    rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
    netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
    rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
    netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
    netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
    sock_sendmsg_nosec net/socket.c:652 [inline]
    sock_sendmsg+0xd7/0x130 net/socket.c:672
    ____sys_sendmsg+0x753/0x880 net/socket.c:2343
    ___sys_sendmsg+0x100/0x170 net/socket.c:2397
    __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
    __do_sys_sendmsg net/socket.c:2439 [inline]
    __se_sys_sendmsg net/socket.c:2437 [inline]
    __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
    do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x4402e9
    Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
    RSP: 002b:00007fffd62fbcf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402e9
    RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
    RBP: 00000000006ca018 R08: 0000000000000008 R09: 00000000004002c8
    R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000401b70
    R13: 0000000000401c00 R14: 0000000000000000 R15: 0000000000000000
    Modules linked in:
    ---[ end trace cfa7664b8fdcdff3 ]---
    RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
    Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
    RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
    RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
    RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
    RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
    R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
    R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
    FS: 0000000000c46880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000020000004 CR3: 000000009e4ac000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fixes: 7dc2bccab0ee ("Validate required parameters in inet6_validate_link_af")
    Signed-off-by: Eric Dumazet
    Bisected-and-reported-by: syzbot
    Cc: Maxim Mikityanskiy
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

06 Feb, 2020

1 commit

  • [ Upstream commit 95224166a9032ff5d08fca633d37113078ce7d01 ]

    With an ebpf program that redirects packets through a vti[6] interface,
    the packets are dropped because no dst is attached.

    This could also be reproduced with an AF_PACKET socket, with the following
    python script (vti1 is an ip_vti interface):

    import socket
    send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
    # scapy
    # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
    # raw(p)
    req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
    send_s.sendto(req, ('vti1', 0x800, 0, 0))

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: Steffen Klassert
    Signed-off-by: Sasha Levin

    Nicolas Dichtel
     

29 Jan, 2020

4 commits

  • commit 4e4362d2bf2a49ff44dbbc9585207977ca3d71d0 upstream.

    Commit 9b42c1f179a6 ("xfrm: Extend the output_mark") added output_mark
    support but missed ESP offload support.

    xfrm_smark_get() is not called within xfrm_input() for packets coming
    from esp4_gro_receive() or esp6_gro_receive(). Therefore call
    xfrm_smark_get() directly within these functions.

    Fixes: 9b42c1f179a6 ("xfrm: Extend the output_mark to support input direction and masking.")
    Signed-off-by: Ulrich Weber
    Signed-off-by: Steffen Klassert
    Signed-off-by: Greg Kroah-Hartman

    Ulrich Weber
     
  • [ Upstream commit 5311a69aaca30fa849c3cc46fb25f75727fb72d0 ]

    in the same manner as commit d0f418516022 ("net, ip_tunnel: fix
    namespaces move"), fix namespace moving as it was broken since commit
    8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnel"), but for
    ipv6 this time; there is no reason to keep it for ip6_tunnel.

    Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnel")
    Signed-off-by: William Dauchy
    Acked-by: Nicolas Dichtel
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    William Dauchy
     
  • [ Upstream commit 690afc165bb314354667f67157c1a1aea7dc797a ]

    Support for moving IPv4 GRE tunnels between namespaces was added in
    commit b57708add314 ("gre: add x-netns support"). The respective change
    for IPv6 tunnels, commit 22f08069e8b4 ("ip6gre: add x-netns support")
    did not drop NETIF_F_NETNS_LOCAL flag so moving them from one netns to
    another is still denied in IPv6 case. Drop NETIF_F_NETNS_LOCAL flag from
    ip6gre tunnels to allow moving ip6gre tunnel endpoints between network
    namespaces.

    Signed-off-by: Niko Kortstrom
    Acked-by: Nicolas Dichtel
    Acked-by: William Tu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Niko Kortstrom
     
  • [ Upstream commit 62ebaeaedee7591c257543d040677a60e35c7aec ]

    After LRO/GRO is applied, SRv6 encapsulated packets have
    SKB_GSO_IPXIP6 feature flag, and this flag must be removed right after
    decapulation procedure.

    Currently, SKB_GSO_IPXIP6 flag is not removed on End.D* actions, which
    creates inconsistent packet state, that is, a normal TCP/IP packets
    have the SKB_GSO_IPXIP6 flag. This behavior can cause unexpected
    fallback to GSO on routing to netdevices that do not support
    SKB_GSO_IPXIP6. For example, on inter-VRF forwarding, decapsulated
    packets separated into small packets by GSO because VRF devices do not
    support TSO for packets with SKB_GSO_IPXIP6 flag, and this degrades
    forwarding performance.

    This patch removes encapsulation related GSO flags from the skb right
    after the End.D* action is applied.

    Fixes: d7a669dd2f8b ("ipv6: sr: add helper functions for seg6local")
    Signed-off-by: Yuki Taguchi
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yuki Taguchi
     

05 Jan, 2020

6 commits

  • [ Upstream commit 2beb6d2901a3f73106485d560c49981144aeacb1 ]

    In commit 4b1373de73a3 ("net: ipv6: addr: perform strict checks also for
    doit handlers") we add strict check for inet6_rtm_getaddr(). But we did
    the invalid header values check before checking if NETLINK_F_STRICT_CHK
    is set. This may break backwards compatibility if user already set the
    ifm->ifa_prefixlen, ifm->ifa_flags, ifm->ifa_scope in their netlink code.

    I didn't move the nlmsg_len check because I thought it's a valid check.

    Reported-by: Jianlin Shi
    Fixes: 4b1373de73a3 ("net: ipv6: addr: perform strict checks also for doit handlers")
    Signed-off-by: Hangbin Liu
    Reviewed-by: David Ahern
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit 4d42df46d6372ece4cb4279870b46c2ea7304a47 ]

    When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit 8247a79efa2f28b44329f363272550c1738377de ]

    When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    Although vti and vti6 are immune to this problem because they are IFF_NOARP
    interfaces, as Guillaume pointed. There is still no sense to confirm neighbour
    here.

    v5: Update commit description.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit 7a1592bcb15d71400a98632727791d1e68ea0ee8 ]

    When do tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
    we should not call dst_confirm_neigh() as there is no two-way communication.

    v5: No Change.
    v4: Update commit description
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Fixes: 0dec879f636f ("net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TP")
    Reviewed-by: Guillaume Nault
    Tested-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit 675d76ad0ad5bf41c9a129772ef0aba8f57ea9a7 ]

    When we do ipv6 gre pmtu update, we will also do neigh confirm currently.
    This will cause the neigh cache be refreshed and set to REACHABLE before
    xmit.

    But if the remote mac address changed, e.g. device is deleted and recreated,
    we will not able to notice this and still use the old mac address as the neigh
    cache is REACHABLE.

    Fix this by disable neigh confirm when do pmtu update

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Reported-by: Jianlin Shi
    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     
  • [ Upstream commit bd085ef678b2cc8c38c105673dfe8ff8f5ec0c57 ]

    The MTU update code is supposed to be invoked in response to real
    networking events that update the PMTU. In IPv6 PMTU update function
    __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
    confirmed time.

    But for tunnel code, it will call pmtu before xmit, like:
    - tnl_update_pmtu()
    - skb_dst_update_pmtu()
    - ip6_rt_update_pmtu()
    - __ip6_rt_update_pmtu()
    - dst_confirm_neigh()

    If the tunnel remote dst mac address changed and we still do the neigh
    confirm, we will not be able to update neigh cache and ping6 remote
    will failed.

    So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
    should not be invoking dst_confirm_neigh() as we have no evidence
    of successful two-way communication at this point.

    On the other hand it is also important to keep the neigh reachability fresh
    for TCP flows, so we cannot remove this dst_confirm_neigh() call.

    To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
    to choose whether we should do neigh update or not. I will add the parameter
    in this patch and set all the callers to true to comply with the previous
    way, and fix the tunnel code one by one on later patches.

    v5: No change.
    v4: No change.
    v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
    v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

    Suggested-by: David Miller
    Reviewed-by: Guillaume Nault
    Acked-by: David Ahern
    Signed-off-by: Hangbin Liu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Hangbin Liu
     

18 Dec, 2019

2 commits

  • [ Upstream commit 6c8991f41546c3c472503dff1ea9daaddf9331c2 ]

    ipv6_stub uses the ip6_dst_lookup function to allow other modules to
    perform IPv6 lookups. However, this function skips the XFRM layer
    entirely.

    All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the
    ip_route_output_key and ip_route_output helpers) for their IPv4 lookups,
    which calls xfrm_lookup_route(). This patch fixes this inconsistent
    behavior by switching the stub to ip6_dst_lookup_flow, which also calls
    xfrm_lookup_route().

    This requires some changes in all the callers, as these two functions
    take different arguments and have different return types.

    Fixes: 5f81bd2e5d80 ("ipv6: export a stub for IPv6 symbols used by vxlan")
    Reported-by: Xiumei Mu
    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     
  • [ Upstream commit c4e85f73afb6384123e5ef1bba3315b2e3ad031e ]

    This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
    as some modules currently pass a net argument without a socket to
    ip6_dst_lookup. This is equivalent to commit 343d60aada5a ("ipv6: change
    ipv6_stub_impl.ipv6_dst_lookup to take net argument").

    Signed-off-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Sabrina Dubroca
     

22 Nov, 2019

1 commit


21 Nov, 2019

1 commit

  • Previously we will return directly if (!rt || !rt->fib6_nh.fib_nh_gw_family)
    in function rt6_probe(), but after commit cc3a86c802f0
    ("ipv6: Change rt6_probe to take a fib6_nh"), the logic changed to
    return if there is fib_nh_gw_family.

    Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh")
    Signed-off-by: Hangbin Liu
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Hangbin Liu
     

17 Nov, 2019

2 commits

  • in the receive path (more precisely in ip6_rcv_core()) the
    skb->transport_header is set to skb->network_header + sizeof(*hdr). As a
    consequence, after routing operations, destination input expects to find
    skb->transport_header correctly set to the next protocol (or extension
    header) that follows the network protocol. However, decap behaviors (DX*,
    DT*) remove the outer IPv6 and SRH extension and do not set again the
    skb->transport_header pointer correctly. For this reason, the patch sets
    the skb->transport_header to the skb->network_header + sizeof(hdr) in each
    DX* and DT* behavior.

    Signed-off-by: Andrea Mayer
    Signed-off-by: David S. Miller

    Andrea Mayer
     
  • pskb_may_pull may change pointers in header. For this reason, it is
    mandatory to reload any pointer that points into skb header.

    Signed-off-by: Andrea Mayer
    Signed-off-by: David S. Miller

    Andrea Mayer
     

08 Nov, 2019

1 commit

  • While looking at a syzbot KCSAN report [1], I found multiple
    issues in this code :

    1) fib6_nh->last_probe has an initial value of 0.

    While probably okay on 64bit kernels, this causes an issue
    on 32bit kernels since the time_after(jiffies, 0 + interval)
    might be false ~24 days after boot (for HZ=1000)

    2) The data-race found by KCSAN
    I could use READ_ONCE() and WRITE_ONCE(), but we also can
    take the opportunity of not piling-up too many rt6_probe_deferred()
    works by using instead cmpxchg() so that only one cpu wins the race.

    [1]
    BUG: KCSAN: data-race in find_match / find_match

    write to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 1:
    rt6_probe net/ipv6/route.c:663 [inline]
    find_match net/ipv6/route.c:757 [inline]
    find_match+0x5bd/0x790 net/ipv6/route.c:733
    __find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
    find_rr_leaf net/ipv6/route.c:852 [inline]
    rt6_select net/ipv6/route.c:896 [inline]
    fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
    ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
    ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
    fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
    ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
    ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
    ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
    ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
    inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
    inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
    __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
    tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
    tcp_xmit_probe_skb+0x19b/0x1d0 net/ipv4/tcp_output.c:3735

    read to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 0:
    rt6_probe net/ipv6/route.c:657 [inline]
    find_match net/ipv6/route.c:757 [inline]
    find_match+0x521/0x790 net/ipv6/route.c:733
    __find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
    find_rr_leaf net/ipv6/route.c:852 [inline]
    rt6_select net/ipv6/route.c:896 [inline]
    fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
    ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
    ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
    fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
    ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
    ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
    ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
    ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
    inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
    inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
    __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 0 PID: 18894 Comm: udevd Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Fixes: cc3a86c802f0 ("ipv6: Change rt6_probe to take a fib6_nh")
    Fixes: f547fac624be ("ipv6: rate-limit probes for neighbourless routes")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Oct, 2019

1 commit

  • This socket field can be read and written by concurrent cpus.

    Use READ_ONCE() and WRITE_ONCE() annotations to document this,
    and avoid some compiler 'optimizations'.

    KCSAN reported :

    BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv

    write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
    sk_incoming_cpu_update include/net/sock.h:953 [inline]
    tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
    ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
    process_backlog+0x1d3/0x420 net/core/dev.c:5955
    napi_poll net/core/dev.c:6392 [inline]
    net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
    do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
    do_softirq kernel/softirq.c:329 [inline]
    __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189

    read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
    sk_incoming_cpu_update include/net/sock.h:952 [inline]
    tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
    ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
    process_backlog+0x1d3/0x420 net/core/dev.c:5955
    napi_poll net/core/dev.c:6392 [inline]
    net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
    __do_softirq+0x115/0x33f kernel/softirq.c:292
    run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
    smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Oct, 2019

1 commit

  • The check for !md doens't really work for ip_tunnel_info_opts(info) which
    only does info + 1. Also to avoid out-of-bounds access on info, it should
    ensure options_len is not less than erspan_metadata in both erspan_xmit()
    and ip6erspan_tunnel_xmit().

    Fixes: 1a66a836da ("gre: add collect_md mode to ERSPAN tunnel")
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

23 Oct, 2019

1 commit

  • Include for the missing declarations of
    various functions. Fixes the following sparse warnings:

    net/ipv6/addrconf_core.c:94:5: warning: symbol 'register_inet6addr_notifier' was not declared. Should it be static?
    net/ipv6/addrconf_core.c:100:5: warning: symbol 'unregister_inet6addr_notifier' was not declared. Should it be static?
    net/ipv6/addrconf_core.c:106:5: warning: symbol 'inet6addr_notifier_call_chain' was not declared. Should it be static?
    net/ipv6/addrconf_core.c:112:5: warning: symbol 'register_inet6addr_validator_notifier' was not declared. Should it be static?
    net/ipv6/addrconf_core.c:118:5: warning: symbol 'unregister_inet6addr_validator_notifier' was not declared. Should it be static?
    net/ipv6/addrconf_core.c:125:5: warning: symbol 'inet6addr_validator_notifier_call_chain' was not declared. Should it be static?
    net/ipv6/addrconf_core.c:237:6: warning: symbol 'in6_dev_finish_destroy' was not declared. Should it be static?

    Signed-off-by: Ben Dooks (Codethink)
    Signed-off-by: Jakub Kicinski

    Ben Dooks (Codethink)
     

19 Oct, 2019

1 commit

  • Thomas found that some forwarded packets would be stuck
    in FQ packet scheduler because their skb->tstamp contained
    timestamps far in the future.

    We thought we addressed this point in commit 8203e2d844d3
    ("net: clear skb->tstamp in forwarding paths") but there
    is still an issue when/if a packet needs to be fragmented.

    In order to meet EDT requirements, we have to make sure all
    fragments get the original skb->tstamp.

    Note that this original skb->tstamp should be zero in
    forwarding path, but might have a non zero value in
    output path if user decided so.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: Eric Dumazet
    Reported-by: Thomas Bartschies
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Oct, 2019

1 commit

  • This reverts commit b0818f80c8c1bc215bba276bd61c216014fab23b.

    Started seeing weird behavior after this patch especially in
    the IPv6 code path. Haven't root caused it, but since this was
    applied to net branch, taking a precautionary measure to revert
    it and look / analyze those failures

    Revert this now and I'll send a better fix after analysing / fixing
    the weirdness observed.

    CC: Eric Dumazet
    CC: Wei Wang
    CC: David S. Miller
    Signed-off-by: Mahesh Bandewar
    Signed-off-by: David S. Miller

    Mahesh Bandewar
     

16 Oct, 2019

1 commit

  • While invalidating the dst, we assign backhole_netdev instead of
    loopback device. However, this device does not have idev pointer
    and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
    triggered the syzbot reported crash.

    The syzbot report does not have reproducer, however, this is the
    only device that doesn't have matching idev created.

    Crash instruction is :

    static inline bool ip6_ignore_linkdown(const struct net_device *dev)
    {
    const struct inet6_dev *idev = __in6_dev_get(dev);

    return !!idev->cnf.ignore_routes_with_linkdown;

    Mahesh Bandewar
     

14 Oct, 2019

4 commits

  • There are few places where we fetch tp->write_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->copied_seq while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • There are few places where we fetch tp->rcv_nxt while
    this field can change from IRQ or other cpu.

    We need to add READ_ONCE() annotations, and also make
    sure write sides use corresponding WRITE_ONCE() to avoid
    store-tearing.

    Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)

    syzbot reported :

    BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv

    write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
    tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
    tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
    tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
    tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
    tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
    ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
    ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
    dst_input include/net/dst.h:442 [inline]
    ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
    NF_HOOK include/linux/netfilter.h:305 [inline]
    NF_HOOK include/linux/netfilter.h:299 [inline]
    ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
    __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
    __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
    netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
    napi_skb_finish net/core/dev.c:5671 [inline]
    napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
    receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061

    read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
    tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
    tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
    sock_poll+0xed/0x250 net/socket.c:1256
    vfs_poll include/linux/poll.h:90 [inline]
    ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
    ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
    ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
    ep_send_events fs/eventpoll.c:1793 [inline]
    ep_poll+0xe3/0x900 fs/eventpoll.c:1930
    do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
    __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
    __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
    __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
    do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Both tcp_v4_err() and tcp_v6_err() do the following operations
    while they do not own the socket lock :

    fastopen = tp->fastopen_rsk;
    snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;

    The problem is that without appropriate barrier, the compiler
    might reload tp->fastopen_rsk and trigger a NULL deref.

    request sockets are protected by RCU, we can simply add
    the missing annotations and barriers to solve the issue.

    Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Oct, 2019

1 commit

  • ip6erspan driver calls ether_setup(), after commit 61e84623ace3
    ("net: centralize net_device min/max MTU checking"), the range
    of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.

    It causes the dev mtu of the erspan device to not be greater
    than 1500, this limit value is not correct for ip6erspan tap
    device.

    Fixes: 61e84623ace3 ("net: centralize net_device min/max MTU checking")
    Signed-off-by: Haishuang Yan
    Acked-by: William Tu
    Signed-off-by: Jakub Kicinski

    Haishuang Yan
     

05 Oct, 2019

2 commits

  • Rajendra reported a kernel panic when a link was taken down:

    [ 6870.263084] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
    [ 6870.271856] IP: [] __ipv6_ifa_notify+0x154/0x290

    [ 6870.570501] Call Trace:
    [ 6870.573238] [] ? ipv6_ifa_notify+0x26/0x40
    [ 6870.579665] [] ? addrconf_dad_completed+0x4c/0x2c0
    [ 6870.586869] [] ? ipv6_dev_mc_inc+0x196/0x260
    [ 6870.593491] [] ? addrconf_dad_work+0x10a/0x430
    [ 6870.600305] [] ? __switch_to_asm+0x34/0x70
    [ 6870.606732] [] ? process_one_work+0x18a/0x430
    [ 6870.613449] [] ? worker_thread+0x4d/0x490
    [ 6870.619778] [] ? process_one_work+0x430/0x430
    [ 6870.626495] [] ? kthread+0xd9/0xf0
    [ 6870.632145] [] ? __switch_to_asm+0x34/0x70
    [ 6870.638573] [] ? kthread_park+0x60/0x60
    [ 6870.644707] [] ? ret_from_fork+0x57/0x70
    [ 6870.650936] Code: 31 c0 31 d2 41 b9 20 00 08 02 b9 09 00 00 0

    addrconf_dad_work is kicked to be scheduled when a device is brought
    up. There is a race between addrcond_dad_work getting scheduled and
    taking the rtnl lock and a process taking the link down (under rtnl).
    The latter removes the host route from the inet6_addr as part of
    addrconf_ifdown which is run for NETDEV_DOWN. The former attempts
    to use the host route in __ipv6_ifa_notify. If the down event removes
    the host route due to the race to the rtnl, then the BUG listed above
    occurs.

    Since the DAD sequence can not be aborted, add a check for the missing
    host route in __ipv6_ifa_notify. The only way this should happen is due
    to the previously mentioned race. The host route is created when the
    address is added to an interface; it is only removed on a down event
    where the address is kept. Add a warning if the host route is missing
    AND the device is up; this is a situation that should never happen.

    Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
    Reported-by: Rajendra Dendukuri
    Signed-off-by: David Ahern
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David Ahern
     
  • This reverts commit a3ce2a21bb8969ae27917281244fa91bf5f286d7.

    Eric reported tests failings with commit. After digging into it,
    the bottom line is that the DAD sequence is not to be messed with.
    There are too many cases that are expected to proceed regardless
    of whether a device is up.

    Revert the patch and I will send a different solution for the
    problem Rajendra reported.

    Signed-off-by: David Ahern
    Cc: Eric Dumazet
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David Ahern
     

03 Oct, 2019

4 commits

  • Prior to this change an application sending 1 even
    if the application has enabled segmentation. I've also updated the
    relevant udpgso selftests.

    Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
    Signed-off-by: Josh Hunt
    Reviewed-by: Willem de Bruijn
    Reviewed-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Josh Hunt
     
  • Commit dfec0ee22c0a ("udp: Record gso_segs when supporting UDP segmentation offload")
    added gso_segs calculation, but incorrectly got sizeof() the pointer and
    not the underlying data type. In addition let's fix the v6 case.

    Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
    Fixes: dfec0ee22c0a ("udp: Record gso_segs when supporting UDP segmentation offload")
    Signed-off-by: Josh Hunt
    Reviewed-by: Alexander Duyck
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Josh Hunt
     
  • This began with a syzbot report. syzkaller was injecting
    IPv6 TCP SYN packets having a v4mapped source address.

    After an unsuccessful 4-tuple lookup, TCP creates a request
    socket (SYN_RECV) and calls reqsk_queue_hash_req()

    reqsk_queue_hash_req() calls sk_ehashfn(sk)

    At this point we have AF_INET6 sockets, and the heuristic
    used by sk_ehashfn() to either hash the IPv4 or IPv6 addresses
    is to use ipv6_addr_v4mapped(&sk->sk_v6_daddr)

    For the particular spoofed packet, we end up hashing V4 addresses
    which were not initialized by the TCP IPv6 stack, so KMSAN fired
    a warning.

    I first fixed sk_ehashfn() to test both source and destination addresses,
    but then faced various problems, including user-space programs
    like packetdrill that had similar assumptions.

    Instead of trying to fix the whole ecosystem, it is better
    to admit that we have a dual stack behavior, and that we
    can not build linux kernels without V4 stack anyway.

    The dual stack API automatically forces the traffic to be IPv4
    if v4mapped addresses are used at bind() or connect(), so it makes
    no sense to allow IPv6 traffic to use the same v4mapped class.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Cc: Florian Westphal
    Cc: Hannes Frederic Sowa
    Reported-by: syzbot
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter fixes for net

    The following patchset contains Netfilter fixes for net:

    1) Remove the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct(). Patch from Florian Westphal.

    2) Fix deadlock in nft_connlimit between packet path updates and
    the garbage collector.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Oct, 2019

2 commits

  • Rajendra reported a kernel panic when a link was taken down:

    [ 6870.263084] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
    [ 6870.271856] IP: [] __ipv6_ifa_notify+0x154/0x290

    [ 6870.570501] Call Trace:
    [ 6870.573238] [] ? ipv6_ifa_notify+0x26/0x40
    [ 6870.579665] [] ? addrconf_dad_completed+0x4c/0x2c0
    [ 6870.586869] [] ? ipv6_dev_mc_inc+0x196/0x260
    [ 6870.593491] [] ? addrconf_dad_work+0x10a/0x430
    [ 6870.600305] [] ? __switch_to_asm+0x34/0x70
    [ 6870.606732] [] ? process_one_work+0x18a/0x430
    [ 6870.613449] [] ? worker_thread+0x4d/0x490
    [ 6870.619778] [] ? process_one_work+0x430/0x430
    [ 6870.626495] [] ? kthread+0xd9/0xf0
    [ 6870.632145] [] ? __switch_to_asm+0x34/0x70
    [ 6870.638573] [] ? kthread_park+0x60/0x60
    [ 6870.644707] [] ? ret_from_fork+0x57/0x70
    [ 6870.650936] Code: 31 c0 31 d2 41 b9 20 00 08 02 b9 09 00 00 0

    addrconf_dad_work is kicked to be scheduled when a device is brought
    up. There is a race between addrcond_dad_work getting scheduled and
    taking the rtnl lock and a process taking the link down (under rtnl).
    The latter removes the host route from the inet6_addr as part of
    addrconf_ifdown which is run for NETDEV_DOWN. The former attempts
    to use the host route in ipv6_ifa_notify. If the down event removes
    the host route due to the race to the rtnl, then the BUG listed above
    occurs.

    This scenario does not occur when the ipv6 address is not kept
    (net.ipv6.conf.all.keep_addr_on_down = 0) as addrconf_ifdown sets the
    state of the ifp to DEAD. Handle when the addresses are kept by checking
    IF_READY which is reset by addrconf_ifdown.

    The 'dead' flag for an inet6_addr is set only under rtnl, in
    addrconf_ifdown and it means the device is getting removed (or IPv6 is
    disabled). The interesting cases for changing the idev flag are
    addrconf_notify (NETDEV_UP and NETDEV_CHANGE) and addrconf_ifdown
    (reset the flag). The former does not have the idev lock - only rtnl;
    the latter has both. Based on that the existing dead + IF_READY check
    can be moved to right after the rtnl_lock in addrconf_dad_work.

    Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
    Reported-by: Rajendra Dendukuri
    Signed-off-by: David Ahern
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    David Ahern
     
  • commit 174e23810cd31
    ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi
    recycle always drop skb extensions. The additional skb_ext_del() that is
    performed via nf_reset on napi skb recycle is not needed anymore.

    Most nf_reset() calls in the stack are there so queued skb won't block
    'rmmod nf_conntrack' indefinitely.

    This removes the skb_ext_del from nf_reset, and renames it to a more
    fitting nf_reset_ct().

    In a few selected places, add a call to skb_ext_reset to make sure that
    no active extensions remain.

    I am submitting this for "net", because we're still early in the release
    cycle. The patch applies to net-next too, but I think the rename causes
    needless divergence between those trees.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal