29 May, 2018

1 commit

  • SCTP sockets originated in a VRF can improve their performance if CRC32c
    computation is delegated to underlying devices: update device features,
    setting NETIF_F_SCTP_CRC. Iterating the following command in the topology
    proposed with [1],

    # ip vrf exec vrf-h2 netperf -H 192.0.2.1 -t SCTP_STREAM -- -m 10K

    the measured throughput in Mbit/s improved from 2395 ± 1% to 2720 ± 1%.

    [1] https://www.spinics.net/lists/netdev/msg486007.html

    Signed-off-by: Davide Caratti
    Reviewed-by: Marcelo Ricardo Leitner
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Davide Caratti
     

18 Apr, 2018

1 commit

  • A later patch removes rt6i_table from rt6_info. Save the ipv6
    table for a VRF in net_vrf. fib tables can not be deleted so
    no reference counting or locking is required.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

02 Apr, 2018

1 commit


31 Mar, 2018

1 commit

  • Miguel reported an skb use after free / double free in vrf_finish_output
    when neigh_output returns an error. The vrf driver should return after
    the call to neigh_output as it takes over the skb on error path as well.

    Patch is a simplified version of Miguel's patch which was written for 4.9,
    and updated to top of tree.

    Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device")
    Signed-off-by: Miguel Fadon Perlines
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

28 Mar, 2018

1 commit


05 Mar, 2018

1 commit

  • IPv6 does path selection for multipath routes deep in the lookup
    functions. The next patch adds L4 hash option and needs the skb
    for the forward path. To get the skb to the relevant FIB lookup
    functions it needs to go through the fib rules layer, so add a
    lookup_data argument to the fib_lookup_arg struct.

    Signed-off-by: David Ahern
    Reviewed-by: Ido Schimmel
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    David Ahern
     

28 Feb, 2018

1 commit

  • These pernet_operations make pretty simple actions
    like variable initialization on init, debug checks
    on exit, and so on, and they obviously are able
    to be executed in parallel with any others:

    vrf_net_ops
    lockd_net_ops
    grace_net_ops
    xfrm6_tunnel_net_ops
    kcm_net_ops
    tcf_net_ops

    Signed-off-by: Kirill Tkhai
    Signed-off-by: David S. Miller

    Kirill Tkhai
     

24 Feb, 2018

1 commit

  • For ages iproute2 has used `struct rtmsg` as the ancillary header for
    FIB rules and in the process set the protocol value to RTPROT_BOOT.
    Until ca56209a66 ("net: Allow a rule to track originating protocol")
    the kernel rules code ignored the protocol value sent from userspace
    and always returned 0 in notifications. To avoid incompatibility with
    existing iproute2, send the protocol as a new attribute.

    Fixes: cac56209a66 ("net: Allow a rule to track originating protocol")
    Signed-off-by: Donald Sharp
    Signed-off-by: David S. Miller

    Donald Sharp
     

22 Feb, 2018

1 commit

  • Allow a rule that is being added/deleted/modified or
    dumped to contain the originating protocol's id.

    The protocol is handled just like a routes originating
    protocol is. This is especially useful because there
    is starting to be a plethora of different user space
    programs adding rules.

    Allow the vrf device to specify that the kernel is the originator
    of the rule created for this device.

    Signed-off-by: Donald Sharp
    Signed-off-by: David S. Miller

    Donald Sharp
     

16 Feb, 2018

1 commit

  • Remove rt_table_id from rtable. It was added for getroute to return the
    table id that was hit in the lookup. With the changes for fibmatch the
    table id can be extracted from the fib_info returned in the fib_result
    so it no longer needs to be in rtable directly.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

26 Jan, 2018

1 commit

  • Sukumar reported that sends to the local broadcast address
    (255.255.255.255) are broken. Check for the address in vrf driver
    and do not redirect to the VRF device - similar to multicast
    packets.

    With this change sockets can use SO_BINDTODEVICE to specify an
    egress interface and receive responses. Note: the egress interface
    can not be a VRF device but needs to be the enslaved device.

    https://bugzilla.kernel.org/show_bug.cgi?id=198521

    Reported-by: Sukumar Gopalakrishnan
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

04 Nov, 2017

1 commit


02 Nov, 2017

1 commit

  • FRA_L3MDEV is defined as U8, but is being added as a U32 attribute. On
    big endian architecture, this results in the l3mdev entry not being
    added to the FIB rules.

    Fixes: 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create")
    Signed-off-by: Jeff Barnhill
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Jeff Barnhill
     

05 Oct, 2017

3 commits


22 Sep, 2017

1 commit


16 Sep, 2017

1 commit

  • When building an allmodconfig kernel with gcc-4.6, we get a rather
    odd warning:

    drivers/net/vrf.c: In function ‘vrf_ip6_input_dst’:
    drivers/net/vrf.c:964:3: error: initialized field with side-effects overwritten [-Werror]
    drivers/net/vrf.c:964:3: error: (near initialization for ‘fl6’) [-Werror]

    I have no idea what this warning is even trying to say, but it does
    seem like a false positive. Reordering the initialization in to match
    the structure definition gets rid of the warning, and might also avoid
    whatever gcc thinks is wrong here.

    Fixes: 9ff74384600a ("net: vrf: Handle ipv6 multicast and link-local addresses")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

14 Aug, 2017

1 commit


08 Aug, 2017

1 commit

  • Add extack error messages for failure paths creating vrf devices. Once
    extack support is added to iproute2, we go from the unhelpful:
    $ ip li add foobar type vrf
    RTNETLINK answers: Invalid argument

    to:
    $ ip li add foobar type vrf
    Error: VRF table id is missing

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

06 Jul, 2017

1 commit

  • When destroying a VRF device we cleanup the slaves in its ndo_uninit()
    function, but that causes packets to be switched (skb->dev == vrf being
    destroyed) even though we're pass the point where the VRF should be
    receiving any packets while it is being dismantled. This causes a BUG_ON
    to trigger if we have raw sockets (trace below).
    The reason is that the inetdev of the VRF has been destroyed but we're
    still sending packets up the stack with it, so let's free the slaves in
    the dellink callback as David Ahern suggested.

    Note that this fix doesn't prevent packets from going up when the VRF
    device is admin down.

    [ 35.631371] ------------[ cut here ]------------
    [ 35.631603] kernel BUG at net/ipv4/fib_frontend.c:285!
    [ 35.631854] invalid opcode: 0000 [#1] SMP
    [ 35.631977] Modules linked in:
    [ 35.632081] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.12.0-rc7+ #45
    [ 35.632247] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    [ 35.632477] task: ffff88005ad68000 task.stack: ffff88005ad64000
    [ 35.632632] RIP: 0010:fib_compute_spec_dst+0xfc/0x1ee
    [ 35.632769] RSP: 0018:ffff88005ad67978 EFLAGS: 00010202
    [ 35.632910] RAX: 0000000000000001 RBX: ffff880059a7f200 RCX: 0000000000000000
    [ 35.633084] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82274af0
    [ 35.633256] RBP: ffff88005ad679f8 R08: 000000000001ef70 R09: 0000000000000046
    [ 35.633430] R10: ffff88005ad679f8 R11: ffff880037731cb0 R12: 0000000000000001
    [ 35.633603] R13: ffff8800599e3000 R14: 0000000000000000 R15: ffff8800599cb852
    [ 35.634114] FS: 0000000000000000(0000) GS:ffff88005d900000(0000) knlGS:0000000000000000
    [ 35.634306] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 35.634456] CR2: 00007f3563227095 CR3: 000000000201d000 CR4: 00000000000406e0
    [ 35.634632] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 35.634865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 35.635055] Call Trace:
    [ 35.635271] ? __lock_acquire+0xf0d/0x1117
    [ 35.635522] ipv4_pktinfo_prepare+0x82/0x151
    [ 35.635831] raw_rcv_skb+0x17/0x3c
    [ 35.636062] raw_rcv+0xe5/0xf7
    [ 35.636287] raw_local_deliver+0x169/0x1d9
    [ 35.636534] ip_local_deliver_finish+0x87/0x1c4
    [ 35.636820] ip_local_deliver+0x63/0x7f
    [ 35.637058] ip_rcv_finish+0x340/0x3a1
    [ 35.637295] ip_rcv+0x314/0x34a
    [ 35.637525] __netif_receive_skb_core+0x49f/0x7c5
    [ 35.637780] ? lock_acquire+0x13f/0x1d7
    [ 35.638018] ? lock_acquire+0x15e/0x1d7
    [ 35.638259] __netif_receive_skb+0x1e/0x94
    [ 35.638502] ? __netif_receive_skb+0x1e/0x94
    [ 35.638748] netif_receive_skb_internal+0x74/0x300
    [ 35.639002] ? dev_gro_receive+0x2ed/0x411
    [ 35.639246] ? lock_is_held_type+0xc4/0xd2
    [ 35.639491] napi_gro_receive+0x105/0x1a0
    [ 35.639736] receive_buf+0xc32/0xc74
    [ 35.639965] ? detach_buf+0x67/0x153
    [ 35.640201] ? virtqueue_get_buf_ctx+0x120/0x176
    [ 35.640453] virtnet_poll+0x128/0x1c5
    [ 35.640690] net_rx_action+0x103/0x343
    [ 35.640932] __do_softirq+0x1c7/0x4b7
    [ 35.641171] run_ksoftirqd+0x23/0x5c
    [ 35.641403] smpboot_thread_fn+0x24f/0x26d
    [ 35.641646] ? sort_range+0x22/0x22
    [ 35.641878] kthread+0x129/0x131
    [ 35.642104] ? __list_add+0x31/0x31
    [ 35.642335] ? __list_add+0x31/0x31
    [ 35.642568] ret_from_fork+0x2a/0x40
    [ 35.642804] Code: 05 bd 87 a3 00 01 e8 1f ef 98 ff 4d 85 f6 48 c7 c7 f0 4a 27 82 41 0f 94 c4 31 c9 31 d2 41 0f b6 f4 e8 04 71 a1 ff 45 84 e4 74 02 0b 0f b7 93 c4 00 00 00 4d 8b a5 80 05 00 00 48 03 93 d0 00
    [ 35.644342] RIP: fib_compute_spec_dst+0xfc/0x1ee RSP: ffff88005ad67978

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Reported-by: Chris Cormier
    Signed-off-by: Nikolay Aleksandrov
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     

27 Jun, 2017

2 commits


18 Jun, 2017

2 commits

  • DST_NOCACHE flag check has been removed from dst_release() and
    dst_hold_safe() in a previous patch because all the dst are now ref
    counted properly and can be released based on refcnt only.
    Looking at the rest of the DST_NOCACHE use, all of them can now be
    removed or replaced with other checks.
    So this patch gets rid of all the DST_NOCACHE usage and remove this flag
    completely.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     
  • In IPv6 routing code, struct rt6_info is created for each static route
    and RTF_CACHE route and inserted into fib6 tree. In both cases, dst
    ref count is not taken.
    As explained in the previous patch, this leads to the need of the dst
    garbage collector.

    This patch holds ref count of dst before inserting the route into fib6
    tree and properly releases the dst when deleting it from the fib6 tree
    as a preparation in order to fully get rid of dst gc later.

    Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate
    no user is referencing the dst.

    And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts
    dst->__refcnt to 1.

    Signed-off-by: Wei Wang
    Acked-by: Martin KaFai Lau
    Signed-off-by: David S. Miller

    Wei Wang
     

16 Jun, 2017

1 commit

  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions return void * and remove all the casts across
    the tree, adding a (u8 *) cast only where the unsigned char pointer
    was used directly, all done with the following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    @@
    expression SKB, LEN;
    identifier fn = { skb_push, __skb_push, skb_push_rcsum };
    @@
    - fn(SKB, LEN)[0]
    + *(u8 *)fn(SKB, LEN)

    Note that the last part there converts from push(...)[0] to the
    more idiomatic *(u8 *)push(...).

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

09 Jun, 2017

1 commit

  • Commit 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create")
    adds the l3mdev FIB rule the first time a VRF device is created. However,
    it only creates the rule once and only in the namespace the first device
    is created - which may not be init_net. Fix by using the net_generic
    capability to make the add_fib_rules flag per network namespace.

    Fixes: 1aa6c4f6b8cd8 ("net: vrf: Add l3mdev rules on first device create")
    Reported-by: Petr Machata
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

08 Jun, 2017

1 commit

  • Network devices can allocate reasources and private memory using
    netdev_ops->ndo_init(). However, the release of these resources
    can occur in one of two different places.

    Either netdev_ops->ndo_uninit() or netdev->destructor().

    The decision of which operation frees the resources depends upon
    whether it is necessary for all netdev refs to be released before it
    is safe to perform the freeing.

    netdev_ops->ndo_uninit() presumably can occur right after the
    NETDEV_UNREGISTER notifier completes and the unicast and multicast
    address lists are flushed.

    netdev->destructor(), on the other hand, does not run until the
    netdev references all go away.

    Further complicating the situation is that netdev->destructor()
    almost universally does also a free_netdev().

    This creates a problem for the logic in register_netdevice().
    Because all callers of register_netdevice() manage the freeing
    of the netdev, and invoke free_netdev(dev) if register_netdevice()
    fails.

    If netdev_ops->ndo_init() succeeds, but something else fails inside
    of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
    it is not able to invoke netdev->destructor().

    This is because netdev->destructor() will do a free_netdev() and
    then the caller of register_netdevice() will do the same.

    However, this means that the resources that would normally be released
    by netdev->destructor() will not be.

    Over the years drivers have added local hacks to deal with this, by
    invoking their destructor parts by hand when register_netdevice()
    fails.

    Many drivers do not try to deal with this, and instead we have leaks.

    Let's close this hole by formalizing the distinction between what
    private things need to be freed up by netdev->destructor() and whether
    the driver needs unregister_netdevice() to perform the free_netdev().

    netdev->priv_destructor() performs all actions to free up the private
    resources that used to be freed by netdev->destructor(), except for
    free_netdev().

    netdev->needs_free_netdev is a boolean that indicates whether
    free_netdev() should be done at the end of unregister_netdevice().

    Now, register_netdevice() can sanely release all resources after
    ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
    and netdev->priv_destructor().

    And at the end of unregister_netdevice(), we invoke
    netdev->priv_destructor() and optionally call free_netdev().

    Signed-off-by: David S. Miller

    David S. Miller
     

12 May, 2017

1 commit

  • The current codes only deal with the case that the skb is dropped, it
    may meet one use-after-free issue when NF_HOOK returns 0 that means
    the skb is stolen by one netfilter rule or hook.

    When one netfilter rule or hook stoles the skb and return NF_STOLEN,
    it means the skb is taken by the rule, and other modules should not
    touch this skb ever. Maybe the skb is queued or freed directly by the
    rule.

    Now uses the nf_hook instead of NF_HOOK to get the result of netfilter,
    and check the return value of nf_hook. Only when its value equals 1, it
    means the skb could go ahead. Or reset the skb as NULL.

    BTW, because vrf_rcv_finish is empty function, so needn't invoke it
    even though nf_hook returns 1. But we need to modify vrf_rcv_finish
    to deal with the NF_STOLEN case.

    There are two cases when skb is stolen.
    1. The skb is stolen and freed directly.
    There is nothing we need to do, and vrf_rcv_finish isn't invoked.
    2. The skb is queued and reinjected again.
    The vrf_rcv_finish would be invoked as okfn, so need to free the
    skb in it.

    Signed-off-by: Gao Feng
    Signed-off-by: David S. Miller

    Gao Feng
     

28 Apr, 2017

1 commit

  • Moving the loopback into a VRF breaks networking for the default VRF.
    Since the VRF device is the loopback for VRF domains, there is no
    reason to move the loopback. Given the repercussions, block attempts
    to set lo into a VRF.

    Signed-off-by: David Ahern
    Reviewed-by: Greg Rose
    Signed-off-by: David S. Miller

    David Ahern
     

20 Apr, 2017

1 commit


18 Apr, 2017

2 commits


24 Mar, 2017

1 commit


23 Mar, 2017

2 commits

  • The VRF driver allows users to implement device based features for an
    entire domain. For example, a qdisc or netfilter rules can be attached
    to a VRF device or tcpdump can be used to view packets for all devices
    in the L3 domain.

    The device-based features come with a performance penalty, most
    notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
    to switch the dst on an skb to its private dst. This allows the skb
    to traverse the xmit stack with the device set to the VRF device
    which in turn enables the netfilter and qdisc features. The VRF
    driver then performs the FIB lookup again and reinserts the packet.

    This patch avoids the redirect for IPv6 packets if a qdisc has not
    been attached to a VRF device which is the default config. In this
    case the netfilter hooks and network taps are directly traversed in
    the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
    then the redirect using the vrf dst is done.

    Additional overhead is removed by only checking packet taps if a
    socket is open on the device (vrf_dev->ptype_all list is not empty).
    Packet sockets bound to any device will still get a copy of the
    packet via the real ingress or egress interface.

    The end result of this change is a decrease in the overhead of VRF
    for the default, baseline case (ie., no netfilter rules, no packet
    sockets, no qdisc) from a +3% improvement for UDP which has a lookup
    per packet (VRF being better than no l3mdev) to ~2% loss for TCP_CRR
    which connects a socket for each request-response.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • The VRF driver allows users to implement device based features for an
    entire domain. For example, a qdisc or netfilter rules can be attached
    to a VRF device or tcpdump can be used to view packets for all devices
    in the L3 domain.

    The device-based features come with a performance penalty, most
    notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
    to switch the dst on an skb to its private dst. This allows the skb
    to traverse the xmit stack with the device set to the VRF device
    which in turn enables the netfilter and qdisc features. The VRF
    driver then performs the FIB lookup again and reinserts the packet.

    This patch avoids the redirect for IPv4 packets if a qdisc has not
    been attached to a VRF device which is the default config. In this
    case the netfilter hooks and network taps are directly traversed in
    the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
    then the redirect using the vrf dst is done.

    Additional overhead is removed by only checking packet taps if a
    socket is open on the device (vrf_dev->ptype_all list is not empty).
    Packet sockets bound to any device will still get a copy of the
    packet via the real ingress or egress interface.

    The end result of this change is a decrease in the overhead of VRF
    for the default, baseline case (ie., no netfilter rules, no packet
    sockets, no qdisc) to ~3% for UDP which has a lookup per packet and
    < 1% overhead for connected sockets that leverage early demux and
    avoid FIB lookups.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

22 Mar, 2017

1 commit

  • The VRF driver takes a reference to the inet6_dev on the VRF device for
    its rt6_local dst when handling local traffic through the VRF device as
    a loopback. When the device is deleted the driver does a put on the idev
    but does not reset rt6i_idev in the rt6_info struct. When the dst is
    destroyed, dst_destroy calls ip6_dst_destroy which does a second put for
    what is essentially the same reference causing it to be prematurely freed.
    Reset rt6i_idev after the put in the vrf driver.

    Fixes: b4869aa2f881e ("net: vrf: ipv6 support for local traffic to
    local addresses")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

17 Mar, 2017

1 commit

  • Allow listeners of the subsequent CHANGEUPPER notification to retrieve
    the VRF's table ID by calling l3mdev_fib_table() with the slave netdev.
    Without this change, the netdev won't be considered an L3 slave and the
    function would return 0.

    This is consistent with other master device such as bridge and bond that
    set the slave's private flag before linking. It also makes
    do_vrf_{add,del}_slave() symmetric.

    Signed-off-by: Ido Schimmel
    Acked-by: David Ahern
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Ido Schimmel
     

09 Mar, 2017

1 commit

  • KASAN detected a use-after-free:

    [ 269.467067] BUG: KASAN: use-after-free in vrf_xmit+0x7f1/0x827 [vrf] at addr ffff8800350a21c0
    [ 269.467067] Read of size 4 by task ssh/1879
    [ 269.467067] CPU: 1 PID: 1879 Comm: ssh Not tainted 4.10.0+ #249
    [ 269.467067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    [ 269.467067] Call Trace:
    [ 269.467067] dump_stack+0x81/0xb6
    [ 269.467067] kasan_object_err+0x21/0x78
    [ 269.467067] kasan_report+0x2f7/0x450
    [ 269.467067] ? vrf_xmit+0x7f1/0x827 [vrf]
    [ 269.467067] ? ip_output+0xa4/0xdb
    [ 269.467067] __asan_load4+0x6b/0x6d
    [ 269.467067] vrf_xmit+0x7f1/0x827 [vrf]
    ...

    Which corresponds to the skb access after xmit handling. Fix by saving
    skb->len and using the saved value to update stats.

    Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

12 Feb, 2017

1 commit