15 Jan, 2017

5 commits

  • [ Upstream commit 24c63bbc18e25d5d8439422aa5fd2d66390b88eb ]

    Frank reported that vrf devices can be created with a table id of 0.
    This breaks many of the run time table id checks and should not be
    allowed. Detect this condition at create time and fail with EINVAL.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Reported-by: Frank Kellermann
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 7a18c5b9fb31a999afc62b0e60978aa896fc89e9 ]

    fib_select_path does not call fib_select_multipath if oif is set in the
    flow struct. For VRF use cases oif is always set, so multipath route
    selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
    check similar to what is done in fib_table_lookup.

    Add saddr and proto to the flow struct for the fib lookup done by the
    VRF driver to better match hash computation for a flow.

    Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit 926d93a33e59b2729afdbad357233c17184de9d2 ]

    The move from rx-handler to L3 receive handler inadvertantly dropped the
    rx counters. Restore them.

    Fixes: 74b20582ac38 ("net: l3mdev: Add hook in ip and ipv6")
    Reported-by: Dinesh Dutt
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit eb63ecc1706b3e094d0f57438b6c2067cfc299f2 ]

    Locally originated traffic in a VRF fails in the presence of a POSTROUTING
    rule. For example,

    $ iptables -t nat -A POSTROUTING -s 11.1.1.0/24 -j MASQUERADE
    $ ping -I red -c1 11.1.1.3
    ping: Warning: source address might be selected on device other than red.
    PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
    ping: sendmsg: Operation not permitted

    Worse, the above causes random corruption resulting in a panic in random
    places (I have not seen a consistent backtrace).

    Call nf_reset to drop the conntrack info following the pass through the
    VRF device. The nf_reset is needed on Tx but not Rx because of the order
    in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
    device and on Tx it is is before the real egress device. Connection
    tracking should be tied to the real egress device and not the VRF device.

    Fixes: 8f58336d3f78a ("net: Add ethernet header for pass through VRF device")
    Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     
  • [ Upstream commit a0f37efa82253994b99623dbf41eea8dd0ba169b ]

    Connection tracking with VRF is broken because the pass through the VRF
    device drops the connection tracking info. Removing the call to nf_reset
    allows DNAT and MASQUERADE to work across interfaces within a VRF.

    Fixes: 73e20b761acf ("net: vrf: Add support for PREROUTING rules on vrf device")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

17 Oct, 2016

1 commit

  • Currently, socket lookups for l3mdev (vrf) use cases can match a socket
    that is bound to a port but not a device (ie., a global socket). If the
    sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
    based on the main table even though the packet came in from an L3 domain.
    The end result is that the connection does not establish creating
    confusion for users since the service is running and a socket shows in
    ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
    skb came through an interface enslaved to an l3mdev device and the
    tcp_l3mdev_accept is not set.

    skb's through an l3mdev interface are marked by setting a flag in
    inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
    flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
    flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
    inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the
    match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
    move is done after the socket lookup, so IP6CB is used.

    The flags field in inet_skb_parm struct needs to be increased to add
    another flag. There is currently a 1-byte hole following the flags,
    so it can be expanded to u16 without increasing the size of the struct.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

17 Sep, 2016

1 commit


11 Sep, 2016

6 commits


06 Jul, 2016

1 commit


18 Jun, 2016

1 commit

  • IPv6 source address selection needs to consider the real egress route.
    Similar to IPv4 implement a get_saddr6 method which is called if
    source address has not been set. The get_saddr6 method does a full
    lookup which means pulling a route from the VRF FIB table and properly
    considering linklocal/multicast destination addresses. Lookup failures
    (eg., unreachable) then cause the source address selection to fail
    which gets propagated back to the caller.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

16 Jun, 2016

4 commits

  • Attempting to delete a VRF device with a socket bound to it can stall:

    unregister_netdevice: waiting for red to become free. Usage count = 1

    The unregister is waiting for the dst to be released and with it
    references to the vrf device. Similar to dst_ifdown switch the dst
    dev to loopback on delete for all of the dst's for the vrf device
    and release the references to the vrf device.

    Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver")
    Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • 1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users
    can add one as desired.

    2. Disable adding a VLAN to a VRF device.

    3. Enable offloads and hardware features similar to other logical
    devices (e.g., dummy, veth)

    Change provides a significant boost in TCP stream Tx performance,
    from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the
    performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark
    using qemu with virtio+vhost for the NICs

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • IPv6 multicast and link-local addresses require special handling by the
    VRF driver:
    1. Rather than using the VRF device index and full FIB lookups,
    packets to/from these addresses should use direct FIB lookups based on
    the VRF device table.

    2. fail sends/receives on a VRF device to/from a multicast address
    (e.g, make ping6 ff02::1% fail)

    3. move the setting of the flow oif to the first dst lookup and revert
    the change in icmpv6_echo_reply made in ca254490c8dfd ("net: Add VRF
    support to IPv6 stack"). Linklocal/mcast addresses require use of the
    skb->dev.

    With this change connections into and out of a VRF enslaved device work
    for multicast and link-local addresses work (icmp, tcp, and udp)
    e.g.,

    1. packets into VM with VRF config:
    ping6 -c3 fe80::e0:f9ff:fe1c:b974%br1
    ping6 -c3 ff02::1%br1

    ssh -6 fe80::e0:f9ff:fe1c:b974%br1

    2. packets going out a VRF enslaved device:
    ping6 -c3 fe80::18f8:83ff:fe4b:7a2e%eth1
    ping6 -c3 ff02::1%eth1
    ssh -6 root@fe80::18f8:83ff:fe4b:7a2e%eth1

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Allow drivers to pass flow arg to functions where the arg is not const
    and allow the driver to make updates as needed (eg., setting oif).

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

10 Jun, 2016

2 commits

  • Frank Kellermann reported a kernel crash with 4.5.0 when IPv6 is
    disabled at boot using the kernel option ipv6.disable=1. Using
    current net-next with the boot option:

    $ ip link add red type vrf table 1001

    Generates:
    [12210.919584] BUG: unable to handle kernel NULL pointer dereference at 0000000000000748
    [12210.921341] IP: [] fib6_get_table+0x2c/0x5a
    [12210.922537] PGD b79e3067 PUD bb32b067 PMD 0
    [12210.923479] Oops: 0000 [#1] SMP
    [12210.924001] Modules linked in: ipvlan 8021q garp mrp stp llc
    [12210.925130] CPU: 3 PID: 1177 Comm: ip Not tainted 4.7.0-rc1+ #235
    [12210.926168] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
    [12210.928065] task: ffff8800b9ac4640 ti: ffff8800bacac000 task.ti: ffff8800bacac000
    [12210.929328] RIP: 0010:[] [] fib6_get_table+0x2c/0x5a
    [12210.930697] RSP: 0018:ffff8800bacaf888 EFLAGS: 00010202
    [12210.931563] RAX: 0000000000000748 RBX: ffffffff81a9e280 RCX: ffff8800b9ac4e28
    [12210.932688] RDX: 00000000000000e9 RSI: 0000000000000002 RDI: 0000000000000286
    [12210.933820] RBP: ffff8800bacaf898 R08: ffff8800b9ac4df0 R09: 000000000052001b
    [12210.934941] R10: 00000000657c0000 R11: 000000000000c649 R12: 00000000000003e9
    [12210.936032] R13: 00000000000003e9 R14: ffff8800bace7800 R15: ffff8800bb3ec000
    [12210.937103] FS: 00007faa1766c700(0000) GS:ffff88013ac00000(0000) knlGS:0000000000000000
    [12210.938321] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [12210.939166] CR2: 0000000000000748 CR3: 00000000b79d6000 CR4: 00000000000406e0
    [12210.940278] Stack:
    [12210.940603] ffff8800bb3ec000 ffffffff81a9e280 ffff8800bacaf8c8 ffffffff814b3135
    [12210.941818] ffff8800bb3ec000 ffffffff81a9e280 ffffffff81a9e280 ffff8800bace7800
    [12210.943040] ffff8800bacaf8f0 ffffffff81397c88 ffff8800bb3ec000 ffffffff81a9e280
    [12210.944288] Call Trace:
    [12210.944688] [] fib6_new_table+0x24/0x8a
    [12210.945516] [] vrf_dev_init+0xd4/0x162
    [12210.946328] [] register_netdevice+0x100/0x396
    [12210.947209] [] vrf_newlink+0x40/0xb3
    [12210.948001] [] rtnl_newlink+0x5d3/0x6d5
    ...

    The problem above is due to the fact that the fib hash table is not
    allocated when IPv6 is disabled at boot.

    As for the VRF driver it should not do any IPv6 initializations if IPv6
    is disabled, so it needs to know if IPv6 is disabled at boot. The disable
    parameter is private to the IPv6 module, so provide an accessor for
    modules to determine if IPv6 was disabled at boot time.

    Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • In case a qdisc is used on a vrf device, we need to use different
    lockdep classes to avoid false positives.

    Use the new netdev_lockdep_set_classes() generic helper.

    Reported-by: David Ahern
    Signed-off-by: Eric Dumazet
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2016

1 commit


08 Jun, 2016

3 commits

  • Add support for locally originated traffic to VRF-local IPv6 addresses.
    Similar to IPv4 a local dst is set on the skb and the packet is
    reinserted with a call to netif_rx. With this patch, ping, tcp and udp
    packets to a local IPv6 address are successfully routed:

    $ ip addr show dev eth1
    4: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
    valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
    valid_lft forever preferred_lft forever

    $ ping6 -c1 -I red 2100:1::1
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms

    ip6_input is exported so the VRF driver can use it for the dst input
    function. The dst_alloc function for IPv4 defaults to setting the input and
    output functions; IPv6's does not. VRF does not need to duplicate the Rx path
    so just export the ipv6 input function.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Add support for locally originated traffic to VRF-local addresses. If
    destination device for an skb is the loopback or VRF device then set
    its dst to a local version of the VRF cached dst_entry and call netif_rx
    to insert the packet onto the rx queue - similar to what is done for
    loopback. This patch handles IPv4 support; follow on patch handles IPv6.

    With this patch, ping, tcp and udp packets to a local IPv4 address are
    successfully routed:

    $ ip addr show dev eth1
    4: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
    valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
    valid_lft forever preferred_lft forever

    $ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

    This patch also enables use of IPv4 loopback address on the VRF device:
    $ ip addr add dev red 127.0.0.1/8

    $ ping -c1 -I red 127.0.0.1
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Move the stripping of the ethernet header from is_ip_tx_frame into the
    ipv4 and ipv6 outbound functions and collapse vrf_send_v4_prep into
    vrf_process_v4_outbound.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

07 Jun, 2016

4 commits

  • This reverts commit 2fb7ea455d57e22110c54fc2de0656b6f744263c.

    It results in build errors because ip6_input is not a
    symbol exported to modules.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Add support for locally originated traffic to VRF-local IPv6 addresses.
    Similar to IPv4 a local dst is set on the skb and the packet is
    reinserted with a call to netif_rx. With this patch, ping, tcp and udp
    packets to a local IPv6 address are successfully routed:

    $ ip addr show dev eth1
    4: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
    valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
    valid_lft forever preferred_lft forever

    $ ping6 -c1 -I red 2100:1::1
    ping6: Warning: source address might be selected on device other than red.
    PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
    64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms

    ip6_input is exported so the VRF driver can use it for the dst input
    function. The dst_alloc function for IPv4 defaults to setting the input and
    output functions; IPv6's does not. VRF does not need to duplicate the Rx path
    so just export the ipv6 input function.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Add support for locally originated traffic to VRF-local addresses. If
    destination device for an skb is the loopback or VRF device then set
    its dst to a local version of the VRF cached dst_entry and call netif_rx
    to insert the packet onto the rx queue - similar to what is done for
    loopback. This patch handles IPv4 support; follow on patch handles IPv6.

    With this patch, ping, tcp and udp packets to a local IPv4 address are
    successfully routed:

    $ ip addr show dev eth1
    4: eth1: mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
    link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
    inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
    valid_lft forever preferred_lft forever
    inet6 2100:1::1/120 scope global
    valid_lft forever preferred_lft forever
    inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
    valid_lft forever preferred_lft forever

    $ ping -c1 -I red 10.100.1.1
    ping: Warning: source address might be selected on device other than red.
    PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
    64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms

    This patch also enables use of IPv4 loopback address on the VRF device:
    $ ip addr add dev red 127.0.0.1/8

    $ ping -c1 -I red 127.0.0.1
    PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
    64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Move the stripping of the ethernet header from is_ip_tx_frame into the
    ipv4 and ipv6 outbound functions. If the packet is destined to a local
    address the header is retained since the packet is sent back to netif_rx.

    Collapse vrf_send_v4_prep into vrf_process_v4_outbound.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

03 Jun, 2016

1 commit

  • The VRF device exists to define L3 domains and guide FIB lookups. As
    such its operstate is not relevant. Seeing 'state UNKNOWN' in the
    output of 'ip link show' can be confusing, so set operstate at link
    create.

    Similarly, the MTU for a VRF device is not used; any fragmentation
    of the payload is done on the output path based on the real egress
    device. An MTU of 1500 on the VRF device while enslaved devices
    have a higher MTU can lead to confusion. Since the VRF MTU is not
    relevant set to 64k similar to what is done for loopback.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

17 May, 2016

1 commit

  • One cpu can be processing packets which includes using the cached route
    entries in the vrf device's private data and on another cpu the device
    gets deleted which releases the routes and sets the pointers in net_vrf
    to NULL. This results in datapath dereferencing a NULL pointer.

    Fix by protecting access to dst's with rcu.

    Fixes: 193125dbd8eb ("net: Introduce VRF device driver")
    Fixes: 35402e313663 ("net: Add IPv6 support to VRF device")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

12 May, 2016

1 commit

  • Currently the VRF driver uses the rx_handler to switch the skb device
    to the VRF device. Switching the dev prior to the ip / ipv6 layer
    means the VRF driver has to duplicate IP/IPv6 processing which adds
    overhead and makes features such as retaining the ingress device index
    more complicated than necessary.

    This patch moves the hook to the L3 layer just after the first NF_HOOK
    for PRE_ROUTING. This location makes exposing the original ingress device
    trivial (next patch) and allows adding other NF_HOOKs to the VRF driver
    in the future.

    dev_queue_xmit_nit is exported so that the VRF driver can cycle the skb
    with the switched device through the packet taps to maintain current
    behavior (tcpdump can be used on either the vrf device or the enslaved
    devices).

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

10 May, 2016

1 commit

  • Allow udp and raw sockets to send by oif that is an enslaved interface
    versus the l3mdev/VRF device. For example, this allows BFD to use ifindex
    from IP_PKTINFO on a receive to send a response without the need to
    convert to the VRF index. It also allows ping and ping6 to work when
    specifying an enslaved interface (e.g., ping -I swp1 ) which is
    a natural use case.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

07 May, 2016

1 commit


12 Apr, 2016

1 commit

  • Vivek reported a kernel exception deleting a VRF with an active
    connection through it. The root cause is that the socket has a cached
    reference to a dst that is destroyed. Converting the dst_destroy to
    dst_release and letting proper reference counting kick in does not
    work as the dst has a reference to the device which needs to be released
    as well.

    I talked to Hannes about this at netdev and he pointed out the ipv4 and
    ipv6 dst handling has dst_ifdown for just this scenario. Rather than
    continuing with the reinvented dst wheel in VRF just remove it and
    leverage the ipv4 and ipv6 versions.

    Fixes: 193125dbd8eb2 ("net: Introduce VRF device driver")
    Fixes: 35402e3136634 ("net: Add IPv6 support to VRF device")

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

09 Mar, 2016

1 commit


26 Feb, 2016

1 commit

  • Nik pointed that the VRF driver should be using skb_header_pointer
    instead of accessing skb->data and bits beyond directly which can
    be garbage.

    Fixes: 35402e313663 ("net: Add IPv6 support to VRF device")
    Cc: Nikolay Aleksandrov
    Signed-off-by: David Ahern
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    David Ahern
     

11 Feb, 2016

1 commit


08 Feb, 2016

1 commit


07 Jan, 2016

1 commit