17 Aug, 2018

1 commit

  • David Ahern reported memory leak in veth.

    =======================================================================
    $ cat /sys/kernel/debug/kmemleak
    unreferenced object 0xffff8800354d5c00 (size 1024):
    comm "ip", pid 836, jiffies 4294722952 (age 25.904s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x70/0x94
    [] slab_post_alloc_hook+0x42/0x52
    [] __kmalloc+0x101/0x142
    [] kmalloc_array.constprop.20+0x1e/0x26 [veth]
    [] veth_newlink+0x147/0x3ac [veth]
    ...
    unreferenced object 0xffff88002e009c00 (size 1024):
    comm "ip", pid 836, jiffies 4294722958 (age 25.898s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x70/0x94
    [] slab_post_alloc_hook+0x42/0x52
    [] __kmalloc+0x101/0x142
    [] kmalloc_array.constprop.20+0x1e/0x26 [veth]
    [] veth_newlink+0x219/0x3ac [veth]
    =======================================================================

    veth_rq allocated in veth_newlink() was not freed on dellink.

    We need to free up them after veth_close() so that any packets will not
    reference the queues afterwards. Thus free them in veth_dev_free() in
    the same way as freeing stats structure (vstats).

    Also move queues allocation to veth_dev_init() to be in line with stats
    allocation.

    Fixes: 638264dc90227 ("veth: Support per queue XDP ring")
    Reported-by: David Ahern
    Signed-off-by: Toshiaki Makita
    Reviewed-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: David S. Miller

    Toshiaki Makita
     

10 Aug, 2018

6 commits

  • Move XDP and napi related fields from veth_priv to newly created veth_rq
    structure.

    When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
    selected by current cpu.

    When skbs are enqueued from the peer device, rxq is one to one mapping
    of its peer txq. This way we have a restriction that the number of rxqs
    must not less than the number of peer txqs, but leave the possibility to
    achieve bulk skb xmit in the future because txq lock would make it
    possible to remove rxq ptr_ring lock.

    v3:
    - Add extack messages.
    - Fix array overrun in veth_xmit.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • This allows further redirection of xdp_frames like

    NIC -> veth--veth -> veth--veth
    (XDP) (XDP) (XDP)

    The intermediate XDP, redirecting packets from NIC to the other veth,
    reuses xdp_mem_info from NIC so that page recycling of the NIC works on
    the destination veth's XDP.
    In this way return_frame is not fully guarded by NAPI, since another
    NAPI handler on another cpu may use the same xdp_mem_info concurrently.
    Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
    NAPI context.

    v8:
    - Don't use xdp_frame pointer address for data_hard_start of xdp_buff.

    v4:
    - Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
    xdp_mem_info.

    v3:
    - Fix double free when veth_xdp_tx() returns a positive value.
    - Convert xdp_xmit and xdp_redir variables into flags.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • This allows NIC's XDP to redirect packets to veth. The destination veth
    device enqueues redirected packets to the napi ring of its peer, then
    they are processed by XDP on its peer veth device.
    This can be thought as calling another XDP program by XDP program using
    REDIRECT, when the peer enables driver XDP.

    Note that when the peer veth device does not set driver xdp, redirected
    packets will be dropped because the peer is not ready for NAPI.

    v4:
    - Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
    Add comments about it and check only MTU.

    v2:
    - Drop the part converting xdp_frame into skb when XDP is not enabled.
    - Implement bulk interface of ndo_xdp_xmit.
    - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

    Signed-off-by: Toshiaki Makita
    Acked-by: John Fastabend
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • This is preparation for XDP TX and ndo_xdp_xmit.
    This allows napi handler to handle xdp_frames through xdp ring as well
    as sk_buff.

    v8:
    - Don't use xdp_frame pointer address to calculate skb->head and
    headroom.

    v7:
    - Use xdp_scrub_frame() instead of memset().

    v3:
    - Revert v2 change around rings and use a flag to differentiate skb and
    xdp_frame, since bulk skb xmit makes little performance difference
    for now.

    v2:
    - Use another ring instead of using flag to differentiate skb and
    xdp_frame. This approach makes bulk skb transmit possible in
    veth_xmit later.
    - Clear xdp_frame feilds in skb->head.
    - Implement adjust_tail.

    Signed-off-by: Toshiaki Makita
    Acked-by: John Fastabend
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • Oversized packets including GSO packets can be dropped if XDP is
    enabled on receiver side, so don't send such packets from peer.

    Drop TSO and SCTP fragmentation features so that veth devices themselves
    segment packets with XDP enabled. Also cap MTU accordingly.

    v4:
    - Don't auto-adjust MTU but cap max MTU.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     
  • This is the basic implementation of veth driver XDP.

    Incoming packets are sent from the peer veth device in the form of skb,
    so this is generally doing the same thing as generic XDP.

    This itself is not so useful, but a starting point to implement other
    useful veth XDP features like TX and REDIRECT.

    This introduces NAPI when XDP is enabled, because XDP is now heavily
    relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
    enqueues packets to the ring and peer NAPI handler drains the ring.

    Currently only one ring is allocated for each veth device, so it does
    not scale on multiqueue env. This can be resolved by allocating rings
    on the per-queue basis later.

    Note that NAPI is not used but netif_rx is used when XDP is not loaded,
    so this does not change the default behaviour.

    v6:
    - Check skb->len only when allocation is needed.
    - Add __GFP_NOWARN to alloc_page() as it can be triggered by external
    events.

    v3:
    - Fix race on closing the device.
    - Add extack messages in ndo_bpf.

    v2:
    - Squashed with the patch adding NAPI.
    - Implement adjust_tail.
    - Don't acquire consumer lock because it is guarded by NAPI.
    - Make poll_controller noop since it is unnecessary.
    - Register rxq_info on enabling XDP rather than on opening the device.

    Signed-off-by: Toshiaki Makita
    Signed-off-by: Daniel Borkmann

    Toshiaki Makita
     

09 Dec, 2017

1 commit

  • When new veth is created, and GSO values have been configured
    on one device, clone those values to the peer.

    For example:
    # ip link add dev vm1 gso_max_size 65530 type veth peer name vm2

    This should create vm1 vm2 with both having GSO maximum
    size set to 65530.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

01 Jul, 2017

1 commit


27 Jun, 2017

2 commits


22 Jun, 2017

1 commit

  • There are number of problems with configuration peer
    network device in absence of IFLA_VETH_PEER attributes
    where attributes for main network device shared with
    peer.

    First it is not feasible to configure both network
    devices with same MAC address since this makes
    communication in such configuration problematic.

    This case can be reproduced with following sequence:

    # ip link add address 02:11:22:33:44:55 type veth
    # ip li sh
    ...
    26: veth0@veth1: mtu 1500 qdisc \
    noop state DOWN mode DEFAULT qlen 1000
    link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff
    27: veth1@veth0: mtu 1500 qdisc \
    noop state DOWN mode DEFAULT qlen 1000
    link/ether 00:11:22:33:44:55 brd ff:ff:ff:ff:ff:ff

    Second it is not possible to register both main and
    peer network devices with same name, that happens
    when name for main interface is given with IFLA_IFNAME
    and same attribute reused for peer.

    This case can be reproduced with following sequence:

    # ip link add dev veth1a type veth
    RTNETLINK answers: File exists

    To fix both of the cases check if corresponding netlink
    attributes are taken from peer_tb when valid or
    name based on rtnl ops kind and random address is used.

    Signed-off-by: Serhey Popovych
    Signed-off-by: David S. Miller

    Serhey Popovych
     

08 Jun, 2017

1 commit

  • Network devices can allocate reasources and private memory using
    netdev_ops->ndo_init(). However, the release of these resources
    can occur in one of two different places.

    Either netdev_ops->ndo_uninit() or netdev->destructor().

    The decision of which operation frees the resources depends upon
    whether it is necessary for all netdev refs to be released before it
    is safe to perform the freeing.

    netdev_ops->ndo_uninit() presumably can occur right after the
    NETDEV_UNREGISTER notifier completes and the unicast and multicast
    address lists are flushed.

    netdev->destructor(), on the other hand, does not run until the
    netdev references all go away.

    Further complicating the situation is that netdev->destructor()
    almost universally does also a free_netdev().

    This creates a problem for the logic in register_netdevice().
    Because all callers of register_netdevice() manage the freeing
    of the netdev, and invoke free_netdev(dev) if register_netdevice()
    fails.

    If netdev_ops->ndo_init() succeeds, but something else fails inside
    of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
    it is not able to invoke netdev->destructor().

    This is because netdev->destructor() will do a free_netdev() and
    then the caller of register_netdevice() will do the same.

    However, this means that the resources that would normally be released
    by netdev->destructor() will not be.

    Over the years drivers have added local hacks to deal with this, by
    invoking their destructor parts by hand when register_netdevice()
    fails.

    Many drivers do not try to deal with this, and instead we have leaks.

    Let's close this hole by formalizing the distinction between what
    private things need to be freed up by netdev->destructor() and whether
    the driver needs unregister_netdevice() to perform the free_netdev().

    netdev->priv_destructor() performs all actions to free up the private
    resources that used to be freed by netdev->destructor(), except for
    free_netdev().

    netdev->needs_free_netdev is a boolean that indicates whether
    free_netdev() should be done at the end of unregister_netdevice().

    Now, register_netdevice() can sanely release all resources after
    ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
    and netdev->priv_destructor().

    And at the end of unregister_netdevice(), we invoke
    netdev->priv_destructor() and optionally call free_netdev().

    Signed-off-by: David S. Miller

    David S. Miller
     

14 Apr, 2017

1 commit


30 Mar, 2017

1 commit


09 Jan, 2017

1 commit

  • The network device operation for reading statistics is only called
    in one place, and it ignores the return value. Having a structure
    return value is potentially confusing because some future driver could
    incorrectly assume that the return value was used.

    Fix all drivers with ndo_get_stats64 to have a void function.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

21 Oct, 2016

1 commit

  • geneve:
    - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
    - This one isn't quite as straight-forward as others, could use some
    closer inspection and testing

    macvlan:
    - set min/max_mtu

    tun:
    - set min/max_mtu, remove tun_net_change_mtu

    vxlan:
    - Merge __vxlan_change_mtu back into vxlan_change_mtu
    - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
    change_mtu function
    - This one is also not as straight-forward and could use closer inspection
    and testing from vxlan folks

    bridge:
    - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
    change_mtu function

    openvswitch:
    - set min/max_mtu, remove internal_dev_change_mtu
    - note: max_mtu wasn't checked previously, it's been set to 65535, which
    is the largest possible size supported

    sch_teql:
    - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)

    macsec:
    - min_mtu = 0, max_mtu = 65535

    macvlan:
    - min_mtu = 0, max_mtu = 65535

    ntb_netdev:
    - min_mtu = 0, max_mtu = 65535

    veth:
    - min_mtu = 68, max_mtu = 65535

    8021q:
    - min_mtu = 0, max_mtu = 65535

    CC: netdev@vger.kernel.org
    CC: Nicolas Dichtel
    CC: Hannes Frederic Sowa
    CC: Tom Herbert
    CC: Daniel Borkmann
    CC: Alexander Duyck
    CC: Paolo Abeni
    CC: Jiri Benc
    CC: WANG Cong
    CC: Roopa Prabhu
    CC: Pravin B Shelar
    CC: Sabrina Dubroca
    CC: Patrick McHardy
    CC: Stephen Hemminger
    CC: Pravin Shelar
    CC: Maxim Krasnyansky
    Signed-off-by: Jarod Wilson
    Signed-off-by: David S. Miller

    Jarod Wilson
     

31 Aug, 2016

1 commit

  • veth does not really transmit packets only moves the skb from one
    netdev to another so gso and checksum is not really needed. Add
    the features to mpls_features to get the same benefit and performance
    with MPLS as without it.

    Reported-by: Lennert Buytenhek
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     

27 Aug, 2016

1 commit

  • Commit b17c706987fa ("loopback: sctp: add NETIF_F_SCTP_CSUM to device
    features") added NETIF_F_SCTP_CRC to device features for lo device to
    improve the performance of sctp over lo.

    This patch is to add NETIF_F_SCTP_CRC to device features for veth to
    improve the performance of sctp over veth.

    Before this patch:
    ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    212992 212992 10240 10.00 1117.16

    After this patch:
    ip netns exec cs_client netperf -H 10.167.12.2 -t SCTP_STREAM -- -m 10K
    Recv Send Send
    Socket Socket Message Elapsed
    Size Size Size Time Throughput
    bytes bytes bytes secs. 10^6bits/sec

    212992 212992 10240 10.20 1415.22

    Tested-by: Li Shuang
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

22 Apr, 2016

1 commit


02 Mar, 2016

1 commit

  • The rx headroom for veth dev is the peer device needed_headroom.
    Avoid ping-pong updates setting the private flag IFF_PHONY_HEADROOM.

    This avoids skb head reallocation when forwarding from a veth dev
    towards a device adding some kind of encapsulation.

    When transmitting frames below the MTU size towards a vxlan device,
    this gives about 10% performance speed-up when OVS is used to connect
    the veth and the vxlan device and a little more when using a
    plain Linux bridge.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

23 Dec, 2015

1 commit

  • Packets that arrive from real hardware devices have ip_summed ==
    CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
    CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
    current version of veth will replace CHECKSUM_NONE with
    CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
    a veth device to be delivered to the application. This caused applications
    at Twitter to receive corrupt data when network hardware was corrupting
    packets.

    We believe this was added as an optimization to skip computing and
    verifying checksums for communication between containers. However, locally
    generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
    written does nothing for them. As far as we can tell, after removing this
    code, these packets are transmitted from one stack to another unmodified
    (tcpdump shows invalid checksums on both sides, as expected), and they are
    delivered correctly to applications. We didn’t test every possible network
    configuration, but we tried a few common ones such as bridging containers,
    using NAT between the host and a container, and routing from hardware
    devices to containers. We have effectively deployed this in production at
    Twitter (by disabling RX checksum offloading on veth devices).

    This code dates back to the first version of the driver, commit
    ("[NET]: Virtual ethernet device driver"), so I
    suspect this bug occurred mostly because the driver API has evolved
    significantly since then. Commit ("net/veth: Fix
    packet checksumming") (in December 2010) fixed this for packets that get
    created locally and sent to hardware devices, by not changing
    CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
    in from hardware devices.

    Co-authored-by: Evan Jones
    Signed-off-by: Evan Jones
    Cc: Nicolas Dichtel
    Cc: Phil Sutter
    Cc: Toshiaki Makita
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Vijay Pandurangan
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Vijay Pandurangan
     

19 Aug, 2015

1 commit


04 Aug, 2015

1 commit


03 Apr, 2015

1 commit

  • Now that the peer netns is advertised in rtnl messages, we can set this property
    so that IFLA_LINK will advertise the peer ifindex. It allows the userland to get
    the full veth configuration.

    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

24 Jan, 2015

1 commit


16 Jul, 2014

1 commit


26 Jun, 2014

1 commit

  • It is trivial to add netpoll support to veth, since
    it is not a stacked device, we don't need to setup and
    clean up netpoll.

    Reported-by: Stefan Priebe
    Cc: "David S. Miller"
    Cc: Neil Horman
    Acked-by: Neil Horman
    Signed-off-by: Cong Wang
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    WANG Cong
     

30 Mar, 2014

1 commit


29 Mar, 2014

1 commit


15 Mar, 2014

1 commit

  • Replace the bh safe variant with the hard irq safe variant.

    We need a hard irq safe variant to deal with netpoll transmitting
    packets from hard irq context, and we need it in most if not all of
    the places using the bh safe variant.

    Except on 32bit uni-processor the code is exactly the same so don't
    bother with a bh variant, just have a hard irq safe variant that
    everyone can use.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

06 Mar, 2014

1 commit

  • Conflicts:
    drivers/net/wireless/ath/ath9k/recv.c
    drivers/net/wireless/mwifiex/pcie.c
    net/ipv6/sit.c

    The SIT driver conflict consists of a bug fix being done by hand
    in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
    was created (netdev_alloc_pcpu_stats()) which takes care of this.

    The two wireless conflicts were overlapping changes.

    Signed-off-by: David S. Miller

    David S. Miller
     

20 Feb, 2014

1 commit

  • Even if we create a stacked vlan interface such as veth0.10.20, it sends
    single tagged frames (tagged with only vid 10).
    Because vlan_features of a veth interface has the
    NETIF_F_HW_VLAN_[CTAG/STAG]_TX bits, veth0.10 also has that feature, so
    dev_hard_start_xmit(veth0.10) doesn't call __vlan_put_tag() and
    vlan_dev_hard_start_xmit(veth0.10) overwrites vlan_tci.
    This prevents us from using a combination of 802.1ad and 802.1Q
    in containers, etc.

    Signed-off-by: Toshiaki Makita
    Acked-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Toshiaki Makita
     

19 Feb, 2014

1 commit


15 Feb, 2014

1 commit


14 Nov, 2013

1 commit

  • Pull core locking changes from Ingo Molnar:
    "The biggest changes:

    - add lockdep support for seqcount/seqlocks structures, this
    unearthed both bugs and required extra annotation.

    - move the various kernel locking primitives to the new
    kernel/locking/ directory"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
    block: Use u64_stats_init() to initialize seqcounts
    locking/lockdep: Mark __lockdep_count_forward_deps() as static
    lockdep/proc: Fix lock-time avg computation
    locking/doc: Update references to kernel/mutex.c
    ipv6: Fix possible ipv6 seqlock deadlock
    cpuset: Fix potential deadlock w/ set_mems_allowed
    seqcount: Add lockdep functionality to seqcount/seqlock structures
    net: Explicitly initialize u64_stats_sync structures for lockdep
    locking: Move the percpu-rwsem code to kernel/locking/
    locking: Move the lglocks code to kernel/locking/
    locking: Move the rwsem code to kernel/locking/
    locking: Move the rtmutex code to kernel/locking/
    locking: Move the semaphore core to kernel/locking/
    locking: Move the spinlock code to kernel/locking/
    locking: Move the lockdep code to kernel/locking/
    locking: Move the mutex code to kernel/locking/
    hung_task debugging: Add tracepoint to report the hang
    x86/locking/kconfig: Update paravirt spinlock Kconfig description
    lockstat: Report avg wait and hold times
    lockdep, x86/alternatives: Drop ancient lockdep fixup message
    ...

    Linus Torvalds
     

06 Nov, 2013

1 commit

  • In order to enable lockdep on seqcount/seqlock structures, we
    must explicitly initialize any locks.

    The u64_stats_sync structure, uses a seqcount, and thus we need
    to introduce a u64_stats_init() function and use it to initialize
    the structure.

    This unfortunately adds a lot of fairly trivial initialization code
    to a number of drivers. But the benefit of ensuring correctness makes
    this worth while.

    Because these changes are required for lockdep to be enabled, and the
    changes are quite trivial, I've not yet split this patch out into 30-some
    separate patches, as I figured it would be better to get the various
    maintainers thoughts on how to best merge this change along with
    the seqcount lockdep enablement.

    Feedback would be appreciated!

    Signed-off-by: John Stultz
    Acked-by: Julian Anastasov
    Signed-off-by: Peter Zijlstra
    Cc: Alexey Kuznetsov
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Hideaki YOSHIFUJI
    Cc: James Morris
    Cc: Jesse Gross
    Cc: Mathieu Desnoyers
    Cc: "Michael S. Tsirkin"
    Cc: Mirko Lindner
    Cc: Patrick McHardy
    Cc: Roger Luethi
    Cc: Rusty Russell
    Cc: Simon Horman
    Cc: Stephen Hemminger
    Cc: Steven Rostedt
    Cc: Thomas Petazzoni
    Cc: Wensong Zhang
    Cc: netdev@vger.kernel.org
    Link: http://lkml.kernel.org/r/1381186321-4906-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    John Stultz
     

28 Oct, 2013

1 commit

  • While investigating on a recent vxlan regression, I found veth
    was using a zero features set for vxlan tunnels.

    We have to segment GSO frames, copy the payload, and do the checksum.

    This patch brings a ~200% performance increase

    We probably have to add hw_enc_features support
    on other virtual devices.

    Signed-off-by: Eric Dumazet
    Cc: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Oct, 2013

1 commit

  • We can only setup multicast address for network device when
    net_device_ops->ndo_set_rx_mode is not null.

    Some configurations need to add multicast address for net
    device, such as netfilter cluster match module.

    Add a fake ndo_set_rx_mode function to allow this operation.

    Signed-off-by: Gao feng
    Signed-off-by: David S. Miller

    Gao feng
     

09 Oct, 2013

1 commit