21 Dec, 2011

1 commit


24 Nov, 2011

1 commit

  • rcu_assign_pointer(ptr, NULL) can be safely replaced by
    RCU_INIT_POINTER(ptr, NULL)

    (old rcu_assign_pointer() macro was testing the NULL value and could
    omit the smp_wmb(), but this had to be removed because of compiler
    warnings)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Oct, 2011

5 commits

  • On systems that create and delete lots of dynamic devices the
    31bit linux ifindex fails to fit in the 16bit macvtap minor,
    resulting in unusable macvtap devices. I have systems running
    automated tests that that hit this condition in just a few days.

    Use a linux idr allocator to track which mavtap minor numbers
    are available and and to track the association between macvtap
    minor numbers and macvtap network devices.

    Remove the unnecessary unneccessary check to see if the network
    device we have found is indeed a macvtap device. With macvtap
    specific data structures it is impossible to find any other
    kind of networking device.

    Increase the macvtap minor range from 65536 to the full 20 bits
    that is supported by linux device numbers. It doesn't solve the
    original problem but there is no penalty for a larger minor
    device range.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • Place macvlan_common_newlink at the end of macvtap_newlink because
    failing in newlink after registering your network device is not
    supported.

    Move device_create into a netdevice creation notifier. The network device
    notifier is the only hook that is called after the network device has been
    registered with the device layer and before register_network_device returns
    success.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • To avoid leaking packets in the receive queue. Add a socket destructor
    that will run whenever destroy a macvtap socket.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • To see if it is appropriate to enable the macvtap zero copy feature
    don't test the lowerdev network device flags. Instead test the
    macvtap network device flags which are a direct copy of the lowerdev
    flags. This is important because nothing holds a reference to lowerdev
    and on a very bad day we lowerdev could be a pointer to stale memory.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There is a small window in macvtap_open between looking up a
    networking device and calling macvtap_set_queue in which
    macvtap_del_queues called from macvtap_dellink. After
    calling macvtap_del_queues it is totally incorrect to
    allow macvtap_set_queue to proceed so prevent success by
    reporting that all of the available queues are in use.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

21 Sep, 2011

1 commit


16 Sep, 2011

1 commit


07 Jul, 2011

1 commit


12 Jun, 2011

1 commit

  • There's no need for the guest to validate the checksum if it have been
    validated by host nics. So this patch introduces a new flag -
    VIRTIO_NET_HDR_F_DATA_VALID which is used to bypass the checksum
    examing in guest. The backend (tap/macvtap) may set this flag when
    met skbs with CHECKSUM_UNNECESSARY to save cpu utilization.

    No feature negotiation is needed as old driver just ignore this flag.

    Iperf shows 12%-30% performance improvement for UDP traffic. For TCP,
    when gro is on no difference as it produces skb with partial
    checksum. But when gro is disabled, 20% or even higher improvement
    could be measured by netperf.

    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

11 Mar, 2011

1 commit


08 Mar, 2011

1 commit


28 Jan, 2011

1 commit


14 Jan, 2011

1 commit

  • After recent changes, (percpu stats on vlan/tunnels...), we dont need
    anymore per struct netdev_queue tx_bytes/tx_packets/tx_dropped counters.

    Only remaining users are ixgbe, sch_teql, gianfar & macvlan :

    1) ixgbe can be converted to use existing tx_ring counters.

    2) macvlan incremented txq->tx_dropped, it can use the
    dev->stats.tx_dropped counter.

    3) sch_teql : almost revert ab35cd4b8f42 (Use net_device internal stats)
    Now we have ndo_get_stats64(), use it, even for "unsigned long"
    fields (No need to bring back a struct net_device_stats)

    4) gianfar adds a stats structure per tx queue to hold
    tx_bytes/tx_packets

    This removes a lockdep warning (and possible lockup) in rndis gadget,
    calling dev_get_stats() from hard IRQ context.

    Ref: http://www.spinics.net/lists/netdev/msg149202.html

    Reported-by: Neil Jones
    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    CC: Alexander Duyck
    CC: Jeff Kirsher
    CC: Sandeep Gopalpet
    CC: Michal Nazarewicz
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Dec, 2010

1 commit


17 Aug, 2010

1 commit


28 Jul, 2010

1 commit


23 Jul, 2010

1 commit

  • Mark Wagner reported OOM symptoms when sending UDP traffic over
    a macvtap link to a kvm receiver.

    This appears to be caused by the fact that macvtap packet queues
    are unlimited in length. This means that if the receiver can't
    keep up with the rate of flow, then we will hit OOM. Of course
    it gets worse if the OOM killer then decides to kill the receiver.

    This patch imposes a cap on the packet queue length, in the same
    way as the tuntap driver, using the device TX queue length.

    Please note that macvtap currently has no way of giving congestion
    notification, that means the software device TX queue cannot be
    used and packets will always be dropped once the macvtap driver
    queue fills up.

    This shouldn't be a great problem for the scenario where macvtap
    is used to feed a kvm receiver, as the traffic is most likely
    external in origin so congestion notification can't be applied
    anyway.

    Of course, if anybody decides to complain about guest-to-guest
    UDP packet loss down the track, then we may have to revisit this.

    Incidentally, this patch also fixes a real memory leak when
    macvtap_get_queue fails.

    Chris Wright noticed that for this patch to work, we need a
    non-zero TX queue length. This patch includes his work to change
    the default macvtap TX queue length to 500.

    Reported-by: Mark Wagner
    Signed-off-by: Herbert Xu
    Acked-by: Chris Wright
    Acked-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Herbert Xu
     

11 Jul, 2010

1 commit


04 May, 2010

1 commit


02 May, 2010

1 commit

  • sk_callback_lock rwlock actually protects sk->sk_sleep pointer, so we
    need two atomic operations (and associated dirtying) per incoming
    packet.

    RCU conversion is pretty much needed :

    1) Add a new structure, called "struct socket_wq" to hold all fields
    that will need rcu_read_lock() protection (currently: a
    wait_queue_head_t and a struct fasync_struct pointer).

    [Future patch will add a list anchor for wakeup coalescing]

    2) Attach one of such structure to each "struct socket" created in
    sock_alloc_inode().

    3) Respect RCU grace period when freeing a "struct socket_wq"

    4) Change sk_sleep pointer in "struct sock" by sk_wq, pointer to "struct
    socket_wq"

    5) Change sk_sleep() function to use new sk->sk_wq instead of
    sk->sk_sleep

    6) Change sk_has_sleeper() to wq_has_sleeper() that must be used inside
    a rcu_read_lock() section.

    7) Change all sk_has_sleeper() callers to :
    - Use rcu_read_lock() instead of read_lock(&sk->sk_callback_lock)
    - Use wq_has_sleeper() to eventually wakeup tasks.
    - Use rcu_read_unlock() instead of read_unlock(&sk->sk_callback_lock)

    8) sock_wake_async() is modified to use rcu protection as well.

    9) Exceptions :
    macvtap, drivers/net/tun.c, af_unix use integrated "struct socket_wq"
    instead of dynamically allocated ones. They dont need rcu freeing.

    Some cleanups or followups are probably needed, (possible
    sk_callback_lock conversion to a spinlock for example...).

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Apr, 2010

1 commit


21 Apr, 2010

1 commit

  • Define a new function to return the waitqueue of a "struct sock".

    static inline wait_queue_head_t *sk_sleep(struct sock *sk)
    {
    return sk->sk_sleep;
    }

    Change all read occurrences of sk_sleep by a call to this function.

    Needed for a future RCU conversion. sk_sleep wont be a field directly
    available.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

19 Feb, 2010

3 commits

  • Added flags field to macvtap_queue to enable/disable processing of
    virtio_net_hdr via IFF_VNET_HDR. This flag is checked to prepend virtio_net_hdr
    in the receive path and process/skip virtio_net_hdr in the send path.

    Original patch by Sridhar, further changes by Arnd.

    Signed-off-by: Sridhar Samudrala
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • This adds support for passing a macvtap file descriptor into
    vhost-net, much like we already do for tun/tap.

    Most of the new code is taken from the respective patch
    in the tun driver and may get consolidated in the future.

    Signed-off-by: Arnd Bergmann
    Acked-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • This reworks the change done by the previous patch
    in a more complete way.

    The original macvtap code has a number of problems
    resulting from the use of RCU for protecting the
    access to struct macvtap_queue from open files.

    This includes
    - need for GFP_ATOMIC allocations for skbs
    - potential deadlocks when copy_*_user sleeps
    - inability to work with vhost-net

    Changing the lifetime of macvtap_queue to always
    depend on the open file solves all these. The
    RCU reference simply moves one step down to
    the reference on the macvlan_dev, which we
    only need for nonblocking operations.

    Signed-off-by: Arnd Bergmann
    Acked-by: Sridhar Samudrala
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

16 Feb, 2010

1 commit

  • The RCU usage in the original code was broken because
    there are cases where we possibly sleep with rcu_read_lock
    held. As a fix, change the macvtap_file_get_queue to
    get a reference on the socket and the netdev instead of
    taking the full rcu_read_lock.

    Also, change macvtap_file_get_queue failure case to
    not require a subsequent macvtap_file_put_queue, as
    pointed out by Ed Swierk.

    Signed-off-by: Arnd Bergmann
    Cc: Ed Swierk
    Cc: Sridhar Samudrala
    Acked-by: Sridhar Samudrala
    Acked-by: Ed Swierk
    Signed-off-by: David S. Miller

    Arnd Bergmann
     

04 Feb, 2010

1 commit

  • In order to use macvlan with qemu and other tools that require
    a tap file descriptor, the macvtap driver adds a small backend
    with a character device with the same interface as the tun
    driver, with a minimum set of features.

    Macvtap interfaces are created in the same way as macvlan
    interfaces using ip link, but the netif is just used as a
    handle for configuration and accounting, while the data
    goes through the chardev. Each macvtap interface has its
    own character device, simplifying permission management
    significantly over the generic tun/tap driver.

    Cc: Patrick McHardy
    Cc: Stephen Hemminger
    Cc: David S. Miller"
    Cc: "Michael S. Tsirkin"
    Cc: Herbert Xu
    Cc: Or Gerlitz
    Cc: netdev@vger.kernel.org
    Cc: bridge@lists.linux-foundation.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Arnd Bergmann
    Signed-off-by: David S. Miller

    Arnd Bergmann