18 Feb, 2017

1 commit

  • [ Upstream commit e1edab87faf6ca30cd137e0795bc73aa9a9a22ec ]

    When IFF_VNET_HDR is enabled, a virtio_net header must precede data.
    Data length is verified to be greater than or equal to expected header
    length tun->vnet_hdr_sz before copying.

    Read this value once and cache locally, as it can be updated between
    the test and use (TOCTOU).

    Signed-off-by: Willem de Bruijn
    Reported-by: Dmitry Vyukov
    CC: Eric Dumazet
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

04 Feb, 2017

1 commit

  • [ Upstream commit 6391a4481ba0796805d6581e42f9f0418c099e34 ]

    Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
    xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
    fixing this by adding a hint (has_data_valid) and set it only on the
    receiving path.

    Cc: Rolf Neugebauer
    Signed-off-by: Jason Wang
    Acked-by: Rolf Neugebauer
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason Wang
     

01 Dec, 2016

1 commit

  • We trigger uarg->callback() immediately after we decide do datacopy
    even if caller want to do zerocopy. This will cause the callback
    (vhost_net_zerocopy_callback) decrease the refcount. But when we meet
    an error afterwards, the error handling in vhost handle_tx() will try
    to decrease it again. This is wrong and fix this by delay the
    uarg->callback() until we're sure there's no errors.

    Reported-by: wangyunjian
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

30 Aug, 2016

1 commit


24 Aug, 2016

1 commit

  • Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
    software transmit timestamp of a packet.

    sock_tx_timestamp resets and overrides the tx_flags of the skb.
    The function is intended to be called from within the protocol
    layer when creating the skb, not from a device driver. This is
    inconsistent with other drivers and will cause issues for TCP.

    In TCP, we intend to sample the timestamps for the last byte
    for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
    tcp_tx_timestamp only with the last skb that it generates.
    For example, if a 128KB message is split into two 64KB packets
    we want to sample the SND timestamp of the last packet. The current
    code in the tun driver, however, will result in sampling the SND
    timestamp for both packets.

    Also, when the last packet is split into smaller packets for
    retranmission (see tcp_fragment), the tun driver will record
    timestamps for all of the retransmitted packets and not only the
    last packet.

    Fixes: eda297729171 (tun: Support software transmit time stamping.)
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: Francis Yan
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

21 Aug, 2016

2 commits


09 Jul, 2016

1 commit

  • The referenced change added a netlink notifier for processing
    device queue size events. These events are fired for all devices
    but the registered callback assumed they only occurred for tun
    devices. This fix adds a check (borrowed from macvtap.c) to discard
    non-tun device events.

    For reference, this fixes the following splat:
    [ 71.505935] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
    [ 71.513870] IP: [] tun_device_event+0x110/0x340
    [ 71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0
    [ 71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
    [ 71.529374] gsmi: Log Shutdown Reason 0x03
    [ 71.533417] Modules linked in:[ 71.533826] mlx4_en: eth1: Link Up

    [ 71.539616] bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
    [ 71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 4.7.0-dbx-DEV #8
    [ 71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 05/03/2016
    [ 71.564495] task: ffff887f00bb20c0 ti: ffff887f00798000 task.ti: ffff887f00798000
    [ 71.571894] RIP: 0010:[] [] tun_device_event+0x110/0x340
    [ 71.580327] RSP: 0018:ffff887f0079bbd8 EFLAGS: 00010202
    [ 71.585576] RAX: fffffffffffffae8 RBX: ffff887ef6d03378 RCX: 0000000000000000
    [ 71.592624] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000000000000000
    [ 71.599675] RBP: ffff887f0079bc48 R08: 0000000000000000 R09: 0000000000000001
    [ 71.606730] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000010
    [ 71.613780] R13: 0000000000000000 R14: 0000000000000001 R15: ffff887f0079bd00
    [ 71.620832] FS: 00007f5cdc581700(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
    [ 71.628826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 71.634500] CR2: 0000000000000010 CR3: 0000003f3eb62000 CR4: 00000000001406e0
    [ 71.641549] Stack:
    [ 71.643533] ffff887f0079bc08 0000000000000246 000000000000001e ffff887ef6d00000
    [ 71.650871] ffff887f0079bd00 0000000000000000 0000000000000000 ffffffff00000000
    [ 71.658210] ffff887f0079bc48 ffffffff81d24070 00000000fffffff9 ffffffff81cec7a0
    [ 71.665549] Call Trace:
    [ 71.667975] [] notifier_call_chain+0x5d/0x80
    [ 71.673823] [] ? show_tx_maxrate+0x30/0x30
    [ 71.679502] [] __raw_notifier_call_chain+0xe/0x10
    [ 71.685778] [] raw_notifier_call_chain+0x16/0x20
    [ 71.691976] [] call_netdevice_notifiers_info+0x40/0x70
    [ 71.698681] [] call_netdevice_notifiers+0x16/0x20
    [ 71.704956] [] change_tx_queue_len+0x66/0x90
    [ 71.710807] [] netdev_store.isra.5+0xbf/0xd0
    [ 71.716658] [] tx_queue_len_store+0x50/0x60
    [ 71.722431] [] dev_attr_store+0x18/0x30
    [ 71.727857] [] sysfs_kf_write+0x4f/0x70
    [ 71.733274] [] kernfs_fop_write+0x147/0x1d0
    [ 71.739045] [] ? rcu_read_lock_sched_held+0x8f/0xa0
    [ 71.745499] [] __vfs_write+0x28/0x120
    [ 71.750748] [] ? percpu_down_read+0x57/0x90
    [ 71.756516] [] ? __sb_start_write+0xc8/0xe0
    [ 71.762278] [] ? __sb_start_write+0xc8/0xe0
    [ 71.768038] [] vfs_write+0xbe/0x1b0
    [ 71.773113] [] SyS_write+0x52/0xa0
    [ 71.778110] [] entry_SYSCALL_64_fastpath+0x18/0xa8
    [ 71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff
    [ 71.803655] RIP [] tun_device_event+0x110/0x340
    [ 71.809769] RSP
    [ 71.813213] CR2: 0000000000000010
    [ 71.816512] ---[ end trace 4db6449606319f73 ]---

    Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
    Signed-off-by: Craig Gallek
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller

    Craig Gallek
     

05 Jul, 2016

1 commit

  • Stephen Rothwell reports a build warnings(powerpc ppc64_defconfig)

    drivers/net/tun.c: In function 'tun_do_read.part.5':
    /home/sfr/next/next/drivers/net/tun.c:1491:6: warning: 'err' may be
    used uninitialized in this function [-Wmaybe-uninitialized]
    int err;

    This is because tun_ring_recv() may return an uninitialized err, fix this.

    Reported-by: Stephen Rothwell
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

01 Jul, 2016

1 commit

  • We used to queue tx packets in sk_receive_queue, this is less
    efficient since it requires spinlocks to synchronize between producer
    and consumer.

    This patch tries to address this by:

    - switch from sk_receive_queue to a skb_array, and resize it when
    tx_queue_len was changed.
    - introduce a new proto_ops peek_len which was used for peeking the
    skb length.
    - implement a tun version of peek_len for vhost_net to use and convert
    vhost_net to use peek_len if possible.

    Pktgen test shows about 15.3% improvement on guest receiving pps for small
    buffers:

    Before: ~1300000pps
    After : ~1500000pps

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     

16 Jun, 2016

1 commit

  • The commit 34166093639b ("tuntap: use common code for virtio_net_hdr
    and skb GSO conversion") replaced the tun code for header manipulation
    with the generic helpers. While doing so, it implictly moved the
    skb_partial_csum_set() invocation after eth_type_trans(), which
    invalidate the current gso start/offset values.
    Fix it by moving the helper invocation before the mac pulling.

    Fixes: 34166093639 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
    Signed-off-by: Paolo Abeni
    Acked-by: Mike Rapoport
    Signed-off-by: David S. Miller

    Paolo Abeni
     

11 Jun, 2016

1 commit


21 May, 2016

1 commit

  • We used to check dev->reg_state against NETREG_REGISTERED after each
    time we are woke up. But after commit 9e641bdcfa4e ("net-tun:
    restructure tun_do_read for better sleep/wakeup efficiency"), it uses
    skb_recv_datagram() which does not check dev->reg_state. This will
    result if we delete a tun/tap device after a process is blocked in the
    reading. The device will wait for the reference count which was held
    by that process for ever.

    Fixes this by using RCV_SHUTDOWN which will be checked during
    sk_recv_datagram() before trying to wake up the process during uninit.

    Fixes: 9e641bdcfa4e ("net-tun: restructure tun_do_read for better
    sleep/wakeup efficiency")
    Cc: Eric Dumazet
    Cc: Xi Wang
    Cc: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Acked-by: Eric Dumazet
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Jason Wang
     

29 Apr, 2016

1 commit

  • There's no need to calculate rps hash if it was not enabled. So this
    patch export rps_needed and check it before trying to get rps
    hash. Tests (using pktgen to inject packets to guest) shows this can
    improve pps about 13% (when rps is disabled).

    Before:
    ~1150000 pps
    After:
    ~1300000 pps

    Cc: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    ----
    Changes from V1:
    - Fix build when CONFIG_RPS is not set
    Signed-off-by: David S. Miller

    Jason Wang
     

19 Apr, 2016

1 commit

  • The current tun_net_xmit() implementation don't need any external
    lock since it relies on rcu protection for the tun data structure
    and on socket queue lock for skb queuing.

    This patch set the NETIF_F_LLTX feature bit in the tun device, so
    that on xmit, in absence of qdisc, no serialization lock is acquired
    by the caller.

    The user space can remove the default tun qdisc with:

    tc qdisc replace dev root noqueue

    Signed-off-by: Paolo Abeni
    Acked-by: Hannes Frederic Sowa
    Acked-by: Eric Dumazet
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Paolo Abeni
     

15 Apr, 2016

1 commit

  • Currently the tun device accounting uses dev->stats without applying any
    kind of protection, regardless that accounting happens in preemptible
    process context.
    This patch move the tun stats to a per cpu data structure, and protect
    the updates with u64_stats_update_begin()/u64_stats_update_end() or
    this_cpu_inc according to the stat type. The per cpu stats are
    aggregated by the newly added ndo_get_stats64 ops.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

10 Apr, 2016

1 commit


09 Apr, 2016

1 commit

  • After commit f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using
    alloc_netdev"), default qdisc was changed to noqueue because
    tuntap does not set tx_queue_len during .setup(). This patch restores
    default qdisc by setting tx_queue_len in tun_setup().

    Fixes: f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev")
    Cc: Phil Sutter
    Signed-off-by: Jason Wang
    Acked-by: Michael S. Tsirkin
    Acked-by: Phil Sutter
    Signed-off-by: David S. Miller

    Jason Wang
     

08 Apr, 2016

1 commit

  • This reverts commit 5a5abb1fa3b05dd ("tun, bpf: fix suspicious RCU usage
    in tun_{attach, detach}_filter") and replaces it to use lock_sock around
    sk_{attach,detach}_filter. The checks inside filter.c are updated with
    lockdep_sock_is_held to check for proper socket locks.

    It keeps the code cleaner by ensuring that only one lock governs the
    socket filter instead of two independent locks.

    Cc: Daniel Borkmann
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Hannes Frederic Sowa
     

05 Apr, 2016

1 commit

  • Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
    This is very costly when users want to sample writes to gather
    tx timestamps.

    Add support for enabling SO_TIMESTAMPING via control messages by
    using tsflags added in `struct sockcm_cookie` (added in the previous
    patches in this series) to set the tx_flags of the last skb created in
    a sendmsg. With this patch, the timestamp recording bits in tx_flags
    of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

    Please note that this is only effective for overriding the recording
    timestamps flags. Users should enable timestamp reporting (e.g.,
    SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
    socket options and then should ask for SOF_TIMESTAMPING_TX_*
    using control messages per sendmsg to sample timestamps for each
    write.

    Signed-off-by: Soheil Hassas Yeganeh
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller

    Soheil Hassas Yeganeh
     

02 Apr, 2016

1 commit

  • Sasha Levin reported a suspicious rcu_dereference_protected() warning
    found while fuzzing with trinity that is similar to this one:

    [ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
    [ 52.765688] other info that might help us debug this:
    [ 52.765695] rcu_scheduler_active = 1, debug_locks = 1
    [ 52.765701] 1 lock held by a.out/1525:
    [ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
    [ 52.765721] stack backtrace:
    [ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
    [...]
    [ 52.765768] Call Trace:
    [ 52.765775] [] dump_stack+0x85/0xc8
    [ 52.765784] [] lockdep_rcu_suspicious+0xd5/0x110
    [ 52.765792] [] sk_detach_filter+0x82/0x90
    [ 52.765801] [] tun_detach_filter+0x35/0x90 [tun]
    [ 52.765810] [] __tun_chr_ioctl+0x354/0x1130 [tun]
    [ 52.765818] [] ? selinux_file_ioctl+0x130/0x210
    [ 52.765827] [] tun_chr_ioctl+0x13/0x20 [tun]
    [ 52.765834] [] do_vfs_ioctl+0x96/0x690
    [ 52.765843] [] ? security_file_ioctl+0x43/0x60
    [ 52.765850] [] SyS_ioctl+0x79/0x90
    [ 52.765858] [] do_syscall_64+0x62/0x140
    [ 52.765866] [] entry_SYSCALL64_slow_path+0x25/0x25

    Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
    from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
    DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

    Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
    fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
    filter with rcu_dereference_protected(), checking whether socket lock
    is held in control path.

    Since its introduction in 994051625981 ("tun: socket filter support"),
    tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
    sock_owned_by_user(sk) doesn't apply in this specific case and therefore
    triggers the false positive.

    Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
    that is used by tap filters and pass in lockdep_rtnl_is_held() for the
    rcu_dereference_protected() checks instead.

    Reported-by: Sasha Levin
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

02 Mar, 2016

1 commit

  • ndo_set_rx_headroom controls the align value used by tun devices to
    allocate skbs on frame reception.
    When the xmit device adds a large encapsulation, this avoids an skb
    head reallocation on forwarding.

    The measured improvement when forwarding towards a vxlan dev with
    frame size below the egress device MTU is as follow:

    vxlan over ipv6, bridged: +6%
    vxlan over ipv6, ovs: +7%

    In case of ipv4 tunnels there is no improvement, since the tun
    device default alignment provides enough headroom to avoid the skb
    head reallocation.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

18 Dec, 2015

1 commit

  • If a tun interface is turned down, we should not allow packet injection
    into the kernel.

    Kernel does not send packets to the tun already.

    TUNATTACHFILTER can not be used as only tun_net_xmit() is taking care
    of it.

    Reported-by: Curt Wohlgemuth
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Dec, 2015

1 commit

  • This patch is a cleanup to make following patch easier to
    review.

    Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
    from (struct socket)->flags to a (struct socket_wq)->flags
    to benefit from RCU protection in sock_wake_async()

    To ease backports, we rename both constants.

    Two new helpers, sk_set_bit(int nr, struct sock *sk)
    and sk_clear_bit(int net, struct sock *sk) are added so that
    following patch can change their implementation.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Oct, 2015

1 commit

  • timewait or request sockets are small and do not contain sk->sk_tsflags

    Without this fix, we might read garbage, and crash later in

    __skb_complete_tx_timestamp()
    -> sock_queue_err_skb()

    (These pseudo sockets do not have an error queue either)

    Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Acked-by: Michael S. Tsirkin
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Aug, 2015

1 commit


04 Jul, 2015

1 commit

  • Pull virtio/vhost cross endian support from Michael Tsirkin:
    "I have just queued some more bugfix patches today but none fix
    regressions and none are related to these ones, so it looks like a
    good time for a merge for -rc1.

    The motivation for this is support for legacy BE guests on the new LE
    hosts. There are two redeeming properties that made me merge this:

    - It's a trivial amount of code: since we wrap host/guest accesses
    anyway, almost all of it is well hidden from drivers.

    - Sane platforms would never set flags like VHOST_CROSS_ENDIAN_LEGACY,
    and when it's clear, there's zero overhead (as some point it was
    tested by compiling with and without the patches, got the same
    stripped binary).

    Maybe we could create a Kconfig symbol to enforce the second point:
    prevent people from enabling it eg on x86. I will look into this"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-pci: alloc only resources actually used.
    macvtap/tun: cross-endian support for little-endian hosts
    vhost: cross-endian support for legacy devices
    virtio: add explicit big-endian support to memory accessors
    vhost: introduce vhost_is_little_endian() helper
    vringh: introduce vringh_is_little_endian() helper
    macvtap: introduce macvtap_is_little_endian() helper
    tun: add tun_is_little_endian() helper
    virtio: introduce virtio_is_little_endian() helper

    Linus Torvalds
     

01 Jun, 2015

3 commits

  • The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers
    that are always little-endian. It can also be used to handle the special
    case of a legacy little-endian device implemented by a big-endian host.

    Let's add a flag and ioctls for big-endian devices as well. If both flags
    are set, little-endian wins.

    Since this is isn't a common usecase, the feature is controlled by a kernel
    config option (not set by default).

    Both macvtap and tun are covered by this patch since they share the same
    API with userland.

    Signed-off-by: Greg Kurz

    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: David Gibson

    Greg Kurz
     
  • The current memory accessors logic is:
    - little endian if little_endian
    - native endian (i.e. no byteswap) if !little_endian

    If we want to fully support cross-endian vhost, we also need to be
    able to convert to big endian.

    Instead of changing the little_endian argument to some 3-value enum, this
    patch changes the logic to:
    - little endian if little_endian
    - big endian if !little_endian

    The native endian case is handled by all users with a trivial helper. This
    patch doesn't change any functionality, nor it does add overhead.

    Signed-off-by: Greg Kurz

    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Cornelia Huck
    Reviewed-by: David Gibson

    Greg Kurz
     
  • Signed-off-by: Greg Kurz

    Signed-off-by: Michael S. Tsirkin
    Acked-by: Cornelia Huck
    Reviewed-by: David Gibson

    Greg Kurz
     

11 May, 2015

2 commits

  • In preparation for changing how struct net is refcounted
    on kernel sockets pass the knowledge that we are creating
    a kernel socket from sock_create_kern through to sk_alloc.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     
  • There is no need for tun to do the weird network namespace refcounting.
    The existing network namespace refcounting in tfile has almost exactly
    the same lifetime. So rewrite the code to use the struct sock network
    namespace refcounting and remove the unnecessary hand rolled network
    namespace refcounting and the unncesary tfile->net.

    This change allows the tun code to directly call sock_put bypassing
    sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

    Remove the now unncessary tun_release so that if anything tries to use
    the sock_release code path the kernel will oops, and let us know about
    the bug.

    The macvtap code already uses it's internal socket this way.

    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

12 Apr, 2015

1 commit

  • All places outside of core VFS that checked ->read and ->write for being NULL or
    called the methods directly are gone now, so NULL {read,write} with non-NULL
    {read,write}_iter will do the right thing in all cases.

    Signed-off-by: Al Viro

    Al Viro
     

03 Mar, 2015

1 commit

  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

09 Feb, 2015

1 commit

  • Receive Flow Steering is a nice solution but suffers from
    hash collisions when a mix of connected and unconnected traffic
    is received on the host, when flow hash table is populated.

    Also, clearing flow in inet_release() makes RFS not very good
    for short lived flows, as many packets can follow close().
    (FIN , ACK packets, ...)

    This patch extends the information stored into global hash table
    to not only include cpu number, but upper part of the hash value.

    I use a 32bit value, and dynamically split it in two parts.

    For host with less than 64 possible cpus, this gives 6 bits for the
    cpu number, and 26 (32-6) bits for the upper part of the hash.

    Since hash bucket selection use low order bits of the hash, we have
    a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
    enough.

    If the hash found in flow table does not match, we fallback to RPS (if
    it is enabled for the rxqueue).

    This means that a packet for an non connected flow can avoid the
    IPI through a unrelated/victim CPU.

    This also means we no longer have to clear the table at socket
    close time, and this helps short lived flows performance.

    Signed-off-by: Eric Dumazet
    Acked-by: Tom Herbert
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Feb, 2015

1 commit

  • Conflicts:
    drivers/net/vxlan.c
    drivers/vhost/net.c
    include/linux/if_vlan.h
    net/core/dev.c

    The net/core/dev.c conflict was the overlap of one commit marking an
    existing function static whilst another was adding a new function.

    In the include/linux/if_vlan.h case, the type used for a local
    variable was changed in 'net', whereas the function got rewritten
    to fix a stacked vlan bug in 'net-next'.

    In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
    overlapped with an endainness fix for VHOST 1.0 in 'net'.

    In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
    in 'net-next' whereas in 'net' there was a bug fix to pass in the
    correct network namespace pointer in calls to this function.

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Feb, 2015

1 commit


04 Feb, 2015

2 commits

  • This reverts commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4.

    Now that GSO functionality can correctly track if the fragment
    id has been selected and select a fragment id if necessary,
    we can re-enable UFO on tap/macvap and virtio devices.

    Signed-off-by: Vladislav Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     
  • This reverts commit 5188cd44c55db3e92cd9e77a40b5baa7ed4340f7.

    Now that GSO layer can track if fragment id has been selected
    and can allocate one if necessary, we don't need to do this in
    tap and macvtap. This reverts most of the code and only keeps
    the new ipv6 fragment id generation function that is still needed.

    Fixes: 3d0ad09412ff (drivers/net: Disable UFO through virtio)
    Signed-off-by: Vladislav Yasevich
    Signed-off-by: David S. Miller

    Vlad Yasevich
     

14 Jan, 2015

1 commit