Eric Lee / smarc-fsl-linux-kernel

18 Feb, 2017

1 commit

26989c9d9 tun: read vnet_hdr_sz once ... Browse Code »

[ Upstream commit e1edab87faf6ca30cd137e0795bc73aa9a9a22ec ]

When IFF_VNET_HDR is enabled, a virtio_net header must precede data.
Data length is verified to be greater than or equal to expected header
length tun->vnet_hdr_sz before copying.

Read this value once and cache locally, as it can be updated between
the test and use (TOCTOU).

Signed-off-by: Willem de Bruijn
Reported-by: Dmitry Vyukov
CC: Eric Dumazet
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Willem de Bruijn
2017-02-18 22:11:42 +0800

04 Feb, 2017

1 commit

1e7cbb413 virtio-net: restore VIRTIO_HDR_F_DATA_VALID on receiving ... Browse Code »

[ Upstream commit 6391a4481ba0796805d6581e42f9f0418c099e34 ]

Commit 501db511397f ("virtio: don't set VIRTIO_NET_HDR_F_DATA_VALID on
xmit") in fact disables VIRTIO_HDR_F_DATA_VALID on receiving path too,
fixing this by adding a hint (has_data_valid) and set it only on the
receiving path.

Cc: Rolf Neugebauer
Signed-off-by: Jason Wang
Acked-by: Rolf Neugebauer
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jason Wang
2017-02-04 16:47:09 +0800

01 Dec, 2016

1 commit

af1cc7a2b tun: handle ubuf refcount correctly when meet errors ... Browse Code »

We trigger uarg->callback() immediately after we decide do datacopy
even if caller want to do zerocopy. This will cause the callback
(vhost_net_zerocopy_callback) decrease the refcount. But when we meet
an error afterwards, the error handling in vhost handle_tx() will try
to decrease it again. This is wrong and fix this by delay the
uarg->callback() until we're sure there's no errors.

Reported-by: wangyunjian
Signed-off-by: Jason Wang
Acked-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Jason Wang
2016-12-01 04:06:01 +0800

30 Aug, 2016

1 commit

6abdd5f59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

All three conflicts were cases of simple overlapping
changes.

Signed-off-by: David S. Miller

David S. Miller
2016-08-30 12:54:02 +0800

24 Aug, 2016

1 commit

7b996243f tun: fix transmit timestamp support ... Browse Code »

Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
software transmit timestamp of a packet.

sock_tx_timestamp resets and overrides the tx_flags of the skb.
The function is intended to be called from within the protocol
layer when creating the skb, not from a device driver. This is
inconsistent with other drivers and will cause issues for TCP.

In TCP, we intend to sample the timestamps for the last byte
for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
tcp_tx_timestamp only with the last skb that it generates.
For example, if a 128KB message is split into two 64KB packets
we want to sample the SND timestamp of the last packet. The current
code in the tun driver, however, will result in sampling the SND
timestamp for both packets.

Also, when the last packet is split into smaller packets for
retranmission (see tcp_fragment), the tun driver will record
timestamps for all of the retransmitted packets and not only the
last packet.

Fixes: eda297729171 (tun: Support software transmit time stamping.)
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: Francis Yan
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2016-08-24 14:09:27 +0800

21 Aug, 2016

2 commits

3b8d2a693 tun: Rename a jump label in update_filter() ... Browse Code »

Adjust a jump target according to the Linux coding style convention.

Signed-off-by: Markus Elfring
Signed-off-by: David S. Miller

Markus Elfring
2016-08-21 10:11:33 +0800
28e8190d2 tun: Use memdup_user() rather than duplicating its implementation ... Browse Code »

Reuse existing functionality from memdup_user() instead of keeping
duplicate source code.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring
Reviewed-by: Shmulik Ladkani
Signed-off-by: David S. Miller

Markus Elfring
2016-08-21 10:11:33 +0800

09 Jul, 2016

1 commit

86dfb4acb tun: Don't assume type tun in tun_device_event ... Browse Code »

The referenced change added a netlink notifier for processing
device queue size events. These events are fired for all devices
but the registered callback assumed they only occurred for tun
devices. This fix adds a check (borrowed from macvtap.c) to discard
non-tun device events.

For reference, this fixes the following splat:
[ 71.505935] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 71.513870] IP: [] tun_device_event+0x110/0x340
[ 71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0
[ 71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[ 71.529374] gsmi: Log Shutdown Reason 0x03
[ 71.533417] Modules linked in:[ 71.533826] mlx4_en: eth1: Link Up

[ 71.539616] bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
[ 71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 4.7.0-dbx-DEV #8
[ 71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 05/03/2016
[ 71.564495] task: ffff887f00bb20c0 ti: ffff887f00798000 task.ti: ffff887f00798000
[ 71.571894] RIP: 0010:[] [] tun_device_event+0x110/0x340
[ 71.580327] RSP: 0018:ffff887f0079bbd8 EFLAGS: 00010202
[ 71.585576] RAX: fffffffffffffae8 RBX: ffff887ef6d03378 RCX: 0000000000000000
[ 71.592624] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000000000000000
[ 71.599675] RBP: ffff887f0079bc48 R08: 0000000000000000 R09: 0000000000000001
[ 71.606730] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000010
[ 71.613780] R13: 0000000000000000 R14: 0000000000000001 R15: ffff887f0079bd00
[ 71.620832] FS: 00007f5cdc581700(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
[ 71.628826] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 71.634500] CR2: 0000000000000010 CR3: 0000003f3eb62000 CR4: 00000000001406e0
[ 71.641549] Stack:
[ 71.643533] ffff887f0079bc08 0000000000000246 000000000000001e ffff887ef6d00000
[ 71.650871] ffff887f0079bd00 0000000000000000 0000000000000000 ffffffff00000000
[ 71.658210] ffff887f0079bc48 ffffffff81d24070 00000000fffffff9 ffffffff81cec7a0
[ 71.665549] Call Trace:
[ 71.667975] [] notifier_call_chain+0x5d/0x80
[ 71.673823] [] ? show_tx_maxrate+0x30/0x30
[ 71.679502] [] __raw_notifier_call_chain+0xe/0x10
[ 71.685778] [] raw_notifier_call_chain+0x16/0x20
[ 71.691976] [] call_netdevice_notifiers_info+0x40/0x70
[ 71.698681] [] call_netdevice_notifiers+0x16/0x20
[ 71.704956] [] change_tx_queue_len+0x66/0x90
[ 71.710807] [] netdev_store.isra.5+0xbf/0xd0
[ 71.716658] [] tx_queue_len_store+0x50/0x60
[ 71.722431] [] dev_attr_store+0x18/0x30
[ 71.727857] [] sysfs_kf_write+0x4f/0x70
[ 71.733274] [] kernfs_fop_write+0x147/0x1d0
[ 71.739045] [] ? rcu_read_lock_sched_held+0x8f/0xa0
[ 71.745499] [] __vfs_write+0x28/0x120
[ 71.750748] [] ? percpu_down_read+0x57/0x90
[ 71.756516] [] ? __sb_start_write+0xc8/0xe0
[ 71.762278] [] ? __sb_start_write+0xc8/0xe0
[ 71.768038] [] vfs_write+0xbe/0x1b0
[ 71.773113] [] SyS_write+0x52/0xa0
[ 71.778110] [] entry_SYSCALL_64_fastpath+0x18/0xa8
[ 71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff
[ 71.803655] RIP [] tun_device_event+0x110/0x340
[ 71.809769] RSP
[ 71.813213] CR2: 0000000000000010
[ 71.816512] ---[ end trace 4db6449606319f73 ]---

Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Signed-off-by: Craig Gallek
Acked-by: Jason Wang
Signed-off-by: David S. Miller

Craig Gallek
2016-07-09 11:58:57 +0800

05 Jul, 2016

1 commit

f48cc6b26 tun: fix build warnings ... Browse Code »

Stephen Rothwell reports a build warnings(powerpc ppc64_defconfig)

drivers/net/tun.c: In function 'tun_do_read.part.5':
/home/sfr/next/next/drivers/net/tun.c:1491:6: warning: 'err' may be
used uninitialized in this function [-Wmaybe-uninitialized]
int err;

This is because tun_ring_recv() may return an uninitialized err, fix this.

Reported-by: Stephen Rothwell
Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2016-07-05 08:19:27 +0800

01 Jul, 2016

1 commit

1576d9860 tun: switch to use skb array for tx ... Browse Code »

We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by:

- switch from sk_receive_queue to a skb_array, and resize it when
tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
skb length.
- implement a tun version of peek_len for vhost_net to use and convert
vhost_net to use peek_len if possible.

Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:

Before: ~1300000pps
After : ~1500000pps

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2016-07-01 17:32:17 +0800

16 Jun, 2016

1 commit

df10db98a tun: fix csum generation for tap devices ... Browse Code »

The commit 34166093639b ("tuntap: use common code for virtio_net_hdr
and skb GSO conversion") replaced the tun code for header manipulation
with the generic helpers. While doing so, it implictly moved the
skb_partial_csum_set() invocation after eth_type_trans(), which
invalidate the current gso start/offset values.
Fix it by moving the helper invocation before the mac pulling.

Fixes: 34166093639 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
Signed-off-by: Paolo Abeni
Acked-by: Mike Rapoport
Signed-off-by: David S. Miller

Paolo Abeni
2016-06-16 05:00:33 +0800

11 Jun, 2016

1 commit

341660936 tuntap: use common code for virtio_net_hdr and skb GSO conversion ... Browse Code »

Replace open coded conversion between virtio_net_hdr to skb GSO info with
virtio_net_hdr_{from,to}_skb

Signed-off-by: Mike Rapoport
Signed-off-by: David S. Miller

Mike Rapoport
2016-06-11 14:03:55 +0800

21 May, 2016

1 commit

addf8fc4a tuntap: correctly wake up process during uninit ... Browse Code »

We used to check dev->reg_state against NETREG_REGISTERED after each
time we are woke up. But after commit 9e641bdcfa4e ("net-tun:
restructure tun_do_read for better sleep/wakeup efficiency"), it uses
skb_recv_datagram() which does not check dev->reg_state. This will
result if we delete a tun/tap device after a process is blocked in the
reading. The device will wait for the reference count which was held
by that process for ever.

Fixes this by using RCV_SHUTDOWN which will be checked during
sk_recv_datagram() before trying to wake up the process during uninit.

Fixes: 9e641bdcfa4e ("net-tun: restructure tun_do_read for better
sleep/wakeup efficiency")
Cc: Eric Dumazet
Cc: Xi Wang
Cc: Michael S. Tsirkin
Signed-off-by: Jason Wang
Acked-by: Eric Dumazet
Acked-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Jason Wang
2016-05-21 07:28:37 +0800

29 Apr, 2016

1 commit

3df97ba83 tuntap: calculate rps hash only when needed ... Browse Code »

There's no need to calculate rps hash if it was not enabled. So this
patch export rps_needed and check it before trying to get rps
hash. Tests (using pktgen to inject packets to guest) shows this can
improve pps about 13% (when rps is disabled).

Before:
~1150000 pps
After:
~1300000 pps

Cc: Michael S. Tsirkin
Signed-off-by: Jason Wang
----
Changes from V1:
- Fix build when CONFIG_RPS is not set
Signed-off-by: David S. Miller

Jason Wang
2016-04-29 04:38:54 +0800

19 Apr, 2016

1 commit

2a2bbf170 tun: don't require serialization lock on tx ... Browse Code »

The current tun_net_xmit() implementation don't need any external
lock since it relies on rcu protection for the tun data structure
and on socket queue lock for skb queuing.

This patch set the NETIF_F_LLTX feature bit in the tun device, so
that on xmit, in absence of qdisc, no serialization lock is acquired
by the caller.

The user space can remove the default tun qdisc with:

tc qdisc replace dev root noqueue

Signed-off-by: Paolo Abeni
Acked-by: Hannes Frederic Sowa
Acked-by: Eric Dumazet
Acked-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Paolo Abeni
2016-04-19 02:36:26 +0800

15 Apr, 2016

1 commit

608b99772 tun: use per cpu variables for stats accounting ... Browse Code »

Currently the tun device accounting uses dev->stats without applying any
kind of protection, regardless that accounting happens in preemptible
process context.
This patch move the tun stats to a per cpu data structure, and protect
the updates with u64_stats_update_begin()/u64_stats_update_end() or
this_cpu_inc according to the stat type. The per cpu stats are
aggregated by the newly added ndo_get_stats64 ops.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2016-04-15 10:55:25 +0800

10 Apr, 2016

1 commit

ae95d7126 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2016-04-10 05:41:41 +0800

09 Apr, 2016

1 commit

016adb726 tuntap: restore default qdisc ... Browse Code »

After commit f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using
alloc_netdev"), default qdisc was changed to noqueue because
tuntap does not set tx_queue_len during .setup(). This patch restores
default qdisc by setting tx_queue_len in tun_setup().

Fixes: f84bb1eac027 ("net: fix IFF_NO_QUEUE for drivers using alloc_netdev")
Cc: Phil Sutter
Signed-off-by: Jason Wang
Acked-by: Michael S. Tsirkin
Acked-by: Phil Sutter
Signed-off-by: David S. Miller

Jason Wang
2016-04-09 03:52:45 +0800

08 Apr, 2016

1 commit

8ced425ee tun: use socket locks for sk_{attach,detatch}_filter ... Browse Code »

This reverts commit 5a5abb1fa3b05dd ("tun, bpf: fix suspicious RCU usage
in tun_{attach, detach}_filter") and replaces it to use lock_sock around
sk_{attach,detach}_filter. The checks inside filter.c are updated with
lockdep_sock_is_held to check for proper socket locks.

It keeps the code cleaner by ensuring that only one lock governs the
socket filter instead of two independent locks.

Cc: Daniel Borkmann
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Hannes Frederic Sowa
2016-04-08 04:44:14 +0800

05 Apr, 2016

1 commit

c14ac9451 sock: enable timestamping using control messages ... Browse Code »

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Soheil Hassas Yeganeh
2016-04-05 03:50:30 +0800

02 Apr, 2016

1 commit

5a5abb1fa tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter ... Browse Code »

Sasha Levin reported a suspicious rcu_dereference_protected() warning
found while fuzzing with trinity that is similar to this one:

[ 52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
[ 52.765688] other info that might help us debug this:
[ 52.765695] rcu_scheduler_active = 1, debug_locks = 1
[ 52.765701] 1 lock held by a.out/1525:
[ 52.765704] #0: (rtnl_mutex){+.+.+.}, at: [] rtnl_lock+0x17/0x20
[ 52.765721] stack backtrace:
[ 52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
[...]
[ 52.765768] Call Trace:
[ 52.765775] [] dump_stack+0x85/0xc8
[ 52.765784] [] lockdep_rcu_suspicious+0xd5/0x110
[ 52.765792] [] sk_detach_filter+0x82/0x90
[ 52.765801] [] tun_detach_filter+0x35/0x90 [tun]
[ 52.765810] [] __tun_chr_ioctl+0x354/0x1130 [tun]
[ 52.765818] [] ? selinux_file_ioctl+0x130/0x210
[ 52.765827] [] tun_chr_ioctl+0x13/0x20 [tun]
[ 52.765834] [] do_vfs_ioctl+0x96/0x690
[ 52.765843] [] ? security_file_ioctl+0x43/0x60
[ 52.765850] [] SyS_ioctl+0x79/0x90
[ 52.765858] [] do_syscall_64+0x62/0x140
[ 52.765866] [] entry_SYSCALL64_slow_path+0x25/0x25

Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.

Since the fix in f91ff5b9ff52 ("net: sk_{detach|attach}_filter() rcu
fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
filter with rcu_dereference_protected(), checking whether socket lock
is held in control path.

Since its introduction in 994051625981 ("tun: socket filter support"),
tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
sock_owned_by_user(sk) doesn't apply in this specific case and therefore
triggers the false positive.

Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
that is used by tap filters and pass in lockdep_rtnl_is_held() for the
rcu_dereference_protected() checks instead.

Reported-by: Sasha Levin
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2016-04-02 02:33:46 +0800

02 Mar, 2016

1 commit

eaea34b23 net/tun: implement ndo_set_rx_headroom ... Browse Code »

ndo_set_rx_headroom controls the align value used by tun devices to
allocate skbs on frame reception.
When the xmit device adds a large encapsulation, this avoids an skb
head reallocation on forwarding.

The measured improvement when forwarding towards a vxlan dev with
frame size below the egress device MTU is as follow:

vxlan over ipv6, bridged: +6%
vxlan over ipv6, ovs: +7%

In case of ipv4 tunnels there is no improvement, since the tun
device default alignment provides enough headroom to avoid the skb
head reallocation.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2016-03-02 04:54:30 +0800

18 Dec, 2015

1 commit

1bd4978a8 tun: honor IFF_UP in tun_get_user() ... Browse Code »

If a tun interface is turned down, we should not allow packet injection
into the kernel.

Kernel does not send packets to the tun already.

TUNATTACHFILTER can not be used as only tun_net_xmit() is taking care
of it.

Reported-by: Curt Wohlgemuth
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-18 04:25:57 +0800

02 Dec, 2015

1 commit

9cd3e072b net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA ... Browse Code »

This patch is a cleanup to make following patch easier to
review.

Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
from (struct socket)->flags to a (struct socket_wq)->flags
to benefit from RCU protection in sock_wake_async()

To ease backports, we rename both constants.

Two new helpers, sk_set_bit(int nr, struct sock *sk)
and sk_clear_bit(int net, struct sock *sk) are added so that
following patch can change their implementation.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-02 04:45:05 +0800

13 Oct, 2015

1 commit

5fcd2d8be tun: use sk_fullsock() before reading sk->sk_tsflags ... Browse Code »

timewait or request sockets are small and do not contain sk->sk_tsflags

Without this fix, we might read garbage, and crash later in

__skb_complete_tx_timestamp()
-> sock_queue_err_skb()

(These pseudo sockets do not have an error queue either)

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet
Cc: Willem de Bruijn
Acked-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-13 10:45:48 +0800

04 Aug, 2015

1 commit

5e52796a9 tuntap: Don't segment multiple tagged packets on tap device ... Browse Code »

Tap devices don't need to segment multiple tagged packets.

Signed-off-by: Toshiaki Makita
Signed-off-by: David S. Miller

Toshiaki Makita
2015-08-04 05:24:50 +0800

04 Jul, 2015

1 commit

5fc835284 Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost ... Browse Code »

Pull virtio/vhost cross endian support from Michael Tsirkin:
"I have just queued some more bugfix patches today but none fix
regressions and none are related to these ones, so it looks like a
good time for a merge for -rc1.

The motivation for this is support for legacy BE guests on the new LE
hosts. There are two redeeming properties that made me merge this:

- It's a trivial amount of code: since we wrap host/guest accesses
anyway, almost all of it is well hidden from drivers.

- Sane platforms would never set flags like VHOST_CROSS_ENDIAN_LEGACY,
and when it's clear, there's zero overhead (as some point it was
tested by compiling with and without the patches, got the same
stripped binary).

Maybe we could create a Kconfig symbol to enforce the second point:
prevent people from enabling it eg on x86. I will look into this"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio-pci: alloc only resources actually used.
macvtap/tun: cross-endian support for little-endian hosts
vhost: cross-endian support for legacy devices
virtio: add explicit big-endian support to memory accessors
vhost: introduce vhost_is_little_endian() helper
vringh: introduce vringh_is_little_endian() helper
macvtap: introduce macvtap_is_little_endian() helper
tun: add tun_is_little_endian() helper
virtio: introduce virtio_is_little_endian() helper

Linus Torvalds
2015-07-04 07:02:25 +0800

01 Jun, 2015

3 commits

8b8e658b1 macvtap/tun: cross-endian support for little-endian hosts ... Browse Code »

The VNET_LE flag was introduced to fix accesses to virtio 1.0 headers
that are always little-endian. It can also be used to handle the special
case of a legacy little-endian device implemented by a big-endian host.

Let's add a flag and ioctls for big-endian devices as well. If both flags
are set, little-endian wins.

Since this is isn't a common usecase, the feature is controlled by a kernel
config option (not set by default).

Both macvtap and tun are covered by this patch since they share the same
API with userland.

Signed-off-by: Greg Kurz

Signed-off-by: Michael S. Tsirkin
Reviewed-by: David Gibson

Greg Kurz
2015-06-01 21:48:56 +0800
7d8241095 virtio: add explicit big-endian support to memory accessors ... Browse Code »

The current memory accessors logic is:
- little endian if little_endian
- native endian (i.e. no byteswap) if !little_endian

If we want to fully support cross-endian vhost, we also need to be
able to convert to big endian.

Instead of changing the little_endian argument to some 3-value enum, this
patch changes the logic to:
- little endian if little_endian
- big endian if !little_endian

The native endian case is handled by all users with a trivial helper. This
patch doesn't change any functionality, nor it does add overhead.

Signed-off-by: Greg Kurz

Signed-off-by: Michael S. Tsirkin
Reviewed-by: Cornelia Huck
Reviewed-by: David Gibson

Greg Kurz
2015-06-01 21:48:54 +0800
25bd55bba tun: add tun_is_little_endian() helper ... Browse Code »

Signed-off-by: Greg Kurz

Signed-off-by: Michael S. Tsirkin
Acked-by: Cornelia Huck
Reviewed-by: David Gibson

Greg Kurz
2015-06-01 21:48:50 +0800

11 May, 2015

2 commits

11aa9c28b net: Pass kern from net_proto_family.create to sk_alloc ... Browse Code »

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-05-11 22:50:17 +0800
140e807da tun: Utilize the normal socket network namespace refcounting. ... Browse Code »

There is no need for tun to do the weird network namespace refcounting.
The existing network namespace refcounting in tfile has almost exactly
the same lifetime. So rewrite the code to use the struct sock network
namespace refcounting and remove the unnecessary hand rolled network
namespace refcounting and the unncesary tfile->net.

This change allows the tun code to directly call sock_put bypassing
sock_release and making SOCK_EXTERNALLY_ALLOCATED unnecessary.

Remove the now unncessary tun_release so that if anything tries to use
the sock_release code path the kernel will oops, and let us know about
the bug.

The macvtap code already uses it's internal socket this way.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2015-05-11 22:50:16 +0800

12 Apr, 2015

1 commit

5d5d56897 make new_sync_{read,write}() static ... Browse Code »

All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro

Al Viro
2015-04-12 10:29:40 +0800

03 Mar, 2015

1 commit

1b7841404 net: Remove iocb argument from sendmsg and recvmsg ... Browse Code »

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig
Suggested-by: Al Viro
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-03-03 02:06:31 +0800

09 Feb, 2015

1 commit

567e4b797 net: rfs: add hash collision detection ... Browse Code »

Receive Flow Steering is a nice solution but suffers from
hash collisions when a mix of connected and unconnected traffic
is received on the host, when flow hash table is populated.

Also, clearing flow in inet_release() makes RFS not very good
for short lived flows, as many packets can follow close().
(FIN , ACK packets, ...)

This patch extends the information stored into global hash table
to not only include cpu number, but upper part of the hash value.

I use a 32bit value, and dynamically split it in two parts.

For host with less than 64 possible cpus, this gives 6 bits for the
cpu number, and 26 (32-6) bits for the upper part of the hash.

Since hash bucket selection use low order bits of the hash, we have
a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
enough.

If the hash found in flow table does not match, we fallback to RPS (if
it is enabled for the rxqueue).

This means that a packet for an non connected flow can avoid the
IPI through a unrelated/victim CPU.

This also means we no longer have to clear the table at socket
close time, and this helps short lived flows performance.

Signed-off-by: Eric Dumazet
Acked-by: Tom Herbert
Signed-off-by: David S. Miller

Eric Dumazet
2015-02-09 08:53:57 +0800

06 Feb, 2015

1 commit

6e03f896b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c

The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.

In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.

In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.

In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.

Signed-off-by: David S. Miller

David S. Miller
2015-02-06 06:33:28 +0800

05 Feb, 2015

1 commit

c4d33e24b tun: Use static attribute groups for sysfs entries ... Browse Code »

Instead of manual calls of device_create_file() and
device_remove_files(), assign the static attribute groups to netdev
groups array. This simplifies the code and avoids the possible
races.

Signed-off-by: Takashi Iwai
Signed-off-by: David S. Miller

Takashi Iwai
2015-02-05 16:30:47 +0800

04 Feb, 2015

2 commits

e3e3c423f Revert "drivers/net: Disable UFO through virtio" ... Browse Code »

This reverts commit 3d0ad09412ffe00c9afa201d01effdb6023d09b4.

Now that GSO functionality can correctly track if the fragment
id has been selected and select a fragment id if necessary,
we can re-enable UFO on tap/macvap and virtio devices.

Signed-off-by: Vladislav Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2015-02-04 15:06:43 +0800
72f651074 Revert "drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets" ... Browse Code »

This reverts commit 5188cd44c55db3e92cd9e77a40b5baa7ed4340f7.

Now that GSO layer can track if fragment id has been selected
and can allocate one if necessary, we don't need to do this in
tap and macvtap. This reverts most of the code and only keeps
the new ipv6 fragment id generation function that is still needed.

Fixes: 3d0ad09412ff (drivers/net: Disable UFO through virtio)
Signed-off-by: Vladislav Yasevich
Signed-off-by: David S. Miller

Vlad Yasevich
2015-02-04 15:06:43 +0800

14 Jan, 2015

1 commit

df8a39def net: rename vlan_tx_* helpers since "tx" is misleading there ... Browse Code »

The same macros are used for rx as well. So rename it.

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2015-01-14 06:51:08 +0800