Eric Lee / smarc-fsl-linux-kernel

15 Nov, 2020

1 commit

65b422d9b vsock: forward all packets to the host when no H2G is registered ... Browse Code »

Before commit c0cfa2d8a788 ("vsock: add multi-transports support"),
if a G2H transport was loaded (e.g. virtio transport), every packets
was forwarded to the host, regardless of the destination CID.
The H2G transports implemented until then (vhost-vsock, VMCI) always
responded with an error, if the destination CID was not
VMADDR_CID_HOST.

From that commit, we are using the remote CID to decide which
transport to use, so packets with remote CID > VMADDR_CID_HOST(2)
are sent only through H2G transport. If no H2G is available, packets
are discarded directly in the guest.

Some use cases (e.g. Nitro Enclaves [1]) rely on the old behaviour
to implement sibling VMs communication, so we restore the old
behavior when no H2G is registered.
It will be up to the host to discard packets if the destination is
not the right one. As it was already implemented before adding
multi-transport support.

Tested with nested QEMU/KVM by me and Nitro Enclaves by Andra.

[1] Documentation/virt/ne_overview.rst

Cc: Jorgen Hansen
Cc: Dexuan Cui
Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
Reported-by: Andra Paraschiv
Tested-by: Andra Paraschiv
Signed-off-by: Stefano Garzarella
Link: https://lore.kernel.org/r/20201112133837.34183-1-sgarzare@redhat.com
Signed-off-by: Jakub Kicinski

Stefano Garzarella
2020-11-15 03:33:39 +0800

27 Oct, 2020

1 commit

af545bb5e vsock: use ns_capable_noaudit() on socket create ... Browse Code »

During __vsock_create() CAP_NET_ADMIN is used to determine if the
vsock_sock->trusted should be set to true. This value is used later
for determing if a remote connection should be allowed to connect
to a restricted VM. Unfortunately, if the caller doesn't have
CAP_NET_ADMIN, an audit message such as an selinux denial is
generated even if the caller does not want a trusted socket.

Logging errors on success is confusing. To avoid this, switch the
capable(CAP_NET_ADMIN) check to the noaudit version.

Reported-by: Roman Kiryanov
https://android-review.googlesource.com/c/device/generic/goldfish/+/1468545/
Signed-off-by: Jeff Vander Stoep
Reviewed-by: James Morris
Link: https://lore.kernel.org/r/20201023143757.377574-1-jeffv@google.com
Signed-off-by: Jakub Kicinski

Jeff Vander Stoep
2020-10-27 07:22:42 +0800

13 Aug, 2020

1 commit

1980c0584 vsock: fix potential null pointer dereference in vsock_poll() ... Browse Code »

syzbot reported this issue where in the vsock_poll() we find the
socket state at TCP_ESTABLISHED, but 'transport' is null:
general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097]
CPU: 0 PID: 8227 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:vsock_poll+0x75a/0x8e0 net/vmw_vsock/af_vsock.c:1038
Call Trace:
sock_poll+0x159/0x460 net/socket.c:1266
vfs_poll include/linux/poll.h:90 [inline]
do_pollfd fs/select.c:869 [inline]
do_poll fs/select.c:917 [inline]
do_sys_poll+0x607/0xd40 fs/select.c:1011
__do_sys_poll fs/select.c:1069 [inline]
__se_sys_poll fs/select.c:1057 [inline]
__x64_sys_poll+0x18c/0x440 fs/select.c:1057
do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:384
entry_SYSCALL_64_after_hwframe+0x44/0xa9

This issue can happen if the TCP_ESTABLISHED state is set after we read
the vsk->transport in the vsock_poll().

We could put barriers to synchronize, but this can only happen during
connection setup, so we can simply check that 'transport' is valid.

Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
Reported-and-tested-by: syzbot+a61bac2fcc1a7c6623fe@syzkaller.appspotmail.com
Signed-off-by: Stefano Garzarella
Reviewed-by: Jorgen Hansen
Signed-off-by: David S. Miller

Stefano Garzarella
2020-08-13 03:56:06 +0800

25 Jul, 2020

1 commit

a7b75c5a8 net: pass a sockptr_t into ->setsockopt ... Browse Code »

Rework the remaining setsockopt code to pass a sockptr_t instead of a
plain user pointer. This removes the last remaining set_fs(KERNEL_DS)
outside of architecture specific code.

Signed-off-by: Christoph Hellwig
Acked-by: Stefan Schmidt [ieee802154]
Acked-by: Matthieu Baerts
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-25 06:41:54 +0800

20 Jul, 2020

1 commit

a44d9e721 net: make ->{get,set}sockopt in proto_ops optional ... Browse Code »

Just check for a NULL method instead of wiring up
sock_no_{get,set}sockopt.

Signed-off-by: Christoph Hellwig
Acked-by: Marc Kleine-Budde
Signed-off-by: David S. Miller

Christoph Hellwig
2020-07-20 09:16:41 +0800

28 May, 2020

1 commit

7e0afbdfd vsock: fix timeout in vsock_accept() ... Browse Code »

The accept(2) is an "input" socket interface, so we should use
SO_RCVTIMEO instead of SO_SNDTIMEO to set the timeout.

So this patch replace sock_sndtimeo() with sock_rcvtimeo() to
use the right timeout in the vsock_accept().

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Signed-off-by: Stefano Garzarella
Reviewed-by: Jorgen Hansen
Signed-off-by: David S. Miller

Stefano Garzarella
2020-05-28 02:20:23 +0800

28 Feb, 2020

1 commit

3f74957fc vsock: fix potential deadlock in transport->release() ... Browse Code »

Some transports (hyperv, virtio) acquire the sock lock during the
.release() callback.

In the vsock_stream_connect() we call vsock_assign_transport(); if
the socket was previously assigned to another transport, the
vsk->transport->release() is called, but the sock lock is already
held in the vsock_stream_connect(), causing a deadlock reported by
syzbot:

INFO: task syz-executor280:9768 blocked for more than 143 seconds.
Not tainted 5.6.0-rc1-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor280 D27912 9768 9766 0x00000000
Call Trace:
context_switch kernel/sched/core.c:3386 [inline]
__schedule+0x934/0x1f90 kernel/sched/core.c:4082
schedule+0xdc/0x2b0 kernel/sched/core.c:4156
__lock_sock+0x165/0x290 net/core/sock.c:2413
lock_sock_nested+0xfe/0x120 net/core/sock.c:2938
virtio_transport_release+0xc4/0xd60 net/vmw_vsock/virtio_transport_common.c:832
vsock_assign_transport+0xf3/0x3b0 net/vmw_vsock/af_vsock.c:454
vsock_stream_connect+0x2b3/0xc70 net/vmw_vsock/af_vsock.c:1288
__sys_connect_file+0x161/0x1c0 net/socket.c:1857
__sys_connect+0x174/0x1b0 net/socket.c:1874
__do_sys_connect net/socket.c:1885 [inline]
__se_sys_connect net/socket.c:1882 [inline]
__x64_sys_connect+0x73/0xb0 net/socket.c:1882
do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x49/0xbe

To avoid this issue, this patch remove the lock acquiring in the
.release() callback of hyperv and virtio transports, and it holds
the lock when we call vsk->transport->release() in the vsock core.

Reported-by: syzbot+731710996d79d0d58fbc@syzkaller.appspotmail.com
Fixes: 408624af4c89 ("vsock: use local transport when it is loaded")
Signed-off-by: Stefano Garzarella
Reviewed-by: Stefan Hajnoczi
Signed-off-by: David S. Miller

Stefano Garzarella
2020-02-28 04:03:56 +0800

12 Dec, 2019

2 commits

408624af4 vsock: use local transport when it is loaded ... Browse Code »

Now that we have a transport that can handle the local communication,
we can use it when it is loaded.

A socket will use the local transport (loopback) when the remote
CID is:
- equal to VMADDR_CID_LOCAL
- or equal to transport_g2h->get_local_cid(), if transport_g2h
is loaded (this allows us to keep the same behavior implemented
by virtio and vmci transports)
- or equal to VMADDR_CID_HOST, if transport_g2h is not loaded

Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-12-12 07:01:23 +0800
0e1219057 vsock: add local transport support in the vsock core ... Browse Code »

This patch allows to register a transport able to handle
local communication (loopback).

Reviewed-by: Stefan Hajnoczi
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-12-12 07:01:23 +0800

22 Nov, 2019

1 commit

039fcccae vsock: avoid to assign transport if its initialization fails ... Browse Code »

If transport->init() fails, we can't assign the transport to the
socket, because it's not initialized correctly, and any future
calls to the transport callbacks would have an unexpected behavior.

Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
Reported-and-tested-by: syzbot+e2e5c07bf353b2f79daa@syzkaller.appspotmail.com
Signed-off-by: Stefano Garzarella
Reviewed-by: Jorgen Hansen
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-22 03:37:16 +0800

15 Nov, 2019

9 commits

36c5b48b9 vsock: fix bind() behaviour taking care of CID ... Browse Code »

When we are looking for a socket bound to a specific address,
we also have to take into account the CID.

This patch is useful with multi-transports support because it
allows the binding of the same port with different CID, and
it prevents a connection to a wrong socket bound to the same
port, but with different CID.

Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
6a2c09621 vsock: prevent transport modules unloading ... Browse Code »

This patch adds 'module' member in the 'struct vsock_transport'
in order to get/put the transport module. This prevents the
module unloading while sockets are assigned to it.

We increase the module refcnt when a socket is assigned to a
transport, and we decrease the module refcnt when the socket
is destructed.

Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
c0cfa2d8a vsock: add multi-transports support ... Browse Code »

This patch adds the support of multiple transports in the
VSOCK core.

With the multi-transports support, we can use vsock with nested VMs
(using also different hypervisors) loading both guest->host and
host->guest transports at the same time.

Major changes:
- vsock core module can be loaded regardless of the transports
- vsock_core_init() and vsock_core_exit() are renamed to
vsock_core_register() and vsock_core_unregister()
- vsock_core_register() has a feature parameter (H2G, G2H, DGRAM)
to identify which directions the transport can handle and if it's
support DGRAM (only vmci)
- each stream socket is assigned to a transport when the remote CID
is set (during the connect() or when we receive a connection request
on a listener socket).
The remote CID is used to decide which transport to use:
- remote CID host transport;
- remote CID == local_cid (guest->host transport) will use guest->host
transport for loopback (host->guest transports don't support loopback);
- remote CID > VMADDR_CID_HOST will use host->guest transport;
- listener sockets are not bound to any transports since no transport
operations are done on it. In this way we can create a listener
socket, also if the transports are not loaded or with VMADDR_CID_ANY
to listen on all transports.
- DGRAM sockets are handled as before, since only the vmci_transport
provides this feature.

Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
55f3e149b vsock: move vsock_insert_unbound() in the vsock_create() ... Browse Code »

vsock_insert_unbound() was called only when 'sock' parameter of
__vsock_create() was not null. This only happened when
__vsock_create() was called by vsock_create().

In order to simplify the multi-transports support, this patch
moves vsock_insert_unbound() at the end of vsock_create().

Reviewed-by: Dexuan Cui
Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
b9ca2f5ff vsock: add vsock_create_connected() called by transports ... Browse Code »

All transports call __vsock_create() with the same parameters,
most of them depending on the parent socket. In order to simplify
the VSOCK core APIs exposed to the transports, this patch adds
the vsock_create_connected() callable from transports to create
a new socket when a connection request is received.
We also unexported the __vsock_create().

Suggested-by: Stefan Hajnoczi
Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
b9f2b0ffd vsock: handle buffer_size sockopts in the core ... Browse Code »

virtio_transport and vmci_transport handle the buffer_size
sockopts in a very similar way.

In order to support multiple transports, this patch moves this
handling in the core to allow the user to change the options
also if the socket is not yet assigned to any transport.

This patch also adds the '.notify_buffer_size' callback in the
'struct virtio_transport' in order to inform the transport,
when the buffer_size is changed by the user. It is also useful
to limit the 'buffer_size' requested (e.g. virtio transports).

Acked-by: Dexuan Cui
Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
daabfbca3 vsock: add 'struct vsock_sock *' param to vsock_core_get_transport() ... Browse Code »

Since now the 'struct vsock_sock' object contains a pointer to
the transport, this patch adds a parameter to the
vsock_core_get_transport() to return the right transport
assigned to the socket.

This patch modifies also the virtio_transport_get_ops(), that
uses the vsock_core_get_transport(), adding the
'struct vsock_sock *' parameter.

Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
fe502c4a3 vsock: add 'transport' member in the struct vsock_sock ... Browse Code »

As a preparation to support multiple transports, this patch adds
the 'transport' member at the 'struct vsock_sock'.
This new field is initialized during the creation in the
__vsock_create() function.

This patch also renames the global 'transport' pointer to
'transport_single', since for now we're only supporting a single
transport registered at run-time.

Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:18 +0800
db205c766 vsock: remove vm_sockets_get_local_cid() ... Browse Code »

vm_sockets_get_local_cid() is only used in virtio_transport_common.c.
We can replace it calling the virtio_transport_get_ops() and
using the get_local_cid() callback registered by the transport.

Reviewed-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: Stefano Garzarella
Signed-off-by: David S. Miller

Stefano Garzarella
2019-11-15 10:12:17 +0800

07 Nov, 2019

1 commit

7976a11b3 net: use helpers to change sk_ack_backlog ... Browse Code »

Writers are holding a lock, but many readers do not.

Following patch will add appropriate barriers in
sk_acceptq_removed() and sk_acceptq_added().

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-11-07 08:14:48 +0800

06 Nov, 2019

1 commit

3b7ad08b5 vsock: Simplify '__vsock_release()' ... Browse Code »

Use 'skb_queue_purge()' instead of re-implementing it.

Signed-off-by: Christophe JAILLET
Reviewed-by: Stefano Garzarella
Reviewed-by: Stefan Hajnoczi
Signed-off-by: David S. Miller

Christophe JAILLET
2019-11-06 09:55:47 +0800

29 Oct, 2019

1 commit

3ef7cf57c net: use skb_queue_empty_lockless() in poll() handlers ... Browse Code »

Many poll() handlers are lockless. Using skb_queue_empty_lockless()
instead of skb_queue_empty() is more appropriate.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-10-29 04:33:41 +0800

02 Oct, 2019

1 commit

0d9138ffa vsock: Fix a lockdep warning in __vsock_release() ... Browse Code »

Lockdep is unhappy if two locks from the same class are held.

Fix the below warning for hyperv and virtio sockets (vmci socket code
doesn't have the issue) by using lock_sock_nested() when __vsock_release()
is called recursively:

============================================
WARNING: possible recursive locking detected
5.3.0+ #1 Not tainted
--------------------------------------------
server/1795 is trying to acquire lock:
ffff8880c5158990 (sk_lock-AF_VSOCK){+.+.}, at: hvs_release+0x10/0x120 [hv_sock]

but task is already holding lock:
ffff8880c5158150 (sk_lock-AF_VSOCK){+.+.}, at: __vsock_release+0x2e/0xf0 [vsock]

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(sk_lock-AF_VSOCK);
lock(sk_lock-AF_VSOCK);

*** DEADLOCK ***

May be due to missing lock nesting notation

2 locks held by server/1795:
#0: ffff8880c5d05ff8 (&sb->s_type->i_mutex_key#10){+.+.}, at: __sock_release+0x2d/0xa0
#1: ffff8880c5158150 (sk_lock-AF_VSOCK){+.+.}, at: __vsock_release+0x2e/0xf0 [vsock]

stack backtrace:
CPU: 5 PID: 1795 Comm: server Not tainted 5.3.0+ #1
Call Trace:
dump_stack+0x67/0x90
__lock_acquire.cold.67+0xd2/0x20b
lock_acquire+0xb5/0x1c0
lock_sock_nested+0x6d/0x90
hvs_release+0x10/0x120 [hv_sock]
__vsock_release+0x24/0xf0 [vsock]
__vsock_release+0xa0/0xf0 [vsock]
vsock_release+0x12/0x30 [vsock]
__sock_release+0x37/0xa0
sock_close+0x14/0x20
__fput+0xc1/0x250
task_work_run+0x98/0xc0
do_exit+0x344/0xc60
do_group_exit+0x47/0xb0
get_signal+0x15c/0xc50
do_signal+0x30/0x720
exit_to_usermode_loop+0x50/0xa0
do_syscall_64+0x24e/0x270
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4184e85f31

Tested-by: Stefano Garzarella
Signed-off-by: Dexuan Cui
Reviewed-by: Stefano Garzarella
Signed-off-by: David S. Miller

Dexuan Cui
2019-10-02 09:23:35 +0800

18 Jun, 2019

1 commit

13091aa30 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller

David S. Miller
2019-06-18 11:20:36 +0800

15 Jun, 2019

1 commit

d5afa82c9 vsock: correct removal of socket from the list ... Browse Code »

The current vsock code for removal of socket from the list is both
subject to race and inefficient. It takes the lock, checks whether
the socket is in the list, drops the lock and if the socket was on the
list, deletes it from the list. This is subject to race because as soon
as the lock is dropped once it is checked for presence, that condition
cannot be relied upon for any decision. It is also inefficient because
if the socket is present in the list, it takes the lock twice.

Signed-off-by: Sunil Muthuswamy
Signed-off-by: David S. Miller

Sunil Muthuswamy
2019-06-15 10:20:20 +0800

05 Jun, 2019

1 commit

685a6bf84 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 321 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation version 2 and no later version this
program is distributed in the hope that it will be useful but
without any warranty without even the implied warranty of
merchantability or fitness for a particular purpose see the gnu
general public license for more details

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 33 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Kate Stewart
Reviewed-by: Alexios Zavras
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.345978407@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-06-05 23:37:05 +0800

04 Feb, 2019

1 commit

fe0c72f3d socket: move compat timeout handling into sock.c ... Browse Code »

This is a cleanup to prepare for the addition of 64-bit time_t
in O_SNDTIMEO/O_RCVTIMEO. The existing compat handler seems
unnecessarily complex and error-prone, moving it all into the
main setsockopt()/getsockopt() implementation requires half
as much code and is easier to extend.

32-bit user space can now use old_timeval32 on both 32-bit
and 64-bit machines, while 64-bit code can use
__old_kernel_timeval.

Signed-off-by: Arnd Bergmann
Signed-off-by: Deepa Dinamani
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Arnd Bergmann
2019-02-04 03:17:30 +0800

16 Jan, 2019

1 commit

a22d32514 Fix ERROR:do not initialise statics to 0 in af_vsock.c ... Browse Code »

Found by scripts/checkpatch.pl
Reviewed-by: Stefan Hajnoczi

Signed-off-by: David S. Miller

Lepton Wu
2019-01-16 12:38:29 +0800

15 Dec, 2018

1 commit

8236b08cf VSOCK: bind to random port for VMADDR_PORT_ANY ... Browse Code »

The old code always starts from fixed port for VMADDR_PORT_ANY. Sometimes
when VMM crashed, there is still orphaned vsock which is waiting for
close timer, then it could cause connection time out for new started VM
if they are trying to connect to same port with same guest cid since the
new packets could hit that orphaned vsock. We could also fix this by doing
more in vhost_vsock_reset_orphans, but any way, it should be better to start
from a random local port instead of a fixed one.

Signed-off-by: Lepton Wu
Reviewed-by: Jorgen Hansen
Signed-off-by: David S. Miller

Lepton Wu
2018-12-15 06:40:19 +0800

08 Aug, 2018

1 commit

455f05ecd vsock: split dwork to avoid reinitializations ... Browse Code »

syzbot reported that we reinitialize an active delayed
work in vsock_stream_connect():

ODEBUG: init active (active state 0) object type: timer_list hint:
delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
debug_print_object+0x16a/0x210 lib/debugobjects.c:326

The pattern is apparently wrong, we should only initialize
the dealyed work once and could repeatly schedule it. So we
have to move out the initializations to allocation side.
And to avoid confusion, we can split the shared dwork
into two, instead of re-using the same one.

Fixes: d021c344051a ("VSOCK: Introduce VM Sockets")
Reported-by:
Cc: Andy king
Cc: Stefan Hajnoczi
Cc: Jorgen Hansen
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2018-08-08 03:39:13 +0800

29 Jun, 2018

1 commit

a11e1d432 Revert changes to convert to ->poll_mask() and aio IOCB_CMD_POLL ... Browse Code »

The poll() changes were not well thought out, and completely
unexplained. They also caused a huge performance regression, because
"->poll()" was no longer a trivial file operation that just called down
to the underlying file operations, but instead did at least two indirect
calls.

Indirect calls are sadly slow now with the Spectre mitigation, but the
performance problem could at least be largely mitigated by changing the
"->get_poll_head()" operation to just have a per-file-descriptor pointer
to the poll head instead. That gets rid of one of the new indirections.

But that doesn't fix the new complexity that is completely unwarranted
for the regular case. The (undocumented) reason for the poll() changes
was some alleged AIO poll race fixing, but we don't make the common case
slower and more complex for some uncommon special case, so this all
really needs way more explanations and most likely a fundamental
redesign.

[ This revert is a revert of about 30 different commits, not reverted
individually because that would just be unnecessarily messy - Linus ]

Cc: Al Viro
Cc: Christoph Hellwig
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-06-29 01:40:47 +0800

26 May, 2018

1 commit

31f50b557 net/vmw_vsock: convert to ->poll_mask ... Browse Code »

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-26 15:16:44 +0800

17 Apr, 2018

1 commit

05e489b15 VSOCK: make af_vsock.ko removable again ... Browse Code »

Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
vsock_init_tables()") introduced a module_init() function without a
corresponding module_exit() function.

Modules with an init function can only be removed if they also have an
exit function. Therefore the vsock module was considered "permanent"
and could not be removed.

This patch adds an empty module_exit() function so that "rmmod vsock"
works. No explicit cleanup is required because:

1. Transports call vsock_core_exit() upon exit and cannot be removed
while sockets are still alive.
2. vsock_diag.ko does not perform any action that requires cleanup by
vsock.ko.

Fixes: c1eef220c176 ("vsock: always call vsock_init_tables()")
Reported-by: Xiumei Mu
Cc: Cong Wang
Cc: Jorgen Hansen
Signed-off-by: Stefan Hajnoczi
Reviewed-by: Jorgen Hansen
Signed-off-by: David S. Miller

Stefan Hajnoczi
2018-04-17 21:44:30 +0800

13 Feb, 2018

1 commit

9b2c45d47 net: make getname() functions return length rather than use int* parameter ... Browse Code »

Changes since v1:
Added changes in these files:
drivers/infiniband/hw/usnic/usnic_transport.c
drivers/staging/lustre/lnet/lnet/lib-socket.c
drivers/target/iscsi/iscsi_target_login.c
drivers/vhost/net.c
fs/dlm/lowcomms.c
fs/ocfs2/cluster/tcp.c
security/tomoyo/network.c

Before:
All these functions either return a negative error indicator,
or store length of sockaddr into "int *socklen" parameter
and return zero on success.

"int *socklen" parameter is awkward. For example, if caller does not
care, it still needs to provide on-stack storage for the value
it does not need.

None of the many FOO_getname() functions of various protocols
ever used old value of *socklen. They always just overwrite it.

This change drops this parameter, and makes all these functions, on success,
return length of sockaddr. It's always >= 0 and can be differentiated
from an error.

Tests in callers are changed from "if (err)" to "if (err < 0)", where needed.

rpc_sockname() lost "int buflen" parameter, since its only use was
to be passed to kernel_getsockname() as &buflen and subsequently
not used in any way.

Userspace API is not changed.

text data bss dec hex filename
30108430 2633624 873672 33615726 200ef6e vmlinux.before.o
30108109 2633612 873672 33615393 200ee21 vmlinux.o

Signed-off-by: Denys Vlasenko
CC: David S. Miller
CC: linux-kernel@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-bluetooth@vger.kernel.org
CC: linux-decnet-user@lists.sourceforge.net
CC: linux-wireless@vger.kernel.org
CC: linux-rdma@vger.kernel.org
CC: linux-sctp@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller

Denys Vlasenko
2018-02-13 03:15:04 +0800

12 Feb, 2018

1 commit

a9a08845e vfs: do bulk POLL* -> EPOLL* replacement ... Browse Code »

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^$[^\"]*$$\$/\\1E\\2/" $f; done
done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.

Scripted-by: Al Viro
Signed-off-by: Linus Torvalds

Linus Torvalds
2018-02-12 06:34:03 +0800

31 Jan, 2018

1 commit

168fe32a0 Merge branch 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull poll annotations from Al Viro:
"This introduces a __bitwise type for POLL### bitmap, and propagates
the annotations through the tree. Most of that stuff is as simple as
'make ->poll() instances return __poll_t and do the same to local
variables used to hold the future return value'.

Some of the obvious brainos found in process are fixed (e.g. POLLIN
misspelled as POLL_IN). At that point the amount of sparse warnings is
low and most of them are for genuine bugs - e.g. ->poll() instance
deciding to return -EINVAL instead of a bitmap. I hadn't touched those
in this series - it's large enough as it is.

Another problem it has caught was eventpoll() ABI mess; select.c and
eventpoll.c assumed that corresponding POLL### and EPOLL### were
equal. That's true for some, but not all of them - EPOLL### are
arch-independent, but POLL### are not.

The last commit in this series separates userland POLL### values from
the (now arch-independent) kernel-side ones, converting between them
in the few places where they are copied to/from userland. AFAICS, this
is the least disruptive fix preserving poll(2) ABI and making epoll()
work on all architectures.

As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and
it will trigger only on what would've triggered EPOLLWRBAND on other
architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered
at all on sparc. With this patch they should work consistently on all
architectures"

* 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
make kernel-side POLL... arch-independent
eventpoll: no need to mask the result of epi_item_poll() again
eventpoll: constify struct epoll_event pointers
debugging printk in sg_poll() uses %x to print POLL... bitmap
annotate poll(2) guts
9p: untangle ->poll() mess
->si_band gets POLL... bitmap stored into a user-visible long field
ring_buffer_poll_wait() return value used as return value of ->poll()
the rest of drivers/*: annotate ->poll() instances
media: annotate ->poll() instances
fs: annotate ->poll() instances
ipc, kernel, mm: annotate ->poll() instances
net: annotate ->poll() instances
apparmor: annotate ->poll() instances
tomoyo: annotate ->poll() instances
sound: annotate ->poll() instances
acpi: annotate ->poll() instances
crypto: annotate ->poll() instances
block: annotate ->poll() instances
x86: annotate ->poll() instances
...

Linus Torvalds
2018-01-31 09:58:07 +0800

27 Jan, 2018

1 commit

ba3169fc7 VSOCK: set POLLOUT | POLLWRNORM for TCP_CLOSING ... Browse Code »

select(2) with wfds but no rfds must return when the socket is shut down
by the peer. This way userspace notices socket activity and gets -EPIPE
from the next write(2).

Currently select(2) does not return for virtio-vsock when a SEND+RCV
shutdown packet is received. This is because vsock_poll() only sets
POLLOUT | POLLWRNORM for TCP_CLOSE, not the TCP_CLOSING state that the
socket is in when the shutdown is received.

Signed-off-by: Stefan Hajnoczi
Signed-off-by: David S. Miller

Stefan Hajnoczi
2018-01-27 00:16:27 +0800

28 Nov, 2017

1 commit

ade994f4f net: annotate ->poll() instances ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2017-11-28 05:20:04 +0800

26 Oct, 2017

1 commit

c1eef220c vsock: always call vsock_init_tables() ... Browse Code »

Although CONFIG_VSOCKETS_DIAG depends on CONFIG_VSOCKETS,
vsock_init_tables() is not always called, it is called only
if other modules call its caller. Therefore if we only
enable CONFIG_VSOCKETS_DIAG, it would crash kernel on uninitialized
vsock_bind_table.

This patch fixes it by moving vsock_init_tables() to its own
module_init().

Fixes: 413a4317aca7 ("VSOCK: add sock_diag interface")
Reported-by: syzkaller bot
Cc: Stefan Hajnoczi
Cc: Jorgen Hansen
Signed-off-by: Cong Wang
Reviewed-by: Stefan Hajnoczi
Signed-off-by: David S. Miller

Cong Wang
2017-10-26 16:45:58 +0800

06 Oct, 2017

1 commit

3b4477d2d VSOCK: use TCP state constants for sk_state ... Browse Code »

There are two state fields: socket->state and sock->sk_state. The
socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
sock->sk_state typically uses values that match TCP state constants
(TCP_CLOSE, TCP_ESTABLISHED). AF_VSOCK does not follow this convention
and instead uses SS_* constants for both fields.

The sk_state field will be exposed to userspace through the vsock_diag
interface for ss(8), netstat(8), and other programs.

This patch switches sk_state to TCP state constants so that the meaning
of this field is consistent with other address families. Not just
AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.

The following mapping was used to convert the code:

SS_FREE -> TCP_CLOSE
SS_UNCONNECTED -> TCP_CLOSE
SS_CONNECTING -> TCP_SYN_SENT
SS_CONNECTED -> TCP_ESTABLISHED
SS_DISCONNECTING -> TCP_CLOSING
VSOCK_SS_LISTEN -> TCP_LISTEN

In __vsock_create() the sk_state initialization was dropped because
sock_init_data() already initializes sk_state to TCP_CLOSE.

Signed-off-by: Stefan Hajnoczi
Signed-off-by: David S. Miller

Stefan Hajnoczi
2017-10-06 09:44:17 +0800