10 Dec, 2015
1 commit
-
The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") missed a
goto statement, which causes netlink to break subtly.This was discovered by Stefan Priebe .
Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
Reported-by: Stefan Priebe
Reported-by: Philipp Hahn
Signed-off-by: Herbert Xu
Acked-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
27 Oct, 2015
1 commit
-
[ Upstream commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe ]
netlink_dump() allocates skb based on the calculated min_dump_alloc or
a per socket max_recvmsg_len.
min_alloc_size is maximum space required for any single netdev
attributes as calculated by rtnl_calcit().
max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
It is capped at 16KiB.
The intention is to avoid small allocations and to minimize the number
of calls required to obtain dump information for all net devices.netlink_dump packs as many small messages as could fit within an skb
that was sized for the largest single netdev information. The actual
space available within an skb is larger than what is requested. It could
be much larger and up to near 2x with align to next power of 2 approach.Allowing netlink_dump to use all the space available within the
allocated skb increases the buffer size a user has to provide to avoid
truncaion (i.e. MSG_TRUNG flag set).It was observed that with many VLANs configured on at least one netdev,
a larger buffer of near 64KiB was necessary to avoid "Message truncated"
error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
min_alloc_size was only little over 32KiB.This patch trims skb to allocated size in order to allow the user to
avoid truncation with more reasonable buffer size.Signed-off-by: Ronen Arad
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
03 Oct, 2015
3 commits
-
[ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way. You have to pair store_release
> and load_acquire. Besides, it isn't a particularly good idea toOK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.> depend on memory barriers embedded in other data structures like the
> above. Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.> There's no reason to be overly smart here. This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.It's not about being overly smart. It's about actually understanding
what's going on with the code. I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> > }
> > }
> >
> > - if (!nlk->portid) {
> > + if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable. That doesn't change anything. Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.However, there is a real bug here that none of these acquire/release
helpers discovered. The two bound tests here used to be a single
one. Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket. So we need to
repeat the portid check in order to maintain consistency.> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> > return -EPERM;
> >
> > - if (!nlk->portid)
> > + if (!nlk->bound)
>
> Don't we need load_acquire here too? Is this path holding a lock
> which makes that unnecessary?Ditto.
---8bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo
Reported-by: Linus Torvalds
Signed-off-by: Herbert Xu
Nacked-by: Tejun Heo
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]
The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID. This led to kernel
deadlocks that were observed by multiple people.This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo
Reported-by: Linus Torvalds
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]
Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA
Signed-off-by: Daniel Borkmann
Tested-by: Ken-ichirou MATSUZAWA
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
30 Sep, 2015
2 commits
-
[ Upstream commit 4e7c1330689e27556de407d3fdadc65ffff5eb12 ]
Linus reports the following deadlock on rtnl_mutex; triggered only
once so far (extract):[12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000
[12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
[12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
[12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
[12236.694235] Call Trace:
[12236.694250] [] ? schedule_preempt_disabled+0x9/0x10
[12236.694257] [] ? schedule+0x2a/0x70
[12236.694263] [] ? schedule_preempt_disabled+0x9/0x10
[12236.694271] [] ? __mutex_lock_slowpath+0x7f/0xf0
[12236.694280] [] ? mutex_lock+0x16/0x30
[12236.694291] [] ? rtnetlink_rcv+0x10/0x30
[12236.694299] [] ? netlink_unicast+0xfb/0x180
[12236.694309] [] ? rtnl_getlink+0x113/0x190
[12236.694319] [] ? rtnetlink_rcv_msg+0x7a/0x210
[12236.694331] [] ? sock_has_perm+0x5c/0x70
[12236.694339] [] ? rtnetlink_rcv+0x30/0x30
[12236.694346] [] ? netlink_rcv_skb+0x9c/0xc0
[12236.694354] [] ? rtnetlink_rcv+0x1f/0x30
[12236.694360] [] ? netlink_unicast+0xfb/0x180
[12236.694367] [] ? netlink_sendmsg+0x484/0x5d0
[12236.694376] [] ? __wake_up+0x2f/0x50
[12236.694387] [] ? sock_sendmsg+0x33/0x40
[12236.694396] [] ? ___sys_sendmsg+0x22e/0x240
[12236.694405] [] ? ___sys_recvmsg+0x135/0x1a0
[12236.694415] [] ? eventfd_write+0x82/0x210
[12236.694423] [] ? fsnotify+0x32e/0x4c0
[12236.694429] [] ? wake_up_q+0x60/0x60
[12236.694434] [] ? __sys_sendmsg+0x39/0x70
[12236.694440] [] ? entry_SYSCALL_64_fastpath+0x12/0x6aIt seems so far plausible that the recursive call into rtnetlink_rcv()
looks suspicious. One way, where this could trigger is that the senders
NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
the rtnl_getlink() request's answer would be sent to the kernel instead
to the actual user process, thus grabbing rtnl_mutex() twice.One theory would be that netlink_autobind() triggered via netlink_sendmsg()
internally overwrites the -EBUSY error to 0, but where it is wrongly
originating from __netlink_insert() instead. That would reset the
socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
also puts it, -EBUSY should not be propagated from netlink_insert().It looks like it's very unlikely to reproduce. We need to trigger the
rhashtable_insert_rehash() handler under a situation where rehashing
currently occurs (one /rare/ way would be to hit ht->elasticity limits
while not filled enough to expand the hashtable, but that would rather
require a specifically crafted bind() sequence with knowledge about
destination slots, seems unlikely). It probably makes sense to guard
__netlink_insert() in any case and remap that error. It was suggested
that EOVERFLOW might be better than an already overloaded ENOMEM.Reference: http://thread.gmane.org/gmane.linux.network/372676
Reported-by: Linus Torvalds
Signed-off-by: Daniel Borkmann
Acked-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 0470eb99b4721586ccac954faac3fa4472da0845 ]
Kirill A. Shutemov says:
This simple test-case trigers few locking asserts in kernel:
int main(int argc, char **argv)
{
unsigned int block_size = 16 * 4096;
struct nl_mmap_req req = {
.nm_block_size = block_size,
.nm_block_nr = 64,
.nm_frame_size = 16384,
.nm_frame_nr = 64 * block_size / 16384,
};
unsigned int ring_size;
int fd;fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1);ring_size = req.nm_block_nr * req.nm_block_size;
mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
return 0;
}+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
#1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
#2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[] __delay+0xf/0x20CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
[] dump_stack+0x4f/0x7b
[] ___might_sleep+0x16d/0x270
[] __might_sleep+0x4d/0x90
[] mutex_lock_nested+0x2f/0x430
[] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[] ? __this_cpu_preempt_check+0x13/0x20
[] netlink_set_ring+0x1ed/0x350
[] ? netlink_undo_bind+0x70/0x70
[] netlink_sock_destruct+0x80/0x150
[] __sk_free+0x1d/0x160
[] sk_free+0x19/0x20
[..]Cong Wang says:
We can't hold mutex lock in a rcu callback, [..]
Thomas Graf says:
The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.Reported-by: "Kirill A. Shutemov"
Diagnosed-by: Cong Wang
Suggested-by: Thomas Graf
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman
17 May, 2015
1 commit
-
The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
because it doesn't reset portid after a failed netlink_insert.This means that should autobind fail the first time around, then
the socket will be stuck in limbo as it can never be bound again
since it already has a non-zero portid.Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
15 May, 2015
1 commit
-
netlink sockets creation and deletion heavily modify nl_table_users
and nl_table_lock.If nl_table is sharing one cache line with one of them, netlink
performance is really bad on SMP.ffffffff81ff5f00 B nl_table
ffffffff81ff5f0c b nl_table_usersPutting nl_table in read_mostly section increased performance
of my open/delete netlink sockets test by about 80 %This came up while diagnosing a getaddrinfo() problem.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
04 May, 2015
1 commit
-
We currently limit the hash table size to 64K which is very bad
as even 10 years ago it was relatively easy to generate millions
of sockets.Since the hash table is naturally limited by memory allocation
failure, we don't really need an explicit limit so this patch
removes it.Signed-off-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
26 Apr, 2015
1 commit
-
When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemallocThis means netlink no longer has to hack skb->head_frag
[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067] (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Eric Dumazet
Reported-by: Sasha Levin
Signed-off-by: David S. Miller
26 Mar, 2015
1 commit
-
nftables sets will be converted to use so called setextensions, moving
the key to a non-fixed position. To hash it, the obj_hashfn must be used,
however it so far doesn't receive the length parameter.Pass the key length to obj_hashfn() and convert existing users.
Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso
25 Mar, 2015
1 commit
-
Introduce a new bool automatic_shrinking to require the
user to explicitly opt-in to automatic shrinking of tables.Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller
24 Mar, 2015
1 commit
-
This patch removes the explicit jhash value for the hashfn parameter
of rhashtable. As the key length is a multiple of 4, this means that
we will actually end up using jhash2.Signed-off-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
21 Mar, 2015
2 commits
-
Instead of computing the offset from trailer, this patch computes
netlink_compare_arg_len from the offset of portid and then adds 4
to it. This allows trailer to be removed.Reported-by: David Miller
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller -
Currently the name space is a de facto key because it has to match
before we find an object in the hash table. However, it isn't in
the hash value so all objects from different name spaces with the
same port ID hash to the same bucket.This is bad as the number of name spaces is unbounded.
This patch fixes this by using the namespace when doing the hash.
Because the namespace field doesn't lie next to the portid field
in the netlink socket, this patch switches over to the rhashtable
interface without a fixed key.This patch also uses the new inlined rhashtable interface where
possible.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
19 Mar, 2015
1 commit
-
This patch converts netlink to use rhashtable max_size instead
of the obsolete max_shift.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
04 Mar, 2015
1 commit
-
Conflicts:
drivers/net/ethernet/rocker/rocker.cThe rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.Signed-off-by: David S. Miller
03 Mar, 2015
1 commit
-
After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.Cc: Christoph Hellwig
Suggested-by: Al Viro
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller
28 Feb, 2015
1 commit
-
Currently, all real users of rhashtable default their grow and shrink
decision functions to rht_grow_above_75() and rht_shrink_below_30(),
so that there's currently no need to have this explicitly selectable.It can/should be generic and private inside rhashtable until a real
use case pops up. Since we can make this private, we'll save us this
additional indirection layer and can improve insertion/deletion time
as well.Reference: http://patchwork.ozlabs.org/patch/443040/
Suggested-by: David S. Miller
Signed-off-by: Daniel Borkmann
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
06 Feb, 2015
1 commit
-
Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.cThe net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.Signed-off-by: David S. Miller
05 Feb, 2015
2 commits
-
More iov_iter work from Al Viro.
Signed-off-by: David S. Miller
-
This patch gets rid of the manual rhashtable walk in netlink
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.In fact the existing code was very buggy. Some sockets weren't
shown at all while others were shown more than once.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
04 Feb, 2015
1 commit
-
As it is, zero msg_iovlen means that the first iovec in the kernel
array of iovecs is left uninitialized, so checking if its ->iov_base
is NULL is random. Since the real users of that thing are doing
sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and
msg_iov[0] = {NULL, 0}, which is what this test is trying to catch.
As suggested by davem, let's just check that msg_iovlen was 1 and
msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches
what we want to catch.Signed-off-by: Al Viro
31 Jan, 2015
1 commit
-
The subscription bitmask passed via struct sockaddr_nl is converted to
the group number when calling the netlink_bind() and netlink_unbind()
callbacks.The conversion is however incorrect since bitmask (1 << 0) needs to be
mapped to group number 1. Note that you cannot specify the group number 0
(usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
since this is rejected through -EINVAL.This problem became noticeable since 97840cb ("netfilter: nfnetlink:
fix insufficient validation in nfnetlink_bind") when binding to bitmask
(1 << 0) in ctnetlink.Reported-by: Andre Tomt
Reported-by: Ivan Delalande
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
29 Jan, 2015
1 commit
-
The sock_iocb structure is allocate on stack for each read/write-like
operation on sockets, and contains various fields of which only the
embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
Get rid of the sock_iocb and put a msghdr directly on the stack and pass
the scm_cookie explicitly to netlink_mmap_sendmsg.Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller
28 Jan, 2015
1 commit
-
Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.cTwo simple sets of overlapping changes.
Signed-off-by: David S. Miller
27 Jan, 2015
1 commit
-
The socket already carries the net namespace with it so there is
no need to be passing another net around.Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
18 Jan, 2015
1 commit
-
Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.This makes the very common pattern of
if (genlmsg_end(...) < 0) { ... }
be a whole bunch of dead code. Many places also simply do
return nlmsg_end(...);
and the caller is expected to deal with it.
This also commonly (at least for me) causes errors, because it is very
common to writeif (my_function(...))
/* error condition */and if my_function() does "return nlmsg_end()" this is of course wrong.
Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with < 0 with no change in behaviour, so I opted for the more
efficient version.One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for
Signed-off-by: David S. Miller
17 Jan, 2015
2 commits
-
In addition to the problem Jeff Layton reported, I looked at the code
and reproduced the same warning by subscribing and removing the genl
family with a socket still open. This is a fairly tricky race which
originates in the fact that generic netlink allows the family to go
away while sockets are still open - unlike regular netlink which has
a module refcount for every open socket so in general this cannot be
triggered.Trying to resolve this issue by the obvious locking isn't possible as
it will result in deadlocks between unregistration and group unbind
notification (which incidentally lockdep doesn't find due to the home
grown locking in the netlink table.)To really resolve this, introduce a "closing socket" reference counter
(for generic netlink only, as it's the only affected family) in the
core netlink code and use that in generic netlink to wait for all the
sockets that are being closed at the same time as a generic netlink
family is removed.This fixes the race that when a socket is closed, it will should call
the unbind, but if the family is removed at the same time the unbind
will not find it, leading to the warning. The real problem though is
that in this case the unbind could actually find a new family that is
registered to have a multicast group with the same ID, and call its
mcast_unbind() leading to confusing.Also remove the warning since it would still trigger, but is now no
longer a problem.This also moves the code in af_netlink.c to before unreferencing the
module to avoid having the same problem in the normal non-genl case.Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller -
Jeff Layton reported that he could trigger the multicast unbind warning
in generic netlink using trinity. I originally thought it was a race
condition between unregistering the generic netlink family and closing
the socket, but there's a far simpler explanation: genetlink currently
allows subscribing to groups that don't (yet) exist, and the warning is
triggered when unsubscribing again while the group still doesn't exist.Originally, I had a warning in the subscribe case and accepted it out of
userspace API concerns, but the warning was of course wrong and removed
later.However, I now think that allowing userspace to subscribe to groups that
don't exist is wrong and could possibly become a security problem:
Consider a (new) genetlink family implementing a permission check in
the mcast_bind() function similar to the like the audit code does today;
it would be possible to bypass the permission check by guessing the ID
and subscribing to the group it exists. This is only possible in case a
family like that would be dynamically loaded, but it doesn't seem like a
huge stretch, for example wireless may be loaded when you plug in a USB
device.To avoid this reject such subscription attempts.
If this ends up causing userspace issues we may need to add a workaround
in af_netlink to deny such requests but not return an error.Reported-by: Jeff Layton
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller
16 Jan, 2015
1 commit
-
The patch c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") introduced a bug where the EADDRINUSE
error has been replaced by ENOMEM. This patch rectifies that
problem.Signed-off-by: Herbert Xu
Acked-by: Ying Xue
Signed-off-by: David S. Miller
14 Jan, 2015
1 commit
-
As rhashtable_lookup_compare_insert() can guarantee the process
of search and insertion is atomic, it's safe to eliminate the
nl_sk_hash_lock. After this, object insertion or removal will
be protected with per bucket lock on write side while object
lookup is guarded with rcu read lock on read side.Signed-off-by: Ying Xue
Cc: Thomas Graf
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
04 Jan, 2015
4 commits
-
Defers the release of the socket reference using call_rcu() to
allow using an RCU read-side protected call to rhashtable_lookup()This restores behaviour and performance gains as previously
introduced by e341694 ("netlink: Convert netlink_lookup() to use
RCU protected hash table") without the side effect of severely
delayed socket destruction.Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller -
Introduces an array of spinlocks to protect bucket mutations. The number
of spinlocks per CPU is configurable and selected based on the hash of
the bucket. This allows for parallel insertions and removals of entries
which do not share a lock.The patch also defers expansion and shrinking to a worker queue which
allows insertion and removal from atomic context. Insertions and
deletions may occur in parallel to it and are only held up briefly
while the particular bucket is linked or unzipped.Mutations of the bucket table pointer is protected by a new mutex, read
access is RCU protected.In the event of an expansion or shrinking, the new bucket table allocated
is exposed as a so called future table as soon as the resize process
starts. Lookups, deletions, and insertions will briefly use both tables.
The future table becomes the main table after an RCU grace period and
initial linking of the old to the new table was performed. Optimization
of the chains to make use of the new number of buckets follows only the
new table is in use.The side effect of this is that during that RCU grace period, a bucket
traversal using any rht_for_each() variant on the main table will not see
any insertions performed during the RCU grace period which would at that
point land in the future table. The lookup will see them as it searches
both tables if needed.Having multiple insertions and removals occur in parallel requires nelems
to become an atomic counter.Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller -
This patch is in preparation to introduce per bucket spinlocks. It
extends all iterator macros to take the bucket table and bucket
index. It also introduces a new rht_dereference_bucket() to
handle protected accesses to buckets.It introduces a barrier() to the RCU iterators to the prevent
the compiler from caching the first element.The lockdep verifier is introduced as stub which always succeeds
and properly implement in the next patch when the locks are
introduced.Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller -
Hash the key inside of rhashtable_lookup_compare() like
rhashtable_lookup() does. This allows to simplify the hashing
functions and keep them private.Signed-off-by: Thomas Graf
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: David S. Miller
30 Dec, 2014
1 commit
-
Users can request to bind to arbitrary multicast groups, so warning
when the requested group number is out of range is not appropriate.And with the warning removed, and the 'err' variable properly given
an initial value, we can remove 'found' altogether.Reported-by: Sedat Dilek
Signed-off-by: David S. Miller
27 Dec, 2014
2 commits
-
Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller -
In order to make the newly fixed multicast bind/unbind
functionality in generic netlink, pass them down to the
appropriate family.Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller