Eric Lee / smarc-fsl-linux-kernel

10 Dec, 2015

1 commit

a52ec6de6 netlink: Add missing goto statement to netlink_insert ... Browse Code »

The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
Fix autobind race condition that leads to zero port ID") missed a
goto statement, which causes netlink to break subtly.

This was discovered by Stefan Priebe .

Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
Reported-by: Stefan Priebe
Reported-by: Philipp Hahn
Signed-off-by: Herbert Xu
Acked-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Herbert Xu
2015-12-10 03:03:28 +0800

27 Oct, 2015

1 commit

e962218b4 netlink: Trim skb to alloc size to avoid MSG_TRUNC ... Browse Code »

[ Upstream commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe ]

netlink_dump() allocates skb based on the calculated min_dump_alloc or
a per socket max_recvmsg_len.
min_alloc_size is maximum space required for any single netdev
attributes as calculated by rtnl_calcit().
max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
It is capped at 16KiB.
The intention is to avoid small allocations and to minimize the number
of calls required to obtain dump information for all net devices.

netlink_dump packs as many small messages as could fit within an skb
that was sized for the largest single netdev information. The actual
space available within an skb is larger than what is requested. It could
be much larger and up to near 2x with align to next power of 2 approach.

Allowing netlink_dump to use all the space available within the
allocated skb increases the buffer size a user has to provide to avoid
truncaion (i.e. MSG_TRUNG flag set).

It was observed that with many VLANs configured on at least one netdev,
a larger buffer of near 64KiB was necessary to avoid "Message truncated"
error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
min_alloc_size was only little over 32KiB.

This patch trims skb to allocated size in order to allow the user to
avoid truncation with more reasonable buffer size.

Signed-off-by: Ronen Arad
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Arad, Ronen
2015-10-27 08:51:52 +0800

03 Oct, 2015

3 commits

d48623677 netlink: Replace rhash_portid with bound ... Browse Code »

[ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]

On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way. You have to pair store_release
> and load_acquire. Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above. Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here. This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart. It's about actually understanding
what's going on with the code. I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> > }
> > }
> >
> > - if (!nlk->portid) {
> > + if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable. That doesn't change anything. Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered. The two bound tests here used to be a single
one. Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket. So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> > return -EPERM;
> >
> > - if (!nlk->portid)
> > + if (!nlk->bound)
>
> Don't we need load_acquire here too? Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo
Reported-by: Linus Torvalds
Signed-off-by: Herbert Xu
Nacked-by: Tejun Heo
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Herbert Xu
2015-10-03 19:49:18 +0800
4e2776241 netlink: Fix autobind race condition that leads to zero port ID ... Browse Code »

[ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]

The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID. This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo
Reported-by: Linus Torvalds
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Herbert Xu
2015-10-03 19:49:18 +0800
65d48c630 netlink, mmap: transform mmap skb into full skb on taps ... Browse Code »

[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

Ken-ichirou reported that running netlink in mmap mode for receive in
combination with nlmon will throw a NULL pointer dereference in
__kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
to handle kernel paging request". The problem is the skb_clone() in
__netlink_deliver_tap_skb() for skbs that are mmaped.

I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
skb has it pointed to netlink_skb_destructor(), set in the handler
netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
that in such cases, __kfree_skb() doesn't perform a skb_release_data()
via skb_release_all(), where skb->head is possibly being freed through
kfree(head) into slab allocator, although netlink mmap skb->head points
to the mmap buffer. Similarly, the same has to be done also for large
netlink skbs where the data area is vmalloced. Therefore, as discussed,
make a copy for these rather rare cases for now. This fixes the issue
on my and Ken-ichirou's test-cases.

Reference: http://thread.gmane.org/gmane.linux.network/371129
Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
Reported-by: Ken-ichirou MATSUZAWA
Signed-off-by: Daniel Borkmann
Tested-by: Ken-ichirou MATSUZAWA
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2015-10-03 19:49:15 +0800

30 Sep, 2015

2 commits

d397617f7 netlink: make sure -EBUSY won't escape from netlink_insert ... Browse Code »

[ Upstream commit 4e7c1330689e27556de407d3fdadc65ffff5eb12 ]

Linus reports the following deadlock on rtnl_mutex; triggered only
once so far (extract):

[12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000
[12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
[12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
[12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
[12236.694235] Call Trace:
[12236.694250] [] ? schedule_preempt_disabled+0x9/0x10
[12236.694257] [] ? schedule+0x2a/0x70
[12236.694263] [] ? schedule_preempt_disabled+0x9/0x10
[12236.694271] [] ? __mutex_lock_slowpath+0x7f/0xf0
[12236.694280] [] ? mutex_lock+0x16/0x30
[12236.694291] [] ? rtnetlink_rcv+0x10/0x30
[12236.694299] [] ? netlink_unicast+0xfb/0x180
[12236.694309] [] ? rtnl_getlink+0x113/0x190
[12236.694319] [] ? rtnetlink_rcv_msg+0x7a/0x210
[12236.694331] [] ? sock_has_perm+0x5c/0x70
[12236.694339] [] ? rtnetlink_rcv+0x30/0x30
[12236.694346] [] ? netlink_rcv_skb+0x9c/0xc0
[12236.694354] [] ? rtnetlink_rcv+0x1f/0x30
[12236.694360] [] ? netlink_unicast+0xfb/0x180
[12236.694367] [] ? netlink_sendmsg+0x484/0x5d0
[12236.694376] [] ? __wake_up+0x2f/0x50
[12236.694387] [] ? sock_sendmsg+0x33/0x40
[12236.694396] [] ? ___sys_sendmsg+0x22e/0x240
[12236.694405] [] ? ___sys_recvmsg+0x135/0x1a0
[12236.694415] [] ? eventfd_write+0x82/0x210
[12236.694423] [] ? fsnotify+0x32e/0x4c0
[12236.694429] [] ? wake_up_q+0x60/0x60
[12236.694434] [] ? __sys_sendmsg+0x39/0x70
[12236.694440] [] ? entry_SYSCALL_64_fastpath+0x12/0x6a

It seems so far plausible that the recursive call into rtnetlink_rcv()
looks suspicious. One way, where this could trigger is that the senders
NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
the rtnl_getlink() request's answer would be sent to the kernel instead
to the actual user process, thus grabbing rtnl_mutex() twice.

One theory would be that netlink_autobind() triggered via netlink_sendmsg()
internally overwrites the -EBUSY error to 0, but where it is wrongly
originating from __netlink_insert() instead. That would reset the
socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
also puts it, -EBUSY should not be propagated from netlink_insert().

It looks like it's very unlikely to reproduce. We need to trigger the
rhashtable_insert_rehash() handler under a situation where rehashing
currently occurs (one /rare/ way would be to hit ht->elasticity limits
while not filled enough to expand the hashtable, but that would rather
require a specifically crafted bind() sequence with knowledge about
destination slots, seems unlikely). It probably makes sense to guard
__netlink_insert() in any case and remap that error. It was suggested
that EOVERFLOW might be better than an already overloaded ENOMEM.

Reference: http://thread.gmane.org/gmane.linux.network/372676
Reported-by: Linus Torvalds
Signed-off-by: Daniel Borkmann
Acked-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2015-09-30 01:26:25 +0800
b265c3003 netlink: don't hold mutex in rcu callback when releasing mmapd ring ... Browse Code »

[ Upstream commit 0470eb99b4721586ccac954faac3fa4472da0845 ]

Kirill A. Shutemov says:

This simple test-case trigers few locking asserts in kernel:

int main(int argc, char **argv)
{
unsigned int block_size = 16 * 4096;
struct nl_mmap_req req = {
.nm_block_size = block_size,
.nm_block_nr = 64,
.nm_frame_size = 16384,
.nm_frame_nr = 64 * block_size / 16384,
};
unsigned int ring_size;
int fd;

fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1);

ring_size = req.nm_block_nr * req.nm_block_size;
mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
return 0;
}

+++ exited with 0 +++
BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
3 locks held by init/1:
#0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
#1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
#2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
Preemption disabled at:[] __delay+0xf/0x20

CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
Call Trace:
[] dump_stack+0x4f/0x7b
[] ___might_sleep+0x16d/0x270
[] __might_sleep+0x4d/0x90
[] mutex_lock_nested+0x2f/0x430
[] ? _raw_spin_unlock_irqrestore+0x5d/0x80
[] ? __this_cpu_preempt_check+0x13/0x20
[] netlink_set_ring+0x1ed/0x350
[] ? netlink_undo_bind+0x70/0x70
[] netlink_sock_destruct+0x80/0x150
[] __sk_free+0x1d/0x160
[] sk_free+0x19/0x20
[..]

Cong Wang says:

We can't hold mutex lock in a rcu callback, [..]

Thomas Graf says:

The socket should be dead at this point. It might be simpler to
add a netlink_release_ring() function which doesn't require
locking at all.

Reported-by: "Kirill A. Shutemov"
Diagnosed-by: Cong Wang
Suggested-by: Thomas Graf
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Florian Westphal
2015-09-30 01:26:23 +0800

17 May, 2015

1 commit

c0bb07df7 netlink: Reset portid after netlink_insert failure ... Browse Code »

The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
because it doesn't reset portid after a failed netlink_insert.

This means that should autobind fail the first time around, then
the socket will be stuck in limbo as it can never be bound again
since it already has a non-zero portid.

Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-05-17 05:08:57 +0800

15 May, 2015

1 commit

91dd93f95 netlink: move nl_table in read_mostly section ... Browse Code »

netlink sockets creation and deletion heavily modify nl_table_users
and nl_table_lock.

If nl_table is sharing one cache line with one of them, netlink
performance is really bad on SMP.

ffffffff81ff5f00 B nl_table
ffffffff81ff5f0c b nl_table_users

Putting nl_table in read_mostly section increased performance
of my open/delete netlink sockets test by about 80 %

This came up while diagnosing a getaddrinfo() problem.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-15 05:49:06 +0800

04 May, 2015

1 commit

edac450d5 netlink: Remove max_size setting ... Browse Code »

We currently limit the hash table size to 64K which is very bad
as even 10 years ago it was relatively easy to generate millions
of sockets.

Since the hash table is naturally limited by memory allocation
failure, we don't really need an explicit limit so this patch
removes it.

Signed-off-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Herbert Xu
2015-05-04 11:27:02 +0800

26 Apr, 2015

1 commit

2ea2f62c8 net: fix crash in build_skb() ... Browse Code »

When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.

In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc

This means netlink no longer has to hack skb->head_frag

[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067] (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Eric Dumazet
Reported-by: Sasha Levin
Signed-off-by: David S. Miller

Eric Dumazet
2015-04-26 03:49:49 +0800

26 Mar, 2015

1 commit

49f7b33e6 rhashtable: provide len to obj_hashfn ... Browse Code »

nftables sets will be converted to use so called setextensions, moving
the key to a non-fixed position. To hash it, the obj_hashfn must be used,
however it so far doesn't receive the length parameter.

Pass the key length to obj_hashfn() and convert existing users.

Signed-off-by: Patrick McHardy
Signed-off-by: Pablo Neira Ayuso

Patrick McHardy
2015-03-26 00:18:33 +0800

25 Mar, 2015

1 commit

b5e2c150a rhashtable: Disable automatic shrinking by default ... Browse Code »

Introduce a new bool automatic_shrinking to require the
user to explicitly opt-in to automatic shrinking of tables.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2015-03-25 05:48:40 +0800

24 Mar, 2015

1 commit

11b58ba14 netlink: Use default rhashtable hashfn ... Browse Code »

This patch removes the explicit jhash value for the hashfn parameter
of rhashtable. As the key length is a multiple of 4, this means that
we will actually end up using jhash2.

Signed-off-by: Herbert Xu
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Herbert Xu
2015-03-24 10:07:51 +0800

21 Mar, 2015

2 commits

8f2ddaac3 netlink: Remove netlink_compare_arg.trailer ... Browse Code »

Instead of computing the offset from trailer, this patch computes
netlink_compare_arg_len from the offset of portid and then adds 4
to it. This allows trailer to be removed.

Reported-by: David Miller
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-03-21 12:16:39 +0800
c428ecd1a netlink: Move namespace into hash key ... Browse Code »

Currently the name space is a de facto key because it has to match
before we find an object in the hash table. However, it isn't in
the hash value so all objects from different name spaces with the
same port ID hash to the same bucket.

This is bad as the number of name spaces is unbounded.

This patch fixes this by using the namespace when doing the hash.

Because the namespace field doesn't lie next to the portid field
in the netlink socket, this patch switches over to the rhashtable
interface without a fixed key.

This patch also uses the new inlined rhashtable interface where
possible.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-03-21 04:16:24 +0800

19 Mar, 2015

1 commit

b06eee59b netlink: Use rhashtable max_size instead of max_shift ... Browse Code »

This patch converts netlink to use rhashtable max_size instead
of the obsolete max_shift.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-03-19 00:46:40 +0800

04 Mar, 2015

1 commit

71a83a6db Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/rocker/rocker.c

The rocker commit was two overlapping changes, one to rename
the ->vport member to ->pport, and another making the bitmask
expression use '1ULL' instead of plain '1'.

Signed-off-by: David S. Miller

David S. Miller
2015-03-04 10:16:48 +0800

03 Mar, 2015

1 commit

1b7841404 net: Remove iocb argument from sendmsg and recvmsg ... Browse Code »

After TIPC doesn't depend on iocb argument in its internal
implementations of sendmsg() and recvmsg() hooks defined in proto
structure, no any user is using iocb argument in them at all now.
Then we can drop the redundant iocb argument completely from kinds of
implementations of both sendmsg() and recvmsg() in the entire
networking stack.

Cc: Christoph Hellwig
Suggested-by: Al Viro
Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2015-03-03 02:06:31 +0800

28 Feb, 2015

1 commit

4c4b52d9b rhashtable: remove indirection for grow/shrink decision functions ... Browse Code »

Currently, all real users of rhashtable default their grow and shrink
decision functions to rht_grow_above_75() and rht_shrink_below_30(),
so that there's currently no need to have this explicitly selectable.

It can/should be generic and private inside rhashtable until a real
use case pops up. Since we can make this private, we'll save us this
additional indirection layer and can improve insertion/deletion time
as well.

Reference: http://patchwork.ozlabs.org/patch/443040/
Suggested-by: David S. Miller
Signed-off-by: Daniel Borkmann
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Daniel Borkmann
2015-02-28 05:06:02 +0800

06 Feb, 2015

1 commit

6e03f896b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/vxlan.c
drivers/vhost/net.c
include/linux/if_vlan.h
net/core/dev.c

The net/core/dev.c conflict was the overlap of one commit marking an
existing function static whilst another was adding a new function.

In the include/linux/if_vlan.h case, the type used for a local
variable was changed in 'net', whereas the function got rewritten
to fix a stacked vlan bug in 'net-next'.

In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
overlapped with an endainness fix for VHOST 1.0 in 'net'.

In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
in 'net-next' whereas in 'net' there was a bug fix to pass in the
correct network namespace pointer in calls to this function.

Signed-off-by: David S. Miller

David S. Miller
2015-02-06 06:33:28 +0800

05 Feb, 2015

2 commits

f2683b743 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

More iov_iter work from Al Viro.

Signed-off-by: David S. Miller

David S. Miller
2015-02-05 12:46:55 +0800
56d28b1e9 netlink: Use rhashtable walk iterator ... Browse Code »

This patch gets rid of the manual rhashtable walk in netlink
which touches rhashtable internals that should not be exposed.
It does so by using the rhashtable iterator primitives.

In fact the existing code was very buggy. Some sockets weren't
shown at all while others were shown more than once.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-02-05 12:34:53 +0800

04 Feb, 2015

1 commit

a8866ff6a netlink: make the check for "send from tx_ring" deterministic ... Browse Code »

As it is, zero msg_iovlen means that the first iovec in the kernel
array of iovecs is left uninitialized, so checking if its ->iov_base
is NULL is random. Since the real users of that thing are doing
sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and
msg_iov[0] = {NULL, 0}, which is what this test is trying to catch.
As suggested by davem, let's just check that msg_iovlen was 1 and
msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches
what we want to catch.

Signed-off-by: Al Viro

Al Viro
2015-02-04 14:34:13 +0800

31 Jan, 2015

1 commit

8b7c36d81 netlink: fix wrong subscription bitmask to group mapping in ... Browse Code »

The subscription bitmask passed via struct sockaddr_nl is converted to
the group number when calling the netlink_bind() and netlink_unbind()
callbacks.

The conversion is however incorrect since bitmask (1 << 0) needs to be
mapped to group number 1. Note that you cannot specify the group number 0
(usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
since this is rejected through -EINVAL.

This problem became noticeable since 97840cb ("netfilter: nfnetlink:
fix insufficient validation in nfnetlink_bind") when binding to bitmask
(1 << 0) in ctnetlink.

Reported-by: Andre Tomt
Reported-by: Ivan Delalande
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Pablo Neira
2015-01-31 09:43:47 +0800

29 Jan, 2015

1 commit

7cc056626 net: remove sock_iocb ... Browse Code »

The sock_iocb structure is allocate on stack for each read/write-like
operation on sockets, and contains various fields of which only the
embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
Get rid of the sock_iocb and put a msghdr directly on the stack and pass
the scm_cookie explicitly to netlink_mmap_sendmsg.

Signed-off-by: Christoph Hellwig
Signed-off-by: David S. Miller

Christoph Hellwig
2015-01-29 15:15:07 +0800

28 Jan, 2015

1 commit

95f873f2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
arch/arm/boot/dts/imx6sx-sdb.dts
net/sched/cls_bpf.c

Two simple sets of overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2015-01-28 08:59:56 +0800

27 Jan, 2015

1 commit

8ea65f4a2 netlink: Kill redundant net argument in netlink_insert ... Browse Code »

The socket already carries the net namespace with it so there is
no need to be passing another net around.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2015-01-27 16:30:30 +0800

18 Jan, 2015

1 commit

053c095a8 netlink: make nlmsg_end() and genlmsg_end() void ... Browse Code »

Contrary to common expectations for an "int" return, these functions
return only a positive value -- if used correctly they cannot even
return 0 because the message header will necessarily be in the skb.

This makes the very common pattern of

if (genlmsg_end(...) < 0) { ... }

be a whole bunch of dead code. Many places also simply do

return nlmsg_end(...);

and the caller is expected to deal with it.

This also commonly (at least for me) causes errors, because it is very
common to write

if (my_function(...))
/* error condition */

and if my_function() does "return nlmsg_end()" this is of course wrong.

Additionally, there's not a single place in the kernel that actually
needs the message length returned, and if anyone needs it later then
it'll be very easy to just use skb->len there.

Remove this, and make the functions void. This removes a bunch of dead
code as described above. The patch adds lines because I did

- return nlmsg_end(...);
+ nlmsg_end(...);
+ return 0;

I could have preserved all the function's return values by returning
skb->len, but instead I've audited all the places calling the affected
functions and found that none cared. A few places actually compared
the return value with < 0 with no change in behaviour, so I opted for the more
efficient version.

One instance of the error I've made numerous times now is also present
in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
check for
Signed-off-by: David S. Miller

Johannes Berg
2015-01-18 14:03:45 +0800

17 Jan, 2015

2 commits

ee1c24421 genetlink: synchronize socket closing and family removal ... Browse Code »

In addition to the problem Jeff Layton reported, I looked at the code
and reproduced the same warning by subscribing and removing the genl
family with a socket still open. This is a fairly tricky race which
originates in the fact that generic netlink allows the family to go
away while sockets are still open - unlike regular netlink which has
a module refcount for every open socket so in general this cannot be
triggered.

Trying to resolve this issue by the obvious locking isn't possible as
it will result in deadlocks between unregistration and group unbind
notification (which incidentally lockdep doesn't find due to the home
grown locking in the netlink table.)

To really resolve this, introduce a "closing socket" reference counter
(for generic netlink only, as it's the only affected family) in the
core netlink code and use that in generic netlink to wait for all the
sockets that are being closed at the same time as a generic netlink
family is removed.

This fixes the race that when a socket is closed, it will should call
the unbind, but if the family is removed at the same time the unbind
will not find it, leading to the warning. The real problem though is
that in this case the unbind could actually find a new family that is
registered to have a multicast group with the same ID, and call its
mcast_unbind() leading to confusing.

Also remove the warning since it would still trigger, but is now no
longer a problem.

This also moves the code in af_netlink.c to before unreferencing the
module to avoid having the same problem in the normal non-genl case.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2015-01-17 06:04:25 +0800
5ad630052 genetlink: disallow subscribing to unknown mcast groups ... Browse Code »

Jeff Layton reported that he could trigger the multicast unbind warning
in generic netlink using trinity. I originally thought it was a race
condition between unregistering the generic netlink family and closing
the socket, but there's a far simpler explanation: genetlink currently
allows subscribing to groups that don't (yet) exist, and the warning is
triggered when unsubscribing again while the group still doesn't exist.

Originally, I had a warning in the subscribe case and accepted it out of
userspace API concerns, but the warning was of course wrong and removed
later.

However, I now think that allowing userspace to subscribe to groups that
don't exist is wrong and could possibly become a security problem:
Consider a (new) genetlink family implementing a permission check in
the mcast_bind() function similar to the like the audit code does today;
it would be possible to bypass the permission check by guessing the ID
and subscribing to the group it exists. This is only possible in case a
family like that would be dynamically loaded, but it doesn't seem like a
huge stretch, for example wireless may be loaded when you plug in a USB
device.

To avoid this reject such subscription attempts.

If this ends up causing userspace issues we may need to add a workaround
in af_netlink to deny such requests but not return an error.

Reported-by: Jeff Layton
Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2015-01-17 06:04:24 +0800

16 Jan, 2015

1 commit

919d9db95 netlink: Fix netlink_insert EADDRINUSE error ... Browse Code »

The patch c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") introduced a bug where the EADDRINUSE
error has been replaced by ENOMEM. This patch rectifies that
problem.

Signed-off-by: Herbert Xu
Acked-by: Ying Xue
Signed-off-by: David S. Miller

Herbert Xu
2015-01-16 15:38:07 +0800

14 Jan, 2015

1 commit

c5adde946 netlink: eliminate nl_sk_hash_lock ... Browse Code »

As rhashtable_lookup_compare_insert() can guarantee the process
of search and insertion is atomic, it's safe to eliminate the
nl_sk_hash_lock. After this, object insertion or removal will
be protected with per bucket lock on write side while object
lookup is guarded with rcu read lock on read side.

Signed-off-by: Ying Xue
Cc: Thomas Graf
Acked-by: Thomas Graf
Signed-off-by: David S. Miller

Ying Xue
2015-01-14 03:01:00 +0800

04 Jan, 2015

4 commits

21e4902ae netlink: Lockless lookup with RCU grace period in socket release ... Browse Code »

Defers the release of the socket reference using call_rcu() to
allow using an RCU read-side protected call to rhashtable_lookup()

This restores behaviour and performance gains as previously
introduced by e341694 ("netlink: Convert netlink_lookup() to use
RCU protected hash table") without the side effect of severely
delayed socket destruction.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2015-01-04 03:32:57 +0800
97defe1ec rhashtable: Per bucket locks & deferred expansion/shrinking ... Browse Code »

Introduces an array of spinlocks to protect bucket mutations. The number
of spinlocks per CPU is configurable and selected based on the hash of
the bucket. This allows for parallel insertions and removals of entries
which do not share a lock.

The patch also defers expansion and shrinking to a worker queue which
allows insertion and removal from atomic context. Insertions and
deletions may occur in parallel to it and are only held up briefly
while the particular bucket is linked or unzipped.

Mutations of the bucket table pointer is protected by a new mutex, read
access is RCU protected.

In the event of an expansion or shrinking, the new bucket table allocated
is exposed as a so called future table as soon as the resize process
starts. Lookups, deletions, and insertions will briefly use both tables.
The future table becomes the main table after an RCU grace period and
initial linking of the old to the new table was performed. Optimization
of the chains to make use of the new number of buckets follows only the
new table is in use.

The side effect of this is that during that RCU grace period, a bucket
traversal using any rht_for_each() variant on the main table will not see
any insertions performed during the RCU grace period which would at that
point land in the future table. The lookup will see them as it searches
both tables if needed.

Having multiple insertions and removals occur in parallel requires nelems
to become an atomic counter.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2015-01-04 03:32:57 +0800
88d6ed15a rhashtable: Convert bucket iterators to take table and index ... Browse Code »

This patch is in preparation to introduce per bucket spinlocks. It
extends all iterator macros to take the bucket table and bucket
index. It also introduces a new rht_dereference_bucket() to
handle protected accesses to buckets.

It introduces a barrier() to the RCU iterators to the prevent
the compiler from caching the first element.

The lockdep verifier is introduced as stub which always succeeds
and properly implement in the next patch when the locks are
introduced.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2015-01-04 03:32:56 +0800
8d24c0b43 rhashtable: Do hashing inside of rhashtable_lookup_compare() ... Browse Code »

Hash the key inside of rhashtable_lookup_compare() like
rhashtable_lookup() does. This allows to simplify the hashing
functions and keep them private.

Signed-off-by: Thomas Graf
Cc: netfilter-devel@vger.kernel.org
Signed-off-by: David S. Miller

Thomas Graf
2015-01-04 03:32:56 +0800

30 Dec, 2014

1 commit

dc97a1a94 genetlink: A genl_bind() to an out-of-range multicast group should not WARN(). ... Browse Code »

Users can request to bind to arbitrary multicast groups, so warning
when the requested group number is out of range is not appropriate.

And with the warning removed, and the 'err' variable properly given
an initial value, we can remove 'found' altogether.

Reported-by: Sedat Dilek
Signed-off-by: David S. Miller

David S. Miller
2014-12-30 05:31:49 +0800

27 Dec, 2014

2 commits

023e2cfa3 netlink/genetlink: pass network namespace to bind/unbind ... Browse Code »

Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.

To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.

Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2014-12-27 16:07:50 +0800
c380d9a7a genetlink: pass multicast bind/unbind to families ... Browse Code »

In order to make the newly fixed multicast bind/unbind
functionality in generic netlink, pass them down to the
appropriate family.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2014-12-27 15:20:23 +0800