10 Dec, 2015

1 commit

  • The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink:
    Fix autobind race condition that leads to zero port ID") missed a
    goto statement, which causes netlink to break subtly.

    This was discovered by Stefan Priebe .

    Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...")
    Reported-by: Stefan Priebe
    Reported-by: Philipp Hahn
    Signed-off-by: Herbert Xu
    Acked-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     

27 Oct, 2015

1 commit

  • [ Upstream commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe ]

    netlink_dump() allocates skb based on the calculated min_dump_alloc or
    a per socket max_recvmsg_len.
    min_alloc_size is maximum space required for any single netdev
    attributes as calculated by rtnl_calcit().
    max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
    It is capped at 16KiB.
    The intention is to avoid small allocations and to minimize the number
    of calls required to obtain dump information for all net devices.

    netlink_dump packs as many small messages as could fit within an skb
    that was sized for the largest single netdev information. The actual
    space available within an skb is larger than what is requested. It could
    be much larger and up to near 2x with align to next power of 2 approach.

    Allowing netlink_dump to use all the space available within the
    allocated skb increases the buffer size a user has to provide to avoid
    truncaion (i.e. MSG_TRUNG flag set).

    It was observed that with many VLANs configured on at least one netdev,
    a larger buffer of near 64KiB was necessary to avoid "Message truncated"
    error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
    min_alloc_size was only little over 32KiB.

    This patch trims skb to allocated size in order to allow the user to
    avoid truncation with more reasonable buffer size.

    Signed-off-by: Ronen Arad
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Arad, Ronen
     

03 Oct, 2015

3 commits

  • [ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ]

    On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
    >
    > store_release and load_acquire are different from the usual memory
    > barriers and can't be paired this way. You have to pair store_release
    > and load_acquire. Besides, it isn't a particularly good idea to

    OK I've decided to drop the acquire/release helpers as they don't
    help us at all and simply pessimises the code by using full memory
    barriers (on some architectures) where only a write or read barrier
    is needed.

    > depend on memory barriers embedded in other data structures like the
    > above. Here, especially, rhashtable_insert() would have write barrier
    > *before* the entry is hashed not necessarily *after*, which means that
    > in the above case, a socket which appears to have set bound to a
    > reader might not visible when the reader tries to look up the socket
    > on the hashtable.

    But you are right we do need an explicit write barrier here to
    ensure that the hashing is visible.

    > There's no reason to be overly smart here. This isn't a crazy hot
    > path, write barriers tend to be very cheap, store_release more so.
    > Please just do smp_store_release() and note what it's paired with.

    It's not about being overly smart. It's about actually understanding
    what's going on with the code. I've seen too many instances of
    people simply sprinkling synchronisation primitives around without
    any knowledge of what is happening underneath, which is just a recipe
    for creating hard-to-debug races.

    > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
    > > }
    > > }
    > >
    > > - if (!nlk->portid) {
    > > + if (!nlk->bound) {
    >
    > I don't think you can skip load_acquire here just because this is the
    > second deref of the variable. That doesn't change anything. Race
    > condition could still happen between the first and second tests and
    > skipping the second would lead to the same kind of bug.

    The reason this one is OK is because we do not use nlk->portid or
    try to get nlk from the hash table before we return to user-space.

    However, there is a real bug here that none of these acquire/release
    helpers discovered. The two bound tests here used to be a single
    one. Now that they are separate it is entirely possible for another
    thread to come in the middle and bind the socket. So we need to
    repeat the portid check in order to maintain consistency.

    > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
    > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
    > > return -EPERM;
    > >
    > > - if (!nlk->portid)
    > > + if (!nlk->bound)
    >
    > Don't we need load_acquire here too? Is this path holding a lock
    > which makes that unnecessary?

    Ditto.

    ---8bound once in netlink_bind fixes
    a race where two threads that bind the socket at the same time
    with different port IDs may both succeed.

    Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Nacked-by: Tejun Heo
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ]

    The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
    Reset portid after netlink_insert failure") introduced a race
    condition where if two threads try to autobind the same socket
    one of them may end up with a zero port ID. This led to kernel
    deadlocks that were observed by multiple people.

    This patch reverts that commit and instead fixes it by introducing
    a separte rhash_portid variable so that the real portid is only set
    after the socket has been successfully hashed.

    Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Herbert Xu
     
  • [ Upstream commit 1853c949646005b5959c483becde86608f548f24 ]

    Ken-ichirou reported that running netlink in mmap mode for receive in
    combination with nlmon will throw a NULL pointer dereference in
    __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
    to handle kernel paging request". The problem is the skb_clone() in
    __netlink_deliver_tap_skb() for skbs that are mmaped.

    I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
    skb has it pointed to netlink_skb_destructor(), set in the handler
    netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
    that in such cases, __kfree_skb() doesn't perform a skb_release_data()
    via skb_release_all(), where skb->head is possibly being freed through
    kfree(head) into slab allocator, although netlink mmap skb->head points
    to the mmap buffer. Similarly, the same has to be done also for large
    netlink skbs where the data area is vmalloced. Therefore, as discussed,
    make a copy for these rather rare cases for now. This fixes the issue
    on my and Ken-ichirou's test-cases.

    Reference: http://thread.gmane.org/gmane.linux.network/371129
    Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
    Reported-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Daniel Borkmann
    Tested-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

30 Sep, 2015

2 commits

  • [ Upstream commit 4e7c1330689e27556de407d3fdadc65ffff5eb12 ]

    Linus reports the following deadlock on rtnl_mutex; triggered only
    once so far (extract):

    [12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000
    [12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
    [12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
    [12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
    [12236.694235] Call Trace:
    [12236.694250] [] ? schedule_preempt_disabled+0x9/0x10
    [12236.694257] [] ? schedule+0x2a/0x70
    [12236.694263] [] ? schedule_preempt_disabled+0x9/0x10
    [12236.694271] [] ? __mutex_lock_slowpath+0x7f/0xf0
    [12236.694280] [] ? mutex_lock+0x16/0x30
    [12236.694291] [] ? rtnetlink_rcv+0x10/0x30
    [12236.694299] [] ? netlink_unicast+0xfb/0x180
    [12236.694309] [] ? rtnl_getlink+0x113/0x190
    [12236.694319] [] ? rtnetlink_rcv_msg+0x7a/0x210
    [12236.694331] [] ? sock_has_perm+0x5c/0x70
    [12236.694339] [] ? rtnetlink_rcv+0x30/0x30
    [12236.694346] [] ? netlink_rcv_skb+0x9c/0xc0
    [12236.694354] [] ? rtnetlink_rcv+0x1f/0x30
    [12236.694360] [] ? netlink_unicast+0xfb/0x180
    [12236.694367] [] ? netlink_sendmsg+0x484/0x5d0
    [12236.694376] [] ? __wake_up+0x2f/0x50
    [12236.694387] [] ? sock_sendmsg+0x33/0x40
    [12236.694396] [] ? ___sys_sendmsg+0x22e/0x240
    [12236.694405] [] ? ___sys_recvmsg+0x135/0x1a0
    [12236.694415] [] ? eventfd_write+0x82/0x210
    [12236.694423] [] ? fsnotify+0x32e/0x4c0
    [12236.694429] [] ? wake_up_q+0x60/0x60
    [12236.694434] [] ? __sys_sendmsg+0x39/0x70
    [12236.694440] [] ? entry_SYSCALL_64_fastpath+0x12/0x6a

    It seems so far plausible that the recursive call into rtnetlink_rcv()
    looks suspicious. One way, where this could trigger is that the senders
    NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
    the rtnl_getlink() request's answer would be sent to the kernel instead
    to the actual user process, thus grabbing rtnl_mutex() twice.

    One theory would be that netlink_autobind() triggered via netlink_sendmsg()
    internally overwrites the -EBUSY error to 0, but where it is wrongly
    originating from __netlink_insert() instead. That would reset the
    socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
    later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
    also puts it, -EBUSY should not be propagated from netlink_insert().

    It looks like it's very unlikely to reproduce. We need to trigger the
    rhashtable_insert_rehash() handler under a situation where rehashing
    currently occurs (one /rare/ way would be to hit ht->elasticity limits
    while not filled enough to expand the hashtable, but that would rather
    require a specifically crafted bind() sequence with knowledge about
    destination slots, seems unlikely). It probably makes sense to guard
    __netlink_insert() in any case and remap that error. It was suggested
    that EOVERFLOW might be better than an already overloaded ENOMEM.

    Reference: http://thread.gmane.org/gmane.linux.network/372676
    Reported-by: Linus Torvalds
    Signed-off-by: Daniel Borkmann
    Acked-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 0470eb99b4721586ccac954faac3fa4472da0845 ]

    Kirill A. Shutemov says:

    This simple test-case trigers few locking asserts in kernel:

    int main(int argc, char **argv)
    {
    unsigned int block_size = 16 * 4096;
    struct nl_mmap_req req = {
    .nm_block_size = block_size,
    .nm_block_nr = 64,
    .nm_frame_size = 16384,
    .nm_frame_nr = 64 * block_size / 16384,
    };
    unsigned int ring_size;
    int fd;

    fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
    exit(1);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
    exit(1);

    ring_size = req.nm_block_nr * req.nm_block_size;
    mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    return 0;
    }

    +++ exited with 0 +++
    BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
    in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
    3 locks held by init/1:
    #0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
    #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
    #2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
    Preemption disabled at:[] __delay+0xf/0x20

    CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
    ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
    0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
    ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
    Call Trace:
    [] dump_stack+0x4f/0x7b
    [] ___might_sleep+0x16d/0x270
    [] __might_sleep+0x4d/0x90
    [] mutex_lock_nested+0x2f/0x430
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] netlink_set_ring+0x1ed/0x350
    [] ? netlink_undo_bind+0x70/0x70
    [] netlink_sock_destruct+0x80/0x150
    [] __sk_free+0x1d/0x160
    [] sk_free+0x19/0x20
    [..]

    Cong Wang says:

    We can't hold mutex lock in a rcu callback, [..]

    Thomas Graf says:

    The socket should be dead at this point. It might be simpler to
    add a netlink_release_ring() function which doesn't require
    locking at all.

    Reported-by: "Kirill A. Shutemov"
    Diagnosed-by: Cong Wang
    Suggested-by: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     

17 May, 2015

1 commit

  • The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
    eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
    because it doesn't reset portid after a failed netlink_insert.

    This means that should autobind fail the first time around, then
    the socket will be stuck in limbo as it can never be bound again
    since it already has a non-zero portid.

    Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

15 May, 2015

1 commit

  • netlink sockets creation and deletion heavily modify nl_table_users
    and nl_table_lock.

    If nl_table is sharing one cache line with one of them, netlink
    performance is really bad on SMP.

    ffffffff81ff5f00 B nl_table
    ffffffff81ff5f0c b nl_table_users

    Putting nl_table in read_mostly section increased performance
    of my open/delete netlink sockets test by about 80 %

    This came up while diagnosing a getaddrinfo() problem.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 May, 2015

1 commit

  • We currently limit the hash table size to 64K which is very bad
    as even 10 years ago it was relatively easy to generate millions
    of sockets.

    Since the hash table is naturally limited by memory allocation
    failure, we don't really need an explicit limit so this patch
    removes it.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     

26 Apr, 2015

1 commit

  • When I added pfmemalloc support in build_skb(), I forgot netlink
    was using build_skb() with a vmalloc() area.

    In this patch I introduce __build_skb() for netlink use,
    and build_skb() is a wrapper handling both skb->head_frag and
    skb->pfmemalloc

    This means netlink no longer has to hack skb->head_frag

    [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
    [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    [ 1567.700067] Dumping ftrace buffer:
    [ 1567.700067] (ftrace buffer empty)
    [ 1567.700067] Modules linked in:
    [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
    [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
    [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
    [ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202
    [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
    [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
    [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
    [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
    [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
    [ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
    [ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
    [ 1567.700067] Stack:
    [ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
    [ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
    [ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
    [ 1567.700067] Call Trace:
    [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
    [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
    [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
    [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
    [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
    [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
    [ 1567.774369] sock_write_iter (net/socket.c:823)
    [ 1567.774369] ? sock_sendmsg (net/socket.c:806)
    [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
    [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
    [ 1567.774369] ? default_llseek (fs/read_write.c:487)
    [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
    [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
    [ 1567.774369] vfs_write (fs/read_write.c:539)
    [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
    [ 1567.774369] ? SyS_read (fs/read_write.c:577)
    [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
    [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
    [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
    [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

    Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve")
    Signed-off-by: Eric Dumazet
    Reported-by: Sasha Levin
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Mar, 2015

1 commit

  • nftables sets will be converted to use so called setextensions, moving
    the key to a non-fixed position. To hash it, the obj_hashfn must be used,
    however it so far doesn't receive the length parameter.

    Pass the key length to obj_hashfn() and convert existing users.

    Signed-off-by: Patrick McHardy
    Signed-off-by: Pablo Neira Ayuso

    Patrick McHardy
     

25 Mar, 2015

1 commit


24 Mar, 2015

1 commit

  • This patch removes the explicit jhash value for the hashfn parameter
    of rhashtable. As the key length is a multiple of 4, this means that
    we will actually end up using jhash2.

    Signed-off-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Herbert Xu
     

21 Mar, 2015

2 commits

  • Instead of computing the offset from trailer, this patch computes
    netlink_compare_arg_len from the offset of portid and then adds 4
    to it. This allows trailer to be removed.

    Reported-by: David Miller
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     
  • Currently the name space is a de facto key because it has to match
    before we find an object in the hash table. However, it isn't in
    the hash value so all objects from different name spaces with the
    same port ID hash to the same bucket.

    This is bad as the number of name spaces is unbounded.

    This patch fixes this by using the namespace when doing the hash.

    Because the namespace field doesn't lie next to the portid field
    in the netlink socket, this patch switches over to the rhashtable
    interface without a fixed key.

    This patch also uses the new inlined rhashtable interface where
    possible.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

19 Mar, 2015

1 commit


04 Mar, 2015

1 commit


03 Mar, 2015

1 commit

  • After TIPC doesn't depend on iocb argument in its internal
    implementations of sendmsg() and recvmsg() hooks defined in proto
    structure, no any user is using iocb argument in them at all now.
    Then we can drop the redundant iocb argument completely from kinds of
    implementations of both sendmsg() and recvmsg() in the entire
    networking stack.

    Cc: Christoph Hellwig
    Suggested-by: Al Viro
    Signed-off-by: Ying Xue
    Signed-off-by: David S. Miller

    Ying Xue
     

28 Feb, 2015

1 commit

  • Currently, all real users of rhashtable default their grow and shrink
    decision functions to rht_grow_above_75() and rht_shrink_below_30(),
    so that there's currently no need to have this explicitly selectable.

    It can/should be generic and private inside rhashtable until a real
    use case pops up. Since we can make this private, we'll save us this
    additional indirection layer and can improve insertion/deletion time
    as well.

    Reference: http://patchwork.ozlabs.org/patch/443040/
    Suggested-by: David S. Miller
    Signed-off-by: Daniel Borkmann
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

06 Feb, 2015

1 commit

  • Conflicts:
    drivers/net/vxlan.c
    drivers/vhost/net.c
    include/linux/if_vlan.h
    net/core/dev.c

    The net/core/dev.c conflict was the overlap of one commit marking an
    existing function static whilst another was adding a new function.

    In the include/linux/if_vlan.h case, the type used for a local
    variable was changed in 'net', whereas the function got rewritten
    to fix a stacked vlan bug in 'net-next'.

    In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next'
    overlapped with an endainness fix for VHOST 1.0 in 'net'.

    In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter
    in 'net-next' whereas in 'net' there was a bug fix to pass in the
    correct network namespace pointer in calls to this function.

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Feb, 2015

2 commits


04 Feb, 2015

1 commit

  • As it is, zero msg_iovlen means that the first iovec in the kernel
    array of iovecs is left uninitialized, so checking if its ->iov_base
    is NULL is random. Since the real users of that thing are doing
    sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and
    msg_iov[0] = {NULL, 0}, which is what this test is trying to catch.
    As suggested by davem, let's just check that msg_iovlen was 1 and
    msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches
    what we want to catch.

    Signed-off-by: Al Viro

    Al Viro
     

31 Jan, 2015

1 commit

  • The subscription bitmask passed via struct sockaddr_nl is converted to
    the group number when calling the netlink_bind() and netlink_unbind()
    callbacks.

    The conversion is however incorrect since bitmask (1 << 0) needs to be
    mapped to group number 1. Note that you cannot specify the group number 0
    (usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP
    since this is rejected through -EINVAL.

    This problem became noticeable since 97840cb ("netfilter: nfnetlink:
    fix insufficient validation in nfnetlink_bind") when binding to bitmask
    (1 << 0) in ctnetlink.

    Reported-by: Andre Tomt
    Reported-by: Ivan Delalande
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller

    Pablo Neira
     

29 Jan, 2015

1 commit

  • The sock_iocb structure is allocate on stack for each read/write-like
    operation on sockets, and contains various fields of which only the
    embedded msghdr and sometimes a pointer to the scm_cookie is ever used.
    Get rid of the sock_iocb and put a msghdr directly on the stack and pass
    the scm_cookie explicitly to netlink_mmap_sendmsg.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: David S. Miller

    Christoph Hellwig
     

28 Jan, 2015

1 commit


27 Jan, 2015

1 commit


18 Jan, 2015

1 commit

  • Contrary to common expectations for an "int" return, these functions
    return only a positive value -- if used correctly they cannot even
    return 0 because the message header will necessarily be in the skb.

    This makes the very common pattern of

    if (genlmsg_end(...) < 0) { ... }

    be a whole bunch of dead code. Many places also simply do

    return nlmsg_end(...);

    and the caller is expected to deal with it.

    This also commonly (at least for me) causes errors, because it is very
    common to write

    if (my_function(...))
    /* error condition */

    and if my_function() does "return nlmsg_end()" this is of course wrong.

    Additionally, there's not a single place in the kernel that actually
    needs the message length returned, and if anyone needs it later then
    it'll be very easy to just use skb->len there.

    Remove this, and make the functions void. This removes a bunch of dead
    code as described above. The patch adds lines because I did

    - return nlmsg_end(...);
    + nlmsg_end(...);
    + return 0;

    I could have preserved all the function's return values by returning
    skb->len, but instead I've audited all the places calling the affected
    functions and found that none cared. A few places actually compared
    the return value with < 0 with no change in behaviour, so I opted for the more
    efficient version.

    One instance of the error I've made numerous times now is also present
    in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
    check for
    Signed-off-by: David S. Miller

    Johannes Berg
     

17 Jan, 2015

2 commits

  • In addition to the problem Jeff Layton reported, I looked at the code
    and reproduced the same warning by subscribing and removing the genl
    family with a socket still open. This is a fairly tricky race which
    originates in the fact that generic netlink allows the family to go
    away while sockets are still open - unlike regular netlink which has
    a module refcount for every open socket so in general this cannot be
    triggered.

    Trying to resolve this issue by the obvious locking isn't possible as
    it will result in deadlocks between unregistration and group unbind
    notification (which incidentally lockdep doesn't find due to the home
    grown locking in the netlink table.)

    To really resolve this, introduce a "closing socket" reference counter
    (for generic netlink only, as it's the only affected family) in the
    core netlink code and use that in generic netlink to wait for all the
    sockets that are being closed at the same time as a generic netlink
    family is removed.

    This fixes the race that when a socket is closed, it will should call
    the unbind, but if the family is removed at the same time the unbind
    will not find it, leading to the warning. The real problem though is
    that in this case the unbind could actually find a new family that is
    registered to have a multicast group with the same ID, and call its
    mcast_unbind() leading to confusing.

    Also remove the warning since it would still trigger, but is now no
    longer a problem.

    This also moves the code in af_netlink.c to before unreferencing the
    module to avoid having the same problem in the normal non-genl case.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Jeff Layton reported that he could trigger the multicast unbind warning
    in generic netlink using trinity. I originally thought it was a race
    condition between unregistering the generic netlink family and closing
    the socket, but there's a far simpler explanation: genetlink currently
    allows subscribing to groups that don't (yet) exist, and the warning is
    triggered when unsubscribing again while the group still doesn't exist.

    Originally, I had a warning in the subscribe case and accepted it out of
    userspace API concerns, but the warning was of course wrong and removed
    later.

    However, I now think that allowing userspace to subscribe to groups that
    don't exist is wrong and could possibly become a security problem:
    Consider a (new) genetlink family implementing a permission check in
    the mcast_bind() function similar to the like the audit code does today;
    it would be possible to bypass the permission check by guessing the ID
    and subscribing to the group it exists. This is only possible in case a
    family like that would be dynamically loaded, but it doesn't seem like a
    huge stretch, for example wireless may be loaded when you plug in a USB
    device.

    To avoid this reject such subscription attempts.

    If this ends up causing userspace issues we may need to add a workaround
    in af_netlink to deny such requests but not return an error.

    Reported-by: Jeff Layton
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

16 Jan, 2015

1 commit

  • The patch c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
    eliminate nl_sk_hash_lock") introduced a bug where the EADDRINUSE
    error has been replaced by ENOMEM. This patch rectifies that
    problem.

    Signed-off-by: Herbert Xu
    Acked-by: Ying Xue
    Signed-off-by: David S. Miller

    Herbert Xu
     

14 Jan, 2015

1 commit

  • As rhashtable_lookup_compare_insert() can guarantee the process
    of search and insertion is atomic, it's safe to eliminate the
    nl_sk_hash_lock. After this, object insertion or removal will
    be protected with per bucket lock on write side while object
    lookup is guarded with rcu read lock on read side.

    Signed-off-by: Ying Xue
    Cc: Thomas Graf
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Ying Xue
     

04 Jan, 2015

4 commits

  • Defers the release of the socket reference using call_rcu() to
    allow using an RCU read-side protected call to rhashtable_lookup()

    This restores behaviour and performance gains as previously
    introduced by e341694 ("netlink: Convert netlink_lookup() to use
    RCU protected hash table") without the side effect of severely
    delayed socket destruction.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Introduces an array of spinlocks to protect bucket mutations. The number
    of spinlocks per CPU is configurable and selected based on the hash of
    the bucket. This allows for parallel insertions and removals of entries
    which do not share a lock.

    The patch also defers expansion and shrinking to a worker queue which
    allows insertion and removal from atomic context. Insertions and
    deletions may occur in parallel to it and are only held up briefly
    while the particular bucket is linked or unzipped.

    Mutations of the bucket table pointer is protected by a new mutex, read
    access is RCU protected.

    In the event of an expansion or shrinking, the new bucket table allocated
    is exposed as a so called future table as soon as the resize process
    starts. Lookups, deletions, and insertions will briefly use both tables.
    The future table becomes the main table after an RCU grace period and
    initial linking of the old to the new table was performed. Optimization
    of the chains to make use of the new number of buckets follows only the
    new table is in use.

    The side effect of this is that during that RCU grace period, a bucket
    traversal using any rht_for_each() variant on the main table will not see
    any insertions performed during the RCU grace period which would at that
    point land in the future table. The lookup will see them as it searches
    both tables if needed.

    Having multiple insertions and removals occur in parallel requires nelems
    to become an atomic counter.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • This patch is in preparation to introduce per bucket spinlocks. It
    extends all iterator macros to take the bucket table and bucket
    index. It also introduces a new rht_dereference_bucket() to
    handle protected accesses to buckets.

    It introduces a barrier() to the RCU iterators to the prevent
    the compiler from caching the first element.

    The lockdep verifier is introduced as stub which always succeeds
    and properly implement in the next patch when the locks are
    introduced.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Hash the key inside of rhashtable_lookup_compare() like
    rhashtable_lookup() does. This allows to simplify the hashing
    functions and keep them private.

    Signed-off-by: Thomas Graf
    Cc: netfilter-devel@vger.kernel.org
    Signed-off-by: David S. Miller

    Thomas Graf
     

30 Dec, 2014

1 commit


27 Dec, 2014

2 commits

  • Netlink families can exist in multiple namespaces, and for the most
    part multicast subscriptions are per network namespace. Thus it only
    makes sense to have bind/unbind notifications per network namespace.

    To achieve this, pass the network namespace of a given client socket
    to the bind/unbind functions.

    Also do this in generic netlink, and there also make sure that any
    bind for multicast groups that only exist in init_net is rejected.
    This isn't really a problem if it is accepted since a client in a
    different namespace will never receive any notifications from such
    a group, but it can confuse the family if not rejected (it's also
    possible to silently (without telling the family) accept it, but it
    would also have to be ignored on unbind so families that take any
    kind of action on bind/unbind won't do unnecessary work for invalid
    clients like that.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • In order to make the newly fixed multicast bind/unbind
    functionality in generic netlink, pass them down to the
    appropriate family.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg