17 May, 2016

1 commit

  • When we free cb->skb after a dump, we do it after releasing the
    lock. This means that a new dump could have started in the time
    being and we'll end up freeing their skb instead of ours.

    This patch saves the skb and module before we unlock so we free
    the right memory.

    Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
    Reported-by: Baozeng Ding
    Signed-off-by: Herbert Xu
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Herbert Xu
     

24 Apr, 2016

1 commit


11 Apr, 2016

1 commit

  • All existing users of NETLINK_URELEASE use it to clean up resources that
    were previously allocated to a socket via some command. As a result, no
    users require getting this notification for unbound sockets.

    Sending it for unbound sockets, however, is a problem because any user
    (including unprivileged users) can create a socket that uses the same ID
    as an existing socket. Binding this new socket will fail, but if the
    NETLINK_URELEASE notification is generated for such sockets, the users
    thereof will be tricked into thinking the socket that they allocated the
    resources for is closed.

    In the nl80211 case, this will cause destruction of virtual interfaces
    that still belong to an existing hostapd process; this is the case that
    Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
    of netlink log/queue it will cause them to stop reporting events, as if
    NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.

    Fix this problem by checking that the socket is bound before generating
    the NETLINK_URELEASE notification.

    Cc: stable@vger.kernel.org
    Signed-off-by: Dmitry Ivanov
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Dmitry Ivanov
     

05 Apr, 2016

1 commit

  • In certain cases, the 802.11 mesh pathtable code wants to
    iterate over all of the entries in the forwarding table from
    the receive path, which is inside an RCU read-side critical
    section. Enable walks inside atomic sections by allowing
    GFP_ATOMIC allocations for the walker state.

    Change all existing callsites to pass in GFP_KERNEL.

    Acked-by: Thomas Graf
    Signed-off-by: Bob Copeland
    [also adjust gfs2/glock.c and rhashtable tests]
    Signed-off-by: Johannes Berg

    Bob Copeland
     

23 Mar, 2016

1 commit

  • By returning -ENOIOCTLCMD, sock_do_ioctl() falls back to calling
    dev_ioctl(), which provides support for NIC driver ioctls, which
    includes ethtool support. This is similar to the way ioctls are handled
    in udp.c or tcp.c.

    This removes the requirement that ethtool for example be tied to the
    support of a specific L3 protocol (ethtool uses an AF_INET socket
    today).

    Signed-off-by: David Decotigny
    Signed-off-by: David S. Miller

    David Decotigny
     

19 Feb, 2016

3 commits

  • reverts commit 3ab1f683bf8b ("nfnetlink: add support for memory mapped
    netlink")'

    Like previous commits in the series, remove wrappers that are not needed
    after mmapped netlink removal.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • This reverts commit bb9b18fb55b0 ("genl: Add genlmsg_new_unicast() for
    unicast message allocation")'.

    Nothing wrong with it; its no longer needed since this was only for
    mmapped netlink support.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • mmapped netlink has a number of unresolved issues:

    - TX zerocopy support had to be disabled more than a year ago via
    commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.")
    because the content of the mmapped area can change after netlink
    attribute validation but before message processing.

    - RX support was implemented mainly to speed up nfqueue dumping packet
    payload to userspace. However, since commit ae08ce0021087a5d812d2
    ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
    with the socket-based interface too (via the skb_zerocopy helper).

    The other problem is that skbs attached to mmaped netlink socket
    behave different from normal skbs:

    - they don't have a shinfo area, so all functions that use skb_shinfo()
    (e.g. skb_clone) cannot be used.

    - reserving headroom prevents userspace from seeing the content as
    it expects message to start at skb->head.
    See for instance
    commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump").

    - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
    crash because it needs the sk to check if a tx ring is attached.

    Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
    ("netfilter: nfnetlink: use original skbuff when acking batches").

    mmaped netlink also didn't play nicely with the skb_zerocopy helper
    used by nfqueue and openvswitch. Daniel Borkmann fixed this via
    commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue
    zero-copy")' but at the cost of also needing to provide remaining
    length to the allocation function.

    nfqueue also has problems when used with mmaped rx netlink:
    - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
    Problem is that in the mmap case, the allocation time also determines
    the ordering in which the frame will be seen by userspace (A
    allocating before B means that A is located in earlier ring slot,
    but this also means that B might get a lower sequence number then A
    since seqno is decided later. To fix this we would need to extend the
    spinlocked region to also cover the allocation and message setup which
    isn't desirable.
    - nfqueue can now be configured to queue large (GSO) skbs to userspace.
    Queing GSO packets is faster than having to force a software segmentation
    in the kernel, so this is a desirable option. However, with a mmap based
    ring one has to use 64kb per ring slot element, else mmap has to fall back
    to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

    To use the mmap interface, userspace not only has to probe for mmap netlink
    support, it also has to implement a recv/socket receive path in order to
    handle messages that exceed the size of an rx ring element.

    Cc: Daniel Borkmann
    Cc: Ken-ichirou MATSUZAWA
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

11 Feb, 2016

1 commit

  • Operations with the GENL_ADMIN_PERM flag fail permissions checks because
    this flag means we call netlink_capable, which uses the init user ns.

    Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
    which should be allowed inside a user namespace.

    The motivation for this is to be able to run openvswitch in unprivileged
    containers. I've tested this and it seems to work, but I really have no
    idea about the security consequences of this patch, so thoughts would be
    much appreciated.

    v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
    v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
    massive one

    Reported-by: James Page
    Signed-off-by: Tycho Andersen
    CC: Eric Biederman
    CC: Pravin Shelar
    CC: Justin Pettit
    CC: "David S. Miller"
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Tycho Andersen
     

30 Jan, 2016

1 commit

  • We should not trim skb for mmaped socket since its buf size is fixed
    and userspace will read as frame which data equals head. mmaped
    socket will not call recvmsg, means max_recvmsg_len is 0,
    skb_reserve was not called before commit: db65a3aaf29e.

    Fixes: db65a3aaf29e (netlink: Trim skb to alloc size to avoid MSG_TRUNC)
    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller

    Ken-ichirou MATSUZAWA
     

13 Jan, 2016

3 commits


16 Dec, 2015

1 commit


07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

24 Oct, 2015

1 commit

  • Conflicts:
    net/ipv6/xfrm6_output.c
    net/openvswitch/flow_netlink.c
    net/openvswitch/vport-gre.c
    net/openvswitch/vport-vxlan.c
    net/openvswitch/vport.c
    net/openvswitch/vport.h

    The openvswitch conflicts were overlapping changes. One was
    the egress tunnel info fix in 'net' and the other was the
    vport ->send() op simplification in 'net-next'.

    The xfrm6_output.c conflicts was also a simplification
    overlapping a bug fix.

    Signed-off-by: David S. Miller

    David S. Miller
     

22 Oct, 2015

1 commit

  • Currently, NETLINK_LIST_MEMBERSHIPS grabs the netlink table while copying
    the membership state to user-space. However, grabing the netlink table is
    effectively a write_lock_irq(), and as such we should not be triggering
    page-faults in the critical section.

    This can be easily reproduced by the following snippet:
    int s = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
    void *p = mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
    int r = getsockopt(s, 0x10e, 9, p, (void*)((char*)p + 4092));

    This should work just fine, but currently triggers EFAULT and a possible
    WARN_ON below handle_mm_fault().

    Fix this by reducing locking of NETLINK_LIST_MEMBERSHIPS to a read-side
    lock. The write-lock was overkill in the first place, and the read-lock
    allows page-faults just fine.

    Reported-by: Dmitry Vyukov
    Signed-off-by: David Herrmann
    Signed-off-by: David S. Miller

    David Herrmann
     

20 Oct, 2015

1 commit


19 Oct, 2015

1 commit

  • netlink_dump() allocates skb based on the calculated min_dump_alloc or
    a per socket max_recvmsg_len.
    min_alloc_size is maximum space required for any single netdev
    attributes as calculated by rtnl_calcit().
    max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
    It is capped at 16KiB.
    The intention is to avoid small allocations and to minimize the number
    of calls required to obtain dump information for all net devices.

    netlink_dump packs as many small messages as could fit within an skb
    that was sized for the largest single netdev information. The actual
    space available within an skb is larger than what is requested. It could
    be much larger and up to near 2x with align to next power of 2 approach.

    Allowing netlink_dump to use all the space available within the
    allocated skb increases the buffer size a user has to provide to avoid
    truncaion (i.e. MSG_TRUNG flag set).

    It was observed that with many VLANs configured on at least one netdev,
    a larger buffer of near 64KiB was necessary to avoid "Message truncated"
    error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
    min_alloc_size was only little over 32KiB.

    This patch trims skb to allocated size in order to allow the user to
    avoid truncation with more reasonable buffer size.

    Signed-off-by: Ronen Arad
    Signed-off-by: David S. Miller

    Arad, Ronen
     

09 Oct, 2015

1 commit


27 Sep, 2015

1 commit


25 Sep, 2015

2 commits

  • The genl_notify function has too many arguments for no real reason - all
    callers use genl_info to get them anyway. Just pass the genl_info down to
    genl_notify.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     
  • On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
    >
    > store_release and load_acquire are different from the usual memory
    > barriers and can't be paired this way. You have to pair store_release
    > and load_acquire. Besides, it isn't a particularly good idea to

    OK I've decided to drop the acquire/release helpers as they don't
    help us at all and simply pessimises the code by using full memory
    barriers (on some architectures) where only a write or read barrier
    is needed.

    > depend on memory barriers embedded in other data structures like the
    > above. Here, especially, rhashtable_insert() would have write barrier
    > *before* the entry is hashed not necessarily *after*, which means that
    > in the above case, a socket which appears to have set bound to a
    > reader might not visible when the reader tries to look up the socket
    > on the hashtable.

    But you are right we do need an explicit write barrier here to
    ensure that the hashing is visible.

    > There's no reason to be overly smart here. This isn't a crazy hot
    > path, write barriers tend to be very cheap, store_release more so.
    > Please just do smp_store_release() and note what it's paired with.

    It's not about being overly smart. It's about actually understanding
    what's going on with the code. I've seen too many instances of
    people simply sprinkling synchronisation primitives around without
    any knowledge of what is happening underneath, which is just a recipe
    for creating hard-to-debug races.

    > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
    > > }
    > > }
    > >
    > > - if (!nlk->portid) {
    > > + if (!nlk->bound) {
    >
    > I don't think you can skip load_acquire here just because this is the
    > second deref of the variable. That doesn't change anything. Race
    > condition could still happen between the first and second tests and
    > skipping the second would lead to the same kind of bug.

    The reason this one is OK is because we do not use nlk->portid or
    try to get nlk from the hash table before we return to user-space.

    However, there is a real bug here that none of these acquire/release
    helpers discovered. The two bound tests here used to be a single
    one. Now that they are separate it is entirely possible for another
    thread to come in the middle and bind the socket. So we need to
    repeat the portid check in order to maintain consistency.

    > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
    > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
    > > return -EPERM;
    > >
    > > - if (!nlk->portid)
    > > + if (!nlk->bound)
    >
    > Don't we need load_acquire here too? Is this path holding a lock
    > which makes that unnecessary?

    Ditto.

    ---8bound once in netlink_bind fixes
    a race where two threads that bind the socket at the same time
    with different port IDs may both succeed.

    Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Nacked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Herbert Xu
     

21 Sep, 2015

1 commit

  • The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
    Reset portid after netlink_insert failure") introduced a race
    condition where if two threads try to autobind the same socket
    one of them may end up with a zero port ID. This led to kernel
    deadlocks that were observed by multiple people.

    This patch reverts that commit and instead fixes it by introducing
    a separte rhash_portid variable so that the real portid is only set
    after the socket has been successfully hashed.

    Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

12 Sep, 2015

1 commit

  • Ken-ichirou reported that running netlink in mmap mode for receive in
    combination with nlmon will throw a NULL pointer dereference in
    __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
    to handle kernel paging request". The problem is the skb_clone() in
    __netlink_deliver_tap_skb() for skbs that are mmaped.

    I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
    skb has it pointed to netlink_skb_destructor(), set in the handler
    netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
    that in such cases, __kfree_skb() doesn't perform a skb_release_data()
    via skb_release_all(), where skb->head is possibly being freed through
    kfree(head) into slab allocator, although netlink mmap skb->head points
    to the mmap buffer. Similarly, the same has to be done also for large
    netlink skbs where the data area is vmalloced. Therefore, as discussed,
    make a copy for these rather rare cases for now. This fixes the issue
    on my and Ken-ichirou's test-cases.

    Reference: http://thread.gmane.org/gmane.linux.network/371129
    Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
    Reported-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Daniel Borkmann
    Tested-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

10 Sep, 2015

2 commits

  • When netlink mmap on receive side is the consumer of nf queue data,
    it can happen that in some edge cases, we write skb shared info into
    the user space mmap buffer:

    Assume a possible rx ring frame size of only 4096, and the network skb,
    which is being zero-copied into the netlink skb, contains page frags
    with an overall skb->len larger than the linear part of the netlink
    skb.

    skb_zerocopy(), which is generic and thus not aware of the fact that
    shared info cannot be accessed for such skbs then tries to write and
    fill frags, thus leaking kernel data/pointers and in some corner cases
    possibly writing out of bounds of the mmap area (when filling the
    last slot in the ring buffer this way).

    I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
    an advertised length larger than 4096, where the linear part is visible
    at the slot beginning, and the leaked sizeof(struct skb_shared_info)
    has been written to the beginning of the next slot (also corrupting
    the struct nl_mmap_hdr slot header incl. status etc), since skb->end
    points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.

    The fix adds and lets __netlink_alloc_skb() take the actual needed
    linear room for the network skb + meta data into account. It's completely
    irrelevant for non-mmaped netlink sockets, but in case mmap sockets
    are used, it can be decided whether the available skb_tailroom() is
    really large enough for the buffer, or whether it needs to internally
    fallback to a normal alloc_skb().

    >From nf queue side, the information whether the destination port is
    an mmap RX ring is not really available without extra port-to-socket
    lookup, thus it can only be determined in lower layers i.e. when
    __netlink_alloc_skb() is called that checks internally for this. I
    chose to add the extra ldiff parameter as mmap will then still work:
    We have data_len and hlen in nfqnl_build_packet_message(), data_len
    is the full length (capped at queue->copy_range) for skb_zerocopy()
    and hlen some possible part of data_len that needs to be copied; the
    rem_len variable indicates the needed remaining linear mmap space.

    The only other workaround in nf queue internally would be after
    allocation time by f.e. cap'ing the data_len to the skb_tailroom()
    iff we deal with an mmap skb, but that would 1) expose the fact that
    we use a mmap skb to upper layers, and 2) trim the skb where we
    otherwise could just have moved the full skb into the normal receive
    queue.

    After the patch, in my test case the ring slot doesn't fit and therefore
    shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
    thus needs to be picked up via recv().

    Fixes: 3ab1f683bf8b ("nfnetlink: add support for memory mapped netlink")
    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • In case of netlink mmap, there can be situations where received frames
    have to be placed into the normal receive queue. The ring buffer indicates
    this through NL_MMAP_STATUS_COPY, so the user is asked to pick them up
    via recvmsg(2) syscall, and to put the slot back to NL_MMAP_STATUS_UNUSED.

    Commit 0ef707700f1c ("netlink: rx mmap: fix POLLIN condition") changed
    polling, so that we walk in the worst case the whole ring through the
    new netlink_has_valid_frame(), for example, when the ring would have no
    NL_MMAP_STATUS_VALID, but at least one NL_MMAP_STATUS_COPY frame.

    Since we do a datagram_poll() already earlier to pick up a mask that could
    possibly contain POLLIN | POLLRDNORM already (due to NL_MMAP_STATUS_COPY),
    we can skip checking the rx ring entirely.

    In case the kernel is compiled with !CONFIG_NETLINK_MMAP, then all this is
    irrelevant anyway as netlink_poll() is just defined as datagram_poll().

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

31 Aug, 2015

1 commit

  • Poll() returns immediately after setting the kernel current frame
    (ring->head) to SKIP from user space even though there is no new
    frame. And in a case of all frames is VALID, user space program
    unintensionally sets (only) kernel current frame to UNUSED, then
    calls poll(), it will not return immediately even though there are
    VALID frames.

    To avoid situations like above, I think we need to scan all frames
    to find VALID frames at poll() like netlink_alloc_skb(),
    netlink_forward_ring() finding an UNUSED frame at skb allocation.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller

    Ken-ichirou MATSUZAWA
     

29 Aug, 2015

2 commits

  • __netlink_lookup_frame() was always called with the same "pos"
    value in netlink_forward_ring(). It will look at the same ring entry
    header over and over again, every time through this loop. Then cycle
    through the whole ring, advancing ring->head, not "pos" until it
    equals the "ring->head != head" loop test fails.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller

    Ken-ichirou MATSUZAWA
     
  • Since commit c05cdb1b864f ("netlink: allow large data transfers from
    user-space"), the kernel may fail to allocate the necessary room for the
    acknowledgment message back to userspace. This patch introduces a new
    socket option that trims off the payload of the original netlink message.

    The netlink message header is still included, so the user can guess from
    the sequence number what is the message that has triggered the
    acknowledgment.

    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Christophe Ricard
    Signed-off-by: David S. Miller

    Christophe Ricard
     

24 Aug, 2015

1 commit

  • I can't send netlink message via mmaped netlink socket since

    commit: a8866ff6a5bce7d0ec465a63bc482a85c09b0d39
    netlink: make the check for "send from tx_ring" deterministic

    msg->msg_iter.type is set to WRITE (1) at

    SYSCALL_DEFINE6(sendto, ...
    import_single_range(WRITE, ...
    iov_iter_init(1, WRITE, ...

    call path, so that we need to check the type by iter_is_iovec()
    to accept the WRITE.

    Signed-off-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller

    Ken-ichirou MATSUZAWA
     

11 Aug, 2015

1 commit

  • Linus reports the following deadlock on rtnl_mutex; triggered only
    once so far (extract):

    [12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000
    [12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018
    [12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff
    [12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80
    [12236.694235] Call Trace:
    [12236.694250] [] ? schedule_preempt_disabled+0x9/0x10
    [12236.694257] [] ? schedule+0x2a/0x70
    [12236.694263] [] ? schedule_preempt_disabled+0x9/0x10
    [12236.694271] [] ? __mutex_lock_slowpath+0x7f/0xf0
    [12236.694280] [] ? mutex_lock+0x16/0x30
    [12236.694291] [] ? rtnetlink_rcv+0x10/0x30
    [12236.694299] [] ? netlink_unicast+0xfb/0x180
    [12236.694309] [] ? rtnl_getlink+0x113/0x190
    [12236.694319] [] ? rtnetlink_rcv_msg+0x7a/0x210
    [12236.694331] [] ? sock_has_perm+0x5c/0x70
    [12236.694339] [] ? rtnetlink_rcv+0x30/0x30
    [12236.694346] [] ? netlink_rcv_skb+0x9c/0xc0
    [12236.694354] [] ? rtnetlink_rcv+0x1f/0x30
    [12236.694360] [] ? netlink_unicast+0xfb/0x180
    [12236.694367] [] ? netlink_sendmsg+0x484/0x5d0
    [12236.694376] [] ? __wake_up+0x2f/0x50
    [12236.694387] [] ? sock_sendmsg+0x33/0x40
    [12236.694396] [] ? ___sys_sendmsg+0x22e/0x240
    [12236.694405] [] ? ___sys_recvmsg+0x135/0x1a0
    [12236.694415] [] ? eventfd_write+0x82/0x210
    [12236.694423] [] ? fsnotify+0x32e/0x4c0
    [12236.694429] [] ? wake_up_q+0x60/0x60
    [12236.694434] [] ? __sys_sendmsg+0x39/0x70
    [12236.694440] [] ? entry_SYSCALL_64_fastpath+0x12/0x6a

    It seems so far plausible that the recursive call into rtnetlink_rcv()
    looks suspicious. One way, where this could trigger is that the senders
    NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so
    the rtnl_getlink() request's answer would be sent to the kernel instead
    to the actual user process, thus grabbing rtnl_mutex() twice.

    One theory would be that netlink_autobind() triggered via netlink_sendmsg()
    internally overwrites the -EBUSY error to 0, but where it is wrongly
    originating from __netlink_insert() instead. That would reset the
    socket's portid to 0, which is then filled into NETLINK_CB(skb).portid
    later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.")
    also puts it, -EBUSY should not be propagated from netlink_insert().

    It looks like it's very unlikely to reproduce. We need to trigger the
    rhashtable_insert_rehash() handler under a situation where rehashing
    currently occurs (one /rare/ way would be to hit ht->elasticity limits
    while not filled enough to expand the hashtable, but that would rather
    require a specifically crafted bind() sequence with knowledge about
    destination slots, seems unlikely). It probably makes sense to guard
    __netlink_insert() in any case and remap that error. It was suggested
    that EOVERFLOW might be better than an already overloaded ENOMEM.

    Reference: http://thread.gmane.org/gmane.linux.network/372676
    Reported-by: Linus Torvalds
    Signed-off-by: Daniel Borkmann
    Acked-by: Herbert Xu
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

22 Jul, 2015

1 commit

  • Kirill A. Shutemov says:

    This simple test-case trigers few locking asserts in kernel:

    int main(int argc, char **argv)
    {
    unsigned int block_size = 16 * 4096;
    struct nl_mmap_req req = {
    .nm_block_size = block_size,
    .nm_block_nr = 64,
    .nm_frame_size = 16384,
    .nm_frame_nr = 64 * block_size / 16384,
    };
    unsigned int ring_size;
    int fd;

    fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
    exit(1);
    if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
    exit(1);

    ring_size = req.nm_block_nr * req.nm_block_size;
    mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    return 0;
    }

    +++ exited with 0 +++
    BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616
    in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init
    3 locks held by init/1:
    #0: (reboot_mutex){+.+...}, at: [] SyS_reboot+0xa9/0x220
    #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [] __blocking_notifier_call_chain+0x39/0x70
    #2: (rcu_callback){......}, at: [] rcu_do_batch.isra.49+0x160/0x10c0
    Preemption disabled at:[] __delay+0xf/0x20

    CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014
    ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102
    0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002
    ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98
    Call Trace:
    [] dump_stack+0x4f/0x7b
    [] ___might_sleep+0x16d/0x270
    [] __might_sleep+0x4d/0x90
    [] mutex_lock_nested+0x2f/0x430
    [] ? _raw_spin_unlock_irqrestore+0x5d/0x80
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] netlink_set_ring+0x1ed/0x350
    [] ? netlink_undo_bind+0x70/0x70
    [] netlink_sock_destruct+0x80/0x150
    [] __sk_free+0x1d/0x160
    [] sk_free+0x19/0x20
    [..]

    Cong Wang says:

    We can't hold mutex lock in a rcu callback, [..]

    Thomas Graf says:

    The socket should be dead at this point. It might be simpler to
    add a netlink_release_ring() function which doesn't require
    locking at all.

    Reported-by: "Kirill A. Shutemov"
    Diagnosed-by: Cong Wang
    Suggested-by: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

04 Jul, 2015

1 commit


22 Jun, 2015

1 commit

  • This patch adds getsockopt(SOL_NETLINK, NETLINK_LIST_MEMBERSHIPS) to
    retrieve all groups a socket is a member of. Currently, we have to use
    getsockname() and look at the nl.nl_groups bitmask. However, this mask is
    limited to 32 groups. Hence, similar to NETLINK_ADD_MEMBERSHIP and
    NETLINK_DROP_MEMBERSHIP, this adds a separate sockopt to manager higher
    groups IDs than 32.

    This new NETLINK_LIST_MEMBERSHIPS option takes a pointer to __u32 and the
    size of the array. The array is filled with the full membership-set of the
    socket, and the required array size is returned in optlen. Hence,
    user-space can retry with a properly sized array in case it was too small.

    Signed-off-by: David Herrmann
    Signed-off-by: David S. Miller

    David Herrmann
     

23 May, 2015

1 commit

  • Conflicts:
    drivers/net/ethernet/cadence/macb.c
    drivers/net/phy/phy.c
    include/linux/skbuff.h
    net/ipv4/tcp.c
    net/switchdev/switchdev.c

    Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
    renaming overlapping with net-next changes of various
    sorts.

    phy.c was a case of two changes, one adding a local
    variable to a function whilst the second was removing
    one.

    tcp.c overlapped a deadlock fix with the addition of new tcp_info
    statistic values.

    macb.c involved the addition of two zyncq device entries.

    skbuff.h involved adding back ipv4_daddr to nf_bridge_info
    whilst net-next changes put two other existing members of
    that struct into a union.

    Signed-off-by: David S. Miller

    David S. Miller
     

18 May, 2015

1 commit

  • Currently we use a global rover to select a port ID that is unique.
    This used to work consistently when it was protected with a global
    lock. However as we're now lockless, the global rover can exhibit
    pathological behaviour should multiple threads all stomp on it at
    the same time.

    Granted this will eventually resolve itself but the process is
    suboptimal.

    This patch replaces the global rover with a pseudorandom starting
    point to avoid this issue.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

17 May, 2015

1 commit

  • The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
    eliminate nl_sk_hash_lock") breaks the autobind retry mechanism
    because it doesn't reset portid after a failed netlink_insert.

    This means that should autobind fail the first time around, then
    the socket will be stuck in limbo as it can never be bound again
    since it already has a non-zero portid.

    Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock")
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

15 May, 2015

1 commit

  • netlink sockets creation and deletion heavily modify nl_table_users
    and nl_table_lock.

    If nl_table is sharing one cache line with one of them, netlink
    performance is really bad on SMP.

    ffffffff81ff5f00 B nl_table
    ffffffff81ff5f0c b nl_table_users

    Putting nl_table in read_mostly section increased performance
    of my open/delete netlink sockets test by about 80 %

    This came up while diagnosing a getaddrinfo() problem.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 May, 2015

1 commit

  • Four minor merge conflicts:

    1) qca_spi.c renamed the local variable used for the SPI device
    from spi_device to spi, meanwhile the spi_set_drvdata() call
    got moved further up in the probe function.

    2) Two changes were both adding new members to codel params
    structure, and thus we had overlapping changes to the
    initializer function.

    3) 'net' was making a fix to sk_release_kernel() which is
    completely removed in 'net-next'.

    4) In net_namespace.c, the rtnl_net_fill() call for GET operations
    had the command value fixed, meanwhile 'net-next' adjusted the
    argument signature a bit.

    This also matches example merge resolutions posted by Stephen
    Rothwell over the past two days.

    Signed-off-by: David S. Miller

    David S. Miller