24 Nov, 2017

1 commit

  • [ Upstream commit 0642840b8bb008528dbdf929cec9f65ac4231ad0 ]

    The way people generally use netlink_dump is that they fill in the skb
    as much as possible, breaking when nla_put returns an error. Then, they
    get called again and start filling out the next skb, and again, and so
    forth. The mechanism at work here is the ability for the iterative
    dumping function to detect when the skb is filled up and not fill it
    past the brim, waiting for a fresh skb for the rest of the data.

    However, if the attributes are small and nicely packed, it is possible
    that a dump callback function successfully fills in attributes until the
    skb is of size 4080 (libmnl's default page-sized receive buffer size).
    The dump function completes, satisfied, and then, if it happens to be
    that this is actually the last skb, and no further ones are to be sent,
    then netlink_dump will add on the NLMSG_DONE part:

    nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

    It is very important that netlink_dump does this, of course. However, in
    this example, that call to nlmsg_put_answer will fail, because the
    previous filling by the dump function did not leave it enough room. And
    how could it possibly have done so? All of the nla_put variety of
    functions simply check to see if the skb has enough tailroom,
    independent of the context it is in.

    In order to keep the important assumptions of all netlink dump users, it
    is therefore important to give them an skb that has this end part of the
    tail already reserved, so that the call to nlmsg_put_answer does not
    fail. Otherwise, library authors are forced to find some bizarre sized
    receive buffer that has a large modulo relative to the common sizes of
    messages received, which is ugly and buggy.

    This patch thus saves the NLMSG_DONE for an additional message, for the
    case that things are dangerously close to the brim. This requires
    keeping track of the errno from ->dump() across calls.

    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason A. Donenfeld
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

14 Apr, 2017

1 commit

  • Add the base infrastructure and UAPI for netlink extended ACK
    reporting. All "manual" calls to netlink_ack() pass NULL for now and
    thus don't get extended ACK reporting.

    Big thanks goes to Pablo Neira Ayuso for not only bringing up the
    whole topic at netconf (again) but also coming up with the nlattr
    passing trick and various other ideas.

    Signed-off-by: Johannes Berg
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Johannes Berg
     

05 Apr, 2017

1 commit

  • cb_running is reported in /proc/self/net/netlink and it is reported by
    the ss tool, when it gets information from the proc files.

    sock_diag is a new interface which is used instead of proc files, so it
    looks reasonable that this interface has to report no less information
    about sockets than proc files.

    We use these flags to dump and restore netlink sockets.

    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

30 Nov, 2016

1 commit

  • The cb->done interface expects to be called in process context.
    This was broken by the netlink RCU conversion. This patch fixes
    it by adding a worker struct to make the cb->done call where
    necessary.

    Fixes: 21e4902aea80 ("netlink: Lockless lookup with RCU grace...")
    Reported-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Herbert Xu
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Herbert Xu
     

10 Jun, 2016

1 commit


19 Feb, 2016

1 commit

  • mmapped netlink has a number of unresolved issues:

    - TX zerocopy support had to be disabled more than a year ago via
    commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.")
    because the content of the mmapped area can change after netlink
    attribute validation but before message processing.

    - RX support was implemented mainly to speed up nfqueue dumping packet
    payload to userspace. However, since commit ae08ce0021087a5d812d2
    ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
    with the socket-based interface too (via the skb_zerocopy helper).

    The other problem is that skbs attached to mmaped netlink socket
    behave different from normal skbs:

    - they don't have a shinfo area, so all functions that use skb_shinfo()
    (e.g. skb_clone) cannot be used.

    - reserving headroom prevents userspace from seeing the content as
    it expects message to start at skb->head.
    See for instance
    commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump").

    - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
    crash because it needs the sk to check if a tx ring is attached.

    Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
    ("netfilter: nfnetlink: use original skbuff when acking batches").

    mmaped netlink also didn't play nicely with the skb_zerocopy helper
    used by nfqueue and openvswitch. Daniel Borkmann fixed this via
    commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue
    zero-copy")' but at the cost of also needing to provide remaining
    length to the allocation function.

    nfqueue also has problems when used with mmaped rx netlink:
    - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
    Problem is that in the mmap case, the allocation time also determines
    the ordering in which the frame will be seen by userspace (A
    allocating before B means that A is located in earlier ring slot,
    but this also means that B might get a lower sequence number then A
    since seqno is decided later. To fix this we would need to extend the
    spinlocked region to also cover the allocation and message setup which
    isn't desirable.
    - nfqueue can now be configured to queue large (GSO) skbs to userspace.
    Queing GSO packets is faster than having to force a software segmentation
    in the kernel, so this is a desirable option. However, with a mmap based
    ring one has to use 64kb per ring slot element, else mmap has to fall back
    to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

    To use the mmap interface, userspace not only has to probe for mmap netlink
    support, it also has to implement a recv/socket receive path in order to
    handle messages that exceed the size of an rx ring element.

    Cc: Daniel Borkmann
    Cc: Ken-ichirou MATSUZAWA
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

25 Sep, 2015

1 commit

  • On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
    >
    > store_release and load_acquire are different from the usual memory
    > barriers and can't be paired this way. You have to pair store_release
    > and load_acquire. Besides, it isn't a particularly good idea to

    OK I've decided to drop the acquire/release helpers as they don't
    help us at all and simply pessimises the code by using full memory
    barriers (on some architectures) where only a write or read barrier
    is needed.

    > depend on memory barriers embedded in other data structures like the
    > above. Here, especially, rhashtable_insert() would have write barrier
    > *before* the entry is hashed not necessarily *after*, which means that
    > in the above case, a socket which appears to have set bound to a
    > reader might not visible when the reader tries to look up the socket
    > on the hashtable.

    But you are right we do need an explicit write barrier here to
    ensure that the hashing is visible.

    > There's no reason to be overly smart here. This isn't a crazy hot
    > path, write barriers tend to be very cheap, store_release more so.
    > Please just do smp_store_release() and note what it's paired with.

    It's not about being overly smart. It's about actually understanding
    what's going on with the code. I've seen too many instances of
    people simply sprinkling synchronisation primitives around without
    any knowledge of what is happening underneath, which is just a recipe
    for creating hard-to-debug races.

    > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
    > > }
    > > }
    > >
    > > - if (!nlk->portid) {
    > > + if (!nlk->bound) {
    >
    > I don't think you can skip load_acquire here just because this is the
    > second deref of the variable. That doesn't change anything. Race
    > condition could still happen between the first and second tests and
    > skipping the second would lead to the same kind of bug.

    The reason this one is OK is because we do not use nlk->portid or
    try to get nlk from the hash table before we return to user-space.

    However, there is a real bug here that none of these acquire/release
    helpers discovered. The two bound tests here used to be a single
    one. Now that they are separate it is entirely possible for another
    thread to come in the middle and bind the socket. So we need to
    repeat the portid check in order to maintain consistency.

    > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
    > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
    > > return -EPERM;
    > >
    > > - if (!nlk->portid)
    > > + if (!nlk->bound)
    >
    > Don't we need load_acquire here too? Is this path holding a lock
    > which makes that unnecessary?

    Ditto.

    ---8bound once in netlink_bind fixes
    a race where two threads that bind the socket at the same time
    with different port IDs may both succeed.

    Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Nacked-by: Tejun Heo
    Signed-off-by: David S. Miller

    Herbert Xu
     

21 Sep, 2015

1 commit

  • The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink:
    Reset portid after netlink_insert failure") introduced a race
    condition where if two threads try to autobind the same socket
    one of them may end up with a zero port ID. This led to kernel
    deadlocks that were observed by multiple people.

    This patch reverts that commit and instead fixes it by introducing
    a separte rhash_portid variable so that the real portid is only set
    after the socket has been successfully hashed.

    Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure")
    Reported-by: Tejun Heo
    Reported-by: Linus Torvalds
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

12 Sep, 2015

1 commit

  • Ken-ichirou reported that running netlink in mmap mode for receive in
    combination with nlmon will throw a NULL pointer dereference in
    __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable
    to handle kernel paging request". The problem is the skb_clone() in
    __netlink_deliver_tap_skb() for skbs that are mmaped.

    I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink
    skb has it pointed to netlink_skb_destructor(), set in the handler
    netlink_ring_setup_skb(). There, skb->head is being set to NULL, so
    that in such cases, __kfree_skb() doesn't perform a skb_release_data()
    via skb_release_all(), where skb->head is possibly being freed through
    kfree(head) into slab allocator, although netlink mmap skb->head points
    to the mmap buffer. Similarly, the same has to be done also for large
    netlink skbs where the data area is vmalloced. Therefore, as discussed,
    make a copy for these rather rare cases for now. This fixes the issue
    on my and Ken-ichirou's test-cases.

    Reference: http://thread.gmane.org/gmane.linux.network/371129
    Fixes: bcbde0d449ed ("net: netlink: virtual tap device management")
    Reported-by: Ken-ichirou MATSUZAWA
    Signed-off-by: Daniel Borkmann
    Tested-by: Ken-ichirou MATSUZAWA
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Jan, 2015

1 commit


17 Jan, 2015

1 commit

  • In addition to the problem Jeff Layton reported, I looked at the code
    and reproduced the same warning by subscribing and removing the genl
    family with a socket still open. This is a fairly tricky race which
    originates in the fact that generic netlink allows the family to go
    away while sockets are still open - unlike regular netlink which has
    a module refcount for every open socket so in general this cannot be
    triggered.

    Trying to resolve this issue by the obvious locking isn't possible as
    it will result in deadlocks between unregistration and group unbind
    notification (which incidentally lockdep doesn't find due to the home
    grown locking in the netlink table.)

    To really resolve this, introduce a "closing socket" reference counter
    (for generic netlink only, as it's the only affected family) in the
    core netlink code and use that in generic netlink to wait for all the
    sockets that are being closed at the same time as a generic netlink
    family is removed.

    This fixes the race that when a socket is closed, it will should call
    the unbind, but if the family is removed at the same time the unbind
    will not find it, leading to the warning. The real problem though is
    that in this case the unbind could actually find a new family that is
    registered to have a multicast group with the same ID, and call its
    mcast_unbind() leading to confusing.

    Also remove the warning since it would still trigger, but is now no
    longer a problem.

    This also moves the code in af_netlink.c to before unreferencing the
    module to avoid having the same problem in the normal non-genl case.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

14 Jan, 2015

1 commit

  • As rhashtable_lookup_compare_insert() can guarantee the process
    of search and insertion is atomic, it's safe to eliminate the
    nl_sk_hash_lock. After this, object insertion or removal will
    be protected with per bucket lock on write side while object
    lookup is guarded with rcu read lock on read side.

    Signed-off-by: Ying Xue
    Cc: Thomas Graf
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Ying Xue
     

04 Jan, 2015

1 commit

  • Defers the release of the socket reference using call_rcu() to
    allow using an RCU read-side protected call to rhashtable_lookup()

    This restores behaviour and performance gains as previously
    introduced by e341694 ("netlink: Convert netlink_lookup() to use
    RCU protected hash table") without the side effect of severely
    delayed socket destruction.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

27 Dec, 2014

1 commit

  • Netlink families can exist in multiple namespaces, and for the most
    part multicast subscriptions are per network namespace. Thus it only
    makes sense to have bind/unbind notifications per network namespace.

    To achieve this, pass the network namespace of a given client socket
    to the bind/unbind functions.

    Also do this in generic netlink, and there also make sure that any
    bind for multicast groups that only exist in init_net is rejected.
    This isn't really a problem if it is accepted since a client in a
    different namespace will never receive any notifications from such
    a group, but it can confuse the family if not rejected (it's also
    possible to silently (without telling the family) accept it, but it
    would also have to be ignored on unbind so families that take any
    kind of action on bind/unbind won't do unnecessary work for invalid
    clients like that.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

07 Aug, 2014

1 commit

  • Although RCU protection would be possible during diag dump, doing
    so allows for concurrent table mutations which can render the
    in-table offset between individual Netlink messages invalid and
    thus cause legitimate sockets to be skipped in the dump.

    Since the diag dump is relatively low volume and consistency is
    more important than performance, the table mutex is held during
    dump.

    Reported-by: Andrey Wagin
    Signed-off-by: Thomas Graf
    Fixes: e341694e3eb57fc ("netlink: Convert netlink_lookup() to use RCU protected hash table")
    Signed-off-by: David S. Miller

    Thomas Graf
     

03 Aug, 2014

1 commit

  • Heavy Netlink users such as Open vSwitch spend a considerable amount of
    time in netlink_lookup() due to the read-lock on nl_table_lock. Use of
    RCU relieves the lock contention.

    Makes use of the new resizable hash table to avoid locking on the
    lookup.

    The hash table will grow if entries exceeds 75% of table size up to a
    total table size of 64K. It will automatically shrink if usage falls
    below 30%.

    Also splits nl_table_lock into a separate mutex to protect hash table
    mutations and allow synchronize_rcu() to sleep while waiting for readers
    during expansion and shrinking.

    Before:
    9.16% kpktgend_0 [openvswitch] [k] masked_flow_lookup
    6.42% kpktgend_0 [pktgen] [k] mod_cur_headers
    6.26% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    6.23% kpktgend_0 [kernel.kallsyms] [k] memset
    4.79% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup
    4.37% kpktgend_0 [kernel.kallsyms] [k] memcpy
    3.60% kpktgend_0 [openvswitch] [k] ovs_flow_extract
    2.69% kpktgend_0 [kernel.kallsyms] [k] jhash2

    After:
    15.26% kpktgend_0 [openvswitch] [k] masked_flow_lookup
    8.12% kpktgend_0 [pktgen] [k] pktgen_thread_worker
    7.92% kpktgend_0 [pktgen] [k] mod_cur_headers
    5.11% kpktgend_0 [kernel.kallsyms] [k] memset
    4.11% kpktgend_0 [openvswitch] [k] ovs_flow_extract
    4.06% kpktgend_0 [kernel.kallsyms] [k] _raw_spin_lock
    3.90% kpktgend_0 [kernel.kallsyms] [k] jhash2
    [...]
    0.67% kpktgend_0 [kernel.kallsyms] [k] netlink_lookup

    Signed-off-by: Thomas Graf
    Reviewed-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Thomas Graf
     

23 Apr, 2014

1 commit

  • Have the netlink per-protocol optional bind function return an int error code
    rather than void to signal a failure.

    This will enable netlink protocols to perform extra checks including
    capabilities and permissions verifications when updating memberships in
    multicast groups.

    In netlink_bind() and netlink_setsockopt() the call to the per-protocol bind
    function was moved above the multicast group update to prevent any access to
    the multicast socket groups before checking with the per-protocol bind
    function. This will enable the per-protocol bind function to be used to check
    permissions which could be denied before making them available, and to avoid
    the messy job of undoing the addition should the per-protocol bind function
    fail.

    The netfilter subsystem seems to be the only one currently using the
    per-protocol bind function.

    Signed-off-by: Richard Guy Briggs
    Signed-off-by: David S. Miller

    Richard Guy Briggs
     

11 Mar, 2014

1 commit

  • One known problem with netlink is the fact that NLMSG_GOODSIZE is
    really small on PAGE_SIZE==4096 architectures, and it is difficult
    to know in advance what buffer size is used by the application.

    This patch adds an automatic learning of the size.

    First netlink message will still be limited to ~4K, but if user used
    bigger buffers, then following messages will be able to use up to 16KB.

    This speedups dump() operations by a large factor and should be safe
    for legacy applications.

    Signed-off-by: Eric Dumazet
    Cc: Thomas Graf
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Aug, 2013

1 commit

  • Following patch stores struct netlink_callback in netlink_sock
    to avoid allocating and freeing it on every netlink dump msg.
    Only one dump operation is allowed for a given socket at a time
    therefore we can safely convert cb pointer to cb struct inside
    netlink_sock.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

11 Jun, 2013

1 commit

  • As we know, netlink sockets are private resource of
    net namespace, they can communicate with each other
    only when they in the same net namespace. this works
    well until we try to add namespace support for other
    subsystems which use netlink.

    Don't like ipv4 and route table.., it is not suited to
    make these subsytems belong to net namespace, Such as
    audit and crypto subsystems,they are more suitable to
    user namespace.

    So we must have the ability to make the netlink sockets
    in same user namespace can communicate with each other.

    This patch adds a new function pointer "compare" for
    netlink_table, we can decide if the netlink sockets can
    communicate with each other through this netlink_table
    self-defined compare function.

    The behavior isn't changed if we don't provide the compare
    function for netlink_table.

    Signed-off-by: Gao feng
    Acked-by: Serge E. Hallyn
    Signed-off-by: David S. Miller

    Gao feng
     

20 Apr, 2013

1 commit

  • Add support for mmap'ed RX and TX ring setup and teardown based on the
    af_packet.c code. The following patches will use this to add the real
    mmap'ed receive and transmit functionality.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

22 Mar, 2013

1 commit