09 Aug, 2018

1 commit

  • commit 91874ecf32e41b5d86a4cb9d60e0bee50d828058 upstream.

    It's legal to have 64 groups for netlink_sock.

    As user-supplied nladdr->nl_groups is __u32, it's possible to subscribe
    only to first 32 groups.

    The check for correctness of .bind() userspace supplied parameter
    is done by applying mask made from ngroups shift. Which broke Android
    as they have 64 groups and the shift for mask resulted in an overflow.

    Fixes: 61f4b23769f0 ("netlink: Don't shift with UB on nlk->ngroups")
    Cc: "David S. Miller"
    Cc: Herbert Xu
    Cc: Steffen Klassert
    Cc: netdev@vger.kernel.org
    Cc: stable@vger.kernel.org
    Reported-and-Tested-by: Nathan Chancellor
    Signed-off-by: Dmitry Safonov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Safonov
     

06 Aug, 2018

1 commit

  • [ Upstream commit bc5b6c0b62b932626a135f516a41838c510c6eba ]

    'protocol' is a user-controlled value, so sanitize it after the bounds
    check to avoid using it for speculative out-of-bounds access to arrays
    indexed by it.

    This addresses the following accesses detected with the help of smatch:

    * net/netlink/af_netlink.c:654 __netlink_create() warn: potential
    spectre issue 'nlk_cb_mutex_keys' [w]

    * net/netlink/af_netlink.c:654 __netlink_create() warn: potential
    spectre issue 'nlk_cb_mutex_key_strings' [w]

    * net/netlink/af_netlink.c:685 netlink_create() warn: potential spectre
    issue 'nl_table' [w] (local cap)

    Cc: Josh Poimboeuf
    Signed-off-by: Jeremy Cline
    Reviewed-by: Josh Poimboeuf
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jeremy Cline
     

03 Aug, 2018

2 commits

  • [ Upstream commit 61f4b23769f0cc72ae62c9a81cf08f0397d40da8 ]

    On i386 nlk->ngroups might be 32 or 0. Which leads to UB, resulting in
    hang during boot.
    Check for 0 ngroups and use (unsigned long long) as a type to shift.

    Fixes: 7acf9d4237c4 ("netlink: Do not subscribe to non-existent groups").
    Reported-by: kernel test robot
    Signed-off-by: Dmitry Safonov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Safonov
     
  • [ Upstream commit 7acf9d4237c46894e0fa0492dd96314a41742e84 ]

    Make ABI more strict about subscribing to group > ngroups.
    Code doesn't check for that and it looks bogus.
    (one can subscribe to non-existing group)
    Still, it's possible to bind() to all possible groups with (-1)

    Cc: "David S. Miller"
    Cc: Herbert Xu
    Cc: Steffen Klassert
    Cc: netdev@vger.kernel.org
    Signed-off-by: Dmitry Safonov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Safonov
     

16 May, 2018

1 commit

  • commit 6091f09c2f79730d895149bcfe3d66140288cd0e upstream.

    syzbot reported :

    BUG: KMSAN: uninit-value in ffs arch/x86/include/asm/bitops.h:432 [inline]
    BUG: KMSAN: uninit-value in netlink_sendmsg+0xb26/0x1310 net/netlink/af_netlink.c:1851

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     

12 Apr, 2018

1 commit

  • [ Upstream commit 7880287981b60a6808f39f297bb66936e8bdf57a ]

    KMSAN reports use of uninitialized memory in the case when |alen| is
    smaller than sizeof(struct sockaddr_nl), and therefore |nladdr| isn't
    fully copied from the userspace.

    Signed-off-by: Alexander Potapenko
    Fixes: 1da177e4c3f41524 ("Linux-2.6.12-rc2")
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexander Potapenko
     

09 Mar, 2018

1 commit

  • [ Upstream commit b87b6194be631c94785fe93398651e804ed43e28 ]

    Before, if cb->start() failed, the module reference would never be put,
    because cb->cb_running is intentionally false at this point. Users are
    generally annoyed by this because they can no longer unload modules that
    leak references. Also, it may be possible to tediously wrap a reference
    counter back to zero, especially since module.c still uses atomic_inc
    instead of refcount_inc.

    This patch expands the error path to simply call module_put if
    cb->start() fails.

    Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")
    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason A. Donenfeld
     

31 Jan, 2018

2 commits

  • [ Upstream commit cd443f1e91ca600a092e780e8250cd6a2954b763 ]

    Move up the extack reset/initialization in netlink_rcv_skb, so that
    those 'goto ack' will not skip it. Otherwise, later on netlink_ack
    may use the uninitialized extack and cause kernel crash.

    Fixes: cbbdf8433a5f ("netlink: extack needs to be reset each time through loop")
    Reported-by: syzbot+03bee3680a37466775e7@syzkaller.appspotmail.com
    Signed-off-by: Xin Long
    Acked-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Xin Long
     
  • [ Upstream commit cbbdf8433a5f117b1a2119ea30fc651b61ef7570 ]

    syzbot triggered the WARN_ON in netlink_ack testing the bad_attr value.
    The problem is that netlink_rcv_skb loops over the skb repeatedly invoking
    the callback and without resetting the extack leaving potentially stale
    data. Initializing each time through avoids the WARN_ON.

    Fixes: 2d4bc93368f5a ("netlink: extended ACK reporting")
    Reported-by: syzbot+315fa6766d0f7c359327@syzkaller.appspotmail.com
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

03 Jan, 2018

1 commit

  • [ Upstream commit 93c647643b48f0131f02e45da3bd367d80443291 ]

    Currently, a nlmon link inside a child namespace can observe systemwide
    netlink activity. Filter the traffic so that nlmon can only sniff
    netlink messages from its own netns.

    Test case:

    vpnns -- bash -c "ip link add nlmon0 type nlmon; \
    ip link set nlmon0 up; \
    tcpdump -i nlmon0 -q -w /tmp/nlmon.pcap -U" &
    sudo ip xfrm state add src 10.1.1.1 dst 10.1.1.2 proto esp \
    spi 0x1 mode transport \
    auth sha1 0x6162633132330000000000000000000000000000 \
    enc aes 0x00000000000000000000000000000000
    grep --binary abc123 /tmp/nlmon.pcap

    Signed-off-by: Kevin Cernekee
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Kevin Cernekee
     

24 Nov, 2017

1 commit

  • [ Upstream commit 0642840b8bb008528dbdf929cec9f65ac4231ad0 ]

    The way people generally use netlink_dump is that they fill in the skb
    as much as possible, breaking when nla_put returns an error. Then, they
    get called again and start filling out the next skb, and again, and so
    forth. The mechanism at work here is the ability for the iterative
    dumping function to detect when the skb is filled up and not fill it
    past the brim, waiting for a fresh skb for the rest of the data.

    However, if the attributes are small and nicely packed, it is possible
    that a dump callback function successfully fills in attributes until the
    skb is of size 4080 (libmnl's default page-sized receive buffer size).
    The dump function completes, satisfied, and then, if it happens to be
    that this is actually the last skb, and no further ones are to be sent,
    then netlink_dump will add on the NLMSG_DONE part:

    nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

    It is very important that netlink_dump does this, of course. However, in
    this example, that call to nlmsg_put_answer will fail, because the
    previous filling by the dump function did not leave it enough room. And
    how could it possibly have done so? All of the nla_put variety of
    functions simply check to see if the skb has enough tailroom,
    independent of the context it is in.

    In order to keep the important assumptions of all netlink dump users, it
    is therefore important to give them an skb that has this end part of the
    tail already reserved, so that the call to nlmsg_put_answer does not
    fail. Otherwise, library authors are forced to find some bizarre sized
    receive buffer that has a large modulo relative to the common sizes of
    messages received, which is ugly and buggy.

    This patch thus saves the NLMSG_DONE for an additional message, for the
    case that things are dangerously close to the brim. This requires
    keeping track of the errno from ->dump() across calls.

    Signed-off-by: Jason A. Donenfeld
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jason A. Donenfeld
     

18 Oct, 2017

1 commit

  • It seems that it's possible to toggle NETLINK_F_EXT_ACK
    through setsockopt() while another thread/CPU is building
    a message inside netlink_ack(), which could then trigger
    the WARN_ON()s I added since if it goes from being turned
    off to being turned on between allocating and filling the
    message, the skb could end up being too small.

    Avoid this whole situation by storing the value of this
    flag in a separate variable and using that throughout the
    function instead.

    Fixes: 2d4bc93368f5 ("netlink: extended ACK reporting")
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

10 Oct, 2017

1 commit

  • It turns out that multiple places can call netlink_dump(), which means
    it's still possible to dereference partially initialized values in
    dump() that were the result of a faulty returned start().

    This fixes the issue by calling start() _before_ setting cb_running to
    true, so that there's no chance at all of hitting the dump() function
    through any indirect paths.

    It also moves the call to start() to be when the mutex is held. This has
    the nice side effect of serializing invocations to start(), which is
    likely desirable anyway. It also prevents any possible other races that
    might come out of this logic.

    In testing this with several different pieces of tricky code to trigger
    these issues, this commit fixes all avenues that I'm aware of.

    Signed-off-by: Jason A. Donenfeld
    Cc: Johannes Berg
    Reviewed-by: Johannes Berg
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

30 Sep, 2017

1 commit

  • Drivers that use the start method for netlink dumping rely on dumpit not
    being called if start fails. For example, ila_xlat.c allocates memory
    and assigns it to cb->args[0] in its start() function. It might fail to
    do that and return -ENOMEM instead. However, even when returning an
    error, dumpit will be called, which, in the example above, quickly
    dereferences the memory in cb->args[0], which will OOPS the kernel. This
    is but one example of how this goes wrong.

    Since start() has always been a function with an int return type, it
    therefore makes sense to use it properly, rather than ignoring it. This
    patch thus returns early and does not call dumpit() when start() fails.

    Signed-off-by: Jason A. Donenfeld
    Cc: Johannes Berg
    Reviewed-by: Johannes Berg
    Signed-off-by: David S. Miller

    Jason A. Donenfeld
     

07 Sep, 2017

2 commits

  • Now there is no lock protecting nlk ngroups/groups' accessing in
    netlink bind and getname. It's safe from nlk groups' setting in
    netlink_release, but not from netlink_realloc_groups called by
    netlink_setsockopt.

    netlink_lock_table is needed in both netlink bind and getname when
    accessing nlk groups.

    Acked-by: Florian Westphal
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • ChunYu found a netlink use-after-free issue by syzkaller:

    [28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr ffff8807185e2378
    [28448.969918] Call Trace:
    [...]
    [28449.117207] __nla_put+0x37/0x40
    [28449.132027] nla_put+0xf5/0x130
    [28449.146261] sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
    [28449.176608] __netlink_diag_dump+0x25a/0x700 [netlink_diag]
    [28449.202215] netlink_diag_dump+0x176/0x240 [netlink_diag]
    [28449.226834] netlink_dump+0x488/0xbb0
    [28449.298014] __netlink_dump_start+0x4e8/0x760
    [28449.317924] netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
    [28449.413414] sock_diag_rcv_msg+0x207/0x390
    [28449.432409] netlink_rcv_skb+0x149/0x380
    [28449.467647] sock_diag_rcv+0x2d/0x40
    [28449.484362] netlink_unicast+0x562/0x7b0
    [28449.564790] netlink_sendmsg+0xaa8/0xe60
    [28449.661510] sock_sendmsg+0xcf/0x110
    [28449.865631] __sys_sendmsg+0xf3/0x240
    [28450.000964] SyS_sendmsg+0x32/0x50
    [28450.016969] do_syscall_64+0x25c/0x6c0
    [28450.154439] entry_SYSCALL64_slow_path+0x25/0x25

    It was caused by no protection between nlk groups' free in netlink_release
    and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
    exists in netlink_seq_show().

    This patch is to defer nlk groups' free in deferred_put_nlk_sk.

    Reported-by: ChunYu Wang
    Acked-by: Florian Westphal
    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

01 Jul, 2017

3 commits

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    This patch uses refcount_inc_not_zero() instead of
    atomic_inc_not_zero_hint() due to absense of a _hint()
    version of refcount API. If the hint() version must
    be used, we might need to revisit API.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     
  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

16 Jun, 2017

2 commits

  • It seems like a historic accident that these return unsigned char *,
    and in many places that means casts are required, more often than not.

    Make these functions (skb_put, __skb_put and pskb_put) return void *
    and remove all the casts across the tree, adding a (u8 *) cast only
    where the unsigned char pointer was used directly, all done with the
    following spatch:

    @@
    expression SKB, LEN;
    typedef u8;
    identifier fn = { skb_put, __skb_put };
    @@
    - *(fn(SKB, LEN))
    + *(u8 *)fn(SKB, LEN)

    @@
    expression E, SKB, LEN;
    identifier fn = { skb_put, __skb_put };
    type T;
    @@
    - E = ((T *)(fn(SKB, LEN)))
    + E = fn(SKB, LEN)

    which actually doesn't cover pskb_put since there are only three
    users overall.

    A handful of stragglers were converted manually, notably a macro in
    drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
    instances in net/bluetooth/hci_sock.c. In the former file, I also
    had to fix one whitespace problem spatch introduced.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • A common pattern with skb_put() is to just want to memcpy()
    some data into the new space, introduce skb_put_data() for
    this.

    An spatch similar to the one for skb_put_zero() converts many
    of the places using it:

    @@
    identifier p, p2;
    expression len, skb, data;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_data(skb, data, len);
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, len);
    |
    -memcpy(p, data, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb, data;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_data(skb, data, sizeof(t));
    )
    (
    p2 = (t2)p;
    -memcpy(p2, data, sizeof(*p));
    |
    -memcpy(p, data, sizeof(*p));
    )

    @@
    expression skb, len, data;
    @@
    -memcpy(skb_put(skb, len), data, len);
    +skb_put_data(skb, data, len);

    (again, manually post-processed to retain some comments)

    Reviewed-by: Stephen Hemminger
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

01 Jun, 2017

1 commit

  • The NETLINK_F_LISTEN_ALL_NSID otion enables to listen all netns that have a
    nsid assigned into the netns where the netlink socket is opened.
    The nsid is sent as metadata to userland, but the existence of this nsid is
    checked only for netns that are different from the socket netns. Thus, if
    no nsid is assigned to the socket netns, NETNSA_NSID_NOT_ASSIGNED is
    reported to the userland. This value is confusing and useless.
    After this patch, only valid nsid are sent to userland.

    Reported-by: Flavio Leitner
    Signed-off-by: Nicolas Dichtel
    Signed-off-by: David S. Miller

    Nicolas Dichtel
     

14 Apr, 2017

2 commits

  • Now that we have extended error reporting and a new message format for
    netlink ACK messages, also extend this to be able to return arbitrary
    cookie data on success.

    This will allow, for example, nl80211 to not send an extra message for
    cookies identifying newly created objects, but return those directly
    in the ACK message.

    The cookie data size is currently limited to 20 bytes (since Jamal
    talked about using SHA1 for identifiers.)

    Thanks to Jamal Hadi Salim for bringing up this idea during the
    discussions.

    Signed-off-by: Johannes Berg
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Add the base infrastructure and UAPI for netlink extended ACK
    reporting. All "manual" calls to netlink_ack() pass NULL for now and
    thus don't get extended ACK reporting.

    Big thanks goes to Pablo Neira Ayuso for not only bringing up the
    whole topic at netconf (again) but also coming up with the nlattr
    passing trick and various other ideas.

    Signed-off-by: Johannes Berg
    Reviewed-by: David Ahern
    Signed-off-by: David S. Miller

    Johannes Berg
     

05 Apr, 2017

1 commit

  • cb_running is reported in /proc/self/net/netlink and it is reported by
    the ss tool, when it gets information from the proc files.

    sock_diag is a new interface which is used instead of proc files, so it
    looks reasonable that this interface has to report no less information
    about sockets than proc files.

    We use these flags to dump and restore netlink sockets.

    Signed-off-by: Andrei Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin
     

22 Mar, 2017

1 commit

  • On Tue, Mar 14, 2017 at 10:44:10AM +0100, Dmitry Vyukov wrote:
    >
    > Yes, please.
    > Disregarding some reports is not a good way long term.

    Please try this patch.

    ---8cb_mutex are annotated by lockdep
    as a single class. This causes a false lcokdep cycle involving
    genl and crypto_user.

    This patch fixes it by dividing cb_mutex into individual classes
    based on the netlink protocol. As genl and crypto_user do not
    use the same netlink protocol this breaks the false dependency
    loop.

    Reported-by: Dmitry Vyukov
    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

28 Jan, 2017

1 commit

  • Slava Shwartsman reported a warning in skb_try_coalesce(), when we
    detect skb->truesize is completely wrong.

    In his case, issue came from IPv6 reassembly coping with malicious
    datagrams, that forced various pskb_may_pull() to reallocate a bigger
    skb->head than the one allocated by NIC driver before entering GRO
    layer.

    Current code does not change skb->truesize, leaving this burden to
    callers if they care enough.

    Blindly changing skb->truesize in pskb_expand_head() is not
    easy, as some producers might track skb->truesize, for example
    in xmit path for back pressure feedback (sk->sk_wmem_alloc)

    We can detect the cases where it should be safe to change
    skb->truesize :

    1) skb is not attached to a socket.
    2) If it is attached to a socket, destructor is sock_edemux()

    My audit gave only two callers doing their own skb->truesize
    manipulation.

    I had to remove skb parameter in sock_edemux macro when
    CONFIG_INET is not set to avoid a compile error.

    Signed-off-by: Eric Dumazet
    Reported-by: Slava Shwartsman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Jan, 2017

1 commit

  • In commit d35c99ff77ecb ("netlink: do not enter direct reclaim from
    netlink_dump()") we made sure to not trigger expensive memory reclaim.

    Problem is that a bit later, netlink_trim() might be called and
    trigger memory reclaim.

    netlink_trim() should be best effort, and really as fast as possible.
    Under memory pressure, it is fine to not trim this skb.

    Signed-off-by: Eric Dumazet
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Dec, 2016

1 commit


11 Dec, 2016

1 commit


06 Dec, 2016

1 commit

  • It is wrong to schedule a work from sk_destruct using the socket
    as the memory reserve because the socket will be freed immediately
    after the return from sk_destruct.

    Instead we should do the deferral prior to sk_free.

    This patch does just that.

    Fixes: 707693c8a498 ("netlink: Call cb->done from a worker thread")
    Signed-off-by: Herbert Xu
    Tested-by: Andrey Konovalov
    Signed-off-by: David S. Miller

    Herbert Xu
     

30 Nov, 2016

1 commit

  • The cb->done interface expects to be called in process context.
    This was broken by the netlink RCU conversion. This patch fixes
    it by adding a worker struct to make the cb->done call where
    necessary.

    Fixes: 21e4902aea80 ("netlink: Lockless lookup with RCU grace...")
    Reported-by: Subash Abhinov Kasiviswanathan
    Signed-off-by: Herbert Xu
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Herbert Xu
     

07 Oct, 2016

1 commit

  • Since linux-3.15, netlink_dump() can use up to 16384 bytes skb
    allocations.

    Due to struct skb_shared_info ~320 bytes overhead, we end up using
    order-3 (on x86) page allocations, that might trigger direct reclaim and
    add stress.

    The intent was really to attempt a large allocation but immediately
    fallback to a smaller one (order-1 on x86) in case of memory stress.

    On recent kernels (linux-4.4), we can remove __GFP_DIRECT_RECLAIM to
    meet the goal. Old kernels would need to remove __GFP_WAIT

    While we are at it, since we do an order-3 allocation, allow to use
    all the allocated bytes instead of 16384 to reduce syscalls during
    large dumps.

    iproute2 already uses 32KB recvmsg() buffer sizes.

    Alexei provided an initial patch downsizing to SKB_WITH_OVERHEAD(16384)

    Fixes: 9063e21fb026 ("netlink: autosize skb lengthes")
    Signed-off-by: Eric Dumazet
    Reported-by: Alexei Starovoitov
    Cc: Greg Thelen
    Reviewed-by: Greg Rose
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 May, 2016

1 commit

  • When we free cb->skb after a dump, we do it after releasing the
    lock. This means that a new dump could have started in the time
    being and we'll end up freeing their skb instead of ours.

    This patch saves the skb and module before we unlock so we free
    the right memory.

    Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
    Reported-by: Baozeng Ding
    Signed-off-by: Herbert Xu
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Herbert Xu
     

24 Apr, 2016

1 commit


11 Apr, 2016

1 commit

  • All existing users of NETLINK_URELEASE use it to clean up resources that
    were previously allocated to a socket via some command. As a result, no
    users require getting this notification for unbound sockets.

    Sending it for unbound sockets, however, is a problem because any user
    (including unprivileged users) can create a socket that uses the same ID
    as an existing socket. Binding this new socket will fail, but if the
    NETLINK_URELEASE notification is generated for such sockets, the users
    thereof will be tricked into thinking the socket that they allocated the
    resources for is closed.

    In the nl80211 case, this will cause destruction of virtual interfaces
    that still belong to an existing hostapd process; this is the case that
    Dmitry noticed. In the NFC case, it will cause a poll abort. In the case
    of netlink log/queue it will cause them to stop reporting events, as if
    NFULNL_CFG_CMD_UNBIND/NFQNL_CFG_CMD_UNBIND had been called.

    Fix this problem by checking that the socket is bound before generating
    the NETLINK_URELEASE notification.

    Cc: stable@vger.kernel.org
    Signed-off-by: Dmitry Ivanov
    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Dmitry Ivanov
     

05 Apr, 2016

1 commit

  • In certain cases, the 802.11 mesh pathtable code wants to
    iterate over all of the entries in the forwarding table from
    the receive path, which is inside an RCU read-side critical
    section. Enable walks inside atomic sections by allowing
    GFP_ATOMIC allocations for the walker state.

    Change all existing callsites to pass in GFP_KERNEL.

    Acked-by: Thomas Graf
    Signed-off-by: Bob Copeland
    [also adjust gfs2/glock.c and rhashtable tests]
    Signed-off-by: Johannes Berg

    Bob Copeland
     

23 Mar, 2016

1 commit

  • By returning -ENOIOCTLCMD, sock_do_ioctl() falls back to calling
    dev_ioctl(), which provides support for NIC driver ioctls, which
    includes ethtool support. This is similar to the way ioctls are handled
    in udp.c or tcp.c.

    This removes the requirement that ethtool for example be tied to the
    support of a specific L3 protocol (ethtool uses an AF_INET socket
    today).

    Signed-off-by: David Decotigny
    Signed-off-by: David S. Miller

    David Decotigny
     

19 Feb, 2016

2 commits

  • reverts commit 3ab1f683bf8b ("nfnetlink: add support for memory mapped
    netlink")'

    Like previous commits in the series, remove wrappers that are not needed
    after mmapped netlink removal.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • mmapped netlink has a number of unresolved issues:

    - TX zerocopy support had to be disabled more than a year ago via
    commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.")
    because the content of the mmapped area can change after netlink
    attribute validation but before message processing.

    - RX support was implemented mainly to speed up nfqueue dumping packet
    payload to userspace. However, since commit ae08ce0021087a5d812d2
    ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
    with the socket-based interface too (via the skb_zerocopy helper).

    The other problem is that skbs attached to mmaped netlink socket
    behave different from normal skbs:

    - they don't have a shinfo area, so all functions that use skb_shinfo()
    (e.g. skb_clone) cannot be used.

    - reserving headroom prevents userspace from seeing the content as
    it expects message to start at skb->head.
    See for instance
    commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump").

    - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
    crash because it needs the sk to check if a tx ring is attached.

    Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
    ("netfilter: nfnetlink: use original skbuff when acking batches").

    mmaped netlink also didn't play nicely with the skb_zerocopy helper
    used by nfqueue and openvswitch. Daniel Borkmann fixed this via
    commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue
    zero-copy")' but at the cost of also needing to provide remaining
    length to the allocation function.

    nfqueue also has problems when used with mmaped rx netlink:
    - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
    Problem is that in the mmap case, the allocation time also determines
    the ordering in which the frame will be seen by userspace (A
    allocating before B means that A is located in earlier ring slot,
    but this also means that B might get a lower sequence number then A
    since seqno is decided later. To fix this we would need to extend the
    spinlocked region to also cover the allocation and message setup which
    isn't desirable.
    - nfqueue can now be configured to queue large (GSO) skbs to userspace.
    Queing GSO packets is faster than having to force a software segmentation
    in the kernel, so this is a desirable option. However, with a mmap based
    ring one has to use 64kb per ring slot element, else mmap has to fall back
    to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

    To use the mmap interface, userspace not only has to probe for mmap netlink
    support, it also has to implement a recv/socket receive path in order to
    handle messages that exceed the size of an rx ring element.

    Cc: Daniel Borkmann
    Cc: Ken-ichirou MATSUZAWA
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal