19 Feb, 2016

1 commit

  • mmapped netlink has a number of unresolved issues:

    - TX zerocopy support had to be disabled more than a year ago via
    commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.")
    because the content of the mmapped area can change after netlink
    attribute validation but before message processing.

    - RX support was implemented mainly to speed up nfqueue dumping packet
    payload to userspace. However, since commit ae08ce0021087a5d812d2
    ("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
    with the socket-based interface too (via the skb_zerocopy helper).

    The other problem is that skbs attached to mmaped netlink socket
    behave different from normal skbs:

    - they don't have a shinfo area, so all functions that use skb_shinfo()
    (e.g. skb_clone) cannot be used.

    - reserving headroom prevents userspace from seeing the content as
    it expects message to start at skb->head.
    See for instance
    commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump").

    - skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
    crash because it needs the sk to check if a tx ring is attached.

    Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
    ("netfilter: nfnetlink: use original skbuff when acking batches").

    mmaped netlink also didn't play nicely with the skb_zerocopy helper
    used by nfqueue and openvswitch. Daniel Borkmann fixed this via
    commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue
    zero-copy")' but at the cost of also needing to provide remaining
    length to the allocation function.

    nfqueue also has problems when used with mmaped rx netlink:
    - mmaped netlink doesn't allow use of nfqueue batch verdict messages.
    Problem is that in the mmap case, the allocation time also determines
    the ordering in which the frame will be seen by userspace (A
    allocating before B means that A is located in earlier ring slot,
    but this also means that B might get a lower sequence number then A
    since seqno is decided later. To fix this we would need to extend the
    spinlocked region to also cover the allocation and message setup which
    isn't desirable.
    - nfqueue can now be configured to queue large (GSO) skbs to userspace.
    Queing GSO packets is faster than having to force a software segmentation
    in the kernel, so this is a desirable option. However, with a mmap based
    ring one has to use 64kb per ring slot element, else mmap has to fall back
    to the socket path (NL_MMAP_STATUS_COPY) for all large packets.

    To use the mmap interface, userspace not only has to probe for mmap netlink
    support, it also has to implement a recv/socket receive path in order to
    handle messages that exceed the size of an rx ring element.

    Cc: Daniel Borkmann
    Cc: Ken-ichirou MATSUZAWA
    Cc: Pablo Neira Ayuso
    Cc: Patrick McHardy
    Cc: Thomas Graf
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

02 May, 2013

1 commit

  • Currently, in menuconfig, Netlink's new mmaped IO is the very first
    entry under the ``Networking support'' item and comes even before
    ``Networking options'':

    [ ] Netlink: mmaped IO
    Networking options --->
    ...

    Lets move this into ``Networking options'' under netlink's Kconfig,
    since this might be more appropriate. Introduced by commit ccdfcc398
    (``netlink: mmaped netlink: ring setup'').

    Signed-off-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

22 Mar, 2013

1 commit

  • The netlink_diag can be built as a module, just like it's done in
    unix sockets.

    The core dumping message carries the basic info about netlink sockets:
    family, type and protocol, portis, dst_group, dst_portid, state.

    Groups can be received as an optional parameter NETLINK_DIAG_GROUPS.

    Netlink sockets cab be filtered by protocols.

    The socket inode number and cookie is reserved for future per-socket info
    retrieving. The per-protocol filtering is also reserved for future by
    requiring the sdiag_protocol to be zero.

    The file /proc/net/netlink doesn't provide enough information for
    dumping netlink sockets. It doesn't provide dst_group, dst_portid,
    groups above 32.

    v2: fix NETLINK_DIAG_MAX. Now it's equal to the last constant.

    Acked-by: Pavel Emelyanov
    Cc: "David S. Miller"
    Cc: Eric Dumazet
    Cc: Pablo Neira Ayuso
    Cc: "Eric W. Biederman"
    Cc: Gao feng
    Cc: Thomas Graf
    Signed-off-by: Andrey Vagin
    Signed-off-by: David S. Miller

    Andrey Vagin