17 Sep, 2018

16 commits

  • I see no reason for them, label or timer cannot be NULL, and if they
    were, we'll crash with null deref anyway.

    For skb_header_pointer failure, just set hotdrop to true and toss
    such packet.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • None of these spots really needs to crash the kernel.
    In one two cases we can jsut report error to userspace, in the other
    cases we can just use WARN_ON (and leak memory instead).

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • cgroup v2 path field is PATH_MAX which is too large, this is placing too
    much pressure on memory allocation for people with many rules doing
    cgroup v1 classid matching, side effects of this are bug reports like:

    https://bugzilla.kernel.org/show_bug.cgi?id=200639

    This patch registers a new revision that shrinks the cgroup path to 512
    bytes, which is the same approach we follow in similar extensions that
    have a path field.

    Cc: Tejun Heo
    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Tejun Heo

    Pablo Neira Ayuso
     
  • The same connection mark can be set on flows belonging to different
    address families. This commit adds support for filtering on the L3
    protocol when flushing connection track entries. If no protocol is
    specified, then all L3 protocols match.

    In order to avoid code duplication and a redundant check, the protocol
    comparison in ctnetlink_dump_table() has been removed. Instead, a filter
    is created if the GET-message triggering the dump contains an address
    family. ctnetlink_filter_match() is then used to compare the L3
    protocols.

    Signed-off-by: Kristian Evensen
    Signed-off-by: Pablo Neira Ayuso

    Kristian Evensen
     
  • supports fetching saddr/daddr of tunnel mode states, request id and spi.
    If direction is 'in', use inbound skb secpath, else dst->xfrm.

    Joint work with Máté Eckl.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • as of a0ae2562c6c4b27 ("netfilter: conntrack: remove l3proto
    abstraction") there are no users anymore.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Release the committed transaction log from a work queue, moving
    expensive synchronize_rcu out of the locked section and providing
    opportunity to batch this.

    On my test machine this cuts runtime of nft-test.py in half.
    Based on earlier patch from Pablo Neira Ayuso.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • ->destroy is only allowed to free data, or do other cleanups that do not
    have side effects on other state, such as visibility to other netlink
    requests.

    Such things need to be done in ->deactivate.
    As a transaction can fail, we need to make sure we can undo such
    operations, therefore ->activate() has to be provided too.

    So print a warning and refuse registration if expr->ops provides
    only one of the two operations.

    v2: fix nft_expr_check_ops to not repeat same check twice (Jones Desougi)

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Splits unbind_set into destroy_set and unbinding operation.

    Unbinding removes set from lists (so new transaction would not
    find it anymore) but keeps memory allocated (so packet path continues
    to work).

    Rebind function is added to allow unrolling in case transaction
    that wants to remove set is aborted.

    Destroy function is added to free the memory, but this could occur
    outside of transaction in the future.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Useful e.g. to avoid NATting inner headers of to-be-encrypted packets.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Same as ip_gre, use gre_parse_header to parse gre header in gre error
    handler code.

    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     
  • gre_parse_header stops parsing when csum_err is encountered, which means
    tpi->key is undefined and ip_tunnel_lookup will return NULL improperly.

    This patch introduce a NULL pointer as csum_err parameter. Even when
    csum_err is encountered, it won't return error and continue parsing gre
    header as expected.

    Fixes: 9f57c67c379d ("gre: Remove support for sharing GRE protocol hook.")
    Reported-by: Jiri Benc
    Signed-off-by: Haishuang Yan
    Signed-off-by: David S. Miller

    Haishuang Yan
     
  • PHY_POLL is defined as -1 which means that we would be setting all flags of the
    PHY driver, this is also not a valid flag to tell PHYLIB about, just remove it.

    Signed-off-by: Florian Fainelli
    Reviewed-by: Andrew Lunn
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Davide Caratti says:

    ====================
    net/sched: act_police: lockless data path

    the data path of 'police' action can be faster if we avoid using spinlocks:
    - patch 1 converts act_police to use per-cpu counters
    - patch 2 lets act_police use RCU to access its configuration data.

    test procedure (using pktgen from https://github.com/netoptimizer):
    # ip link add name eth1 type dummy
    # ip link set dev eth1 up
    # tc qdisc add dev eth1 clsact
    # tc filter add dev eth1 egress matchall action police \
    > rate 2gbit burst 100k conform-exceed pass/pass index 100
    # for c in 1 2 4; do
    > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 5000000 -i eth1
    > done

    test results (avg. pps/thread):

    $c | before patch | after patch | improvement
    ----+--------------+--------------+-------------
    1 | 3518448 | 3591240 | irrelevant
    2 | 3070065 | 3383393 | 10%
    4 | 1540969 | 3238385 | 110%
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • use RCU instead of spinlocks, to protect concurrent read/write on
    act_police configuration. This reduces the effects of contention in the
    data path, in case multiple readers are present.

    Signed-off-by: Davide Caratti
    Signed-off-by: David S. Miller

    Davide Caratti
     
  • use per-CPU counters, instead of sharing a single set of stats with all
    cores. This removes the need of using spinlock when statistics are read
    or updated.

    Signed-off-by: Davide Caratti
    Signed-off-by: David S. Miller

    Davide Caratti
     

14 Sep, 2018

23 commits

  • - In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb
    version is changed and update the dcb supported version.

    - Also, fill the priority code point value for priority
    based flow control.

    Signed-off-by: Ganesh Goudar
    Signed-off-by: David S. Miller

    Ganesh Goudar
     
  • print per rx-queue packet errors in sge_qinfo

    Signed-off-by: Casey Leedom
    Signed-off-by: Ganesh Goudar
    Signed-off-by: David S. Miller

    Ganesh Goudar
     
  • Do not put host-endian 0 or 1 into big endian feild.

    Reported-by: Al Viro
    Signed-off-by: Ganesh Goudar
    Signed-off-by: David S. Miller

    Ganesh Goudar
     
  • pcpu_lstats is defined in several files, so unify them as one
    and move to header file

    Signed-off-by: Zhang Yu
    Signed-off-by: Li RongQing
    Signed-off-by: David S. Miller

    Li RongQing
     
  • In the quest to remove all stack VLA usage from the kernel[1], this
    removes the VLA used for the emac xaht registers size. Since the size
    of registers can only ever be 4 or 8, as detected in emac_init_config(),
    the max can be hardcoded and a runtime test added for robustness.

    [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

    Cc: "David S. Miller"
    Cc: Christian Lamparter
    Cc: Ivan Mikhaylov
    Cc: netdev@vger.kernel.org
    Co-developed-by: Benjamin Herrenschmidt
    Signed-off-by: Kees Cook
    Signed-off-by: David S. Miller

    Kees Cook
     
  • Replace "fallthru" with a proper "fall through" annotation.

    This fix is part of the ongoing efforts to enabling
    -Wimplicit-fallthrough

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     
  • Replace "fallthru" with a proper "fall through" annotation.

    This fix is part of the ongoing efforts to enabling
    -Wimplicit-fallthrough

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     
  • When splitting a GSO segment that consists of encapsulated packets, the
    skb->mac_len of the segments can end up being set wrong, causing packet
    drops in particular when using act_mirred and ifb interfaces in
    combination with a qdisc that splits GSO packets.

    This happens because at the time skb_segment() is called, network_header
    will point to the inner header, throwing off the calculation in
    skb_reset_mac_len(). The network_header is subsequently adjust by the
    outer IP gso_segment handlers, but they don't set the mac_len.

    Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
    gso_segment handlers, after they modify the network_header.

    Many thanks to Eric Dumazet for his help in identifying the cause of
    the bug.

    Acked-by: Dave Taht
    Reviewed-by: Eric Dumazet
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • Remove duplicated include.

    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     
  • When the Device Tree is not providing the per-port interrupts, do not fail
    during b53_srab_irq_enable() but instead bail out gracefully. The SRAB driver
    is used on the BCM5301X (Northstar) platforms which do not yet have the SRAB
    interrupts wired up.

    Fixes: 16994374a6fc ("net: dsa: b53: Make SRAB driver manage port interrupts")
    Signed-off-by: Florian Fainelli
    Signed-off-by: David S. Miller

    Florian Fainelli
     
  • Jason Wang says:

    ====================
    vhost_net TX batching

    This series tries to batch submitting packets to underlayer socket
    through msg_control during sendmsg(). This is done by:

    1) Doing userspace copy inside vhost_net
    2) Build XDP buff
    3) Batch at most 64 (VHOST_NET_BATCH) XDP buffs and submit them once
    through msg_control during sendmsg().
    4) Underlayer sockets can use XDP buffs directly when XDP is enalbed,
    or build skb based on XDP buff.

    For the packet that can not be built easily with XDP or for the case
    that batch submission is hard (e.g sndbuf is limited). We will go for
    the previous slow path, passing iov iterator to underlayer socket
    through sendmsg() once per packet.

    This can help to improve cache utilization and avoid lots of indirect
    calls with sendmsg(). It can also co-operate with the batching support
    of the underlayer sockets (e.g the case of XDP redirection through
    maps).

    Testpmd(txonly) in guest shows obvious improvements:

    Test /+pps%
    XDP_DROP on TAP /+44.8%
    XDP_REDIRECT on TAP /+29%
    macvtap (skb) /+26%

    Netperf TCP_STREAM TX from guest shows obvious improvements on small
    packet:

    size/session/+thu%/+normalize%
    64/ 1/ +2%/ 0%
    64/ 2/ +3%/ +1%
    64/ 4/ +7%/ +5%
    64/ 8/ +8%/ +6%
    256/ 1/ +3%/ 0%
    256/ 2/ +10%/ +7%
    256/ 4/ +26%/ +22%
    256/ 8/ +27%/ +23%
    512/ 1/ +3%/ +2%
    512/ 2/ +19%/ +14%
    512/ 4/ +43%/ +40%
    512/ 8/ +45%/ +41%
    1024/ 1/ +4%/ 0%
    1024/ 2/ +27%/ +21%
    1024/ 4/ +38%/ +73%
    1024/ 8/ +15%/ +24%
    2048/ 1/ +10%/ +7%
    2048/ 2/ +16%/ +12%
    2048/ 4/ 0%/ +2%
    2048/ 8/ 0%/ +2%
    4096/ 1/ +36%/ +60%
    4096/ 2/ -11%/ -26%
    4096/ 4/ 0%/ +14%
    4096/ 8/ 0%/ +4%
    16384/ 1/ -1%/ +5%
    16384/ 2/ 0%/ +2%
    16384/ 4/ 0%/ -3%
    16384/ 8/ 0%/ +4%
    65535/ 1/ 0%/ +10%
    65535/ 2/ 0%/ +8%
    65535/ 4/ 0%/ +1%
    65535/ 8/ 0%/ +3%

    Please review.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This patch implements XDP batching for vhost_net. The idea is first to
    try to do userspace copy and build XDP buff directly in vhost. Instead
    of submitting the packet immediately, vhost_net will batch them in an
    array and submit every 64 (VHOST_NET_BATCH) packets to the under layer
    sockets through msg_control of sendmsg().

    When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a
    loop without caring GUP thus it can do batch map flushing. When XDP is
    not enabled or not supported, the underlayer socket need to build skb
    and pass it to network core. The batched packet submission allows us
    to do batching like netif_receive_skb_list() in the future.

    This saves lots of indirect calls for better cache utilization. For
    the case that we can't so batching e.g when sndbuf is limited or
    packet size is too large, we will go for usual one packet per
    sendmsg() way.

    Doing testpmd on various setups gives us:

    Test /+pps%
    XDP_DROP on TAP /+44.8%
    XDP_REDIRECT on TAP /+29%
    macvtap (skb) /+26%

    Netperf tests shows obvious improvements for small packet transmission:

    size/session/+thu%/+normalize%
    64/ 1/ +2%/ 0%
    64/ 2/ +3%/ +1%
    64/ 4/ +7%/ +5%
    64/ 8/ +8%/ +6%
    256/ 1/ +3%/ 0%
    256/ 2/ +10%/ +7%
    256/ 4/ +26%/ +22%
    256/ 8/ +27%/ +23%
    512/ 1/ +3%/ +2%
    512/ 2/ +19%/ +14%
    512/ 4/ +43%/ +40%
    512/ 8/ +45%/ +41%
    1024/ 1/ +4%/ 0%
    1024/ 2/ +27%/ +21%
    1024/ 4/ +38%/ +73%
    1024/ 8/ +15%/ +24%
    2048/ 1/ +10%/ +7%
    2048/ 2/ +16%/ +12%
    2048/ 4/ 0%/ +2%
    2048/ 8/ 0%/ +2%
    4096/ 1/ +36%/ +60%
    4096/ 2/ -11%/ -26%
    4096/ 4/ 0%/ +14%
    4096/ 8/ 0%/ +4%
    16384/ 1/ -1%/ +5%
    16384/ 2/ 0%/ +2%
    16384/ 4/ 0%/ -3%
    16384/ 8/ 0%/ +4%
    65535/ 1/ 0%/ +10%
    65535/ 2/ 0%/ +8%
    65535/ 4/ 0%/ +1%
    65535/ 8/ 0%/ +3%

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch implement TUN_MSG_PTR msg_control type. This type allows
    the caller to pass an array of XDP buffs to tuntap through ptr field
    of the tun_msg_control. Tap will build skb through those XDP buffers.

    This will avoid lots of indirect calls thus improves the icache
    utilization and allows to do XDP batched flushing when doing XDP
    redirection.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch implement TUN_MSG_PTR msg_control type. This type allows
    the caller to pass an array of XDP buffs to tuntap through ptr field
    of the tun_msg_control. If an XDP program is attached, tuntap can run
    XDP program directly. If not, tuntap will build skb and do a fast
    receiving since part of the work has been done by vhost_net.

    This will avoid lots of indirect calls thus improves the icache
    utilization and allows to do XDP batched flushing when doing XDP
    redirection.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch introduces to a new tun/tap specific msg_control:

    #define TUN_MSG_UBUF 1
    #define TUN_MSG_PTR 2
    struct tun_msg_ctl {
    int type;
    void *ptr;
    };

    This allows us to pass different kinds of msg_control through
    sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
    be used by the existed vhost_net zerocopy code. The second is XDP
    buff, which allows vhost_net to pass XDP buff to TUN. This could be
    used to implement accepting an array of XDP buffs from vhost_net in
    the following patches.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This will allow adding batch flushing on top.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch split out XDP logic into a single function. This make it to
    be reused by XDP batching path in the following patch.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • If we're sure not to go native XDP, there's no need for several things
    like bh and rcu stuffs. So this patch introduces a helper to build skb
    and hold page refcnt. When we found we will go through skb path, build
    skb directly.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • There's no need to duplicate page get logic in each action. So this
    patch tries to get page and calculate the offset before processing XDP
    actions (except for XDP_DROP), and undo them when meet errors (we
    don't care the performance on errors). This will be used for factoring
    out XDP logic.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch move the bh enabling a little bit earlier, this will be
    used for factoring out the core XDP logic of tuntap.

    Acked-by: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • Acked-by: Michael S. Tsirkin
    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • This patch introduces a new sock flag - SOCK_XDP. This will be used
    for notifying the upper layer that XDP program is attached on the
    lower socket, and requires for extra headroom.

    TUN will be the first user.

    Signed-off-by: Jason Wang
    Signed-off-by: David S. Miller

    Jason Wang
     
  • llc_sap_close() is called by llc_sap_put() which
    could be called in BH context in llc_rcv(). We can't
    block in BH.

    There is no reason to block it here, kfree_rcu() should
    be sufficient.

    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

13 Sep, 2018

1 commit

  • The socket option will be enabled by default to ensure current behaviour
    is not changed. This is the same for the IPv4 version.

    A socket bound to in6addr_any and a specific port will receive all traffic
    on that port. Analogue to IP_MULTICAST_ALL, disable this behaviour, if
    one or more multicast groups were joined (using said socket) and only
    pass on multicast traffic from groups, which were explicitly joined via
    this socket.

    Without this option disabled a socket (system even) joined to multiple
    multicast groups is very hard to get right. Filtering by destination
    address has to take place in user space to avoid receiving multicast
    traffic from other multicast groups, which might have traffic on the same
    port.

    The extension of the IP_MULTICAST_ALL socketoption to just apply to ipv6,
    too, is not done to avoid changing the behaviour of current applications.

    Signed-off-by: Andre Naujoks
    Acked-By: YOSHIFUJI Hideaki
    Signed-off-by: David S. Miller

    Andre Naujoks