03 Nov, 2016

27 commits

  • nf_iterate() has become rather simple, we can integrate this code into
    nf_hook_slow() to reduce the amount of LOC in the core path.

    However, we still need nf_iterate() around for nf_queue packet handling,
    so move this function there where we only need it. I think it should be
    possible to refactor nf_queue code to get rid of it definitely, but
    given this is slow path anyway, let's have a look this later.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This field is only useful for nf_queue, so store it in the
    nf_queue_entry structure instead, away from the core path. Pass
    hook_head to nf_hook_slow().

    Since we always have a valid entry on the first iteration in
    nf_iterate(), we can use 'do { ... } while (entry)' loop instead.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Use switch() for verdict handling and add explicit handling for
    NF_STOLEN and other non-conventional verdicts.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Don't copy relevant fields from hook state structure, instead use the
    one that is already available in struct xt_action_param.

    This patch also adds a set of new wrapper functions to fetch relevant
    hook state structure fields.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Place pointer to hook state in xt_action_param structure instead of
    copying the fields that we need. After this change xt_action_param fits
    into one cacheline.

    This patch also adds a set of new wrapper functions to fetch relevant
    hook state structure fields.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • NF_STOP is only used by br_netfilter these days, and it can be emulated
    with a combination of NF_STOLEN plus explicit call to the ->okfn()
    function as Florian suggests.

    To retain binary compatibility with userspace nf_queue application, we
    have to keep NF_STOP around, so libnetfilter_queue userspace userspace
    applications still work if they use NF_STOP for some exotic reason.

    Out of tree modules using NF_STOP would break, but we don't care about
    those.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Patch c5136b15ea36 ("netfilter: bridge: add and use br_nf_hook_thresh")
    introduced br_nf_hook_thresh().

    Replace NF_HOOK_THRESH() by br_nf_hook_thresh from
    br_nf_forward_finish(), so we have no more callers for this macro.

    As a result, state->thresh and explicit thresh parameter in the hook
    state structure is not required anymore. And we can get rid of
    skip-hook-under-thresh loop in nf_iterate() in the core path that is
    only used by br_netfilter to search for the filter hook.

    Suggested-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • We cannot block/sleep on nf_iterate because netfilter runs under rcu
    read lock these days, where blocking is well-known to be illegal. So
    let's remove these old comments.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • This patch remove compile time code to catch inconventional verdicts.
    We have better ways to handle this case these days, eg. pr_debug() but
    even though I don't think this is useful at all, so let's remove this.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Driver sets the skb l4/l3 hash based on NIC_CFG_RSS_HASH_TYPE_*,
    which is bit mask. This is wrong. Hw actually provides us enum.
    Use CQ_ENET_RQ_DESC_RSS_TYPE_* to set l3 and l4 hash type.

    Fixes: bf751ba802fe ("driver/net: enic: record q_number and rss_hash for skb")
    Signed-off-by: Govindarajulu Varadarajan
    Signed-off-by: David S. Miller

    Govindarajulu Varadarajan
     
  • The ethtool api {get|set}_settings is deprecated.
    We move this driver to new api {get|set}_link_ksettings.

    Signed-off-by: Philippe Reynes
    Reviewed-by: David Dillow
    Signed-off-by: David S. Miller

    Philippe Reynes
     
  • commit ca26893f05e86 ("rhashtable: Add rhlist interface")
    added a field to rhashtable_iter so that length became 56 bytes
    and would exceed the size of args in netlink_callback (which is
    48 bytes). The netlink diag dump function already has been
    allocating a iter structure and storing the pointed to that
    in the args of netlink_callback. ila_xlat also uses
    rhahstable_iter but is still putting that directly in
    the arg block. Now since rhashtable_iter size is increased
    we are overwriting beyond the structure. The next field
    happens to be cb_mutex pointer in netlink_sock and hence the crash.

    Fix is to alloc the rhashtable_iter and save it as pointer
    in arg.

    Tested:

    modprobe ila
    ./ip ila add loc 3333:0:0:0 loc_match 2222:0:0:1,
    ./ip ila list # NO crash now

    Signed-off-by: Tom Herbert
    Signed-off-by: David S. Miller

    Tom Herbert
     
  • While being preparing patches for killing raw sockets via
    diag netlink interface I noticed that my runs are stuck:

    | [root@pcs7 ~]# cat /proc/`pidof ss`/stack
    | [] __lock_sock+0x80/0xc4
    | [] lock_sock_nested+0x47/0x95
    | [] udp_disconnect+0x19/0x33
    | [] raw_abort+0x33/0x42
    | [] sock_diag_destroy+0x4d/0x52

    which has not been the case before. I narrowed it down to the commit

    | commit 286c72deabaa240b7eebbd99496ed3324d69f3c0
    | Author: Eric Dumazet
    | Date: Thu Oct 20 09:39:40 2016 -0700
    |
    | udp: must lock the socket in udp_disconnect()

    where we start locking the socket for different reason.

    So the raw_abort escaped the renaming and we have to
    fix this typo using __udp_disconnect instead.

    Fixes: 286c72deabaa ("udp: must lock the socket in udp_disconnect()")
    CC: David S. Miller
    CC: Eric Dumazet
    CC: David Ahern
    CC: Alexey Kuznetsov
    CC: James Morris
    CC: Hideaki YOSHIFUJI
    CC: Patrick McHardy
    CC: Andrey Vagin
    CC: Stephen Hemminger
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cyrill Gorcunov
     
  • To utilize phylib with interrupt fully than handling some of phy stuff in the MAC driver,
    create irq_domain for USB interrupt EP of phy interrupt and
    pass the irq number to phy_connect_direct() instead of PHY_IGNORE_INTERRUPT.

    Idea comes from drivers/gpio/gpio-dl2.c

    Signed-off-by: Woojung Huh
    Signed-off-by: David S. Miller

    Woojung Huh
     
  • As Ilya Lesokhin suggested, we can collapse two skbs at retransmit
    time even if the skb at the right has fragments.

    We simply have to use more generic skb_copy_bits() instead of
    skb_copy_from_linear_data() in tcp_collapse_retrans()

    Also need to guard this skb_copy_bits() in case there is nothing to
    copy, otherwise skb_put() could panic if left skb has frags.

    Tested:

    Used following packetdrill test

    // Establish a connection.
    0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
    +0 bind(3, ..., ...) = 0
    +0 listen(3, 1) = 0

    +0 < S 0:0(0) win 32792
    +0 > S. 0:0(0) ack 1
    +.100 < . 1:1(0) ack 1 win 257
    +0 accept(3, ..., ...) = 4

    +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
    +0 write(4, ..., 200) = 200
    +0 > P. 1:201(200) ack 1
    +.001 write(4, ..., 200) = 200
    +0 > P. 201:401(200) ack 1
    +.001 write(4, ..., 200) = 200
    +0 > P. 401:601(200) ack 1
    +.001 write(4, ..., 200) = 200
    +0 > P. 601:801(200) ack 1
    +.001 write(4, ..., 200) = 200
    +0 > P. 801:1001(200) ack 1
    +.001 write(4, ..., 100) = 100
    +0 > P. 1001:1101(100) ack 1
    +.001 write(4, ..., 100) = 100
    +0 > P. 1101:1201(100) ack 1
    +.001 write(4, ..., 100) = 100
    +0 > P. 1201:1301(100) ack 1
    +.001 write(4, ..., 100) = 100
    +0 > P. 1301:1401(100) ack 1

    +.100 < . 1:1(0) ack 1 win 257
    // Check that TCP collapse works :
    +0 > P. 1:1001(1000) ack 1

    Reported-by: Ilya Lesokhin
    Signed-off-by: Eric Dumazet
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Ilpo Järvinen
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • The ethtool api {get|set}_settings is deprecated.
    We move this driver to new api {get|set}_link_ksettings.

    Signed-off-by: Philippe Reynes
    Signed-off-by: David S. Miller

    Philippe Reynes
     
  • The ethtool api {get|set}_settings is deprecated.
    We move this driver to new api {get|set}_link_ksettings.

    Signed-off-by: Philippe Reynes
    Signed-off-by: David S. Miller

    Philippe Reynes
     
  • The old ethtool api (get_setting and set_setting) has generic mii
    functions mii_ethtool_sset and mii_ethtool_gset.

    To support the new ethtool api ({get|set}_link_ksettings), we add
    two generics mii function mii_ethtool_{get|set}_link_ksettings_get.

    Signed-off-by: Philippe Reynes
    Signed-off-by: David S. Miller

    Philippe Reynes
     
  • Tariq Toukan says:

    ====================
    mlx4 XDP TX refactor

    This patchset refactors the XDP forwarding case, so that
    its dedicated transmit queues are managed in a complete
    separation from the other regular ones.

    It also adds ethtool counters for XDP cases.

    Series generated against net-next commit:
    22ca904ad70a genetlink: fix error return code in genl_register_family()

    Thanks,
    Tariq.

    v3:
    * Exposed per ring counters.

    v2:
    * Added ethtool counters.
    * Rebased, now patch 2 reverts Brenden's fix, as the bug no longer exists:
    958b3d396d7f ("net/mlx4_en: fixup xdp tx irq to match rx")
    * Updated commit message of patch 2.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • XDP statistics are reported in ethtool, in total and per ring,
    as follows:
    - xdp_drop: the number of packets dropped by xdp.
    - xdp_tx: the number of packets forwarded by xdp.
    - xdp_tx_full: the number of times an xdp forward failed
    due to a full tx xdp ring.

    In addition, all packets that are dropped/forwarded by XDP
    are no longer accounted in rx_packets/rx_bytes of the ring,
    so that they count traffic that is passed to the stack.

    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Tariq Toukan
     
  • Separately manage the two types of TX rings: regular ones, and XDP.
    Upon an XDP set, do not borrow regular TX rings and convert them
    into XDP ones, but allocate new ones, unless we hit the max number
    of rings.
    Which means that in systems with smaller #cores we will not consume
    the current TX rings for XDP, while we are still in the num TX limit.

    XDP TX rings counters are not shown in ethtool statistics.
    Instead, XDP counters will be added to the respective RX rings
    in a downstream patch.

    This has no performance implications.

    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Tariq Toukan
     
  • Support XDP CQ type, and refactor the CQ type enum.
    Rename the is_tx field to match the change.

    Signed-off-by: Tariq Toukan
    Signed-off-by: David S. Miller

    Tariq Toukan
     
  • After adding sctp gso, sctp_packet_transmit is a quite big function now.

    This patch is to extract the codes for packing packet to sctp_packet_pack
    from sctp_packet_transmit, and add some comments, simplify the err path by
    freeing auth chunk when freeing packet chunk_list in out path and freeing
    head skb early if it fails to pack packet.

    Signed-off-by: Xin Long
    Acked-by: Marcelo Ricardo Leitner
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Xin Long
     
  • Roi Dayan says:

    ====================
    misc TC/flower changes

    This series includes two small changes to the TC flower classifier.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Move common code from fl_delete and fl_detroy to __fl_delete.

    Signed-off-by: Roi Dayan
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Roi Dayan
     
  • tcf_unbind was called in fl_delete but was missing in fl_destroy when
    force deleting flows.

    Fixes: 77b9900ef53a ('tc: introduce Flower classifier')
    Signed-off-by: Roi Dayan
    Reviewed-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Roi Dayan
     
  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. This includes better integration with the routing subsystem for
    nf_tables, explicit notrack support and smaller updates. More
    specifically, they are:

    1) Add fib lookup expression for nf_tables, from Florian Westphal. This
    new expression provides a native replacement for iptables addrtype
    and rp_filter matches. This is more flexible though, since we can
    populate the kernel flowi representation to inquire fib to
    accomodate new usecases, such as RTBH through skb mark.

    2) Introduce rt expression for nf_tables, from Anders K. Pedersen. This
    new expression allow you to access skbuff route metadata, more
    specifically nexthop and classid fields.

    3) Add notrack support for nf_tables, to skip conntracking, requested by
    many users already.

    4) Add boilerplate code to allow to use nf_log infrastructure from
    nf_tables ingress.

    5) Allow to mangle pkttype from nf_tables prerouting chain, to emulate
    the xtables cluster match, from Liping Zhang.

    6) Move socket lookup code into generic nf_socket_* infrastructure so
    we can provide a native replacement for the xtables socket match.

    7) Make sure nfnetlink_queue data that is updated on every packets is
    placed in a different cache from read-only data, from Florian Westphal.

    8) Handle NF_STOLEN from nf_tables core, also from Florian Westphal.

    9) Start round robin number generation in nft_numgen from zero,
    instead of n-1, for consistency with xtables statistics match,
    patch from Liping Zhang.

    10) Set GFP_NOWARN flag in skbuff netlink allocations in nfnetlink_log,
    given we retry with a smaller allocation on failure, from Calvin Owens.

    11) Cleanup xt_multiport to use switch(), from Gao feng.

    12) Remove superfluous check in nft_immediate and nft_cmp, from
    Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

02 Nov, 2016

7 commits

  • As the comment indicates, the data at the end of nfqnl_instance struct is
    written on every queue/dequeue, so it should reside in its own cacheline.

    Before this change, 'lock' was in first cacheline so we dirtied both.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • After call nft_data_init, size is already validated and desc.len will
    not exceed the sizeof(struct nft_data), i.e. 16 bytes. So it will never
    exceed U8_MAX.

    Furthermore, in nft_immediate_init, we forget to call nft_data_uninit
    when desc.len exceeds U8_MAX, although this will not happen, but it's
    a logical mistake.

    Now remove these redundant validation introduced by commit 36b701fae12a
    ("netfilter: nf_tables: validate maximum value of u32 netlink attributes")

    Signed-off-by: Liping Zhang
    Signed-off-by: Pablo Neira Ayuso

    Liping Zhang
     
  • Introduces an nftables rt expression for routing related data with support
    for nexthop (i.e. the directly connected IP address that an outgoing packet
    is sent to), which can be used either for matching or accounting, eg.

    # nft add rule filter postrouting \
    ip daddr 192.168.1.0/24 rt nexthop != 192.168.0.1 drop

    This will drop any traffic to 192.168.1.0/24 that is not routed via
    192.168.0.1.

    # nft add rule filter postrouting \
    flow table acct { rt nexthop timeout 600s counter }
    # nft add rule ip6 filter postrouting \
    flow table acct { rt nexthop timeout 600s counter }

    These rules count outgoing traffic per nexthop. Note that the timeout
    releases an entry if no traffic is seen for this nexthop within 10 minutes.

    # nft add rule inet filter postrouting \
    ether type ip \
    flow table acct { rt nexthop timeout 600s counter }
    # nft add rule inet filter postrouting \
    ether type ip6 \
    flow table acct { rt nexthop timeout 600s counter }

    Same as above, but via the inet family, where the ether type must be
    specified explicitly.

    "rt classid" is also implemented identical to "meta rtclassid", since it
    is more logical to have this match in the routing expression going forward.

    Signed-off-by: Anders K. Pedersen
    Signed-off-by: Pablo Neira Ayuso

    Anders K. Pedersen
     
  • We need this split to reuse existing codebase for the upcoming nf_tables
    socket expression.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Move layer 2 packet logging into nf_log_l2packet() that resides in
    nf_log_common.c, so this can be shared by both bridge and netdev
    families.

    This patch adds the boiler plate code to register the netdev logging
    family.

    Signed-off-by: Pablo Neira Ayuso

    Pablo Neira Ayuso
     
  • Add FIB expression, supported for ipv4, ipv6 and inet family (the latter
    just dispatches to ipv4 or ipv6 one based on nfproto).

    Currently supports fetching output interface index/name and the
    rtm_type associated with an address.

    This can be used for adding path filtering. rtm_type is useful
    to e.g. enforce a strong-end host model where packets
    are only accepted if daddr is configured on the interface the
    packet arrived on.

    The fib expression is a native nftables alternative to the
    xtables addrtype and rp_filter matches.

    FIB result order for oif/oifname retrieval is as follows:
    - if packet is local (skb has rtable, RTF_LOCAL set, this
    will also catch looped-back multicast packets), set oif to
    the loopback interface.
    - if fib lookup returns an error, or result points to local,
    store zero result. This means '--local' option of -m rpfilter
    is not supported. It is possible to use 'fib type local' or add
    explicit saddr/daddr matching rules to create exceptions if this
    is really needed.
    - store result in the destination register.
    In case of multiple routes, search set for desired oif in case
    strict matching is requested.

    ipv4 and ipv6 behave fib expressions are supposed to behave the same.

    [ I have collapsed Arnd Bergmann's ("netfilter: nf_tables: fib warnings")

    http://patchwork.ozlabs.org/patch/688615/

    to address fallout from this patch after rebasing nf-next, that was
    posted to address compilation warnings. --pablo ]

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Fix to return a negative error code from the idr_alloc() error handling
    case instead of 0, as done elsewhere in this function.

    Also fix the return value check of idr_alloc() since idr_alloc return
    negative errors on failure, not zero.

    Fixes: 2ae0f17df1cd ("genetlink: use idr to track families")
    Signed-off-by: Wei Yongjun
    Signed-off-by: David S. Miller

    Wei Yongjun
     

01 Nov, 2016

6 commits

  • Enable support for IPv4 multicast:
    - similar to unicast the flow struct is updated to L3 master device
    if relevant prior to calling fib_rules_lookup. The table id is saved
    to the lookup arg so the rule action for ipmr can return the table
    associated with the device.

    - ip_mr_forward needs to check for master device mismatch as well
    since the skb->dev is set to it

    - allow multicast address on VRF device for Rx by checking for the
    daddr in the VRF device as well as the original ingress device

    - on Tx need to drop to __mkroute_output when FIB lookup fails for
    multicast destination address.

    - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
    IPMR FIB rules on first device create similar to FIB rules. In
    addition the VRF driver does not divert IPv4 multicast packets:
    it breaks on Tx since the fib lookup fails on the mcast address.

    With this patch, ipmr forwarding and local rx/tx work.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Parthasarathy Bhuvaragan says:

    ====================
    tipc: socket layer improvements

    The following issues with the current socket layer hinders socket diagnostics
    implementation, which led to this patch series.

    1. tipc socket state is derived from multiple variables like
    sock->state, tsk->probing_state and tsk->connected. This style forces
    us to export multiple attributes to the user space, which has to be
    backward compatible.

    2. Abuse of sock->state cannot be exported to user-space without
    requiring tipc specific hacks in the user-space.
    - For connection less (CL) sockets sock->state is overloaded to
    tipc state SS_READY.
    - For connection oriented (CO) listening socket sock->state is
    overloaded to tipc state SS_LISTEN.

    This series is split into four:
    1. Bug fixes in patch #1,2,3.
    2. Minor cleanups in patch#4-5.
    3. Express all tipc states using a single variable in patch#6-8.
    4. Migrate the new tipc states to sk->sk_state in patch#9-16.

    The figures below represents the FSM after this series:

    Stream Server Listening Socket:
    +-----------+ +-------------+
    | TIPC_OPEN |------>| TIPC_LISTEN |
    +-----------+ +-------------+

    Stream Server Data Socket:
    +-----------+ +------------------+
    | TIPC_OPEN |------>| TIPC_ESTABLISHED |
    +-----------+ +------------------+
    ^ |
    | |
    | v
    +--------------------+
    | TIPC_DISCONNECTING |
    +--------------------+

    Stream Socket Client:
    +-----------+ +-----------------+
    | TIPC_OPEN |------>| TIPC_CONNECTING |------+
    +-----------+ +-----------------+ |
    | |
    | |
    v |
    +------------------+ |
    | TIPC_ESTABLISHED | |
    +------------------+ |
    ^ | |
    | | |
    | v |
    +--------------------+ |
    | TIPC_DISCONNECTING |

    David S. Miller
     
  • In this commit, we replace references to sock->state SS_CONNECTE
    with sk_state TIPC_ESTABLISHED.

    Finally, the sock->state is no longer explicitly used by tipc.
    The FSM below is for various types of connection oriented sockets.

    Stream Server Listening Socket:
    +-----------+ +-------------+
    | TIPC_OPEN |------>| TIPC_LISTEN |
    +-----------+ +-------------+

    Stream Server Data Socket:
    +-----------+ +------------------+
    | TIPC_OPEN |------>| TIPC_ESTABLISHED |
    +-----------+ +------------------+
    ^ |
    | |
    | v
    +--------------------+
    | TIPC_DISCONNECTING |
    +--------------------+

    Stream Socket Client:
    +-----------+ +-----------------+
    | TIPC_OPEN |------>| TIPC_CONNECTING |------+
    +-----------+ +-----------------+ |
    | |
    | |
    v |
    +------------------+ |
    | TIPC_ESTABLISHED | |
    +------------------+ |
    ^ | |
    | | |
    | v |
    +--------------------+ |
    | TIPC_DISCONNECTING |
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • In this commit, we create a new tipc socket state TIPC_CONNECTING
    by primarily replacing the SS_CONNECTING with TIPC_CONNECTING.

    There is no functional change in this commit.

    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • In this commit, we replace the references to SS_DISCONNECTING with
    the combination of sk_state TIPC_DISCONNECTING and flags set in
    sk_shutdown.
    We introduce a new function _tipc_shutdown(), which provides
    the common code required by tipc_release() and tipc_shutdown().

    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan
     
  • In this commit, we create a new tipc socket state TIPC_DISCONNECTING in
    sk_state. TIPC_DISCONNECTING is replacing the socket connection status
    update using SS_DISCONNECTING.
    TIPC_DISCONNECTING is set for connection oriented sockets at:
    - tipc_shutdown()
    - connection probe timeout
    - when we receive an error message on the connection.

    There is no functional change in this commit.

    Signed-off-by: Parthasarathy Bhuvaragan
    Signed-off-by: David S. Miller

    Parthasarathy Bhuvaragan