18 Jan, 2019

40 commits

  • No need to get/put module owner reference, none of these can be removed
    anymore.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Only used by icmp(v6). Prefer a direct call and remove this
    function from the l4proto struct.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • GRE is now builtin, so we can handle it via direct call and
    remove the callback.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • No users anymore.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • This makes the last of the modular l4 trackers 'bool'.

    After this, all infrastructure to handle dynamic l4 protocol registration
    becomes obsolete and can be removed in followup patches.

    Old:
    302824 net/netfilter/nf_conntrack.ko
    21504 net/netfilter/nf_conntrack_proto_gre.ko

    New:
    313728 net/netfilter/nf_conntrack.ko

    Old:
    text data bss dec hex filename
    6281 1732 4 8017 1f51 nf_conntrack_proto_gre.ko
    108356 20613 236 129205 1f8b5 nf_conntrack.ko
    New:
    112095 21381 240 133716 20a54 nf_conntrack.ko

    The size increase is only temporary.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • We can use gre. Lock is only needed when a new expectation is added.

    In case a single spinlock proves to be problematic we can either add one
    per netns or use an array of locks combined with net_hash_mix() or similar
    to pick the 'correct' one.

    But given this is only needed for an expectation rather than per packet
    a single one should be ok.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • rather than handling them via indirect call, use a direct one instead.
    This leaves GRE as the last user of this indirect call facility.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • The l4 protocol trackers are invoked via indirect call: l4proto->packet().

    With one exception (gre), all l4trackers are builtin, so we can make
    .packet optional and use a direct call for most protocols.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • To allow for a batch to contain rules in arbitrary ordering, introduce
    NFTA_RULE_POSITION_ID attribute which works just like NFTA_RULE_POSITION
    but contains the ID of another rule within the same batch. This helps
    iptables-nft-restore handling dumps with mixed insert/append commands
    correctly.

    Note that NFTA_RULE_POSITION takes precedence over
    NFTA_RULE_POSITION_ID, so if the former is present, the latter is
    ignored.

    Signed-off-by: Phil Sutter
    Signed-off-by: Pablo Neira Ayuso

    Phil Sutter
     
  • Following command:
    iptables -D FORWARD -m physdev ...
    causes connectivity loss in some setups.

    Reason is that iptables userspace will probe kernel for the module revision
    of the physdev patch, and physdev has an artificial dependency on
    br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
    is loaded).

    This causes the "phydev" module to be loaded, which in turn enables the
    "call-iptables" infrastructure.

    bridged packets might then get dropped by the iptables ruleset.

    The better fix would be to change the "call-iptables" defaults to 0 and
    enforce explicit setting to 1, but that breaks backwards compatibility.

    This does the next best thing: add a request_module call to checkentry.
    This was a stray '-D ... -m physdev' won't activate br_netfilter
    anymore.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • place them into the confirm one.

    Old:
    hook (300): ipv4/6_help() first call helper, then seqadj.
    hook (INT_MAX): confirm

    Now:
    hook (INT_MAX): confirm, first call helper, then seqadj, then confirm

    Not having the extra call is noticeable in bechmarks.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • With CONFIG_RETPOLINE its faster to add an if (ptr == &foo_func)
    check and and use direct calls for all the built-in expressions.

    ~15% improvement in pathological cases.

    checkpatch doesn't like the X macro due to the embedded return statement,
    but the macro has a very limited scope so I don't think its a problem.

    I would like to avoid bugs of the form
    If (e->ops->eval == (unsigned long)nft_foo_eval)
    nft_bar_eval();

    and open-coded if ()/else if()/else cascade, thus the macro.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Instead of linear search, use rhlist interface to look up the objects.
    This fixes rulesets with thousands of named objects (quota, counters and
    the like).

    We only use a single table for this and consider the address of the
    table we're doing the lookup in as a part of the key.

    This reduces restore time of a sample ruleset with ~20k named counters
    from 37 seconds to 0.8 seconds.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Add a 'key' structure for object, so we can look them up by name + table
    combination (the name can be the same in each table).

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso

    Florian Westphal
     
  • Eric Dumazet says:

    ====================
    tcp: remove code from tcp_create_openreq_child()

    tcp_create_openreq_child() is essentially cloning a listener, then
    must initialize some fields that can not be inherited.

    Listeners are either fresh sockets, or sockets that came through
    tcp_disconnect() after a session that dirtied many fields.

    By moving code to tcp_disconnect(), we can shorten time taken
    to create a clone, since tcp_disconnect() operation is very
    unlikely.
    ====================

    Acked-by: Yuchung Cheng
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    David S. Miller
     
  • If we make sure all listeners have these fields cleared, then a clone
    will also inherit zero values.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If we make sure all listeners have proper tp->rack value,
    then a clone will also inherit proper initial value.

    Note that fresh sockets init tp->rack from tcp_init_sock()

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If we make sure all listeners have app_limited set to ~0U,
    then a clone will also inherit proper initial value.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If we make sure all listeners have these fields cleared, then a clone
    will also inherit zero values.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • All listeners have this field cleared already, since tcp_disconnect()
    clears it and newly created sockets have also a zero value here.

    So a clone will inherit a zero value here.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Passive connections can inherit proper value by cloning,
    if we make sure all listeners have the proper values there.

    tcp_disconnect() was setting snd_cwnd to 2, which seems
    quite obsolete since IW10 adoption.

    Also remove an obsolete comment.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If we make sure a listener always has its mdev_us
    field set to TCP_TIMEOUT_INIT, we do not need to rewrite
    this field after a new clone is created.

    tcp_disconnect() is very seldom used in real applications.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • All listeners have this field cleared already, since tcp_disconnect()
    clears it and newly created sockets have also a zero value here.

    So a clone will inherit a zero value here.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • New sockets have this field cleared, and tcp_disconnect()
    calls tcp_write_queue_purge() which among other things
    also clear tp->packets_out

    So a listener is guaranteed to have this field cleared.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • If we make sure a listener always has its icsk_rto
    field set to TCP_TIMEOUT_INIT, we do not need to rewrite
    this field after a new clone is created.

    tcp_disconnect() is very seldom used in real applications.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • New sockets get the field set to TCP_INFINITE_SSTHRESH in tcp_init_sock()
    In case a socket had this field changed and transitions to TCP_LISTEN
    state, tcp_disconnect() also makes sure snd_ssthresh is set to
    TCP_INFINITE_SSTHRESH.

    So a listener has this field set to TCP_INFINITE_SSTHRESH already.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Remove unneeded semicolon.

    Signed-off-by: YueHaibing
    Reviewed-by: Leon Romanovsky
    Signed-off-by: David S. Miller

    YueHaibing
     
  • Remove unneeded semicolon.

    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     
  • Remove unneeded semicolon

    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     
  • Remove duplicated include.

    Signed-off-by: YueHaibing
    Acked-by: Denis Bolotin
    Signed-off-by: David S. Miller

    YueHaibing
     
  • There is an if statement and a return statement that are incorrectly
    indented. Fix these. Also replace the assignment-in-if statements
    to assignment followed by an if to keep to the coding style.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • In some testing scenarios, dst/route cache can fill up so quickly
    that even an explicit GC call occasionally fails to clean it up. This leads
    to sporadically failing calls to dst_alloc and "network unreachable" errors
    to the user, which is confusing.

    This patch adds a diagnostic message to make the cause of the failure
    easier to determine.

    Signed-off-by: Peter Oskolkov
    Signed-off-by: David S. Miller

    Peter Oskolkov
     
  • In the current implementation, on interface down we disabled NAPI and
    then manually drained any remaining ingress frames. This could lead
    to a situation when, under heavy traffic, the data availability
    notification for some of the channels would not get rearmed correctly.

    Change the implementation such that we let all remaining ingress frames
    be processed as usual and only disable NAPI once the hardware queues
    are empty.

    We also add a wait on the Tx side, to allow hardware time to process
    all in-flight Tx frames before issueing the disable command.

    Signed-off-by: Ioana Radulescu
    Signed-off-by: David S. Miller

    Ioana Ciocoi Radulescu
     
  • There are some lines that have indentation issues, fix these.

    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • Petr Machata says:

    ====================
    vxlan: Allow vetoing FDB operations

    mlxsw does not implement handling of the more advanced types of VXLAN
    FDB entries. In order to provide visibility to users, it is important to
    be able to reject such FDB entries, ideally with an explanation passed
    in extended ack. This patch set implements this.

    In patches #1-#4, vxlan is gradually transformed to support vetoing of
    FDB entries added (or modified) through vxlan_fdb_update(), and the
    default FDB entry added in __vxlan_dev_create().

    Patches #5-#7 deal with vxlan_changelink(). The existing code recognizes
    that vxlan_fdb_update() may fail, but doesn't attempt to keep things
    intact if it does. These patches change the function in several steps to
    gracefully handle vetoes (or other failures).

    Then in patches #8-#11, extack arguments are added, respectively, to
    ndo_fdb_add(), mlxsw's mlxsw_sp_nve_ops.fdb_replay, the functions that
    connect to the VXLAN vetoing code, and call_switchdev_notifiers(). Note
    that call_switchdev_blocking_notifiers() already does support extack.

    Finally in patch #12, mlxsw is extended to add extack messages to
    rejected FDB entries. In patch #13, the functionality is tested.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • mlxsw doesn't implement offloading of all types of FDB entries that the
    VXLAN driver supports. Test that such FDB entries are rejected. That
    makes sure that the decision made by the existing validation code in
    mlxsw propagates up the stack. It also exercises rollback functionality
    in VXLAN, and tests that extack is returned.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • Annotate the rejections in mlxsw_sp_switchdev_vxlan_work_prepare() with
    textual reasons.

    Because this code ends up being invoked for FDB replay as well, drop the
    default message from there, so that the more accurate error message is
    not overwritten.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • A follow-up patch will enable vetoing of FDB entries. Make it possible
    to communicate details of why an FDB entry is not acceptable back to the
    user.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • There are four sources of VXLAN switchdev notifier calls:

    - the changelink() link operation, which already supports extack,
    - ndo_fdb_add() which got extack support in a previous patch,
    - FDB updates due to packet forwarding,
    - and vxlan_fdb_replay().

    Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
    switchdev message that it sends, and propagate the argument upwards to
    the callers. For the first two cases, pass in the extack gotten through
    the operation. For case #3, pass in NULL.

    To cover the last case, extend vxlan_fdb_replay() to take extack
    argument, which might come from whatever operation necessitated the FDB
    replay.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     
  • A follow-up patch will extend vxlan_fdb_replay() with an extack
    argument. Extend the fdb_replay callback in mlxsw likewise so that the
    argument is ready for the vxlan conversion.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata