04 Aug, 2020

1 commit


25 Jul, 2020

1 commit

  • The previous patch introduced a deadlock, this patch fixes it by making
    sure the work is canceled without holding the global ovs lock. This is
    done by moving the reorder processing one layer up to the netns level.

    Fixes: eac87c413bf9 ("net: openvswitch: reorder masks array based on usage")
    Reported-by: syzbot+2c4ff3614695f75ce26c@syzkaller.appspotmail.com
    Reported-by: syzbot+bad6507e5db05017b008@syzkaller.appspotmail.com
    Reviewed-by: Paolo
    Signed-off-by: Eelco Chaudron
    Signed-off-by: David S. Miller

    Eelco Chaudron
     

18 Jul, 2020

1 commit

  • This patch reorders the masks array every 4 seconds based on their
    usage count. This greatly reduces the masks per packet hit, and
    hence the overall performance. Especially in the OVS/OVN case for
    OpenShift.

    Here are some results from the OVS/OVN OpenShift test, which use
    8 pods, each pod having 512 uperf connections, each connection
    sends a 64-byte request and gets a 1024-byte response (TCP).
    All uperf clients are on 1 worker node while all uperf servers are
    on the other worker node.

    Kernel without this patch : 7.71 Gbps
    Kernel with this patch applied: 14.52 Gbps

    We also run some tests to verify the rebalance activity does not
    lower the flow insertion rate, which does not.

    Signed-off-by: Eelco Chaudron
    Tested-by: Andrew Theurer
    Reviewed-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Eelco Chaudron
     

24 Apr, 2020

1 commit

  • In kernel datapath of Open vSwitch, there are only 1024
    buckets of meter in one datapath. If installing more than
    1024 (e.g. 8192) meters, it may lead to the performance drop.
    But in some case, for example, Open vSwitch used as edge
    gateway, there should be 20K at least, where meters used for
    IP address bandwidth limitation.

    [Open vSwitch userspace datapath has this issue too.]

    For more scalable meter, this patch use meter array instead of
    hash tables, and expand/shrink the array when necessary. So we
    can install more meters than before in the datapath.
    Introducing the struct *dp_meter_instance, it's easy to
    expand meter though changing the *ti point in the struct
    *dp_meter_table.

    Cc: Pravin B Shelar
    Cc: Andy Zhou
    Signed-off-by: Tonghao Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Tonghao Zhang
     

15 Nov, 2019

1 commit

  • When using the kernel datapath, the upcall don't
    include skb hash info relatived. That will introduce
    some problem, because the hash of skb is important
    in kernel stack. For example, VXLAN module uses
    it to select UDP src port. The tx queue selection
    may also use the hash in stack.

    Hash is computed in different ways. Hash is random
    for a TCP socket, and hash may be computed in hardware,
    or software stack. Recalculation hash is not easy.

    Hash of TCP socket is computed:
    tcp_v4_connect
    -> sk_set_txhash (is random)

    __tcp_transmit_skb
    -> skb_set_hash_from_sk

    There will be one upcall, without information of skb
    hash, to ovs-vswitchd, for the first packet of a TCP
    session. The rest packets will be processed in Open vSwitch
    modules, hash kept. If this tcp session is forward to
    VXLAN module, then the UDP src port of first tcp packet
    is different from rest packets.

    TCP packets may come from the host or dockers, to Open vSwitch.
    To fix it, we store the hash info to upcall, and restore hash
    when packets sent back.

    +---------------+ +-------------------------+
    | Docker/VMs | | ovs-vswitchd |
    +----+----------+ +-+--------------------+--+
    | ^ |
    | | |
    | | upcall v restore packet hash (not recalculate)
    | +-+--------------------+--+
    | tap netdev | | vxlan module
    +---------------> +--> Open vSwitch ko +-->
    or internal type | |
    +-------------------------+

    Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-October/364062.html
    Signed-off-by: Tonghao Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Tonghao Zhang
     

06 Sep, 2019

1 commit

  • Offloaded OvS datapath rules are translated one to one to tc rules,
    for example the following simplified OvS rule:

    recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)

    Will be translated to the following tc rule:

    $ tc filter add dev dev1 ingress \
    prio 1 chain 0 proto ip \
    flower tcp ct_state -trk \
    action ct pipe \
    action goto chain 2

    Received packets will first travel though tc, and if they aren't stolen
    by it, like in the above rule, they will continue to OvS datapath.
    Since we already did some actions (action ct in this case) which might
    modify the packets, and updated action stats, we would like to continue
    the proccessing with the correct recirc_id in OvS (here recirc_id(2))
    where we left off.

    To support this, introduce a new skb extension for tc, which
    will be used for translating tc chain to ovs recirc_id to
    handle these miss cases. Last tc chain index will be set
    by tc goto chain action and read by OvS datapath.

    Signed-off-by: Paul Blakey
    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Paul Blakey
     

05 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of version 2 of the gnu general public license as
    published by the free software foundation this program is
    distributed in the hope that it will be useful but without any
    warranty without even the implied warranty of merchantability or
    fitness for a particular purpose see the gnu general public license
    for more details you should have received a copy of the gnu general
    public license along with this program if not write to the free
    software foundation inc 51 franklin street fifth floor boston ma
    02110 1301 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 21 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190529141334.228102212@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

26 May, 2018

1 commit

  • Currently, nf_conntrack_max is used to limit the maximum number of
    conntrack entries in the conntrack table for every network namespace.
    For the VMs and containers that reside in the same namespace,
    they share the same conntrack table, and the total # of conntrack entries
    for all the VMs and containers are limited by nf_conntrack_max. In this
    case, if one of the VM/container abuses the usage the conntrack entries,
    it blocks the others from committing valid conntrack entries into the
    conntrack table. Even if we can possibly put the VM in different network
    namespace, the current nf_conntrack_max configuration is kind of rigid
    that we cannot limit different VM/container to have different # conntrack
    entries.

    To address the aforementioned issue, this patch proposes to have a
    fine-grained mechanism that could further limit the # of conntrack entries
    per-zone. For example, we can designate different zone to different VM,
    and set conntrack limit to each zone. By providing this isolation, a
    mis-behaved VM only consumes the conntrack entries in its own zone, and
    it will not influence other well-behaved VMs. Moreover, the users can
    set various conntrack limit to different zone based on their preference.

    The proposed implementation utilizes Netfilter's nf_conncount backend
    to count the number of connections in a particular zone. If the number of
    connection is above a configured limitation, ovs will return ENOMEM to the
    userspace. If userspace does not configure the zone limit, the limit
    defaults to zero that is no limitation, which is backward compatible to
    the behavior without this patch.

    The following high leve APIs are provided to the userspace:
    - OVS_CT_LIMIT_CMD_SET:
    * set default connection limit for all zones
    * set the connection limit for a particular zone
    - OVS_CT_LIMIT_CMD_DEL:
    * remove the connection limit for a particular zone
    - OVS_CT_LIMIT_CMD_GET:
    * get the default connection limit for all zones
    * get the connection limit for a particular zone

    Signed-off-by: Yi-Hung Wei
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Yi-Hung Wei
     

13 Nov, 2017

3 commits


05 Nov, 2017

1 commit

  • This patch allows reliable identification of netdevice interfaces connected
    to openvswitch bridges. In particular, user space queries the netdev
    interfaces belonging to the ports for statistics, up/down state, etc.
    Datapath dump needs to provide enough information for the user space to be
    able to do that.

    Currently, only interface names are returned. This is not sufficient, as
    openvswitch allows its ports to be in different name spaces and the
    interface name is valid only in its name space. What is needed and generally
    used in other netlink APIs, is the pair ifindex+netnsid.

    The solution is addition of the ifindex+netnsid pair (or only ifindex if in
    the same name space) to vport get/dump operation.

    On request side, ideally the ifindex+netnsid pair could be used to
    get/set/del the corresponding vport. This is not implemented by this patch
    and can be added later if needed.

    Signed-off-by: Jiri Benc
    Signed-off-by: David S. Miller

    Jiri Benc
     

17 Aug, 2017

1 commit

  • For sw_flow_actions, the actions_len only represents the kernel part's
    size, and when we dump the actions to the userspace, we will do the
    convertions, so it's true size may become bigger than the actions_len.

    But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
    to alloc the skbuff, so the user_skb's size may become insufficient and
    oops will happen like this:
    skbuff: skb_over_panic: text:ffffffff8148fabf len:1749 put:157 head:
    ffff881300f39000 data:ffff881300f39000 tail:0x6d5 end:0x6c0 dev:
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:129!
    [...]
    Call Trace:

    [] skb_put+0x43/0x44
    [] skb_zerocopy+0x6c/0x1f4
    [] queue_userspace_packet+0x3a3/0x448 [openvswitch]
    [] ovs_dp_upcall+0x30/0x5c [openvswitch]
    [] output_userspace+0x132/0x158 [openvswitch]
    [] ? ip6_rcv_finish+0x74/0x77 [ipv6]
    [] do_execute_actions+0xcc1/0xdc8 [openvswitch]
    [] ovs_execute_actions+0x74/0x106 [openvswitch]
    [] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
    [] ? key_extract+0x63c/0x8d5 [openvswitch]
    [] ovs_vport_receive+0xa1/0xc3 [openvswitch]
    [...]

    Also we can find that the actions_len is much little than the orig_len:
    crash> struct sw_flow_actions 0xffff8812f539d000
    struct sw_flow_actions {
    rcu = {
    next = 0xffff8812f5398800,
    func = 0xffffe3b00035db32
    },
    orig_len = 1384,
    actions_len = 592,
    actions = 0xffff8812f539d01c
    }

    So as a quick fix, use the orig_len instead of the actions_len to alloc
    the user_skb.

    Last, this oops happened on our system running a relative old kernel, but
    the same risk still exists on the mainline, since we use the wrong
    actions_len from the beginning.

    Fixes: ccea74457bbd ("openvswitch: include datapath actions with sampled-packet upcall to userspace")
    Cc: Neil McKee
    Signed-off-by: Liping Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Liping Zhang
     

03 Jul, 2017

1 commit


23 Mar, 2017

1 commit

  • With the introduction of open flow 'clone' action, the OVS user space
    can now translate the 'clone' action into kernel datapath 'sample'
    action, with 100% probability, to ensure that the clone semantics,
    which is that the packet seen by the clone action is the same as the
    packet seen by the action after clone, is faithfully carried out
    in the datapath.

    While the sample action in the datpath has the matching semantics,
    its implementation is only optimized for its original use.
    Specifically, there are two limitation: First, there is a 3 level of
    nesting restriction, enforced at the flow downloading time. This
    limit turns out to be too restrictive for the 'clone' use case.
    Second, the implementation avoid recursive call only if the sample
    action list has a single userspace action.

    The main optimization implemented in this series removes the static
    nesting limit check, instead, implement the run time recursion limit
    check, and recursion avoidance similar to that of the 'recirc' action.
    This optimization solve both #1 and #2 issues above.

    One related optimization attempts to avoid copying flow key as
    long as the actions enclosed does not change the flow key. The
    detection is performed only once at the flow downloading time.

    Another related optimization is to rewrite the action list
    at flow downloading time in order to save the fast path from parsing
    the sample action list in its original form repeatedly.

    Signed-off-by: Andy Zhou
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    andy zhou
     

18 Nov, 2016

1 commit

  • Make struct pernet_operations::id unsigned.

    There are 2 reasons to do so:

    1)
    This field is really an index into an zero based array and
    thus is unsigned entity. Using negative value is out-of-bound
    access by definition.

    2)
    On x86_64 unsigned 32-bit data which are mixed with pointers
    via array indexing or offsets added or subtracted to pointers
    are preffered to signed 32-bit data.

    "int" being used as an array index needs to be sign-extended
    to 64-bit before being used.

    void f(long *p, int i)
    {
    g(p[i]);
    }

    roughly translates to

    movsx rsi, esi
    mov rdi, [rsi+...]
    call g

    MOVSX is 3 byte instruction which isn't necessary if the variable is
    unsigned because x86_64 is zero extending by default.

    Now, there is net_generic() function which, you guessed it right, uses
    "int" as an array index:

    static inline void *net_generic(const struct net *net, int id)
    {
    ...
    ptr = ng->ptr[id - 1];
    ...
    }

    And this function is used a lot, so those sign extensions add up.

    Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
    messing with code generation):

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)

    Unfortunately some functions actually grow bigger.
    This is a semmingly random artefact of code generation with register
    allocator being used differently. gcc decides that some variable
    needs to live in new r8+ registers and every access now requires REX
    prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
    used which is longer than [r8]

    However, overall balance is in negative direction:

    add/remove: 0/0 grow/shrink: 70/598 up/down: 396/-2126 (-1730)
    function old new delta
    nfsd4_lock 3886 3959 +73
    tipc_link_build_proto_msg 1096 1140 +44
    mac80211_hwsim_new_radio 2776 2808 +32
    tipc_mon_rcv 1032 1058 +26
    svcauth_gss_legacy_init 1413 1429 +16
    tipc_bcbase_select_primary 379 392 +13
    nfsd4_exchange_id 1247 1260 +13
    nfsd4_setclientid_confirm 782 793 +11
    ...
    put_client_renew_locked 494 480 -14
    ip_set_sockfn_get 730 716 -14
    geneve_sock_add 829 813 -16
    nfsd4_sequence_done 721 703 -18
    nlmclnt_lookup_host 708 686 -22
    nfsd4_lockt 1085 1063 -22
    nfs_get_client 1077 1050 -27
    tcf_bpf_init 1106 1076 -30
    nfsd4_encode_fattr 5997 5930 -67
    Total: Before=154856051, After=154854321, chg -0.00%

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: David S. Miller

    Alexey Dobriyan
     

11 Jun, 2016

1 commit

  • The patch adds a new OVS action, OVS_ACTION_ATTR_TRUNC, in order to
    truncate packets. A 'max_len' is added for setting up the maximum
    packet size, and a 'cutlen' field is to record the number of bytes
    to trim the packet when the packet is outputting to a port, or when
    the packet is sent to userspace.

    Signed-off-by: William Tu
    Cc: Pravin Shelar
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    William Tu
     

02 Mar, 2016

1 commit

  • This patch implements bookkeeping support to compute the maximum
    headroom for all the devices in each datapath. When said value
    changes, the underlying devs are notified via the
    ndo_set_rx_headroom method.

    This also increases the internal vports xmit performance.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

23 Oct, 2015

1 commit

  • While transitioning to netdev based vport we broke OVS
    feature which allows user to retrieve tunnel packet egress
    information for lwtunnel devices. Following patch fixes it
    by introducing ndo operation to get the tunnel egress info.
    Same ndo operation can be used for lwtunnel devices and compat
    ovs-tnl-vport devices. So after adding such device operation
    we can remove similar operation from ovs-vport.

    Fixes: 614732eaa12d ("openvswitch: Use regular VXLAN net_device device").
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

01 Sep, 2015

1 commit

  • Currently tun-info options pointer is used in few cases to
    pass options around. But tunnel options can be accessed using
    ip_tunnel_info_opts() API without using the pointer. Following
    patch removes the redundant pointer and consistently make use
    of API.

    Signed-off-by: Pravin B Shelar
    Acked-by: Thomas Graf
    Reviewed-by: Jesse Gross
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

30 Aug, 2015

2 commits


28 Aug, 2015

3 commits

  • Allow matching and setting the ct_label field. As with ct_mark, this is
    populated by executing the CT action. The label field may be modified by
    specifying a label and mask nested under the CT action. It is stored as
    metadata attached to the connection. Label modification occurs after
    lookup, and will only persist when the conntrack entry is committed by
    providing the COMMIT flag to the CT action. Labels are currently fixed
    to 128 bits in size.

    Signed-off-by: Joe Stringer
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • Expose the kernel connection tracker via OVS. Userspace components can
    make use of the CT action to populate the connection state (ct_state)
    field for a flow. This state can be subsequently matched.

    Exposed connection states are OVS_CS_F_*:
    - NEW (0x01) - Beginning of a new connection.
    - ESTABLISHED (0x02) - Part of an existing connection.
    - RELATED (0x04) - Related to an established connection.
    - INVALID (0x20) - Could not track the connection for this packet.
    - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
    - TRACKED (0x80) - This packet has been sent through conntrack.

    When the CT action is executed by itself, it will send the packet
    through the connection tracker and populate the ct_state field with one
    or more of the connection state flags above. The CT action will always
    set the TRACKED bit.

    When the COMMIT flag is passed to the conntrack action, this specifies
    that information about the connection should be stored. This allows
    subsequent packets for the same (or related) connections to be
    correlated with this connection. Sending subsequent packets for the
    connection through conntrack allows the connection tracker to consider
    the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.

    The CT action may optionally take a zone to track the flow within. This
    allows connections with the same 5-tuple to be kept logically separate
    from connections in other zones. If the zone is specified, then the
    "ct_zone" match field will be subsequently populated with the zone id.

    IP fragments are handled by transparently assembling them as part of the
    CT action. The maximum received unit (MRU) size is tracked so that
    refragmentation can occur during output.

    IP frag handling contributed by Andy Zhou.

    Based on original design by Justin Pettit.

    Signed-off-by: Joe Stringer
    Signed-off-by: Justin Pettit
    Signed-off-by: Andy Zhou
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • This will allow the ovs-conntrack code to reuse these macros.

    Signed-off-by: Joe Stringer
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     

22 Jul, 2015

1 commit

  • Rename the tunnel metadata data structures currently internal to
    OVS and make them generic for use by all IP tunnels.

    Both structures are kernel internal and will stay that way. Their
    members are exposed to user space through individual Netlink
    attributes by OVS. It will therefore be possible to extend/modify
    these structures without affecting user ABI.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

02 Jun, 2015

1 commit

  • If new optional attribute OVS_USERSPACE_ATTR_ACTIONS is added to an
    OVS_ACTION_ATTR_USERSPACE action, then include the datapath actions
    in the upcall.

    This Directly associates the sampled packet with the path it takes
    through the virtual switch. Path information currently includes mangling,
    encapsulation and decapsulation actions for tunneling protocols GRE,
    VXLAN, Geneve, MPLS and QinQ, but this extension requires no further
    changes to accommodate datapath actions that may be added in the
    future.

    Adding path information enhances visibility into complex virtual
    networks.

    Signed-off-by: Neil McKee
    Signed-off-by: David S. Miller

    Neil McKee
     

13 Mar, 2015

1 commit

  • Having to say
    > #ifdef CONFIG_NET_NS
    > struct net *net;
    > #endif

    in structures is a little bit wordy and a little bit error prone.

    Instead it is possible to say:
    > typedef struct {
    > #ifdef CONFIG_NET_NS
    > struct net *net;
    > #endif
    > } possible_net_t;

    And then in a header say:

    > possible_net_t net;

    Which is cleaner and easier to use and easier to test, as the
    possible_net_t is always there no matter what the compile options.

    Further this allows read_pnet and write_pnet to be functions in all
    cases which is better at catching typos.

    This change adds possible_net_t, updates the definitions of read_pnet
    and write_pnet, updates optional struct net * variables that
    write_pnet uses on to have the type possible_net_t, and finally fixes
    up the b0rked users of read_pnet and write_pnet.

    Signed-off-by: "Eric W. Biederman"
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric W. Biederman
     

10 Nov, 2014

4 commits


06 Nov, 2014

1 commit


06 Oct, 2014

1 commit

  • Currently, the flow information that is matched for tunnels and
    the tunnel data passed around with packets is the same. However,
    as additional information is added this is not necessarily desirable,
    as in the case of pointers.

    This adds a new structure for tunnel metadata which currently contains
    only the existing struct. This change is purely internal to the kernel
    since the current OVS_KEY_ATTR_IPV4_TUNNEL is simply a compressed version
    of OVS_KEY_ATTR_TUNNEL that is translated at flow setup.

    Signed-off-by: Jesse Gross
    Signed-off-by: Andy Zhou
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     

16 Sep, 2014

4 commits

  • Recirc action allows a packet to reenter openvswitch processing.
    currently openvswitch lookup flow for packet received and execute
    set of actions on that packet, with help of recirc action we can
    process/modify the packet and recirculate it back in openvswitch
    for another pass.

    OVS hash action calculates 5-tupple hash and set hash in flow-key
    hash. This can be used along with recirculation for distributing
    packets among different ports for bond devices.
    For example:
    OVS bonding can use following actions:
    Match on: bond flow; Action: hash, recirc(id)
    Match on: recirc-id == id and hash lower bits == a;
    Action: output port_bond_a

    Signed-off-by: Andy Zhou
    Acked-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Andy Zhou
     
  • Currently tun_key is used for passing tunnel information
    on ingress and egress path, this cause confusion. Following
    patch removes its use on ingress path make it egress only parameter.

    Signed-off-by: Pravin B Shelar
    Acked-by: Andy Zhou

    Pravin B Shelar
     
  • OVS flow extract is called on packet receive or packet
    execute code path. Following patch defines separate API
    for extracting flow-key in packet execute code path.

    Signed-off-by: Pravin B Shelar
    Acked-by: Andy Zhou

    Pravin B Shelar
     
  • OVS keeps pointer to packet key in skb->cb, but the packet key is
    store on stack. This could make code bit tricky. So it is better to
    get rid of the pointer.

    Signed-off-by: Pravin B Shelar

    Pravin B Shelar
     

31 Jul, 2014

1 commit


17 May, 2014

1 commit