13 Mar, 2019

1 commit

  • Patch series "generic radix trees; drop flex arrays".

    This patch (of 7):

    There was no real need for this code to be using flexarrays, it's just
    implementing a hash table - ideally it would be using rhashtables, but
    that conversion would be significantly more complicated.

    Link: http://lkml.kernel.org/r/20181217131929.11727-2-kent.overstreet@gmail.com
    Signed-off-by: Kent Overstreet
    Reviewed-by: Matthew Wilcox
    Cc: Pravin B Shelar
    Cc: Alexey Dobriyan
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Paris
    Cc: Marcelo Ricardo Leitner
    Cc: Neil Horman
    Cc: Paul Moore
    Cc: Shaohua Li
    Cc: Stephen Smalley
    Cc: Vlad Yasevich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kent Overstreet
     

09 Nov, 2018

1 commit


08 Nov, 2017

1 commit

  • v16->17
    - Fixed disputed check code: keep them in nsh_push and nsh_pop
    but also add them in __ovs_nla_copy_actions

    v15->v16
    - Add csum recalculation for nsh_push, nsh_pop and set_nsh
    pointed out by Pravin
    - Move nsh key into the union with ipv4 and ipv6 and add
    check for nsh key in match_validate pointed out by Pravin
    - Add nsh check in validate_set and __ovs_nla_copy_actions

    v14->v15
    - Check size in nsh_hdr_from_nlattr
    - Fixed four small issues pointed out By Jiri and Eric

    v13->v14
    - Rename skb_push_nsh to nsh_push per Dave's comment
    - Rename skb_pop_nsh to nsh_pop per Dave's comment

    v12->v13
    - Fix NSH header length check in set_nsh

    v11->v12
    - Fix missing changes old comments pointed out
    - Fix new comments for v11

    v10->v11
    - Fix the left three disputable comments for v9
    but not fixed in v10.

    v9->v10
    - Change struct ovs_key_nsh to
    struct ovs_nsh_key_base base;
    __be32 context[NSH_MD1_CONTEXT_SIZE];
    - Fix new comments for v9

    v8->v9
    - Fix build error reported by daily intel build
    because nsh module isn't selected by openvswitch

    v7->v8
    - Rework nested value and mask for OVS_KEY_ATTR_NSH
    - Change pop_nsh to adapt to nsh kernel module
    - Fix many issues per comments from Jiri Benc

    v6->v7
    - Remove NSH GSO patches in v6 because Jiri Benc
    reworked it as another patch series and they have
    been merged.
    - Change it to adapt to nsh kernel module added by NSH
    GSO patch series

    v5->v6
    - Fix the rest comments for v4.
    - Add NSH GSO support for VxLAN-gpe + NSH and
    Eth + NSH.

    v4->v5
    - Fix many comments by Jiri Benc and Eric Garver
    for v4.

    v3->v4
    - Add new NSH match field ttl
    - Update NSH header to the latest format
    which will be final format and won't change
    per its author's confirmation.
    - Fix comments for v3.

    v2->v3
    - Change OVS_KEY_ATTR_NSH to nested key to handle
    length-fixed attributes and length-variable
    attriubte more flexibly.
    - Remove struct ovs_action_push_nsh completely
    - Add code to handle nested attribute for SET_MASKED
    - Change PUSH_NSH to use the nested OVS_KEY_ATTR_NSH
    to transfer NSH header data.
    - Fix comments and coding style issues by Jiri and Eric

    v1->v2
    - Change encap_nsh and decap_nsh to push_nsh and pop_nsh
    - Dynamically allocate struct ovs_action_push_nsh for
    length-variable metadata.

    OVS master and 2.8 branch has merged NSH userspace
    patch series, this patch is to enable NSH support
    in kernel data path in order that OVS can support
    NSH in compat mode by porting this.

    Signed-off-by: Yi Yang
    Acked-by: Jiri Benc
    Acked-by: Eric Garver
    Acked-by: Pravin Shelar
    Signed-off-by: David S. Miller

    Yi Yang
     

20 Jul, 2017

1 commit

  • When calling the flow_free() to free the flow, we call many times
    (cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
    take up our CPU usage if we call the flow_free() frequently.
    When we put all packets to userspace via upcall, and OvS will send
    them back via netlink to ovs_packet_cmd_execute(will call flow_free).

    The test topo is shown as below. VM01 sends TCP packets to VM02,
    and OvS forward packtets. When testing, we use perf to report the
    system performance.

    VM01 --- OvS-VM --- VM02

    Without this patch, perf-top show as below: The flow_free() is
    3.02% CPU usage.

    4.23% [kernel] [k] _raw_spin_unlock_irqrestore
    3.62% [kernel] [k] __do_softirq
    3.16% [kernel] [k] __memcpy
    3.02% [kernel] [k] flow_free
    2.42% libc-2.17.so [.] __memcpy_ssse3_back
    2.18% [kernel] [k] copy_user_generic_unrolled
    2.17% [kernel] [k] find_next_bit

    When applied this patch, perf-top show as below: Not shown on
    the list anymore.

    4.11% [kernel] [k] _raw_spin_unlock_irqrestore
    3.79% [kernel] [k] __do_softirq
    3.46% [kernel] [k] __memcpy
    2.73% libc-2.17.so [.] __memcpy_ssse3_back
    2.25% [kernel] [k] copy_user_generic_unrolled
    1.89% libc-2.17.so [.] _int_malloc
    1.53% ovs-vswitchd [.] xlate_actions

    With this patch, the TCP throughput(we dont use Megaflow Cache
    + Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
    (maybe ~10% performance imporve).

    This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
    that the flow used. And we only check the flow_stats on the cpu we
    used, and it is unncessary to check all possible cpu when getting,
    cleaning, and updating the flow_stats. Adding the cpu_used_mask to
    sw_flow struct does’t increase the cacheline number.

    Signed-off-by: Tonghao Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Tonghao Zhang
     

10 Feb, 2017

2 commits

  • struct sw_flow_key has two 16-bit holes. Move the most matched
    conntrack match fields there. In some typical cases this reduces the
    size of the key that needs to be hashed into half and into one cache
    line.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jarno Rajahalme
     
  • Add the fields of the conntrack original direction 5-tuple to struct
    sw_flow_key. The new fields are initially marked as non-existent, and
    are populated whenever a conntrack action is executed and either finds
    or generates a conntrack entry. This means that these fields exist
    for all packets that were not rejected by conntrack as untrackable.

    The original tuple fields in the sw_flow_key are filled from the
    original direction tuple of the conntrack entry relating to the
    current packet, or from the original direction tuple of the master
    conntrack entry, if the current conntrack entry has a master.
    Generally, expected connections of connections having an assigned
    helper (e.g., FTP), have a master conntrack entry.

    The main purpose of the new conntrack original tuple fields is to
    allow matching on them for policy decision purposes, with the premise
    that the admissibility of tracked connections reply packets (as well
    as original direction packets), and both direction packets of any
    related connections may be based on ACL rules applying to the master
    connection's original direction 5-tuple. This also makes it easier to
    make policy decisions when the actual packet headers might have been
    transformed by NAT, as the original direction 5-tuple represents the
    packet headers before any such transformation.

    When using the original direction 5-tuple the admissibility of return
    and/or related packets need not be based on the mere existence of a
    conntrack entry, allowing separation of admission policy from the
    established conntrack state. While existence of a conntrack entry is
    required for admission of the return or related packets, policy
    changes can render connections that were initially admitted to be
    rejected or dropped afterwards. If the admission of the return and
    related packets was based on mere conntrack state (e.g., connection
    being in an established state), a policy change that would make the
    connection rejected or dropped would need to find and delete all
    conntrack entries affected by such a change. When using the original
    direction 5-tuple matching the affected conntrack entries can be
    allowed to time out instead, as the established state of the
    connection would not need to be the basis for packet admission any
    more.

    It should be noted that the directionality of related connections may
    be the same or different than that of the master connection, and
    neither the original direction 5-tuple nor the conntrack state bits
    carry this information. If needed, the directionality of the master
    connection can be stored in master's conntrack mark or labels, which
    are automatically inherited by the expected related connections.

    The fact that neither ARP nor ND packets are trackable by conntrack
    allows mutual exclusion between ARP/ND and the new conntrack original
    tuple fields. Hence, the IP addresses are overlaid in union with ARP
    and ND fields. This allows the sw_flow_key to not grow much due to
    this patch, but it also means that we must be careful to never use the
    new key fields with ARP or ND packets. ARP is easy to distinguish and
    keep mutually exclusive based on the ethernet type, but ND being an
    ICMPv6 protocol requires a bit more attention.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jarno Rajahalme
     

13 Nov, 2016

1 commit

  • Use a hole in the structure. We support only Ethernet so far and will add
    a support for L2-less packets shortly. We could use a bool to indicate
    whether the Ethernet header is present or not but the approach with the
    mac_proto field is more generic and occupies the same number of bytes in the
    struct, while allowing later extensibility. It also makes the code in the
    next patches more self explaining.

    It would be nice to use ARPHRD_ constants but those are u16 which would be
    waste. Thus define our own constants.

    Another upside of this is that we can overload this new field to also denote
    whether the flow key is valid. This has the advantage that on
    refragmentation, we don't have to reparse the packet but can rely on the
    stored eth.type. This is especially important for the next patches in this
    series - instead of adding another branch for L2-less packets before calling
    ovs_fragment, we can just remove all those branches completely.

    Signed-off-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jiri Benc
     

19 Sep, 2016

1 commit

  • Instead of using flow stats per NUMA node, use it per CPU. When using
    megaflows, the stats lock can be a bottleneck in scalability.

    On a E5-2690 12-core system, usual throughput went from ~4Mpps to
    ~15Mpps when forwarding between two 40GbE ports with a single flow
    configured on the datapath.

    This has been tested on a system with possible CPUs 0-7,16-23. After
    module removal, there were no corruption on the slab cache.

    Signed-off-by: Thadeu Lima de Souza Cascardo
    Cc: pravin shelar
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thadeu Lima de Souza Cascardo
     

09 Sep, 2016

1 commit

  • Add support for 802.1ad including the ability to push and pop double
    tagged vlans. Add support for 802.1ad to netlink parsing and flow
    conversion. Uses double nested encap attributes to represent double
    tagged vlan. Inner TPID encoded along with ctci in nested attributes.

    This is based on Thomas F Herbert's original v20 patch. I made some
    small clean ups and bug fixes.

    Signed-off-by: Thomas F Herbert
    Signed-off-by: Eric Garver
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Eric Garver
     

19 Mar, 2016

1 commit


20 Oct, 2015

1 commit


07 Oct, 2015

1 commit

  • Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now
    also acts as an indicator whether the flow contains tunnel data (this was
    previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses
    in an union with IPv4 ones this won't work anymore).

    The new field was added to a hole in sw_flow_key.

    Signed-off-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jiri Benc
     

05 Oct, 2015

1 commit

  • Conntrack LABELS (plural) are exposed by conntrack; rename the OVS name
    for these to be consistent with conntrack.

    Fixes: c2ac667 "openvswitch: Allow matching on conntrack label"
    Signed-off-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     

28 Aug, 2015

4 commits

  • Allow matching and setting the ct_label field. As with ct_mark, this is
    populated by executing the CT action. The label field may be modified by
    specifying a label and mask nested under the CT action. It is stored as
    metadata attached to the connection. Label modification occurs after
    lookup, and will only persist when the conntrack entry is committed by
    providing the COMMIT flag to the CT action. Labels are currently fixed
    to 128 bits in size.

    Signed-off-by: Joe Stringer
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • Allow matching and setting the ct_mark field. As with ct_state and
    ct_zone, these fields are populated when the CT action is executed. To
    write to this field, a value and mask can be specified as a nested
    attribute under the CT action. This data is stored with the conntrack
    entry, and is executed after the lookup occurs for the CT action. The
    conntrack entry itself must be committed using the COMMIT flag in the CT
    action flags for this change to persist.

    Signed-off-by: Justin Pettit
    Signed-off-by: Joe Stringer
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • Expose the kernel connection tracker via OVS. Userspace components can
    make use of the CT action to populate the connection state (ct_state)
    field for a flow. This state can be subsequently matched.

    Exposed connection states are OVS_CS_F_*:
    - NEW (0x01) - Beginning of a new connection.
    - ESTABLISHED (0x02) - Part of an existing connection.
    - RELATED (0x04) - Related to an established connection.
    - INVALID (0x20) - Could not track the connection for this packet.
    - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
    - TRACKED (0x80) - This packet has been sent through conntrack.

    When the CT action is executed by itself, it will send the packet
    through the connection tracker and populate the ct_state field with one
    or more of the connection state flags above. The CT action will always
    set the TRACKED bit.

    When the COMMIT flag is passed to the conntrack action, this specifies
    that information about the connection should be stored. This allows
    subsequent packets for the same (or related) connections to be
    correlated with this connection. Sending subsequent packets for the
    connection through conntrack allows the connection tracker to consider
    the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.

    The CT action may optionally take a zone to track the flow within. This
    allows connections with the same 5-tuple to be kept logically separate
    from connections in other zones. If the zone is specified, then the
    "ct_zone" match field will be subsequently populated with the zone id.

    IP fragments are handled by transparently assembling them as part of the
    CT action. The maximum received unit (MRU) size is tracked so that
    refragmentation can occur during output.

    IP frag handling contributed by Andy Zhou.

    Based on original design by Justin Pettit.

    Signed-off-by: Joe Stringer
    Signed-off-by: Justin Pettit
    Signed-off-by: Andy Zhou
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • Previously, we used the kernel-internal netlink actions length to
    calculate the size of messages to serialize back to userspace.
    However,the sw_flow_actions may not be formatted exactly the same as the
    actions on the wire, so store the original actions length when
    de-serializing and re-use the original length when serializing.

    Signed-off-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Joe Stringer
     

22 Jul, 2015

2 commits

  • Utilize the new metadata dst to attach encapsulation instructions to
    the skb. The existing egress_tun_info via the OVS_CB() is left in
    place until all tunnel vports have been converted to the new method.

    Signed-off-by: Thomas Graf
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thomas Graf
     
  • Rename the tunnel metadata data structures currently internal to
    OVS and make them generic for use by all IP tunnels.

    Both structures are kernel internal and will stay that way. Their
    members are exposed to user space through individual Netlink
    attributes by OVS. It will therefore be possible to extend/modify
    these structures without affecting user ABI.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

27 Jan, 2015

1 commit

  • Previously, flows were manipulated by userspace specifying a full,
    unmasked flow key. This adds significant burden onto flow
    serialization/deserialization, particularly when dumping flows.

    This patch adds an alternative way to refer to flows using a
    variable-length "unique flow identifier" (UFID). At flow setup time,
    userspace may specify a UFID for a flow, which is stored with the flow
    and inserted into a separate table for lookup, in addition to the
    standard flow table. Flows created using a UFID must be fetched or
    deleted using the UFID.

    All flow dump operations may now be made more terse with OVS_UFID_F_*
    flags. For example, the OVS_UFID_F_OMIT_KEY flag allows responses to
    omit the flow key from a datapath operation if the flow has a
    corresponding UFID. This significantly reduces the time spent assembling
    and transacting netlink messages. With all OVS_UFID_F_OMIT_* flags
    enabled, the datapath only returns the UFID and statistics for each flow
    during flow dump, increasing ovs-vswitchd revalidator performance by 40%
    or more.

    Signed-off-by: Joe Stringer
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     

15 Jan, 2015

1 commit

  • Also factors out Geneve validation code into a new separate function
    validate_and_copy_geneve_opts().

    A subsequent patch will introduce VXLAN options. Rename the existing
    GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
    tunnel metadata options.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

10 Nov, 2014

3 commits


06 Nov, 2014

1 commit

  • Allow datapath to recognize and extract MPLS labels into flow keys
    and execute actions which push, pop, and set labels on packets.

    Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

    Cc: Ravi K
    Cc: Leo Alterman
    Cc: Isaku Yamahata
    Cc: Joe Stringer
    Signed-off-by: Simon Horman
    Signed-off-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Simon Horman
     

06 Oct, 2014

2 commits

  • The Openvswitch implementation is completely agnostic to the options
    that are in use and can handle newly defined options without
    further work. It does this by simply matching on a byte array
    of options and allowing userspace to setup flows on this array.

    Signed-off-by: Jesse Gross
    Singed-off-by: Ansis Atteka
    Signed-off-by: Andy Zhou
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • Currently, the flow information that is matched for tunnels and
    the tunnel data passed around with packets is the same. However,
    as additional information is added this is not necessarily desirable,
    as in the case of pointers.

    This adds a new structure for tunnel metadata which currently contains
    only the existing struct. This change is purely internal to the kernel
    since the current OVS_KEY_ATTR_IPV4_TUNNEL is simply a compressed version
    of OVS_KEY_ATTR_TUNNEL that is translated at flow setup.

    Signed-off-by: Jesse Gross
    Signed-off-by: Andy Zhou
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     

16 Sep, 2014

3 commits

  • Recirc action allows a packet to reenter openvswitch processing.
    currently openvswitch lookup flow for packet received and execute
    set of actions on that packet, with help of recirc action we can
    process/modify the packet and recirculate it back in openvswitch
    for another pass.

    OVS hash action calculates 5-tupple hash and set hash in flow-key
    hash. This can be used along with recirculation for distributing
    packets among different ports for bond devices.
    For example:
    OVS bonding can use following actions:
    Match on: bond flow; Action: hash, recirc(id)
    Match on: recirc-id == id and hash lower bits == a;
    Action: output port_bond_a

    Signed-off-by: Andy Zhou
    Acked-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Andy Zhou
     
  • Currently tun_key is used for passing tunnel information
    on ingress and egress path, this cause confusion. Following
    patch removes its use on ingress path make it egress only parameter.

    Signed-off-by: Pravin B Shelar
    Acked-by: Andy Zhou

    Pravin B Shelar
     
  • OVS flow extract is called on packet receive or packet
    execute code path. Following patch defines separate API
    for extracting flow-key in packet execute code path.

    Signed-off-by: Pravin B Shelar
    Acked-by: Andy Zhou

    Pravin B Shelar
     

30 Jun, 2014

1 commit

  • Flow statistics need to take into account the TCP flags from the packet
    currently being processed (in 'key'), not the TCP flags matched by the
    flow found in the kernel flow table (in 'flow').

    This bug made the Open vSwitch userspace fin_timeout action have no effect
    in many cases.
    This bug is introduced by commit 88d73f6c411ac2f0578 (openvswitch: Use
    TCP flags in the flow key for stats.)

    Reported-by: Len Gao
    Signed-off-by: Ben Pfaff
    Acked-by: Jarno Rajahalme
    Acked-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Ben Pfaff
     

23 May, 2014

2 commits

  • For ovs_flow_stats_get() using ovsl_dereference() was wrong, since
    flow dumps call this with RCU read lock.

    ovs_flow_stats_clear() is always called with ovs_mutex, so can use
    ovsl_dereference().

    Also, make the ovs_flow_stats_get() 'flow' argument const to make
    later patches cleaner.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Pravin B Shelar

    Jarno Rajahalme
     
  • Minimize padding in sw_flow_key and move 'tp' top the main struct.
    These changes simplify code when accessing the transport port numbers
    and the tcp flags, and makes the sw_flow_key 8 bytes smaller on 64-bit
    systems (128->120 bytes). These changes also make the keys for IPv4
    packets to fit in one cache line.

    There is a valid concern for safety of packing the struct
    ovs_key_ipv4_tunnel, as it would be possible to take the address of
    the tun_id member as a __be64 * which could result in unaligned access
    in some systems. However:

    - sw_flow_key itself is 64-bit aligned, so the tun_id within is
    always
    64-bit aligned.
    - We never make arrays of ovs_key_ipv4_tunnel (which would force
    every
    second tun_key to be misaligned).
    - We never take the address of the tun_id in to a __be64 *.
    - Whereever we use struct ovs_key_ipv4_tunnel outside the
    sw_flow_key,
    it is in stack (on tunnel input functions), where compiler has full
    control of the alignment.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Pravin B Shelar

    Jarno Rajahalme
     

17 May, 2014

2 commits

  • Keep kernel flow stats for each NUMA node rather than each (logical)
    CPU. This avoids using the per-CPU allocator and removes most of the
    kernel-side OVS locking overhead otherwise on the top of perf reports
    and allows OVS to scale better with higher number of threads.

    With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
    rate doubles on a server with two hyper-threaded physical CPUs (16
    logical cores each) compared to the current OVS master. Tested with
    non-trivial flow table with a TCP port match rule forcing all new
    connections with unique port numbers to OVS userspace. The IP
    addresses are still wildcarded, so the kernel flows are not considered
    as exact match 5-tuple flows. This type of flows can be expected to
    appear in large numbers as the result of more effective wildcarding
    made possible by improvements in OVS userspace flow classifier.

    Perf results for this test (master):

    Events: 305K cycles
    + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
    + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
    + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc
    + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
    + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area
    + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
    + 2.03% swapper [kernel.kallsyms] [k] intel_idle
    + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
    + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
    + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6
    + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset
    + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock
    + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock
    ...

    And after this patch:

    Events: 356K cycles
    + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc
    + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
    + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
    + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
    + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
    + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
    + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f
    + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
    + 1.47% swapper [kernel.kallsyms] [k] intel_idle
    + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask
    + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref
    + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash
    + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions
    + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref
    + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock
    ...

    There is a small increase in kernel spinlock overhead due to the same
    spinlock being shared between multiple cores of the same physical CPU,
    but that is barely visible in the netperf TCP_CRR test performance
    (maybe ~1% performance drop, hard to tell exactly due to variance in
    the test results), when testing for kernel module throughput (with no
    userspace activity, handful of kernel flows).

    On flow setup, a single stats instance is allocated (for the NUMA node
    0). As CPUs from multiple NUMA nodes start updating stats, new
    NUMA-node specific stats instances are allocated. This allocation on
    the packet processing code path is made to never block or look for
    emergency memory pools, minimizing the allocation latency. If the
    allocation fails, the existing preallocated stats instance is used.
    Also, if only CPUs from one NUMA-node are updating the preallocated
    stats instance, no additional stats instances are allocated. This
    eliminates the need to pre-allocate stats instances that will not be
    used, also relieving the stats reader from the burden of reading stats
    that are never used.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Pravin B Shelar
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     
  • The 5-tuple optimization becomes unnecessary with a later per-NUMA
    node stats patch. Remove it first to make the changes easier to
    grasp.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     

07 Jan, 2014

2 commits

  • With mega flow implementation ovs flow can be shared between
    multiple CPUs which makes stats updates highly contended
    operation. This patch uses per-CPU stats in cases where a flow
    is likely to be shared (if there is a wildcard in the 5-tuple
    and therefore likely to be spread by RSS). In other situations,
    it uses the current strategy, saving memory and allocation time.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: Jesse Gross

    Pravin B Shelar
     
  • We won't normally have a ton of flow masks but using a size_t to store
    values no bigger than sizeof(struct sw_flow_key) seems excessive.

    This reduces sw_flow_key_range and sw_flow_mask by 4 bytes on 32-bit
    systems. On 64-bit systems it shrinks sw_flow_key_range by 12 bytes but
    sw_flow_mask only by 8 bytes due to padding.

    Compile tested only.

    Signed-off-by: Ben Pfaff
    Acked-by: Andy Zhou
    Signed-off-by: Jesse Gross

    Ben Pfaff
     

02 Nov, 2013

2 commits

  • tcp_flags=flags/mask
    Bitwise match on TCP flags. The flags and mask are 16-bit num‐
    bers written in decimal or in hexadecimal prefixed by 0x. Each
    1-bit in mask requires that the corresponding bit in port must
    match. Each 0-bit in mask causes the corresponding bit to be
    ignored.

    TCP protocol currently defines 9 flag bits, and additional 3
    bits are reserved (must be transmitted as zero), see RFCs 793,
    3168, and 3540. The flag bits are, numbering from the least
    significant bit:

    0: FIN No more data from sender.

    1: SYN Synchronize sequence numbers.

    2: RST Reset the connection.

    3: PSH Push function.

    4: ACK Acknowledgement field significant.

    5: URG Urgent pointer field significant.

    6: ECE ECN Echo.

    7: CWR Congestion Windows Reduced.

    8: NS Nonce Sum.

    9-11: Reserved.

    12-15: Not matchable, must be zero.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     
  • Widen TCP flags handling from 7 bits (uint8_t) to 12 bits (uint16_t).
    The kernel interface remains at 8 bits, which makes no functional
    difference now, as none of the higher bits is currently of interest
    to the userspace.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     

04 Oct, 2013

1 commit

  • Over the time datapath.c and flow.c has became pretty large files.
    Following patch restructures functionality of component into three
    different components:

    flow.c: contains flow extract.
    flow_netlink.c: netlink flow api.
    flow_table.c: flow table api.

    This patch restructures code without changing logic.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: Jesse Gross

    Pravin B Shelar