13 Oct, 2016

1 commit

  • This code is called whenever flow key is being extracted from the packet.
    The packet may be as likely vlan tagged as not.

    Fixes: 018c1dda5ff1 ("openvswitch: 802.1AD Flow handling, actions, vlan parsing, netlink attributes")
    Signed-off-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Acked-by: Eric Garver
    Signed-off-by: David S. Miller

    Jiri Benc
     

03 Oct, 2016

1 commit

  • After the 48d2ab609b6b ("net: mpls: Fixups for GSO"), MPLS handling in
    openvswitch was changed to have network header pointing to the start of the
    MPLS headers and inner_network_header pointing after the MPLS headers.

    However, key_extract was missed by the mentioned commit, causing incorrect
    headers to be set when a MPLS packet just enters the bridge or after it is
    recirculated.

    Fixes: 48d2ab609b6b ("net: mpls: Fixups for GSO")
    Signed-off-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jiri Benc
     

21 Sep, 2016

1 commit


19 Sep, 2016

2 commits

  • Instead of using flow stats per NUMA node, use it per CPU. When using
    megaflows, the stats lock can be a bottleneck in scalability.

    On a E5-2690 12-core system, usual throughput went from ~4Mpps to
    ~15Mpps when forwarding between two 40GbE ports with a single flow
    configured on the datapath.

    This has been tested on a system with possible CPUs 0-7,16-23. After
    module removal, there were no corruption on the slab cache.

    Signed-off-by: Thadeu Lima de Souza Cascardo
    Cc: pravin shelar
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thadeu Lima de Souza Cascardo
     
  • On a system with only node 1 as possible, all statistics is going to be
    accounted on node 0 as it will have a single writer.

    However, when getting and clearing the statistics, node 0 is not going
    to be considered, as it's not a possible node.

    Tested that statistics are not zero on a system with only node 1
    possible. Also compile-tested with CONFIG_NUMA off.

    Signed-off-by: Thadeu Lima de Souza Cascardo
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Thadeu Lima de Souza Cascardo
     

09 Sep, 2016

1 commit

  • Add support for 802.1ad including the ability to push and pop double
    tagged vlans. Add support for 802.1ad to netlink parsing and flow
    conversion. Uses double nested encap attributes to represent double
    tagged vlan. Inner TPID encoded along with ctci in nested attributes.

    This is based on Thomas F Herbert's original v20 patch. I made some
    small clean ups and bug fixes.

    Signed-off-by: Thomas F Herbert
    Signed-off-by: Eric Garver
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Eric Garver
     

07 Oct, 2015

1 commit

  • Store tunnel protocol (AF_INET or AF_INET6) in sw_flow_key. This field now
    also acts as an indicator whether the flow contains tunnel data (this was
    previously indicated by tun_key.u.ipv4.dst being set but with IPv6 addresses
    in an union with IPv4 ones this won't work anymore).

    The new field was added to a hole in sw_flow_key.

    Signed-off-by: Jiri Benc
    Acked-by: Pravin B Shelar
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller

    Jiri Benc
     

01 Sep, 2015

1 commit

  • Currently tun-info options pointer is used in few cases to
    pass options around. But tunnel options can be accessed using
    ip_tunnel_info_opts() API without using the pointer. Following
    patch removes the redundant pointer and consistently make use
    of API.

    Signed-off-by: Pravin B Shelar
    Acked-by: Thomas Graf
    Reviewed-by: Jesse Gross
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

30 Aug, 2015

3 commits

  • This structure is not used anymore.

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • When an error occurs skipping IPv6 extension headers retain the already
    parsed IP protocol and IPv6 addresses in the flow. Also assume that the
    packet is not a fragment in the absence of information to the contrary;
    that is always use the frag_off value set by ipv6_skip_exthdr().

    This allows matching on the IP protocol and IPv6 addresses of packets
    with malformed extension headers.

    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Simon Horman
     
  • There's currently nothing preventing directing packets with IPv6
    encapsulation data to IPv4 tunnels (and vice versa). If this happens,
    IPv6 addresses are incorrectly interpreted as IPv4 ones.

    Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
    in ip_tunnel_info. Reject packets at appropriate places if they are supposed
    to be encapsulated into an incompatible protocol.

    Signed-off-by: Jiri Benc
    Acked-by: Alexei Starovoitov
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jiri Benc
     

28 Aug, 2015

2 commits

  • Allow matching and setting the ct_label field. As with ct_mark, this is
    populated by executing the CT action. The label field may be modified by
    specifying a label and mask nested under the CT action. It is stored as
    metadata attached to the connection. Label modification occurs after
    lookup, and will only persist when the conntrack entry is committed by
    providing the COMMIT flag to the CT action. Labels are currently fixed
    to 128 bits in size.

    Signed-off-by: Joe Stringer
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     
  • Expose the kernel connection tracker via OVS. Userspace components can
    make use of the CT action to populate the connection state (ct_state)
    field for a flow. This state can be subsequently matched.

    Exposed connection states are OVS_CS_F_*:
    - NEW (0x01) - Beginning of a new connection.
    - ESTABLISHED (0x02) - Part of an existing connection.
    - RELATED (0x04) - Related to an established connection.
    - INVALID (0x20) - Could not track the connection for this packet.
    - REPLY_DIR (0x40) - This packet is in the reply direction for the flow.
    - TRACKED (0x80) - This packet has been sent through conntrack.

    When the CT action is executed by itself, it will send the packet
    through the connection tracker and populate the ct_state field with one
    or more of the connection state flags above. The CT action will always
    set the TRACKED bit.

    When the COMMIT flag is passed to the conntrack action, this specifies
    that information about the connection should be stored. This allows
    subsequent packets for the same (or related) connections to be
    correlated with this connection. Sending subsequent packets for the
    connection through conntrack allows the connection tracker to consider
    the packets as ESTABLISHED, RELATED, and/or REPLY_DIR.

    The CT action may optionally take a zone to track the flow within. This
    allows connections with the same 5-tuple to be kept logically separate
    from connections in other zones. If the zone is specified, then the
    "ct_zone" match field will be subsequently populated with the zone id.

    IP fragments are handled by transparently assembling them as part of the
    CT action. The maximum received unit (MRU) size is tracked so that
    refragmentation can occur during output.

    IP frag handling contributed by Andy Zhou.

    Based on original design by Justin Pettit.

    Signed-off-by: Joe Stringer
    Signed-off-by: Justin Pettit
    Signed-off-by: Andy Zhou
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Joe Stringer
     

22 Jul, 2015

1 commit

  • Rename the tunnel metadata data structures currently internal to
    OVS and make them generic for use by all IP tunnels.

    Both structures are kernel internal and will stay that way. Their
    members are exposed to user space through individual Netlink
    attributes by OVS. It will therefore be possible to extend/modify
    these structures without affecting user ABI.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

06 May, 2015

1 commit


15 Apr, 2015

1 commit

  • NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

    GFP_THISNODE is a secret combination of gfp bits that have different
    behavior than expected. It is a combination of __GFP_THISNODE,
    __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
    allocator slowpath to fail without trying reclaim even though it may be
    used in combination with __GFP_WAIT.

    An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
    GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
    that really just wanted __GFP_THISNODE. The problem doesn't end there,
    however, because even it was a no-op for alloc_misplaced_dst_page(),
    which also sets __GFP_NORETRY and __GFP_NOWARN, and
    migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
    is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a
    no-op in these cases since the page allocator special-cases
    __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.

    It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE
    to restrict an allocation to a local node, but remove GFP_THISNODE and
    its obscurity. Instead, we require that a caller clear __GFP_WAIT if it
    wants to avoid reclaim.

    This allows the aforementioned functions to actually reclaim as they
    should. It also enables any future callers that want to do
    __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The
    rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.

    Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
    it is unchanged.

    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Joonsoo Kim
    Acked-by: Johannes Weiner
    Cc: Mel Gorman
    Cc: Pravin Shelar
    Cc: Jarno Rajahalme
    Cc: Li Zefan
    Cc: Greg Thelen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

12 Feb, 2015

1 commit

  • Userspace packet execute command pass down flow key for given
    packet. But userspace can skip some parameter with zero value.
    Therefore kernel needs to initialize key metadata to zero.

    Fixes: 0714812134 ("openvswitch: Eliminate memset() from flow_extract.")
    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     

15 Jan, 2015

1 commit

  • Also factors out Geneve validation code into a new separate function
    validate_and_copy_geneve_opts().

    A subsequent patch will introduce VXLAN options. Rename the existing
    GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
    tunnel metadata options.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

14 Jan, 2015

1 commit


03 Jan, 2015

1 commit

  • Until now, when VLAN acceleration was in use, the bytes of the VLAN header
    were not included in port or flow byte counters. They were however
    included when VLAN acceleration was not used. This commit corrects the
    inconsistency, by always including the VLAN header in byte counters.

    Previous discussion at
    http://openvswitch.org/pipermail/dev/2014-December/049521.html

    Reported-by: Motonori Shindo
    Signed-off-by: Ben Pfaff
    Reviewed-by: Flavio Leitner
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Ben Pfaff
     

10 Nov, 2014

2 commits


06 Nov, 2014

1 commit

  • Allow datapath to recognize and extract MPLS labels into flow keys
    and execute actions which push, pop, and set labels on packets.

    Based heavily on work by Leo Alterman, Ravi K, Isaku Yamahata and Joe Stringer.

    Cc: Ravi K
    Cc: Leo Alterman
    Cc: Isaku Yamahata
    Cc: Joe Stringer
    Signed-off-by: Simon Horman
    Signed-off-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Simon Horman
     

18 Oct, 2014

2 commits

  • This patch adds missing memset which are required to initialize
    flow key member. For example for IP flow we need to initialize
    ip.frag for all cases.

    Found by inspection.

    This bug is introduced by commit 0714812134d7dcadeb7ecfbfeb18788aa7e1eaac
    ("openvswitch: Eliminate memset() from flow_extract").

    Signed-off-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Pravin B Shelar
     
  • pskb_may_pull() called by arphdr_ok can change skb->data, so put the arp
    setting after arphdr_ok to avoid the use the freed memory

    Fixes: 0714812134d7d ("openvswitch: Eliminate memset() from flow_extract.")
    Cc: Jesse Gross
    Cc: Eric Dumazet
    Signed-off-by: Li RongQing
    Acked-by: Jesse Gross
    Signed-off-by: David S. Miller

    Li RongQing
     

06 Oct, 2014

3 commits

  • The Openvswitch implementation is completely agnostic to the options
    that are in use and can handle newly defined options without
    further work. It does this by simply matching on a byte array
    of options and allowing userspace to setup flows on this array.

    Signed-off-by: Jesse Gross
    Singed-off-by: Ansis Atteka
    Signed-off-by: Andy Zhou
    Acked-by: Thomas Graf
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • Currently, the flow information that is matched for tunnels and
    the tunnel data passed around with packets is the same. However,
    as additional information is added this is not necessarily desirable,
    as in the case of pointers.

    This adds a new structure for tunnel metadata which currently contains
    only the existing struct. This change is purely internal to the kernel
    since the current OVS_KEY_ATTR_IPV4_TUNNEL is simply a compressed version
    of OVS_KEY_ATTR_TUNNEL that is translated at flow setup.

    Signed-off-by: Jesse Gross
    Signed-off-by: Andy Zhou
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     
  • As new protocols are added, the size of the flow key tends to
    increase although few protocols care about all of the fields. In
    order to optimize this for hashing and matching, OVS uses a variable
    length portion of the key. However, when fields are extracted from
    the packet we must still zero out the entire key.

    This is no longer necessary now that OVS implements masking. Any
    fields (or holes in the structure) which are not part of a given
    protocol will be by definition not part of the mask and zeroed out
    during lookup. Furthermore, since masking already uses variable
    length keys this zeroing operation automatically benefits as well.

    In principle, the only thing that needs to be done at this point
    is remove the memset() at the beginning of flow. However, some
    fields assume that they are initialized to zero, which now must be
    done explicitly. In addition, in the event of an error we must also
    zero out corresponding fields to signal that there is no valid data
    present. These increase the total amount of code but very little of
    it is executed in non-error situations.

    Removing the memset() reduces the profile of ovs_flow_extract()
    from 0.64% to 0.56% when tested with large packets on a 10G link.

    Suggested-by: Pravin Shelar
    Signed-off-by: Jesse Gross
    Signed-off-by: Andy Zhou
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Jesse Gross
     

16 Sep, 2014

3 commits

  • Recirc action allows a packet to reenter openvswitch processing.
    currently openvswitch lookup flow for packet received and execute
    set of actions on that packet, with help of recirc action we can
    process/modify the packet and recirculate it back in openvswitch
    for another pass.

    OVS hash action calculates 5-tupple hash and set hash in flow-key
    hash. This can be used along with recirculation for distributing
    packets among different ports for bond devices.
    For example:
    OVS bonding can use following actions:
    Match on: bond flow; Action: hash, recirc(id)
    Match on: recirc-id == id and hash lower bits == a;
    Action: output port_bond_a

    Signed-off-by: Andy Zhou
    Acked-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Andy Zhou
     
  • Currently tun_key is used for passing tunnel information
    on ingress and egress path, this cause confusion. Following
    patch removes its use on ingress path make it egress only parameter.

    Signed-off-by: Pravin B Shelar
    Acked-by: Andy Zhou

    Pravin B Shelar
     
  • OVS flow extract is called on packet receive or packet
    execute code path. Following patch defines separate API
    for extracting flow-key in packet execute code path.

    Signed-off-by: Pravin B Shelar
    Acked-by: Andy Zhou

    Pravin B Shelar
     

23 Aug, 2014

1 commit


30 Jun, 2014

1 commit

  • Flow statistics need to take into account the TCP flags from the packet
    currently being processed (in 'key'), not the TCP flags matched by the
    flow found in the kernel flow table (in 'flow').

    This bug made the Open vSwitch userspace fin_timeout action have no effect
    in many cases.
    This bug is introduced by commit 88d73f6c411ac2f0578 (openvswitch: Use
    TCP flags in the flow key for stats.)

    Reported-by: Len Gao
    Signed-off-by: Ben Pfaff
    Acked-by: Jarno Rajahalme
    Acked-by: Jesse Gross
    Signed-off-by: Pravin B Shelar

    Ben Pfaff
     

23 May, 2014

3 commits

  • For ovs_flow_stats_get() using ovsl_dereference() was wrong, since
    flow dumps call this with RCU read lock.

    ovs_flow_stats_clear() is always called with ovs_mutex, so can use
    ovsl_dereference().

    Also, make the ovs_flow_stats_get() 'flow' argument const to make
    later patches cleaner.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Pravin B Shelar

    Jarno Rajahalme
     
  • Remove unnecessary locking from functions that are always called with
    appropriate locking.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Thomas Graf
    Signed-off-by: Pravin B Shelar

    Jarno Rajahalme
     
  • Minimize padding in sw_flow_key and move 'tp' top the main struct.
    These changes simplify code when accessing the transport port numbers
    and the tcp flags, and makes the sw_flow_key 8 bytes smaller on 64-bit
    systems (128->120 bytes). These changes also make the keys for IPv4
    packets to fit in one cache line.

    There is a valid concern for safety of packing the struct
    ovs_key_ipv4_tunnel, as it would be possible to take the address of
    the tun_id member as a __be64 * which could result in unaligned access
    in some systems. However:

    - sw_flow_key itself is 64-bit aligned, so the tun_id within is
    always
    64-bit aligned.
    - We never make arrays of ovs_key_ipv4_tunnel (which would force
    every
    second tun_key to be misaligned).
    - We never take the address of the tun_id in to a __be64 *.
    - Whereever we use struct ovs_key_ipv4_tunnel outside the
    sw_flow_key,
    it is in stack (on tunnel input functions), where compiler has full
    control of the alignment.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Pravin B Shelar

    Jarno Rajahalme
     

17 May, 2014

4 commits

  • We already extract the TCP flags for the key, might as well use that
    for stats.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Pravin B Shelar
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     
  • Keep kernel flow stats for each NUMA node rather than each (logical)
    CPU. This avoids using the per-CPU allocator and removes most of the
    kernel-side OVS locking overhead otherwise on the top of perf reports
    and allows OVS to scale better with higher number of threads.

    With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
    rate doubles on a server with two hyper-threaded physical CPUs (16
    logical cores each) compared to the current OVS master. Tested with
    non-trivial flow table with a TCP port match rule forcing all new
    connections with unique port numbers to OVS userspace. The IP
    addresses are still wildcarded, so the kernel flows are not considered
    as exact match 5-tuple flows. This type of flows can be expected to
    appear in large numbers as the result of more effective wildcarding
    made possible by improvements in OVS userspace flow classifier.

    Perf results for this test (master):

    Events: 305K cycles
    + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
    + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
    + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc
    + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
    + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area
    + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
    + 2.03% swapper [kernel.kallsyms] [k] intel_idle
    + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
    + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
    + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6
    + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset
    + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock
    + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock
    ...

    And after this patch:

    Events: 356K cycles
    + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc
    + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
    + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
    + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
    + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
    + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
    + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f
    + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
    + 1.47% swapper [kernel.kallsyms] [k] intel_idle
    + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask
    + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref
    + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash
    + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions
    + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref
    + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock
    ...

    There is a small increase in kernel spinlock overhead due to the same
    spinlock being shared between multiple cores of the same physical CPU,
    but that is barely visible in the netperf TCP_CRR test performance
    (maybe ~1% performance drop, hard to tell exactly due to variance in
    the test results), when testing for kernel module throughput (with no
    userspace activity, handful of kernel flows).

    On flow setup, a single stats instance is allocated (for the NUMA node
    0). As CPUs from multiple NUMA nodes start updating stats, new
    NUMA-node specific stats instances are allocated. This allocation on
    the packet processing code path is made to never block or look for
    emergency memory pools, minimizing the allocation latency. If the
    allocation fails, the existing preallocated stats instance is used.
    Also, if only CPUs from one NUMA-node are updating the preallocated
    stats instance, no additional stats instances are allocated. This
    eliminates the need to pre-allocate stats instances that will not be
    used, also relieving the stats reader from the burden of reading stats
    that are never used.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Pravin B Shelar
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     
  • The 5-tuple optimization becomes unnecessary with a later per-NUMA
    node stats patch. Remove it first to make the changes easier to
    grasp.

    Signed-off-by: Jarno Rajahalme
    Signed-off-by: Jesse Gross

    Jarno Rajahalme
     
  • It's slightly smaller/faster for some architectures.

    Signed-off-by: Joe Perches
    Signed-off-by: Jesse Gross

    Joe Perches