19 May, 2018

1 commit

  • [ Upstream commit 72f17baf2352ded6a1d3f4bb2d15da8c678cd2cb ]

    If an OVS_ATTR_NESTED attribute type is found while walking
    through netlink attributes, we call nlattr_set() recursively
    passing the length table for the following nested attributes, if
    different from the current one.

    However, once we're done with those sub-nested attributes, we
    should continue walking through attributes using the current
    table, instead of using the one related to the sub-nested
    attributes.

    For example, given this sequence:

    1 OVS_KEY_ATTR_PRIORITY
    2 OVS_KEY_ATTR_TUNNEL
    3 OVS_TUNNEL_KEY_ATTR_ID
    4 OVS_TUNNEL_KEY_ATTR_IPV4_SRC
    5 OVS_TUNNEL_KEY_ATTR_IPV4_DST
    6 OVS_TUNNEL_KEY_ATTR_TTL
    7 OVS_TUNNEL_KEY_ATTR_TP_SRC
    8 OVS_TUNNEL_KEY_ATTR_TP_DST
    9 OVS_KEY_ATTR_IN_PORT
    10 OVS_KEY_ATTR_SKB_MARK
    11 OVS_KEY_ATTR_MPLS

    we switch to the 'ovs_tunnel_key_lens' table on attribute #3,
    and we don't switch back to 'ovs_key_lens' while setting
    attributes #9 to #11 in the sequence. As OVS_KEY_ATTR_MPLS
    evaluates to 21, and the array size of 'ovs_tunnel_key_lens' is
    15, we also get this kind of KASan splat while accessing the
    wrong table:

    [ 7654.586496] ==================================================================
    [ 7654.594573] BUG: KASAN: global-out-of-bounds in nlattr_set+0x164/0xde9 [openvswitch]
    [ 7654.603214] Read of size 4 at addr ffffffffc169ecf0 by task handler29/87430
    [ 7654.610983]
    [ 7654.612644] CPU: 21 PID: 87430 Comm: handler29 Kdump: loaded Not tainted 3.10.0-866.el7.test.x86_64 #1
    [ 7654.623030] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
    [ 7654.631379] Call Trace:
    [ 7654.634108] [] dump_stack+0x19/0x1b
    [ 7654.639843] [] print_address_description+0x33/0x290
    [ 7654.647129] [] ? nlattr_set+0x164/0xde9 [openvswitch]
    [ 7654.654607] [] kasan_report.part.3+0x242/0x330
    [ 7654.661406] [] __asan_report_load4_noabort+0x34/0x40
    [ 7654.668789] [] nlattr_set+0x164/0xde9 [openvswitch]
    [ 7654.676076] [] ovs_nla_get_match+0x10c8/0x1900 [openvswitch]
    [ 7654.684234] [] ? genl_rcv+0x28/0x40
    [ 7654.689968] [] ? netlink_unicast+0x3f3/0x590
    [ 7654.696574] [] ? ovs_nla_put_tunnel_info+0xb0/0xb0 [openvswitch]
    [ 7654.705122] [] ? unwind_get_return_address+0xb0/0xb0
    [ 7654.712503] [] ? system_call_fastpath+0x1c/0x21
    [ 7654.719401] [] ? update_stack_state+0x229/0x370
    [ 7654.726298] [] ? update_stack_state+0x229/0x370
    [ 7654.733195] [] ? kasan_unpoison_shadow+0x35/0x50
    [ 7654.740187] [] ? kasan_kmalloc+0xaa/0xe0
    [ 7654.746406] [] ? kasan_slab_alloc+0x12/0x20
    [ 7654.752914] [] ? memset+0x31/0x40
    [ 7654.758456] [] ovs_flow_cmd_new+0x2b2/0xf00 [openvswitch]

    [snip]

    [ 7655.132484] The buggy address belongs to the variable:
    [ 7655.138226] ovs_tunnel_key_lens+0xf0/0xffffffffffffd400 [openvswitch]
    [ 7655.145507]
    [ 7655.147166] Memory state around the buggy address:
    [ 7655.152514] ffffffffc169eb80: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa
    [ 7655.160585] ffffffffc169ec00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ 7655.168644] >ffffffffc169ec80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa
    [ 7655.176701] ^
    [ 7655.184372] ffffffffc169ed00: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 05
    [ 7655.192431] ffffffffc169ed80: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
    [ 7655.200490] ==================================================================

    Reported-by: Hangbin Liu
    Fixes: 982b52700482 ("openvswitch: Fix mask generation for nested attributes.")
    Signed-off-by: Stefano Brivio
    Reviewed-by: Sabrina Dubroca
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Stefano Brivio
     

26 Apr, 2018

1 commit

  • [ Upstream commit 9382fe71c0058465e942a633869629929102843d ]

    IPv4 and IPv6 packets may arrive with lower-layer padding that is not
    included in the L3 length. For example, a short IPv4 packet may have
    up to 6 bytes of padding following the IP payload when received on an
    Ethernet device with a minimum packet length of 64 bytes.

    Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
    and help() in nf_conntrack_ftp) assume skb->len reflects the length of
    the L3 header and payload, rather than referring back to
    ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
    lower-layer padding.

    In the normal IPv4 receive path, ip_rcv() trims the packet to
    ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
    path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
    in the br_netfilter receive path, br_validate_ipv4() and
    br_validate_ipv6() trim the packet to the L3 length before invoking
    netfilter hooks.

    Currently in the OVS conntrack receive path, ovs_ct_execute() pulls
    the skb to the L3 header but does not trim it to the L3 length before
    calling nf_conntrack_in(NF_INET_PRE_ROUTING). When
    nf_conntrack_proto_tcp encounters a packet with lower-layer padding,
    nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log
    message. While extra zero bytes don't affect the checksum, the length
    in the IP pseudoheader does. That length is based on skb->len, and
    without trimming, it doesn't match the length the sender used when
    computing the checksum.

    In ovs_ct_execute(), trim the skb to the L3 length before higher-layer
    processing.

    Signed-off-by: Ed Swierk
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ed Swierk
     

04 Feb, 2018

1 commit

  • [ Upstream commit 67c8d22a73128ff910e2287567132530abcf5b71 ]

    If we want to add a datapath flow, which has more than 500 vxlan outputs'
    action, we will get the following error reports:
    openvswitch: netlink: Flow action size 32832 bytes exceeds max
    openvswitch: netlink: Flow action size 32832 bytes exceeds max
    openvswitch: netlink: Actions may not be safe on all matching packets
    ... ...

    It seems that we can simply enlarge the MAX_ACTIONS_BUFSIZE to fix it, but
    this is not the root cause. For example, for a vxlan output action, we need
    about 60 bytes for the nlattr, but after it is converted to the flow
    action, it only occupies 24 bytes. This means that we can still support
    more than 1000 vxlan output actions for a single datapath flow under the
    the current 32k max limitation.

    So even if the nla_len(attr) is larger than MAX_ACTIONS_BUFSIZE, we
    shouldn't report EINVAL and keep it move on, as the judgement can be
    done by the reserve_sfa_size.

    Signed-off-by: zhangliping
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    zhangliping
     

03 Jan, 2018

1 commit

  • [ Upstream commit c48e74736fccf25fb32bb015426359e1c2016e3b ]

    skb_vlan_pop() expects skb->protocol to be a valid TPID for double
    tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
    shift the true ethertype into position for us.

    Fixes: 5108bbaddc37 ("openvswitch: add processing of L3 packets")
    Signed-off-by: Eric Garver
    Reviewed-by: Jiri Benc
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Eric Garver
     

17 Dec, 2017

2 commits

  • [ Upstream commit 2734166e89639c973c6e125ac8bcfc2d9db72b70 ]

    gso_type is being used in binary AND operations together with SKB_GSO_UDP.
    The issue is that variable gso_type is of type unsigned short and
    SKB_GSO_UDP expands to more than 16 bits:

    SKB_GSO_UDP = 1 << 16

    this makes any binary AND operation between gso_type and SKB_GSO_UDP to
    be always zero, hence making some code unreachable and likely causing
    undesired behavior.

    Fix this by changing the data type of variable gso_type to unsigned int.

    Addresses-Coverity-ID: 1462223
    Fixes: 0c19f846d582 ("net: accept UFO datagrams from tuntap and packet")
    Signed-off-by: Gustavo A. R. Silva
    Acked-by: Willem de Bruijn
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Gustavo A. R. Silva
     
  • [ Upstream commit 0c19f846d582af919db66a5914a0189f9f92c936 ]

    Tuntap and similar devices can inject GSO packets. Accept type
    VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

    Processes are expected to use feature negotiation such as TUNSETOFFLOAD
    to detect supported offload types and refrain from injecting other
    packets. This process breaks down with live migration: guest kernels
    do not renegotiate flags, so destination hosts need to expose all
    features that the source host does.

    Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
    This patch introduces nearly(*) no new code to simplify verification.
    It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
    insertion and software UFO segmentation.

    It does not reinstate protocol stack support, hardware offload
    (NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
    of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

    To support SKB_GSO_UDP reappearing in the stack, also reinstate
    logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
    by squashing in commit 939912216fa8 ("net: skb_needs_check() removes
    CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1
    ("net: avoid skb_warn_bad_offload false positives on UFO").

    (*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
    ipv6_proxy_select_ident is changed to return a __be32 and this is
    assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
    at the end of the enum to minimize code churn.

    Tested
    Booted a v4.13 guest kernel with QEMU. On a host kernel before this
    patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
    enabled, same as on a v4.13 host kernel.

    A UFO packet sent from the guest appears on the tap device:
    host:
    nc -l -p -u 8000 &
    tcpdump -n -i tap0

    guest:
    dd if=/dev/zero of=payload.txt bs=1 count=2000
    nc -u 192.16.1.1 8000 < payload.txt

    Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
    packets arriving fragmented:

    ./with_tap_pair.sh ./tap_send_ufo tap0 tap1
    (from https://github.com/wdebruij/kerneltools/tree/master/tests)

    Changes
    v1 -> v2
    - simplified set_offload change (review comment)
    - documented test procedure

    Link: http://lkml.kernel.org/r/
    Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
    Reported-by: Michal Kubecek
    Signed-off-by: Willem de Bruijn
    Acked-by: Jason Wang
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Willem de Bruijn
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

13 Sep, 2017

1 commit


04 Sep, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. Basically, updates to the conntrack core, enhancements for
    nf_tables, conversion of netfilter hooks from linked list to array to
    improve memory locality and asorted improvements for the Netfilter
    codebase. More specifically, they are:

    1) Add expection to hashes after timer initialization to prevent
    access from another CPU that walks on the hashes and calls
    del_timer(), from Florian Westphal.

    2) Don't update nf_tables chain counters from hot path, this is only
    used by the x_tables compatibility layer.

    3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
    Hooks are always guaranteed to run from rcu read side, so remove
    nested rcu_read_lock() where possible. Patch from Taehee Yoo.

    4) nf_tables new ruleset generation notifications include PID and name
    of the process that has updated the ruleset, from Phil Sutter.

    5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
    the nf_family netdev family. Patch from Pablo M. Bermudo.

    6) Add support for nft_fib in nf_tables netdev family, also from Pablo.

    7) Use deferrable workqueue for conntrack garbage collection, to reduce
    power consumption, from Patch from Subash Abhinov Kasiviswanathan.

    8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
    Westphal.

    9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.

    10) Drop references on conntrack removal path when skbuffs has escaped via
    nfqueue, from Florian.

    11) Don't queue packets to nfqueue with dying conntrack, from Florian.

    12) Constify nf_hook_ops structure, from Florian.

    13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.

    14) Add nla_strdup(), from Phil Sutter.

    15) Rise nf_tables objects name size up to 255 chars, people want to use
    DNS names, so increase this according to what RFC 1035 specifies.
    Patch series from Phil Sutter.

    16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
    registration on demand, suggested by Eric Dumazet, patch from Florian.

    17) Remove unused variables in compat_copy_entry_from_user both in
    ip_tables and arp_tables code. Patch from Taehee Yoo.

    18) Constify struct nf_conntrack_l4proto, from Julia Lawall.

    19) Constify nf_loginfo structure, also from Julia.

    20) Use a single rb root in connlimit, from Taehee Yoo.

    21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.

    22) Use audit_log() instead of open-coding it, from Geliang Tang.

    23) Allow to mangle tcp options via nft_exthdr, from Florian.

    24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
    a fix for a miscalculation of the minimal length.

    25) Simplify branch logic in h323 helper, from Nick Desaulniers.

    26) Calculate netlink attribute size for conntrack tuple at compile
    time, from Florian.

    27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
    From Florian.

    28) Remove holes in nf_conntrack_l4proto structure, so it becomes
    smaller. From Florian.

    29) Get rid of print_tuple() indirection for /proc conntrack listing.
    Place all the code in net/netfilter/nf_conntrack_standalone.c.
    Patch from Florian.

    30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
    off. From Florian.

    31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
    Florian.

    32) Fix broken indentation in ebtables extensions, from Colin Ian King.

    33) Fix several harmless sparse warning, from Florian.

    34) Convert netfilter hook infrastructure to use array for better memory
    locality, joint work done by Florian and Aaron Conole. Moreover, add
    some instrumentation to debug this.

    35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
    per batch, from Florian.

    36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.

    37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.

    38) Remove unused code in the generic protocol tracker, from Davide
    Caratti.

    I think I will have material for a second Netfilter batch in my queue if
    time allow to make it fit in this merge window.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

25 Aug, 2017

1 commit


22 Aug, 2017

1 commit


17 Aug, 2017

1 commit

  • For sw_flow_actions, the actions_len only represents the kernel part's
    size, and when we dump the actions to the userspace, we will do the
    convertions, so it's true size may become bigger than the actions_len.

    But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
    to alloc the skbuff, so the user_skb's size may become insufficient and
    oops will happen like this:
    skbuff: skb_over_panic: text:ffffffff8148fabf len:1749 put:157 head:
    ffff881300f39000 data:ffff881300f39000 tail:0x6d5 end:0x6c0 dev:
    ------------[ cut here ]------------
    kernel BUG at net/core/skbuff.c:129!
    [...]
    Call Trace:

    [] skb_put+0x43/0x44
    [] skb_zerocopy+0x6c/0x1f4
    [] queue_userspace_packet+0x3a3/0x448 [openvswitch]
    [] ovs_dp_upcall+0x30/0x5c [openvswitch]
    [] output_userspace+0x132/0x158 [openvswitch]
    [] ? ip6_rcv_finish+0x74/0x77 [ipv6]
    [] do_execute_actions+0xcc1/0xdc8 [openvswitch]
    [] ovs_execute_actions+0x74/0x106 [openvswitch]
    [] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
    [] ? key_extract+0x63c/0x8d5 [openvswitch]
    [] ovs_vport_receive+0xa1/0xc3 [openvswitch]
    [...]

    Also we can find that the actions_len is much little than the orig_len:
    crash> struct sw_flow_actions 0xffff8812f539d000
    struct sw_flow_actions {
    rcu = {
    next = 0xffff8812f5398800,
    func = 0xffffe3b00035db32
    },
    orig_len = 1384,
    actions_len = 592,
    actions = 0xffff8812f539d01c
    }

    So as a quick fix, use the orig_len instead of the actions_len to alloc
    the user_skb.

    Last, this oops happened on our system running a relative old kernel, but
    the same risk still exists on the mainline, since we use the wrong
    actions_len from the beginning.

    Fixes: ccea74457bbd ("openvswitch: include datapath actions with sampled-packet upcall to userspace")
    Cc: Neil McKee
    Signed-off-by: Liping Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Liping Zhang
     

12 Aug, 2017

1 commit


02 Aug, 2017

1 commit


25 Jul, 2017

1 commit


21 Jul, 2017

1 commit


20 Jul, 2017

2 commits

  • When calling the flow_free() to free the flow, we call many times
    (cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
    take up our CPU usage if we call the flow_free() frequently.
    When we put all packets to userspace via upcall, and OvS will send
    them back via netlink to ovs_packet_cmd_execute(will call flow_free).

    The test topo is shown as below. VM01 sends TCP packets to VM02,
    and OvS forward packtets. When testing, we use perf to report the
    system performance.

    VM01 --- OvS-VM --- VM02

    Without this patch, perf-top show as below: The flow_free() is
    3.02% CPU usage.

    4.23% [kernel] [k] _raw_spin_unlock_irqrestore
    3.62% [kernel] [k] __do_softirq
    3.16% [kernel] [k] __memcpy
    3.02% [kernel] [k] flow_free
    2.42% libc-2.17.so [.] __memcpy_ssse3_back
    2.18% [kernel] [k] copy_user_generic_unrolled
    2.17% [kernel] [k] find_next_bit

    When applied this patch, perf-top show as below: Not shown on
    the list anymore.

    4.11% [kernel] [k] _raw_spin_unlock_irqrestore
    3.79% [kernel] [k] __do_softirq
    3.46% [kernel] [k] __memcpy
    2.73% libc-2.17.so [.] __memcpy_ssse3_back
    2.25% [kernel] [k] copy_user_generic_unrolled
    1.89% libc-2.17.so [.] _int_malloc
    1.53% ovs-vswitchd [.] xlate_actions

    With this patch, the TCP throughput(we dont use Megaflow Cache
    + Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
    (maybe ~10% performance imporve).

    This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
    that the flow used. And we only check the flow_stats on the cpu we
    used, and it is unncessary to check all possible cpu when getting,
    cleaning, and updating the flow_stats. Adding the cpu_used_mask to
    sw_flow struct does’t increase the cacheline number.

    Signed-off-by: Tonghao Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Tonghao Zhang
     
  • In the ovs_flow_stats_update(), we only use the node
    var to alloc flow_stats struct. But this is not a
    common case, it is unnecessary to call the numa_node_id()
    everytime. This patch is not a bugfix, but there maybe
    a small increase.

    Signed-off-by: Tonghao Zhang
    Signed-off-by: David S. Miller

    Tonghao Zhang
     

18 Jul, 2017

1 commit


16 Jul, 2017

1 commit

  • When there is an established connection in direction A->B, it is
    possible to receive a packet on port B which then executes
    ct(commit,force) without first performing ct() - ie, a lookup.
    In this case, we would expect that this packet can delete the existing
    entry so that we can commit a connection with direction B->A. However,
    currently we only perform a check in skb_nfct_cached() for whether
    OVS_CS_F_TRACKED is set and OVS_CS_F_INVALID is not set, ie that a
    lookup previously occurred. In the above scenario, a lookup has not
    occurred but we should still be able to statelessly look up the
    existing entry and potentially delete the entry if it is in the
    opposite direction.

    This patch extends the check to also hint that if the action has the
    force flag set, then we will lookup the existing entry so that the
    force check at the end of skb_nfct_cached has the ability to delete
    the connection.

    Fixes: dd41d330b03 ("openvswitch: Add force commit.")
    CC: Pravin Shelar
    CC: dev@openvswitch.org
    Signed-off-by: Joe Stringer
    Signed-off-by: Greg Rose
    Signed-off-by: David S. Miller

    Greg Rose
     

03 Jul, 2017

1 commit


02 Jul, 2017

1 commit

  • When compiling OvS-master on 4.4.0-81 kernel,
    there is a warning:

    CC [M] /root/ovs/datapath/linux/datapath.o
    /root/ovs/datapath/linux/datapath.c: In function
    'ovs_flow_cmd_set':
    /root/ovs/datapath/linux/datapath.c:1221:1: warning:
    the frame size of 1040 bytes is larger than 1024 bytes
    [-Wframe-larger-than=]

    This patch factors out match-init and action-copy to avoid
    "Wframe-larger-than=1024" warning. Because mask is only
    used to get actions, we new a function to save some
    stack space.

    Signed-off-by: Tonghao Zhang
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Tonghao Zhang
     

25 Jun, 2017

1 commit

  • Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
    representators and control messages over single set of hardware queues.
    Control messages and muxed traffic may need ordered delivery.

    Those requirements make it hard to comfortably use TC infrastructure today
    unless we have a way of attaching metadata to skbs at the upper device.
    Because single set of queues is used for many netdevs stopping TC/sched
    queues of all of them reliably is impossible and lower device has to
    retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
    the fastpath.

    This patch attempts to enable port/representative devs to attach metadata
    to skbs which carry port id. This way representatives can be queueless and
    all queuing can be performed at the lower netdev in the usual way.

    Traffic arriving on the port/representative interfaces will be have
    metadata attached and will subsequently be queued to the lower device for
    transmission. The lower device should recognize the metadata and translate
    it to HW specific format which is most likely either a special header
    inserted before the network headers or descriptor/metadata fields.

    Metadata is associated with the lower device by storing the netdev pointer
    along with port id so that if TC decides to redirect or mirror the new
    netdev will not try to interpret it.

    This is mostly for SR-IOV devices since switches don't have lower netdevs
    today.

    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sridhar Samudrala
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

21 Jun, 2017

1 commit


16 Jun, 2017

1 commit

  • There were many places that my previous spatch didn't find,
    as pointed out by yuan linyu in various patches.

    The following spatch found many more and also removes the
    now unnecessary casts:

    @@
    identifier p, p2;
    expression len;
    expression skb;
    type t, t2;
    @@
    (
    -p = skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    |
    -p = (t)skb_put(skb, len);
    +p = skb_put_zero(skb, len);
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, len);
    |
    -memset(p, 0, len);
    )

    @@
    type t, t2;
    identifier p, p2;
    expression skb;
    @@
    t *p;
    ...
    (
    -p = skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    |
    -p = (t *)skb_put(skb, sizeof(t));
    +p = skb_put_zero(skb, sizeof(t));
    )
    ... when != p
    (
    p2 = (t2)p;
    -memset(p2, 0, sizeof(*p));
    |
    -memset(p, 0, sizeof(*p));
    )

    @@
    expression skb, len;
    @@
    -memset(skb_put(skb, len), 0, len);
    +skb_put_zero(skb, len);

    Apply it to the tree (with one manual fixup to keep the
    comment in vxlan.c, which spatch removed.)

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     

15 Jun, 2017

1 commit


08 Jun, 2017

1 commit

  • Network devices can allocate reasources and private memory using
    netdev_ops->ndo_init(). However, the release of these resources
    can occur in one of two different places.

    Either netdev_ops->ndo_uninit() or netdev->destructor().

    The decision of which operation frees the resources depends upon
    whether it is necessary for all netdev refs to be released before it
    is safe to perform the freeing.

    netdev_ops->ndo_uninit() presumably can occur right after the
    NETDEV_UNREGISTER notifier completes and the unicast and multicast
    address lists are flushed.

    netdev->destructor(), on the other hand, does not run until the
    netdev references all go away.

    Further complicating the situation is that netdev->destructor()
    almost universally does also a free_netdev().

    This creates a problem for the logic in register_netdevice().
    Because all callers of register_netdevice() manage the freeing
    of the netdev, and invoke free_netdev(dev) if register_netdevice()
    fails.

    If netdev_ops->ndo_init() succeeds, but something else fails inside
    of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
    it is not able to invoke netdev->destructor().

    This is because netdev->destructor() will do a free_netdev() and
    then the caller of register_netdevice() will do the same.

    However, this means that the resources that would normally be released
    by netdev->destructor() will not be.

    Over the years drivers have added local hacks to deal with this, by
    invoking their destructor parts by hand when register_netdevice()
    fails.

    Many drivers do not try to deal with this, and instead we have leaks.

    Let's close this hole by formalizing the distinction between what
    private things need to be freed up by netdev->destructor() and whether
    the driver needs unregister_netdevice() to perform the free_netdev().

    netdev->priv_destructor() performs all actions to free up the private
    resources that used to be freed by netdev->destructor(), except for
    free_netdev().

    netdev->needs_free_netdev is a boolean that indicates whether
    free_netdev() should be done at the end of unregister_netdevice().

    Now, register_netdevice() can sanely release all resources after
    ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
    and netdev->priv_destructor().

    And at the end of unregister_netdevice(), we invoke
    netdev->priv_destructor() and optionally call free_netdev().

    Signed-off-by: David S. Miller

    David S. Miller
     

23 May, 2017

1 commit


20 May, 2017

1 commit


15 May, 2017

1 commit


03 May, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS/OVS fixes for net

    The following patchset contains a rather large batch of Netfilter, IPVS
    and OVS fixes for your net tree. This includes fixes for ctnetlink, the
    userspace conntrack helper infrastructure, conntrack OVS support,
    ebtables DNAT target, several leaks in error path among other. More
    specifically, they are:

    1) Fix reference count leak in the CT target error path, from Gao Feng.

    2) Remove conntrack entry clashing with a matching expectation, patch
    from Jarno Rajahalme.

    3) Fix bogus EEXIST when registering two different userspace helpers,
    from Liping Zhang.

    4) Don't leak dummy elements in the new bitmap set type in nf_tables,
    from Liping Zhang.

    5) Get rid of module autoload from conntrack update path in ctnetlink,
    we don't need autoload at this late stage and it is happening with
    rcu read lock held which is not good. From Liping Zhang.

    6) Fix deadlock due to double-acquire of the expect_lock from conntrack
    update path, this fixes a bug that was introduced when the central
    spinlock got removed. Again from Liping Zhang.

    7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock
    protection that was selected when the central spinlock was removed was
    not really protecting anything at all.

    8) Protect sequence adjustment under ct->lock.

    9) Missing socket match with IPv6, from Peter Tirsek.

    10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from
    Linus Luessing.

    11) Don't give up on evaluating the expression on new entries added via
    dynset expression in nf_tables, from Liping Zhang.

    12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals
    with non-linear skbuffs.

    13) Don't allow IPv6 service in IPVS if no IPv6 support is available,
    from Paolo Abeni.

    14) Missing mutex release in error path of xt_find_table_lock(), from
    Dan Carpenter.

    15) Update maintainers files, Netfilter section. Add Florian to the
    file, refer to nftables.org and change project status from Supported
    to Maintained.

    16) Bail out on mismatching extensions in element updates in nf_tables.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

01 May, 2017

1 commit

  • Pablo Neira Ayuso says:

    ====================
    Netfilter/IPVS updates for net-next

    The following patchset contains Netfilter updates for your net-next
    tree. A large bunch of code cleanups, simplify the conntrack extension
    codebase, get rid of the fake conntrack object, speed up netns by
    selective synchronize_net() calls. More specifically, they are:

    1) Check for ct->status bit instead of using nfct_nat() from IPVS and
    Netfilter codebase, patch from Florian Westphal.

    2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

    3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

    4) Introduce nft_is_base_chain() helper function.

    5) Enforce expectation limit from userspace conntrack helper,
    from Gao Feng.

    6) Add nf_ct_remove_expect() helper function, from Gao Feng.

    7) NAT mangle helper function return boolean, from Gao Feng.

    8) ctnetlink_alloc_expect() should only work for conntrack with
    helpers, from Gao Feng.

    9) Add nfnl_msg_type() helper function to nfnetlink to build the
    netlink message type.

    10) Get rid of unnecessary cast on void, from simran singhal.

    11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
    also from simran singhal.

    12) Use list_prev_entry() from nf_tables, from simran signhal.

    13) Remove unnecessary & on pointer function in the Netfilter and IPVS
    code.

    14) Remove obsolete comment on set of rules per CPU in ip6_tables,
    no longer true. From Arushi Singhal.

    15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

    16) Remove unnecessary nested rcu_read_lock() in
    __nf_nat_decode_session(). Code running from hooks are already
    guaranteed to run under RCU read side.

    17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

    18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
    also from Aaron.

    19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

    20) Don't propagate NF_DROP error to userspace via ctnetlink in
    __nf_nat_alloc_null_binding() function, from Gao Feng.

    21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
    from Gao Feng.

    22) Kill the fake untracked conntrack objects, use ctinfo instead to
    annotate a conntrack object is untracked, from Florian Westphal.

    23) Remove nf_ct_is_untracked(), now obsolete since we have no
    conntrack template anymore, from Florian.

    24) Add event mask support to nft_ct, also from Florian.

    25) Move nf_conn_help structure to
    include/net/netfilter/nf_conntrack_helper.h.

    26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
    Thus, we don't deal with variable conntrack extensions anymore.
    Make sure userspace conntrack helper doesn't go over that size.
    Remove variable size ct extension infrastructure now this code
    got no more clients. From Florian Westphal.

    27) Restore offset and length of nf_ct_ext structure to 8 bytes now
    that wraparound is not possible any longer, also from Florian.

    28) Allow to get rid of unassured flows under stress in conntrack,
    this applies to DCCP, SCTP and TCP protocols, from Florian.

    29) Shrink size of nf_conntrack_ecache structure, from Florian.

    30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
    from Gao Feng.

    31) Register SYNPROXY hooks on demand, from Florian Westphal.

    32) Use pernet hook whenever possible, instead of global hook
    registration, from Florian Westphal.

    33) Pass hook structure to ebt_register_table() to consolidate some
    infrastructure code, from Florian Westphal.

    34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
    SYNPROXY code, to make sure device stats are not fooled, patch
    from Gao Feng.

    35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
    don't need anymore if we just select a fixed size instead of
    expensive runtime time calculation of this. From Florian.

    36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
    from Florian.

    37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
    Florian.

    38) Attach NAT extension on-demand from masquerade and pptp helper
    path, from Florian.

    39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

    40) Speed up netns by selective calls of synchronize_net(), from
    Florian Westphal.

    41) Silence stack size warning gcc in 32-bit arch in snmp helper,
    from Florian.

    42) Inconditionally call nf_ct_ext_destroy(), even if we have no
    extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
    Liping Zhang.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

25 Apr, 2017

3 commits

  • Conntrack helpers do not check for a potentially clashing conntrack
    entry when creating a new expectation. Also, nf_conntrack_in() will
    check expectations (via init_conntrack()) only if a conntrack entry
    can not be found. The expectation for a packet which also matches an
    existing conntrack entry will not be removed by conntrack, and is
    currently handled inconsistently by OVS, as OVS expects the
    expectation to be removed when the connection tracking entry matching
    that expectation is confirmed.

    It should be noted that normally an IP stack would not allow reuse of
    a 5-tuple of an old (possibly lingering) connection for a new data
    connection, so this is somewhat unlikely corner case. However, it is
    possible that a misbehaving source could cause conntrack entries be
    created that could then interfere with new related connections.

    Fix this in the OVS module by deleting the clashing conntrack entry
    after an expectation has been matched. This causes the following
    nf_conntrack_in() call also find the expectation and remove it when
    creating the new conntrack entry, as well as the forthcoming reply
    direction packets to match the new related connection instead of the
    old clashing conntrack entry.

    Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
    Reported-by: Yang Song
    Signed-off-by: Jarno Rajahalme
    Acked-by: Joe Stringer
    Signed-off-by: Pablo Neira Ayuso

    Jarno Rajahalme
     
  • Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
    which can be used in conjunction with the commit flag
    (OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
    conntrack events (IPCT_*) should be delivered via the Netfilter
    netlink multicast groups. Default behavior depends on the system
    configuration, but typically a lot of events are delivered. This can be
    very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
    types of events are of interest.

    Netfilter core init_conntrack() adds the event cache extension, so we
    only need to set the ctmask value. However, if the system is
    configured without support for events, the setting will be skipped due
    to extension not being found.

    Signed-off-by: Jarno Rajahalme
    Reviewed-by: Greg Rose
    Acked-by: Joe Stringer
    Signed-off-by: David S. Miller

    Jarno Rajahalme
     
  • Fix typo in a comment.

    Signed-off-by: Jarno Rajahalme
    Acked-by: Greg Rose
    Signed-off-by: David S. Miller

    Jarno Rajahalme
     

15 Apr, 2017

1 commit


14 Apr, 2017

1 commit


06 Apr, 2017

1 commit


02 Apr, 2017

1 commit

  • ovs_flow_key_update() is called when the flow key is invalid, and it is
    used to update and revalidate the flow key. Commit 329f45bc4f19
    ("openvswitch: add mac_proto field to the flow key") introduces mac_proto
    field to flow key and use it to determine whether the flow key is valid.
    However, the commit does not update the code path in ovs_flow_key_update()
    to revalidate the flow key which may cause BUG_ON() on execute_recirc().
    This patch addresses the aforementioned issue.

    Fixes: 329f45bc4f19 ("openvswitch: add mac_proto field to the flow key")
    Signed-off-by: Yi-Hung Wei
    Acked-by: Jiri Benc
    Signed-off-by: David S. Miller

    Yi-Hung Wei
     

29 Mar, 2017

1 commit

  • The reference count held for skb needs to be released when the skb's
    nfct pointer is cleared regardless of if nf_ct_delete() is called or
    not.

    Failing to release the skb's reference cound led to deferred conntrack
    cleanup spinning forever within nf_conntrack_cleanup_net_list() when
    cleaning up a network namespace:

       kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120]
       kworker/u16:0-19025 [004] 45981067.173651: kernel_stack:
    => ___preempt_schedule (ffffffffa001ed36)
    => _raw_spin_unlock_bh (ffffffffa0713290)
    => nf_ct_iterate_cleanup (ffffffffc00a4454)
    => nf_conntrack_cleanup_net_list (ffffffffc00a5e1e)
    => nf_conntrack_pernet_exit (ffffffffc00a63dd)
    => ops_exit_list.isra.1 (ffffffffa06075f3)
    => cleanup_net (ffffffffa0607df0)
    => process_one_work (ffffffffa0084c31)
    => worker_thread (ffffffffa008592b)
    => kthread (ffffffffa008bee2)
    => ret_from_fork (ffffffffa071b67c)

    Fixes: dd41d33f0b03 ("openvswitch: Add force commit.")
    Reported-by: Yang Song
    Signed-off-by: Jarno Rajahalme
    Acked-by: Joe Stringer
    Signed-off-by: David S. Miller

    Jarno Rajahalme