08 Sep, 2022

1 commit

  • [ Upstream commit a87406f4adee9c53b311d8a1ba2849c69e29a6d0 ]

    ovs_dp_cmd_new()->ovs_dp_change()->ovs_dp_set_upcall_portids()
    allocates array via kmalloc.
    If for some reason new_vport() fails during ovs_dp_cmd_new()
    dp->upcall_portids must be freed.
    Add missing kfree.

    Kmemleak example:
    unreferenced object 0xffff88800c382500 (size 64):
    comm "dump_state", pid 323, jiffies 4294955418 (age 104.347s)
    hex dump (first 32 bytes):
    5e c2 79 e4 1f 7a 38 c7 09 21 38 0c 80 88 ff ff ^.y..z8..!8.....
    03 00 00 00 0a 00 00 00 14 00 00 00 28 00 00 00 ............(...
    backtrace:
    [] ovs_dp_set_upcall_portids+0x38/0xa0
    [] ovs_dp_change+0x63/0xe0
    [] ovs_dp_cmd_new+0x1f0/0x380
    [] genl_family_rcv_msg_doit+0xea/0x150
    [] genl_rcv_msg+0xdc/0x1e0
    [] netlink_rcv_skb+0x50/0x100
    [] genl_rcv+0x24/0x40
    [] netlink_unicast+0x23e/0x360
    [] netlink_sendmsg+0x24e/0x4b0
    [] sock_sendmsg+0x62/0x70
    [] ____sys_sendmsg+0x230/0x270
    [] ___sys_sendmsg+0x88/0xd0
    [] __sys_sendmsg+0x59/0xa0
    [] do_syscall_64+0x3b/0x90
    [] entry_SYSCALL_64_after_hwframe+0x63/0xcd

    Fixes: b83d23a2a38b ("openvswitch: Introduce per-cpu upcall dispatch")
    Acked-by: Aaron Conole
    Signed-off-by: Andrey Zhadchenko
    Link: https://lore.kernel.org/r/20220825020326.664073-1-andrey.zhadchenko@virtuozzo.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Andrey Zhadchenko
     

29 Jun, 2022

1 commit

  • commit 12378a5a75e33f34f8586706eb61cca9e6d4690c upstream.

    When a packet enters the OVS datapath and does not match any existing
    flows installed in the kernel flow cache, the packet will be sent to
    userspace to be parsed, and a new flow will be created. The kernel and
    OVS rely on each other to parse packet fields in the same way so that
    packets will be handled properly.

    As per the design document linked below, OVS expects all later IPv6
    fragments to have nw_proto=44 in the flow key, so they can be correctly
    matched on OpenFlow rules. OpenFlow controllers create pipelines based
    on this design.

    This behavior was changed by the commit in the Fixes tag so that
    nw_proto equals the next_header field of the last extension header.
    However, there is no counterpart for this change in OVS userspace,
    meaning that this field is parsed differently between OVS and the
    kernel. This is a problem because OVS creates actions based on what is
    parsed in userspace, but the kernel-provided flow key is used as a match
    criteria, as described in Documentation/networking/openvswitch.rst. This
    leads to issues such as packets incorrectly matching on a flow and thus
    the wrong list of actions being applied to the packet. Such changes in
    packet parsing cannot be implemented without breaking the userspace.

    The offending commit is partially reverted to restore the expected
    behavior.

    The change technically made sense and there is a good reason that it was
    implemented, but it does not comply with the original design of OVS.
    If in the future someone wants to implement such a change, then it must
    be user-configurable and disabled by default to preserve backwards
    compatibility with existing OVS versions.

    Cc: stable@vger.kernel.org
    Fixes: fa642f08839b ("openvswitch: Derive IP protocol number for IPv6 later frags")
    Link: https://docs.openvswitch.org/en/latest/topics/design/#fragments
    Signed-off-by: Rosemarie O'Riorden
    Acked-by: Eelco Chaudron
    Link: https://lore.kernel.org/r/20220621204845.9721-1-roriorden@redhat.com
    Signed-off-by: Paolo Abeni
    Signed-off-by: Greg Kroah-Hartman

    Rosemarie O'Riorden
     

15 Jun, 2022

1 commit

  • commit 2061ecfdf2350994e5b61c43e50e98a7a70e95ee upstream.

    If packet headers changed, the cached nfct is no longer relevant
    for the packet and attempt to re-use it leads to the incorrect packet
    classification.

    This issue is causing broken connectivity in OpenStack deployments
    with OVS/OVN due to hairpin traffic being unexpectedly dropped.

    The setup has datapath flows with several conntrack actions and tuple
    changes between them:

    actions:ct(commit,zone=8,mark=0/0x1,nat(src)),
    set(eth(src=00:00:00:00:00:01,dst=00:00:00:00:00:06)),
    set(ipv4(src=172.18.2.10,dst=192.168.100.6,ttl=62)),
    ct(zone=8),recirc(0x4)

    After the first ct() action the packet headers are almost fully
    re-written. The next ct() tries to re-use the existing nfct entry
    and marks the packet as invalid, so it gets dropped later in the
    pipeline.

    Clearing the cached conntrack entry whenever packet tuple is changed
    to avoid the issue.

    The flow key should not be cleared though, because we should still
    be able to match on the ct_state if the recirculation happens after
    the tuple change but before the next ct() action.

    Cc: stable@vger.kernel.org
    Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
    Reported-by: Frode Nordahl
    Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-May/051829.html
    Link: https://bugs.launchpad.net/ubuntu/+source/ovn/+bug/1967856
    Signed-off-by: Ilya Maximets
    Link: https://lore.kernel.org/r/20220606221140.488984-1-i.maximets@ovn.org
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Ilya Maximets
     

27 Apr, 2022

2 commits

  • commit 719774377622bc4025d2a74f551b5dc2158c6c30 upstream.

    Convert nf_conn reference counting from atomic_t to refcount_t based api.
    refcount_t api provides more runtime sanity checks and will warn on
    certain constructs, e.g. refcount_inc() on a zero reference count, which
    usually indicates use-after-free.

    For this reason template allocation is changed to init the refcount to
    1, the subsequenct add operations are removed.

    Likewise, init_conntrack() is changed to set the initial refcount to 1
    instead refcount_inc().

    This is safe because the new entry is not (yet) visible to other cpus.

    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Greg Kroah-Hartman

    Florian Westphal
     
  • commit cefa91b2332d7009bc0be5d951d6cbbf349f90f8 upstream.

    Given a sufficiently large number of actions, while copying and
    reserving memory for a new action of a new flow, if next_offset is
    greater than MAX_ACTIONS_BUFSIZE, the function reserve_sfa_size() does
    not return -EMSGSIZE as expected, but it allocates MAX_ACTIONS_BUFSIZE
    bytes increasing actions_len by req_size. This can then lead to an OOB
    write access, especially when further actions need to be copied.

    Fix it by rearranging the flow action size check.

    KASAN splat below:

    ==================================================================
    BUG: KASAN: slab-out-of-bounds in reserve_sfa_size+0x1ba/0x380 [openvswitch]
    Write of size 65360 at addr ffff888147e4001c by task handler15/836

    CPU: 1 PID: 836 Comm: handler15 Not tainted 5.18.0-rc1+ #27
    ...
    Call Trace:

    dump_stack_lvl+0x45/0x5a
    print_report.cold+0x5e/0x5db
    ? __lock_text_start+0x8/0x8
    ? reserve_sfa_size+0x1ba/0x380 [openvswitch]
    kasan_report+0xb5/0x130
    ? reserve_sfa_size+0x1ba/0x380 [openvswitch]
    kasan_check_range+0xf5/0x1d0
    memcpy+0x39/0x60
    reserve_sfa_size+0x1ba/0x380 [openvswitch]
    __add_action+0x24/0x120 [openvswitch]
    ovs_nla_add_action+0xe/0x20 [openvswitch]
    ovs_ct_copy_action+0x29d/0x1130 [openvswitch]
    ? __kernel_text_address+0xe/0x30
    ? unwind_get_return_address+0x56/0xa0
    ? create_prof_cpu_mask+0x20/0x20
    ? ovs_ct_verify+0xf0/0xf0 [openvswitch]
    ? prep_compound_page+0x198/0x2a0
    ? __kasan_check_byte+0x10/0x40
    ? kasan_unpoison+0x40/0x70
    ? ksize+0x44/0x60
    ? reserve_sfa_size+0x75/0x380 [openvswitch]
    __ovs_nla_copy_actions+0xc26/0x2070 [openvswitch]
    ? __zone_watermark_ok+0x420/0x420
    ? validate_set.constprop.0+0xc90/0xc90 [openvswitch]
    ? __alloc_pages+0x1a9/0x3e0
    ? __alloc_pages_slowpath.constprop.0+0x1da0/0x1da0
    ? unwind_next_frame+0x991/0x1e40
    ? __mod_node_page_state+0x99/0x120
    ? __mod_lruvec_page_state+0x2e3/0x470
    ? __kasan_kmalloc_large+0x90/0xe0
    ovs_nla_copy_actions+0x1b4/0x2c0 [openvswitch]
    ovs_flow_cmd_new+0x3cd/0xb10 [openvswitch]
    ...

    Cc: stable@vger.kernel.org
    Fixes: f28cd2af22a0 ("openvswitch: fix flow actions reallocation")
    Signed-off-by: Paolo Valerio
    Acked-by: Eelco Chaudron
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valerio
     

14 Apr, 2022

2 commits

  • [ Upstream commit 1f30fb9166d4f15a1aa19449b9da871fe0ed4796 ]

    While parsing user-provided actions, openvswitch module may dynamically
    allocate memory and store pointers in the internal copy of the actions.
    So this memory has to be freed while destroying the actions.

    Currently there are only two such actions: ct() and set(). However,
    there are many actions that can hold nested lists of actions and
    ovs_nla_free_flow_actions() just jumps over them leaking the memory.

    For example, removal of the flow with the following actions will lead
    to a leak of the memory allocated by nf_ct_tmpl_alloc():

    actions:clone(ct(commit),0)

    Non-freed set() action may also leak the 'dst' structure for the
    tunnel info including device references.

    Under certain conditions with a high rate of flow rotation that may
    cause significant memory leak problem (2MB per second in reporter's
    case). The problem is also hard to mitigate, because the user doesn't
    have direct control over the datapath flows generated by OVS.

    Fix that by iterating over all the nested actions and freeing
    everything that needs to be freed recursively.

    New build time assertion should protect us from this problem if new
    actions will be added in the future.

    Unfortunately, openvswitch module doesn't use NLA_F_NESTED, so all
    attributes has to be explicitly checked. sample() and clone() actions
    are mixing extra attributes into the user-provided action list. That
    prevents some code generalization too.

    Fixes: 34ae932a4036 ("openvswitch: Make tunnel set action attach a metadata dst")
    Link: https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392922.html
    Reported-by: Stéphane Graber
    Signed-off-by: Ilya Maximets
    Acked-by: Aaron Conole
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Ilya Maximets
     
  • [ Upstream commit 3f2a3050b4a3e7f32fc0ea3c9b0183090ae00522 ]

    'OVS_CLONE_ATTR_EXEC' is an internal attribute that is used for
    performance optimization inside the kernel. It's added by the kernel
    while parsing user-provided actions and should not be sent during the
    flow dump as it's not part of the uAPI.

    The issue doesn't cause any significant problems to the ovs-vswitchd
    process, because reported actions are not really used in the
    application lifecycle and only supposed to be shown to a human via
    ovs-dpctl flow dump. However, the action list is still incorrect
    and causes the following error if the user wants to look at the
    datapath flows:

    # ovs-dpctl add-dp system@ovs-system
    # ovs-dpctl add-flow "" "clone(ct(commit),0)"
    # ovs-dpctl dump-flows
    , packets:0, bytes:0, used:never,
    actions:clone(bad length 4, expected -1 for: action0(01 00 00 00),
    ct(commit),0)

    With the fix:

    # ovs-dpctl dump-flows
    , packets:0, bytes:0, used:never,
    actions:clone(ct(commit),0)

    Additionally fixed an incorrect attribute name in the comment.

    Fixes: b233504033db ("openvswitch: kernel datapath clone action")
    Signed-off-by: Ilya Maximets
    Acked-by: Aaron Conole
    Link: https://lore.kernel.org/r/20220404104150.2865736-1-i.maximets@ovn.org
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Ilya Maximets
     

08 Apr, 2022

3 commits

  • commit f19c44452b58a84d95e209b847f5495d91c9983a upstream.

    IPv6 nd target mask was not getting populated in flow dump.

    In the function __ovs_nla_put_key the icmp code mask field was checked
    instead of icmp code key field to classify the flow as neighbour discovery.

    ufid:bdfbe3e5-60c2-43b0-a5ff-dfcac1c37328, recirc_id(0),dp_hash(0/0),
    skb_priority(0/0),in_port(ovs-nm1),skb_mark(0/0),ct_state(0/0),
    ct_zone(0/0),ct_mark(0/0),ct_label(0/0),
    eth(src=00:00:00:00:00:00/00:00:00:00:00:00,
    dst=00:00:00:00:00:00/00:00:00:00:00:00),
    eth_type(0x86dd),
    ipv6(src=::/::,dst=::/::,label=0/0,proto=58,tclass=0/0,hlimit=0/0,frag=no),
    icmpv6(type=135,code=0),
    nd(target=2001::2/::,
    sll=00:00:00:00:00:00/00:00:00:00:00:00,
    tll=00:00:00:00:00:00/00:00:00:00:00:00),
    packets:10, bytes:860, used:0.504s, dp:ovs, actions:ovs-nm2

    Fixes: e64457191a25 (openvswitch: Restructure datapath.c and flow.c)
    Signed-off-by: Martin Varghese
    Link: https://lore.kernel.org/r/20220328054148.3057-1-martinvarghesenokia@gmail.com
    Signed-off-by: Paolo Abeni
    Signed-off-by: Greg Kroah-Hartman

    Martin Varghese
     
  • [ Upstream commit 408bdcfce8dfd6902f75fbcd3b99d8b24b506597 ]

    Its the same as nf_conntrack_put(), but without the
    need for an indirect call. The downside is a module dependency on
    nf_conntrack, but all of these already depend on conntrack anyway.

    Cc: Paul Blakey
    Cc: dev@openvswitch.org
    Signed-off-by: Florian Westphal
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: Sasha Levin

    Florian Westphal
     
  • [ Upstream commit 60b44ca6bd7518dd38fa2719bc9240378b6172c3 ]

    During NAT, a tuple collision may occur. When this happens, openvswitch
    will make a second pass through NAT which will perform additional packet
    modification. This will update the skb data, but not the flow key that
    OVS uses. This means that future flow lookups, and packet matches will
    have incorrect data. This has been supported since
    5d50aa83e2c8 ("openvswitch: support asymmetric conntrack").

    That commit failed to properly update the sw_flow_key attributes, since
    it only called the ovs_ct_nat_update_key once, rather than each time
    ovs_ct_nat_execute was called. As these two operations are linked, the
    ovs_ct_nat_execute() function should always make sure that the
    sw_flow_key is updated after a successful call through NAT infrastructure.

    Fixes: 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack")
    Cc: Dumitru Ceara
    Cc: Numan Siddique
    Signed-off-by: Aaron Conole
    Acked-by: Eelco Chaudron
    Link: https://lore.kernel.org/r/20220318124319.3056455-1-aconole@redhat.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Aaron Conole
     

02 Mar, 2022

1 commit

  • commit d9b5ae5c1b241b91480aa30408be12fe91af834a upstream.

    Ipv6 ttl, label and tos fields are modified without first
    pulling/pushing the ipv6 header, which would have updated
    the hw csum (if available). This might cause csum validation
    when sending the packet to the stack, as can be seen in
    the trace below.

    Fix this by updating skb->csum if available.

    Trace resulted by ipv6 ttl dec and then sending packet
    to conntrack [actions: set(ipv6(hlimit=63)),ct(zone=99)]:
    [295241.900063] s_pf0vf2: hw csum failure
    [295241.923191] Call Trace:
    [295241.925728]
    [295241.927836] dump_stack+0x5c/0x80
    [295241.931240] __skb_checksum_complete+0xac/0xc0
    [295241.935778] nf_conntrack_tcp_packet+0x398/0xba0 [nf_conntrack]
    [295241.953030] nf_conntrack_in+0x498/0x5e0 [nf_conntrack]
    [295241.958344] __ovs_ct_lookup+0xac/0x860 [openvswitch]
    [295241.968532] ovs_ct_execute+0x4a7/0x7c0 [openvswitch]
    [295241.979167] do_execute_actions+0x54a/0xaa0 [openvswitch]
    [295242.001482] ovs_execute_actions+0x48/0x100 [openvswitch]
    [295242.006966] ovs_dp_process_packet+0x96/0x1d0 [openvswitch]
    [295242.012626] ovs_vport_receive+0x6c/0xc0 [openvswitch]
    [295242.028763] netdev_frame_hook+0xc0/0x180 [openvswitch]
    [295242.034074] __netif_receive_skb_core+0x2ca/0xcb0
    [295242.047498] netif_receive_skb_internal+0x3e/0xc0
    [295242.052291] napi_gro_receive+0xba/0xe0
    [295242.056231] mlx5e_handle_rx_cqe_mpwrq_rep+0x12b/0x250 [mlx5_core]
    [295242.062513] mlx5e_poll_rx_cq+0xa0f/0xa30 [mlx5_core]
    [295242.067669] mlx5e_napi_poll+0xe1/0x6b0 [mlx5_core]
    [295242.077958] net_rx_action+0x149/0x3b0
    [295242.086762] __do_softirq+0xd7/0x2d6
    [295242.090427] irq_exit+0xf7/0x100
    [295242.093748] do_IRQ+0x7f/0xd0
    [295242.096806] common_interrupt+0xf/0xf
    [295242.100559]
    [295242.102750] RIP: 0033:0x7f9022e88cbd
    [295242.125246] RSP: 002b:00007f9022282b20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffda
    [295242.132900] RAX: 0000000000000005 RBX: 0000000000000010 RCX: 0000000000000000
    [295242.140120] RDX: 00007f9022282ba8 RSI: 00007f9022282a30 RDI: 00007f9014005c30
    [295242.147337] RBP: 00007f9014014d60 R08: 0000000000000020 R09: 00007f90254a8340
    [295242.154557] R10: 00007f9022282a28 R11: 0000000000000246 R12: 0000000000000000
    [295242.161775] R13: 00007f902308c000 R14: 000000000000002b R15: 00007f9022b71f40

    Fixes: 3fdbd1ce11e5 ("openvswitch: add ipv6 'set' action")
    Signed-off-by: Paul Blakey
    Link: https://lore.kernel.org/r/20220223163416.24096-1-paulb@nvidia.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Greg Kroah-Hartman

    Paul Blakey
     

27 Jan, 2022

2 commits

  • [ Upstream commit 6f022c2ddbcefaee79502ce5386dfe351d457070 ]

    Netfilter conntrack maintains NAT flags per connection indicating
    whether NAT was configured for the connection. Openvswitch maintains
    NAT flags on the per packet flow key ct_state field, indicating
    whether NAT was actually executed on the packet.

    When a packet misses from tc to ovs the conntrack NAT flags are set.
    However, NAT was not necessarily executed on the packet because the
    connection's state might still be in NEW state. As such, openvswitch
    wrongly assumes that NAT was executed and sets an incorrect flow key
    NAT flags.

    Fix this, by flagging to openvswitch which NAT was actually done in
    act_ct via tc_skb_ext and tc_skb_cb to the openvswitch module, so
    the packet flow key NAT flags will be correctly set.

    Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
    Signed-off-by: Paul Blakey
    Acked-by: Jamal Hadi Salim
    Link: https://lore.kernel.org/r/20220106153804.26451-1-paulb@nvidia.com
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Paul Blakey
     
  • [ Upstream commit 635d448a1cce4b4ebee52b351052c70434fa90ea ]

    Zone id is not restored if we passed ct and ct rejected the connection,
    as there is no ct info on the skb.

    Save the zone from tc skb cb to tc skb extension and pass it on to
    ovs, use that info to restore the zone id for invalid connections.

    Fixes: d29334c15d33 ("net/sched: act_api: fix miss set post_ct for ovs after do conntrack in act_ct")
    Signed-off-by: Paul Blakey
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Sasha Levin

    Paul Blakey
     

20 Aug, 2021

1 commit


18 Aug, 2021

1 commit

  • fq qdisc requires tstamp to be cleared in the forwarding path. Now ovs
    doesn't clear skb->tstamp. We encountered a problem with linux
    version 5.4.56 and ovs version 2.14.1, and packets failed to
    dequeue from qdisc when fq qdisc was attached to ovs port.

    Fixes: fb420d5d91c1 ("tcp/fq: move back to CLOCK_MONOTONIC")
    Signed-off-by: kaixi.fan
    Signed-off-by: xiexiaohui
    Reviewed-by: Cong Wang
    Signed-off-by: David S. Miller

    kaixi.fan
     

13 Aug, 2021

1 commit

  • Conflicts:

    drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.h
    9e26680733d5 ("bnxt_en: Update firmware call to retrieve TX PTP timestamp")
    9e518f25802c ("bnxt_en: 1PPS functions to configure TSIO pins")
    099fdeda659d ("bnxt_en: Event handler for PPS events")

    kernel/bpf/helpers.c
    include/linux/bpf-cgroup.h
    a2baf4e8bb0f ("bpf: Fix potentially incorrect results with bpf_get_local_storage()")
    c7603cfa04e7 ("bpf: Add ambient BPF runtime context stored in current")

    drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
    5957cc557dc5 ("net/mlx5: Set all field of mlx5_irq before inserting it to the xarray")
    2d0b41a37679 ("net/mlx5: Refcount mlx5_irq with integer")

    MAINTAINERS
    7b637cd52f02 ("MAINTAINERS: fix Microchip CAN BUS Analyzer Tool entry typo")
    7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver")

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

10 Aug, 2021

1 commit

  • Repair kernel-doc notation in a few places to make it conform to
    the expected format.

    Fixes the following kernel-doc warnings:

    flow.c:296: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
    * Parse vlan tag from vlan header.
    flow.c:296: warning: missing initial short description on line:
    * Parse vlan tag from vlan header.
    flow.c:537: warning: No description found for return value of 'key_extract_l3l4'
    flow.c:769: warning: No description found for return value of 'key_extract'

    Signed-off-by: Randy Dunlap
    Cc: Pravin B Shelar
    Cc: dev@openvswitch.org
    Link: https://lore.kernel.org/r/20210808190834.23362-1-rdunlap@infradead.org
    Signed-off-by: Jakub Kicinski

    Randy Dunlap
     

27 Jul, 2021

2 commits

  • fix incorrect type in argument 1 (different address spaces)

    ../net/openvswitch/datapath.c:169:17: warning: incorrect type in argument 1 (different address spaces)
    ../net/openvswitch/datapath.c:169:17: expected void const *
    ../net/openvswitch/datapath.c:169:17: got struct dp_nlsk_pids [noderef] __rcu *upcall_portids

    Found at: https://patchwork.kernel.org/project/netdevbpf/patch/20210630095350.817785-1-mark.d.gray@redhat.com/#24285159

    Signed-off-by: Mark Gray
    Signed-off-by: David S. Miller

    Mark Gray
     
  • Signed-off-by: Mark Gray
    Signed-off-by: David S. Miller

    Mark Gray
     

17 Jul, 2021

1 commit

  • The Open vSwitch kernel module uses the upcall mechanism to send
    packets from kernel space to user space when it misses in the kernel
    space flow table. The upcall sends packets via a Netlink socket.
    Currently, a Netlink socket is created for every vport. In this way,
    there is a 1:1 mapping between a vport and a Netlink socket.
    When a packet is received by a vport, if it needs to be sent to
    user space, it is sent via the corresponding Netlink socket.

    This mechanism, with various iterations of the corresponding user
    space code, has seen some limitations and issues:

    * On systems with a large number of vports, there is a correspondingly
    large number of Netlink sockets which can limit scaling.
    (https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
    * Packet reordering on upcalls.
    (https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
    * A thundering herd issue.
    (https://bugzilla.redhat.com/show_bug.cgi?id=1834444)

    This patch introduces an alternative, feature-negotiated, upcall
    mode using a per-cpu dispatch rather than a per-vport dispatch.

    In this mode, the Netlink socket to be used for the upcall is
    selected based on the CPU of the thread that is executing the upcall.
    In this way, it resolves the issues above as:

    a) The number of Netlink sockets scales with the number of CPUs
    rather than the number of vports.
    b) Ordering per-flow is maintained as packets are distributed to
    CPUs based on mechanisms such as RSS and flows are distributed
    to a single user space thread.
    c) Packets from a flow can only wake up one user space thread.

    The corresponding user space code can be found at:
    https://mail.openvswitch.org/pipermail/ovs-dev/2021-July/385139.html

    Bugzilla: https://bugzilla.redhat.com/1844576
    Signed-off-by: Mark Gray
    Acked-by: Flavio Leitner
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Mark Gray
     

02 Jul, 2021

1 commit

  • In the current implement when comparing two flow keys, we will return
    result after comparing the whole key from start to end.

    In our optimization, we will return result in the first none-zero
    comparison, then we will improve the flow table looking up efficiency.

    Signed-off-by: Baowen Zheng
    Signed-off-by: Louis Peens
    Signed-off-by: Simon Horman
    Signed-off-by: David S. Miller

    Baowen Zheng
     

23 Jun, 2021

1 commit

  • This makes openvswitch module use the event tracing framework
    to log the upcall interface and action execution pipeline. When
    using openvswitch as the packet forwarding engine, some types of
    debugging are made possible simply by using the ovs-vswitchd's
    ofproto/trace command. However, such a command has some
    limitations:

    1. When trying to trace packets that go through the CT action,
    the state of the packet can't be determined, and probably
    would be potentially wrong.

    2. Deducing problem packets can sometimes be difficult as well
    even if many of the flows are known

    3. It's possible to use the openvswitch module even without
    the ovs-vswitchd (although, not common use).

    Introduce the event tracing points here to make it possible for
    working through these problems in kernel space. The style is
    copied from the mac80211 driver-trace / trace code for
    consistency - this creates some checkpatch splats, but the
    official 'guide' for adding tracepoints, as well as the existing
    examples all add the same splats so it seems acceptable.

    Signed-off-by: Aaron Conole
    Signed-off-by: David S. Miller

    Aaron Conole
     

28 May, 2021

1 commit


14 May, 2021

1 commit

  • We have observed meters working unexpected if traffic is 3+Gbit/s
    with multiple connections.

    now_ms is not pretected by meter->lock, we may get a negative
    long_delta_ms when another cpu updated meter->used, then:
    delta_ms = (u32)long_delta_ms;
    which will be a large value.

    band->bucket += delta_ms * band->rate;
    then we get a wrong band->bucket.

    OpenVswitch userspace datapath has fixed the same issue[1] some
    time ago, and we port the implementation to kernel datapath.

    [1] https://patchwork.ozlabs.org/project/openvswitch/patch/20191025114436.9746-1-i.maximets@ovn.org/

    Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
    Signed-off-by: Tao Liu
    Suggested-by: Ilya Maximets
    Reviewed-by: Ilya Maximets
    Signed-off-by: David S. Miller

    Tao Liu
     

11 May, 2021

1 commit


30 Apr, 2021

1 commit

  • running openvswitch on kernels built with KASAN, it's possible to see the
    following splat while testing fragmentation of IPv4 packets:

    BUG: KASAN: stack-out-of-bounds in ip_do_fragment+0x1b03/0x1f60
    Read of size 1 at addr ffff888112fc713c by task handler2/1367

    CPU: 0 PID: 1367 Comm: handler2 Not tainted 5.12.0-rc6+ #418
    Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
    Call Trace:
    dump_stack+0x92/0xc1
    print_address_description.constprop.7+0x1a/0x150
    kasan_report.cold.13+0x7f/0x111
    ip_do_fragment+0x1b03/0x1f60
    ovs_fragment+0x5bf/0x840 [openvswitch]
    do_execute_actions+0x1bd5/0x2400 [openvswitch]
    ovs_execute_actions+0xc8/0x3d0 [openvswitch]
    ovs_packet_cmd_execute+0xa39/0x1150 [openvswitch]
    genl_family_rcv_msg_doit.isra.15+0x227/0x2d0
    genl_rcv_msg+0x287/0x490
    netlink_rcv_skb+0x120/0x380
    genl_rcv+0x24/0x40
    netlink_unicast+0x439/0x630
    netlink_sendmsg+0x719/0xbf0
    sock_sendmsg+0xe2/0x110
    ____sys_sendmsg+0x5ba/0x890
    ___sys_sendmsg+0xe9/0x160
    __sys_sendmsg+0xd3/0x170
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7f957079db07
    Code: c3 66 90 41 54 41 89 d4 55 48 89 f5 53 89 fb 48 83 ec 10 e8 eb ec ff ff 44 89 e2 48 89 ee 89 df 41 89 c0 b8 2e 00 00 00 0f 05 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 24 ed ff ff 48
    RSP: 002b:00007f956ce35a50 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
    RAX: ffffffffffffffda RBX: 0000000000000019 RCX: 00007f957079db07
    RDX: 0000000000000000 RSI: 00007f956ce35ae0 RDI: 0000000000000019
    RBP: 00007f956ce35ae0 R08: 0000000000000000 R09: 00007f9558006730
    R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
    R13: 00007f956ce37308 R14: 00007f956ce35f80 R15: 00007f956ce35ae0

    The buggy address belongs to the page:
    page:00000000af2a1d93 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x112fc7
    flags: 0x17ffffc0000000()
    raw: 0017ffffc0000000 0000000000000000 dead000000000122 0000000000000000
    raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    addr ffff888112fc713c is located in stack of task handler2/1367 at offset 180 in frame:
    ovs_fragment+0x0/0x840 [openvswitch]

    this frame has 2 objects:
    [32, 144) 'ovs_dst'
    [192, 424) 'ovs_rt'

    Memory state around the buggy address:
    ffff888112fc7000: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff888112fc7080: 00 f1 f1 f1 f1 00 00 00 00 00 00 00 00 00 00 00
    >ffff888112fc7100: 00 00 00 f2 f2 f2 f2 f2 f2 00 00 00 00 00 00 00
    ^
    ffff888112fc7180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffff888112fc7200: 00 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00

    for IPv4 packets, ovs_fragment() uses a temporary struct dst_entry. Then,
    in the following call graph:

    ip_do_fragment()
    ip_skb_dst_mtu()
    ip_dst_mtu_maybe_forward()
    ip_mtu_locked()

    the pointer to struct dst_entry is used as pointer to struct rtable: this
    turns the access to struct members like rt_mtu_locked into an OOB read in
    the stack. Fix this changing the temporary variable used for IPv4 packets
    in ovs_fragment(), similarly to what is done for IPv6 few lines below.

    Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmt")
    Cc:
    Acked-by: Eelco Chaudron
    Signed-off-by: Davide Caratti
    Signed-off-by: David S. Miller

    Davide Caratti
     

27 Apr, 2021

1 commit


24 Apr, 2021

1 commit

  • Implementation of meters supposed to be a classic token bucket with 2
    typical parameters: rate and burst size.

    Burst size in this schema is the maximum number of bytes/packets that
    could pass without being rate limited.

    Recent changes to userspace datapath made meter implementation to be
    in line with the kernel one, and this uncovered several issues.

    The main problem is that maximum bucket size for unknown reason
    accounts not only burst size, but also the numerical value of rate.
    This creates a lot of confusion around behavior of meters.

    For example, if rate is configured as 1000 pps and burst size set to 1,
    this should mean that meter will tolerate bursts of 1 packet at most,
    i.e. not a single packet above the rate should pass the meter.
    However, current implementation calculates maximum bucket size as
    (rate + burst size), so the effective bucket size will be 1001. This
    means that first 1000 packets will not be rate limited and average
    rate might be twice as high as the configured rate. This also makes
    it practically impossible to configure meter that will have burst size
    lower than the rate, which might be a desirable configuration if the
    rate is high.

    Inability to configure low values of a burst size and overall inability
    for a user to predict what will be a maximum and average rate from the
    configured parameters of a meter without looking at the OVS and kernel
    code might be also classified as a security issue, because drop meters
    are frequently used as a way of protection from DoS attacks.

    This change removes rate from the calculation of a bucket size, making
    it in line with the classic token bucket algorithm and essentially
    making the rate and burst tolerance being predictable from a users'
    perspective.

    Same change proposed for the userspace implementation.

    Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
    Signed-off-by: Ilya Maximets
    Signed-off-by: David S. Miller

    Ilya Maximets
     

10 Apr, 2021

1 commit

  • Conflicts:

    MAINTAINERS
    - keep Chandrasekar
    drivers/net/ethernet/mellanox/mlx5/core/en_main.c
    - simple fix + trust the code re-added to param.c in -next is fine
    include/linux/bpf.h
    - trivial
    include/linux/ethtool.h
    - trivial, fix kdoc while at it
    include/linux/skmsg.h
    - move to relevant place in tcp.c, comment re-wrapped
    net/core/skmsg.c
    - add the sk = sk // sk = NULL around calls
    net/tipc/crypto.c
    - trivial

    Signed-off-by: Jakub Kicinski

    Jakub Kicinski
     

06 Apr, 2021

1 commit

  • 'struct ovs_zone_limit' has more members than initialized in
    ovs_ct_limit_get_default_limit(). The rest of the memory is a random
    kernel stack content that ends up being sent to userspace.

    Fix that by using designated initializer that will clear all
    non-specified fields.

    Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
    Signed-off-by: Ilya Maximets
    Acked-by: Tonghao Zhang
    Signed-off-by: David S. Miller

    Ilya Maximets
     

04 Apr, 2021

1 commit


26 Mar, 2021

1 commit


23 Mar, 2021

1 commit


17 Mar, 2021

2 commits

  • It is not unusual to have the bridge port down. Sometimes
    it has the old MTU, which is fine since it's not being used.

    However, the kernel spams the log with a warning message
    when a packet is going to be sent over such port. Fix that
    by warning only if the interface is UP.

    Signed-off-by: Flavio Leitner
    Signed-off-by: David S. Miller

    Flavio Leitner
     
  • When openvswitch conntrack offload with act_ct action. The first rule
    do conntrack in the act_ct in tc subsystem. And miss the next rule in
    the tc and fallback to the ovs datapath but miss set post_ct flag
    which will lead the ct_state_key with -trk flag.

    Fixes: 7baf2429a1a9 ("net/sched: cls_flower add CT_FLAGS_INVALID flag support")
    Signed-off-by: wenxu
    Reviewed-by: Marcelo Ricardo Leitner
    Signed-off-by: David S. Miller

    wenxu
     

15 Jan, 2021

1 commit

  • As requested by upstream OVS, added some error messages in the
    validate_and_copy_dec_ttl function.

    Includes a small cleanup, which removes an unnecessary parameter
    from the dec_ttl_exception_handler() function.

    Reported-by: Flavio Leitner
    Signed-off-by: Eelco Chaudron
    Acked-by: Flavio Leitner
    Link: https://lore.kernel.org/r/161054576573.26637.18396634650212670580.stgit@ebuild
    Signed-off-by: Jakub Kicinski

    Eelco Chaudron
     

15 Dec, 2020

1 commit

  • Currently, the exception actions are not processed correctly as the wrong
    dataset is passed. This change fixes this, including the misleading
    comment.

    In addition, a check was added to make sure we work on an IPv4 packet,
    and not just assume if it's not IPv6 it's IPv4.

    This was all tested using OVS with patch,
    https://patchwork.ozlabs.org/project/openvswitch/list/?series=21639,
    applied and sending packets with a TTL of 1 (and 0), both with IPv4
    and IPv6.

    Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format")
    Signed-off-by: Eelco Chaudron
    Link: https://lore.kernel.org/r/160733569860.3007.12938188180387116741.stgit@wsfd-netdev64.ntdv.lab.eng.bos.redhat.com
    Signed-off-by: Jakub Kicinski

    Eelco Chaudron
     

12 Dec, 2020

1 commit


09 Dec, 2020

1 commit


05 Dec, 2020

1 commit

  • Fix to return a negative error code from the error handling
    case instead of 0, as done elsewhere in this function.

    Changing 'return start' to 'return action_start' can fix this bug.

    Fixes: 69929d4c49e1 ("net: openvswitch: fix TTL decrement action netlink message format")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Reviewed-by: Eelco Chaudron
    Link: https://lore.kernel.org/r/20201204114314.1596-1-wanghai38@huawei.com
    Signed-off-by: Jakub Kicinski

    Wang Hai