Eric Lee / smarc-fsl-linux-kernel

19 May, 2018

1 commit

c1ce5f359 openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found ... Browse Code »

[ Upstream commit 72f17baf2352ded6a1d3f4bb2d15da8c678cd2cb ]

If an OVS_ATTR_NESTED attribute type is found while walking
through netlink attributes, we call nlattr_set() recursively
passing the length table for the following nested attributes, if
different from the current one.

However, once we're done with those sub-nested attributes, we
should continue walking through attributes using the current
table, instead of using the one related to the sub-nested
attributes.

For example, given this sequence:

1 OVS_KEY_ATTR_PRIORITY
2 OVS_KEY_ATTR_TUNNEL
3 OVS_TUNNEL_KEY_ATTR_ID
4 OVS_TUNNEL_KEY_ATTR_IPV4_SRC
5 OVS_TUNNEL_KEY_ATTR_IPV4_DST
6 OVS_TUNNEL_KEY_ATTR_TTL
7 OVS_TUNNEL_KEY_ATTR_TP_SRC
8 OVS_TUNNEL_KEY_ATTR_TP_DST
9 OVS_KEY_ATTR_IN_PORT
10 OVS_KEY_ATTR_SKB_MARK
11 OVS_KEY_ATTR_MPLS

we switch to the 'ovs_tunnel_key_lens' table on attribute #3,
and we don't switch back to 'ovs_key_lens' while setting
attributes #9 to #11 in the sequence. As OVS_KEY_ATTR_MPLS
evaluates to 21, and the array size of 'ovs_tunnel_key_lens' is
15, we also get this kind of KASan splat while accessing the
wrong table:

[ 7654.586496] ==================================================================
[ 7654.594573] BUG: KASAN: global-out-of-bounds in nlattr_set+0x164/0xde9 [openvswitch]
[ 7654.603214] Read of size 4 at addr ffffffffc169ecf0 by task handler29/87430
[ 7654.610983]
[ 7654.612644] CPU: 21 PID: 87430 Comm: handler29 Kdump: loaded Not tainted 3.10.0-866.el7.test.x86_64 #1
[ 7654.623030] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 06/16/2016
[ 7654.631379] Call Trace:
[ 7654.634108] [] dump_stack+0x19/0x1b
[ 7654.639843] [] print_address_description+0x33/0x290
[ 7654.647129] [] ? nlattr_set+0x164/0xde9 [openvswitch]
[ 7654.654607] [] kasan_report.part.3+0x242/0x330
[ 7654.661406] [] __asan_report_load4_noabort+0x34/0x40
[ 7654.668789] [] nlattr_set+0x164/0xde9 [openvswitch]
[ 7654.676076] [] ovs_nla_get_match+0x10c8/0x1900 [openvswitch]
[ 7654.684234] [] ? genl_rcv+0x28/0x40
[ 7654.689968] [] ? netlink_unicast+0x3f3/0x590
[ 7654.696574] [] ? ovs_nla_put_tunnel_info+0xb0/0xb0 [openvswitch]
[ 7654.705122] [] ? unwind_get_return_address+0xb0/0xb0
[ 7654.712503] [] ? system_call_fastpath+0x1c/0x21
[ 7654.719401] [] ? update_stack_state+0x229/0x370
[ 7654.726298] [] ? update_stack_state+0x229/0x370
[ 7654.733195] [] ? kasan_unpoison_shadow+0x35/0x50
[ 7654.740187] [] ? kasan_kmalloc+0xaa/0xe0
[ 7654.746406] [] ? kasan_slab_alloc+0x12/0x20
[ 7654.752914] [] ? memset+0x31/0x40
[ 7654.758456] [] ovs_flow_cmd_new+0x2b2/0xf00 [openvswitch]

[snip]

[ 7655.132484] The buggy address belongs to the variable:
[ 7655.138226] ovs_tunnel_key_lens+0xf0/0xffffffffffffd400 [openvswitch]
[ 7655.145507]
[ 7655.147166] Memory state around the buggy address:
[ 7655.152514] ffffffffc169eb80: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa
[ 7655.160585] ffffffffc169ec00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 7655.168644] >ffffffffc169ec80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa fa
[ 7655.176701] ^
[ 7655.184372] ffffffffc169ed00: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 05
[ 7655.192431] ffffffffc169ed80: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00
[ 7655.200490] ==================================================================

Reported-by: Hangbin Liu
Fixes: 982b52700482 ("openvswitch: Fix mask generation for nested attributes.")
Signed-off-by: Stefano Brivio
Reviewed-by: Sabrina Dubroca
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Stefano Brivio
7 years ago

26 Apr, 2018

1 commit

6eddea4ba openvswitch: Remove padding from packet before L3+ conntrack processing ... Browse Code »

[ Upstream commit 9382fe71c0058465e942a633869629929102843d ]

IPv4 and IPv6 packets may arrive with lower-layer padding that is not
included in the L3 length. For example, a short IPv4 packet may have
up to 6 bytes of padding following the IP payload when received on an
Ethernet device with a minimum packet length of 64 bytes.

Higher-layer processing functions in netfilter (e.g. nf_ip_checksum(),
and help() in nf_conntrack_ftp) assume skb->len reflects the length of
the L3 header and payload, rather than referring back to
ip_hdr->tot_len or ipv6_hdr->payload_len, and get confused by
lower-layer padding.

In the normal IPv4 receive path, ip_rcv() trims the packet to
ip_hdr->tot_len before invoking netfilter hooks. In the IPv6 receive
path, ip6_rcv() does the same using ipv6_hdr->payload_len. Similarly
in the br_netfilter receive path, br_validate_ipv4() and
br_validate_ipv6() trim the packet to the L3 length before invoking
netfilter hooks.

Currently in the OVS conntrack receive path, ovs_ct_execute() pulls
the skb to the L3 header but does not trim it to the L3 length before
calling nf_conntrack_in(NF_INET_PRE_ROUTING). When
nf_conntrack_proto_tcp encounters a packet with lower-layer padding,
nf_ip_checksum() fails causing a "nf_ct_tcp: bad TCP checksum" log
message. While extra zero bytes don't affect the checksum, the length
in the IP pseudoheader does. That length is based on skb->len, and
without trimming, it doesn't match the length the sender used when
computing the checksum.

In ovs_ct_execute(), trim the skb to the L3 length before higher-layer
processing.

Signed-off-by: Ed Swierk
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ed Swierk
7 years ago

04 Feb, 2018

1 commit

9be97a9ab openvswitch: fix the incorrect flow action alloc size ... Browse Code »

[ Upstream commit 67c8d22a73128ff910e2287567132530abcf5b71 ]

If we want to add a datapath flow, which has more than 500 vxlan outputs'
action, we will get the following error reports:
openvswitch: netlink: Flow action size 32832 bytes exceeds max
openvswitch: netlink: Flow action size 32832 bytes exceeds max
openvswitch: netlink: Actions may not be safe on all matching packets
... ...

It seems that we can simply enlarge the MAX_ACTIONS_BUFSIZE to fix it, but
this is not the root cause. For example, for a vxlan output action, we need
about 60 bytes for the nlattr, but after it is converted to the flow
action, it only occupies 24 bytes. This means that we can still support
more than 1000 vxlan output actions for a single datapath flow under the
the current 32k max limitation.

So even if the nla_len(attr) is larger than MAX_ACTIONS_BUFSIZE, we
shouldn't report EINVAL and keep it move on, as the judgement can be
done by the reserve_sfa_size.

Signed-off-by: zhangliping
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

zhangliping
7 years ago

03 Jan, 2018

1 commit

a6cc63e12 openvswitch: Fix pop_vlan action for double tagged frames ... Browse Code »

[ Upstream commit c48e74736fccf25fb32bb015426359e1c2016e3b ]

skb_vlan_pop() expects skb->protocol to be a valid TPID for double
tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
shift the true ethertype into position for us.

Fixes: 5108bbaddc37 ("openvswitch: add processing of L3 packets")
Signed-off-by: Eric Garver
Reviewed-by: Jiri Benc
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Eric Garver
8 years ago

17 Dec, 2017

2 commits

87ff3fb30 net: openvswitch: datapath: fix data type in queue_gso_packets ... Browse Code »

[ Upstream commit 2734166e89639c973c6e125ac8bcfc2d9db72b70 ]

gso_type is being used in binary AND operations together with SKB_GSO_UDP.
The issue is that variable gso_type is of type unsigned short and
SKB_GSO_UDP expands to more than 16 bits:

SKB_GSO_UDP = 1 << 16

this makes any binary AND operation between gso_type and SKB_GSO_UDP to
be always zero, hence making some code unreachable and likely causing
undesired behavior.

Fix this by changing the data type of variable gso_type to unsigned int.

Addresses-Coverity-ID: 1462223
Fixes: 0c19f846d582 ("net: accept UFO datagrams from tuntap and packet")
Signed-off-by: Gustavo A. R. Silva
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Gustavo A. R. Silva
8 years ago
60335608e net: accept UFO datagrams from tuntap and packet ... Browse Code »

[ Upstream commit 0c19f846d582af919db66a5914a0189f9f92c936 ]

Tuntap and similar devices can inject GSO packets. Accept type
VIRTIO_NET_HDR_GSO_UDP, even though not generating UFO natively.

Processes are expected to use feature negotiation such as TUNSETOFFLOAD
to detect supported offload types and refrain from injecting other
packets. This process breaks down with live migration: guest kernels
do not renegotiate flags, so destination hosts need to expose all
features that the source host does.

Partially revert the UFO removal from 182e0b6b5846~1..d9d30adf5677.
This patch introduces nearly(*) no new code to simplify verification.
It brings back verbatim tuntap UFO negotiation, VIRTIO_NET_HDR_GSO_UDP
insertion and software UFO segmentation.

It does not reinstate protocol stack support, hardware offload
(NETIF_F_UFO), SKB_GSO_UDP tunneling in SKB_GSO_SOFTWARE or reception
of VIRTIO_NET_HDR_GSO_UDP packets in tuntap.

To support SKB_GSO_UDP reappearing in the stack, also reinstate
logic in act_csum and openvswitch. Achieve equivalence with v4.13 HEAD
by squashing in commit 939912216fa8 ("net: skb_needs_check() removes
CHECKSUM_UNNECESSARY check for tx.") and reverting commit 8d63bee643f1
("net: avoid skb_warn_bad_offload false positives on UFO").

(*) To avoid having to bring back skb_shinfo(skb)->ip6_frag_id,
ipv6_proxy_select_ident is changed to return a __be32 and this is
assigned directly to the frag_hdr. Also, SKB_GSO_UDP is inserted
at the end of the enum to minimize code churn.

Tested
Booted a v4.13 guest kernel with QEMU. On a host kernel before this
patch `ethtool -k eth0` shows UFO disabled. After the patch, it is
enabled, same as on a v4.13 host kernel.

A UFO packet sent from the guest appears on the tap device:
host:
nc -l -p -u 8000 &
tcpdump -n -i tap0

guest:
dd if=/dev/zero of=payload.txt bs=1 count=2000
nc -u 192.16.1.1 8000 < payload.txt

Direct tap to tap transmission of VIRTIO_NET_HDR_GSO_UDP succeeds,
packets arriving fragmented:

./with_tap_pair.sh ./tap_send_ufo tap0 tap1
(from https://github.com/wdebruij/kerneltools/tree/master/tests)

Changes
v1 -> v2
- simplified set_offload change (review comment)
- documented test procedure

Link: http://lkml.kernel.org/r/
Fixes: fb652fdfe837 ("macvlan/macvtap: Remove NETIF_F_UFO advertisement.")
Reported-by: Michal Kubecek
Signed-off-by: Willem de Bruijn
Acked-by: Jason Wang
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Willem de Bruijn
8 years ago

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
8 years ago

13 Sep, 2017

1 commit

5829e62ac openvswitch: Fix an error handling path in 'ovs_nla_init_match_and_action()' ... Browse Code »

All other error handling paths in this function go through the 'error'
label. This one should do the same.

Fixes: 9cc9a5cb176c ("datapath: Avoid using stack larger than 1024.")
Signed-off-by: Christophe JAILLET
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Christophe JAILLET
8 years ago

04 Sep, 2017

1 commit

b63f6044d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. Basically, updates to the conntrack core, enhancements for
nf_tables, conversion of netfilter hooks from linked list to array to
improve memory locality and asorted improvements for the Netfilter
codebase. More specifically, they are:

1) Add expection to hashes after timer initialization to prevent
access from another CPU that walks on the hashes and calls
del_timer(), from Florian Westphal.

2) Don't update nf_tables chain counters from hot path, this is only
used by the x_tables compatibility layer.

3) Get rid of nested rcu_read_lock() calls from netfilter hook path.
Hooks are always guaranteed to run from rcu read side, so remove
nested rcu_read_lock() where possible. Patch from Taehee Yoo.

4) nf_tables new ruleset generation notifications include PID and name
of the process that has updated the ruleset, from Phil Sutter.

5) Use skb_header_pointer() from nft_fib, so we can reuse this code from
the nf_family netdev family. Patch from Pablo M. Bermudo.

6) Add support for nft_fib in nf_tables netdev family, also from Pablo.

7) Use deferrable workqueue for conntrack garbage collection, to reduce
power consumption, from Patch from Subash Abhinov Kasiviswanathan.

8) Add nf_ct_expect_iterate_net() helper and use it. From Florian
Westphal.

9) Call nf_ct_unconfirmed_destroy only from cttimeout, from Florian.

10) Drop references on conntrack removal path when skbuffs has escaped via
nfqueue, from Florian.

11) Don't queue packets to nfqueue with dying conntrack, from Florian.

12) Constify nf_hook_ops structure, from Florian.

13) Remove neededlessly branch in nf_tables trace code, from Phil Sutter.

14) Add nla_strdup(), from Phil Sutter.

15) Rise nf_tables objects name size up to 255 chars, people want to use
DNS names, so increase this according to what RFC 1035 specifies.
Patch series from Phil Sutter.

16) Kill nf_conntrack_default_on, it's broken. Default on conntrack hook
registration on demand, suggested by Eric Dumazet, patch from Florian.

17) Remove unused variables in compat_copy_entry_from_user both in
ip_tables and arp_tables code. Patch from Taehee Yoo.

18) Constify struct nf_conntrack_l4proto, from Julia Lawall.

19) Constify nf_loginfo structure, also from Julia.

20) Use a single rb root in connlimit, from Taehee Yoo.

21) Remove unused netfilter_queue_init() prototype, from Taehee Yoo.

22) Use audit_log() instead of open-coding it, from Geliang Tang.

23) Allow to mangle tcp options via nft_exthdr, from Florian.

24) Allow to fetch TCP MSS from nft_rt, from Florian. This includes
a fix for a miscalculation of the minimal length.

25) Simplify branch logic in h323 helper, from Nick Desaulniers.

26) Calculate netlink attribute size for conntrack tuple at compile
time, from Florian.

27) Remove protocol name field from nf_conntrack_{l3,l4}proto structure.
From Florian.

28) Remove holes in nf_conntrack_l4proto structure, so it becomes
smaller. From Florian.

29) Get rid of print_tuple() indirection for /proc conntrack listing.
Place all the code in net/netfilter/nf_conntrack_standalone.c.
Patch from Florian.

30) Do not built in print_conntrack() if CONFIG_NF_CONNTRACK_PROCFS is
off. From Florian.

31) Constify most nf_conntrack_{l3,l4}proto helper functions, from
Florian.

32) Fix broken indentation in ebtables extensions, from Colin Ian King.

33) Fix several harmless sparse warning, from Florian.

34) Convert netfilter hook infrastructure to use array for better memory
locality, joint work done by Florian and Aaron Conole. Moreover, add
some instrumentation to debug this.

35) Batch nf_unregister_net_hooks() calls, to call synchronize_net once
per batch, from Florian.

36) Get rid of noisy logging in ICMPv6 conntrack helper, from Florian.

37) Get rid of obsolete NFDEBUG() instrumentation, from Varsha Rao.

38) Remove unused code in the generic protocol tracker, from Davide
Caratti.

I think I will have material for a second Netfilter batch in my queue if
time allow to make it fit in this merge window.
====================

Signed-off-by: David S. Miller

David S. Miller
8 years ago

25 Aug, 2017

1 commit

b3480fe05 netfilter: conntrack: make protocol tracker pointers const ... Browse Code »

Doesn't change generated code, but will make it easier to eventually
make the actual trackers themselvers const.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
8 years ago

22 Aug, 2017

1 commit

e2a7c34fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
8 years ago

17 Aug, 2017

1 commit

494bea39f openvswitch: fix skb_panic due to the incorrect actions attrlen ... Browse Code »

For sw_flow_actions, the actions_len only represents the kernel part's
size, and when we dump the actions to the userspace, we will do the
convertions, so it's true size may become bigger than the actions_len.

But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
to alloc the skbuff, so the user_skb's size may become insufficient and
oops will happen like this:
skbuff: skb_over_panic: text:ffffffff8148fabf len:1749 put:157 head:
ffff881300f39000 data:ffff881300f39000 tail:0x6d5 end:0x6c0 dev:
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:129!
[...]
Call Trace:

[] skb_put+0x43/0x44
[] skb_zerocopy+0x6c/0x1f4
[] queue_userspace_packet+0x3a3/0x448 [openvswitch]
[] ovs_dp_upcall+0x30/0x5c [openvswitch]
[] output_userspace+0x132/0x158 [openvswitch]
[] ? ip6_rcv_finish+0x74/0x77 [ipv6]
[] do_execute_actions+0xcc1/0xdc8 [openvswitch]
[] ovs_execute_actions+0x74/0x106 [openvswitch]
[] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
[] ? key_extract+0x63c/0x8d5 [openvswitch]
[] ovs_vport_receive+0xa1/0xc3 [openvswitch]
[...]

Also we can find that the actions_len is much little than the orig_len:
crash> struct sw_flow_actions 0xffff8812f539d000
struct sw_flow_actions {
rcu = {
next = 0xffff8812f5398800,
func = 0xffffe3b00035db32
},
orig_len = 1384,
actions_len = 592,
actions = 0xffff8812f539d01c
}

So as a quick fix, use the orig_len instead of the actions_len to alloc
the user_skb.

Last, this oops happened on our system running a relative old kernel, but
the same risk still exists on the mainline, since we use the wrong
actions_len from the beginning.

Fixes: ccea74457bbd ("openvswitch: include datapath actions with sampled-packet upcall to userspace")
Cc: Neil McKee
Signed-off-by: Liping Zhang
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Liping Zhang
8 years ago

12 Aug, 2017

1 commit

0ed80da51 openvswitch: Remove unnecessary newlines from OVS_NLERR uses ... Browse Code »

OVS_NLERR already adds a newline so these just add blank
lines to the logging.

Signed-off-by: Joe Perches
Acked-by: Joe Stringer
Signed-off-by: David S. Miller

Joe Perches
8 years ago

02 Aug, 2017

1 commit

29fda25a2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Two minor conflicts in virtio_net driver (bug fix overlapping addition
of a helper) and MAINTAINERS (new driver edit overlapping revamp of
PHY entry).

Signed-off-by: David S. Miller

David S. Miller
8 years ago

25 Jul, 2017

1 commit

69ec932e3 openvswitch: fix potential out of bound access in parse_ct ... Browse Code »

Before the 'type' is validated, we shouldn't use it to fetch the
ovs_ct_attr_lens's minlen and maxlen, else, out of bound access
may happen.

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Signed-off-by: Liping Zhang
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Liping Zhang
8 years ago

21 Jul, 2017

1 commit

7a68ada6e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
8 years ago

20 Jul, 2017

2 commits

c4b2bf6b4 openvswitch: Optimize operations for OvS flow_stats. ... Browse Code »

When calling the flow_free() to free the flow, we call many times
(cpu_possible_mask, eg. 128 as default) cpumask_next(). That will
take up our CPU usage if we call the flow_free() frequently.
When we put all packets to userspace via upcall, and OvS will send
them back via netlink to ovs_packet_cmd_execute(will call flow_free).

The test topo is shown as below. VM01 sends TCP packets to VM02,
and OvS forward packtets. When testing, we use perf to report the
system performance.

VM01 --- OvS-VM --- VM02

Without this patch, perf-top show as below: The flow_free() is
3.02% CPU usage.

4.23% [kernel] [k] _raw_spin_unlock_irqrestore
3.62% [kernel] [k] __do_softirq
3.16% [kernel] [k] __memcpy
3.02% [kernel] [k] flow_free
2.42% libc-2.17.so [.] __memcpy_ssse3_back
2.18% [kernel] [k] copy_user_generic_unrolled
2.17% [kernel] [k] find_next_bit

When applied this patch, perf-top show as below: Not shown on
the list anymore.

4.11% [kernel] [k] _raw_spin_unlock_irqrestore
3.79% [kernel] [k] __do_softirq
3.46% [kernel] [k] __memcpy
2.73% libc-2.17.so [.] __memcpy_ssse3_back
2.25% [kernel] [k] copy_user_generic_unrolled
1.89% libc-2.17.so [.] _int_malloc
1.53% ovs-vswitchd [.] xlate_actions

With this patch, the TCP throughput(we dont use Megaflow Cache
+ Microflow Cache) between VMs is 1.18Gbs/sec up to 1.30Gbs/sec
(maybe ~10% performance imporve).

This patch adds cpumask struct, the cpu_used_mask stores the cpu_id
that the flow used. And we only check the flow_stats on the cpu we
used, and it is unncessary to check all possible cpu when getting,
cleaning, and updating the flow_stats. Adding the cpu_used_mask to
sw_flow struct does’t increase the cacheline number.

Signed-off-by: Tonghao Zhang
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Tonghao Zhang
8 years ago
c57c054eb openvswitch: Optimize updating for OvS flow_stats. ... Browse Code »

In the ovs_flow_stats_update(), we only use the node
var to alloc flow_stats struct. But this is not a
common case, it is unnecessary to call the numa_node_id()
everytime. This patch is not a bugfix, but there maybe
a small increase.

Signed-off-by: Tonghao Zhang
Signed-off-by: David S. Miller

Tonghao Zhang
8 years ago

18 Jul, 2017

1 commit

880388aa3 net: Remove all references to SKB_GSO_UDP. ... Browse Code »

Such packets are no longer possible.

Signed-off-by: David S. Miller

David S. Miller
8 years ago

16 Jul, 2017

1 commit

8b97ac5bd openvswitch: Fix for force/commit action failures ... Browse Code »

When there is an established connection in direction A->B, it is
possible to receive a packet on port B which then executes
ct(commit,force) without first performing ct() - ie, a lookup.
In this case, we would expect that this packet can delete the existing
entry so that we can commit a connection with direction B->A. However,
currently we only perform a check in skb_nfct_cached() for whether
OVS_CS_F_TRACKED is set and OVS_CS_F_INVALID is not set, ie that a
lookup previously occurred. In the above scenario, a lookup has not
occurred but we should still be able to statelessly look up the
existing entry and potentially delete the entry if it is in the
opposite direction.

This patch extends the check to also hint that if the action has the
force flag set, then we will lookup the existing entry so that the
force check at the end of skb_nfct_cached has the ability to delete
the connection.

Fixes: dd41d330b03 ("openvswitch: Add force commit.")
CC: Pravin Shelar
CC: dev@openvswitch.org
Signed-off-by: Joe Stringer
Signed-off-by: Greg Rose
Signed-off-by: David S. Miller

Greg Rose
8 years ago

03 Jul, 2017

1 commit

52427fa06 openvswitch: fix mis-ordered comment lines for ovs_skb_cb ... Browse Code »

I was trying to wrap my head around meaning of mru, and realised
that the second line of the comment defining it had somehow
ended up after the line defining cutlen, leading to much confusion.

Reorder the lines to make sense.

Signed-off-by: Daniel Axtens
Signed-off-by: David S. Miller

Daniel Axtens
8 years ago

02 Jul, 2017

1 commit

9cc9a5cb1 datapath: Avoid using stack larger than 1024. ... Browse Code »

When compiling OvS-master on 4.4.0-81 kernel,
there is a warning:

CC [M] /root/ovs/datapath/linux/datapath.o
/root/ovs/datapath/linux/datapath.c: In function
'ovs_flow_cmd_set':
/root/ovs/datapath/linux/datapath.c:1221:1: warning:
the frame size of 1040 bytes is larger than 1024 bytes
[-Wframe-larger-than=]

This patch factors out match-init and action-copy to avoid
"Wframe-larger-than=1024" warning. Because mask is only
used to get actions, we new a function to save some
stack space.

Signed-off-by: Tonghao Zhang
Acked-by: Pravin B Shelar
Signed-off-by: David S. Miller

Tonghao Zhang
8 years ago

25 Jun, 2017

1 commit

3fcece12b net: store port/representator id in metadata_dst ... Browse Code »

Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.

Signed-off-by: Jakub Kicinski
Signed-off-by: Sridhar Samudrala
Signed-off-by: Simon Horman
Signed-off-by: David S. Miller

Jakub Kicinski
8 years ago

21 Jun, 2017

1 commit

dc5321d79 vxlan: get rid of redundant vxlan_dev.flags ... Browse Code »

There is no good reason to keep the flags twice in vxlan_dev and
vxlan_config.

Signed-off-by: Matthias Schiffer
Signed-off-by: David S. Miller

Matthias Schiffer
8 years ago

16 Jun, 2017

1 commit

b080db585 networking: convert many more places to skb_put_zero() ... Browse Code »

There were many places that my previous spatch didn't find,
as pointed out by yuan linyu in various patches.

The following spatch found many more and also removes the
now unnecessary casts:

@@
identifier p, p2;
expression len;
expression skb;
type t, t2;
@@
(
-p = skb_put(skb, len);
+p = skb_put_zero(skb, len);
|
-p = (t)skb_put(skb, len);
+p = skb_put_zero(skb, len);
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, len);
|
-memset(p, 0, len);
)

@@
type t, t2;
identifier p, p2;
expression skb;
@@
t *p;
...
(
-p = skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
|
-p = (t *)skb_put(skb, sizeof(t));
+p = skb_put_zero(skb, sizeof(t));
)
... when != p
(
p2 = (t2)p;
-memset(p2, 0, sizeof(*p));
|
-memset(p, 0, sizeof(*p));
)

@@
expression skb, len;
@@
-memset(skb_put(skb, len), 0, len);
+skb_put_zero(skb, len);

Apply it to the tree (with one manual fixup to keep the
comment in vxlan.c, which spatch removed.)

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
8 years ago

15 Jun, 2017

1 commit

0ddead90b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

The conflicts were two cases of overlapping changes in
batman-adv and the qed driver.

Signed-off-by: David S. Miller

David S. Miller
8 years ago

08 Jun, 2017

1 commit

cf124db56 net: Fix inconsistent teardown and release of private netdev state. ... Browse Code »

Network devices can allocate reasources and private memory using
netdev_ops->ndo_init(). However, the release of these resources
can occur in one of two different places.

Either netdev_ops->ndo_uninit() or netdev->destructor().

The decision of which operation frees the resources depends upon
whether it is necessary for all netdev refs to be released before it
is safe to perform the freeing.

netdev_ops->ndo_uninit() presumably can occur right after the
NETDEV_UNREGISTER notifier completes and the unicast and multicast
address lists are flushed.

netdev->destructor(), on the other hand, does not run until the
netdev references all go away.

Further complicating the situation is that netdev->destructor()
almost universally does also a free_netdev().

This creates a problem for the logic in register_netdevice().
Because all callers of register_netdevice() manage the freeing
of the netdev, and invoke free_netdev(dev) if register_netdevice()
fails.

If netdev_ops->ndo_init() succeeds, but something else fails inside
of register_netdevice(), it does call ndo_ops->ndo_uninit(). But
it is not able to invoke netdev->destructor().

This is because netdev->destructor() will do a free_netdev() and
then the caller of register_netdevice() will do the same.

However, this means that the resources that would normally be released
by netdev->destructor() will not be.

Over the years drivers have added local hacks to deal with this, by
invoking their destructor parts by hand when register_netdevice()
fails.

Many drivers do not try to deal with this, and instead we have leaks.

Let's close this hole by formalizing the distinction between what
private things need to be freed up by netdev->destructor() and whether
the driver needs unregister_netdevice() to perform the free_netdev().

netdev->priv_destructor() performs all actions to free up the private
resources that used to be freed by netdev->destructor(), except for
free_netdev().

netdev->needs_free_netdev is a boolean that indicates whether
free_netdev() should be done at the end of unregister_netdevice().

Now, register_netdevice() can sanely release all resources after
ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
and netdev->priv_destructor().

And at the end of unregister_netdevice(), we invoke
netdev->priv_destructor() and optionally call free_netdev().

Signed-off-by: David S. Miller

David S. Miller
8 years ago

23 May, 2017

1 commit

218b6a5b2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
8 years ago

20 May, 2017

1 commit

7529390d0 openvswitch: more accurate checksumming in queue_userspace_packet() ... Browse Code »

if skb carries an SCTP packet and ip_summed is CHECKSUM_PARTIAL, it needs
CRC32c in place of Internet Checksum: use skb_csum_hwoffload_help to avoid
corrupting such packets while queueing them towards userspace.

Signed-off-by: Davide Caratti
Signed-off-by: David S. Miller

Davide Caratti
8 years ago

15 May, 2017

1 commit

d91fc59cd netfilter: introduce nf_conntrack_helper_put helper function ... Browse Code »

And convert module_put invocation to nf_conntrack_helper_put, this is
prepared for the followup patch, which will add a refcnt for cthelper,
so we can reject the deleting request when cthelper is in use.

Signed-off-by: Liping Zhang
Signed-off-by: Pablo Neira Ayuso

Liping Zhang
8 years ago

03 May, 2017

1 commit

4d89ac2dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter/IPVS/OVS fixes for net

The following patchset contains a rather large batch of Netfilter, IPVS
and OVS fixes for your net tree. This includes fixes for ctnetlink, the
userspace conntrack helper infrastructure, conntrack OVS support,
ebtables DNAT target, several leaks in error path among other. More
specifically, they are:

1) Fix reference count leak in the CT target error path, from Gao Feng.

2) Remove conntrack entry clashing with a matching expectation, patch
from Jarno Rajahalme.

3) Fix bogus EEXIST when registering two different userspace helpers,
from Liping Zhang.

4) Don't leak dummy elements in the new bitmap set type in nf_tables,
from Liping Zhang.

5) Get rid of module autoload from conntrack update path in ctnetlink,
we don't need autoload at this late stage and it is happening with
rcu read lock held which is not good. From Liping Zhang.

6) Fix deadlock due to double-acquire of the expect_lock from conntrack
update path, this fixes a bug that was introduced when the central
spinlock got removed. Again from Liping Zhang.

7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock
protection that was selected when the central spinlock was removed was
not really protecting anything at all.

8) Protect sequence adjustment under ct->lock.

9) Missing socket match with IPv6, from Peter Tirsek.

10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from
Linus Luessing.

11) Don't give up on evaluating the expression on new entries added via
dynset expression in nf_tables, from Liping Zhang.

12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals
with non-linear skbuffs.

13) Don't allow IPv6 service in IPVS if no IPv6 support is available,
from Paolo Abeni.

14) Missing mutex release in error path of xt_find_table_lock(), from
Dan Carpenter.

15) Update maintainers files, Netfilter section. Add Florian to the
file, refer to nftables.org and change project status from Supported
to Maintained.

16) Bail out on mismatching extensions in element updates in nf_tables.
====================

Signed-off-by: David S. Miller

David S. Miller
8 years ago

01 May, 2017

1 commit

a01aa920b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next
tree. A large bunch of code cleanups, simplify the conntrack extension
codebase, get rid of the fake conntrack object, speed up netns by
selective synchronize_net() calls. More specifically, they are:

1) Check for ct->status bit instead of using nfct_nat() from IPVS and
Netfilter codebase, patch from Florian Westphal.

2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.

3) Simplify FTP IPVS helper module registration path, from Arushi Singhal.

4) Introduce nft_is_base_chain() helper function.

5) Enforce expectation limit from userspace conntrack helper,
from Gao Feng.

6) Add nf_ct_remove_expect() helper function, from Gao Feng.

7) NAT mangle helper function return boolean, from Gao Feng.

8) ctnetlink_alloc_expect() should only work for conntrack with
helpers, from Gao Feng.

9) Add nfnl_msg_type() helper function to nfnetlink to build the
netlink message type.

10) Get rid of unnecessary cast on void, from simran singhal.

11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
also from simran singhal.

12) Use list_prev_entry() from nf_tables, from simran signhal.

13) Remove unnecessary & on pointer function in the Netfilter and IPVS
code.

14) Remove obsolete comment on set of rules per CPU in ip6_tables,
no longer true. From Arushi Singhal.

15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.

16) Remove unnecessary nested rcu_read_lock() in
__nf_nat_decode_session(). Code running from hooks are already
guaranteed to run under RCU read side.

17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.

18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
also from Aaron.

19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.

20) Don't propagate NF_DROP error to userspace via ctnetlink in
__nf_nat_alloc_null_binding() function, from Gao Feng.

21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
from Gao Feng.

22) Kill the fake untracked conntrack objects, use ctinfo instead to
annotate a conntrack object is untracked, from Florian Westphal.

23) Remove nf_ct_is_untracked(), now obsolete since we have no
conntrack template anymore, from Florian.

24) Add event mask support to nft_ct, also from Florian.

25) Move nf_conn_help structure to
include/net/netfilter/nf_conntrack_helper.h.

26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
Thus, we don't deal with variable conntrack extensions anymore.
Make sure userspace conntrack helper doesn't go over that size.
Remove variable size ct extension infrastructure now this code
got no more clients. From Florian Westphal.

27) Restore offset and length of nf_ct_ext structure to 8 bytes now
that wraparound is not possible any longer, also from Florian.

28) Allow to get rid of unassured flows under stress in conntrack,
this applies to DCCP, SCTP and TCP protocols, from Florian.

29) Shrink size of nf_conntrack_ecache structure, from Florian.

30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
from Gao Feng.

31) Register SYNPROXY hooks on demand, from Florian Westphal.

32) Use pernet hook whenever possible, instead of global hook
registration, from Florian Westphal.

33) Pass hook structure to ebt_register_table() to consolidate some
infrastructure code, from Florian Westphal.

34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
SYNPROXY code, to make sure device stats are not fooled, patch
from Gao Feng.

35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
don't need anymore if we just select a fixed size instead of
expensive runtime time calculation of this. From Florian.

36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
from Florian.

37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
Florian.

38) Attach NAT extension on-demand from masquerade and pptp helper
path, from Florian.

39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.

40) Speed up netns by selective calls of synchronize_net(), from
Florian Westphal.

41) Silence stack size warning gcc in 32-bit arch in snmp helper,
from Florian.

42) Inconditionally call nf_ct_ext_destroy(), even if we have no
extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
Liping Zhang.
====================

Signed-off-by: David S. Miller

David S. Miller
8 years ago

25 Apr, 2017

3 commits

cf5d70918 openvswitch: Delete conntrack entry clashing with an expectation. ... Browse Code »

Conntrack helpers do not check for a potentially clashing conntrack
entry when creating a new expectation. Also, nf_conntrack_in() will
check expectations (via init_conntrack()) only if a conntrack entry
can not be found. The expectation for a packet which also matches an
existing conntrack entry will not be removed by conntrack, and is
currently handled inconsistently by OVS, as OVS expects the
expectation to be removed when the connection tracking entry matching
that expectation is confirmed.

It should be noted that normally an IP stack would not allow reuse of
a 5-tuple of an old (possibly lingering) connection for a new data
connection, so this is somewhat unlikely corner case. However, it is
possible that a misbehaving source could cause conntrack entries be
created that could then interfere with new related connections.

Fix this in the OVS module by deleting the clashing conntrack entry
after an expectation has been matched. This causes the following
nf_conntrack_in() call also find the expectation and remove it when
creating the new conntrack entry, as well as the forthcoming reply
direction packets to match the new related connection instead of the
old clashing conntrack entry.

Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
Reported-by: Yang Song
Signed-off-by: Jarno Rajahalme
Acked-by: Joe Stringer
Signed-off-by: Pablo Neira Ayuso

Jarno Rajahalme
8 years ago
120645513 openvswitch: Add eventmask support to CT action. ... Browse Code »

Add a new optional conntrack action attribute OVS_CT_ATTR_EVENTMASK,
which can be used in conjunction with the commit flag
(OVS_CT_ATTR_COMMIT) to set the mask of bits specifying which
conntrack events (IPCT_*) should be delivered via the Netfilter
netlink multicast groups. Default behavior depends on the system
configuration, but typically a lot of events are delivered. This can be
very chatty for the NFNLGRP_CONNTRACK_UPDATE group, even if only some
types of events are of interest.

Netfilter core init_conntrack() adds the event cache extension, so we
only need to set the ctmask value. However, if the system is
configured without support for events, the setting will be skipped due
to extension not being found.

Signed-off-by: Jarno Rajahalme
Reviewed-by: Greg Rose
Acked-by: Joe Stringer
Signed-off-by: David S. Miller

Jarno Rajahalme
8 years ago
abd0a4f2b openvswitch: Typo fix. ... Browse Code »

Fix typo in a comment.

Signed-off-by: Jarno Rajahalme
Acked-by: Greg Rose
Signed-off-by: David S. Miller

Jarno Rajahalme
8 years ago

15 Apr, 2017

1 commit

ab8bc7ed8 netfilter: remove nf_ct_is_untracked ... Browse Code »

This function is now obsolete and always returns false.
This change has no effect on generated code.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
8 years ago

14 Apr, 2017

1 commit

fceb6435e netlink: pass extended ACK struct to parsing functions ... Browse Code »

Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
8 years ago

06 Apr, 2017

1 commit

6f14f443d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Mostly simple cases of overlapping changes (adding code nearby,
a function whose name changes, for example).

Signed-off-by: David S. Miller

David S. Miller
8 years ago

02 Apr, 2017

1 commit

6f56f6186 openvswitch: Fix ovs_flow_key_update() ... Browse Code »

ovs_flow_key_update() is called when the flow key is invalid, and it is
used to update and revalidate the flow key. Commit 329f45bc4f19
("openvswitch: add mac_proto field to the flow key") introduces mac_proto
field to flow key and use it to determine whether the flow key is valid.
However, the commit does not update the code path in ovs_flow_key_update()
to revalidate the flow key which may cause BUG_ON() on execute_recirc().
This patch addresses the aforementioned issue.

Fixes: 329f45bc4f19 ("openvswitch: add mac_proto field to the flow key")
Signed-off-by: Yi-Hung Wei
Acked-by: Jiri Benc
Signed-off-by: David S. Miller

Yi-Hung Wei
8 years ago

29 Mar, 2017

1 commit

b768b16de openvswitch: Fix refcount leak on force commit. ... Browse Code »

The reference count held for skb needs to be released when the skb's
nfct pointer is cleared regardless of if nf_ct_delete() is called or
not.

Failing to release the skb's reference cound led to deferred conntrack
cleanup spinning forever within nf_conntrack_cleanup_net_list() when
cleaning up a network namespace:

kworker/u16:0-19025 [004] 45981067.173642: sched_switch: kworker/u16:0:19025 [120] R ==> rcu_preempt:7 [120]
kworker/u16:0-19025 [004] 45981067.173651: kernel_stack:
=> ___preempt_schedule (ffffffffa001ed36)
=> _raw_spin_unlock_bh (ffffffffa0713290)
=> nf_ct_iterate_cleanup (ffffffffc00a4454)
=> nf_conntrack_cleanup_net_list (ffffffffc00a5e1e)
=> nf_conntrack_pernet_exit (ffffffffc00a63dd)
=> ops_exit_list.isra.1 (ffffffffa06075f3)
=> cleanup_net (ffffffffa0607df0)
=> process_one_work (ffffffffa0084c31)
=> worker_thread (ffffffffa008592b)
=> kthread (ffffffffa008bee2)
=> ret_from_fork (ffffffffa071b67c)

Fixes: dd41d33f0b03 ("openvswitch: Add force commit.")
Reported-by: Yang Song
Signed-off-by: Jarno Rajahalme
Acked-by: Joe Stringer
Signed-off-by: David S. Miller

Jarno Rajahalme
8 years ago