Eric Lee / smarc-fsl-linux-kernel

17 Sep, 2018

16 commits

70c0eb1ca netfilter: xtables: avoid BUG_ON ... Browse Code »

I see no reason for them, label or timer cannot be NULL, and if they
were, we'll crash with null deref anyway.

For skb_header_pointer failure, just set hotdrop to true and toss
such packet.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 22:11:12 +0800
fa5950e49 netfilter: nf_tables: avoid BUG_ON usage ... Browse Code »

None of these spots really needs to crash the kernel.
In one two cases we can jsut report error to userspace, in the other
cases we can just use WARN_ON (and leak memory instead).

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 22:11:12 +0800
0d704967f netfilter: xt_cgroup: shrink size of v2 path ... Browse Code »

cgroup v2 path field is PATH_MAX which is too large, this is placing too
much pressure on memory allocation for people with many rules doing
cgroup v1 classid matching, side effects of this are bug reports like:

https://bugzilla.kernel.org/show_bug.cgi?id=200639

This patch registers a new revision that shrinks the cgroup path to 512
bytes, which is the same approach we follow in similar extensions that
have a path field.

Cc: Tejun Heo
Signed-off-by: Pablo Neira Ayuso
Acked-by: Tejun Heo

Pablo Neira Ayuso
2018-09-17 22:11:03 +0800
59c08c69c netfilter: ctnetlink: Support L3 protocol-filter on flush ... Browse Code »

The same connection mark can be set on flows belonging to different
address families. This commit adds support for filtering on the L3
protocol when flushing connection track entries. If no protocol is
specified, then all L3 protocols match.

In order to avoid code duplication and a redundant check, the protocol
comparison in ctnetlink_dump_table() has been removed. Instead, a filter
is created if the GET-message triggering the dump contains an address
family. ctnetlink_filter_match() is then used to compare the L3
protocols.

Signed-off-by: Kristian Evensen
Signed-off-by: Pablo Neira Ayuso

Kristian Evensen
2018-09-17 18:04:14 +0800
6c4726025 netfilter: nf_tables: add xfrm expression ... Browse Code »

supports fetching saddr/daddr of tunnel mode states, request id and spi.
If direction is 'in', use inbound skb secpath, else dst->xfrm.

Joint work with Máté Eckl.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 17:40:08 +0800
2953d80ff netfilter: remove obsolete need_conntrack stub ... Browse Code »

as of a0ae2562c6c4b27 ("netfilter: conntrack: remove l3proto
abstraction") there are no users anymore.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 17:40:07 +0800
0935d5588 netfilter: nf_tables: asynchronous release ... Browse Code »

Release the committed transaction log from a work queue, moving
expensive synchronize_rcu out of the locked section and providing
opportunity to batch this.

On my test machine this cuts runtime of nft-test.py in half.
Based on earlier patch from Pablo Neira Ayuso.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 17:40:07 +0800
0ef235c71 netfilter: nf_tables: warn when expr implements only one of activate/deactivate ... Browse Code »

->destroy is only allowed to free data, or do other cleanups that do not
have side effects on other state, such as visibility to other netlink
requests.

Such things need to be done in ->deactivate.
As a transaction can fail, we need to make sure we can undo such
operations, therefore ->activate() has to be provided too.

So print a warning and refuse registration if expr->ops provides
only one of the two operations.

v2: fix nft_expr_check_ops to not repeat same check twice (Jones Desougi)

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 17:40:06 +0800
cd5125d8f netfilter: nf_tables: split set destruction in deactivate and destroy phase ... Browse Code »

Splits unbind_set into destroy_set and unbinding operation.

Unbinding removes set from lists (so new transaction would not
find it anymore) but keeps memory allocated (so packet path continues
to work).

Rebind function is added to allow unrolling in case transaction
that wants to remove set is aborted.

Destroy function is added to free the memory, but this could occur
outside of transaction in the future.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 17:29:49 +0800
02b408fae netfilter: nf_tables: rt: allow checking if dst has xfrm attached ... Browse Code »

Useful e.g. to avoid NATting inner headers of to-be-encrypted packets.

Signed-off-by: Florian Westphal
Signed-off-by: Pablo Neira Ayuso

Florian Westphal
2018-09-17 17:29:49 +0800
a82738adf ip6_gre: simplify gre header parsing in ip6gre_err ... Browse Code »

Same as ip_gre, use gre_parse_header to parse gre header in gre error
handler code.

Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2018-09-17 06:32:59 +0800
b0350d51f ip_gre: fix parsing gre header in ipgre_err ... Browse Code »

gre_parse_header stops parsing when csum_err is encountered, which means
tpi->key is undefined and ip_tunnel_lookup will return NULL improperly.

This patch introduce a NULL pointer as csum_err parameter. Even when
csum_err is encountered, it won't return error and continue parsing gre
header as expected.

Fixes: 9f57c67c379d ("gre: Remove support for sharing GRE protocol hook.")
Reported-by: Jiri Benc
Signed-off-by: Haishuang Yan
Signed-off-by: David S. Miller

Haishuang Yan
2018-09-17 06:32:59 +0800
21e65923a net: phy: et011c: Remove incorrect PHY_POLL flags ... Browse Code »

PHY_POLL is defined as -1 which means that we would be setting all flags of the
PHY driver, this is also not a valid flag to tell PHYLIB about, just remove it.

Signed-off-by: Florian Fainelli
Reviewed-by: Andrew Lunn
Signed-off-by: David S. Miller

Florian Fainelli
2018-09-17 06:31:01 +0800
50676de48 Merge branch 'act_police-lockless-data-path' ... Browse Code »

Davide Caratti says:

====================
net/sched: act_police: lockless data path

the data path of 'police' action can be faster if we avoid using spinlocks:
- patch 1 converts act_police to use per-cpu counters
- patch 2 lets act_police use RCU to access its configuration data.

test procedure (using pktgen from https://github.com/netoptimizer):
# ip link add name eth1 type dummy
# ip link set dev eth1 up
# tc qdisc add dev eth1 clsact
# tc filter add dev eth1 egress matchall action police \
> rate 2gbit burst 100k conform-exceed pass/pass index 100
# for c in 1 2 4; do
> ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 5000000 -i eth1
> done

test results (avg. pps/thread):

$c | before patch | after patch | improvement
----+--------------+--------------+-------------
1 | 3518448 | 3591240 | irrelevant
2 | 3070065 | 3383393 | 10%
4 | 1540969 | 3238385 | 110%
====================

Signed-off-by: David S. Miller

David S. Miller
2018-09-17 06:30:23 +0800
2d550dbad net/sched: act_police: don't use spinlock in the data path ... Browse Code »

use RCU instead of spinlocks, to protect concurrent read/write on
act_police configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti
Signed-off-by: David S. Miller

Davide Caratti
2018-09-17 06:30:22 +0800
93be42f91 net/sched: act_police: use per-cpu counters ... Browse Code »

use per-CPU counters, instead of sharing a single set of stats with all
cores. This removes the need of using spinlock when statistics are read
or updated.

Signed-off-by: Davide Caratti
Signed-off-by: David S. Miller

Davide Caratti
2018-09-17 06:30:22 +0800

14 Sep, 2018

23 commits

c3ec8bcce cxgb4: update supported DCB version ... Browse Code »

- In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb
version is changed and update the dcb supported version.

- Also, fill the priority code point value for priority
based flow control.

Signed-off-by: Ganesh Goudar
Signed-off-by: David S. Miller

Ganesh Goudar
2018-09-14 23:50:23 +0800
992bea8e4 cxgb4: add per rx-queue counter for packet errors ... Browse Code »

print per rx-queue packet errors in sge_qinfo

Signed-off-by: Casey Leedom
Signed-off-by: Ganesh Goudar
Signed-off-by: David S. Miller

Ganesh Goudar
2018-09-14 23:40:53 +0800
0dc235afc cxgb4: Fix endianness issue in t4_fwcache() ... Browse Code »

Do not put host-endian 0 or 1 into big endian feild.

Reported-by: Al Viro
Signed-off-by: Ganesh Goudar
Signed-off-by: David S. Miller

Ganesh Goudar
2018-09-14 23:40:53 +0800
52bb6677d net: move definition of pcpu_lstats to header file ... Browse Code »

pcpu_lstats is defined in several files, so unify them as one
and move to header file

Signed-off-by: Zhang Yu
Signed-off-by: Li RongQing
Signed-off-by: David S. Miller

Li RongQing
2018-09-14 23:32:23 +0800
ee4fccbee net/ibm/emac: Remove VLA usage ... Browse Code »

In the quest to remove all stack VLA usage from the kernel[1], this
removes the VLA used for the emac xaht registers size. Since the size
of registers can only ever be 4 or 8, as detected in emac_init_config(),
the max can be hardcoded and a runtime test added for robustness.

[1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com

Cc: "David S. Miller"
Cc: Christian Lamparter
Cc: Ivan Mikhaylov
Cc: netdev@vger.kernel.org
Co-developed-by: Benjamin Herrenschmidt
Signed-off-by: Kees Cook
Signed-off-by: David S. Miller

Kees Cook
2018-09-14 07:53:24 +0800
f91845da9 pktgen: Fix fall-through annotation ... Browse Code »

Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: David S. Miller

Gustavo A. R. Silva
2018-09-14 06:36:41 +0800
310fc0513 tg3: Fix fall-through annotations ... Browse Code »

Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: David S. Miller

Gustavo A. R. Silva
2018-09-14 06:36:41 +0800
50c12f740 gso_segment: Reset skb->mac_len after modifying network header ... Browse Code »

When splitting a GSO segment that consists of encapsulated packets, the
skb->mac_len of the segments can end up being set wrong, causing packet
drops in particular when using act_mirred and ifb interfaces in
combination with a qdisc that splits GSO packets.

This happens because at the time skb_segment() is called, network_header
will point to the inner header, throwing off the calculation in
skb_reset_mac_len(). The network_header is subsequently adjust by the
outer IP gso_segment handlers, but they don't set the mac_len.

Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
gso_segment handlers, after they modify the network_header.

Many thanks to Eric Dumazet for his help in identifying the cause of
the bug.

Acked-by: Dave Taht
Reviewed-by: Eric Dumazet
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: David S. Miller

Toke Høiland-Jørgensen
2018-09-14 03:08:40 +0800
293681f14 vxlan: Remove duplicated include from vxlan.h ... Browse Code »

Remove duplicated include.

Signed-off-by: YueHaibing
Signed-off-by: David S. Miller

YueHaibing
2018-09-14 03:07:56 +0800
b2ddc48a8 net: dsa: b53: Do not fail when IRQ are not initialized ... Browse Code »

When the Device Tree is not providing the per-port interrupts, do not fail
during b53_srab_irq_enable() but instead bail out gracefully. The SRAB driver
is used on the BCM5301X (Northstar) platforms which do not yet have the SRAB
interrupts wired up.

Fixes: 16994374a6fc ("net: dsa: b53: Make SRAB driver manage port interrupts")
Signed-off-by: Florian Fainelli
Signed-off-by: David S. Miller

Florian Fainelli
2018-09-14 01:19:14 +0800
8bb83b783 Merge branch 'vhost_net-TX-batching' ... Browse Code »

Jason Wang says:

====================
vhost_net TX batching

This series tries to batch submitting packets to underlayer socket
through msg_control during sendmsg(). This is done by:

1) Doing userspace copy inside vhost_net
2) Build XDP buff
3) Batch at most 64 (VHOST_NET_BATCH) XDP buffs and submit them once
through msg_control during sendmsg().
4) Underlayer sockets can use XDP buffs directly when XDP is enalbed,
or build skb based on XDP buff.

For the packet that can not be built easily with XDP or for the case
that batch submission is hard (e.g sndbuf is limited). We will go for
the previous slow path, passing iov iterator to underlayer socket
through sendmsg() once per packet.

This can help to improve cache utilization and avoid lots of indirect
calls with sendmsg(). It can also co-operate with the batching support
of the underlayer sockets (e.g the case of XDP redirection through
maps).

Testpmd(txonly) in guest shows obvious improvements:

Test /+pps%
XDP_DROP on TAP /+44.8%
XDP_REDIRECT on TAP /+29%
macvtap (skb) /+26%

Netperf TCP_STREAM TX from guest shows obvious improvements on small
packet:

size/session/+thu%/+normalize%
64/ 1/ +2%/ 0%
64/ 2/ +3%/ +1%
64/ 4/ +7%/ +5%
64/ 8/ +8%/ +6%
256/ 1/ +3%/ 0%
256/ 2/ +10%/ +7%
256/ 4/ +26%/ +22%
256/ 8/ +27%/ +23%
512/ 1/ +3%/ +2%
512/ 2/ +19%/ +14%
512/ 4/ +43%/ +40%
512/ 8/ +45%/ +41%
1024/ 1/ +4%/ 0%
1024/ 2/ +27%/ +21%
1024/ 4/ +38%/ +73%
1024/ 8/ +15%/ +24%
2048/ 1/ +10%/ +7%
2048/ 2/ +16%/ +12%
2048/ 4/ 0%/ +2%
2048/ 8/ 0%/ +2%
4096/ 1/ +36%/ +60%
4096/ 2/ -11%/ -26%
4096/ 4/ 0%/ +14%
4096/ 8/ 0%/ +4%
16384/ 1/ -1%/ +5%
16384/ 2/ 0%/ +2%
16384/ 4/ 0%/ -3%
16384/ 8/ 0%/ +4%
65535/ 1/ 0%/ +10%
65535/ 2/ 0%/ +8%
65535/ 4/ 0%/ +1%
65535/ 8/ 0%/ +3%

Please review.
====================

Signed-off-by: David S. Miller

David S. Miller
2018-09-14 00:25:41 +0800
0a0be13b8 vhost_net: batch submitting XDP buffers to underlayer sockets ... Browse Code »

This patch implements XDP batching for vhost_net. The idea is first to
try to do userspace copy and build XDP buff directly in vhost. Instead
of submitting the packet immediately, vhost_net will batch them in an
array and submit every 64 (VHOST_NET_BATCH) packets to the under layer
sockets through msg_control of sendmsg().

When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a
loop without caring GUP thus it can do batch map flushing. When XDP is
not enabled or not supported, the underlayer socket need to build skb
and pass it to network core. The batched packet submission allows us
to do batching like netif_receive_skb_list() in the future.

This saves lots of indirect calls for better cache utilization. For
the case that we can't so batching e.g when sndbuf is limited or
packet size is too large, we will go for usual one packet per
sendmsg() way.

Doing testpmd on various setups gives us:

Test /+pps%
XDP_DROP on TAP /+44.8%
XDP_REDIRECT on TAP /+29%
macvtap (skb) /+26%

Netperf tests shows obvious improvements for small packet transmission:

size/session/+thu%/+normalize%
64/ 1/ +2%/ 0%
64/ 2/ +3%/ +1%
64/ 4/ +7%/ +5%
64/ 8/ +8%/ +6%
256/ 1/ +3%/ 0%
256/ 2/ +10%/ +7%
256/ 4/ +26%/ +22%
256/ 8/ +27%/ +23%
512/ 1/ +3%/ +2%
512/ 2/ +19%/ +14%
512/ 4/ +43%/ +40%
512/ 8/ +45%/ +41%
1024/ 1/ +4%/ 0%
1024/ 2/ +27%/ +21%
1024/ 4/ +38%/ +73%
1024/ 8/ +15%/ +24%
2048/ 1/ +10%/ +7%
2048/ 2/ +16%/ +12%
2048/ 4/ 0%/ +2%
2048/ 8/ 0%/ +2%
4096/ 1/ +36%/ +60%
4096/ 2/ -11%/ -26%
4096/ 4/ 0%/ +14%
4096/ 8/ 0%/ +4%
16384/ 1/ -1%/ +5%
16384/ 2/ 0%/ +2%
16384/ 4/ 0%/ -3%
16384/ 8/ 0%/ +4%
65535/ 1/ 0%/ +10%
65535/ 2/ 0%/ +8%
65535/ 4/ 0%/ +1%
65535/ 8/ 0%/ +3%

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:41 +0800
0efac2779 tap: accept an array of XDP buffs through sendmsg() ... Browse Code »

This patch implement TUN_MSG_PTR msg_control type. This type allows
the caller to pass an array of XDP buffs to tuntap through ptr field
of the tun_msg_control. Tap will build skb through those XDP buffers.

This will avoid lots of indirect calls thus improves the icache
utilization and allows to do XDP batched flushing when doing XDP
redirection.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:41 +0800
043d222f9 tuntap: accept an array of XDP buffs through sendmsg() ... Browse Code »

This patch implement TUN_MSG_PTR msg_control type. This type allows
the caller to pass an array of XDP buffs to tuntap through ptr field
of the tun_msg_control. If an XDP program is attached, tuntap can run
XDP program directly. If not, tuntap will build skb and do a fast
receiving since part of the work has been done by vhost_net.

This will avoid lots of indirect calls thus improves the icache
utilization and allows to do XDP batched flushing when doing XDP
redirection.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
fe8dd45bb tun: switch to new type of msg_control ... Browse Code »

This patch introduces to a new tun/tap specific msg_control:

#define TUN_MSG_UBUF 1
#define TUN_MSG_PTR 2
struct tun_msg_ctl {
int type;
void *ptr;
};

This allows us to pass different kinds of msg_control through
sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will
be used by the existed vhost_net zerocopy code. The second is XDP
buff, which allows vhost_net to pass XDP buff to TUN. This could be
used to implement accepting an array of XDP buffs from vhost_net in
the following patches.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
1a097910a tuntap: move XDP flushing out of tun_do_xdp() ... Browse Code »

This will allow adding batch flushing on top.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
8ae1aff0b tuntap: split out XDP logic ... Browse Code »

This patch split out XDP logic into a single function. This make it to
be reused by XDP batching path in the following patch.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
ac1f1f6c5 tuntap: tweak on the path of skb XDP case in tun_build_skb() ... Browse Code »

If we're sure not to go native XDP, there's no need for several things
like bh and rcu stuffs. So this patch introduces a helper to build skb
and hold page refcnt. When we found we will go through skb path, build
skb directly.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
f7053b6cc tuntap: simplify error handling in tun_build_skb() ... Browse Code »

There's no need to duplicate page get logic in each action. So this
patch tries to get page and calculate the offset before processing XDP
actions (except for XDP_DROP), and undo them when meet errors (we
don't care the performance on errors). This will be used for factoring
out XDP logic.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
291aeb2b1 tuntap: enable bh early during processing XDP ... Browse Code »

This patch move the bh enabling a little bit earlier, this will be
used for factoring out the core XDP logic of tuntap.

Acked-by: Michael S. Tsirkin
Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
4f23aff87 tuntap: switch to use XDP_PACKET_HEADROOM ... Browse Code »

Acked-by: Michael S. Tsirkin
Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
e4a2a3048 net: sock: introduce SOCK_XDP ... Browse Code »

This patch introduces a new sock flag - SOCK_XDP. This will be used
for notifying the upper layer that XDP program is attached on the
lower socket, and requires for extra headroom.

TUN will be the first user.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2018-09-14 00:25:40 +0800
9708d2b5b llc: avoid blocking in llc_sap_close() ... Browse Code »

llc_sap_close() is called by llc_sap_put() which
could be called in BH context in llc_rcv(). We can't
block in BH.

There is no reason to block it here, kfree_rcu() should
be sufficient.

Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2018-09-14 00:04:58 +0800

13 Sep, 2018

1 commit

15033f045 ipv6: Add sockopt IPV6_MULTICAST_ALL analogue to IP_MULTICAST_ALL ... Browse Code »

The socket option will be enabled by default to ensure current behaviour
is not changed. This is the same for the IPv4 version.

A socket bound to in6addr_any and a specific port will receive all traffic
on that port. Analogue to IP_MULTICAST_ALL, disable this behaviour, if
one or more multicast groups were joined (using said socket) and only
pass on multicast traffic from groups, which were explicitly joined via
this socket.

Without this option disabled a socket (system even) joined to multiple
multicast groups is very hard to get right. Filtering by destination
address has to take place in user space to avoid receiving multicast
traffic from other multicast groups, which might have traffic on the same
port.

The extension of the IP_MULTICAST_ALL socketoption to just apply to ipv6,
too, is not done to avoid changing the behaviour of current applications.

Signed-off-by: Andre Naujoks
Acked-By: YOSHIFUJI Hideaki
Signed-off-by: David S. Miller

Andre Naujoks
2018-09-13 23:17:27 +0800