Eric Lee / smarc-fsl-linux-kernel

10 Nov, 2020

1 commit

77a2d673d tunnels: Fix off-by-one in lower MTU bounds for ICMP/ICMPv6 replies ... Browse Code »

Jianlin reports that a bridged IPv6 VXLAN endpoint, carrying IPv6
packets over a link with a PMTU estimation of exactly 1350 bytes,
won't trigger ICMPv6 Packet Too Big replies when the encapsulated
datagrams exceed said PMTU value. VXLAN over IPv6 adds 70 bytes of
overhead, so an ICMPv6 reply indicating 1280 bytes as inner MTU
would be legitimate and expected.

This comes from an off-by-one error I introduced in checks added
as part of commit 4cb47a8644cc ("tunnels: PMTU discovery support
for directly bridged IP packets"), whose purpose was to prevent
sending ICMPv6 Packet Too Big messages with an MTU lower than the
smallest permissible IPv6 link MTU, i.e. 1280 bytes.

In iptunnel_pmtud_check_icmpv6(), avoid triggering a reply only if
the advertised MTU would be less than, and not equal to, 1280 bytes.

Also fix the analogous comparison for IPv4, that is, skip the ICMP
reply only if the resulting MTU is strictly less than 576 bytes.

This becomes apparent while running the net/pmtu.sh bridged VXLAN
or GENEVE selftests with adjusted lower-link MTU values. Using
e.g. GENEVE, setting ll_mtu to the values reported below, in the
test_pmtu_ipvX_over_bridged_vxlanY_or_geneveY_exception() test
function, we can see failures on the following tests:

test | ll_mtu
-------------------------------|--------
pmtu_ipv4_br_geneve4_exception | 626
pmtu_ipv6_br_geneve4_exception | 1330
pmtu_ipv6_br_geneve6_exception | 1350

owing to the different tunneling overheads implied by the
corresponding configurations.

Reported-by: Jianlin Shi
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Stefano Brivio
Link: https://lore.kernel.org/r/4f5fc2f33bfdf8409549fafd4f952b008bf04d63.1604681709.git.sbrivio@redhat.com
Signed-off-by: Jakub Kicinski

Stefano Brivio
2020-11-10 07:39:39 +0800

14 Oct, 2020

1 commit

cf89f18fa iptunnel: use new function dev_fetch_sw_netstats ... Browse Code »

Simplify the code by using new function dev_fetch_sw_netstats().

Signed-off-by: Heiner Kallweit
Link: https://lore.kernel.org/r/050f9a83-b195-a3d6-edbd-91a59040be21@gmail.com
Signed-off-by: Jakub Kicinski

Heiner Kallweit
2020-10-14 08:33:49 +0800

15 Sep, 2020

1 commit

681d2cfb7 lwtunnel: only keep the available bits when setting vxlan md->gbp ... Browse Code »

As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata
on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can
be set to or parse from the packet for vxlan gbp option.

So do the mask when set it in lwtunnel, as it does in act_tunnel_key and
cls_flower.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2020-09-15 07:49:39 +0800

06 Aug, 2020

1 commit

8ed54f167 ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM ... Browse Code »

On architectures defining _HAVE_ARCH_IPV6_CSUM, we get
csum_ipv6_magic() defined by means of arch checksum.h headers. On
other architectures, we actually need to include net/ip6_checksum.h
to be able to use it.

Without this include, building with defconfig breaks at least for
s390.

Reported-by: Stephen Rothwell
Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Stefano Brivio
Signed-off-by: David S. Miller

Stefano Brivio
2020-08-06 03:30:36 +0800

05 Aug, 2020

1 commit

4cb47a864 tunnels: PMTU discovery support for directly bridged IP packets ... Browse Code »

It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.

PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.

On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:

VTEPs MUST NOT fragment VXLAN packets. Intermediate routers may
fragment encapsulated VXLAN packets due to the larger frame size.
The destination VTEP MAY silently discard such VXLAN fragments.

The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.

Further, it states that:

Other techniques like Path MTU discovery (see [RFC1191] and
[RFC1981]) MAY be used to address this requirement as well.

Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.

Add the missing bits for bridged cases:

- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
RFC 4443 section 2.4 for ICMPv6. This function is already called by
UDP tunnels

- a new function generating those ICMP or ICMPv6 replies. We can't
reuse icmp_send() and icmp6_send() as we don't see the sender as a
valid destination. This doesn't need to be generic, as we don't
cover any other type of ICMP errors given that we only provide an
encapsulation function to the sender

While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).

This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.

v2:
- moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
- split IPv4/IPv6 functions (David Ahern)

Signed-off-by: Stefano Brivio
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Stefano Brivio
2020-08-05 04:01:45 +0800

01 Jul, 2020

1 commit

2606aff91 net: ip_tunnel: add header_ops for layer 3 devices ... Browse Code »

Some devices that take straight up layer 3 packets benefit from having a
shared header_ops so that AF_PACKET sockets can inject packets that are
recognized. This shared infrastructure will be used by other drivers
that currently can't inject packets using AF_PACKET. It also exposes the
parser function, as it is useful in standalone form too.

Signed-off-by: Jason A. Donenfeld
Acked-by: Willem de Bruijn
Signed-off-by: David S. Miller

Jason A. Donenfeld
2020-07-01 03:29:39 +0800

30 Mar, 2020

1 commit

faee67694 net: add net available in build_state ... Browse Code »

The build_state callback of lwtunnel doesn't contain the net namespace
structure yet. This patch will add it so we can check on specific
address configuration at creation time of rpl source routes.

Signed-off-by: Alexander Aring
Signed-off-by: David S. Miller

Alexander Aring
2020-03-30 13:30:57 +0800

22 Nov, 2019

2 commits

1841b9829 lwtunnel: check erspan options before allocating tun_info ... Browse Code »

As Jakub suggested on another patch, it's better to do the check
on erspan options before allocating memory.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-22 03:47:39 +0800
7b6a70f73 lwtunnel: be STRICT to validate the new LWTUNNEL_IP(6)_OPTS ... Browse Code »

LWTUNNEL_IP(6)_OPTS are the new items in ip(6)_tun_policy, which
are parsed by nla_parse_nested_deprecated(). We should check it
strictly by setting .strict_start_type = LWTUNNEL_IP(6)_OPTS.

This patch also adds missing LWTUNNEL_IP6_OPTS in ip6_tun_policy.

Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve")
Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-22 03:47:23 +0800

21 Nov, 2019

1 commit

2f1d370b9 lwtunnel: add support for multiple geneve opts ... Browse Code »

geneve RFC (draft-ietf-nvo3-geneve-14) allows a geneve packet to carry
multiple geneve opts, so it's necessary for lwtunnel to support adding
multiple geneve opts in one lwtunnel route. But vxlan and erspan opts
are still only allowed to add one option.

With this patch, iproute2 could make it like:

# ip r a 1.1.1.0/24 encap ip id 1 geneve_opts 0:0:12121212,1:2:12121212 \
dst 10.1.0.2 dev geneve1

# ip r a 1.1.1.0/24 encap ip id 1 vxlan_opts 456 \
dst 10.1.0.2 dev erspan1

# ip r a 1.1.1.0/24 encap ip id 1 erspan_opts 1:123:0:0 \
dst 10.1.0.2 dev erspan1

Which are pretty much like cls_flower and act_tunnel_key.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-21 02:15:13 +0800

19 Nov, 2019

1 commit

3132174b4 lwtunnel: change to use nla_put_u8 for LWTUNNEL_IP_OPT_ERSPAN_VER ... Browse Code »

LWTUNNEL_IP_OPT_ERSPAN_VER is u8 type, and nla_put_u8 should have
been used instead of nla_put_u32(). This is a copy-paste error.

Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan")
Signed-off-by: Xin Long
Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Xin Long
2019-11-19 09:19:56 +0800

12 Nov, 2019

3 commits

0c06d166e lwtunnel: ignore any TUNNEL_OPTIONS_PRESENT flags set by users ... Browse Code »

TUNNEL_OPTIONS_PRESENT (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT|
TUNNEL_ERSPAN_OPT) flags should be set only according to
tb[LWTUNNEL_IP_OPTS], which is done in ip_tun_parse_opts().

When setting info key.tun_flags, the TUNNEL_OPTIONS_PRESENT
bits in tb[LWTUNNEL_IP(6)_FLAGS] passed from users should
be ignored.

While at it, replace all (TUNNEL_GENEVE_OPT|TUNNEL_VXLAN_OPT|
TUNNEL_ERSPAN_OPT) with 'TUNNEL_OPTIONS_PRESENT'.

Fixes: 3093fbe7ff4b ("route: Per route IP tunnel metadata via lightweight tunnel")
Fixes: 32a2b002ce61 ("ipv6: route: per route IP tunnel metadata via lightweight tunnel")
Signed-off-by: Xin Long
Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Xin Long
2019-11-12 06:43:02 +0800
58e8494eb lwtunnel: get nlsize for erspan options properly ... Browse Code »

erspan v1 has OPT_ERSPAN_INDEX while erspan v2 has OPT_ERSPAN_DIR and
OPT_ERSPAN_HWID attributes, and they require different nlsize when
dumping.

So this patch is to get nlsize for erspan options properly according
to erspan version.

Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan")
Signed-off-by: Xin Long
Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Xin Long
2019-11-12 06:42:38 +0800
ed02551f5 lwtunnel: change to use nla_parse_nested on new options ... Browse Code »

As the new options added in kernel, all should always use strict
parsing from the beginning with nla_parse_nested(), instead of
nla_parse_nested_deprecated().

Fixes: b0a21810bd5e ("lwtunnel: add options setting and dumping for erspan")
Fixes: edf31cbb1502 ("lwtunnel: add options setting and dumping for vxlan")
Fixes: 4ece47787077 ("lwtunnel: add options setting and dumping for geneve")
Signed-off-by: Xin Long
Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Xin Long
2019-11-12 06:42:14 +0800

07 Nov, 2019

5 commits

b0a21810b lwtunnel: add options setting and dumping for erspan ... Browse Code »

Based on the code framework built on the last patch, to
support setting and dumping for vxlan, we only need to
add ip_tun_parse_opts_erspan() for .build_state and
ip_tun_fill_encap_opts_erspan() for .fill_encap and
if (tun_flags & TUNNEL_ERSPAN_OPT) for .get_encap_size.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-07 13:14:22 +0800
edf31cbb1 lwtunnel: add options setting and dumping for vxlan ... Browse Code »

Based on the code framework built on the last patch, to
support setting and dumping for vxlan, we only need to
add ip_tun_parse_opts_vxlan() for .build_state and
ip_tun_fill_encap_opts_vxlan() for .fill_encap and
if (tun_flags & TUNNEL_VXLAN_OPT) for .get_encap_size.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-07 13:14:21 +0800
4ece47787 lwtunnel: add options setting and dumping for geneve ... Browse Code »

To add options setting and dumping, .build_state(), .fill_encap() and
.get_encap_size() in ip_tun_lwt_ops needs to be extended:

ip_tun_build_state():
ip_tun_parse_opts():
ip_tun_parse_opts_geneve()

ip_tun_fill_encap_info():
ip_tun_fill_encap_opts():
ip_tun_fill_encap_opts_geneve()

ip_tun_encap_nlsize()
ip_tun_opts_nlsize():
if (tun_flags & TUNNEL_GENEVE_OPT)

ip_tun_parse_opts(), ip_tun_fill_encap_opts() and ip_tun_opts_nlsize()
processes LWTUNNEL_IP_OPTS.

ip_tun_parse_opts_geneve(), ip_tun_fill_encap_opts_geneve() and
if (tun_flags & TUNNEL_GENEVE_OPT) processes LWTUNNEL_IP_OPTS_GENEVE.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-07 13:14:21 +0800
0eb8eb2f9 lwtunnel: add options process for cmp_encap ... Browse Code »

When comparing two tun_info, dst_cache member should have been skipped,
as dst_cache is a per cpu pointer and they are always different values
even in two tun_info with the same keys.

So this patch is to skip dst_cache member and compare the key, mode and
options_len only. For the future opts setting support, also to compare
options.

Fixes: 2d79849903e0 ("lwtunnel: ip tunnel: fix multiple routes with different encap")
Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-07 13:14:21 +0800
f52f11ec8 lwtunnel: add options process for arp request ... Browse Code »

Without options copied to the dst tun_info in iptunnel_metadata_reply()
called by arp_process for handling arp_request, the generated arp_reply
packet may be dropped or sent out with wrong options for some tunnels
like erspan and vxlan, and the traffic will break.

Fixes: 63d008a4e9ee ("ipv4: send arp replies to the correct tunnel")
Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-11-07 13:14:21 +0800

19 Jun, 2019

1 commit

5684abf70 ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL ... Browse Code »

iptunnel_xmit() works as a common function, also used by a udp tunnel
which doesn't have to have a tunnel device, like how TIPC works with
udp media.

In these cases, we should allow not to count pkts on dev's tstats, so
that udp tunnel can work with no tunnel device safely.

Signed-off-by: Xin Long
Signed-off-by: David S. Miller

Xin Long
2019-06-19 08:48:45 +0800

05 Jun, 2019

1 commit

c94229992 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 269 ... Browse Code »

Based on 1 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of version 2 of the gnu general public license as
published by the free software foundation this program is
distributed in the hope that it will be useful but without any
warranty without even the implied warranty of merchantability or
fitness for a particular purpose see the gnu general public license
for more details you should have received a copy of the gnu general
public license along with this program if not write to the free
software foundation inc 51 franklin street fifth floor boston ma
02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 21 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Alexios Zavras
Reviewed-by: Allison Randal
Reviewed-by: Richard Fontana
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141334.228102212@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-06-05 23:30:29 +0800

28 Apr, 2019

1 commit

8cb081746 netlink: make validation more configurable for future strictness ... Browse Code »

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2019-04-28 05:07:21 +0800

25 Feb, 2019

1 commit

3d25eabbb ip_tunnel: Add dst_cache support in lwtunnel_state of ip tunnel ... Browse Code »

The lwtunnel_state is not init the dst_cache Which make the
ip_md_tunnel_xmit can't use the dst_cache. It will lookup
route table every packets.

Signed-off-by: wenxu
Signed-off-by: David S. Miller

wenxu
2019-02-25 14:13:49 +0800

25 Dec, 2018

2 commits

90cadbbf3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull in bug fixes before respinning my net-next pull
request.

Signed-off-by: David S. Miller

David S. Miller
2018-12-25 08:19:56 +0800
7bdca378b iptunnel: Set tun_flags in the iptunnel_metadata_reply from src ... Browse Code »

ip l add tun type gretap external
ip r a 10.0.0.2 encap ip id 1000 dst 172.168.0.2 key dev tun
ip a a 10.0.0.1/24 dev tun

The peer arp request to 10.0.0.1 with tunnel_id, but the arp reply
only set the tun_id but not the tun_flags with TUNNEL_KEY. The arp
reply packet don't contain tun_id field.

Signed-off-by: wenxu
Signed-off-by: David S. Miller

wenxu
2018-12-25 06:20:25 +0800

20 Nov, 2018

1 commit

f2be6d710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2018-11-20 02:55:00 +0800

18 Nov, 2018

1 commit

16f7eb2b7 ip_tunnel: don't force DF when MTU is locked ... Browse Code »

The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.

This patch makes setting the DF bit conditional on the route's MTU
locking state.

This issue seems to be older than git history.

Signed-off-by: Sabrina Dubroca
Reviewed-by: Stefano Brivio
Signed-off-by: David S. Miller

Sabrina Dubroca
2018-11-18 13:50:55 +0800

09 Nov, 2018

1 commit

3e2ed0c25 ipv4/tunnel: use __vlan_hwaccel helpers ... Browse Code »

Signed-off-by: Michał Mirosław
Signed-off-by: David S. Miller

Michał Mirosław
2018-11-09 12:45:04 +0800

11 May, 2018

1 commit

5263a98f1 net/ipv4: Update ip_tunnel_metadata_cnt static key to modern api ... Browse Code »

No changes in refcount semantics -- key init is false; replace

static_key_slow_inc|dec with static_branch_inc|dec
static_key_false with static_branch_unlikely

Signed-off-by: Davidlohr Bueso
Signed-off-by: David S. Miller

Davidlohr Bueso
2018-05-11 03:13:33 +0800

25 Jun, 2017

1 commit

3fcece12b net: store port/representator id in metadata_dst ... Browse Code »

Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure today
unless we have a way of attaching metadata to skbs at the upper device.
Because single set of queues is used for many netdevs stopping TC/sched
queues of all of them reliably is impossible and lower device has to
retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id. This way representatives can be queueless and
all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have
metadata attached and will subsequently be queued to the lower device for
transmission. The lower device should recognize the metadata and translate
it to HW specific format which is most likely either a special header
inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower netdevs
today.

Signed-off-by: Jakub Kicinski
Signed-off-by: Sridhar Samudrala
Signed-off-by: Simon Horman
Signed-off-by: David S. Miller

Jakub Kicinski
2017-06-25 23:42:01 +0800

30 May, 2017

1 commit

9ae287274 net: add extack arg to lwtunnel build state ... Browse Code »

Pass extack arg down to lwtunnel_build_state and the build_state callbacks.
Add messages for failures in lwtunnel_build_state, and add the extarg to
nla_parse where possible in the build_state callbacks.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2017-05-30 23:55:32 +0800

14 Apr, 2017

1 commit

fceb6435e netlink: pass extended ACK struct to parsing functions ... Browse Code »

Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2017-04-14 01:58:22 +0800

31 Jan, 2017

1 commit

30357d7d8 lwtunnel: remove device arg to lwtunnel_build_state ... Browse Code »

Nothing about lwt state requires a device reference, so remove the
input argument.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2017-01-31 04:14:22 +0800

28 Jan, 2017

1 commit

4e8f2fc1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Two trivial overlapping changes conflicts in MPLS and mlx5.

Signed-off-by: David S. Miller

David S. Miller
2017-01-28 23:33:06 +0800

25 Jan, 2017

1 commit

88ff7334f net: Specify the owning module for lwtunnel ops ... Browse Code »

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman
Signed-off-by: David S. Miller

Robert Shearman
2017-01-25 05:21:36 +0800

09 Jan, 2017

1 commit

bc1f44709 net: make ndo_get_stats64 a void function ... Browse Code »

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2017-01-09 06:51:44 +0800

04 Nov, 2016

1 commit

9ee6c5dc8 ipv4: allow local fragmentation in ip_finish_output_gso() ... Browse Code »

Some configurations (e.g. geneve interface with default
MTU of 1500 over an ethernet interface with 1500 MTU) result
in the transmission of packets that exceed the configured MTU.
While this should be considered to be a "bad" configuration,
it is still allowed and should not result in the sending
of packets that exceed the configured MTU.

Fix by dropping the assumption in ip_finish_output_gso() that
locally originated gso packets will never need fragmentation.
Basic testing using iperf (observing CPU usage and bandwidth)
have shown no measurable performance impact for traffic not
requiring fragmentation.

Fixes: c7ba65d7b649 ("net: ip: push gso skb forwarding handling down the stack")
Reported-by: Jan Tluka
Signed-off-by: Lance Richardson
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Lance Richardson
2016-11-04 04:10:26 +0800

10 Sep, 2016

1 commit

bf8d85d4f ip_tunnel: do not clear l4 hashes ... Browse Code »

If skb has a valid l4 hash, there is no point clearing hash and force
a further flow dissection when a tunnel encapsulation is added.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-09-10 10:33:11 +0800

23 Aug, 2016

1 commit

c0451fe1f net: ip_finish_output_gso: Allow fragmenting segments of tunneled skbs if their DF is unset ... Browse Code »

In b8247f095e,

"net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs"

gso skbs arriving from an ingress interface that go through UDP
tunneling, are allowed to be fragmented if the resulting encapulated
segments exceed the dst mtu of the egress interface.

This aligned the behavior of gso skbs to non-gso skbs going through udp
encapsulation path.

However the non-gso vs gso anomaly is present also in the following
cases of a GRE tunnel:
- ip_gre in collect_md mode, where TUNNEL_DONT_FRAGMENT is not set
(e.g. OvS vport-gre with df_default=false)
- ip_gre in nopmtudisc mode, where IFLA_GRE_IGNORE_DF is set

In both of the above cases, the non-gso skbs get fragmented, whereas the
gso skbs (having skb_gso_network_seglen that exceeds dst mtu) get dropped,
as they don't go through the segment+fragment code path.

Fix: Setting IPSKB_FRAG_SEGS if the tunnel specified IP_DF bit is NOT set.

Tunnels that do set IP_DF, will not go to fragmentation of segments.
This preserves behavior of ip_gre in (the default) pmtudisc mode.

Fixes: b8247f095e ("net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs")
Reported-by: wenxu
Cc: Hannes Frederic Sowa
Signed-off-by: Shmulik Ladkani
Tested-by: wenxu
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Shmulik Ladkani
2016-08-23 08:11:01 +0800

20 Jul, 2016

1 commit

b8247f095 net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmenta… ... Browse Code »

…tion for local udp tunneled skbs

Given:
- tap0 and vxlan0 are bridged
- vxlan0 stacked on eth0, eth0 having small mtu (e.g. 1400)

Assume GSO skbs arriving from tap0 having a gso_size as determined by
user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).

After encapsulation these skbs have skb_gso_network_seglen that exceed
eth0's ip_skb_dst_mtu.

These skbs are accidentally passed to ip_finish_output2 AS IS.
Alas, each final segment (segmented either by validate_xmit_skb or by
hardware UFO) would be larger than eth0 mtu.
As a result, those above-mtu segments get dropped on certain networks.

This behavior is not aligned with the NON-GSO case:
Assume a non-gso 1500-sized IP packet arrives from tap0. After
encapsulation, the vxlan datagram is fragmented normally at the
ip_finish_output-->ip_fragment code path.

The expected behavior for the GSO case would be segmenting the
"gso-oversized" skb first, then fragmenting each segment according to
dst mtu, and finally passing the resulting fragments to ip_finish_output2.

'ip_finish_output_gso' already supports this "Slowpath" behavior,
according to the IPSKB_FRAG_SEGS flag, which is only set during ipv4
forwarding (not set in the bridged case).

In order to support the bridged case, we'll mark skbs arriving from an
ingress interface that get udp-encaspulated as "allowed to be fragmented",
causing their network_seglen to be validated by 'ip_finish_output_gso'
(and fragment if needed).

Note the TUNNEL_DONT_FRAGMENT tun_flag is still honoured (both in the
gso and non-gso cases), which serves users wishing to forbid
fragmentation at the udp tunnel endpoint.

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Shmulik Ladkani
2016-07-20 07:40:22 +0800