Eric Lee / smarc-fsl-linux-kernel

26 Mar, 2020

1 commit

2c64605b5 net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build ... Browse Code »

net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
pkt->skb->tc_redirected = 1;
^~
net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
pkt->skb->tc_from_ingress = 1;
^~

To avoid a direct dependency with tc actions from netfilter, wrap the
redirect bits around CONFIG_NET_REDIRECT and move helpers to
include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
only existing client of these bits in the tree.

This patch adds skb_set_redirected() that sets on the redirected bit
on the skbuff, it specifies if the packet was redirect from ingress
and resets the timestamp (timestamp reset was originally missing in the
netfilter bugfix).

Fixes: bcfabee1afd99484 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
Reported-by: noreply@ellerman.id.au
Reported-by: Geert Uytterhoeven
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Pablo Neira Ayuso
2020-03-26 03:24:33 +0800

14 Mar, 2020

2 commits

7d7587db0 afs: Fix client call Rx-phase signal handling ... Browse Code »

Fix the handling of signals in client rxrpc calls made by the afs
filesystem. Ignore signals completely, leaving call abandonment or
connection loss to be detected by timeouts inside AF_RXRPC.

Allowing a filesystem call to be interrupted after the entire request has
been transmitted and an abort sent means that the server may or may not
have done the action - and we don't know. It may even be worse than that
for older servers.

Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells

David Howells
2020-03-14 07:04:35 +0800
e138aa7d3 rxrpc: Fix call interruptibility handling ... Browse Code »

Fix the interruptibility of kernel-initiated client calls so that they're
either only interruptible when they're waiting for a call slot to come
available or they're not interruptible at all. Either way, they're not
interruptible during transmission.

This should help prevent StoreData calls from being interrupted when
writeback is in progress. It doesn't, however, handle interruption during
the receive phase.

Userspace-initiated calls are still interruptable. After the signal has
been handled, sendmsg() will return the amount of data copied out of the
buffer and userspace can perform another sendmsg() call to continue
transmission.

Fixes: bc5e3a546d55 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells

David Howells
2020-03-14 07:04:30 +0800

04 Mar, 2020

1 commit

4c16d64ea fib: add missing attribute validation for tun_id ... Browse Code »

Add missing netlink policy entry for FRA_TUN_ID.

Fixes: e7030878fc84 ("fib: Add fib rule match on tunnel id")
Signed-off-by: Jakub Kicinski
Reviewed-by: David Ahern
Signed-off-by: David S. Miller

Jakub Kicinski
2020-03-04 05:28:48 +0800

18 Feb, 2020

1 commit

8a9093c79 net: sched: correct flower port blocking ... Browse Code »

tc flower rules that are based on src or dst port blocking are sometimes
ineffective due to uninitialized stack data. __skb_flow_dissect() extracts
ports from the skb for tc flower to match against. However, the port
dissection is not done when when the FLOW_DIS_IS_FRAGMENT bit is set in
key_control->flags. All callers of __skb_flow_dissect(), zero-out the
key_control field except for fl_classify() as used by the flower
classifier. Thus, the FLOW_DIS_IS_FRAGMENT may be set on entry to
__skb_flow_dissect(), since key_control is allocated on the stack
and may not be initialized.

Since key_basic and key_control are present for all flow keys, let's
make sure they are initialized.

Fixes: 62230715fd24 ("flow_dissector: do not dissect l4 ports for fragments")
Co-developed-by: Eric Dumazet
Signed-off-by: Eric Dumazet
Acked-by: Cong Wang
Signed-off-by: Jason Baron
Signed-off-by: David S. Miller

Jason Baron
2020-02-18 13:33:28 +0800

17 Feb, 2020

1 commit

66256e0b1 net/sock.h: fix all kernel-doc warnings ... Browse Code »

Fix all kernel-doc warnings for .
Fixes these warnings:

../include/net/sock.h:232: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_portpair' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_cookie' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_listener' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common'

../include/net/sock.h:498: warning: Function parameter or member 'sk_rx_skb_cache' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_wq_raw' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_tx_skb_cache' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_bpf_storage' not described in 'sock'

../include/net/sock.h:2024: warning: No description found for return value of 'sk_wmem_alloc_get'
../include/net/sock.h:2035: warning: No description found for return value of 'sk_rmem_alloc_get'
../include/net/sock.h:2046: warning: No description found for return value of 'sk_has_allocations'
../include/net/sock.h:2082: warning: No description found for return value of 'skwq_has_sleeper'
../include/net/sock.h:2244: warning: No description found for return value of 'sk_page_frag'
../include/net/sock.h:2444: warning: Function parameter or member 'tcp_rx_skb_cache_key' not described in 'DECLARE_STATIC_KEY_FALSE'
../include/net/sock.h:2444: warning: Excess function parameter 'sk' description in 'DECLARE_STATIC_KEY_FALSE'
../include/net/sock.h:2444: warning: Excess function parameter 'skb' description in 'DECLARE_STATIC_KEY_FALSE'

Signed-off-by: Randy Dunlap
Signed-off-by: David S. Miller

Randy Dunlap
2020-02-17 11:43:06 +0800

14 Feb, 2020

3 commits

b32cb6fcf Merge tag 'mac80211-for-net-2020-02-14' of git://git.kernel.org/pub/scm/linux/ke… ... Browse Code »

…rnel/git/jberg/mac80211

Johannes Berg says:

====================
Just a few fixes:
* avoid running out of tracking space for frames that need
to be reported to userspace by using more bits
* fix beacon handling suppression by adding some relevant
elements to the CRC calculation
* fix quiet mode in action frames
* fix crash in ethtool for virt_wifi and similar
* add a missing policy entry
* fix 160 & 80+80 bandwidth to take local capabilities into
account
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2020-02-14 23:16:08 +0800
0b41713b6 icmp: introduce helper for nat'd source address in network device context ... Browse Code »

This introduces a helper function to be called only by network drivers
that wraps calls to icmp[v6]_send in a conntrack transformation, in case
NAT has been used. We don't want to pollute the non-driver path, though,
so we introduce this as a helper to be called by places that actually
make use of this, as suggested by Florian.

Signed-off-by: Jason A. Donenfeld
Cc: Florian Westphal
Signed-off-by: David S. Miller

Jason A. Donenfeld
2020-02-14 06:19:00 +0800
6ee2deb6f net/flow_dissector: remove unexist field description ... Browse Code »

@thoff has moved to struct flow_dissector_key_control.

Fixes: 42aecaa9bb2b ("net: Get skb hash over flow_keys structure")
Signed-off-by: Hangbin Liu
Signed-off-by: David S. Miller

Hangbin Liu
2020-02-14 06:13:27 +0800

07 Feb, 2020

1 commit

f2b18baca mac80211: use more bits for ack_frame_id ... Browse Code »

It turns out that this wasn't a good idea, I hit a test failure in
hwsim due to this. That particular failure was easily worked around,
but it raised questions: if an AP needs to, for example, send action
frames to each connected station, the current limit is nowhere near
enough (especially if those stations are sleeping and the frames are
queued for a while.)

Shuffle around some bits to make more room for ack_frame_id to allow
up to 8192 queued up frames, that's enough for queueing 4 frames to
each connected station, even at the maximum of 2007 stations on a
single AP.

We take the bits from band (which currently only 2 but I leave 3 in
case we add another band) and from the hw_queue, which can only need
4 since it has a limit of 16 queues.

Fixes: 6912daed05e1 ("mac80211: Shrink the size of ack_frame_id to make room for tx_time_est")
Signed-off-by: Johannes Berg
Acked-by: Toke Høiland-Jørgensen
Link: https://lore.kernel.org/r/20200115122549.b9a4ef9f4980.Ied52ed90150220b83a280009c590b65d125d087c@changeid
Signed-off-by: Johannes Berg

Johannes Berg
2020-02-07 18:32:27 +0800

05 Feb, 2020

1 commit

38f88c454 bonding/alb: properly access headers in bond_alb_xmit() ... Browse Code »

syzbot managed to send an IPX packet through bond_alb_xmit()
and af_packet and triggered a use-after-free.

First, bond_alb_xmit() was using ipx_hdr() helper to reach
the IPX header, but ipx_hdr() was using the transport offset
instead of the network offset. In the particular syzbot
report transport offset was 0xFFFF

This patch removes ipx_hdr() since it was only (mis)used from bonding.

Then we need to make sure IPv4/IPv6/IPX headers are pulled
in skb->head before dereferencing anything.

BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
(if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
[] __dump_stack lib/dump_stack.c:17 [inline]
[] dump_stack+0x14d/0x20b lib/dump_stack.c:53
[] print_address_description+0x6f/0x20b mm/kasan/report.c:282
[] kasan_report_error mm/kasan/report.c:380 [inline]
[] kasan_report mm/kasan/report.c:438 [inline]
[] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
[] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
[] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
[] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
[] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
[] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
[] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
[] xmit_one net/core/dev.c:3611 [inline]
[] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
[] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
[] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
[] packet_snd net/packet/af_packet.c:3226 [inline]
[] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
[] sock_sendmsg_nosec net/socket.c:673 [inline]
[] sock_sendmsg+0x12c/0x160 net/socket.c:684
[] __sys_sendto+0x262/0x380 net/socket.c:1996
[] SYSC_sendto net/socket.c:2008 [inline]
[] SyS_sendto+0x40/0x60 net/socket.c:2004

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Cc: Jay Vosburgh
Cc: Veaceslav Falico
Cc: Andy Gospodarek
Signed-off-by: David S. Miller

Eric Dumazet
2020-02-05 21:28:09 +0800

30 Jan, 2020

2 commits

31484d56c mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6 ... Browse Code »

If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!

This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.

As exporting a symbol for an empty function would be a bit wasteful, fix
this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
CONFIG_MPTCP_IPV6=n case.

Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
clear this is a pure-IPV6 function, just like mptcpv6_init().

Fixes: cec37a6e41aae7bf ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Geert Uytterhoeven
Signed-off-by: David S. Miller

Geert Uytterhoeven
2020-01-30 17:55:54 +0800
d0208bf4d udp: document udp_rcv_segment special case for looped packets ... Browse Code »

Commit 6cd021a58c18a ("udp: segment looped gso packets correctly")
fixes an issue with rare udp gso multicast packets looped onto the
receive path.

The stable backport makes the narrowest change to target only these
packets, when needed. As opposed to, say, expanding __udp_gso_segment,
which is harder to reason to be free from unintended side-effects.

But the resulting code is hardly self-describing.
Document its purpose and rationale.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2020-01-30 16:59:53 +0800

28 Jan, 2020

1 commit

6cd021a58 udp: segment looped gso packets correctly ... Browse Code »

Multicast and broadcast packets can be looped from egress to ingress
pre segmentation with dev_loopback_xmit. That function unconditionally
sets ip_summed to CHECKSUM_UNNECESSARY.

udp_rcv_segment segments gso packets in the udp rx path. Segmentation
usually executes on egress, and does not expect packets of this type.
__udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
the offsets are not correct for gso_make_checksum.

UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
to the correct pseudo header checksum. Reset ip_summed to this type.
(CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)

Reported-by: syzbot
Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2020-01-28 17:56:51 +0800

27 Jan, 2020

7 commits

41c0d917d devlink: add macro for "fw.roce" ... Browse Code »

Add definition and documentation for the new generic info "fw.roce".

v2: Remove board.nvm_cfg since fw.psid is similar.

Cc: Jiri Pirko
Cc: Jakub Kicinski
Signed-off-by: Vasundhara Volam
Signed-off-by: Michael Chan
Signed-off-by: David S. Miller

Vasundhara Volam
2020-01-27 18:33:29 +0800
c4c57b974 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/blu… ... Browse Code »

…etooth/bluetooth-next

Johan Hedberg says:

====================
pull request: bluetooth-next 2020-01-26

Here's (probably) the last bluetooth-next pull request for the 5.6 kernel.

- Initial pieces of Bluetooth 5.2 Isochronous Channels support
- mgmt: Various cleanups and a new Set Blocked Keys command
- btusb: Added support for 04ca:3021 QCA_ROME device
- hci_qca: Multiple fixes & cleanups
- hci_bcm: Fixes & improved device tree support
- Fixed attempts to create duplicate debugfs entries

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

David S. Miller
2020-01-27 18:24:46 +0800
9fd1ff5d2 udp: Support UDP fraglist GRO/GSO. ... Browse Code »

This patch extends UDP GRO to support fraglist GRO/GSO
by using the previously introduced infrastructure.
If the feature is enabled, all UDP packets are going to
fraglist GRO (local input and forward).

After validating the csum, we mark ip_summed as
CHECKSUM_UNNECESSARY for fraglist GRO packets to
make sure that the csum is not touched.

Signed-off-by: Steffen Klassert
Reviewed-by: Willem de Bruijn
Signed-off-by: David S. Miller

Steffen Klassert
2020-01-27 18:00:21 +0800
2e24cd755 net_sched: fix ops->bind_class() implementations ... Browse Code »

The current implementations of ops->bind_class() are merely
searching for classid and updating class in the struct tcf_result,
without invoking either of cl_ops->bind_tcf() or
cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
like cbq use them to count filters too. This is why syzbot triggered
the warning in cbq_destroy_class().

In order to fix this, we have to call cl_ops->bind_tcf() and
cl_ops->unbind_tcf() like the filter binding path. This patch does
so by refactoring out two helper functions __tcf_bind_filter()
and __tcf_unbind_filter(), which are lockless and accept a Qdisc
pointer, then teaching each implementation to call them correctly.

Note, we merely pass the Qdisc pointer as an opaque pointer to
each filter, they only need to pass it down to the helper
functions without understanding it at all.

Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class")
Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim
Cc: Jiri Pirko
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2020-01-27 17:51:43 +0800
3c4287f62 nf_tables: Add set type for arbitrary concatenation of ranges ... Browse Code »

This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):

TEST: performance
net,port [ OK ]
baseline (drop from netdev hook): 10190076pps
baseline hash (non-ranged entries): 6179564pps
baseline rbtree (match on first field only): 2950341pps
set with 1000 full, ranged entries: 2304165pps
port,net [ OK ]
baseline (drop from netdev hook): 10143615pps
baseline hash (non-ranged entries): 6135776pps
baseline rbtree (match on first field only): 4311934pps
set with 100 full, ranged entries: 4131471pps
net6,port [ OK ]
baseline (drop from netdev hook): 9730404pps
baseline hash (non-ranged entries): 4809557pps
baseline rbtree (match on first field only): 1501699pps
set with 1000 full, ranged entries: 1092557pps
port,proto [ OK ]
baseline (drop from netdev hook): 10812426pps
baseline hash (non-ranged entries): 6929353pps
baseline rbtree (match on first field only): 3027105pps
set with 30000 full, ranged entries: 284147pps
net6,port,mac [ OK ]
baseline (drop from netdev hook): 9660114pps
baseline hash (non-ranged entries): 3778877pps
baseline rbtree (match on first field only): 3179379pps
set with 10 full, ranged entries: 2082880pps
net6,port,mac,proto [ OK ]
baseline (drop from netdev hook): 9718324pps
baseline hash (non-ranged entries): 3799021pps
baseline rbtree (match on first field only): 1506689pps
set with 1000 full, ranged entries: 783810pps
net,mac [ OK ]
baseline (drop from netdev hook): 10190029pps
baseline hash (non-ranged entries): 5172218pps
baseline rbtree (match on first field only): 2946863pps
set with 1000 full, ranged entries: 1279122pps

v4:
- fix build for 32-bit architectures: 64-bit division needs
div_u64() (kbuild test robot )
v3:
- rework interface for field length specification,
NFT_SET_SUBKEY disappears and information is stored in
description
- remove scratch area to store closing element of ranges,
as elements now come with an actual attribute to specify
the upper range limit (Pablo Neira Ayuso)
- also remove pointer to 'start' element from mapping table,
closing key is now accessible via extension data
- use bytes right away instead of bits for field lengths,
this way we can also double the inner loop of the lookup
function to take care of upper and lower bits in a single
iteration (minor performance improvement)
- make it clearer that set operations are actually atomic
API-wise, but we can't e.g. implement flush() as one-shot
action
- fix type for 'dup' in nft_pipapo_insert(), check for
duplicates only in the next generation, and in general take
care of differentiating generation mask cases depending on
the operation (Pablo Neira Ayuso)
- report C implementation matching rate in commit message, so
that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
- protect access to scratch maps in nft_pipapo_lookup() with
local_bh_disable/enable() (Florian Westphal)
- drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
already implied (Florian Westphal)
- explain why partial allocation failures don't need handling
in pipapo_realloc_scratch(), rename 'm' to clone and update
related kerneldoc to make it clear we're not operating on
the live copy (Florian Westphal)
- add expicit check for priv->start_elem in
nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
with a NULL start element, and also zero it out in every
operation that might make it invalid, so that insertion
doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio
Signed-off-by: Pablo Neira Ayuso

Stefano Brivio
2020-01-27 15:54:30 +0800
f3a2181e1 netfilter: nf_tables: Support for sets with multiple ranged fields ... Browse Code »

Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.

This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.

In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.

For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:

NFTA_SET_ELEM_KEY: 192.0.2.0 . 22
NFTA_SET_ELEM_KEY_END: 192.0.2.42 . 25

and NFTA_SET_DESC_CONCAT attribute would contain:

NFTA_LIST_ELEM
NFTA_SET_FIELD_LEN: 4
NFTA_LIST_ELEM
NFTA_SET_FIELD_LEN: 2

v4: No changes
v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
v2: No changes

Signed-off-by: Stefano Brivio
Signed-off-by: Pablo Neira Ayuso

Stefano Brivio
2020-01-27 15:54:30 +0800
7b225d0b5 netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute ... Browse Code »

Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
interval between kernel and userspace.

This patch also adds the NFT_SET_EXT_KEY_END extension to store the
closing element value in this interval.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; add corresponding
nft_set_ext_type for new key; rebase]
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2020-01-27 15:54:30 +0800

26 Jan, 2020

1 commit

4d8773b68 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Minor conflict in mlx5 because changes happened to code that has
moved meanwhile.

Signed-off-by: David S. Miller

David S. Miller
2020-01-26 17:40:21 +0800

25 Jan, 2020

3 commits

ef6aadcc7 net: sched: Make TBF Qdisc offloadable ... Browse Code »

Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying
and dumping of TBF Qdisc.

Signed-off-by: Petr Machata
Acked-by: Jiri Pirko
Signed-off-by: Ido Schimmel
Signed-off-by: David S. Miller

Petr Machata
2020-01-25 17:56:31 +0800
e42f1ac62 mptcp: do not inherit inet proto ops ... Browse Code »

We need to initialise the struct ourselves, else we expose tcp-specific
callbacks such as tcp_splice_read which will then trigger splat because
the socket is an mptcp one:

BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478

CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3
Call Trace:
tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612
tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674
tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791
do_splice+0x1259/0x1560 fs/splice.c:1205

To prevent build error with ipv6, add the recv/sendmsg function
declaration to ipv6.h. The functions are already accessible "thanks"
to retpoline related work, but they are currently only made visible
by socket.c specific INDIRECT_CALLABLE macros.

Reported-by: Christoph Paasch
Signed-off-by: Florian Westphal
Signed-off-by: Mat Martineau
Signed-off-by: David S. Miller

Florian Westphal
2020-01-25 15:14:39 +0800
eb014de4f netfilter: nf_tables: autoload modules from the abort path ... Browse Code »

This patch introduces a list of pending module requests. This new module
list is composed of nft_module_request objects that contain the module
name and one status field that tells if the module has been already
loaded (the 'done' field).

In the first pass, from the preparation phase, the netlink command finds
that a module is missing on this list. Then, a module request is
allocated and added to this list and nft_request_module() returns
-EAGAIN. This triggers the abort path with the autoload parameter set on
from nfnetlink, request_module() is called and the module request enters
the 'done' state. Since the mutex is released when loading modules from
the abort phase, the module list is zapped so this is iteration occurs
over a local list. Therefore, the request_module() calls happen when
object lists are in consistent state (after fulling aborting the
transaction) and the commit list is empty.

On the second pass, the netlink command will find that it already tried
to load the module, so it does not request it again and
nft_request_module() returns 0. Then, there is a look up to find the
object that the command was missing. If the module was successfully
loaded, the command proceeds normally since it finds the missing object
in place, otherwise -ENOENT is reported to userspace.

This patch also updates nfnetlink to include the reason to enter the
abort phase, which is required for this new autoload module rationale.

Fixes: ec7470b834fe ("netfilter: nf_tables: store transaction list locally while requesting module")
Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso

Pablo Neira Ayuso
2020-01-25 03:54:29 +0800

24 Jan, 2020

6 commits

cc7972ea1 mptcp: parse and emit MP_CAPABLE option according to v1 spec ... Browse Code »

This implements MP_CAPABLE options parsing and writing according
to RFC 6824 bis / RFC 8684: MPTCP v1.

Local key is sent on syn/ack, and both keys are sent on 3rd ack.
MP_CAPABLE messages len are updated accordingly. We need the skbuff to
correctly emit the above, so we push the skbuff struct as an argument
all the way from tcp code to the relevant mptcp callbacks.

When processing incoming MP_CAPABLE + data, build a full blown DSS-like
map info, to simplify later processing. On child socket creation, we
need to record the remote key, if available.

Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Christoph Paasch
2020-01-24 20:44:08 +0800
648ef4b88 mptcp: Implement MPTCP receive path ... Browse Code »

Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

The subflow implements its own data_ready() ops, which ensures that the
pending data is in sequence - according to MPTCP seq number - dropping
out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
This allows the MPTCP socket layer to determine if more data is
available without having to consult the individual subflows.

It additionally validates the current mapping and propagates EoF events
to the connection socket.

Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Co-developed-by: Peter Krystad
Signed-off-by: Peter Krystad
Co-developed-by: Davide Caratti
Signed-off-by: Davide Caratti
Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Signed-off-by: Mat Martineau
Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Mat Martineau
2020-01-24 20:44:07 +0800
6d0060f60 mptcp: Write MPTCP DSS headers to outgoing data packets ... Browse Code »

Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, which is copied to page fragments and mapped in to MPTCP
DSS segments with size determined by the available page fragments and
the maximum mapping length allowed by the MPTCP specification. If
do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok -
the later skbs can either have duplicated DSS mapping information or
none at all, and the receiver can handle that.

The current implementation uses the subflow frag cache and tcp
sendpages to avoid excessive code duplication. More work is required to
ensure that it works correctly under memory pressure and to support
MPTCP-level retransmissions.

The MPTCP DSS checksum is not yet implemented.

Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Co-developed-by: Peter Krystad
Signed-off-by: Peter Krystad
Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Signed-off-by: Mat Martineau
Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Mat Martineau
2020-01-24 20:44:07 +0800
cec37a6e4 mptcp: Handle MP_CAPABLE options for outgoing connections ... Browse Code »

Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Signed-off-by: Peter Krystad
Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Peter Krystad
2020-01-24 20:44:07 +0800
eda7acddf mptcp: Handle MPTCP TCP options ... Browse Code »

Add hooks to parse and format the MP_CAPABLE option.

This option is handled according to MPTCP version 0 (RFC6824).
MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
coordination with related code changes.

Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Co-developed-by: Davide Caratti
Signed-off-by: Davide Caratti
Signed-off-by: Peter Krystad
Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Peter Krystad
2020-01-24 20:44:07 +0800
f870fa0b5 mptcp: Add MPTCP socket stubs ... Browse Code »

Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Co-developed-by: Florian Westphal
Signed-off-by: Florian Westphal
Co-developed-by: Peter Krystad
Signed-off-by: Peter Krystad
Co-developed-by: Matthieu Baerts
Signed-off-by: Matthieu Baerts
Co-developed-by: Paolo Abeni
Signed-off-by: Paolo Abeni
Signed-off-by: Mat Martineau
Signed-off-by: Christoph Paasch
Signed-off-by: David S. Miller

Mat Martineau
2020-01-24 20:44:07 +0800

23 Jan, 2020

9 commits

ec97ecf1e net: sched: add Flow Queue PIE packet scheduler ... Browse Code »

Principles:
- Packets are classified on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed to the same slot)
- Each flow has a PIE managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.
- For a given flow, packets are not reordered.
- Drops during enqueue only.
- ECN capability is off by default.
- ECN threshold (if ECN is enabled) is at 10% by default.
- Uses timestamps to calculate queue delay by default.

Usage:
tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
[ target TIME ] [ tupdate TIME ]
[ alpha NUMBER ] [ beta NUMBER ]
[ quantum BYTES ] [ memory_limit BYTES ]
[ ecnprob PERCENTAGE ] [ [no]ecn ]
[ [no]bytemode ] [ [no_]dq_rate_estimator ]

defaults:
limit: 10240 packets, flows: 1024
target: 15 ms, tupdate: 15 ms (in jiffies)
alpha: 1/8, beta : 5/4
quantum: device MTU, memory_limit: 32 Mb
ecnprob: 10%, ecn: off
bytemode: off, dq_rate_estimator: off

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Sachin D. Patil
Signed-off-by: V. Saicharan
Signed-off-by: Mohit Bhasi
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
5205ea00c net: sched: pie: export symbols to be reused by FQ-PIE ... Browse Code »

This patch makes the drop_early(), calculate_probability() and
pie_process_dequeue() functions generic enough to be used by
both PIE and FQ-PIE (to be added in a future commit). The major
change here is in the way the functions take in arguments. This
patch exports these functions and makes FQ-PIE dependent on
sch_pie.

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
b42a3d7c7 pie: improve comments and commenting style ... Browse Code »

Improve the comments along with the commenting style used to
describe the members of the structures and their initial values
in the init functions.

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
2dfb1952a pie: rearrange structure members and their initializations ... Browse Code »

Rearrange the members of the structure such that closely
referenced members appear together and/or fit in the same
cacheline. Also, change the order of their initializations to
match the order in which they appear in the structure.

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
1dbfc5e07 pie: use u8 instead of bool in pie_vars ... Browse Code »

Linux best practice recommends using u8 for true/false values in
structures.

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
cf4eeee5f pie: rearrange macros in order of length ... Browse Code »

Rearrange macros in order of length and align the values to
improve readability.

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
805a5a23a pie: use U64_MAX to denote (2^64 - 1) ... Browse Code »

Use the U64_MAX macro to denote the constant (2^64 - 1).

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800
84bf557fb net: sched: pie: move common code to pie.h ... Browse Code »

This patch moves macros, structures and small functions common
to PIE and FQ-PIE (to be added in a future commit) from the file
net/sched/sch_pie.c to the header file include/net/pie.h.
All the moved functions are made inline.

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:30 +0800
954b3c439 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-01-22

The following pull-request contains BPF updates for your *net-next* tree.

We've added 92 non-merge commits during the last 16 day(s) which contain
a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

The main changes are:

1) function by function verification and program extensions from Alexei.

2) massive cleanup of selftests/bpf from Toke and Andrii.

3) batched bpf map operations from Brian and Yonghong.

4) tcp congestion control in bpf from Martin.

5) bulking for non-map xdp_redirect form Toke.

6) bpf_send_signal_thread helper from Yonghong.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-01-23 15:10:16 +0800