Eric Lee / smarc-fsl-linux-kernel

09 May, 2019

1 commit

5f0583683 net/sched: avoid double free on matchall reoffload ... Browse Code »

Avoid freeing cls_mall.rule twice when failing to setup flow_action
offload used in the hardware intermediate representation. This is
achieved by returning 0 when the setup fails but the skip software
flag has not been set.

Fixes: f00cbf196814 ("net/sched: use the hardware intermediate representation for matchall")
Reported-by: Dan Carpenter
Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-09 07:34:58 +0800

08 May, 2019

5 commits

a9e41a529 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Minor conflict with the DSA legacy code removal.

Signed-off-by: David S. Miller

David S. Miller
2019-05-08 08:22:09 +0800
d6787147e net/sched: remove block pointer from common offload structure ... Browse Code »

Based on feedback from Jiri avoid carrying a pointer to the tcf_block
structure in the tc_cls_common_offload structure. Instead store
a flag in driver private data which indicates if offloads apply
to a shared block at block binding time.

Suggested-by: Jiri Pirko
Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-08 03:23:40 +0800
e4acf4274 taprio: add null check on sched_nest to avoid potential null pointer dereference ... Browse Code »

The call to nla_nest_start_noflag can return a null pointer and currently
this is not being checked and this can lead to a null pointer dereference
when the null pointer sched_nest is passed to function nla_nest_end. Fix
this by adding in a null pointer check.

Addresses-Coverity: ("Dereference null return value")
Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
Signed-off-by: Colin Ian King
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Colin Ian King
2019-05-08 03:15:33 +0800
37c0aead7 net_sched: sch_fq: handle non connected flows ... Browse Code »

FQ packet scheduler assumed that packets could be classified
based on their owning socket.

This means that if a UDP server uses one UDP socket to send
packets to different destinations, packets all land
in one FQ flow.

This is unfair, since each TCP flow has a unique bucket, meaning
that in case of pressure (fully utilised uplink), TCP flows
have more share of the bandwidth.

If we instead detect unconnected sockets, we can use a stochastic
hash based on the 4-tuple hash.

This also means a QUIC server using one UDP socket will properly
spread the outgoing packets to different buckets, and in-kernel
pacing based on EDT model will no longer risk having big rb-tree on
one flow.

Note that UDP application might provide the skb->hash in an
ancillary message at sendmsg() time to avoid the cost of a dissection
in fq packet scheduler.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-05-08 03:09:25 +0800
eeb84aa0d net_sched: sch_fq: do not assume EDT packets are ordered ... Browse Code »

TCP stack makes sure packets for a given flow are monotically
increasing, but we want to allow UDP packets to use EDT as
well, so that QUIC servers can use in-kernel pacing.

This patch adds a per-flow rb-tree on which packets might
be stored. We still try to use the linear list for the
typical cases where packets are queued with monotically
increasing skb->tstamp, since queue/dequeue packets on
a standard list is O(1).

Note that the ability to store packets in arbitrary EDT
order will allow us to implement later a per TCP socket
mechanism adding delays (with jitter eventually) and reorders,
to implement convenient network emulators.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2019-05-08 03:09:25 +0800

06 May, 2019

10 commits

b362487a3 sch_htb: redefine htb qdisc overlimits ... Browse Code »

In commit 3c75f6ee139d ("net_sched: sch_htb: add per class overlimits counter")
we added an overlimits counter for each HTB class which could
properly reflect how many times we use up all the bandwidth
on each class. However, the overlimits counter in HTB qdisc
does not, it is way bigger than the sum of each HTB class.
In fact, this qdisc overlimits counter increases when we have
no skb to dequeue, which happens more often than we run out of
bandwidth.

It makes more sense to make this qdisc overlimits counter just
be a sum of each HTB class, in case people still get confused.

I have verified this patch with one single HTB class, where HTB
qdisc counters now always match HTB class counters as expected.

Eric suggested we could fold this field into 'direct_pkts' as
we only use its 32bit on 64bit CPU, this saves one cache line.

Cc: Eric Dumazet
Signed-off-by: Cong Wang
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller

Cong Wang
2019-05-06 12:55:20 +0800
88c44a520 net/sched: add block pointer to tc_cls_common_offload structure ... Browse Code »

Some actions like the police action are stateful and could share state
between devices. This is incompatible with offloading to multiple devices
and drivers might want to test for shared blocks when offloading.
Store a pointer to the tcf_block structure in the tc_cls_common_offload
structure to allow drivers to determine when offloads apply to a shared
block.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:24 +0800
12f02b6b1 net/sched: allow stats updates from offloaded police actions ... Browse Code »

Implement the stats_update callback for the police action that
will be used by drivers for hardware offload.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:24 +0800
b7fe4ab8a net/sched: extend matchall offload for hardware statistics ... Browse Code »

Introduce a new command for matchall classifiers that allows hardware
to update statistics.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:24 +0800
8c8cfc6ed net/sched: add police action to the hardware intermediate representation ... Browse Code »

Add police action to the hardware intermediate representation which
would subsequently allow it to be used by drivers for offload.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:24 +0800
fa762da94 net/sched: move police action structures to header ... Browse Code »

Move tcf_police_params, tcf_police and tc_police_compat structures to a
header. Making them usable to other code for example drivers that would
offload police actions to hardware.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:24 +0800
dfcb19f0f net/sched: remove unused functions for matchall offload ... Browse Code »

Cleanup unused functions and variables after porting to the newer
intermediate representation.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:24 +0800
f00cbf196 net/sched: use the hardware intermediate representation for matchall ... Browse Code »

Extends matchall offload to make use of the hardware intermediate
representation. More specifically, this patch moves the native TC
actions in cls_matchall offload to the newer flow_action
representation. This ultimately allows us to avoid a direct
dependency on native TC actions for matchall.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:23 +0800
a7a7be608 net/sched: add sample action to the hardware intermediate representation ... Browse Code »

Add sample action to the hardware intermediate representation model which
would subsequently allow it to be used by drivers for offload.

Signed-off-by: Pieter Jansen van Vuuren
Reviewed-by: Jakub Kicinski
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Pieter Jansen van Vuuren
2019-05-06 12:49:23 +0800
594725db0 cls_cgroup: avoid panic when receiving a packet before filter set ... Browse Code »

When a cgroup classifier is added, there is a small time interval in
which tp->root is NULL. If we receive a packet in this small time slice
a NULL pointer dereference will happen, leading to a kernel panic:

# mkdir /sys/fs/cgroup/net_cls/0
# echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
# echo $$ >/sys/fs/cgroup/net_cls/0/tasks
# ping -qfb 255.255.255.255 -I eth0 &>/dev/null &
# tc qdisc add dev eth0 root handle 10: htb
# while : ; do
> tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
> tc filter delete dev eth0
> done
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
Mem abort info:
ESR = 0x96000005
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp = 0000000098a7ff91
[0000000000000028] pgd=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] SMP
Modules linked in: sch_htb cls_cgroup algif_hash af_alg nls_iso8859_1 nls_cp437 vfat fat xhci_plat_hcd m25p80 spi_nor xhci_hcd mtd usbcore usb_common spi_orion sfp i2c_mv64xxx phy_generic mdio_i2c marvell10g i2c_core mvpp2 mvmdio phylink sbsa_gwdt ip_tables x_tables autofs4
Process ping (pid: 5421, stack limit = 0x00000000b20b1505)
CPU: 3 PID: 5421 Comm: ping Not tainted 5.1.0-rc6 #31
Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
pstate: 60000005 (nZCv daif -PAN -UAO)
pc : cls_cgroup_classify+0x80/0xec [cls_cgroup]
lr : cls_cgroup_classify+0x34/0xec [cls_cgroup]
sp : ffffff8012e6b850
x29: ffffff8012e6b850 x28: ffffffc423dd3c00
x27: ffffff801093ebc0 x26: ffffffc425a85b00
x25: 0000000020000000 x24: 0000000000000000
x23: ffffff8012e6b910 x22: ffffffc428db4900
x21: ffffff8012e6b910 x20: 0000000000100001
x19: 0000000000000000 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 000000000000001c
x11: 0000000000000018 x10: ffffff8012e6b840
x9 : 0000000000003580 x8 : 000000000000009d
x7 : 0000000000000002 x6 : ffffff8012e6b860
x5 : 000000007cd66ffe x4 : 000000009742a193
x3 : ffffff800865b4d8 x2 : ffffff8012e6b910
x1 : 0000000000000400 x0 : ffffffc42c38f300
Call trace:
cls_cgroup_classify+0x80/0xec [cls_cgroup]
tcf_classify+0x78/0x138
htb_enqueue+0x74/0x320 [sch_htb]
__dev_queue_xmit+0x3e4/0x9d0
dev_queue_xmit+0x24/0x30
ip_finish_output2+0x2e4/0x4d0
ip_finish_output+0x1d8/0x270
ip_mc_output+0xa8/0x240
ip_local_out+0x58/0x68
ip_send_skb+0x2c/0x88
ip_push_pending_frames+0x44/0x50
raw_sendmsg+0x458/0x830
inet_sendmsg+0x54/0xe8
sock_sendmsg+0x34/0x50
__sys_sendto+0xd0/0x120
__arm64_sys_sendto+0x30/0x40
el0_svc_common.constprop.0+0x88/0xf8
el0_svc_handler+0x2c/0x38
el0_svc+0x8/0xc
Code: 39496001 360002a1 b9425c14 34000274 (79405260)

Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex")
Suggested-by: Cong Wang
Signed-off-by: Matteo Croce
Signed-off-by: David S. Miller

Matteo Croce
2019-05-06 01:00:17 +0800

04 May, 2019

3 commits

25426043e cls_matchall: avoid panic when receiving a packet before filter set ... Browse Code »

When a matchall classifier is added, there is a small time interval in
which tp->root is NULL. If we receive a packet in this small time slice
a NULL pointer dereference will happen, leading to a kernel panic:

# tc qdisc replace dev eth0 ingress
# tc filter add dev eth0 parent ffff: matchall action gact drop
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000034
Mem abort info:
ESR = 0x96000005
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000a623d530
[0000000000000034] pgd=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] SMP
Modules linked in: cls_matchall sch_ingress nls_iso8859_1 nls_cp437 vfat fat m25p80 spi_nor mtd xhci_plat_hcd xhci_hcd phy_generic sfp mdio_i2c usbcore i2c_mv64xxx marvell10g mvpp2 usb_common spi_orion mvmdio i2c_core sbsa_gwdt phylink ip_tables x_tables autofs4
Process ksoftirqd/0 (pid: 9, stack limit = 0x0000000009de7d62)
CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.1.0-rc6 #21
Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : mall_classify+0x28/0x78 [cls_matchall]
lr : tcf_classify+0x78/0x138
sp : ffffff80109db9d0
x29: ffffff80109db9d0 x28: ffffffc426058800
x27: 0000000000000000 x26: ffffffc425b0dd00
x25: 0000000020000000 x24: 0000000000000000
x23: ffffff80109dbac0 x22: 0000000000000001
x21: ffffffc428ab5100 x20: ffffffc425b0dd00
x19: ffffff80109dbac0 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: ffffffbf108ad288 x12: dead000000000200
x11: 00000000f0000000 x10: 0000000000000001
x9 : ffffffbf1089a220 x8 : 0000000000000001
x7 : ffffffbebffaa950 x6 : 0000000000000000
x5 : 000000442d6ba000 x4 : 0000000000000000
x3 : ffffff8008735ad8 x2 : ffffff80109dbac0
x1 : ffffffc425b0dd00 x0 : ffffff8010592078
Call trace:
mall_classify+0x28/0x78 [cls_matchall]
tcf_classify+0x78/0x138
__netif_receive_skb_core+0x29c/0xa20
__netif_receive_skb_one_core+0x34/0x60
__netif_receive_skb+0x28/0x78
netif_receive_skb_internal+0x2c/0xc0
napi_gro_receive+0x1a0/0x1d8
mvpp2_poll+0x928/0xb18 [mvpp2]
net_rx_action+0x108/0x378
__do_softirq+0x128/0x320
run_ksoftirqd+0x44/0x60
smpboot_thread_fn+0x168/0x1b0
kthread+0x12c/0x130
ret_from_fork+0x10/0x1c
Code: aa0203f3 aa1e03e0 d503201f f9400684 (b9403480)
---[ end trace fc71e2ef7b8ab5a5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x002,00002000
Memory Limit: none
Rebooting in 1 seconds..

Fix this by adding a NULL check in mall_classify().

Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex")
Signed-off-by: Matteo Croce
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Matteo Croce
2019-05-04 12:57:48 +0800
e512fcf02 net: sched: cls_u32: use struct_size() helper ... Browse Code »

Make use of the struct_size() helper instead of an open-coded version
in order to avoid any potential type mistakes, in particular in the
context in which this code is being used.

So, replace code of the following form:

sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key)

with:

struct_size(s, keys, s->nkeys)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: David S. Miller

Gustavo A. R. Silva
2019-05-04 12:46:27 +0800
141b6b2ad net: add a generic tracepoint for TX queue timeout ... Browse Code »

Although devlink health report does a nice job on reporting TX
timeout and other NIC errors, unfortunately it requires drivers
to support it but currently only mlx5 has implemented it.
Before other drivers could catch up, it is useful to have a
generic tracepoint to monitor this kind of TX timeout. We have
been suffering TX timeout with different drivers, we plan to
start to monitor it with rasdaemon which just needs a new tracepoint.

Sample output:

ksoftirqd/1-16 [001] ..s2 144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0

Cc: Eran Ben Elisha
Cc: Jiri Pirko
Signed-off-by: Cong Wang
Acked-by: Jiri Pirko
Reviewed-by: Eran Ben Elisha
Signed-off-by: David S. Miller

Cong Wang
2019-05-04 12:41:41 +0800

01 May, 2019

4 commits

c25031e99 taprio: Add support for cycle-time-extension ... Browse Code »

IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
last entry of a schedule before the start of a new schedule can be
extended, so "too-short" entries can be avoided.

Signed-off-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Vinicius Costa Gomes
2019-05-01 23:58:51 +0800
6ca6a6654 taprio: Add support for setting the cycle-time manually ... Browse Code »

IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
overridden, so the schedule is truncated to a determined "width".

Signed-off-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Vinicius Costa Gomes
2019-05-01 23:58:51 +0800
a3d43c0d5 taprio: Add support adding an admin schedule ... Browse Code »

The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
operational?) and "Admin" ones. Up until now, 'taprio' only had
support for the "Oper" one, added when the qdisc is created. This adds
support for the "Admin" one, which allows the .change() operation to
be supported.

Just for clarification, some quick (and dirty) definitions, the "Oper"
schedule is the currently (as in this instant) running one, and it's
read-only. The "Admin" one is the one that the system configurator has
installed, it can be changed, and it will be "promoted" to "Oper" when
it's 'base-time' is reached.

The idea behing this patch is that calling something like the below,
(after taprio is already configured with an initial schedule):

$ tc qdisc change taprio dev IFACE parent root \
base-time X \
sched-entry \
...

Will cause a new admin schedule to be created and programmed to be
"promoted" to "Oper" at instant X. If an "Admin" schedule already
exists, it will be overwritten with the new parameters.

Up until now, there was some code that was added to ease the support
of changing a single entry of a schedule, but was ultimately unused.
Now, that we have support for "change" with more well thought
semantics, updating a single entry seems to be less useful.

So we remove what is in practice dead code, and return a "not
supported" error if the user tries to use it. If changing a single
entry would make the user's life easier we may ressurrect this idea,
but at this point, removing it simplifies the code.

For now, only the schedule specific bits are allowed to be added for a
new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
cannot be modified.

Example:

$ tc qdisc change dev IFACE parent root handle 100 taprio \
base-time $BASE_TIME \
sched-entry S 00 500000 \
sched-entry S 0f 500000 \
clockid CLOCK_TAI

The only change in the netlink API introduced by this change is the
introduction of an "admin" type in the response to a dump request,
that type allows userspace to separate the "oper" schedule from the
"admin" schedule. If userspace doesn't support the "admin" type, it
will only display the "oper" schedule.

Signed-off-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Vinicius Costa Gomes
2019-05-01 23:58:51 +0800
8c79f0ea5 taprio: Fix potencial use of invalid memory during dequeue() ... Browse Code »

Right now, this isn't a problem, but the next commit allows schedules
to be added during runtime. When a new schedule transitions from the
inactive to the active state ("admin" -> "oper") the previous one can
be freed, if it's freed just after the RCU read lock is released, we
may access an invalid entry.

So, we should take care to protect the dequeue() flow, so all the
places that access the entries are protected by the RCU read lock.

Signed-off-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Vinicius Costa Gomes
2019-05-01 23:58:51 +0800

28 Apr, 2019

2 commits

8cb081746 netlink: make validation more configurable for future strictness ... Browse Code »

We currently have two levels of strict validation:

1) liberal (default)
- undefined (type >= max) & NLA_UNSPEC attributes accepted
- attribute length >= expected accepted
- garbage at end of message accepted
2) strict (opt-in)
- NLA_UNSPEC attributes accepted
- attribute length >= expected accepted

Split out parsing strictness into four different options:
* TRAILING - check that there's no trailing data after parsing
attributes (in message or nested)
* MAXTYPE - reject attrs > max known type
* UNSPEC - reject attributes with NLA_UNSPEC policy entries
* STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
* nla_parse -> nla_parse_deprecated
* nla_parse_strict -> nla_parse_deprecated_strict
* nlmsg_parse -> nlmsg_parse_deprecated
* nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
* nla_parse_nested -> nla_parse_nested_deprecated
* nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
@@
expression TB, MAX, HEAD, LEN, POL, EXT;
@@
-nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
+nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression NLH, HDRLEN, TB, MAX, POL, EXT;
@@
-nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
+nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

@@
expression TB, MAX, NLA, POL, EXT;
@@
-nla_parse_nested(TB, MAX, NLA, POL, EXT)
+nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

@@
expression START, MAX, POL, EXT;
@@
-nla_validate_nested(START, MAX, POL, EXT)
+nla_validate_nested_deprecated(START, MAX, POL, EXT)

@@
expression NLH, HDRLEN, MAX, POL, EXT;
@@
-nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
+nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2019-04-28 05:07:21 +0800
ae0be8de9 netlink: make nla_nest_start() add NLA_F_NESTED flag ... Browse Code »

Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek
Acked-by: Jiri Pirko
Acked-by: David Ahern
Signed-off-by: David S. Miller

Michal Kubecek
2019-04-28 05:03:44 +0800

25 Apr, 2019

1 commit

c049d56eb net: sched: flower: refactor reoffload for concurrent access ... Browse Code »

Recent changes that introduced unlocked flower did not properly account for
case when reoffload is initiated concurrently with filter updates. To fix
the issue, extend flower with 'hw_filters' list that is used to store
filters that don't have 'skip_hw' flag set. Filter is added to the list
when it is inserted to hardware and only removed from it after being
unoffloaded from all drivers that parent block is attached to. This ensures
that concurrent reoffload can still access filter that is being deleted and
prevents race condition when driver callback can be removed when filter is
no longer accessible trough idr, but is still present in hardware.

Refactor fl_change() to respect new filter reference counter and to release
filter reference with __fl_put() in case of error, instead of directly
deallocating filter memory. This allows for concurrent access to filter
from fl_reoffload() and protects it with reference counting. Refactor
fl_reoffload() to iterate over hw_filters list instead of idr. Implement
fl_get_next_hw_filter() helper function that is used to iterate over
hw_filters list with reference counting and skips filters that are being
concurrently deleted.

Fixes: 92149190067d ("net: sched: flower: set unlocked flag for flower proto ops")
Signed-off-by: Vlad Buslov
Reviewed-by: Saeed Mahameed
Signed-off-by: David S. Miller

Vlad Buslov
2019-04-25 02:53:28 +0800

24 Apr, 2019

5 commits

6e734c82b net: sched: taprio: Fix taprio_dequeue() ... Browse Code »

In case we don't have 'guard' or 'budget' to transmit the skb, we should
continue traversing the qdisc list since the remaining guard/budget
might be enough to transmit a skb from other children qdiscs.

Fixes: 5a781ccbd19e (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes
Signed-off-by: David S. Miller

Andre Guedes
2019-04-24 10:52:32 +0800
2684d1b75 net: sched: taprio: Fix taprio_peek() ... Browse Code »

While traversing taprio's children qdisc list, if the gate is closed for
a given traffic class, we should continue traversing the list since the
remaining qdiscs may have skb ready for transmission.

This patch also takes this opportunity and changes the function to use
the TAPRIO_ALL_GATES_OPEN macro instead of the magic number '-1'.

Fixes: 5a781ccbd19e (“tc: Add support for configuring the taprio scheduler”)
Signed-off-by: Andre Guedes
Signed-off-by: David S. Miller

Andre Guedes
2019-04-24 10:52:32 +0800
5175aafe7 net: sched: taprio: Remove should_restart_cycle() ... Browse Code »

The 'entry' argument from should_restart_cycle() cannot be NULL since it
is already checked by the caller so the WARN_ON() within should_
restart_cycle() could be removed. By doing that, that function becomes
a dummy wrapper on list_is_last() so this patch simply gets rid of it
and call list_is_last() within advance_sched() instead.

Signed-off-by: Andre Guedes
Signed-off-by: David S. Miller

Andre Guedes
2019-04-24 10:52:32 +0800
8599099f0 net: sched: taprio: Refactor taprio_get_start_time() ... Browse Code »

This patch does a code refactoring to taprio_get_start_time() function
to improve readability and report error properly.

If 'base' time is later than 'now', the start time is equal to 'base'
and taprio_get_start_time() is done. That's the natural case so we move
that code to the beginning of the function. Also, if 'cycle' calculation
is zero, something went really wrong with taprio and we should log that
internal error properly.

Signed-off-by: Andre Guedes
Signed-off-by: David S. Miller

Andre Guedes
2019-04-24 10:52:32 +0800
59ab87f6e net: sched: taprio: Remove pointless variable assigment ... Browse Code »

This patch removes a pointless variable assigment in taprio_change().
The 'err' variable is not used from this assignment to the next one so
this patch removes it.

Signed-off-by: Andre Guedes
Signed-off-by: David S. Miller

Andre Guedes
2019-04-24 10:52:32 +0800

19 Apr, 2019

1 commit

23bddf692 net/sched: taprio: fix build without 64bit div ... Browse Code »

Recent changes to taprio did not use the correct div64 helpers,
leading to:

net/sched/sch_taprio.o: In function `taprio_dequeue':
sch_taprio.c:(.text+0x34a): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `advance_sched':
sch_taprio.c:(.text+0xa0b): undefined reference to `__divdi3'
net/sched/sch_taprio.o: In function `taprio_init':
sch_taprio.c:(.text+0x1450): undefined reference to `__divdi3'
/home/jkicinski/devel/linux/Makefile:1032: recipe for target 'vmlinux' failed

Use math64 helpers.

Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
Signed-off-by: Jakub Kicinski
Reviewed-by: Dirk van der Merwe
Acked-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Jakub Kicinski
2019-04-19 08:06:15 +0800

12 Apr, 2019

2 commits

9994677c9 net: sched: flower: fix filter net reference counting ... Browse Code »

Fix net reference counting in fl_change() and remove redundant call to
tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net
before releasing exts and deallocating a filter, so this code caused flower
classifier to obtain net twice per filter that is being deleted.

Implementation of __fl_delete() called tcf_exts_get_net() to pass its
result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant
and only complicates fl_mask_put() implementation. This functionality seems
to be copied from filter cleanup code, where it was added by Cong with
following explanation:

This patchset tries to fix the race between call_rcu() and
cleanup_net() again. Without holding the netns refcnt the
tc_action_net_exit() in netns workqueue could be called before
filter destroy works in tc filter workqueue. This patchset
moves the netns refcnt from tc actions to tcf_exts, without
breaking per-netns tc actions.

This doesn't apply to flower mask, which doesn't call any tc action code
during cleanup. Simplify fl_mask_put() by removing the flag parameter and
always use tcf_queue_work() to free mask objects.

Fixes: 061775583e35 ("net: sched: flower: introduce reference counting for filters")
Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw")
Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
Reported-by: Ido Schimmel
Signed-off-by: Vlad Buslov
Signed-off-by: David S. Miller

Vlad Buslov
2019-04-12 12:32:27 +0800
9e35552ae net: sched: flower: use correct ht function to prevent duplicates ... Browse Code »

Implementation of function rhashtable_insert_fast() check if its internal
helper function __rhashtable_insert_fast() returns non-NULL pointer and
seemingly return -EEXIST in such case. However, since
__rhashtable_insert_fast() is called with NULL key pointer, it never
actually checks for duplicates, which means that -EEXIST is never returned
to the user. Use rhashtable_lookup_insert_fast() hash table API instead. In
order to verify that it works as expected and prevent the problem from
happening in future, extend tc-tests with new test that verifies that no
new filters with existing key can be inserted to flower classifier.

Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw")
Signed-off-by: Vlad Buslov
Reviewed-by: Jakub Kicinski
Signed-off-by: David S. Miller

Vlad Buslov
2019-04-12 02:33:06 +0800

11 Apr, 2019

5 commits

e0a7683d3 net/sched: cbs: fix port_rate miscalculation ... Browse Code »

The Credit Based Shaper heavily depends on link speed to calculate
the scheduling credits, we can't properly calculate the credits if the
device has failed to report the link speed.

In that case we can't dequeue packets assuming a wrong port rate that will
result into an inconsistent credit distribution.

This patch makes sure we fail to dequeue case:

1) __ethtool_get_link_ksettings() reports error or 2) the ethernet driver
failed to set the ksettings' speed value (setting link speed to
SPEED_UNKNOWN).

Additionally we properly re calculate the port rate whenever the link speed
is changed.

Fixes: 3d0bd028ffb4a ("net/sched: Add support for HW offloading for CBS")
Signed-off-by: Leandro Dorileo
Reviewed-by: Vedang Patel
Signed-off-by: David S. Miller

Leandro Dorileo
2019-04-11 10:27:43 +0800
7b9eba7ba net/sched: taprio: fix picos_per_byte miscalculation ... Browse Code »

The Time Aware Priority Scheduler is heavily dependent to link speed,
it relies on it to calculate transmission bytes per cycle, we can't
properly calculate the so called budget if the device has failed
to report the link speed.

In that case we can't dequeue packets assuming a wrong budget.
This patch makes sure we fail to dequeue case:

1) __ethtool_get_link_ksettings() reports error or 2) the ethernet
driver failed to set the ksettings' speed value (setting link speed
to SPEED_UNKNOWN).

Additionally we re calculate the budget whenever the link speed is
changed.

Fixes: 5a781ccbd19e4 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Leandro Dorileo
Reviewed-by: Vedang Patel
Signed-off-by: David S. Miller

Leandro Dorileo
2019-04-11 10:27:43 +0800
73eb628dd Revert: "net: sched: put back q.qlen into a single location" ... Browse Code »

This revert commit 46b1c18f9deb ("net: sched: put back q.qlen into
a single location").
After the previous patch, when a NOLOCK qdisc is enslaved to a
locking qdisc it switches to global stats accounting. As a consequence,
when a classful qdisc accesses directly a child qdisc's qlen, such
qdisc is not doing per CPU accounting and qlen value is consistent.

In the control path nobody uses directly qlen since commit
e5f0e8f8e45 ("net: sched: introduce and use qdisc tree flush/purge
helpers"), so we can remove the contented atomic ops from the
datapath.

v1 -> v2:
- complete the qdisc_qstats_atomic_qlen_dec() ->
qdisc_qstats_cpu_qlen_dec() replacement, fix build issue
- more descriptive commit message

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2019-04-11 03:20:46 +0800
8a53e616d net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too ... Browse Code »

Since stats updating is always consistent with TCQ_F_CPUSTATS flag,
we can disable it at qdisc creation time flipping such bit.

In my experiments, if the NOLOCK flag is cleared, per CPU stats
accounting does not give any measurable performance gain, but it
waste some memory.

Let's clear TCQ_F_CPUSTATS together with NOLOCK, when enslaving
a NOLOCK qdisc to 'lock' one.

Use stats update helper inside pfifo_fast, to cope correctly with
TCQ_F_CPUSTATS flag change.

As a side effect, q.qlen value for any child qdiscs is always
consistent for all lock classfull qdiscs.

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2019-04-11 03:20:46 +0800
9c01c9f1f net: sched: always do stats accounting according to TCQ_F_CPUSTATS ... Browse Code »

The core sched implementation checks independently for NOLOCK flag
to acquire/release the root spin lock and for qdisc_is_percpu_stats()
to account per CPU values in many places.

This change update the last few places checking the TCQ_F_NOLOCK to
do per CPU stats accounting according to qdisc_is_percpu_stats()
value.

The above allows to clean dev_requeue_skb() implementation a bit
and makes stats update always consistent with a single flag.

v1 -> v2:
- do not move qdisc_is_empty definition, fix build issue

Signed-off-by: Paolo Abeni
Signed-off-by: David S. Miller

Paolo Abeni
2019-04-11 03:20:46 +0800

08 Apr, 2019

1 commit

1f17f7742 net: sched: flower: insert filter to ht before offloading it to hw ... Browse Code »

John reports:

Recent refactoring of fl_change aims to use the classifier spinlock to
avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
function was moved to before the lock is taken. This can create problems
for drivers if duplicate filters are created (commmon in ovs tc offload
due to filters being triggered by user-space matches).

Drivers registered for such filters will now receive multiple copies of
the same rule, each with a different cookie value. This means that the
drivers would need to do a full match field lookup to determine
duplicates, repeating work that will happen in flower __fl_lookup().
Currently, drivers do not expect to receive duplicate filters.

To fix this, verify that filter with same key is not present in flower
classifier hash table and insert the new filter to the flower hash table
before offloading it to hardware. Implement helper function
fl_ht_insert_unique() to atomically verify/insert a filter.

This change makes filter visible to fast path at the beginning of
fl_change() function, which means it can no longer be freed directly in
case of error. Refactor fl_change() error handling code to deallocate the
filter with rcu timeout.

Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
Reported-by: John Hurley
Signed-off-by: Vlad Buslov
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Vlad Buslov
2019-04-08 10:33:07 +0800