Eric Lee / smarc-fsl-linux-kernel

18 Feb, 2017

1 commit

6c8556f6e net/sched: matchall: Fix configuration race ... Browse Code »

[ Upstream commit fd62d9f5c575f0792f150109f1fd24a0d4b3f854 ]

In the current version, the matchall internal state is split into two
structs: cls_matchall_head and cls_matchall_filter. This makes little
sense, as matchall instance supports only one filter, and there is no
situation where one exists and the other does not. In addition, that led
to some races when filter was deleted while packet was processed.

Unify that two structs into one, thus simplifying the process of matchall
creation and deletion. As a result, the new, delete and get callbacks have
a dummy implementation where all the work is done in destroy and change
callbacks, as was done in cls_cgroup.

Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Reported-by: Daniel Borkmann
Signed-off-by: Yotam Gigi
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Yotam Gigi
2017-02-18 22:11:41 +0800

04 Feb, 2017

1 commit

b260a714a net sched actions: fix refcnt when GETing of action after bind ... Browse Code »

[ Upstream commit 0faa9cb5b3836a979864a6357e01d2046884ad52 ]

Demonstrating the issue:

.. add a drop action
$sudo $TC actions add action drop index 10

.. retrieve it
$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 0 installed 29 sec used 29 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

... bug 1 above: reference is two.
Reference is actually 1 but we forget to subtract 1.

... do a GET again and we see the same issue
try a few times and nothing changes
~$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 0 installed 31 sec used 31 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

... lets try to bind the action to a filter..
$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

... and now a few GETs:
$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 3 bind 1 installed 204 sec used 204 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 4 bind 1 installed 206 sec used 206 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 5 bind 1 installed 235 sec used 235 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

.... as can be observed the reference count keeps going up.

After the fix

$ sudo $TC actions add action drop index 10
$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 1 bind 0 installed 4 sec used 4 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 1 bind 0 installed 6 sec used 6 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

$ sudo $TC qdisc add dev lo ingress
$ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 1 installed 32 sec used 32 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

$ sudo $TC -s actions get action gact index 10

action order 1: gact action drop
random type none pass val 0
index 10 ref 2 bind 1 installed 33 sec used 33 sec
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Fixes: aecc5cefc389 ("net sched actions: fix GETing actions")
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Jamal Hadi Salim
2017-02-04 16:47:09 +0800

15 Jan, 2017

2 commits

9b4a34ff8 net/sched: cls_flower: Fix missing addr_type in classify ... Browse Code »

[ Upstream commit 0df0f207aab4f42e5c96a807adf9a6845b69e984 ]

Since we now use a non zero mask on addr_type, we are matching on its
value (IPV4/IPV6). So before this fix, matching on enc_src_ip/enc_dst_ip
failed in SW/classify path since its value was zero.
This patch sets the proper value of addr_type for encapsulated packets.

Fixes: 970bfcd09791 ('net/sched: cls_flower: Use mask for addr_type')
Signed-off-by: Paul Blakey
Reviewed-by: Hadar Hen Zion
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Paul Blakey
2017-01-15 20:42:53 +0800
09babe4ce net, sched: fix soft lockup in tc_classify ... Browse Code »

[ Upstream commit 628185cfddf1dfb701c4efe2cfd72cf5b09f5702 ]

Shahar reported a soft lockup in tc_classify(), where we run into an
endless loop when walking the classifier chain due to tp->next == tp
which is a state we should never run into. The issue only seems to
trigger under load in the tc control path.

What happens is that in tc_ctl_tfilter(), thread A allocates a new
tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
with it. In that classifier callback we had to unlock/lock the rtnl
mutex and returned with -EAGAIN. One reason why we need to drop there
is, for example, that we need to request an action module to be loaded.

This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
after we loaded and found the requested action, we need to redo the
whole request so we don't race against others. While we had to unlock
rtnl in that time, thread B's request was processed next on that CPU.
Thread B added a new tp instance successfully to the classifier chain.
When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
and destroying its tp instance which never got linked, we goto replay
and redo A's request.

This time when walking the classifier chain in tc_ctl_tfilter() for
checking for existing tp instances we had a priority match and found
the tp instance that was created and linked by thread B. Now calling
again into tp->ops->change() with that tp was successful and returned
without error.

tp_created was never cleared in the second round, thus kernel thinks
that we need to link it into the classifier chain (once again). tp and
*back point to the same object due to the match we had earlier on. Thus
for thread B's already public tp, we reset tp->next to tp itself and
link it into the chain, which eventually causes the mentioned endless
loop in tc_classify() once a packet hits the data path.

Fix is to clear tp_created at the beginning of each request, also when
we replay it. On the paths that can cause -EAGAIN we already destroy
the original tp instance we had and on replay we really need to start
from scratch. It seems that this issue was first introduced in commit
12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
and avoid kernel panic when we use cls_cgroup").

Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
Reported-by: Shahar Klein
Signed-off-by: Daniel Borkmann
Cc: Cong Wang
Acked-by: Eric Dumazet
Tested-by: Shahar Klein
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2017-01-15 20:42:53 +0800

30 Nov, 2016

2 commits

725cbb62e sched: cls_flower: remove from hashtable only in case skip sw flag is not set ... Browse Code »

Be symmetric to hashtable insert and remove filter from hashtable only
in case skip sw flag is not set.

Fixes: e69985c67c33 ("net/sched: cls_flower: Introduce support in SKIP SW flag")
Signed-off-by: Jiri Pirko
Reviewed-by: Amir Vadai
Signed-off-by: David S. Miller

Jiri Pirko
2016-11-30 09:44:38 +0800
95c2027bf net/sched: pedit: make sure that offset is valid ... Browse Code »

Add a validation function to make sure offset is valid:
1. Not below skb head (could happen when offset is negative).
2. Validate both 'offset' and 'at'.

Signed-off-by: Amir Vadai
Signed-off-by: David S. Miller

Amir Vadai
2016-11-30 08:46:00 +0800

28 Nov, 2016

1 commit

d93637741 net, sched: respect rcu grace period on cls destruction ... Browse Code »

Roi reported a crash in flower where tp->root was NULL in ->classify()
callbacks. Reason is that in ->destroy() tp->root is set to NULL via
RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
this doesn't respect RCU grace period for them, and as a result, still
outstanding readers from tc_classify() will try to blindly dereference
a NULL tp->root.

The tp->root object is strictly private to the classifier implementation
and holds internal data the core such as tc_ctl_tfilter() doesn't know
about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
is only checked for NULL in ->get() callback, but nowhere else. This is
misleading and seemed to be copied from old classifier code that was not
cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
fix NULL pointer dereference") moved tp->root initialization into ->init()
routine, where before it was part of ->change(), so ->get() had to deal
with tp->root being NULL back then, so that was indeed a valid case, after
d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
in packet classifiers"); but the NULLifying was reintroduced with the
RCUification, but it's not correct for every classifier implementation.

In the cases that are fixed here with one exception of cls_cgroup, tp->root
object is allocated and initialized inside ->init() callback, which is always
performed at a point in time after we allocate a new tp, which means tp and
thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
handler, same for the tp which is kfree_rcu()'ed right when we return
from ->destroy() in tcf_destroy(). This means, the head object's lifetime
for such classifiers is always tied to the tp lifetime. The RCU callback
invocation for the two kfree_rcu() could be out of order, but that's fine
since both are independent.

Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
means that 1) we don't need a useless NULL check in fast-path and, 2) that
outstanding readers of that tp in tc_classify() can still execute under
respect with RCU grace period as it is actually expected.

Things that haven't been touched here: cls_fw and cls_route. They each
handle tp->root being NULL in ->classify() path for historic reasons, so
their ->destroy() implementation can stay as is. If someone actually
cares, they could get cleaned up at some point to avoid the test in fast
path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
!head should anyone actually be using/testing it, so it at least aligns with
cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
destruction (to a sleepable context) after RCU grace period as concurrent
readers might still access it. (Note that in this case we need to hold module
reference to keep work callback address intact, since we only wait on module
unload for all call_rcu()s to finish.)

This fixes one race to bring RCU grace period guarantees back. Next step
as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
proto tp when all filters are gone") to get the order of unlinking the tp
in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
RCU_INIT_POINTER() before tcf_destroy() and let the notification for
removal be done through the prior ->delete() callback. Both are independant
issues. Once we have that right, we can then clean tp->root up for a number
of classifiers by not making them RCU pointers, which requires a new callback
(->uninit) that is triggered from tp's RCU callback, where we just kfree()
tp->root from there.

Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU")
Reported-by: Roi Dayan
Signed-off-by: Daniel Borkmann
Cc: Cong Wang
Cc: John Fastabend
Cc: Roi Dayan
Cc: Jiri Pirko
Acked-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Daniel Borkmann
2016-11-28 23:47:35 +0800

25 Nov, 2016

1 commit

19a8bb28d net sched filters: fix filter handle ID in tfilter_notify_chain() ... Browse Code »

Should pass valid filter handle, not the netlink flags.

Fixes: 30a391a13ab92 ("net sched filters: pass netlink message flags in event notification")
Signed-off-by: Roman Mashak
Signed-off-by: Jamal Hadi Salim
Reported-by: Cong Wang
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Roman Mashak
2016-11-25 05:05:58 +0800

18 Nov, 2016

1 commit

30a391a13 net sched filters: pass netlink message flags in event notification ... Browse Code »

Userland client should be able to read an event, and reflect it back to
the kernel, therefore it needs to extract complete set of netlink flags.

For example, this will allow "tc monitor" to distinguish Add and Replace
operations.

Signed-off-by: Roman Mashak
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Roman Mashak
2016-11-18 02:42:12 +0800

30 Oct, 2016

1 commit

4700e9ce6 net_sched actions: use nla_parse_nested() ... Browse Code »

Use nla_parse_nested instead of open-coding the call to
nla_parse() with the attribute data/len.

Signed-off-by: Johannes Berg
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Johannes Berg
2016-10-30 03:01:01 +0800

28 Oct, 2016

1 commit

9ee783744 net sched filters: fix notification of filter delete with proper handle ... Browse Code »

Daniel says:

While trying out [1][2], I noticed that tc monitor doesn't show the
correct handle on delete:

$ tc monitor
qdisc clsact ffff: dev eno1 parent ffff:fff1
filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80

some context to explain the above:
The user identity of any tc filter is represented by a 32-bit
identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
above. A user wishing to delete, get or even modify a specific filter
uses this handle to reference it.
Every classifier is free to provide its own semantics for the 32 bit handle.
Example: classifiers like u32 use schemes like 800:1:801 to describe
the semantics of their filters represented as hash table, bucket and
node ids etc.
Classifiers also have internal per-filter representation which is different
from this externally visible identity. Most classifiers set this
internal representation to be a pointer address (which allows fast retrieval
of said filters in their implementations). This internal representation
is referenced with the "fh" variable in the kernel control code.

When a user successfuly deletes a specific filter, by specifying the correct
tcm->tcm_handle, an event is generated to user space which indicates
which specific filter was deleted.

Before this patch, the "fh" value was sent to user space as the identity.
As an example what is shown in the sample bpf filter delete event above
is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80
which happens to be a 64-bit memory address of the internal filter
representation (address of the corresponding filter's struct cls_bpf_prog);

After this patch the appropriate user identifiable handle as encoded
in the originating request tcm->tcm_handle is generated in the event.
One of the cardinal rules of netlink rules is to be able to take an
event (such as a delete in this case) and reflect it back to the
kernel and successfully delete the filter. This patch achieves that.

Note, this issue has existed since the original TC action
infrastructure code patch back in 2004 as found in:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/

[1] http://patchwork.ozlabs.org/patch/682828/
[2] http://patchwork.ozlabs.org/patch/682829/

Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.")
Reported-by: Daniel Borkmann
Acked-by: Cong Wang
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-10-28 05:12:33 +0800

20 Oct, 2016

1 commit

5712bf9c5 net/sched: act_mirred: Use passed lastuse argument ... Browse Code »

stats_update callback is called by NIC drivers doing hardware
offloading of the mirred action. Lastuse is passed as argument
to specify when the stats was actually last updated and is not
always the current time.

Fixes: 9798e6fe4f9b ('net: act_mirred: allow statistic updates from offloaded actions')
Signed-off-by: Paul Blakey
Signed-off-by: Or Gerlitz
Signed-off-by: David S. Miller

Paul Blakey
2016-10-20 23:14:24 +0800

13 Oct, 2016

2 commits

ab102b80c net_sched: reorder pernet ops and act ops registrations ... Browse Code »

Krister reported a kernel NULL pointer dereference after
tcf_action_init_1() invokes a_o->init(), it is a race condition
where one thread calling tcf_register_action() to initialize
the netns data after putting act ops in the global list and
the other thread searching the list and then calling
a_o->init(net, ...).

Fix this by moving the pernet ops registration before making
the action ops visible. This is fine because: a) we don't
rely on act_base in pernet ops->init(), b) in the worst case we
have a fully initialized netns but ops is still not ready so
new actions still can't be created.

Reported-by: Krister Johansen
Tested-by: Krister Johansen
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2016-10-13 22:26:43 +0800
fa59b27c9 net_sched: do not broadcast RTM_GETTFILTER result ... Browse Code »

There are two ways to get tc filters from kernel to user space.

1) Full dump (tc_dump_tfilter())
2) RTM_GETTFILTER to get one precise filter, reducing overhead.

The second operation is unfortunately broadcasting its result,
polluting "tc monitor" users.

This patch makes sure only the requester gets the result, using
netlink_unicast() instead of rtnetlink_send()

Jamal cooked an iproute2 patch to implement "tc filter get" operation,
but other user space libraries already use RTM_GETTFILTER when a single
filter is queried, instead of dumping all filters.

Signed-off-by: Eric Dumazet
Cc: Jamal Hadi Salim
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2016-10-13 21:51:55 +0800

04 Oct, 2016

1 commit

f39acc84a net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions ... Browse Code »

Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
case where the input skb data pointer does not point at the mac header:

- They're doing push/pop, but fail to properly unwind data back to its
original location.
For example, in the skb_vlan_push case, any subsequent
'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
BEFORE start of frame, leading to bogus frames that may be transmitted.

- They update rcsum per the added/removed 4 bytes tag.
Alas if data is originally after the vlan/eth headers, then these
bytes were already pulled out of the csum.

OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
present no issues.

act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
at network header (upon ingress).
Other calles (ovs, bpf) already adjust skb->data at mac_header.

This patch fixes act_vlan to point to the mac_header prior calling
skb_vlan_*() functions, as other callers do.

Signed-off-by: Shmulik Ladkani
Cc: Daniel Borkmann
Cc: Pravin Shelar
Cc: Jiri Pirko
Signed-off-by: David S. Miller

Shmulik Ladkani
2016-10-04 09:40:50 +0800

03 Oct, 2016

1 commit

b50afd203 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Three sets of overlapping changes. Nothing serious.

Signed-off-by: David S. Miller

David S. Miller
2016-10-03 10:20:41 +0800

28 Sep, 2016

1 commit

eb523f42d net/sched: cls_flower: Use a proper mask value for enc key id parameter ... Browse Code »

The current code use the encapsulation key id value as the mask of that
parameter which is wrong. Fix that by using a full mask.

Fixes: bc3103f1ed40 ('net/sched: cls_flower: Classify packet in ip tunnels')
Signed-off-by: Hadar Hen Zion
Acked-by: Amir Vadai
Signed-off-by: David S. Miller

Hadar Hen Zion
2016-09-28 15:11:22 +0800

27 Sep, 2016

2 commits

c006da0be act_ife: Fix false encoding ... Browse Code »

On ife encode side, the action stores the different tlvs inside the ife
header, where each tlv length field should refer to the length of the
whole tlv (without additional padding) and not just the data length.

On ife decode side, the action iterates over the tlvs in the ife header
and parses them one by one, where in each iteration the current pointer is
advanced according to the tlv size.

Before, the encoding encoded only the data length inside the tlv, which led
to false parsing of ife the header. In addition, due to the fact that the
loop counter was unsigned, it could lead to infinite parsing loop.

This fix changes the loop counter to be signed and fixes the encoding to
take into account the tlv type and size.

Fixes: 28a10c426e81 ("net sched: fix encoding to use real length")
Acked-by: Jamal Hadi Salim
Signed-off-by: Yotam Gigi
Signed-off-by: David S. Miller

Yotam Gigi
2016-09-27 21:53:17 +0800
4b1d488a2 act_ife: Fix external mac header on encode ... Browse Code »

On ife encode side, external mac header is copied from the original packet
and may be overridden if the user requests. Before, the mac header copy
was done from memory region that might not be accessible anymore, as
skb_cow_head might free it and copy the packet. This led to random values
in the external mac header once the values were not set by user.

This fix takes the internal mac header from the packet, after the call to
skb_cow_head.

Fixes: ef6980b6becb ("net sched: introduce IFE action")
Acked-by: Jamal Hadi Salim
Signed-off-by: Yotam Gigi
Signed-off-by: David S. Miller

Yotam Gigi
2016-09-27 21:53:16 +0800

23 Sep, 2016

4 commits

fefa569a9 net_sched: sch_fq: account for schedule/timers drifts ... Browse Code »

It looks like the following patch can make FQ very precise, even in VM
or stressed hosts. It matters at high pacing rates.

We take into account the difference between the time that was programmed
when last packet was sent, and current time (a drift of tens of usecs is
often observed)

Add an EWMA of the unthrottle latency to help diagnostics.

This latency is the difference between current time and oldest packet in
delayed RB-tree. This accounts for the high resolution timer latency,
but can be different under stress, as fq_check_throttled() can be
opportunistically be called from a dequeue() called after an enqueue()
for a different flow.

Tested:
// Start a 10Gbit flow
$ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

Before patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52

After patch :
$ sar -n DEV 10 5 | grep eth0 | grep Average
Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52

A new iproute2 tc can output the 'unthrottle latency' :

$ tc -s qd sh dev eth0 | grep latency
0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-09-23 19:19:06 +0800
3d4357fba sch_sfb: keep backlog updated with qlen ... Browse Code »

Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2016-09-23 18:52:31 +0800
2ed5c3f09 sch_qfq: keep backlog updated with qlen ... Browse Code »

Reported-by: Stas Nichiporovich
Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2016-09-23 18:52:31 +0800
21641c2e1 net_sched: check NULL on error path in route4_change() ... Browse Code »

On error path in route4_change(), 'f' could be NULL,
so we should check NULL before calling tcf_exts_destroy().

Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: kbuild test robot
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2016-09-23 18:51:49 +0800

22 Sep, 2016

6 commits

45a497f2d net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action ... Browse Code »

TCA_VLAN_ACT_MODIFY allows one to change an existing tag.

It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.

For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").

Signed-off-by: Shmulik Ladkani
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Shmulik Ladkani
2016-09-22 13:34:20 +0800
9798e6fe4 net: act_mirred: allow statistic updates from offloaded actions ... Browse Code »

Implement .stats_update() callback. The implementation
is generic and can be reused by other simple actions if
needed.

Signed-off-by: Jakub Kicinski
Signed-off-by: David S. Miller

Jakub Kicinski
2016-09-22 07:50:03 +0800
68d640630 net: cls_bpf: allow offloaded filters to update stats ... Browse Code »

Call into offloaded filters to update stats.

Signed-off-by: Jakub Kicinski
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jakub Kicinski
2016-09-22 07:50:03 +0800
eadb41489 net: cls_bpf: add support for marking filters as hardware-only ... Browse Code »

Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag.

Signed-off-by: Jakub Kicinski
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jakub Kicinski
2016-09-22 07:50:02 +0800
0d01d45f1 net: cls_bpf: limit hardware offload by software-only flag ... Browse Code »

Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag.
Unlike U32 and flower cls_bpf already has some netlink
flags defined. Create a new attribute to be able to use
the same flag values as the above.

Unlike U32 and flower reject unknown flags.

Signed-off-by: Jakub Kicinski
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jakub Kicinski
2016-09-22 07:50:02 +0800
332ae8e2f net: cls_bpf: add hardware offload ... Browse Code »

This patch adds hardware offload capability to cls_bpf classifier,
similar to what have been done with U32 and flower.

Signed-off-by: Jakub Kicinski
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jakub Kicinski
2016-09-22 07:50:02 +0800

21 Sep, 2016

2 commits

77879147a net_sched: sch_fq: add low_rate_threshold parameter ... Browse Code »

This commit adds to the fq module a low_rate_threshold parameter to
insert a delay after all packets if the socket requests a pacing rate
below the threshold.

This helps achieve more precise control of the sending rate with
low-rate paths, especially policers. The basic issue is that if a
congestion control module detects a policer at a certain rate, it may
want fq to be able to shape to that policed rate. That way the sender
can avoid policer drops by having the packets arrive at the policer at
or just under the policed rate.

The default threshold of 550Kbps was chosen analytically so that for
policers or links at 500Kbps or 512Kbps fq would very likely invoke
this mechanism, even if the pacing rate was briefly slightly above the
available bandwidth. This value was then empirically validated with
two years of production testing on YouTube video servers.

Signed-off-by: Van Jacobson
Signed-off-by: Neal Cardwell
Signed-off-by: Yuchung Cheng
Signed-off-by: Nandita Dukkipati
Signed-off-by: Eric Dumazet
Signed-off-by: Soheil Hassas Yeganeh
Signed-off-by: David S. Miller

Eric Dumazet
2016-09-21 12:23:00 +0800
aecc5cefc net sched actions: fix GETing actions ... Browse Code »

With the batch changes that translated transient actions into
a temporary list lost in the translation was the fact that
tcf_action_destroy() will eventually delete the action from
the permanent location if the refcount is zero.

Example of what broke:
...add a gact action to drop
sudo $TC actions add action drop index 10
...now retrieve it, looks good
sudo $TC actions get action gact index 10
...retrieve it again and find it is gone!
sudo $TC actions get action gact index 10

Fixes: 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
Fixes: 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
Fixes: f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

Acked-by: Cong Wang
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-21 11:34:55 +0800

20 Sep, 2016

4 commits

5a7a5555a net sched: stylistic cleanups ... Browse Code »

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-20 10:04:14 +0800
f71b109f1 net sched actions police: peg drop stats for conforming traffic ... Browse Code »

setting conforming action to drop is a valid policy.
When it is set we need to at least see the stats indicating it
for debugging.

Signed-off-by: Roman Mashak
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Roman Mashak
2016-09-20 10:04:14 +0800
408fbc22e net sched ife action: Introduce skb tcindex metadata encap decap ... Browse Code »

Sample use case of how this is encoded:
user space via tuntap (or a connected VM/Machine/container)
encodes the tcindex TLV.

Sample use case of decoding:
IFE action decodes it and the skb->tc_index is then used to classify.
So something like this for encoded ICMP packets:

.. first decode then reclassify... skb->tcindex will be set
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

...next match the decode icmp packet...
sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

... last classify it using the tcindex classifier and do someaction..
sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
handle 0x11 tcindex classid 1:1 \
action blah..

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-20 09:55:28 +0800
6a5d58b67 net sched ife action: add 16 bit helpers ... Browse Code »

encoder and checker for 16 bits metadata

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-20 09:55:28 +0800

19 Sep, 2016

5 commits

48da34b7a sched: add and use qdisc_skb_head helpers ... Browse Code »

This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.

Suggested-by: Eric Dumazet
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-09-19 13:47:18 +0800
ed760cb8a sched: replace __skb_dequeue with __qdisc_dequeue_head ... Browse Code »

After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-09-19 13:47:18 +0800
ec3233687 sched: remove qdisc arg from __qdisc_dequeue_head ... Browse Code »

Moves qdisc stat accouting to qdisc_dequeue_head.

The only direct caller of the __qdisc_dequeue_head version open-codes
this now.

This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-09-19 13:47:18 +0800
97d0678f9 sched: don't use skb queue helpers ... Browse Code »

A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-09-19 13:47:18 +0800
1486587b2 pie: use qdisc_dequeue_head wrapper ... Browse Code »

Doesn't change generated code.

Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-09-19 13:47:18 +0800