Eric Lee / smarc-fsl-linux-kernel

14 Nov, 2018

1 commit

dfba40921 net: sched: Remove TCA_OPTIONS from policy ... Browse Code »

commit e72bde6b66299602087c8c2350d36a525e75d06e upstream.

Marco reported an error with hfsc:
root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
Error: Attribute failed policy validation.

Apparently a few implementations pass TCA_OPTIONS as a binary instead
of nested attribute, so drop TCA_OPTIONS from the policy.

Fixes: 8b4c3cdd9dd8 ("net: sched: Add policy validation for tc attributes")
Reported-by: Marco Berizzi
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2018-11-14 03:15:18 +0800

04 Nov, 2018

1 commit

fd090ba39 net: sched: Fix for duplicate class dump ... Browse Code »

[ Upstream commit 3c53ed8fef6881a864f0ee8240ed2793ef73ad0d ]

When dumping classes by parent, kernel would return classes twice:

| # tc qdisc add dev lo root prio
| # tc class show dev lo
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| # tc class show dev lo parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:
| class prio 8001:1 parent 8001:
| class prio 8001:2 parent 8001:
| class prio 8001:3 parent 8001:

This comes from qdisc_match_from_root() potentially returning the root
qdisc itself if its handle matched. Though in that case, root's classes
were already dumped a few lines above.

Fixes: cb395b2010879 ("net: sched: optimize class dumps")
Signed-off-by: Phil Sutter
Reviewed-by: Jiri Pirko
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Phil Sutter
2018-11-04 21:52:50 +0800

18 Oct, 2018

1 commit

d90574233 net: sched: Add policy validation for tc attributes ... Browse Code »

[ Upstream commit 8b4c3cdd9dd8290343ce959a132d3b334062c5b9 ]

A number of TC attributes are processed without proper validation
(e.g., length checks). Add a tca policy for all input attributes and use
when invoking nlmsg_parse.

The 2 Fixes tags below cover the latest additions. The other attributes
are a string (KIND), nested attribute (OPTIONS which does seem to have
validation in most cases), for dumps only or a flag.

Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
Fixes: d47a6b0e7c492 ("net: sched: introduce ingress/egress block index attributes for qdisc")
Signed-off-by: David Ahern
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

David Ahern
2018-10-18 15:16:18 +0800

29 Oct, 2017

1 commit

50317fce2 net_sched: avoid matching qdisc with zero handle ... Browse Code »

Davide found the following script triggers a NULL pointer
dereference:

ip l a name eth0 type dummy
tc q a dev eth0 parent :1 handle 1: htb

This is because for a freshly created netdevice noop_qdisc
is attached and when passing 'parent :1', kernel actually
tries to match the major handle which is 0 and noop_qdisc
has handle 0 so is matched by mistake. Commit 69012ae425d7
tries to fix a similar bug but still misses this case.

Handle 0 is not a valid one, should be just skipped. In
fact, kernel uses it as TC_H_UNSPEC.

Fixes: 69012ae425d7 ("net: sched: fix handling of singleton qdiscs with qdisc_hash")
Fixes: 59cc1f61f09c ("net: sched:convert qdisc linked list to hashtable")
Reported-by: Davide Caratti
Cc: Jiri Kosina
Cc: Eric Dumazet
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2017-10-29 16:55:03 +0800

02 Sep, 2017

1 commit

6026e043d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Three cases of simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2017-09-02 08:42:05 +0800

01 Sep, 2017

1 commit

07d79fc7d net_sched: add reverse binding for tc class ... Browse Code »

TC filters when used as classifiers are bound to TC classes.
However, there is a hidden difference when adding them in different
orders:

1. If we add tc classes before its filters, everything is fine.
Logically, the classes exist before we specify their ID's in
filters, it is easy to bind them together, just as in the current
code base.

2. If we add tc filters before the tc classes they bind, we have to
do dynamic lookup in fast path. What's worse, this happens all
the time not just once, because on fast path tcf_result is passed
on stack, there is no way to propagate back to the one in tc filters.

This hidden difference hurts performance silently if we have many tc
classes in hierarchy.

This patch intends to close this gap by doing the reverse binding when
we create a new class, in this case we can actually search all the
filters in its parent, match and fixup by classid. And because
tcf_result is specific to each type of tc filter, we have to introduce
a new ops for each filter to tell how to bind the class.

Note, we still can NOT totally get rid of those class lookup in
->enqueue() because cgroup and flow filters have no way to determine
the classid at setup time, they still have to go through dynamic lookup.

Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Cong Wang
2017-09-01 02:40:52 +0800

26 Aug, 2017

3 commits

143976ce9 net_sched: remove tc class reference counting ... Browse Code »

For TC classes, their ->get() and ->put() are always paired, and the
reference counting is completely useless, because:

1) For class modification and dumping paths, we already hold RTNL lock,
so all of these ->get(),->change(),->put() are atomic.

2) For filter bindiing/unbinding, we use other reference counter than
this one, and they should have RTNL lock too.

3) For ->qlen_notify(), it is special because it is called on ->enqueue()
path, but we already hold qdisc tree lock there, and we hold this
tree lock when graft or delete the class too, so it should not be gone
or changed until we release the tree lock.

Therefore, this patch removes ->get() and ->put(), but:

1) Adds a new ->find() to find the pointer to a class by classid, no
refcnt.

2) Move the original class destroy upon the last refcnt into ->delete(),
right after releasing tree lock. This is fine because the class is
already removed from hash when holding the lock.

For those who also use ->put() as ->unbind(), just rename them to reflect
this change.

Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2017-08-26 08:19:10 +0800
14546ba1e net_sched: introduce tclass_del_notify() ... Browse Code »

Like for TC actions, ->delete() is a special case,
we have to prepare and fill the notification before delete
otherwise would get use-after-free after we remove the
reference count.

Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2017-08-26 08:19:10 +0800
27d7f07c4 net_sched: get rid of more forward declarations ... Browse Code »

This is not needed if we move them up properly.

Signed-off-by: Cong Wang
Acked-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2017-08-26 08:19:10 +0800

25 Aug, 2017

1 commit

551143d8d net_sched: fix a refcount_t issue with noop_qdisc ... Browse Code »

syzkaller reported a refcount_t warning [1]

Issue here is that noop_qdisc refcnt was never really considered as
a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set :

if (qdisc->flags & TCQ_F_BUILTIN ||
!refcount_dec_and_test(&qdisc->refcnt)))
return;

Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not
really needed, but harmless until refcount_t came.

To fix this problem, we simply need to not increment noop_qdisc.refcnt,
since we never decrement it.

[1]
refcount_t: increment on 0; use-after-free.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
panic+0x1e4/0x417 kernel/panic.c:180
__warn+0x1c4/0x1d9 kernel/panic.c:541
report_bug+0x211/0x2d0 lib/bug.c:183
fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152
RSP: 0018:ffff8801c43477a0 EFLAGS: 00010282
RAX: 000000000000002b RBX: ffffffff86093c14 RCX: 0000000000000000
RDX: 000000000000002b RSI: ffffffff8159314e RDI: ffffed0038868ee8
RBP: ffff8801c43477a8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff86093ac0
R13: 0000000000000001 R14: ffff8801d0f3bac0 R15: dffffc0000000000
attach_default_qdiscs net/sched/sch_generic.c:792 [inline]
dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833
__dev_open+0x227/0x330 net/core/dev.c:1380
__dev_change_flags+0x695/0x990 net/core/dev.c:6726
dev_change_flags+0x88/0x140 net/core/dev.c:6792
dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256
dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554
sock_do_ioctl+0x94/0xb0 net/socket.c:968
sock_ioctl+0x2c2/0x440 net/socket.c:1058
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

Fixes: 7b9364050246 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t")
Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Cc: Reshetova, Elena
Signed-off-by: David S. Miller

Eric Dumazet
2017-08-25 12:28:24 +0800

23 Aug, 2017

1 commit

9695fe6f2 net: sched: use kvmalloc() for class hash tables ... Browse Code »

High order GFP_KERNEL allocations can stress the host badly.

Use modern kvmalloc_array()/kvfree() instead of custom
allocations.

Signed-off-by: Eric Dumazet
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Eric Dumazet
2017-08-23 05:37:24 +0800

17 Aug, 2017

1 commit

959466588 net_sched: call qlen_notify only if child qdisc is empty ... Browse Code »

This callback is used for deactivating class in parent qdisc.
This is cheaper to test queue length right here.

Also this allows to catch draining screwed backlog and prevent
second deactivation of already inactive parent class which will
crash kernel for sure. Kernel with print warning at destruction
of child qdisc where no packets but backlog is not zero.

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: David S. Miller

Konstantin Khlebnikov
2017-08-17 01:55:34 +0800

16 Aug, 2017

2 commits

463910e2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2017-08-16 11:23:23 +0800
c90e95147 net_sched: remove warning from qdisc_hash_add ... Browse Code »

It was added in commit e57a784d8cae ("pkt_sched: set root qdisc
before change() in attach_default_qdiscs()") to hide duplicates
from "tc qdisc show" for incative deivices.

After 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
it triggered when classful qdisc is added to inactive device because
default qdiscs are added before switching root qdisc.

Anyway after commit ea3274695353 ("net: sched: avoid duplicates in
qdisc dump") duplicates are filtered right in dumper.

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: David S. Miller

Konstantin Khlebnikov
2017-08-16 08:16:39 +0800

10 Aug, 2017

1 commit

b97bac64a rtnetlink: make rtnl_register accept a flags parameter ... Browse Code »

This change allows us to later indicate to rtnetlink core that certain
doit functions should be called without acquiring rtnl_mutex.

This change should have no effect, we simply replace the last (now
unused) calcit argument with the new flag.

Signed-off-by: Florian Westphal
Reviewed-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Florian Westphal
2017-08-10 07:57:38 +0800

05 Jul, 2017

1 commit

7b9364050 net, sched: convert Qdisc.refcnt from atomic_t to refcount_t ... Browse Code »

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova
Signed-off-by: Hans Liljestrand
Signed-off-by: Kees Cook
Signed-off-by: David Windsor
Signed-off-by: David S. Miller

Reshetova, Elena
2017-07-05 05:35:16 +0800

01 Jul, 2017

1 commit

b07911593 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

A set of overlapping changes in macvlan and the rocker
driver, nothing serious.

Signed-off-by: David S. Miller

David S. Miller
2017-07-01 00:43:08 +0800

30 Jun, 2017

1 commit

c1a4872eb net: sched: Fix one possible panic when no destroy callback ... Browse Code »

When qdisc fail to init, qdisc_create would invoke the destroy callback
to cleanup. But there is no check if the callback exists really. So it
would cause the panic if there is no real destroy callback like the qdisc
codel, fq, and so on.

Take codel as an example following:
When a malicious user constructs one invalid netlink msg, it would cause
codel_init->codel_change->nla_parse_nested failed.
Then kernel would invoke the destroy callback directly but qdisc codel
doesn't define one. It causes one panic as a result.

Now add one the check for destroy to avoid the possible panic.

Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
Signed-off-by: Gao Feng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Gao Feng
2017-06-30 00:55:12 +0800

18 May, 2017

2 commits

6529eaba3 net: sched: introduce tcf block infractructure ... Browse Code »

Currently, the filter chains are direcly put into the private structures
of qdiscs. In order to be able to have multiple chains per qdisc and to
allow filter chains sharing among qdiscs, there is a need for common
object that would hold the chains. This introduces such object and calls
it "tcf_block".

Helpers to get and put the blocks are provided to be called from
individual qdisc code. Also, the original filter_list pointers are left
in qdisc privs to allow the entry into tcf_block processing without any
added overhead of possible multiple pointer dereference on fast path.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2017-05-18 03:22:13 +0800
87d83093b net: sched: move tc_classify function to cls_api.c ... Browse Code »

Move tc_classify function to cls_api.c where it belongs, rename it to
fit the namespace.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2017-05-18 03:22:13 +0800

12 May, 2017

1 commit

cb395b201 net: sched: optimize class dumps ... Browse Code »

In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to
hashtable") we missed the opportunity to considerably speed up
tc_dump_tclass_root() if a qdisc handle is provided by user.

Instead of iterating all the qdiscs, use qdisc_match_from_root()
to directly get the one we look for.

Signed-off-by: Eric Dumazet
Cc: Jiri Kosina
Cc: Jamal Hadi Salim
Cc: Cong Wang
Cc: Jiri Pirko
Signed-off-by: David S. Miller

Eric Dumazet
2017-05-12 09:37:40 +0800

18 Apr, 2017

2 commits

c21ef3e34 net: rtnetlink: plumb extended ack to doit function ... Browse Code »

Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.

This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.

Signed-off-by: David Ahern
Signed-off-by: David S. Miller

David Ahern
2017-04-18 03:35:38 +0800
8ea3e4391 Subject: net: allow configuring default qdisc ... Browse Code »

Since 3.12 it has been possible to configure the default queuing
discipline via sysctl. This patch adds ability to configure the
default queue discipline in kernel configuration. This is useful for
environments where configuring the value from userspace is difficult
to manage.

The default is still the same as before (pfifo_fast) and it is
possible to change after kernel init with sysctl. This is similar
to how TCP congestion control works.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2017-04-18 01:23:06 +0800

14 Apr, 2017

1 commit

fceb6435e netlink: pass extended ACK struct to parsing functions ... Browse Code »

Pass the new extended ACK reporting struct to all of the generic
netlink parsing functions. For now, pass NULL in almost all callers
(except for some in the core.)

Signed-off-by: Johannes Berg
Signed-off-by: David S. Miller

Johannes Berg
2017-04-14 01:58:22 +0800

13 Mar, 2017

1 commit

49b499718 net: sched: make default fifo qdiscs appear in the dump ... Browse Code »

The original reason [1] for having hidden qdiscs (potential scalability
issues in qdisc_match_from_root() with single linked list in case of large
amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
qdisc linked list to hashtable").

This allows us for bringing more clarity and determinism into the dump by
making default pfifo qdiscs visible.

We're not turning this on by default though, at it was deemed [2] too
intrusive / unnecessary change of default behavior towards userspace.
Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
applications to request complete qdisc hierarchy dump, including the
ones that have always been implicit/invisible.

Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
about singletons would require quite some surgery with very little gain
(seeing no qdisc or seeing noop qdisc in the dump is probably setting
the same user expectation).

[1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
[2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller

Jiri Kosina
2017-03-13 13:53:02 +0800

18 Feb, 2017

1 commit

806a83765 pkt_sched: Remove useless qdisc_stab_lock ... Browse Code »

The qdisc_stab_lock is used in qdisc_get_stab and qdisc_put_stab.
These two functions are invoked in qdisc_create, qdisc_change, and
qdisc_destroy which run fully under RTNL.

So it already makes sure only one could access the qdisc_stab_list at
the same time. Then it is unnecessary to use qdisc_stab_lock now.

Signed-off-by: Gao Feng
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Gao Feng
2017-02-18 04:10:18 +0800

12 Feb, 2017

1 commit

87b60cfac net_sched: fix error recovery at qdisc creation ... Browse Code »

Dmitry reported uses after free in qdisc code [1]

The problem here is that ops->init() can return an error.

qdisc_create_dflt() then call ops->destroy(),
while qdisc_create() does _not_ call it.

Four qdisc chose to call their own ops->destroy(), assuming their caller
would not.

This patch makes sure qdisc_create() calls ops->destroy()
and fixes the four qdisc to avoid double free.

[1]
BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
Read of size 8 by task syz-executor2/5030
CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
Call Trace:
[] __dump_stack lib/dump_stack.c:15 [inline]
[] dump_stack+0x6c/0x98 lib/dump_stack.c:51
[] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
[] print_address_description mm/kasan/report.c:196 [inline]
[] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
[] kasan_report mm/kasan/report.c:305 [inline]
[] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
[] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
[] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
[] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
[] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
[] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
[] __dev_open+0x221/0x320 net/core/dev.c:1403
[] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
[] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
[] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
[] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
[] sock_do_ioctl+0x99/0xb0 net/socket.c:879
[] sock_ioctl+0x2a0/0x390 net/socket.c:958
[] vfs_ioctl fs/ioctl.c:44 [inline]
[] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
[] SYSC_ioctl fs/ioctl.c:626 [inline]
[] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
[] entry_SYSCALL_64_fastpath+0x12/0x17

Signed-off-by: Eric Dumazet
Reported-by: Dmitry Vyukov
Signed-off-by: David S. Miller

Eric Dumazet
2017-02-12 10:38:58 +0800

11 Feb, 2017

2 commits

cf1facda2 sched: move tcf_proto_destroy and tcf_destroy_chain helpers into cls_api ... Browse Code »

Creation is done in this file, move destruction to be at the same place.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2017-02-11 00:38:08 +0800
79112c26f sched: rename tcf_destroy to tcf_destroy_proto ... Browse Code »

This function destroys TC filter protocol, not TC filter. So name it
accordingly.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2017-02-11 00:38:08 +0800

09 Jan, 2017

1 commit

d6264071c net-tc: make MAX_RECLASSIFY_LOOP local ... Browse Code »

This field is no longer kept in tc_verd. Remove it from the global
definition of that struct.

Signed-off-by: Willem de Bruijn
Signed-off-by: David S. Miller

Willem de Bruijn
2017-01-09 09:58:52 +0800

06 Dec, 2016

1 commit

1c0d32fde net_sched: gen_estimator: complete rewrite of rate estimators ... Browse Code »

1) Old code was hard to maintain, due to complex lock chains.
(We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
support for 32bit kernels.

- We reduce memory need when estimators are not used, since
we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
(In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-12-06 04:21:59 +0800

08 Nov, 2016

1 commit

84c46dd86 qdisc: catch misconfig of attaching qdisc to tx_queue_len zero device ... Browse Code »

It is a clear misconfiguration to attach a qdisc to a device with
tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
htb, plug and sfb) inherit/copy this value as their queue length.

Why should the kernel catch such a misconfiguration? Because prior to
introducing the IFF_NO_QUEUE device flag, userspace found a loophole
in the qdisc config system that allowed them to achieve the equivalent
of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
a device. The loophole on older kernels is setting tx_queue_len=0,
*prior* to device qdisc init (the config time is significant, simply
setting tx_queue_len=0 doesn't trigger the loophole).

This loophole is currently used by Docker[1] to get better performance
and scalability out of the veth device. The Docker developers were
warned[1] that they needed to adjust the tx_queue_len if ever
attaching a qdisc. The OpenShift project didn't remember this warning
and attached a qdisc, this were caught and fixed in[2].

[1] https://github.com/docker/libcontainer/pull/193
[2] https://github.com/openshift/origin/pull/11126

Instead of fixing every userspace program that used this loophole, and
forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's
catch the misconfiguration on the kernel side.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2016-11-08 09:15:55 +0800

20 Sep, 2016

1 commit

5a7a5555a net sched: stylistic cleanups ... Browse Code »

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-20 10:04:14 +0800

19 Aug, 2016

2 commits

ea3274695 net: sched: avoid duplicates in qdisc dump ... Browse Code »

tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases;
first, the "standard" dev->qdisc is being dumped. Second, if there is/are
ingress queue(s), they are being dumped as well.

After conversion of netdevice's qdisc linked-list into hashtable, these
two sets are not in two disjunctive sets/lists any more, but are both
"reachable" directly from netdevice's hashtable. As a consequence, the
"full-depth" dump of the ingress qdiscs results in immediately hitting the
netdevice hashtable again, and duplicating the dump that has already been
performed for dev->qdisc.
What in fact needs to be dumped in case of ingress queue is "just" the
top-level ingress qdisc, as everything else has been dumped already.

Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed
whether it should (while performing the "full" per-netdev qdisc dump) perform
the whole recursion, or just dump "additional" top-level (ingress) qdiscs
without performing any kind of recursion.

This fixes duplicate dumps such as

qdisc mq 0: root
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Daniel Borkmann
Tested-by: Daniel Borkmann
Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller

Jiri Kosina
2016-08-19 12:19:08 +0800
69012ae42 net: sched: fix handling of singleton qdiscs with qdisc_hash ... Browse Code »

qdisc_match_from_root() is now iterating over per-netdevice qdisc
hashtable instead of going through a linked-list of qdiscs (independently
on the actual underlying netdev), which was the case before the switch to
hashtable for qdiscs.

For singleton qdiscs, there is no underlying netdev associated though, and
therefore dumping a singleton qdisc will panic, as qdisc_dev(root) will
always be NULL.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000410
IP: [] qdisc_match_from_root+0x2c/0x70
PGD 1aceba067 PUD 1aceb7067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
[ ... ]
task: ffff8801ec996e00 task.stack: ffff8801ec934000
RIP: 0010:[] [] qdisc_match_from_root+0x2c/0x70
RSP: 0018:ffff8801ec937ab0 EFLAGS: 00010203
RAX: 0000000000000408 RBX: ffff88025e612000 RCX: ffffffffffffffd8
RDX: 0000000000000000 RSI: 00000000ffff0000 RDI: ffffffff81cf8100
RBP: ffff8801ec937ab0 R08: 000000000001c160 R09: ffff8802668032c0
R10: ffffffff81cf8100 R11: 0000000000000030 R12: 00000000ffff0000
R13: ffff88025e612000 R14: ffffffff81cf3140 R15: 0000000000000000
FS: 00007f24b9af6740(0000) GS:ffff88026f280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000410 CR3: 00000001aceec000 CR4: 00000000001406e0
Stack:
ffff8801ec937ad0 ffffffff81681210 ffff88025dd51a00 00000000fffffff1
ffff8801ec937b88 ffffffff81681e4e ffffffff81c42bc0 ffff880262431500
ffffffff81cf3140 ffff88025dd51a10 ffff88025dd51a24 00000000ec937b38
Call Trace:
[] qdisc_lookup+0x40/0x50
[] tc_modify_qdisc+0x21e/0x550
[] rtnetlink_rcv_msg+0x95/0x220
[] ? __kmalloc_track_caller+0x172/0x230
[] ? rtnl_newlink+0x870/0x870
[] netlink_rcv_skb+0xa7/0xc0
[] rtnetlink_rcv+0x28/0x30
[] netlink_unicast+0x15b/0x210
[] netlink_sendmsg+0x319/0x390
[] sock_sendmsg+0x38/0x50
[] ___sys_sendmsg+0x256/0x260
[] ? __pagevec_lru_add_fn+0x135/0x280
[] ? pagevec_lru_move_fn+0xd0/0xf0
[] ? trace_event_raw_event_mm_lru_insertion+0x180/0x180
[] ? __lru_cache_add+0x75/0xb0
[] ? _raw_spin_unlock+0x16/0x40
[] ? handle_mm_fault+0x39f/0x1160
[] __sys_sendmsg+0x45/0x80
[] SyS_sendmsg+0x12/0x20
[] do_syscall_64+0x57/0xb0

Fix this by special-casing singleton qdiscs (those that don't have
underlying netdevice) and introduce immediate handling of those rather
than trying to go over an underlying netdevice. We're in the same
situation in tc_dump_qdisc_root() and tc_dump_tclass_root().

Ultimately, this will have to be slightly reworked so that we are actually
able to show singleton qdiscs (noop) in the dump properly; but we're not
currently doing that anyway, so no regression there, and better do this in
a gradual manner.

Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
Reported-by: Daniel Borkmann
Tested-by: Daniel Borkmann
Reported-by: David Ahern
Tested-by: David Ahern
Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller

Jiri Kosina
2016-08-19 12:19:08 +0800

11 Aug, 2016

1 commit

59cc1f61f net: sched: convert qdisc linked list to hashtable ... Browse Code »

Convert the per-device linked list into a hashtable. The primary
motivation for this change is that currently, we're not tracking all the
qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
performed over the linked list by qdisc_match_from_root() is rather
expensive.

The ultimate goal is to get rid of hidden qdiscs completely, which will
bring much more determinism in user experience.

Reviewed-by: Cong Wang
Signed-off-by: Jiri Kosina
Signed-off-by: David S. Miller

Jiri Kosina
2016-08-11 08:19:02 +0800

13 Jun, 2016

1 commit

99860208b sched: remove NET_XMIT_POLICED ... Browse Code »

sch_atm returns this when TC_ACT_SHOT classification occurs.

But all other schedulers that use tc_classify
(htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
in this case so just do that in atm.

BATMAN uses it as an intermediate return value to signal
forwarding vs. buffering, but it did not return POLICED to
callers outside of BATMAN.

Reviewed-by: Sven Eckelmann
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2016-06-13 10:02:11 +0800

11 Jun, 2016

1 commit

45f50bed1 net_sched: remove generic throttled management ... Browse Code »

__QDISC_STATE_THROTTLED bit manipulation is rather expensive
for HTB and few others.

I already removed it for sch_fq in commit f2600cf02b5b
("net: sched: avoid costly atomic operation in fq_dequeue()")
and so far nobody complained.

When one ore more packets are stuck in one or more throttled
HTB class, a htb dequeue() performs two atomic operations
to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
lock is held.

Removing this pair of atomic operations bring me a 8 % performance
increase on 200 TCP_RR tests, in presence of throttled classes.

This patch has no side effect, since nothing actually uses
disc_is_throttled() anymore.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-06-11 14:58:21 +0800

08 Jun, 2016

1 commit

edb09eb17 net: sched: do not acquire qdisc spinlock in qdisc/class stats dump ... Browse Code »

Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
agent [1] are problematic at scale :

For each qdisc/class found in the dump, we currently lock the root qdisc
spinlock in order to get stats. Sampling stats every 5 seconds from
thousands of HTB classes is a challenge when the root qdisc spinlock is
under high pressure. Not only the dumps take time, they also slow
down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
that might need the qdisc lock in fq_codel_dump_stats() and
fq_codel_dump_class_stats()

In v2 of this patch, I now use the Qdisc running seqcount to provide
consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

I also changed rate estimators to use the same infrastructure
so that they no longer need to lock root qdisc lock.

[1]
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

Signed-off-by: Eric Dumazet
Cc: Cong Wang
Cc: Jamal Hadi Salim
Cc: John Fastabend
Cc: Kevin Athey
Cc: Xiaotian Pei
Signed-off-by: David S. Miller

Eric Dumazet
2016-06-08 07:37:14 +0800

25 May, 2016

1 commit

a9efad8b2 net_sched: avoid too many hrtimer_start() calls ... Browse Code »

I found a serious performance bug in packet schedulers using hrtimers.

sch_htb and sch_fq are definitely impacted by this problem.

We constantly rearm high resolution timers if some packets are throttled
in one (or more) class, and other packets are flying through qdisc on
another (non throttled) class.

hrtimer_start() does not have the mod_timer() trick of doing nothing if
expires value does not change :

if (timer_pending(timer) &&
timer->expires == expires)
return 1;

This issue is particularly visible when multiple cpus can queue/dequeue
packets on the same qdisc, as hrtimer code has to lock a remote base.

I used following fix :

1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
it.

2) Cache watchdog prior expiration. hrtimer might provide this, but I
prefer to not rely on some hrtimer internal.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2016-05-25 05:49:14 +0800