14 Nov, 2018

1 commit

  • commit e72bde6b66299602087c8c2350d36a525e75d06e upstream.

    Marco reported an error with hfsc:
    root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
    Error: Attribute failed policy validation.

    Apparently a few implementations pass TCA_OPTIONS as a binary instead
    of nested attribute, so drop TCA_OPTIONS from the policy.

    Fixes: 8b4c3cdd9dd8 ("net: sched: Add policy validation for tc attributes")
    Reported-by: Marco Berizzi
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

04 Nov, 2018

1 commit

  • [ Upstream commit 3c53ed8fef6881a864f0ee8240ed2793ef73ad0d ]

    When dumping classes by parent, kernel would return classes twice:

    | # tc qdisc add dev lo root prio
    | # tc class show dev lo
    | class prio 8001:1 parent 8001:
    | class prio 8001:2 parent 8001:
    | class prio 8001:3 parent 8001:
    | # tc class show dev lo parent 8001:
    | class prio 8001:1 parent 8001:
    | class prio 8001:2 parent 8001:
    | class prio 8001:3 parent 8001:
    | class prio 8001:1 parent 8001:
    | class prio 8001:2 parent 8001:
    | class prio 8001:3 parent 8001:

    This comes from qdisc_match_from_root() potentially returning the root
    qdisc itself if its handle matched. Though in that case, root's classes
    were already dumped a few lines above.

    Fixes: cb395b2010879 ("net: sched: optimize class dumps")
    Signed-off-by: Phil Sutter
    Reviewed-by: Jiri Pirko
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Phil Sutter
     

18 Oct, 2018

1 commit

  • [ Upstream commit 8b4c3cdd9dd8290343ce959a132d3b334062c5b9 ]

    A number of TC attributes are processed without proper validation
    (e.g., length checks). Add a tca policy for all input attributes and use
    when invoking nlmsg_parse.

    The 2 Fixes tags below cover the latest additions. The other attributes
    are a string (KIND), nested attribute (OPTIONS which does seem to have
    validation in most cases), for dumps only or a flag.

    Fixes: 5bc1701881e39 ("net: sched: introduce multichain support for filters")
    Fixes: d47a6b0e7c492 ("net: sched: introduce ingress/egress block index attributes for qdisc")
    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    David Ahern
     

29 Oct, 2017

1 commit

  • Davide found the following script triggers a NULL pointer
    dereference:

    ip l a name eth0 type dummy
    tc q a dev eth0 parent :1 handle 1: htb

    This is because for a freshly created netdevice noop_qdisc
    is attached and when passing 'parent :1', kernel actually
    tries to match the major handle which is 0 and noop_qdisc
    has handle 0 so is matched by mistake. Commit 69012ae425d7
    tries to fix a similar bug but still misses this case.

    Handle 0 is not a valid one, should be just skipped. In
    fact, kernel uses it as TC_H_UNSPEC.

    Fixes: 69012ae425d7 ("net: sched: fix handling of singleton qdiscs with qdisc_hash")
    Fixes: 59cc1f61f09c ("net: sched:convert qdisc linked list to hashtable")
    Reported-by: Davide Caratti
    Cc: Jiri Kosina
    Cc: Eric Dumazet
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

02 Sep, 2017

1 commit


01 Sep, 2017

1 commit

  • TC filters when used as classifiers are bound to TC classes.
    However, there is a hidden difference when adding them in different
    orders:

    1. If we add tc classes before its filters, everything is fine.
    Logically, the classes exist before we specify their ID's in
    filters, it is easy to bind them together, just as in the current
    code base.

    2. If we add tc filters before the tc classes they bind, we have to
    do dynamic lookup in fast path. What's worse, this happens all
    the time not just once, because on fast path tcf_result is passed
    on stack, there is no way to propagate back to the one in tc filters.

    This hidden difference hurts performance silently if we have many tc
    classes in hierarchy.

    This patch intends to close this gap by doing the reverse binding when
    we create a new class, in this case we can actually search all the
    filters in its parent, match and fixup by classid. And because
    tcf_result is specific to each type of tc filter, we have to introduce
    a new ops for each filter to tell how to bind the class.

    Note, we still can NOT totally get rid of those class lookup in
    ->enqueue() because cgroup and flow filters have no way to determine
    the classid at setup time, they still have to go through dynamic lookup.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

26 Aug, 2017

3 commits

  • For TC classes, their ->get() and ->put() are always paired, and the
    reference counting is completely useless, because:

    1) For class modification and dumping paths, we already hold RTNL lock,
    so all of these ->get(),->change(),->put() are atomic.

    2) For filter bindiing/unbinding, we use other reference counter than
    this one, and they should have RTNL lock too.

    3) For ->qlen_notify(), it is special because it is called on ->enqueue()
    path, but we already hold qdisc tree lock there, and we hold this
    tree lock when graft or delete the class too, so it should not be gone
    or changed until we release the tree lock.

    Therefore, this patch removes ->get() and ->put(), but:

    1) Adds a new ->find() to find the pointer to a class by classid, no
    refcnt.

    2) Move the original class destroy upon the last refcnt into ->delete(),
    right after releasing tree lock. This is fine because the class is
    already removed from hash when holding the lock.

    For those who also use ->put() as ->unbind(), just rename them to reflect
    this change.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Like for TC actions, ->delete() is a special case,
    we have to prepare and fill the notification before delete
    otherwise would get use-after-free after we remove the
    reference count.

    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • This is not needed if we move them up properly.

    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

25 Aug, 2017

1 commit

  • syzkaller reported a refcount_t warning [1]

    Issue here is that noop_qdisc refcnt was never really considered as
    a true refcount, since qdisc_destroy() found TCQ_F_BUILTIN set :

    if (qdisc->flags & TCQ_F_BUILTIN ||
    !refcount_dec_and_test(&qdisc->refcnt)))
    return;

    Meaning that all atomic_inc() we did on noop_qdisc.refcnt were not
    really needed, but harmless until refcount_t came.

    To fix this problem, we simply need to not increment noop_qdisc.refcnt,
    since we never decrement it.

    [1]
    refcount_t: increment on 0; use-after-free.
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 21754 at lib/refcount.c:152 refcount_inc+0x47/0x50 lib/refcount.c:152
    Kernel panic - not syncing: panic_on_warn set ...

    CPU: 0 PID: 21754 Comm: syz-executor7 Not tainted 4.13.0-rc6+ #20
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    panic+0x1e4/0x417 kernel/panic.c:180
    __warn+0x1c4/0x1d9 kernel/panic.c:541
    report_bug+0x211/0x2d0 lib/bug.c:183
    fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
    do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
    do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
    do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
    do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
    invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
    RIP: 0010:refcount_inc+0x47/0x50 lib/refcount.c:152
    RSP: 0018:ffff8801c43477a0 EFLAGS: 00010282
    RAX: 000000000000002b RBX: ffffffff86093c14 RCX: 0000000000000000
    RDX: 000000000000002b RSI: ffffffff8159314e RDI: ffffed0038868ee8
    RBP: ffff8801c43477a8 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff86093ac0
    R13: 0000000000000001 R14: ffff8801d0f3bac0 R15: dffffc0000000000
    attach_default_qdiscs net/sched/sch_generic.c:792 [inline]
    dev_activate+0x7d3/0xaa0 net/sched/sch_generic.c:833
    __dev_open+0x227/0x330 net/core/dev.c:1380
    __dev_change_flags+0x695/0x990 net/core/dev.c:6726
    dev_change_flags+0x88/0x140 net/core/dev.c:6792
    dev_ifsioc+0x5a6/0x930 net/core/dev_ioctl.c:256
    dev_ioctl+0x2bc/0xf90 net/core/dev_ioctl.c:554
    sock_do_ioctl+0x94/0xb0 net/socket.c:968
    sock_ioctl+0x2c2/0x440 net/socket.c:1058
    vfs_ioctl fs/ioctl.c:45 [inline]
    do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
    SYSC_ioctl fs/ioctl.c:700 [inline]
    SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

    Fixes: 7b9364050246 ("net, sched: convert Qdisc.refcnt from atomic_t to refcount_t")
    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Cc: Reshetova, Elena
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Aug, 2017

1 commit


17 Aug, 2017

1 commit

  • This callback is used for deactivating class in parent qdisc.
    This is cheaper to test queue length right here.

    Also this allows to catch draining screwed backlog and prevent
    second deactivation of already inactive parent class which will
    crash kernel for sure. Kernel with print warning at destruction
    of child qdisc where no packets but backlog is not zero.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: David S. Miller

    Konstantin Khlebnikov
     

16 Aug, 2017

2 commits


10 Aug, 2017

1 commit

  • This change allows us to later indicate to rtnetlink core that certain
    doit functions should be called without acquiring rtnl_mutex.

    This change should have no effect, we simply replace the last (now
    unused) calcit argument with the new flag.

    Signed-off-by: Florian Westphal
    Reviewed-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller

    Florian Westphal
     

05 Jul, 2017

1 commit

  • refcount_t type and corresponding API should be
    used instead of atomic_t when the variable is used as
    a reference counter. This allows to avoid accidental
    refcounter overflows that might lead to use-after-free
    situations.

    Signed-off-by: Elena Reshetova
    Signed-off-by: Hans Liljestrand
    Signed-off-by: Kees Cook
    Signed-off-by: David Windsor
    Signed-off-by: David S. Miller

    Reshetova, Elena
     

01 Jul, 2017

1 commit


30 Jun, 2017

1 commit

  • When qdisc fail to init, qdisc_create would invoke the destroy callback
    to cleanup. But there is no check if the callback exists really. So it
    would cause the panic if there is no real destroy callback like the qdisc
    codel, fq, and so on.

    Take codel as an example following:
    When a malicious user constructs one invalid netlink msg, it would cause
    codel_init->codel_change->nla_parse_nested failed.
    Then kernel would invoke the destroy callback directly but qdisc codel
    doesn't define one. It causes one panic as a result.

    Now add one the check for destroy to avoid the possible panic.

    Fixes: 87b60cfacf9f ("net_sched: fix error recovery at qdisc creation")
    Signed-off-by: Gao Feng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Gao Feng
     

18 May, 2017

2 commits

  • Currently, the filter chains are direcly put into the private structures
    of qdiscs. In order to be able to have multiple chains per qdisc and to
    allow filter chains sharing among qdiscs, there is a need for common
    object that would hold the chains. This introduces such object and calls
    it "tcf_block".

    Helpers to get and put the blocks are provided to be called from
    individual qdisc code. Also, the original filter_list pointers are left
    in qdisc privs to allow the entry into tcf_block processing without any
    added overhead of possible multiple pointer dereference on fast path.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     
  • Move tc_classify function to cls_api.c where it belongs, rename it to
    fit the namespace.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     

12 May, 2017

1 commit

  • In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to
    hashtable") we missed the opportunity to considerably speed up
    tc_dump_tclass_root() if a qdisc handle is provided by user.

    Instead of iterating all the qdiscs, use qdisc_match_from_root()
    to directly get the one we look for.

    Signed-off-by: Eric Dumazet
    Cc: Jiri Kosina
    Cc: Jamal Hadi Salim
    Cc: Cong Wang
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Apr, 2017

2 commits

  • Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
    for doit functions that call it directly.

    This is the first step to using extended error reporting in rtnetlink.
    >From here individual subsystems can be updated to set netlink_ext_ack as
    needed.

    Signed-off-by: David Ahern
    Signed-off-by: David S. Miller

    David Ahern
     
  • Since 3.12 it has been possible to configure the default queuing
    discipline via sysctl. This patch adds ability to configure the
    default queue discipline in kernel configuration. This is useful for
    environments where configuring the value from userspace is difficult
    to manage.

    The default is still the same as before (pfifo_fast) and it is
    possible to change after kernel init with sysctl. This is similar
    to how TCP congestion control works.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

14 Apr, 2017

1 commit


13 Mar, 2017

1 commit

  • The original reason [1] for having hidden qdiscs (potential scalability
    issues in qdisc_match_from_root() with single linked list in case of large
    amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
    qdisc linked list to hashtable").

    This allows us for bringing more clarity and determinism into the dump by
    making default pfifo qdiscs visible.

    We're not turning this on by default though, at it was deemed [2] too
    intrusive / unnecessary change of default behavior towards userspace.
    Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
    applications to request complete qdisc hierarchy dump, including the
    ones that have always been implicit/invisible.

    Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
    about singletons would require quite some surgery with very little gain
    (seeing no qdisc or seeing noop qdisc in the dump is probably setting
    the same user expectation).

    [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
    [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

18 Feb, 2017

1 commit

  • The qdisc_stab_lock is used in qdisc_get_stab and qdisc_put_stab.
    These two functions are invoked in qdisc_create, qdisc_change, and
    qdisc_destroy which run fully under RTNL.

    So it already makes sure only one could access the qdisc_stab_list at
    the same time. Then it is unnecessary to use qdisc_stab_lock now.

    Signed-off-by: Gao Feng
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Gao Feng
     

12 Feb, 2017

1 commit

  • Dmitry reported uses after free in qdisc code [1]

    The problem here is that ops->init() can return an error.

    qdisc_create_dflt() then call ops->destroy(),
    while qdisc_create() does _not_ call it.

    Four qdisc chose to call their own ops->destroy(), assuming their caller
    would not.

    This patch makes sure qdisc_create() calls ops->destroy()
    and fixes the four qdisc to avoid double free.

    [1]
    BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
    Read of size 8 by task syz-executor2/5030
    CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
    ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
    ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
    Call Trace:
    [] __dump_stack lib/dump_stack.c:15 [inline]
    [] dump_stack+0x6c/0x98 lib/dump_stack.c:51
    [] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
    [] print_address_description mm/kasan/report.c:196 [inline]
    [] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
    [] kasan_report mm/kasan/report.c:305 [inline]
    [] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
    [] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
    [] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
    [] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
    [] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
    [] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
    [] __dev_open+0x221/0x320 net/core/dev.c:1403
    [] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
    [] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
    [] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
    [] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
    [] sock_do_ioctl+0x99/0xb0 net/socket.c:879
    [] sock_ioctl+0x2a0/0x390 net/socket.c:958
    [] vfs_ioctl fs/ioctl.c:44 [inline]
    [] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
    [] SYSC_ioctl fs/ioctl.c:626 [inline]
    [] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
    [] entry_SYSCALL_64_fastpath+0x12/0x17

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Feb, 2017

2 commits


09 Jan, 2017

1 commit


06 Dec, 2016

1 commit

  • 1) Old code was hard to maintain, due to complex lock chains.
    (We probably will be able to remove some kfree_rcu() in callers)

    2) Using a single timer to update all estimators does not scale.

    3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
    is not supposed to work well)

    In this rewrite :

    - I removed the RB tree that had to be scanned in
    gen_estimator_active(). qdisc dumps should be much faster.

    - Each estimator has its own timer.

    - Estimations are maintained in net_rate_estimator structure,
    instead of dirtying the qdisc. Minor, but part of the simplification.

    - Reading the estimator uses RCU and a seqcount to provide proper
    support for 32bit kernels.

    - We reduce memory need when estimators are not used, since
    we store a pointer, instead of the bytes/packets counters.

    - xt_rateest_mt() no longer has to grab a spinlock.
    (In the future, xt_rateest_tg() could be switched to per cpu counters)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Nov, 2016

1 commit

  • It is a clear misconfiguration to attach a qdisc to a device with
    tx_queue_len zero, because some qdisc's (namely, pfifo, bfifo, gred,
    htb, plug and sfb) inherit/copy this value as their queue length.

    Why should the kernel catch such a misconfiguration? Because prior to
    introducing the IFF_NO_QUEUE device flag, userspace found a loophole
    in the qdisc config system that allowed them to achieve the equivalent
    of IFF_NO_QUEUE, which is to remove the qdisc code path entirely from
    a device. The loophole on older kernels is setting tx_queue_len=0,
    *prior* to device qdisc init (the config time is significant, simply
    setting tx_queue_len=0 doesn't trigger the loophole).

    This loophole is currently used by Docker[1] to get better performance
    and scalability out of the veth device. The Docker developers were
    warned[1] that they needed to adjust the tx_queue_len if ever
    attaching a qdisc. The OpenShift project didn't remember this warning
    and attached a qdisc, this were caught and fixed in[2].

    [1] https://github.com/docker/libcontainer/pull/193
    [2] https://github.com/openshift/origin/pull/11126

    Instead of fixing every userspace program that used this loophole, and
    forgot to reset the tx_queue_len, prior to attaching a qdisc. Let's
    catch the misconfiguration on the kernel side.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

20 Sep, 2016

1 commit


19 Aug, 2016

2 commits

  • tc_dump_qdisc() performs dumping of the per-device qdiscs in two phases;
    first, the "standard" dev->qdisc is being dumped. Second, if there is/are
    ingress queue(s), they are being dumped as well.

    After conversion of netdevice's qdisc linked-list into hashtable, these
    two sets are not in two disjunctive sets/lists any more, but are both
    "reachable" directly from netdevice's hashtable. As a consequence, the
    "full-depth" dump of the ingress qdiscs results in immediately hitting the
    netdevice hashtable again, and duplicating the dump that has already been
    performed for dev->qdisc.
    What in fact needs to be dumped in case of ingress queue is "just" the
    top-level ingress qdisc, as everything else has been dumped already.

    Fix this by extending tc_dump_qdisc_root() in a way that it can be instructed
    whether it should (while performing the "full" per-netdev qdisc dump) perform
    the whole recursion, or just dump "additional" top-level (ingress) qdiscs
    without performing any kind of recursion.

    This fixes duplicate dumps such as

    qdisc mq 0: root
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc clsact ffff: parent ffff:fff1
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1

    Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
    Reported-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     
  • qdisc_match_from_root() is now iterating over per-netdevice qdisc
    hashtable instead of going through a linked-list of qdiscs (independently
    on the actual underlying netdev), which was the case before the switch to
    hashtable for qdiscs.

    For singleton qdiscs, there is no underlying netdev associated though, and
    therefore dumping a singleton qdisc will panic, as qdisc_dev(root) will
    always be NULL.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000410
    IP: [] qdisc_match_from_root+0x2c/0x70
    PGD 1aceba067 PUD 1aceb7067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP
    [ ... ]
    task: ffff8801ec996e00 task.stack: ffff8801ec934000
    RIP: 0010:[] [] qdisc_match_from_root+0x2c/0x70
    RSP: 0018:ffff8801ec937ab0 EFLAGS: 00010203
    RAX: 0000000000000408 RBX: ffff88025e612000 RCX: ffffffffffffffd8
    RDX: 0000000000000000 RSI: 00000000ffff0000 RDI: ffffffff81cf8100
    RBP: ffff8801ec937ab0 R08: 000000000001c160 R09: ffff8802668032c0
    R10: ffffffff81cf8100 R11: 0000000000000030 R12: 00000000ffff0000
    R13: ffff88025e612000 R14: ffffffff81cf3140 R15: 0000000000000000
    FS: 00007f24b9af6740(0000) GS:ffff88026f280000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000410 CR3: 00000001aceec000 CR4: 00000000001406e0
    Stack:
    ffff8801ec937ad0 ffffffff81681210 ffff88025dd51a00 00000000fffffff1
    ffff8801ec937b88 ffffffff81681e4e ffffffff81c42bc0 ffff880262431500
    ffffffff81cf3140 ffff88025dd51a10 ffff88025dd51a24 00000000ec937b38
    Call Trace:
    [] qdisc_lookup+0x40/0x50
    [] tc_modify_qdisc+0x21e/0x550
    [] rtnetlink_rcv_msg+0x95/0x220
    [] ? __kmalloc_track_caller+0x172/0x230
    [] ? rtnl_newlink+0x870/0x870
    [] netlink_rcv_skb+0xa7/0xc0
    [] rtnetlink_rcv+0x28/0x30
    [] netlink_unicast+0x15b/0x210
    [] netlink_sendmsg+0x319/0x390
    [] sock_sendmsg+0x38/0x50
    [] ___sys_sendmsg+0x256/0x260
    [] ? __pagevec_lru_add_fn+0x135/0x280
    [] ? pagevec_lru_move_fn+0xd0/0xf0
    [] ? trace_event_raw_event_mm_lru_insertion+0x180/0x180
    [] ? __lru_cache_add+0x75/0xb0
    [] ? _raw_spin_unlock+0x16/0x40
    [] ? handle_mm_fault+0x39f/0x1160
    [] __sys_sendmsg+0x45/0x80
    [] SyS_sendmsg+0x12/0x20
    [] do_syscall_64+0x57/0xb0

    Fix this by special-casing singleton qdiscs (those that don't have
    underlying netdevice) and introduce immediate handling of those rather
    than trying to go over an underlying netdevice. We're in the same
    situation in tc_dump_qdisc_root() and tc_dump_tclass_root().

    Ultimately, this will have to be slightly reworked so that we are actually
    able to show singleton qdiscs (noop) in the dump properly; but we're not
    currently doing that anyway, so no regression there, and better do this in
    a gradual manner.

    Fixes: 59cc1f61f ("net: sched: convert qdisc linked list to hashtable")
    Reported-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Reported-by: David Ahern
    Tested-by: David Ahern
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

11 Aug, 2016

1 commit

  • Convert the per-device linked list into a hashtable. The primary
    motivation for this change is that currently, we're not tracking all the
    qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
    performed over the linked list by qdisc_match_from_root() is rather
    expensive.

    The ultimate goal is to get rid of hidden qdiscs completely, which will
    bring much more determinism in user experience.

    Reviewed-by: Cong Wang
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

13 Jun, 2016

1 commit

  • sch_atm returns this when TC_ACT_SHOT classification occurs.

    But all other schedulers that use tc_classify
    (htb, hfsc, drr, fq_codel ...) return NET_XMIT_SUCCESS | __BYPASS
    in this case so just do that in atm.

    BATMAN uses it as an intermediate return value to signal
    forwarding vs. buffering, but it did not return POLICED to
    callers outside of BATMAN.

    Reviewed-by: Sven Eckelmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

11 Jun, 2016

1 commit

  • __QDISC_STATE_THROTTLED bit manipulation is rather expensive
    for HTB and few others.

    I already removed it for sch_fq in commit f2600cf02b5b
    ("net: sched: avoid costly atomic operation in fq_dequeue()")
    and so far nobody complained.

    When one ore more packets are stuck in one or more throttled
    HTB class, a htb dequeue() performs two atomic operations
    to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
    lock is held.

    Removing this pair of atomic operations bring me a 8 % performance
    increase on 200 TCP_RR tests, in presence of throttled classes.

    This patch has no side effect, since nothing actually uses
    disc_is_throttled() anymore.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Jun, 2016

1 commit

  • Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
    agent [1] are problematic at scale :

    For each qdisc/class found in the dump, we currently lock the root qdisc
    spinlock in order to get stats. Sampling stats every 5 seconds from
    thousands of HTB classes is a challenge when the root qdisc spinlock is
    under high pressure. Not only the dumps take time, they also slow
    down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

    An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
    that might need the qdisc lock in fq_codel_dump_stats() and
    fq_codel_dump_class_stats()

    In v2 of this patch, I now use the Qdisc running seqcount to provide
    consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

    I also changed rate estimators to use the same infrastructure
    so that they no longer need to lock root qdisc lock.

    [1]
    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kevin Athey
    Cc: Xiaotian Pei
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 May, 2016

1 commit

  • I found a serious performance bug in packet schedulers using hrtimers.

    sch_htb and sch_fq are definitely impacted by this problem.

    We constantly rearm high resolution timers if some packets are throttled
    in one (or more) class, and other packets are flying through qdisc on
    another (non throttled) class.

    hrtimer_start() does not have the mod_timer() trick of doing nothing if
    expires value does not change :

    if (timer_pending(timer) &&
    timer->expires == expires)
    return 1;

    This issue is particularly visible when multiple cpus can queue/dequeue
    packets on the same qdisc, as hrtimer code has to lock a remote base.

    I used following fix :

    1) Change htb to use qdisc_watchdog_schedule_ns() instead of open-coding
    it.

    2) Cache watchdog prior expiration. hrtimer might provide this, but I
    prefer to not rely on some hrtimer internal.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet