07 Dec, 2019

1 commit

  • When user runs a command like
    tc qdisc add dev eth1 root mqprio
    KASAN stack-out-of-bounds warning is emitted.
    Currently, NLA_ALIGN macro used in mqprio_dump provides too large
    buffer size as argument for nla_put and memcpy down the call stack.
    The flow looks like this:
    1. nla_put expects exact object size as an argument;
    2. Later it provides this size to memcpy;
    3. To calculate correct padding for SKB, nla_put applies NLA_ALIGN
    macro itself.

    Therefore, NLA_ALIGN should not be applied to the nla_put parameter.
    Otherwise it will lead to out-of-bounds memory access in memcpy.

    Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
    Signed-off-by: Vladyslav Tarasiuk
    Signed-off-by: David S. Miller

    Vladyslav Tarasiuk
     

04 Dec, 2019

1 commit


01 Dec, 2019

1 commit

  • When a classful qdisc's child qdisc has set the flag
    TCQ_F_CPUSTATS (pfifo_fast for example), the child qdisc's
    cpu_bstats should be passed to gnet_stats_copy_basic(),
    but many classful qdisc didn't do that. As a result,
    `tc -s class show dev DEV` always return 0 for bytes and
    packets in this case.

    Pass the child qdisc's cpu_bstats to gnet_stats_copy_basic()
    to fix this issue.

    The qstats also has this problem, but it has been fixed
    in 5dd431b6b9 ("net: sched: introduce and use qstats read...")
    and bstats still remains buggy.

    Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
    Signed-off-by: Dust Li
    Signed-off-by: Tony Lu
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Dust Li
     

19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

02 Apr, 2019

1 commit

  • Classful qdiscs can't access directly the child qdiscs backlog
    length: if such qdisc is NOLOCK, per CPU values should be
    accounted instead.

    Most qdiscs no not respect the above. As a result, qstats fetching
    for most classful qdisc is currently incorrect: if the child qdisc is
    NOLOCK, it always reports 0 len backlog.

    This change introduces a pair of helpers to safely fetch
    both backlog and qlen and use them in stats class dumping
    functions, fixing the above issue and cleaning a bit the code.

    DRR needs also to access the child qdisc queue length, so it
    needs custom handling.

    Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

26 Sep, 2018

1 commit

  • Current implementation of qdisc_destroy() decrements Qdisc reference
    counter and only actually destroy Qdisc if reference counter value reached
    zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
    describe the way in which this function currently implemented and used.

    Extract code that deallocates Qdisc into new private qdisc_destroy()
    function. It is intended to be shared between regular qdisc_put() and its
    unlocked version that is introduced in next patch in this series.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

22 Dec, 2017

3 commits

  • This patch adds extack support for the function qdisc_create_dflt which is
    a common used function in the tc subsystem. Callers which are interested
    in the receiving error can assign extack to get a more detailed
    information why qdisc_create_dflt failed. The function qdisc_create_dflt
    will also call an init callback which can fail by any per-qdisc specific
    handling.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for graft callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     
  • This patch adds extack support for init callback to prepare per-qdisc
    specific changes for extack.

    Cc: David Ahern
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Alexander Aring
    Signed-off-by: David S. Miller

    Alexander Aring
     

09 Dec, 2017

1 commit

  • The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then
    called independently for enqueue and dequeue operations. However
    statistics are aggregated and pushed up to the "master" qdisc.

    This patch adds support for any of the sub-qdiscs to be per cpu
    statistic qdiscs. To handle this case add a check when calculating
    stats and aggregate the per cpu stats if needed.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

08 Nov, 2017

1 commit


28 Oct, 2017

1 commit

  • When replacing a child qdisc from mqprio, tc_modify_qdisc() must fetch
    the netdev_queue pointer that the current child qdisc is associated
    with before creating the new qdisc.

    Currently, when using mqprio as root qdisc, the kernel will end up
    getting the queue #0 pointer from the mqprio (root qdisc), which leaves
    any new child qdisc with a possibly wrong netdev_queue pointer.

    Implementing the Qdisc_class_ops select_queue() on mqprio fixes this
    issue and avoid an inconsistent state when child qdiscs are replaced.

    Signed-off-by: Jesus Sanchez-Palencia
    Tested-by: Henrik Austad
    Signed-off-by: Jeff Kirsher

    Jesus Sanchez-Palencia
     

19 Oct, 2017

1 commit

  • The pointer opt has a null check however before for this check opt is
    dereferenced when len is initialized, hence we potentially have a null
    pointer deference on opt. Avoid this by checking for a null opt before
    dereferencing it.

    Detected by CoverityScan, CID#1458234 ("Dereference before null check")

    Fixes: 4e8b86c06269 ("mqprio: Introduce new hardware offload mode and shaper in mqprio")
    Signed-off-by: Colin Ian King
    Signed-off-by: David S. Miller

    Colin Ian King
     

17 Oct, 2017

1 commit

  • This patch makes a slight tweak to mqprio in order to bring the
    classid values used back in line with what is used for mq. The general idea
    is to reserve values :ffe0 - :ffef to identify hardware traffic classes
    normally reported via dev->num_tc. By doing this we can maintain a
    consistent behavior with mq for classid where :1 - :ffdf will represent a
    physical qdisc mapped onto a Tx queue represented by classid - 1, and the
    traffic classes will be mapped onto a known subset of classid values
    reserved for our virtual qdiscs.

    Note I reserved the range from :fff0 - :ffff since this way we might be
    able to reuse these classid values with clsact and ingress which would mean
    that for mq, mqprio, ingress, and clsact we should be able to maintain a
    similar classid layout.

    Signed-off-by: Alexander Duyck
    Tested-by: Jesus Sanchez-Palencia
    Signed-off-by: David S. Miller

    Alexander Duyck
     

14 Oct, 2017

1 commit

  • The offload types currently supported in mqprio are 0 (no offload) and
    1 (offload only TCs) by setting these values for the 'hw' option. If
    offloads are supported by setting the 'hw' option to 1, the default
    offload mode is 'dcb' where only the TC values are offloaded to the
    device. This patch introduces a new hardware offload mode called
    'channel' with 'hw' set to 1 in mqprio which makes full use of the
    mqprio options, the TCs, the queue configurations and the QoS parameters
    for the TCs. This is achieved through a new netlink attribute for the
    'mode' option which takes values such as 'dcb' (default) and 'channel'.
    The 'channel' mode also supports QoS attributes for traffic class such as
    minimum and maximum values for bandwidth rate limits.

    This patch enables configuring additional HW shaper attributes associated
    with a traffic class. Currently the shaper for bandwidth rate limiting is
    supported which takes options such as minimum and maximum bandwidth rates
    and are offloaded to the hardware in the 'channel' mode. The min and max
    limits for bandwidth rates are provided by the user along with the TCs
    and the queue configurations when creating the mqprio qdisc. The interface
    can be extended to support new HW shapers in future through the 'shaper'
    attribute.

    Introduces a new data structure 'tc_mqprio_qopt_offload' for offloading
    mqprio queue options and use this to be shared between the kernel and
    device driver. This contains a copy of the existing data structure
    for mqprio queue options. This new data structure can be extended when
    adding new attributes for traffic class such as mode, shaper, shaper
    parameters (bandwidth rate limits). The existing data structure for mqprio
    queue options will be shared between the kernel and userspace.

    Example:
    queues 4@0 4@4 hw 1 mode channel shaper bw_rlimit\
    min_rate 1Gbit 2Gbit max_rate 4Gbit 5Gbit

    To dump the bandwidth rates:

    qdisc mqprio 804a: root tc 2 map 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
    queues:(0:3) (4:7)
    mode:channel
    shaper:bw_rlimit min_rate:1Gbit 2Gbit max_rate:4Gbit 5Gbit

    Signed-off-by: Amritha Nambiar
    Tested-by: Andrew Bowers
    Signed-off-by: Jeff Kirsher

    Amritha Nambiar
     

26 Aug, 2017

1 commit

  • For TC classes, their ->get() and ->put() are always paired, and the
    reference counting is completely useless, because:

    1) For class modification and dumping paths, we already hold RTNL lock,
    so all of these ->get(),->change(),->put() are atomic.

    2) For filter bindiing/unbinding, we use other reference counter than
    this one, and they should have RTNL lock too.

    3) For ->qlen_notify(), it is special because it is called on ->enqueue()
    path, but we already hold qdisc tree lock there, and we hold this
    tree lock when graft or delete the class too, so it should not be gone
    or changed until we release the tree lock.

    Therefore, this patch removes ->get() and ->put(), but:

    1) Adds a new ->find() to find the pointer to a class by classid, no
    refcnt.

    2) Move the original class destroy upon the last refcnt into ->delete(),
    right after releasing tree lock. This is fine because the class is
    already removed from hash when holding the lock.

    For those who also use ->put() as ->unbind(), just rename them to reflect
    this change.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

08 Aug, 2017

3 commits


08 Jun, 2017

1 commit

  • We need to push the chain index down to the drivers, so they have the
    information to which chain the rule belongs. For now, no driver supports
    multichain offload, so only chain 0 is supported. This is needed to
    prevent chain squashes during offload for now. Later this will be used
    to implement multichain offload.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

16 Mar, 2017

2 commits

  • The configurable priority to traffic class mapping and the user specified
    queue ranges are used to configure the traffic class, overriding the
    hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
    option is non-zero, the hardware QOS defaults are used.

    This patch makes it so that we can pass the data the user provided to
    ndo_setup_tc. This allows us to pull in the queue configuration if the
    user requested it as well as any additional hardware offload type
    requested by using a value other than 1 for the hw value.

    Finally it also provides a means for the device driver to return the level
    supported for the offload type via the qopt->hw value. Previously we were
    just always assuming the value to be 1, in the future values beyond just 1
    may be supported.

    Signed-off-by: Amritha Nambiar
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Amritha Nambiar
     
  • This patch is meant to allow for support of multiple hardware offload type
    for a single device. There is currently no bounds checking for the hw
    member of the mqprio_qopt structure. This results in us being able to pass
    values from 1 to 255 with all being treated the same. On retreiving the
    value it is returned as 1 for anything 1 or greater being set.

    With this change we are currently adding limited bounds checking by
    defining an enum and using those values to limit the reported hardware
    offloads.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

13 Mar, 2017

1 commit

  • The original reason [1] for having hidden qdiscs (potential scalability
    issues in qdisc_match_from_root() with single linked list in case of large
    amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
    qdisc linked list to hashtable").

    This allows us for bringing more clarity and determinism into the dump by
    making default pfifo qdiscs visible.

    We're not turning this on by default though, at it was deemed [2] too
    intrusive / unnecessary change of default behavior towards userspace.
    Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
    applications to request complete qdisc hierarchy dump, including the
    ones that have always been implicit/invisible.

    Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
    about singletons would require quite some surgery with very little gain
    (seeing no qdisc or seeing noop qdisc in the dump is probably setting
    the same user expectation).

    [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
    [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

12 Feb, 2017

1 commit

  • Dmitry reported uses after free in qdisc code [1]

    The problem here is that ops->init() can return an error.

    qdisc_create_dflt() then call ops->destroy(),
    while qdisc_create() does _not_ call it.

    Four qdisc chose to call their own ops->destroy(), assuming their caller
    would not.

    This patch makes sure qdisc_create() calls ops->destroy()
    and fixes the four qdisc to avoid double free.

    [1]
    BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
    Read of size 8 by task syz-executor2/5030
    CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
    ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
    ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
    Call Trace:
    [] __dump_stack lib/dump_stack.c:15 [inline]
    [] dump_stack+0x6c/0x98 lib/dump_stack.c:51
    [] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
    [] print_address_description mm/kasan/report.c:196 [inline]
    [] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
    [] kasan_report mm/kasan/report.c:305 [inline]
    [] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
    [] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
    [] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
    [] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
    [] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
    [] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
    [] __dev_open+0x221/0x320 net/core/dev.c:1403
    [] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
    [] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
    [] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
    [] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
    [] sock_do_ioctl+0x99/0xb0 net/socket.c:879
    [] sock_ioctl+0x2a0/0x390 net/socket.c:958
    [] vfs_ioctl fs/ioctl.c:44 [inline]
    [] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
    [] SYSC_ioctl fs/ioctl.c:626 [inline]
    [] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
    [] entry_SYSCALL_64_fastpath+0x12/0x17

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Aug, 2016

1 commit

  • Convert the per-device linked list into a hashtable. The primary
    motivation for this change is that currently, we're not tracking all the
    qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
    performed over the linked list by qdisc_match_from_root() is rather
    expensive.

    The ultimate goal is to get rid of hidden qdiscs completely, which will
    bring much more determinism in user experience.

    Reviewed-by: Cong Wang
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

08 Jun, 2016

1 commit

  • Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
    agent [1] are problematic at scale :

    For each qdisc/class found in the dump, we currently lock the root qdisc
    spinlock in order to get stats. Sampling stats every 5 seconds from
    thousands of HTB classes is a challenge when the root qdisc spinlock is
    under high pressure. Not only the dumps take time, they also slow
    down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

    An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
    that might need the qdisc lock in fq_codel_dump_stats() and
    fq_codel_dump_class_stats()

    In v2 of this patch, I now use the Qdisc running seqcount to provide
    consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

    I also changed rate estimators to use the same infrastructure
    so that they no longer need to lock root qdisc lock.

    [1]
    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kevin Athey
    Cc: Xiaotian Pei
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Mar, 2016

1 commit


02 Mar, 2016

1 commit

  • CC [M] net/sched/sch_mqprio.o
    net/sched/sch_mqprio.c: In function ?mqprio_init?:
    net/sched/sch_mqprio.c:145: error: unknown field ?tc? specified in initializer
    net/sched/sch_mqprio.c:145: warning: missing braces around initializer
    net/sched/sch_mqprio.c:145: warning: (near initialization for ?tc.?)
    make[2]: *** [net/sched/sch_mqprio.o] Error 1
    make[1]: *** [net/sched] Error 2
    make: *** [net] Error 2

    Several people reported this, surround the unnamed union
    member initialization with braces to fix.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Feb, 2016

2 commits

  • This patch updates setup_tc so we can pass additional parameters into
    the ndo op in a generic way. To do this we provide structured union
    and type flag.

    This lets each classifier and qdisc provide its own set of attributes
    without having to add new ndo ops or grow the signature of the
    callback.

    Signed-off-by: John Fastabend
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    John Fastabend
     
  • The ndo_setup_tc() op was added to support drivers offloading tx
    qdiscs however only support for mqprio was ever added. So we
    only ever added support for passing the number of traffic classes
    to the driver.

    This patch generalizes the ndo_setup_tc op so that a handle can
    be provided to indicate if the offload is for ingress or egress
    or potentially even child qdiscs.

    CC: Murali Karicheri
    CC: Shradha Shah
    CC: Or Gerlitz
    CC: Ariel Elior
    CC: Jeff Kirsher
    CC: Bruce Allan
    CC: Jesse Brandeburg
    CC: Don Skidmore
    Signed-off-by: John Fastabend
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    John Fastabend
     

04 Dec, 2015

1 commit

  • qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
    devices.

    One problem is that it updates sch->q.qlen and sch->qstats.drops
    on the mq/mqprio root qdisc, while it should not : Daniele
    reported underflows errors :
    [ 681.774821] PAX: sch->q.qlen: 0 n: 1
    [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
    [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
    [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
    [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
    [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
    [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
    [ 681.774962] Call Trace:
    [ 681.774967] [] dump_stack+0x4c/0x7f
    [ 681.774970] [] report_size_overflow+0x34/0x50
    [ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
    [ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
    [ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
    [ 681.774980] [] __qdisc_run+0x4d/0x1d0
    [ 681.774983] [] net_tx_action+0xc2/0x160
    [ 681.774985] [] __do_softirq+0xf1/0x200
    [ 681.774987] [] run_ksoftirqd+0x1e/0x30
    [ 681.774989] [] smpboot_thread_fn+0x150/0x260
    [ 681.774991] [] ? sort_range+0x40/0x40
    [ 681.774992] [] kthread+0xe4/0x100
    [ 681.774994] [] ? kthread_worker_fn+0x170/0x170
    [ 681.774995] [] ret_from_fork+0x3e/0x70

    mq/mqprio have their own ways to report qlen/drops by folding stats on
    all their queues, with appropriate locking.

    A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
    without proper locking : concurrent qdisc updates could corrupt the list
    that qdisc_match_from_root() parses to find a qdisc given its handle.

    Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
    qdisc_tree_decrease_qlen() can use to abort its tree traversal,
    as soon as it meets a mq/mqprio qdisc children.

    Second problem can be fixed by RCU protection.
    Qdisc are already freed after RCU grace period, so qdisc_list_add() and
    qdisc_list_del() simply have to use appropriate rcu list variants.

    A future patch will add a per struct netdev_queue list anchor, so that
    qdisc_tree_decrease_qlen() can have more efficient lookups.

    Reported-by: Daniele Fucini
    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2014

3 commits

  • After previous patches to simplify qstats the qstats can be
    made per cpu with a packed union in Qdisc struct.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This removes the use of qstats->qlen variable from the classifiers
    and makes it an explicit argument to gnet_stats_copy_queue().

    The qlen represents the qdisc queue length and is packed into
    the qstats at the last moment before passnig to user space. By
    handling it explicitely we avoid, in the percpu stats case, having
    to figure out which per_cpu variable to put it in.

    It would probably be best to remove it from qstats completely
    but qstats is a user space ABI and can't be broken. A future
    patch could make an internal only qstats structure that would
    avoid having to allocate an additional u32 variable on the
    Qdisc struct. This would make the qstats struct 128bits instead
    of 128+32.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • In order to run qdisc's without locking statistics and estimators
    need to be handled correctly.

    To resolve bstats make the statistics per cpu. And because this is
    only needed for qdiscs that are running without locks which is not
    the case for most qdiscs in the near future only create percpu
    stats when qdiscs set the TCQ_F_CPUSTATS flag.

    Next because estimators use the bstats to calculate packets per
    second and bytes per second the estimator code paths are updated
    to use the per cpu statistics.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

14 Sep, 2014

1 commit

  • Add __rcu notation to qdisc handling by doing this we can make
    smatch output more legible. And anyways some of the cases should
    be using rcu_dereference() see qdisc_all_tx_empty(),
    qdisc_tx_chainging(), and so on.

    Also *wake_queue() API is commonly called from driver timer routines
    without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
    around netif_wake_subqueue and netif_tx_wake_queue.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     

10 Dec, 2013

1 commit

  • Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
    added the ability to change default qdisc from pfifo_fast to say fq

    But as most modern ethernet devices are multiqueue, we cant really
    see all the statistics from "tc -s qdisc show", as the default root
    qdisc is mq.

    This patch adds the calls to qdisc_list_add() to mq and mqprio

    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Aug, 2013

1 commit

  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

12 Dec, 2012

1 commit

  • With BQL being deployed, we can more likely have following behavior :

    We dequeue a packet from qdisc in dequeue_skb(), then we realize target
    tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
    skb into gso_skb for later.

    This shows in stats (tc -s qdisc dev eth0) as requeues.

    Problem of these requeues is that high priority packets can not be
    dequeued as long as this (possibly low prio and big TSO packet) is not
    removed from gso_skb.

    At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

    In some cases, we know that all packets dequeued from a qdisc are
    for a particular and known txq :

    - If device is non multi queue
    - For all MQ/MQPRIO slave qdiscs

    This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
    this capability, so that dequeue_skb() is allowed to dequeue a packet
    only if the associated txq is not stopped.

    This indeed reduce latencies for high prio packets (or improve fairness
    with sfq/fq_codel), and almost remove qdisc 'requeues'.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Signed-off-by: David S. Miller

    Eric Dumazet