26 Aug, 2017

1 commit

  • For TC classes, their ->get() and ->put() are always paired, and the
    reference counting is completely useless, because:

    1) For class modification and dumping paths, we already hold RTNL lock,
    so all of these ->get(),->change(),->put() are atomic.

    2) For filter bindiing/unbinding, we use other reference counter than
    this one, and they should have RTNL lock too.

    3) For ->qlen_notify(), it is special because it is called on ->enqueue()
    path, but we already hold qdisc tree lock there, and we hold this
    tree lock when graft or delete the class too, so it should not be gone
    or changed until we release the tree lock.

    Therefore, this patch removes ->get() and ->put(), but:

    1) Adds a new ->find() to find the pointer to a class by classid, no
    refcnt.

    2) Move the original class destroy upon the last refcnt into ->delete(),
    right after releasing tree lock. This is fine because the class is
    already removed from hash when holding the lock.

    For those who also use ->put() as ->unbind(), just rename them to reflect
    this change.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

08 Aug, 2017

3 commits


08 Jun, 2017

1 commit

  • We need to push the chain index down to the drivers, so they have the
    information to which chain the rule belongs. For now, no driver supports
    multichain offload, so only chain 0 is supported. This is needed to
    prevent chain squashes during offload for now. Later this will be used
    to implement multichain offload.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

16 Mar, 2017

2 commits

  • The configurable priority to traffic class mapping and the user specified
    queue ranges are used to configure the traffic class, overriding the
    hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
    option is non-zero, the hardware QOS defaults are used.

    This patch makes it so that we can pass the data the user provided to
    ndo_setup_tc. This allows us to pull in the queue configuration if the
    user requested it as well as any additional hardware offload type
    requested by using a value other than 1 for the hw value.

    Finally it also provides a means for the device driver to return the level
    supported for the offload type via the qopt->hw value. Previously we were
    just always assuming the value to be 1, in the future values beyond just 1
    may be supported.

    Signed-off-by: Amritha Nambiar
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Amritha Nambiar
     
  • This patch is meant to allow for support of multiple hardware offload type
    for a single device. There is currently no bounds checking for the hw
    member of the mqprio_qopt structure. This results in us being able to pass
    values from 1 to 255 with all being treated the same. On retreiving the
    value it is returned as 1 for anything 1 or greater being set.

    With this change we are currently adding limited bounds checking by
    defining an enum and using those values to limit the reported hardware
    offloads.

    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

13 Mar, 2017

1 commit

  • The original reason [1] for having hidden qdiscs (potential scalability
    issues in qdisc_match_from_root() with single linked list in case of large
    amount of qdiscs) has been invalidated by 59cc1f61f0 ("net: sched: convert
    qdisc linked list to hashtable").

    This allows us for bringing more clarity and determinism into the dump by
    making default pfifo qdiscs visible.

    We're not turning this on by default though, at it was deemed [2] too
    intrusive / unnecessary change of default behavior towards userspace.
    Instead, TCA_DUMP_INVISIBLE netlink attribute is introduced, which allows
    applications to request complete qdisc hierarchy dump, including the
    ones that have always been implicit/invisible.

    Singleton noop_qdisc stays invisible, as teaching the whole infrastructure
    about singletons would require quite some surgery with very little gain
    (seeing no qdisc or seeing noop qdisc in the dump is probably setting
    the same user expectation).

    [1] http://lkml.kernel.org/r/1460732328.10638.74.camel@edumazet-glaptop3.roam.corp.google.com
    [2] http://lkml.kernel.org/r/20161021.105935.1907696543877061916.davem@davemloft.net

    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

12 Feb, 2017

1 commit

  • Dmitry reported uses after free in qdisc code [1]

    The problem here is that ops->init() can return an error.

    qdisc_create_dflt() then call ops->destroy(),
    while qdisc_create() does _not_ call it.

    Four qdisc chose to call their own ops->destroy(), assuming their caller
    would not.

    This patch makes sure qdisc_create() calls ops->destroy()
    and fixes the four qdisc to avoid double free.

    [1]
    BUG: KASAN: use-after-free in mq_destroy+0x242/0x290 net/sched/sch_mq.c:33 at addr ffff8801d415d440
    Read of size 8 by task syz-executor2/5030
    CPU: 0 PID: 5030 Comm: syz-executor2 Not tainted 4.3.5-smp-DEV #119
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    0000000000000046 ffff8801b435b870 ffffffff81bbbed4 ffff8801db000400
    ffff8801d415d440 ffff8801d415dc40 ffff8801c4988510 ffff8801b435b898
    ffffffff816682b1 ffff8801b435b928 ffff8801d415d440 ffff8801c49880c0
    Call Trace:
    [] __dump_stack lib/dump_stack.c:15 [inline]
    [] dump_stack+0x6c/0x98 lib/dump_stack.c:51
    [] kasan_object_err+0x21/0x70 mm/kasan/report.c:158
    [] print_address_description mm/kasan/report.c:196 [inline]
    [] kasan_report_error+0x1b4/0x4b0 mm/kasan/report.c:285
    [] kasan_report mm/kasan/report.c:305 [inline]
    [] __asan_report_load8_noabort+0x43/0x50 mm/kasan/report.c:326
    [] mq_destroy+0x242/0x290 net/sched/sch_mq.c:33
    [] qdisc_destroy+0x12d/0x290 net/sched/sch_generic.c:953
    [] qdisc_create_dflt+0xf0/0x120 net/sched/sch_generic.c:848
    [] attach_default_qdiscs net/sched/sch_generic.c:1029 [inline]
    [] dev_activate+0x6ad/0x880 net/sched/sch_generic.c:1064
    [] __dev_open+0x221/0x320 net/core/dev.c:1403
    [] __dev_change_flags+0x15e/0x3e0 net/core/dev.c:6858
    [] dev_change_flags+0x8e/0x140 net/core/dev.c:6926
    [] dev_ifsioc+0x446/0x890 net/core/dev_ioctl.c:260
    [] dev_ioctl+0x1ba/0xb80 net/core/dev_ioctl.c:546
    [] sock_do_ioctl+0x99/0xb0 net/socket.c:879
    [] sock_ioctl+0x2a0/0x390 net/socket.c:958
    [] vfs_ioctl fs/ioctl.c:44 [inline]
    [] do_vfs_ioctl+0x8a8/0xe50 fs/ioctl.c:611
    [] SYSC_ioctl fs/ioctl.c:626 [inline]
    [] SyS_ioctl+0x94/0xc0 fs/ioctl.c:617
    [] entry_SYSCALL_64_fastpath+0x12/0x17

    Signed-off-by: Eric Dumazet
    Reported-by: Dmitry Vyukov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Aug, 2016

1 commit

  • Convert the per-device linked list into a hashtable. The primary
    motivation for this change is that currently, we're not tracking all the
    qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
    performed over the linked list by qdisc_match_from_root() is rather
    expensive.

    The ultimate goal is to get rid of hidden qdiscs completely, which will
    bring much more determinism in user experience.

    Reviewed-by: Cong Wang
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

08 Jun, 2016

1 commit

  • Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
    agent [1] are problematic at scale :

    For each qdisc/class found in the dump, we currently lock the root qdisc
    spinlock in order to get stats. Sampling stats every 5 seconds from
    thousands of HTB classes is a challenge when the root qdisc spinlock is
    under high pressure. Not only the dumps take time, they also slow
    down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

    An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
    that might need the qdisc lock in fq_codel_dump_stats() and
    fq_codel_dump_class_stats()

    In v2 of this patch, I now use the Qdisc running seqcount to provide
    consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

    I also changed rate estimators to use the same infrastructure
    so that they no longer need to lock root qdisc lock.

    [1]
    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kevin Athey
    Cc: Xiaotian Pei
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Mar, 2016

1 commit


02 Mar, 2016

1 commit

  • CC [M] net/sched/sch_mqprio.o
    net/sched/sch_mqprio.c: In function ?mqprio_init?:
    net/sched/sch_mqprio.c:145: error: unknown field ?tc? specified in initializer
    net/sched/sch_mqprio.c:145: warning: missing braces around initializer
    net/sched/sch_mqprio.c:145: warning: (near initialization for ?tc.?)
    make[2]: *** [net/sched/sch_mqprio.o] Error 1
    make[1]: *** [net/sched] Error 2
    make: *** [net] Error 2

    Several people reported this, surround the unnamed union
    member initialization with braces to fix.

    Signed-off-by: David S. Miller

    David S. Miller
     

17 Feb, 2016

2 commits

  • This patch updates setup_tc so we can pass additional parameters into
    the ndo op in a generic way. To do this we provide structured union
    and type flag.

    This lets each classifier and qdisc provide its own set of attributes
    without having to add new ndo ops or grow the signature of the
    callback.

    Signed-off-by: John Fastabend
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    John Fastabend
     
  • The ndo_setup_tc() op was added to support drivers offloading tx
    qdiscs however only support for mqprio was ever added. So we
    only ever added support for passing the number of traffic classes
    to the driver.

    This patch generalizes the ndo_setup_tc op so that a handle can
    be provided to indicate if the offload is for ingress or egress
    or potentially even child qdiscs.

    CC: Murali Karicheri
    CC: Shradha Shah
    CC: Or Gerlitz
    CC: Ariel Elior
    CC: Jeff Kirsher
    CC: Bruce Allan
    CC: Jesse Brandeburg
    CC: Don Skidmore
    Signed-off-by: John Fastabend
    Acked-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    John Fastabend
     

04 Dec, 2015

1 commit

  • qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
    devices.

    One problem is that it updates sch->q.qlen and sch->qstats.drops
    on the mq/mqprio root qdisc, while it should not : Daniele
    reported underflows errors :
    [ 681.774821] PAX: sch->q.qlen: 0 n: 1
    [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
    [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
    [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
    [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
    [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
    [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
    [ 681.774962] Call Trace:
    [ 681.774967] [] dump_stack+0x4c/0x7f
    [ 681.774970] [] report_size_overflow+0x34/0x50
    [ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
    [ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
    [ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
    [ 681.774980] [] __qdisc_run+0x4d/0x1d0
    [ 681.774983] [] net_tx_action+0xc2/0x160
    [ 681.774985] [] __do_softirq+0xf1/0x200
    [ 681.774987] [] run_ksoftirqd+0x1e/0x30
    [ 681.774989] [] smpboot_thread_fn+0x150/0x260
    [ 681.774991] [] ? sort_range+0x40/0x40
    [ 681.774992] [] kthread+0xe4/0x100
    [ 681.774994] [] ? kthread_worker_fn+0x170/0x170
    [ 681.774995] [] ret_from_fork+0x3e/0x70

    mq/mqprio have their own ways to report qlen/drops by folding stats on
    all their queues, with appropriate locking.

    A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
    without proper locking : concurrent qdisc updates could corrupt the list
    that qdisc_match_from_root() parses to find a qdisc given its handle.

    Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
    qdisc_tree_decrease_qlen() can use to abort its tree traversal,
    as soon as it meets a mq/mqprio qdisc children.

    Second problem can be fixed by RCU protection.
    Qdisc are already freed after RCU grace period, so qdisc_list_add() and
    qdisc_list_del() simply have to use appropriate rcu list variants.

    A future patch will add a per struct netdev_queue list anchor, so that
    qdisc_tree_decrease_qlen() can have more efficient lookups.

    Reported-by: Daniele Fucini
    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Sep, 2014

3 commits

  • After previous patches to simplify qstats the qstats can be
    made per cpu with a packed union in Qdisc struct.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This removes the use of qstats->qlen variable from the classifiers
    and makes it an explicit argument to gnet_stats_copy_queue().

    The qlen represents the qdisc queue length and is packed into
    the qstats at the last moment before passnig to user space. By
    handling it explicitely we avoid, in the percpu stats case, having
    to figure out which per_cpu variable to put it in.

    It would probably be best to remove it from qstats completely
    but qstats is a user space ABI and can't be broken. A future
    patch could make an internal only qstats structure that would
    avoid having to allocate an additional u32 variable on the
    Qdisc struct. This would make the qstats struct 128bits instead
    of 128+32.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • In order to run qdisc's without locking statistics and estimators
    need to be handled correctly.

    To resolve bstats make the statistics per cpu. And because this is
    only needed for qdiscs that are running without locks which is not
    the case for most qdiscs in the near future only create percpu
    stats when qdiscs set the TCQ_F_CPUSTATS flag.

    Next because estimators use the bstats to calculate packets per
    second and bytes per second the estimator code paths are updated
    to use the per cpu statistics.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

14 Sep, 2014

1 commit

  • Add __rcu notation to qdisc handling by doing this we can make
    smatch output more legible. And anyways some of the cases should
    be using rcu_dereference() see qdisc_all_tx_empty(),
    qdisc_tx_chainging(), and so on.

    Also *wake_queue() API is commonly called from driver timer routines
    without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
    around netif_wake_subqueue and netif_tx_wake_queue.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend
     

10 Dec, 2013

1 commit

  • Commit 6da7c8fcbcbd ("qdisc: allow setting default queuing discipline")
    added the ability to change default qdisc from pfifo_fast to say fq

    But as most modern ethernet devices are multiqueue, we cant really
    see all the statistics from "tc -s qdisc show", as the default root
    qdisc is mq.

    This patch adds the calls to qdisc_list_add() to mq and mqprio

    Signed-off-by: Eric Dumazet
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Aug, 2013

1 commit

  • By default, the pfifo_fast queue discipline has been used by default
    for all devices. But we have better choices now.

    This patch allow setting the default queueing discipline with sysctl.
    This allows easy use of better queueing disciplines on all devices
    without having to use tc qdisc scripts. It is intended to allow
    an easy path for distributions to make fq_codel or sfq the default
    qdisc.

    This patch also makes pfifo_fast more of a first class qdisc, since
    it is now possible to manually override the default and explicitly
    use pfifo_fast. The behavior for systems who do not use the sysctl
    is unchanged, they still get pfifo_fast

    Also removes leftover random # in sysctl net core.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

12 Dec, 2012

1 commit

  • With BQL being deployed, we can more likely have following behavior :

    We dequeue a packet from qdisc in dequeue_skb(), then we realize target
    tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
    skb into gso_skb for later.

    This shows in stats (tc -s qdisc dev eth0) as requeues.

    Problem of these requeues is that high priority packets can not be
    dequeued as long as this (possibly low prio and big TSO packet) is not
    removed from gso_skb.

    At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

    In some cases, we know that all packets dequeued from a qdisc are
    for a particular and known txq :

    - If device is non multi queue
    - For all MQ/MQPRIO slave qdiscs

    This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
    this capability, so that dequeue_skb() is allowed to dequeue a packet
    only if the associated txq is not stopped.

    This indeed reduce latencies for high prio packets (or improve fairness
    with sfq/fq_codel), and almost remove qdisc 'requeues'.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Apr, 2012

1 commit


23 Dec, 2011

1 commit


01 Nov, 2011

1 commit


24 Feb, 2011

1 commit

  • * make qdisc_ops local
    * add sparse annotation about expected unlock/unlock in dump_class_stats
    * fix indentation

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

15 Feb, 2011

1 commit


27 Jan, 2011

1 commit


22 Jan, 2011

1 commit

  • Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in
    __dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs
    than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq

    SFQ is special because it can have external classifiers, and in these
    cases, we cannot bypass queue discipline (packet could be dropped by
    classifier) without admin asking it, or further changes.

    Its worth doing this, especially for SFQ, avoiding dirtying memory in
    case no packets are already waiting in queue.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jan, 2011

1 commit

  • This implements a mqprio queueing discipline that by default creates
    a pfifo_fast qdisc per tx queue and provides the needed configuration
    interface.

    Using the mqprio qdisc the number of tcs currently in use along
    with the range of queues alloted to each class can be configured. By
    default skbs are mapped to traffic classes using the skb priority.
    This mapping is configurable.

    Configurable parameters,

    struct tc_mqprio_qopt {
    __u8 num_tc;
    __u8 prio_tc_map[TC_BITMASK + 1];
    __u8 hw;
    __u16 count[TC_MAX_QUEUE];
    __u16 offset[TC_MAX_QUEUE];
    };

    Here the count/offset pairing give the queue alignment and the
    prio_tc_map gives the mapping from skb->priority to tc.

    The hw bit determines if the hardware should configure the count
    and offset values. If the hardware bit is set then the operation
    will fail if the hardware does not implement the ndo_setup_tc
    operation. This is to avoid undetermined states where the hardware
    may or may not control the queue mapping. Also minimal bounds
    checking is done on the count/offset to verify a queue does not
    exceed num_tx_queues and that queue ranges do not overlap. Otherwise
    it is left to user policy or hardware configuration to create
    useful mappings.

    It is expected that hardware QOS schemes can be implemented by
    creating appropriate mappings of queues in ndo_tc_setup().

    One expected use case is drivers will use the ndo_setup_tc to map
    queue ranges onto 802.1Q traffic classes. This provides a generic
    mechanism to map network traffic onto these traffic classes and
    removes the need for lower layer drivers to know specifics about
    traffic types.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend