19 Sep, 2016

2 commits

  • This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

    Its similar to the skb_buff_head api, but does not use skb->prev pointers.

    Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
    While skb_buff_head works fine for this, enqueue/dequeue needs to also
    adjust the prev pointer of next element.

    The ->prev pointer is not required for qdiscs so we can just leave
    it undefined and avoid one cacheline write access for en/dequeue.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Moves qdisc stat accouting to qdisc_dequeue_head.

    The only direct caller of the __qdisc_dequeue_head version open-codes
    this now.

    This allows us to later use __qdisc_dequeue_head as a replacement
    of __skb_dequeue() (which operates on sk_buff_head list).

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

26 Aug, 2016

1 commit


11 Aug, 2016

1 commit

  • Convert the per-device linked list into a hashtable. The primary
    motivation for this change is that currently, we're not tracking all the
    qdiscs in hierarchy (e.g. excluding default qdiscs), as the lookup
    performed over the linked list by qdisc_match_from_root() is rather
    expensive.

    The ultimate goal is to get rid of hidden qdiscs completely, which will
    bring much more determinism in user experience.

    Reviewed-by: Cong Wang
    Signed-off-by: Jiri Kosina
    Signed-off-by: David S. Miller

    Jiri Kosina
     

26 Jun, 2016

2 commits

  • When qdisc bulk dequeue was added in linux-3.18 (commit
    5772e9a3463b "qdisc: bulk dequeue support for qdiscs
    with TCQ_F_ONETXQUEUE"), it was constrained to some
    specific qdiscs.

    With some extra care, we can extend this to all qdiscs,
    so that typical traffic shaping solutions can benefit from
    small batches (8 packets in this patch).

    For example, HTB is often used on some multi queue device.
    And bonding/team are multi queue devices...

    Idea is to bulk-dequeue packets mapping to the same transmit queue.

    This brings between 35 and 80 % performance increase in HTB setup
    under pressure on a bonding setup :

    1) NUMA node contention : 610,000 pps -> 1,110,000 pps
    2) No node contention : 1,380,000 pps -> 1,930,000 pps

    Now we should work to add batches on the enqueue() side ;)

    Signed-off-by: Eric Dumazet
    Cc: John Fastabend
    Cc: Jesper Dangaard Brouer
    Cc: Hannes Frederic Sowa
    Cc: Florian Westphal
    Cc: Daniel Borkmann
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Qdisc performance suffers when packets are dropped at enqueue()
    time because drops (kfree_skb()) are done while qdisc lock is held,
    delaying a dequeue() draining the queue.

    Nominal throughput can be reduced by 50 % when this happens,
    at a time we would like the dequeue() to proceed as fast as possible.

    Even FQ is vulnerable to this problem, while one of FQ goals was
    to provide some flow isolation.

    This patch adds a 'struct sk_buff **to_free' parameter to all
    qdisc->enqueue(), and in qdisc_drop() helper.

    I measured a performance increase of up to 12 %, but this patch
    is a prereq so that future batches in enqueue() can fly.

    Signed-off-by: Eric Dumazet
    Acked-by: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Jun, 2016

1 commit

  • qdisc are changed under RTNL protection and often
    while blocking BH and root qdisc spinlock.

    When lots of skbs need to be dropped, we free
    them under these locks causing TX/RX freezes,
    and more generally latency spikes.

    This commit adds rtnl_kfree_skbs(), used to queue
    skbs for deferred freeing.

    Actual freeing happens right after RTNL is released,
    with appropriate scheduling points.

    rtnl_qdisc_drop() can also be used in place
    of disc_drop() when RTNL is held.

    qdisc_reset_queue() and __qdisc_reset_queue() get
    the new behavior, so standard qdiscs like pfifo, pfifo_fast...
    have their ->reset() method automatically handled.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jun, 2016

2 commits

  • __QDISC_STATE_THROTTLED bit manipulation is rather expensive
    for HTB and few others.

    I already removed it for sch_fq in commit f2600cf02b5b
    ("net: sched: avoid costly atomic operation in fq_dequeue()")
    and so far nobody complained.

    When one ore more packets are stuck in one or more throttled
    HTB class, a htb dequeue() performs two atomic operations
    to clear/set __QDISC_STATE_THROTTLED bit, while root qdisc
    lock is held.

    Removing this pair of atomic operations bring me a 8 % performance
    increase on 200 TCP_RR tests, in presence of throttled classes.

    This patch has no side effect, since nothing actually uses
    disc_is_throttled() anymore.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Conflicts:
    net/sched/act_police.c
    net/sched/sch_drr.c
    net/sched/sch_hfsc.c
    net/sched/sch_prio.c
    net/sched/sch_red.c
    net/sched/sch_tbf.c

    In net-next the drop methods of the packet schedulers got removed, so
    the bug fixes to them in 'net' are irrelevant.

    A packet action unload crash fix conflicts with the addition of the
    new firstuse timestamp.

    Signed-off-by: David S. Miller

    David S. Miller
     

10 Jun, 2016

1 commit

  • 1) qdisc_run_begin() is really using the equivalent of a trylock.
    Instead of using write_seqcount_begin(), use a combination of
    raw_write_seqcount_begin() and correct lockdep annotation.

    2) sch_direct_xmit() should use regular spin_lock(root_lock)

    Fixes: f9eb8aea2a1e ("net_sched: transform qdisc running bit into a seqcount")
    Signed-off-by: Eric Dumazet
    Reported-by: David Ahern
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jun, 2016

4 commits

  • Earlier commits removed two members from struct Qdisc which places
    next_sched/gso_skb into a different cacheline than ->state.

    This restores the struct layout to what it was before the removal.
    Move the two members, then add an annotation so they all reside in the
    same cacheline.

    This adds a 16 byte hole after cpu_qstats.

    The hole could be closed but as it doesn't decrease total struct size just
    do it this way.

    Reported-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • after removal of TCA_CBQ_OVL_STRATEGY from cbq scheduler, there are no
    more callers of ->drop() outside of other ->drop functions, i.e.
    nothing calls them.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
    is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • iproute2 doesn't implement any cbq option that results in this attribute
    being sent to kernel.

    To make use of it, user would have to

    - patch iproute2
    - add a class
    - attach a qdisc to the class (default pfifo doesn't work as
    q->handle is 0 and cbq_set_police() is a no-op in this case)
    - re-'add' the same class (tc class change ...) again
    - user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
    this 'police' feature relies on its presence
    - the added qdisc must be one of bfifo, pfifo or netem

    If all of these conditions are met and _some_ leaf qdiscs, namely
    p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
    cbq, which will attempt to re-queue the skb into a different class
    as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.

    [ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].

    This feature, which isn't documented or implemented in iproute2,
    and isn't implemented consistently (most qdiscs like sfq, codel, etc
    drop right away instead of attempting this reclassification) is the
    sole reason for the reshape_fail and __parent member in Qdisc struct.

    So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
    so userspace knows we don't support it, and then remove no-longer needed
    infrastructure in followup commit.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

08 Jun, 2016

3 commits

  • When offloading classifiers such as u32 or flower to hardware, and the
    qdisc is clsact (TC_H_CLSACT), then we need to differentiate its classes,
    since not all of them handle ingress, therefore we must leave those in
    software path. Add a .tcf_cl_offload() callback, so we can generically
    handle them, tested on ixgbe.

    Fixes: 10cbc6843446 ("net/sched: cls_flower: Hardware offloaded filters statistics support")
    Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support")
    Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     
  • Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
    agent [1] are problematic at scale :

    For each qdisc/class found in the dump, we currently lock the root qdisc
    spinlock in order to get stats. Sampling stats every 5 seconds from
    thousands of HTB classes is a challenge when the root qdisc spinlock is
    under high pressure. Not only the dumps take time, they also slow
    down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.

    An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
    that might need the qdisc lock in fq_codel_dump_stats() and
    fq_codel_dump_class_stats()

    In v2 of this patch, I now use the Qdisc running seqcount to provide
    consistent reads of packets/bytes counters, regardless of 32/64 bit arches.

    I also changed rate estimators to use the same infrastructure
    so that they no longer need to lock root qdisc lock.

    [1]
    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdf

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Cc: Kevin Athey
    Cc: Xiaotian Pei
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of using a single bit (__QDISC___STATE_RUNNING)
    in sch->__state, use a seqcount.

    This adds lockdep support, but more importantly it will allow us
    to sample qdisc/class statistics without having to grab qdisc root lock.

    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Jun, 2016

1 commit

  • For gso_skb we only update qlen, backlog should be updated too.

    Note, it is correct to just update these stats at one layer,
    because the gso_skb is cached there.

    Reported-by: Stas Nichiporovich
    Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

17 May, 2016

1 commit

  • Introduce stats_update callback. netdev driver could call it for offloaded
    actions to update the basic statistics (packets, bytes and last use).
    Since bstats_update() and bstats_cpu_update() use skb as an argument to
    get the counters, _bstats_update() and _bstats_cpu_update(), that get
    bytes and packets as arguments, were added.

    Signed-off-by: Amir Vadai
    Signed-off-by: David S. Miller

    Amir Vadai
     

04 Mar, 2016

1 commit


01 Mar, 2016

2 commits

  • When the bottom qdisc decides to, for example, drop some packet,
    it calls qdisc_tree_decrease_qlen() to update the queue length
    for all its ancestors, we need to update the backlog too to
    keep the stats on root qdisc accurate.

    Cc: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Remove nearly duplicated code and prepare for the following patch.

    Cc: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    WANG Cong
     

11 Jan, 2016

1 commit


04 Dec, 2015

1 commit

  • qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
    devices.

    One problem is that it updates sch->q.qlen and sch->qstats.drops
    on the mq/mqprio root qdisc, while it should not : Daniele
    reported underflows errors :
    [ 681.774821] PAX: sch->q.qlen: 0 n: 1
    [ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
    [ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
    [ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
    [ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
    [ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
    [ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
    [ 681.774962] Call Trace:
    [ 681.774967] [] dump_stack+0x4c/0x7f
    [ 681.774970] [] report_size_overflow+0x34/0x50
    [ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
    [ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
    [ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
    [ 681.774980] [] __qdisc_run+0x4d/0x1d0
    [ 681.774983] [] net_tx_action+0xc2/0x160
    [ 681.774985] [] __do_softirq+0xf1/0x200
    [ 681.774987] [] run_ksoftirqd+0x1e/0x30
    [ 681.774989] [] smpboot_thread_fn+0x150/0x260
    [ 681.774991] [] ? sort_range+0x40/0x40
    [ 681.774992] [] kthread+0xe4/0x100
    [ 681.774994] [] ? kthread_worker_fn+0x170/0x170
    [ 681.774995] [] ret_from_fork+0x3e/0x70

    mq/mqprio have their own ways to report qlen/drops by folding stats on
    all their queues, with appropriate locking.

    A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
    without proper locking : concurrent qdisc updates could corrupt the list
    that qdisc_match_from_root() parses to find a qdisc given its handle.

    Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
    qdisc_tree_decrease_qlen() can use to abort its tree traversal,
    as soon as it meets a mq/mqprio qdisc children.

    Second problem can be fixed by RCU protection.
    Qdisc are already freed after RCU grace period, so qdisc_list_add() and
    qdisc_list_del() simply have to use appropriate rcu list variants.

    A future patch will add a per struct netdev_queue list anchor, so that
    qdisc_tree_decrease_qlen() can have more efficient lookups.

    Reported-by: Daniele Fucini
    Signed-off-by: Eric Dumazet
    Cc: Cong Wang
    Cc: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Sep, 2015

2 commits

  • Existing bpf_clone_redirect() helper clones skb before redirecting
    it to RX or TX of destination netdev.
    Introduce bpf_redirect() helper that does that without cloning.

    Benchmarked with two hosts using 10G ixgbe NICs.
    One host is doing line rate pktgen.
    Another host is configured as:
    $ tc qdisc add dev $dev ingress
    $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
    action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
    so it receives the packet on $dev and immediately xmits it on $dev + 1
    The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
    that does bpf_clone_redirect() and performance is 2.0 Mpps

    $ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
    action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
    which is using bpf_redirect() - 2.4 Mpps

    and using cls_bpf with integrated actions as:
    $ tc filter add dev $dev root pref 10 \
    bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
    performance is 2.5 Mpps

    To summarize:
    u32+act_bpf using clone_redirect - 2.0 Mpps
    u32+act_bpf using redirect - 2.4 Mpps
    cls_bpf using redirect - 2.5 Mpps

    For comparison linux bridge in this setup is doing 2.1 Mpps
    and ixgbe rx + drop in ip_rcv - 7.8 Mpps

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Alexei Starovoitov
     
  • Often cls_bpf classifier is used with single action drop attached.
    Optimize this use case and let cls_bpf return both classid and action.
    For backwards compatibility reasons enable this feature under
    TCA_BPF_FLAG_ACT_DIRECT flag.

    Then more interesting programs like the following are easier to write:
    int cls_bpf_prog(struct __sk_buff *skb)
    {
    /* classify arp, ip, ipv6 into different traffic classes
    * and drop all other packets
    */
    switch (skb->protocol) {
    case htons(ETH_P_ARP):
    skb->tc_classid = 1;
    break;
    case htons(ETH_P_IP):
    skb->tc_classid = 2;
    break;
    case htons(ETH_P_IPV6):
    skb->tc_classid = 3;
    break;
    default:
    return TC_ACT_SHOT;
    }

    return TC_ACT_OK;
    }

    Joint work with Daniel Borkmann.

    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

28 Aug, 2015

1 commit


09 Jul, 2015

1 commit

  • qdisc_bstats_update_cpu() and other helpers were added to support
    percpu stats for qdisc.

    We want to add percpu stats for tc action, so this patch add common
    helpers.

    qdisc_bstats_update_cpu() is renamed to qdisc_bstats_cpu_update()
    qdisc_qstats_drop_cpu() is renamed to qdisc_qstats_cpu_drop()

    Signed-off-by: Eric Dumazet
    Cc: Alexei Starovoitov
    Acked-by: Jamal Hadi Salim
    Acked-by: John Fastabend
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Eric Dumazet
     

14 May, 2015

1 commit

  • Seems all we want here is to avoid endless 'goto reclassify' loop.
    tc_classify_compat even resets this counter when something other
    than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
    break hypothetical loops induced by something other than perpetual
    TC_ACT_RECLASSIFY return values.

    skb_act_clone is now identical to skb_clone, so just use that.

    Tested with following (bogus) filter:
    tc filter add dev eth0 parent ffff: \
    protocol ip u32 match u32 0 0 police rate 10Kbit burst \
    64000 mtu 1500 action reclassify

    Acked-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Acked-by: Alexei Starovoitov
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Florian Westphal
     

12 May, 2015

1 commit


03 May, 2015

1 commit

  • Not used.

    pedit sets TC_MUNGED when packet content was altered, but all the core
    does is unset MUNGED again and then set OK2MUNGE.

    And the latter isn't tested anywhere. So lets remove both
    TC_MUNGED and TC_OK2MUNGE.

    Signed-off-by: Florian Westphal
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Florian Westphal
     

10 Mar, 2015

1 commit

  • Kernel automatically creates a tp for each
    (kind, protocol, priority) tuple, which has handle 0,
    when we add a new filter, but it still is left there
    after we remove our own, unless we don't specify the
    handle (literally means all the filters under
    the tuple). For example this one is left:

    # tc filter show dev eth0
    filter parent 8001: protocol arp pref 49152 basic

    The user-space is hard to clean up these for kernel
    because filters like u32 are organized in a complex way.
    So kernel is responsible to remove it after all filters
    are gone. Each type of filter has its own way to
    store the filters, so each type has to provide its
    way to check if all filters are gone.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Cong Wang
     

01 Feb, 2015

1 commit

  • Doing the following commands on a non idle network device
    panics the box instantly, because cpu_bstats gets overwritten
    by stats.

    tc qdisc add dev eth0 root
    ... some traffic (one packet is enough) ...
    tc qdisc replace dev eth0 root est 1sec 4sec

    [ 325.355596] BUG: unable to handle kernel paging request at ffff8841dc5a074c
    [ 325.362609] IP: [] __gnet_stats_copy_basic+0x3e/0x90
    [ 325.369158] PGD 1fa7067 PUD 0
    [ 325.372254] Oops: 0000 [#1] SMP
    [ 325.375514] Modules linked in: ...
    [ 325.398346] CPU: 13 PID: 14313 Comm: tc Not tainted 3.19.0-smp-DEV #1163
    [ 325.412042] task: ffff8800793ab5d0 ti: ffff881ff2fa4000 task.ti: ffff881ff2fa4000
    [ 325.419518] RIP: 0010:[] [] __gnet_stats_copy_basic+0x3e/0x90
    [ 325.428506] RSP: 0018:ffff881ff2fa7928 EFLAGS: 00010286
    [ 325.433824] RAX: 000000000000000c RBX: ffff881ff2fa796c RCX: 000000000000000c
    [ 325.440988] RDX: ffff8841dc5a0744 RSI: 0000000000000060 RDI: 0000000000000060
    [ 325.448120] RBP: ffff881ff2fa7948 R08: ffffffff81cd4f80 R09: 0000000000000000
    [ 325.455268] R10: ffff883ff223e400 R11: 0000000000000000 R12: 000000015cba0744
    [ 325.462405] R13: ffffffff81cd4f80 R14: ffff883ff223e460 R15: ffff883feea0722c
    [ 325.469536] FS: 00007f2ee30fa700(0000) GS:ffff88407fa20000(0000) knlGS:0000000000000000
    [ 325.477630] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 325.483380] CR2: ffff8841dc5a074c CR3: 0000003feeae9000 CR4: 00000000001407e0
    [ 325.490510] Stack:
    [ 325.492524] ffff883feea0722c ffff883fef719dc0 ffff883feea0722c ffff883ff223e4a0
    [ 325.499990] ffff881ff2fa79a8 ffffffff815424ee ffff883ff223e49c 000000015cba0744
    [ 325.507460] 00000000f2fa7978 0000000000000000 ffff881ff2fa79a8 ffff883ff223e4a0
    [ 325.514956] Call Trace:
    [ 325.517412] [] gen_new_estimator+0x8e/0x230
    [ 325.523250] [] gen_replace_estimator+0x4a/0x60
    [ 325.529349] [] tc_modify_qdisc+0x52b/0x590
    [ 325.535117] [] rtnetlink_rcv_msg+0xa0/0x240
    [ 325.540963] [] ? __rtnl_unlock+0x20/0x20
    [ 325.546532] [] netlink_rcv_skb+0xb1/0xc0
    [ 325.552145] [] rtnetlink_rcv+0x25/0x40
    [ 325.557558] [] netlink_unicast+0x168/0x220
    [ 325.563317] [] netlink_sendmsg+0x2ec/0x3e0

    Lets play safe and not use an union : percpu 'pointers' are mostly read
    anyway, and we have typically few qdiscs per host.

    Signed-off-by: Eric Dumazet
    Cc: John Fastabend
    Fixes: 22e0f8b9322c ("net: sched: make bstats per cpu and estimator RCU safe")
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Dec, 2014

1 commit


04 Oct, 2014

1 commit

  • Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
    sending/processing an entire skb list.

    This patch implements qdisc bulk dequeue, by allowing multiple packets
    to be dequeued in dequeue_skb().

    The optimization principle for this is two fold, (1) to amortize
    locking cost and (2) avoid expensive tailptr update for notifying HW.
    (1) Several packets are dequeued while holding the qdisc root_lock,
    amortizing locking cost over several packet. The dequeued SKB list is
    processed under the TXQ lock in dev_hard_start_xmit(), thus also
    amortizing the cost of the TXQ lock.
    (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
    API to delay HW tailptr update, which also reduces the cost per
    packet.

    One restriction of the new API is that every SKB must belong to the
    same TXQ. This patch takes the easy way out, by restricting bulk
    dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
    qdisc only have attached a single TXQ.

    Some detail about the flow; dev_hard_start_xmit() will process the skb
    list, and transmit packets individually towards the driver (see
    xmit_one()). In case the driver stops midway in the list, the
    remaining skb list is returned by dev_hard_start_xmit(). In
    sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

    To avoid overshooting the HW limits, which results in requeuing, the
    patch limits the amount of bytes dequeued, based on the drivers BQL
    limits. In-effect bulking will only happen for BQL enabled drivers.

    Small amounts for extra HoL blocking (2x MTU/0.24ms) were
    measured at 100Mbit/s, with bulking 8 packets, but the
    oscillating nature of the measurement indicate something, like
    sched latency might be causing this effect. More comparisons
    show, that this oscillation goes away occationally. Thus, we
    disregard this artifact completely and remove any "magic" bulking
    limit.

    For now, as a conservative approach, stop bulking when seeing TSO and
    segmented GSO packets. They already benefit from bulking on their own.
    A followup patch add this, to allow easier bisect-ability for finding
    regressions.

    Jointed work with Hannes, Daniel and Florian.

    Signed-off-by: Jesper Dangaard Brouer
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Jesper Dangaard Brouer
     

30 Sep, 2014

3 commits

  • After previous patches to simplify qstats the qstats can be
    made per cpu with a packed union in Qdisc struct.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This adds helpers to manipulate qstats logic and replaces locations
    that touch the counters directly. This simplifies future patches
    to push qstats onto per cpu counters.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • In order to run qdisc's without locking statistics and estimators
    need to be handled correctly.

    To resolve bstats make the statistics per cpu. And because this is
    only needed for qdiscs that are running without locks which is not
    the case for most qdiscs in the near future only create percpu
    stats when qdiscs set the TCQ_F_CPUSTATS flag.

    Next because estimators use the bstats to calculate packets per
    second and bytes per second the estimator code paths are updated
    to use the per cpu statistics.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     

24 Sep, 2014

1 commit


23 Sep, 2014

1 commit

  • We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
    or increasing skb->cb[] size.

    Commit e0f31d849867 ("flow_keys: Record IP layer protocol in
    skb_flow_dissect()") broke IPoIB.

    Only current offender is sch_choke, and this one do not need an
    absolutely precise flow key.

    If we store 17 bytes of flow key, its more than enough. (Its the actual
    size of flow_keys if it was a packed structure, but we might add new
    fields at the end of it later)

    Signed-off-by: Eric Dumazet
    Fixes: e0f31d849867 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
    Signed-off-by: David S. Miller

    Eric Dumazet