01 Nov, 2011

1 commit


21 Oct, 2011

1 commit


06 Jul, 2011

1 commit


24 Mar, 2011

1 commit

  • commit fd245a4adb52 (net_sched: move TCQ_F_THROTTLED flag)
    added a race.

    qdisc_watchdog() is run from softirq, so special care should be taken or
    we can lose one state transition (THROTTLED/RUNNING)

    Prior to fd245a4adb52, we were manipulating q->flags (qdisc->flags &=
    ~TCQ_F_THROTTLED;) and this manipulation could only race with
    qdisc_warn_nonwc().

    Since we want to avoid atomic ops in qdisc fast path - it was the
    meaning of commit 371121057607e (QDISC_STATE_RUNNING dont need atomic
    bit ops) - fix is to move THROTTLE bit into 'state' field, this one
    being manipulated with SMP and IRQ safe operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Mar, 2011

2 commits


24 Feb, 2011

1 commit


25 Jan, 2011

1 commit


21 Jan, 2011

3 commits

  • In commit 44b8288308ac9d (net_sched: pfifo_head_drop problem), we fixed
    a problem with pfifo_head drops that incorrectly decreased
    sch->bstats.bytes and sch->bstats.packets

    Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
    previously enqueued packet, and bstats cannot be changed, so
    bstats/rates are not accurate (over estimated)

    This patch changes the qdisc_bstats updates to be done at dequeue() time
    instead of enqueue() time. bstats counters no longer account for dropped
    frames, and rates are more correct, since enqueue() bursts dont have
    effect on dequeue() rate.

    Signed-off-by: Eric Dumazet
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch converts stab qdisc management to RCU, so that we can perform
    the qdisc_calculate_pkt_len() call before getting qdisc lock.

    This shortens the lock's held time in __dev_xmit_skb().

    This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
    cache misses and so reducing latencies.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    CC: Jesper Dangaard Brouer
    CC: Jarek Poplawski
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In commit 371121057607e (net: QDISC_STATE_RUNNING dont need atomic bit
    ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
    the cache line containing qdisc lock and often dirtied fields.

    I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
    mostly, and shared by all cpus. This should speedup HTB/CBQ for example.

    Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
    "unsigned int" for __state container, reducing by 8 bytes Qdisc size.

    Introduce helpers to hide implementation details.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    CC: Jesper Dangaard Brouer
    CC: Jarek Poplawski
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jan, 2011

1 commit

  • HTB takes into account skb is segmented in stats updates.
    Generalize this to all schedulers.

    They should use qdisc_bstats_update() helper instead of manipulating
    bstats.bytes and bstats.packets

    Add bstats_update() helper too for classes that use
    gnet_stats_basic_packed fields.

    Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
    stab is setup on qdisc.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Dec, 2010

2 commits


17 Dec, 2010

1 commit

  • Add dev_close_many and dev_deactivate_many to factorize another
    sync-rcu operation on the netdevice unregister path.

    $ modprobe dummy numdummies=10000
    $ ip link set dev dummy* up
    $ time rmmod dummy

    Without the patch With the patch

    real 0m 24.63s real 0m 5.15s
    user 0m 0.00s user 0m 0.00s
    sys 0m 6.05s sys 0m 5.14s

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

21 Oct, 2010

1 commit


24 Sep, 2010

1 commit


08 Jul, 2010

1 commit


03 Jul, 2010

2 commits

  • Reducing real_num_queues needs to flush the qdisc otherwise
    skbs with queue_mappings greater then real_num_tx_queues can
    be sent to the underlying driver.

    The flow for this is,

    dev_queue_xmit()
    dev_pick_tx()
    skb_tx_hash() => hash using real_num_tx_queues
    skb_set_queue_mapping()
    ...
    qdisc_enqueue_root() => enqueue skb on txq from hash
    ...
    dev->real_num_tx_queues -= n
    ...
    sch_direct_xmit()
    dev_hard_start_xmit()
    ndo_start_xmit(skb,dev) => skb queue set with old hash

    skbs are enqueued on the qdisc with skb->queue_mapping set
    0 < queue_mappings < real_num_tx_queues. When the driver
    decreases real_num_tx_queues skb's may be dequeued from the
    qdisc with a queue_mapping greater then real_num_tx_queues.

    This fixes a case in ixgbe where this was occurring with DCB
    and FCoE. Because the driver is using queue_mapping to map
    skbs to tx descriptor rings we can potentially map skbs to
    rings that no longer exist.

    Signed-off-by: John Fastabend
    Tested-by: Ross Brattain
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     
  • When calling qdisc_reset() the qdisc lock needs to be held. In
    this case there is at least one driver i4l which is using this
    without holding the lock. Add the locking here.

    Signed-off-by: John Fastabend
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     

29 Jun, 2010

1 commit

  • don't clone skb when skb isn't shared

    When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
    to clone a new skb. As the skb will be freed after this function returns, we
    can use it freely once we get a reference to it.

    Signed-off-by: Changli Gao
    ----
    include/net/sch_generic.h | 11 +++++++++--
    net/sched/act_mirred.c | 6 +++---
    2 files changed, 12 insertions(+), 5 deletions(-)
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Changli Gao
     

02 Jun, 2010

3 commits

  • When many cpus compete for sending frames on a given qdisc, the qdisc
    spinlock suffers from very high contention.

    The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
    the lock, and cannot dequeue packets fast enough, since it must wait for
    this lock for each dequeued packet.

    One solution to this problem is to force all cpus spinning on a second
    lock before trying to get the main lock, when/if they see
    __QDISC_STATE_RUNNING already set.

    The owning cpu then compete with at most one other cpu for the main
    lock, allowing for higher dequeueing rate.

    Based on a previous patch from Alexander Duyck. I added the heuristic to
    avoid the atomic in fast path, and put the new lock far away from the
    cache line used by the dequeue worker. Also try to release the busylock
    lock as late as possible.

    Tests with following script gave a boost from ~50.000 pps to ~600.000
    pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
    (A single netperf flow can reach ~800.000 pps on this platform)

    for j in `seq 0 3`; do
    for i in `seq 0 7`; do
    netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
    done
    done

    Signed-off-by: Eric Dumazet
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • __QDISC_STATE_RUNNING is always changed while qdisc lock is held.

    We can avoid two atomic operations in xmit path, if we move this bit in
    a new __state container.

    Location of this __state container is carefully chosen so that fast path
    only dirties one qdisc cache line.

    THROTTLED bit could later be moved into this __state location too, to
    avoid dirtying first qdisc cache line.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
    second patch will move on another location.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Apr, 2010

1 commit

  • One of my test machine got a deadlock during "tc" sessions,
    adding/deleting classes & filters, using traffic estimators.

    After some analysis, I believe we have a potential use after free case
    in est_timer() :

    spin_lock(e->stats_lock); << HERE >>
    read_lock(&est_lock);
    if (e->bstats == NULL) << TEST >>
    goto skip;

    Test is done a bit late, because after estimator is killed, and before
    rcu grace period elapsed, we might already have freed/reuse memory where
    e->stats_locks points to (some qdisc->q.lock)

    A possible fix is to respect a rcu grace period at Qdisc dismantle time.

    On 64bit, sizeof(struct Qdisc) is exactly 192 bytes. Adding 16 bytes to
    it (for struct rcu_head) is a problem because it might change
    performance, given QDISC_ALIGNTO is 32 bytes.

    This is why I also change QDISC_ALIGNTO to 64 bytes, to satisfy most
    current alignment requirements.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Jan, 2010

1 commit

  • This adds an additional queuing strategy, called pfifo_head_drop,
    to remove the oldest skb in the case of an overflow within the queue -
    the head element - instead of the last skb (tail). To remove the oldest
    skb in congested situations is useful for sensor network environments
    where newer packets reflect the superior information.

    Reviewed-by: Florian Westphal
    Acked-by: Patrick McHardy
    Signed-off-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

04 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2009

1 commit

  • After the recent mq change there is the new select_queue qdisc class
    method used in tc_modify_qdisc, but it works OK only for direct child
    qdiscs of mq qdisc. Grandchildren always get the first tx queue, which
    would give wrong qdisc_root etc. results (e.g. for sch_htb as child of
    sch_prio). This patch fixes it by using parent's dev_queue for such
    grandchildren qdiscs. The select_queue method's return type is changed
    BTW.

    With feedback from: Patrick McHardy

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

10 Sep, 2009

1 commit

  • When new child qdiscs are attached to the mq qdisc, they are actually
    attached as root qdiscs to the device queues. The lock selection for
    new estimators incorrectly picks the root lock of the existing and
    to be replaced qdisc, which results in a use-after-free once the old
    qdisc has been destroyed.

    Mark mq qdisc instances with a new flag and treat qdiscs attached to
    mq as children similar to regular root qdiscs.

    Additionally prevent estimators from being attached to the mq qdisc
    itself since it only updates its byte and packet counters during dumps.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

06 Sep, 2009

2 commits

  • This patch adds a classful dummy scheduler which can be used as root qdisc
    for multiqueue devices and exposes each device queue as a child class.

    This allows to address queues individually and graft them similar to regular
    classes. Additionally it presents an accumulated view of the statistics of
    all real root qdiscs in the dummy root.

    Two new callbacks are added to the qdisc_ops and qdisc_class_ops:

    - cl_ops->select_queue selects the tx queue number for new child classes.

    - qdisc_ops->attach() overrides root qdisc device grafting to attach
    non-shared qdiscs to the queues.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    David S. Miller
     
  • It will be used in a following patch by the multiqueue qdisc.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

02 Sep, 2009

1 commit


18 Aug, 2009

1 commit

  • In 5e140dfc1fe87eae27846f193086724806b33c7d "net: reorder struct Qdisc
    for better SMP performance" the definition of struct gnet_stats_basic
    changed incompatibly, as copies of this struct are shipped to
    userland via netlink.

    Restoring old behavior is not welcome, for performance reason.

    Fix is to use a private structure for kernel, and
    teach gnet_stats_copy_basic() to convert from kernel to user land,
    using legacy structure (struct gnet_stats_basic)

    Based on a report and initial patch from Michael Spang.

    Reported-by: Michael Spang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Aug, 2009

1 commit

  • dev_queue_xmit enqueue's a skb and calls qdisc_run which
    dequeue's the skb and xmits it. In most cases, the skb that
    is enqueue'd is the same one that is dequeue'd (unless the
    queue gets stopped or multiple cpu's write to the same queue
    and ends in a race with qdisc_run). For default qdiscs, we
    can remove the redundant enqueue/dequeue and simply xmit the
    skb since the default qdisc is work-conserving.

    The patch uses a new flag - TCQ_F_CAN_BYPASS to identify the
    default fast queue. The controversial part of the patch is
    incrementing qlen when a skb is requeued - this is to avoid
    checks like the second line below:

    + } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
    >> !q->gso_skb &&
    + !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {

    Results of a 2 hour testing for multiple netperf sessions (1,
    2, 4, 8, 12 sessions on a 4 cpu system-X). The BW numbers are
    aggregate Mb/s across iterations tested with this version on
    System-X boxes with Chelsio 10gbps cards:

    ----------------------------------
    Size | ORG BW NEW BW |
    ----------------------------------
    128K | 156964 159381 |
    256K | 158650 162042 |
    ----------------------------------

    Changes from ver1:

    1. Move sch_direct_xmit declaration from sch_generic.h to
    pkt_sched.h
    2. Update qdisc basic statistics for direct xmit path.
    3. Set qlen to zero in qdisc_reset.
    4. Changed some function names to more meaningful ones.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     

20 Mar, 2009

1 commit

  • dev_queue_xmit() needs to dirty fields "state", "q", "bstats" and "qstats"

    On x86_64 arch, they currently span three cache lines, involving more
    cache line ping pongs than necessary, making longer holding of queue spinlock.

    We can reduce this to one cache line, by grouping all read-mostly fields
    at the beginning of structure. (Or should I say, all highly modified fields
    at the end :) )

    Before patch :

    offsetof(struct Qdisc, state)=0x38
    offsetof(struct Qdisc, q)=0x48
    offsetof(struct Qdisc, bstats)=0x80
    offsetof(struct Qdisc, qstats)=0x90
    sizeof(struct Qdisc)=0xc8

    After patch :

    offsetof(struct Qdisc, state)=0x80
    offsetof(struct Qdisc, q)=0x88
    offsetof(struct Qdisc, bstats)=0xa0
    offsetof(struct Qdisc, qstats)=0xac
    sizeof(struct Qdisc)=0xc0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Feb, 2009

1 commit


14 Nov, 2008

1 commit

  • After implementing qdisc->ops->peek() and changing sch_netem into
    classless qdisc there are no more qdisc->ops->requeue() users. This
    patch removes this method with its wrappers (qdisc_requeue()), and
    also unused qdisc->requeue structure. There are a few minor fixes of
    warnings (htb_enqueue()) and comments btw.

    The idea to kill ->requeue() and a similar patch were first developed
    by David S. Miller.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

06 Nov, 2008

1 commit

  • A packet dequeued and stored as gso_skb in qdisc_peek_dequeued() should
    be seen as part of the queue for sch->q.qlen queries until it's really
    dequeued with qdisc_dequeue_peeked(), so qlen needs additional updating
    in these functions. (Updating qstats.backlog shouldn't matter here.)

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

31 Oct, 2008

2 commits

  • This patch adds qdisc_peek_dequeued() wrapper to emulate peek method
    with qdisc->dequeue() and storing "peeked" skb in qdisc->gso_skb until
    dequeuing. This is mainly for compatibility reasons not to break some
    strange configs because peeking is expected for non-work-conserving
    parent qdiscs to query work-conserving child qdiscs.

    This implementation requires using qdisc_dequeue_peeked() wrapper
    instead of directly calling qdisc->dequeue() for all qdiscs ever
    querried with qdisc->ops->peek() or qdisc_peek_dequeued().

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • From: Patrick McHardy

    Just as a demonstration how easy adding a peek operation to the
    work-conserving qdiscs actually is. It doesn't need to keep or change
    any internal state in many cases thanks to the guarantee that the
    packet will either be dequeued or, if another packet arrives, the
    upper qdisc will immediately ->peek again to reevaluate the state.

    (This is only slightly modified Patrick's patch.)

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Patrick McHardy