21 Oct, 2010

1 commit


24 Sep, 2010

1 commit


08 Jul, 2010

1 commit


03 Jul, 2010

2 commits

  • Reducing real_num_queues needs to flush the qdisc otherwise
    skbs with queue_mappings greater then real_num_tx_queues can
    be sent to the underlying driver.

    The flow for this is,

    dev_queue_xmit()
    dev_pick_tx()
    skb_tx_hash() => hash using real_num_tx_queues
    skb_set_queue_mapping()
    ...
    qdisc_enqueue_root() => enqueue skb on txq from hash
    ...
    dev->real_num_tx_queues -= n
    ...
    sch_direct_xmit()
    dev_hard_start_xmit()
    ndo_start_xmit(skb,dev) => skb queue set with old hash

    skbs are enqueued on the qdisc with skb->queue_mapping set
    0 < queue_mappings < real_num_tx_queues. When the driver
    decreases real_num_tx_queues skb's may be dequeued from the
    qdisc with a queue_mapping greater then real_num_tx_queues.

    This fixes a case in ixgbe where this was occurring with DCB
    and FCoE. Because the driver is using queue_mapping to map
    skbs to tx descriptor rings we can potentially map skbs to
    rings that no longer exist.

    Signed-off-by: John Fastabend
    Tested-by: Ross Brattain
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     
  • When calling qdisc_reset() the qdisc lock needs to be held. In
    this case there is at least one driver i4l which is using this
    without holding the lock. Add the locking here.

    Signed-off-by: John Fastabend
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     

29 Jun, 2010

1 commit

  • don't clone skb when skb isn't shared

    When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
    to clone a new skb. As the skb will be freed after this function returns, we
    can use it freely once we get a reference to it.

    Signed-off-by: Changli Gao
    ----
    include/net/sch_generic.h | 11 +++++++++--
    net/sched/act_mirred.c | 6 +++---
    2 files changed, 12 insertions(+), 5 deletions(-)
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Changli Gao
     

02 Jun, 2010

3 commits

  • When many cpus compete for sending frames on a given qdisc, the qdisc
    spinlock suffers from very high contention.

    The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
    the lock, and cannot dequeue packets fast enough, since it must wait for
    this lock for each dequeued packet.

    One solution to this problem is to force all cpus spinning on a second
    lock before trying to get the main lock, when/if they see
    __QDISC_STATE_RUNNING already set.

    The owning cpu then compete with at most one other cpu for the main
    lock, allowing for higher dequeueing rate.

    Based on a previous patch from Alexander Duyck. I added the heuristic to
    avoid the atomic in fast path, and put the new lock far away from the
    cache line used by the dequeue worker. Also try to release the busylock
    lock as late as possible.

    Tests with following script gave a boost from ~50.000 pps to ~600.000
    pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
    (A single netperf flow can reach ~800.000 pps on this platform)

    for j in `seq 0 3`; do
    for i in `seq 0 7`; do
    netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
    done
    done

    Signed-off-by: Eric Dumazet
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • __QDISC_STATE_RUNNING is always changed while qdisc lock is held.

    We can avoid two atomic operations in xmit path, if we move this bit in
    a new __state container.

    Location of this __state container is carefully chosen so that fast path
    only dirties one qdisc cache line.

    THROTTLED bit could later be moved into this __state location too, to
    avoid dirtying first qdisc cache line.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
    second patch will move on another location.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Apr, 2010

1 commit

  • One of my test machine got a deadlock during "tc" sessions,
    adding/deleting classes & filters, using traffic estimators.

    After some analysis, I believe we have a potential use after free case
    in est_timer() :

    spin_lock(e->stats_lock); << HERE >>
    read_lock(&est_lock);
    if (e->bstats == NULL) << TEST >>
    goto skip;

    Test is done a bit late, because after estimator is killed, and before
    rcu grace period elapsed, we might already have freed/reuse memory where
    e->stats_locks points to (some qdisc->q.lock)

    A possible fix is to respect a rcu grace period at Qdisc dismantle time.

    On 64bit, sizeof(struct Qdisc) is exactly 192 bytes. Adding 16 bytes to
    it (for struct rcu_head) is a problem because it might change
    performance, given QDISC_ALIGNTO is 32 bytes.

    This is why I also change QDISC_ALIGNTO to 64 bytes, to satisfy most
    current alignment requirements.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Jan, 2010

1 commit

  • This adds an additional queuing strategy, called pfifo_head_drop,
    to remove the oldest skb in the case of an overflow within the queue -
    the head element - instead of the last skb (tail). To remove the oldest
    skb in congested situations is useful for sensor network environments
    where newer packets reflect the superior information.

    Reviewed-by: Florian Westphal
    Acked-by: Patrick McHardy
    Signed-off-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

04 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2009

1 commit

  • After the recent mq change there is the new select_queue qdisc class
    method used in tc_modify_qdisc, but it works OK only for direct child
    qdiscs of mq qdisc. Grandchildren always get the first tx queue, which
    would give wrong qdisc_root etc. results (e.g. for sch_htb as child of
    sch_prio). This patch fixes it by using parent's dev_queue for such
    grandchildren qdiscs. The select_queue method's return type is changed
    BTW.

    With feedback from: Patrick McHardy

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

10 Sep, 2009

1 commit

  • When new child qdiscs are attached to the mq qdisc, they are actually
    attached as root qdiscs to the device queues. The lock selection for
    new estimators incorrectly picks the root lock of the existing and
    to be replaced qdisc, which results in a use-after-free once the old
    qdisc has been destroyed.

    Mark mq qdisc instances with a new flag and treat qdiscs attached to
    mq as children similar to regular root qdiscs.

    Additionally prevent estimators from being attached to the mq qdisc
    itself since it only updates its byte and packet counters during dumps.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

06 Sep, 2009

2 commits

  • This patch adds a classful dummy scheduler which can be used as root qdisc
    for multiqueue devices and exposes each device queue as a child class.

    This allows to address queues individually and graft them similar to regular
    classes. Additionally it presents an accumulated view of the statistics of
    all real root qdiscs in the dummy root.

    Two new callbacks are added to the qdisc_ops and qdisc_class_ops:

    - cl_ops->select_queue selects the tx queue number for new child classes.

    - qdisc_ops->attach() overrides root qdisc device grafting to attach
    non-shared qdiscs to the queues.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    David S. Miller
     
  • It will be used in a following patch by the multiqueue qdisc.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

02 Sep, 2009

1 commit


18 Aug, 2009

1 commit

  • In 5e140dfc1fe87eae27846f193086724806b33c7d "net: reorder struct Qdisc
    for better SMP performance" the definition of struct gnet_stats_basic
    changed incompatibly, as copies of this struct are shipped to
    userland via netlink.

    Restoring old behavior is not welcome, for performance reason.

    Fix is to use a private structure for kernel, and
    teach gnet_stats_copy_basic() to convert from kernel to user land,
    using legacy structure (struct gnet_stats_basic)

    Based on a report and initial patch from Michael Spang.

    Reported-by: Michael Spang
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

07 Aug, 2009

1 commit

  • dev_queue_xmit enqueue's a skb and calls qdisc_run which
    dequeue's the skb and xmits it. In most cases, the skb that
    is enqueue'd is the same one that is dequeue'd (unless the
    queue gets stopped or multiple cpu's write to the same queue
    and ends in a race with qdisc_run). For default qdiscs, we
    can remove the redundant enqueue/dequeue and simply xmit the
    skb since the default qdisc is work-conserving.

    The patch uses a new flag - TCQ_F_CAN_BYPASS to identify the
    default fast queue. The controversial part of the patch is
    incrementing qlen when a skb is requeued - this is to avoid
    checks like the second line below:

    + } else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
    >> !q->gso_skb &&
    + !test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) {

    Results of a 2 hour testing for multiple netperf sessions (1,
    2, 4, 8, 12 sessions on a 4 cpu system-X). The BW numbers are
    aggregate Mb/s across iterations tested with this version on
    System-X boxes with Chelsio 10gbps cards:

    ----------------------------------
    Size | ORG BW NEW BW |
    ----------------------------------
    128K | 156964 159381 |
    256K | 158650 162042 |
    ----------------------------------

    Changes from ver1:

    1. Move sch_direct_xmit declaration from sch_generic.h to
    pkt_sched.h
    2. Update qdisc basic statistics for direct xmit path.
    3. Set qlen to zero in qdisc_reset.
    4. Changed some function names to more meaningful ones.

    Signed-off-by: Krishna Kumar
    Signed-off-by: David S. Miller

    Krishna Kumar
     

20 Mar, 2009

1 commit

  • dev_queue_xmit() needs to dirty fields "state", "q", "bstats" and "qstats"

    On x86_64 arch, they currently span three cache lines, involving more
    cache line ping pongs than necessary, making longer holding of queue spinlock.

    We can reduce this to one cache line, by grouping all read-mostly fields
    at the beginning of structure. (Or should I say, all highly modified fields
    at the end :) )

    Before patch :

    offsetof(struct Qdisc, state)=0x38
    offsetof(struct Qdisc, q)=0x48
    offsetof(struct Qdisc, bstats)=0x80
    offsetof(struct Qdisc, qstats)=0x90
    sizeof(struct Qdisc)=0xc8

    After patch :

    offsetof(struct Qdisc, state)=0x80
    offsetof(struct Qdisc, q)=0x88
    offsetof(struct Qdisc, bstats)=0xa0
    offsetof(struct Qdisc, qstats)=0xac
    sizeof(struct Qdisc)=0xc0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Feb, 2009

1 commit


14 Nov, 2008

1 commit

  • After implementing qdisc->ops->peek() and changing sch_netem into
    classless qdisc there are no more qdisc->ops->requeue() users. This
    patch removes this method with its wrappers (qdisc_requeue()), and
    also unused qdisc->requeue structure. There are a few minor fixes of
    warnings (htb_enqueue()) and comments btw.

    The idea to kill ->requeue() and a similar patch were first developed
    by David S. Miller.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

06 Nov, 2008

1 commit

  • A packet dequeued and stored as gso_skb in qdisc_peek_dequeued() should
    be seen as part of the queue for sch->q.qlen queries until it's really
    dequeued with qdisc_dequeue_peeked(), so qlen needs additional updating
    in these functions. (Updating qstats.backlog shouldn't matter here.)

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

31 Oct, 2008

3 commits

  • This patch adds qdisc_peek_dequeued() wrapper to emulate peek method
    with qdisc->dequeue() and storing "peeked" skb in qdisc->gso_skb until
    dequeuing. This is mainly for compatibility reasons not to break some
    strange configs because peeking is expected for non-work-conserving
    parent qdiscs to query work-conserving child qdiscs.

    This implementation requires using qdisc_dequeue_peeked() wrapper
    instead of directly calling qdisc->dequeue() for all qdiscs ever
    querried with qdisc->ops->peek() or qdisc_peek_dequeued().

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • From: Patrick McHardy

    Just as a demonstration how easy adding a peek operation to the
    work-conserving qdiscs actually is. It doesn't need to keep or change
    any internal state in many cases thanks to the guarantee that the
    packet will either be dequeued or, if another packet arrives, the
    upper qdisc will immediately ->peek again to reevaluate the state.

    (This is only slightly modified Patrick's patch.)

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Add Qdisc_ops peek() method in order to replace requeuing.

    Based on ideas and patches of Herbert Xu, Patrick McHardy and
    David S. Miller.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

07 Oct, 2008

1 commit

  • Jay Cliburn noticed and diagnosed a bug triggered in
    dev_gso_skb_destructor() after last change from qdisc->gso_skb
    to qdisc->requeue list. Since gso_segmented skbs can't be queued
    to another list this patch brings back qdisc->gso_skb for them.

    Reported-by: Jay Cliburn
    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

23 Sep, 2008

1 commit


27 Aug, 2008

2 commits

  • Use new qdisc_root_sleeping_lock() instead of qdisc_root_lock() as
    sch_tree_lock() because this lock could be used while dev is
    deactivated, but we never need to use this with noop_qdisc as a root.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • While passing a qdisc root lock to gen_new_estimator() and
    gen_replace_estimator() dev could be deactivated or even before
    grafting proper root qdisc as qdisc_sleeping (e.g. qdisc_create), so
    using qdisc_root_lock() is not enough. This patch adds
    qdisc_root_sleeping_lock() for this, plus additional checks, where
    necessary.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

21 Aug, 2008

1 commit

  • dev_deactivate() can skip rescheduling of a qdisc by qdisc_watchdog()
    or other timer calling netif_schedule() after dev_queue_deactivate().
    We prevent this checking aliveness before scheduling the timer. Since
    during deactivation the root qdisc is available only as qdisc_sleeping
    additional accessor qdisc_root_sleeping() is created.

    With feedback from Herbert Xu

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

18 Aug, 2008

2 commits

  • We can now kill them synchronously with all of the
    previous dev_deactivate() cures.

    This makes netdev destruction and shutdown saner as
    the qdiscs hold references to the device.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • This new state lets dev_deactivate() mark a qdisc as having been
    deactivated.

    dev_queue_xmit() and ing_filter() check for this bit and do not
    try to process the qdisc if the bit is set.

    dev_deactivate() polls the qdisc after setting the bit, waiting
    for both __QDISC_STATE_RUNNING and __QDISC_STATE_SCHED to clear.

    This isn't perfect yet, but subsequent changesets will make it so.
    This part is just one piece of the puzzle.

    Signed-off-by: David S. Miller

    David S. Miller
     

05 Aug, 2008

2 commits

  • Patrick McHardy noticed that it would be nice to
    handle NET_XMIT_BYPASS by NET_XMIT_SUCCESS with an internal qdisc flag
    __NET_XMIT_BYPASS and to remove the mapping from dev_queue_xmit().

    David Miller spotted a serious bug in the first
    version of this patch.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     
  • Patrick McHardy noticed:
    "The other problem that affects all qdiscs supporting actions is
    TC_ACT_QUEUED/TC_ACT_STOLEN getting mapped to NET_XMIT_SUCCESS
    even though the packet is not queued, corrupting upper qdiscs'
    qlen counters."

    and later explained:
    "The reason why it translates it at all seems to be to not increase
    the drops counter. Within a single qdisc this could be avoided by
    other means easily, upper qdiscs would still increase the counter
    when we return anything besides NET_XMIT_SUCCESS though.

    This means we need a new NET_XMIT return value to indicate this to
    the upper qdiscs. So I'd suggest to introduce NET_XMIT_STOLEN,
    return that to upper qdiscs and translate it to NET_XMIT_SUCCESS
    in dev_queue_xmit, similar to NET_XMIT_BYPASS."

    David Miller noticed:
    "Maybe these NET_XMIT_* values being passed around should be a set of
    bits. They could be composed of base meanings, combined with specific
    attributes.

    So you could say "NET_XMIT_DROP | __NET_XMIT_NO_DROP_COUNT"

    The attributes get masked out by the top-level ->enqueue() caller,
    such that the base meanings are the only thing that make their
    way up into the stack. If it's only about communication within the
    qdisc tree, let's simply code it that way."

    This patch is trying to realize these ideas.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

03 Aug, 2008

1 commit


21 Jul, 2008

1 commit


20 Jul, 2008

3 commits