03 Jun, 2013

1 commit

  • commit 56b765b79 ("htb: improved accuracy at high rates")
    broke the "overhead xxx" handling, as well as the "linklayer atm"
    attribute.

    tc class add ... htb rate X ceil Y linklayer atm overhead 10

    This patch restores the "overhead xxx" handling, for htb, tbf
    and act_police

    The "linklayer atm" thing needs a separate fix.

    Reported-by: Jesper Dangaard Brouer
    Signed-off-by: Eric Dumazet
    Cc: Vimalkumar
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

13 Feb, 2013

1 commit


15 Jan, 2013

1 commit

  • Eric Dumazet pointed out that act_mirred needs to find the current net_ns,
    and struct net pointer is not provided in the call chain. His original
    patch made use of current->nsproxy->net_ns to find the network namespace,
    but this fails to work correctly for userspace code that makes use of
    netlink sockets in different network namespaces. Instead, pass the
    "struct net *" down along the call chain to where it is needed.

    This version removes the ifb changes as Eric has submitted that patch
    separately, but is otherwise identical to the previous version.

    Signed-off-by: Benjamin LaHaise
    Tested-by: Eric Dumazet
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Benjamin LaHaise
     

12 Dec, 2012

1 commit

  • With BQL being deployed, we can more likely have following behavior :

    We dequeue a packet from qdisc in dequeue_skb(), then we realize target
    tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
    skb into gso_skb for later.

    This shows in stats (tc -s qdisc dev eth0) as requeues.

    Problem of these requeues is that high priority packets can not be
    dequeued as long as this (possibly low prio and big TSO packet) is not
    removed from gso_skb.

    At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

    In some cases, we know that all packets dequeued from a qdisc are
    for a particular and known txq :

    - If device is non multi queue
    - For all MQ/MQPRIO slave qdiscs

    This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
    this capability, so that dequeue_skb() is allowed to dequeue a packet
    only if the associated txq is not stopped.

    This indeed reduce latencies for high prio packets (or improve fairness
    with sfq/fq_codel), and almost remove qdisc 'requeues'.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Cc: John Fastabend
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Aug, 2012

1 commit

  • cls_flow.c plays with uids and gids. Unless I misread that
    code it is possible for classifiers to depend on the specific uid and
    gid values. Therefore I need to know the user namespace of the
    netlink socket that is installing the packet classifiers. Pass
    in the rtnetlink skb so I can access the NETLINK_CB of the passed
    packet. In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).

    Pass in not the user namespace but the incomming rtnetlink skb into
    the the classifier change routines as that is generally the more useful
    parameter.

    Cc: Jamal Hadi Salim
    Acked-by: David S. Miller
    Acked-by: Serge Hallyn
    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

21 Jul, 2012

1 commit


13 Jun, 2012

1 commit

  • In the transmit path of the bonding driver, skb->cb is used to
    stash the skb->queue_mapping so that the bonding device can set its
    own queue mapping. This value becomes corrupted since the skb->cb is
    also used in __dev_xmit_skb.

    When transmitting through bonding driver, bond_select_queue is
    called from dev_queue_xmit. In bond_select_queue the original
    skb->queue_mapping is copied into skb->cb (via bond_queue_mapping)
    and skb->queue_mapping is overwritten with the bond driver queue.

    Subsequently in dev_queue_xmit, __dev_xmit_skb is called which writes
    the packet length into skb->cb, thereby overwriting the stashed
    queue mappping. In bond_dev_queue_xmit (called from hard_start_xmit),
    the queue mapping for the skb is set to the stashed value which is now
    the skb length and hence is an invalid queue for the slave device.

    If we want to save skb->queue_mapping into skb->cb[], best place is to
    add a field in struct qdisc_skb_cb, to make sure it wont conflict with
    other layers (eg : Qdiscc, Infiniband...)

    This patchs also makes sure (struct qdisc_skb_cb)->data is aligned on 8
    bytes :

    netem qdisc for example assumes it can store an u64 in it, without
    misalignment penalty.

    Note : we only have 20 bytes left in (struct qdisc_skb_cb)->data[].
    The largest user is CHOKe and it fills it.

    Based on a previous patch from Tom Herbert.

    Signed-off-by: Eric Dumazet
    Reported-by: Tom Herbert
    Cc: John Fastabend
    Cc: Roland Dreier
    Acked-by: Neil Horman
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Feb, 2012

1 commit

  • Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside
    of other data structures.

    This is intended to be used by IPoIB so that it can remember
    addressing information stored at hard_header_ops->create() time that
    it can fetch when the packet gets to the transmit routine.

    Signed-off-by: David S. Miller

    David S. Miller
     

01 Nov, 2011

1 commit


21 Oct, 2011

1 commit


06 Jul, 2011

1 commit


24 Mar, 2011

1 commit

  • commit fd245a4adb52 (net_sched: move TCQ_F_THROTTLED flag)
    added a race.

    qdisc_watchdog() is run from softirq, so special care should be taken or
    we can lose one state transition (THROTTLED/RUNNING)

    Prior to fd245a4adb52, we were manipulating q->flags (qdisc->flags &=
    ~TCQ_F_THROTTLED;) and this manipulation could only race with
    qdisc_warn_nonwc().

    Since we want to avoid atomic ops in qdisc fast path - it was the
    meaning of commit 371121057607e (QDISC_STATE_RUNNING dont need atomic
    bit ops) - fix is to move THROTTLE bit into 'state' field, this one
    being manipulated with SMP and IRQ safe operations.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Mar, 2011

2 commits


24 Feb, 2011

1 commit


25 Jan, 2011

1 commit


21 Jan, 2011

3 commits

  • In commit 44b8288308ac9d (net_sched: pfifo_head_drop problem), we fixed
    a problem with pfifo_head drops that incorrectly decreased
    sch->bstats.bytes and sch->bstats.packets

    Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
    previously enqueued packet, and bstats cannot be changed, so
    bstats/rates are not accurate (over estimated)

    This patch changes the qdisc_bstats updates to be done at dequeue() time
    instead of enqueue() time. bstats counters no longer account for dropped
    frames, and rates are more correct, since enqueue() bursts dont have
    effect on dequeue() rate.

    Signed-off-by: Eric Dumazet
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This patch converts stab qdisc management to RCU, so that we can perform
    the qdisc_calculate_pkt_len() call before getting qdisc lock.

    This shortens the lock's held time in __dev_xmit_skb().

    This permits more qdiscs to get TCQ_F_CAN_BYPASS status, avoiding lot of
    cache misses and so reducing latencies.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    CC: Jesper Dangaard Brouer
    CC: Jarek Poplawski
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • In commit 371121057607e (net: QDISC_STATE_RUNNING dont need atomic bit
    ops) I moved QDISC_STATE_RUNNING flag to __state container, located in
    the cache line containing qdisc lock and often dirtied fields.

    I now move TCQ_F_THROTTLED bit too, so that we let first cache line read
    mostly, and shared by all cpus. This should speedup HTB/CBQ for example.

    Not using test_bit()/__clear_bit()/__test_and_set_bit allows to use an
    "unsigned int" for __state container, reducing by 8 bytes Qdisc size.

    Introduce helpers to hide implementation details.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    CC: Jesper Dangaard Brouer
    CC: Jarek Poplawski
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Jan, 2011

1 commit

  • HTB takes into account skb is segmented in stats updates.
    Generalize this to all schedulers.

    They should use qdisc_bstats_update() helper instead of manipulating
    bstats.bytes and bstats.packets

    Add bstats_update() helper too for classes that use
    gnet_stats_basic_packed fields.

    Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
    stab is setup on qdisc.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Dec, 2010

2 commits


17 Dec, 2010

1 commit

  • Add dev_close_many and dev_deactivate_many to factorize another
    sync-rcu operation on the netdevice unregister path.

    $ modprobe dummy numdummies=10000
    $ ip link set dev dummy* up
    $ time rmmod dummy

    Without the patch With the patch

    real 0m 24.63s real 0m 5.15s
    user 0m 0.00s user 0m 0.00s
    sys 0m 6.05s sys 0m 5.14s

    Signed-off-by: Octavian Purdila
    Signed-off-by: David S. Miller

    Octavian Purdila
     

21 Oct, 2010

1 commit


24 Sep, 2010

1 commit


08 Jul, 2010

1 commit


03 Jul, 2010

2 commits

  • Reducing real_num_queues needs to flush the qdisc otherwise
    skbs with queue_mappings greater then real_num_tx_queues can
    be sent to the underlying driver.

    The flow for this is,

    dev_queue_xmit()
    dev_pick_tx()
    skb_tx_hash() => hash using real_num_tx_queues
    skb_set_queue_mapping()
    ...
    qdisc_enqueue_root() => enqueue skb on txq from hash
    ...
    dev->real_num_tx_queues -= n
    ...
    sch_direct_xmit()
    dev_hard_start_xmit()
    ndo_start_xmit(skb,dev) => skb queue set with old hash

    skbs are enqueued on the qdisc with skb->queue_mapping set
    0 < queue_mappings < real_num_tx_queues. When the driver
    decreases real_num_tx_queues skb's may be dequeued from the
    qdisc with a queue_mapping greater then real_num_tx_queues.

    This fixes a case in ixgbe where this was occurring with DCB
    and FCoE. Because the driver is using queue_mapping to map
    skbs to tx descriptor rings we can potentially map skbs to
    rings that no longer exist.

    Signed-off-by: John Fastabend
    Tested-by: Ross Brattain
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     
  • When calling qdisc_reset() the qdisc lock needs to be held. In
    this case there is at least one driver i4l which is using this
    without holding the lock. Add the locking here.

    Signed-off-by: John Fastabend
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    John Fastabend
     

29 Jun, 2010

1 commit

  • don't clone skb when skb isn't shared

    When the tcf_action is TC_ACT_STOLEN, and the skb isn't shared, we don't need
    to clone a new skb. As the skb will be freed after this function returns, we
    can use it freely once we get a reference to it.

    Signed-off-by: Changli Gao
    ----
    include/net/sch_generic.h | 11 +++++++++--
    net/sched/act_mirred.c | 6 +++---
    2 files changed, 12 insertions(+), 5 deletions(-)
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Changli Gao
     

02 Jun, 2010

3 commits

  • When many cpus compete for sending frames on a given qdisc, the qdisc
    spinlock suffers from very high contention.

    The cpu owning __QDISC_STATE_RUNNING bit has same priority to acquire
    the lock, and cannot dequeue packets fast enough, since it must wait for
    this lock for each dequeued packet.

    One solution to this problem is to force all cpus spinning on a second
    lock before trying to get the main lock, when/if they see
    __QDISC_STATE_RUNNING already set.

    The owning cpu then compete with at most one other cpu for the main
    lock, allowing for higher dequeueing rate.

    Based on a previous patch from Alexander Duyck. I added the heuristic to
    avoid the atomic in fast path, and put the new lock far away from the
    cache line used by the dequeue worker. Also try to release the busylock
    lock as late as possible.

    Tests with following script gave a boost from ~50.000 pps to ~600.000
    pps on a dual quad core machine (E5450 @3.00GHz), tg3 driver.
    (A single netperf flow can reach ~800.000 pps on this platform)

    for j in `seq 0 3`; do
    for i in `seq 0 7`; do
    netperf -H 192.168.0.1 -t UDP_STREAM -l 60 -N -T $i -- -m 6 &
    done
    done

    Signed-off-by: Eric Dumazet
    Acked-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • __QDISC_STATE_RUNNING is always changed while qdisc lock is held.

    We can avoid two atomic operations in xmit path, if we move this bit in
    a new __state container.

    Location of this __state container is carefully chosen so that fast path
    only dirties one qdisc cache line.

    THROTTLED bit could later be moved into this __state location too, to
    avoid dirtying first qdisc cache line.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Define three helpers to manipulate QDISC_STATE_RUNNIG flag, that a
    second patch will move on another location.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Apr, 2010

1 commit

  • One of my test machine got a deadlock during "tc" sessions,
    adding/deleting classes & filters, using traffic estimators.

    After some analysis, I believe we have a potential use after free case
    in est_timer() :

    spin_lock(e->stats_lock); << HERE >>
    read_lock(&est_lock);
    if (e->bstats == NULL) << TEST >>
    goto skip;

    Test is done a bit late, because after estimator is killed, and before
    rcu grace period elapsed, we might already have freed/reuse memory where
    e->stats_locks points to (some qdisc->q.lock)

    A possible fix is to respect a rcu grace period at Qdisc dismantle time.

    On 64bit, sizeof(struct Qdisc) is exactly 192 bytes. Adding 16 bytes to
    it (for struct rcu_head) is a problem because it might change
    performance, given QDISC_ALIGNTO is 32 bytes.

    This is why I also change QDISC_ALIGNTO to 64 bytes, to satisfy most
    current alignment requirements.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

29 Jan, 2010

1 commit

  • This adds an additional queuing strategy, called pfifo_head_drop,
    to remove the oldest skb in the case of an overflow within the queue -
    the head element - instead of the last skb (tail). To remove the oldest
    skb in congested situations is useful for sensor network environments
    where newer packets reflect the superior information.

    Reviewed-by: Florian Westphal
    Acked-by: Patrick McHardy
    Signed-off-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

04 Nov, 2009

1 commit

  • This cleanup patch puts struct/union/enum opening braces,
    in first line to ease grep games.

    struct something
    {

    becomes :

    struct something {

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

15 Sep, 2009

1 commit

  • After the recent mq change there is the new select_queue qdisc class
    method used in tc_modify_qdisc, but it works OK only for direct child
    qdiscs of mq qdisc. Grandchildren always get the first tx queue, which
    would give wrong qdisc_root etc. results (e.g. for sch_htb as child of
    sch_prio). This patch fixes it by using parent's dev_queue for such
    grandchildren qdiscs. The select_queue method's return type is changed
    BTW.

    With feedback from: Patrick McHardy

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

10 Sep, 2009

1 commit

  • When new child qdiscs are attached to the mq qdisc, they are actually
    attached as root qdiscs to the device queues. The lock selection for
    new estimators incorrectly picks the root lock of the existing and
    to be replaced qdisc, which results in a use-after-free once the old
    qdisc has been destroyed.

    Mark mq qdisc instances with a new flag and treat qdiscs attached to
    mq as children similar to regular root qdiscs.

    Additionally prevent estimators from being attached to the mq qdisc
    itself since it only updates its byte and packet counters during dumps.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

06 Sep, 2009

2 commits

  • This patch adds a classful dummy scheduler which can be used as root qdisc
    for multiqueue devices and exposes each device queue as a child class.

    This allows to address queues individually and graft them similar to regular
    classes. Additionally it presents an accumulated view of the statistics of
    all real root qdiscs in the dummy root.

    Two new callbacks are added to the qdisc_ops and qdisc_class_ops:

    - cl_ops->select_queue selects the tx queue number for new child classes.

    - qdisc_ops->attach() overrides root qdisc device grafting to attach
    non-shared qdiscs to the queues.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    David S. Miller
     
  • It will be used in a following patch by the multiqueue qdisc.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy