13 Jan, 2012

1 commit

  • Adds an optional Random Early Detection on each SFQ flow queue.

    Traditional SFQ limits count of packets, while RED permits to also
    control number of bytes per flow, and adds ECN capability as well.

    1) We dont handle the idle time management in this RED implementation,
    since each 'new flow' begins with a null qavg. We really want to address
    backlogged flows.

    2) if headdrop is selected, we try to ecn mark first packet instead of
    currently enqueued packet. This gives faster feedback for tcp flows
    compared to traditional RED [ marking the last packet in queue ]

    Example of use :

    tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
    limit 3000 headdrop flows 512 divisor 16384 \
    redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn

    qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
    flows 512/16384 divisor 16384
    ewma 6 min 8000b max 60000b probability 0.2 ecn
    prob_mark 0 prob_mark_head 4876 prob_drop 6131
    forced_mark 0 forced_mark_head 0 forced_drop 0
    Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
    requeues 0)
    rate 99483Kbit 8219pps backlog 689392b 456p requeues 0

    In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
    flows, we can see number of packets CE marked is smaller than number of
    drops (for non ECN flows)

    If same test is run, without RED, we can check backlog is much bigger.

    qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
    flows 512/16384 divisor 16384
    Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
    rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0

    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Dave Taht
    Tested-by: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jan, 2012

1 commit

  • SFQ as implemented in Linux is very limited, with at most 127 flows
    and limit of 127 packets. [ So if 127 flows are active, we have one
    packet per flow ]

    This patch brings to SFQ following features to cope with modern needs.

    - Ability to specify a smaller per flow limit of inflight packets.
    (default value being at 127 packets)

    - Ability to have up to 65408 active flows (instead of 127)

    - Ability to have head drops instead of tail drops
    (to drop old packets from a flow)

    Example of use : No more than 20 packets per flow, max 8000 flows, max
    20000 packets in SFQ qdisc, hash table of 65536 slots.

    tc qdisc add ... sfq \
    flows 8000 \
    depth 20 \
    headdrop \
    limit 20000 \
    divisor 65536

    Ram usage :

    2 bytes per hash table entry (instead of previous 1 byte/entry)
    32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
    better cache hit ratio.

    Signed-off-by: Eric Dumazet
    CC: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     

05 Jan, 2012

2 commits

  • SFQ q->perturbation is used in sfq_hash() as an input to Jenkins hash.

    We currently randomize this 32bit value only if a perturbation timer is
    setup.

    Its much better to always initialize it to defeat attackers, or else
    they can predict very well what kind of packets they have to forge to
    hit a particular flow.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Since commit 817fb15dfd98 (net_sched: sfq: allow divisor to be a
    parameter), we can leave perturbation timer armed if a memory allocation
    error aborts sfq_init().

    Memory containing active struct timer_list is freed and kernel can
    crash.

    Call sfq_destroy() from sfq_init() to properly dismantle qdisc.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Jan, 2012

1 commit

  • SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the
    circular list. In fact this is probably an old SFQ implementation bug.

    100 Mbits = ~8333 full frames per second, or ~8 frames per ms.

    With 50 flows, it means your "new flow" will have to wait 50 packets
    being sent before its own packet. Thats the ~6ms.

    We certainly can change SFQ to give a priority advantage to new flows,
    so that next dequeued packet is taken from a new flow, not an old one.

    Reported-by: Dave Taht
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Dec, 2011

1 commit

  • A known Out Of Order (OOO) problem hurts SFQ when timer changes
    perturbation value, since all new packets delivered to SFQ enqueue might
    end on different slots than previous in-flight packets.

    With round robin delivery, we can thus deliver packets in a different
    order.

    Since SFQ is limited to small amount of in-flight packets, we can rehash
    packets so that this OOO problem is fixed.

    This rehashing is performed only if internal flow classifier is in use.

    We now store in skb->cb[] the "struct flow_keys" so that we dont call
    skb_flow_dissect() again while rehashing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Nov, 2011

1 commit


01 Aug, 2011

1 commit

  • commit 8efa88540635 (sch_sfq: avoid giving spurious NET_XMIT_CN signals)
    forgot to call qdisc_tree_decrease_qlen() to signal upper levels that a
    packet (from another flow) was dropped, leading to various problems.

    With help from Michal Soltys and Michal Pokrywka, who did a bisection.

    Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=39372
    Debian ref: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631945

    Reported-by: Lucas Bocchi
    Reported-and-bisected-by: Michal Pokrywka
    Signed-off-by: Eric Dumazet
    CC: Michal Soltys
    Acked-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Jun, 2011

1 commit


26 May, 2011

1 commit

  • Since commit eeaeb068f139 (sch_sfq: allow big packets and be fair),
    sfq_peek() can return a different skb that would be normally dequeued by
    sfq_dequeue() [ if current slot->allot is negative ]

    Use generic qdisc_peek_dequeued() instead of custom implementation, to
    get consistent result.

    Signed-off-by: Eric Dumazet
    CC: Jarek Poplawski
    CC: Patrick McHardy
    CC: Jesper Dangaard Brouer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 May, 2011

1 commit

  • While chasing a possible net_sched bug, I found that IP fragments have
    litle chance to pass a congestioned SFQ qdisc :

    - Say SFQ qdisc is full because one flow is non responsive.
    - ip_fragment() wants to send two fragments belonging to an idle flow.
    - sfq_enqueue() queues first packet, but see queue limit reached :
    - sfq_enqueue() drops one packet from 'big consumer', and returns
    NET_XMIT_CN.
    - ip_fragment() cancel remaining fragments.

    This patch restores fairness, making sure we return NET_XMIT_CN only if
    we dropped a packet from the same flow.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    CC: Jarek Poplawski
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Apr, 2011

1 commit


03 Feb, 2011

1 commit

  • The change to allow divisor to be a parameter (in 2.6.38-rc1)
    commit 817fb15dfd988d8dda916ee04fa506f0c466b9d6
    introduced a possible deadlock caught by sparse.

    The scheduler tree lock was left locked in the case of an incorrect
    divisor value. Simplest fix is to move test outside of lock
    which also solves problem of partial update.

    Signed-off-by: Stephen Hemminger
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

25 Jan, 2011

1 commit


22 Jan, 2011

1 commit

  • Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in
    __dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs
    than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfq

    SFQ is special because it can have external classifiers, and in these
    cases, we cannot bypass queue discipline (packet could be dropped by
    classifier) without admin asking it, or further changes.

    Its worth doing this, especially for SFQ, avoiding dirtying memory in
    case no packets are already waiting in queue.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Jan, 2011

2 commits

  • In commit 44b8288308ac9d (net_sched: pfifo_head_drop problem), we fixed
    a problem with pfifo_head drops that incorrectly decreased
    sch->bstats.bytes and sch->bstats.packets

    Several qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
    previously enqueued packet, and bstats cannot be changed, so
    bstats/rates are not accurate (over estimated)

    This patch changes the qdisc_bstats updates to be done at dequeue() time
    instead of enqueue() time. bstats counters no longer account for dropped
    frames, and rates are more correct, since enqueue() bursts dont have
    effect on dequeue() rate.

    Signed-off-by: Eric Dumazet
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • SFQ currently uses a 1024 slots hash table, and its internal structure
    (sfq_sched_data) allocation needs order-1 page on x86_64

    Allow tc command to specify a divisor value (hash table size), between 1
    and 65536.
    If no value is provided, assume the 1024 default size.

    This allows admins to setup smaller (or bigger) SFQ for specific needs.

    This also brings back sfq_sched_data allocations to order-0 ones, saving
    3KB per SFQ qdisc.

    Jesper uses ~55.000 SFQ in one machine, this patch should free 165 MB of
    memory.

    Signed-off-by: Eric Dumazet
    CC: Patrick McHardy
    CC: Jesper Dangaard Brouer
    CC: Jarek Poplawski
    CC: Jamal Hadi Salim
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Jan, 2011

1 commit


11 Jan, 2011

1 commit

  • HTB takes into account skb is segmented in stats updates.
    Generalize this to all schedulers.

    They should use qdisc_bstats_update() helper instead of manipulating
    bstats.bytes and bstats.packets

    Add bstats_update() helper too for classes that use
    gnet_stats_basic_packed fields.

    Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
    stab is setup on qdisc.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Jan, 2011

2 commits

  • slot_dequeue_head() should make sure slot skb chain is correct in both
    ways, or we can crash if all possible flows are in use.

    Jarek pointed out slot_queue_init() can now be done in sfq_init() once,
    instead each time a flow is setup.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • SFQ is currently 'limited' to small packets, because it uses a 15bit
    allotment number per flow. Introduce a scale by 8, so that we can handle
    full size TSO/GRO packets.

    Use appropriate handling to make sure allot is positive before a new
    packet is dequeued, so that fairness is respected.

    Signed-off-by: Eric Dumazet
    Acked-by: Jarek Poplawski
    Cc: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Dec, 2010

1 commit

  • sfq_walk() runs without qdisc lock. By the time it selects a non empty
    hash slot and sfq_dump_class_stats() is run (with lock held), slot might
    have been freed : We then access q->slots[SFQ_EMPTY_SLOT], out of
    bounds, and crash in slot_queue_walk()

    On previous kernels, bug is here but out of bounds qs[SFQ_DEPTH] and
    allot[SFQ_DEPTH] are located in struct sfq_sched_data, so no illegal
    memory access happens, only possibly wrong data reported to user.

    Also, slot_dequeue_tail() should make sure slot skb chain is correctly
    terminated, or sfq_dump_class_stats() can access freed skbs.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Dec, 2010

4 commits

  • Here is a respin of patch.

    I'll send a short patch to make SFQ more fair in presence of large
    packets as well.

    Thanks

    [PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts

    This patch shrinks sizeof(struct sfq_sched_data)
    from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and
    reduce text size as well.

    text data bss dec hex filename
    4821 152 0 4973 136d old/net/sched/sch_sfq.o
    4627 136 0 4763 129b new/net/sched/sch_sfq.o

    All data for a slot/flow is now grouped in a compact and cache friendly
    structure, instead of being spreaded in many different points.

    struct sfq_slot {
    struct sk_buff *skblist_next;
    struct sk_buff *skblist_prev;
    sfq_index qlen; /* number of skbs in skblist */
    sfq_index next; /* next slot in sfq chain */
    struct sfq_head dep; /* anchor in dep[] chains */
    unsigned short hash; /* hash value (index in ht[]) */
    short allot; /* credit for this slot */
    };

    Signed-off-by: Eric Dumazet
    Cc: Jarek Poplawski
    Cc: Patrick McHardy
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • David S. Miller
     
  • When deploying SFQ/IFB here at work, I found the allot management was
    pretty wrong in sfq, even changing allot from short to int...

    We should init allot for each new flow, not using a previous value found
    in slot.

    Before patch, I saw bursts of several packets per flow, apparently
    denying the default "quantum 1514" limit I had on my SFQ class.

    class sfq 11:1 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 7p requeues 0
    allot 11546

    class sfq 11:46 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 1p requeues 0
    allot -23873

    class sfq 11:78 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 5p requeues 0
    allot 11393

    After patch, better fairness among each flow, allot limit being
    respected, allot is positive :

    class sfq 11:e parent 11:
    (dropped 0, overlimits 0 requeues 86)
    backlog 0b 3p requeues 86
    allot 596

    class sfq 11:94 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 3p requeues 0
    allot 1468

    class sfq 11:a4 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 4p requeues 0
    allot 650

    class sfq 11:bb parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 3p requeues 0
    allot 596

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We currently return for each active SFQ slot the number of packets in
    queue. We can also give number of bytes accounted for these packets.

    tc -s class show dev ifb0

    Before patch :

    class sfq 11:3d9 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 0b 3p requeues 0
    allot 1266

    After patch :

    class sfq 11:3e4 parent 11:
    (dropped 0, overlimits 0 requeues 0)
    backlog 4380b 3p requeues 0
    allot 1212

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

20 Aug, 2010

1 commit


11 Aug, 2010

1 commit


10 Aug, 2010

2 commits


08 Aug, 2010

1 commit

  • Since there was added ->tcf_chain() method without ->bind_tcf() to
    sch_sfq class options, there is oops when a filter is added with
    the classid parameter.

    Fixes commit 7d2681a6ff4f9ab5e48d02550b4c6338f1638998
    netdev thread: null pointer at cls_api.c

    Signed-off-by: Jarek Poplawski
    Reported-by: Franchoze Eric
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

05 Aug, 2010

1 commit


21 Apr, 2010

1 commit


30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

06 Sep, 2009

1 commit

  • Some schedulers don't support creating, changing or deleting classes.
    Make the respective callbacks optionally and consistently return
    -EOPNOTSUPP for unsupported operations, instead of currently either
    -EOPNOTSUPP, -ENOSYS or no error.

    In case of sch_prio and sch_multiq, the removed operations additionally
    checked for an invalid class. This is not necessary since the class
    argument can only orginate from ->get() or in case of ->change is 0
    for creation of new classes, in which case ->change() incorrectly
    returned -ENOENT.

    As a side-effect, this patch fixes a possible (root-only) NULL pointer
    function call in sch_ingress, which didn't implement a so far mandatory
    ->delete() operation.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

03 Jun, 2009

1 commit

  • Define three accessors to get/set dst attached to a skb

    struct dst_entry *skb_dst(const struct sk_buff *skb)

    void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)

    void skb_dst_drop(struct sk_buff *skb)
    This one should replace occurrences of :
    dst_release(skb->dst)
    skb->dst = NULL;

    Delete skb->dst field

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Jan, 2009

1 commit

  • Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Acked-by: Theodore Ts'o
    Acked-by: Mark Fasheh
    Acked-by: David S. Miller
    Cc: James Morris
    Acked-by: Casey Schaufler
    Acked-by: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fernando Carrijo
     

22 Dec, 2008

1 commit


14 Nov, 2008

1 commit

  • After implementing qdisc->ops->peek() and changing sch_netem into
    classless qdisc there are no more qdisc->ops->requeue() users. This
    patch removes this method with its wrappers (qdisc_requeue()), and
    also unused qdisc->requeue structure. There are a few minor fixes of
    warnings (htb_enqueue()) and comments btw.

    The idea to kill ->requeue() and a similar patch were first developed
    by David S. Miller.

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Jarek Poplawski
     

31 Oct, 2008

1 commit

  • From: Patrick McHardy

    Just as a demonstration how easy adding a peek operation to the
    work-conserving qdiscs actually is. It doesn't need to keep or change
    any internal state in many cases thanks to the guarantee that the
    packet will either be dequeued or, if another packet arrives, the
    upper qdisc will immediately ->peek again to reevaluate the state.

    (This is only slightly modified Patrick's patch.)

    Signed-off-by: Jarek Poplawski
    Signed-off-by: David S. Miller

    Patrick McHardy