13 Jan, 2012

1 commit

  • Adds an optional Random Early Detection on each SFQ flow queue.

    Traditional SFQ limits count of packets, while RED permits to also
    control number of bytes per flow, and adds ECN capability as well.

    1) We dont handle the idle time management in this RED implementation,
    since each 'new flow' begins with a null qavg. We really want to address
    backlogged flows.

    2) if headdrop is selected, we try to ecn mark first packet instead of
    currently enqueued packet. This gives faster feedback for tcp flows
    compared to traditional RED [ marking the last packet in queue ]

    Example of use :

    tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
    limit 3000 headdrop flows 512 divisor 16384 \
    redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn

    qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
    flows 512/16384 divisor 16384
    ewma 6 min 8000b max 60000b probability 0.2 ecn
    prob_mark 0 prob_mark_head 4876 prob_drop 6131
    forced_mark 0 forced_mark_head 0 forced_drop 0
    Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
    requeues 0)
    rate 99483Kbit 8219pps backlog 689392b 456p requeues 0

    In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
    flows, we can see number of packets CE marked is smaller than number of
    drops (for non ECN flows)

    If same test is run, without RED, we can check backlog is much bigger.

    qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
    flows 512/16384 divisor 16384
    Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
    rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0

    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Dave Taht
    Tested-by: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Jan, 2012

3 commits

  • This patch splits the red_parms structure into two components.

    One holding the RED 'constant' parameters, and one containing the
    variables.

    This permits a size reduction of GRED qdisc, and is a preliminary step
    to add an optional RED unit to SFQ.

    SFQRED will have a single red_parms structure shared by all flows, and a
    private red_vars per flow.

    Signed-off-by: Eric Dumazet
    CC: Dave Taht
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • SFQ as implemented in Linux is very limited, with at most 127 flows
    and limit of 127 packets. [ So if 127 flows are active, we have one
    packet per flow ]

    This patch brings to SFQ following features to cope with modern needs.

    - Ability to specify a smaller per flow limit of inflight packets.
    (default value being at 127 packets)

    - Ability to have up to 65408 active flows (instead of 127)

    - Ability to have head drops instead of tail drops
    (to drop old packets from a flow)

    Example of use : No more than 20 packets per flow, max 8000 flows, max
    20000 packets in SFQ qdisc, hash table of 65536 slots.

    tc qdisc add ... sfq \
    flows 8000 \
    depth 20 \
    headdrop \
    limit 20000 \
    divisor 65536

    Ram usage :

    2 bytes per hash table entry (instead of previous 1 byte/entry)
    32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
    better cache hit ratio.

    Signed-off-by: Eric Dumazet
    CC: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Not now, but it looks you are correct. q->qdisc is NULL until another
    additional qdisc is attached (beside tfifo). See 50612537e9ab2969312.
    The following patch should work.

    From: Hagen Paul Pfeifer

    netem: catch NULL pointer by updating the real qdisc statistic

    Reported-by: Vijay Subramanian
    Signed-off-by: Hagen Paul Pfeifer
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     

05 Jan, 2012

3 commits


04 Jan, 2012

4 commits

  • When trying to allocate ~32768 qdiscs using autohandle mechanism, we can
    fill the space managed by kernel (handles in [8000-FFFF]:0000 range)

    But O(N^2) qdisc_alloc_handle() loops 0x10000 times instead of 0x8000

    time tc add qdisc add dev eth0 parent 10:7fff pfifo limit 10
    RTNETLINK answers: Cannot allocate memory
    real 1m54.826s
    user 0m0.000s
    sys 0m0.004s

    INFO: rcu_sched_state detected stall on CPU 0 (t=60000 jiffies)

    Half number of loops, and add a cond_resched() call.
    We hold rtnl at this point.

    Signed-off-by: Eric Dumazet
    CC: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • We can underestimate q->wsum in case of "tc class replace ... qfq"
    and/or qdisc_create_dflt() error.

    wsum is not really used in fast path, only at qfq qdisc/class setup,
    to catch user error.

    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • grp->slot_shift is between 22 and 41, so using 32bit wide variables is
    probably a typo.

    This could explain QFQ hangs Dave reported to me, after 2^23 packets ?

    (23 = 64 - 41)

    Reported-by: Dave Taht
    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    CC: Dave Taht
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the
    circular list. In fact this is probably an old SFQ implementation bug.

    100 Mbits = ~8333 full frames per second, or ~8 frames per ms.

    With 50 flows, it means your "new flow" will have to wait 50 packets
    being sent before its own packet. Thats the ~6ms.

    We certainly can change SFQ to give a priority advantage to new flows,
    so that next dequeued packet is taken from a new flow, not an old one.

    Reported-by: Dave Taht
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

31 Dec, 2011

2 commits

  • Commit 10f6dfcfde (Revert "sch_netem: Remove classful functionality")
    reintroduced classful functionality to netem, but broke basic netem
    behavior :

    netem uses an t(ime)fifo queue, and store timestamps in skb->cb[]

    If qdisc is changed, time constraints are not respected and other qdisc
    can destroy skb->cb[] and block netem at dequeue time.

    Fix this by always using internal tfifo, and optionally attach a child
    qdisc to netem (or a tree of qdiscs)

    Example of use :

    DEV=eth3
    tc qdisc del dev $DEV root
    tc qdisc add dev $DEV root handle 30: est 1sec 8sec netem delay 20ms 10ms
    tc qdisc add dev $DEV handle 40:0 parent 30:0 tbf \
    burst 20480 limit 20480 mtu 1514 rate 32000bps

    qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms 10.0ms
    Sent 190792 bytes 413 pkt (dropped 0, overlimits 0 requeues 0)
    rate 18416bit 3pps backlog 0b 0p requeues 0
    qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
    Sent 190792 bytes 413 pkt (dropped 6, overlimits 10 requeues 0)
    backlog 0b 5p requeues 0

    Signed-off-by: Eric Dumazet
    CC: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • David S. Miller
     

30 Dec, 2011

1 commit

  • Provide child qdisc backlog (byte count) information so that "tc -s
    qdisc" can report it to user.

    qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms 10.0ms
    Sent 948517 bytes 898 pkt (dropped 0, overlimits 0 requeues 1)
    rate 175056bit 16pps backlog 114b 1p requeues 1
    qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us
    Sent 948517 bytes 898 pkt (dropped 15, overlimits 611 requeues 0)
    backlog 18168b 12p requeues 0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

25 Dec, 2011

1 commit

  • commit 6373a9a286 (netem: use vmalloc for distribution table) added a
    regression, since vfree() is called while holding a spinlock and BH
    being disabled.

    Fix this by doing the pointers swap in critical section, and freeing
    after spinlock release.

    Also add __GFP_NOWARN to the kmalloc() try, since we fallback to
    vmalloc().

    Signed-off-by: Eric Dumazet
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

24 Dec, 2011

3 commits

  • Conflicts:
    net/bluetooth/l2cap_core.c

    Just two overlapping changes, one added an initialization of
    a local variable, and another change added a new local variable.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The new netem loss model is configured with nested netlink messages.
    This code is being overly strict about sizes, and is easily confused
    by padding (or possible future expansion). Also message
    for gemodel is incorrect.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     
  • Add backlog (byte count) information in hfsc classes and qdisc, so that
    "tc -s" can report it to user, instead of 0 values :

    qdisc hfsc 1: root refcnt 6 default 20
    Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0)
    rate 1492Kbit 126pps backlog 103226b 74p requeues 0
    ...
    class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit
    Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 81822b 56p requeues 0
    period 23 work 49451576 bytes rtwork 13277552 bytes level 0
    ...

    Signed-off-by: Eric Dumazet
    CC: John A. Sullivan III
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Dec, 2011

1 commit


22 Dec, 2011

1 commit

  • A known Out Of Order (OOO) problem hurts SFQ when timer changes
    perturbation value, since all new packets delivered to SFQ enqueue might
    end on different slots than previous in-flight packets.

    With round robin delivery, we can thus deliver packets in a different
    order.

    Since SFQ is limited to small amount of in-flight packets, we can rehash
    packets so that this OOO problem is fixed.

    This rehashing is performed only if internal flow classifier is in use.

    We now store in skb->cb[] the "struct flow_keys" so that we dont call
    skb_flow_dissect() again while rehashing.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Dec, 2011

1 commit

  • In control path, its better to use GFP_KERNEL allocations where
    possible.

    Before taking qdisc spinlock, we preallocate memory just in case we'll
    need it in gred_change_vq()

    This is a followup to commit 3f1e6d3fd37b (sch_gred: should not use
    GFP_KERNEL while holding a spinlock)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Dec, 2011

1 commit


15 Dec, 2011

1 commit

  • Its better to use a predefined size for this small automatic variable.

    Removes a sparse error as well :

    net/sched/cls_flow.c:288:13: error: bad constant expression

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

13 Dec, 2011

2 commits

  • This extension can be used to simulate special link layer
    characteristics. Simulate because packet data is not modified, only the
    calculation base is changed to delay a packet based on the original
    packet size and artificial cell information.

    packet_overhead can be used to simulate a link layer header compression
    scheme (e.g. set packet_overhead to -20) or with a positive
    packet_overhead value an additional MAC header can be simulated. It is
    also possible to "replace" the 14 byte Ethernet header with something
    else.

    cell_size and cell_overhead can be used to simulate link layer schemes,
    based on cells, like some TDMA schemes. Another application area are MAC
    schemes using a link layer fragmentation with a (small) header each.
    Cell size is the maximum amount of data bytes within one cell. Cell
    overhead is an additional variable to change the per-cell-overhead
    (e.g. 5 byte header per fragment).

    Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per
    cell overhead 5 byte):

    tc qdisc add dev eth0 root netem rate 5kbit 20 100 5

    Signed-off-by: Hagen Paul Pfeifer
    Signed-off-by: Florian Westphal
    Acked-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     
  • gred_change_vq() is called under sch_tree_lock(sch).

    This means a spinlock is held, and we are not allowed to sleep in this
    context.

    We might pre-allocate memory using GFP_KERNEL before taking spinlock,
    but this is not suitable for stable material.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

10 Dec, 2011

1 commit

  • Now RED uses a Q0.32 number to store max_p (max probability), allow
    RED/GRED/CHOKE to use/report full resolution at config/dump time.

    Old tc binaries are non aware of new attributes, and still set/get Plog.

    New tc binary set/get both Plog and max_p for backward compatibility,
    they display "probability value" if they get max_p from new kernels.

    # tc -d qdisc show dev ...
    ...
    qdisc red 10: parent 1:1 limit 360Kb min 30Kb max 90Kb ecn ewma 5
    probability 0.09 Scell_log 15

    Make sure we avoid potential divides by 0 in reciprocal_value(), if
    (max_th - min_th) is big.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 Dec, 2011

1 commit

  • Adaptative RED AQM for linux, based on paper from Sally FLoyd,
    Ramakrishna Gummadi, and Scott Shenker, August 2001 :

    http://icir.org/floyd/papers/adaptiveRed.pdf

    Goal of Adaptative RED is to make max_p a dynamic value between 1% and
    50% to reach the target average queue : (max_th - min_th) / 2

    Every 500 ms:
    if (avg > target and max_p < target and max_p >= 0.01)
    decrease max_p : max_p *= beta;

    target :[min_th + 0.4*(min_th - max_th),
    min_th + 0.6*(min_th - max_th)].
    alpha : min(0.01, max_p / 4)
    beta : 0.9
    max_P is a Q0.32 fixed point number (unsigned, with 32 bits mantissa)

    Changes against our RED implementation are :

    max_p is no longer a negative power of two (1/(2^Plog)), but a Q0.32
    fixed point number, to allow full range described in Adatative paper.

    To deliver a random number, we now use a reciprocal divide (thats really
    a multiply), but this operation is done once per marked/droped packet
    when in RED_BETWEEN_TRESH window, so added cost (compared to previous
    AND operation) is near zero.

    dump operation gives current max_p value in a new TCA_RED_MAX_P
    attribute.

    Example on a 10Mbit link :

    tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 8sec red \
    limit 400000 min 30000 max 90000 avpkt 1000 \
    burst 55 ecn adaptative bandwidth 10Mbit

    # tc -s -d qdisc show dev eth3
    ...
    qdisc red 10: parent 1:1 limit 400000b min 30000b max 90000b ecn
    adaptative ewma 5 max_p=0.113335 Scell_log 15
    Sent 50414282 bytes 34504 pkt (dropped 35, overlimits 1392 requeues 0)
    rate 9749Kbit 831pps backlog 72056b 16p requeues 0
    marked 1357 early 35 pdrop 0 other 0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 Dec, 2011

1 commit


03 Dec, 2011

1 commit


02 Dec, 2011

2 commits

  • Le mercredi 30 novembre 2011 à 14:36 -0800, Stephen Hemminger a écrit :

    > (Almost) nobody uses RED because they can't figure it out.
    > According to Wikipedia, VJ says that:
    > "there are not one, but two bugs in classic RED."

    RED is useful for high throughput routers, I doubt many linux machines
    act as such devices.

    I was considering adding Adaptative RED (Sally Floyd, Ramakrishna
    Gummadi, Scott Shender), August 2001

    In this version, maxp is dynamic (from 1% to 50%), and user only have to
    setup min_th (target average queue size)
    (max_th and wq (burst in linux RED) are automatically setup)

    By the way it seems we have a small bug in red_change()

    if (skb_queue_empty(&sch->q))
    red_end_of_idle_period(&q->parms);

    First, if queue is empty, we should call
    red_start_of_idle_period(&q->parms);

    Second, since we dont use anymore sch->q, but q->qdisc, the test is
    meaningless.

    Oh well...

    [PATCH] sch_red: fix red_change()

    Now RED is classful, we must check q->qdisc->q.qlen, and if queue is empty,
    we start an idle period, not end it.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • ERROR: "__udivdi3" [net/sched/sch_netem.ko] undefined!

    Signed-off-by: Eric Dumazet
    Acked-by: Hagen Paul Pfeifer
    Signed-off-by: David S. Miller

    Eric Dumazet
     

01 Dec, 2011

2 commits

  • Currently netem is not in the ability to emulate channel bandwidth. Only static
    delay (and optional random jitter) can be configured.

    To emulate the channel rate the token bucket filter (sch_tbf) can be used. But
    TBF has some major emulation flaws. The buffer (token bucket depth/rate) cannot
    be 0. Also the idea behind TBF is that the credit (token in buckets) fills if
    no packet is transmitted. So that there is always a "positive" credit for new
    packets. In real life this behavior contradicts the law of nature where
    nothing can travel faster as speed of light. E.g.: on an emulated 1000 byte/s
    link a small IPv4/TCP SYN packet with ~50 byte require ~0.05 seconds - not 0
    seconds.

    Netem is an excellent place to implement a rate limiting feature: static
    delay is already implemented, tfifo already has time information and the
    user can skip TBF configuration completely.

    This patch implement rate feature which can be configured via tc. e.g:

    tc qdisc add dev eth0 root netem rate 10kbit

    To emulate a link of 5000byte/s and add an additional static delay of 10ms:

    tc qdisc add dev eth0 root netem delay 10ms rate 5KBps

    Note: similar to TBF the rate extension is bounded to the kernel timing
    system. Depending on the architecture timer granularity, higher rates (e.g.
    10mbit/s and higher) tend to transmission bursts. Also note: further queues
    living in network adaptors; see ethtool(8).

    Signed-off-by: Hagen Paul Pfeifer
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Hagen Paul Pfeifer
     
  • We need rcu_read_lock() protection before using dst_get_neighbour(), and
    we must cache its value (pass it to __teql_resolve())

    teql_master_xmit() is called under rcu_read_lock_bh() protection, its
    not enough.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

30 Nov, 2011

3 commits


29 Nov, 2011

2 commits

  • Current SFB double hashing is not fulfilling SFB theory, if two flows
    share same rxhash value.

    Using skb_flow_dissect() permits to really have better hash dispersion,
    and get tunnelling support as well.

    Double hashing point was mentioned by Florian Westphal

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Instead of using a custom flow dissector, use skb_flow_dissect() and
    benefit from tunnelling support.

    This lack of tunnelling support was mentioned by Dan Siemon.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

17 Nov, 2011

1 commit

  • This adds the /sys/class/net/DEV/queues/Q/tx_timeout attribute
    containing the total number of timeout events on the given queue. It
    is always available with CONFIG_SYSFS, independently of
    CONFIG_RPS/XPS.

    Credits to Stephen Hemminger for a preliminary version of this patch.

    Tested:
    without CONFIG_SYSFS (compilation only)
    with sysfs and without CONFIG_RPS & CONFIG_XPS
    with sysfs and without CONFIG_RPS
    with sysfs and without CONFIG_XPS
    with defaults

    Signed-off-by: David Decotigny
    Signed-off-by: David S. Miller

    david decotigny
     

09 Nov, 2011

1 commit

  • Remove the assumption that skb_get_rxhash() makes IP header and ports
    linear, and use skb_header_pointer() instead in choke_match_flow()

    This permits __skb_get_rxhash() to use skb_header_pointer() eventually.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet