13 Jan, 2012
1 commit
-
Adds an optional Random Early Detection on each SFQ flow queue.
Traditional SFQ limits count of packets, while RED permits to also
control number of bytes per flow, and adds ECN capability as well.1) We dont handle the idle time management in this RED implementation,
since each 'new flow' begins with a null qavg. We really want to address
backlogged flows.2) if headdrop is selected, we try to ecn mark first packet instead of
currently enqueued packet. This gives faster feedback for tcp flows
compared to traditional RED [ marking the last packet in queue ]Example of use :
tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \
limit 3000 headdrop flows 512 divisor 16384 \
redflowlimit 100000 min 8000 max 60000 probability 0.20 ecnqdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
ewma 6 min 8000b max 60000b probability 0.2 ecn
prob_mark 0 prob_mark_head 4876 prob_drop 6131
forced_mark 0 forced_mark_head 0 forced_drop 0
Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007
requeues 0)
rate 99483Kbit 8219pps backlog 689392b 456p requeues 0In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled
flows, we can see number of packets CE marked is smaller than number of
drops (for non ECN flows)If same test is run, without RED, we can check backlog is much bigger.
qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop
flows 512/16384 divisor 16384
Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0)
rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0Signed-off-by: Eric Dumazet
CC: Stephen Hemminger
CC: Dave Taht
Tested-by: Dave Taht
Signed-off-by: David S. Miller
06 Jan, 2012
1 commit
-
SFQ as implemented in Linux is very limited, with at most 127 flows
and limit of 127 packets. [ So if 127 flows are active, we have one
packet per flow ]This patch brings to SFQ following features to cope with modern needs.
- Ability to specify a smaller per flow limit of inflight packets.
(default value being at 127 packets)- Ability to have up to 65408 active flows (instead of 127)
- Ability to have head drops instead of tail drops
(to drop old packets from a flow)Example of use : No more than 20 packets per flow, max 8000 flows, max
20000 packets in SFQ qdisc, hash table of 65536 slots.tc qdisc add ... sfq \
flows 8000 \
depth 20 \
headdrop \
limit 20000 \
divisor 65536Ram usage :
2 bytes per hash table entry (instead of previous 1 byte/entry)
32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much
better cache hit ratio.Signed-off-by: Eric Dumazet
CC: Dave Taht
Signed-off-by: David S. Miller
05 Jan, 2012
2 commits
-
SFQ q->perturbation is used in sfq_hash() as an input to Jenkins hash.
We currently randomize this 32bit value only if a perturbation timer is
setup.Its much better to always initialize it to defeat attackers, or else
they can predict very well what kind of packets they have to forge to
hit a particular flow.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
Since commit 817fb15dfd98 (net_sched: sfq: allow divisor to be a
parameter), we can leave perturbation timer armed if a memory allocation
error aborts sfq_init().Memory containing active struct timer_list is freed and kernel can
crash.Call sfq_destroy() from sfq_init() to properly dismantle qdisc.
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
04 Jan, 2012
1 commit
-
SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the
circular list. In fact this is probably an old SFQ implementation bug.100 Mbits = ~8333 full frames per second, or ~8 frames per ms.
With 50 flows, it means your "new flow" will have to wait 50 packets
being sent before its own packet. Thats the ~6ms.We certainly can change SFQ to give a priority advantage to new flows,
so that next dequeued packet is taken from a new flow, not an old one.Reported-by: Dave Taht
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
22 Dec, 2011
1 commit
-
A known Out Of Order (OOO) problem hurts SFQ when timer changes
perturbation value, since all new packets delivered to SFQ enqueue might
end on different slots than previous in-flight packets.With round robin delivery, we can thus deliver packets in a different
order.Since SFQ is limited to small amount of in-flight packets, we can rehash
packets so that this OOO problem is fixed.This rehashing is performed only if internal flow classifier is in use.
We now store in skb->cb[] the "struct flow_keys" so that we dont call
skb_flow_dissect() again while rehashing.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
30 Nov, 2011
1 commit
-
Instead of using a custom flow dissector, use skb_flow_dissect() and
benefit from tunnelling support.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
01 Aug, 2011
1 commit
-
commit 8efa88540635 (sch_sfq: avoid giving spurious NET_XMIT_CN signals)
forgot to call qdisc_tree_decrease_qlen() to signal upper levels that a
packet (from another flow) was dropped, leading to various problems.With help from Michal Soltys and Michal Pokrywka, who did a bisection.
Bugzilla ref: https://bugzilla.kernel.org/show_bug.cgi?id=39372
Debian ref: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=631945Reported-by: Lucas Bocchi
Reported-and-bisected-by: Michal Pokrywka
Signed-off-by: Eric Dumazet
CC: Michal Soltys
Acked-by: Patrick McHardy
Signed-off-by: David S. Miller
22 Jun, 2011
1 commit
-
There are enough instances of this:
iph->frag_off & htons(IP_MF | IP_OFFSET)
that a helper function is probably warranted.
Signed-off-by: Paul Gortmaker
Signed-off-by: David S. Miller
26 May, 2011
1 commit
-
Since commit eeaeb068f139 (sch_sfq: allow big packets and be fair),
sfq_peek() can return a different skb that would be normally dequeued by
sfq_dequeue() [ if current slot->allot is negative ]Use generic qdisc_peek_dequeued() instead of custom implementation, to
get consistent result.Signed-off-by: Eric Dumazet
CC: Jarek Poplawski
CC: Patrick McHardy
CC: Jesper Dangaard Brouer
Signed-off-by: David S. Miller
24 May, 2011
1 commit
-
While chasing a possible net_sched bug, I found that IP fragments have
litle chance to pass a congestioned SFQ qdisc :- Say SFQ qdisc is full because one flow is non responsive.
- ip_fragment() wants to send two fragments belonging to an idle flow.
- sfq_enqueue() queues first packet, but see queue limit reached :
- sfq_enqueue() drops one packet from 'big consumer', and returns
NET_XMIT_CN.
- ip_fragment() cancel remaining fragments.This patch restores fairness, making sure we return NET_XMIT_CN only if
we dropped a packet from the same flow.Signed-off-by: Eric Dumazet
CC: Patrick McHardy
CC: Jarek Poplawski
CC: Jamal Hadi Salim
CC: Stephen Hemminger
Signed-off-by: David S. Miller
23 Apr, 2011
1 commit
-
Add const qualifiers to structs iphdr, ipv6hdr and in6_addr pointers
where possible, to make code intention more obvious.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
03 Feb, 2011
1 commit
-
The change to allow divisor to be a parameter (in 2.6.38-rc1)
commit 817fb15dfd988d8dda916ee04fa506f0c466b9d6
introduced a possible deadlock caught by sparse.The scheduler tree lock was left locked in the case of an incorrect
divisor value. Simplest fix is to move test outside of lock
which also solves problem of partial update.Signed-off-by: Stephen Hemminger
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller
25 Jan, 2011
1 commit
-
Conflicts:
net/sched/sch_hfsc.c
net/sched/sch_htb.c
net/sched/sch_tbf.c
22 Jan, 2011
1 commit
-
Now qdisc stab is handled before TCQ_F_CAN_BYPASS test in
__dev_xmit_skb(), we can generalize TCQ_F_CAN_BYPASS to other qdiscs
than pfifo_fast : pfifo, bfifo, pfifo_head_drop and sfqSFQ is special because it can have external classifiers, and in these
cases, we cannot bypass queue discipline (packet could be dropped by
classifier) without admin asking it, or further changes.Its worth doing this, especially for SFQ, avoiding dirtying memory in
case no packets are already waiting in queue.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
21 Jan, 2011
2 commits
-
In commit 44b8288308ac9d (net_sched: pfifo_head_drop problem), we fixed
a problem with pfifo_head drops that incorrectly decreased
sch->bstats.bytes and sch->bstats.packetsSeveral qdiscs (CHOKe, SFQ, pfifo_head, ...) are able to drop a
previously enqueued packet, and bstats cannot be changed, so
bstats/rates are not accurate (over estimated)This patch changes the qdisc_bstats updates to be done at dequeue() time
instead of enqueue() time. bstats counters no longer account for dropped
frames, and rates are more correct, since enqueue() bursts dont have
effect on dequeue() rate.Signed-off-by: Eric Dumazet
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller -
SFQ currently uses a 1024 slots hash table, and its internal structure
(sfq_sched_data) allocation needs order-1 page on x86_64Allow tc command to specify a divisor value (hash table size), between 1
and 65536.
If no value is provided, assume the 1024 default size.This allows admins to setup smaller (or bigger) SFQ for specific needs.
This also brings back sfq_sched_data allocations to order-0 ones, saving
3KB per SFQ qdisc.Jesper uses ~55.000 SFQ in one machine, this patch should free 165 MB of
memory.Signed-off-by: Eric Dumazet
CC: Patrick McHardy
CC: Jesper Dangaard Brouer
CC: Jarek Poplawski
CC: Jamal Hadi Salim
CC: Stephen Hemminger
Signed-off-by: David S. Miller
20 Jan, 2011
1 commit
-
Cleanup net/sched code to current CodingStyle and practices.
Reduce inline abuse
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
11 Jan, 2011
1 commit
-
HTB takes into account skb is segmented in stats updates.
Generalize this to all schedulers.They should use qdisc_bstats_update() helper instead of manipulating
bstats.bytes and bstats.packetsAdd bstats_update() helper too for classes that use
gnet_stats_basic_packed fields.Note : Right now, TCQ_F_CAN_BYPASS shortcurt can be taken only if no
stab is setup on qdisc.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
01 Jan, 2011
2 commits
-
slot_dequeue_head() should make sure slot skb chain is correct in both
ways, or we can crash if all possible flows are in use.Jarek pointed out slot_queue_init() can now be done in sfq_init() once,
instead each time a flow is setup.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
SFQ is currently 'limited' to small packets, because it uses a 15bit
allotment number per flow. Introduce a scale by 8, so that we can handle
full size TSO/GRO packets.Use appropriate handling to make sure allot is positive before a new
packet is dequeued, so that fairness is respected.Signed-off-by: Eric Dumazet
Acked-by: Jarek Poplawski
Cc: Patrick McHardy
Signed-off-by: David S. Miller
23 Dec, 2010
1 commit
-
sfq_walk() runs without qdisc lock. By the time it selects a non empty
hash slot and sfq_dump_class_stats() is run (with lock held), slot might
have been freed : We then access q->slots[SFQ_EMPTY_SLOT], out of
bounds, and crash in slot_queue_walk()On previous kernels, bug is here but out of bounds qs[SFQ_DEPTH] and
allot[SFQ_DEPTH] are located in struct sfq_sched_data, so no illegal
memory access happens, only possibly wrong data reported to user.Also, slot_dequeue_tail() should make sure slot skb chain is correctly
terminated, or sfq_dump_class_stats() can access freed skbs.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
21 Dec, 2010
4 commits
-
Here is a respin of patch.
I'll send a short patch to make SFQ more fair in presence of large
packets as well.Thanks
[PATCH v3 net-next-2.6] net_sched: sch_sfq: better struct layouts
This patch shrinks sizeof(struct sfq_sched_data)
from 0x14f8 (or more if spinlocks are bigger) to 0x1180 bytes, and
reduce text size as well.text data bss dec hex filename
4821 152 0 4973 136d old/net/sched/sch_sfq.o
4627 136 0 4763 129b new/net/sched/sch_sfq.oAll data for a slot/flow is now grouped in a compact and cache friendly
structure, instead of being spreaded in many different points.struct sfq_slot {
struct sk_buff *skblist_next;
struct sk_buff *skblist_prev;
sfq_index qlen; /* number of skbs in skblist */
sfq_index next; /* next slot in sfq chain */
struct sfq_head dep; /* anchor in dep[] chains */
unsigned short hash; /* hash value (index in ht[]) */
short allot; /* credit for this slot */
};Signed-off-by: Eric Dumazet
Cc: Jarek Poplawski
Cc: Patrick McHardy
Signed-off-by: David S. Miller -
When deploying SFQ/IFB here at work, I found the allot management was
pretty wrong in sfq, even changing allot from short to int...We should init allot for each new flow, not using a previous value found
in slot.Before patch, I saw bursts of several packets per flow, apparently
denying the default "quantum 1514" limit I had on my SFQ class.class sfq 11:1 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 7p requeues 0
allot 11546class sfq 11:46 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 1p requeues 0
allot -23873class sfq 11:78 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 5p requeues 0
allot 11393After patch, better fairness among each flow, allot limit being
respected, allot is positive :class sfq 11:e parent 11:
(dropped 0, overlimits 0 requeues 86)
backlog 0b 3p requeues 86
allot 596class sfq 11:94 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1468class sfq 11:a4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 4p requeues 0
allot 650class sfq 11:bb parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 596Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller -
We currently return for each active SFQ slot the number of packets in
queue. We can also give number of bytes accounted for these packets.tc -s class show dev ifb0
Before patch :
class sfq 11:3d9 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 0b 3p requeues 0
allot 1266After patch :
class sfq 11:3e4 parent 11:
(dropped 0, overlimits 0 requeues 0)
backlog 4380b 3p requeues 0
allot 1212Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
20 Aug, 2010
1 commit
-
Signed-off-by: Changli Gao
Signed-off-by: David S. Miller
11 Aug, 2010
1 commit
-
sch_sfq as a classful qdisc needs the .leaf handler. Otherwise, there
is an oops possible in tc_modify_qdisc()/check_loop().Fixes commit 7d2681a6ff4f9ab5e48d02550b4c6338f1638998
Signed-off-by: Jarek Poplawski
Signed-off-by: David S. Miller
10 Aug, 2010
2 commits
-
This is based on work originally done by Patric McHardy.
Signed-off-by: Ben Greear
Signed-off-by: David S. Miller -
…dummy bind/unbind handles
Add dummy .unbind_tcf and .put qdisc class ops for easier verification.
(All other schedulers have it like this.)Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
08 Aug, 2010
1 commit
-
Since there was added ->tcf_chain() method without ->bind_tcf() to
sch_sfq class options, there is oops when a filter is added with
the classid parameter.Fixes commit 7d2681a6ff4f9ab5e48d02550b4c6338f1638998
netdev thread: null pointer at cls_api.cSigned-off-by: Jarek Poplawski
Reported-by: Franchoze Eric
Signed-off-by: David S. Miller
05 Aug, 2010
1 commit
-
The packet length should be checked before the packet data is dereferenced.
Signed-off-by: Changli Gao
Signed-off-by: David S. Miller
21 Apr, 2010
1 commit
-
Sparse can help us find endianness bugs, but we need to make some
cleanups to be able to more easily spot real bugs.Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
30 Mar, 2010
1 commit
-
…it slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
06 Sep, 2009
1 commit
-
Some schedulers don't support creating, changing or deleting classes.
Make the respective callbacks optionally and consistently return
-EOPNOTSUPP for unsupported operations, instead of currently either
-EOPNOTSUPP, -ENOSYS or no error.In case of sch_prio and sch_multiq, the removed operations additionally
checked for an invalid class. This is not necessary since the class
argument can only orginate from ->get() or in case of ->change is 0
for creation of new classes, in which case ->change() incorrectly
returned -ENOENT.As a side-effect, this patch fixes a possible (root-only) NULL pointer
function call in sch_ingress, which didn't implement a so far mandatory
->delete() operation.Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller
03 Jun, 2009
1 commit
-
Define three accessors to get/set dst attached to a skb
struct dst_entry *skb_dst(const struct sk_buff *skb)
void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
void skb_dst_drop(struct sk_buff *skb)
This one should replace occurrences of :
dst_release(skb->dst)
skb->dst = NULL;Delete skb->dst field
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller
09 Jan, 2009
1 commit
-
Cc: Ingo Molnar
Cc: Thomas Gleixner
Acked-by: Theodore Ts'o
Acked-by: Mark Fasheh
Acked-by: David S. Miller
Cc: James Morris
Acked-by: Casey Schaufler
Acked-by: Takashi Iwai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Dec, 2008
1 commit
-
Some gcc versions warn that ret may be used uninitialized in
sfq_enqueue(). It's a false positive, so let's annotate this.Signed-off-by: Jarek Poplawski
Signed-off-by: David S. Miller
14 Nov, 2008
1 commit
-
After implementing qdisc->ops->peek() and changing sch_netem into
classless qdisc there are no more qdisc->ops->requeue() users. This
patch removes this method with its wrappers (qdisc_requeue()), and
also unused qdisc->requeue structure. There are a few minor fixes of
warnings (htb_enqueue()) and comments btw.The idea to kill ->requeue() and a similar patch were first developed
by David S. Miller.Signed-off-by: Jarek Poplawski
Signed-off-by: David S. Miller
31 Oct, 2008
1 commit
-
From: Patrick McHardy
Just as a demonstration how easy adding a peek operation to the
work-conserving qdiscs actually is. It doesn't need to keep or change
any internal state in many cases thanks to the guarantee that the
packet will either be dequeued or, if another packet arrives, the
upper qdisc will immediately ->peek again to reevaluate the state.(This is only slightly modified Patrick's patch.)
Signed-off-by: Jarek Poplawski
Signed-off-by: David S. Miller