Eric Lee / smarc-fsl-linux-kernel

07 Jan, 2014

1 commit

d4b36210c net: pkt_sched: PIE AQM scheme ... Browse Code »

Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.

>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.

We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software. "

Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions. Naeem Khademi and Dave Taht independently contributed to ECN
support.

For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.

Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.

For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)

Signed-off-by: Vijay Subramanian
Signed-off-by: Mythili Prabhu
CC: Dave Taht
Signed-off-by: David S. Miller

Vijay Subramanian
2014-01-07 04:13:01 +0800

06 Jan, 2014

1 commit

855404efa Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for your net-next tree,
they are:

* Add full port randomization support. Some crazy researchers found a way
to reconstruct the secure ephemeral ports that are allocated in random mode
by sending off-path bursts of UDP packets to overrun the socket buffer of
the DNS resolver to trigger retransmissions, then if the timing for the
DNS resolution done by a client is larger than usual, then they conclude
that the port that received the burst of UDP packets is the one that was
opened. It seems a bit aggressive method to me but it seems to work for
them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a
new NAT mode to fully randomize ports using prandom.

* Add a new classifier to x_tables based on the socket net_cls set via
cgroups. These includes two patches to prepare the field as requested by
Zefan Li. Also from Daniel Borkmann.

* Use prandom instead of get_random_bytes in several locations of the
netfilter code, from Florian Westphal.

* Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack
mark, also from Florian Westphal.

* Fix compilation warning due to unused variable in IPVS, from Geert
Uytterhoeven.

* Add support for UID/GID via nfnetlink_queue, from Valentina Giusti.

* Add IPComp extension to x_tables, from Fan Du.
====================

Signed-off-by: David S. Miller

David S. Miller
2014-01-06 09:18:50 +0800

04 Jan, 2014

1 commit

fe1217c4f net: net_cls: move cgroupfs classid handling into core ... Browse Code »

Zefan Li requested [1] to perform the following cleanup/refactoring:

- Split cgroupfs classid handling into net core to better express a
possible more generic use.

- Disable module support for cgroupfs bits as the majority of other
cgroupfs subsystems do not have that, and seems to be not wished
from cgroup side. Zefan probably might want to follow-up for netprio
later on.

- By this, code can be further reduced which previously took care of
functionality built when compiled as module.

cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
that we are consistent with {netclassid,netprio}_cgroup naming that is
under net/core/ as suggested by Zefan.

No change in functionality, but only code refactoring that is being
done here.

[1] http://patchwork.ozlabs.org/patch/304825/

Suggested-by: Li Zefan
Signed-off-by: Daniel Borkmann
Cc: Zefan Li
Cc: Thomas Graf
Cc: cgroups@vger.kernel.org
Acked-by: Li Zefan
Signed-off-by: Pablo Neira Ayuso

Daniel Borkmann
2014-01-04 06:41:41 +0800

20 Dec, 2013

1 commit

10239edf8 net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc ... Browse Code »

This patch implements the first size-based qdisc that attempts to
differentiate between small flows and heavy-hitters. The goal is to
catch the heavy-hitters and move them to a separate queue with less
priority so that bulk traffic does not affect the latency of critical
traffic. Currently "less priority" means less weight (2:1 in
particular) in a Weighted Deficit Round Robin (WDRR) scheduler.

In essence, this patch addresses the "delay-bloat" problem due to
bloated buffers. In some systems, large queues may be necessary for
obtaining CPU efficiency, or due to the presence of unresponsive
traffic like UDP, or just a large number of connections with each
having a small amount of outstanding traffic. In these circumstances,
HHF aims to reduce the HoL blocking for latency sensitive traffic,
while not impacting the queues built up by bulk traffic. HHF can also
be used in conjunction with other AQM mechanisms such as CoDel.

To capture heavy-hitters, we implement the "multi-stage filter" design
in the following paper:
C. Estan and G. Varghese, "New Directions in Traffic Measurement and
Accounting", in ACM SIGCOMM, 2002.

Some configurable qdisc settings through 'tc':
- hhf_reset_timeout: period to reset counter values in the multi-stage
filter (default 40ms)
- hhf_admit_bytes: threshold to classify heavy-hitters
(default 128KB)
- hhf_evict_timeout: threshold to evict idle heavy-hitters
(default 1s)
- hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
non-heavy-hitters (default 2)
- hh_flows_limit: max number of heavy-hitter flow entries
(default 2048)

Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
reflects the bandwidth of heavy-hitters that we attempt to capture
(25Mbps with the above default settings).

The false negative rate (heavy-hitter flows getting away unclassified)
is zero by the design of the multi-stage filter algorithm.
With 100 heavy-hitter flows, using four hashes and 4000 counters yields
a false positive rate (non-heavy-hitters mistakenly classified as
heavy-hitters) of less than 1e-4.

Signed-off-by: Terry Lam
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Terry Lam
2013-12-20 03:48:42 +0800

30 Oct, 2013

1 commit

7d1d65cb8 net: sched: cls_bpf: add BPF-based classifier ... Browse Code »

This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:

0: No match
-1: Select classid given in "tc filter ..." command
else: flowid, overwrite the default one

As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:

echo 1 > /proc/sys/net/core/bpf_jit_enable

tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16

tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3

BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:

1) People familiar with tcpdump-like filters:

tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf

2) People that want to low-level program their filters or use BPF
extensions that lack support by libpcap's compiler:

bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf

ssh.ops example code:
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4 * ([14] & 0xf)
ldh [%x + 14]
jeq #0x16, pass
ldh [%x + 16]
jne #0x16, drop
pass: ret #-1
drop: ret #0

It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!

Signed-off-by: Daniel Borkmann
Cc: Thomas Graf
Signed-off-by: David S. Miller

Daniel Borkmann
2013-10-30 05:33:17 +0800

30 Aug, 2013

1 commit

afe4fd062 pkt_sched: fq: Fair Queue packet scheduler ... Browse Code »

- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
to add per socket limitation.

Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.

TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.

This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.

Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit

cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)

15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125
15:01:23.545394 IP B > A: . ack 81089 win 3668
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125
15:01:23.546565 IP B > A: . ack 83985 win 3668
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125
15:01:23.547778 IP B > A: . ack 86881 win 3668
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125
15:01:23.548949 IP B > A: . ack 89777 win 3668
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125
15:01:23.550182 IP B > A: . ack 92673 win 3668
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125
15:01:23.551406 IP B > A: . ack 95569 win 3668
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125
15:01:23.552576 IP B > A: . ack 98465 win 3668
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125
15:01:23.554204 IP B > A: . ack 100001 win 3668
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668

TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.

In slow start or steady state, very few packets are throttled [1]

FQ gets a bunch of tunables as :

limit : max number of packets on whole Qdisc (default 10000)

flow_limit : max number of packets per flow (default 100)

quantum : the credit per RR round (default is 2 MTU)

initial_quantum : initial credit for new flows (default is 10 MTU)

maxrate : max per flow rate (default : unlimited)

buckets : number of RB trees (default : 1024) in hash table.
(consumes 8 bytes per bucket)

[no]pacing : disable/enable pacing (default is enable)

All of them can be changed on a live qdisc.

$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ]

$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
backlog 0b 0p requeues 14
511 flows, 511 inactive, 0 throttled
110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit

[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.

Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-30 09:38:31 +0800

26 Nov, 2012

1 commit

a303fbf3d net: sched: enable CAN Identifier to be build into kernel ... Browse Code »

This patch makes it possible to build the CAN Identifier into the kernel, even
if the CAN support is build as a module.

Signed-off-by: Marc Kleine-Budde
Signed-off-by: David S. Miller

Marc Kleine-Budde
2012-11-26 05:06:06 +0800

12 Jul, 2012

1 commit

6d4fa852a net: sched: add ipset ematch ... Browse Code »

Can be used to match packets against netfilter ip sets created via ipset(8).
skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'.

Since ipset is usually called from netfilter, the ematch
initializes a fake xt_action_param, pulls the ip header into the
linear area and also sets skb->data to the IP header (otherwise
matching Layer 4 set types doesn't work).

Tested-by: Mr Dash Four
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2012-07-12 22:54:46 +0800

04 Jul, 2012

1 commit

f057bbb6f net: em_canid: Ematch rule to match CAN frames according to their identifiers ... Browse Code »

This ematch makes it possible to classify CAN frames (AF_CAN) according
to their identifiers. This functionality can not be easily achieved with
existing classifiers, such as u32, because CAN identifier is always stored
in native endianness, whereas u32 expects Network byte order.

Signed-off-by: Rostislav Lisovy
Signed-off-by: Oliver Hartkopp
Signed-off-by: Marc Kleine-Budde

Rostislav Lisovy
2012-07-04 19:07:05 +0800

13 May, 2012

1 commit

4b549a2ef fq_codel: Fair Queue Codel AQM ... Browse Code »

Fair Queue Codel packet scheduler

Principles :

- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.

- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)

tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
[ target TIME ] [ interval TIME ] [ noecn ]
[ quantum BYTES ]

defaults : 1024 flows, 10240 packets limit, quantum : device MTU
target : 5ms (CoDel default)
interval : 100ms (CoDel default)

Impressive results on load :

class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
rate 201691Kbit 28595pps backlog 0b 312p requeues 0
lended: 33063109 borrowed: 0 giants: 0
tokens: -912 ctokens: -912

class fq_codel 10:1735 parent 10:
(dropped 1292, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4524 parent 10:
(dropped 1291, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4e74 parent 10:
(dropped 1290, overlimits 0 requeues 0)
backlog 6056b 4p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
class fq_codel 10:628a parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 7570b 5p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
class fq_codel 10:a4b3 parent 10:
(dropped 302, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:c3c2 parent 10:
(dropped 1284, overlimits 0 requeues 0)
backlog 13626b 9p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:d331 parent 10:
(dropped 299, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.0ms
class fq_codel 10:d526 parent 10:
(dropped 12160, overlimits 0 requeues 0)
backlog 35870b 211p requeues 0
deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
class fq_codel 10:e2c6 parent 10:
(dropped 1288, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:eab5 parent 10:
(dropped 1285, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:f220 parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms

qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
rate 201697Kbit 28602pps backlog 0b 260p requeues 71
qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
new_flows_len 0 old_flows_len 11

PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms

10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms

Much better than SFQ because of priority given to new flows, and fast
path dirtying less cache lines.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-13 03:53:42 +0800

11 May, 2012

1 commit

76e3cc126 codel: Controlled Delay AQM ... Browse Code »

An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.

http://queue.acm.org/detail.cfm?id=2209336

This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.

As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :

target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)

Based on initial work from Dave Taht.

Refactored to help future codel inclusion as a plugin for other linux
qdisc (FQ_CODEL, ...), like RED.

include/net/codel.h contains codel algorithm as close as possible than
Kathleen reference.

net/sched/sch_codel.c contains the linux qdisc specific glue.

Separate structures permit a memory efficient implementation of fq_codel
(to be sent as a separate work) : Each flow has its own struct
codel_vars.

timestamps are taken at enqueue() time with 1024 ns precision, allowing
a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
usec as base unit.

Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.

Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
tg3 drivers (BQL enabled).

Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]

qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0

CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.

One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.

Signed-off-by: Eric Dumazet
Signed-off-by: Dave Taht
Cc: Kathleen Nichols
Cc: Van Jacobson
Cc: Tom Herbert
Cc: Matt Mathis
Cc: Yuchung Cheng
Cc: Stephen Hemminger
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-11 11:35:02 +0800

08 Feb, 2012

1 commit

c3059be16 net/sched: sch_plug - Queue traffic until an explicit release command ... Browse Code »

The qdisc supports two operations - plug and unplug. When the
qdisc receives a plug command via netlink request, packets arriving
henceforth are buffered until a corresponding unplug command is received.
Depending on the type of unplug command, the queue can be unplugged
indefinitely or selectively.

This qdisc can be used to implement output buffering, an essential
functionality required for consistent recovery in checkpoint based
fault-tolerance systems. Output buffering enables speculative execution
by allowing generated network traffic to be rolled back. It is used to
provide network protection for Xen Guests in the Remus high availability
project, available as part of Xen.

This module is generic enough to be used by any other system that wishes
to add speculative execution and output buffering to its applications.

This module was originally available in the linux 2.6.32 PV-OPS tree,
used as dom0 for Xen.

For more information, please refer to http://nss.cs.ubc.ca/remus/
and http://wiki.xensource.com/xenwiki/Remus

Changes in V3:
* Removed debug output (printk) on queue overflow
* Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
use this qdisc, for simple plug/unplug operations.
* Use of packet counts instead of pointers to keep track of
the buffers in the queue.

Signed-off-by: Shriram Rajagopalan
Signed-off-by: Brendan Cully
[author of the code in the linux 2.6.32 pvops tree]
Signed-off-by: David S. Miller

Shriram Rajagopalan
2012-02-08 01:54:56 +0800

20 May, 2011

1 commit

034cfe48d networking: NET_CLS_ROUTE4 depends on INET ... Browse Code »

IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects
IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning
is produced, so fix it by making NET_CLS_ROUTE4 depend on INET.

warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET)

Signed-off-by: Randy Dunlap
Signed-off-by: David S. Miller

Randy Dunlap
2011-05-20 07:23:28 +0800

05 Apr, 2011

1 commit

0545a3037 pkt_sched: QFQ - quick fair queue scheduler ... Browse Code »

This is an implementation of the Quick Fair Queue scheduler developed
by Fabio Checconi. The same algorithm is already implemented in ipfw
in FreeBSD. Fabio had an earlier version developed on Linux, I just
cleaned it up. Thanks to Eric Dumazet for testing this under load.

Signed-off-by: Stephen Hemminger
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

stephen hemminger
2011-04-05 02:10:24 +0800

24 Feb, 2011

1 commit

e13e02a3c net_sched: SFB flow scheduler ... Browse Code »

This is the Stochastic Fair Blue scheduler, based on work from :

W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.

http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf

This implementation is based on work done by Juliusz Chroboczek

General SFB algorithm can be found in figure 14, page 15:

B[l][n] : L x N array of bins (L levels, N bins per level)
enqueue()
Calculate hash function values h{0}, h{1}, .. h{L-1}
Update bins at each level
for i = 0 to L - 1
if (B[i][h{i}].qlen > bin_size)
B[i][h{i}].p_mark += p_increment;
else if (B[i][h{i}].qlen == 0)
B[i][h{i}].p_mark -= p_decrement;
p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
if (p_min == 1.0)
ratelimit();
else
mark/drop with probabilty p_min;

I did the adaptation of Juliusz code to meet current kernel standards,
and various changes to address previous comments :

http://thread.gmane.org/gmane.linux.network/90225
http://thread.gmane.org/gmane.linux.network/90375

Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
we can use an external flow classifier if wanted.

tc qdisc add dev $DEV parent 1:11 handle 11: \
est 0.5sec 2sec sfb limit 128

tc filter add dev $DEV protocol ip parent 11: handle 3 \
flow hash keys dst divisor 1024

Notes:

1) SFB default child qdisc is pfifo_fast. It can be changed by another
qdisc but a child qdisc MUST not drop a packet previously queued. This
is because SFB needs to handle a dequeued packet in order to maintain
its virtual queue states. pfifo_head_drop or CHOKe should not be used.

2) ECN is enabled by default, unlike RED/CHOKe/GRED

With help from Patrick McHardy & Andi Kleen

Signed-off-by: Eric Dumazet
CC: Juliusz Chroboczek
CC: Stephen Hemminger
CC: Patrick McHardy
CC: Andi Kleen
CC: John W. Linville
Signed-off-by: David S. Miller

Eric Dumazet
2011-02-24 06:05:11 +0800

03 Feb, 2011

1 commit

45e144339 sched: CHOKe flow scheduler ... Browse Code »

CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
packet scheduler based on the Random Exponential Drop (RED) algorithm.

The core idea is:
For every packet arrival:
Calculate Qave
if (Qave < minth)
Queue the new packet
else
Select randomly a packet from the queue
if (both packets from same flow)
then Drop both the packets
else if (Qave > maxth)
Drop packet
else
Admit packet with proability p (same as RED)

See also:
Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
queue management scheme for approximating fair bandwidth allocation",
Proceeding of INFOCOM'2000, March 2000.

Help from:
Eric Dumazet
Patrick McHardy

Signed-off-by: Stephen Hemminger
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

stephen hemminger
2011-02-03 12:52:42 +0800

20 Jan, 2011

3 commits

a07aa004c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6 Browse Code »

David S. Miller
2011-01-20 16:06:15 +0800
b8970f0bf net_sched: implement a root container qdisc sch_mqprio ... Browse Code »

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mqprio_qopt {
__u8 num_tc;
__u8 prio_tc_map[TC_BITMASK + 1];
__u8 hw;
__u16 count[TC_MAX_QUEUE];
__u16 offset[TC_MAX_QUEUE];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2011-01-20 15:31:11 +0800
14f0290ba Merge branch 'master' of /repos/git/net-next-2.6 Browse Code »

Patrick McHardy
2011-01-20 06:51:37 +0800

14 Jan, 2011

1 commit

c7066f70d netfilter: fix Kconfig dependencies ... Browse Code »

Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
which itself depends on NET_SCHED; this dependency is missing from netfilter.

Since matching on realms is also useful without having NET_SCHED enabled and
the option really only controls whether the tclassid member is included in
route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.

Reported-by: Vladis Kletnieks
Signed-off-by: Patrick McHardy

Patrick McHardy
2011-01-14 20:36:42 +0800

16 Nov, 2010

1 commit

c996d8b9a Docs/Kconfig: Update: osdl.org -> linuxfoundation.org ... Browse Code »

Some of the documentation refers to web pages under
the domain `osdl.org'. However, `osdl.org' now
redirects to `linuxfoundation.org'.

Rather than rely on redirections, this patch updates
the addresses appropriately; for the most part, only
documentation that is meant to be current has been
updated.

The patch should be pretty quick to scan and check;
each new web-page url was gotten by trying out the
original URL in a browser and then simply copying the
the redirected URL (formatting as necessary).

There is some conflict as to which one of these domain
names is preferred:

linuxfoundation.org
linux-foundation.org

So, I wrote:

info@linuxfoundation.org

and got this reply:

Message-ID:
Date: Mon, 15 Nov 2010 10:41:42 -0800
From: David Ames

...

linuxfoundation.org is preferred. The canonical name for our web site is
www.linuxfoundation.org. Our list site is actually
lists.linux-foundation.org.

Regarding email linuxfoundation.org is preferred there are a few people
who choose to use linux-foundation.org for their own reasons.

Consequently, I used `linuxfoundation.org' for web pages and
`lists.linux-foundation.org' for mailing-list web pages and email addresses;
the only personal email address I updated from `@osdl.org' was that of
Andrew Morton, who prefers `linux-foundation.org' according `git log'.

Signed-off-by: Michael Witten
Signed-off-by: Jiri Kosina

Michael Witten
2010-11-16 06:50:13 +0800

24 Aug, 2010

1 commit

7abac6860 pkt_sched: Make act_csum depend upon INET. ... Browse Code »

It uses ip_send_check() and stuff like that.

Reported-by: Randy Dunlap
Signed-off-by: David S. Miller

David S. Miller
2010-08-24 11:42:11 +0800

20 Aug, 2010

1 commit

eb4d40654 net/sched: add ACT_CSUM action to update packets checksums ... Browse Code »

net/sched: add ACT_CSUM action to update packets checksums

ACT_CSUM can be called just after ACT_PEDIT in order to re-compute some
altered checksums in IPv4 and IPv6 packets. The following checksums are
supported by this patch:
- IPv4: IPv4 header, ICMP, IGMP, TCP, UDP & UDPLite
- IPv6: ICMPv6, TCP, UDP & UDPLite
It's possible to request in the same action to update different kind of
checksums, if the packets flow mix TCP, UDP and UDPLite, ...

An example of usage is done in the associated iproute2 patch.

Version 3 changes:
- remove useless goto instructions
- improve IPv6 hop options decoding

Version 2 changes:
- coding style correction
- remove useless arguments of some functions
- use stack in tcf_csum_dump()
- add tcf_csum_skb_nextlayer() to factor code

Signed-off-by: Gregoire Baron
Acked-by: jamal
Signed-off-by: David S. Miller

Grégoire Baron
2010-08-20 16:42:59 +0800

24 Mar, 2010

1 commit

8e039d84b cgroups: net_cls as module ... Browse Code »

Allows the net_cls cgroup subsystem to be compiled as a module

This patch modifies net/sched/cls_cgroup.c to allow the net_cls subsystem
to be optionally compiled as a module instead of builtin. The
cgroup_subsys struct is moved around a bit to allow the subsys_id to be
either declared as a compile-time constant by the cgroup_subsys.h include
in cgroup.h, or, if it's a module, initialized within the struct by
cgroup_load_subsys.

Signed-off-by: Ben Blum
Acked-by: Li Zefan
Cc: Paul Menage
Cc: "David S. Miller"
Cc: KAMEZAWA Hiroyuki
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: David S. Miller

Ben Blum
2010-03-24 04:06:14 +0800

09 Feb, 2010

1 commit

d4ae20b37 net/sched: Fix module name in Kconfig ... Browse Code »

The action modules have been prefixed with 'act_', but the Kconfig
description was not changed.

Signed-off-by: Jan Luebbe
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jan Luebbe
2010-02-09 14:41:44 +0800

30 Dec, 2008

1 commit

68ce9c0e3 cls_cgroup: clean up Kconfig ... Browse Code »

cls_cgroup can't be compiled as a module, since it's not supported by
cgroup.

Signed-off-by: Li Zefan
Signed-off-by: David S. Miller

Li Zefan
2008-12-30 11:40:46 +0800

20 Nov, 2008

1 commit

13d2a1d2b pkt_sched: add DRR scheduler ... Browse Code »

Add classful DRR scheduler as a more flexible replacement for SFQ.

The main difference to the algorithm described in "Efficient Fair Queueing
using Deficit Round Robin" is that this implementation doesn't drop packets
from the longest queue on overrun because its classful and limits are
handled by each individual child qdisc.

Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Patrick McHardy
2008-11-20 20:10:00 +0800

08 Nov, 2008

1 commit

f40092373 pkt_sched: Control group classifier ... Browse Code »

The classifier should cover the most common use case and will work
without any special configuration.

The principle of the classifier is to directly access the
task_struct via get_current(). In order for this to work,
classification requests from softirqs must be ignored. This is
not a problem because the vast majority of packets in softirq
context are not assigned to a task anyway. For this to work, a
mechanism is needed to trace softirq context.

This repost goes back to the method of relying on the number of
nested bh disable calls for the sake of not adding too much
complexity and the option to come up with something more reliable
if actually needed.

Signed-off-by: Thomas Graf
Signed-off-by: David S. Miller

Thomas Graf
2008-11-08 14:56:00 +0800

13 Sep, 2008

2 commits

ca9b0e27e pkt_action: add new action skbedit ... Browse Code »

This new action will have the ability to change the priority and/or
queue_mapping fields on an sk_buff.

Signed-off-by: Alexander Duyck
Signed-off-by: Jeff Kirsher
Signed-off-by: David S. Miller

Alexander Duyck
2008-09-13 07:30:20 +0800
92651940a pkt_sched: Add multiqueue scheduler support ... Browse Code »

This patch is intended to add a qdisc to support the new tx multiqueue
architecture by providing a band for each hardware queue. By doing
this it is possible to support a different qdisc per physical hardware
queue.

This qdisc uses the skb->queue_mapping to select which band to place
the traffic onto. It then uses a round robin w/ a check to see if the
subqueue is stopped to determine which band to dequeue the packet from.

Signed-off-by: Alexander Duyck
Signed-off-by: Jeff Kirsher
Signed-off-by: David S. Miller

Alexander Duyck
2008-09-13 07:29:34 +0800

28 Jun, 2008

1 commit

ede16af4c pkt_sched: Remove CONFIG_NET_SCH_RR ... Browse Code »

Commit d62733c8e437fdb58325617c4b3331769ba82d70
([SCHED]: Qdisc changes and sch_rr added for multiqueue)
added a NET_SCH_RR option that was unused since the code
went unconditionally into sch_prio.

Reported-by: Robert P. J. Day
Signed-off-by: Adrian Bunk
Signed-off-by: David S. Miller

Adrian Bunk
2008-06-28 10:54:05 +0800

01 Feb, 2008

2 commits

e5dfb8151 [NET_SCHED]: Add flow classifier ... Browse Code »

Add new "flow" classifier, which is meant to extend the SFQ hashing
capabilities without hard-coding new hash functions and also allows
deterministic mappings of keys to classes, replacing some out of tree
iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
IPs to marks, with fw filters to classes), ...

Some examples:

- Classic SFQ hash:

tc filter add ... flow hash \
keys src,dst,proto,proto-src,proto-dst divisor 1024

- Classic SFQ hash, but using information from conntrack to work properly in
combination with NAT:

tc filter add ... flow hash \
keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024

- Map destination IPs of 192.168.0.0/24 to classids 1-257:

tc filter add ... flow map \
key dst addend -192.168.0.0 divisor 256

- alternatively:

tc filter add ... flow map \
key dst and 0xff

- similar, but reverse ordered:

tc filter add ... flow map \
key dst and 0xff xor 0xff

Perturbation is currently not supported because we can't reliable kill the
timer on destruction.

Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Patrick McHardy
2008-02-01 11:28:36 +0800
72eb7bd26 [NET_SCHED]: sch_ingress: remove netfilter support ... Browse Code »

Since the old policer code is gone, TC actions are needed for policing.
The ingress qdisc can get packets directly from netif_receive_skb()
in case TC actions are enabled or through netfilter otherwise, but
since without TC actions there is no policer the only thing it actually
does is count packets.

Remove the netfilter support and always require TC actions.

Signed-off-by: Patrick McHardy
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Patrick McHardy
2008-02-01 11:28:25 +0800

29 Jan, 2008

3 commits

13a0a096e [NET_SCHED]: kill obsolete NET_CLS_POLICE option ... Browse Code »

The code is already gone for about half a year, the config option
has been kept around to select the replacement options for easier
upgrades. This seems long enough, people upgrading from older
kernels will have to reconfigure a lot anyway.

Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Patrick McHardy
2008-01-29 07:08:37 +0800
645a1e39e [NET_SCHED]: sch_ingress: move dependencies to Kconfig ... Browse Code »

Instead of complaining at scheduler initialization time, check the
dependencies in Kconfig.

Signed-off-by: Patrick McHardy
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Patrick McHardy
2008-01-29 07:08:21 +0800
be0ea7d5d [NETFILTER]: Convert old checksum helper names ... Browse Code »

Kill the defines again, convert to the new checksum helper names and
remove the dependency of NET_ACT_NAT on NETFILTER.

Signed-off-by: Patrick McHardy
Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Patrick McHardy
2008-01-29 06:55:15 +0800

19 Oct, 2007

1 commit

85ef3e5ca [NET]: QoS/Sched as menuconfig ... Browse Code »

Convert "QoS and/or fair queueing" to menuconfig.
This makes it easy for someone to disable all sub-options with
one config symbol.

Signed-off-by: Randy Dunlap
Signed-off-by: David S. Miller

Randy Dunlap
2007-10-19 12:56:38 +0800

11 Oct, 2007

1 commit

b42199523 [PKT_SCHED]: Add stateless NAT ... Browse Code »

Stateless NAT is useful in controlled environments where restrictions are
placed on through traffic such that we don't need connection tracking to
correctly NAT protocol-specific data.

In particular, this is of interest when the number of flows or the number
of addresses being NATed is large, or if connection tracking information
has to be replicated and where it is not practical to do so.

Previously we had stateless NAT functionality which was integrated into
the IPv4 routing subsystem. This was a great solution as long as the NAT
worked on a subnet to subnet basis such that the number of NAT rules was
relatively small. The reason is that for SNAT the routing based system
had to perform a linear scan through the rules.

If the number of rules is large then major renovations would have take
place in the routing subsystem to make this practical.

For the time being, the least intrusive way of achieving this is to use
the u32 classifier written by Alexey Kuznetsov along with the actions
infrastructure implemented by Jamal Hadi Salim.

The following patch is an attempt at this problem by creating a new nat
action that can be invoked from u32 hash tables which would allow large
number of stateless NAT rules that can be used/updated in constant time.

The actual NAT code is mostly based on the previous stateless NAT code
written by Alexey. In future we might be able to utilise the protocol
NAT code from netfilter to improve support for other protocols.

Signed-off-by: Herbert Xu
Signed-off-by: David S. Miller

Herbert Xu
2007-10-11 07:53:11 +0800

18 Jul, 2007

1 commit

99acaeb92 [PKT_SCHED]: Some typo fixes in net/sched/Kconfig ... Browse Code »

Signed-off-by: Gabriel Craciunescu
Signed-off-by: David S. Miller

Gabriel Craciunescu
2007-07-18 17:00:04 +0800

15 Jul, 2007

1 commit

c3bc7cff8 [NET_SCHED]: Kill CONFIG_NET_CLS_POLICE ... Browse Code »

The NET_CLS_ACT option is now a full replacement for NET_CLS_POLICE,
remove the old code. The config option will be kept around to select
the equivalent NET_CLS_ACT options for a short time to allow easier
upgrades.

Signed-off-by: Patrick McHardy
Signed-off-by: David S. Miller

Patrick McHardy
2007-07-15 15:03:05 +0800