Eric Lee / smarc-fsl-linux-kernel

02 May, 2020

1 commit

a51c328df net: qos: introduce a gate control flow action ... Browse Code »

Introduce a ingress frame gate control flow action.
Tc gate action does the work like this:
Assume there is a gate allow specified ingress frames can be passed at
specific time slot, and be dropped at specific time slot. Tc filter
chooses the ingress frames, and tc gate action would specify what slot
does these frames can be passed to device and what time slot would be
dropped.
Tc gate action would provide an entry list to tell how much time gate
keep open and how much time gate keep state close. Gate action also
assign a start time to tell when the entry list start. Then driver would
repeat the gate entry list cyclically.
For the software simulation, gate action requires the user assign a time
clock type.

Below is the setting example in user space. Tc filter a stream source ip
address is 192.168.0.20 and gate action own two time slots. One is last
200ms gate open let frame pass another is last 100ms gate close let
frames dropped. When the ingress frames have reach total frames over
8000000 bytes, the excessive frames will be dropped in that 200000000ns
time slot.

> tc qdisc add dev eth0 ingress

> tc filter add dev eth0 parent ffff: protocol ip \
flower src_ip 192.168.0.20 \
action gate index 2 clockid CLOCK_TAI \
sched-entry open 200000000 -1 8000000 \
sched-entry close 100000000 -1 -1

> tc chain del dev eth0 ingress chain 0

"sched-entry" follow the name taprio style. Gate state is
"open"/"close". Follow with period nanosecond. Then next item is internal
priority value means which ingress queue should put. "-1" means
wildcard. The last value optional specifies the maximum number of
MSDU octets that are permitted to pass the gate during the specified
time interval.
Base-time is not set will be 0 as default, as result start time would
be ((N + 1) * cycletime) which is the minimal of future time.

Below example shows filtering a stream with destination mac address is
10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
action would run with one close time slot which means always keep close.
The time cycle is total 200000000ns. The base-time would calculate by:

1357000000000 + (N + 1) * cycletime

When the total value is the future time, it will be the start time.
The cycletime here would be 200000000ns for this case.

> tc filter add dev eth0 parent ffff: protocol ip \
flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
action gate index 12 base-time 1357000000000 \
sched-entry close 200000000 -1 -1 \
clockid CLOCK_TAI

Signed-off-by: Po Liu
Signed-off-by: David S. Miller

Po Liu
2020-05-02 07:08:19 +0800

23 Jan, 2020

1 commit

ec97ecf1e net: sched: add Flow Queue PIE packet scheduler ... Browse Code »

Principles:
- Packets are classified on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed to the same slot)
- Each flow has a PIE managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.
- For a given flow, packets are not reordered.
- Drops during enqueue only.
- ECN capability is off by default.
- ECN threshold (if ECN is enabled) is at 10% by default.
- Uses timestamps to calculate queue delay by default.

Usage:
tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
[ target TIME ] [ tupdate TIME ]
[ alpha NUMBER ] [ beta NUMBER ]
[ quantum BYTES ] [ memory_limit BYTES ]
[ ecnprob PERCENTAGE ] [ [no]ecn ]
[ [no]bytemode ] [ [no_]dq_rate_estimator ]

defaults:
limit: 10240 packets, flows: 1024
target: 15 ms, tupdate: 15 ms (in jiffies)
alpha: 1/8, beta : 5/4
quantum: device MTU, memory_limit: 32 Mb
ecnprob: 10%, ecn: off
bytemode: off, dq_rate_estimator: off

Signed-off-by: Mohit P. Tahiliani
Signed-off-by: Sachin D. Patil
Signed-off-by: V. Saicharan
Signed-off-by: Mohit Bhasi
Signed-off-by: Leslie Monis
Signed-off-by: Gautam Ramakrishnan
Signed-off-by: David S. Miller

Mohit P. Tahiliani
2020-01-23 18:38:31 +0800

19 Dec, 2019

1 commit

dcc68b4d8 net: sch_ets: Add a new Qdisc ... Browse Code »

Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
PRIO-like in how it is configured, meaning one needs to specify how many
bands there are, how many are strict and how many are dwrr, quanta for the
latter, and priomap.

The new Qdisc operates like the PRIO / DRR combo would when configured as
per the standard. The strict classes, if any, are tried for traffic first.
When there's no traffic in any of the strict queues, the ETS ones (if any)
are treated in the same way as in DRR.

Signed-off-by: Petr Machata
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Petr Machata
2019-12-19 05:32:29 +0800

10 Jul, 2019

1 commit

b57dc7c13 net/sched: Introduce action ct ... Browse Code »

Allow sending a packet to conntrack module for connection tracking.

The packet will be marked with conntrack connection's state, and
any metadata such as conntrack mark and label. This state metadata
can later be matched against with tc classifers, for example with the
flower classifier as below.

In addition to committing new connections the user can optionally
specific a zone to track within, set a mark/label and configure nat
with an address range and port range.

Usage is as follows:
$ tc qdisc add dev ens1f0_0 ingress
$ tc qdisc add dev ens1f0_1 ingress

$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 0 proto ip \
flower ip_proto tcp ct_state -trk \
action ct zone 2 pipe \
action goto chain 2
$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower ct_state +trk+new \
action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_0 ingress \
prio 1 chain 2 proto ip \
flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
action ct nat pipe \
action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_1 ingress \
prio 1 chain 0 proto ip \
flower ip_proto tcp ct_state -trk \
action ct zone 2 pipe \
action goto chain 1
$ tc filter add dev ens1f0_1 ingress \
prio 1 chain 1 proto ip \
flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
action ct nat pipe \
action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey
Signed-off-by: Marcelo Ricardo Leitner
Signed-off-by: Yossi Kuperman
Acked-by: Jiri Pirko

Changelog:
V5->V6:
Added CONFIG_NF_DEFRAG_IPV6 in handle fragments ipv6 case
V4->V5:
Reordered nf_conntrack_put() in tcf_ct_skb_nfct_cached()
V3->V4:
Added strict_start_type for act_ct policy
V2->V3:
Fixed david's comments: Removed extra newline after rcu in tcf_ct_params , and indent of break in act_ct.c
V1->V2:
Fixed parsing of ranges TCA_CT_NAT_IPV6_MAX as 'else' case overwritten ipv4 max
Refactored NAT_PORT_MIN_MAX range handling as well
Added ipv4/ipv6 defragmentation
Removed extra skb pull push of nw offset in exectute nat
Refactored tcf_ct_skb_network_trim after pull
Removed TCA_ACT_CT define

Signed-off-by: David S. Miller

Paul Blakey
2019-07-10 03:11:59 +0800

09 Jul, 2019

1 commit

2a2ea5087 net: sched: add mpls manipulation actions to TC ... Browse Code »

Currently, TC offers the ability to match on the MPLS fields of a packet
through the use of the flow_dissector_key_mpls struct. However, as yet, TC
actions do not allow the modification or manipulation of such fields.

Add a new module that registers TC action ops to allow manipulation of
MPLS. This includes the ability to push and pop headers as well as modify
the contents of new or existing headers. A further action to decrement the
TTL field of an MPLS header is also provided with a new helper added to
support this.

Examples of the usage of the new action with flower rules to push and pop
MPLS labels are:

tc filter add dev eth0 protocol ip parent ffff: flower \
action mpls push protocol mpls_uc label 123 \
action mirred egress redirect dev eth1

tc filter add dev eth0 protocol mpls_uc parent ffff: flower \
action mpls pop protocol ipv4 \
action mirred egress redirect dev eth1

Signed-off-by: John Hurley
Reviewed-by: Jakub Kicinski
Reviewed-by: Simon Horman
Reviewed-by: Willem de Bruijn
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Hurley
2019-07-09 10:50:13 +0800

30 May, 2019

1 commit

24ec483ce net: sched: Introduce act_ctinfo action ... Browse Code »

ctinfo is a new tc filter action module. It is designed to restore
information contained in firewall conntrack marks to other packet fields
and is typically used on packet ingress paths. At present it has two
independent sub-functions or operating modes, DSCP restoration mode &
skb mark restoration mode.

The DSCP restore mode:

This mode copies DSCP values that have been placed in the firewall
conntrack mark back into the IPv4/v6 diffserv fields of relevant
packets.

The DSCP restoration is intended for use and has been found useful for
restoring ingress classifications based on egress classifications across
links that bleach or otherwise change DSCP, typically home ISP Internet
links. Restoring DSCP on ingress on the WAN link allows qdiscs such as
but by no means limited to CAKE to shape inbound packets according to
policies that are easier to set & mark on egress.

Ingress classification is traditionally a challenging task since
iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
lookups, hence are unable to see internal IPv4 addresses as used on the
typical home masquerading gateway. Thus marking the connection in some
manner on egress for later restoration of classification on ingress is
easier to implement.

Parameters related to DSCP restore mode:

dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
conntrack mark field contain the DSCP value to be restored.

statemask - a 32 bit mask of (usually) 1 bit length, outside the area
specified by dscpmask. This represents a conditional operation flag
whereby the DSCP is only restored if the flag is set. This is useful to
implement a 'one shot' iptables based classification where the
'complicated' iptables rules are only run once to classify the
connection on initial (egress) packet and subsequent packets are all
marked/restored with the same DSCP. A mask of zero disables the
conditional behaviour ie. the conntrack mark DSCP bits are always
restored to the ip diffserv field (assuming the conntrack entry is found
& the skb is an ipv4/ipv6 type)

e.g. dscpmask 0xfc000000 statemask 0x01000000

|----0xFC----conntrack mark----000000---|
| Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
| DSCP | unused | flag |unused |
|-----------------------0x01---000000---|
| |
| |
---| Conditional flag
v only restore if set
|-ip diffserv-|
| 6 bits |
|-------------|

The skb mark restore mode (cpmark):

This mode copies the firewall conntrack mark to the skb's mark field.
It is completely the functional equivalent of the existing act_connmark
action with the additional feature of being able to apply a mask to the
restored value.

Parameters related to skb mark restore mode:

mask - a 32 bit mask applied to the firewall conntrack mark to mask out
bits unwanted for restoration. This can be useful where the conntrack
mark is being used for different purposes by different applications. If
not specified and by default the whole mark field is copied (i.e.
default mask of 0xffffffff)

e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
aforementioned DSCP restore mode.

|----0x00----conntrack mark----ffffff---|
| Bits 31-24 | |
| DSCP & flag| some value here |
|---------------------------------------|
|
|
v
|------------skb mark-------------------|
| | |
| zeroed | |
|---------------------------------------|

Overall parameters:

zone - conntrack zone

control - action related control (reclassify | pipe | drop | continue |
ok | goto chain )

Signed-off-by: Kevin Darbyshire-Bryant
Reviewed-by: Toke Høiland-Jørgensen
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Kevin 'ldir' Darbyshire-Bryant
2019-05-30 12:43:54 +0800

05 Oct, 2018

1 commit

5a781ccbd tc: Add support for configuring the taprio scheduler ... Browse Code »

This traffic scheduler allows traffic classes states (transmission
allowed/not allowed, in the simplest case) to be scheduled, according
to a pre-generated time sequence. This is the basis of the IEEE
802.1Qbv specification.

Example configuration:

tc qdisc replace dev enp3s0 parent root handle 100 taprio \
num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
queues 1@0 1@1 2@2 \
base-time 1528743495910289987 \
sched-entry S 01 300000 \
sched-entry S 02 300000 \
sched-entry S 04 300000 \
clockid CLOCK_TAI

The configuration format is similar to mqprio. The main difference is
the presence of a schedule, built by multiple "sched-entry"
definitions, each entry has the following format:

sched-entry

The only supported is "S", which means "SetGateStates",
following the IEEE 802.1Qbv-2015 definition (Table 8-6).
is a bitmask where each bit is a associated with a traffic class, so
bit 0 (the least significant bit) being "on" means that traffic class
0 is "active" for that schedule entry. is a time duration
in nanoseconds that specifies for how long that state defined by
and should be held before moving to the next entry.

This schedule is circular, that is, after the last entry is executed
it starts from the first one, indefinitely.

The other parameters can be defined as follows:

- base-time: specifies the instant when the schedule starts, if
'base-time' is a time in the past, the schedule will start at

base-time + (N * cycle-time)

where N is the smallest integer so the resulting time is greater
than "now", and "cycle-time" is the sum of all the intervals of the
entries in the schedule;

- clockid: specifies the reference clock to be used;

The parameters should be similar to what the IEEE 802.1Q family of
specification defines.

Signed-off-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Vinicius Costa Gomes
2018-10-05 04:52:23 +0800

25 Jul, 2018

2 commits

aea5f654e net/sched: add skbprio scheduler ... Browse Code »

Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
according to their skb->priority field. Under congestion, already-enqueued lower
priority packets will be dropped to make space available for higher priority
packets. Skbprio was conceived as a solution for denial-of-service defenses that
need to route packets with different priorities as a means to overcome DoS
attacks.

v5
*Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
default sch->limit to 64.

v4
*Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
page for Skbprio, in iproute2.

v3
*Drop max_limit parameter in struct skbprio_sched_data and instead use
sch->limit.

*Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
qdisc (previously being referenced every time qdisc changes).

*Move qdisc's detailed description from in-code to Documentation/networking.

*When qdisc is saturated, enqueue incoming packet first before dequeueing
lowest priority packet in queue - improves usage of call stack registers.

*Introduce and use overlimit stat to keep track of number of dropped packets.

v2
*Use skb->priority field rather than DS field. Rename queueing discipline as
SKB Priority Queue (previously Gatekeeper Priority Queue).

*Queueing discipline is made classful to expose Skbprio's internal priority
queues.

Signed-off-by: Nishanth Devarajan
Reviewed-by: Sachin Paryani
Reviewed-by: Cody Doucette
Reviewed-by: Michel Machado
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Nishanth Devarajan
2018-07-25 05:44:00 +0800
50f699b1f sched: fix trailing whitespace ... Browse Code »

Remove trailing whitespace and blank lines at EOF

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

Stephen Hemminger
2018-07-25 05:10:42 +0800

11 Jul, 2018

1 commit

046f6fd5d sched: Add Common Applications Kept Enhanced (cake) qdisc ... Browse Code »

sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
International Symposium on Local and Metropolitan Area Networks (LANMAN).

This patch adds the base shaper and packet scheduler, while subsequent
commits add the optional (configurable) features. The full userspace API
and most data structures are included in this commit, but options not
understood in the base version will be ignored.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
backlog 0b 0p requeues 12
memory used: 1053008b of 15140Kb
capacity estimate: 970Mbit
min/max network layer size: 28 / 1500
min/max overhead-adjusted size: 84 / 1538
average network hdr offset: 14
Bulk Best Effort Voice
thresh 62500Kbit 1Gbit 250Mbit
target 5.0ms 5.0ms 5.0ms
interval 100.0ms 100.0ms 100.0ms
pk_delay 5us 5us 6us
av_delay 3us 2us 2us
sp_delay 2us 1us 1us
backlog 0b 0b 0b
pkts 3164050 25030267 9530280
bytes 3227519915 35396974782 12879808898
way_inds 0 8 0
way_miss 21 366 25
way_cols 0 0 0
drops 5 0 1
marks 0 0 0
ack_drop 0 0 0
sp_flows 1 3 0
bk_flows 0 1 1
un_flows 0 0 0
max_len 68130 68130 68130

Tested-by: Pete Heist
Tested-by: Georgios Amanakis
Signed-off-by: Dave Taht
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: David S. Miller

Toke Høiland-Jørgensen
2018-07-11 11:06:34 +0800

04 Jul, 2018

1 commit

25db26a91 net/sched: Introduce the ETF Qdisc ... Browse Code »

The ETF (Earliest TxTime First) qdisc uses the information added
earlier in this series (the socket option SO_TXTIME and the new
role of sk_buff->tstamp) to schedule packets transmission based
on absolute time.

For some workloads, just bandwidth enforcement is not enough, and
precise control of the transmission of packets is necessary.

Example:

$ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
clockid CLOCK_TAI

In this example, the Qdisc will provide SW best-effort for the control
of the transmission time to the network adapter, the time stamp in the
socket will be in reference to the clockid CLOCK_TAI and packets
will leave the qdisc "delta" (100000) nanoseconds before its transmission
time.

The ETF qdisc will buffer packets sorted by their txtime. It will drop
packets on enqueue() if their skbuff clockid does not match the clock
reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
if it expires while being enqueued.

The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
will dequeue a packet as soon as possible and change the skb timestamp
to 'now' during etf_dequeue().

Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
to be configured, but it's been decided that usage of CLOCK_TAI should
be enforced until we decide to allow for other clockids to be used.
The rationale here is that PTP times are usually in the TAI scale, thus
no other clocks should be necessary. For now, the qdisc will return
EINVAL if any clocks other than CLOCK_TAI are used.

Signed-off-by: Jesus Sanchez-Palencia
Signed-off-by: Vinicius Costa Gomes
Signed-off-by: David S. Miller

Vinicius Costa Gomes
2018-07-04 21:30:27 +0800

22 Feb, 2018

1 commit

ccc007e4a net: sched: add em_ipt ematch for calling xtables matches ... Browse Code »

The commit a new tc ematch for using netfilter xtable matches.

This allows early classification as well as mirroning/redirecting traffic
based on logic implemented in netfilter extensions.

Current supported use case is classification based on the incoming IPSec
state used during decpsulation using the 'policy' iptables extension
(xt_policy).

The module dynamically fetches the netfilter match module and calls
it using a fake xt_action_param structure based on validated userspace
provided parameters.

As the xt_policy match does not access skb->data, no skb modifications
are needed on match.

Signed-off-by: Eyal Birger
Signed-off-by: David S. Miller

Eyal Birger
2018-02-22 02:15:33 +0800

04 Nov, 2017

1 commit

2a171788b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Files removed in 'net-next' had their license header updated
in 'net'. We take the remove from 'net-next'.

Signed-off-by: David S. Miller

David S. Miller
2017-11-04 08:26:51 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

28 Oct, 2017

1 commit

585d763af net/sched: Introduce Credit Based Shaper (CBS) qdisc ... Browse Code »

This queueing discipline implements the shaper algorithm defined by
the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.

It's primary usage is to apply some bandwidth reservation to user
defined traffic classes, which are mapped to different queues via the
mqprio qdisc.

Only a simple software implementation is added for now.

Signed-off-by: Vinicius Costa Gomes
Signed-off-by: Jesus Sanchez-Palencia
Tested-by: Henrik Austad
Signed-off-by: Jeff Kirsher

Vinicius Costa Gomes
2017-10-28 00:48:02 +0800

25 Jan, 2017

1 commit

5c5670fae net/sched: Introduce sample tc action ... Browse Code »

This action allows the user to sample traffic matched by tc classifier.
The sampling consists of choosing packets randomly and sampling them using
the psample module. The user can configure the psample group number, the
sampling rate and the packet's truncation (to save kernel-user traffic).

Example:
To sample ingress traffic from interface eth1, one may use the commands:

tc qdisc add dev eth1 handle ffff: ingress

tc filter add dev eth1 parent ffff: \
matchall action sample rate 12 group 4

Where the first command adds an ingress qdisc and the second starts
sampling randomly with an average of one sampled packet per 12 packets on
dev eth1 to psample group 4.

Signed-off-by: Yotam Gigi
Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Reviewed-by: Simon Horman
Signed-off-by: David S. Miller

Yotam Gigi
2017-01-25 02:44:28 +0800

20 Sep, 2016

1 commit

408fbc22e net sched ife action: Introduce skb tcindex metadata encap decap ... Browse Code »

Sample use case of how this is encoded:
user space via tuntap (or a connected VM/Machine/container)
encodes the tcindex TLV.

Sample use case of decoding:
IFE action decodes it and the skb->tc_index is then used to classify.
So something like this for encoded ICMP packets:

.. first decode then reclassify... skb->tcindex will be set
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

...next match the decode icmp packet...
sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

... last classify it using the tcindex classifier and do someaction..
sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
handle 0x11 tcindex classid 1:1 \
action blah..

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-20 09:55:28 +0800

16 Sep, 2016

1 commit

86da71b57 net_sched: Introduce skbmod action ... Browse Code »

This action is intended to be an upgrade from a usability perspective
from pedit (as well as operational debugability).
Compare this:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action pedit munge offset -14 u8 set 0x02 \
munge offset -13 u8 set 0x15 \
munge offset -12 u8 set 0x15 \
munge offset -11 u8 set 0x15 \
munge offset -10 u16 set 0x1515 \
pipe

to:

sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbmod dmac 02:15:15:15:15:15

Also try to do a MAC address swap with pedit or worse
try to debug a policy with destination mac, source mac and
etherype. Then make few rules out of those and you'll get my point.

In the future common use cases on pedit can be migrated to this action
(as an example different fields in ip v4/6, transports like tcp/udp/sctp
etc). For this first cut, this allows modifying basic ethernet header.

The most important ethernet use case at the moment is when redirecting or
mirroring packets to a remote machine. The dst mac address needs a re-write
so that it doesnt get dropped or confuse an interconnecting (learning) switch
or dropped by a target machine (which looks at the dst mac). And at times
when flipping back the packet a swap of the MAC addresses is needed.

Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-09-16 07:33:47 +0800

11 Sep, 2016

1 commit

d0f6dd8a9 net/sched: Introduce act_tunnel_key ... Browse Code »

This action could be used before redirecting packets to a shared tunnel
device, or when redirecting packets arriving from a such a device.

The action will release the metadata created by the tunnel device
(decap), or set the metadata with the specified values for encap
operation.

For example, the following flower filter will forward all ICMP packets
destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
redirecting, a metadata for the vxlan tunnel is created using the
tunnel_key action and it's arguments:

$ tc filter add dev net0 protocol ip parent ffff: \
flower \
ip_proto 1 \
dst_ip 11.11.11.2 \
action tunnel_key set \
src_ip 11.11.0.1 \
dst_ip 11.11.0.2 \
id 11 \
action mirred egress redirect dev vxlan0

Signed-off-by: Amir Vadai
Signed-off-by: Hadar Hen Zion
Reviewed-by: Shmulik Ladkani
Acked-by: Jamal Hadi Salim
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Amir Vadai
2016-09-11 11:53:56 +0800

25 Jul, 2016

1 commit

bf3994d2e net/sched: introduce Match-all classifier ... Browse Code »

The matchall classifier matches every packet and allows the user to apply
actions on it. This filter is very useful in usecases where every packet
should be matched, for example, packet mirroring (SPAN) can be setup very
easily using that filter.

Signed-off-by: Jiri Pirko
Signed-off-by: Yotam Gigi
Signed-off-by: David S. Miller

Jiri Pirko
2016-07-25 14:11:59 +0800

02 Mar, 2016

3 commits

200e10f46 Support to encoding decoding skb prio on IFE action ... Browse Code »

Example usage:
Set the skb priority using skbedit then allow it to be encoded

sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit prio 17 \
action ife encode \
allow prio \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb priority earlier. A zero skb priority will not be sent

Alternative hard code static priority of decimal 33 (unlike skbedit)
then mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use prio 33 \
use mark 0x12 \
dst 02:15:15:15:15:15

Signed-off-by: Jamal Hadi Salim
Acked-by: Cong Wang

Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-03-02 06:15:23 +0800
084e2f656 Support to encoding decoding skb mark on IFE action ... Browse Code »

Example usage:
Set the skb using skbedit then allow it to be encoded

sudo tc qdisc add dev $ETH root handle 1: prio
sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action skbedit mark 17 \
action ife encode \
allow mark \
dst 02:15:15:15:15:15

Note: You dont need the skbedit action if you are already encoding the
skb mark earlier. A zero skb mark, when seen, will not be encoded.

Alternative hard code static mark of 0x12 every time the filter matches

sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
u32 match ip protocol 1 0xff flowid 1:2 \
action ife encode \
type 0xDEAD \
use mark 0x12 \
dst 02:15:15:15:15:15

Signed-off-by: Jamal Hadi Salim
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-03-02 06:15:23 +0800
ef6980b6b introduce IFE action ... Browse Code »

This action allows for a sending side to encapsulate arbitrary metadata
which is decapsulated by the receiving end.
The sender runs in encoding mode and the receiver in decode mode.
Both sender and receiver must specify the same ethertype.
At some point we hope to have a registered ethertype and we'll
then provide a default so the user doesnt have to specify it.
For now we enforce the user specify it.

Lets show example usage where we encode icmp from a sender towards
a receiver with an skbmark of 17; both sender and receiver use
ethertype of 0xdead to interop.

YYYY: Lets start with Receiver-side policy config:
xxx: add an ingress qdisc
sudo tc qdisc add dev $ETH ingress

xxx: any packets with ethertype 0xdead will be subjected to ife decoding
xxx: we then restart the classification so we can match on icmp at prio 3
sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
u32 match u32 0 0 flowid 1:1 \
action ife decode reclassify

xxx: on restarting the classification from above if it was an icmp
xxx: packet, then match it here and continue to the next rule at prio 4
xxx: which will match based on skb mark of 17
sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
u32 match ip protocol 1 0xff flowid 1:1 \
action continue

xxx: match on skbmark of 0x11 (decimal 17) and accept
sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
handle 0x11 fw flowid 1:1 \
action ok

xxx: Lets show the decoding policy
sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
xxx:
filter pref 2 u32
filter pref 2 u32 fh 800: ht divisor 1
filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0)
match 00000000/00000000 at 0 (success 0 )
action order 1: ife decode action reclassify
index 1 ref 1 bind 1 installed 14 sec used 14 sec
type: 0x0
Metadata: allow mark allow hash allow prio allow qmap
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:
Observe that above lists all metadatum it can decode. Typically these
submodules will already be compiled into a monolithic kernel or
loaded as modules

YYYY: Lets show the sender side now ..

xxx: Add an egress qdisc on the sender netdev
sudo tc qdisc add dev $ETH root handle 1: prio
xxx:
xxx: Match all icmp packets to 192.168.122.237/24, then
xxx: tag the packet with skb mark of decimal 17, then
xxx: Encode it with:
xxx: ethertype 0xdead
xxx: add skb->mark to whitelist of metadatum to send
xxx: rewrite target dst MAC address to 02:15:15:15:15:15
xxx:
sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \
match ip dst 192.168.122.237/24 \
match ip protocol 1 0xff \
flowid 1:2 \
action skbedit mark 17 \
action ife encode \
type 0xDEAD \
allow mark \
dst 02:15:15:15:15:15

xxx: Lets show the encoding policy
sudo tc -s filter ls dev $ETH parent 1: protocol ip
xxx:
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 0 success 0)
match c0a87aed/ffffffff at 16 (success 0 )
match 00010000/00ff0000 at 8 (success 0 )

action order 1: skbedit mark 17
index 6 ref 1 bind 1
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

action order 2: ife encode action pipe
index 3 ref 1 bind 1
dst MAC: 02:15:15:15:15:15 type: 0xDEAD
Metadata: allow mark
Action statistics:
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
xxx:

test by sending ping from sender to destination

Signed-off-by: Jamal Hadi Salim
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-03-02 06:15:22 +0800

14 May, 2015

1 commit

77b9900ef tc: introduce Flower classifier ... Browse Code »

This patch introduces a flow-based filter. So far, the very essential
packet fields are supported.

This patch is only the first step. There is a lot of potential performance
improvements possible to implement. Also a lot of features are missing
now. They will be addressed in follow-up patches.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2015-05-14 03:19:48 +0800

20 Jan, 2015

1 commit

22a5dc0e5 net: sched: Introduce connmark action ... Browse Code »

This tc action allows you to retrieve the connection tracking mark
This action has been used heavily by openwrt for a few years now.

There are known limitations currently:

doesn't work for initial packets, since we only query the ct table.
Fine given use case is for returning packets

no implicit defrag.
frags should be rare so fix later..

won't work for more complex tasks, e.g. lookup of other extensions
since we have no means to store results

we still have a 2nd lookup later on via normal conntrack path.
This shouldn't break anything though since skb->nfct isn't altered.

V2:
remove unnecessary braces (Jiri)
change the action identifier to 14 (Jiri)
Fix some stylistic issues caught by checkpatch
V3:
Move module params to bottom (Cong)
Get rid of tcf_hashinfo_init and friends and conform to newer API (Cong)

Acked-by: Jiri Pirko
Signed-off-by: Felix Fietkau
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Felix Fietkau
2015-01-20 05:02:06 +0800

18 Jan, 2015

1 commit

d23b8ad8a tc: add BPF based action ... Browse Code »

This action provides a possibility to exec custom BPF code.

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2015-01-18 12:51:10 +0800

22 Nov, 2014

1 commit

c7e2b9689 sched: introduce vlan action ... Browse Code »

This tc action allows to work with vlan tagged skbs. Two supported
sub-actions are header pop and header push.

Signed-off-by: Jiri Pirko
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2014-11-22 03:20:18 +0800

07 Jan, 2014

1 commit

d4b36210c net: pkt_sched: PIE AQM scheme ... Browse Code »

Proportional Integral controller Enhanced (PIE) is a scheduler to address the
bufferbloat problem.

>From the IETF draft below:
" Bufferbloat is a phenomenon where excess buffers in the network cause high
latency and jitter. As more and more interactive applications (e.g. voice over
IP, real time video streaming and financial transactions) run in the Internet,
high latency and jitter degrade application performance. There is a pressing
need to design intelligent queue management schemes that can control latency and
jitter; and hence provide desirable quality of service to users.

We present here a lightweight design, PIE(Proportional Integral controller
Enhanced) that can effectively control the average queueing latency to a target
value. Simulation results, theoretical analysis and Linux testbed results have
shown that PIE can ensure low latency and achieve high link utilization under
various congestion situations. The design does not require per-packet
timestamp, so it incurs very small overhead and is simple enough to implement
in both hardware and software. "

Many thanks to Dave Taht for extensive feedback, reviews, testing and
suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
suggestions. Naeem Khademi and Dave Taht independently contributed to ECN
support.

For more information, please see technical paper about PIE in the IEEE
Conference on High Performance Switching and Routing 2013. A copy of the paper
can be found at ftp://ftpeng.cisco.com/pie/.

Please also refer to the IETF draft submission at
http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

All relevant code, documents and test scripts and results can be found at
ftp://ftpeng.cisco.com/pie/.

For problems with the iproute2/tc or Linux kernel code, please contact Vijay
Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
(mysuryan@cisco.com)

Signed-off-by: Vijay Subramanian
Signed-off-by: Mythili Prabhu
CC: Dave Taht
Signed-off-by: David S. Miller

Vijay Subramanian
2014-01-07 04:13:01 +0800

20 Dec, 2013

1 commit

10239edf8 net-qdisc-hhf: Heavy-Hitter Filter (HHF) qdisc ... Browse Code »

This patch implements the first size-based qdisc that attempts to
differentiate between small flows and heavy-hitters. The goal is to
catch the heavy-hitters and move them to a separate queue with less
priority so that bulk traffic does not affect the latency of critical
traffic. Currently "less priority" means less weight (2:1 in
particular) in a Weighted Deficit Round Robin (WDRR) scheduler.

In essence, this patch addresses the "delay-bloat" problem due to
bloated buffers. In some systems, large queues may be necessary for
obtaining CPU efficiency, or due to the presence of unresponsive
traffic like UDP, or just a large number of connections with each
having a small amount of outstanding traffic. In these circumstances,
HHF aims to reduce the HoL blocking for latency sensitive traffic,
while not impacting the queues built up by bulk traffic. HHF can also
be used in conjunction with other AQM mechanisms such as CoDel.

To capture heavy-hitters, we implement the "multi-stage filter" design
in the following paper:
C. Estan and G. Varghese, "New Directions in Traffic Measurement and
Accounting", in ACM SIGCOMM, 2002.

Some configurable qdisc settings through 'tc':
- hhf_reset_timeout: period to reset counter values in the multi-stage
filter (default 40ms)
- hhf_admit_bytes: threshold to classify heavy-hitters
(default 128KB)
- hhf_evict_timeout: threshold to evict idle heavy-hitters
(default 1s)
- hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
non-heavy-hitters (default 2)
- hh_flows_limit: max number of heavy-hitter flow entries
(default 2048)

Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
reflects the bandwidth of heavy-hitters that we attempt to capture
(25Mbps with the above default settings).

The false negative rate (heavy-hitter flows getting away unclassified)
is zero by the design of the multi-stage filter algorithm.
With 100 heavy-hitter flows, using four hashes and 4000 counters yields
a false positive rate (non-heavy-hitters mistakenly classified as
heavy-hitters) of less than 1e-4.

Signed-off-by: Terry Lam
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Terry Lam
2013-12-20 03:48:42 +0800

30 Oct, 2013

1 commit

7d1d65cb8 net: sched: cls_bpf: add BPF-based classifier ... Browse Code »

This work contains a lightweight BPF-based traffic classifier that can
serve as a flexible alternative to ematch-based tree classification, i.e.
now that BPF filter engine can also be JITed in the kernel. Naturally, tc
actions and policies are supported as well with cls_bpf. Multiple BPF
programs/filter can be attached for a class, or they can just as well be
written within a single BPF program, that's really up to the user how he
wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
The notion of a BPF program's return/exit codes is being kept as follows:

0: No match
-1: Select classid given in "tc filter ..." command
else: flowid, overwrite the default one

As a minimal usage example with iproute2, we use a 3 band prio root qdisc
on a router with sfq each as leave, and assign ssh and icmp bpf-based
filters to band 1, http traffic to band 2 and the rest to band 3. For the
first two bands we load the bytecode from a file, in the 2nd we load it
inline as an example:

echo 1 > /proc/sys/net/core/bpf_jit_enable

tc qdisc del dev em1 root
tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

tc qdisc add dev em1 parent 1:1 sfq perturb 16
tc qdisc add dev em1 parent 1:2 sfq perturb 16
tc qdisc add dev em1 parent 1:3 sfq perturb 16

tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3

BPF programs can be easily created and passed to tc, either as inline
'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
compile opcodes, for example:

1) People familiar with tcpdump-like filters:

tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf

2) People that want to low-level program their filters or use BPF
extensions that lack support by libpcap's compiler:

bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf

ssh.ops example code:
ldh [12]
jne #0x800, drop
ldb [23]
jneq #6, drop
ldh [20]
jset #0x1fff, drop
ldxb 4 * ([14] & 0xf)
ldh [%x + 14]
jeq #0x16, pass
ldh [%x + 16]
jne #0x16, drop
pass: ret #-1
drop: ret #0

It was chosen to load bytecode into tc, since the reverse operation,
tc filter list dev em1, is then able to show the exact commands again.
Possible follow-up work could also include a small expression compiler
for iproute2. Tested with the help of bmon. This idea came up during
the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
Eric Dumazet!

Signed-off-by: Daniel Borkmann
Cc: Thomas Graf
Signed-off-by: David S. Miller

Daniel Borkmann
2013-10-30 05:33:17 +0800

30 Aug, 2013

1 commit

afe4fd062 pkt_sched: fq: Fair Queue packet scheduler ... Browse Code »

- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
to add per socket limitation.

Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.

TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.

This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.

Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit

cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)

15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125
15:01:23.545394 IP B > A: . ack 81089 win 3668
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125
15:01:23.546565 IP B > A: . ack 83985 win 3668
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125
15:01:23.547778 IP B > A: . ack 86881 win 3668
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125
15:01:23.548949 IP B > A: . ack 89777 win 3668
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125
15:01:23.550182 IP B > A: . ack 92673 win 3668
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125
15:01:23.551406 IP B > A: . ack 95569 win 3668
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125
15:01:23.552576 IP B > A: . ack 98465 win 3668
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125
15:01:23.554204 IP B > A: . ack 100001 win 3668
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668

TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.

In slow start or steady state, very few packets are throttled [1]

FQ gets a bunch of tunables as :

limit : max number of packets on whole Qdisc (default 10000)

flow_limit : max number of packets per flow (default 100)

quantum : the credit per RR round (default is 2 MTU)

initial_quantum : initial credit for new flows (default is 10 MTU)

maxrate : max per flow rate (default : unlimited)

buckets : number of RB trees (default : 1024) in hash table.
(consumes 8 bytes per bucket)

[no]pacing : disable/enable pacing (default is enable)

All of them can be changed on a live qdisc.

$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
[ quantum BYTES ] [ initial_quantum BYTES ]
[ maxrate RATE ] [ buckets NUMBER ]
[ [no]pacing ]

$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
backlog 0b 0p requeues 14
511 flows, 511 inactive, 0 throttled
110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit

[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.

Signed-off-by: Eric Dumazet
Cc: Yuchung Cheng
Cc: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-30 09:38:31 +0800

12 Jul, 2012

1 commit

6d4fa852a net: sched: add ipset ematch ... Browse Code »

Can be used to match packets against netfilter ip sets created via ipset(8).
skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'.

Since ipset is usually called from netfilter, the ematch
initializes a fake xt_action_param, pulls the ip header into the
linear area and also sets skb->data to the IP header (otherwise
matching Layer 4 set types doesn't work).

Tested-by: Mr Dash Four
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2012-07-12 22:54:46 +0800

04 Jul, 2012

1 commit

f057bbb6f net: em_canid: Ematch rule to match CAN frames according to their identifiers ... Browse Code »

This ematch makes it possible to classify CAN frames (AF_CAN) according
to their identifiers. This functionality can not be easily achieved with
existing classifiers, such as u32, because CAN identifier is always stored
in native endianness, whereas u32 expects Network byte order.

Signed-off-by: Rostislav Lisovy
Signed-off-by: Oliver Hartkopp
Signed-off-by: Marc Kleine-Budde

Rostislav Lisovy
2012-07-04 19:07:05 +0800

13 May, 2012

1 commit

4b549a2ef fq_codel: Fair Queue Codel AQM ... Browse Code »

Fair Queue Codel packet scheduler

Principles :

- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.

- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)

tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
[ target TIME ] [ interval TIME ] [ noecn ]
[ quantum BYTES ]

defaults : 1024 flows, 10240 packets limit, quantum : device MTU
target : 5ms (CoDel default)
interval : 100ms (CoDel default)

Impressive results on load :

class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
rate 201691Kbit 28595pps backlog 0b 312p requeues 0
lended: 33063109 borrowed: 0 giants: 0
tokens: -912 ctokens: -912

class fq_codel 10:1735 parent 10:
(dropped 1292, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4524 parent 10:
(dropped 1291, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4e74 parent 10:
(dropped 1290, overlimits 0 requeues 0)
backlog 6056b 4p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
class fq_codel 10:628a parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 7570b 5p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
class fq_codel 10:a4b3 parent 10:
(dropped 302, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:c3c2 parent 10:
(dropped 1284, overlimits 0 requeues 0)
backlog 13626b 9p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:d331 parent 10:
(dropped 299, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.0ms
class fq_codel 10:d526 parent 10:
(dropped 12160, overlimits 0 requeues 0)
backlog 35870b 211p requeues 0
deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
class fq_codel 10:e2c6 parent 10:
(dropped 1288, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:eab5 parent 10:
(dropped 1285, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:f220 parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms

qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
rate 201697Kbit 28602pps backlog 0b 260p requeues 71
qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
new_flows_len 0 old_flows_len 11

PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms

10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms

Much better than SFQ because of priority given to new flows, and fast
path dirtying less cache lines.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-13 03:53:42 +0800

11 May, 2012

1 commit

76e3cc126 codel: Controlled Delay AQM ... Browse Code »

An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.

http://queue.acm.org/detail.cfm?id=2209336

This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.

As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :

target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)

Based on initial work from Dave Taht.

Refactored to help future codel inclusion as a plugin for other linux
qdisc (FQ_CODEL, ...), like RED.

include/net/codel.h contains codel algorithm as close as possible than
Kathleen reference.

net/sched/sch_codel.c contains the linux qdisc specific glue.

Separate structures permit a memory efficient implementation of fq_codel
(to be sent as a separate work) : Each flow has its own struct
codel_vars.

timestamps are taken at enqueue() time with 1024 ns precision, allowing
a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
usec as base unit.

Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.

Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
tg3 drivers (BQL enabled).

Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]

qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0

CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.

One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.

Signed-off-by: Eric Dumazet
Signed-off-by: Dave Taht
Cc: Kathleen Nichols
Cc: Van Jacobson
Cc: Tom Herbert
Cc: Matt Mathis
Cc: Yuchung Cheng
Cc: Stephen Hemminger
Signed-off-by: David S. Miller

Eric Dumazet
2012-05-11 11:35:02 +0800

08 Feb, 2012

1 commit

c3059be16 net/sched: sch_plug - Queue traffic until an explicit release command ... Browse Code »

The qdisc supports two operations - plug and unplug. When the
qdisc receives a plug command via netlink request, packets arriving
henceforth are buffered until a corresponding unplug command is received.
Depending on the type of unplug command, the queue can be unplugged
indefinitely or selectively.

This qdisc can be used to implement output buffering, an essential
functionality required for consistent recovery in checkpoint based
fault-tolerance systems. Output buffering enables speculative execution
by allowing generated network traffic to be rolled back. It is used to
provide network protection for Xen Guests in the Remus high availability
project, available as part of Xen.

This module is generic enough to be used by any other system that wishes
to add speculative execution and output buffering to its applications.

This module was originally available in the linux 2.6.32 PV-OPS tree,
used as dom0 for Xen.

For more information, please refer to http://nss.cs.ubc.ca/remus/
and http://wiki.xensource.com/xenwiki/Remus

Changes in V3:
* Removed debug output (printk) on queue overflow
* Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
use this qdisc, for simple plug/unplug operations.
* Use of packet counts instead of pointers to keep track of
the buffers in the queue.

Signed-off-by: Shriram Rajagopalan
Signed-off-by: Brendan Cully
[author of the code in the linux 2.6.32 pvops tree]
Signed-off-by: David S. Miller

Shriram Rajagopalan
2012-02-08 01:54:56 +0800

05 Apr, 2011

1 commit

0545a3037 pkt_sched: QFQ - quick fair queue scheduler ... Browse Code »

This is an implementation of the Quick Fair Queue scheduler developed
by Fabio Checconi. The same algorithm is already implemented in ipfw
in FreeBSD. Fabio had an earlier version developed on Linux, I just
cleaned it up. Thanks to Eric Dumazet for testing this under load.

Signed-off-by: Stephen Hemminger
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

stephen hemminger
2011-04-05 02:10:24 +0800

24 Feb, 2011

1 commit

e13e02a3c net_sched: SFB flow scheduler ... Browse Code »

This is the Stochastic Fair Blue scheduler, based on work from :

W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.

http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf

This implementation is based on work done by Juliusz Chroboczek

General SFB algorithm can be found in figure 14, page 15:

B[l][n] : L x N array of bins (L levels, N bins per level)
enqueue()
Calculate hash function values h{0}, h{1}, .. h{L-1}
Update bins at each level
for i = 0 to L - 1
if (B[i][h{i}].qlen > bin_size)
B[i][h{i}].p_mark += p_increment;
else if (B[i][h{i}].qlen == 0)
B[i][h{i}].p_mark -= p_decrement;
p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
if (p_min == 1.0)
ratelimit();
else
mark/drop with probabilty p_min;

I did the adaptation of Juliusz code to meet current kernel standards,
and various changes to address previous comments :

http://thread.gmane.org/gmane.linux.network/90225
http://thread.gmane.org/gmane.linux.network/90375

Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
we can use an external flow classifier if wanted.

tc qdisc add dev $DEV parent 1:11 handle 11: \
est 0.5sec 2sec sfb limit 128

tc filter add dev $DEV protocol ip parent 11: handle 3 \
flow hash keys dst divisor 1024

Notes:

1) SFB default child qdisc is pfifo_fast. It can be changed by another
qdisc but a child qdisc MUST not drop a packet previously queued. This
is because SFB needs to handle a dequeued packet in order to maintain
its virtual queue states. pfifo_head_drop or CHOKe should not be used.

2) ECN is enabled by default, unlike RED/CHOKe/GRED

With help from Patrick McHardy & Andi Kleen

Signed-off-by: Eric Dumazet
CC: Juliusz Chroboczek
CC: Stephen Hemminger
CC: Patrick McHardy
CC: Andi Kleen
CC: John W. Linville
Signed-off-by: David S. Miller

Eric Dumazet
2011-02-24 06:05:11 +0800

03 Feb, 2011

1 commit

45e144339 sched: CHOKe flow scheduler ... Browse Code »

CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
packet scheduler based on the Random Exponential Drop (RED) algorithm.

The core idea is:
For every packet arrival:
Calculate Qave
if (Qave < minth)
Queue the new packet
else
Select randomly a packet from the queue
if (both packets from same flow)
then Drop both the packets
else if (Qave > maxth)
Drop packet
else
Admit packet with proability p (same as RED)

See also:
Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
queue management scheme for approximating fair bandwidth allocation",
Proceeding of INFOCOM'2000, March 2000.

Help from:
Eric Dumazet
Patrick McHardy

Signed-off-by: Stephen Hemminger
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

stephen hemminger
2011-02-03 12:52:42 +0800

20 Jan, 2011

1 commit

b8970f0bf net_sched: implement a root container qdisc sch_mqprio ... Browse Code »

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mqprio_qopt {
__u8 num_tc;
__u8 prio_tc_map[TC_BITMASK + 1];
__u8 hw;
__u16 count[TC_MAX_QUEUE];
__u16 offset[TC_MAX_QUEUE];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2011-01-20 15:31:11 +0800