02 Jul, 2020

1 commit

  • Similar to fq_codel and the other qdiscs that can set as default,
    fq_pie is also suitable for general use without explicit configuration,
    which makes it a valid choice for this.

    This is useful in situations where a painless out-of-the-box solution
    for reducing bufferbloat is desired but fq_codel is not necessarily the
    best choice. For example, fq_pie can be better for DASH streaming, but
    there could be more cases where it's the better choice of the two simple
    AQMs available in the kernel.

    Signed-off-by: Danny Lin
    Signed-off-by: David S. Miller

    Danny Lin
     

14 Jun, 2020

1 commit

  • Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over
    '---help---'"), the number of '---help---' has been gradually
    decreasing, but there are still more than 2400 instances.

    This commit finishes the conversion. While I touched the lines,
    I also fixed the indentation.

    There are a variety of indentation styles found.

    a) 4 spaces + '---help---'
    b) 7 spaces + '---help---'
    c) 8 spaces + '---help---'
    d) 1 space + 1 tab + '---help---'
    e) 1 tab + '---help---' (correct indentation)
    f) 1 tab + 1 space + '---help---'
    g) 1 tab + 2 spaces + '---help---'

    In order to convert all of them to 1 tab + 'help', I ran the
    following commend:

    $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/'

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

02 May, 2020

1 commit

  • Introduce a ingress frame gate control flow action.
    Tc gate action does the work like this:
    Assume there is a gate allow specified ingress frames can be passed at
    specific time slot, and be dropped at specific time slot. Tc filter
    chooses the ingress frames, and tc gate action would specify what slot
    does these frames can be passed to device and what time slot would be
    dropped.
    Tc gate action would provide an entry list to tell how much time gate
    keep open and how much time gate keep state close. Gate action also
    assign a start time to tell when the entry list start. Then driver would
    repeat the gate entry list cyclically.
    For the software simulation, gate action requires the user assign a time
    clock type.

    Below is the setting example in user space. Tc filter a stream source ip
    address is 192.168.0.20 and gate action own two time slots. One is last
    200ms gate open let frame pass another is last 100ms gate close let
    frames dropped. When the ingress frames have reach total frames over
    8000000 bytes, the excessive frames will be dropped in that 200000000ns
    time slot.

    > tc qdisc add dev eth0 ingress

    > tc filter add dev eth0 parent ffff: protocol ip \
    flower src_ip 192.168.0.20 \
    action gate index 2 clockid CLOCK_TAI \
    sched-entry open 200000000 -1 8000000 \
    sched-entry close 100000000 -1 -1

    > tc chain del dev eth0 ingress chain 0

    "sched-entry" follow the name taprio style. Gate state is
    "open"/"close". Follow with period nanosecond. Then next item is internal
    priority value means which ingress queue should put. "-1" means
    wildcard. The last value optional specifies the maximum number of
    MSDU octets that are permitted to pass the gate during the specified
    time interval.
    Base-time is not set will be 0 as default, as result start time would
    be ((N + 1) * cycletime) which is the minimal of future time.

    Below example shows filtering a stream with destination mac address is
    10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
    action would run with one close time slot which means always keep close.
    The time cycle is total 200000000ns. The base-time would calculate by:

    1357000000000 + (N + 1) * cycletime

    When the total value is the future time, it will be the start time.
    The cycletime here would be 200000000ns for this case.

    > tc filter add dev eth0 parent ffff: protocol ip \
    flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
    action gate index 12 base-time 1357000000000 \
    sched-entry close 200000000 -1 -1 \
    clockid CLOCK_TAI

    Signed-off-by: Po Liu
    Signed-off-by: David S. Miller

    Po Liu
     

04 Mar, 2020

1 commit


23 Jan, 2020

1 commit

  • Principles:
    - Packets are classified on flows.
    - This is a Stochastic model (as we use a hash, several flows might
    be hashed to the same slot)
    - Each flow has a PIE managed queue.
    - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.
    - For a given flow, packets are not reordered.
    - Drops during enqueue only.
    - ECN capability is off by default.
    - ECN threshold (if ECN is enabled) is at 10% by default.
    - Uses timestamps to calculate queue delay by default.

    Usage:
    tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
    [ target TIME ] [ tupdate TIME ]
    [ alpha NUMBER ] [ beta NUMBER ]
    [ quantum BYTES ] [ memory_limit BYTES ]
    [ ecnprob PERCENTAGE ] [ [no]ecn ]
    [ [no]bytemode ] [ [no_]dq_rate_estimator ]

    defaults:
    limit: 10240 packets, flows: 1024
    target: 15 ms, tupdate: 15 ms (in jiffies)
    alpha: 1/8, beta : 5/4
    quantum: device MTU, memory_limit: 32 Mb
    ecnprob: 10%, ecn: off
    bytemode: off, dq_rate_estimator: off

    Signed-off-by: Mohit P. Tahiliani
    Signed-off-by: Sachin D. Patil
    Signed-off-by: V. Saicharan
    Signed-off-by: Mohit Bhasi
    Signed-off-by: Leslie Monis
    Signed-off-by: Gautam Ramakrishnan
    Signed-off-by: David S. Miller

    Mohit P. Tahiliani
     

19 Dec, 2019

1 commit

  • Introduces a new Qdisc, which is based on 802.1Q-2014 wording. It is
    PRIO-like in how it is configured, meaning one needs to specify how many
    bands there are, how many are strict and how many are dwrr, quanta for the
    latter, and priomap.

    The new Qdisc operates like the PRIO / DRR combo would when configured as
    per the standard. The strict classes, if any, are tried for traffic first.
    When there's no traffic in any of the strict queues, the ETS ones (if any)
    are treated in the same way as in DRR.

    Signed-off-by: Petr Machata
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Petr Machata
     

28 Sep, 2019

1 commit


26 Sep, 2019

1 commit


06 Sep, 2019

1 commit

  • Offloaded OvS datapath rules are translated one to one to tc rules,
    for example the following simplified OvS rule:

    recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)

    Will be translated to the following tc rule:

    $ tc filter add dev dev1 ingress \
    prio 1 chain 0 proto ip \
    flower tcp ct_state -trk \
    action ct pipe \
    action goto chain 2

    Received packets will first travel though tc, and if they aren't stolen
    by it, like in the above rule, they will continue to OvS datapath.
    Since we already did some actions (action ct in this case) which might
    modify the packets, and updated action stats, we would like to continue
    the proccessing with the correct recirc_id in OvS (here recirc_id(2))
    where we left off.

    To support this, introduce a new skb extension for tc, which
    will be used for translating tc chain to ovs recirc_id to
    handle these miss cases. Last tc chain index will be set
    by tc goto chain action and read by OvS datapath.

    Signed-off-by: Paul Blakey
    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Acked-by: Pravin B Shelar
    Signed-off-by: David S. Miller

    Paul Blakey
     

18 Jul, 2019

1 commit

  • If NF_NAT is m and NET_ACT_CT is y, build fails:

    net/sched/act_ct.o: In function `tcf_ct_act':
    act_ct.c:(.text+0x21ac): undefined reference to `nf_ct_nat_ext_add'
    act_ct.c:(.text+0x229a): undefined reference to `nf_nat_icmp_reply_translation'
    act_ct.c:(.text+0x233a): undefined reference to `nf_nat_setup_info'
    act_ct.c:(.text+0x234a): undefined reference to `nf_nat_alloc_null_binding'
    act_ct.c:(.text+0x237c): undefined reference to `nf_nat_packet'

    Reported-by: Hulk Robot
    Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
    Signed-off-by: YueHaibing
    Signed-off-by: David S. Miller

    YueHaibing
     

10 Jul, 2019

1 commit

  • Allow sending a packet to conntrack module for connection tracking.

    The packet will be marked with conntrack connection's state, and
    any metadata such as conntrack mark and label. This state metadata
    can later be matched against with tc classifers, for example with the
    flower classifier as below.

    In addition to committing new connections the user can optionally
    specific a zone to track within, set a mark/label and configure nat
    with an address range and port range.

    Usage is as follows:
    $ tc qdisc add dev ens1f0_0 ingress
    $ tc qdisc add dev ens1f0_1 ingress

    $ tc filter add dev ens1f0_0 ingress \
    prio 1 chain 0 proto ip \
    flower ip_proto tcp ct_state -trk \
    action ct zone 2 pipe \
    action goto chain 2
    $ tc filter add dev ens1f0_0 ingress \
    prio 1 chain 2 proto ip \
    flower ct_state +trk+new \
    action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
    action mirred egress redirect dev ens1f0_1
    $ tc filter add dev ens1f0_0 ingress \
    prio 1 chain 2 proto ip \
    flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
    action ct nat pipe \
    action mirred egress redirect dev ens1f0_1

    $ tc filter add dev ens1f0_1 ingress \
    prio 1 chain 0 proto ip \
    flower ip_proto tcp ct_state -trk \
    action ct zone 2 pipe \
    action goto chain 1
    $ tc filter add dev ens1f0_1 ingress \
    prio 1 chain 1 proto ip \
    flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
    action ct nat pipe \
    action mirred egress redirect dev ens1f0_0

    Signed-off-by: Paul Blakey
    Signed-off-by: Marcelo Ricardo Leitner
    Signed-off-by: Yossi Kuperman
    Acked-by: Jiri Pirko

    Changelog:
    V5->V6:
    Added CONFIG_NF_DEFRAG_IPV6 in handle fragments ipv6 case
    V4->V5:
    Reordered nf_conntrack_put() in tcf_ct_skb_nfct_cached()
    V3->V4:
    Added strict_start_type for act_ct policy
    V2->V3:
    Fixed david's comments: Removed extra newline after rcu in tcf_ct_params , and indent of break in act_ct.c
    V1->V2:
    Fixed parsing of ranges TCA_CT_NAT_IPV6_MAX as 'else' case overwritten ipv4 max
    Refactored NAT_PORT_MIN_MAX range handling as well
    Added ipv4/ipv6 defragmentation
    Removed extra skb pull push of nw offset in exectute nat
    Refactored tcf_ct_skb_network_trim after pull
    Removed TCA_ACT_CT define

    Signed-off-by: David S. Miller

    Paul Blakey
     

09 Jul, 2019

1 commit

  • Currently, TC offers the ability to match on the MPLS fields of a packet
    through the use of the flow_dissector_key_mpls struct. However, as yet, TC
    actions do not allow the modification or manipulation of such fields.

    Add a new module that registers TC action ops to allow manipulation of
    MPLS. This includes the ability to push and pop headers as well as modify
    the contents of new or existing headers. A further action to decrement the
    TTL field of an MPLS header is also provided with a new helper added to
    support this.

    Examples of the usage of the new action with flower rules to push and pop
    MPLS labels are:

    tc filter add dev eth0 protocol ip parent ffff: flower \
    action mpls push protocol mpls_uc label 123 \
    action mirred egress redirect dev eth1

    tc filter add dev eth0 protocol mpls_uc parent ffff: flower \
    action mpls pop protocol ipv4 \
    action mirred egress redirect dev eth1

    Signed-off-by: John Hurley
    Reviewed-by: Jakub Kicinski
    Reviewed-by: Simon Horman
    Reviewed-by: Willem de Bruijn
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Hurley
     

16 Jun, 2019

1 commit

  • This config option makes only couple of lines optional.
    Two small helpers and an int in couple of cls structs.

    Remove the config option and always compile this in.
    This saves the user from unexpected surprises when he adds
    a filter with ingress device match which is silently ignored
    in case the config option is not set.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

30 May, 2019

1 commit

  • ctinfo is a new tc filter action module. It is designed to restore
    information contained in firewall conntrack marks to other packet fields
    and is typically used on packet ingress paths. At present it has two
    independent sub-functions or operating modes, DSCP restoration mode &
    skb mark restoration mode.

    The DSCP restore mode:

    This mode copies DSCP values that have been placed in the firewall
    conntrack mark back into the IPv4/v6 diffserv fields of relevant
    packets.

    The DSCP restoration is intended for use and has been found useful for
    restoring ingress classifications based on egress classifications across
    links that bleach or otherwise change DSCP, typically home ISP Internet
    links. Restoring DSCP on ingress on the WAN link allows qdiscs such as
    but by no means limited to CAKE to shape inbound packets according to
    policies that are easier to set & mark on egress.

    Ingress classification is traditionally a challenging task since
    iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
    lookups, hence are unable to see internal IPv4 addresses as used on the
    typical home masquerading gateway. Thus marking the connection in some
    manner on egress for later restoration of classification on ingress is
    easier to implement.

    Parameters related to DSCP restore mode:

    dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
    conntrack mark field contain the DSCP value to be restored.

    statemask - a 32 bit mask of (usually) 1 bit length, outside the area
    specified by dscpmask. This represents a conditional operation flag
    whereby the DSCP is only restored if the flag is set. This is useful to
    implement a 'one shot' iptables based classification where the
    'complicated' iptables rules are only run once to classify the
    connection on initial (egress) packet and subsequent packets are all
    marked/restored with the same DSCP. A mask of zero disables the
    conditional behaviour ie. the conntrack mark DSCP bits are always
    restored to the ip diffserv field (assuming the conntrack entry is found
    & the skb is an ipv4/ipv6 type)

    e.g. dscpmask 0xfc000000 statemask 0x01000000

    |----0xFC----conntrack mark----000000---|
    | Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
    | DSCP | unused | flag |unused |
    |-----------------------0x01---000000---|
    | |
    | |
    ---| Conditional flag
    v only restore if set
    |-ip diffserv-|
    | 6 bits |
    |-------------|

    The skb mark restore mode (cpmark):

    This mode copies the firewall conntrack mark to the skb's mark field.
    It is completely the functional equivalent of the existing act_connmark
    action with the additional feature of being able to apply a mask to the
    restored value.

    Parameters related to skb mark restore mode:

    mask - a 32 bit mask applied to the firewall conntrack mark to mask out
    bits unwanted for restoration. This can be useful where the conntrack
    mark is being used for different purposes by different applications. If
    not specified and by default the whole mark field is copied (i.e.
    default mask of 0xffffffff)

    e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
    aforementioned DSCP restore mode.

    |----0x00----conntrack mark----ffffff---|
    | Bits 31-24 | |
    | DSCP & flag| some value here |
    |---------------------------------------|
    |
    |
    v
    |------------skb mark-------------------|
    | | |
    | zeroed | |
    |---------------------------------------|

    Overall parameters:

    zone - conntrack zone

    control - action related control (reclassify | pipe | drop | continue |
    ok | goto chain )

    Signed-off-by: Kevin Darbyshire-Bryant
    Reviewed-by: Toke Høiland-Jørgensen
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Kevin 'ldir' Darbyshire-Bryant
     

21 May, 2019

1 commit


27 Mar, 2019

1 commit


05 Oct, 2018

1 commit

  • This traffic scheduler allows traffic classes states (transmission
    allowed/not allowed, in the simplest case) to be scheduled, according
    to a pre-generated time sequence. This is the basis of the IEEE
    802.1Qbv specification.

    Example configuration:

    tc qdisc replace dev enp3s0 parent root handle 100 taprio \
    num_tc 3 \
    map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 \
    queues 1@0 1@1 2@2 \
    base-time 1528743495910289987 \
    sched-entry S 01 300000 \
    sched-entry S 02 300000 \
    sched-entry S 04 300000 \
    clockid CLOCK_TAI

    The configuration format is similar to mqprio. The main difference is
    the presence of a schedule, built by multiple "sched-entry"
    definitions, each entry has the following format:

    sched-entry

    The only supported is "S", which means "SetGateStates",
    following the IEEE 802.1Qbv-2015 definition (Table 8-6).
    is a bitmask where each bit is a associated with a traffic class, so
    bit 0 (the least significant bit) being "on" means that traffic class
    0 is "active" for that schedule entry. is a time duration
    in nanoseconds that specifies for how long that state defined by
    and should be held before moving to the next entry.

    This schedule is circular, that is, after the last entry is executed
    it starts from the first one, indefinitely.

    The other parameters can be defined as follows:

    - base-time: specifies the instant when the schedule starts, if
    'base-time' is a time in the past, the schedule will start at

    base-time + (N * cycle-time)

    where N is the smallest integer so the resulting time is greater
    than "now", and "cycle-time" is the sum of all the intervals of the
    entries in the schedule;

    - clockid: specifies the reference clock to be used;

    The parameters should be similar to what the IEEE 802.1Q family of
    specification defines.

    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     

25 Jul, 2018

2 commits

  • Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes packets
    according to their skb->priority field. Under congestion, already-enqueued lower
    priority packets will be dropped to make space available for higher priority
    packets. Skbprio was conceived as a solution for denial-of-service defenses that
    need to route packets with different priorities as a means to overcome DoS
    attacks.

    v5
    *Do not reference qdisc_dev(sch)->tx_queue_len for setting limit. Instead set
    default sch->limit to 64.

    v4
    *Drop Documentation/networking/sch_skbprio.txt doc file to move it to tc man
    page for Skbprio, in iproute2.

    v3
    *Drop max_limit parameter in struct skbprio_sched_data and instead use
    sch->limit.

    *Reference qdisc_dev(sch)->tx_queue_len only once, during initialisation for
    qdisc (previously being referenced every time qdisc changes).

    *Move qdisc's detailed description from in-code to Documentation/networking.

    *When qdisc is saturated, enqueue incoming packet first before dequeueing
    lowest priority packet in queue - improves usage of call stack registers.

    *Introduce and use overlimit stat to keep track of number of dropped packets.

    v2
    *Use skb->priority field rather than DS field. Rename queueing discipline as
    SKB Priority Queue (previously Gatekeeper Priority Queue).

    *Queueing discipline is made classful to expose Skbprio's internal priority
    queues.

    Signed-off-by: Nishanth Devarajan
    Reviewed-by: Sachin Paryani
    Reviewed-by: Cody Doucette
    Reviewed-by: Michel Machado
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Nishanth Devarajan
     
  • Remove trailing whitespace and blank lines at EOF

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    Stephen Hemminger
     

11 Jul, 2018

1 commit

  • sch_cake targets the home router use case and is intended to squeeze the
    most bandwidth and latency out of even the slowest ISP links and routers,
    while presenting an API simple enough that even an ISP can configure it.

    Example of use on a cable ISP uplink:

    tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

    To shape a cable download link (ifb and tc-mirred setup elided)

    tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

    CAKE is filled with:

    * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
    derived Flow Queuing system, which autoconfigures based on the bandwidth.
    * A novel "triple-isolate" mode (the default) which balances per-host
    and per-flow FQ even through NAT.
    * An deficit based shaper, that can also be used in an unlimited mode.
    * 8 way set associative hashing to reduce flow collisions to a minimum.
    * A reasonable interpretation of various diffserv latency/loss tradeoffs.
    * Support for zeroing diffserv markings for entering and exiting traffic.
    * Support for interacting well with Docsis 3.0 shaper framing.
    * Extensive support for DSL framing types.
    * Support for ack filtering.
    * Extensive statistics for measuring, loss, ecn markings, latency
    variation.

    A paper describing the design of CAKE is available at
    https://arxiv.org/abs/1804.07617, and will be published at the 2018 IEEE
    International Symposium on Local and Metropolitan Area Networks (LANMAN).

    This patch adds the base shaper and packet scheduler, while subsequent
    commits add the optional (configurable) features. The full userspace API
    and most data structures are included in this commit, but options not
    understood in the base version will be ignored.

    Various versions baking have been available as an out of tree build for
    kernel versions going back to 3.10, as the embedded router world has been
    running a few years behind mainline Linux. A stable version has been
    generally available on lede-17.01 and later.

    sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
    in the sqm-scripts, with sane defaults and vastly simpler configuration.

    CAKE's principal author is Jonathan Morton, with contributions from
    Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
    Ryan Mounce, Tony Ambardar, Dean Scarff, Nils Andreas Svee, Dave Täht,
    and Loganaden Velvindron.

    Testing from Pete Heist, Georgios Amanakis, and the many other members of
    the cake@lists.bufferbloat.net mailing list.

    tc -s qdisc show dev eth2
    qdisc cake 8017: root refcnt 2 bandwidth 1Gbit diffserv3 triple-isolate split-gso rtt 100.0ms noatm overhead 38 mpu 84
    Sent 51504294511 bytes 37724591 pkt (dropped 6, overlimits 64958695 requeues 12)
    backlog 0b 0p requeues 12
    memory used: 1053008b of 15140Kb
    capacity estimate: 970Mbit
    min/max network layer size: 28 / 1500
    min/max overhead-adjusted size: 84 / 1538
    average network hdr offset: 14
    Bulk Best Effort Voice
    thresh 62500Kbit 1Gbit 250Mbit
    target 5.0ms 5.0ms 5.0ms
    interval 100.0ms 100.0ms 100.0ms
    pk_delay 5us 5us 6us
    av_delay 3us 2us 2us
    sp_delay 2us 1us 1us
    backlog 0b 0b 0b
    pkts 3164050 25030267 9530280
    bytes 3227519915 35396974782 12879808898
    way_inds 0 8 0
    way_miss 21 366 25
    way_cols 0 0 0
    drops 5 0 1
    marks 0 0 0
    ack_drop 0 0 0
    sp_flows 1 3 0
    bk_flows 0 1 1
    un_flows 0 0 0
    max_len 68130 68130 68130

    Tested-by: Pete Heist
    Tested-by: Georgios Amanakis
    Signed-off-by: Dave Taht
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

04 Jul, 2018

1 commit

  • The ETF (Earliest TxTime First) qdisc uses the information added
    earlier in this series (the socket option SO_TXTIME and the new
    role of sk_buff->tstamp) to schedule packets transmission based
    on absolute time.

    For some workloads, just bandwidth enforcement is not enough, and
    precise control of the transmission of packets is necessary.

    Example:

    $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
    map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

    $ tc qdisc add dev enp2s0 parent 100:1 etf delta 100000 \
    clockid CLOCK_TAI

    In this example, the Qdisc will provide SW best-effort for the control
    of the transmission time to the network adapter, the time stamp in the
    socket will be in reference to the clockid CLOCK_TAI and packets
    will leave the qdisc "delta" (100000) nanoseconds before its transmission
    time.

    The ETF qdisc will buffer packets sorted by their txtime. It will drop
    packets on enqueue() if their skbuff clockid does not match the clock
    reference of the Qdisc. Moreover, on dequeue(), a packet will be dropped
    if it expires while being enqueued.

    The qdisc also supports the SO_TXTIME deadline mode. For this mode, it
    will dequeue a packet as soon as possible and change the skb timestamp
    to 'now' during etf_dequeue().

    Note that both the qdisc's and the SO_TXTIME ABIs allow for a clockid
    to be configured, but it's been decided that usage of CLOCK_TAI should
    be enforced until we decide to allow for other clockids to be used.
    The rationale here is that PTP times are usually in the TAI scale, thus
    no other clocks should be necessary. For now, the qdisc will return
    EINVAL if any clocks other than CLOCK_TAI are used.

    Signed-off-by: Jesus Sanchez-Palencia
    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     

22 Feb, 2018

1 commit

  • The commit a new tc ematch for using netfilter xtable matches.

    This allows early classification as well as mirroning/redirecting traffic
    based on logic implemented in netfilter extensions.

    Current supported use case is classification based on the incoming IPSec
    state used during decpsulation using the 'policy' iptables extension
    (xt_policy).

    The module dynamically fetches the netfilter match module and calls
    it using a fake xt_action_param structure based on validated userspace
    provided parameters.

    As the xt_policy match does not access skb->data, no skb modifications
    are needed on match.

    Signed-off-by: Eyal Birger
    Signed-off-by: David S. Miller

    Eyal Birger
     

31 Jan, 2018

1 commit

  • Blank help texts are probably either a typo, a Kconfig misunderstanding,
    or some kind of half-committing to adding a help text (in which case a
    TODO comment would be clearer, if the help text really can't be added
    right away).

    Best to remove them, IMO.

    Signed-off-by: Ulf Magnusson
    Signed-off-by: David S. Miller

    Ulf Magnusson
     

28 Oct, 2017

1 commit

  • This queueing discipline implements the shaper algorithm defined by
    the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.

    It's primary usage is to apply some bandwidth reservation to user
    defined traffic classes, which are mapped to different queues via the
    mqprio qdisc.

    Only a simple software implementation is added for now.

    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: Jesus Sanchez-Palencia
    Tested-by: Henrik Austad
    Signed-off-by: Jeff Kirsher

    Vinicius Costa Gomes
     

05 Jun, 2017

1 commit

  • It really makes no sense to have cls_act enabled without cls. In that
    case, the cls_act code is dead. So select it.

    This also fixes an issue recently reported by kbuild robot:
    [linux-next:master 1326/4151] net/sched/act_api.c:37:18: error: implicit declaration of function 'tcf_chain_get'

    Reported-by: kbuild test robot
    Fixes: db50514f9a9c ("net: sched: add termination action to allow goto chain")
    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

18 Apr, 2017

1 commit

  • Since 3.12 it has been possible to configure the default queuing
    discipline via sysctl. This patch adds ability to configure the
    default queue discipline in kernel configuration. This is useful for
    environments where configuring the value from userspace is difficult
    to manage.

    The default is still the same as before (pfifo_fast) and it is
    possible to change after kernel init with sysctl. This is similar
    to how TCP congestion control works.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: David S. Miller

    stephen hemminger
     

04 Feb, 2017

1 commit


25 Jan, 2017

1 commit

  • This action allows the user to sample traffic matched by tc classifier.
    The sampling consists of choosing packets randomly and sampling them using
    the psample module. The user can configure the psample group number, the
    sampling rate and the packet's truncation (to save kernel-user traffic).

    Example:
    To sample ingress traffic from interface eth1, one may use the commands:

    tc qdisc add dev eth1 handle ffff: ingress

    tc filter add dev eth1 parent ffff: \
    matchall action sample rate 12 group 4

    Where the first command adds an ingress qdisc and the second starts
    sampling randomly with an average of one sampled packet per 12 packets on
    dev eth1 to psample group 4.

    Signed-off-by: Yotam Gigi
    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Reviewed-by: Simon Horman
    Signed-off-by: David S. Miller

    Yotam Gigi
     

10 Jan, 2017

1 commit


20 Sep, 2016

1 commit

  • Sample use case of how this is encoded:
    user space via tuntap (or a connected VM/Machine/container)
    encodes the tcindex TLV.

    Sample use case of decoding:
    IFE action decodes it and the skb->tc_index is then used to classify.
    So something like this for encoded ICMP packets:

    .. first decode then reclassify... skb->tcindex will be set
    sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
    u32 match u32 0 0 flowid 1:1 \
    action ife decode reclassify

    ...next match the decode icmp packet...
    sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
    u32 match ip protocol 1 0xff flowid 1:1 \
    action continue

    ... last classify it using the tcindex classifier and do someaction..
    sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
    handle 0x11 tcindex classid 1:1 \
    action blah..

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

16 Sep, 2016

1 commit

  • This action is intended to be an upgrade from a usability perspective
    from pedit (as well as operational debugability).
    Compare this:

    sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
    u32 match ip protocol 1 0xff flowid 1:2 \
    action pedit munge offset -14 u8 set 0x02 \
    munge offset -13 u8 set 0x15 \
    munge offset -12 u8 set 0x15 \
    munge offset -11 u8 set 0x15 \
    munge offset -10 u16 set 0x1515 \
    pipe

    to:

    sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
    u32 match ip protocol 1 0xff flowid 1:2 \
    action skbmod dmac 02:15:15:15:15:15

    Also try to do a MAC address swap with pedit or worse
    try to debug a policy with destination mac, source mac and
    etherype. Then make few rules out of those and you'll get my point.

    In the future common use cases on pedit can be migrated to this action
    (as an example different fields in ip v4/6, transports like tcp/udp/sctp
    etc). For this first cut, this allows modifying basic ethernet header.

    The most important ethernet use case at the moment is when redirecting or
    mirroring packets to a remote machine. The dst mac address needs a re-write
    so that it doesnt get dropped or confuse an interconnecting (learning) switch
    or dropped by a target machine (which looks at the dst mac). And at times
    when flipping back the packet a swap of the MAC addresses is needed.

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

11 Sep, 2016

1 commit

  • This action could be used before redirecting packets to a shared tunnel
    device, or when redirecting packets arriving from a such a device.

    The action will release the metadata created by the tunnel device
    (decap), or set the metadata with the specified values for encap
    operation.

    For example, the following flower filter will forward all ICMP packets
    destined to 11.11.11.2 through the shared vxlan device 'vxlan0'. Before
    redirecting, a metadata for the vxlan tunnel is created using the
    tunnel_key action and it's arguments:

    $ tc filter add dev net0 protocol ip parent ffff: \
    flower \
    ip_proto 1 \
    dst_ip 11.11.11.2 \
    action tunnel_key set \
    src_ip 11.11.0.1 \
    dst_ip 11.11.0.2 \
    id 11 \
    action mirred egress redirect dev vxlan0

    Signed-off-by: Amir Vadai
    Signed-off-by: Hadar Hen Zion
    Reviewed-by: Shmulik Ladkani
    Acked-by: Jamal Hadi Salim
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Amir Vadai
     

25 Jul, 2016

1 commit

  • The matchall classifier matches every packet and allows the user to apply
    actions on it. This filter is very useful in usecases where every packet
    should be matched, for example, packet mirroring (SPAN) can be setup very
    easily using that filter.

    Signed-off-by: Jiri Pirko
    Signed-off-by: Yotam Gigi
    Signed-off-by: David S. Miller

    Jiri Pirko
     

02 Mar, 2016

3 commits

  • Example usage:
    Set the skb priority using skbedit then allow it to be encoded

    sudo tc qdisc add dev $ETH root handle 1: prio
    sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
    u32 match ip protocol 1 0xff flowid 1:2 \
    action skbedit prio 17 \
    action ife encode \
    allow prio \
    dst 02:15:15:15:15:15

    Note: You dont need the skbedit action if you are already encoding the
    skb priority earlier. A zero skb priority will not be sent

    Alternative hard code static priority of decimal 33 (unlike skbedit)
    then mark of 0x12 every time the filter matches

    sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
    u32 match ip protocol 1 0xff flowid 1:2 \
    action ife encode \
    type 0xDEAD \
    use prio 33 \
    use mark 0x12 \
    dst 02:15:15:15:15:15

    Signed-off-by: Jamal Hadi Salim
    Acked-by: Cong Wang

    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • Example usage:
    Set the skb using skbedit then allow it to be encoded

    sudo tc qdisc add dev $ETH root handle 1: prio
    sudo tc filter add dev $ETH parent 1: protocol ip prio 10 \
    u32 match ip protocol 1 0xff flowid 1:2 \
    action skbedit mark 17 \
    action ife encode \
    allow mark \
    dst 02:15:15:15:15:15

    Note: You dont need the skbedit action if you are already encoding the
    skb mark earlier. A zero skb mark, when seen, will not be encoded.

    Alternative hard code static mark of 0x12 every time the filter matches

    sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 \
    u32 match ip protocol 1 0xff flowid 1:2 \
    action ife encode \
    type 0xDEAD \
    use mark 0x12 \
    dst 02:15:15:15:15:15

    Signed-off-by: Jamal Hadi Salim
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • This action allows for a sending side to encapsulate arbitrary metadata
    which is decapsulated by the receiving end.
    The sender runs in encoding mode and the receiver in decode mode.
    Both sender and receiver must specify the same ethertype.
    At some point we hope to have a registered ethertype and we'll
    then provide a default so the user doesnt have to specify it.
    For now we enforce the user specify it.

    Lets show example usage where we encode icmp from a sender towards
    a receiver with an skbmark of 17; both sender and receiver use
    ethertype of 0xdead to interop.

    YYYY: Lets start with Receiver-side policy config:
    xxx: add an ingress qdisc
    sudo tc qdisc add dev $ETH ingress

    xxx: any packets with ethertype 0xdead will be subjected to ife decoding
    xxx: we then restart the classification so we can match on icmp at prio 3
    sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xdead \
    u32 match u32 0 0 flowid 1:1 \
    action ife decode reclassify

    xxx: on restarting the classification from above if it was an icmp
    xxx: packet, then match it here and continue to the next rule at prio 4
    xxx: which will match based on skb mark of 17
    sudo tc filter add dev $ETH parent ffff: prio 3 protocol ip \
    u32 match ip protocol 1 0xff flowid 1:1 \
    action continue

    xxx: match on skbmark of 0x11 (decimal 17) and accept
    sudo tc filter add dev $ETH parent ffff: prio 4 protocol ip \
    handle 0x11 fw flowid 1:1 \
    action ok

    xxx: Lets show the decoding policy
    sudo tc -s filter ls dev $ETH parent ffff: protocol 0xdead
    xxx:
    filter pref 2 u32
    filter pref 2 u32 fh 800: ht divisor 1
    filter pref 2 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:1 (rule hit 0 success 0)
    match 00000000/00000000 at 0 (success 0 )
    action order 1: ife decode action reclassify
    index 1 ref 1 bind 1 installed 14 sec used 14 sec
    type: 0x0
    Metadata: allow mark allow hash allow prio allow qmap
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    xxx:
    Observe that above lists all metadatum it can decode. Typically these
    submodules will already be compiled into a monolithic kernel or
    loaded as modules

    YYYY: Lets show the sender side now ..

    xxx: Add an egress qdisc on the sender netdev
    sudo tc qdisc add dev $ETH root handle 1: prio
    xxx:
    xxx: Match all icmp packets to 192.168.122.237/24, then
    xxx: tag the packet with skb mark of decimal 17, then
    xxx: Encode it with:
    xxx: ethertype 0xdead
    xxx: add skb->mark to whitelist of metadatum to send
    xxx: rewrite target dst MAC address to 02:15:15:15:15:15
    xxx:
    sudo $TC filter add dev $ETH parent 1: protocol ip prio 10 u32 \
    match ip dst 192.168.122.237/24 \
    match ip protocol 1 0xff \
    flowid 1:2 \
    action skbedit mark 17 \
    action ife encode \
    type 0xDEAD \
    allow mark \
    dst 02:15:15:15:15:15

    xxx: Lets show the encoding policy
    sudo tc -s filter ls dev $ETH parent 1: protocol ip
    xxx:
    filter pref 10 u32
    filter pref 10 u32 fh 800: ht divisor 1
    filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:2 (rule hit 0 success 0)
    match c0a87aed/ffffffff at 16 (success 0 )
    match 00010000/00ff0000 at 8 (success 0 )

    action order 1: skbedit mark 17
    index 6 ref 1 bind 1
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    action order 2: ife encode action pipe
    index 3 ref 1 bind 1
    dst MAC: 02:15:15:15:15:15 type: 0xDEAD
    Metadata: allow mark
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    xxx:

    test by sending ping from sender to destination

    Signed-off-by: Jamal Hadi Salim
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

11 Jan, 2016

1 commit

  • This work adds a generalization of the ingress qdisc as a qdisc holding
    only classifiers. The clsact qdisc works on ingress, but also on egress.
    In both cases, it's execution happens without taking the qdisc lock, and
    the main difference for the egress part compared to prior version of [1]
    is that this can be applied with _any_ underlying real egress qdisc (also
    classless ones).

    Besides solving the use-case of [1], that is, allowing for more programmability
    on assigning skb->priority for the mqprio case that is supported by most
    popular 10G+ NICs, it also opens up a lot more flexibility for other tc
    applications. The main work on classification can already be done at clsact
    egress time if the use-case allows and state stored for later retrieval
    f.e. again in skb->priority with major/minors (which is checked by most
    classful qdiscs before consulting tc_classify()) and/or in other skb fields
    like skb->tc_index for some light-weight post-processing to get to the
    eventual classid in case of a classful qdisc. Another use case is that
    the clsact egress part allows to have a central egress counterpart to
    the ingress classifiers, so that classifiers can easily share state (e.g.
    in cls_bpf via eBPF maps) for ingress and egress.

    Currently, default setups like mq + pfifo_fast would require for this to
    use, for example, prio qdisc instead (to get a tc_classify() run) and to
    duplicate the egress classifier for each queue. With clsact, it allows
    for leaving the setup as is, it can additionally assign skb->priority to
    put the skb in one of pfifo_fast's bands and it can share state with maps.
    Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
    w/o the need to perform a skb_dst_force() to hold on to it any longer. In
    lwt case, we can also use this facility to setup dst metadata via cls_bpf
    (bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
    that (case of IFF_NO_QUEUE devices, for example).

    The realization can be done without any changes to the scheduler core
    framework. All it takes is that we have two a-priori defined minors/child
    classes, where we can mux between ingress and egress classifier list
    (dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
    dev->_tx to avoid extra cacheline miss for moderate loads). The egress
    part is a bit similar modelled to handle_ing() and patched to a noop in
    case the functionality is not used. Both handlers are now called
    sch_handle_ingress() and sch_handle_egress(), code sharing among the two
    doesn't seem practical as there are various minor differences in both
    paths, so that making them conditional in a single handler would rather
    slow things down.

    Full compatibility to ingress qdisc is provided as well. Since both
    piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
    per netdevice, and thus ingress qdisc specific behaviour can be retained
    for user space. This means, either a user does 'tc qdisc add dev foo ingress'
    and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
    alternative, where both, ingress and egress classifier can be configured
    as in the below example. ingress qdisc supports attaching classifier to any
    minor number whereas clsact has two fixed minors for muxing between the
    lists, therefore to not break user space setups, they are better done as
    two separate qdiscs.

    I decided to extend the sch_ingress module with clsact functionality so
    that commonly used code can be reused, the module is being aliased with
    sch_clsact so that it can be auto-loaded properly. Alternative would have been
    to add a flag when initializing ingress to alter its behaviour plus aliasing
    to a different name (as it's more than just ingress). However, the first would
    end up, based on the flag, choosing the new/old behaviour by calling different
    function implementations to handle each anyway, the latter would require to
    register ingress qdisc once again under different alias. So, this really begs
    to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
    by its own that share callbacks used by both.

    Example, adding qdisc:

    # tc qdisc add dev foo clsact
    # tc qdisc show dev foo
    qdisc mq 0: root
    qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
    qdisc clsact ffff: parent ffff:fff1

    Adding filters (deleting, etc works analogous by specifying ingress/egress):

    # tc filter add dev foo ingress bpf da obj bar.o sec ingress
    # tc filter add dev foo egress bpf da obj bar.o sec egress
    # tc filter show dev foo ingress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
    # tc filter show dev foo egress
    filter protocol all pref 49152 bpf
    filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

    A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
    show an empty list for clsact. Either using the parent names (ingress/egress)
    or specifying the full major/minor will then show the related filter lists.

    Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

    [1] http://patchwork.ozlabs.org/patch/512949/

    Signed-off-by: Daniel Borkmann
    Acked-by: John Fastabend
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

14 May, 2015

2 commits

  • This new config switch enables the ingress filtering infrastructure that is
    controlled through the ingress_needed static key. This prepares the
    introduction of the Netfilter ingress hook that resides under this unique
    static key.

    Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
    problem since this also depends on CONFIG_NET_CLS_ACT.

    Signed-off-by: Pablo Neira Ayuso
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller

    Pablo Neira
     
  • This patch introduces a flow-based filter. So far, the very essential
    packet fields are supported.

    This patch is only the first step. There is a lot of potential performance
    improvements possible to implement. Also a lot of features are missing
    now. They will be addressed in follow-up patches.

    Signed-off-by: Jiri Pirko
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jiri Pirko
     

20 Feb, 2015

1 commit

  • Pull kconfig updates from Michal Marek:
    "Yann E Morin was supposed to take over kconfig maintainership, but
    this hasn't happened. So I'm sending a few kconfig patches that I
    collected:

    - Fix for missing va_end in kconfig
    - merge_config.sh displays used if given too few arguments
    - s/boolean/bool/ in Kconfig files for consistency, with the plan to
    only support bool in the future"

    * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
    kconfig: use va_end to match corresponding va_start
    merge_config.sh: Display usage if given too few arguments
    kconfig: use bool instead of boolean for type definition attributes

    Linus Torvalds