07 Jan, 2014

1 commit

  • Proportional Integral controller Enhanced (PIE) is a scheduler to address the
    bufferbloat problem.

    >From the IETF draft below:
    " Bufferbloat is a phenomenon where excess buffers in the network cause high
    latency and jitter. As more and more interactive applications (e.g. voice over
    IP, real time video streaming and financial transactions) run in the Internet,
    high latency and jitter degrade application performance. There is a pressing
    need to design intelligent queue management schemes that can control latency and
    jitter; and hence provide desirable quality of service to users.

    We present here a lightweight design, PIE(Proportional Integral controller
    Enhanced) that can effectively control the average queueing latency to a target
    value. Simulation results, theoretical analysis and Linux testbed results have
    shown that PIE can ensure low latency and achieve high link utilization under
    various congestion situations. The design does not require per-packet
    timestamp, so it incurs very small overhead and is simple enough to implement
    in both hardware and software. "

    Many thanks to Dave Taht for extensive feedback, reviews, testing and
    suggestions. Thanks also to Stephen Hemminger and Eric Dumazet for reviews and
    suggestions. Naeem Khademi and Dave Taht independently contributed to ECN
    support.

    For more information, please see technical paper about PIE in the IEEE
    Conference on High Performance Switching and Routing 2013. A copy of the paper
    can be found at ftp://ftpeng.cisco.com/pie/.

    Please also refer to the IETF draft submission at
    http://tools.ietf.org/html/draft-pan-tsvwg-pie-00

    All relevant code, documents and test scripts and results can be found at
    ftp://ftpeng.cisco.com/pie/.

    For problems with the iproute2/tc or Linux kernel code, please contact Vijay
    Subramanian (vijaynsu@cisco.com or subramanian.vijay@gmail.com) Mythili Prabhu
    (mysuryan@cisco.com)

    Signed-off-by: Vijay Subramanian
    Signed-off-by: Mythili Prabhu
    CC: Dave Taht
    Signed-off-by: David S. Miller

    Vijay Subramanian
     

06 Jan, 2014

1 commit

  • Pablo Neira Ayuso says:

    ====================
    netfilter/IPVS updates for net-next

    The following patchset contains Netfilter updates for your net-next tree,
    they are:

    * Add full port randomization support. Some crazy researchers found a way
    to reconstruct the secure ephemeral ports that are allocated in random mode
    by sending off-path bursts of UDP packets to overrun the socket buffer of
    the DNS resolver to trigger retransmissions, then if the timing for the
    DNS resolution done by a client is larger than usual, then they conclude
    that the port that received the burst of UDP packets is the one that was
    opened. It seems a bit aggressive method to me but it seems to work for
    them. As a result, Daniel Borkmann and Hannes Frederic Sowa came up with a
    new NAT mode to fully randomize ports using prandom.

    * Add a new classifier to x_tables based on the socket net_cls set via
    cgroups. These includes two patches to prepare the field as requested by
    Zefan Li. Also from Daniel Borkmann.

    * Use prandom instead of get_random_bytes in several locations of the
    netfilter code, from Florian Westphal.

    * Allow to use the CTA_MARK_MASK in ctnetlink when mangling the conntrack
    mark, also from Florian Westphal.

    * Fix compilation warning due to unused variable in IPVS, from Geert
    Uytterhoeven.

    * Add support for UID/GID via nfnetlink_queue, from Valentina Giusti.

    * Add IPComp extension to x_tables, from Fan Du.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

04 Jan, 2014

1 commit

  • Zefan Li requested [1] to perform the following cleanup/refactoring:

    - Split cgroupfs classid handling into net core to better express a
    possible more generic use.

    - Disable module support for cgroupfs bits as the majority of other
    cgroupfs subsystems do not have that, and seems to be not wished
    from cgroup side. Zefan probably might want to follow-up for netprio
    later on.

    - By this, code can be further reduced which previously took care of
    functionality built when compiled as module.

    cgroupfs bits are being placed under net/core/netclassid_cgroup.c, so
    that we are consistent with {netclassid,netprio}_cgroup naming that is
    under net/core/ as suggested by Zefan.

    No change in functionality, but only code refactoring that is being
    done here.

    [1] http://patchwork.ozlabs.org/patch/304825/

    Suggested-by: Li Zefan
    Signed-off-by: Daniel Borkmann
    Cc: Zefan Li
    Cc: Thomas Graf
    Cc: cgroups@vger.kernel.org
    Acked-by: Li Zefan
    Signed-off-by: Pablo Neira Ayuso

    Daniel Borkmann
     

20 Dec, 2013

1 commit

  • This patch implements the first size-based qdisc that attempts to
    differentiate between small flows and heavy-hitters. The goal is to
    catch the heavy-hitters and move them to a separate queue with less
    priority so that bulk traffic does not affect the latency of critical
    traffic. Currently "less priority" means less weight (2:1 in
    particular) in a Weighted Deficit Round Robin (WDRR) scheduler.

    In essence, this patch addresses the "delay-bloat" problem due to
    bloated buffers. In some systems, large queues may be necessary for
    obtaining CPU efficiency, or due to the presence of unresponsive
    traffic like UDP, or just a large number of connections with each
    having a small amount of outstanding traffic. In these circumstances,
    HHF aims to reduce the HoL blocking for latency sensitive traffic,
    while not impacting the queues built up by bulk traffic. HHF can also
    be used in conjunction with other AQM mechanisms such as CoDel.

    To capture heavy-hitters, we implement the "multi-stage filter" design
    in the following paper:
    C. Estan and G. Varghese, "New Directions in Traffic Measurement and
    Accounting", in ACM SIGCOMM, 2002.

    Some configurable qdisc settings through 'tc':
    - hhf_reset_timeout: period to reset counter values in the multi-stage
    filter (default 40ms)
    - hhf_admit_bytes: threshold to classify heavy-hitters
    (default 128KB)
    - hhf_evict_timeout: threshold to evict idle heavy-hitters
    (default 1s)
    - hhf_non_hh_weight: Weighted Deficit Round Robin (WDRR) weight for
    non-heavy-hitters (default 2)
    - hh_flows_limit: max number of heavy-hitter flow entries
    (default 2048)

    Note that the ratio between hhf_admit_bytes and hhf_reset_timeout
    reflects the bandwidth of heavy-hitters that we attempt to capture
    (25Mbps with the above default settings).

    The false negative rate (heavy-hitter flows getting away unclassified)
    is zero by the design of the multi-stage filter algorithm.
    With 100 heavy-hitter flows, using four hashes and 4000 counters yields
    a false positive rate (non-heavy-hitters mistakenly classified as
    heavy-hitters) of less than 1e-4.

    Signed-off-by: Terry Lam
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Terry Lam
     

30 Oct, 2013

1 commit

  • This work contains a lightweight BPF-based traffic classifier that can
    serve as a flexible alternative to ematch-based tree classification, i.e.
    now that BPF filter engine can also be JITed in the kernel. Naturally, tc
    actions and policies are supported as well with cls_bpf. Multiple BPF
    programs/filter can be attached for a class, or they can just as well be
    written within a single BPF program, that's really up to the user how he
    wishes to run/optimize the code, e.g. also for inversion of verdicts etc.
    The notion of a BPF program's return/exit codes is being kept as follows:

    0: No match
    -1: Select classid given in "tc filter ..." command
    else: flowid, overwrite the default one

    As a minimal usage example with iproute2, we use a 3 band prio root qdisc
    on a router with sfq each as leave, and assign ssh and icmp bpf-based
    filters to band 1, http traffic to band 2 and the rest to band 3. For the
    first two bands we load the bytecode from a file, in the 2nd we load it
    inline as an example:

    echo 1 > /proc/sys/net/core/bpf_jit_enable

    tc qdisc del dev em1 root
    tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

    tc qdisc add dev em1 parent 1:1 sfq perturb 16
    tc qdisc add dev em1 parent 1:2 sfq perturb 16
    tc qdisc add dev em1 parent 1:3 sfq perturb 16

    tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1
    tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1
    tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2
    tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3

    BPF programs can be easily created and passed to tc, either as inline
    'bytecode' or 'bytecode-file'. There are a couple of front-ends that can
    compile opcodes, for example:

    1) People familiar with tcpdump-like filters:

    tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf

    2) People that want to low-level program their filters or use BPF
    extensions that lack support by libpcap's compiler:

    bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf

    ssh.ops example code:
    ldh [12]
    jne #0x800, drop
    ldb [23]
    jneq #6, drop
    ldh [20]
    jset #0x1fff, drop
    ldxb 4 * ([14] & 0xf)
    ldh [%x + 14]
    jeq #0x16, pass
    ldh [%x + 16]
    jne #0x16, drop
    pass: ret #-1
    drop: ret #0

    It was chosen to load bytecode into tc, since the reverse operation,
    tc filter list dev em1, is then able to show the exact commands again.
    Possible follow-up work could also include a small expression compiler
    for iproute2. Tested with the help of bmon. This idea came up during
    the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from
    Eric Dumazet!

    Signed-off-by: Daniel Borkmann
    Cc: Thomas Graf
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

30 Aug, 2013

1 commit

  • - Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
    - Uses the new_flow/old_flow separation from FQ_codel
    - New flows get an initial credit allowing IW10 without added delay.
    - Special FIFO queue for high prio packets (no need for PRIO + FQ)
    - Uses a hash table of RB trees to locate the flows at enqueue() time
    - Smart on demand gc (at enqueue() time, RB tree lookup evicts old
    unused flows)
    - Dynamic memory allocations.
    - Designed to allow millions of concurrent flows per Qdisc.
    - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
    - Single high resolution timer for throttled flows (if any).
    - One RB tree to link throttled flows.
    - Ability to have a max rate per flow. We might add a socket option
    to add per socket limitation.

    Attempts have been made to add TCP pacing in TCP stack, but this
    seems to add complex code to an already complex stack.

    TCP pacing is welcomed for flows having idle times, as the cwnd
    permits TCP stack to queue a possibly large number of packets.

    This removes the 'slow start after idle' choice, hitting badly
    large BDP flows, and applications delivering chunks of data
    as video streams.

    Nicely spaced packets :
    Here interface is 10Gbit, but flow bottleneck is ~20Mbit

    cwin is big, yet FQ avoids the typical bursts generated by TCP
    (as in netperf TCP_RR -- -r 100000,100000)

    15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125
    15:01:23.545394 IP B > A: . ack 81089 win 3668
    15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125
    15:01:23.546565 IP B > A: . ack 83985 win 3668
    15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125
    15:01:23.547778 IP B > A: . ack 86881 win 3668
    15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125
    15:01:23.548949 IP B > A: . ack 89777 win 3668
    15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125
    15:01:23.550182 IP B > A: . ack 92673 win 3668
    15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125
    15:01:23.551406 IP B > A: . ack 95569 win 3668
    15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125
    15:01:23.552576 IP B > A: . ack 98465 win 3668
    15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125
    15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125
    15:01:23.554204 IP B > A: . ack 100001 win 3668
    15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668
    15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668
    15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668
    15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668
    15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668
    15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668
    15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668
    15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668
    15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668
    15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668
    15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668
    15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668
    15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668
    15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668
    15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668
    15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668
    15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668
    15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668

    TCP timestamps show that most packets from B were queued in the same ms
    timeframe (TSval 1159799{3,4}), but FQ managed to send them right
    in time to avoid a big burst.

    In slow start or steady state, very few packets are throttled [1]

    FQ gets a bunch of tunables as :

    limit : max number of packets on whole Qdisc (default 10000)

    flow_limit : max number of packets per flow (default 100)

    quantum : the credit per RR round (default is 2 MTU)

    initial_quantum : initial credit for new flows (default is 10 MTU)

    maxrate : max per flow rate (default : unlimited)

    buckets : number of RB trees (default : 1024) in hash table.
    (consumes 8 bytes per bucket)

    [no]pacing : disable/enable pacing (default is enable)

    All of them can be changed on a live qdisc.

    $ tc qd add dev eth0 root fq help
    Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
    [ quantum BYTES ] [ initial_quantum BYTES ]
    [ maxrate RATE ] [ buckets NUMBER ]
    [ [no]pacing ]

    $ tc -s -d qd
    qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
    Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
    backlog 0b 0p requeues 14
    511 flows, 511 inactive, 0 throttled
    110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit

    [1] Except if initial srtt is overestimated, as if using
    cached srtt in tcp metrics. We'll provide a fix for this issue.

    Signed-off-by: Eric Dumazet
    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

26 Nov, 2012

1 commit


12 Jul, 2012

1 commit

  • Can be used to match packets against netfilter ip sets created via ipset(8).
    skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'.

    Since ipset is usually called from netfilter, the ematch
    initializes a fake xt_action_param, pulls the ip header into the
    linear area and also sets skb->data to the IP header (otherwise
    matching Layer 4 set types doesn't work).

    Tested-by: Mr Dash Four
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     

04 Jul, 2012

1 commit


13 May, 2012

1 commit

  • Fair Queue Codel packet scheduler

    Principles :

    - Packets are classified (internal classifier or external) on flows.
    - This is a Stochastic model (as we use a hash, several flows might
    be hashed on same slot)
    - Each flow has a CoDel managed queue.
    - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.

    - For a given flow, packets are not reordered (CoDel uses a FIFO)
    - head drops only.
    - ECN capability is on by default.
    - Very low memory footprint (64 bytes per flow)

    tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
    [ target TIME ] [ interval TIME ] [ noecn ]
    [ quantum BYTES ]

    defaults : 1024 flows, 10240 packets limit, quantum : device MTU
    target : 5ms (CoDel default)
    interval : 100ms (CoDel default)

    Impressive results on load :

    class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
    Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
    rate 201691Kbit 28595pps backlog 0b 312p requeues 0
    lended: 33063109 borrowed: 0 giants: 0
    tokens: -912 ctokens: -912

    class fq_codel 10:1735 parent 10:
    (dropped 1292, overlimits 0 requeues 0)
    backlog 15140b 10p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 7.1ms
    class fq_codel 10:4524 parent 10:
    (dropped 1291, overlimits 0 requeues 0)
    backlog 16654b 11p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 7.1ms
    class fq_codel 10:4e74 parent 10:
    (dropped 1290, overlimits 0 requeues 0)
    backlog 6056b 4p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
    class fq_codel 10:628a parent 10:
    (dropped 1289, overlimits 0 requeues 0)
    backlog 7570b 5p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
    class fq_codel 10:a4b3 parent 10:
    (dropped 302, overlimits 0 requeues 0)
    backlog 16654b 11p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 7.1ms
    class fq_codel 10:c3c2 parent 10:
    (dropped 1284, overlimits 0 requeues 0)
    backlog 13626b 9p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 5.9ms
    class fq_codel 10:d331 parent 10:
    (dropped 299, overlimits 0 requeues 0)
    backlog 15140b 10p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 7.0ms
    class fq_codel 10:d526 parent 10:
    (dropped 12160, overlimits 0 requeues 0)
    backlog 35870b 211p requeues 0
    deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
    class fq_codel 10:e2c6 parent 10:
    (dropped 1288, overlimits 0 requeues 0)
    backlog 15140b 10p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 7.1ms
    class fq_codel 10:eab5 parent 10:
    (dropped 1285, overlimits 0 requeues 0)
    backlog 16654b 11p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 5.9ms
    class fq_codel 10:f220 parent 10:
    (dropped 1289, overlimits 0 requeues 0)
    backlog 15140b 10p requeues 0
    deficit 1514 count 1 lastcount 1 ldelay 7.1ms

    qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
    Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
    rate 201697Kbit 28602pps backlog 0b 260p requeues 71
    qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
    Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
    rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
    maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
    new_flows_len 0 old_flows_len 11

    PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
    64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
    64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
    64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
    64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
    64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
    64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
    64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
    64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
    64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
    64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms

    10 packets transmitted, 10 received, 0% packet loss, time 8999ms
    rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms

    Much better than SFQ because of priority given to new flows, and fast
    path dirtying less cache lines.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 May, 2012

1 commit

  • An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.

    http://queue.acm.org/detail.cfm?id=2209336

    This AQM main input is no longer queue size in bytes or packets, but the
    delay packets stay in (FIFO) queue.

    As we don't have infinite memory, we still can drop packets in enqueue()
    in case of massive load, but mean of CoDel is to drop packets in
    dequeue(), using a control law based on two simple parameters :

    target : target sojourn time (default 5ms)
    interval : width of moving time window (default 100ms)

    Based on initial work from Dave Taht.

    Refactored to help future codel inclusion as a plugin for other linux
    qdisc (FQ_CODEL, ...), like RED.

    include/net/codel.h contains codel algorithm as close as possible than
    Kathleen reference.

    net/sched/sch_codel.c contains the linux qdisc specific glue.

    Separate structures permit a memory efficient implementation of fq_codel
    (to be sent as a separate work) : Each flow has its own struct
    codel_vars.

    timestamps are taken at enqueue() time with 1024 ns precision, allowing
    a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
    usec as base unit.

    Selected packets are dropped, unless ECN is enabled and packets can get
    ECN mark instead.

    Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
    tg3 drivers (BQL enabled).

    Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
    [ interval TIME ] [ ecn ]

    qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
    Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
    rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
    count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
    maxpacket 1514 ecn_mark 84399 drop_overlimit 0

    CoDel must be seen as a base module, and should be used keeping in mind
    there is still a FIFO queue. So a typical setup will probably need a
    hierarchy of several qdiscs and packet classifiers to be able to meet
    whatever constraints a user might have.

    One possible example would be to use fq_codel, which combines Fair
    Queueing and CoDel, in replacement of sfq / sfq_red.

    Signed-off-by: Eric Dumazet
    Signed-off-by: Dave Taht
    Cc: Kathleen Nichols
    Cc: Van Jacobson
    Cc: Tom Herbert
    Cc: Matt Mathis
    Cc: Yuchung Cheng
    Cc: Stephen Hemminger
    Signed-off-by: David S. Miller

    Eric Dumazet
     

08 Feb, 2012

1 commit

  • The qdisc supports two operations - plug and unplug. When the
    qdisc receives a plug command via netlink request, packets arriving
    henceforth are buffered until a corresponding unplug command is received.
    Depending on the type of unplug command, the queue can be unplugged
    indefinitely or selectively.

    This qdisc can be used to implement output buffering, an essential
    functionality required for consistent recovery in checkpoint based
    fault-tolerance systems. Output buffering enables speculative execution
    by allowing generated network traffic to be rolled back. It is used to
    provide network protection for Xen Guests in the Remus high availability
    project, available as part of Xen.

    This module is generic enough to be used by any other system that wishes
    to add speculative execution and output buffering to its applications.

    This module was originally available in the linux 2.6.32 PV-OPS tree,
    used as dom0 for Xen.

    For more information, please refer to http://nss.cs.ubc.ca/remus/
    and http://wiki.xensource.com/xenwiki/Remus

    Changes in V3:
    * Removed debug output (printk) on queue overflow
    * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to
    use this qdisc, for simple plug/unplug operations.
    * Use of packet counts instead of pointers to keep track of
    the buffers in the queue.

    Signed-off-by: Shriram Rajagopalan
    Signed-off-by: Brendan Cully
    [author of the code in the linux 2.6.32 pvops tree]
    Signed-off-by: David S. Miller

    Shriram Rajagopalan
     

20 May, 2011

1 commit

  • IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects
    IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning
    is produced, so fix it by making NET_CLS_ROUTE4 depend on INET.

    warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET)

    Signed-off-by: Randy Dunlap
    Signed-off-by: David S. Miller

    Randy Dunlap
     

05 Apr, 2011

1 commit

  • This is an implementation of the Quick Fair Queue scheduler developed
    by Fabio Checconi. The same algorithm is already implemented in ipfw
    in FreeBSD. Fabio had an earlier version developed on Linux, I just
    cleaned it up. Thanks to Eric Dumazet for testing this under load.

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

24 Feb, 2011

1 commit

  • This is the Stochastic Fair Blue scheduler, based on work from :

    W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue
    Management Algorithms. U. Michigan CSE-TR-387-99, April 1999.

    http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf

    This implementation is based on work done by Juliusz Chroboczek

    General SFB algorithm can be found in figure 14, page 15:

    B[l][n] : L x N array of bins (L levels, N bins per level)
    enqueue()
    Calculate hash function values h{0}, h{1}, .. h{L-1}
    Update bins at each level
    for i = 0 to L - 1
    if (B[i][h{i}].qlen > bin_size)
    B[i][h{i}].p_mark += p_increment;
    else if (B[i][h{i}].qlen == 0)
    B[i][h{i}].p_mark -= p_decrement;
    p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark);
    if (p_min == 1.0)
    ratelimit();
    else
    mark/drop with probabilty p_min;

    I did the adaptation of Juliusz code to meet current kernel standards,
    and various changes to address previous comments :

    http://thread.gmane.org/gmane.linux.network/90225
    http://thread.gmane.org/gmane.linux.network/90375

    Default flow classifier is the rxhash introduced by RPS in 2.6.35, but
    we can use an external flow classifier if wanted.

    tc qdisc add dev $DEV parent 1:11 handle 11: \
    est 0.5sec 2sec sfb limit 128

    tc filter add dev $DEV protocol ip parent 11: handle 3 \
    flow hash keys dst divisor 1024

    Notes:

    1) SFB default child qdisc is pfifo_fast. It can be changed by another
    qdisc but a child qdisc MUST not drop a packet previously queued. This
    is because SFB needs to handle a dequeued packet in order to maintain
    its virtual queue states. pfifo_head_drop or CHOKe should not be used.

    2) ECN is enabled by default, unlike RED/CHOKe/GRED

    With help from Patrick McHardy & Andi Kleen

    Signed-off-by: Eric Dumazet
    CC: Juliusz Chroboczek
    CC: Stephen Hemminger
    CC: Patrick McHardy
    CC: Andi Kleen
    CC: John W. Linville
    Signed-off-by: David S. Miller

    Eric Dumazet
     

03 Feb, 2011

1 commit

  • CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative
    packet scheduler based on the Random Exponential Drop (RED) algorithm.

    The core idea is:
    For every packet arrival:
    Calculate Qave
    if (Qave < minth)
    Queue the new packet
    else
    Select randomly a packet from the queue
    if (both packets from same flow)
    then Drop both the packets
    else if (Qave > maxth)
    Drop packet
    else
    Admit packet with proability p (same as RED)

    See also:
    Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active
    queue management scheme for approximating fair bandwidth allocation",
    Proceeding of INFOCOM'2000, March 2000.

    Help from:
    Eric Dumazet
    Patrick McHardy

    Signed-off-by: Stephen Hemminger
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    stephen hemminger
     

20 Jan, 2011

3 commits

  • David S. Miller
     
  • This implements a mqprio queueing discipline that by default creates
    a pfifo_fast qdisc per tx queue and provides the needed configuration
    interface.

    Using the mqprio qdisc the number of tcs currently in use along
    with the range of queues alloted to each class can be configured. By
    default skbs are mapped to traffic classes using the skb priority.
    This mapping is configurable.

    Configurable parameters,

    struct tc_mqprio_qopt {
    __u8 num_tc;
    __u8 prio_tc_map[TC_BITMASK + 1];
    __u8 hw;
    __u16 count[TC_MAX_QUEUE];
    __u16 offset[TC_MAX_QUEUE];
    };

    Here the count/offset pairing give the queue alignment and the
    prio_tc_map gives the mapping from skb->priority to tc.

    The hw bit determines if the hardware should configure the count
    and offset values. If the hardware bit is set then the operation
    will fail if the hardware does not implement the ndo_setup_tc
    operation. This is to avoid undetermined states where the hardware
    may or may not control the queue mapping. Also minimal bounds
    checking is done on the count/offset to verify a queue does not
    exceed num_tx_queues and that queue ranges do not overlap. Otherwise
    it is left to user policy or hardware configuration to create
    useful mappings.

    It is expected that hardware QOS schemes can be implemented by
    creating appropriate mappings of queues in ndo_tc_setup().

    One expected use case is drivers will use the ndo_setup_tc to map
    queue ranges onto 802.1Q traffic classes. This provides a generic
    mechanism to map network traffic onto these traffic classes and
    removes the need for lower layer drivers to know specifics about
    traffic types.

    Signed-off-by: John Fastabend
    Signed-off-by: David S. Miller

    John Fastabend
     
  • Patrick McHardy
     

14 Jan, 2011

1 commit

  • Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE,
    which itself depends on NET_SCHED; this dependency is missing from netfilter.

    Since matching on realms is also useful without having NET_SCHED enabled and
    the option really only controls whether the tclassid member is included in
    route and dst entries, rename the config option to IP_ROUTE_CLASSID and move
    it outside of traffic scheduling context to get rid of the NET_SCHED dependeny.

    Reported-by: Vladis Kletnieks
    Signed-off-by: Patrick McHardy

    Patrick McHardy
     

16 Nov, 2010

1 commit

  • Some of the documentation refers to web pages under
    the domain `osdl.org'. However, `osdl.org' now
    redirects to `linuxfoundation.org'.

    Rather than rely on redirections, this patch updates
    the addresses appropriately; for the most part, only
    documentation that is meant to be current has been
    updated.

    The patch should be pretty quick to scan and check;
    each new web-page url was gotten by trying out the
    original URL in a browser and then simply copying the
    the redirected URL (formatting as necessary).

    There is some conflict as to which one of these domain
    names is preferred:

    linuxfoundation.org
    linux-foundation.org

    So, I wrote:

    info@linuxfoundation.org

    and got this reply:

    Message-ID:
    Date: Mon, 15 Nov 2010 10:41:42 -0800
    From: David Ames

    ...

    linuxfoundation.org is preferred. The canonical name for our web site is
    www.linuxfoundation.org. Our list site is actually
    lists.linux-foundation.org.

    Regarding email linuxfoundation.org is preferred there are a few people
    who choose to use linux-foundation.org for their own reasons.

    Consequently, I used `linuxfoundation.org' for web pages and
    `lists.linux-foundation.org' for mailing-list web pages and email addresses;
    the only personal email address I updated from `@osdl.org' was that of
    Andrew Morton, who prefers `linux-foundation.org' according `git log'.

    Signed-off-by: Michael Witten
    Signed-off-by: Jiri Kosina

    Michael Witten
     

24 Aug, 2010

1 commit


20 Aug, 2010

1 commit

  • net/sched: add ACT_CSUM action to update packets checksums

    ACT_CSUM can be called just after ACT_PEDIT in order to re-compute some
    altered checksums in IPv4 and IPv6 packets. The following checksums are
    supported by this patch:
    - IPv4: IPv4 header, ICMP, IGMP, TCP, UDP & UDPLite
    - IPv6: ICMPv6, TCP, UDP & UDPLite
    It's possible to request in the same action to update different kind of
    checksums, if the packets flow mix TCP, UDP and UDPLite, ...

    An example of usage is done in the associated iproute2 patch.

    Version 3 changes:
    - remove useless goto instructions
    - improve IPv6 hop options decoding

    Version 2 changes:
    - coding style correction
    - remove useless arguments of some functions
    - use stack in tcf_csum_dump()
    - add tcf_csum_skb_nextlayer() to factor code

    Signed-off-by: Gregoire Baron
    Acked-by: jamal
    Signed-off-by: David S. Miller

    Grégoire Baron
     

24 Mar, 2010

1 commit

  • Allows the net_cls cgroup subsystem to be compiled as a module

    This patch modifies net/sched/cls_cgroup.c to allow the net_cls subsystem
    to be optionally compiled as a module instead of builtin. The
    cgroup_subsys struct is moved around a bit to allow the subsys_id to be
    either declared as a compile-time constant by the cgroup_subsys.h include
    in cgroup.h, or, if it's a module, initialized within the struct by
    cgroup_load_subsys.

    Signed-off-by: Ben Blum
    Acked-by: Li Zefan
    Cc: Paul Menage
    Cc: "David S. Miller"
    Cc: KAMEZAWA Hiroyuki
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: David S. Miller

    Ben Blum
     

09 Feb, 2010

1 commit


30 Dec, 2008

1 commit


20 Nov, 2008

1 commit

  • Add classful DRR scheduler as a more flexible replacement for SFQ.

    The main difference to the algorithm described in "Efficient Fair Queueing
    using Deficit Round Robin" is that this implementation doesn't drop packets
    from the longest queue on overrun because its classful and limits are
    handled by each individual child qdisc.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     

08 Nov, 2008

1 commit

  • The classifier should cover the most common use case and will work
    without any special configuration.

    The principle of the classifier is to directly access the
    task_struct via get_current(). In order for this to work,
    classification requests from softirqs must be ignored. This is
    not a problem because the vast majority of packets in softirq
    context are not assigned to a task anyway. For this to work, a
    mechanism is needed to trace softirq context.

    This repost goes back to the method of relying on the number of
    nested bh disable calls for the sake of not adding too much
    complexity and the option to come up with something more reliable
    if actually needed.

    Signed-off-by: Thomas Graf
    Signed-off-by: David S. Miller

    Thomas Graf
     

13 Sep, 2008

2 commits

  • This new action will have the ability to change the priority and/or
    queue_mapping fields on an sk_buff.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    Alexander Duyck
     
  • This patch is intended to add a qdisc to support the new tx multiqueue
    architecture by providing a band for each hardware queue. By doing
    this it is possible to support a different qdisc per physical hardware
    queue.

    This qdisc uses the skb->queue_mapping to select which band to place
    the traffic onto. It then uses a round robin w/ a check to see if the
    subqueue is stopped to determine which band to dequeue the packet from.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Jeff Kirsher
    Signed-off-by: David S. Miller

    Alexander Duyck
     

28 Jun, 2008

1 commit

  • Commit d62733c8e437fdb58325617c4b3331769ba82d70
    ([SCHED]: Qdisc changes and sch_rr added for multiqueue)
    added a NET_SCH_RR option that was unused since the code
    went unconditionally into sch_prio.

    Reported-by: Robert P. J. Day
    Signed-off-by: Adrian Bunk
    Signed-off-by: David S. Miller

    Adrian Bunk
     

01 Feb, 2008

2 commits

  • Add new "flow" classifier, which is meant to extend the SFQ hashing
    capabilities without hard-coding new hash functions and also allows
    deterministic mappings of keys to classes, replacing some out of tree
    iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
    IPs to marks, with fw filters to classes), ...

    Some examples:

    - Classic SFQ hash:

    tc filter add ... flow hash \
    keys src,dst,proto,proto-src,proto-dst divisor 1024

    - Classic SFQ hash, but using information from conntrack to work properly in
    combination with NAT:

    tc filter add ... flow hash \
    keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024

    - Map destination IPs of 192.168.0.0/24 to classids 1-257:

    tc filter add ... flow map \
    key dst addend -192.168.0.0 divisor 256

    - alternatively:

    tc filter add ... flow map \
    key dst and 0xff

    - similar, but reverse ordered:

    tc filter add ... flow map \
    key dst and 0xff xor 0xff

    Perturbation is currently not supported because we can't reliable kill the
    timer on destruction.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy
     
  • Since the old policer code is gone, TC actions are needed for policing.
    The ingress qdisc can get packets directly from netif_receive_skb()
    in case TC actions are enabled or through netfilter otherwise, but
    since without TC actions there is no policer the only thing it actually
    does is count packets.

    Remove the netfilter support and always require TC actions.

    Signed-off-by: Patrick McHardy
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Patrick McHardy
     

29 Jan, 2008

3 commits


19 Oct, 2007

1 commit


11 Oct, 2007

1 commit

  • Stateless NAT is useful in controlled environments where restrictions are
    placed on through traffic such that we don't need connection tracking to
    correctly NAT protocol-specific data.

    In particular, this is of interest when the number of flows or the number
    of addresses being NATed is large, or if connection tracking information
    has to be replicated and where it is not practical to do so.

    Previously we had stateless NAT functionality which was integrated into
    the IPv4 routing subsystem. This was a great solution as long as the NAT
    worked on a subnet to subnet basis such that the number of NAT rules was
    relatively small. The reason is that for SNAT the routing based system
    had to perform a linear scan through the rules.

    If the number of rules is large then major renovations would have take
    place in the routing subsystem to make this practical.

    For the time being, the least intrusive way of achieving this is to use
    the u32 classifier written by Alexey Kuznetsov along with the actions
    infrastructure implemented by Jamal Hadi Salim.

    The following patch is an attempt at this problem by creating a new nat
    action that can be invoked from u32 hash tables which would allow large
    number of stateless NAT rules that can be used/updated in constant time.

    The actual NAT code is mostly based on the previous stateless NAT code
    written by Alexey. In future we might be able to utilise the protocol
    NAT code from netfilter to improve support for other protocols.

    Signed-off-by: Herbert Xu
    Signed-off-by: David S. Miller

    Herbert Xu
     

18 Jul, 2007

1 commit


15 Jul, 2007

1 commit

  • The NET_CLS_ACT option is now a full replacement for NET_CLS_POLICE,
    remove the old code. The config option will be kept around to select
    the equivalent NET_CLS_ACT options for a short time to allow easier
    upgrades.

    Signed-off-by: Patrick McHardy
    Signed-off-by: David S. Miller

    Patrick McHardy