21 Jun, 2020

1 commit

  • The user tool modinfo is used to get information on kernel modules, including a
    description where it is available.

    This patch adds a brief MODULE_DESCRIPTION to the following modules:

    9p
    drop_monitor
    esp4_offload
    esp6_offload
    fou
    fou6
    ila
    sch_fq
    sch_fq_codel
    sch_hhf

    Signed-off-by: Rob Gill
    Signed-off-by: David S. Miller

    Rob Gill
     

05 May, 2020

1 commit

  • QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN,
    to efficiently pace UDP packets.

    As far as sch_fq is concerned, we need to add safety checks, so
    that a buggy application does not fill the qdisc with packets
    having delivery time far in the future.

    This patch adds a configurable horizon (default: 10 seconds),
    and a configurable policy when a packet is beyond the horizon
    at enqueue() time:
    - either drop the packet (default policy)
    - or cap its delivery time to the horizon.

    $ tc -s -d qd sh dev eth0
    qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024
    orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit
    refill_delay 40.0ms timer_slack 10.000us horizon 10.000s
    Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6)
    backlog 0b 0p requeues 6
    flows 1191 (inactive 1177 throttled 0)
    gc 0 highprio 0 throttled 692 latency 11.480us
    pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0

    v2: fixed an overflow on 32bit kernels in fq_init(), reported
    by kbuild test robot

    Signed-off-by: Eric Dumazet
    Cc: Willem de Bruijn
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 May, 2020

5 commits

  • The prefetch() done in fq_dequeue() can be done a bit earlier
    after the refactoring of the code done in the prior patch.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This refactors the code to not call fq_peek() from fq_dequeue_head()
    since the caller can provide the skb.

    Also rename fq_dequeue_head() to fq_dequeue_skb() because 'head' is
    a bit vague, given the skb could come from t_root rb-tree.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • fq_gc() already builds a small array of pointers, so using
    kmem_cache_free_bulk() needs very little change.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • sizeof(struct fq_flow) is 112 bytes on 64bit arches.

    This means that half of them use two cache lines, but 50% use
    three cache lines.

    This patch adds cache line alignment, and makes sure that only
    the first cache line is touched by fq_enqueue(), which is more
    expensive that fq_dequeue() in general.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • A significant amount of cpu cycles is spent in fq_gc()

    When fq_gc() does its lookup in the rb-tree, it needs the
    following fields from struct fq_flow :

    f->sk (lookup key in the rb-tree)
    f->fq_node (anchor in the rb-tree)
    f->next (used to determine if the flow is detached)
    f->age (used to determine if the flow is candidate for gc)

    This unfortunately spans two cache lines (assuming 64 bytes cache lines)

    We can avoid using f->next, if we use the low order bit of f->{age|tail}

    This low order bit is 0, if f->tail points to an sk_buff.
    We set the low order bit to 1, if the union contains a jiffies value.

    Combined with the following patch, this makes sure we only need
    to bring into cpu caches one cache line per flow.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

18 Mar, 2020

1 commit

  • Add a new attribute to control the fq qdisc hrtimer slack.

    Default is set to 10 usec.

    When/if packets are throttled, fq set up an hrtimer that can
    lead to one interrupt per packet in the throttled queue.

    By using a timer slack, we allow better use of timer interrupts,
    by giving them a chance to call multiple timer callbacks
    at each hardware interrupt.

    Also, giving a slack allows FQ to dequeue batches of packets
    instead of a single one, thus increasing xmit_more efficiency.

    This has no negative effect on the rate a TCP flow can sustain,
    since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)

    v2: added strict netlink checking (as feedback from Jakub Kicinski)

    Tested:
    1000 concurrent flows all using paced packets.
    1,000,000 packets sent per second.

    Before the patch :

    $ vmstat 2 10
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    0 0 0 60726784 23628 3485992 0 0 138 1 977 535 0 12 87 0 0
    0 0 0 60714700 23628 3485628 0 0 0 0 1568827 26462 0 22 78 0 0
    1 0 0 60716012 23628 3485656 0 0 0 0 1570034 26216 0 22 78 0 0
    0 0 0 60722420 23628 3485492 0 0 0 0 1567230 26424 0 22 78 0 0
    0 0 0 60727484 23628 3485556 0 0 0 0 1568220 26200 0 22 78 0 0
    2 0 0 60718900 23628 3485380 0 0 0 40 1564721 26630 0 22 78 0 0
    2 0 0 60718096 23628 3485332 0 0 0 0 1562593 26432 0 22 78 0 0
    0 0 0 60719608 23628 3485064 0 0 0 0 1563806 26238 0 22 78 0 0
    1 0 0 60722876 23628 3485236 0 0 0 130 1565874 26566 0 22 78 0 0
    1 0 0 60722752 23628 3484908 0 0 0 0 1567646 26247 0 22 78 0 0

    After the patch, slack of 10 usec, we can see a reduction of interrupts
    per second, and a small decrease of reported cpu usage.

    $ vmstat 2 10
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    1 0 0 60722564 23628 3484728 0 0 133 1 696 545 0 13 87 0 0
    1 0 0 60722568 23628 3484824 0 0 0 0 977278 25469 0 20 80 0 0
    0 0 0 60716396 23628 3484764 0 0 0 0 979997 25326 0 20 80 0 0
    0 0 0 60713844 23628 3484960 0 0 0 0 981394 25249 0 20 80 0 0
    2 0 0 60720468 23628 3484916 0 0 0 0 982860 25062 0 20 80 0 0
    1 0 0 60721236 23628 3484856 0 0 0 0 982867 25100 0 20 80 0 0
    1 0 0 60722400 23628 3484456 0 0 0 8 982698 25303 0 20 80 0 0
    0 0 0 60715396 23628 3484428 0 0 0 0 981777 25176 0 20 80 0 0
    0 0 0 60716520 23628 3486544 0 0 0 36 978965 27857 0 21 79 0 0
    0 0 0 60719592 23628 3486516 0 0 0 22 977318 25106 0 20 80 0 0

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Mar, 2020

1 commit


09 Jan, 2020

1 commit

  • As diagnosed by Florian :

    If TCA_FQ_QUANTUM is set to 0x80000000, fq_deueue()
    can loop forever in :

    if (f->credit credit += q->quantum;
    goto begin;
    }

    ... because f->credit is either 0 or -2147483648.

    Let's limit TCA_FQ_QUANTUM to no more than 1 << 20 :
    This max value should limit risks of breaking user setups
    while fixing this bug.

    Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler")
    Signed-off-by: Eric Dumazet
    Diagnosed-by: Florian Westphal
    Reported-by: syzbot+dc9071cc5a85950bdfce@syzkaller.appspotmail.com
    Signed-off-by: David S. Miller

    Eric Dumazet
     

27 Dec, 2019

1 commit

  • If fq_classify() recycles a struct fq_flow because
    a socket structure has been reallocated, we do not
    set sk->sk_pacing_status immediately, but later if the
    flow becomes detached.

    This means that any flow requiring pacing (BBR, or SO_MAX_PACING_RATE)
    might fallback to TCP internal pacing, which requires a per-socket
    high resolution timer, and therefore more cpu cycles.

    Fixes: 218af599fa63 ("tcp: internal implementation for pacing")
    Signed-off-by: Eric Dumazet
    Cc: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Acked-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Oct, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

2 commits

  • FQ packet scheduler assumed that packets could be classified
    based on their owning socket.

    This means that if a UDP server uses one UDP socket to send
    packets to different destinations, packets all land
    in one FQ flow.

    This is unfair, since each TCP flow has a unique bucket, meaning
    that in case of pressure (fully utilised uplink), TCP flows
    have more share of the bandwidth.

    If we instead detect unconnected sockets, we can use a stochastic
    hash based on the 4-tuple hash.

    This also means a QUIC server using one UDP socket will properly
    spread the outgoing packets to different buckets, and in-kernel
    pacing based on EDT model will no longer risk having big rb-tree on
    one flow.

    Note that UDP application might provide the skb->hash in an
    ancillary message at sendmsg() time to avoid the cost of a dissection
    in fq packet scheduler.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP stack makes sure packets for a given flow are monotically
    increasing, but we want to allow UDP packets to use EDT as
    well, so that QUIC servers can use in-kernel pacing.

    This patch adds a per-flow rb-tree on which packets might
    be stored. We still try to use the linear list for the
    typical cases where packets are queued with monotically
    increasing skb->tstamp, since queue/dequeue packets on
    a standard list is O(1).

    Note that the ability to store packets in arbitrary EDT
    order will allow us to implement later a per TCP socket
    mechanism adding delays (with jitter eventually) and reorders,
    to implement convenient network emulators.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

21 Nov, 2018

1 commit


20 Nov, 2018

1 commit


16 Nov, 2018

1 commit

  • When EDT conversion happened, fq lost the ability to enfore a maxrate
    for all flows. It kept it for non EDT flows.

    This commit restores the functionality.

    Tested:

    tc qd replace dev eth0 root fq maxrate 500Mbit
    netperf -P0 -H host -- -O THROUGHPUT
    489.75

    Fixes: ab408b6dc744 ("tcp: switch tcp and sch_fq to new earliest departure time model")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

12 Nov, 2018

1 commit

  • Similar to 80ba92fa1a92 ("codel: add ce_threshold attribute")

    After EDT adoption, it became easier to implement DCTCP-like CE marking.

    In many cases, queues are not building in the network fabric but on
    the hosts themselves.

    If packets leaving fq missed their Earliest Departure Time by XXX usec,
    we mark them with ECN CE. This gives a feedback (after one RTT) to
    the sender to slow down and find better operating mode.

    Example :

    tc qd replace dev eth0 root fq ce_threshold 2.5ms

    Signed-off-by: Eric Dumazet
    Acked-by: Neal Cardwell
    Signed-off-by: David S. Miller

    Eric Dumazet
     

16 Oct, 2018

2 commits

  • With the new EDT model, sch_fq no longer has to special
    case TCP pure acks, since their skb->tstamp will allow them
    being sent without pacing delay.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • sk_pacing_rate has beed introduced as a u32 field in 2013,
    effectively limiting per flow pacing to 34Gbit.

    We believe it is time to allow TCP to pace high speed flows
    on 64bit hosts, as we now can reach 100Gbit on one TCP flow.

    This patch adds no cost for 32bit kernels.

    The tcpi_pacing_rate and tcpi_max_pacing_rate were already
    exported as 64bit, so iproute2/ss command require no changes.

    Unfortunately the SO_MAX_PACING_RATE socket option will stay
    32bit and we will need to add a new option to let applications
    control high pacing rates.

    State Recv-Q Send-Q Local Address:Port Peer Address:Port
    ESTAB 0 1787144 10.246.9.76:49992 10.246.9.77:36741
    timer:(on,003ms,0) ino:91863 sk:2
    skmem:(r0,rb540000,t66440,tb2363904,f605944,w1822984,o0,bl0,d0)
    ts sack bbr wscale:8,8 rto:201 rtt:0.057/0.006 mss:1448
    rcvmss:536 advmss:1448
    cwnd:138 ssthresh:178 bytes_acked:256699822585 segs_out:177279177
    segs_in:3916318 data_segs_out:177279175
    bbr:(bw:31276.8Mbps,mrtt:0,pacing_gain:1.25,cwnd_gain:2)
    send 28045.5Mbps lastrcv:73333
    pacing_rate 38705.0Mbps delivery_rate 22997.6Mbps
    busy:73333ms unacked:135 retrans:0/157 rcv_space:14480
    notsent:2085120 minrtt:0.013

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

02 Oct, 2018

1 commit

  • In the recent TCP/EDT patch series, I switched TCP and sch_fq
    clocks from MONOTONIC to TAI, in order to meet the choice done
    earlier for sch_etf packet scheduler.

    But sure enough, this broke some setups were the TAI clock
    jumps forward (by almost 50 year...), as reported
    by Leonard Crestez.

    If we want to converge later, we'll probably need to add
    an skb field to differentiate the clock bases, or a socket option.

    In the meantime, an UDP application will need to use CLOCK_MONOTONIC
    base for its SCM_TXTIME timestamps if using fq packet scheduler.

    Fixes: 72b0094f9182 ("tcp: switch tcp_clock_ns() to CLOCK_TAI base")
    Fixes: 142537e41923 ("net_sched: sch_fq: switch to CLOCK_TAI")
    Fixes: fd2bca2aa789 ("tcp: switch internal pacing timer to CLOCK_TAI")
    Signed-off-by: Eric Dumazet
    Reported-by: Leonard Crestez
    Tested-by: Leonard Crestez
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Sep, 2018

3 commits

  • With the earliest departure time model, we no longer plan
    special casing TCP retransmits. We therefore remove dead
    code (since most compilers understood skb_is_retransmit()
    was false)

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
    no longer has to do it.

    Thanks to this model, TCP can get more accurate RTT samples,
    since pacing no longer inflates them.

    This has the nice effect of removing some delays caused by FQ
    quantum mechanism, causing inflated max/P99 latencies.

    Also we might relax TCP Small Queue tight limits in the future,
    since this new model allow TCP to build bigger batches, since
    sch_fq (or a device with earliest departure time offload) ensure
    these packets will be delivered on time.

    Note that other protocols are not converted (they will probably
    never be) so sch_fq has still support for SO_MAX_PACING_RATE

    Tested:

    Test showing FQ pacing quantum artifact for low-rate flows,
    adding unexpected throttles for RPC flows, inflating max and P99 latencies.

    The parameters chosen here are to show what happens typically when
    a TCP flow has a reduced pacing rate (this can be caused by a reduced
    cwin after few losses, or/and rtt above few ms)

    MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
    Before :
    $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
    MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
    Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
    19,82.78,5279,3825,482.02

    After :
    $ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
    MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
    Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
    20,49.94,128,63,3.18

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP will soon provide per skb->tstamp with earliest departure time,
    so that sch_fq does not have to determine departure time by looking
    at socket sk_pacing_rate.

    We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
    qdiscs, and NIC offloads.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

11 Sep, 2018

1 commit


03 May, 2018

1 commit

  • Normally, a socket can not be freed/reused unless all its TX packets
    left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
    this to happen.

    With commit fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for
    reused flows") we cleared f->time_next_packet but took no special
    action if the flow was still in the throttled rb-tree.

    Since f->time_next_packet is the key used in the rb-tree searches,
    blindly clearing it might break rb-tree integrity. We need to make
    sure the flow is no longer in the rb-tree to avoid this problem.

    Fixes: fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for reused flows")
    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

22 Dec, 2017

2 commits


13 Jul, 2017

1 commit

  • __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
    the page allocator. This has been true but only for allocations
    requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always
    ignored for smaller sizes. This is a bit unfortunate because there is
    no way to express the same semantic for those requests and they are
    considered too important to fail so they might end up looping in the
    page allocator for ever, similarly to GFP_NOFAIL requests.

    Now that the whole tree has been cleaned up and accidental or misled
    usage of __GFP_REPEAT flag has been removed for !costly requests we can
    give the original flag a better name and more importantly a more useful
    semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
    that the allocator would try really hard but there is no promise of a
    success. This will work independent of the order and overrides the
    default allocator behavior. Page allocator users have several levels of
    guarantee vs. cost options (take GFP_KERNEL as an example)

    - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
    attempt to free memory at all. The most light weight mode which even
    doesn't kick the background reclaim. Should be used carefully because
    it might deplete the memory and the next user might hit the more
    aggressive reclaim

    - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
    allocation without any attempt to free memory from the current
    context but can wake kswapd to reclaim memory if the zone is below
    the low watermark. Can be used from either atomic contexts or when
    the request is a performance optimization and there is another
    fallback for a slow path.

    - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
    non sleeping allocation with an expensive fallback so it can access
    some portion of memory reserves. Usually used from interrupt/bh
    context with an expensive slow path fallback.

    - GFP_KERNEL - both background and direct reclaim are allowed and the
    _default_ page allocator behavior is used. That means that !costly
    allocation requests are basically nofail but there is no guarantee of
    that behavior so failures have to be checked properly by callers
    (e.g. OOM killer victim is allowed to fail currently).

    - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
    and all allocation requests fail early rather than cause disruptive
    reclaim (one round of reclaim in this implementation). The OOM killer
    is not invoked.

    - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
    behavior and all allocation requests try really hard. The request
    will fail if the reclaim cannot make any progress. The OOM killer
    won't be triggered.

    - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
    and all allocation requests will loop endlessly until they succeed.
    This might be really dangerous especially for larger orders.

    Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
    because they already had their semantic. No new users are added.
    __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
    there is no progress and we have already passed the OOM point.

    This means that all the reclaim opportunities have been exhausted except
    the most disruptive one (the OOM killer) and a user defined fallback
    behavior is more sensible than keep retrying in the page allocator.

    [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
    [mhocko@suse.com: semantic fix]
    Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
    [mhocko@kernel.org: address other thing spotted by Vlastimil]
    Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Belits
    Cc: Chris Wilson
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: David Daney
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: NeilBrown
    Cc: Ralf Baechle
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

17 May, 2017

1 commit

  • BBR congestion control depends on pacing, and pacing is
    currently handled by sch_fq packet scheduler for performance reasons,
    and also because implemening pacing with FQ was convenient to truly
    avoid bursts.

    However there are many cases where this packet scheduler constraint
    is not practical.
    - Many linux hosts are not focusing on handling thousands of TCP
    flows in the most efficient way.
    - Some routers use fq_codel or other AQM, but still would like
    to use BBR for the few TCP flows they initiate/terminate.

    This patch implements an automatic fallback to internal pacing.

    Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.

    If sch_fq happens to be in the egress path, pacing is delegated to
    the qdisc, otherwise pacing is done by TCP itself.

    One advantage of pacing from TCP stack is to get more precise rtt
    estimations, and less work done from TX completion, since TCP Small
    queue limits are not generally hit. Setups with single TX queue but
    many cpus might even benefit from this.

    Note that unlike sch_fq, we do not take into account header sizes.
    Taking care of these headers would add additional complexity for
    no practical differences in behavior.

    Some performance numbers using 800 TCP_STREAM flows rate limited to
    ~48 Mbit per second on 40Gbit NIC.

    If MQ+pfifo_fast is used on the NIC :

    $ sar -n DEV 1 5 | grep eth
    14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
    14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
    14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
    14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
    14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
    Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
    4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
    0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
    1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
    1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0

    Now use MQ+FQ :

    lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
    lpaa23:~# tc qdisc replace dev eth0 root mq

    $ sar -n DEV 1 5 | grep eth
    14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
    14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
    14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
    14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
    14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
    Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
    $ vmstat 1 5
    procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st
    2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
    1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
    0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
    4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
    3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0

    As expected, number of interrupts per second is very different.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Cc: Neal Cardwell
    Cc: Yuchung Cheng
    Cc: Van Jacobson
    Cc: Jerry Chu
    Signed-off-by: David S. Miller

    Eric Dumazet
     

09 May, 2017

1 commit

  • fq_alloc_node, alloc_netdev_mqs and netif_alloc* open code kmalloc with
    vmalloc fallback. Use the kvmalloc variant instead. Keep the
    __GFP_REPEAT flag based on explanation from Eric:

    "At the time, tests on the hardware I had in my labs showed that
    vmalloc() could deliver pages spread all over the memory and that was
    a small penalty (once memory is fragmented enough, not at boot time)"

    The way how the code is constructed means, however, that we prefer to go
    and hit the OOM killer before we fall back to the vmalloc for requests

    Acked-by: Vlastimil Babka
    Cc: Eric Dumazet
    Cc: David Miller
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

14 Apr, 2017

1 commit


21 Dec, 2016

1 commit


18 Nov, 2016

1 commit

  • When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
    and I chose hash_32().

    Linus Torvalds and George Spelvin fixed this issue, so we can
    use hash_ptr() to get more entropy on 64bit arches with Terabytes
    of memory, and avoid the cast games.

    Signed-off-by: Eric Dumazet
    Cc: Hugh Dickins
    Signed-off-by: David S. Miller

    Eric Dumazet
     

23 Sep, 2016

1 commit

  • It looks like the following patch can make FQ very precise, even in VM
    or stressed hosts. It matters at high pacing rates.

    We take into account the difference between the time that was programmed
    when last packet was sent, and current time (a drift of tens of usecs is
    often observed)

    Add an EWMA of the unthrottle latency to help diagnostics.

    This latency is the difference between current time and oldest packet in
    delayed RB-tree. This accounts for the high resolution timer latency,
    but can be different under stress, as fq_check_throttled() can be
    opportunistically be called from a dequeue() called after an enqueue()
    for a different flow.

    Tested:
    // Start a 10Gbit flow
    $ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

    Before patch :
    $ sar -n DEV 10 5 | grep eth0 | grep Average
    Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52

    After patch :
    $ sar -n DEV 10 5 | grep eth0 | grep Average
    Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52

    A new iproute2 tc can output the 'unthrottle latency' :

    $ tc -s qd sh dev eth0 | grep latency
    0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

21 Sep, 2016

1 commit

  • This commit adds to the fq module a low_rate_threshold parameter to
    insert a delay after all packets if the socket requests a pacing rate
    below the threshold.

    This helps achieve more precise control of the sending rate with
    low-rate paths, especially policers. The basic issue is that if a
    congestion control module detects a policer at a certain rate, it may
    want fq to be able to shape to that policed rate. That way the sender
    can avoid policer drops by having the packets arrive at the policer at
    or just under the policed rate.

    The default threshold of 550Kbps was chosen analytically so that for
    policers or links at 500Kbps or 512Kbps fq would very likely invoke
    this mechanism, even if the pacing rate was briefly slightly above the
    available bandwidth. This value was then empirically validated with
    two years of production testing on YouTube video servers.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     

19 Sep, 2016

1 commit