24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

06 Aug, 2020

1 commit

  • Pull networking updates from David Miller:

    1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

    2) Support UDP segmentation in code TSO code, from Eric Dumazet.

    3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

    4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

    5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

    6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

    7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

    8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

    9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

    10) Switch qca8k driver over to phylink, from Jonathan McDowell.

    11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

    12) Several conversions over to generic power management, from Vaibhav
    Gupta.

    13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

    14) Various https url conversions, from Alexander A. Klimov.

    15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

    16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

    17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

    18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

    19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

    20) XDP support for xen-netfront, from Denis Kirjanov.

    21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

    22) Support EF100 chip in sfc driver, from Edward Cree.

    23) Add XDP support to mvpp2 driver, from Matteo Croce.

    24) Support MPTCP in sock_diag, from Paolo Abeni.

    25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

    26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

    27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

    28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

    29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

    30) Add rfc4884 support to icmp code, from Willem de Bruijn.

    31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

    32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

    33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

    34) Support TCP syncookies in MPTCP, from Flowian Westphal.

    35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
    net: thunderx: initialize VF's mailbox mutex before first usage
    usb: hso: remove bogus check for EINPROGRESS
    usb: hso: no complaint about kmalloc failure
    hso: fix bailout in error case of probe
    ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
    selftests/net: relax cpu affinity requirement in msg_zerocopy test
    mptcp: be careful on subflow creation
    selftests: rtnetlink: make kci_test_encap() return sub-test result
    selftests: rtnetlink: correct the final return value for the test
    net: dsa: sja1105: use detected device id instead of DT one on mismatch
    tipc: set ub->ifindex for local ipv6 address
    ipv6: add ipv6_dev_find()
    net: openvswitch: silence suspicious RCU usage warning
    Revert "vxlan: fix tos value before xmit"
    ptp: only allow phase values lower than 1 period
    farsync: switch from 'pci_' to 'dma_' API
    wan: wanxl: switch from 'pci_' to 'dma_' API
    hv_netvsc: do not use VF device if link is down
    dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
    net: macb: Properly handle phylink on at91sam9x
    ...

    Linus Torvalds
     

05 Aug, 2020

1 commit

  • Pull uninitialized_var() macro removal from Kees Cook:
    "This is long overdue, and has hidden too many bugs over the years. The
    series has several "by hand" fixes, and then a trivial treewide
    replacement.

    - Clean up non-trivial uses of uninitialized_var()

    - Update documentation and checkpatch for uninitialized_var() removal

    - Treewide removal of uninitialized_var()"

    * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    compiler: Remove uninitialized_var() macro
    treewide: Remove uninitialized_var() usage
    checkpatch: Remove awareness of uninitialized_var() macro
    mm/debug_vm_pgtable: Remove uninitialized_var() usage
    f2fs: Eliminate usage of uninitialized_var() macro
    media: sur40: Remove uninitialized_var() usage
    KVM: PPC: Book3S PR: Remove uninitialized_var() usage
    clk: spear: Remove uninitialized_var() usage
    clk: st: Remove uninitialized_var() usage
    spi: davinci: Remove uninitialized_var() usage
    ide: Remove uninitialized_var() usage
    rtlwifi: rtl8192cu: Remove uninitialized_var() usage
    b43: Remove uninitialized_var() usage
    drbd: Remove uninitialized_var() usage
    x86/mm/numa: Remove uninitialized_var() usage
    docs: deprecated.rst: Add uninitialized_var()

    Linus Torvalds
     

17 Jul, 2020

2 commits

  • This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117.

    Signed-off-by: Petr Machata
    Signed-off-by: Jakub Kicinski

    Petr Machata
     
  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

11 Jul, 2020

1 commit


04 Jul, 2020

1 commit

  • There are a couple of places in net/sched/ that check skb->protocol and act
    on the value there. However, in the presence of VLAN tags, the value stored
    in skb->protocol can be inconsistent based on whether VLAN acceleration is
    enabled. The commit quoted in the Fixes tag below fixed the users of
    skb->protocol to use a helper that will always see the VLAN ethertype.

    However, most of the callers don't actually handle the VLAN ethertype, but
    expect to find the IP header type in the protocol field. This means that
    things like changing the ECN field, or parsing diffserv values, stops
    working if there's a VLAN tag, or if there are multiple nested VLAN
    tags (QinQ).

    To fix this, change the helper to take an argument that indicates whether
    the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
    make sure to skip all of them, so behaviour is consistent even in QinQ
    mode.

    To make the helper usable from the ECN code, move it to if_vlan.h instead
    of pkt_sched.h.

    v3:
    - Remove empty lines
    - Move vlan variable definitions inside loop in skb_protocol()
    - Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
    bpf_skb_ecn_set_ce()

    v2:
    - Use eth_type_vlan() helper in skb_protocol()
    - Also fix code that reads skb->protocol directly
    - Change a couple of 'if/else if' statements to switch constructs to avoid
    calling the helper twice

    Reported-by: Ilya Ponetayev
    Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

30 Jun, 2020

1 commit

  • A following patch introduces qevents, points in qdisc algorithm where
    packet can be processed by user-defined filters. Should this processing
    lead to a situation where a new packet is to be enqueued on the same port,
    holding the root lock would lead to deadlocks. To solve the issue, qevent
    handler needs to unlock and relock the root lock when necessary.

    To that end, add the root lock argument to the qdisc op enqueue, and
    propagate throughout.

    Signed-off-by: Petr Machata
    Signed-off-by: David S. Miller

    Petr Machata
     

26 Jun, 2020

5 commits

  • Minor overlapping changes in xfrm_device.c, between the double
    ESP trailing bug fix setting the XFRM_INIT flag and the changes
    in net-next preparing for bonding encryption support.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Change tin mapping on diffserv3, 4 & 8 for LE PHB support, in essence
    making LE a member of the Bulk tin.

    Bulk has the least priority and minimum of 1/16th total bandwidth in the
    face of higher priority traffic.

    NB: Diffserv 3 & 4 swap tin 0 & 1 priorities from the default order as
    found in diffserv8, in case anyone is wondering why it looks a bit odd.

    Signed-off-by: Kevin Darbyshire-Bryant
    [ reword commit message slightly ]
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Kevin Darbyshire-Bryant
     
  • I spotted a few nits when comparing the in-tree version of sch_cake with
    the out-of-tree one: A redundant error variable declaration shadowing an
    outer declaration, and an indentation alignment issue. Fix both of these.

    Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • As a further optimisation of the diffserv parsing codepath, we can skip it
    entirely if CAKE is configured to neither use diffserv-based
    classification, nor to zero out the diffserv bits.

    Fixes: c87b4ecdbe8d ("sch_cake: Make sure we can write the IP header before changing DSCP bits")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • cake_handle_diffserv() tries to linearize mac and network header parts of
    skb and to make it writable unconditionally. In some cases it leads to full
    skb reallocation, which reduces throughput and increases CPU load. Some
    measurements of IPv4 forward + NAPT on MIPS router with 580 MHz single-core
    CPU was conducted. It appears that on kernel 4.9 skb_try_make_writable()
    reallocates skb, if skb was allocated in ethernet driver via so-called
    'build skb' method from page cache (it was discovered by strange increase
    of kmalloc-2048 slab at first).

    Obtain DSCP value via read-only skb_header_pointer() call, and leave
    linearization only for DSCP bleaching or ECN CE setting. And, as an
    additional optimisation, skip diffserv parsing entirely if it is not needed
    by the current configuration.

    Fixes: c87b4ecdbe8d ("sch_cake: Make sure we can write the IP header before changing DSCP bits")
    Signed-off-by: Ilya Ponetayev
    [ fix a few style issues, reflow commit message ]
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Ilya Ponetayev
     

31 May, 2020

1 commit

  • While the other fq-based qdiscs take advantage of skb->hash and doesn't
    recompute it if it is already set, sch_cake does not.

    This was a deliberate choice because sch_cake hashes various parts of the
    packet header to support its advanced flow isolation modes. However,
    foregoing the use of skb->hash entirely loses a few important benefits:

    - When skb->hash is set by hardware, a few CPU cycles can be saved by not
    hashing again in software.

    - Tunnel encapsulations will generally preserve the value of skb->hash from
    before the encapsulation, which allows flow-based qdiscs to distinguish
    between flows even though the outer packet header no longer has flow
    information.

    It turns out that we can preserve these desirable properties in many cases,
    while still supporting the advanced flow isolation properties of sch_cake.
    This patch does so by reusing the skb->hash value as the flow_hash part of
    the hashing procedure in cake_hash() only in the following conditions:

    - If the skb->hash is marked as covering the flow headers (skb->l4_hash is
    set)

    AND

    - NAT header rewriting is either disabled, or did not change any values
    used for hashing. The latter is important to match local-origin packets
    such as those of a tunnel endpoint.

    The immediate motivation for fixing this was the recent patch to WireGuard
    to preserve the skb->hash on encapsulation. As such, this is also what I
    tested against; with this patch, added latency under load for competing
    flows drops from ~8 ms to sub-1ms on an RRUL test over a WireGuard tunnel
    going through a virtual link shaped to 1Gbps using sch_cake. This matches
    the results we saw with a similar setup using sch_fq_codel when testing the
    WireGuard patch.

    Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

15 Jan, 2020

1 commit


10 Jan, 2020

1 commit


03 Jan, 2020

1 commit

  • The variables 'window_interval' is u64 and do_div()
    truncates it to 32 bits, which means it can test
    non-zero and be truncated to zero for division.
    The unit of window_interval is nanoseconds,
    so its lower 32-bit is relatively easy to exceed.
    Fix this issue by using div64_u64() instead.

    Fixes: 7298de9cd725 ("sch_cake: Add ingress mode")
    Signed-off-by: Wen Yang
    Cc: Kevin Darbyshire-Bryant
    Cc: Toke Høiland-Jørgensen
    Cc: David S. Miller
    Cc: Cong Wang
    Cc: cake@lists.bufferbloat.net
    Cc: netdev@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Acked-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Wen Yang
     

19 Dec, 2019

1 commit

  • Turns out tin_quantum_prio isn't used anymore and is a leftover from a
    previous implementation of diffserv tins. Since the variable isn't used
    in any calculations it can be eliminated.

    Drop variable and places where it was set. Rename remaining variable
    and consolidate naming of intermediate variables that set it.

    Signed-off-by: Kevin Darbyshire-Bryant
    Acked-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Kevin 'ldir' Darbyshire-Bryant
     

03 Dec, 2019

1 commit


28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

05 Apr, 2019

2 commits


16 Mar, 2019

1 commit

  • We initially interpreted the fwmark parameter as a flag that simply turned
    on the feature, using the whole skb->mark field as the index into the CAKE
    tin_order array. However, it is quite common for different applications to
    use different parts of the mask field for their own purposes, each using a
    different mask.

    Support this use of subsets of the mark by interpreting the TCA_CAKE_FWMARK
    parameter as a bitmask to apply to the fwmark field when reading it. The
    result will be right-shifted by the number of unset lower bits of the mask
    before looking up the tin.

    In the original commit message we also failed to credit Felix Resch with
    originally suggesting the fwmark feature back in 2017; so the Suggested-By
    in this commit covers the whole fwmark feature.

    Fixes: 0b5c7efdfc6e ("sch_cake: Permit use of connmarks as tin classifiers")
    Suggested-by: Felix Resch
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

04 Mar, 2019

3 commits

  • With more modes added the logic in cake_select_tin() was getting a bit
    hairy, and it turns out we can actually simplify it quite a bit. This also
    allows us to get rid of one of the two diffserv parsing functions, which
    has the added benefit that already-zeroed DSCP fields won't get re-written.

    Suggested-by: Kevin Darbyshire-Bryant
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • Add flag 'FWMARK' to enable use of firewall connmarks as tin selector.
    The connmark (skbuff->mark) needs to be in the range 1->tin_cnt ie.
    for diffserv3 the mark needs to be 1->3.

    Background

    Typically CAKE uses DSCP as the basis for tin selection. DSCP values
    are relatively easily changed as part of the egress path, usually with
    iptables & the mangle table, ingress is more challenging. CAKE is often
    used on the WAN interface of a residential gateway where passthrough of
    DSCP from the ISP is either missing or set to unhelpful values thus use
    of ingress DSCP values for tin selection isn't helpful in that
    environment.

    An approach to solving the ingress tin selection problem is to use
    CAKE's understanding of tc filters. Naive tc filters could match on
    source/destination port numbers and force tin selection that way, but
    multiple filters don't scale particularly well as each filter must be
    traversed whether it matches or not. e.g. a simple example to map 3
    firewall marks to tins:

    MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
    tc filter add dev $DEV parent $MAJOR protocol all handle 0x01 fw action skbedit priority ${MAJOR}1
    tc filter add dev $DEV parent $MAJOR protocol all handle 0x02 fw action skbedit priority ${MAJOR}2
    tc filter add dev $DEV parent $MAJOR protocol all handle 0x03 fw action skbedit priority ${MAJOR}3

    Another option is to use eBPF cls_act with tc filters e.g.

    MAJOR=$( tc qdisc show dev $DEV | head -1 | awk '{print $3}' )
    tc filter add dev $DEV parent $MAJOR bpf da obj my-bpf-fwmark-to-class.o

    This has the disadvantages of a) needing someone to write & maintain
    the bpf program, b) a bpf toolchain to compile it and c) needing to
    hardcode the major number in the bpf program so it matches the cake
    instance (or forcing the cake instance to a particular major number)
    since the major number cannot be passed to the bpf program via tc
    command line.

    As already hinted at by the previous examples, it would be helpful
    to associate tins with something that survives the Internet path and
    ideally allows tin selection on both egress and ingress. Netfilter's
    conntrack permits setting an identifying mark on a connection which
    can also be restored to an ingress packet with tc action connmark e.g.

    tc filter add dev eth0 parent ffff: protocol all prio 10 u32 \
    match u32 0 0 flowid 1:1 action connmark action mirred egress redirect dev ifb1

    Since tc's connmark action has restored any connmark into skb->mark,
    any of the previous solutions are based upon it and in one form or
    another copy that mark to the skb->priority field where again CAKE
    picks this up.

    This change cuts out at least one of the (less intuitive &
    non-scalable) middlemen and permit direct access to skb->mark.

    Signed-off-by: Kevin Darbyshire-Bryant
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Kevin Darbyshire-Bryant
     
  • CAKE host fairness does not work well with TCP flows in dual-srchost and
    dual-dsthost setup. The reason is that ACKs generated by TCP flows are
    classified as sparse flows, and affect flow isolation from other hosts. Fix
    this by calculating host_load based only on the bulk flows a host
    generates. In a hash collision the host_bulk_flow_count values must be
    decremented on the old hosts and incremented on the new ones *if* the queue
    is in the bulk set.

    Reported-by: Pete Heist
    Signed-off-by: George Amanakis
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    George Amanakis
     

16 Jan, 2019

1 commit


13 Oct, 2018

1 commit


06 Oct, 2018

1 commit

  • As done treewide earlier, this catches several more open-coded
    allocation size calculations that were added to the kernel during the
    merge window. This performs the following mechanical transformations
    using Coccinelle:

    kvmalloc(a * b, ...) -> kvmalloc_array(a, b, ...)
    kvzalloc(a * b, ...) -> kvcalloc(a, b, ...)
    devm_kzalloc(..., a * b, ...) -> devm_kcalloc(..., a, b, ...)

    Signed-off-by: Kees Cook

    Kees Cook
     

11 Sep, 2018

1 commit


23 Aug, 2018

1 commit

  • The TC filter flow mapping override completely skipped the call to
    cake_hash(); however that meant that the internal state was not being
    updated, which ultimately leads to deadlocks in some configurations. Fix
    that by passing the overridden flow ID into cake_hash() instead so it can
    react appropriately.

    In addition, the major number of the class ID can now be set to override
    the host mapping in host isolation mode. If both host and flow are
    overridden (or if the respective modes are disabled), flow dissection and
    hashing will be skipped entirely; otherwise, the hashing will be kept for
    the portions that are not set by the filter.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

22 Aug, 2018

1 commit


28 Jul, 2018

1 commit

  • This patch restores cake's deployed behavior at line rate to always
    split gso, and makes gso splitting configurable from userspace.

    running cake unlimited (unshaped) at 1gigE, local traffic:

    no-split-gso bql limit: 131966
    split-gso bql limit: ~42392-45420

    On this 4 stream test splitting gso apart results in halving the
    observed interpacket latency at no loss in throughput.

    Summary of tcp_nup test run 'gso-split' (at 2018-07-26 16:03:51.824728):

    Ping (ms) ICMP : 0.83 0.81 ms 341
    TCP upload avg : 235.43 235.39 Mbits/s 301
    TCP upload sum : 941.71 941.56 Mbits/s 301
    TCP upload::1 : 235.45 235.43 Mbits/s 271
    TCP upload::2 : 235.45 235.41 Mbits/s 289
    TCP upload::3 : 235.40 235.40 Mbits/s 288
    TCP upload::4 : 235.41 235.40 Mbits/s 291

    verses

    Summary of tcp_nup test run 'no-split-gso' (at 2018-07-26 16:37:23.563960):

    avg median # data pts
    Ping (ms) ICMP : 1.67 1.73 ms 348
    TCP upload avg : 234.56 235.37 Mbits/s 301
    TCP upload sum : 938.24 941.49 Mbits/s 301
    TCP upload::1 : 234.55 235.38 Mbits/s 285
    TCP upload::2 : 234.57 235.37 Mbits/s 286
    TCP upload::3 : 234.58 235.37 Mbits/s 274
    TCP upload::4 : 234.54 235.42 Mbits/s 288

    Signed-off-by: David S. Miller

    Dave Taht
     

17 Jul, 2018

1 commit

  • In diffserv mode, CAKE stores tins in a different order internally than
    the logical order exposed to userspace. The order remapping was missing
    in the handling of 'tc filter' priority mappings through skb->priority,
    resulting in bulk and best effort mappings being reversed relative to
    how they are displayed.

    Fix this by adding the missing mapping when reading skb->priority.

    Fixes: 83f8fd69af4f ("sch_cake: Add DiffServ handling")
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     

11 Jul, 2018

5 commits

  • At lower bandwidths, the transmission time of a single GSO segment can add
    an unacceptable amount of latency due to HOL blocking. Furthermore, with a
    software shaper, any tuning mechanism employed by the kernel to control the
    maximum size of GSO segments is thrown off by the artificial limit on
    bandwidth. For this reason, we split GSO segments into their individual
    packets iff the shaper is active and configured to a bandwidth
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • This commit adds configurable overhead compensation support to the rate
    shaper. With this feature, userspace can configure the actual bottleneck
    link overhead and encapsulation mode used, which will be used by the shaper
    to calculate the precise duration of each packet on the wire.

    This feature is needed because CAKE is often deployed one or two hops
    upstream of the actual bottleneck (which can be, e.g., inside a DSL or
    cable modem). In this case, the link layer characteristics and overhead
    reported by the kernel does not match the actual bottleneck. Being able to
    set the actual values in use makes it possible to configure the shaper rate
    much closer to the actual bottleneck rate (our experience shows it is
    possible to get with 0.1% of the actual physical bottleneck rate), thus
    keeping latency low without sacrificing bandwidth.

    The overhead compensation has three tunables: A fixed per-packet overhead
    size (which, if set, will be accounted from the IP packet header), a
    minimum packet size (MPU) and a framing mode supporting either ATM or PTM
    framing. We include a set of common keywords in TC to help users configure
    the right parameters. If no overhead value is set, the value reported by
    the kernel is used.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • This adds support for DiffServ-based priority queueing to CAKE. If the
    shaper is in use, each priority tier gets its own virtual clock, which
    limits that tier's rate to a fraction of the overall shaped rate, to
    discourage trying to game the priority mechanism.

    CAKE defaults to a simple, three-tier mode that interprets most code points
    as "best effort", but places CS1 traffic into a low-priority "bulk" tier
    which is assigned 1/16 of the total rate, and a few code points indicating
    latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
    into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
    The other supported DiffServ modes are a 4-tier mode matching the 802.11e
    precedence rules, as well as two 8-tier modes, one of which implements
    strict precedence of the eight priority levels.

    This commit also adds an optional DiffServ 'wash' mode, which will zero out
    the DSCP fields of any packet passing through CAKE. While this can
    technically be done with other mechanisms in the kernel, having the feature
    available in CAKE significantly decreases configuration complexity; and the
    implementation cost is low on top of the other DiffServ-handling code.

    Filters and applications can set the skb->priority field to override the
    DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches
    CAKE's qdisc handle, the minor number will be interpreted as a priority
    tier if it is less than or equal to the number of configured priority
    tiers.

    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • When CAKE is deployed on a gateway that also performs NAT (which is a
    common deployment mode), the host fairness mechanism cannot distinguish
    internal hosts from each other, and so fails to work correctly.

    To fix this, we add an optional NAT awareness mode, which will query the
    kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
    and use that in the flow and host hashing.

    When the shaper is enabled and the host is already performing NAT, the cost
    of this lookup is negligible. However, in unlimited mode with no NAT being
    performed, there is a significant CPU cost at higher bandwidths. For this
    reason, the feature is turned off by default.

    Cc: netfilter-devel@vger.kernel.org
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen
     
  • The ACK filter is an optional feature of CAKE which is designed to improve
    performance on links with very asymmetrical rate limits. On such links
    (which are unfortunately quite prevalent, especially for DSL and cable
    subscribers), the downstream throughput can be limited by the number of
    ACKs capable of being transmitted in the *upstream* direction.

    Filtering ACKs can, in general, have adverse effects on TCP performance
    because it interferes with ACK clocking (especially in slow start), and it
    reduces the flow's resiliency to ACKs being dropped further along the path.
    To alleviate these drawbacks, the ACK filter in CAKE tries its best to
    always keep enough ACKs queued to ensure forward progress in the TCP flow
    being filtered. It does this by only filtering redundant ACKs. In its
    default 'conservative' mode, the filter will always keep at least two
    redundant ACKs in the queue, while in 'aggressive' mode, it will filter
    down to a single ACK.

    The ACK filter works by inspecting the per-flow queue on every packet
    enqueue. Starting at the head of the queue, the filter looks for another
    eligible packet to drop (so the ACK being dropped is always closer to the
    head of the queue than the packet being enqueued). An ACK is eligible only
    if it ACKs *fewer* bytes than the new packet being enqueued, including any
    SACK options. This prevents duplicate ACKs from being filtered, to avoid
    interfering with retransmission logic. In addition, we check TCP header
    options and only drop those that are known to not interfere with sender
    state. In particular, packets with unknown option codes are never dropped.

    In aggressive mode, an eligible packet is always dropped, while in
    conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
    (with no data segments) are considered eligible for dropping, but when an
    ACK with data segments is enqueued, this can cause another pure ACK to
    become eligible for dropping.

    The approach described above ensures that this ACK filter avoids most of
    the drawbacks of a naive filtering mechanism that only keeps flow state but
    does not inspect the queue. This is the rationale for including the ACK
    filter in CAKE itself rather than as separate module (as the TC filter, for
    instance).

    Our performance evaluation has shown that on a 30/1 Mbps link with a
    bidirectional traffic test (RRUL), turning on the ACK filter on the
    upstream link improves downstream throughput by ~20% (both modes) and
    upstream throughput by ~12% in conservative mode and ~40% in aggressive
    mode, at the cost of ~5ms of inter-flow latency due to the increased
    congestion.

    In *really* pathological cases, the effect can be a lot more; for instance,
    the ACK filter increases the achievable downstream throughput on a link
    with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
    Mbps to ~25 Mbps).

    Finally, even though we consider the ACK filter to be safer than most, we
    do not recommend turning it on everywhere: on more symmetrical link
    bandwidths the effect is negligible at best.

    Cc: Yuchung Cheng
    Cc: Neal Cardwell
    Signed-off-by: Toke Høiland-Jørgensen
    Signed-off-by: David S. Miller

    Toke Høiland-Jørgensen