09 May, 2019

1 commit

  • Avoid freeing cls_mall.rule twice when failing to setup flow_action
    offload used in the hardware intermediate representation. This is
    achieved by returning 0 when the setup fails but the skip software
    flag has not been set.

    Fixes: f00cbf196814 ("net/sched: use the hardware intermediate representation for matchall")
    Reported-by: Dan Carpenter
    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     

08 May, 2019

5 commits

  • Minor conflict with the DSA legacy code removal.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Based on feedback from Jiri avoid carrying a pointer to the tcf_block
    structure in the tc_cls_common_offload structure. Instead store
    a flag in driver private data which indicates if offloads apply
    to a shared block at block binding time.

    Suggested-by: Jiri Pirko
    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • The call to nla_nest_start_noflag can return a null pointer and currently
    this is not being checked and this can lead to a null pointer dereference
    when the null pointer sched_nest is passed to function nla_nest_end. Fix
    this by adding in a null pointer check.

    Addresses-Coverity: ("Dereference null return value")
    Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule")
    Signed-off-by: Colin Ian King
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • FQ packet scheduler assumed that packets could be classified
    based on their owning socket.

    This means that if a UDP server uses one UDP socket to send
    packets to different destinations, packets all land
    in one FQ flow.

    This is unfair, since each TCP flow has a unique bucket, meaning
    that in case of pressure (fully utilised uplink), TCP flows
    have more share of the bandwidth.

    If we instead detect unconnected sockets, we can use a stochastic
    hash based on the 4-tuple hash.

    This also means a QUIC server using one UDP socket will properly
    spread the outgoing packets to different buckets, and in-kernel
    pacing based on EDT model will no longer risk having big rb-tree on
    one flow.

    Note that UDP application might provide the skb->hash in an
    ancillary message at sendmsg() time to avoid the cost of a dissection
    in fq packet scheduler.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • TCP stack makes sure packets for a given flow are monotically
    increasing, but we want to allow UDP packets to use EDT as
    well, so that QUIC servers can use in-kernel pacing.

    This patch adds a per-flow rb-tree on which packets might
    be stored. We still try to use the linear list for the
    typical cases where packets are queued with monotically
    increasing skb->tstamp, since queue/dequeue packets on
    a standard list is O(1).

    Note that the ability to store packets in arbitrary EDT
    order will allow us to implement later a per TCP socket
    mechanism adding delays (with jitter eventually) and reorders,
    to implement convenient network emulators.

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     

06 May, 2019

10 commits

  • In commit 3c75f6ee139d ("net_sched: sch_htb: add per class overlimits counter")
    we added an overlimits counter for each HTB class which could
    properly reflect how many times we use up all the bandwidth
    on each class. However, the overlimits counter in HTB qdisc
    does not, it is way bigger than the sum of each HTB class.
    In fact, this qdisc overlimits counter increases when we have
    no skb to dequeue, which happens more often than we run out of
    bandwidth.

    It makes more sense to make this qdisc overlimits counter just
    be a sum of each HTB class, in case people still get confused.

    I have verified this patch with one single HTB class, where HTB
    qdisc counters now always match HTB class counters as expected.

    Eric suggested we could fold this field into 'direct_pkts' as
    we only use its 32bit on 64bit CPU, this saves one cache line.

    Cc: Eric Dumazet
    Signed-off-by: Cong Wang
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Cong Wang
     
  • Some actions like the police action are stateful and could share state
    between devices. This is incompatible with offloading to multiple devices
    and drivers might want to test for shared blocks when offloading.
    Store a pointer to the tcf_block structure in the tc_cls_common_offload
    structure to allow drivers to determine when offloads apply to a shared
    block.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Implement the stats_update callback for the police action that
    will be used by drivers for hardware offload.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Introduce a new command for matchall classifiers that allows hardware
    to update statistics.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Add police action to the hardware intermediate representation which
    would subsequently allow it to be used by drivers for offload.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Move tcf_police_params, tcf_police and tc_police_compat structures to a
    header. Making them usable to other code for example drivers that would
    offload police actions to hardware.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Cleanup unused functions and variables after porting to the newer
    intermediate representation.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Extends matchall offload to make use of the hardware intermediate
    representation. More specifically, this patch moves the native TC
    actions in cls_matchall offload to the newer flow_action
    representation. This ultimately allows us to avoid a direct
    dependency on native TC actions for matchall.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • Add sample action to the hardware intermediate representation model which
    would subsequently allow it to be used by drivers for offload.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     
  • When a cgroup classifier is added, there is a small time interval in
    which tp->root is NULL. If we receive a packet in this small time slice
    a NULL pointer dereference will happen, leading to a kernel panic:

    # mkdir /sys/fs/cgroup/net_cls/0
    # echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
    # echo $$ >/sys/fs/cgroup/net_cls/0/tasks
    # ping -qfb 255.255.255.255 -I eth0 &>/dev/null &
    # tc qdisc add dev eth0 root handle 10: htb
    # while : ; do
    > tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
    > tc filter delete dev eth0
    > done
    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
    Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 0000000098a7ff91
    [0000000000000028] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] SMP
    Modules linked in: sch_htb cls_cgroup algif_hash af_alg nls_iso8859_1 nls_cp437 vfat fat xhci_plat_hcd m25p80 spi_nor xhci_hcd mtd usbcore usb_common spi_orion sfp i2c_mv64xxx phy_generic mdio_i2c marvell10g i2c_core mvpp2 mvmdio phylink sbsa_gwdt ip_tables x_tables autofs4
    Process ping (pid: 5421, stack limit = 0x00000000b20b1505)
    CPU: 3 PID: 5421 Comm: ping Not tainted 5.1.0-rc6 #31
    Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : cls_cgroup_classify+0x80/0xec [cls_cgroup]
    lr : cls_cgroup_classify+0x34/0xec [cls_cgroup]
    sp : ffffff8012e6b850
    x29: ffffff8012e6b850 x28: ffffffc423dd3c00
    x27: ffffff801093ebc0 x26: ffffffc425a85b00
    x25: 0000000020000000 x24: 0000000000000000
    x23: ffffff8012e6b910 x22: ffffffc428db4900
    x21: ffffff8012e6b910 x20: 0000000000100001
    x19: 0000000000000000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000000
    x15: 0000000000000000 x14: 0000000000000000
    x13: 0000000000000000 x12: 000000000000001c
    x11: 0000000000000018 x10: ffffff8012e6b840
    x9 : 0000000000003580 x8 : 000000000000009d
    x7 : 0000000000000002 x6 : ffffff8012e6b860
    x5 : 000000007cd66ffe x4 : 000000009742a193
    x3 : ffffff800865b4d8 x2 : ffffff8012e6b910
    x1 : 0000000000000400 x0 : ffffffc42c38f300
    Call trace:
    cls_cgroup_classify+0x80/0xec [cls_cgroup]
    tcf_classify+0x78/0x138
    htb_enqueue+0x74/0x320 [sch_htb]
    __dev_queue_xmit+0x3e4/0x9d0
    dev_queue_xmit+0x24/0x30
    ip_finish_output2+0x2e4/0x4d0
    ip_finish_output+0x1d8/0x270
    ip_mc_output+0xa8/0x240
    ip_local_out+0x58/0x68
    ip_send_skb+0x2c/0x88
    ip_push_pending_frames+0x44/0x50
    raw_sendmsg+0x458/0x830
    inet_sendmsg+0x54/0xe8
    sock_sendmsg+0x34/0x50
    __sys_sendto+0xd0/0x120
    __arm64_sys_sendto+0x30/0x40
    el0_svc_common.constprop.0+0x88/0xf8
    el0_svc_handler+0x2c/0x38
    el0_svc+0x8/0xc
    Code: 39496001 360002a1 b9425c14 34000274 (79405260)

    Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex")
    Suggested-by: Cong Wang
    Signed-off-by: Matteo Croce
    Signed-off-by: David S. Miller

    Matteo Croce
     

04 May, 2019

3 commits

  • When a matchall classifier is added, there is a small time interval in
    which tp->root is NULL. If we receive a packet in this small time slice
    a NULL pointer dereference will happen, leading to a kernel panic:

    # tc qdisc replace dev eth0 ingress
    # tc filter add dev eth0 parent ffff: matchall action gact drop
    Unable to handle kernel NULL pointer dereference at virtual address 0000000000000034
    Mem abort info:
    ESR = 0x96000005
    Exception class = DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000005
    CM = 0, WnR = 0
    user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000a623d530
    [0000000000000034] pgd=0000000000000000, pud=0000000000000000
    Internal error: Oops: 96000005 [#1] SMP
    Modules linked in: cls_matchall sch_ingress nls_iso8859_1 nls_cp437 vfat fat m25p80 spi_nor mtd xhci_plat_hcd xhci_hcd phy_generic sfp mdio_i2c usbcore i2c_mv64xxx marvell10g mvpp2 usb_common spi_orion mvmdio i2c_core sbsa_gwdt phylink ip_tables x_tables autofs4
    Process ksoftirqd/0 (pid: 9, stack limit = 0x0000000009de7d62)
    CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.1.0-rc6 #21
    Hardware name: Marvell 8040 MACCHIATOBin Double-shot (DT)
    pstate: 40000005 (nZcv daif -PAN -UAO)
    pc : mall_classify+0x28/0x78 [cls_matchall]
    lr : tcf_classify+0x78/0x138
    sp : ffffff80109db9d0
    x29: ffffff80109db9d0 x28: ffffffc426058800
    x27: 0000000000000000 x26: ffffffc425b0dd00
    x25: 0000000020000000 x24: 0000000000000000
    x23: ffffff80109dbac0 x22: 0000000000000001
    x21: ffffffc428ab5100 x20: ffffffc425b0dd00
    x19: ffffff80109dbac0 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000000
    x15: 0000000000000000 x14: 0000000000000000
    x13: ffffffbf108ad288 x12: dead000000000200
    x11: 00000000f0000000 x10: 0000000000000001
    x9 : ffffffbf1089a220 x8 : 0000000000000001
    x7 : ffffffbebffaa950 x6 : 0000000000000000
    x5 : 000000442d6ba000 x4 : 0000000000000000
    x3 : ffffff8008735ad8 x2 : ffffff80109dbac0
    x1 : ffffffc425b0dd00 x0 : ffffff8010592078
    Call trace:
    mall_classify+0x28/0x78 [cls_matchall]
    tcf_classify+0x78/0x138
    __netif_receive_skb_core+0x29c/0xa20
    __netif_receive_skb_one_core+0x34/0x60
    __netif_receive_skb+0x28/0x78
    netif_receive_skb_internal+0x2c/0xc0
    napi_gro_receive+0x1a0/0x1d8
    mvpp2_poll+0x928/0xb18 [mvpp2]
    net_rx_action+0x108/0x378
    __do_softirq+0x128/0x320
    run_ksoftirqd+0x44/0x60
    smpboot_thread_fn+0x168/0x1b0
    kthread+0x12c/0x130
    ret_from_fork+0x10/0x1c
    Code: aa0203f3 aa1e03e0 d503201f f9400684 (b9403480)
    ---[ end trace fc71e2ef7b8ab5a5 ]---
    Kernel panic - not syncing: Fatal exception in interrupt
    SMP: stopping secondary CPUs
    Kernel Offset: disabled
    CPU features: 0x002,00002000
    Memory Limit: none
    Rebooting in 1 seconds..

    Fix this by adding a NULL check in mall_classify().

    Fixes: ed76f5edccc9 ("net: sched: protect filter_chain list with filter_chain_lock mutex")
    Signed-off-by: Matteo Croce
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Matteo Croce
     
  • Make use of the struct_size() helper instead of an open-coded version
    in order to avoid any potential type mistakes, in particular in the
    context in which this code is being used.

    So, replace code of the following form:

    sizeof(*s) + s->nkeys*sizeof(struct tc_u32_key)

    with:

    struct_size(s, keys, s->nkeys)

    This code was detected with the help of Coccinelle.

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     
  • Although devlink health report does a nice job on reporting TX
    timeout and other NIC errors, unfortunately it requires drivers
    to support it but currently only mlx5 has implemented it.
    Before other drivers could catch up, it is useful to have a
    generic tracepoint to monitor this kind of TX timeout. We have
    been suffering TX timeout with different drivers, we plan to
    start to monitor it with rasdaemon which just needs a new tracepoint.

    Sample output:

    ksoftirqd/1-16 [001] ..s2 144.043173: net_dev_xmit_timeout: dev=ens3 driver=e1000 queue=0

    Cc: Eran Ben Elisha
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Acked-by: Jiri Pirko
    Reviewed-by: Eran Ben Elisha
    Signed-off-by: David S. Miller

    Cong Wang
     

01 May, 2019

4 commits

  • IEEE 802.1Q-2018 defines the concept of a cycle-time-extension, so the
    last entry of a schedule before the start of a new schedule can be
    extended, so "too-short" entries can be avoided.

    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • IEEE 802.1Q-2018 defines that a the cycle-time of a schedule may be
    overridden, so the schedule is truncated to a determined "width".

    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • The IEEE 802.1Q-2018 defines two "types" of schedules, the "Oper" (from
    operational?) and "Admin" ones. Up until now, 'taprio' only had
    support for the "Oper" one, added when the qdisc is created. This adds
    support for the "Admin" one, which allows the .change() operation to
    be supported.

    Just for clarification, some quick (and dirty) definitions, the "Oper"
    schedule is the currently (as in this instant) running one, and it's
    read-only. The "Admin" one is the one that the system configurator has
    installed, it can be changed, and it will be "promoted" to "Oper" when
    it's 'base-time' is reached.

    The idea behing this patch is that calling something like the below,
    (after taprio is already configured with an initial schedule):

    $ tc qdisc change taprio dev IFACE parent root \
    base-time X \
    sched-entry \
    ...

    Will cause a new admin schedule to be created and programmed to be
    "promoted" to "Oper" at instant X. If an "Admin" schedule already
    exists, it will be overwritten with the new parameters.

    Up until now, there was some code that was added to ease the support
    of changing a single entry of a schedule, but was ultimately unused.
    Now, that we have support for "change" with more well thought
    semantics, updating a single entry seems to be less useful.

    So we remove what is in practice dead code, and return a "not
    supported" error if the user tries to use it. If changing a single
    entry would make the user's life easier we may ressurrect this idea,
    but at this point, removing it simplifies the code.

    For now, only the schedule specific bits are allowed to be added for a
    new schedule, that means that 'clockid', 'num_tc', 'map' and 'queues'
    cannot be modified.

    Example:

    $ tc qdisc change dev IFACE parent root handle 100 taprio \
    base-time $BASE_TIME \
    sched-entry S 00 500000 \
    sched-entry S 0f 500000 \
    clockid CLOCK_TAI

    The only change in the netlink API introduced by this change is the
    introduction of an "admin" type in the response to a dump request,
    that type allows userspace to separate the "oper" schedule from the
    "admin" schedule. If userspace doesn't support the "admin" type, it
    will only display the "oper" schedule.

    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     
  • Right now, this isn't a problem, but the next commit allows schedules
    to be added during runtime. When a new schedule transitions from the
    inactive to the active state ("admin" -> "oper") the previous one can
    be freed, if it's freed just after the RCU read lock is released, we
    may access an invalid entry.

    So, we should take care to protect the dequeue() flow, so all the
    places that access the entries are protected by the RCU read lock.

    Signed-off-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Vinicius Costa Gomes
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

25 Apr, 2019

1 commit

  • Recent changes that introduced unlocked flower did not properly account for
    case when reoffload is initiated concurrently with filter updates. To fix
    the issue, extend flower with 'hw_filters' list that is used to store
    filters that don't have 'skip_hw' flag set. Filter is added to the list
    when it is inserted to hardware and only removed from it after being
    unoffloaded from all drivers that parent block is attached to. This ensures
    that concurrent reoffload can still access filter that is being deleted and
    prevents race condition when driver callback can be removed when filter is
    no longer accessible trough idr, but is still present in hardware.

    Refactor fl_change() to respect new filter reference counter and to release
    filter reference with __fl_put() in case of error, instead of directly
    deallocating filter memory. This allows for concurrent access to filter
    from fl_reoffload() and protects it with reference counting. Refactor
    fl_reoffload() to iterate over hw_filters list instead of idr. Implement
    fl_get_next_hw_filter() helper function that is used to iterate over
    hw_filters list with reference counting and skips filters that are being
    concurrently deleted.

    Fixes: 92149190067d ("net: sched: flower: set unlocked flag for flower proto ops")
    Signed-off-by: Vlad Buslov
    Reviewed-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Vlad Buslov
     

24 Apr, 2019

5 commits

  • In case we don't have 'guard' or 'budget' to transmit the skb, we should
    continue traversing the qdisc list since the remaining guard/budget
    might be enough to transmit a skb from other children qdiscs.

    Fixes: 5a781ccbd19e (“tc: Add support for configuring the taprio scheduler”)
    Signed-off-by: Andre Guedes
    Signed-off-by: David S. Miller

    Andre Guedes
     
  • While traversing taprio's children qdisc list, if the gate is closed for
    a given traffic class, we should continue traversing the list since the
    remaining qdiscs may have skb ready for transmission.

    This patch also takes this opportunity and changes the function to use
    the TAPRIO_ALL_GATES_OPEN macro instead of the magic number '-1'.

    Fixes: 5a781ccbd19e (“tc: Add support for configuring the taprio scheduler”)
    Signed-off-by: Andre Guedes
    Signed-off-by: David S. Miller

    Andre Guedes
     
  • The 'entry' argument from should_restart_cycle() cannot be NULL since it
    is already checked by the caller so the WARN_ON() within should_
    restart_cycle() could be removed. By doing that, that function becomes
    a dummy wrapper on list_is_last() so this patch simply gets rid of it
    and call list_is_last() within advance_sched() instead.

    Signed-off-by: Andre Guedes
    Signed-off-by: David S. Miller

    Andre Guedes
     
  • This patch does a code refactoring to taprio_get_start_time() function
    to improve readability and report error properly.

    If 'base' time is later than 'now', the start time is equal to 'base'
    and taprio_get_start_time() is done. That's the natural case so we move
    that code to the beginning of the function. Also, if 'cycle' calculation
    is zero, something went really wrong with taprio and we should log that
    internal error properly.

    Signed-off-by: Andre Guedes
    Signed-off-by: David S. Miller

    Andre Guedes
     
  • This patch removes a pointless variable assigment in taprio_change().
    The 'err' variable is not used from this assignment to the next one so
    this patch removes it.

    Signed-off-by: Andre Guedes
    Signed-off-by: David S. Miller

    Andre Guedes
     

19 Apr, 2019

1 commit

  • Recent changes to taprio did not use the correct div64 helpers,
    leading to:

    net/sched/sch_taprio.o: In function `taprio_dequeue':
    sch_taprio.c:(.text+0x34a): undefined reference to `__divdi3'
    net/sched/sch_taprio.o: In function `advance_sched':
    sch_taprio.c:(.text+0xa0b): undefined reference to `__divdi3'
    net/sched/sch_taprio.o: In function `taprio_init':
    sch_taprio.c:(.text+0x1450): undefined reference to `__divdi3'
    /home/jkicinski/devel/linux/Makefile:1032: recipe for target 'vmlinux' failed

    Use math64 helpers.

    Fixes: 7b9eba7ba0c1 ("net/sched: taprio: fix picos_per_byte miscalculation")
    Signed-off-by: Jakub Kicinski
    Reviewed-by: Dirk van der Merwe
    Acked-by: Vinicius Costa Gomes
    Signed-off-by: David S. Miller

    Jakub Kicinski
     

12 Apr, 2019

2 commits

  • Fix net reference counting in fl_change() and remove redundant call to
    tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net
    before releasing exts and deallocating a filter, so this code caused flower
    classifier to obtain net twice per filter that is being deleted.

    Implementation of __fl_delete() called tcf_exts_get_net() to pass its
    result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant
    and only complicates fl_mask_put() implementation. This functionality seems
    to be copied from filter cleanup code, where it was added by Cong with
    following explanation:

    This patchset tries to fix the race between call_rcu() and
    cleanup_net() again. Without holding the netns refcnt the
    tc_action_net_exit() in netns workqueue could be called before
    filter destroy works in tc filter workqueue. This patchset
    moves the netns refcnt from tc actions to tcf_exts, without
    breaking per-netns tc actions.

    This doesn't apply to flower mask, which doesn't call any tc action code
    during cleanup. Simplify fl_mask_put() by removing the flag parameter and
    always use tcf_queue_work() to free mask objects.

    Fixes: 061775583e35 ("net: sched: flower: introduce reference counting for filters")
    Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw")
    Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
    Reported-by: Ido Schimmel
    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Implementation of function rhashtable_insert_fast() check if its internal
    helper function __rhashtable_insert_fast() returns non-NULL pointer and
    seemingly return -EEXIST in such case. However, since
    __rhashtable_insert_fast() is called with NULL key pointer, it never
    actually checks for duplicates, which means that -EEXIST is never returned
    to the user. Use rhashtable_lookup_insert_fast() hash table API instead. In
    order to verify that it works as expected and prevent the problem from
    happening in future, extend tc-tests with new test that verifies that no
    new filters with existing key can be inserted to flower classifier.

    Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw")
    Signed-off-by: Vlad Buslov
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Vlad Buslov
     

11 Apr, 2019

5 commits

  • The Credit Based Shaper heavily depends on link speed to calculate
    the scheduling credits, we can't properly calculate the credits if the
    device has failed to report the link speed.

    In that case we can't dequeue packets assuming a wrong port rate that will
    result into an inconsistent credit distribution.

    This patch makes sure we fail to dequeue case:

    1) __ethtool_get_link_ksettings() reports error or 2) the ethernet driver
    failed to set the ksettings' speed value (setting link speed to
    SPEED_UNKNOWN).

    Additionally we properly re calculate the port rate whenever the link speed
    is changed.

    Fixes: 3d0bd028ffb4a ("net/sched: Add support for HW offloading for CBS")
    Signed-off-by: Leandro Dorileo
    Reviewed-by: Vedang Patel
    Signed-off-by: David S. Miller

    Leandro Dorileo
     
  • The Time Aware Priority Scheduler is heavily dependent to link speed,
    it relies on it to calculate transmission bytes per cycle, we can't
    properly calculate the so called budget if the device has failed
    to report the link speed.

    In that case we can't dequeue packets assuming a wrong budget.
    This patch makes sure we fail to dequeue case:

    1) __ethtool_get_link_ksettings() reports error or 2) the ethernet
    driver failed to set the ksettings' speed value (setting link speed
    to SPEED_UNKNOWN).

    Additionally we re calculate the budget whenever the link speed is
    changed.

    Fixes: 5a781ccbd19e4 ("tc: Add support for configuring the taprio scheduler")
    Signed-off-by: Leandro Dorileo
    Reviewed-by: Vedang Patel
    Signed-off-by: David S. Miller

    Leandro Dorileo
     
  • This revert commit 46b1c18f9deb ("net: sched: put back q.qlen into
    a single location").
    After the previous patch, when a NOLOCK qdisc is enslaved to a
    locking qdisc it switches to global stats accounting. As a consequence,
    when a classful qdisc accesses directly a child qdisc's qlen, such
    qdisc is not doing per CPU accounting and qlen value is consistent.

    In the control path nobody uses directly qlen since commit
    e5f0e8f8e45 ("net: sched: introduce and use qdisc tree flush/purge
    helpers"), so we can remove the contented atomic ops from the
    datapath.

    v1 -> v2:
    - complete the qdisc_qstats_atomic_qlen_dec() ->
    qdisc_qstats_cpu_qlen_dec() replacement, fix build issue
    - more descriptive commit message

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • Since stats updating is always consistent with TCQ_F_CPUSTATS flag,
    we can disable it at qdisc creation time flipping such bit.

    In my experiments, if the NOLOCK flag is cleared, per CPU stats
    accounting does not give any measurable performance gain, but it
    waste some memory.

    Let's clear TCQ_F_CPUSTATS together with NOLOCK, when enslaving
    a NOLOCK qdisc to 'lock' one.

    Use stats update helper inside pfifo_fast, to cope correctly with
    TCQ_F_CPUSTATS flag change.

    As a side effect, q.qlen value for any child qdiscs is always
    consistent for all lock classfull qdiscs.

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     
  • The core sched implementation checks independently for NOLOCK flag
    to acquire/release the root spin lock and for qdisc_is_percpu_stats()
    to account per CPU values in many places.

    This change update the last few places checking the TCQ_F_NOLOCK to
    do per CPU stats accounting according to qdisc_is_percpu_stats()
    value.

    The above allows to clean dev_requeue_skb() implementation a bit
    and makes stats update always consistent with a single flag.

    v1 -> v2:
    - do not move qdisc_is_empty definition, fix build issue

    Signed-off-by: Paolo Abeni
    Signed-off-by: David S. Miller

    Paolo Abeni
     

08 Apr, 2019

1 commit

  • John reports:

    Recent refactoring of fl_change aims to use the classifier spinlock to
    avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
    function was moved to before the lock is taken. This can create problems
    for drivers if duplicate filters are created (commmon in ovs tc offload
    due to filters being triggered by user-space matches).

    Drivers registered for such filters will now receive multiple copies of
    the same rule, each with a different cookie value. This means that the
    drivers would need to do a full match field lookup to determine
    duplicates, repeating work that will happen in flower __fl_lookup().
    Currently, drivers do not expect to receive duplicate filters.

    To fix this, verify that filter with same key is not present in flower
    classifier hash table and insert the new filter to the flower hash table
    before offloading it to hardware. Implement helper function
    fl_ht_insert_unique() to atomically verify/insert a filter.

    This change makes filter visible to fast path at the beginning of
    fl_change() function, which means it can no longer be freed directly in
    case of error. Refactor fl_change() error handling code to deallocate the
    filter with rcu timeout.

    Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
    Reported-by: John Hurley
    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov