31 Mar, 2020

1 commit

  • It may be up to the driver (in case ANY HW stats is passed) to select
    which type of HW stats he is going to use. Add an infrastructure to
    expose this information to user.

    $ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
    $ tc -s filter show dev enp3s0np1 ingress
    filter protocol ip pref 1 flower chain 0
    filter protocol ip pref 1 flower chain 0 handle 0x1
    eth_type ipv4
    dst_ip 192.168.1.1
    in_hw in_hw_count 2
    action order 1: gact action drop
    random type none pass val 0
    index 1 ref 1 bind 1 installed 10 sec used 10 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0
    used_hw_stats immediate <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

27 Mar, 2020

3 commits


22 Feb, 2020

1 commit


18 Feb, 2020

2 commits

  • tc flower rules that are based on src or dst port blocking are sometimes
    ineffective due to uninitialized stack data. __skb_flow_dissect() extracts
    ports from the skb for tc flower to match against. However, the port
    dissection is not done when when the FLOW_DIS_IS_FRAGMENT bit is set in
    key_control->flags. All callers of __skb_flow_dissect(), zero-out the
    key_control field except for fl_classify() as used by the flower
    classifier. Thus, the FLOW_DIS_IS_FRAGMENT may be set on entry to
    __skb_flow_dissect(), since key_control is allocated on the stack
    and may not be initialized.

    Since key_basic and key_control are present for all flow keys, let's
    make sure they are initialized.

    Fixes: 62230715fd24 ("flow_dissector: do not dissect l4 ports for fragments")
    Co-developed-by: Eric Dumazet
    Signed-off-by: Eric Dumazet
    Acked-by: Cong Wang
    Signed-off-by: Jason Baron
    Signed-off-by: David S. Miller

    Jason Baron
     
  • Refactor tc_setup_flow_action() function not to use rtnl lock and remove
    'rtnl_held' argument that is no longer needed.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

14 Feb, 2020

1 commit

  • unlike other classifiers that can be offloaded (i.e. users can set flags
    like 'skip_hw' and 'skip_sw'), 'cls_flower' doesn't validate the size of
    netlink attribute 'TCA_FLOWER_FLAGS' provided by user: add a proper entry
    to fl_policy.

    Fixes: 5b33f48842fa ("net/flower: Introduce hardware offload support")
    Signed-off-by: Davide Caratti
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Davide Caratti
     

27 Jan, 2020

1 commit

  • The current implementations of ops->bind_class() are merely
    searching for classid and updating class in the struct tcf_result,
    without invoking either of cl_ops->bind_tcf() or
    cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
    like cbq use them to count filters too. This is why syzbot triggered
    the warning in cbq_destroy_class().

    In order to fix this, we have to call cl_ops->bind_tcf() and
    cl_ops->unbind_tcf() like the filter binding path. This patch does
    so by refactoring out two helper functions __tcf_bind_filter()
    and __tcf_unbind_filter(), which are lockless and accept a Qdisc
    pointer, then teaching each implementation to call them correctly.

    Note, we merely pass the Qdisc pointer as an opaque pointer to
    each filter, they only need to pass it down to the helper
    functions without understanding it at all.

    Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class")
    Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

31 Dec, 2019

1 commit

  • Revert "net/sched: cls_u32: fix refcount leak in the error path of
    u32_change()", and fix the u32 refcount leak in a more generic way that
    preserves the semantic of rule dumping.
    On tc filters that don't support lockless insertion/removal, there is no
    need to guard against concurrent insertion when a removal is in progress.
    Therefore, for most of them we can avoid a full walk() when deleting, and
    just decrease the refcount, like it was done on older Linux kernels.
    This fixes situations where walk() was wrongly detecting a non-empty
    filter, like it happened with cls_u32 in the error path of change(), thus
    leading to failures in the following tdc selftests:

    6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
    6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
    74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

    On cls_flower, and on (future) lockless filters, this check is necessary:
    move all the check_empty() logic in a callback so that each filter
    can have its own implementation. For cls_flower, it's sufficient to check
    if no IDRs have been allocated.

    This reverts commit 275c44aa194b7159d1191817b20e076f55f0e620.

    Changes since v1:
    - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
    is used, thanks to Vlad Buslov
    - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
    - squash revert and new fix in a single patch, to be nice with bisect
    tests that run tdc on u32 filter, thanks to Dave Miller

    Fixes: 275c44aa194b ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
    Fixes: 6676d5e416ee ("net: sched: set dedicated tcf_walker flag when tp is empty")
    Suggested-by: Jamal Hadi Salim
    Suggested-by: Vlad Buslov
    Signed-off-by: Davide Caratti
    Reviewed-by: Vlad Buslov
    Tested-by: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Davide Caratti
     

10 Dec, 2019

1 commit

  • Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
    at places where these are defined. Later patches will remove the unused
    definition of FIELD_SIZEOF().

    This patch is generated using following script:

    EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

    git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
    do

    if [[ "$file" =~ $EXCLUDE_FILES ]]; then
    continue
    fi
    sed -i -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
    done

    Signed-off-by: Pankaj Bharadiya
    Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
    Co-developed-by: Kees Cook
    Signed-off-by: Kees Cook
    Acked-by: David Miller # for net

    Pankaj Bharadiya
     

04 Dec, 2019

1 commit

  • The recent commit 5c72299fba9d ("net: sched: cls_flower: Classify
    packets using port ranges") had added filtering based on port ranges
    to tc flower. However the commit missed necessary changes in hw-offload
    code, so the feature gave rise to generating incorrect offloaded flow
    keys in NIC.

    One more detailed example is below:

    $ tc qdisc add dev eth0 ingress
    $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
    dst_port 100-200 action drop

    With the setup above, an exact match filter with dst_port == 0 will be
    installed in NIC by hw-offload. IOW, the NIC will have a rule which is
    equivalent to the following one.

    $ tc qdisc add dev eth0 ingress
    $ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
    dst_port 0 action drop

    The behavior was caused by the flow dissector which extracts packet
    data into the flow key in the tc flower. More specifically, regardless
    of exact match or specified port ranges, fl_init_dissector() set the
    FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
    numbers from skb in skb_flow_dissect() called by fl_classify(). Note
    that device drivers received the same struct flow_dissector object as
    used in skb_flow_dissect(). Thus, offloaded drivers could not identify
    which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
    set to struct flow_dissector in either case.

    This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
    tp_range field in struct fl_flow_key to recognize which filters are applied
    to offloaded drivers. At this point, when filters based on port ranges
    passed to drivers, drivers return the EOPNOTSUPP error because they do
    not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
    flag).

    Fixes: 5c72299fba9d ("net: sched: cls_flower: Classify packets using port ranges")
    Signed-off-by: Yoshiki Komachi
    Signed-off-by: David S. Miller

    Yoshiki Komachi
     

22 Nov, 2019

2 commits

  • This patch is to allow matching options in erspan.

    The options can be described in the form:
    VER:INDEX:DIR:HWID/VER:INDEX_MASK:DIR_MASK:HWID_MASK.
    When ver is set to 1, index will be applied while dir
    and hwid will be ignored, and when ver is set to 2,
    dir and hwid will be used while index will be ignored.

    Different from geneve, only one option can be set. And
    also, geneve options, vxlan options or erspan options
    can't be set at the same time.

    # ip link add name erspan1 type erspan external
    # tc qdisc add dev erspan1 ingress
    # tc filter add dev erspan1 protocol ip parent ffff: \
    flower \
    enc_src_ip 10.0.99.192 \
    enc_dst_ip 10.0.99.193 \
    enc_key_id 11 \
    erspan_opts 1:12:0:0/1:ffff:0:0 \
    ip_proto udp \
    action mirred egress redirect dev eth0

    v1->v2:
    - improve some err msgs of extack.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     
  • This patch is to allow matching gbp option in vxlan.

    The options can be described in the form GBP/GBP_MASK,
    where GBP is represented as a 32bit hexadecimal value.
    Different from geneve, only one option can be set. And
    also, geneve options and vxlan options can't be set at
    the same time.

    # ip link add name vxlan0 type vxlan dstport 0 external
    # tc qdisc add dev vxlan0 ingress
    # tc filter add dev vxlan0 protocol ip parent ffff: \
    flower \
    enc_src_ip 10.0.99.192 \
    enc_dst_ip 10.0.99.193 \
    enc_key_id 11 \
    vxlan_opts 01020304/ffffffff \
    ip_proto udp \
    action mirred egress redirect dev eth0

    v1->v2:
    - add .strict_start_type for enc_opts_policy as Jakub noticed.
    - use Duplicate instead of Wrong in err msg for extack as Jakub
    suggested.

    Signed-off-by: Xin Long
    Signed-off-by: David S. Miller

    Xin Long
     

27 Aug, 2019

5 commits

  • Don't manually take rtnl lock in flower classifier before calling cls
    hardware offloads API. Instead, pass rtnl lock status via 'rtnl_held'
    parameter.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • In order to remove dependency on rtnl lock when calling hardware offload
    API, take reference to action mirred dev when initializing flow_action
    structure in tc_setup_flow_action(). Implement function
    tc_cleanup_flow_action(), use it to release the device after hardware
    offload API is done using it.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • In order to allow using new flow_action infrastructure from unlocked
    classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held'
    argument. Take rtnl lock before accessing tc_action data. This is necessary
    to protect from concurrent action replace.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • To remove dependency on rtnl lock, extend classifier ops with new
    ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while
    holding cb_lock every time filter if successfully added to or deleted from
    hardware.

    Implement the new API in flower classifier. Use it to manage hw_filters
    list under cb_lock protection, instead of relying on rtnl lock to
    synchronize with concurrent fl_reoffload() call.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Without rtnl lock protection filters can no longer safely manage block
    offloads counter themselves. Refactor cls API to protect block offloadcnt
    with tcf_block->cb_lock that is already used to protect driver callback
    list and nooffloaddevcnt counter. The counter can be modified by concurrent
    tasks by new functions that execute block callbacks (which is safe with
    previous patch that changed its type to atomic_t), however, block
    bind/unbind code that checks the counter value takes cb_lock in write mode
    to exclude any concurrent modifications. This approach prevents race
    conditions between bind/unbind and callback execution code but allows for
    concurrency for tc rule update path.

    Move block offload counter, filter in hardware counter and filter flags
    management from classifiers into cls hardware offloads API. Make functions
    tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API
    private. Implement following new cls API to be used instead:

    tc_setup_cb_add() - non-destructive filter add. If filter that wasn't
    already in hardware is successfully offloaded, increment block offloads
    counter, set filter in hardware counter and flag. On failure, previously
    offloaded filter is considered to be intact and offloads counter is not
    decremented.

    tc_setup_cb_replace() - destructive filter replace. Release existing
    filter block offload counter and reset its in hardware counter and flag.
    Set new filter in hardware counter and flag. On failure, previously
    offloaded filter is considered to be destroyed and offload counter is
    decremented.

    tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block
    offloads counter.

    tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and
    call tc_cls_offload_cnt_update() if cb() didn't return an error.

    Refactor all offload-capable classifiers to atomically offload filters to
    hardware, change block offload counter, and set filter in hardware counter
    and flag by means of the new cls API functions.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

20 Jul, 2019

1 commit


10 Jul, 2019

2 commits


09 Jul, 2019

1 commit


02 Jul, 2019

1 commit

  • Similarly, other callers of idr_get_next_ul() suffer the same
    overflow bug as they don't handle it properly either.

    Introduce idr_for_each_entry_continue_ul() to help these callers
    iterate from a given ID.

    cls_flower needs more care here because it still has overflow when
    does arg->cookie++, we have to fold its nested loops into one
    and remove the arg->cookie++.

    Fixes: 01683a146999 ("net: sched: refactor flower walk to iterate over idr")
    Fixes: 12d6066c3b29 ("net/mlx5: Add flow counters idr")
    Reported-by: Li Shuang
    Cc: Davide Caratti
    Cc: Vlad Buslov
    Cc: Chris Mi
    Cc: Matthew Wilcox
    Signed-off-by: Cong Wang
    Tested-by: Davide Caratti
    Signed-off-by: David S. Miller

    Cong Wang
     

19 Jun, 2019

1 commit


18 Jun, 2019

1 commit


16 Jun, 2019

1 commit

  • This config option makes only couple of lines optional.
    Two small helpers and an int in couple of cls structs.

    Remove the config option and always compile this in.
    This saves the user from unexpected surprises when he adds
    a filter with ingress device match which is silently ignored
    in case the config option is not set.

    Signed-off-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Jiri Pirko
     

15 Jun, 2019

1 commit

  • Current flower mask creating code assumes that temporary mask that is used
    when inserting new filter is stack allocated. To prevent race condition
    with data patch synchronize_rcu() is called every time fl_create_new_mask()
    replaces temporary stack allocated mask. As reported by Jiri, this
    increases runtime of creating 20000 flower classifiers from 4 seconds to
    163 seconds. However, this design is no longer necessary since temporary
    mask was converted to be dynamically allocated by commit 2cddd2014782
    ("net/sched: cls_flower: allocate mask dynamically in fl_change()").

    Remove synchronize_rcu() calls from mask creation code. Instead, refactor
    fl_change() to always deallocate temporary mask with rcu grace period.

    Fixes: 195c234d15c9 ("net: sched: flower: handle concurrent mask insertion")
    Reported-by: Jiri Pirko
    Signed-off-by: Vlad Buslov
    Tested-by: Jiri Pirko
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 May, 2019

1 commit

  • Based on feedback from Jiri avoid carrying a pointer to the tcf_block
    structure in the tc_cls_common_offload structure. Instead store
    a flag in driver private data which indicates if offloads apply
    to a shared block at block binding time.

    Suggested-by: Jiri Pirko
    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     

06 May, 2019

1 commit

  • Some actions like the police action are stateful and could share state
    between devices. This is incompatible with offloading to multiple devices
    and drivers might want to test for shared blocks when offloading.
    Store a pointer to the tcf_block structure in the tc_cls_common_offload
    structure to allow drivers to determine when offloads apply to a shared
    block.

    Signed-off-by: Pieter Jansen van Vuuren
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Pieter Jansen van Vuuren
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

25 Apr, 2019

1 commit

  • Recent changes that introduced unlocked flower did not properly account for
    case when reoffload is initiated concurrently with filter updates. To fix
    the issue, extend flower with 'hw_filters' list that is used to store
    filters that don't have 'skip_hw' flag set. Filter is added to the list
    when it is inserted to hardware and only removed from it after being
    unoffloaded from all drivers that parent block is attached to. This ensures
    that concurrent reoffload can still access filter that is being deleted and
    prevents race condition when driver callback can be removed when filter is
    no longer accessible trough idr, but is still present in hardware.

    Refactor fl_change() to respect new filter reference counter and to release
    filter reference with __fl_put() in case of error, instead of directly
    deallocating filter memory. This allows for concurrent access to filter
    from fl_reoffload() and protects it with reference counting. Refactor
    fl_reoffload() to iterate over hw_filters list instead of idr. Implement
    fl_get_next_hw_filter() helper function that is used to iterate over
    hw_filters list with reference counting and skips filters that are being
    concurrently deleted.

    Fixes: 92149190067d ("net: sched: flower: set unlocked flag for flower proto ops")
    Signed-off-by: Vlad Buslov
    Reviewed-by: Saeed Mahameed
    Signed-off-by: David S. Miller

    Vlad Buslov
     

12 Apr, 2019

2 commits

  • Fix net reference counting in fl_change() and remove redundant call to
    tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net
    before releasing exts and deallocating a filter, so this code caused flower
    classifier to obtain net twice per filter that is being deleted.

    Implementation of __fl_delete() called tcf_exts_get_net() to pass its
    result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant
    and only complicates fl_mask_put() implementation. This functionality seems
    to be copied from filter cleanup code, where it was added by Cong with
    following explanation:

    This patchset tries to fix the race between call_rcu() and
    cleanup_net() again. Without holding the netns refcnt the
    tc_action_net_exit() in netns workqueue could be called before
    filter destroy works in tc filter workqueue. This patchset
    moves the netns refcnt from tc actions to tcf_exts, without
    breaking per-netns tc actions.

    This doesn't apply to flower mask, which doesn't call any tc action code
    during cleanup. Simplify fl_mask_put() by removing the flag parameter and
    always use tcf_queue_work() to free mask objects.

    Fixes: 061775583e35 ("net: sched: flower: introduce reference counting for filters")
    Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw")
    Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority")
    Reported-by: Ido Schimmel
    Signed-off-by: Vlad Buslov
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Implementation of function rhashtable_insert_fast() check if its internal
    helper function __rhashtable_insert_fast() returns non-NULL pointer and
    seemingly return -EEXIST in such case. However, since
    __rhashtable_insert_fast() is called with NULL key pointer, it never
    actually checks for duplicates, which means that -EEXIST is never returned
    to the user. Use rhashtable_lookup_insert_fast() hash table API instead. In
    order to verify that it works as expected and prevent the problem from
    happening in future, extend tc-tests with new test that verifies that no
    new filters with existing key can be inserted to flower classifier.

    Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw")
    Signed-off-by: Vlad Buslov
    Reviewed-by: Jakub Kicinski
    Signed-off-by: David S. Miller

    Vlad Buslov
     

08 Apr, 2019

1 commit

  • John reports:

    Recent refactoring of fl_change aims to use the classifier spinlock to
    avoid the need for rtnl lock. In doing so, the fl_hw_replace_filer()
    function was moved to before the lock is taken. This can create problems
    for drivers if duplicate filters are created (commmon in ovs tc offload
    due to filters being triggered by user-space matches).

    Drivers registered for such filters will now receive multiple copies of
    the same rule, each with a different cookie value. This means that the
    drivers would need to do a full match field lookup to determine
    duplicates, repeating work that will happen in flower __fl_lookup().
    Currently, drivers do not expect to receive duplicate filters.

    To fix this, verify that filter with same key is not present in flower
    classifier hash table and insert the new filter to the flower hash table
    before offloading it to hardware. Implement helper function
    fl_ht_insert_unique() to atomically verify/insert a filter.

    This change makes filter visible to fast path at the beginning of
    fl_change() function, which means it can no longer be freed directly in
    case of error. Refactor fl_change() error handling code to deallocate the
    filter with rcu timeout.

    Fixes: 620da4860827 ("net: sched: flower: refactor fl_change")
    Reported-by: John Hurley
    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

05 Apr, 2019

1 commit

  • Recent changes to TC flower remove the requirement for rtnl lock when
    accessing and modifying filters. Refcounts now ensure access and deletion
    do not happen concurrently. However, the reoffload function which cycles
    through all filters and replays them to registered hw drivers is not
    protected.

    Use the fl_get_next_filter() function to cycle the filters for reoffload
    and ensure the ref taken by this function is put when done with each
    filter.

    Signed-off-by: John Hurley
    Reviewed-by: Jakub Kicinski
    Reviewed-by: Vlad Buslov
    Signed-off-by: David S. Miller

    John Hurley
     

22 Mar, 2019

2 commits

  • Set TCF_PROTO_OPS_DOIT_UNLOCKED for flower classifier to indicate that its
    ops callbacks don't require caller to hold rtnl lock. Don't take rtnl lock
    in fl_destroy_filter_work() that is executed on workqueue instead of being
    called by cls API and is not affected by setting
    TCF_PROTO_OPS_DOIT_UNLOCKED. Rtnl mutex is still manually taken by flower
    classifier before calling hardware offloads API that has not been updated
    for unlocked execution.

    Signed-off-by: Vlad Buslov
    Reviewed-by: Stefano Brivio
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Use 'rtnl_held' flag to track if caller holds rtnl lock. Propagate the flag
    to internal functions that need to know rtnl lock state. Take rtnl lock
    before calling tcf APIs that require it (hw offload, bind filter, etc.).

    Signed-off-by: Vlad Buslov
    Reviewed-by: Stefano Brivio
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov