27 Jan, 2020

1 commit

  • The current implementations of ops->bind_class() are merely
    searching for classid and updating class in the struct tcf_result,
    without invoking either of cl_ops->bind_tcf() or
    cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
    like cbq use them to count filters too. This is why syzbot triggered
    the warning in cbq_destroy_class().

    In order to fix this, we have to call cl_ops->bind_tcf() and
    cl_ops->unbind_tcf() like the filter binding path. This patch does
    so by refactoring out two helper functions __tcf_bind_filter()
    and __tcf_unbind_filter(), which are lockless and accept a Qdisc
    pointer, then teaching each implementation to call them correctly.

    Note, we merely pass the Qdisc pointer as an opaque pointer to
    each filter, they only need to pass it down to the helper
    functions without understanding it at all.

    Fixes: 07d79fc7d94e ("net_sched: add reverse binding for tc class")
    Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
    Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 3029 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

28 Apr, 2019

2 commits

  • We currently have two levels of strict validation:

    1) liberal (default)
    - undefined (type >= max) & NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted
    - garbage at end of message accepted
    2) strict (opt-in)
    - NLA_UNSPEC attributes accepted
    - attribute length >= expected accepted

    Split out parsing strictness into four different options:
    * TRAILING - check that there's no trailing data after parsing
    attributes (in message or nested)
    * MAXTYPE - reject attrs > max known type
    * UNSPEC - reject attributes with NLA_UNSPEC policy entries
    * STRICT_ATTRS - strictly validate attribute size

    The default for future things should be *everything*.
    The current *_strict() is a combination of TRAILING and MAXTYPE,
    and is renamed to _deprecated_strict().
    The current regular parsing has none of this, and is renamed to
    *_parse_deprecated().

    Additionally it allows us to selectively set one of the new flags
    even on old policies. Notably, the UNSPEC flag could be useful in
    this case, since it can be arranged (by filling in the policy) to
    not be an incompatible userspace ABI change, but would then going
    forward prevent forgetting attribute entries. Similar can apply
    to the POLICY flag.

    We end up with the following renames:
    * nla_parse -> nla_parse_deprecated
    * nla_parse_strict -> nla_parse_deprecated_strict
    * nlmsg_parse -> nlmsg_parse_deprecated
    * nlmsg_parse_strict -> nlmsg_parse_deprecated_strict
    * nla_parse_nested -> nla_parse_nested_deprecated
    * nla_validate_nested -> nla_validate_nested_deprecated

    Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

    For this patch, don't actually add the strict, non-renamed versions
    yet so that it breaks compile if I get it wrong.

    Also, while at it, make nla_validate and nla_parse go down to a
    common __nla_validate_parse() function to avoid code duplication.

    Ultimately, this allows us to have very strict validation for every
    new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
    next patch, while existing things will continue to work as is.

    In effect then, this adds fully strict validation for any new command.

    Signed-off-by: Johannes Berg
    Signed-off-by: David S. Miller

    Johannes Berg
     
  • Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
    netlink based interfaces (including recently added ones) are still not
    setting it in kernel generated messages. Without the flag, message parsers
    not aware of attribute semantics (e.g. wireshark dissector or libmnl's
    mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
    the structure of their contents.

    Unfortunately we cannot just add the flag everywhere as there may be
    userspace applications which check nlattr::nla_type directly rather than
    through a helper masking out the flags. Therefore the patch renames
    nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
    as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
    are rewritten to use nla_nest_start().

    Except for changes in include/net/netlink.h, the patch was generated using
    this semantic patch:

    @@ expression E1, E2; @@
    -nla_nest_start(E1, E2)
    +nla_nest_start_noflag(E1, E2)

    @@ expression E1, E2; @@
    -nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
    +nla_nest_start(E1, E2)

    Signed-off-by: Michal Kubecek
    Acked-by: Jiri Pirko
    Acked-by: David Ahern
    Signed-off-by: David S. Miller

    Michal Kubecek
     

23 Feb, 2019

1 commit

  • For tcindex filter, it is too late to initialize the
    net pointer in tcf_exts_validate(), as tcf_exts_get_net()
    requires a non-NULL net pointer. We can just move its
    initialization into tcf_exts_init(), which just requires
    an additional parameter.

    This makes the code in tcindex_alloc_perfect_hash()
    prettier.

    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

13 Feb, 2019

2 commits

  • Add 'rtnl_held' flag to tcf proto change, delete, destroy, dump, walk
    functions to track rtnl lock status. Extend users of these function in cls
    API to propagate rtnl lock status to them. This allows classifiers to
    obtain rtnl lock when necessary and to pass rtnl lock status to extensions
    and driver offload callbacks.

    Add flags field to tcf proto ops. Add flag value to indicate that
    classifier doesn't require rtnl lock.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     
  • Actions API is already updated to not rely on rtnl lock for
    synchronization. However, it need to be provided with rtnl status when
    called from classifiers API in order to be able to correctly release the
    lock when loading kernel module.

    Extend extension validation function with 'rtnl_held' flag which is passed
    to actions API. Add new 'rtnl_held' parameter to tcf_exts_validate() in cls
    API. No classifier is currently updated to support unlocked execution, so
    pass hardcoded 'true' flag parameter value.

    Signed-off-by: Vlad Buslov
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller

    Vlad Buslov
     

20 Jan, 2019

1 commit

  • Similar to u32 filter, it is useful to know how many times
    we reach each basic filter and how many times we pass the
    ematch attached to it.

    Sample output:

    filter protocol arp pref 49152 basic chain 0
    filter protocol arp pref 49152 basic chain 0 handle 0x1 (rule hit 3 success 3)
    action order 1: gact action pass
    random type none pass val 0
    index 1 ref 1 bind 1 installed 81 sec used 4 sec
    Action statistics:
    Sent 126 bytes 3 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

25 Jul, 2018

1 commit


25 May, 2018

1 commit

  • Commit 05f0fe6b74db ("RCU, workqueue: Implement rcu_work") introduces
    new API's for dispatching work in a RCU callback. Now we can just
    switch to the new API's for tc filters. This could get rid of a lot
    of code.

    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

07 Feb, 2018

3 commits


25 Jan, 2018

1 commit


20 Jan, 2018

3 commits


10 Nov, 2017

1 commit


09 Nov, 2017

1 commit

  • Hold netns refcnt before call_rcu() and release it after
    the tcf_exts_destroy() is done.

    Note, on ->destroy() path we have to respect the return value
    of tcf_exts_get_net(), on other paths it should always return
    true, so we don't need to care.

    Cc: Lucas Bates
    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

30 Oct, 2017

1 commit

  • Several conflicts here.

    NFP driver bug fix adding nfp_netdev_is_nfp_repr() check to
    nfp_fl_output() needed some adjustments because the code block is in
    an else block now.

    Parallel additions to net/pkt_cls.h and net/sch_generic.h

    A bug fix in __tcp_retransmit_skb() conflicted with some of
    the rbtree changes in net-next.

    The tc action RCU callback fixes in 'net' had some overlap with some
    of the recent tcf_block reworking.

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Oct, 2017

1 commit

  • Defer the tcf_exts_destroy() in RCU callback to
    tc filter workqueue and get RTNL lock.

    Reported-by: Chris Mi
    Cc: Daniel Borkmann
    Cc: Jiri Pirko
    Cc: John Fastabend
    Cc: Jamal Hadi Salim
    Cc: "Paul E. McKenney"
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

01 Oct, 2017

1 commit

  • The assignment of -EINVAL to variable ret is redundant as it
    is being overwritten on the following error exit paths or
    to the return value from the following call to basic_set_parms.
    Fix this up by removing it. Cleans up clang warning message:

    net/sched/cls_basic.c:185:2: warning: Value stored to 'err' is never read

    Fixes: 1d8134fea2eb ("net_sched: use idr to allocate basic filter handles")
    Signed-off-by: Colin Ian King
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Colin Ian King
     

29 Sep, 2017

1 commit


01 Sep, 2017

1 commit

  • TC filters when used as classifiers are bound to TC classes.
    However, there is a hidden difference when adding them in different
    orders:

    1. If we add tc classes before its filters, everything is fine.
    Logically, the classes exist before we specify their ID's in
    filters, it is easy to bind them together, just as in the current
    code base.

    2. If we add tc filters before the tc classes they bind, we have to
    do dynamic lookup in fast path. What's worse, this happens all
    the time not just once, because on fast path tcf_result is passed
    on stack, there is no way to propagate back to the one in tc filters.

    This hidden difference hurts performance silently if we have many tc
    classes in hierarchy.

    This patch intends to close this gap by doing the reverse binding when
    we create a new class, in this case we can actually search all the
    filters in its parent, match and fixup by classid. And because
    tcf_result is specific to each type of tc filter, we have to introduce
    a new ops for each filter to tell how to bind the class.

    Note, we still can NOT totally get rid of those class lookup in
    ->enqueue() because cgroup and flow filters have no way to determine
    the classid at setup time, they still have to go through dynamic lookup.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: David S. Miller

    Cong Wang
     

08 Aug, 2017

1 commit

  • Now we use 'unsigned long fh' as a pointer in every place,
    it is safe to convert it to a void pointer now. This gets
    rid of many casts to pointer.

    Cc: Jamal Hadi Salim
    Cc: Jiri Pirko
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

05 Aug, 2017

2 commits


22 Apr, 2017

1 commit

  • We could have a race condition where in ->classify() path we
    dereference tp->root and meanwhile a parallel ->destroy() makes it
    a NULL. Daniel cured this bug in commit d936377414fa
    ("net, sched: respect rcu grace period on cls destruction").

    This happens when ->destroy() is called for deleting a filter to
    check if we are the last one in tp, this tp is still linked and
    visible at that time. The root cause of this problem is the semantic
    of ->destroy(), it does two things (for non-force case):

    1) check if tp is empty
    2) if tp is empty we could really destroy it

    and its caller, if cares, needs to check its return value to see if it
    is really destroyed. Therefore we can't unlink tp unless we know it is
    empty.

    As suggested by Daniel, we could actually move the test logic to ->delete()
    so that we can safely unlink tp after ->delete() tells us the last one is
    just deleted and before ->destroy().

    Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
    Cc: Roi Dayan
    Cc: Daniel Borkmann
    Cc: John Fastabend
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    WANG Cong
     

14 Apr, 2017

1 commit


28 Nov, 2016

1 commit

  • Roi reported a crash in flower where tp->root was NULL in ->classify()
    callbacks. Reason is that in ->destroy() tp->root is set to NULL via
    RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
    this doesn't respect RCU grace period for them, and as a result, still
    outstanding readers from tc_classify() will try to blindly dereference
    a NULL tp->root.

    The tp->root object is strictly private to the classifier implementation
    and holds internal data the core such as tc_ctl_tfilter() doesn't know
    about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
    is only checked for NULL in ->get() callback, but nowhere else. This is
    misleading and seemed to be copied from old classifier code that was not
    cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
    fix NULL pointer dereference") moved tp->root initialization into ->init()
    routine, where before it was part of ->change(), so ->get() had to deal
    with tp->root being NULL back then, so that was indeed a valid case, after
    d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
    ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
    in packet classifiers"); but the NULLifying was reintroduced with the
    RCUification, but it's not correct for every classifier implementation.

    In the cases that are fixed here with one exception of cls_cgroup, tp->root
    object is allocated and initialized inside ->init() callback, which is always
    performed at a point in time after we allocate a new tp, which means tp and
    thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
    Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
    handler, same for the tp which is kfree_rcu()'ed right when we return
    from ->destroy() in tcf_destroy(). This means, the head object's lifetime
    for such classifiers is always tied to the tp lifetime. The RCU callback
    invocation for the two kfree_rcu() could be out of order, but that's fine
    since both are independent.

    Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
    means that 1) we don't need a useless NULL check in fast-path and, 2) that
    outstanding readers of that tp in tc_classify() can still execute under
    respect with RCU grace period as it is actually expected.

    Things that haven't been touched here: cls_fw and cls_route. They each
    handle tp->root being NULL in ->classify() path for historic reasons, so
    their ->destroy() implementation can stay as is. If someone actually
    cares, they could get cleaned up at some point to avoid the test in fast
    path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
    !head should anyone actually be using/testing it, so it at least aligns with
    cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
    destruction (to a sleepable context) after RCU grace period as concurrent
    readers might still access it. (Note that in this case we need to hold module
    reference to keep work callback address intact, since we only wait on module
    unload for all call_rcu()s to finish.)

    This fixes one race to bring RCU grace period guarantees back. Next step
    as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
    proto tp when all filters are gone") to get the order of unlinking the tp
    in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
    RCU_INIT_POINTER() before tcf_destroy() and let the notification for
    removal be done through the prior ->delete() callback. Both are independant
    issues. Once we have that right, we can then clean tp->root up for a number
    of classifiers by not making them RCU pointers, which requires a new callback
    (->uninit) that is triggered from tp's RCU callback, where we just kfree()
    tp->root from there.

    Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
    Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
    Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
    Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
    Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
    Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU")
    Reported-by: Roi Dayan
    Signed-off-by: Daniel Borkmann
    Cc: Cong Wang
    Cc: John Fastabend
    Cc: Roi Dayan
    Cc: Jiri Pirko
    Acked-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

23 Aug, 2016

1 commit

  • After commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array")
    we do dynamic allocation in tcf_exts_init(), therefore we need
    to handle the ENOMEM case properly.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

10 Mar, 2015

1 commit

  • Kernel automatically creates a tp for each
    (kind, protocol, priority) tuple, which has handle 0,
    when we add a new filter, but it still is left there
    after we remove our own, unless we don't specify the
    handle (literally means all the filters under
    the tuple). For example this one is left:

    # tc filter show dev eth0
    filter parent 8001: protocol arp pref 49152 basic

    The user-space is hard to clean up these for kernel
    because filters like u32 are organized in a complex way.
    So kernel is responsible to remove it after all filters
    are gone. Each type of filter has its own way to
    store the filters, so each type has to provide its
    way to check if all filters are gone.

    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Cong Wang
     

27 Jan, 2015

1 commit


10 Dec, 2014

2 commits


09 Dec, 2014

1 commit


07 Oct, 2014

2 commits

  • Using the tcf_proto pointer 'tp' from inside the classifiers callback
    is not valid because it may have been cleaned up by another call_rcu
    occuring on another CPU.

    'tp' is currently being used by tcf_unbind_filter() in this patch we
    move instances of tcf_unbind_filter outside of the call_rcu() context.
    This is safe to do because any running schedulers will either read the
    valid class field or it will be zeroed.

    And all schedulers today when the class is 0 do a lookup using the
    same call used by the tcf_exts_bind(). So even if we have a running
    classifier hit the null class pointer it will do a lookup and get
    to the same result. This is particularly fragile at the moment because
    the only way to verify this is to audit the schedulers call sites.

    Reported-by: Cong Wang
    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     
  • This removes the tcf_proto argument from the ematch code paths that
    only need it to reference the net namespace. This allows simplifying
    qdisc code paths especially when we need to tear down the ematch
    from an RCU callback. In this case we can not guarentee that the
    tcf_proto structure is still valid.

    Signed-off-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    John Fastabend
     

29 Sep, 2014

1 commit


14 Sep, 2014

1 commit

  • Enable basic classifier for RCU.

    Dereferencing tp->root may look a bit strange here but it is needed
    by my accounting because it is allocated at init time and needs to
    be kfree'd at destroy time. However because it may be referenced in
    the classify() path we must wait an RCU grace period before free'ing
    it. We use kfree_rcu() and rcu_ APIs to enforce this. This pattern
    is used in all the classifiers.

    Also the hgenerator can be incremented without concern because it
    is always incremented under RTNL.

    Signed-off-by: John Fastabend
    Acked-by: Eric Dumazet
    Signed-off-by: David S. Miller

    John Fastabend