18 Feb, 2017

1 commit

  • [ Upstream commit fd62d9f5c575f0792f150109f1fd24a0d4b3f854 ]

    In the current version, the matchall internal state is split into two
    structs: cls_matchall_head and cls_matchall_filter. This makes little
    sense, as matchall instance supports only one filter, and there is no
    situation where one exists and the other does not. In addition, that led
    to some races when filter was deleted while packet was processed.

    Unify that two structs into one, thus simplifying the process of matchall
    creation and deletion. As a result, the new, delete and get callbacks have
    a dummy implementation where all the work is done in destroy and change
    callbacks, as was done in cls_cgroup.

    Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
    Reported-by: Daniel Borkmann
    Signed-off-by: Yotam Gigi
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Yotam Gigi
     

04 Feb, 2017

1 commit

  • [ Upstream commit 0faa9cb5b3836a979864a6357e01d2046884ad52 ]

    Demonstrating the issue:

    .. add a drop action
    $sudo $TC actions add action drop index 10

    .. retrieve it
    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 2 bind 0 installed 29 sec used 29 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    ... bug 1 above: reference is two.
    Reference is actually 1 but we forget to subtract 1.

    ... do a GET again and we see the same issue
    try a few times and nothing changes
    ~$ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 2 bind 0 installed 31 sec used 31 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    ... lets try to bind the action to a filter..
    $ sudo $TC qdisc add dev lo ingress
    $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
    u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

    ... and now a few GETs:
    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 3 bind 1 installed 204 sec used 204 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 4 bind 1 installed 206 sec used 206 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 5 bind 1 installed 235 sec used 235 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    .... as can be observed the reference count keeps going up.

    After the fix

    $ sudo $TC actions add action drop index 10
    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 1 bind 0 installed 4 sec used 4 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 1 bind 0 installed 6 sec used 6 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    $ sudo $TC qdisc add dev lo ingress
    $ sudo $TC filter add dev lo parent ffff: protocol ip prio 1 \
    u32 match ip dst 127.0.0.1/32 flowid 1:1 action gact index 10

    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 2 bind 1 installed 32 sec used 32 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    $ sudo $TC -s actions get action gact index 10

    action order 1: gact action drop
    random type none pass val 0
    index 10 ref 2 bind 1 installed 33 sec used 33 sec
    Action statistics:
    Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
    backlog 0b 0p requeues 0

    Fixes: aecc5cefc389 ("net sched actions: fix GETing actions")
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Jamal Hadi Salim
     

15 Jan, 2017

2 commits

  • [ Upstream commit 0df0f207aab4f42e5c96a807adf9a6845b69e984 ]

    Since we now use a non zero mask on addr_type, we are matching on its
    value (IPV4/IPV6). So before this fix, matching on enc_src_ip/enc_dst_ip
    failed in SW/classify path since its value was zero.
    This patch sets the proper value of addr_type for encapsulated packets.

    Fixes: 970bfcd09791 ('net/sched: cls_flower: Use mask for addr_type')
    Signed-off-by: Paul Blakey
    Reviewed-by: Hadar Hen Zion
    Acked-by: Jiri Pirko
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Paul Blakey
     
  • [ Upstream commit 628185cfddf1dfb701c4efe2cfd72cf5b09f5702 ]

    Shahar reported a soft lockup in tc_classify(), where we run into an
    endless loop when walking the classifier chain due to tp->next == tp
    which is a state we should never run into. The issue only seems to
    trigger under load in the tc control path.

    What happens is that in tc_ctl_tfilter(), thread A allocates a new
    tp, initializes it, sets tp_created to 1, and calls into tp->ops->change()
    with it. In that classifier callback we had to unlock/lock the rtnl
    mutex and returned with -EAGAIN. One reason why we need to drop there
    is, for example, that we need to request an action module to be loaded.

    This happens via tcf_exts_validate() -> tcf_action_init/_1() meaning
    after we loaded and found the requested action, we need to redo the
    whole request so we don't race against others. While we had to unlock
    rtnl in that time, thread B's request was processed next on that CPU.
    Thread B added a new tp instance successfully to the classifier chain.
    When thread A returned grabbing the rtnl mutex again, propagating -EAGAIN
    and destroying its tp instance which never got linked, we goto replay
    and redo A's request.

    This time when walking the classifier chain in tc_ctl_tfilter() for
    checking for existing tp instances we had a priority match and found
    the tp instance that was created and linked by thread B. Now calling
    again into tp->ops->change() with that tp was successful and returned
    without error.

    tp_created was never cleared in the second round, thus kernel thinks
    that we need to link it into the classifier chain (once again). tp and
    *back point to the same object due to the match we had earlier on. Thus
    for thread B's already public tp, we reset tp->next to tp itself and
    link it into the chain, which eventually causes the mentioned endless
    loop in tc_classify() once a packet hits the data path.

    Fix is to clear tp_created at the beginning of each request, also when
    we replay it. On the paths that can cause -EAGAIN we already destroy
    the original tp instance we had and on replay we really need to start
    from scratch. It seems that this issue was first introduced in commit
    12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining
    and avoid kernel panic when we use cls_cgroup").

    Fixes: 12186be7d2e1 ("net_cls: fix unconfigured struct tcf_proto keeps chaining and avoid kernel panic when we use cls_cgroup")
    Reported-by: Shahar Klein
    Signed-off-by: Daniel Borkmann
    Cc: Cong Wang
    Acked-by: Eric Dumazet
    Tested-by: Shahar Klein
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

30 Nov, 2016

2 commits


28 Nov, 2016

1 commit

  • Roi reported a crash in flower where tp->root was NULL in ->classify()
    callbacks. Reason is that in ->destroy() tp->root is set to NULL via
    RCU_INIT_POINTER(). It's problematic for some of the classifiers, because
    this doesn't respect RCU grace period for them, and as a result, still
    outstanding readers from tc_classify() will try to blindly dereference
    a NULL tp->root.

    The tp->root object is strictly private to the classifier implementation
    and holds internal data the core such as tc_ctl_tfilter() doesn't know
    about. Within some classifiers, such as cls_bpf, cls_basic, etc, tp->root
    is only checked for NULL in ->get() callback, but nowhere else. This is
    misleading and seemed to be copied from old classifier code that was not
    cleaned up properly. For example, d3fa76ee6b4a ("[NET_SCHED]: cls_basic:
    fix NULL pointer dereference") moved tp->root initialization into ->init()
    routine, where before it was part of ->change(), so ->get() had to deal
    with tp->root being NULL back then, so that was indeed a valid case, after
    d3fa76ee6b4a, not really anymore. We used to set tp->root to NULL long
    ago in ->destroy(), see 47a1a1d4be29 ("pkt_sched: remove unnecessary xchg()
    in packet classifiers"); but the NULLifying was reintroduced with the
    RCUification, but it's not correct for every classifier implementation.

    In the cases that are fixed here with one exception of cls_cgroup, tp->root
    object is allocated and initialized inside ->init() callback, which is always
    performed at a point in time after we allocate a new tp, which means tp and
    thus tp->root was not globally visible in the tp chain yet (see tc_ctl_tfilter()).
    Also, on destruction tp->root is strictly kfree_rcu()'ed in ->destroy()
    handler, same for the tp which is kfree_rcu()'ed right when we return
    from ->destroy() in tcf_destroy(). This means, the head object's lifetime
    for such classifiers is always tied to the tp lifetime. The RCU callback
    invocation for the two kfree_rcu() could be out of order, but that's fine
    since both are independent.

    Dropping the RCU_INIT_POINTER(tp->root, NULL) for these classifiers here
    means that 1) we don't need a useless NULL check in fast-path and, 2) that
    outstanding readers of that tp in tc_classify() can still execute under
    respect with RCU grace period as it is actually expected.

    Things that haven't been touched here: cls_fw and cls_route. They each
    handle tp->root being NULL in ->classify() path for historic reasons, so
    their ->destroy() implementation can stay as is. If someone actually
    cares, they could get cleaned up at some point to avoid the test in fast
    path. cls_u32 doesn't set tp->root to NULL. For cls_rsvp, I just added a
    !head should anyone actually be using/testing it, so it at least aligns with
    cls_fw and cls_route. For cls_flower we additionally need to defer rhashtable
    destruction (to a sleepable context) after RCU grace period as concurrent
    readers might still access it. (Note that in this case we need to hold module
    reference to keep work callback address intact, since we only wait on module
    unload for all call_rcu()s to finish.)

    This fixes one race to bring RCU grace period guarantees back. Next step
    as worked on by Cong however is to fix 1e052be69d04 ("net_sched: destroy
    proto tp when all filters are gone") to get the order of unlinking the tp
    in tc_ctl_tfilter() for the RTM_DELTFILTER case right by moving
    RCU_INIT_POINTER() before tcf_destroy() and let the notification for
    removal be done through the prior ->delete() callback. Both are independant
    issues. Once we have that right, we can then clean tp->root up for a number
    of classifiers by not making them RCU pointers, which requires a new callback
    (->uninit) that is triggered from tp's RCU callback, where we just kfree()
    tp->root from there.

    Fixes: 1f947bf151e9 ("net: sched: rcu'ify cls_bpf")
    Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
    Fixes: 70da9f0bf999 ("net: sched: cls_flow use RCU")
    Fixes: 77b9900ef53a ("tc: introduce Flower classifier")
    Fixes: bf3994d2ed31 ("net/sched: introduce Match-all classifier")
    Fixes: 952313bd6258 ("net: sched: cls_cgroup use RCU")
    Reported-by: Roi Dayan
    Signed-off-by: Daniel Borkmann
    Cc: Cong Wang
    Cc: John Fastabend
    Cc: Roi Dayan
    Cc: Jiri Pirko
    Acked-by: John Fastabend
    Acked-by: Cong Wang
    Signed-off-by: David S. Miller

    Daniel Borkmann
     

25 Nov, 2016

1 commit


18 Nov, 2016

1 commit


30 Oct, 2016

1 commit


28 Oct, 2016

1 commit

  • Daniel says:

    While trying out [1][2], I noticed that tc monitor doesn't show the
    correct handle on delete:

    $ tc monitor
    qdisc clsact ffff: dev eno1 parent ffff:fff1
    filter dev eno1 ingress protocol all pref 49152 bpf handle 0x2a [...]
    deleted filter dev eno1 ingress protocol all pref 49152 bpf handle 0xf3be0c80

    some context to explain the above:
    The user identity of any tc filter is represented by a 32-bit
    identifier encoded in tcm->tcm_handle. Example 0x2a in the bpf filter
    above. A user wishing to delete, get or even modify a specific filter
    uses this handle to reference it.
    Every classifier is free to provide its own semantics for the 32 bit handle.
    Example: classifiers like u32 use schemes like 800:1:801 to describe
    the semantics of their filters represented as hash table, bucket and
    node ids etc.
    Classifiers also have internal per-filter representation which is different
    from this externally visible identity. Most classifiers set this
    internal representation to be a pointer address (which allows fast retrieval
    of said filters in their implementations). This internal representation
    is referenced with the "fh" variable in the kernel control code.

    When a user successfuly deletes a specific filter, by specifying the correct
    tcm->tcm_handle, an event is generated to user space which indicates
    which specific filter was deleted.

    Before this patch, the "fh" value was sent to user space as the identity.
    As an example what is shown in the sample bpf filter delete event above
    is 0xf3be0c80. This is infact a 32-bit truncation of 0xffff8807f3be0c80
    which happens to be a 64-bit memory address of the internal filter
    representation (address of the corresponding filter's struct cls_bpf_prog);

    After this patch the appropriate user identifiable handle as encoded
    in the originating request tcm->tcm_handle is generated in the event.
    One of the cardinal rules of netlink rules is to be able to take an
    event (such as a delete in this case) and reflect it back to the
    kernel and successfully delete the filter. This patch achieves that.

    Note, this issue has existed since the original TC action
    infrastructure code patch back in 2004 as found in:
    https://git.kernel.org/cgit/linux/kernel/git/history/history.git/commit/

    [1] http://patchwork.ozlabs.org/patch/682828/
    [2] http://patchwork.ozlabs.org/patch/682829/

    Fixes: 4e54c4816bfe ("[NET]: Add tc extensions infrastructure.")
    Reported-by: Daniel Borkmann
    Acked-by: Cong Wang
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

20 Oct, 2016

1 commit

  • stats_update callback is called by NIC drivers doing hardware
    offloading of the mirred action. Lastuse is passed as argument
    to specify when the stats was actually last updated and is not
    always the current time.

    Fixes: 9798e6fe4f9b ('net: act_mirred: allow statistic updates from offloaded actions')
    Signed-off-by: Paul Blakey
    Signed-off-by: Or Gerlitz
    Signed-off-by: David S. Miller

    Paul Blakey
     

13 Oct, 2016

2 commits

  • Krister reported a kernel NULL pointer dereference after
    tcf_action_init_1() invokes a_o->init(), it is a race condition
    where one thread calling tcf_register_action() to initialize
    the netns data after putting act ops in the global list and
    the other thread searching the list and then calling
    a_o->init(net, ...).

    Fix this by moving the pernet ops registration before making
    the action ops visible. This is fine because: a) we don't
    rely on act_base in pernet ops->init(), b) in the worst case we
    have a fully initialized netns but ops is still not ready so
    new actions still can't be created.

    Reported-by: Krister Johansen
    Tested-by: Krister Johansen
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • There are two ways to get tc filters from kernel to user space.

    1) Full dump (tc_dump_tfilter())
    2) RTM_GETTFILTER to get one precise filter, reducing overhead.

    The second operation is unfortunately broadcasting its result,
    polluting "tc monitor" users.

    This patch makes sure only the requester gets the result, using
    netlink_unicast() instead of rtnetlink_send()

    Jamal cooked an iproute2 patch to implement "tc filter get" operation,
    but other user space libraries already use RTM_GETTFILTER when a single
    filter is queried, instead of dumping all filters.

    Signed-off-by: Eric Dumazet
    Cc: Jamal Hadi Salim
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Eric Dumazet
     

04 Oct, 2016

1 commit

  • Generic skb_vlan_push/skb_vlan_pop functions don't properly handle the
    case where the input skb data pointer does not point at the mac header:

    - They're doing push/pop, but fail to properly unwind data back to its
    original location.
    For example, in the skb_vlan_push case, any subsequent
    'skb_push(skb, skb->mac_len)' calls make the skb->data point 4 bytes
    BEFORE start of frame, leading to bogus frames that may be transmitted.

    - They update rcsum per the added/removed 4 bytes tag.
    Alas if data is originally after the vlan/eth headers, then these
    bytes were already pulled out of the csum.

    OTOH calling skb_vlan_push/skb_vlan_pop with skb->data at mac_header
    present no issues.

    act_vlan is the only caller to skb_vlan_*() that has skb->data pointing
    at network header (upon ingress).
    Other calles (ovs, bpf) already adjust skb->data at mac_header.

    This patch fixes act_vlan to point to the mac_header prior calling
    skb_vlan_*() functions, as other callers do.

    Signed-off-by: Shmulik Ladkani
    Cc: Daniel Borkmann
    Cc: Pravin Shelar
    Cc: Jiri Pirko
    Signed-off-by: David S. Miller

    Shmulik Ladkani
     

03 Oct, 2016

1 commit


28 Sep, 2016

1 commit


27 Sep, 2016

2 commits

  • On ife encode side, the action stores the different tlvs inside the ife
    header, where each tlv length field should refer to the length of the
    whole tlv (without additional padding) and not just the data length.

    On ife decode side, the action iterates over the tlvs in the ife header
    and parses them one by one, where in each iteration the current pointer is
    advanced according to the tlv size.

    Before, the encoding encoded only the data length inside the tlv, which led
    to false parsing of ife the header. In addition, due to the fact that the
    loop counter was unsigned, it could lead to infinite parsing loop.

    This fix changes the loop counter to be signed and fixes the encoding to
    take into account the tlv type and size.

    Fixes: 28a10c426e81 ("net sched: fix encoding to use real length")
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Yotam Gigi
    Signed-off-by: David S. Miller

    Yotam Gigi
     
  • On ife encode side, external mac header is copied from the original packet
    and may be overridden if the user requests. Before, the mac header copy
    was done from memory region that might not be accessible anymore, as
    skb_cow_head might free it and copy the packet. This led to random values
    in the external mac header once the values were not set by user.

    This fix takes the internal mac header from the packet, after the call to
    skb_cow_head.

    Fixes: ef6980b6becb ("net sched: introduce IFE action")
    Acked-by: Jamal Hadi Salim
    Signed-off-by: Yotam Gigi
    Signed-off-by: David S. Miller

    Yotam Gigi
     

23 Sep, 2016

4 commits

  • It looks like the following patch can make FQ very precise, even in VM
    or stressed hosts. It matters at high pacing rates.

    We take into account the difference between the time that was programmed
    when last packet was sent, and current time (a drift of tens of usecs is
    often observed)

    Add an EWMA of the unthrottle latency to help diagnostics.

    This latency is the difference between current time and oldest packet in
    delayed RB-tree. This accounts for the high resolution timer latency,
    but can be different under stress, as fq_check_throttled() can be
    opportunistically be called from a dequeue() called after an enqueue()
    for a different flow.

    Tested:
    // Start a 10Gbit flow
    $ netperf --google-pacing-rate 1250000000 -H lpaa24 -l 10000 -- -K bbr &

    Before patch :
    $ sar -n DEV 10 5 | grep eth0 | grep Average
    Average: eth0 17106.04 756876.84 1102.75 1119049.02 0.00 0.00 0.52

    After patch :
    $ sar -n DEV 10 5 | grep eth0 | grep Average
    Average: eth0 17867.00 800245.90 1151.77 1183172.12 0.00 0.00 0.52

    A new iproute2 tc can output the 'unthrottle latency' :

    $ tc -s qd sh dev eth0 | grep latency
    0 gc, 0 highprio, 32490767 throttled, 2382 ns latency

    Signed-off-by: Eric Dumazet
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • Reported-by: Stas Nichiporovich
    Fixes: 2ccccf5fb43f ("net_sched: update hierarchical backlog too")
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     
  • On error path in route4_change(), 'f' could be NULL,
    so we should check NULL before calling tcf_exts_destroy().

    Fixes: b9a24bb76bf6 ("net_sched: properly handle failure case of tcf_exts_init()")
    Reported-by: kbuild test robot
    Cc: Jamal Hadi Salim
    Signed-off-by: Cong Wang
    Acked-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    WANG Cong
     

22 Sep, 2016

6 commits


21 Sep, 2016

2 commits

  • This commit adds to the fq module a low_rate_threshold parameter to
    insert a delay after all packets if the socket requests a pacing rate
    below the threshold.

    This helps achieve more precise control of the sending rate with
    low-rate paths, especially policers. The basic issue is that if a
    congestion control module detects a policer at a certain rate, it may
    want fq to be able to shape to that policed rate. That way the sender
    can avoid policer drops by having the packets arrive at the policer at
    or just under the policed rate.

    The default threshold of 550Kbps was chosen analytically so that for
    policers or links at 500Kbps or 512Kbps fq would very likely invoke
    this mechanism, even if the pacing rate was briefly slightly above the
    available bandwidth. This value was then empirically validated with
    two years of production testing on YouTube video servers.

    Signed-off-by: Van Jacobson
    Signed-off-by: Neal Cardwell
    Signed-off-by: Yuchung Cheng
    Signed-off-by: Nandita Dukkipati
    Signed-off-by: Eric Dumazet
    Signed-off-by: Soheil Hassas Yeganeh
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • With the batch changes that translated transient actions into
    a temporary list lost in the translation was the fact that
    tcf_action_destroy() will eventually delete the action from
    the permanent location if the refcount is zero.

    Example of what broke:
    ...add a gact action to drop
    sudo $TC actions add action drop index 10
    ...now retrieve it, looks good
    sudo $TC actions get action gact index 10
    ...retrieve it again and find it is gone!
    sudo $TC actions get action gact index 10

    Fixes: 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer array"),
    Fixes: 824a7e8863b3 ("net_sched: remove an unnecessary list_del()")
    Fixes: f07fed82ad79 ("net_sched: remove the leftover cleanup_a()")

    Acked-by: Cong Wang
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

20 Sep, 2016

4 commits

  • Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • setting conforming action to drop is a valid policy.
    When it is set we need to at least see the stats indicating it
    for debugging.

    Signed-off-by: Roman Mashak
    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Roman Mashak
     
  • Sample use case of how this is encoded:
    user space via tuntap (or a connected VM/Machine/container)
    encodes the tcindex TLV.

    Sample use case of decoding:
    IFE action decodes it and the skb->tc_index is then used to classify.
    So something like this for encoded ICMP packets:

    .. first decode then reclassify... skb->tcindex will be set
    sudo $TC filter add dev $ETH parent ffff: prio 2 protocol 0xbeef \
    u32 match u32 0 0 flowid 1:1 \
    action ife decode reclassify

    ...next match the decode icmp packet...
    sudo $TC filter add dev $ETH parent ffff: prio 4 protocol ip \
    u32 match ip protocol 1 0xff flowid 1:1 \
    action continue

    ... last classify it using the tcindex classifier and do someaction..
    sudo $TC filter add dev $ETH parent ffff: prio 5 protocol ip \
    handle 0x11 tcindex classid 1:1 \
    action blah..

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     
  • encoder and checker for 16 bits metadata

    Signed-off-by: Jamal Hadi Salim
    Signed-off-by: David S. Miller

    Jamal Hadi Salim
     

19 Sep, 2016

5 commits

  • This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

    Its similar to the skb_buff_head api, but does not use skb->prev pointers.

    Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
    While skb_buff_head works fine for this, enqueue/dequeue needs to also
    adjust the prev pointer of next element.

    The ->prev pointer is not required for qdiscs so we can just leave
    it undefined and avoid one cacheline write access for en/dequeue.

    Suggested-by: Eric Dumazet
    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • After previous patch these functions are identical.
    Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

    Next patch will then make __qdisc_dequeue_head handle
    single-linked list instead of strcut sk_buff_head argument.

    Doesn't change generated code.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Moves qdisc stat accouting to qdisc_dequeue_head.

    The only direct caller of the __qdisc_dequeue_head version open-codes
    this now.

    This allows us to later use __qdisc_dequeue_head as a replacement
    of __skb_dequeue() (which operates on sk_buff_head list).

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • A followup change will replace the sk_buff_head in the qdisc
    struct with a slightly different list.

    Use of the sk_buff_head helpers will thus cause compiler
    warnings.

    Open-code these accesses in an extra change to ease review.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal
     
  • Doesn't change generated code.

    Signed-off-by: Florian Westphal
    Signed-off-by: David S. Miller

    Florian Westphal