Eric Lee / smarc-fsl-linux-kernel

30 Jan, 2016

1 commit

df3eb6cd6 net_sched: drr: check for NULL pointer in drr_dequeue ... Browse Code »

There are cases where qdisc_dequeue_peeked can return NULL, and the result
is dereferenced later on in the function.

Similarly to the other qdisc dequeue functions, check whether the skb
pointer is NULL and if it is, goto out.

Signed-off-by: Bernie Harris
Reviewed-by: Cong Wang
Signed-off-by: David S. Miller

Bernie Harris
2016-01-30 09:26:44 +0800

12 Jan, 2016

2 commits

9d367eddf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/bonding/bond_main.c
drivers/net/ethernet/mellanox/mlxsw/spectrum.h
drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c

The bond_main.c and mellanox switch conflicts were cases of
overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2016-01-12 12:55:43 +0800
66530bdf8 sched,cls_flower: set key address type when present ... Browse Code »

only when user space passes the addresses should we consider their
presence

Signed-off-by: Jamal Hadi Salim
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Jamal Hadi Salim
2016-01-12 06:27:30 +0800

11 Jan, 2016

2 commits

1f211a1b9 net, sched: add clsact qdisc ... Browse Code »

This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

# tc qdisc add dev foo clsact
# tc qdisc show dev foo
qdisc mq 0: root
qdisc pfifo_fast 0: parent :1 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :3 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc pfifo_fast 0: parent :4 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

# tc filter add dev foo ingress bpf da obj bar.o sec ingress
# tc filter add dev foo egress bpf da obj bar.o sec egress
# tc filter show dev foo ingress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
# tc filter show dev foo egress
filter protocol all pref 49152 bpf
filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

[1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Daniel Borkmann
2016-01-11 11:13:15 +0800
fdc5432a7 net, sched: add skb_at_tc_ingress helper ... Browse Code »

Add a skb_at_tc_ingress() as this will be needed elsewhere as well and
can hide the ugly ifdef.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-01-11 06:54:28 +0800

06 Jan, 2016

1 commit

73c20a8b7 net: sched: fix missing free per cpu on qstats ... Browse Code »

When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.

Fixes: b0ab6f92752b9f9d8 ("net: sched: enable per cpu qstats")
Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

John Fastabend
2016-01-06 14:40:21 +0800

16 Dec, 2015

1 commit

225734de7 net_sched: make qdisc_tree_decrease_qlen() work for non mq ... Browse Code »

Stas Nichiporovich reported a regression in his HFSC qdisc setup
on a non multi queue device.

It turns out I mistakenly added a TCQ_F_NOPARENT flag on all qdisc
allocated in qdisc_create() for non multi queue devices, which was
rather buggy. I was clearly mislead by the TCQ_F_ONETXQUEUE that is
also set here for no good reason, since it only matters for the root
qdisc.

Fixes: 4eaf3b84f288 ("net_sched: fix qdisc_tree_decrease_qlen() races")
Reported-by: Stas Nichiporovich
Tested-by: Stas Nichiporovich
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-16 02:41:52 +0800

04 Dec, 2015

1 commit

4eaf3b84f net_sched: fix qdisc_tree_decrease_qlen() races ... Browse Code »

qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[ 681.774821] PAX: sch->q.qlen: 0 n: 1
[ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
[ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[ 681.774962] Call Trace:
[ 681.774967] [] dump_stack+0x4c/0x7f
[ 681.774970] [] report_size_overflow+0x34/0x50
[ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
[ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[ 681.774980] [] __qdisc_run+0x4d/0x1d0
[ 681.774983] [] net_tx_action+0xc2/0x160
[ 681.774985] [] __do_softirq+0xf1/0x200
[ 681.774987] [] run_ksoftirqd+0x1e/0x30
[ 681.774989] [] smpboot_thread_fn+0x150/0x260
[ 681.774991] [] ? sort_range+0x40/0x40
[ 681.774992] [] kthread+0xe4/0x100
[ 681.774994] [] ? kthread_worker_fn+0x170/0x170
[ 681.774995] [] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.

Reported-by: Daniele Fucini
Signed-off-by: Eric Dumazet
Cc: Cong Wang
Cc: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-04 03:59:05 +0800

09 Nov, 2015

2 commits

02a56c81c net_sched: em_meta: use skb_to_full_sk() helper ... Browse Code »

SYNACK packets might be attached to request sockets.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-11-09 09:56:39 +0800
743b2a667 sched: cls_flow: use skb_to_full_sk() helper ... Browse Code »

SYNACK packets might be attached to request sockets.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-11-09 09:56:39 +0800

04 Nov, 2015

1 commit

2a4f41762 net: sched: kill dead code in sch_choke.c ... Browse Code »

It looks like this has never been used at all.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-11-04 02:30:47 +0800

20 Oct, 2015

1 commit

26440c835 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/asix_common.c
net/ipv4/inet_connection_sock.c
net/switchdev/switchdev.c

In the inet_connection_sock.c case the request socket hashing scheme
is completely different in net-next.

The other two conflicts were overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2015-10-20 21:08:27 +0800

11 Oct, 2015

2 commits

e446f9dfe net: synack packets can be attached to request sockets ... Browse Code »

selinux needs few changes to accommodate fact that SYNACK messages
can be attached to a request socket, lacking sk_security pointer

(Only syncookies are still attached to a TCP_LISTEN socket)

Adds a new sk_listener() helper, and use it in selinux and sch_fq

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet
Reported by: kernel test robot
Cc: Paul Moore
Cc: Stephen Smalley
Cc: Eric Paris
Acked-by: Paul Moore
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-11 20:05:06 +0800
6ac644a8a sch_hhf: fix return value of hhf_drop() ... Browse Code »

Similar to commit c0afd9ce4d6a ("fq_codel: fix return value of fq_codel_drop()")
->drop() is supposed to return the number of bytes it dropped,
but hhf_drop () returns the id of the bucket where it drops
a packet from.

Cc: Jamal Hadi Salim
Cc: Terry Lam
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

WANG Cong
2015-10-11 19:49:33 +0800

09 Oct, 2015

1 commit

075640e36 net/sched: make sch_blackhole.c explicitly non-modular ... Browse Code »

The Kconfig currently controlling compilation of this code is:

net/sched/Kconfig:menuconfig NET_SCHED
net/sched/Kconfig: bool "QoS and/or fair queueing"

...meaning that it currently is not being built as a module by anyone.

Lets remove the modular code that is essentially orphaned, so that
when reading the driver there is no doubt it is builtin-only.

Since module_init translates to device_initcall in the non-modular
case, the init ordering remains unchanged with this commit. We can
change to one of the other priority initcalls (subsys?) at any later
date, if desired.

We also delete the MODULE_LICENSE tag since all that information
is already contained at the top of the file in the comments.

Cc: Jamal Hadi Salim
Cc: "David S. Miller"
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker
Signed-off-by: David S. Miller

Paul Gortmaker
2015-10-09 22:52:28 +0800

08 Oct, 2015

1 commit

d40496a56 act_mirred: clear sender cpu before sending to tx ... Browse Code »

Similar to commit c29390c6dfee ("xps: must clear sender_cpu before forwarding")
the skb->sender_cpu needs to be cleared when moving from Rx
Tx, otherwise kernel could crash.

Fixes: 2bd82484bb4c ("xps: fix xps for stacked devices")
Cc: Eric Dumazet
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

WANG Cong
2015-10-08 19:59:04 +0800

05 Oct, 2015

2 commits

215c90afb act_mirred: always release tcf hash ... Browse Code »

Align with other tc actions.

Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2015-10-05 21:30:34 +0800
6bd00b850 act_mirred: fix a race condition on mirred_list ... Browse Code »

After commit 1ce87720d456 ("net: sched: make cls_u32 lockless")
we began to release tc actions in a RCU callback. However,
mirred action relies on RTNL lock to protect the global
mirred_list, therefore we could have a race condition
between RCU callback and netdevice event, which caused
a list corruption as reported by Vinson.

Instead of relying on RTNL lock, introduce a spinlock to
protect this list.

Note, in non-bind case, it is still called with RTNL lock,
therefore should disable BH too.

Reported-by: Vinson Lee
Cc: John Fastabend
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2015-10-05 21:30:33 +0800

03 Oct, 2015

2 commits

c46646d04 sched, bpf: add helper for retrieving routing realms ... Browse Code »

Using routing realms as part of the classifier is quite useful, it
can be viewed as a tag for one or multiple routing entries (think of
an analogy to net_cls cgroup for processes), set by user space routing
daemons or via iproute2 as an indicator for traffic classifiers and
later on processed in the eBPF program.

Unlike actions, the classifier can inspect device flags and enable
netif_keep_dst() if necessary. tc actions don't have that possibility,
but in case people know what they are doing, it can be used from there
as well (e.g. via devs that must keep dsts by design anyway).

If a realm is set, the handler returns the non-zero realm. User space
can set the full 32bit realm for the dst.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-10-03 20:02:41 +0800
ca6fb0651 tcp: attach SYNACK messages to request sockets instead of listener ... Browse Code »

If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

listen(fd, 10485760); // single listener (no SO_REUSEPORT)
16 RX/TX queue NIC
Sustain a SYNFLOOD attack of ~320,000 SYN per second,
Sending ~1,400,000 SYNACK per second.
Perf profiles now show listener spinlock being next bottleneck.

20.29% [kernel] [k] queued_spin_lock_slowpath
10.06% [kernel] [k] __inet_lookup_established
5.12% [kernel] [k] reqsk_timer_handler
3.22% [kernel] [k] get_next_timer_interrupt
3.00% [kernel] [k] tcp_make_synack
2.77% [kernel] [k] ipt_do_table
2.70% [kernel] [k] run_timer_softirq
2.50% [kernel] [k] ip_finish_output
2.04% [kernel] [k] cascade

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-10-03 19:32:43 +0800

27 Sep, 2015

1 commit

4963ed48f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/ipv4/arp.c

The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.

Signed-off-by: David S. Miller

David S. Miller
2015-09-27 07:08:27 +0800

25 Sep, 2015

1 commit

d8aecb101 net: revert "net_sched: move tp->root allocation into fw_init()" ... Browse Code »

fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb15 ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2015-09-25 05:33:30 +0800

24 Sep, 2015

3 commits

5cf8ca0e4 cls_bpf: further limit exec opcodes subset ... Browse Code »

Jamal suggested to further limit the currently allowed subset of opcodes
that may be used by a direct action return code as the intention is not
to replace the full action engine, but rather to have a minimal set that
can be used in the fast-path on things like ingress for some features
that cls_bpf supports.

Classifiers can, of course, still be chained together that have direct
action mode with those that have a full exec pass. For more complex
scenarios that go beyond this minimal set here, the full tcf_exts_exec()
path must be used.

Suggested-by: Jamal Hadi Salim
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-09-24 05:29:02 +0800
ef146fa40 cls_bpf: make binding to classid optional ... Browse Code »

The binding to a particular classid was so far always mandatory for
cls_bpf, but it doesn't need to be. Therefore, lift this restriction
as similarly done in other classifiers.

Only a couple of qdiscs make use of class from the tcf_result, others
don't strictly care, so let the user choose his needs (those that read
out class can handle situations where it could be NULL).

An explicit check for tcf_unbind_filter() is also not needed here, as
the previous r->class was 0, so the xchg() will return that and
therefore a callback to the qdisc's unbind_tcf() is skipped.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-09-24 05:29:02 +0800
bf007d1c7 cls_bpf: also dump TCA_BPF_FLAGS ... Browse Code »

In commit 43388da42a49 ("cls_bpf: introduce integrated actions") we
have added TCA_BPF_FLAGS. We can also retrieve this information from
the prog, dump it back to user space as well. It's useful in tc when
displaying/dumping filter info.

Also, remove tp from cls_bpf_prog_from_efd(), came in as a conflict
from a rebase and it's unused here (later work may add it along with
a real user).

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-09-24 05:29:01 +0800

19 Sep, 2015

3 commits

a31f1adc0 netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple ... Browse Code »

As gre does not have the srckey in the packet gre_pkt_to_tuple
needs to perform a lookup in it's per network namespace tables.

Pass in the proper network namespace to all pkt_to_tuple
implementations to ensure gre (and any similar protocols) can get this
right.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Pablo Neira Ayuso

Eric W. Biederman
2015-09-19 04:00:04 +0800
a4ffe319a act_connmark: Remember the struct net instead of guessing it. ... Browse Code »

Stop guessing the struct net instead of remember it. Guessing is just
silly and will be problematic in the future when I implement routes
between network namespaces.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Pablo Neira Ayuso

Eric W. Biederman
2015-09-19 03:59:31 +0800
156c196f6 netfilter: x_tables: Pass struct net in xt_action_param ... Browse Code »

As xt_action_param lives on the stack this does not bloat any
persistent data structures.

This is a first step in making netfilter code that needs to know
which network namespace it is executing in simpler.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Pablo Neira Ayuso

Eric W. Biederman
2015-09-19 03:58:14 +0800

18 Sep, 2015

3 commits

47bbbb30b sch_dsmark: improve memory locality ... Browse Code »

Memory placement in sch_dsmark is silly : Better place mask/value
in the same cache line.

Also, we can embed small arrays in the first cache line and
remove a potential cache miss.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-09-18 13:37:19 +0800
27b29f630 bpf: add bpf_redirect() helper ... Browse Code »

Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-09-18 12:09:07 +0800
045efa82f cls_bpf: introduce integrated actions ... Browse Code »

Often cls_bpf classifier is used with single action drop attached.
Optimize this use case and let cls_bpf return both classid and action.
For backwards compatibility reasons enable this feature under
TCA_BPF_FLAG_ACT_DIRECT flag.

Then more interesting programs like the following are easier to write:
int cls_bpf_prog(struct __sk_buff *skb)
{
/* classify arp, ip, ipv6 into different traffic classes
* and drop all other packets
*/
switch (skb->protocol) {
case htons(ETH_P_ARP):
skb->tc_classid = 1;
break;
case htons(ETH_P_IP):
skb->tc_classid = 2;
break;
case htons(ETH_P_IPV6):
skb->tc_classid = 3;
break;
default:
return TC_ACT_SHOT;
}

return TC_ACT_OK;
}

Joint work with Daniel Borkmann.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-09-18 12:09:06 +0800

02 Sep, 2015

1 commit

cd79a2382 flow_dissector: Add flags argument to skb_flow_dissector functions ... Browse Code »

The flags argument will allow control of the dissection process (for
instance whether to parse beyond L3).

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-09-02 06:06:22 +0800

29 Aug, 2015

1 commit

c1b3b1992 net: sched: don't break line in tc_classify loop notification ... Browse Code »

Just some minor noise follow-up to address some stylistic issues of
commit 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}").
Accidentally v1 instead of v2 of that commit got applied, so this
patch adds the relative diff.

Suggested-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-08-29 05:01:15 +0800

28 Aug, 2015

5 commits

0d36938bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2015-08-28 12:45:31 +0800
3e692f215 net: sched: simplify attach_one_default_qdisc() ... Browse Code »

Now that noqueue qdisc can be attached just like any other qdisc, no
special treatment is necessary anymore when attaching it as default
qdisc.

This change has the added benefit that 'tc qdisc show' prints noqueue
instead of nothing for devices defaulting to noqueue.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-08-28 08:14:30 +0800
d66d6c315 net: sched: register noqueue qdisc ... Browse Code »

This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-08-28 08:14:30 +0800
db4094bca net: sched: ignore tx_queue_len when assigning default qdisc ... Browse Code »

Since alloc_netdev_mqs() sets IFF_NO_QUEUE for drivers not initializing
tx_queue_len, it is safe to assume that if tx_queue_len is zero,
dev->priv flags always contains IFF_NO_QUEUE.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-08-28 08:14:29 +0800
3b3ae8802 net: sched: consolidate tc_classify{,_compat} ... Browse Code »

For classifiers getting invoked via tc_classify(), we always need an
extra function call into tc_classify_compat(), as both are being
exported as symbols and tc_classify() itself doesn't do much except
handling of reclassifications when tp->classify() returned with
TC_ACT_RECLASSIFY.

CBQ and ATM are the only qdiscs that directly call into tc_classify_compat(),
all others use tc_classify(). When tc actions are being configured
out in the kernel, tc_classify() effectively does nothing besides
delegating.

We could spare this layer and consolidate both functions. pktgen on
single CPU constantly pushing skbs directly into the netif_receive_skb()
path with a dummy classifier on ingress qdisc attached, improves
slightly from 22.3Mpps to 23.1Mpps.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-08-28 05:18:48 +0800

27 Aug, 2015

2 commits

cff82457c net_sched: act_bpf: remove spinlock in fast path ... Browse Code »

Similar to act_gact/act_mirred, act_bpf can be lockless in packet processing
with extra care taken to free bpf programs after rcu grace period.
Replacement of existing act_bpf (very rare) is done with synchronize_rcu()
and final destruction is done from tc_action_ops->cleanup() callback that is
called from tcf_exts_destroy()->tcf_action_destroy()->__tcf_hash_release() when
bind and refcnt reach zero which is only possible when classifier is destroyed.
Previous two patches fixed the last two classifiers (tcindex and rsvp) to
call tcf_exts_destroy() from rcu callback.

Similar to gact/mirred there is a race between prog->filter and
prog->tcf_action. Meaning that the program being replaced may use
previous default action if it happened to return TC_ACT_UNSPEC.
act_mirred race betwen tcf_action and tcfm_dev is similar.
In all cases the race is harmless.
Long term we may want to improve the situation by replacing the whole
tc_action->priv as single pointer instead of updating inner fields one by one.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-08-27 02:01:45 +0800
9e528d891 net_sched: convert rsvp to call tcf_exts_destroy from rcu callback ... Browse Code »

Adjust destroy path of cls_rsvp to call tcf_exts_destroy() after
rcu grace period.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-08-27 02:01:45 +0800