Eric Lee / smarc-fsl-linux-kernel

07 Jun, 2015

1 commit

3431205e0 bpf: make programs see skb->data == L2 for ingress and egress ... Browse Code »

eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
This is known to program writers which are currently forced to use
BPF_LL_OFF workaround.
Since programs don't change skb internal pointers it is safe to do
pull/push right around invocation of the program and earlier taps and
later pt->func() will not be affected.
Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
around run_filter/BPF_PROG_RUN even if skb_shared.

This fix finally allows programs to use optimized LD_ABS/IND instructions
without BPF_LL_OFF for higher performance.
tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
w/o JIT w/JIT
before 20.5 23.6 Mpps
after 21.8 26.6 Mpps

Old programs with BPF_LL_OFF will still work as-is.

We can now undo most of the earlier workaround commit:
a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets")

Signed-off-by: Alexei Starovoitov
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-06-07 17:01:33 +0800

05 Jun, 2015

2 commits

c3f832418 net: Add full IPv6 addresses to flow_keys ... Browse Code »

This patch adds full IPv6 addresses into flow_keys and uses them as
input to the flow hash function. The implementation supports either
IPv4 or IPv6 addresses in a union, and selector is used to determine
how may words to input to jhash2.

We also add flow_get_u32_dst and flow_get_u32_src functions which are
used to get a u32 representation of the source and destination
addresses. For IPv6, ipv6_addr_hash is called. These functions retain
getting the legacy values of src and dst in flow_keys.

With this patch, Ethertype and IP protocol are now included in the
flow hash input.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-06-05 06:44:30 +0800
42aecaa9b net: Get skb hash over flow_keys structure ... Browse Code »

This patch changes flow hashing to use jhash2 over the flow_keys
structure instead just doing jhash_3words over src, dst, and ports.
This method will allow us take more input into the hashing function
so that we can include full IPv6 addresses, VLAN, flow labels etc.
without needing to resort to xor'ing which makes for a poor hash.

Acked-by: Jiri Pirko
Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-06-05 06:44:30 +0800

02 Jun, 2015

1 commit

dda922c83 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h

iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.

The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.

Signed-off-by: David S. Miller

David S. Miller
2015-06-02 13:51:30 +0800

28 May, 2015

1 commit

86e363dc3 net_sched: invoke ->attach() after setting dev->qdisc ... Browse Code »

For mq qdisc, we add per tx queue qdisc to root qdisc
for display purpose, however, that happens too early,
before the new dev->qdisc is finally set, this causes
q->list points to an old root qdisc which is going to be
freed right before assigning with a new one.

Fix this by moving ->attach() after setting dev->qdisc.

For the record, this fixes the following crash:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
Call Trace:
[] dump_stack+0x4c/0x65
[] warn_slowpath_common+0x9c/0xb6
[] ? __list_del_entry+0x5a/0x98
[] warn_slowpath_fmt+0x46/0x48
[] ? dev_graft_qdisc+0x5e/0x6a
[] __list_del_entry+0x5a/0x98
[] list_del+0xe/0x2d
[] qdisc_list_del+0x1e/0x20
[] qdisc_destroy+0x30/0xd6
[] qdisc_graft+0x11d/0x243
[] tc_get_qdisc+0x1a6/0x1d4
[] ? mark_lock+0x2e/0x226
[] rtnetlink_rcv_msg+0x181/0x194
[] ? rtnl_lock+0x17/0x19
[] ? rtnl_lock+0x17/0x19
[] ? __rtnl_unlock+0x17/0x17
[] netlink_rcv_skb+0x4d/0x93
[] rtnetlink_rcv+0x26/0x2d
[] netlink_unicast+0xcb/0x150
[] ? might_fault+0x59/0xa9
[] netlink_sendmsg+0x4fa/0x51c
[] sock_sendmsg_nosec+0x12/0x1d
[] sock_sendmsg+0x29/0x2e
[] ___sys_sendmsg+0x1b4/0x23a
[] ? native_sched_clock+0x35/0x37
[] ? sched_clock_local+0x12/0x72
[] ? sched_clock_cpu+0x9e/0xb7
[] ? current_kernel_time+0xe/0x32
[] ? lock_release_holdtime.part.29+0x71/0x7f
[] ? read_seqcount_begin.constprop.27+0x5f/0x76
[] ? trace_hardirqs_on_caller+0x17d/0x199
[] ? __fget_light+0x50/0x78
[] __sys_sendmsg+0x42/0x60
[] SyS_sendmsg+0x12/0x1c
[] system_call_fastpath+0x12/0x6f
---[ end trace ef29d3fb28e97ae7 ]---

For long term, we probably need to clean up the qdisc_graft() code
in case it hides other bugs like this.

Fixes: 95dc19299f74 ("pkt_sched: give visibility to mq slave qdiscs")
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

WANG Cong
2015-05-28 02:09:55 +0800

23 May, 2015

1 commit

36583eb54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c

Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.

phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.

tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.

macb.c involved the addition of two zyncq device entries.

skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.

Signed-off-by: David S. Miller

David S. Miller
2015-05-23 13:22:35 +0800

22 May, 2015

1 commit

c78e1746d net: sched: fix call_rcu() race on classifier module unloads ... Browse Code »

Vijay reported that a loop as simple as ...

while true; do
tc qdisc add dev foo root handle 1: prio
tc filter add dev foo parent 1: u32 match u32 0 0 flowid 1
tc qdisc del dev foo root
rmmod cls_u32
done

... will panic the kernel. Moreover, he bisected the change
apparently introducing it to 78fd1d0ab072 ("netlink: Re-add
locking to netlink_lookup() and seq walker").

The removal of synchronize_net() from the netlink socket
triggering the qdisc to be removed, seems to have uncovered
an RCU resp. module reference count race from the tc API.
Given that RCU conversion was done after e341694e3eb5 ("netlink:
Convert netlink_lookup() to use RCU protected hash table")
which added the synchronize_net() originally, occasion of
hitting the bug was less likely (not impossible though):

When qdiscs that i) support attaching classifiers and,
ii) have at least one of them attached, get deleted, they
invoke tcf_destroy_chain(), and thus call into ->destroy()
handler from a classifier module.

After RCU conversion, all classifier that have an internal
prio list, unlink them and initiate freeing via call_rcu()
deferral.

Meanhile, tcf_destroy() releases already reference to the
tp->ops->owner module before the queued RCU callback handler
has been invoked.

Subsequent rmmod on the classifier module is then not prevented
since all module references are already dropped.

By the time, the kernel invokes the RCU callback handler from
the module, that function address is then invalid.

One way to fix it would be to add an rcu_barrier() to
unregister_tcf_proto_ops() to wait for all pending call_rcu()s
to complete.

synchronize_rcu() is not appropriate as under heavy RCU
callback load, registered call_rcu()s could be deferred
longer than a grace period. In case we don't have any pending
call_rcu()s, the barrier is allowed to return immediately.

Since we came here via unregister_tcf_proto_ops(), there
are no users of a given classifier anymore. Further nested
call_rcu()s pointing into the module space are not being
done anywhere.

Only cls_bpf_delete_prog() may schedule a work item, to
unlock pages eventually, but that is not in the range/context
of cls_bpf anymore.

Fixes: 25d8c0d55f24 ("net: rcu-ify tcf_proto")
Fixes: 9888faefe132 ("net: sched: cls_basic use RCU")
Reported-by: Vijay Subramanian
Signed-off-by: Daniel Borkmann
Cc: John Fastabend
Cc: Eric Dumazet
Cc: Thomas Graf
Cc: Jamal Hadi Salim
Cc: Alexei Starovoitov
Tested-by: Vijay Subramanian
Acked-by: Alexei Starovoitov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Daniel Borkmann
2015-05-22 06:48:18 +0800

15 May, 2015

1 commit

dd3aa3b5f cls_flower: Fix compile error ... Browse Code »

Fix compile error in net/sched/cls_flower.c

net/sched/cls_flower.c: In function ‘fl_set_key’:
net/sched/cls_flower.c:240:3: error: implicit declaration of
function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);

Introduced in 77b9900ef53ae

Fixes: 77b9900ef53ae ("tc: introduce Flower classifier")
Signed-off-by: Brian Haley
Signed-off-by: David S. Miller

Brian Haley
2015-05-15 01:34:35 +0800

14 May, 2015

7 commits

1cf51900f net: add CONFIG_NET_INGRESS to enable ingress filtering ... Browse Code »

This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.

Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.

Signed-off-by: Pablo Neira Ayuso
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Pablo Neira
2015-05-14 13:10:05 +0800
77b9900ef tc: introduce Flower classifier ... Browse Code »

This patch introduces a flow-based filter. So far, the very essential
packet fields are supported.

This patch is only the first step. There is a lot of potential performance
improvements possible to implement. Also a lot of features are missing
now. They will be addressed in follow-up patches.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jiri Pirko
2015-05-14 03:19:48 +0800
59346afe7 flow_dissector: change port array into src, dst tuple ... Browse Code »

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2015-05-14 03:19:47 +0800
06635a35d flow_dissect: use programable dissector in skb_flow_dissect and friends ... Browse Code »

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2015-05-14 03:19:47 +0800
1bd758eb1 net: change name of flow_dissector header to match the .c file name ... Browse Code »

add couple of empty lines on the way.

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2015-05-14 03:19:45 +0800
e578d9c02 net: sched: use counter to break reclassify loops ... Browse Code »

Seems all we want here is to avoid endless 'goto reclassify' loop.
tc_classify_compat even resets this counter when something other
than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't
break hypothetical loops induced by something other than perpetual
TC_ACT_RECLASSIFY return values.

skb_act_clone is now identical to skb_clone, so just use that.

Tested with following (bogus) filter:
tc filter add dev eth0 parent ffff: \
protocol ip u32 match u32 0 0 police rate 10Kbit burst \
64000 mtu 1500 action reclassify

Acked-by: Daniel Borkmann
Signed-off-by: Florian Westphal
Acked-by: Alexei Starovoitov
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Florian Westphal
2015-05-14 03:08:14 +0800
b04096ff3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Four minor merge conflicts:

1) qca_spi.c renamed the local variable used for the SPI device
from spi_device to spi, meanwhile the spi_set_drvdata() call
got moved further up in the probe function.

2) Two changes were both adding new members to codel params
structure, and thus we had overlapping changes to the
initializer function.

3) 'net' was making a fix to sk_release_kernel() which is
completely removed in 'net-next'.

4) In net_namespace.c, the rtnl_net_fill() call for GET operations
had the command value fixed, meanwhile 'net-next' adjusted the
argument signature a bit.

This also matches example merge resolutions posted by Stephen
Rothwell over the past two days.

Signed-off-by: David S. Miller

David S. Miller
2015-05-14 02:31:43 +0800

13 May, 2015

1 commit

a3eb95f89 net_sched: gred: add TCA_GRED_LIMIT attribute ... Browse Code »

In a GRED qdisc, if the default "virtual queue" (VQ) does not have drop
parameters configured, then packets for the default VQ are not subjected
to RED and are only dropped if the queue is larger than the net_device's
tx_queue_len. This behavior is useful for WRED mode, since these packets
will still influence the calculated average queue length and (therefore)
the drop probability for all of the other VQs. However, for some drivers
tx_queue_len is zero. In other cases the user may wish to make the limit
the same for all VQs (including the default VQ with no drop parameters).

This change adds a TCA_GRED_LIMIT attribute to set the GRED queue limit,
in bytes, during qdisc setup. (This limit is in bytes to be consistent
with the drop parameters.) The default limit is the same as for a bfifo
queue (tx_queue_len * psched_mtu). If the drop parameters of any VQ are
configured with a smaller limit than the GRED queue limit, that VQ will
still observe the smaller limit instead.

Signed-off-by: David Ward
Signed-off-by: David S. Miller

David Ward
2015-05-13 06:22:49 +0800

12 May, 2015

2 commits

b396cca6f net: sched: deprecate enqueue_root() ... Browse Code »

Only left enqueue_root() user is netem, and it looks not necessary :

qdisc_skb_cb(skb)->pkt_len is preserved after one skb_clone()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-12 02:17:32 +0800
145a42b3a net_sched: gred: use correct backlog value in WRED mode ... Browse Code »

In WRED mode, the backlog for a single virtual queue (VQ) should not be
used to determine queue behavior; instead the backlog is summed across
all VQs. This sum is currently used when calculating the average queue
lengths. It also needs to be used when determining if the queue's hard
limit has been reached, or when reporting each VQ's backlog via netlink.
q->backlog will only be used if the queue switches out of WRED mode.

Signed-off-by: David Ward
Signed-off-by: David S. Miller

David Ward
2015-05-12 01:26:26 +0800

11 May, 2015

2 commits

d2788d348 net: sched: further simplify handle_ing ... Browse Code »

Ingress qdisc has no other purpose than calling into tc_classify()
that executes attached classifier(s) and action(s).

It has a 1:1 relationship to dev->ingress_queue. After having commit
087c1a601ad7 ("net: sched: run ingress qdisc without locks") removed
the central ingress lock, one major contention point is gone.

The extra indirection layers however, are not necessary for calling
into ingress qdisc. pktgen calling locally into netif_receive_skb()
with a dummy u32, single CPU result on a Supermicro X10SLM-F, Xeon
E3-1240: before ~21,1 Mpps, after patch ~22,9 Mpps.

We can redirect the private classifier list to the netdev directly,
without changing any classifier API bits (!) and execute on that from
handle_ing() side. The __QDISC_STATE_DEACTIVATE test can be removed,
ingress qdisc doesn't have a queue and thus dev_deactivate_queue()
is also not applicable, ingress_cl_list provides similar behaviour.
In other words, ingress qdisc acts like TCQ_F_BUILTIN qdisc.

One next possible step is the removal of the dev's ingress (dummy)
netdev_queue, and to only have the list member in the netdevice
itself.

Note, the filter chain is RCU protected and individual filter elements
are being kfree'd by sched subsystem after RCU grace period. RCU read
lock is being held by __netif_receive_skb_core().

Joint work with Alexei Starovoitov.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-05-11 23:10:35 +0800
80ba92fa1 codel: add ce_threshold attribute ... Browse Code »

For DCTCP or similar ECN based deployments on fabrics with shallow
buffers, hosts are responsible for a good part of the buffering.

This patch adds an optional ce_threshold to codel & fq_codel qdiscs,
so that DCTCP can have feedback from queuing in the host.

A DCTCP enabled egress port simply have a queue occupancy threshold
above which ECT packets get CE mark.

In codel language this translates to a sojourn time, so that one doesn't
have to worry about bytes or bandwidth but delays.

This makes the host an active participant in the health of the whole
network.

This also helps experimenting DCTCP in a setup without DCTCP compliant
fabric.

On following example, ce_threshold is set to 1ms, and we can see from
'ldelay xxx us' that TCP is not trying to go around the 5ms codel
target.

Queue has more capacity to absorb inelastic bursts (say from UDP
traffic), as queues are maintained to an optimal level.

lpaa23:~# ./tc -s -d qd sh dev eth1
qdisc mq 1: dev eth1 root
Sent 87910654696 bytes 58065331 pkt (dropped 0, overlimits 0 requeues 42961)
backlog 3108242b 364p requeues 42961
qdisc codel 8063: dev eth1 parent 1:1 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7363778701 bytes 4863809 pkt (dropped 0, overlimits 0 requeues 5503)
rate 2348Mbit 193919pps backlog 255866b 46p requeues 5503
count 0 lastcount 0 ldelay 1.0ms drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 72384
qdisc codel 8064: dev eth1 parent 1:2 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 7636486190 bytes 5043942 pkt (dropped 0, overlimits 0 requeues 5186)
rate 2319Mbit 191538pps backlog 207418b 64p requeues 5186
count 0 lastcount 0 ldelay 694us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 69873
qdisc codel 8065: dev eth1 parent 1:3 limit 1000p target 5.0ms ce_threshold 1.0ms interval 100.0ms
Sent 11569360142 bytes 7641602 pkt (dropped 0, overlimits 0 requeues 5554)
rate 3041Mbit 251096pps backlog 210446b 59p requeues 5554
count 0 lastcount 0 ldelay 889us drop_next 0us
maxpacket 68130 ecn_mark 0 drop_overlimit 0 ce_mark 37780
...

Signed-off-by: Eric Dumazet
Cc: Florian Westphal
Cc: Daniel Borkmann
Cc: Glenn Judd
Cc: Nandita Dukkipati
Cc: Neal Cardwell
Cc: Yuchung Cheng
Acked-by: Neal Cardwell
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-11 07:50:20 +0800

10 May, 2015

1 commit

d74431857 net_sched: fix a use-after-free in tc_ctl_tfilter() ... Browse Code »

When tcf_destroy() returns true, tp could be already destroyed,
we should not use tp->next after that.

For long term, we probably should move tp list to list_head.

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2015-05-10 04:14:04 +0800

04 May, 2015

7 commits

2e99403d2 sch_choke: Use flow_keys_digest ... Browse Code »

Call make_flow_keys_digest to get a digest from flow keys and
use that to pass skbuff cb and for comparing flows.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-04 12:09:09 +0800
ada1dba04 sched: Call skb_get_hash_perturb in sch_sfq ... Browse Code »

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-04 12:09:08 +0800
63c0ad4d4 sched: Call skb_get_hash_perturb in sch_sfb ... Browse Code »

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-04 12:09:08 +0800
f969777ac sched: Call skb_get_hash_perturb in sch_hhf ... Browse Code »

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-04 12:09:08 +0800
342db2218 sched: Call skb_get_hash_perturb in sch_fq_codel ... Browse Code »

Call skb_get_hash_perturb instead of doing skb_flow_dissect and then
jhash by hand.

Signed-off-by: Tom Herbert
Signed-off-by: David S. Miller

Tom Herbert
2015-05-04 12:09:08 +0800
087c1a601 net: sched: run ingress qdisc without locks ... Browse Code »

TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.

Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.

In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps

Signed-off-by: John Fastabend
Signed-off-by: Alexei Starovoitov
Signed-off-by: Jamal Hadi Salim
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-05-04 11:42:03 +0800
a5d280904 codel: fix maxpacket/mtu confusion ... Browse Code »

Under presence of TSO/GSO/GRO packets, codel at low rates can be quite
useless. In following example, not a single packet was ever dropped,
while average delay in codel queue is ~100 ms !

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 134376498 bytes 88797 pkt (dropped 0, overlimits 0 requeues 0)
backlog 13626b 3p requeues 0
count 0 lastcount 0 ldelay 96.9ms drop_next 0us
maxpacket 9084 ecn_mark 0 drop_overlimit 0

This comes from a confusion of what should be the minimal backlog. It is
pretty clear it is not 64KB or whatever max GSO packet ever reached the
qdisc.

codel intent was to use MTU of the device.

After the fix, we finally drop some packets, and rtt/cwnd of my single
TCP flow are meeting our expectations.

qdisc codel 0: parent 1:12 limit 16000p target 5.0ms interval 100.0ms
Sent 102798497 bytes 67912 pkt (dropped 1365, overlimits 0 requeues 0)
backlog 6056b 3p requeues 0
count 1 lastcount 1 ldelay 36.3ms drop_next 0us
maxpacket 10598 ecn_mark 0 drop_overlimit 0

Signed-off-by: Eric Dumazet
Cc: Kathleen Nichols
Cc: Dave Taht
Cc: Van Jacobson
Signed-off-by: David S. Miller

Eric Dumazet
2015-05-04 10:17:40 +0800

03 May, 2015

1 commit

4749c3ef8 net: sched: remove TC_MUNGED bits ... Browse Code »

Not used.

pedit sets TC_MUNGED when packet content was altered, but all the core
does is unset MUNGED again and then set OK2MUNGE.

And the latter isn't tested anywhere. So lets remove both
TC_MUNGED and TC_OK2MUNGE.

Signed-off-by: Florian Westphal
Acked-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Florian Westphal
2015-05-03 10:25:17 +0800

30 Apr, 2015

1 commit

2b70fe5ab net: sched: act_connmark: don't zap skb->nfct ... Browse Code »

This action is meant to be passive, i.e. we should not alter
skb->nfct: If nfct is present just leave it alone.

Compile tested only.

Cc: Jamal Hadi Salim
Signed-off-by: Florian Westphal
Acked-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller

Florian Westphal
2015-04-30 02:56:40 +0800

18 Apr, 2015

1 commit

f40ae9130 act_mirred: Fix bogus header when redirecting from VLAN ... Browse Code »

When you redirect a VLAN device to any device, you end up with
crap in af_packet on the xmit path because hard_header_len is
not equal to skb->mac_len. So the redirected packet contains
four extra bytes at the start which then gets interpreted as
part of the MAC address.

This patch fixes this by only pushing skb->mac_len. We also
need to fix ifb because it tries to undo the pushing done by
act_mirred.

Signed-off-by: Herbert Xu
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Herbert Xu
2015-04-18 01:29:28 +0800

17 Apr, 2015

1 commit

a166151cb bpf: fix bpf helpers to use skb->mac_header relative offsets ... Browse Code »

For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.

More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.

I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.

Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-04-17 02:08:49 +0800

15 Apr, 2015

1 commit

87ffabb1f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

The dwmac-socfpga.c conflict was a case of a bug fix overlapping
changes in net-next to handle an error pointer differently.

Signed-off-by: David S. Miller

David S. Miller
2015-04-15 03:44:14 +0800

14 Apr, 2015

1 commit

4577139b2 net: use jump label patching for ingress qdisc in __netif_receive_skb_core ... Browse Code »

Even if we make use of classifier and actions from the egress
path, we're going into handle_ing() executing additional code
on a per-packet cost for ingress qdisc, just to realize that
nothing is attached on ingress.

Instead, this can just be blinded out as a no-op entirely with
the use of a static key. On input fast-path, we already make
use of static keys in various places, e.g. skb time stamping,
in RPS, etc. It makes sense to not waste time when we're assured
that no ingress qdisc is attached anywhere.

Enabling/disabling of that code path is being done via two
helpers, namely net_{inc,dec}_ingress_queue(), that are being
invoked under RTNL mutex when a ingress qdisc is being either
initialized or destructed.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-04-14 01:34:40 +0800

08 Apr, 2015

1 commit

0ad2a8365 netem: Fixes byte backlog accounting for the first of two chained netem instances ... Browse Code »

Fixes byte backlog accounting for the first of two chained netem instances.
Bytes backlog reported now corresponds to the number of queued packets.

When two netem instances are chained, for instance to apply rate and queue
limitation followed by packet delay, the number of backlogged bytes reported
by the first netem instance is wrong. It reports the sum of bytes in the queues
of the first and second netem. The first netem reports the correct number of
backlogged packets but not bytes. This is shown in the example below.

Consider a chain of two netem schedulers created using the following commands:

$ tc -s qdisc replace dev veth2 root handle 1:0 netem rate 10000kbit limit 100
$ tc -s qdisc add dev veth2 parent 1:0 handle 2: netem delay 50ms

Start an iperf session to send packets out on the specified interface and
monitor the backlog using tc:

$ tc -s qdisc show dev veth2

Output using unpatched netem:
qdisc netem 1: root refcnt 2 limit 100 rate 10000Kbit
Sent 98422639 bytes 65434 pkt (dropped 123, overlimits 0 requeues 0)
backlog 172694b 73p requeues 0
qdisc netem 2: parent 1: limit 1000 delay 50.0ms
Sent 98422639 bytes 65434 pkt (dropped 0, overlimits 0 requeues 0)
backlog 63588b 42p requeues 0

The interface used to produce this output has an MTU of 1500. The output for
backlogged bytes behind netem 1 is 172694b. This value is not correct. Consider
the total number of sent bytes and packets. By dividing the number of sent
bytes by the number of sent packets, we get an average packet size of ~=1504.
If we divide the number of backlogged bytes by packets, we get ~=2365. This is
due to the first netem incorrectly counting the 63588b which are in netem 2's
queue as being in its own queue. To verify this is the case, we subtract them
from the reported value and divide by the number of packets as follows:
172694 - 63588 = 109106 bytes actualled backlogged in netem 1
109106 / 73 packets ~= 1494 bytes (which matches our MTU)

The root cause is that the byte accounting is not done at the
same time with packet accounting. The solution is to update the backlog value
every time the packet queue is updated.

Signed-off-by: Joseph D Beshay
Acked-by: Hagen Paul Pfeifer
Signed-off-by: David S. Miller

Beshay, Joseph
2015-04-08 06:34:24 +0800

02 Apr, 2015

1 commit

05e8bb860 pkt_sched: fq: correct spelling of locally ... Browse Code »

Correct spelling of locally.

Also remove extra space before tab character in struct fq_flow.

Signed-off-by: Simon Horman
Signed-off-by: David S. Miller

Simon Horman
2015-04-02 10:52:29 +0800

21 Mar, 2015

2 commits

a8cb5f556 act_bpf: add initial eBPF support for actions ... Browse Code »

This work extends the "classic" BPF programmable tc action by extending
its scope also to native eBPF code!

Together with commit e2e9b6541dd4 ("cls_bpf: add initial eBPF support
for programmable classifiers") this adds the facility to implement fully
flexible classifier and actions for tc that can be implemented in a C
subset in user space, "safely" loaded into the kernel, and being run in
native speed when JITed.

Also, since eBPF maps can be shared between eBPF programs, it offers the
possibility that cls_bpf and act_bpf can share data 1) between themselves
and 2) between user space applications. That means that, f.e. customized
runtime statistics can be collected in user space, but also more importantly
classifier and action behaviour could be altered based on map input from
the user space application.

For the remaining details on the workflow and integration, see the cls_bpf
commit e2e9b6541dd4. Preliminary iproute2 part can be found under [1].

[1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-act

Signed-off-by: Daniel Borkmann
Cc: Jamal Hadi Salim
Cc: Jiri Pirko
Acked-by: Jiri Pirko
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-03-21 07:10:44 +0800
0fa74a4be Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/emulex/benet/be_main.c
net/core/sysctl_net_core.c
net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky. The conflict
hunks generated by GIT were very unhelpful, to say the least. It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged. And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller

David S. Miller
2015-03-21 06:51:09 +0800

18 Mar, 2015

1 commit

ced585c83 act_bpf: allow non-default TC_ACT opcodes as BPF exec outcome ... Browse Code »

Revisiting commit d23b8ad8ab23 ("tc: add BPF based action") with regards
to eBPF support, I was thinking that it might be better to improve
return semantics from a BPF program invoked through BPF_PROG_RUN().

Currently, in case filter_res is 0, we overwrite the default action
opcode with TC_ACT_SHOT. A default action opcode configured through tc's
m_bpf can be: TC_ACT_RECLASSIFY, TC_ACT_PIPE, TC_ACT_SHOT, TC_ACT_UNSPEC,
TC_ACT_OK.

In cls_bpf, we have the possibility to overwrite the default class
associated with the classifier in case filter_res is _not_ 0xffffffff
(-1).

That allows us to fold multiple [e]BPF programs into a single one, where
they would otherwise need to be defined as a separate classifier with
its own classid, needlessly redoing parsing work, etc.

Similarly, we could do better in act_bpf: Since above TC_ACT* opcodes
are exported to UAPI anyway, we reuse them for return-code-to-tc-opcode
mapping, where we would allow above possibilities. Thus, like in cls_bpf,
a filter_res of 0xffffffff (-1) means that the configured _default_ action
is used. Any unkown return code from the BPF program would fail in
tcf_bpf() with TC_ACT_UNSPEC.

Should we one day want to make use of TC_ACT_STOLEN or TC_ACT_QUEUED,
which both have the same semantics, we have the option to either use
that as a default action (filter_res of 0xffffffff) or non-default BPF
return code.

All that will allow us to transparently use tcf_bpf() for both BPF
flavours.

Signed-off-by: Daniel Borkmann
Cc: Jiri Pirko
Cc: Alexei Starovoitov
Cc: Jamal Hadi Salim
Acked-by: Jiri Pirko
Signed-off-by: David S. Miller

Daniel Borkmann
2015-03-18 10:15:06 +0800

13 Mar, 2015

1 commit

54720df13 cls_bpf: do eBPF invocation under non-bh RCU lock variant for maps ... Browse Code »

Currently, it is possible in cls_bpf to access eBPF maps only under
rcu_read_lock_bh() variants: while on ingress side, that is, handle_ing(),
the classifier would be called from __netif_receive_skb_core() under
rcu_read_lock(); on egress side, however, it's rcu_read_lock_bh() via
__dev_queue_xmit().

This rcu/rcu_bh mix doesn't work together with eBPF maps as they require
soley to be called under rcu_read_lock(). eBPF maps could also be shared
among various other eBPF programs (possibly even with other eBPF program
types, f.e. tracing) and user space processes, so any context is assumed.

Therefore, a possible fix for cls_bpf is to wrap/nest eBPF program
invocation under non-bh RCU lock variant.

Fixes: e2e9b6541dd4 ("cls_bpf: add initial eBPF support for programmable classifiers")
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-03-13 06:33:15 +0800