Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

17 Dec, 2014

1 commit

01da9f8b6 net: fix suspicious rcu_dereference_check in net/sched/sch_fq_codel.c ... Browse Code »

[ Upstream commit 69204cf7eb9c5a72067ce6922d4699378251d053 ]

commit 46e5da40ae (net: qdisc: use rcu prefix and silence
sparse warnings) triggers a spurious warning:

net/sched/sch_fq_codel.c:97 suspicious rcu_dereference_check() usage!

The code should be using the _bh variant of rcu_dereference.

Signed-off-by: Valdis Kletnieks
Acked-by: Eric Dumazet
Acked-by: John Fastabend
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Valdis.Kletnieks@vt.edu
2014-12-17 01:39:06 +0800

30 Oct, 2014

1 commit

d56109020 sch_pie: schedule the timer after all init succeed ... Browse Code »

Cc: Vijay Subramanian
Cc: David S. Miller
Signed-off-by: Cong Wang
Acked-by: Eric Dumazet

WANG Cong
2014-10-30 02:28:01 +0800

22 Oct, 2014

1 commit

7c1c97d54 net: sched: initialize bstats syncp ... Browse Code »

Use netdev_alloc_pcpu_stats to allocate percpu stats and initialize syncp.

Fixes: 22e0f8b9322c "net: sched: make bstats per cpu and estimator RCU safe"
Signed-off-by: Sabrina Dubroca
Acked-by: Cong Wang
Signed-off-by: David S. Miller

Sabrina Dubroca
2014-10-22 09:45:21 +0800

10 Oct, 2014

1 commit

b8358d70c net_sched: restore qdisc quota fairness limits after bulk dequeue ... Browse Code »

Restore the quota fairness between qdisc's, that we broke with commit
5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE").

Before that commit, the quota in __qdisc_run() were in packets as
dequeue_skb() would only dequeue a single packet, that assumption
broke with bulk dequeue.

We choose not to account for the number of packets inside the TSO/GSO
packets (accessable via "skb_gso_segs"). As the previous fairness
also had this "defect". Thus, GSO/TSO packets counts as a single
packet.

Further more, we choose to slack on accuracy, by allowing a bulk
dequeue try_bulk_dequeue_skb() to exceed the "packets" limit, only
limited by the BQL bytelimit. This is done because BQL prefers to get
its full budget for appropriate feedback from TX completion.

In future, we might consider reworking this further and, if it allows,
switch to a time-based model, as suggested by Eric. Right now, we only
restore old semantics.

Joint work with Eric, Hannes, Daniel and Jesper. Hannes wrote the
first patch in cooperation with Daniel and Jesper. Eric rewrote the
patch.

Fixes: 5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
Signed-off-by: Eric Dumazet
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-10-10 07:12:26 +0800

09 Oct, 2014

2 commits

64b1f00a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2014-10-09 04:22:22 +0800
5301e3e11 net_sched: copy exts->type in tcf_exts_change() ... Browse Code »

We need to copy exts->type when committing the change, otherwise
it would be always 0. This is a quick fix for -net and -stable,
for net-next tcf_exts will be removed.

Fixes: commit 33be627159913b094bb578e83 ("net_sched: act: use standard struct list_head")
Reported-by: Jamal Hadi Salim
Cc: Jamal Hadi Salim
Cc: John Fastabend
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2014-10-09 03:41:27 +0800

08 Oct, 2014

1 commit

028758788 net: better IFF_XMIT_DST_RELEASE support ... Browse Code »

Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.

Signed-off-by: Eric Dumazet
Cc: Julian Anastasov
Signed-off-by: David S. Miller

Eric Dumazet
2014-10-08 01:22:11 +0800

07 Oct, 2014

3 commits

18cdb37eb net: sched: do not use tcf_proto 'tp' argument from call_rcu ... Browse Code »

Using the tcf_proto pointer 'tp' from inside the classifiers callback
is not valid because it may have been cleaned up by another call_rcu
occuring on another CPU.

'tp' is currently being used by tcf_unbind_filter() in this patch we
move instances of tcf_unbind_filter outside of the call_rcu() context.
This is safe to do because any running schedulers will either read the
valid class field or it will be zeroed.

And all schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.

Reported-by: Cong Wang
Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-10-07 06:02:33 +0800
13990f815 net: sched: cls_cgroup tear down exts and ematch from rcu callback ... Browse Code »

It is not RCU safe to destroy the action chain while there
is a possibility of readers accessing it. Move this code
into the rcu callback using the same rcu callback used in the
code patch to make a change to head.

Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-10-07 06:02:32 +0800
82a470f11 net: sched: remove tcf_proto from ematch calls ... Browse Code »

This removes the tcf_proto argument from the ematch code paths that
only need it to reference the net namespace. This allows simplifying
qdisc code paths especially when we need to tear down the ematch
from an RCU callback. In this case we can not guarentee that the
tcf_proto structure is still valid.

Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-10-07 06:02:32 +0800

06 Oct, 2014

1 commit

f2600cf02 net: sched: avoid costly atomic operation in fq_dequeue() ... Browse Code »
2

Standard qdisc API to setup a timer implies an atomic operation on every
packet dequeue : qdisc_unthrottled()

It turns out this is not really needed for FQ, as FQ has no concept of
global qdisc throttling, being a qdisc handling many different flows,
some of them can be throttled, while others are not.

Fix is straightforward : add a 'bool throttle' to
qdisc_watchdog_schedule_ns(), and remove calls to qdisc_unthrottled()
in sch_fq.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-10-06 12:55:10 +0800

05 Oct, 2014

2 commits

34a419d4e ematch: Fix early ending of inverted containers. ... Browse Code »

The result of a negated container has to be inverted before checking for
early ending.

This fixes my previous attempt (17c9c8232663a47f074b7452b9b034efda868ca7) to
make inverted containers work correctly.

Signed-off-by: Ignacy Gawędzki
Signed-off-by: David S. Miller

Ignacy Gawędzki
2014-10-05 08:49:46 +0800
1e203c1a2 net: sched: suspicious RCU usage in qdisc_watchdog ... Browse Code »
13

Suspicious RCU usage in qdisc_watchdog call needs to be done inside
rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
need to ensure timer is cancelled before removing qdisc structure.

[ 3992.191339] ===============================
[ 3992.191340] [ INFO: suspicious RCU usage. ]
[ 3992.191343] 3.17.0-rc6net-next+ #72 Not tainted
[ 3992.191345] -------------------------------
[ 3992.191347] include/net/sch_generic.h:272 suspicious rcu_dereference_check() usage!
[ 3992.191348]
[ 3992.191348] other info that might help us debug this:
[ 3992.191348]
[ 3992.191351]
[ 3992.191351] rcu_scheduler_active = 1, debug_locks = 1
[ 3992.191353] no locks held by swapper/1/0.
[ 3992.191355]
[ 3992.191355] stack backtrace:
[ 3992.191358] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.17.0-rc6net-next+ #72
[ 3992.191360] Hardware name: /DZ77RE-75K, BIOS GAZ7711H.86A.0060.2012.1115.1750 11/15/2012
[ 3992.191362] 0000000000000001 ffff880235803e48 ffffffff8178f92c 0000000000000000
[ 3992.191366] ffff8802322224a0 ffff880235803e78 ffffffff810c9966 ffff8800a5fe3000
[ 3992.191370] ffff880235803f30 ffff8802359cd768 ffff8802359cd6e0 ffff880235803e98
[ 3992.191374] Call Trace:
[ 3992.191376] [] dump_stack+0x4e/0x68
[ 3992.191387] [] lockdep_rcu_suspicious+0xe6/0x130
[ 3992.191392] [] qdisc_watchdog+0x8a/0xb0
[ 3992.191396] [] __run_hrtimer+0x72/0x420
[ 3992.191399] [] ? hrtimer_interrupt+0x7d/0x240
[ 3992.191403] [] ? tc_classify+0xc0/0xc0
[ 3992.191406] [] hrtimer_interrupt+0xff/0x240
[ 3992.191410] [] ? __atomic_notifier_call_chain+0x5/0x140
[ 3992.191415] [] local_apic_timer_interrupt+0x3b/0x60
[ 3992.191419] [] smp_apic_timer_interrupt+0x45/0x60
[ 3992.191422] [] apic_timer_interrupt+0x6f/0x80
[ 3992.191424] [] ? cpuidle_enter_state+0x73/0x2e0
[ 3992.191432] [] ? cpuidle_enter_state+0x6e/0x2e0
[ 3992.191437] [] cpuidle_enter+0x17/0x20
[ 3992.191441] [] cpu_startup_entry+0x3d1/0x4a0
[ 3992.191445] [] ? clockevents_config_and_register+0x26/0x30
[ 3992.191448] [] start_secondary+0x1b6/0x260

Fixes: b26b0d1e8b1 ("net: qdisc: use rcu prefix and silence sparse warnings")
Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-10-05 08:45:54 +0800

04 Oct, 2014

3 commits

55a93b3ea qdisc: validate skb without holding lock ... Browse Code »
5

Validation of skb can be pretty expensive :

GSO segmentation and/or checksum computations.

We can do this without holding qdisc lock, so that other cpus
can queue additional packets.

Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.

Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.

Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.

Same if disabling TX checksum offload.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-10-04 06:36:11 +0800
808e7ac0b qdisc: dequeue bulking also pickup GSO/TSO packets ... Browse Code »

The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7f9e8 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f2a4 ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list. And via commit
ce93718fb7cd ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms

Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-10-04 03:37:06 +0800
5772e9a34 qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE ... Browse Code »
28

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
(1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet. The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
(2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ. This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()). In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit(). In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits. In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets. They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-10-04 03:37:06 +0800

03 Oct, 2014

1 commit

739e4a758 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/usb/r8152.c
net/netfilter/nfnetlink.c

Both r8152 and nfnetlink conflicts were simple overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2014-10-03 02:25:43 +0800

02 Oct, 2014

2 commits

a0efb80ce net_sched: avoid calling tcf_unbind_filter() in call_rcu callback ... Browse Code »

This fixes the following crash:

[ 63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
[ 63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
[ 63.980094] RIP: 0010:[] [] u32_destroy_key+0x27/0x6d
[ 63.980094] RSP: 0018:ffff880117dffcc0 EFLAGS: 00010202
[ 63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
[ 63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
[ 63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
[ 63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
[ 63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
[ 63.980094] FS: 0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[ 63.980094] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
[ 63.980094] Stack:
[ 63.980094] ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
[ 63.980094] ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
[ 63.980094] ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
[ 63.980094] Call Trace:
[ 63.980094] [] u32_delete_key_freepf_rcu+0x1b/0x1d
[ 63.980094] [] rcu_process_callbacks+0x3bb/0x691
[ 63.980094] [] ? rcu_process_callbacks+0x2c1/0x691
[ 63.980094] [] ? u32_destroy_key+0x6d/0x6d
[ 63.980094] [] __do_softirq+0x142/0x323
[ 63.980094] [] run_ksoftirqd+0x23/0x53
[ 63.980094] [] smpboot_thread_fn+0x203/0x221
[ 63.980094] [] ? smpboot_unpark_thread+0x33/0x33
[ 63.980094] [] kthread+0xc9/0xd1
[ 63.980094] [] ? do_wait_for_common+0xf8/0x125
[ 63.980094] [] ? __kthread_parkme+0x61/0x61
[ 63.980094] [] ret_from_fork+0x7c/0xb0
[ 63.980094] [] ? __kthread_parkme+0x61/0x61

tp could be freed in call_rcu callback too, the order is not guaranteed.

John Fastabend says:

====================
Its worth noting why this is safe. Any running schedulers will either
read the valid class field or it will be zeroed.

All schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.
====================

Cc: John Fastabend
Signed-off-by: Cong Wang
Acked-by: John Fastabend
Signed-off-by: David S. Miller

WANG Cong
2014-10-02 10:00:42 +0800
6e0565697 net_sched: fix another crash in cls_tcindex ... Browse Code »

This patch fixes the following crash:

[ 166.670795] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 166.674230] IP: [] __list_del_entry+0x5c/0x98
[ 166.674230] PGD d0ea5067 PUD ce7fc067 PMD 0
[ 166.674230] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 166.674230] CPU: 1 PID: 775 Comm: tc Not tainted 3.17.0-rc6+ #642
[ 166.674230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 166.674230] task: ffff8800d03c4d20 ti: ffff8800cae7c000 task.ti: ffff8800cae7c000
[ 166.674230] RIP: 0010:[] [] __list_del_entry+0x5c/0x98
[ 166.674230] RSP: 0018:ffff8800cae7f7d0 EFLAGS: 00010207
[ 166.674230] RAX: 0000000000000000 RBX: ffff8800cba8d700 RCX: ffff8800cba8d700
[ 166.674230] RDX: 0000000000000000 RSI: dead000000200200 RDI: ffff8800cba8d700
[ 166.674230] RBP: ffff8800cae7f7d0 R08: 0000000000000001 R09: 0000000000000001
[ 166.674230] R10: 0000000000000000 R11: 000000000000859a R12: ffffffffffffffe8
[ 166.674230] R13: ffff8800cba8c5b8 R14: 0000000000000001 R15: ffff8800cba8d700
[ 166.674230] FS: 00007fdb5f04a740(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
[ 166.674230] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 166.674230] CR2: 0000000000000000 CR3: 00000000cf929000 CR4: 00000000000006e0
[ 166.674230] Stack:
[ 166.674230] ffff8800cae7f7e8 ffffffff814b73e8 ffff8800cba8d6e8 ffff8800cae7f828
[ 166.674230] ffffffff817caeec 0000000000000046 ffff8800cba8c5b0 ffff8800cba8c5b8
[ 166.674230] 0000000000000000 0000000000000001 ffff8800cf8e33e8 ffff8800cae7f848
[ 166.674230] Call Trace:
[ 166.674230] [] list_del+0xd/0x2b
[ 166.674230] [] tcf_action_destroy+0x4c/0x71
[ 166.674230] [] tcf_exts_destroy+0x20/0x2d
[ 166.674230] [] tcindex_delete+0x196/0x1b7

struct list_head can not be simply copied and we should always init it.

Cc: John Fastabend
Signed-off-by: Cong Wang
Acked-by: John Fastabend
Signed-off-by: David S. Miller

WANG Cong
2014-10-02 10:00:42 +0800

30 Sep, 2014

5 commits

b0ab6f927 net: sched: enable per cpu qstats ... Browse Code »
5

After previous patches to simplify qstats the qstats can be
made per cpu with a packed union in Qdisc struct.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-30 13:02:26 +0800
640158536 net: sched: restrict use of qstats qlen ... Browse Code »
2

This removes the use of qstats->qlen variable from the classifiers
and makes it an explicit argument to gnet_stats_copy_queue().

The qlen represents the qdisc queue length and is packed into
the qstats at the last moment before passnig to user space. By
handling it explicitely we avoid, in the percpu stats case, having
to figure out which per_cpu variable to put it in.

It would probably be best to remove it from qstats completely
but qstats is a user space ABI and can't be broken. A future
patch could make an internal only qstats structure that would
avoid having to allocate an additional u32 variable on the
Qdisc struct. This would make the qstats struct 128bits instead
of 128+32.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-30 13:02:26 +0800
25331d6ce net: sched: implement qstat helper routines ... Browse Code »
18

This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-30 13:02:26 +0800
22e0f8b93 net: sched: make bstats per cpu and estimator RCU safe ... Browse Code »
41

In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-30 13:02:26 +0800
17c9c8232 ematch: Fix matching of inverted containers. ... Browse Code »
13

Negated expressions and sub-expressions need to have their flags checked for
TCF_EM_INVERT and their result negated accordingly.

Signed-off-by: Ignacy Gawędzki
Signed-off-by: David S. Miller

Ignacy Gawędzki
2014-09-30 03:31:29 +0800

29 Sep, 2014

4 commits

53dfd5018 net: sched: cls_rcvp, complete rcu conversion ... Browse Code »

This completes the cls_rsvp conversion to RCU safe
copy, update semantics.

As a result all cases of tcf_exts_change occur on
empty lists now.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-29 12:04:55 +0800
68f6a7c6c net_sched: fix another regression in cls_tcindex ... Browse Code »

Clearly the following change is not expected:

- if (!cp.perfect && !cp.h)
- cp.alloc_hash = cp.hash;
+ if (!cp->perfect && cp->h)
+ cp->alloc_hash = cp->hash;

Fixes: commit 331b72922c5f58d48fd ("net: sched: RCU cls_tcindex")
Cc: John Fastabend
Signed-off-by: Cong Wang
Acked-by: John Fastabend
Signed-off-by: David S. Miller

WANG Cong
2014-09-29 05:34:35 +0800
02c5e8441 net_sched: fix errno in tcindex_set_parms() ... Browse Code »

When kmemdup() fails, we should return -ENOMEM.

Cc: John Fastabend
Signed-off-by: Cong Wang
Acked-by: John Fastabend
Signed-off-by: David S. Miller

WANG Cong
2014-09-29 05:34:22 +0800
18d0264f6 net_sched: remove the first parameter from tcf_exts_destroy() ... Browse Code »

Cc: Jamal Hadi Salim
Signed-off-by: Cong Wang
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

WANG Cong
2014-09-29 05:29:01 +0800

26 Sep, 2014

1 commit

4a8e320c9 net: sched: use pinned timers ... Browse Code »

While using a MQ + NETEM setup, I had confirmation that the default
timer migration ( /proc/sys/kernel/timer_migration ) is killing us.

Installing this on a receiver side of a TCP_STREAM test, (NIC has 8 TX
queues) :

EST="est 1sec 4sec"
for ETH in eth1
do
tc qd del dev $ETH root 2>/dev/null
tc qd add dev $ETH root handle 1: mq
tc qd add dev $ETH parent 1:1 $EST netem limit 70000 delay 6ms
tc qd add dev $ETH parent 1:2 $EST netem limit 70000 delay 8ms
tc qd add dev $ETH parent 1:3 $EST netem limit 70000 delay 10ms
tc qd add dev $ETH parent 1:4 $EST netem limit 70000 delay 12ms
tc qd add dev $ETH parent 1:5 $EST netem limit 70000 delay 14ms
tc qd add dev $ETH parent 1:6 $EST netem limit 70000 delay 16ms
tc qd add dev $ETH parent 1:7 $EST netem limit 80000 delay 18ms
tc qd add dev $ETH parent 1:8 $EST netem limit 90000 delay 20ms
done

We can see that timers get migrated into a single cpu, presumably idle
at the time timers are set up.
Then all qdisc dequeues run from this cpu and huge lock contention
happens. This single cpu is stuck in softirq mode and cannot dequeue
fast enough.

39.24% [kernel] [k] _raw_spin_lock
2.65% [kernel] [k] netem_enqueue
1.80% [kernel] [k] netem_dequeue
1.63% [kernel] [k] copy_user_enhanced_fast_string
1.45% [kernel] [k] _raw_spin_lock_bh

By pinning qdisc timers on the cpu running the qdisc, we respect proper
XPS setting and remove this lock contention.

5.84% [kernel] [k] netem_enqueue
4.83% [kernel] [k] _raw_spin_lock
2.92% [kernel] [k] copy_user_enhanced_fast_string

Current Qdiscs that benefit from this change are :

netem, cbq, fq, hfsc, tbf, htb.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-09-26 12:26:48 +0800

24 Sep, 2014

1 commit

1f6d80358 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
arch/mips/net/bpf_jit.c
drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller

David S. Miller
2014-09-24 00:09:27 +0800

23 Sep, 2014

4 commits

a2aeb02a8 net: sched: fix compile warning in cls_u32 ... Browse Code »

$ grep CONFIG_CLS_U32_MARK .config
# CONFIG_CLS_U32_MARK is not set

net/sched/cls_u32.c: In function 'u32_change':
net/sched/cls_u32.c:852:1: warning: label 'errout' defined but not used
[-Wunused-label]

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-09-23 04:47:19 +0800
de5df6322 net: sched: cls_u32 changes to knode must appear atomic to readers ... Browse Code »

Changes to the cls_u32 classifier must appear atomic to the
readers. Before this patch if a change is requested for both
the exts and ifindex, first the ifindex is updated then the
exts with tcf_exts_change(). This opens a small window where
a reader can have a exts chain with an incorrect ifindex. This
violates the the RCU semantics.

Here we resolve this by always passing u32_set_parms() a copy
of the tc_u_knode to work on and then inserting it into the hash
table after the updates have been successfully applied.

Tested with the following short script:

#tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 handle 1: \
u32 divisor 256

#tc filter add dev p3p2 parent 8001:0 protocol ip prio 99 \
u32 link 1: hashkey mask ffffff00 at 12 \
match ip src 192.168.8.0/2

#tc filter add dev p3p2 parent 8001:0 protocol ip prio 102 \
handle 1::10 u32 classid 1:2 ht 1: \
match ip src 192.168.8.0/8 match ip tos 0x0a 1e

#tc filter change dev p3p2 parent 8001:0 protocol ip prio 102 \
handle 1::10 u32 classid 1:2 ht 1: \
match ip src 1.1.0.0/8 match ip tos 0x0b 1e

CC: Eric Dumazet
CC: Jamal Hadi Salim
Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

John Fastabend
2014-09-23 03:59:21 +0800
a1ddcfee2 net: cls_u32: fix missed pcpu_success free_percpu ... Browse Code »

This fixes a missed free_percpu in the unwind code path and when
keys are destroyed.

Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

John Fastabend
2014-09-23 03:59:21 +0800
257117862 net: sched: shrink struct qdisc_skb_cb to 28 bytes ... Browse Code »

We cannot make struct qdisc_skb_cb bigger without impacting IPoIB,
or increasing skb->cb[] size.

Commit e0f31d849867 ("flow_keys: Record IP layer protocol in
skb_flow_dissect()") broke IPoIB.

Only current offender is sch_choke, and this one do not need an
absolutely precise flow key.

If we store 17 bytes of flow key, its more than enough. (Its the actual
size of flow_keys if it was a packed structure, but we might add new
fields at the end of it later)

Signed-off-by: Eric Dumazet
Fixes: e0f31d849867 ("flow_keys: Record IP layer protocol in skb_flow_dissect()")
Signed-off-by: David S. Miller

Eric Dumazet
2014-09-23 02:21:47 +0800

20 Sep, 2014

2 commits

4e2840eee net: sched: cls_u32: rcu can not be last node ... Browse Code »

tc_u32_sel 'sel' in tc_u_knode expects to be the last element in the
structure and pads the structure with tc_u32_key fields for each key.

kzalloc(sizeof(*n) + s->nkeys*sizeof(struct tc_u32_key), GFP_KERNEL)

CC: Eric Dumazet
Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

John Fastabend
2014-09-20 05:05:45 +0800
ab34f6480 net: sched: use __skb_queue_head_init() where applicable ... Browse Code »

pfifo_fast and htb use skb lists, without needing their spinlocks.
(They instead use the standard qdisc lock)

We can use __skb_queue_head_init() instead of skb_queue_head_init()
to be consistent.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-09-20 04:32:10 +0800

17 Sep, 2014

4 commits

9f6c38e70 net: sched: cls_cgroup need tcf_exts_init in all cases ... Browse Code »

This ensures the tcf_exts_init() is called for all cases.

Fixes: 952313bd62589cae216a57 ("net: sched: cls_cgroup use RCU")
Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-09-17 04:26:39 +0800
e1f93eb06 net: sched: cls_fw: add missing tcf_exts_init call in fw_change() ... Browse Code »

When allocating a new structure we also need to call tcf_exts_init
to initialize exts.

A follow up patch might be in order to remove some of this code
and do tcf_exts_assign(). With this we could remove the
tcf_exts_init/tcf_exts_change pattern for some of the classifiers.
As part of the future tcf_actions RCU series this will need to be
done. For now fix the call here.

Fixes e35a8ee5993ba81fd6c0 ("net: sched: fw use RCU")
Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-09-17 03:59:36 +0800
d14cbfc88 net: sched: cls_cgroup fix possible memory leak of 'new' ... Browse Code »

tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git master
head: 54996b529ab70ca1d6f40677cd2698c4f7127e87
commit: c7953ef23042b7c4fc2be5ecdd216aacff6df5eb [625/646] net: sched: cls_cgroup use RCU

net/sched/cls_cgroup.c:130 cls_cgroup_change() warn: possible memory leak of 'new'
net/sched/cls_cgroup.c:135 cls_cgroup_change() warn: possible memory leak of 'new'
net/sched/cls_cgroup.c:139 cls_cgroup_change() warn: possible memory leak of 'new'

Fixes: c7953ef23042b7c4fc2be5ecdd216aac ("net: sched: cls_cgroup use RCU")
Reported-by: Dan Carpenter
Signed-off-by: John Fastabend
Acked-by: Cong Wang
Signed-off-by: David S. Miller

John Fastabend
2014-09-17 03:59:36 +0800
a96366bf2 net: sched: cls_u32 add missing rcu_assign_pointer and annotation ... Browse Code »

Add missing rcu_assign_pointer and missing annotation for ht_up
in cls_u32.c

Caught by kbuild bot,

>> net/sched/cls_u32.c:378:36: sparse: incorrect type in initializer (different address spaces)
net/sched/cls_u32.c:378:36: expected struct tc_u_hnode *ht
net/sched/cls_u32.c:378:36: got struct tc_u_hnode [noderef] *ht_up
>> net/sched/cls_u32.c:610:54: sparse: incorrect type in argument 4 (different address spaces)
net/sched/cls_u32.c:610:54: expected struct tc_u_hnode *ht
net/sched/cls_u32.c:610:54: got struct tc_u_hnode [noderef] *ht_up
>> net/sched/cls_u32.c:684:18: sparse: incorrect type in assignment (different address spaces)
net/sched/cls_u32.c:684:18: expected struct tc_u_hnode [noderef] *ht_up
net/sched/cls_u32.c:684:18: got struct tc_u_hnode *[assigned] ht
>> net/sched/cls_u32.c:359:18: sparse: dereference of noderef expression

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-17 03:59:36 +0800