Eric Lee / smarc-fsl-linux-kernel

06 Jan, 2016

1 commit

73c20a8b7 net: sched: fix missing free per cpu on qstats ... Browse Code »

When a qdisc is using per cpu stats (currently just the ingress
qdisc) only the bstats are being freed. This also free's the qstats.

Fixes: b0ab6f92752b9f9d8 ("net: sched: enable per cpu qstats")
Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

John Fastabend
2016-01-06 14:40:21 +0800

04 Dec, 2015

1 commit

4eaf3b84f net_sched: fix qdisc_tree_decrease_qlen() races ... Browse Code »

qdisc_tree_decrease_qlen() suffers from two problems on multiqueue
devices.

One problem is that it updates sch->q.qlen and sch->qstats.drops
on the mq/mqprio root qdisc, while it should not : Daniele
reported underflows errors :
[ 681.774821] PAX: sch->q.qlen: 0 n: 1
[ 681.774825] PAX: size overflow detected in function qdisc_tree_decrease_qlen net/sched/sch_api.c:769 cicus.693_49 min, count: 72, decl: qlen; num: 0; context: sk_buff_head;
[ 681.774954] CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: G O 4.2.6.201511282239-1-grsec #1
[ 681.774955] Hardware name: ASUSTeK COMPUTER INC. X302LJ/X302LJ, BIOS X302LJ.202 03/05/2015
[ 681.774956] ffffffffa9a04863 0000000000000000 0000000000000000 ffffffffa990ff7c
[ 681.774959] ffffc90000d3bc38 ffffffffa95d2810 0000000000000007 ffffffffa991002b
[ 681.774960] ffffc90000d3bc68 ffffffffa91a44f4 0000000000000001 0000000000000001
[ 681.774962] Call Trace:
[ 681.774967] [] dump_stack+0x4c/0x7f
[ 681.774970] [] report_size_overflow+0x34/0x50
[ 681.774972] [] qdisc_tree_decrease_qlen+0x152/0x160
[ 681.774976] [] fq_codel_dequeue+0x7b1/0x820 [sch_fq_codel]
[ 681.774978] [] ? qdisc_peek_dequeued+0xa0/0xa0 [sch_fq_codel]
[ 681.774980] [] __qdisc_run+0x4d/0x1d0
[ 681.774983] [] net_tx_action+0xc2/0x160
[ 681.774985] [] __do_softirq+0xf1/0x200
[ 681.774987] [] run_ksoftirqd+0x1e/0x30
[ 681.774989] [] smpboot_thread_fn+0x150/0x260
[ 681.774991] [] ? sort_range+0x40/0x40
[ 681.774992] [] kthread+0xe4/0x100
[ 681.774994] [] ? kthread_worker_fn+0x170/0x170
[ 681.774995] [] ret_from_fork+0x3e/0x70

mq/mqprio have their own ways to report qlen/drops by folding stats on
all their queues, with appropriate locking.

A second problem is that qdisc_tree_decrease_qlen() calls qdisc_lookup()
without proper locking : concurrent qdisc updates could corrupt the list
that qdisc_match_from_root() parses to find a qdisc given its handle.

Fix first problem adding a TCQ_F_NOPARENT qdisc flag that
qdisc_tree_decrease_qlen() can use to abort its tree traversal,
as soon as it meets a mq/mqprio qdisc children.

Second problem can be fixed by RCU protection.
Qdisc are already freed after RCU grace period, so qdisc_list_add() and
qdisc_list_del() simply have to use appropriate rcu list variants.

A future patch will add a per struct netdev_queue list anchor, so that
qdisc_tree_decrease_qlen() can have more efficient lookups.

Reported-by: Daniele Fucini
Signed-off-by: Eric Dumazet
Cc: Cong Wang
Cc: Jamal Hadi Salim
Signed-off-by: David S. Miller

Eric Dumazet
2015-12-04 03:59:05 +0800

28 Aug, 2015

3 commits

3e692f215 net: sched: simplify attach_one_default_qdisc() ... Browse Code »

Now that noqueue qdisc can be attached just like any other qdisc, no
special treatment is necessary anymore when attaching it as default
qdisc.

This change has the added benefit that 'tc qdisc show' prints noqueue
instead of nothing for devices defaulting to noqueue.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-08-28 08:14:30 +0800
d66d6c315 net: sched: register noqueue qdisc ... Browse Code »

This way users can attach noqueue just like any other qdisc using tc
without having to mess with tx_queue_len first.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-08-28 08:14:30 +0800
db4094bca net: sched: ignore tx_queue_len when assigning default qdisc ... Browse Code »

Since alloc_netdev_mqs() sets IFF_NO_QUEUE for drivers not initializing
tx_queue_len, it is safe to assume that if tx_queue_len is zero,
dev->priv flags always contains IFF_NO_QUEUE.

Signed-off-by: Phil Sutter
Signed-off-by: David S. Miller

Phil Sutter
2015-08-28 08:14:29 +0800

18 Aug, 2015

1 commit

4b4699556 net: sch_generic: react upon IFF_NO_QUEUE flag ... Browse Code »

Handle IFF_NO_QUEUE as alternative to tx_queue_len being zero.

Signed-off-by: Phil Sutter
Acked-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Phil Sutter
2015-08-18 02:50:18 +0800

10 Oct, 2014

1 commit

b8358d70c net_sched: restore qdisc quota fairness limits after bulk dequeue ... Browse Code »

Restore the quota fairness between qdisc's, that we broke with commit
5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE").

Before that commit, the quota in __qdisc_run() were in packets as
dequeue_skb() would only dequeue a single packet, that assumption
broke with bulk dequeue.

We choose not to account for the number of packets inside the TSO/GSO
packets (accessable via "skb_gso_segs"). As the previous fairness
also had this "defect". Thus, GSO/TSO packets counts as a single
packet.

Further more, we choose to slack on accuracy, by allowing a bulk
dequeue try_bulk_dequeue_skb() to exceed the "packets" limit, only
limited by the BQL bytelimit. This is done because BQL prefers to get
its full budget for appropriate feedback from TX completion.

In future, we might consider reworking this further and, if it allows,
switch to a time-based model, as suggested by Eric. Right now, we only
restore old semantics.

Joint work with Eric, Hannes, Daniel and Jesper. Hannes wrote the
first patch in cooperation with Daniel and Jesper. Eric rewrote the
patch.

Fixes: 5772e9a346 ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
Signed-off-by: Eric Dumazet
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-10-10 07:12:26 +0800

08 Oct, 2014

1 commit

028758788 net: better IFF_XMIT_DST_RELEASE support ... Browse Code »

Testing xmit_more support with netperf and connected UDP sockets,
I found strange dst refcount false sharing.

Current handling of IFF_XMIT_DST_RELEASE is not optimal.

Dropping dst in validate_xmit_skb() is certainly too late in case
packet was queued by cpu X but dequeued by cpu Y

The logical point to take care of drop/force is in __dev_queue_xmit()
before even taking qdisc lock.

As Julian Anastasov pointed out, need for skb_dst() might come from some
packet schedulers or classifiers.

This patch adds new helper to cleanly express needs of various drivers
or qdiscs/classifiers.

Drivers that need skb_dst() in their ndo_start_xmit() should call
following helper in their setup instead of the prior :

dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
->
netif_keep_dst(dev);

Instead of using a single bit, we use two bits, one being
eventually rebuilt in bonding/team drivers.

The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
rebuilt in bonding/team. Eventually, we could add something
smarter later.

Signed-off-by: Eric Dumazet
Cc: Julian Anastasov
Signed-off-by: David S. Miller

Eric Dumazet
2014-10-08 01:22:11 +0800

04 Oct, 2014

3 commits

55a93b3ea qdisc: validate skb without holding lock ... Browse Code »

Validation of skb can be pretty expensive :

GSO segmentation and/or checksum computations.

We can do this without holding qdisc lock, so that other cpus
can queue additional packets.

Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.

Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.

Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.

Same if disabling TX checksum offload.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-10-04 06:36:11 +0800
808e7ac0b qdisc: dequeue bulking also pickup GSO/TSO packets ... Browse Code »

The TSO and GSO segmented packets already benefit from bulking
on their own.

The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.

The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7f9e8 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f2a4 ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list. And via commit
ce93718fb7cd ("net: Don't keep around original SKB when we
software segment GSO frames.").

This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.

Testing:
Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).

Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).

Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.

Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms

Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-10-04 03:37:06 +0800
5772e9a34 qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE ... Browse Code »

Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.

This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().

The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
(1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet. The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
(2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.

One restriction of the new API is that every SKB must belong to the
same TXQ. This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.

Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()). In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit(). In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().

To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits. In-effect bulking will only happen for BQL enabled drivers.

Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.

For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets. They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.

Jointed work with Hannes, Daniel and Florian.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: Daniel Borkmann
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-10-04 03:37:06 +0800

30 Sep, 2014

1 commit

22e0f8b93 net: sched: make bstats per cpu and estimator RCU safe ... Browse Code »

In order to run qdisc's without locking statistics and estimators
need to be handled correctly.

To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.

Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.

Signed-off-by: John Fastabend
Signed-off-by: David S. Miller

John Fastabend
2014-09-30 13:02:26 +0800

20 Sep, 2014

1 commit

ab34f6480 net: sched: use __skb_queue_head_init() where applicable ... Browse Code »

pfifo_fast and htb use skb lists, without needing their spinlocks.
(They instead use the standard qdisc lock)

We can use __skb_queue_head_init() instead of skb_queue_head_init()
to be consistent.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2014-09-20 04:32:10 +0800

14 Sep, 2014

1 commit

46e5da40a net: qdisc: use rcu prefix and silence sparse warnings ... Browse Code »

Add __rcu notation to qdisc handling by doing this we can make
smatch output more legible. And anyways some of the cases should
be using rcu_dereference() see qdisc_all_tx_empty(),
qdisc_tx_chainging(), and so on.

Also *wake_queue() API is commonly called from driver timer routines
without rcu lock or rtnl lock. So I added rcu_read_lock() blocks
around netif_wake_subqueue and netif_tx_wake_queue.

Signed-off-by: John Fastabend
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

John Fastabend
2014-09-14 00:30:25 +0800

04 Sep, 2014

1 commit

3f3c7eec6 qdisc: exit case fixes for skb list handling in qdisc layer ... Browse Code »

More minor fixes to merge commit 53fda7f7f9e (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Fixing exit cases in qdisc_reset() and qdisc_destroy(), where a
leftover requeued SKB (qdisc->gso_skb) can have the potential of
being a skb list, thus use kfree_skb_list().

This is a followup to commit 10770bc2d1 ("qdisc: adjustments for
API allowing skb list xmits").

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-09-04 11:41:42 +0800

03 Sep, 2014

1 commit

10770bc2d qdisc: adjustments for API allowing skb list xmits ... Browse Code »

Minor adjustments for merge commit 53fda7f7f9e (Merge branch 'xmit_list')
that allows us to work with a list of SKBs.

Update code doc to function sch_direct_xmit().

In handle_dev_cpu_collision() use kfree_skb_list() in error handling.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2014-09-03 05:06:17 +0800

02 Sep, 2014

2 commits

ce93718fb net: Don't keep around original SKB when we software segment GSO frames. ... Browse Code »

Just maintain the list properly by returning the head of the remaining
SKB list from dev_hard_start_xmit().

Signed-off-by: David S. Miller

David S. Miller
2014-09-02 08:39:56 +0800
50cbe9ab5 net: Validate xmit SKBs right when we pull them out of the qdisc. ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2014-09-02 08:39:56 +0800

30 Aug, 2014

1 commit

10c51b562 net: add skb_get_tx_queue() helper ... Browse Code »

Replace occurences of skb_get_queue_mapping() and follow-up
netdev_get_tx_queue() with an actual helper function.

Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Borkmann
2014-08-30 11:02:07 +0800

02 Jul, 2014

1 commit

9bf2b8c28 net: fix some typos in comment ... Browse Code »

In commit 371121057607e3127e19b3fa094330181b5b031e("net:
QDISC_STATE_RUNNING dont need atomic bit ops") the
__QDISC_STATE_RUNNING is renamed to __QDISC___STATE_RUNNING,
but the old names existing in comment are not replaced with
the new name completely.

Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2014-07-02 05:20:32 +0800

01 Apr, 2014

1 commit

2d3b479df net-sysfs: expose number of carrier on/off changes ... Browse Code »

This allows to monitor carrier on/off transitions and detect link
flapping issues:
- new /sys/class/net/X/carrier_changes
- new rtnetlink IFLA_CARRIER_CHANGES (getlink)

Tested:
- grep . /sys/class/net/*/carrier_changes
+ ip link set dev X down/up
+ plug/unplug cable
- updated iproute2: prints IFLA_CARRIER_CHANGES
- iproute2 20121211-2 (debian): unchanged behavior

Signed-off-by: David Decotigny
Signed-off-by: David S. Miller

david decotigny
2014-04-01 04:24:52 +0800

15 Jan, 2014

1 commit

0a379e21c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2014-01-15 06:42:42 +0800

11 Jan, 2014

1 commit

f663dd9aa net: core: explicitly select a txq before doing l2 forwarding ... Browse Code »

Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
will cause several issues:

- NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
instead of lower device which misses the necessary txq synchronization for
lower device such as txq stopping or frozen required by dev watchdog or
control path.
- dev_hard_start_xmit() was called with NULL txq which bypasses the net device
watchdog.
- dev_hard_start_xmit() does not check txq everywhere which will lead a crash
when tso is disabled for lower device.

Fix this by explicitly introducing a new param for .ndo_select_queue() for just
selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
forwarding transmission.

With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
dev_queue_xmit() to do the transmission.

In the future, it was also required for macvtap l2 forwarding support since it
provides a necessary synchronization method.

Cc: John Fastabend
Cc: Neil Horman
Cc: e1000-devel@lists.sourceforge.net
Signed-off-by: Jason Wang
Acked-by: Neil Horman
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Jason Wang
2014-01-11 02:23:08 +0800

14 Dec, 2013

1 commit

e57a784d8 pkt_sched: set root qdisc before change() in attach_default_qdiscs() ... Browse Code »

After commit 95dc19299f74 ("pkt_sched: give visibility to mq slave
qdiscs") we call disc_list_add() while the device qdisc might be
the noop_qdisc one.

This shows up as duplicates in "tc qdisc show", as all inactive devices
point to noop_qdisc.

Fix this by setting dev->qdisc to the new qdisc before calling
ops->change() in attach_default_qdiscs()

Add a WARN_ON_ONCE() to catch any future similar problem.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-12-14 14:20:06 +0800

11 Dec, 2013

1 commit

82d567c26 net_sched: change "foo* bar" to "foo *bar" ... Browse Code »

"foo* bar" or "foo * bar" should be "foo *bar".

Signed-off-by: Yang Yingliang
Signed-off-by: David S. Miller

Yang Yingliang
2013-12-11 11:44:51 +0800

08 Nov, 2013

1 commit

a6cc0cfa7 net: Add layer 2 hardware acceleration operations for macvlan devices ... Browse Code »

Add a operations structure that allows a network interface to export
the fact that it supports package forwarding in hardware between
physical interfaces and other mac layer devices assigned to it (such
as macvlans). This operaions structure can be used by virtual mac
devices to bypass software switching so that forwarding can be done
in hardware more efficiently.

Signed-off-by: John Fastabend
Signed-off-by: Neil Horman
CC: Andy Gospodarek
CC: "David S. Miller"
Signed-off-by: David S. Miller

John Fastabend
2013-11-08 08:11:41 +0800

08 Oct, 2013

1 commit

5cde28293 net: Separate the close_list and the unreg_list v2 ... Browse Code »

Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list. The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown. Resulting in subtle memory
corruption.

This is the second bug from sharing the storage between the close_list
and the unreg_list. The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.

v2: Make all callers pass in a close_list to dev_close_many

Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2013-10-08 03:23:14 +0800

21 Sep, 2013

1 commit

3e1e3aae1 net_sched: add u64 rate to psched_ratecfg_precompute() ... Browse Code »

Add an extra u64 rate parameter to psched_ratecfg_precompute()
so that some qdisc can opt-in for 64bit rates in the future,
to overcome the ~34 Gbits limit.

psched_ratecfg_getrate() reports a legacy structure to
tc utility, so if actual rate is above the 32bit rate field,
cap it to the 34Gbit limit.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-09-21 02:41:02 +0800

01 Sep, 2013

2 commits

34aedd3f3 qdisc: fix build with !CONFIG_NET_SCHED ... Browse Code »

Multiqueue scheduler refers to default_qdisc_ops; therefore the
variable definition needs to be moved to handle case where net
scheduler API is not available.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2013-09-01 06:09:45 +0800
d2a7f269f qdisc: make args to qdisc_create_default const ... Browse Code »

Fixes warnings introduced by the qdisc default patch.

Signed-off-by: Stephen Hemminger
Signed-off-by: David S. Miller

stephen hemminger
2013-09-01 06:09:45 +0800

31 Aug, 2013

1 commit

6da7c8fcb qdisc: allow setting default queuing discipline ... Browse Code »

By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.

This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.

This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fast

Also removes leftover random # in sysctl net core.

Signed-off-by: Stephen Hemminger
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

stephen hemminger
2013-08-31 12:32:32 +0800

15 Aug, 2013

1 commit

8a8e3d84b net_sched: restore "linklayer atm" handling ... Browse Code »

commit 56b765b79 ("htb: improved accuracy at high rates")
broke the "linklayer atm" handling.

tc class add ... htb rate X ceil Y linklayer atm

The linklayer setting is implemented by modifying the rate table
which is send to the kernel. No direct parameter were
transferred to the kernel indicating the linklayer setting.

The commit 56b765b79 ("htb: improved accuracy at high rates")
removed the use of the rate table system.

To keep compatible with older iproute2 utils, this patch detects
the linklayer by parsing the rate table. It also supports future
versions of iproute2 to send this linklayer parameter to the
kernel directly. This is done by using the __reserved field in
struct tc_ratespec, to convey the choosen linklayer option, but
only using the lower 4 bits of this field.

Linklayer detection is limited to speeds below 100Mbit/s, because
at high rates the rtab is gets too inaccurate, so bad that
several fields contain the same values, this resembling the ATM
detect. Fields even start to contain "0" time to send, e.g. at
1000Mbit/s sending a 96 bytes packet cost "0", thus the rtab have
been more broken than we first realized.

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: David S. Miller

Jesper Dangaard Brouer
2013-08-15 16:43:08 +0800

06 Aug, 2013

1 commit

07ce76aa9 net_sched: make dev_trans_start return vlan's real dev trans_start ... Browse Code »

Vlan devices are LLTX and don't update their own trans_start, so if
dev_trans_start has to be called with a vlan device then 0 or a stale
value will be returned. Currently the bonding is the only such user, and
it's needed for proper arp monitoring when the slaves are vlans.
Fix this by extracting the vlan's real device trans_start.

Suggested-by: David Miller
Signed-off-by: Nikolay Aleksandrov
Acked-by: Veaceslav Falico
Signed-off-by: David S. Miller

nikolay@redhat.com
2013-08-06 03:17:42 +0800

12 Jun, 2013

1 commit

130d3d68b net_sched: psched_ratecfg_precompute() improvements ... Browse Code »

Before allowing 64bits bytes rates, refactor
psched_ratecfg_precompute() to get better comments
and increased accuracy.

rate_bps field is renamed to rate_bytes_ps, as we only
have to worry about bytes per second.

Signed-off-by: Eric Dumazet
Cc: Ben Greear
Signed-off-by: David S. Miller

Eric Dumazet
2013-06-12 13:39:47 +0800

03 Jun, 2013

1 commit

01cb71d2d net_sched: restore "overhead xxx" handling ... Browse Code »

commit 56b765b79 ("htb: improved accuracy at high rates")
broke the "overhead xxx" handling, as well as the "linklayer atm"
attribute.

tc class add ... htb rate X ceil Y linklayer atm overhead 10

This patch restores the "overhead xxx" handling, for htb, tbf
and act_police

The "linklayer atm" thing needs a separate fix.

Reported-by: Jesper Dangaard Brouer
Signed-off-by: Eric Dumazet
Cc: Vimalkumar
Cc: Jiri Pirko
Signed-off-by: David S. Miller

Eric Dumazet
2013-06-03 13:22:35 +0800

28 Mar, 2013

1 commit

ea872d771 sch: add missing u64 in psched_ratecfg_precompute() ... Browse Code »

It seems that commit

commit 292f1c7ff6cc10516076ceeea45ed11833bb71c7
Author: Jiri Pirko
Date: Tue Feb 12 00:12:03 2013 +0000

sch: make htb_rate_cfg and functions around that generic

adds little regression.

Before:

# tc qdisc add dev eth0 root handle 1: htb default ffff
# tc class add dev eth0 classid 1:ffff htb rate 5Gbit
# tc -s class show dev eth0
class htb 1:ffff root prio 0 rate 5000Mbit ceil 5000Mbit burst 625b cburst
625b
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0
lended: 0 borrowed: 0 giants: 0
tokens: 31 ctokens: 31

After:

# tc qdisc add dev eth0 root handle 1: htb default ffff
# tc class add dev eth0 classid 1:ffff htb rate 5Gbit
# tc -s class show dev eth0
class htb 1:ffff root prio 0 rate 1544Mbit ceil 1544Mbit burst 625b cburst
625b
Sent 5073 bytes 41 pkt (dropped 0, overlimits 0 requeues 0)
rate 1976bit 2pps backlog 0b 0p requeues 0
lended: 41 borrowed: 0 giants: 0
tokens: 1802 ctokens: 1802

This probably due to lost u64 cast of rate parameter in
psched_ratecfg_precompute() (net/sched/sch_generic.c).

Signed-off-by: Sergey Popovich
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Sergey Popovich
2013-03-28 02:06:41 +0800

13 Feb, 2013

1 commit

292f1c7ff sch: make htb_rate_cfg and functions around that generic ... Browse Code »

As it is going to be used in tbf as well, push these to generic code.

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:45 +0800

12 Dec, 2012

1 commit

1abbe1394 pkt_sched: avoid requeues if possible ... Browse Code »

With BQL being deployed, we can more likely have following behavior :

We dequeue a packet from qdisc in dequeue_skb(), then we realize target
tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
skb into gso_skb for later.

This shows in stats (tc -s qdisc dev eth0) as requeues.

Problem of these requeues is that high priority packets can not be
dequeued as long as this (possibly low prio and big TSO packet) is not
removed from gso_skb.

At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

In some cases, we know that all packets dequeued from a qdisc are
for a particular and known txq :

- If device is non multi queue
- For all MQ/MQPRIO slave qdiscs

This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
this capability, so that dequeue_skb() is allowed to dequeue a packet
only if the associated txq is not stopped.

This indeed reduce latencies for high prio packets (or improve fairness
with sfq/fq_codel), and almost remove qdisc 'requeues'.

Signed-off-by: Eric Dumazet
Cc: Jamal Hadi Salim
Cc: John Fastabend
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-12 13:16:47 +0800

06 Sep, 2012

1 commit

23d3b8bfb net: qdisc busylock needs lockdep annotations ... Browse Code »

It seems we need to provide ability for stacked devices
to use specific lock_class_key for sch->busylock

We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
a user might use a qdisc anyway.

(So same fixes are probably needed on non LLTX stacked drivers)

Noticed while stressing L2TPV3 setup :

======================================================
[ INFO: possible circular locking dependency detected ]
3.6.0-rc3+ #788 Not tainted
-------------------------------------------------------
netperf/4660 is trying to acquire lock:
(l2tpsock){+.-...}, at: [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]

but task is already holding lock:
(&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&sch->busylock)->rlock){+.-...}:
[] lock_acquire+0x90/0x200
[] _raw_spin_lock_irqsave+0x4c/0x60
[] __wake_up+0x32/0x70
[] tty_wakeup+0x3e/0x80
[] pty_write+0x73/0x80
[] tty_put_char+0x3c/0x40
[] process_echoes+0x142/0x330
[] n_tty_receive_buf+0x8fb/0x1230
[] flush_to_ldisc+0x142/0x1c0
[] process_one_work+0x198/0x760
[] worker_thread+0x186/0x4b0
[] kthread+0x93/0xa0
[] kernel_thread_helper+0x4/0x10

-> #0 (l2tpsock){+.-...}:
[] __lock_acquire+0x1628/0x1b10
[] lock_acquire+0x90/0x200
[] _raw_spin_lock+0x41/0x50
[] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
[] dev_hard_start_xmit+0x502/0xa70
[] sch_direct_xmit+0xfe/0x290
[] dev_queue_xmit+0x1e5/0xe00
[] ip_finish_output+0x3d0/0x890
[] ip_output+0x59/0xf0
[] ip_local_out+0x2d/0xa0
[] ip_queue_xmit+0x1c3/0x680
[] tcp_transmit_skb+0x402/0xa60
[] tcp_write_xmit+0x1f4/0xa30
[] tcp_push_one+0x30/0x40
[] tcp_sendmsg+0xe82/0x1040
[] inet_sendmsg+0x125/0x230
[] sock_sendmsg+0xdc/0xf0
[] sys_sendto+0xfe/0x130
[] system_call_fastpath+0x16/0x1b
Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&sch->busylock)->rlock);
lock(l2tpsock);
lock(&(&sch->busylock)->rlock);
lock(l2tpsock);

*** DEADLOCK ***

5 locks held by netperf/4660:
#0: (sk_lock-AF_INET){+.+.+.}, at: [] tcp_sendmsg+0x2c/0x1040
#1: (rcu_read_lock){.+.+..}, at: [] ip_queue_xmit+0x0/0x680
#2: (rcu_read_lock_bh){.+....}, at: [] ip_finish_output+0x135/0x890
#3: (rcu_read_lock_bh){.+....}, at: [] dev_queue_xmit+0x0/0xe00
#4: (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

stack backtrace:
Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
Call Trace:
[] print_circular_bug+0x1fb/0x20c
[] __lock_acquire+0x1628/0x1b10
[] ? check_usage+0x9b/0x4d0
[] ? __lock_acquire+0x2e4/0x1b10
[] lock_acquire+0x90/0x200
[] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] _raw_spin_lock+0x41/0x50
[] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
[] dev_hard_start_xmit+0x502/0xa70
[] ? dev_hard_start_xmit+0x5e/0xa70
[] ? dev_queue_xmit+0x141/0xe00
[] sch_direct_xmit+0xfe/0x290
[] dev_queue_xmit+0x1e5/0xe00
[] ? dev_hard_start_xmit+0xa70/0xa70
[] ip_finish_output+0x3d0/0x890
[] ? ip_finish_output+0x135/0x890
[] ip_output+0x59/0xf0
[] ip_local_out+0x2d/0xa0
[] ip_queue_xmit+0x1c3/0x680
[] ? ip_local_out+0xa0/0xa0
[] tcp_transmit_skb+0x402/0xa60
[] ? tcp_md5_do_lookup+0x18e/0x1a0
[] tcp_write_xmit+0x1f4/0xa30
[] tcp_push_one+0x30/0x40
[] tcp_sendmsg+0xe82/0x1040
[] inet_sendmsg+0x125/0x230
[] ? inet_create+0x6b0/0x6b0
[] ? sock_update_classid+0xc2/0x3b0
[] ? sock_update_classid+0x130/0x3b0
[] sock_sendmsg+0xdc/0xf0
[] ? fget_light+0x3f9/0x4f0
[] sys_sendto+0xfe/0x130
[] ? trace_hardirqs_on+0xd/0x10
[] ? _raw_spin_unlock_irq+0x30/0x50
[] ? finish_task_switch+0x83/0xf0
[] ? finish_task_switch+0x46/0xf0
[] ? sysret_check+0x1b/0x56
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-06 05:49:27 +0800

15 Aug, 2012

1 commit

ee89bab14 net: move and rename netif_notify_peers() ... Browse Code »

I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller
Cc: Ian Campbell
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:28:23 +0800