Doug / smarc-fsl-linux-kernel | Embedian Git Server

28 Feb, 2013

1 commit

b67bfe0d4 hlist: drop the node parameter from iterators ... Browse Code »

I'm not sure why, but the hlist for each entry iterators were conceived

list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-28 11:10:24 +0800

19 Feb, 2013

2 commits

ece31ffd5 net: proc: change proc_net_remove to remove_proc_entry ... Browse Code »

proc_net_remove is only used to remove proc entries
that under /proc/net,it's not a general function for
removing proc entries of netns. if we want to remove
some proc entries which under /proc/net/stat/, we still
need to call remove_proc_entry.

this patch use remove_proc_entry to replace proc_net_remove.
we can remove proc_net_remove after this patch.

Signed-off-by: Gao feng
Signed-off-by: David S. Miller

Gao feng
2013-02-19 03:53:08 +0800
d4beaa66a net: proc: change proc_net_fops_create to proc_create ... Browse Code »

Right now, some modules such as bonding use proc_create
to create proc entries under /proc/net/, and other modules
such as ipv4 use proc_net_fops_create.

It looks a little chaos.this patch changes all of
proc_net_fops_create to proc_create. we can remove
proc_net_fops_create after this patch.

Signed-off-by: Gao feng
Signed-off-by: David S. Miller

Gao feng
2013-02-19 03:53:08 +0800

16 Feb, 2013

1 commit

14bbd6a56 net: Add skb_unclone() helper function. ... Browse Code »

This function will be used in next GRE_GSO patch. This patch does
not change any functionality.

Signed-off-by: Pravin B Shelar
Acked-by: Eric Dumazet

Pravin B Shelar
2013-02-16 04:10:37 +0800

13 Feb, 2013

10 commits

c6d14ff11 act_police: improved accuracy at high rates ... Browse Code »

Current act_police uses rate table computed by the "tc" userspace
program, which has the following issue:

The rate table has 256 entries to map packet lengths to token (time
units). With TSO sized packets, the 256 entry granularity leads to
loss/gain of rate, making the token bucket inaccurate.

Thus, instead of relying on rate table, this patch explicitly computes
the time and accounts for packet transmission times with nanosecond
granularity.

This is a followup to 56b765b79e9a78dc7d3f8850ba5e5567205a3ecd
("htb: improved accuracy at high rates").

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:45 +0800
0e243218b act_police: move struct tcf_police to act_police.c ... Browse Code »

It's not used anywhere else, so move it.

Signed-off-by: Jiri Pirko
Acked-by: Jamal Hadi Salim
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:45 +0800
b757c9336 tbf: improved accuracy at high rates ... Browse Code »

Current TBF uses rate table computed by the "tc" userspace program,
which has the following issue:

The rate table has 256 entries to map packet lengths to
token (time units). With TSO sized packets, the 256 entry granularity
leads to loss/gain of rate, making the token bucket inaccurate.

Thus, instead of relying on rate table, this patch explicitly computes
the time and accounts for packet transmission times with nanosecond
granularity.

This is a followup to 56b765b79e9a78dc7d3f8850ba5e5567205a3ecd
("htb: improved accuracy at high rates").

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:45 +0800
34c5d292c sch_api: introduce qdisc_watchdog_schedule_ns() ... Browse Code »

tbf will need to schedule watchdog in ns. No need to convert it twice.

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:45 +0800
292f1c7ff sch: make htb_rate_cfg and functions around that generic ... Browse Code »

As it is going to be used in tbf as well, push these to generic code.

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:45 +0800
b9a7afdef htb: initialize cl->tokens and cl->ctokens correctly ... Browse Code »

These are in ns so convert from ticks to ns.

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:44 +0800
bdd6998b1 htb: remove pointless first initialization of buffer and cbuffer ... Browse Code »

These are initialized correctly a couple of lines later in the
function.

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:44 +0800
324f5aa52 htb: use PSCHED_TICKS2NS() ... Browse Code »

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:59:44 +0800
9f6d98c29 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c

The bnx2x gso_type setting bug fix in 'net' conflicted with
changes in 'net-next' that broke the gso_* setting logic
out into a seperate function, which also fixes the bug in
question. Thus, use the 'net-next' version.

Signed-off-by: David S. Miller

David S. Miller
2013-02-13 07:58:28 +0800
9c10f4115 htb: fix values in opt dump ... Browse Code »

in htb_change_class() cl->buffer and cl->buffer are stored in ns.
So in dump, convert them back to psched ticks.

Note this was introduced by:
commit 56b765b79e9a78dc7d3f8850ba5e5567205a3ecd
htb: improved accuracy at high rates

Please consider this for -net/-stable.

Signed-off-by: Jiri Pirko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Jiri Pirko
2013-02-13 07:51:11 +0800

06 Feb, 2013

1 commit

188d1f76d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/intel/e1000e/ethtool.c
drivers/net/vmxnet3/vmxnet3_drv.c
drivers/net/wireless/iwlwifi/dvm/tx.c
net/ipv6/route.c

The ipv6 route.c conflict is simple, just ignore the 'net' side change
as we fixed the same problem in 'net-next' by eliminating cached
neighbours from ipv6 routes.

The e1000e conflict is an addition of a new statistic in the ethtool
code, trivial.

The vmxnet3 conflict is about one change in 'net' removing a guarding
conditional, whilst in 'net-next' we had a netdev_info() conversion.

The iwlwifi conflict is dealing with a WARN_ON() conversion in
'net-next' vs. a revert happening in 'net'.

Signed-off-by: David S. Miller

David S. Miller
2013-02-06 03:12:20 +0800

30 Jan, 2013

1 commit

a13d31047 netem: fix delay calculation in rate extension ... Browse Code »

The delay calculation with the rate extension introduces in v3.3 does
not properly work, if other packets are still queued for transmission.
For the delay calculation to work, both delay types (latency and delay
introduces by rate limitation) have to be handled differently. The
latency delay for a packet can overlap with the delay of other packets.
The delay introduced by the rate however is separate, and can only
start, once all other rate-introduced delays finished.

Latency delay is from same distribution for each packet, rate delay
depends on the packet size.

.: latency delay
-: rate delay
x: additional delay we have to wait since another packet is currently
transmitted

.....---- Packet 1
.....xx------ Packet 2
.....------ Packet 3
^^^^^
latency stacks
^^
rate delay doesn't stack
^^
latency stacks

-----> time

When a packet is enqueued, we first consider the latency delay. If other
packets are already queued, we can reduce the latency delay until the
last packet in the queue is send, however the latency delay cannot be

Acked-by: Hagen Paul Pfeifer
Signed-off-by: David S. Miller

Johannes Naab
2013-01-30 04:43:02 +0800

15 Jan, 2013

1 commit

c1b52739e pkt_sched: namespace aware act_mirred ... Browse Code »

Eric Dumazet pointed out that act_mirred needs to find the current net_ns,
and struct net pointer is not provided in the call chain. His original
patch made use of current->nsproxy->net_ns to find the network namespace,
but this fails to work correctly for userspace code that makes use of
netlink sockets in different network namespaces. Instead, pass the
"struct net *" down along the call chain to where it is needed.

This version removes the ifb changes as Eric has submitted that patch
separately, but is otherwise identical to the previous version.

Signed-off-by: Benjamin LaHaise
Tested-by: Eric Dumazet
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Benjamin LaHaise
2013-01-15 04:09:36 +0800

22 Dec, 2012

1 commit

d2fe85da5 net: sched: integer overflow fix ... Browse Code »

Fixed integer overflow in function htb_dequeue

Signed-off-by: Stefan Hasko
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Stefan Hasko
2012-12-22 16:03:00 +0800

13 Dec, 2012

2 commits

6be35c700 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:

1) Allow to dump, monitor, and change the bridge multicast database
using netlink. From Cong Wang.

2) RFC 5961 TCP blind data injection attack mitigation, from Eric
Dumazet.

3) Networking user namespace support from Eric W. Biederman.

4) tuntap/virtio-net multiqueue support by Jason Wang.

5) Support for checksum offload of encapsulated packets (basically,
tunneled traffic can still be checksummed by HW). From Joseph
Gasparakis.

6) Allow BPF filter access to VLAN tags, from Eric Dumazet and
Daniel Borkmann.

7) Bridge port parameters over netlink and BPDU blocking support
from Stephen Hemminger.

8) Improve data access patterns during inet socket demux by rearranging
socket layout, from Eric Dumazet.

9) TIPC protocol updates and cleanups from Ying Xue, Paul Gortmaker, and
Jon Maloy.

10) Update TCP socket hash sizing to be more in line with current day
realities. The existing heurstics were choosen a decade ago.
From Eric Dumazet.

11) Fix races, queue bloat, and excessive wakeups in ATM and
associated drivers, from Krzysztof Mazur and David Woodhouse.

12) Support DOVE (Distributed Overlay Virtual Ethernet) extensions
in VXLAN driver, from David Stevens.

13) Add "oops_only" mode to netconsole, from Amerigo Wang.

14) Support set and query of VEB/VEPA bridge mode via PF_BRIDGE, also
allow DCB netlink to work on namespaces other than the initial
namespace. From John Fastabend.

15) Support PTP in the Tigon3 driver, from Matt Carlson.

16) tun/vhost zero copy fixes and improvements, plus turn it on
by default, from Michael S. Tsirkin.

17) Support per-association statistics in SCTP, from Michele
Baldessari.

And many, many, driver updates, cleanups, and improvements. Too
numerous to mention individually.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits)
net/mlx4_en: Add support for destination MAC in steering rules
net/mlx4_en: Use generic etherdevice.h functions.
net: ethtool: Add destination MAC address to flow steering API
bridge: add support of adding and deleting mdb entries
bridge: notify mdb changes via netlink
ndisc: Unexport ndisc_{build,send}_skb().
uapi: add missing netconf.h to export list
pkt_sched: avoid requeues if possible
solos-pci: fix double-free of TX skb in DMA mode
bnx2: Fix accidental reversions.
bna: Driver Version Updated to 3.1.2.1
bna: Firmware update
bna: Add RX State
bna: Rx Page Based Allocation
bna: TX Intr Coalescing Fix
bna: Tx and Rx Optimizations
bna: Code Cleanup and Enhancements
ath9k: check pdata variable before dereferencing it
ath5k: RX timestamp is reported at end of frame
ath9k_htc: RX timestamp is reported at end of frame
...

Linus Torvalds
2012-12-13 10:07:07 +0800
d206e0903 Merge branch 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup changes from Tejun Heo:
"A lot of activities on cgroup side. The big changes are focused on
making cgroup hierarchy handling saner.

- cgroup_rmdir() had peculiar semantics - it allowed cgroup
destruction to be vetoed by individual controllers and tried to
drain refcnt synchronously. The vetoing never worked properly and
caused good deal of contortions in cgroup. memcg was the last
reamining user. Michal Hocko removed the usage and cgroup_rmdir()
path has been simplified significantly. This was done in a
separate branch so that the memcg people can base further memcg
changes on top.

- The above allowed cleaning up cgroup lifecycle management and
implementation of generic cgroup iterators which are used to
improve hierarchy support.

- cgroup_freezer updated to allow migration in and out of a frozen
cgroup and handle hierarchy. If a cgroup is frozen, all descendant
cgroups are frozen.

- netcls_cgroup and netprio_cgroup updated to handle hierarchy
properly.

- Various fixes and cleanups.

- Two merge commits. One to pull in memcg and rmdir cleanups (needed
to build iterators). The other pulled in cgroup/for-3.7-fixes for
device_cgroup fixes so that further device_cgroup patches can be
stacked on top."

Fixed up a trivial conflict in mm/memcontrol.c as per Tejun (due to
commit bea8c150a7 ("memcg: fix hotplugged memory zone oops") in master
touching code close to commit 2ef37d3fe4 ("memcg: Simplify
mem_cgroup_force_empty_list error handling") in for-3.8)

* 'for-3.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (65 commits)
cgroup: update Documentation/cgroups/00-INDEX
cgroup_rm_file: don't delete the uncreated files
cgroup: remove subsystem files when remounting cgroup
cgroup: use cgroup_addrm_files() in cgroup_clear_directory()
cgroup: warn about broken hierarchies only after css_online
cgroup: list_del_init() on removed events
cgroup: fix lockdep warning for event_control
cgroup: move list add after list head initilization
netprio_cgroup: allow nesting and inherit config on cgroup creation
netprio_cgroup: implement netprio[_set]_prio() helpers
netprio_cgroup: use cgroup->id instead of cgroup_netprio_state->prioidx
netprio_cgroup: reimplement priomap expansion
netprio_cgroup: shorten variable names in extend_netdev_table()
netprio_cgroup: simplify write_priomap()
netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking
cgroup: remove obsolete guarantee from cgroup_task_migrate.
cgroup: add cgroup->id
cgroup, cpuset: remove cgroup_subsys->post_clone()
cgroup: s/CGRP_CLONE_CHILDREN/CGRP_CPUSET_CLONE_CHILDREN/
cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free()
...

Linus Torvalds
2012-12-13 00:18:24 +0800

12 Dec, 2012

1 commit

1abbe1394 pkt_sched: avoid requeues if possible ... Browse Code »

With BQL being deployed, we can more likely have following behavior :

We dequeue a packet from qdisc in dequeue_skb(), then we realize target
tx queue is in XOFF state in sch_direct_xmit(), and we have to hold the
skb into gso_skb for later.

This shows in stats (tc -s qdisc dev eth0) as requeues.

Problem of these requeues is that high priority packets can not be
dequeued as long as this (possibly low prio and big TSO packet) is not
removed from gso_skb.

At 1Gbps speed, a full size TSO packet is 500 us of extra latency.

In some cases, we know that all packets dequeued from a qdisc are
for a particular and known txq :

- If device is non multi queue
- For all MQ/MQPRIO slave qdiscs

This patch introduces a new qdisc flag, TCQ_F_ONETXQUEUE to mark
this capability, so that dequeue_skb() is allowed to dequeue a packet
only if the associated txq is not stopped.

This indeed reduce latencies for high prio packets (or improve fairness
with sfq/fq_codel), and almost remove qdisc 'requeues'.

Signed-off-by: Eric Dumazet
Cc: Jamal Hadi Salim
Cc: John Fastabend
Signed-off-by: David S. Miller

Eric Dumazet
2012-12-12 13:16:47 +0800

29 Nov, 2012

1 commit

462dbc910 pkt_sched: QFQ Plus: fair-queueing service at DRR cost ... Browse Code »

This patch turns QFQ into QFQ+, a variant of QFQ that provides the
following two benefits: 1) QFQ+ is faster than QFQ, 2) differently
from QFQ, QFQ+ correctly schedules also non-leaves classes in a
hierarchical setting. A detailed description of QFQ+, plus a
performance comparison with DRR and QFQ, can be found in [1].

[1] P. Valente, "Reducing the Execution Time of Fair-Queueing Schedulers"
http://algo.ing.unimo.it/people/paolo/agg-sched/agg-sched.pdf

Signed-off-by: Paolo Valente
Signed-off-by: David S. Miller

Paolo Valente
2012-11-29 00:19:35 +0800

26 Nov, 2012

1 commit

a303fbf3d net: sched: enable CAN Identifier to be build into kernel ... Browse Code »

This patch makes it possible to build the CAN Identifier into the kernel, even
if the CAN support is build as a module.

Signed-off-by: Marc Kleine-Budde
Signed-off-by: David S. Miller

Marc Kleine-Budde
2012-11-26 05:06:06 +0800

22 Nov, 2012

1 commit

0ba18f7a5 netcls_cgroup: move config inheritance to ->css_online() and remove .broken_hierarchy marking ... Browse Code »

It turns out that we'll have to live with attributes which are
inherited at cgroup creation time but not affected by further updates
to the parent afterwards - such attributes are already in wide use
e.g. for cpuset.

So, there's nothing to do for netcls_cgroup for hierarchy support.
Its current behavior - inherit only during creation - is good enough.

Move config inheriting from ->css_alloc() to ->css_online() for
consistency, which doesn't change behavior at all, and remove
.broken_hierarchy marking.

Signed-off-by: Tejun Heo
Tested-and-Acked-by: Daniel Wagner
Acked-by: David S. Miller

Tejun Heo
2012-11-22 23:32:46 +0800

20 Nov, 2012

1 commit

92fb97487 cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() ... Browse Code »

Rename cgroup_subsys css lifetime related callbacks to better describe
what their roles are. Also, update documentation.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2012-11-20 00:13:38 +0800

19 Nov, 2012

1 commit

dfc47ef86 net: Push capable(CAP_NET_ADMIN) into the rtnl methods ... Browse Code »

- In rtnetlink_rcv_msg convert the capable(CAP_NET_ADMIN) check
to ns_capable(net->user-ns, CAP_NET_ADMIN). Allowing unprivileged
users to make netlink calls to modify their local network
namespace.

- In the rtnetlink doit methods add capable(CAP_NET_ADMIN) so
that calls that are not safe for unprivileged users are still
protected.

Later patches will remove the extra capable calls from methods
that are safe for unprivilged users.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2012-11-19 09:32:44 +0800

11 Nov, 2012

1 commit

d4185bbf6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c

Minor conflict between the BCM_CNIC define removal in net-next
and a bug fix added to net. Based upon a conflict resolution
patch posted by Stephen Rothwell.

Signed-off-by: David S. Miller

David S. Miller
2012-11-11 07:32:51 +0800

08 Nov, 2012

1 commit

3015f3d2a pkt_sched: enable QFQ to support TSO/GSO ... Browse Code »

If the max packet size for some class (configured through tc) is
violated by the actual size of the packets of that class, then QFQ
would not schedule classes correctly, and the data structures
implementing the bucket lists may get corrupted. This problem occurs
with TSO/GSO even if the max packet size is set to the MTU, and is,
e.g., the cause of the failure reported in [1]. Two patches have been
proposed to solve this problem in [2], one of them is a preliminary
version of this patch.

This patch addresses the above issues by: 1) setting QFQ parameters to
proper values for supporting TSO/GSO (in particular, setting the
maximum possible packet size to 64KB), 2) automatically increasing the
max packet size for a class, lmax, when a packet with a larger size
than the current value of lmax arrives.

The drawback of the first point is that the maximum weight for a class
is now limited to 4096, which is equal to 1/16 of the maximum weight
sum.

Finally, this patch also forcibly caps the timestamps of a class if
they are too high to be stored in the bucket list. This capping, taken
from QFQ+ [3], handles the unfrequent case described in the comment to
the function slot_insert.

[1] http://marc.info/?l=linux-netdev&m=134968777902077&w=2
[2] http://marc.info/?l=linux-netdev&m=135096573507936&w=2
[3] http://marc.info/?l=linux-netdev&m=134902691421670&w=2

Signed-off-by: Paolo Valente
Tested-by: Cong Wang
Acked-by: Stephen Hemminger
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller

Paolo Valente
2012-11-08 04:37:04 +0800

07 Nov, 2012

1 commit

196d97f6b htb: fix two bugs ... Browse Code »

Commit 56b765b79e9 (htb: improved accuracy at high rates)
introduced two bugs :

1) one bstats_update() was inadvertently removed from
htb_dequeue_tree(), breaking statistics/rate estimation.

2) Missing qdisc_put_rtab() calls in htb_change_class(),
leaking kernel memory, now struct htb_class no longer
retains pointers to qdisc_rate_table structs.

Since only rate is used, dont use qdisc_get_rtab() calls
copying data we ignore anyway.

Signed-off-by: Eric Dumazet
Cc: Vimalkumar
Signed-off-by: David S. Miller

Eric Dumazet
2012-11-07 08:06:29 +0800

04 Nov, 2012

1 commit

56b765b79 htb: improved accuracy at high rates ... Browse Code »

Current HTB (and TBF) uses rate table computed by the "tc"
userspace program, which has the following issue:

The rate table has 256 entries to map packet lengths
to token (time units). With TSO sized packets, the
256 entry granularity leads to loss/gain of rate,
making the token bucket inaccurate.

Thus, instead of relying on rate table, this patch
explicitly computes the time and accounts for packet
transmission times with nanosecond granularity.

This greatly improves accuracy of HTB with a wide
range of packet sizes.

Example:

tc qdisc add dev $dev root handle 1: \
htb default 1

tc class add dev $dev classid 1:1 parent 1: \
rate 5Gbit mtu 64k

Here is an example of inaccuracy:

$ iperf -c host -t 10 -i 1

With old htb:
eth4: 34.76 Mb/s In 5827.98 Mb/s Out - 65836.0 p/s In 481273.0 p/s Out
[SUM] 9.0-10.0 sec 669 MBytes 5.61 Gbits/sec
[SUM] 0.0-10.0 sec 6.50 GBytes 5.58 Gbits/sec

With new htb:
eth4: 28.36 Mb/s In 5208.06 Mb/s Out - 53704.0 p/s In 430076.0 p/s Out
[SUM] 9.0-10.0 sec 594 MBytes 4.98 Gbits/sec
[SUM] 0.0-10.0 sec 5.80 GBytes 4.98 Gbits/sec

The bits per second on the wire is still 5200Mb/s with new HTB
because qdisc accounts for packet length using skb->len, which
is smaller than total bytes on the wire if GSO is used. But
that is for another patch regardless of how time is accounted.

Many thanks to Eric Dumazet for review and feedback.

Signed-off-by: Vimalkumar
Signed-off-by: David S. Miller

Vimalkumar
2012-11-04 03:24:01 +0800

26 Oct, 2012

1 commit

6a328d8c6 cgroup: net_cls: Rework update socket logic ... Browse Code »

The cgroup logic part of net_cls is very similar as the one in
net_prio. Let's stream line the net_cls logic with the net_prio one.

The net_prio update logic was changed by following commit (note there
were some changes necessary later on)

commit 406a3c638ce8b17d9704052c07955490f732c2b8
Author: John Fastabend
Date: Fri Jul 20 10:39:25 2012 +0000

net: netprio_cgroup: rework update socket logic

Instead of updating the sk_cgrp_prioidx struct field on every send
this only updates the field when a task is moved via cgroup
infrastructure.

This allows sockets that may be used by a kernel worker thread
to be managed. For example in the iscsi case today a user can
put iscsid in a netprio cgroup and control traffic will be sent
with the correct sk_cgrp_prioidx value set but as soon as data
is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
is updated with the kernel worker threads value which is the
default case.

It seems more correct to only update the field when the user
explicitly sets it via control group infrastructure. This allows
the users to manage sockets that may be used with other threads.

Since classid is now updated when the task is moved between the
cgroups, we don't have to call sock_update_classid() from various
places to ensure we always using the latest classid value.

[v2: Use iterate_fd() instead of open coding]

Signed-off-by: Daniel Wagner
Cc: Li Zefan
Cc: "David S. Miller"
Cc: "Michael S. Tsirkin"
Cc: Jamal Hadi Salim
Cc: Joe Perches
Cc: John Fastabend
Cc: Neil Horman
Cc: Stanislav Kinsbursky
Cc: Tejun Heo
Cc:
Cc:
Acked-by: Neil Horman
Signed-off-by: David S. Miller

Daniel Wagner
2012-10-26 15:40:51 +0800

22 Oct, 2012

1 commit

46baac38e pkt_sched: use ns_to_ktime() helper ... Browse Code »

ns_to_ktime() seems better than ktime_set() + ktime_add_ns()

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-10-22 10:21:27 +0800

03 Oct, 2012

4 commits

aecdc33e1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:

1) GRE now works over ipv6, from Dmitry Kozlov.

2) Make SCTP more network namespace aware, from Eric Biederman.

3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

4) Make openvswitch network namespace aware, from Pravin B Shelar.

5) IPV6 NAT implementation, from Patrick McHardy.

6) Server side support for TCP Fast Open, from Jerry Chu and others.

7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.

8) Increate the loopback default MTU to 64K, from Eric Dumazet.

9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.

From Eric Dumazet.

10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.

From Eric Dumazet.

11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.

13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.

Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...

Linus Torvalds
2012-10-03 04:38:27 +0800
437589a74 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.

The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.

The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.

Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.

Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.

While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.

Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.

After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...

Linus Torvalds
2012-10-03 02:11:09 +0800
68d47a137 Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.

People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.

This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

Linus Torvalds
2012-10-03 01:52:28 +0800
c0e8a139a Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- xattr support added. The implementation is shared with tmpfs. The
usage is restricted and intended to be used to manage per-cgroup
metadata by system software. tmpfs changes are routed through this
branch with Hugh's permission.

- cgroup subsystem ID handling simplified.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
cgroup: Assign subsystem IDs during compile time
cgroup: Do not depend on a given order when populating the subsys array
cgroup: Wrap subsystem selection macro
cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
cgroup: net_prio: Do not define task_netpioidx() when not selected
cgroup: net_cls: Do not define task_cls_classid() when not selected
cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
xattr: mark variable as uninitialized to make both gcc and smatch happy
fs: add missing documentation to simple_xattr functions
cgroup: add documentation on extended attributes usage
cgroup: rename subsys_bits to subsys_mask
cgroup: add xattr support
cgroup: revise how we re-populate root directory
xattr: extract simple_xattr code from tmpfs

Linus Torvalds
2012-10-03 01:50:47 +0800

29 Sep, 2012

1 commit

6a06e5e1b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller

David S. Miller
2012-09-29 02:40:49 +0800

28 Sep, 2012

1 commit

f54ba7798 pkt_sched: Fix warning false positives. ... Browse Code »

GCC refuses to recognize that all error control flows do in fact
set err to something.

Add an explicit initialization to shut it up.

net/sched/sch_drr.c: In function ‘drr_enqueue’:
net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Signed-off-by: David S. Miller

David S. Miller
2012-09-28 06:35:47 +0800

25 Sep, 2012

1 commit

5640f7685 net: use a per task frag allocator ... Browse Code »

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet
Cc: Ben Hutchings
Cc: Vijay Subramanian
Cc: Alexander Duyck
Tested-by: Vijay Subramanian
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-25 04:31:37 +0800

20 Sep, 2012

1 commit

712619569 pkt_sched: fix virtual-start-time update in QFQ ... Browse Code »

If the old timestamps of a class, say cl, are stale when the class
becomes active, then QFQ may assign to cl a much higher start time
than the maximum value allowed. This may happen when QFQ assigns to
the start time of cl the finish time of a group whose classes are
characterized by a higher value of the ratio
max_class_pkt/weight_of_the_class with respect to that of
cl. Inserting a class with a too high start time into the bucket list
corrupts the data structure and may eventually lead to crashes.
This patch limits the maximum start time assigned to a class.

Signed-off-by: Paolo Valente
Signed-off-by: David S. Miller

Paolo Valente
2012-09-20 04:23:53 +0800