Eric Lee / smarc-fsl-linux-kernel

08 Nov, 2012

1 commit

3015f3d2a pkt_sched: enable QFQ to support TSO/GSO ... Browse Code »

If the max packet size for some class (configured through tc) is
violated by the actual size of the packets of that class, then QFQ
would not schedule classes correctly, and the data structures
implementing the bucket lists may get corrupted. This problem occurs
with TSO/GSO even if the max packet size is set to the MTU, and is,
e.g., the cause of the failure reported in [1]. Two patches have been
proposed to solve this problem in [2], one of them is a preliminary
version of this patch.

This patch addresses the above issues by: 1) setting QFQ parameters to
proper values for supporting TSO/GSO (in particular, setting the
maximum possible packet size to 64KB), 2) automatically increasing the
max packet size for a class, lmax, when a packet with a larger size
than the current value of lmax arrives.

The drawback of the first point is that the maximum weight for a class
is now limited to 4096, which is equal to 1/16 of the maximum weight
sum.

Finally, this patch also forcibly caps the timestamps of a class if
they are too high to be stored in the bucket list. This capping, taken
from QFQ+ [3], handles the unfrequent case described in the comment to
the function slot_insert.

[1] http://marc.info/?l=linux-netdev&m=134968777902077&w=2
[2] http://marc.info/?l=linux-netdev&m=135096573507936&w=2
[3] http://marc.info/?l=linux-netdev&m=134902691421670&w=2

Signed-off-by: Paolo Valente
Tested-by: Cong Wang
Acked-by: Stephen Hemminger
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller

Paolo Valente
2012-11-08 04:37:04 +0800

03 Oct, 2012

4 commits

aecdc33e1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:

1) GRE now works over ipv6, from Dmitry Kozlov.

2) Make SCTP more network namespace aware, from Eric Biederman.

3) TEAM driver now works with non-ethernet devices, from Jiri Pirko.

4) Make openvswitch network namespace aware, from Pravin B Shelar.

5) IPV6 NAT implementation, from Patrick McHardy.

6) Server side support for TCP Fast Open, from Jerry Chu and others.

7) Packet BPF filter supports MOD and XOR, from Eric Dumazet and Daniel
Borkmann.

8) Increate the loopback default MTU to 64K, from Eric Dumazet.

9) Use a per-task rather than per-socket page fragment allocator for
outgoing networking traffic. This benefits processes that have very
many mostly idle sockets, which is quite common.

From Eric Dumazet.

10) Use up to 32K for page fragment allocations, with fallbacks to
smaller sizes when higher order page allocations fail. Benefits are
a) less segments for driver to process b) less calls to page
allocator c) less waste of space.

From Eric Dumazet.

11) Allow GRO to be used on GRE tunnels, from Eric Dumazet.

12) VXLAN device driver, one way to handle VLAN issues such as the
limitation of 4096 VLAN IDs yet still have some level of isolation.
From Stephen Hemminger.

13) As usual there is a large boatload of driver changes, with the scale
perhaps tilted towards the wireless side this time around.

Fix up various fairly trivial conflicts, mostly caused by the user
namespace changes.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1012 commits)
hyperv: Add buffer for extended info after the RNDIS response message.
hyperv: Report actual status in receive completion packet
hyperv: Remove extra allocated space for recv_pkt_list elements
hyperv: Fix page buffer handling in rndis_filter_send_request()
hyperv: Fix the missing return value in rndis_filter_set_packet_filter()
hyperv: Fix the max_xfer_size in RNDIS initialization
vxlan: put UDP socket in correct namespace
vxlan: Depend on CONFIG_INET
sfc: Fix the reported priorities of different filter types
sfc: Remove EFX_FILTER_FLAG_RX_OVERRIDE_IP
sfc: Fix loopback self-test with separate_tx_channels=1
sfc: Fix MCDI structure field lookup
sfc: Add parentheses around use of bitfield macro arguments
sfc: Fix null function pointer in efx_sriov_channel_type
vxlan: virtual extensible lan
igmp: export symbol ip_mc_leave_group
netlink: add attributes to fdb interface
tg3: unconditionally select HWMON support when tg3 is enabled.
Revert "net: ti cpsw ethernet: allow reading phy interface mode from DT"
gre: fix sparse warning
...

Linus Torvalds
2012-10-03 04:38:27 +0800
437589a74 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull user namespace changes from Eric Biederman:
"This is a mostly modest set of changes to enable basic user namespace
support. This allows the code to code to compile with user namespaces
enabled and removes the assumption there is only the initial user
namespace. Everything is converted except for the most complex of the
filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
nfs, ocfs2 and xfs as those patches need a bit more review.

The strategy is to push kuid_t and kgid_t values are far down into
subsystems and filesystems as reasonable. Leaving the make_kuid and
from_kuid operations to happen at the edge of userspace, as the values
come off the disk, and as the values come in from the network.
Letting compile type incompatible compile errors (present when user
namespaces are enabled) guide me to find the issues.

The most tricky areas have been the places where we had an implicit
union of uid and gid values and were storing them in an unsigned int.
Those places were converted into explicit unions. I made certain to
handle those places with simple trivial patches.

Out of that work I discovered we have generic interfaces for storing
quota by projid. I had never heard of the project identifiers before.
Adding full user namespace support for project identifiers accounts
for most of the code size growth in my git tree.

Ultimately there will be work to relax privlige checks from
"capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
root in a user names to do those things that today we only forbid to
non-root users because it will confuse suid root applications.

While I was pushing kuid_t and kgid_t changes deep into the audit code
I made a few other cleanups. I capitalized on the fact we process
netlink messages in the context of the message sender. I removed
usage of NETLINK_CRED, and started directly using current->tty.

Some of these patches have also made it into maintainer trees, with no
problems from identical code from different trees showing up in
linux-next.

After reading through all of this code I feel like I might be able to
win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
userns: Convert the ufs filesystem to use kuid/kgid where appropriate
userns: Convert the udf filesystem to use kuid/kgid where appropriate
userns: Convert ubifs to use kuid/kgid
userns: Convert squashfs to use kuid/kgid where appropriate
userns: Convert reiserfs to use kuid and kgid where appropriate
userns: Convert jfs to use kuid/kgid where appropriate
userns: Convert jffs2 to use kuid and kgid where appropriate
userns: Convert hpfs to use kuid and kgid where appropriate
userns: Convert btrfs to use kuid/kgid where appropriate
userns: Convert bfs to use kuid/kgid where appropriate
userns: Convert affs to use kuid/kgid wherwe appropriate
userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
userns: On ia64 deal with current_uid and current_gid being kuid and kgid
userns: On ppc convert current_uid from a kuid before printing.
userns: Convert s390 getting uid and gid system calls to use kuid and kgid
userns: Convert s390 hypfs to use kuid and kgid where appropriate
userns: Convert binder ipc to use kuids
userns: Teach security_path_chown to take kuids and kgids
userns: Add user namespace support to IMA
userns: Convert EVM to deal with kuids and kgids in it's hmac computation
...

Linus Torvalds
2012-10-03 02:11:09 +0800
68d47a137 Merge branch 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup hierarchy update from Tejun Heo:
"Currently, different cgroup subsystems handle nested cgroups
completely differently. There's no consistency among subsystems and
the behaviors often are outright broken.

People at least seem to agree that the broken hierarhcy behaviors need
to be weeded out if any progress is gonna be made on this front and
that the fallouts from deprecating the broken behaviors should be
acceptable especially given that the current behaviors don't make much
sense when nested.

This patch makes cgroup emit warning messages if cgroups for
subsystems with broken hierarchy behavior are nested to prepare for
fixing them in the future. This was put in a separate branch because
more related changes were expected (didn't make it this round) and the
memory cgroup wanted to pull in this and make changes on top."

* 'for-3.7-hierarchy' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them

Linus Torvalds
2012-10-03 01:52:28 +0800
c0e8a139a Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- xattr support added. The implementation is shared with tmpfs. The
usage is restricted and intended to be used to manage per-cgroup
metadata by system software. tmpfs changes are routed through this
branch with Hugh's permission.

- cgroup subsystem ID handling simplified.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Define CGROUP_SUBSYS_COUNT according the configuration
cgroup: Assign subsystem IDs during compile time
cgroup: Do not depend on a given order when populating the subsys array
cgroup: Wrap subsystem selection macro
cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT
cgroup: net_prio: Do not define task_netpioidx() when not selected
cgroup: net_cls: Do not define task_cls_classid() when not selected
cgroup: net_cls: Move sock_update_classid() declaration to cls_cgroup.h
cgroup: trivial fixes for Documentation/cgroups/cgroups.txt
xattr: mark variable as uninitialized to make both gcc and smatch happy
fs: add missing documentation to simple_xattr functions
cgroup: add documentation on extended attributes usage
cgroup: rename subsys_bits to subsys_mask
cgroup: add xattr support
cgroup: revise how we re-populate root directory
xattr: extract simple_xattr code from tmpfs

Linus Torvalds
2012-10-03 01:50:47 +0800

29 Sep, 2012

1 commit

6a06e5e1b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/team/team.c
drivers/net/usb/qmi_wwan.c
net/batman-adv/bat_iv_ogm.c
net/ipv4/fib_frontend.c
net/ipv4/route.c
net/l2tp/l2tp_netlink.c

The team, fib_frontend, route, and l2tp_netlink conflicts were simply
overlapping changes.

qmi_wwan and bat_iv_ogm were of the "use HEAD" variety.

With help from Antonio Quartulli.

Signed-off-by: David S. Miller

David S. Miller
2012-09-29 02:40:49 +0800

28 Sep, 2012

1 commit

f54ba7798 pkt_sched: Fix warning false positives. ... Browse Code »

GCC refuses to recognize that all error control flows do in fact
set err to something.

Add an explicit initialization to shut it up.

net/sched/sch_drr.c: In function ‘drr_enqueue’:
net/sched/sch_drr.c:359:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]
net/sched/sch_qfq.c: In function ‘qfq_enqueue’:
net/sched/sch_qfq.c:885:11: warning: ‘err’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Signed-off-by: David S. Miller

David S. Miller
2012-09-28 06:35:47 +0800

25 Sep, 2012

1 commit

5640f7685 net: use a per task frag allocator ... Browse Code »

We currently use a per socket order-0 page cache for tcp_sendmsg()
operations.

This page is used to build fragments for skbs.

Its done to increase probability of coalescing small write() into
single segments in skbs still in write queue (not yet sent)

But it wastes a lot of memory for applications handling many mostly
idle sockets, since each socket holds one page in sk->sk_sndmsg_page

Its also quite inefficient to build TSO 64KB packets, because we need
about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
page allocator more than wanted.

This patch adds a per task frag allocator and uses bigger pages,
if available. An automatic fallback is done in case of memory pressure.

(up to 32768 bytes per frag, thats order-3 pages on x86)

This increases TCP stream performance by 20% on loopback device,
but also benefits on other network devices, since 8x less frags are
mapped on transmit and unmapped on tx completion. Alexander Duyck
mentioned a probable performance win on systems with IOMMU enabled.

Its possible some SG enabled hardware cant cope with bigger fragments,
but their ndo_start_xmit() should already handle this, splitting a
fragment in sub fragments, since some arches have PAGE_SIZE=65536

Successfully tested on various ethernet devices.
(ixgbe, igb, bnx2x, tg3, mellanox mlx4)

Signed-off-by: Eric Dumazet
Cc: Ben Hutchings
Cc: Vijay Subramanian
Cc: Alexander Duyck
Tested-by: Vijay Subramanian
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-25 04:31:37 +0800

20 Sep, 2012

1 commit

712619569 pkt_sched: fix virtual-start-time update in QFQ ... Browse Code »

If the old timestamps of a class, say cl, are stale when the class
becomes active, then QFQ may assign to cl a much higher start time
than the maximum value allowed. This may happen when QFQ assigns to
the start time of cl the finish time of a group whose classes are
characterized by a higher value of the ratio
max_class_pkt/weight_of_the_class with respect to that of
cl. Inserting a class with a too high start time into the bucket list
corrupts the data structure and may eventually lead to crashes.
This patch limits the maximum start time assigned to a class.

Signed-off-by: Paolo Valente
Signed-off-by: David S. Miller

Paolo Valente
2012-09-20 04:23:53 +0800

15 Sep, 2012

3 commits

b48b63a1f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/netfilter/nfnetlink_log.c
net/netfilter/xt_LOG.c

Rather easy conflict resolution, the 'net' tree had bug fixes to make
sure we checked if a socket is a time-wait one or not and elide the
logging code if so.

Whereas on the 'net-next' side we are calculating the UID and GID from
the creds using different interfaces due to the user namespace changes
from Eric Biederman.

Signed-off-by: David S. Miller

David S. Miller
2012-09-15 23:43:53 +0800
8c7f6edbd cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them ... Browse Code »

Currently, cgroup hierarchy support is a mess. cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children. blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup. Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy. Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support. The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
subsystems, so that implementing proper hierarchy support on those
doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
subsystems to implement proper hierarchy support.

For now, start with a single warning message. We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
cases where .use_hierarchy=false doesn't make the behavior
different from root.use_hierarchy=true. Fixed a typo spotted by
Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
->create() can affect the result per Michal. Dropped unnecessary
memcg root handling per Michal.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Acked-by: Li Zefan
Acked-by: Serge E. Hallyn
Cc: Glauber Costa
Cc: Peter Zijlstra
Cc: Paul Turner
Cc: Johannes Weiner
Cc: Thomas Graf
Cc: Vivek Goyal
Cc: Paul Mackerras
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Neil Horman
Cc: Aneesh Kumar K.V

Tejun Heo
2012-09-15 03:01:16 +0800
8a8e04df4 cgroup: Assign subsystem IDs during compile time ... Browse Code »

WARNING: With this change it is impossible to load external built
controllers anymore.

In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.

By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.

A close look is necessary on the RCU part which was introduces by
following patch:

commit f845172531fb7410c7fb7780b1a6e51ee6df7d52
Author: Herbert Xu Mon May 24 09:12:34 2010
Committer: David S. Miller Mon May 24 09:12:34 2010

cls_cgroup: Store classid in struct sock

Tis code was added to init_cgroup_cls()

/* We can't use rcu_assign_pointer because this is an int. */
smp_wmb();
net_cls_subsys_id = net_cls_subsys.subsys_id;

respectively to exit_cgroup_cls()

net_cls_subsys_id = -1;
synchronize_rcu();

and in module version of task_cls_classid()

rcu_read_lock();
id = rcu_dereference(net_cls_subsys_id);
if (id >= 0)
classid = container_of(task_subsys_state(p, id),
struct cgroup_cls_state, css)->classid;
rcu_read_unlock();

Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)

So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.

The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).

So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.

Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.

The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).

No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.

When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore. At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.

Signed-off-by: Daniel Wagner
Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Neil Horman
Cc: "David S. Miller"
Cc: "Paul E. McKenney"
Cc: Andrew Morton
Cc: Eric Dumazet
Cc: Gao feng
Cc: Glauber Costa
Cc: Herbert Xu
Cc: Jamal Hadi Salim
Cc: John Fastabend
Cc: Kamezawa Hiroyuki
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

Daniel Wagner
2012-09-15 00:57:43 +0800

14 Sep, 2012

4 commits

ba1bf474e net_sched: gred: actually perform idling in WRED mode ... Browse Code »

gred_dequeue() and gred_drop() do not seem to get called when the
queue is empty, meaning that we never start idling while in WRED
mode. And since qidlestart is not stored by gred_store_wred_set(),
we would never stop idling while in WRED mode if we ever started.
This messes up the average queue size calculation that influences
packet marking/dropping behavior.

Now, we start WRED mode idling as we are removing the last packet
from the queue. Also we now actually stop WRED mode idling when we
are enqueuing a packet.

Cc: Bruce Osler
Signed-off-by: David Ward
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

David Ward
2012-09-14 04:10:13 +0800
1fe37b106 net_sched: gred: fix qave reporting via netlink ... Browse Code »

q->vars.qavg is a Wlog scaled value, but q->backlog is not. In order
to pass q->vars.qavg as the backlog value, we need to un-scale it.
Additionally, the qave value returned via netlink should not be Wlog
scaled, so we need to un-scale the result of red_calc_qavg().

This caused artificially high values for "Average Queue" to be shown
by 'tc -s -d qdisc', but did not affect the actual operation of GRED.

Signed-off-by: David Ward
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

David Ward
2012-09-14 04:10:13 +0800
c22e46402 net_sched: gred: eliminate redundant DP prio comparisons ... Browse Code »

Each pair of DPs only needs to be compared once when searching for
a non-unique prio value.

Signed-off-by: David Ward
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

David Ward
2012-09-14 04:10:13 +0800
e29fe837b net_sched: gred: correct comment about qavg calculation in RIO mode ... Browse Code »

Signed-off-by: David Ward
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

David Ward
2012-09-14 04:10:13 +0800

12 Sep, 2012

1 commit

bdfc87f7d net-sched: sch_cbq: avoid infinite loop ... Browse Code »

Its possible to setup a bad cbq configuration leading to
an infinite loop in cbq_classify()

DEV_OUT=eth0
ICMP="match ip protocol 1 0xff"
U32="protocol ip u32"
DST="match ip dst"
tc qdisc add dev $DEV_OUT root handle 1: cbq avpkt 1000 \
bandwidth 100mbit
tc class add dev $DEV_OUT parent 1: classid 1:1 cbq \
rate 512kbit allot 1500 prio 5 bounded isolated
tc filter add dev $DEV_OUT parent 1: prio 3 $U32 \
$ICMP $DST 192.168.3.234 flowid 1:

Reported-by: Denys Fedoryschenko
Tested-by: Denys Fedoryschenko
Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-12 10:20:43 +0800

11 Sep, 2012

1 commit

15e473046 netlink: Rename pid to portid to avoid confusion ... Browse Code »

It is a frequent mistake to confuse the netlink port identifier with a
process identifier. Try to reduce this confusion by renaming fields
that hold port identifiers portid instead of pid.

I have carefully avoided changing the structures exported to
userspace to avoid changing the userspace API.

I have successfully built an allyesconfig kernel with this change.

Signed-off-by: "Eric W. Biederman"
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller

Eric W. Biederman
2012-09-11 03:30:41 +0800

06 Sep, 2012

1 commit

23d3b8bfb net: qdisc busylock needs lockdep annotations ... Browse Code »

It seems we need to provide ability for stacked devices
to use specific lock_class_key for sch->busylock

We could instead default l2tpeth tx_queue_len to 0 (no qdisc), but
a user might use a qdisc anyway.

(So same fixes are probably needed on non LLTX stacked drivers)

Noticed while stressing L2TPV3 setup :

======================================================
[ INFO: possible circular locking dependency detected ]
3.6.0-rc3+ #788 Not tainted
-------------------------------------------------------
netperf/4660 is trying to acquire lock:
(l2tpsock){+.-...}, at: [] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]

but task is already holding lock:
(&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&sch->busylock)->rlock){+.-...}:
[] lock_acquire+0x90/0x200
[] _raw_spin_lock_irqsave+0x4c/0x60
[] __wake_up+0x32/0x70
[] tty_wakeup+0x3e/0x80
[] pty_write+0x73/0x80
[] tty_put_char+0x3c/0x40
[] process_echoes+0x142/0x330
[] n_tty_receive_buf+0x8fb/0x1230
[] flush_to_ldisc+0x142/0x1c0
[] process_one_work+0x198/0x760
[] worker_thread+0x186/0x4b0
[] kthread+0x93/0xa0
[] kernel_thread_helper+0x4/0x10

-> #0 (l2tpsock){+.-...}:
[] __lock_acquire+0x1628/0x1b10
[] lock_acquire+0x90/0x200
[] _raw_spin_lock+0x41/0x50
[] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
[] dev_hard_start_xmit+0x502/0xa70
[] sch_direct_xmit+0xfe/0x290
[] dev_queue_xmit+0x1e5/0xe00
[] ip_finish_output+0x3d0/0x890
[] ip_output+0x59/0xf0
[] ip_local_out+0x2d/0xa0
[] ip_queue_xmit+0x1c3/0x680
[] tcp_transmit_skb+0x402/0xa60
[] tcp_write_xmit+0x1f4/0xa30
[] tcp_push_one+0x30/0x40
[] tcp_sendmsg+0xe82/0x1040
[] inet_sendmsg+0x125/0x230
[] sock_sendmsg+0xdc/0xf0
[] sys_sendto+0xfe/0x130
[] system_call_fastpath+0x16/0x1b
Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&sch->busylock)->rlock);
lock(l2tpsock);
lock(&(&sch->busylock)->rlock);
lock(l2tpsock);

*** DEADLOCK ***

5 locks held by netperf/4660:
#0: (sk_lock-AF_INET){+.+.+.}, at: [] tcp_sendmsg+0x2c/0x1040
#1: (rcu_read_lock){.+.+..}, at: [] ip_queue_xmit+0x0/0x680
#2: (rcu_read_lock_bh){.+....}, at: [] ip_finish_output+0x135/0x890
#3: (rcu_read_lock_bh){.+....}, at: [] dev_queue_xmit+0x0/0xe00
#4: (&(&sch->busylock)->rlock){+.-...}, at: [] dev_queue_xmit+0xd75/0xe00

stack backtrace:
Pid: 4660, comm: netperf Not tainted 3.6.0-rc3+ #788
Call Trace:
[] print_circular_bug+0x1fb/0x20c
[] __lock_acquire+0x1628/0x1b10
[] ? check_usage+0x9b/0x4d0
[] ? __lock_acquire+0x2e4/0x1b10
[] lock_acquire+0x90/0x200
[] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] _raw_spin_lock+0x41/0x50
[] ? l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] l2tp_xmit_skb+0x172/0xa50 [l2tp_core]
[] l2tp_eth_dev_xmit+0x32/0x60 [l2tp_eth]
[] dev_hard_start_xmit+0x502/0xa70
[] ? dev_hard_start_xmit+0x5e/0xa70
[] ? dev_queue_xmit+0x141/0xe00
[] sch_direct_xmit+0xfe/0x290
[] dev_queue_xmit+0x1e5/0xe00
[] ? dev_hard_start_xmit+0xa70/0xa70
[] ip_finish_output+0x3d0/0x890
[] ? ip_finish_output+0x135/0x890
[] ip_output+0x59/0xf0
[] ip_local_out+0x2d/0xa0
[] ip_queue_xmit+0x1c3/0x680
[] ? ip_local_out+0xa0/0xa0
[] tcp_transmit_skb+0x402/0xa60
[] ? tcp_md5_do_lookup+0x18e/0x1a0
[] tcp_write_xmit+0x1f4/0xa30
[] tcp_push_one+0x30/0x40
[] tcp_sendmsg+0xe82/0x1040
[] inet_sendmsg+0x125/0x230
[] ? inet_create+0x6b0/0x6b0
[] ? sock_update_classid+0xc2/0x3b0
[] ? sock_update_classid+0x130/0x3b0
[] sock_sendmsg+0xdc/0xf0
[] ? fget_light+0x3f9/0x4f0
[] sys_sendto+0xfe/0x130
[] ? trace_hardirqs_on+0xd/0x10
[] ? _raw_spin_unlock_irq+0x30/0x50
[] ? finish_task_switch+0x83/0xf0
[] ? finish_task_switch+0x46/0xf0
[] ? sysret_check+0x1b/0x56
[] system_call_fastpath+0x16/0x1b

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-06 05:49:27 +0800

04 Sep, 2012

1 commit

b379135c4 fq_codel: dont reinit flow state ... Browse Code »

When fq_codel builds a new flow, it should not reset codel state.

Codel algo needs to get previous values (lastcount, drop_next) to get
proper behavior.

Signed-off-by: Dave Taht
Signed-off-by: Eric Dumazet
Acked-by: Dave Taht
Signed-off-by: David S. Miller

Eric Dumazet
2012-09-04 02:36:50 +0800

25 Aug, 2012

1 commit

e6acb3848 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

This is an initial merge in of Eric Biederman's work to start adding
user namespace support to the networking.

Signed-off-by: David S. Miller

David S. Miller
2012-08-25 06:54:37 +0800

23 Aug, 2012

1 commit

1304a7343 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2012-08-23 05:21:38 +0800

17 Aug, 2012

1 commit

16c0b164b act_mirred: do not drop packets when fails to mirror it ... Browse Code »

We drop packet unconditionally when we fail to mirror it. This is not intended
in some cases. Consdier for kvm guest, we may mirror the traffic of the bridge
to a tap device used by a VM. When kernel fails to mirror the packet in
conditions such as when qemu crashes or stop polling the tap, it's hard for the
management software to detect such condition and clean the the mirroring
before. This would lead all packets to the bridge to be dropped and break the
netowrk of other virtual machines.

To solve the issue, the patch does not drop packets when kernel fails to mirror
it, and only drop the redirected packets.

Signed-off-by: Jason Wang
Signed-off-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Jason Wang
2012-08-17 05:54:44 +0800

15 Aug, 2012

3 commits

a6c6796c7 userns: Convert cls_flow to work with user namespaces enabled ... Browse Code »

The flow classifier can use uids and gids of the sockets that
are transmitting packets and do insert those uids and gids
into the packet classification calcuation. I don't fully
understand the details but it appears that we can depend
on specific uids and gids when making traffic classification
decisions.

To work with user namespaces enabled map from kuids and kgids
into uids and gids in the initial user namespace giving raw
integer values the code can play with and depend on.

To avoid issues of userspace depending on uids and gids in
packet classifiers installed from other user namespaces
and getting confused deny all packet classifiers that
use uids or gids that are not comming from a netlink socket
in the initial user namespace.

Cc: Patrick McHardy
Cc: Eric Dumazet
Cc: Jamal Hadi Salim
Cc: Changli Gao
Acked-by: David S. Miller
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-08-15 12:55:28 +0800
af4c6641f net sched: Pass the skb into change so it can access NETLINK_CB ... Browse Code »

cls_flow.c plays with uids and gids. Unless I misread that
code it is possible for classifiers to depend on the specific uid and
gid values. Therefore I need to know the user namespace of the
netlink socket that is installing the packet classifiers. Pass
in the rtnetlink skb so I can access the NETLINK_CB of the passed
packet. In particular I want access to sk_user_ns(NETLINK_CB(in_skb).ssk).

Pass in not the user namespace but the incomming rtnetlink skb into
the the classifier change routines as that is generally the more useful
parameter.

Cc: Jamal Hadi Salim
Acked-by: David S. Miller
Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-08-15 12:55:28 +0800
ee89bab14 net: move and rename netif_notify_peers() ... Browse Code »

I believe net/core/dev.c is a better place for netif_notify_peers(),
because other net event notify functions also stay in this file.

And rename it to netdev_notify_peers().

Cc: David S. Miller
Cc: Ian Campbell
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:28:23 +0800

09 Aug, 2012

1 commit

be72f63b4 sched: add missing group change to qfq_change_class ... Browse Code »

[Resending again, as the text was corrupted by the email client]

To speed up operations, QFQ internally divides classes into
groups. Which group a class belongs to depends on the ratio between
the maximum packet length and the weight of the class. Unfortunately
the function qfq_change_class lacks the steps for changing the group
of a class when the ratio max_pkt_len/weight of the class changes.

For example, when the last of the following three commands is
executed, the group of class 1:1 is not correctly changed:

tc disc add dev XXX root handle 1: qfq
tc class add dev XXX parent 1: qfq classid 1:1 weight 1
tc class change dev XXX parent 1: classid 1:1 qfq weight 4

Not changing the group of a class does not affect the long-term
bandwidth guaranteed to the class, as the latter is independent of the
maximum packet length, and correctly changes (only) if the weight of
the class changes. In contrast, if the group of the class is not
updated, the class is still guaranteed the short-term bandwidth and
packet delay related to its old group, instead of the guarantees that
it should receive according to its new weight and/or maximum packet
length. This may also break service guarantees for other classes.
This patch adds the missing operations.

Signed-off-by: Paolo Valente
Signed-off-by: David S. Miller

Paolo Valente
2012-08-09 07:02:05 +0800

07 Aug, 2012

1 commit

47fd92f5a net_sched: act: Delete estimator in error path. ... Browse Code »

Some action modules free struct tcf_common in their error path
while estimator is still active. This results in est_timer()
dereference freed memory.
Add gen_kill_estimator() in ipt, pedit and simple action.

Signed-off-by: Hiroaki SHIMODA
Signed-off-by: David S. Miller

Hiroaki SHIMODA
2012-08-07 04:30:01 +0800

04 Aug, 2012

1 commit

696ecdc10 net_sched: gact: Fix potential panic in tcf_gact(). ... Browse Code »

gact_rand array is accessed by gact->tcfg_ptype whose value
is assumed to less than MAX_RAND, but any range checks are
not performed.

So add a check in tcf_gact_init(). And in tcf_gact(), we can
reduce a branch.

Signed-off-by: Hiroaki SHIMODA
Signed-off-by: David S. Miller

Hiroaki SHIMODA
2012-08-04 07:47:24 +0800

24 Jul, 2012

1 commit

92101b3b2 ipv4: Prepare for change of rt->rt_iif encoding. ... Browse Code »
43

Use inet_iif() consistently, and for TCP record the input interface of
cached RX dst in inet sock.

rt->rt_iif is going to be encoded differently, so that we can
legitimately cache input routes in the FIB info more aggressively.

When the input interface is "use SKB device index" the rt->rt_iif will
be set to zero.

This forces us to move the TCP RX dst cache installation into the ipv4
specific code, and as well it should since doing the route caching for
ipv6 is pointless at the moment since it is not inspected in the ipv6
input paths yet.

Also, remove the unlikely on dst->obsolete, all ipv4 dsts have
obsolete set to a non-zero value to force invocation of the check
callback.

Signed-off-by: David S. Miller

David S. Miller
2012-07-24 07:36:26 +0800

20 Jul, 2012

1 commit

abaa72d7f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c

David S. Miller
2012-07-20 02:17:30 +0800

17 Jul, 2012

1 commit

5a308f40b netem: refine early skb orphaning ... Browse Code »

netem does an early orphaning of skbs. Doing so breaks TCP Small Queue
or any mechanism relying on socket sk_wmem_alloc feedback.

Ideally, we should perform this orphaning after the rate module and
before the delay module, to mimic what happens on a real link :

skb orphaning is indeed normally done at TX completion, before the
transit on the link.

+-------+ +--------+ +---------------+ +-----------------+
+ Qdisc +---> Device +--> TX completion +--> links / hops +->
+ + + xmit + + skb orphaning + + propagation +
+-------+ +--------+ +---------------+ +-----------------+
< rate limiting > < delay, drops, reorders >

If netem is used without delay feature (drops, reorders, rate
limiting), then we should avoid early skb orphaning, to keep pressure
on sockets as long as packets are still in qdisc queue.

Ideally, netem should be refactored to implement delay module
as the last stage. Current algorithm merges the two phases
(rate limiting + delay) so its not correct.

Signed-off-by: Eric Dumazet
Cc: Hagen Paul Pfeifer
Cc: Mark Gordon
Cc: Andreas Terzis
Cc: Yuchung Cheng
Acked-by: Stephen Hemminger
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-17 14:08:33 +0800

12 Jul, 2012

2 commits

7ac2908e4 sch_sfb: Fix missing NULL check ... Browse Code »

Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44461

Signed-off-by: Alan Cox
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Alan Cox
2012-07-12 23:33:18 +0800
6d4fa852a net: sched: add ipset ematch ... Browse Code »

Can be used to match packets against netfilter ip sets created via ipset(8).
skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'.

Since ipset is usually called from netfilter, the ematch
initializes a fake xt_action_param, pulls the ip header into the
linear area and also sets skb->data to the IP header (otherwise
matching Layer 4 set types doesn't work).

Tested-by: Mr Dash Four
Signed-off-by: Florian Westphal
Signed-off-by: David S. Miller

Florian Westphal
2012-07-12 22:54:46 +0800

11 Jul, 2012

1 commit

04c9f416e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
net/batman-adv/bridge_loop_avoidance.c
net/batman-adv/bridge_loop_avoidance.h
net/batman-adv/soft-interface.c
net/mac80211/mlme.c

With merge help from Antonio Quartulli (batman-adv) and
Stephen Rothwell (drivers/net/usb/qmi_wwan.c).

The net/mac80211/mlme.c conflict seemed easy enough, accounting for a
conversion to some new tracing macros.

Signed-off-by: David S. Miller

David S. Miller
2012-07-11 14:56:33 +0800

09 Jul, 2012

1 commit

960fb66e5 netem: add limitation to reordered packets ... Browse Code »

Fix two netem bugs :

1) When a frame was dropped by tfifo_enqueue(), drop counter
was incremented twice.

2) When reordering is triggered, we enqueue a packet without
checking queue limit. This can OOM pretty fast when this
is repeated enough, since skbs are orphaned, no socket limit
can help in this situation.

Signed-off-by: Eric Dumazet
Cc: Mark Gordon
Cc: Andreas Terzis
Cc: Yuchung Cheng
Cc: Hagen Paul Pfeifer
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-09 15:01:49 +0800

08 Jul, 2012

1 commit

8f961faef Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next Browse Code »

David S. Miller
2012-07-08 07:29:29 +0800

05 Jul, 2012

1 commit

dbedbe6d5 sch_teql: Convert over to dev_neigh_lookup_skb(). ... Browse Code »

Signed-off-by: David S. Miller

David S. Miller
2012-07-05 16:09:06 +0800

04 Jul, 2012

1 commit

f057bbb6f net: em_canid: Ematch rule to match CAN frames according to their identifiers ... Browse Code »

This ematch makes it possible to classify CAN frames (AF_CAN) according
to their identifiers. This functionality can not be easily achieved with
existing classifiers, such as u32, because CAN identifier is always stored
in native endianness, whereas u32 expects Network byte order.

Signed-off-by: Rostislav Lisovy
Signed-off-by: Oliver Hartkopp
Signed-off-by: Marc Kleine-Budde

Rostislav Lisovy
2012-07-04 19:07:05 +0800

27 Jun, 2012

1 commit

02ef22ca4 pkt_sched: sch_api: Move away from NLMSG_NEW(). ... Browse Code »

And use nlmsg_data() while we're here too, as well as remove
a useless cast.

Signed-off-by: David S. Miller

David S. Miller
2012-06-27 12:54:15 +0800