Doug / smarc-fsl-linux-kernel | Embedian Git Server

01 Oct, 2013

1 commit

b86783587 net: flow_dissector: fix thoff for IPPROTO_AH ... Browse Code »

In commit 8ed781668dd49 ("flow_keys: include thoff into flow_keys for
later usage"), we missed that existing code was using nhoff as a
temporary variable that could not always contain transport header
offset.

This is not a problem for TCP/UDP because port offset (@poff)
is 0 for these protocols.

Signed-off-by: Eric Dumazet
Cc: Daniel Borkmann
Cc: Nikolay Aleksandrov
Acked-by: Nikolay Aleksandrov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Eric Dumazet
2013-10-01 03:32:05 +0800

29 Sep, 2013

2 commits

9a3bab6b0 net: net_secret should not depend on TCP ... Browse Code »

A host might need net_secret[] and never open a single socket.

Problem added in commit aebda156a570782
("net: defer net_secret[] initialization")

Based on prior patch from Hannes Frederic Sowa.

Reported-by: Hannes Frederic Sowa
Signed-off-by: Eric Dumazet
Acked-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller

Eric Dumazet
2013-09-29 06:19:40 +0800
50624c934 net: Delay default_device_exit_batch until no devices are unregistering v2 ... Browse Code »

There is currently serialization network namespaces exiting and
network devices exiting as the final part of netdev_run_todo does not
happen under the rtnl_lock. This is compounded by the fact that the
only list of devices unregistering in netdev_run_todo is local to the
netdev_run_todo.

This lack of serialization in extreme cases results in network devices
unregistering in netdev_run_todo after the loopback device of their
network namespace has been freed (making dst_ifdown unsafe), and after
the their network namespace has exited (making the NETDEV_UNREGISTER,
and NETDEV_UNREGISTER_FINAL callbacks unsafe).

Add the missing serialization by a per network namespace count of how
many network devices are unregistering and having a wait queue that is
woken up whenever the count is decreased. The count and wait queue
allow default_device_exit_batch to wait until all of the unregistration
activity for a network namespace has finished before proceeding to
unregister the loopback device and then allowing the network namespace
to exit.

Only a single global wait queue is used because there is a single global
lock, and there is a single waiter, per network namespace wait queues
would be a waste of resources.

The per network namespace count of unregistering devices gives a
progress guarantee because the number of network devices unregistering
in an exiting network namespace must ultimately drop to zero (assuming
network device unregistration completes).

The basic logic remains the same as in v1. This patch is now half
comment and half rtnl_lock_unregistering an expanded version of
wait_event performs no extra work in the common case where no network
devices are unregistering when we get to default_device_exit_batch.

Reported-by: Francesco Ruggeri
Signed-off-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Eric W. Biederman
2013-09-29 06:09:15 +0800

20 Sep, 2013

1 commit

d0fe8c888 netpoll: fix NULL pointer dereference in netpoll_cleanup ... Browse Code »

I've been hitting a NULL ptr deref while using netconsole because the
np->dev check and the pointer manipulation in netpoll_cleanup are done
without rtnl and the following sequence happens when having a netconsole
over a vlan and we remove the vlan while disabling the netconsole:
CPU 1 CPU2
removes vlan and calls the notifier
enters store_enabled(), calls
netdev_cleanup which checks np->dev
and then waits for rtnl
executes the netconsole netdev
release notifier making np->dev
== NULL and releases rtnl
continues to dereference a member of
np->dev which at this point is == NULL

Signed-off-by: Nikolay Aleksandrov
Signed-off-by: David S. Miller

Nikolay Aleksandrov
2013-09-20 02:15:53 +0800

13 Sep, 2013

1 commit

b0dd663b6 netpoll: Should handle ETH_P_ARP other than ETH_P_IP in netpoll_neigh_reply ... Browse Code »

The received ARP request type in the Ethernet packet head is ETH_P_ARP other than ETH_P_IP.

[ Bug introduced by commit b7394d2429c198b1da3d46ac39192e891029ec0f
("netpoll: prepare for ipv6") ]

Signed-off-by: Sonic Zhang
Signed-off-by: David S. Miller

Sonic Zhang
2013-09-13 05:19:14 +0800

12 Sep, 2013

1 commit

50d1784ee net: fix multiqueue selection ... Browse Code »

commit 416186fbf8c5b4e4465 ("net: Split core bits of netdev_pick_tx
into __netdev_pick_tx") added a bug that disables caching of queue
index in the socket.

This is the source of packet reorders for TCP flows, and
again this is happening more often when using FQ pacing.

Old code was doing

if (queue_index != old_index)
sk_tx_queue_set(sk, queue_index);

Alexander renamed the variables but forgot to change sk_tx_queue_set()
2nd parameter.

if (queue_index != new_index)
sk_tx_queue_set(sk, queue_index);

This means we store -1 over and over in sk->sk_tx_queue_mapping

Signed-off-by: Eric Dumazet
Cc: Alexander Duyck
Acked-by: Alexander Duyck
Signed-off-by: David S. Miller

Eric Dumazet
2013-09-12 04:10:00 +0800

08 Sep, 2013

1 commit

c7c4591db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace changes from Eric Biederman:
"This is an assorted mishmash of small cleanups, enhancements and bug
fixes.

The major theme is user namespace mount restrictions. nsown_capable
is killed as it encourages not thinking about details that need to be
considered. A very hard to hit pid namespace exiting bug was finally
tracked and fixed. A couple of cleanups to the basic namespace
infrastructure.

Finally there is an enhancement that makes per user namespace
capabilities usable as capabilities, and an enhancement that allows
the per userns root to nice other processes in the user namespace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
userns: Kill nsown_capable it makes the wrong thing easy
capabilities: allow nice if we are privileged
pidns: Don't have unshare(CLONE_NEWPID) imply CLONE_THREAD
userns: Allow PR_CAPBSET_DROP in a user namespace.
namespaces: Simplify copy_namespaces so it is clear what is going on.
pidns: Fix hang in zap_pid_ns_processes by sending a potentially extra wakeup
sysfs: Restrict mounting sysfs
userns: Better restrictions on when proc and sysfs can be mounted
vfs: Don't copy mount bind mounts of /proc//ns/mnt between namespaces
kernel/nsproxy.c: Improving a snippet of code.
proc: Restrict mounting the proc filesystem
vfs: Lock in place mounts from more privileged users

Linus Torvalds
2013-09-08 05:35:32 +0800

06 Sep, 2013

2 commits

cc998ff88 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David Miller:
"Noteworthy changes this time around:

1) Multicast rejoin support for team driver, from Jiri Pirko.

2) Centralize and simplify TCP RTT measurement handling in order to
reduce the impact of bad RTO seeding from SYN/ACKs. Also, when
both timestamps and local RTT measurements are available prefer
the later because there are broken middleware devices which
scramble the timestamp.

From Yuchung Cheng.

3) Add TCP_NOTSENT_LOWAT socket option to limit the amount of kernel
memory consumed to queue up unsend user data. From Eric Dumazet.

4) Add a "physical port ID" abstraction for network devices, from
Jiri Pirko.

5) Add a "suppress" operation to influence fib_rules lookups, from
Stefan Tomanek.

6) Add a networking development FAQ, from Paul Gortmaker.

7) Extend the information provided by tcp_probe and add ipv6 support,
from Daniel Borkmann.

8) Use RCU locking more extensively in openvswitch data paths, from
Pravin B Shelar.

9) Add SCTP support to openvswitch, from Joe Stringer.

10) Add EF10 chip support to SFC driver, from Ben Hutchings.

11) Add new SYNPROXY netfilter target, from Patrick McHardy.

12) Compute a rate approximation for sending in TCP sockets, and use
this to more intelligently coalesce TSO frames. Furthermore, add
a new packet scheduler which takes advantage of this estimate when
available. From Eric Dumazet.

13) Allow AF_PACKET fanouts with random selection, from Daniel
Borkmann.

14) Add ipv6 support to vxlan driver, from Cong Wang"

Resolved conflicts as per discussion.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1218 commits)
openvswitch: Fix alignment of struct sw_flow_key.
netfilter: Fix build errors with xt_socket.c
tcp: Add missing braces to do_tcp_setsockopt
caif: Add missing braces to multiline if in cfctrl_linkup_request
bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
vxlan: Fix kernel panic on device delete.
net: mvneta: implement ->ndo_do_ioctl() to support PHY ioctls
net: mvneta: properly disable HW PHY polling and ensure adjust_link() works
icplus: Use netif_running to determine device state
ethernet/arc/arc_emac: Fix huge delays in large file copies
tuntap: orphan frags before trying to set tx timestamp
tuntap: purge socket error queue on detach
qlcnic: use standard NAPI weights
ipv6:introduce function to find route for redirect
bnx2x: VF RSS support - VF side
bnx2x: VF RSS support - PF side
vxlan: Notify drivers for listening UDP port changes
net: usbnet: update addr_assign_type if appropriate
driver/net: enic: update enic maintainers and driver
driver/net: enic: Exposing symbols for Cisco's low latency driver
...

Linus Torvalds
2013-09-06 05:54:29 +0800
06c54055b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Conflicts:
drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
net/bridge/br_multicast.c
net/ipv6/sit.c

The conflicts were minor:

1) sit.c changes overlap with change to ip_tunnel_xmit() signature.

2) br_multicast.c had an overlap between computing max_delay using
msecs_to_jiffies and turning MLDV2_MRC() into an inline function
with a name using lowercase instead of uppercase letters.

3) stmmac had two overlapping changes, one which conditionally allocated
and hooked up a dma_cfg based upon the presence of the pbl OF property,
and another one handling store-and-forward DMA made. The latter of
which should not go into the new of_find_property() basic block.

Signed-off-by: David S. Miller

David S. Miller
2013-09-06 02:58:52 +0800

04 Sep, 2013

5 commits

82476b316 net: correctly interlink lower/upper devices ... Browse Code »

Currently we're linking upper devices to lower ones, which results in
upside-down relationship: upper devices seeing lower devices via its upper
lists.

Fix this by correctly linking lower devices to the upper ones.

CC: "David S. Miller"
CC: Eric Dumazet
CC: Jiri Pirko
CC: Alexander Duyck
CC: Cong Wang
Signed-off-by: Veaceslav Falico
Signed-off-by: David S. Miller

Veaceslav Falico
2013-09-04 12:27:26 +0800
8b27f2779 skb: allow skb_scrub_packet() to be used by tunnels ... Browse Code »

This function was only used when a packet was sent to another netns. Now, it can
also be used after tunnel encapsulation or decapsulation.

Only skb_orphan() should not be done when a packet is not crossing netns.

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2013-09-04 12:27:25 +0800
3e25c65ed net: neighbour: Remove CONFIG_ARPD ... Browse Code »

This config option is superfluous in that it only guards a call
to neigh_app_ns(). Enabling CONFIG_ARPD by default has no
change in behavior. There will now be call to __neigh_notify()
for each ARP resolution, which has no impact unless there is a
user space daemon waiting to receive the notification, i.e.,
the case for which CONFIG_ARPD was designed anyways.

Suggested-by: Eric W. Biederman
Cc: "David S. Miller"
Cc: Alexey Kuznetsov
Cc: James Morris
Cc: Hideaki YOSHIFUJI
Cc: Patrick McHardy
Cc: "Eric W. Biederman"
Cc: Gao feng
Cc: Joe Perches
Cc: Veaceslav Falico
Signed-off-by: Tim Gardner
Reviewed-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Tim Gardner
2013-09-04 09:41:43 +0800
32dad03d1 Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"A lot of activities on the cgroup front. Most changes aren't visible
to userland at all at this point and are laying foundation for the
planned unified hierarchy.

- The biggest change is decoupling the lifetime management of css
(cgroup_subsys_state) from that of cgroup's. Because controllers
(cpu, memory, block and so on) will need to be dynamically enabled
and disabled, css which is the association point between a cgroup
and a controller may come and go dynamically across the lifetime of
a cgroup. Till now, css's were created when the associated cgroup
was created and stayed till the cgroup got destroyed.

Assumptions around this tight coupling permeated through cgroup
core and controllers. These assumptions are gradually removed,
which consists bulk of patches, and css destruction path is
completely decoupled from cgroup destruction path. Note that
decoupling of creation path is relatively easy on top of these
changes and the patchset is pending for the next window.

- cgroup has its own event mechanism cgroup.event_control, which is
only used by memcg. It is overly complex trying to achieve high
flexibility whose benefits seem dubious at best. Going forward,
new events will simply generate file modified event and the
existing mechanism is being made specific to memcg. This pull
request contains prepatory patches for such change.

- Various fixes and cleanups"

Fixed up conflict in kernel/cgroup.c as per Tejun.

* 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (69 commits)
cgroup: fix cgroup_css() invocation in css_from_id()
cgroup: make cgroup_write_event_control() use css_from_dir() instead of __d_cgrp()
cgroup: make cgroup_event hold onto cgroup_subsys_state instead of cgroup
cgroup: implement CFTYPE_NO_PREFIX
cgroup: make cgroup_css() take cgroup_subsys * instead and allow NULL subsys
cgroup: rename cgroup_css_from_dir() to css_from_dir() and update its syntax
cgroup: fix cgroup_write_event_control()
cgroup: fix subsystem file accesses on the root cgroup
cgroup: change cgroup_from_id() to css_from_id()
cgroup: use css_get() in cgroup_create() to check CSS_ROOT
cpuset: remove an unncessary forward declaration
cgroup: RCU protect each cgroup_subsys_state release
cgroup: move subsys file removal to kill_css()
cgroup: factor out kill_css()
cgroup: decouple cgroup_subsys_state destruction from cgroup destruction
cgroup: replace cgroup->css_kill_cnt with ->nr_css
cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item
cgroup: move cgroup->subsys[] assignment to online_css()
cgroup: reorganize css init / exit paths
cgroup: add __rcu modifier to cgroup->subsys[]
...

Linus Torvalds
2013-09-04 09:25:03 +0800
542a086ac Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core patches from Greg KH:
"Here's the big driver core pull request for 3.12-rc1.

Lots of tiny changes here fixing up the way sysfs attributes are
created, to try to make drivers simpler, and fix a whole class race
conditions with creations of device attributes after the device was
announced to userspace.

All the various pieces are acked by the different subsystem
maintainers"

* tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
firmware loader: fix pending_fw_head list corruption
drivers/base/memory.c: introduce help macro to_memory_block
dynamic debug: line queries failing due to uninitialized local variable
sysfs: sysfs_create_groups returns a value.
debugfs: provide debugfs_create_x64() when disabled
rbd: convert bus code to use bus_groups
firmware: dcdbas: use binary attribute groups
sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
driver core: add #include to core files.
HID: convert bus code to use dev_groups
Input: serio: convert bus code to use drv_groups
Input: gameport: convert bus code to use drv_groups
driver core: firmware: use __ATTR_RW()
driver core: core: use DEVICE_ATTR_RO
driver core: bus: use DRIVER_ATTR_WO()
driver core: create write-only attribute macros for devices and drivers
sysfs: create __ATTR_WO()
driver-core: platform: convert bus code to use dev_groups
workqueue: convert bus code to use dev_groups
MEI: convert bus code to use dev_groups
...

Linus Torvalds
2013-09-04 02:37:15 +0800

31 Aug, 2013

3 commits

c7b96acf1 userns: Kill nsown_capable it makes the wrong thing easy ... Browse Code »

nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.

Acked-by: Serge Hallyn
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-08-31 14:44:11 +0800
6da7c8fcb qdisc: allow setting default queuing discipline ... Browse Code »

By default, the pfifo_fast queue discipline has been used by default
for all devices. But we have better choices now.

This patch allow setting the default queueing discipline with sysctl.
This allows easy use of better queueing disciplines on all devices
without having to use tc qdisc scripts. It is intended to allow
an easy path for distributions to make fq_codel or sfq the default
qdisc.

This patch also makes pfifo_fast more of a first class qdisc, since
it is now possible to manually override the default and explicitly
use pfifo_fast. The behavior for systems who do not use the sysctl
is unchanged, they still get pfifo_fast

Also removes leftover random # in sysctl net core.

Signed-off-by: Stephen Hemminger
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

stephen hemminger
2013-08-31 12:32:32 +0800
702821f4e net: revert 8728c544a9c ("net: dev_pick_tx() fix") ... Browse Code »

commit 8728c544a9cbdc ("net: dev_pick_tx() fix") and commit
b6fe83e9525a ("bonding: refine IFF_XMIT_DST_RELEASE capability")
are quite incompatible : Queue selection is disabled because skb
dst was dropped before entering bonding device.

This causes major performance regression, mainly because TCP packets
for a given flow can be sent to multiple queues.

This is particularly visible when using the new FQ packet scheduler
with MQ + FQ setup on the slaves.

We can safely revert the first commit now that 416186fbf8c5b
("net: Split core bits of netdev_pick_tx into __netdev_pick_tx")
properly caps the queue_index.

Reported-by: Xi Wang
Diagnosed-by: Xi Wang
Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Cc: Alexander Duyck
Cc: Denys Fedorysychenko
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-31 05:48:04 +0800

30 Aug, 2013

4 commits

48311f468 net: add netdev_upper_get_next_dev_rcu(dev, iter) ... Browse Code »

This function returns the next dev in the dev->upper_dev_list after the
struct list_head **iter position, and updates *iter accordingly. Returns
NULL if there are no devices left.

Caller must hold RCU read lock.

CC: "David S. Miller"
CC: Eric Dumazet
CC: Jiri Pirko
CC: Alexander Duyck
CC: Cong Wang
Signed-off-by: Veaceslav Falico
Signed-off-by: David S. Miller

Veaceslav Falico
2013-08-30 04:19:42 +0800
620f3186c net: remove search_list from netdev_adjacent ... Browse Code »

We already don't need it cause we see every upper/lower device in the list
already.

CC: "David S. Miller"
CC: Eric Dumazet
CC: Jiri Pirko
CC: Alexander Duyck
CC: Cong Wang
Signed-off-by: Veaceslav Falico
Signed-off-by: David S. Miller

Veaceslav Falico
2013-08-30 04:19:42 +0800
5d261913c net: add lower_dev_list to net_device and make a full mesh ... Browse Code »

This patch adds lower_dev_list list_head to net_device, which is the same
as upper_dev_list, only for lower devices, and begins to use it in the same
way as the upper list.

It also changes the way the whole adjacent device lists work - now they
contain *all* of upper/lower devices, not only the first level. The first
level devices are distinguished by the bool neighbour field in
netdev_adjacent, also added by this patch.

There are cases when a device can be added several times to the adjacent
list, the simplest would be:

/---- eth0.10 ---\
eth0- --- bond0
\---- eth0.20 ---/

where both bond0 and eth0 'see' each other in the adjacent lists two times.
To avoid duplication of netdev_adjacent structures ref_nr is being kept as
the number of times the device was added to the list.

The 'full view' is achieved by adding, on link creation, all of the
upper_dev's upper_dev_list devices as upper devices to all of the
lower_dev's lower_dev_list devices (and to the lower_dev itself), and vice
versa. On unlink they are removed using the same logic.

I've tested it with thousands vlans/bonds/bridges, everything works ok and
no observable lags even on a huge number of interfaces.

Memory footprint for 128 devices interconnected with each other via both
upper and lower (which is impossible, but for the comparison) lists would be:

128*128*2*sizeof(netdev_adjacent) = 1.5MB

but in the real world we usualy have at most several devices with slaves
and a lot of vlans, so the footprint will be much lower.

CC: "David S. Miller"
CC: Eric Dumazet
CC: Jiri Pirko
CC: Alexander Duyck
CC: Cong Wang
Signed-off-by: Veaceslav Falico
Signed-off-by: David S. Miller

Veaceslav Falico
2013-08-30 04:19:42 +0800
aa9d85605 net: rename netdev_upper to netdev_adjacent ... Browse Code »

Rename the structure to reflect the upcoming addition of lower_dev_list.

CC: "David S. Miller"
CC: Eric Dumazet
CC: Jiri Pirko
CC: Alexander Duyck
CC: Cong Wang
Signed-off-by: Veaceslav Falico
Signed-off-by: David S. Miller

Veaceslav Falico
2013-08-30 04:19:41 +0800

29 Aug, 2013

1 commit

7dc5dbc87 sysfs: Restrict mounting sysfs ... Browse Code »

Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
over the net namespace. The principle here is if you create or have
capabilities over it you can mount it, otherwise you get to live with
what other people have mounted.

Instead of testing this with a straight forward ns_capable call,
perform this check the long and torturous way with kobject helpers,
this keeps direct knowledge of namespaces out of sysfs, and preserves
the existing sysfs abstractions.

Acked-by: Greg Kroah-Hartman
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2013-08-29 12:35:14 +0800

28 Aug, 2013

1 commit

d661684cf net: Check the correct namespace when spoofing pid over SCM_RIGHTS ... Browse Code »

This is a security bug.

The follow-up will fix nsproxy to discourage this type of issue from
happening again.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski
Reviewed-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Andy Lutomirski
2013-08-28 01:52:52 +0800

17 Aug, 2013

1 commit

2ff1cf12c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2013-08-17 06:37:26 +0800

15 Aug, 2013

2 commits

fce9b9be8 rtnetlink: remove an unneeded test ... Browse Code »

We know that "dev" is a valid pointer at this point, so we can remove
the test and clean up a little.

Signed-off-by: Dan Carpenter
Reviewed-by: "Eric W. Biederman"
Signed-off-by: David S. Miller

Dan Carpenter
2013-08-15 16:29:02 +0800
64261f230 dev: move skb_scrub_packet() after eth_type_trans() ... Browse Code »

skb_scrub_packet() was called before eth_type_trans() to let eth_type_trans()
set pkt_type.

In fact, we should force pkt_type to PACKET_HOST, so move the call after
eth_type_trans().

Signed-off-by: Nicolas Dichtel
Signed-off-by: David S. Miller

Nicolas Dichtel
2013-08-15 16:00:20 +0800

14 Aug, 2013

1 commit

3e805ad28 rtnetlink: rtnl_bridge_getlink: Call nlmsg_find_attr() with ifinfomsg header ... Browse Code »

Fix the iproute2 command `bridge vlan show`, after switching from
rtgenmsg to ifinfomsg.

Let's start with a little history:

Feb 20: Vlad Yasevich got his VLAN-aware bridge patchset included in
the 3.9 merge window.
In the kernel commit 6cbdceeb, he added attribute support to
bridge GETLINK requests sent with rtgenmsg.

Mar 6th: Vlad got this iproute2 reference implementation of the bridge
vlan netlink interface accepted (iproute2 9eff0e5c)

Apr 25th: iproute2 switched from using rtgenmsg to ifinfomsg (63338dca)
http://patchwork.ozlabs.org/patch/239602/
http://marc.info/?t=136680900700007

Apr 28th: Linus released 3.9

Apr 30th: Stephen released iproute2 3.9.0

The `bridge vlan show` command haven't been working since the switch to
ifinfomsg, or in a released version of iproute2. Since the kernel side
only supports rtgenmsg, which iproute2 switched away from just prior to
the iproute2 3.9.0 release.

I haven't been able to find any documentation, about neither rtgenmsg
nor ifinfomsg, and in which situation to use which, but kernel commit
88c5b5ce seams to suggest that ifinfomsg should be used.

Fixing this in kernel will break compatibility, but I doubt that anybody
have been using it due to this bug in the user space reference
implementation, at least not without noticing this bug. That said the
functionality is still fully functional in 3.9, when reversing iproute2
commit 63338dca.

This could also be fixed in iproute2, but thats an ugly patch that would
reintroduce rtgenmsg in iproute2, and from searching in netdev it seams
like rtgenmsg usage is discouraged. I'm assuming that the only reason
that Vlad implemented the kernel side to use rtgenmsg, was because
iproute2 was using it at the time.

Signed-off-by: Asbjoern Sloth Toennesen
Reviewed-by: Vlad Yasevich
Signed-off-by: David S. Miller

Asbjoern Sloth Toennesen
2013-08-14 10:09:29 +0800

10 Aug, 2013

3 commits

645359930 rtnetlink: Fix inverted check in ndo_dflt_fdb_del() ... Browse Code »

Fix inverted check when deleting an fdb entry.

Signed-off-by: Sridhar Samudrala
Signed-off-by: David S. Miller

Sridhar Samudrala
2013-08-10 16:24:12 +0800
28d642710 net: attempt high order allocations in sock_alloc_send_pskb() ... Browse Code »

Adding paged frags skbs to af_unix sockets introduced a performance
regression on large sends because of additional page allocations, even
if each skb could carry at least 100% more payload than before.

We can instruct sock_alloc_send_pskb() to attempt high order
allocations.

Most of the time, it does a single page allocation instead of 8.

I added an additional parameter to sock_alloc_send_pskb() to
let other users to opt-in for this new feature on followup patches.

Tested:

Before patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

2304 212992 212992 10.00 46861.15

After patch :

$ netperf -t STREAM_STREAM
STREAM STREAM TEST
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec

2304 212992 212992 10.00 57981.11

Signed-off-by: Eric Dumazet
Cc: David Rientjes
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-10 16:16:44 +0800
e11aada32 net: flow_dissector: add 802.1ad support ... Browse Code »

Same behavior than 802.1q : finds the encapsulated protocol and
skip 32bit header.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2013-08-10 02:06:23 +0800

09 Aug, 2013

5 commits

d99c8727e cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup ... Browse Code »

cgroup is in the process of converting to css (cgroup_subsys_state)
from cgroup as the principal subsystem interface handle. This is
mostly to prepare for the unified hierarchy support where css's will
be created and destroyed dynamically but also helps cleaning up
subsystem implementations as css is usually what they are interested
in anyway.

cgroup_taskset which is used by the subsystem attach methods is the
last cgroup subsystem API which isn't using css as the handle. Update
cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.

The conversions are pretty mechanical. One exception is
cpuset::cgroup_cs(), which lost its last user and got removed.

This patch shouldn't introduce any functional changes.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Daniel Wagner
Cc: Ingo Molnar
Cc: Matt Helsley
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:27 +0800
182446d08 cgroup: pass around cgroup_subsys_state instead of cgroup in file methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup.
Please see the previous commit which converts the subsystem methods
for rationale.

This patch converts all cftype file operations to take @css instead of
@cgroup. cftypes for the cgroup core files don't have their subsytem
pointer set. These will automatically use the dummy_css added by the
previous patch and can be converted the same way.

Most subsystem conversions are straight forwards but there are some
interesting ones.

* freezer: update_if_frozen() is also converted to take @css instead
of @cgroup for consistency. This will make the code look simpler
too once iterators are converted to use css.

* memory/vmpressure: mem_cgroup_from_css() needs to be exported to
vmpressure while mem_cgroup_from_cont() can be made static.
Updated accordingly.

* cpu: cgroup_tg() doesn't have any user left. Removed.

* cpuacct: cgroup_ca() doesn't have any user left. Removed.

* hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
Removed.

* net_cls: cgrp_cls_state() doesn't have any user left. Removed.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:24 +0800
eb95419b0 cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods ... Browse Code »

cgroup is currently in the process of transitioning to using struct
cgroup_subsys_state * as the primary handle instead of struct cgroup *
in subsystem implementations for the following reasons.

* With unified hierarchy, subsystems will be dynamically bound and
unbound from cgroups and thus css's (cgroup_subsys_state) may be
created and destroyed dynamically over the lifetime of a cgroup,
which is different from the current state where all css's are
allocated and destroyed together with the associated cgroup. This
in turn means that cgroup_css() should be synchronized and may
return NULL, making it more cumbersome to use.

* Differing levels of per-subsystem granularity in the unified
hierarchy means that the task and descendant iterators should behave
differently depending on the specific subsystem the iteration is
being performed for.

* In majority of the cases, subsystems only care about its part in the
cgroup hierarchy - ie. the hierarchy of css's. Subsystem methods
often obtain the matching css pointer from the cgroup and don't
bother with the cgroup pointer itself. Passing around css fits
much better.

This patch converts all cgroup_subsys methods to take @css instead of
@cgroup. The conversions are mostly straight-forward. A few
noteworthy changes are

* ->css_alloc() now takes css of the parent cgroup rather than the
pointer to the new cgroup as the css for the new cgroup doesn't
exist yet. Knowing the parent css is enough for all the existing
subsystems.

* In kernel/cgroup.c::offline_css(), unnecessary open coded css
dereference is replaced with local variable access.

This patch shouldn't cause any behavior differences.

v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
with local variable @css as suggested by Li Zefan.

Rebased on top of new for-3.12 which includes for-3.11-fixes so
that ->css_free() invocation added by da0a12caff ("cgroup: fix a
leak when percpu_ref_init() fails") is converted too. Suggested
by Li Zefan.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Michal Hocko
Acked-by: Vivek Goyal
Acked-by: Aristeu Rozanski
Acked-by: Daniel Wagner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Johannes Weiner
Cc: Balbir Singh
Cc: Matt Helsley
Cc: Jens Axboe
Cc: Steven Rostedt

Tejun Heo
2013-08-09 08:11:23 +0800
6d37b9742 netprio_cgroup: pass around @css instead of @cgroup and kill struct cgroup_netprio_state ... Browse Code »

cgroup controller API will be converted to primarily use struct
cgroup_subsys_state instead of struct cgroup. In preparation, make
the internal functions of netprio_cgroup pass around @css instead of
@cgrp.

While at it, kill struct cgroup_netprio_state which only contained
struct cgroup_subsys_state without serving any purpose. All functions
are converted to deal with @css directly.

This patch shouldn't cause any behavior differences.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan
Acked-by: Neil Horman
Acked-by: David S. Miller

Tejun Heo
2013-08-09 08:11:22 +0800
8af01f56a cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ ... Browse Code »

The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.

We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.

Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2013-08-09 08:11:22 +0800

08 Aug, 2013

5 commits

3d9953a2e net: use skb_copy_datagram_from_iovec() in zerocopy_sg_from_iovec() ... Browse Code »

Use skb_copy_datagram_from_iovec() to avoid code duplication and make it easy to
be read. Also we can do the skipping inside the zero-copy loop.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2013-08-08 07:52:38 +0800
0433547aa net: use release_pages() in zerocopy_sg_from_iovec() ... Browse Code »

To reduce the duplicated codes.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2013-08-08 07:52:38 +0800
234a42670 net: remove the useless comment in zerocopy_sg_from_iovec() ... Browse Code »

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2013-08-08 07:52:37 +0800
0a57ec62d net: use skb_fill_page_desc() in zerocopy_sg_from_iovec() ... Browse Code »

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2013-08-08 07:52:35 +0800
c3bdeb5c7 net: move zerocopy_sg_from_iovec() to net/core/datagram.c ... Browse Code »

To let it be reused and reduce code duplication. Also document this function.

Signed-off-by: Jason Wang
Signed-off-by: David S. Miller

Jason Wang
2013-08-08 07:52:34 +0800