Eric Lee / smarc-fsl-linux-kernel

20 Aug, 2012

2 commits

3de7a37b0 net/core/dev.c: fix kernel-doc warning ... Browse Code »

Fix kernel-doc warning:

Warning(net/core/dev.c:5745): No description found for parameter 'dev'

Signed-off-by: Randy Dunlap
Cc: "David S. Miller"
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller

Randy Dunlap
2012-08-20 18:00:55 +0800
c0de08d04 af_packet: don't emit packet on orig fanout group ... Browse Code »

If a packet is emitted on one socket in one group of fanout sockets,
it is transmitted again. It is thus read again on one of the sockets
of the fanout group. This result in a loop for software which
generate packets when receiving one.
This retransmission is not the intended behavior: a fanout group
must behave like a single socket. The packet should not be
transmitted on a socket if it originates from a socket belonging
to the same fanout group.

This patch fixes the issue by changing the transmission check to
take fanout group info account.

Reported-by: Aleksandr Kotov
Signed-off-by: Eric Leblond
Signed-off-by: David S. Miller

Eric Leblond
2012-08-20 17:37:29 +0800

17 Aug, 2012

3 commits

476ad154f net: netprio: fix cgrp create and write priomap race ... Browse Code »

A race exists where creating cgroups and also updating the priomap
may result in losing a priomap update. This is because priomap
writers are not protected by rtnl_lock.

Move priority writer into rtnl_lock()/rtnl_unlock().

CC: Neil Horman
Reported-by: Al Viro
Signed-off-by: John Fastabend
Acked-by: Neil Horman
Signed-off-by: David S. Miller

John Fastabend
2012-08-17 05:56:11 +0800
48a87cc26 net: netprio: fd passed in SCM_RIGHTS datagram not set correctly ... Browse Code »

A socket fd passed in a SCM_RIGHTS datagram was not getting
updated with the new tasks cgrp prioidx. This leaves IO on
the socket tagged with the old tasks priority.

To fix this add a check in the scm recvmsg path to update the
sock cgrp prioidx with the new tasks value.

Thanks to Al Viro for catching this.

CC: Neil Horman
Reported-by: Al Viro
Signed-off-by: John Fastabend
Acked-by: Neil Horman
Signed-off-by: David S. Miller

John Fastabend
2012-08-17 05:56:11 +0800
f796c20cf net: netprio: fix files lock and remove useless d_path bits ... Browse Code »

Add lock to prevent a race with a file closing and also remove
useless and ugly sscanf code. The extra code was never needed
and the case it supposedly protected against is in fact handled
correctly by sock_from_file as pointed out by Al Viro.

CC: Neil Horman
Reported-by: Al Viro
Signed-off-by: John Fastabend
Acked-by: Neil Horman
Signed-off-by: David S. Miller

John Fastabend
2012-08-17 05:56:11 +0800

15 Aug, 2012

6 commits

6bdb7fe31 netpoll: re-enable irq in poll_napi() ... Browse Code »
43

napi->poll() needs IRQ enabled, so we have to re-enable IRQ before
calling it.

Cc: David Miller
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:33:33 +0800
689971b44 netpoll: handle vlan tags in netpoll tx and rx path ... Browse Code »

Without this patch, I can't get netconsole logs remotely over
vlan. The reason is probably we don't handle vlan tags in either
netpoll tx or rx path.

I am not sure if I use these vlan functions correctly, at
least this patch works.

Cc: Benjamin LaHaise
Cc: Patrick McHardy
Cc: David Miller
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:33:32 +0800
2899656b4 netpoll: take rcu_read_lock_bh() in netpoll_send_skb_on_dev() ... Browse Code »

This patch fixes several problems in the call path of
netpoll_send_skb_on_dev():

1. Disable IRQ's before calling netpoll_send_skb_on_dev().

2. All the callees of netpoll_send_skb_on_dev() should use
rcu_dereference_bh() to dereference ->npinfo.

3. Rename arp_reply() to netpoll_arp_reply(), the former is too generic.

Cc: "David S. Miller"
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:33:31 +0800
57c5d4619 netpoll: take rcu_read_lock_bh() in netpoll_rx() ... Browse Code »

In __netpoll_rx(), it dereferences ->npinfo without rcu_dereference_bh(),
this patch fixes it by using the 'npinfo' passed from netpoll_rx()
where it is already dereferenced with rcu_dereference_bh().

Cc: "David S. Miller"
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:33:31 +0800
38e6bc185 netpoll: make __netpoll_cleanup non-block ... Browse Code »

Like the previous patch, slave_disable_netpoll() and __netpoll_cleanup()
may be called with read_lock() held too, so we should make them
non-block, by moving the cleanup and kfree() to call_rcu_bh() callbacks.

Cc: "David S. Miller"
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:33:30 +0800
47be03a28 netpoll: use GFP_ATOMIC in slave_enable_netpoll() and __netpoll_setup() ... Browse Code »

slave_enable_netpoll() and __netpoll_setup() may be called
with read_lock() held, so should use GFP_ATOMIC to allocate
memory. Eric suggested to pass gfp flags to __netpoll_setup().

Cc: Eric Dumazet
Cc: "David S. Miller"
Reported-by: Dan Carpenter
Signed-off-by: Eric Dumazet
Signed-off-by: Cong Wang
Signed-off-by: David S. Miller

Amerigo Wang
2012-08-15 05:33:30 +0800

09 Aug, 2012

2 commits

7364e445f net/core: Fix potential memory leak in dev_set_alias() ... Browse Code »

Do not leak memory by updating pointer with potentially NULL realloc return value.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov
Signed-off-by: David S. Miller

Alexey Khoroshilov
2012-08-09 07:06:23 +0800
a37e6e344 net: force dst_default_metrics to const section ... Browse Code »

While investigating on network performance problems, I found this little
gem :

$ nm -v vmlinux | grep -1 dst_default_metrics
ffffffff82736540 b busy.46605
ffffffff82736560 B dst_default_metrics
ffffffff82736598 b dst_busy_list

Apparently, declaring a const array without initializer put it in
(writeable) bss section, in middle of possibly often dirtied cache
lines.

Since we really want dst_default_metrics be const to avoid any possible
false sharing and catch any buggy writes, I force a null initializer.

ffffffff818a4c20 R dst_default_metrics

Signed-off-by: Eric Dumazet
Cc: Ben Hutchings
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-09 07:00:28 +0800

02 Aug, 2012

2 commits

1485348d2 tcp: Apply device TSO segment limit earlier ... Browse Code »

Cache the device gso_max_segs in sock::sk_gso_max_segs and use it to
limit the size of TSO skbs. This avoids the need to fall back to
software GSO for local TCP senders.

Signed-off-by: Ben Hutchings
Signed-off-by: David S. Miller

Ben Hutchings
2012-08-02 15:19:17 +0800
30b678d84 net: Allow driver to limit number of GSO segments per skb ... Browse Code »

A peer (or local user) may cause TCP to use a nominal MSS of as little
as 88 (actual MSS of 76 with timestamps). Given that we have a
sufficiently prodigious local sender and the peer ACKs quickly enough,
it is nevertheless possible to grow the window for such a connection
to the point that we will try to send just under 64K at once. This
results in a single skb that expands to 861 segments.

In some drivers with TSO support, such an skb will require hundreds of
DMA descriptors; a substantial fraction of a TX ring or even more than
a full ring. The TX queue selected for the skb may stall and trigger
the TX watchdog repeatedly (since the problem skb will be retried
after the TX reset). This particularly affects sfc, for which the
issue is designated as CVE-2012-3412.

Therefore:
1. Add the field net_device::gso_max_segs holding the device-specific
limit.
2. In netif_skb_features(), if the number of segments is too high then
mask out GSO features to force fall back to software GSO.

Signed-off-by: Ben Hutchings
Signed-off-by: David S. Miller

Ben Hutchings
2012-08-02 15:19:17 +0800

01 Aug, 2012

8 commits

ac694dbdb Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge Andrew's second set of patches:
- MM
- a few random fixes
- a couple of RTC leftovers

* emailed patches from Andrew Morton : (120 commits)
rtc/rtc-88pm80x: remove unneed devm_kfree
rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
tmpfs: distribute interleave better across nodes
mm: remove redundant initialization
mm: warn if pg_data_t isn't initialized with zero
mips: zero out pg_data_t when it's allocated
memcg: gix memory accounting scalability in shrink_page_list
mm/sparse: remove index_init_lock
mm/sparse: more checks on mem_section number
mm/sparse: optimize sparse_index_alloc
memcg: add mem_cgroup_from_css() helper
memcg: further prevent OOM with too many dirty pages
memcg: prevent OOM with too many dirty pages
mm: mmu_notifier: fix freed page still mapped in secondary MMU
mm: memcg: only check anon swapin page charges for swap cache
mm: memcg: only check swap cache pages for repeated charging
mm: memcg: split swapin charge function into private and public part
mm: memcg: remove needless !mm fixup to init_mm when charging
mm: memcg: remove unneeded shmem charge type
...

Linus Torvalds
2012-08-01 10:25:39 +0800
3e9a97082 Merge tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random ... Browse Code »

Pull random subsystem patches from Ted Ts'o:
"This patch series contains a major revamp of how we collect entropy
from interrupts for /dev/random and /dev/urandom.

The goal is to addresses weaknesses discussed in the paper "Mining
your Ps and Qs: Detection of Widespread Weak Keys in Network Devices",
by Nadia Heninger, Zakir Durumeric, Eric Wustrow, J. Alex Halderman,
which will be published in the Proceedings of the 21st Usenix Security
Symposium, August 2012. (See https://factorable.net for more
information and an extended version of the paper.)"

Fix up trivial conflicts due to nearby changes in
drivers/{mfd/ab3100-core.c, usb/gadget/omap_udc.c}

* tag 'random_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/random: (33 commits)
random: mix in architectural randomness in extract_buf()
dmi: Feed DMI table to /dev/random driver
random: Add comment to random_initialize()
random: final removal of IRQF_SAMPLE_RANDOM
um: remove IRQF_SAMPLE_RANDOM which is now a no-op
sparc/ldc: remove IRQF_SAMPLE_RANDOM which is now a no-op
[ARM] pxa: remove IRQF_SAMPLE_RANDOM which is now a no-op
board-palmz71: remove IRQF_SAMPLE_RANDOM which is now a no-op
isp1301_omap: remove IRQF_SAMPLE_RANDOM which is now a no-op
pxa25x_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
omap_udc: remove IRQF_SAMPLE_RANDOM which is now a no-op
goku_udc: remove IRQF_SAMPLE_RANDOM which was commented out
uartlite: remove IRQF_SAMPLE_RANDOM which is now a no-op
drivers: hv: remove IRQF_SAMPLE_RANDOM which is now a no-op
xen-blkfront: remove IRQF_SAMPLE_RANDOM which is now a no-op
n2_crypto: remove IRQF_SAMPLE_RANDOM which is now a no-op
pda_power: remove IRQF_SAMPLE_RANDOM which is now a no-op
i2c-pmcmsp: remove IRQF_SAMPLE_RANDOM which is now a no-op
input/serio/hp_sdc.c: remove IRQF_SAMPLE_RANDOM which is now a no-op
mfd: remove IRQF_SAMPLE_RANDOM which is now a no-op
...

Linus Torvalds
2012-08-01 10:07:42 +0800
c76562b67 netvm: prevent a stream-specific deadlock ... Browse Code »
43

This patch series is based on top of "Swap-over-NBD without deadlocking
v15" as it depends on the same reservation of PF_MEMALLOC reserves logic.

When a user or administrator requires swap for their application, they
create a swap partition and file, format it with mkswap and activate it
with swapon. In diskless systems this is not an option so if swap if
required then swapping over the network is considered. The two likely
scenarios are when blade servers are used as part of a cluster where the
form factor or maintenance costs do not allow the use of disks and thin
clients.

The Linux Terminal Server Project recommends the use of the Network Block
Device (NBD) for swap but this is not always an option. There is no
guarantee that the network attached storage (NAS) device is running Linux
or supports NBD. However, it is likely that it supports NFS so there are
users that want support for swapping over NFS despite any performance
concern. Some distributions currently carry patches that support swapping
over NFS but it would be preferable to support it in the mainline kernel.

Patch 1 avoids a stream-specific deadlock that potentially affects TCP.

Patch 2 is a small modification to SELinux to avoid using PFMEMALLOC
reserves.

Patch 3 adds three helpers for filesystems to handle swap cache pages.
For example, page_file_mapping() returns page->mapping for
file-backed pages and the address_space of the underlying
swap file for swap cache pages.

Patch 4 adds two address_space_operations to allow a filesystem
to pin all metadata relevant to a swapfile in memory. Upon
successful activation, the swapfile is marked SWP_FILE and
the address space operation ->direct_IO is used for writing
and ->readpage for reading in swap pages.

Patch 5 notes that patch 3 is bolting
filesystem-specific-swapfile-support onto the side and that
the default handlers have different information to what
is available to the filesystem. This patch refactors the
code so that there are generic handlers for each of the new
address_space operations.

Patch 6 adds an API to allow a vector of kernel addresses to be
translated to struct pages and pinned for IO.

Patch 7 adds support for using highmem pages for swap by kmapping
the pages before calling the direct_IO handler.

Patch 8 updates NFS to use the helpers from patch 3 where necessary.

Patch 9 avoids setting PF_private on PG_swapcache pages within NFS.

Patch 10 implements the new swapfile-related address_space operations
for NFS and teaches the direct IO handler how to manage
kernel addresses.

Patch 11 prevents page allocator recursions in NFS by using GFP_NOIO
where appropriate.

Patch 12 fixes a NULL pointer dereference that occurs when using
swap-over-NFS.

With the patches applied, it is possible to mount a swapfile that is on an
NFS filesystem. Swap performance is not great with a swap stress test
taking roughly twice as long to complete than if the swap device was
backed by NBD.

This patch: netvm: prevent a stream-specific deadlock

It could happen that all !SOCK_MEMALLOC sockets have buffered so much data
that we're over the global rmem limit. This will prevent SOCK_MEMALLOC
buffers from receiving data, which will prevent userspace from running,
which is needed to reduce the buffered data.

Fix this by exempting the SOCK_MEMALLOC sockets from the rmem limit. Once
this change it applied, it is important that sockets that set
SOCK_MEMALLOC do not clear the flag until the socket is being torn down.
If this happens, a warning is generated and the tokens reclaimed to avoid
accounting errors until the bug is fixed.

[davem@davemloft.net: Warning about clearing SOCK_MEMALLOC]
Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Acked-by: David S. Miller
Acked-by: Rik van Riel
Cc: Trond Myklebust
Cc: Neil Brown
Cc: Christoph Hellwig
Cc: Mike Christie
Cc: Eric B Munson
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:47 +0800
b4b9e3558 netvm: set PF_MEMALLOC as appropriate during SKB processing ... Browse Code »

In order to make sure pfmemalloc packets receive all memory needed to
proceed, ensure processing of pfmemalloc SKBs happens under PF_MEMALLOC.
This is limited to a subset of protocols that are expected to be used for
writing to swap. Taps are not allowed to use PF_MEMALLOC as these are
expected to communicate with userspace processes which could be paged out.

[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[jslaby@suse.cz: Lock imbalance fix]
Signed-off-by: Mel Gorman
Acked-by: David S. Miller
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:46 +0800
c93bdd0e0 netvm: allow skb allocation to use PFMEMALLOC reserves ... Browse Code »

Change the skb allocation API to indicate RX usage and use this to fall
back to the PFMEMALLOC reserve when needed. SKBs allocated from the
reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the
reserve and the socket is later found to be unrelated to page reclaim, the
packet is dropped so that the memory remains available for page reclaim.
Network protocols are expected to recover from this packet loss.

[a.p.zijlstra@chello.nl: Ideas taken from various patches]
[davem@davemloft.net: Use static branches, coding style corrections]
[sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
Signed-off-by: Mel Gorman
Acked-by: David S. Miller
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:46 +0800
7cb024049 netvm: allow the use of __GFP_MEMALLOC by specific sockets ... Browse Code »

Allow specific sockets to be tagged SOCK_MEMALLOC and use __GFP_MEMALLOC
for their allocations. These sockets will be able to go below watermarks
and allocate from the emergency reserve. Such sockets are to be used to
service the VM (iow. to swap over). They must be handled kernel side,
exposing such a socket to user-space is a bug.

There is a risk that the reserves be depleted so for now, the
administrator is responsible for increasing min_free_kbytes as necessary
to prevent deadlock for their workloads.

[a.p.zijlstra@chello.nl: Original patches]
Signed-off-by: Mel Gorman
Acked-by: David S. Miller
Cc: Neil Brown
Cc: Peter Zijlstra
Cc: Mike Christie
Cc: Eric B Munson
Cc: Eric Dumazet
Cc: Sebastian Andrzej Siewior
Cc: Mel Gorman
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-01 09:42:46 +0800
c255a4580 memcg: rename config variables ... Browse Code »

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Aneesh Kumar K.V
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-08-01 09:42:43 +0800
54764bb64 ipv4: Restore old dst_free() behavior. ... Browse Code »

commit 404e0a8b6a55 (net: ipv4: fix RCU races on dst refcounts) tried
to solve a race but added a problem at device/fib dismantle time :

We really want to call dst_free() as soon as possible, even if sockets
still have dst in their cache.
dst_release() calls in free_fib_info_rcu() are not welcomed.

Root of the problem was that now we also cache output routes (in
nh_rth_output), we must use call_rcu() instead of call_rcu_bh() in
rt_free(), because output route lookups are done in process context.

Based on feedback and initial patch from David Miller (adding another
call_rcu_bh() call in fib, but it appears it was not the right fix)

I left the inet_sk_rx_dst_set() helper and added __rcu attributes
to nh_rth_output and nh_rth_input to better document what is going on in
this code.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-01 05:41:38 +0800

31 Jul, 2012

1 commit

404e0a8b6 net: ipv4: fix RCU races on dst refcounts ... Browse Code »
43

commit c6cffba4ffa2 (ipv4: Fix input route performance regression.)
added various fatal races with dst refcounts.

crashes happen on tcp workloads if routes are added/deleted at the same
time.

The dst_free() calls from free_fib_info_rcu() are clearly racy.

We need instead regular dst refcounting (dst_release()) and make
sure dst_release() is aware of RCU grace periods :

Add DST_RCU_FREE flag so that dst_release() respects an RCU grace period
before dst destruction for cached dst

Introduce a new inet_sk_rx_dst_set() helper, using atomic_inc_not_zero()
to make sure we dont increase a zero refcount (On a dst currently
waiting an rcu grace period before destruction)

rt_cache_route() must take a reference on the new cached route, and
release it if was not able to install it.

With this patch, my machines survive various benchmarks.

Signed-off-by: Eric Dumazet
Signed-off-by: David S. Miller

Eric Dumazet
2012-07-31 05:53:22 +0800

30 Jul, 2012

1 commit

8253947e2 ipv6: fix incorrect route 'expires' value passed to userspace ... Browse Code »

When userspace use RTM_GETROUTE to dump route table, with an already
expired route entry, we always got an 'expires' value(2147157)
calculated base on INT_MAX.

The reason of this problem is in the following satement:
rt->dst.expires - jiffies < INT_MAX
gcc promoted the type of both sides of '
Signed-off-by: David S. Miller

Li Wei
2012-07-30 14:18:31 +0800

28 Jul, 2012

1 commit

b1beb681c net: fix rtnetlink IFF_PROMISC and IFF_ALLMULTI handling ... Browse Code »

When device flags are set using rtnetlink, IFF_PROMISC and IFF_ALLMULTI
flags are handled specially. Function dev_change_flags sets IFF_PROMISC and
IFF_ALLMULTI bits in dev->gflags according to the passed value but
do_setlink passes a result of rtnl_dev_combine_flags which takes those bits
from dev->flags.

This can be easily trigerred by doing:

tcpdump -i eth0 &
ip l s up eth0

ip sets IFF_UP flag in ifi_flags and ifi_change, which is combined with
IFF_PROMISC by rtnl_dev_combine_flags, causing __dev_change_flags to set
IFF_PROMISC in gflags.

Reported-by: Max Matveev
Signed-off-by: Jiri Benc
Signed-off-by: David S. Miller

Jiri Benc
2012-07-28 04:45:50 +0800

25 Jul, 2012

2 commits

d14b7a419 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...

Linus Torvalds
2012-07-25 04:34:56 +0800
3c4cfadef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking changes from David S Miller:

1) Remove the ipv4 routing cache. Now lookups go directly into the FIB
trie and use prebuilt routes cached there.

No more garbage collection, no more rDOS attacks on the routing
cache. Instead we now get predictable and consistent performance,
no matter what the pattern of traffic we service.

This has been almost 2 years in the making. Special thanks to
Julian Anastasov, Eric Dumazet, Steffen Klassert, and others who
have helped along the way.

I'm sure that with a change of this magnitude there will be some
kind of fallout, but such things ought the be simple to fix at this
point. Luckily I'm not European so I'll be around all of August to
fix things :-)

The major stages of this work here are each fronted by a forced
merge commit whose commit message contains a top-level description
of the motivations and implementation issues.

2) Pre-demux of established ipv4 TCP sockets, saves a route demux on
input.

3) TCP SYN/ACK performance tweaks from Eric Dumazet.

4) Add namespace support for netfilter L4 conntrack helpers, from Gao
Feng.

5) Add config mechanism for Energy Efficient Ethernet to ethtool, from
Yuval Mintz.

6) Remove quadratic behavior from /proc/net/unix, from Eric Dumazet.

7) Support for connection tracker helpers in userspace, from Pablo
Neira Ayuso.

8) Allow userspace driven TX load balancing functions in TEAM driver,
from Jiri Pirko.

9) Kill off NLMSG_PUT and RTA_PUT macros, more gross stuff with
embedded gotos.

10) TCP Small Queues, essentially minimize the amount of TCP data queued
up in the packet scheduler layer. Whereas the existing BQL (Byte
Queue Limits) limits the pkt_sched --> netdevice queuing levels,
this controls the TCP --> pkt_sched queueing levels.

From Eric Dumazet.

11) Reduce the number of get_page/put_page ops done on SKB fragments,
from Alexander Duyck.

12) Implement protection against blind resets in TCP (RFC 5961), from
Eric Dumazet.

13) Support the client side of TCP Fast Open, basically the ability to
send data in the SYN exchange, from Yuchung Cheng.

Basically, the sender queues up data with a sendmsg() call using
MSG_FASTOPEN, then they do the connect() which emits the queued up
fastopen data.

14) Avoid all the problems we get into in TCP when timers or PMTU events
hit a locked socket. The TCP Small Queues changes added a
tcp_release_cb() that allows us to queue work up to the
release_sock() caller, and that's what we use here too. From Eric
Dumazet.

15) Zero copy on TX support for TUN driver, from Michael S. Tsirkin.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1870 commits)
genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP
r8169: revert "add byte queue limit support".
ipv4: Change rt->rt_iif encoding.
net: Make skb->skb_iif always track skb->dev
ipv4: Prepare for change of rt->rt_iif encoding.
ipv4: Remove all RTCF_DIRECTSRC handliing.
ipv4: Really ignore ICMP address requests/replies.
decnet: Don't set RTCF_DIRECTSRC.
net/ipv4/ip_vti.c: Fix __rcu warnings detected by sparse.
ipv4: Remove redundant assignment
rds: set correct msg_namelen
openvswitch: potential NULL deref in sample()
tcp: dont drop MTU reduction indications
bnx2x: Add new 57840 device IDs
tcp: avoid oops in tcp_metrics and reset tcpm_stamp
niu: Change niu_rbr_fill() to use unlikely() to check niu_rbr_add_page() return value
niu: Fix to check for dma mapping errors.
net: Fix references to out-of-scope variables in put_cmsg_compat()
net: ethernet: davinci_emac: add pm_runtime support
net: ethernet: davinci_emac: Remove unnecessary #include
...

Linus Torvalds
2012-07-25 01:01:50 +0800

24 Jul, 2012

2 commits

b68581778 net: Make skb->skb_iif always track skb->dev ... Browse Code »

Make it follow device decapsulation, from things such as VLAN and
bonding.

The stuff that actually cares about pre-demuxed device pointers, is
handled by the "orig_dev" variable in __netif_receive_skb(). And
the only consumer of that is the po->origdev feature of AF_PACKET
sockets.

Signed-off-by: David S. Miller

David S. Miller
2012-07-24 07:36:27 +0800
a66d2c8f7 Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull the big VFS changes from Al Viro:
"This one is *big* and changes quite a few things around VFS. What's in there:

- the first of two really major architecture changes - death to open
intents.

The former is finally there; it was very long in making, but with
Miklos getting through really hard and messy final push in
fs/namei.c, we finally have it. Unlike his variant, this one
doesn't introduce struct opendata; what we have instead is
->atomic_open() taking preallocated struct file * and passing
everything via its fields.

Instead of returning struct file *, it returns -E... on error, 0
on success and 1 in "deal with it yourself" case (e.g. symlink
found on server, etc.).

See comments before fs/namei.c:atomic_open(). That made a lot of
goodies finally possible and quite a few are in that pile:
->lookup(), ->d_revalidate() and ->create() do not get struct
nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
flags instead, ->create() gets "do we want it exclusive" flag.

With the introduction of new helper (kern_path_locked()) we are rid
of all struct nameidata instances outside of fs/namei.c; it's still
visible in namei.h, but not for long. Come the next cycle,
declaration will move either to fs/internal.h or to fs/namei.c
itself. [me, miklos, hch]

- The second major change: behaviour of final fput(). Now we have
__fput() done without any locks held by caller *and* not from deep
in call stack.

That obviously lifts a lot of constraints on the locking in there.
Moreover, it's legal now to call fput() from atomic contexts (which
has immediately simplified life for aio.c). We also don't need
anti-recursion logics in __scm_destroy() anymore.

There is a price, though - the damn thing has become partially
asynchronous. For fput() from normal process we are guaranteed
that pending __fput() will be done before the caller returns to
userland, exits or gets stopped for ptrace.

For kernel threads and atomic contexts it's done via
schedule_work(), so theoretically we might need a way to make sure
it's finished; so far only one such place had been found, but there
might be more.

There's flush_delayed_fput() (do all pending __fput()) and there's
__fput_sync() (fput() analog doing __fput() immediately). I hope
we won't need them often; see warnings in fs/file_table.c for
details. [me, based on task_work series from Oleg merged last
cycle]

- sync series from Jan

- large part of "death to sync_supers()" work from Artem; the only
bits missing here are exofs and ext4 ones. As far as I understand,
those are going via the exofs and ext4 trees resp.; once they are
in, we can put ->write_super() to the rest, along with the thread
calling it.

- preparatory bits from unionmount series (from dhowells).

- assorted cleanups and fixes all over the place, as usual.

This is not the last pile for this cycle; there's at least jlayton's
ESTALE work and fsfreeze series (the latter - in dire need of fixes,
so I'm not sure it'll make the cut this cycle). I'll probably throw
symlink/hardlink restrictions stuff from Kees into the next pile, too.
Plus there's a lot of misc patches I hadn't thrown into that one -
it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
switch dentry_open() to struct path, make it grab references itself
spufs: shift dget/mntget towards dentry_open()
zoran: don't bother with struct file * in zoran_map
ecryptfs: don't reinvent the wheels, please - use struct completion
don't expose I_NEW inodes via dentry->d_inode
tidy up namei.c a bit
unobfuscate follow_up() a bit
ext3: pass custom EOF to generic_file_llseek_size()
ext4: use core vfs llseek code for dir seeks
vfs: allow custom EOF in generic_file_llseek code
vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
vfs: Remove unnecessary flushing of block devices
vfs: Make sys_sync writeout also block device inodes
vfs: Create function for iterating over block devices
vfs: Reorder operations during sys_sync
quota: Move quota syncing to ->sync_fs method
quota: Split dquot_quota_sync() to writeback and cache flushing part
vfs: Move noop_backing_dev_info check from sync into writeback
...

Linus Torvalds
2012-07-24 03:27:27 +0800

23 Jul, 2012

7 commits

5e9965c15 Merge branch 'kill_rtcache' ... Browse Code »

The ipv4 routing cache is non-deterministic, performance wise, and is
subject to reasonably easy to launch denial of service attacks.

The routing cache works great for well behaved traffic, and the world
was a much friendlier place when the tradeoffs that led to the routing
cache's design were considered.

What it boils down to is that the performance of the routing cache is
a product of the traffic patterns seen by a system rather than being a
product of the contents of the routing tables. The former of which is
controllable by external entitites.

Even for "well behaved" legitimate traffic, high volume sites can see
hit rates in the routing cache of only ~%10.

The general flow of this patch series is that first the routing cache
is removed. We build a completely new rtable entry every lookup
request.

Next we make some simplifications due to the fact that removing the
routing cache causes several members of struct rtable to become no
longer necessary.

Then we need to make some amends such that we can legally cache
pre-constructed routes in the FIB nexthops. Firstly, we need to
invalidate routes which are hit with nexthop exceptions. Secondly we
have to change the semantics of rt->rt_gateway such that zero means
that the destination is on-link and non-zero otherwise.

Now that the preparations are ready, we start caching precomputed
routes in the FIB nexthops. Output and input routes need different
kinds of care when determining if we can legally do such caching or
not. The details are in the commit log messages for those changes.

The patch series then winds down with some more struct rtable
simplifications and other tidy ups that remove unnecessary overhead.

On a SPARC-T3 output route lookups are ~876 cycles. Input route
lookups are ~1169 cycles with rpfilter disabled, and about ~1468
cycles with rpfilter enabled.

These measurements were taken with the kbench_mod test module in the
net_test_tools GIT tree:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net_test_tools.git

That GIT tree also includes a udpflood tester tool and stresses
route lookups on packet output.

For example, on the same SPARC-T3 system we can run:

time ./udpflood -l 10000000 10.2.2.11

with routing cache:
real 1m21.955s user 0m6.530s sys 1m15.390s

without routing cache:
real 1m31.678s user 0m6.520s sys 1m25.140s

Performance undoubtedly can easily be improved further.

For example fib_table_lookup() performs a lot of excessive
computations with all the masking and shifting, some of it
conditionalized to deal with edge cases.

Also, Eric's no-ref optimization for input route lookups can be
re-instated for the FIB nexthop caching code path. I would be really
pleased if someone would work on that.

In fact anyone suitable motivated can just fire up perf on the loading
of the test net_test_tools benchmark kernel module. I spend much of
my time going:

bash# perf record insmod ./kbench_mod.ko dst=172.30.42.22 src=74.128.0.1 iif=2
bash# perf report

Thanks to helpful feedback from Joe Perches, Eric Dumazet, Ben
Hutchings, and others.

Signed-off-by: David S. Miller

David S. Miller
2012-07-23 08:04:15 +0800
6120d3dbb get rid of ->scm_work_list ... Browse Code »

recursion in __scm_destroy() will be cut by delaying final fput()

Signed-off-by: Al Viro

Al Viro
2012-07-23 03:58:00 +0800
406a3c638 net: netprio_cgroup: rework update socket logic ... Browse Code »

Instead of updating the sk_cgrp_prioidx struct field on every send
this only updates the field when a task is moved via cgroup
infrastructure.

This allows sockets that may be used by a kernel worker thread
to be managed. For example in the iscsi case today a user can
put iscsid in a netprio cgroup and control traffic will be sent
with the correct sk_cgrp_prioidx value set but as soon as data
is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
is updated with the kernel worker threads value which is the
default case.

It seems more correct to only update the field when the user
explicitly sets it via control group infrastructure. This allows
the users to manage sockets that may be used with other threads.

Signed-off-by: John Fastabend
Acked-by: Neil Horman
Signed-off-by: David S. Miller

John Fastabend
2012-07-23 03:44:01 +0800
dcc0fb782 skbuff: export skb_copy_ubufs ... Browse Code »

Export skb_copy_ubufs so that modules can orphan frags.

Signed-off-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Michael S. Tsirkin
2012-07-23 03:39:33 +0800
1080e512d net: orphan frags on receive ... Browse Code »

zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.

Signed-off-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Michael S. Tsirkin
2012-07-23 03:39:33 +0800
70008aa50 skbuff: convert to skb_orphan_frags ... Browse Code »

Reduce code duplication a bit using the new helper.

Signed-off-by: Michael S. Tsirkin
Signed-off-by: David S. Miller

Michael S. Tsirkin
2012-07-23 03:39:33 +0800
1d69c2b34 rtnl: Add #ifdef CONFIG_RPS around num_rx_queues reference ... Browse Code »

Commit 76ff5cc91935c51fcf1a6a99ffa28b97a6e7a884
(rtnl: allow to specify number of rx and tx queues
on device creation) added a reference to the net_device
structure's 'num_rx_queues' member in

net/core/rtnetlink.c:rtnl_fill_ifinfo()

However, the definition for 'num_rx_queues' is surrounded
by an '#ifdef CONFIG_RPS' while the new reference to it is
not. This causes a compile error when CONFIG_RPS is not
defined.

Fix the compile error by surrounding the new reference to
'num_rx_queues' by an '#ifdef CONFIG_RPS'.

CC: Jiri Pirko
Signed-off-by: Mark A. Greer
Signed-off-by: David S. Miller

Mark A. Greer
2012-07-23 03:28:11 +0800

21 Jul, 2012

3 commits

f5b0a8743 net: Document dst->obsolete better. ... Browse Code »

Add a big comment explaining how the field works, and use defines
instead of magic constants for the values assigned to it.

Suggested by Joe Perches.

Signed-off-by: David S. Miller

David S. Miller
2012-07-21 04:31:21 +0800
76ff5cc91 rtnl: allow to specify number of rx and tx queues on device creation ... Browse Code »
43

This patch introduces IFLA_NUM_TX_QUEUES and IFLA_NUM_RX_QUEUES by
which userspace can set number of rx and/or tx queues to be allocated
for newly created netdevice.
This overrides ops->get_num_[tr]x_queues()

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2012-07-21 02:07:00 +0800
d40156aa5 rtnl: allow to specify different num for rx and tx queue count ... Browse Code »

Also cut out unused function parameters and possible err in return
value.

Signed-off-by: Jiri Pirko
Signed-off-by: David S. Miller

Jiri Pirko
2012-07-21 02:06:59 +0800