Doug / smarc-fsl-linux-kernel | Embedian Git Server

23 Aug, 2012

23 commits

807736f6e batman-adv: Split batadv_priv in sub-structures for features ... Browse Code »

The structure batadv_priv grows everytime a new feature is introduced. It gets
hard to find the parts of the struct that belongs to a specific feature. This
becomes even harder by the fact that not every feature uses a prefix in the
member name.

The variables for bridge loop avoidence, gateway handling, translation table
and visualization server are moved into separate structs that are included in
the bat_priv main struct.

Signed-off-by: Sven Eckelmann
Signed-off-by: Antonio Quartulli

Sven Eckelmann
2012-08-23 20:20:13 +0800
624463079 batman-adv: check batadv_orig_hash_add_if() return code ... Browse Code »

If this call fails, some of the orig_nodes spaces may have been
resized for the increased number of interface, and some may not.
If we would just continue with the larger number of interfaces,
this would lead to access to not allocated memory later.

We better check the return code, and don't add the interface if
no memory is available. OTOH, keeping some of the orig_nodes
with too much memory allocated should hurt no one (except for
a few too many bytes allocated).

Signed-off-by: Simon Wunderlich
Signed-off-by: Antonio Quartulli

Simon Wunderlich
2012-08-23 20:02:46 +0800
a51fb9b2a batman-adv: fix typos in comments ... Browse Code »

the word millisecond is misspelled in several comments. This patch fixes it.

Signed-off-by: Antonio Quartulli

Antonio Quartulli
2012-08-23 20:02:45 +0800
d657e621a batman-adv: add reference counting for type batadv_tt_orig_list_entry ... Browse Code »

The batadv_tt_orig_list_entry structure didn't have any refcounting mechanism so
far. This patch introduces it and makes the structure being usable in much more
complex context.

Signed-off-by: Antonio Quartulli

Antonio Quartulli
2012-08-23 20:02:44 +0800
3a7f291bf batman-adv: remove a misleading comment ... Browse Code »

As much as I'm happy to see LWN links sprinkled through the kernel by the
dozen, this one in particular reflects a very old state of reality; the
associated comment is now incorrect. So just delete it.

Signed-off-by: Jonathan Corbet
Signed-off-by: Antonio Quartulli

Jonathan Corbet
2012-08-23 20:02:44 +0800
1c9b0550f batman-adv: convert remaining packet counters to per_cpu_ptr() infrastructure ... Browse Code »

Signed-off-by: Marek Lindner
Acked-by: Martin Hundebøll
Signed-off-by: Antonio Quartulli

Marek Lindner
2012-08-23 20:02:43 +0800
3eb8773e3 batman-adv: rename bridge loop avoidance claim types ... Browse Code »

for consistency reasons within the code and with the documentation,
we should always call it "claim" and "unclaim".

Signed-off-by: Simon Wunderlich
Signed-off-by: Antonio Quartulli

Simon Wunderlich
2012-08-23 20:02:42 +0800
99e966fc9 batman-adv: correct comments in bridge loop avoidance ... Browse Code »

Signed-off-by: Simon Wunderlich
Signed-off-by: Antonio Quartulli

Simon Wunderlich
2012-08-23 20:02:41 +0800
536a23f11 batman-adv: Add the backbone gateway list to debugfs ... Browse Code »

This is especially useful if there are no claims yet, but we still want
to know which gateways are using bridge loop avoidance in the network.

Signed-off-by: Simon Wunderlich
Signed-off-by: Antonio Quartulli

Simon Wunderlich
2012-08-23 20:02:41 +0800
c70437289 batman-adv: move function arguments on one line ... Browse Code »

Signed-off-by: Antonio Quartulli

Antonio Quartulli
2012-08-23 20:02:40 +0800
0fa7fa98d packet: Protect packet sk list with mutex (v2) ... Browse Code »

Change since v1:

* Fixed inuse counters access spotted by Eric

In patch eea68e2f (packet: Report socket mclist info via diag module) I've
introduced a "scheduling in atomic" problem in packet diag module -- the
socket list is traversed under rcu_read_lock() while performed under it sk
mclist access requires rtnl lock (i.e. -- mutex) to be taken.

[152363.820563] BUG: scheduling while atomic: crtools/12517/0x10000002
[152363.820573] 4 locks held by crtools/12517:
[152363.820581] #0: (sock_diag_mutex){+.+.+.}, at: [] sock_diag_rcv+0x1f/0x3e
[152363.820613] #1: (sock_diag_table_mutex){+.+.+.}, at: [] sock_diag_rcv_msg+0xdb/0x11a
[152363.820644] #2: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x23/0x1ab
[152363.820693] #3: (rcu_read_lock){.+.+..}, at: [] packet_diag_dump+0x0/0x1af

Similar thing was then re-introduced by further packet diag patches (fanount
mutex and pgvec mutex for rings) :(

Apart from being terribly sorry for the above, I propose to change the packet
sk list protection from spinlock to mutex. This lock currently protects two
modifications:

* sklist
* prot inuse counters

The sklist modifications can be just reprotected with mutex since they already
occur in a sleeping context. The inuse counters modifications are trickier -- the
__this_cpu_-s are used inside, thus requiring the caller to handle the potential
issues with contexts himself. Since packet sockets' counters are modified in two
places only (packet_create and packet_release) we only need to protect the context
from being preempted. BH disabling is not required in this case.

Signed-off-by: Pavel Emelyanov
Acked-by: Eric Dumazet
Signed-off-by: David S. Miller

Pavel Emelyanov
2012-08-23 13:58:27 +0800
b32607dd4 mdio: translation of MMD EEE registers to/from ethtool settings ... Browse Code »

The helper functions which translate IEEE MDIO Manageable Device (MMD)
Energy-Efficient Ethernet (EEE) registers 3.20, 7.60 and 7.61 to and from
the comparable ethtool supported/advertised settings will be needed by
drivers other than those in PHYLIB (e.g. e1000e in a follow-on patch).

In the same fashion as similar translation functions in linux/mii.h, move
these functions from the PHYLIB core to the linux/mdio.h header file so the
code will not have to be duplicated in each driver needing MMD-to-ethtool
(and vice-versa) translations. The function and some variable names have
been renamed to be more descriptive.

Not tested on the only hardware that currently calls the related functions,
stmmac, because I don't have access to any. Has been compile tested and
the translations have been tested on a locally modified version of e1000e.

Signed-off-by: Bruce Allan
Cc: Giuseppe Cavallaro
Signed-off-by: David S. Miller

Allan, Bruce W
2012-08-23 13:58:27 +0800
9e67030af af_packet: use define instead of constant ... Browse Code »

Instead of using a hard-coded value for the status variable, it would make
the code more readable to use its destined define from linux/if_packet.h.

Signed-off-by: daniel.borkmann@tik.ee.ethz.ch
Signed-off-by: David S. Miller

danborkmann@iogearbox.net
2012-08-23 13:58:27 +0800
bfdc587c5 rds: Don't disable BH on BH context ... Browse Code »

Since we have already in BH context when *_write_space(),
*_data_ready() as well as *_state_change() are called, it's
unnecessary to disable BH.

Signed-off-by: Ying Xue
Signed-off-by: David S. Miller

Ying Xue
2012-08-23 13:52:04 +0800
6b923cb71 bonding: support for IPv6 transmit hashing ... Browse Code »

Currently the "bonding" driver does not support load balancing outgoing
traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
are currently supported; this patch adds transmit hashing for IPv6 (and
TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
bonding driver. In addition, bounds checking has been added to all
transmit hashing functions.

The algorithm chosen (xor'ing the bottom three quads of the source and
destination addresses together, then xor'ing each byte of that result into
the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
was selected after testing almost 400,000 unique IPv6 addresses harvested
from server logs. This algorithm had the most even distribution for both
big- and little-endian architectures while still using few instructions. Its
behavior also attempts to closely match that of the IPv4 algorithm.

The IPv6 flow label was intentionally not included in the hash as it appears
to be unset in the vast majority of IPv6 traffic sampled, and the current
algorithm not using the flow label already offers a very even distribution.

Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
ie, they are not balanced based on layer 4 information. Additionally,
IPv6 packets with intermediate headers are not balanced based on layer
4 information. In practice these intermediate headers are not common and
this should not cause any problems, and the alternative (a packet-parsing
loop and look-up table) seemed slow and complicated for little gain.

Tested-by: John Eaglesham
Signed-off-by: John Eaglesham
Signed-off-by: David S. Miller

John Eaglesham
2012-08-23 13:49:30 +0800
b87fb39e3 ipv6: gre: fix ip6gre_err() ... Browse Code »

ip6gre_err() miscomputes grehlen (sizeof(ipv6h) is 4 or 8,
not 40 as expected), and should take into account 'offset' parameter.

Also uses pskb_may_pull() to cope with some fragged skbs

Signed-off-by: Eric Dumazet
Cc: Dmitry Kozlov
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-23 13:48:32 +0800
ef8531b64 xfrm: fix RCU bugs ... Browse Code »

This patch reverts commit 56892261ed1a (xfrm: Use rcu_dereference_bh to
deference pointer protected by rcu_read_lock_bh), and fixes bugs
introduced in commit 418a99ac6ad ( Replace rwlock on xfrm_policy_afinfo
with rcu )

1) We properly use RCU variant in this file, not a mix of RCU/RCU_BH

2) We must defer some writes after the synchronize_rcu() call or a reader
can crash dereferencing NULL pointer.

3) Now we use the xfrm_policy_afinfo_lock spinlock only from process
context, we no longer need to block BH in xfrm_policy_register_afinfo()
and xfrm_policy_unregister_afinfo()

4) Can use RCU_INIT_POINTER() instead of rcu_assign_pointer() in
xfrm_policy_unregister_afinfo()

5) Remove a forward inline declaration (xfrm_policy_put_afinfo()),
and also move xfrm_policy_get_afinfo() declaration.

Signed-off-by: Eric Dumazet
Cc: Fan Du
Cc: Priyanka Jain
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-23 13:39:46 +0800
0115e8e30 net: remove delay at device dismantle ... Browse Code »

I noticed extra one second delay in device dismantle, tracked down to
a call to dst_dev_event() while some call_rcu() are still in RCU queues.

These call_rcu() were posted by rt_free(struct rtable *rt) calls.

We then wait a little (but one second) in netdev_wait_allrefs() before
kicking again NETDEV_UNREGISTER.

As the call_rcu() are now completed, dst_dev_event() can do the needed
device swap on busy dst.

To solve this problem, add a new NETDEV_UNREGISTER_FINAL, called
after a rcu_barrier(), but outside of RTNL lock.

Use NETDEV_UNREGISTER_FINAL with care !

Change dst_dev_event() handler to react to NETDEV_UNREGISTER_FINAL

Also remove NETDEV_UNREGISTER_BATCH, as its not used anymore after
IP cache removal.

With help from Gao feng

Signed-off-by: Eric Dumazet
Cc: Tom Herbert
Cc: Mahesh Bandewar
Cc: "Eric W. Biederman"
Cc: Gao feng
Signed-off-by: David S. Miller

Eric Dumazet
2012-08-23 12:50:36 +0800
bf277b0cc Merge git://1984.lsi.us.es/nf-next ... Browse Code »

Pablo Neira Ayuso says:

====================
This is the first batch of Netfilter and IPVS updates for your
net-next tree. Mostly cleanups for the Netfilter side. They are:

* Remove unnecessary RTNL locking now that we have support
for namespace in nf_conntrack, from Patrick McHardy.

* Cleanup to eliminate unnecessary goto in the initialization
path of several Netfilter tables, from Jean Sacren.

* Another cleanup from Wu Fengguang, this time to PTR_RET instead
of if IS_ERR then return PTR_ERR.

* Use list_for_each_entry_continue_rcu in nf_iterate, from
Michael Wang.

* Add pmtu_disc sysctl option to disable PMTU in their tunneling
transmitter, from Julian Anastasov.

* Generalize application protocol registration in IPVS and modify
IPVS FTP helper to use it, from Julian Anastasov.

* update Kconfig. The IPVS FTP helper depends on the Netfilter FTP
helper for NAT support, from Julian Anastasov.

* Add logic to update PMTU for IPIP packets in IPVS, again
from Julian Anastasov.

* A couple of sparse warning fixes for IPVS and Netfilter from
Claudiu Ghioc and Patrick McHardy respectively.

Patrick's IPv6 NAT changes will follow after this batch, I need
to flush this batch first before refreshing my tree.
====================

Signed-off-by: David S. Miller

David S. Miller
2012-08-23 09:48:52 +0800
bba6ec7e4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next ... Browse Code »

Jeff Kirsher says:

====================
This series contains updates to ethtool.h, e1000, e1000e, and igb to
implement MDI/MDIx control.
====================

Signed-off-by: David S. Miller

David S. Miller
2012-08-23 05:23:43 +0800
1304a7343 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2012-08-23 05:21:38 +0800
90efbed18 netfilter: remove unnecessary goto statement for error recovery ... Browse Code »

Usually it's a good practice to use goto statement for error recovery
when initializing the module. This approach could be an overkill if:

1) there is only one fail case;
2) success and failure use the same return statement.

For a cleaner approach, remove the unnecessary goto statement and
directly implement error recovery.

Signed-off-by: Jean Sacren
Signed-off-by: Pablo Neira Ayuso

Jean Sacren
2012-08-23 01:17:38 +0800
6705e8672 netfilter: replace list_for_each_continue_rcu with new interface ... Browse Code »

This patch replaces list_for_each_continue_rcu() with
list_for_each_entry_continue_rcu() to allow removing
list_for_each_continue_rcu().

Signed-off-by: Michael Wang
Reviewed-by: Paul E. McKenney
Signed-off-by: Pablo Neira Ayuso

Michael Wang
2012-08-23 01:17:20 +0800

22 Aug, 2012

17 commits

23dcfa61b Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge fixes from Andrew Morton.

Random drivers and some VM fixes.

* emailed patches from Andrew Morton : (17 commits)
mm: compaction: Abort async compaction if locks are contended or taking too long
mm: have order > 0 compaction start near a pageblock with free pages
rapidio/tsi721: fix unused variable compiler warning
rapidio/tsi721: fix inbound doorbell interrupt handling
drivers/rtc/rtc-rs5c348.c: fix hour decoding in 12-hour mode
mm: correct page->pfmemalloc to fix deactivate_slab regression
drivers/rtc/rtc-pcf2123.c: initialize dynamic sysfs attributes
mm/compaction.c: fix deferring compaction mistake
drivers/misc/sgi-xp/xpc_uv.c: SGI XPC fails to load when cpu 0 is out of IRQ resources
string: do not export memweight() to userspace
hugetlb: update hugetlbpage.txt
checkpatch: add control statement test to SINGLE_STATEMENT_DO_WHILE_MACRO
mm: hugetlbfs: correctly populate shared pmd
cciss: fix incorrect scsi status reporting
Documentation: update mount option in filesystem/vfat.txt
mm: change nr_ptes BUG_ON to WARN_ON
cs5535-clockevt: typo, it's MFGPT, not MFPGT

Linus Torvalds
2012-08-22 08:22:22 +0800
a484147a5 Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media ... Browse Code »

Pull media fixes from Mauro Carvalho Chehab:
"For bug fixes, at soc_camera, si470x, uvcvideo, iguanaworks IR driver,
radio_shark Kbuild fixes, and at the V4L2 core (radio fixes)."

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] media: soc_camera: don't clear pix->sizeimage in JPEG mode
[media] media: mx2_camera: Fix clock handling for i.MX27
[media] video: mx2_camera: Use clk_prepare_enable/clk_disable_unprepare
[media] video: mx1_camera: Use clk_prepare_enable/clk_disable_unprepare
[media] media: mx3_camera: buf_init() add buffer state check
[media] radio-shark2: Only compile led support when CONFIG_LED_CLASS is set
[media] radio-shark: Only compile led support when CONFIG_LED_CLASS is set
[media] radio-shark*: Call cancel_work_sync from disconnect rather then release
[media] radio-shark*: Remove work-around for dangling pointer in usb intfdata
[media] Add USB dependency for IguanaWorks USB IR Transceiver
[media] Add missing logging for rangelow/high of hwseek
[media] VIDIOC_ENUM_FREQ_BANDS fix
[media] mem2mem_testdev: fix querycap regression
[media] si470x: v4l2-compliance fixes
[media] DocBook: Remove a spurious character
[media] uvcvideo: Reset the bytesused field when recycling an erroneous buffer

Linus Torvalds
2012-08-22 07:54:38 +0800
8f8ba75ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking update from David Miller:
"A couple weeks of bug fixing in there. The largest chunk is all the
broken crap Amerigo Wang found in the netpoll layer."

1) netpoll and it's users has several serious bugs:
a) uses GFP_KERNEL with locks held
b) interfaces requiring interrupts disabled are called with them
enabled
c) and vice versa
d) VLAN tag demuxing, as per all other RX packet input paths, is not
applied

All from Amerigo Wang.

2) Hopefully cure the ipv4 mapped ipv6 address TCP early demux bugs for
good, from Neal Cardwell.

3) Unlike AF_UNIX, AF_PACKET sockets don't set a default credentials
when the user doesn't specify one explicitly during sendmsg().
Instead we attach an empty (zero) SCM credential block which is
definitely not what we want. Fix from Eric Dumazet.

4) IPv6 illegally invokes netdevice notifiers with RCU lock held, fix
from Ben Hutchings.

5) inet_csk_route_child_sock() checks wrong inet options pointer, fix
from Christoph Paasch.

6) When AF_PACKET is used for transmit, packet loopback doesn't behave
properly when a socket fanout is enabled, from Eric Leblond.

7) On bluetooth l2cap channel create failure, we leak the socket, from
Jaganath Kanakkassery.

8) Fix all the netprio file handling bugs found by Al Viro, from John
Fastabend.

9) Several error return and NULL deref bug fixes in networking drivers
from Julia Lawall.

10) A large smattering of struct padding et al. kernel memory leaks to
userspace found of Mathias Krause.

11) Conntrack expections in netfilter can access an uninitialized timer,
fix from Pablo Neira Ayuso.

12) Several netfilter SIP tracker bug fixes from Patrick McHardy.

13) IPSEC ipv6 routes are not initialized correctly all the time,
resulting in an OOPS in inet_putpeer(). Also from Patrick McHardy.

14) Bridging does rcu_dereference() outside of RCU protected area, from
Stephen Hemminger.

15) Fix routing cache removal performance regression when looking up
output routes that have a local destination. From Zheng Yan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
af_netlink: force credentials passing [CVE-2012-3520]
ipv4: fix ip header ident selection in __ip_make_skb()
ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()
tcp: fix possible socket refcount problem
net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child()
net/core/dev.c: fix kernel-doc warning
netconsole: remove a redundant netconsole_target_put()
net: ipv6: fix oops in inet_putpeer()
net/stmmac: fix issue of clk_get for Loongson1B.
caif: Do not dereference NULL in chnl_recv_cb()
af_packet: don't emit packet on orig fanout group
drivers/net/irda: fix error return code
drivers/net/wan/dscc4.c: fix error return code
drivers/net/wimax/i2400m/fw.c: fix error return code
smsc75xx: add missing entry to MAINTAINERS
net: qmi_wwan: new devices: UML290 and K5006-Z
net: sh_eth: Add eth support for R8A7779 device
netdev/phy: skip disabled mdio-mux nodes
dt: introduce for_each_available_child_of_node, of_get_next_available_child
net: netprio: fix cgrp create and write priomap race
...

Linus Torvalds
2012-08-22 07:46:08 +0800
c67fe3752 mm: compaction: Abort async compaction if locks are contended or taking too long ... Browse Code »

Jim Schutt reported a problem that pointed at compaction contending
heavily on locks. The workload is straight-forward and in his own words;

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

Early in the test everything looks fine

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0

and then it goes to pot

procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0
207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0
123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0
123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0
622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0
223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0

Note that system CPU usage is very high blocks being written out has
dropped by 42%. He analysed this with perf and found

perf record -g -a sleep 10
perf report --sort symbol --call-graph fractal,5
34.63% [k] _raw_spin_lock_irqsave
|
|--97.30%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| compact_zone_order
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--87.39%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call
| | __recv
| | |
| | --100.00%-- (nil)
| |
| --12.61%-- memcpy
--2.70%-- [...]

There was other data but primarily it is all showing that compaction is
contended heavily on the zone->lock and zone->lru_lock.

commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled
while isolating pages for migration] noted that it was possible for
migration to hold the lru_lock for an excessive amount of time. Very
broadly speaking this patch expands the concept.

This patch introduces compact_checklock_irqsave() to check if a lock
is contended or the process needs to be scheduled. If either condition
is true then async compaction is aborted and the caller is informed.
The page allocator will fail a THP allocation if compaction failed due
to contention. This patch also introduces compact_trylock_irqsave()
which will acquire the lock only if it is not contended and the process
does not need to schedule.

Reported-by: Jim Schutt
Tested-by: Jim Schutt
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-22 07:45:03 +0800
de74f1cc3 mm: have order > 0 compaction start near a pageblock with free pages ... Browse Code »

Commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
left") introduced a caching mechanism to reduce the amount work the free
page scanner does in compaction. However, it has a problem. Consider
two process simultaneously scanning free pages

C
Process A M S F
|---------------------------------------|
Process B M FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

if (cc->order > 0 && (!cc->wrapped ||
zone->compact_cached_free_pfn >
cc->start_free_pfn))
pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner
skips almost the entire space it should have scanned. When there are
multiple processes compacting it can end in a situation where the entire
zone is not being scanned at all. Further, it is possible for two
processes to ping-pong update to compact_cached_free_pfn which is just
random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it
matters if two free scanners happen to be in the same block but with
racing updates it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
of the zone. When a wrapped scanner isolates a page, it updates
compact_cached_free_pfn to point to the highest pageblock it
can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
checks if compact_cached_free_pfn is pointing to the end of the
zone. If so, the value is updated to point to the highest
pageblock that pages were isolated from. This value will not
be updated again until a free page scanner wraps and resets
compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-08-22 07:45:03 +0800
9a9a9a7ad rapidio/tsi721: fix unused variable compiler warning ... Browse Code »

Fix unused variable compiler warning when built with CONFIG_RAPIDIO_DEBUG
option off.

This patch is applicable to kernel versions starting from v3.2

Signed-off-by: Alexandre Bounine
Cc: Matt Porter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2012-08-22 07:45:03 +0800
3670e7e12 rapidio/tsi721: fix inbound doorbell interrupt handling ... Browse Code »

Make sure that there is no doorbell messages left behind due to disabled
interrupts during inbound doorbell processing.

The most common case for this bug is loss of rionet JOIN messages in
systems with three or more rionet participants and MSI or MSI-X enabled.
As result, requests for packet transfers may finish with "destination
unreachable" error message.

This patch is applicable to kernel versions starting from v3.2.

Signed-off-by: Alexandre Bounine
Cc: Matt Porter
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandre Bounine
2012-08-22 07:45:03 +0800
7dbfb315b drivers/rtc/rtc-rs5c348.c: fix hour decoding in 12-hour mode ... Browse Code »

Correct the offset by subtracting 20 from tm_hour before taking the
modulo 12.

[ "Why 20?" I hear you ask. Or at least I did.

Here's the reason why: RS5C348_BIT_PM is 32, and is - stupidly -
included in the RS5C348_HOURS_MASK define. So it's really subtracting
out that bit to get "hour+12". But then because it does things modulo
12, it needs to add the 12 in again afterwards anyway.

This code is confused. It would be much clearer if RS5C348_HOURS_MASK
just didn't include the RS5C348_BIT_PM bit at all, then it wouldn't
need to do the silly subtract either.

Whatever. It's all just math, the end result is the same. - Linus ]

Reported-by: James Nute
Tested-by: James Nute
Signed-off-by: Atsushi Nemoto
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Atsushi Nemoto
2012-08-22 07:45:03 +0800
b121186ab mm: correct page->pfmemalloc to fix deactivate_slab regression ... Browse Code »

Commit cfd19c5a9ecf ("mm: only set page->pfmemalloc when
ALLOC_NO_WATERMARKS was used") tried to narrow down page->pfmemalloc
setting, but it missed some places the pfmemalloc should be set.

So, in __slab_alloc, the unalignment pfmemalloc and ALLOC_NO_WATERMARKS
cause incorrect deactivate_slab() on our core2 server:

64.73% fio [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|---0.34%-- deactivate_slab
| __slab_alloc
| kmem_cache_alloc
| |

That causes our fio sync write performance to have a 40% regression.

Move the checking in get_page_from_freelist() which resolves this issue.

Signed-off-by: Alex Shi
Acked-by: Mel Gorman
Cc: David Miller
Tested-by: Eric Dumazet
Tested-by: Sage Weil
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Shi
2012-08-22 07:45:03 +0800
5ed12f128 drivers/rtc/rtc-pcf2123.c: initialize dynamic sysfs attributes ... Browse Code »

Dynamically allocated sysfs attributes must be initialized using
sysfs_attr_init(), otherwise lockdep complains: BUG: key
not in
.data!

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Ilya Shchepetkov
Cc: Chris Verges
Cc: Christian Pellegrin
Cc: Alessandro Zummo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ilya Shchepetkov
2012-08-22 07:45:03 +0800
c81758fbe mm/compaction.c: fix deferring compaction mistake ... Browse Code »

Commit aff622495c9a ("vmscan: only defer compaction for failed order and
higher") fixed bad deferring policy but made mistake about checking
compact_order_failed in __compact_pgdat(). So it can't update
compact_order_failed with the new order. This ends up preventing
correct operation of policy deferral. This patch fixes it.

Signed-off-by: Minchan Kim
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-08-22 07:45:03 +0800
7838f994b drivers/misc/sgi-xp/xpc_uv.c: SGI XPC fails to load when cpu 0 is out of IRQ resources ... Browse Code »

On many of our larger systems, CPU 0 has had all of its IRQ resources
consumed before XPC loads. Worst cases on machines with multiple 10
GigE cards and multiple IB cards have depleted the entire first socket
of IRQs.

This patch makes selecting the node upon which IRQs are allocated (as
well as all the other GRU Message Queue structures) specifiable as a
module load param and has a default behavior of searching all nodes/cpus
for an available resources.

[akpm@linux-foundation.org: fix build: include cpu.h and module.h]
Signed-off-by: Robin Holt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2012-08-22 07:45:03 +0800
c3a5ce041 string: do not export memweight() to userspace ... Browse Code »

Fix the following warning:

usr/include/linux/string.h:8: userspace cannot reference function or variable defined in the kernel

Signed-off-by: WANG Cong
Acked-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

WANG Cong
2012-08-22 07:45:03 +0800
d46f3d86f hugetlb: update hugetlbpage.txt ... Browse Code »

Commit f0f57b2b1488 ("mm: move hugepage test examples to
tools/testing/selftests/vm") moved map_hugetlb.c, hugepage-shm.c and
hugepage-mmap.c tests into tools/testing/selftests/vm/ directory, but it
didn't update hugetlbpage.txt

Signed-off-by: Zhouping Liu
Acked-by: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhouping Liu
2012-08-22 07:45:03 +0800
ac8e97f8a checkpatch: add control statement test to SINGLE_STATEMENT_DO_WHILE_MACRO ... Browse Code »

Commit b13edf7ff2dd ("checkpatch: add checks for do {} while (0) macro
misuses") added a test that is overly simplistic for single statement
macros.

Macros that start with control tests should be enclosed in a do {} while
(0) loop.

Add the necessary control tests to the check.

Signed-off-by: Joe Perches
Acked-by: Andy Whitcroft
Tested-by: Franz Schrober
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-08-22 07:45:02 +0800
eb48c0714 mm: hugetlbfs: correctly populate shared pmd ... Browse Code »

Each page mapped in a process's address space must be correctly
accounted for in _mapcount. Normally the rules for this are
straightforward but hugetlbfs page table sharing is different. The page
table pages at the PMD level are reference counted while the mapcount
remains the same.

If this accounting is wrong, it causes bugs like this one reported by
Larry Woodman:

kernel BUG at mm/filemap.c:135!
invalid opcode: 0000 [#1] SMP
CPU 22
Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi]
Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2
RIP: 0010:[] [] __delete_from_page_cache+0x15d/0x170
Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20)
Call Trace:
delete_from_page_cache+0x40/0x80
truncate_hugepages+0x115/0x1f0
hugetlbfs_evict_inode+0x18/0x30
evict+0x9f/0x1b0
iput_final+0xe3/0x1e0
iput+0x3e/0x50
d_kill+0xf8/0x110
dput+0xe2/0x1b0
__fput+0x162/0x240

During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc()
shared page tables with the check dst_pte == src_pte. The logic is if
the PMD page is the same, they must be shared. This assumes that the
sharing is between the parent and child. However, if the sharing is
with a different process entirely then this check fails as in this
diagram:

parent
|
------------>pmd
src_pte----------> data page
^
other--------->pmd--------------------|
^
child-----------|
dst_pte

For this situation to occur, it must be possible for Parent and Other to
have faulted and failed to share page tables with each other. This is
possible due to the following style of race.

PROC A PROC B
copy_hugetlb_page_range copy_hugetlb_page_range
src_pte == huge_pte_offset src_pte == huge_pte_offset
!src_pte so no sharing !src_pte so no sharing

(time passes)

hugetlb_fault hugetlb_fault
huge_pte_alloc huge_pte_alloc
huge_pmd_share huge_pmd_share
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
LOCK(i_mmap_mutex)
find nothing, no sharing
UNLOCK(i_mmap_mutex)
pmd_alloc pmd_alloc
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)
LOCK(instantiation_mutex)
fault
UNLOCK(instantiation_mutex)

These two processes are not poing to the same data page but are not
sharing page tables because the opportunity was missed. When either
process later forks, the src_pte == dst pte is potentially insufficient.
As the check falls through, the wrong PTE information is copied in
(harmless but wrong) and the mapcount is bumped for a page mapped by a
shared page table leading to the BUG_ON.

This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.

Race identified and changelog written mostly by Mel Gorman.

{akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style]
Reported-by: Larry Woodman
Tested-by: Larry Woodman
Reviewed-by: Mel Gorman
Signed-off-by: Michal Hocko
Reviewed-by: Rik van Riel
Cc: David Gibson
Cc: Ken Chen
Cc: Cong Wang
Cc: Hillf Danton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-08-22 07:45:02 +0800
b0cf0b118 cciss: fix incorrect scsi status reporting ... Browse Code »

Delete code which sets SCSI status incorrectly as it's already been set
correctly above this incorrect code. The bug was introduced in 2009 by
commit b0e15f6db111 ("cciss: fix typo that causes scsi status to be
lost.")

Signed-off-by: Stephen M. Cameron
Reported-by: Roel van Meer
Tested-by: Roel van Meer
Cc: Jens Axboe
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stephen M. Cameron
2012-08-22 07:45:02 +0800