Eric Lee / smarc-fsl-linux-kernel

27 Apr, 2012

1 commit

110a5c8b3 Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge fixes from Andrew Morton:
"13 fixes. The acerhdf patches aren't (really) fixes. But they've
been stuck in my tree for up to two years, sent to Matthew multiple
times and the developers are unhappy."

* emailed from Andrew Morton : (13 patches)
mm: fix NULL ptr dereference in move_pages
mm: fix NULL ptr dereference in migrate_pages
revert "proc: clear_refs: do not clear reserved pages"
drivers/rtc/rtc-ds1307.c: fix BUG shown with lock debugging enabled
arch/arm/mach-ux500/mbox-db5500.c: world-writable sysfs fifo file
hugetlbfs: lockdep annotate root inode properly
acerhdf: lowered default temp fanon/fanoff values
acerhdf: add support for new hardware
acerhdf: add support for Aspire 1410 BIOS v1.3314
fs/buffer.c: remove BUG() in possible but rare condition
mm: fix up the vmscan stat in vmstat
epoll: clear the tfile_check_list on -ELOOP
mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

Linus Torvalds
2012-04-27 06:24:45 +0800

26 Apr, 2012

6 commits

6e8b09eaf mm: fix NULL ptr dereference in move_pages ... Browse Code »

Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
added an odd construct where 'mm' is checked for being NULL, and if it is,
it would get dereferenced anyways by mput()ing it.

Signed-off-by: Sasha Levin
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2012-04-26 12:26:34 +0800
f2a9ef880 mm: fix NULL ptr dereference in migrate_pages ... Browse Code »

Commit 3268c63 ("mm: fix move/migrate_pages() race on task struct") has
added an odd construct where 'mm' is checked for being NULL, and if it is,
it would get dereferenced anyways by mput()ing it.

This would lead to the following NULL ptr deref and BUG() when calling
migrate_pages() with a pid that has no mm struct:

[25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[25904.194235] IP: [] mmput+0x27/0xf0
[25904.194235] PGD 773e6067 PUD 77da0067 PMD 0
[25904.194235] Oops: 0002 [#1] PREEMPT SMP
[25904.194235] CPU 2
[25904.194235] Pid: 31608, comm: trinity Tainted: G W 3.4.0-rc2-next-20120412-sasha #69
[25904.194235] RIP: 0010:[] [] mmput+0x27/0xf0
[25904.194235] RSP: 0018:ffff880077d49e08 EFLAGS: 00010202
[25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000
[25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286
[25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001
[25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010
[25904.194235] FS: 00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000
[25904.194235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0
[25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000)
[25904.194235] Stack:
[25904.194235] ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020
[25904.194235] ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000
[25904.194235] 00000000000003ff 0000000000000000 0000000000000000 0000000000000000
[25904.194235] Call Trace:
[25904.194235] [] sys_migrate_pages+0x340/0x3a0
[25904.194235] [] ? sys_migrate_pages+0xb1/0x3a0
[25904.194235] [] system_call_fastpath+0x16/0x1b
[25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1
[25904.194235] RIP [] mmput+0x27/0xf0
[25904.194235] RSP
[25904.194235] CR2: 0000000000000050
[25904.348999] ---[ end trace a307b3ed40206b4b ]---

Signed-off-by: Sasha Levin
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2012-04-26 12:26:34 +0800
904249aa6 mm: fix up the vmscan stat in vmstat ... Browse Code »

The "pgsteal" stat is confusing because it counts both direct reclaim as
well as background reclaim. However, we have "kswapd_steal" which also
counts background reclaim value.

This patch fixes it and also makes it match the existng "pgscan_" stats.

Test:
pgsteal_kswapd_dma32 447623
pgsteal_kswapd_normal 42272677
pgsteal_kswapd_movable 0
pgsteal_direct_dma32 2801
pgsteal_direct_normal 44353270
pgsteal_direct_movable 0

Signed-off-by: Ying Han
Reviewed-by: Rik van Riel
Acked-by: Christoph Lameter
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Mel Gorman
Acked-by: KAMEZAWA Hiroyuki
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: Dan Magenheimer
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2012-04-26 12:26:33 +0800
b1c12cbcd mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma ... Browse Code »

Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")

Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset. Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.

mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function

Signed-off-by: Konstantin Khlebnikov
Acked-by: Mel Gorman
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-04-26 12:26:33 +0800
ce587e65e mm: memcg: move pc lookup point to commit_charge() ... Browse Code »

None of the callsites actually need the page_cgroup descriptor
themselves, so just pass the page and do the look up in there.

We already had two bugs (6568d4a 'mm: memcg: update the correct soft
limit tree during migration' and 'memcg: fix Bad page state after
replace_page_cache') where the passed page and pc were not referring
to the same page frame.

Signed-off-by: Johannes Weiner
Acked-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-04-26 12:22:35 +0800
4e1c2b284 mm: nobootmem: Correct alloc_bootmem semantics. ... Browse Code »

The comments above __alloc_bootmem_node() claim that the code will
first try the allocation using 'goal' and if that fails it will
try again but with the 'goal' requirement dropped.

Unfortunately, this is not what the code does, so fix it to do so.

This is important for nobootmem conversions to architectures such
as sparc where MAX_DMA_ADDRESS is infinity.

On such architectures all of the allocations done by generic spots,
such as the sparse-vmemmap implementation, will pass in:

__pa(MAX_DMA_ADDRESS)

as the goal, and with the limit given as "-1" this will always fail
unless we add the appropriate fallback logic here.

Signed-off-by: David S. Miller
Acked-by: Yinghai Lu
Signed-off-by: Linus Torvalds

David Miller
2012-04-26 12:18:01 +0800

24 Apr, 2012

1 commit

aca50bd3b mm: fix s390 BUG by __set_page_dirty_no_writeback on swap ... Browse Code »
1

Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390
3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap()
tries to transfer dirty flag from s390 storage key to struct page and
radix_tree.

That would be because of reclaim's shrink_page_list() calling
add_to_swap() on this page at the same time: first PageSwapCache is set
(causing page_mapping(page) to appear as &swapper_space), then
page->private set, then tree_lock taken, then page inserted into
radix_tree - so there's an interval before taking the lock when the
radix_tree slot is empty.

We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up
before the SetPageSwapCache. But a better fix is simply to do what's
five years overdue: Ken Chen introduced __set_page_dirty_no_writeback()
(if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree
overhead, and swap is just the same - it ignores the radix_tree tag, and
does not participate in dirty page accounting, so should be using
__set_page_dirty_no_writeback() too.

s390 testing now confirms that this does indeed fix the problem.

Reported-by: Mel Gorman
Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Cc: Andrew Morton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rik van Riel
Cc: Ken Chen
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-04-24 09:19:22 +0800

21 Apr, 2012

5 commits

bfce281c2 kill mm argument of vm_munmap() ... Browse Code »

it's always current->mm

Signed-off-by: Al Viro

Al Viro
2012-04-21 13:58:20 +0800
6be5ceb02 VM: add "vm_mmap()" helper function ... Browse Code »

This continues the theme started with vm_brk() and vm_munmap():
vm_mmap() does the same thing as do_mmap(), but additionally does the
required VM locking.

This uninlines (and rewrites it to be clearer) do_mmap(), which sadly
duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have
to export our internal do_mmap_pgoff() function.

Some day we hopefully don't have to export do_mmap() either, if all
modular users can become the simpler vm_mmap() instead. We're actually
very close to that already, with the notable exception of the (broken)
use in i810, and a couple of stragglers in binfmt_elf.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-04-21 08:29:13 +0800
a46ef99d8 VM: add "vm_munmap()" helper function ... Browse Code »

Like the vm_brk() function, this is the same as "do_munmap()", except it
does the VM locking for the caller.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-04-21 08:29:13 +0800
e4eb1ff61 VM: add "vm_brk()" helper function ... Browse Code »

It does the same thing as "do_brk()", except it handles the VM locking
too.

It turns out that all external callers want that anyway, so we can make
do_brk() static to just mm/mmap.c while at it.

Signed-off-by: Linus Torvalds

Linus Torvalds
2012-04-21 08:28:17 +0800
b3dc627ca memblock: memblock should be able to handle zero length operations ... Browse Code »

Commit 24aa07882b ("memblock, x86: Replace memblock_x86_reserve/
free_range() with generic ones") replaced x86 specific memblock
operations with the generic ones; unfortunately, it lost zero length
operation handling in the process making the kernel panic if somebody
tries to reserve zero length area.

There isn't much to be gained by being cranky to zero length operations
and panicking is almost the worst response. Drop the BUG_ON() in
memblock_reserve() and update memblock_add_region/isolate_range() so
that all zero length operations are handled as noops.

Signed-off-by: Tejun Heo
Cc: stable@vger.kernel.org
Reported-by: Valere Monseur
Bisected-by: Joseph Freeman
Tested-by: Joseph Freeman
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43098
Signed-off-by: Linus Torvalds

Tejun Heo
2012-04-21 02:18:46 +0800

19 Apr, 2012

1 commit

9b7f43afd memcg: fix Bad page state after replace_page_cache ... Browse Code »

My 9ce70c0240d0 "memcg: fix deadlock by inverting lrucare nesting" put a
nasty little bug into v3.3's version of mem_cgroup_replace_page_cache(),
sometimes used for FUSE. Replacing __mem_cgroup_commit_charge_lrucare()
by __mem_cgroup_commit_charge(), I used the "pc" pointer set up earlier:
but it's for oldpage, and needs now to be for newpage. Once oldpage was
freed, its PageCgroupUsed bit (cleared above but set again here) caused
"Bad page state" messages - and perhaps worse, being missed from newpage.
(I didn't find this by using FUSE, but in reusing the function for tmpfs.)

Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org [v3.3 only]
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-04-19 14:40:57 +0800

13 Apr, 2012

4 commits

41c930881 Revert "mm: vmscan: fix misused nr_reclaimed in shrink_mem_cgroup_zone()" ... Browse Code »

This reverts commit c38446cc65e1f2b3eb8630c53943b94c4f65f670.

Before the commit, the code makes senses to me but not after the commit.
The "nr_reclaimed" is the number of pages reclaimed by scanning through
the memcg's lru lists. The "nr_to_reclaim" is the target value for the
whole function. For example, we like to early break the reclaim if
reclaimed 32 pages under direct reclaim (not DEF_PRIORITY).

After the reverted commit, the target "nr_to_reclaim" is decremented each
time by "nr_reclaimed" but we still use it to compare the "nr_reclaimed".
It just doesn't make sense to me...

Signed-off-by: Ying Han
Acked-by: Hugh Dickins
Cc: Rik van Riel
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2012-04-13 04:12:12 +0800
66aebce74 hugetlb: fix race condition in hugetlb_fault() ... Browse Code »
46

The race is as follows:

Suppose a multi-threaded task forks a new process (on cpu A), thus
bumping up the ref count on all the pages. While the fork is occurring
(and thus we have marked all the PTEs as read-only), another thread in
the original process (on cpu B) tries to write to a huge page, taking an
access violation from the write-protect and calling hugetlb_cow(). Now,
suppose the fork() fails. It will undo the COW and decrement the ref
count on the pages, so the ref count on the huge page drops back to 1.
Meanwhile hugetlb_cow() also decrements the ref count by one on the
original page, since the original address space doesn't need it any
more, having copied a new page to replace the original page. This
leaves the ref count at zero, and when we call unlock_page(), we panic.

fork on CPU A fault on CPU B
============= ==============
...
down_write(&parent->mmap_sem);
down_write_nested(&child->mmap_sem);
...
while duplicating vmas
if error
break;
...
up_write(&child->mmap_sem);
up_write(&parent->mmap_sem); ...
down_read(&parent->mmap_sem);
...
lock_page(page);
handle COW
page_mapcount(old_page) == 2
alloc and prepare new_page
...
handle error
page_remove_rmap(page);
put_page(page);
...
fold new_page into pte
page_remove_rmap(page);
put_page(page);
...
oops ==> unlock_page(page);
up_read(&parent->mmap_sem);

The solution is to take an extra reference to the page while we are
holding the lock on it.

Signed-off-by: Chris Metcalf
Cc: Hillf Danton
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chris Metcalf
2012-04-13 04:12:12 +0800
569530fb1 memcg: do not open code accesses to res_counter members ... Browse Code »

We should use the accessor res_counter_read_u64 for that.

Although a purely cosmetic change is sometimes better delayed, to avoid
conflicting with other people's work, we are starting to have people
touching this code as well, and reproducing the open code behavior
because that's the standard =)

Time to fix it, then.

Signed-off-by: Glauber Costa
Cc: Johannes Weiner
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Glauber Costa
2012-04-13 04:12:12 +0800
d833049bd memcg: fix broken boolen expression ... Browse Code »

action != CPU_DEAD || action != CPU_DEAD_FROZEN is always true.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2012-04-13 04:12:11 +0800

29 Mar, 2012

10 commits

532bfc851 Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge third batch of patches from Andrew Morton:
- Some MM stragglers
- core SMP library cleanups (on_each_cpu_mask)
- Some IPI optimisations
- kexec
- kdump
- IPMI
- the radix-tree iterator work
- various other misc bits.

"That'll do for -rc1. I still have ~10 patches for 3.4, will send
those along when they've baked a little more."

* emailed from Andrew Morton : (35 commits)
backlight: fix typo in tosa_lcd.c
crc32: add help text for the algorithm select option
mm: move hugepage test examples to tools/testing/selftests/vm
mm: move slabinfo.c to tools/vm
mm: move page-types.c from Documentation to tools/vm
selftests/Makefile: make `run_tests' depend on `all'
selftests: launch individual selftests from the main Makefile
radix-tree: use iterators in find_get_pages* functions
radix-tree: rewrite gang lookup using iterator
radix-tree: introduce bit-optimized iterator
fs/proc/namespaces.c: prevent crash when ns_entries[] is empty
nbd: rename the nbd_device variable from lo to nbd
pidns: add reboot_pid_ns() to handle the reboot syscall
sysctl: use bitmap library functions
ipmi: use locks on watchdog timeout set on reboot
ipmi: simplify locking
ipmi: fix message handling during panics
ipmi: use a tasklet for handling received messages
ipmi: increase KCS timeouts
ipmi: decrease the IPMI message transaction time in interrupt mode
...

Linus Torvalds
2012-03-29 08:19:28 +0800
0fc9d1040 radix-tree: use iterators in find_get_pages* functions ... Browse Code »

Replace radix_tree_gang_lookup_slot() and
radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with
brand-new radix-tree direct iterating. This avoids the double-scanning
and pointer copying.

Iterator don't stop after nr_pages page-get fails in a row, it continue
lookup till the radix-tree end. Thus we can safely remove these restart
conditions.

Unfortunately, old implementation didn't forbid nr_pages == 0, this corner
case does not fit into new code, so the patch adds an extra check at the
beginning.

Signed-off-by: Konstantin Khlebnikov
Tested-by: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-03-29 08:14:37 +0800
74046494e mm: only IPI CPUs to drain local pages if they exist ... Browse Code »

Calculate a cpumask of CPUs with per-cpu pages in any zone and only send
an IPI requesting CPUs to drain these pages to the buddy allocator if they
actually have pages when asked to flush.

This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of
severe memory pressure that leads to OOM since in these cases multiple,
possibly concurrent, allocation requests end up in the direct reclaim code
path so when the per-cpu pages end up reclaimed on first allocation
failure for most of the proceeding allocation attempts until the memory
pressure is off (possibly via the OOM killer) there are no per-cpu pages
on most CPUs (and there can easily be hundreds of them).

This also has the side effect of shortening the average latency of direct
reclaim by 1 or more order of magnitude since waiting for all the CPUs to
ACK the IPI takes a long time.

Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the
difference between the number of direct reclaim attempts that end up in
drain_all_pages() and those were more then 1/2 of the online CPU had any
per-cpu page in them, using the vmstat counters introduced in the next
patch in the series and using proc/interrupts.

In the test sceanrio, this was seen to save around 3600 global
IPIs after trigerring an OOM on a concurrent workload:

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 0
pcp_global_ipi_saved 0

$ cat /proc/interrupts | grep CAL
CAL: 1 2 1 2
2 2 2 2 Function call interrupts

$ hackbench 400
[OOM messages snipped]

$ cat /proc/vmstat | tail -n 2
pcp_global_drain 3647
pcp_global_ipi_saved 3642

$ cat /proc/interrupts | grep CAL
CAL: 6 13 6 3
3 3 1 2 7 Function call interrupts

Please note that if the global drain is removed from the direct reclaim
path as a patch from Mel Gorman currently suggests this should be replaced
with an on_each_cpu_cond invocation.

Signed-off-by: Gilad Ben-Yossef
Acked-by: Mel Gorman
Cc: KOSAKI Motohiro
Acked-by: Christoph Lameter
Acked-by: Peter Zijlstra
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Andi Kleen
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
a8364d555 slub: only IPI CPUs that have per cpu obj to flush ... Browse Code »
44

flush_all() is called for each kmem_cache_destroy(). So every cache being
destroyed dynamically ends up sending an IPI to each CPU in the system,
regardless if the cache has ever been used there.

For example, if you close the Infinband ipath driver char device file, the
close file ops calls kmem_cache_destroy(). So running some infiniband
config tool on one a single CPU dedicated to system tasks might interrupt
the rest of the 127 CPUs dedicated to some CPU intensive or latency
sensitive task.

I suspect there is a good chance that every line in the output of "git
grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario.

This patch attempts to rectify this issue by sending an IPI to flush the
per cpu objects back to the free lists only to CPUs that seem to have such
objects.

The check which CPU to IPI is racy but we don't care since asking a CPU
without per cpu objects to flush does no damage and as far as I can tell
the flush_all by itself is racy against allocs on remote CPUs anyway, so
if you required the flush_all to be determinstic, you had to arrange for
locking regardless.

Without this patch the following artificial test case:

$ cd /sys/kernel/slab
$ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done

produces 166 IPIs on an cpuset isolated CPU. With it it produces none.

The code path of memory allocation failure for CPUMASK_OFFSTACK=y
config was tested using fault injection framework.

Signed-off-by: Gilad Ben-Yossef
Acked-by: Christoph Lameter
Cc: Chris Metcalf
Acked-by: Peter Zijlstra
Cc: Frederic Weisbecker
Cc: Russell King
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Sasha Levin
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Mel Gorman
Cc: Alexander Viro
Cc: Avi Kivity
Cc: Michal Nazarewicz
Cc: Kosaki Motohiro
Cc: Milton Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gilad Ben-Yossef
2012-03-29 08:14:35 +0800
d15cab975 swapon: check validity of swap_flags ... Browse Code »

Most system calls taking flags first check that the flags passed in are
valid, and that helps userspace to detect when new flags are supported.

But swapon never did so: start checking now, to help if we ever want to
support more swap_flags in future.

It's difficult to get stray bits set in an int, and swapon is not widely
used, so this is most unlikely to break any userspace; but we can just
revert if it turns out to do so.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-29 08:14:35 +0800
29fd66d28 mm, coredump: fail allocations when coredumping instead of oom killing ... Browse Code »

The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:

- the coredumping process may be chosen for oom kill and quickly deplete
all memory reserves in oom conditions preventing further progress from
being made or tasks from exiting,

- the coredumping process may cause other processes to be oom killed
without fault of their own as the result of a SIGSEGV, for example, in
the coredumping process, or

- the coredumping process may result in a livelock while writing to the
dump file if it needs memory to allocate while other threads are in
the exit path waiting on the coredumper to complete.

This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-29 08:14:35 +0800
45f83cefe mm: thp: fix up pmd_trans_unstable() locations ... Browse Code »

pmd_trans_unstable() should be called before pmd_offset_map() in the
locations where the mmap_sem is held for reading.

Signed-off-by: Andrea Arcangeli
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Larry Woodman
Cc: Ulrich Obergfell
Cc: Rik van Riel
Cc: Mark Salter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-03-29 08:14:35 +0800
623e3db9f mm for fs: add truncate_pagecache_range() ... Browse Code »

Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
but forgetting to unmap pages first (ocfs2 remembers). This is not really
a bug, since races already require truncate_inode_page() to handle that
case once the page is locked; but it can be very inefficient if the file
being punched happens to be mapped into many vmas.

Provide a drop-in replacement truncate_pagecache_range() which does the
unmapping pass first, handling the awkward mismatch between arguments to
truncate_inode_pages_range() and arguments to unmap_mapping_range().

Note that holepunching does not unmap privately COWed pages in the range:
POSIX requires that we do so when truncating, but it's hard to justify,
difficult to implement without an i_size cutoff, and no filesystem is
attempting to implement it.

Signed-off-by: Hugh Dickins
Cc: "Theodore Ts'o"
Cc: Andreas Dilger
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Ben Myers
Cc: Alex Elder
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-29 08:14:35 +0800
0c9aac082 Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB changes from Pekka Enberg:
"There's the new kmalloc_array() API, minor fixes and performance
improvements, but quite honestly, nothing terribly exciting."

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm: SLAB Out-of-memory diagnostics
slab: introduce kmalloc_array()
slub: per cpu partial statistics change
slub: include include for prefetch
slub: Do not hold slub_lock when calling sysfs_slab_add()
slub: prefetch next freelist pointer in slab_alloc()
slab, cleanup: remove unneeded return

Linus Torvalds
2012-03-29 06:04:26 +0800
69e1aaddd Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull ext4 updates for 3.4 from Ted Ts'o:
"Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes

The changes to export dirty_writeback_interval are from Artem's s_dirt
cleanup patch series. The same is true of the change to remove the
s_dirt helper functions which never got used by anyone in-tree. I've
run these changes by Al Viro, and am carrying them so that Artem can
more easily fix up the rest of the file systems during the next merge
window. (Originally we had hopped to remove the use of s_dirt from
ext4 during this merge window, but his patches had some bugs, so I
ultimately ended dropping them from the ext4 tree.)"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits)
vfs: remove unused superblock helpers
mm: export dirty_writeback_interval
ext4: remove useless s_dirt assignment
ext4: write superblock only once on unmount
ext4: do not mark superblock as dirty unnecessarily
ext4: correct ext4_punch_hole return codes
ext4: remove restrictive checks for EOFBLOCKS_FL
ext4: always set then trimmed blocks count into len
ext4: fix trimmed block count accunting
ext4: fix start and len arguments handling in ext4_trim_fs()
ext4: update s_free_{inodes,blocks}_count during online resize
ext4: change some printk() calls to use ext4_msg() instead
ext4: avoid output message interleaving in ext4_error_()
ext4: remove trailing newlines from ext4_msg() and ext4_error() messages
ext4: add no_printk argument validation, fix fallout
ext4: remove redundant "EXT4-fs: " from uses of ext4_msg
ext4: give more helpful error message in ext4_ext_rm_leaf()
ext4: remove unused code from ext4_ext_map_blocks()
ext4: rewrite punch hole to use ext4_ext_remove_space()
jbd2: cleanup journal tail after transaction commit
...

Linus Torvalds
2012-03-29 01:02:55 +0800

25 Mar, 2012

1 commit

496b919b3 Fix potential endless loop in kswapd when compaction is not enabled ... Browse Code »

We should only test compaction_suitable if the kernel is built with
CONFIG_COMPACTION, otherwise the stub compaction_suitable function will
always return COMPACT_SKIPPED and send kswapd into an infinite loop.

Reported-by: Anton Blanchard
Signed-off-by: Rik van Riel
Signed-off-by: Linus Torvalds

Rik van Riel
2012-03-25 03:18:32 +0800

24 Mar, 2012

4 commits

accb61fe7 coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP ... Browse Code »

Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag:
MADV_DONTDUMP, which can be set by applications to specifically request
memory regions which should not dump core.

The specific application I have in mind is qemu: we can add a flag there
that wouldn't dump all of guest memory when qemu dumps core. This flag
might also be useful for security sensitive apps that want to absolutely
make sure that parts of memory are not dumped. To clear the flag use:
MADV_DODUMP.

[akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
[akpm@linux-foundation.org: fix up the architectures which broke]
Signed-off-by: Jason Baron
Acked-by: Roland McGrath
Cc: Chris Metcalf
Cc: Avi Kivity
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-03-24 07:58:42 +0800
909af768e coredump: remove VM_ALWAYSDUMP flag ... Browse Code »

The motivation for this patchset was that I was looking at a way for a
qemu-kvm process, to exclude the guest memory from its core dump, which
can be quite large. There are already a number of filter flags in
/proc//coredump_filter, however, these allow one to specify 'types'
of kernel memory, not specific address ranges (which is needed in this
case).

Since there are no more vma flags available, the first patch eliminates
the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by
the kernel to mark vdso and vsyscall pages. However, it is simple
enough to check if a vma covers a vdso or vsyscall page without the need
for this flag.

The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters
continue to work the same as before unless 'MADV_DONTDUMP' is set on the
region.

The qemu code which implements this features is at:

http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch

In my testing the qemu core dump shrunk from 383MB -> 13MB with this
patch.

I also believe that the 'MADV_DONTDUMP' flag might be useful for
security sensitive apps, which might want to select which areas are
dumped.

This patch:

The VM_ALWAYSDUMP flag is currently used by the coredump code to
indicate that a vma is part of a vsyscall or vdso section. However, we
can determine if a vma is in one these sections by checking it against
the gate_vma and checking for a non-NULL return value from
arch_vma_name(). Thus, freeing a valuable vma bit.

Signed-off-by: Jason Baron
Acked-by: Roland McGrath
Cc: Chris Metcalf
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-03-24 07:58:42 +0800
d2d393099 signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() ... Browse Code »

Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
of force_sig(SIGKILL). With the recent changes we do not need force_ to
kill the CLONE_NEWPID tasks.

And this is more correct. force_sig() can race with the exiting thread
even if oom_kill_task() checks p->mm != NULL, while
do_send_sig_info(group => true) kille the whole process.

Signed-off-by: Oleg Nesterov
Cc: Tejun Heo
Cc: Anton Vorontsov
Cc: "Eric W. Biederman"
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-03-24 07:58:41 +0800
6629326b8 mm: hugetlb: cleanup duplicated code in unmapping vm range ... Browse Code »

Fix code duplication in __unmap_hugepage_range(), such as pte_page() and
huge_pte_none().

Signed-off-by: Hillf Danton
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-03-24 07:58:31 +0800

23 Mar, 2012

4 commits

643ac9fc5 mm: fix testorder interaction between two kswapd patches ... Browse Code »

Adjusting cc715d99e529 "mm: vmscan: forcibly scan highmem if there are
too many buffer_heads pinning highmem" for -stable reveals that it was
slightly wrong once on top of fe2c2a106663 "vmscan: reclaim at order 0
when compaction is enabled", which specifically adds testorder for the
zone_watermark_ok_safe() test.

Signed-off-by: Hugh Dickins
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-03-23 23:34:40 +0800
aab008db8 Merge tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm ... Browse Code »

Pull cleancache changes from Konrad Rzeszutek Wilk:
"This has some patches for the cleancache API that should have been
submitted a _long_ time ago. They are basically cleanups:

- rename of flush to invalidate

- moving reporting of statistics into debugfs

- use __read_mostly as necessary.

Oh, and also the MAINTAINERS file change. The files (except the
MAINTAINERS file) have been in #linux-next for months now. The late
addition of MAINTAINERS file is a brain-fart on my side - didn't
realize I needed that just until I was typing this up - and I based
that patch on v3.3 - so the tree is on top of v3.3."

* tag 'stable/for-linus-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
MAINTAINERS: Adding cleancache API to the list.
mm: cleancache: Use __read_mostly as appropiate.
mm: cleancache: report statistics via debugfs instead of sysfs.
mm: zcache/tmem/cleancache: s/flush/invalidate/
mm: cleancache: s/flush/invalidate/

Linus Torvalds
2012-03-23 10:52:47 +0800
754b98007 Merge branch 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull MCE changes from Ingo Molnar.

* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix return value of mce_chrdev_read() when erst is disabled
x86/mce: Convert static array of pointers to per-cpu variables
x86/mce: Replace hard coded hex constants with symbolic defines
x86/mce: Recognise machine check bank signature for data path error
x86/mce: Handle "action required" errors
x86/mce: Add mechanism to safely save information in MCE handler
x86/mce: Create helper function to save addr/misc when needed
HWPOISON: Add code to handle "action required" errors.
HWPOISON: Clean up memory_failure() vs. __memory_failure()

Linus Torvalds
2012-03-23 00:42:04 +0800
95211279c Merge branch 'akpm' (Andrew's patch-bomb) ... Browse Code »

Merge first batch of patches from Andrew Morton:
"A few misc things and all the MM queue"

* emailed from Andrew Morton : (92 commits)
memcg: avoid THP split in task migration
thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE
memcg: clean up existing move charge code
mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read()
mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event()
mm/memcontrol.c: s/stealed/stolen/
memcg: fix performance of mem_cgroup_begin_update_page_stat()
memcg: remove PCG_FILE_MAPPED
memcg: use new logic for page stat accounting
memcg: remove PCG_MOVE_LOCK flag from page_cgroup
memcg: simplify move_account() check
memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
memcg: kill dead prev_priority stubs
memcg: remove PCG_CACHE page_cgroup flag
memcg: let css_get_next() rely upon rcu_read_lock()
cgroup: revert ss_id_lock to spinlock
idr: make idr_get_next() good for rcu_read_lock()
memcg: remove unnecessary thp check in page stat accounting
memcg: remove redundant returns
memcg: enum lru_list lru
...

Linus Torvalds
2012-03-23 00:04:48 +0800

22 Mar, 2012

3 commits

91913a294 mm: export dirty_writeback_interval ... Browse Code »

Export 'dirty_writeback_interval' to make it visible to
file-systems. We are going to push superblock management down to
file-systems and get rid of the 'sync_supers' kernel thread completly.

Signed-off-by: Artem Bityutskiy
Cc: Al Viro
Signed-off-by: "Theodore Ts'o"

Artem Bityutskiy
2012-03-22 10:33:00 +0800
12724850e memcg: avoid THP split in task migration ... Browse Code »
44

Currently we can't do task migration among memory cgroups without THP
split, which means processes heavily using THP experience large overhead
in task migration. This patch introduces the code for moving charge of
THP and makes THP more valuable.

Signed-off-by: Naoya Horiguchi
Acked-by: Hillf Danton
Cc: Andrea Arcangeli
Acked-by: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-03-22 08:55:02 +0800
d8c37c480 thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE ... Browse Code »

These macros will be used in a later patch, where all usages are expected
to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to
detect unexpected usages, we convert the existing BUG() to BUILD_BUG().

[akpm@linux-foundation.org: fix build in mm/pgtable-generic.c]
Signed-off-by: Naoya Horiguchi
Acked-by: Hillf Danton
Reviewed-by: Andrea Arcangeli
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2012-03-22 08:55:02 +0800