Eric Lee / smarc-fsl-linux-kernel

08 Aug, 2014

1 commit

1144d70b3 mm, thp: do not allow thp faults to avoid cpuset restrictions ... Browse Code »

commit b104a35d32025ca740539db2808aa3385d0f30eb upstream.

The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.

Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.

This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.

Signed-off-by: David Rientjes
Reported-by: Alex Thorlton
Tested-by: Alex Thorlton
Cc: Bob Liu
Cc: Dave Hansen
Cc: Hedi Berriche
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Srivatsa S. Bhat
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Rientjes
2014-08-08 05:30:26 +0800

01 Aug, 2014

3 commits

32226c206 mm: hugetlb: fix copy_hugetlb_page_range() ... Browse Code »

commit 0253d634e0803a8376a0d88efee0bf523d8673f9 upstream.

Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry") changed the order of
huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
in some workloads like hugepage-backed heap allocation via libhugetlbfs.
This patch fixes it.

The test program for the problem is shown below:

$ cat heap.c
#include
#include
#include

#define HPS 0x200000

int main() {
int i;
char *p = malloc(HPS);
memset(p, '1', HPS);
for (i = 0; i < 5; i++) {
if (!fork()) {
memset(p, '2', HPS);
p = malloc(HPS);
memset(p, '3', HPS);
free(p);
return 0;
}
}
sleep(1);
free(p);
return 0;
}

$ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap

Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry"), so is applicable to -stable kernels which
include it.

Signed-off-by: Naoya Horiguchi
Reported-by: Guillaume Morin
Suggested-by: Guillaume Morin
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-08-01 03:53:51 +0800
6264198b3 slab_common: fix the check for duplicate slab names ... Browse Code »

commit 694617474e33b8603fc76e090ed7d09376514b1a upstream.

The patch 3e374919b314f20e2a04f641ebc1093d758f66a4 is supposed to fix the
problem where kmem_cache_create incorrectly reports duplicate cache name
and fails. The problem is described in the header of that patch.

However, the patch doesn't really fix the problem because of these
reasons:

* the logic to test for debugging is reversed. It was intended to perform
the check only if slub debugging is enabled (which implies that caches
with the same parameters are not merged). Therefore, there should be
#if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON)
The current code has the condition reversed and performs the test if
debugging is disabled.

* slub debugging may be enabled or disabled based on kernel command line,
CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test
based on definition of CONFIG_SLUB_DEBUG_ON is unreliable.

This patch fixes the problem by removing the test
"!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never
checked if the SLUB allocator is used.

Note to stable kernel maintainers: when backporint this patch, please
backport also the patch 3e374919b314f20e2a04f641ebc1093d758f66a4.

Acked-by: David Rientjes
Acked-by: Christoph Lameter
Signed-off-by: Mikulas Patocka
Signed-off-by: Pekka Enberg
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2014-08-01 03:53:50 +0800
a654d23f0 slab_common: Do not check for duplicate slab names ... Browse Code »

commit 3e374919b314f20e2a04f641ebc1093d758f66a4 upstream.

SLUB can alias multiple slab kmem_create_requests to one slab cache to save
memory and increase the cache hotness. As a result the name of the slab can be
stale. Only check the name for duplicates if we are in debug mode where we do
not merge multiple caches.

This fixes the following problem reported by Jonathan Brassow:

The problem with kmem_cache* is this:

*) Assume CONFIG_SLUB is set
1) kmem_cache_create(name="foo-a")
- creates new kmem_cache structure
2) kmem_cache_create(name="foo-b")
- If identical cache characteristics, it will be merged with the previously
created cache associated with "foo-a". The cache's refcount will be
incremented and an alias will be created via sysfs_slab_alias().
3) kmem_cache_destroy()
- Attempting to destroy cache associated with "foo-a", but instead the
refcount is simply decremented. I don't even think the sysfs aliases are
ever removed...
4) kmem_cache_create(name="foo-a")
- This FAILS because kmem_cache_sanity_check colides with the existing
name ("foo-a") associated with the non-removed cache.

This is a problem for RAID (specifically dm-raid) because the name used
for the kmem_cache_create is ("raid%d-%p", level, mddev). If the cache
persists for long enough, the memory address of an old mddev will be
reused for a new mddev - causing an identical formulation of the cache
name. Even though kmem_cache_destory had long ago been used to delete
the old cache, the merging of caches has cause the name and cache of that
old instance to be preserved and causes a colision (and thus failure) in
kmem_cache_create(). I see this regularly in my testing.

Reported-by: Jonathan Brassow
Signed-off-by: Christoph Lameter
Signed-off-by: Pekka Enberg
Signed-off-by: Greg Kroah-Hartman

Christoph Lameter
2014-08-01 03:53:50 +0800

28 Jul, 2014

3 commits

7dc7fb432 shmem: fix splicing from a hole while it's punched ... Browse Code »

commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream.

shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.

But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.

shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch). Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.

shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.

We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?

The origin of this part of the problem is my v3.1 commit d0823576bf4b
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c. It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.

Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.

Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.

But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.

Signed-off-by: Hugh Dickins
Reported-by: Sasha Levin
Suggested-by: Vlastimil Babka
Cc: Konstantin Khlebnikov
Cc: Johannes Weiner
Cc: Lukas Czerner
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-07-28 23:00:03 +0800
887675c98 shmem: fix faulting into a hole, not taking i_mutex ... Browse Code »

commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream.

Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
punched") was buggy: Sasha sent a lockdep report to remind us that
grabbing i_mutex in the fault path is a no-no (write syscall may already
hold i_mutex while faulting user buffer).

We tried a completely different approach (see following patch) but that
proved inadequate: good enough for a rational workload, but not good
enough against trinity - which forks off so many mappings of the object
that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
into serious starvation when concurrent faults force the puncher to fall
back to single-page unmap_mapping_range() searches of the i_mmap tree.

So return to the original umbrella approach, but keep away from i_mutex
this time. We really don't want to bloat every shmem inode with a new
mutex or completion, just to protect this unlikely case from trinity.
So extend the original with wait_queue_head on stack at the hole-punch
end, and wait_queue item on the stack at the fault end.

This involves further use of i_lock to guard against the races: lockdep
has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
i_lock around wake_up_bit(), which is comparable to what we do here.
i_lock is more convenient, but we could switch to shmem's info->lock.

This issue has been tagged with CVE-2014-4171, which will require commit
f00cdc6df7d7 and this and the following patch to be backported: we
suggest to 3.1+, though in fact the trinity forkbomb effect might go
back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
Anyone running trinity on 3.0 and earlier? I don't think we need care.

Signed-off-by: Hugh Dickins
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Cc: Vlastimil Babka
Cc: Konstantin Khlebnikov
Cc: Johannes Weiner
Cc: Lukas Czerner
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-07-28 23:00:03 +0800
1ccc3ffad shmem: fix faulting into a hole while it's punched ... Browse Code »

commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream.

Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.

It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.

Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-07-28 23:00:03 +0800

18 Jul, 2014

1 commit

3c33a9bdb cpuset,mempolicy: fix sleeping function called from invalid context ... Browse Code »

commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.

When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [] dump_stack+0x4d/0x66
[ 9970.065074] [] __might_sleep+0xfa/0x130
[ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [] ? __mpol_dup+0x63/0x150
[ 9970.409282] [] __mpol_dup+0xe5/0x150
[ 9970.471897] [] ? __mpol_dup+0x63/0x150
[ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [] do_fork+0xd8/0x380
[ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [] SyS_clone+0x16/0x20
[ 9971.030011] [] stub_clone+0x69/0x90
[ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.

This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():

"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."

Signed-off-by: Gu Zheng
Acked-by: Li Zefan
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Gu Zheng
2014-07-18 06:58:00 +0800

10 Jul, 2014

2 commits

f76d0efeb mm: fix crashes from mbind() merging vmas ... Browse Code »

commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.

In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
introduced vma merging to mbind(), but it should have also changed the
convention of passing start vma from queue_pages_range() (formerly
check_range()) to new_vma_page(): vma merging may have already freed
that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
worse crashes.

Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
Reported-by: Naoya Horiguchi
Tested-by: Naoya Horiguchi
Signed-off-by: Hugh Dickins
Acked-by: Christoph Lameter
Cc: KOSAKI Motohiro
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-07-10 02:14:03 +0800
9b576da0f hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry ... Browse Code »

commit 4a705fef986231a3e7a6b1a6d3c37025f021f49f upstream.

There's a race between fork() and hugepage migration, as a result we try
to "dereference" a swap entry as a normal pte, causing kernel panic.
The cause of the problem is that copy_hugetlb_page_range() can't handle
"swap entry" family (migration entry and hwpoisoned entry) so let's fix
it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Acked-by: Hugh Dickins
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-07-10 02:14:03 +0800

01 Jul, 2014

5 commits

f26bff271 mm: vmscan: clear kswapd's special reclaim powers before exiting ... Browse Code »

commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.

When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner
Reported-by: Gu Zheng
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Johannes Weiner
2014-07-01 11:09:42 +0800
ef1516486 mm: fix sleeping function warning from __put_anon_vma ... Browse Code »

commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.

Trinity reports BUG:

sleeping function called from invalid context at kernel/locking/rwsem.c:47
in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

__might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..

Right, since conversion to mutex then rwsem, we should not put_anon_vma()
from inside an rcu_read_lock()ed section: fix the two places that did so.
And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
Reported-by: Dave Jones
Signed-off-by: Hugh Dickins
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-07-01 11:09:42 +0800
15e09f824 mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED ... Browse Code »

commit 74614de17db6fb472370c426d4f934d8d616edf2 upstream.

When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer. The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Tony Luck
2014-07-01 11:09:42 +0800
4451dd214 mm/memory-failure.c-failure: send right signal code to correct thread ... Browse Code »

commit a70ffcac741d31a406c1d2b832ae43d658e7e1cf upstream.

When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

if ((flags & MF_ACTION_REQUIRED) && t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page. If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck
Signed-off-by: Naoya Horiguchi
Reported-by: Otto Bruggeman
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Tony Luck
2014-07-01 11:09:42 +0800
16838de6a mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL ... Browse Code »

commit 675becce15f320337499bc1a9356260409a5ba29 upstream.

throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.

The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.

On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2014-07-01 11:09:41 +0800

17 Jun, 2014

3 commits

ee7f7615f mm/compaction: make isolate_freepages start at pageblock boundary ... Browse Code »

commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.

The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc->free_pfn value as the first pfn. In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.

This means that when cc->free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages. Currently this
can happen when

a) zone's end pfn is not pageblock aligned, or

b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
enabled and a hole spanning the beginning of a pageblock

This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary. This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.

Signed-off-by: Vlastimil Babka
Reported-by: Heesub Shin
Acked-by: Minchan Kim
Cc: Mel Gorman
Acked-by: Joonsoo Kim
Cc: Bartlomiej Zolnierkiewicz
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Acked-by: Rik van Riel
Cc: Dongjun Shin
Cc: Sunghwan Yun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2014-06-17 04:42:53 +0800
87c4475c9 mm: compaction: detect when scanners meet in isolate_freepages ... Browse Code »

commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream.

Compaction of a zone is finished when the migrate scanner (which begins
at the zone's lowest pfn) meets the free page scanner (which begins at
the zone's highest pfn). This is detected in compact_zone() and in the
case of direct compaction, the compact_blockskip_flush flag is set so
that kswapd later resets the cached scanner pfn's, and a new compaction
may again start at the zone's borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems. First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator
when migration fails. Second, failing to isolate enough free pages due
to scanners meeting results in -ENOMEM being returned by
migrate_pages(), which makes compact_zone() bail out immediately without
calling compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated
attempts at compaction of a zone that keep starting from the cached
pfn's close to the meeting point, and quickly failing through the
-ENOMEM path, without the cached pfns being reset, over and over. This
has been observed (through additional tracepoints) in the third phase of
the mmtests stress-highalloc benchmark, where the allocator runs on an
otherwise idle system. The problem was observed in the DMA32 zone,
which was used as a fallback to the preferred Normal zone, but on the
4GB system it was actually the largest zone. The problem is even
amplified for such fallback zone - the deferred compaction logic, which
could (after being fixed by a previous patch) reset the cached scanner
pfn's, is only applied to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success
rate from ~85% to ~65%. This occurs in about half of benchmark runs,
making bisection problematic. It is unlikely that the commit itself is
buggy, but it should put more pressure on the DMA32 zone during phases 1
and 2, which may leave it more fragmented in phase 3 and expose the bugs
that this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM. The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make
kswapd reset the scanner pfn's.

The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2
allocation success rates are also significantly improved.

Signed-off-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2014-06-17 04:42:53 +0800
e6798424a mm: compaction: reset cached scanner pfn's before reading them ... Browse Code »

commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream.

Compaction caches pfn's for its migrate and free scanners to avoid
scanning the whole zone each time. In compact_zone(), the cached values
are read to set up initial values for the scanners. There are several
situations when these cached pfn's are reset to the first and last pfn
of the zone, respectively. One of these situations is when a compaction
has been deferred for a zone and is now being restarted during a direct
compaction, which is also done in compact_zone().

However, compact_zone() currently reads the cached pfn's *before*
resetting them. This means the reset doesn't affect the compaction that
performs it, and with good chance also subsequent compactions, as
update_pageblock_skip() is likely to be called and update the cached
pfn's to those being processed. Another chance for a successful reset
is when a direct compaction detects that migration and free scanners
meet (which has its own problems addressed by another patch) and sets
update_pageblock_skip flag which kswapd uses to do the reset because it
goes to sleep.

This is clearly a bug that results in non-deterministic behavior, so
this patch moves the cached pfn reset to be performed *before* the
values are read.

Signed-off-by: Vlastimil Babka
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2014-06-17 04:42:53 +0800

12 Jun, 2014

2 commits

65375ce7a mm: rmap: fix use-after-free in __put_anon_vma ... Browse Code »

commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream.

While working address sanitizer for kernel I've discovered
use-after-free bug in __put_anon_vma.

For the last anon_vma, anon_vma->root freed before child anon_vma.
Later in anon_vma_free(anon_vma) we are referencing to already freed
anon_vma->root to check rwsem.

This fixes it by freeing the child anon_vma before freeing
anon_vma->root.

Signed-off-by: Andrey Ryabinin
Acked-by: Peter Zijlstra
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrey Ryabinin
2014-06-12 03:03:26 +0800
53e8565ac mm/memory-failure.c: fix memory leak by race between poison and unpoison ... Browse Code »

commit 3e030ecc0fc7de10fd0da10c1c19939872a31717 upstream.

When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.

When you try to unpoison it later, unpoison_memory() calls put_page()
for it twice in order to bring the page back to free page pool (buddy or
free hugepage list). However, if another memory error occurs on the
page which we are unpoisoning, memory_failure() returns without
releasing the refcount which was incremented in the same call at first,
which results in memory leak and unconsistent num_poisoned_pages
statistics. This patch fixes it.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-06-12 03:03:22 +0800

08 Jun, 2014

4 commits

46c032616 percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree() ... Browse Code »

commit 5a838c3b60e3a36ade764cf7751b8f17d7c9c2da upstream.

pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)

It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
but for consistency with its couterpart pcpu_mem_zalloc(),
use pcpu_mem_free() instead.

Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use
pcpu_mem_free() instead of kfree()") addressed this problem, but
missed this one.

tj: commit message updated

Signed-off-by: Jianyu Zhan
Signed-off-by: Tejun Heo
Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online)
Signed-off-by: Greg Kroah-Hartman

Jianyu Zhan
2014-06-08 04:25:37 +0800
ed32bcbb8 hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage ... Browse Code »

commit b985194c8c0a130ed155b71662e39f7eaea4876f upstream.

For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently. So we need to
check PageHWPoison instead of !PageHWPoison.

If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).

Signed-off-by: Chen Yucong
Reviewed-by: Naoya Horiguchi
Tested-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Chen Yucong
2014-06-08 04:25:31 +0800
619131122 mm, thp: close race between mremap() and split_huge_page() ... Browse Code »

commit dd18dbc2d42af75fffa60c77e0f02220bc329829 upstream.

It's critical for split_huge_page() (and migration) to catch and freeze
all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
mremap() since usually we copy/move page table entries on dup_mm() or
move_page_tables() without rmap lock taken. To get it work we rely on
rmap walk order to not miss any entry. We expect to see destination VMA
after source one to work correctly.

But after switching rmap implementation to interval tree it's not always
possible to preserve expected walk order.

It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
/ vma_last_pgoff() and explicitly insert dst VMA after src one with
vma_interval_tree_insert_after().

But on move_vma() destination VMA can be merged into adjacent one and as
result shifted left in interval tree. Fortunately, we can detect the
situation and prevent race with rmap walk by moving page table entries
under rmap lock. See commit 38a76013ad80.

Problem is that we miss the lock when we move transhuge PMD. Most
likely this bug caused the crash[1].

[1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Andrea Arcangeli
Cc: Rik van Riel
Acked-by: Michel Lespinasse
Cc: Dave Jones
Cc: David Miller
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kirill A. Shutemov
2014-06-08 04:25:31 +0800
96679ed69 mm: make fixup_user_fault() check the vma access rights too ... Browse Code »

commit 1b17844b29ae042576bea588164f2f1e9590a8bc upstream.

fixup_user_fault() is used by the futex code when the direct user access
fails, and the futex code wants it to either map in the page in a usable
form or return an error. It relied on handle_mm_fault() to map the
page, and correctly checked the error return from that, but while that
does map the page, it doesn't actually guarantee that the page will be
mapped with sufficient permissions to be then accessed.

So do the appropriate tests of the vma access rights by hand.

[ Side note: arguably handle_mm_fault() could just do that itself, but
we have traditionally done it in the caller, because some callers -
notably get_user_pages() - have been able to access pages even when
they are mapped with PROT_NONE. Maybe we should re-visit that design
decision, but in the meantime this is the minimal patch. ]

Found by Dave Jones running his trinity tool.

Reported-by: Dave Jones
Acked-by: Hugh Dickins
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2014-06-08 04:25:29 +0800

31 May, 2014

1 commit

b934932aa mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages() ... Browse Code »

commit 7848a4bf51b34f41fcc9bd77e837126d99ae84e3 upstream.

soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm:
hugetlb: fix softlockup when a large number of hugepages are freed." can
happen in return_unused_surplus_pages(), so let's fix it.

Signed-off-by: Masayoshi Mizuma
Signed-off-by: Naoya Horiguchi
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Aneesh Kumar
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mizuma, Masayoshi
2014-05-31 12:52:12 +0800

06 May, 2014

2 commits

733ad2dce mm: hugetlb: fix softlockup when a large number of hugepages are freed. ... Browse Code »

commit 55f67141a8927b2be3e51840da37b8a2320143ed upstream.

When I decrease the value of nr_hugepage in procfs a lot, softlockup
happens. It is because there is no chance of context switch during this
process.

On the other hand, when I allocate a large number of hugepages, there is
some chance of context switch. Hence softlockup doesn't happen during
this process. So it's necessary to add the context switch in the
freeing process as same as allocating process to avoid softlockup.

When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing
process occupied a CPU over 150 seconds and following softlockup message
appeared twice or more.

$ echo 6000000 > /proc/sys/vm/nr_hugepages
$ cat /proc/sys/vm/nr_hugepages
6000000
$ grep ^Huge /proc/meminfo
HugePages_Total: 6000000
HugePages_Free: 6000000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
$ echo 0 > /proc/sys/vm/nr_hugepages

BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ...
Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1
Call Trace:
free_pool_huge_page+0xb8/0xd0
set_max_huge_pages+0x128/0x190
hugetlb_sysctl_handler_common+0x113/0x140
hugetlb_sysctl_handler+0x1e/0x20
proc_sys_call_handler+0x97/0xd0
proc_sys_write+0x14/0x20
vfs_write+0xb8/0x1a0
sys_write+0x51/0x90
__audit_syscall_exit+0x265/0x290
system_call_fastpath+0x16/0x1b

I have not confirmed this problem with upstream kernels because I am not
able to prepare the machine equipped with 12TB memory now. However I
confirmed that the amount of decreasing hugepages was directly
proportional to the amount of required time.

I measured required times on a smaller machine. It showed 130-145
hugepages decreased in a millisecond.

Amount of decreasing Required time Decreasing rate
hugepages (msec) (pages/msec)
------------------------------------------------------------
10,000 pages == 20GB 70 - 74 135-142
30,000 pages == 60GB 208 - 229 131-144

It means decrement of 6TB hugepages will trigger softlockup with the
default threshold 20sec, in this decreasing rate.

Signed-off-by: Masayoshi Mizuma
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Wanpeng Li
Cc: Aneesh Kumar
Cc: KOSAKI Motohiro
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mizuma, Masayoshi
2014-05-06 22:55:32 +0800
77552735b mm: try_to_unmap_cluster() should lock_page() before mlocking ... Browse Code »

commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream.

A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin
fuzzing with trinity. The call site try_to_unmap_cluster() does not lock
the pages other than its check_page parameter (which is already locked).

The BUG_ON in mlock_vma_page() is not documented and its purpose is
somewhat unclear, but apparently it serializes against page migration,
which could otherwise fail to transfer the PG_mlocked flag. This would
not be fatal, as the page would be eventually encountered again, but
NR_MLOCK accounting would become distorted nevertheless. This patch adds
a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that
effect.

The call site try_to_unmap_cluster() is fixed so that for page !=
check_page, trylock_page() is attempted (to avoid possible deadlocks as we
already have check_page locked) and mlock_vma_page() is performed only
upon success. If the page lock cannot be obtained, the page is left
without PG_mlocked, which is again not a problem in the whole unevictable
memory design.

Signed-off-by: Vlastimil Babka
Signed-off-by: Bob Liu
Reported-by: Sasha Levin
Cc: Wanpeng Li
Cc: Michel Lespinasse
Cc: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: David Rientjes
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vlastimil Babka
2014-05-06 22:55:32 +0800

27 Apr, 2014

2 commits

bf0972039 bdi: avoid oops on device removal ... Browse Code »

commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream.

After commit 839a8e8660b6 ("writeback: replace custom worker pool
implementation with unbound workqueue") when device is removed while we
are writing to it we crash in bdi_writeback_workfn() ->
set_worker_desc() because bdi->dev is NULL.

This can happen because even though bdi_unregister() cancels all pending
flushing work, nothing really prevents new ones from being queued from
balance_dirty_pages() or other places.

Fix the problem by clearing BDI_registered bit in bdi_unregister() and
checking it before scheduling of any flushing work.

Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977

Reviewed-by: Tejun Heo
Signed-off-by: Jan Kara
Cc: Derek Basehore
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2014-04-27 08:15:35 +0800
39305a6ac backing_dev: fix hung task on sync ... Browse Code »

commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream.

bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
schedule work to writeback dirty inodes. The problem with this is that
it can delay work that is scheduled for immediate execution, such as the
work from sync_inodes_sb(). This can happen since mod_delayed_work()
can now steal work from a work_queue. This fixes the problem by using
queue_delayed_work() instead. This is a regression caused by commit
839a8e8660b6 ("writeback: replace custom worker pool implementation with
unbound workqueue").

The reason that this causes a problem is that laptop-mode will change
the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
In the case that bdi_wakeup_thread_delayed() races with
sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
task. Even if dirty_writeback_centisecs is not long enough to cause a
hung task, we still don't want to delay sync for that long.

We fix the problem by using queue_delayed_work() when we want to
schedule writeback sometime in future. This function doesn't change the
timer if it is already armed.

For the same reason, we also change bdi_writeback_workfn() to
immediately queue the work again in the case that the work_list is not
empty. The same problem can happen if the sync work is run on the
rescue worker.

[jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
Signed-off-by: Derek Basehore
Reviewed-by: Jan Kara
Cc: Alexander Viro
Reviewed-by: Tejun Heo
Cc: Greg Kroah-Hartman
Cc: "Darrick J. Wong"
Cc: Derek Basehore
Cc: Kees Cook
Cc: Benson Leung
Cc: Sonny Rao
Cc: Luigi Semenzato
Cc: Jens Axboe
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Derek Basehore
2014-04-27 08:15:35 +0800

04 Apr, 2014

1 commit

def52acc9 mm: close PageTail race ... Browse Code »

commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream.

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes
Cc: Holger Kiehl
Cc: Christoph Lameter
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Rientjes
2014-04-04 03:01:05 +0800

24 Mar, 2014

2 commits

ba9a72974 memcg: reparent charges of children before processing parent ... Browse Code »

commit 4fb1a86fb5e4209a7d4426d4e586c58e9edc74ac upstream.

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding
cgroup_mutex which prevents the child from reaching its
mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
all its children (and grandchildren, in the correct order) to have their
charges reparented first.

[The version for 3.10.34 (or perhaps now 3.10.35) is this below.
Yes, more differences, and the old mem_cgroup_reparent_charges line
is intentionally left in for 3.10 whereas it was removed for 3.12+:
that's because the css/cgroup iterator changed in between, it used
not to supply the root of the subtree, but nowadays it does - Hugh]

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Filipe Brandenburger
Signed-off-by: Hugh Dickins
Reviewed-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Filipe Brandenburger
2014-03-24 12:38:20 +0800
8c54de8fe mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block ... Browse Code »

commit 2af120bc040c5ebcda156df6be6a66610ab6957f upstream.

We received several reports of bad page state when freeing CMA pages
previously allocated with alloc_contig_range:

BUG: Bad page state in process Binder_A pfn:63202
page:d21130b0 count:0 mapcount:1 mapping: (null) index:0x7dfbf
page flags: 0x40080068(uptodate|lru|active|swapbacked)

Based on the page state, it looks like the page was still in use. The
page flags do not make sense for the use case though. Further debugging
showed that despite alloc_contig_range returning success, at least one
page in the range still remained in the buddy allocator.

There is an issue with isolate_freepages_block. In strict mode (which
CMA uses), if any pages in the range cannot be isolated,
isolate_freepages_block should return failure 0. The current check
keeps track of the total number of isolated pages and compares against
the size of the range:

if (strict && nr_strict_required > total_isolated)
total_isolated = 0;

After taking the zone lock, if one of the pages in the range is not in
the buddy allocator, we continue through the loop and do not increment
total_isolated. If in the last iteration of the loop we isolate more
than one page (e.g. last page needed is a higher order page), the check
for total_isolated may pass and we fail to detect that a page was
skipped. The fix is to bail out if the loop immediately if we are in
strict mode. There's no benfit to continuing anyway since we need all
pages to be isolated. Additionally, drop the error checking based on
nr_strict_required and just check the pfn ranges. This matches with
what isolate_freepages_range does.

Signed-off-by: Laura Abbott
Acked-by: Minchan Kim
Cc: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Joonsoo Kim
Acked-by: Bartlomiej Zolnierkiewicz
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Laura Abbott
2014-03-24 12:38:18 +0800

07 Mar, 2014

1 commit

5fb67b91d memcg: fix endless loop caused by mem_cgroup_iter ... Browse Code »

commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream.

Hugh has reported an endless loop when the hardlimit reclaim sees the
same group all the time. This might happen when the reclaim races with
the memcg removal.

shrink_zone
[rmdir root]
mem_cgroup_iter(root, NULL, reclaim)
// prev = NULL
rcu_read_lock()
mem_cgroup_iter_load
last_visited = iter->last_visited // gets root || NULL
css_tryget(last_visited) // failed
last_visited = NULL [1]
memcg = root = __mem_cgroup_iter_next(root, NULL)
mem_cgroup_iter_update
iter->last_visited = root;
reclaim->generation = iter->generation

mem_cgroup_iter(root, root, reclaim)
// prev = root
rcu_read_lock
mem_cgroup_iter_load
last_visited = iter->last_visited // gets root
css_tryget(last_visited) // failed
[1]

The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax
memcg iter caching") which has replaced unconditional css_get/css_put by
css_tryget/css_put for the cached iterator.

This patch fixes the issue by skipping css_tryget on the root of the
tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
in mem_cgroup_iter_update.

Signed-off-by: Michal Hocko
Reported-by: Hugh Dickins
Tested-by: Hugh Dickins
Cc: Johannes Weiner
Cc: Greg Thelen
Cc: [3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2014-03-07 13:30:05 +0800

23 Feb, 2014

1 commit

2186bb4e7 mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASED ... Browse Code »

commit 8d547ff4ac5927245e0833ac18528f939da0ee0e upstream.

mce-test detected a test failure when injecting error to a thp tail
page. This is because we take page refcount of the tail page in
madvise_hwpoison() while the fix in commit a3e0f9e47d5e
("mm/memory-failure.c: transfer page count from head page to tail page
after split thp") assumes that we always take refcount on the head page.

When a real memory error happens we take refcount on the head page where
memory_failure() is called without MF_COUNT_INCREASED set, so it seems
to me that testing memory error on thp tail page using madvise makes
little sense.

This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
testing.

[akpm@linux-foundation.org: s/&&/&/]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Wanpeng Li
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-02-23 04:41:26 +0800

21 Feb, 2014

3 commits

6843d9254 mm: fix process accidentally killed by mce because of huge page migration ... Browse Code »

Based on c8721bbbdd36382de51cd6b7a56322e0acca2414 upstream, but only the
bugfix portion pulled out.

Hi Naoya or Greg,

We found a bug in 3.10.x.
The problem is that we accidentally have a hwpoisoned hugepage in free
hugepage list. It could happend in the the following scenario:

process A process B

migrate_huge_page
put_page (old hugepage)
linked to free hugepage list
hugetlb_fault
hugetlb_no_page
alloc_huge_page
dequeue_huge_page_vma
dequeue_huge_page_node
(steal hwpoisoned hugepage)
set_page_hwpoison_huge_page
dequeue_hwpoisoned_huge_page
(fail to dequeue)

I tested this bug, one process keeps allocating huge page, and I
use sysfs interface to soft offline a huge page, then received:
"MCE: Killing UCP:2717 due to hardware memory corruption fault at 8200034"

Upstream kernel is free from this bug because of these two commits:

f15bdfa802bfa5eb6b4b5a241b97ec9fa1204a35
mm/memory-failure.c: fix memory leak in successful soft offlining

c8721bbbdd36382de51cd6b7a56322e0acca2414
mm: memory-hotplug: enable memory hotplug to handle hugepage

The first one, although the problem is about memory leak, this patch
moves unset_migratetype_isolate(), which is important to avoid the race.
The latter is not a bug fix and it's too big, so I rewrite a small one.

The following patch can fix this bug.(please apply f15bdfa802bf first)

Signed-off-by: Xishi Qiu
Reviewed-by: Naoya Horiguchi
Signed-off-by: Greg Kroah-Hartman

Xishi Qiu
2014-02-21 03:06:12 +0800
b0d4c0f81 mm/memory-failure.c: fix memory leak in successful soft offlining ... Browse Code »

commit f15bdfa802bfa5eb6b4b5a241b97ec9fa1204a35 upstream.

After a successful page migration by soft offlining, the source page is
not properly freed and it's never reusable even if we unpoison it
afterward.

This is caused by the race between freeing page and setting PG_hwpoison.
In successful soft offlining, the source page is put (and the refcount
becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked
to pagevec and actual freeing back to buddy is delayed. So if
PG_hwpoison is set for the page before freeing, the freeing does not
functions as expected (in such case freeing aborts in
free_pages_prepare() check.)

This patch tries to make sure to free the source page before setting
PG_hwpoison on it. To avoid reallocating, the page keeps
MIGRATE_ISOLATE until after setting PG_hwpoison.

This patch also removes obsolete comments about "keeping elevated
refcount" because what they say is not true. Unlike memory_failure(),
soft_offline_page() uses no special page isolation code, and the
soft-offlined pages have no elevated.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Cc: Xishi Qiu
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-02-21 03:06:12 +0800
dce0b4fcf mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() ... Browse Code »

commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream.

During aio stress test, we observed the following lockdep warning. This
mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&(&ctx->completion_lock)->rlock);

lock(&(&ctx->completion_lock)->rlock);

*** DEADLOCK ***

dump_stack+0x19/0x1b
print_usage_bug+0x1f7/0x208
mark_lock+0x21d/0x2a0
mark_held_locks+0xb9/0x140
trace_hardirqs_on_caller+0x105/0x1d0
trace_hardirqs_on+0xd/0x10
_raw_spin_unlock_irq+0x2c/0x50
__set_page_dirty_nobuffers+0x8c/0xf0
migrate_page_copy+0x434/0x540
aio_migratepage+0xb1/0x140
move_to_new_page+0x7d/0x230
migrate_pages+0x5e5/0x700
migrate_misplaced_page+0xbc/0xf0
do_numa_page+0x102/0x190
handle_pte_fault+0x241/0x970
handle_mm_fault+0x265/0x370
__do_page_fault+0x172/0x5a0
do_page_fault+0x1a/0x70
page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Rik van Riel
Cc: Johannes Weiner
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

KOSAKI Motohiro
2014-02-21 03:06:11 +0800

14 Feb, 2014

3 commits

06bdd77c7 mm, oom: base root bonus on current usage ... Browse Code »

commit 778c14affaf94a9e4953179d3e13a544ccce7707 upstream.

A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.

With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption. But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes. For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.

The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.

Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.

By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small. In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.

Signed-off-by: David Rientjes
Reported-by: Johannes Weiner
Acked-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

David Rientjes
2014-02-14 05:48:02 +0800
d6d76c664 slub: Fix calculation of cpu slabs ... Browse Code »

commit 8afb1474db4701d1ab80cd8251137a3260e6913e upstream.

/sys/kernel/slab/:t-0000048 # cat cpu_slabs
231 N0=16 N1=215
/sys/kernel/slab/:t-0000048 # cat slabs
145 N0=36 N1=109

See, the number of slabs is smaller than that of cpu slabs.

The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700
("slub: per cpu cache for partial pages").

We should use page->pages instead of page->pobjects when calculating
the number of cpu partial slabs. This also fixes the mapping of slabs
and nodes.

As there's no variable storing the number of total/active objects in
cpu partial slabs, and we don't have user interfaces requiring those
statistics, I just add WARN_ON for those cases.

Acked-by: Christoph Lameter
Reviewed-by: Wanpeng Li
Signed-off-by: Li Zefan
Signed-off-by: Pekka Enberg
Signed-off-by: Greg Kroah-Hartman

Li Zefan
2014-02-14 05:48:00 +0800
485261499 mm/page-writeback.c: do not count anon pages as dirtyable memory ... Browse Code »

commit a1c3bfb2f67ef766de03f1f56bdfff9c8595ab14 upstream.

The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.

A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
uses the remainder for a streaming writer demonstrates this problem. In
that case, the actual cache pages are a small fraction of what is
considered dirtyable overall, which results in an relatively large
portion of the cache pages to be dirtied. As kswapd starts rotating
these, random tasks enter direct reclaim and stall on IO.

Only consider free pages and file pages dirtyable.

Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Tested-by: Tejun Heo
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Wu Fengguang
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Johannes Weiner
2014-02-14 05:48:00 +0800