Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

01 Jul, 2014

10 commits

ba6eb3c3a ext4: fix data integrity sync in ordered mode ... Browse Code »

commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream.

When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check
for this tag in write_cache_pages_da and creates a struct
mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit. This
process is done in while loop until all the PAGECACHE_TAG_TOWRITE
pages are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call
ext4_writepage which in turn will call ext4_bio_write_page even for
delayed OR unwritten buffers. When ext4_bio_write_page is called for
such buffers, even though it does not sync them but it clears the
PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
are also not synced by the currently running data integrity sync. We
will end up with dirty pages although sync is completed.

This could cause a potential data loss when the sync call is followed
by a truncate_pagecache call, which is exactly the case in
collapse_range. (It will cause generic/127 failure in xfstests)

To avoid this issue, we can use set_page_writeback_keepwrite instead of
set_page_writeback, which doesn't clear TOWRITE tag.

Signed-off-by: Namjae Jeon
Signed-off-by: Ashish Sangwan
Signed-off-by: "Theodore Ts'o"
Reviewed-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Namjae Jeon
2014-07-01 11:13:56 +0800
0077982f8 mm: vmscan: clear kswapd's special reclaim powers before exiting ... Browse Code »

commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.

When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner
Reported-by: Gu Zheng
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Johannes Weiner
2014-07-01 11:13:55 +0800
693b69e66 mm: fix sleeping function warning from __put_anon_vma ... Browse Code »

commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.

Trinity reports BUG:

sleeping function called from invalid context at kernel/locking/rwsem.c:47
in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27

__might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..

Right, since conversion to mutex then rwsem, we should not put_anon_vma()
from inside an rcu_read_lock()ed section: fix the two places that did so.
And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.

Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
Reported-by: Dave Jones
Signed-off-by: Hugh Dickins
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-07-01 11:13:55 +0800
92797cf19 mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) ... Browse Code »

commit 3ba08129e38437561df44c36b7ea9081185d5333 upstream.

Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Reviewed-by: Tony Luck
Cc: Kamil Iskra
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-07-01 11:13:55 +0800
bb3008a2f mm/memory-failure.c: don't let collect_procs() skip over processes for MF_ACTION_REQUIRED ... Browse Code »

commit 74614de17db6fb472370c426d4f934d8d616edf2 upstream.

When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.

task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.

But for MF_ACTION_REQUIRED case we must not defer. The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.

Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.

Signed-off-by: Tony Luck
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Tony Luck
2014-07-01 11:13:54 +0800
44dc0863c mm/memory-failure.c-failure: send right signal code to correct thread ... Browse Code »

commit a70ffcac741d31a406c1d2b832ae43d658e7e1cf upstream.

When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:

if ((flags & MF_ACTION_REQUIRED) && t == current) {

will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).

We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page. If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).

Signed-off-by: Tony Luck
Signed-off-by: Naoya Horiguchi
Reported-by: Otto Bruggeman
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Tony Luck
2014-07-01 11:13:54 +0800
cbc3234d2 mm: page_alloc: use word-based accesses for get/set pageblock bitmaps ... Browse Code »

commit e58469bafd0524e848c3733bc3918d854595e20f upstream.

The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

In addition to the performance benefits, this patch closes races that are
possible between:

a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
reads part of the bits before and other part of the bits after
set_pageblock_migratetype() has updated them.

b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
read-modify-update set bit operation in set_pageblock_skip() will cause
lost updates to some bits changed in the set_pageblock_migratetype().

Joonsoo Kim first reported the case a) via code inspection. Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock). Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g. list_move(). Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.

That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove. Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.

After this patch, the race has no longer been observed in testing.

Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Reported-by: Joonsoo Kim
Reported-and-tested-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2014-07-01 11:13:54 +0800
fc226822c memcg: do not hang on OOM when killed by userspace OOM access to memory reserves ... Browse Code »

commit d8dc595ce3909fbc131bdf5ab8c9808fe624b18d upstream.

Eric has reported that he can see task(s) stuck in memcg OOM handler
regularly. The only way out is to

echo 0 > $GROUP/memory.oom_control

His usecase is:

- Setup a hierarchy with memory and the freezer (disable kernel oom and
have a process watch for oom).

- In that memory cgroup add a process with one thread per cpu.

- In one thread slowly allocate once per second I think it is 16M of ram
and mlock and dirty it (just to force the pages into ram and stay
there).

- When oom is achieved loop:
* attempt to freeze all of the tasks.
* if frozen send every task SIGKILL, unfreeze, remove the directory in
cgroupfs.

Eric has then pinpointed the issue to be memcg specific.

All tasks are sitting on the memcg_oom_waitq when memcg oom is disabled.
Those that have received fatal signal will bypass the charge and should
continue on their way out. The tricky part is that the exit path might
trigger a page fault (e.g. exit_robust_list), thus the memcg charge,
while its memcg is still under OOM because nobody has released any charges
yet.

Unlike with the in-kernel OOM handler the exiting task doesn't get
TIF_MEMDIE set so it doesn't shortcut further charges of the killed task
and falls to the memcg OOM again without any way out of it as there are no
fatal signals pending anymore.

This patch fixes the issue by checking PF_EXITING early in
mem_cgroup_try_charge and bypass the charge same as if it had fatal
signal pending or TIF_MEMDIE set.

Normally exiting tasks (aka not killed) will bypass the charge now but
this should be OK as the task is leaving and will release memory and
increasing the memory pressure just to release it in a moment seems
dubious wasting of cycles. Besides that charges after exit_signals should
be rare.

I am bringing this patch again (rebased on the current mmotm tree). I
hope we can move forward finally. If there is still an opposition then
I would really appreciate a concurrent approach so that we can discuss
alternatives.

http://comments.gmane.org/gmane.linux.kernel.stable/77650 is a reference
to the followup discussion when the patch has been dropped from the mmotm
last time.

Reported-by: Eric W. Biederman
Signed-off-by: Michal Hocko
Acked-by: David Rientjes
Acked-by: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Michal Hocko
2014-07-01 11:13:54 +0800
f339db521 mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ZONE_NORMAL ... Browse Code »

commit 675becce15f320337499bc1a9356260409a5ba29 upstream.

throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.

The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.

On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mel Gorman
2014-07-01 11:13:54 +0800
c3dc58b99 hugetlb: restrict hugepage_migration_support() to x86_64 ... Browse Code »

commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream.

Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs. So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.

Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().

Signed-off-by: Naoya Horiguchi
Reported-by: Michael Ellerman
Tested-by: Michael Ellerman
Acked-by: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: Tony Luck
Cc: Russell King
Cc: Martin Schwidefsky
Cc: James Hogan
Cc: Ralf Baechle
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Naoya Horiguchi
2014-07-01 11:13:54 +0800

27 Jun, 2014

2 commits

e74eb9412 slab: fix oops when reading /proc/slab_allocators ... Browse Code »

commit 03787301420376ae41fbaf4267f4a6253d152ac5 upstream.

Commit b1cb0982bdd6 ("change the management method of free objects of
the slab") introduced a bug on slab leak detector
('/proc/slab_allocators'). This detector works like as following
decription.

1. traverse all objects on all the slabs.
2. determine whether it is active or not.
3. if active, print who allocate this object.

but that commit changed the way how to manage free objects, so the logic
determining whether it is active or not is also changed. In before, we
regard object in cpu caches as inactive one, but, with this commit, we
mistakenly regard object in cpu caches as active one.

This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
corrupt free memory in the slab. It unmaps page table mapping if object
is free and map it if object is active. When slab leak detector check
object in cpu caches, it mistakenly think this object active so try to
access object memory to retrieve caller of allocation. At this point,
page table mapping to this object doesn't exist, so oops occurs.

Following is oops message reported from Dave.

It blew up when something tried to read /proc/slab_allocators
(Just cat it, and you should see the oops below)

Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
[snip...]
CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
RIP: 0010:[] [] handle_slab+0x8a/0x180
RSP: 0018:ffff880076925de0 EFLAGS: 00010002
RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
Call Trace:
leaks_show+0xce/0x240
seq_read+0x28e/0x490
proc_reg_read+0x3d/0x80
vfs_read+0x9b/0x160
SyS_read+0x58/0xb0
tracesys+0xd4/0xd9
Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
RIP handle_slab+0x8a/0x180

To fix the problem, I introduce an object status buffer on each slab.
With this, we can track object status precisely, so slab leak detector
would not access active object and no kernel oops would occur. Memory
overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
which is mainly used for debugging, so memory overhead isn't big
problem.

Signed-off-by: Joonsoo Kim
Reported-by: Dave Jones
Reported-by: Tetsuo Handa
Reviewed-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Joonsoo Kim
2014-06-27 03:17:37 +0800
264b86651 tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported ... Browse Code »

commit 13ace4d0d9db40e10ecd66dfda14e297571be813 upstream.

I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE
support being added to fallocate(); but didn't realize until now that I
had been too stupid to future-proof shmem_fallocate() against new
additions. -EOPNOTSUPP instead of going on to ordinary fallocation.

Signed-off-by: Hugh Dickins
Reviewed-by: Lukas Czerner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Hugh Dickins
2014-06-27 03:17:37 +0800

07 Jun, 2014

1 commit

d4c54919e mm: add !pte_present() check on existing hugetlb_entry callbacks ... Browse Code »
5

The age table walker doesn't check non-present hugetlb entry in common
path, so hugetlb_entry() callbacks must check it. The reason for this
behavior is that some callers want to handle it in its own way.

[ I think that reason is bogus, btw - it should just do what the regular
code does, which is to call the "pte_hole()" function for such hugetlb
entries - Linus]

However, some callers don't check it now, which causes unpredictable
result, for example when we have a race between migrating hugepage and
reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
checks on buggy callbacks.

This bug exists for years and got visible by introducing hugepage
migration.

ChangeLog v2:
- fix if condition (check !pte_present() instead of pte_present())

Reported-by: Sasha Levin
Signed-off-by: Naoya Horiguchi
Cc: Rik van Riel
Cc: [3.12+]
Signed-off-by: Andrew Morton
[ Backported to 3.15. Signed-off-by: Josh Boyer ]
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-07 04:21:16 +0800

06 Jun, 2014

1 commit

624483f3e mm: rmap: fix use-after-free in __put_anon_vma ... Browse Code »
5

While working address sanitizer for kernel I've discovered
use-after-free bug in __put_anon_vma.

For the last anon_vma, anon_vma->root freed before child anon_vma.
Later in anon_vma_free(anon_vma) we are referencing to already freed
anon_vma->root to check rwsem.

This fixes it by freeing the child anon_vma before freeing
anon_vma->root.

Signed-off-by: Andrey Ryabinin
Acked-by: Peter Zijlstra
Cc: # v3.0+
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2014-06-06 23:53:41 +0800

24 May, 2014

5 commits

3e030ecc0 mm/memory-failure.c: fix memory leak by race between poison and unpoison ... Browse Code »
6

When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.

When you try to unpoison it later, unpoison_memory() calls put_page()
for it twice in order to bring the page back to free page pool (buddy or
free hugepage list). However, if another memory error occurs on the
page which we are unpoisoning, memory_failure() returns without
releasing the refcount which was incremented in the same call at first,
which results in memory leak and unconsistent num_poisoned_pages
statistics. This patch fixes it.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: [2.6.32+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-05-24 00:37:30 +0800
6f6acb005 memcg: fix swapcache charge from kernel thread context ... Browse Code »

Commit 284f39afeaa4 ("mm: memcg: push !mm handling out to page cache
charge function") explicitly checks for page cache charges without any
mm context (from kernel thread context[1]).

This seemed to be the only possible case where memory could be charged
without mm context so commit 03583f1a631c ("memcg: remove unnecessary
!mm check from try_get_mem_cgroup_from_mm()") removed the mm check from
get_mem_cgroup_from_mm(). This however caused another NULL ptr
dereference during early boot when loopback kernel thread splices to
tmpfs as reported by Stephan Kulow:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000360
IP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
Oops: 0000 [#1] SMP
Modules linked in: btrfs dm_multipath dm_mod scsi_dh multipath raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod parport_pc parport nls_utf8 isofs usb_storage iscsi_ibft iscsi_boot_sysfs arc4 ecb fan thermal nfs lockd fscache nls_iso8859_1 nls_cp437 sg st hid_generic usbhid af_packet sunrpc sr_mod cdrom ata_generic uhci_hcd virtio_net virtio_blk ehci_hcd usbcore ata_piix floppy processor button usb_common virtio_pci virtio_ring virtio edd squashfs loop ppa]
CPU: 0 PID: 97 Comm: loop1 Not tainted 3.15.0-rc5-5-default #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
__mem_cgroup_try_charge_swapin+0x40/0xe0
mem_cgroup_charge_file+0x8b/0xd0
shmem_getpage_gfp+0x66b/0x7b0
shmem_file_splice_read+0x18f/0x430
splice_direct_to_actor+0xa2/0x1c0
do_lo_receive+0x5a/0x60 [loop]
loop_thread+0x298/0x720 [loop]
kthread+0xc6/0xe0
ret_from_fork+0x7c/0xb0

Also Branimir Maksimovic reported the following oops which is tiggered
for the swapcache charge path from the accounting code for kernel threads:

CPU: 1 PID: 160 Comm: kworker/u8:5 Tainted: P OE 3.15.0-rc5-core2-custom #159
Hardware name: System manufacturer System Product Name/MAXIMUSV GENE, BIOS 1903 08/19/2013
task: ffff880404e349b0 ti: ffff88040486a000 task.ti: ffff88040486a000
RIP: get_mem_cgroup_from_mm.isra.42+0x2b/0x60
Call Trace:
__mem_cgroup_try_charge_swapin+0x45/0xf0
mem_cgroup_charge_file+0x9c/0xe0
shmem_getpage_gfp+0x62c/0x770
shmem_write_begin+0x38/0x40
generic_perform_write+0xc5/0x1c0
__generic_file_aio_write+0x1d1/0x3f0
generic_file_aio_write+0x4f/0xc0
do_sync_write+0x5a/0x90
do_acct_process+0x4b1/0x550
acct_process+0x6d/0xa0
do_exit+0x827/0xa70
kthread+0xc3/0xf0

This patch fixes the issue by reintroducing mm check into
get_mem_cgroup_from_mm. We could do the same trick in
__mem_cgroup_try_charge_swapin as we do for the regular page cache path
but it is not worth troubles. The check is not that expensive and it is
better to have get_mem_cgroup_from_mm more robust.

[1] - http://marc.info/?l=linux-mm&m=139463617808941&w=2

Fixes: 03583f1a631c ("memcg: remove unnecessary !mm check from try_get_mem_cgroup_from_mm()")
Reported-and-tested-by: Stephan Kulow
Reported-by: Branimir Maksimovic
Signed-off-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-05-24 00:37:29 +0800
55231e5c8 mm: madvise: fix MADV_WILLNEED on shmem swapouts ... Browse Code »
7

MADV_WILLNEED currently does not read swapped out shmem pages back in.

Commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page
cache radix trees") made find_get_page() filter exceptional radix tree
entries but failed to convert all find_get_page() callers that WANT
exceptional entries over to find_get_entry(). One of them is shmem swap
readahead in madvise, which now skips over any swap-out records.

Convert it to find_get_entry().

Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner
Reported-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-05-24 00:37:29 +0800
7fcbbaf18 mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT ... Browse Code »
7

In some testing I ran today (some fio jobs that spread over two nodes),
we end up spending 40% of the time in filemap_check_errors(). That
smells fishy. Looking further, this is basically what happens:

blkdev_aio_read()
generic_file_aio_read()
filemap_write_and_wait_range()
if (!mapping->nr_pages)
filemap_check_errors()

and filemap_check_errors() always attempts two test_and_clear_bit() on
the mapping flags, thus dirtying it for every single invocation. The
patch below tests each of these bits before clearing them, avoiding this
issue. In my test case (4-socket box), performance went from 1.7M IOPS
to 4.0M IOPS.

Signed-off-by: Jens Axboe
Acked-by: Jeff Moyer
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jens Axboe
2014-05-24 00:37:29 +0800
b985194c8 hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free hugepage ... Browse Code »
5

For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently. So we need to
check PageHWPoison instead of !PageHWPoison.

If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).

Signed-off-by: Chen Yucong
Reviewed-by: Naoya Horiguchi
Tested-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Cc: [2.6.36+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Yucong
2014-05-24 00:37:29 +0800

20 May, 2014

1 commit

41abc9022 Merge tag 'metag-for-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag ... Browse Code »

Pull Metag architecture and related fixes from James Hogan:
"Mostly fixes for metag and parisc relating to upgrowing stacks.

- Fix missing compiler barriers in metag memory barriers.
- Fix BUG_ON on metag when RLIMIT_STACK hard limit is increased
beyond safe value.
- Make maximum stack size configurable. This reduces the default
user stack size back to 80MB (especially on parisc after their
removal of _STK_LIM_MAX override). This only affects metag and
parisc.
- Remove metag _STK_LIM_MAX override to match other arches and follow
parisc, now that it is safe to do so (due to the BUG_ON fix
mentioned above).
- Finally now that both metag and parisc _STK_LIM_MAX overrides have
been removed, it makes sense to remove _STK_LIM_MAX altogether"

* tag 'metag-for-v3.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jhogan/metag:
asm-generic: remove _STK_LIM_MAX
metag: Remove _STK_LIM_MAX override
parisc,metag: Do not hardcode maximum userspace stack size
metag: Reduce maximum stack size to 256MB
metag: fix memory barriers

Linus Torvalds
2014-05-20 13:30:34 +0800

15 May, 2014

1 commit

042d27acb parisc,metag: Do not hardcode maximum userspace stack size ... Browse Code »
5

This patch affects only architectures where the stack grows upwards
(currently parisc and metag only). On those do not hardcode the maximum
initial stack size to 1GB for 32-bit processes, but make it configurable
via a config option.

The main problem with the hardcoded stack size is, that we have two
memory regions which grow upwards: stack and heap. To keep most of the
memory available for heap in a flexmap memory layout, it makes no sense
to hard allocate up to 1GB of the memory for stack which can't be used
as heap then.

This patch makes the stack size for 32-bit processes configurable and
uses 80MB as default value which has been in use during the last few
years on parisc and which hasn't showed any problems yet.

Signed-off-by: Helge Deller
Signed-off-by: James Hogan
Cc: "James E.J. Bottomley"
Cc: linux-parisc@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Cc: John David Anglin

Helge Deller
2014-05-15 07:01:41 +0800

13 May, 2014

1 commit

68cb363a4 Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull a percpu fix from Tejun Heo:
"Fix for a percpu allocator bug where it could try to kfree() a memory
region allocated using vmalloc(). The bug has been there for years
now and is unlikely to have ever triggered given the size of struct
pcpu_chunk. It's still theoretically possible and the fix is simple
and safe enough, so the patch is marked with -stable"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()

Linus Torvalds
2014-05-13 10:25:56 +0800

11 May, 2014

2 commits

dd18dbc2d mm, thp: close race between mremap() and split_huge_page() ... Browse Code »
5

It's critical for split_huge_page() (and migration) to catch and freeze
all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
mremap() since usually we copy/move page table entries on dup_mm() or
move_page_tables() without rmap lock taken. To get it work we rely on
rmap walk order to not miss any entry. We expect to see destination VMA
after source one to work correctly.

But after switching rmap implementation to interval tree it's not always
possible to preserve expected walk order.

It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
/ vma_last_pgoff() and explicitly insert dst VMA after src one with
vma_interval_tree_insert_after().

But on move_vma() destination VMA can be merged into adjacent one and as
result shifted left in interval tree. Fortunately, we can detect the
situation and prevent race with rmap walk by moving page table entries
under rmap lock. See commit 38a76013ad80.

Problem is that we miss the lock when we move transhuge PMD. Most
likely this bug caused the crash[1].

[1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Andrea Arcangeli
Cc: Rik van Riel
Acked-by: Michel Lespinasse
Cc: Dave Jones
Cc: David Miller
Acked-by: Johannes Weiner
Cc: [3.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-05-11 16:55:48 +0800
3551a9280 mm: postpone the disabling of kmemleak early logging ... Browse Code »

Commit 8910ae896c8c ("kmemleak: change some global variables to int"),
in addition to the atomic -> int conversion, moved the disabling of
kmemleak_early_log to the beginning of the kmemleak_init() function,
before the full kmemleak tracing is actually enabled. In this small
window, kmem_cache_create() is called by kmemleak which triggers
additional memory allocation that are not traced. This patch restores
the original logic with kmemleak_early_log disabling when kmemleak is
fully functional.

Fixes: 8910ae896c8c (kmemleak: change some global variables to int)

Signed-off-by: Catalin Marinas
Cc: Sasha Levin
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2014-05-11 16:55:48 +0800

07 May, 2014

10 commits

38583f095 Merge branch 'akpm' (incoming from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"13 fixes"

* emailed patches from Andrew Morton :
agp: info leak in agpioc_info_wrap()
fs/affs/super.c: bugfix / double free
fanotify: fix -EOVERFLOW with large files on 64-bit
slub: use sysfs'es release mechanism for kmem_cache
revert "mm: vmscan: do not swap anon pages just because free+file is low"
autofs: fix lockref lookup
mm: filemap: update find_get_pages_tag() to deal with shadow entries
mm/compaction: make isolate_freepages start at pageblock boundary
MAINTAINERS: zswap/zbud: change maintainer email address
mm/page-writeback.c: fix divide by zero in pos_ratio_polynom
hugetlb: ensure hugepage access is denied if hugepages are not supported
slub: fix memcg_propagate_slab_attrs
drivers/rtc/rtc-pcf8523.c: fix month definition

Linus Torvalds
2014-05-07 04:07:41 +0800
41a212859 slub: use sysfs'es release mechanism for kmem_cache ... Browse Code »
13

debugobjects warning during netfilter exit:

------------[ cut here ]------------
WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
Modules linked in:
CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G W 3.11.0-next-20130906-sasha #3984
Workqueue: netns cleanup_net
Call Trace:
dump_stack+0x52/0x87
warn_slowpath_common+0x8c/0xc0
warn_slowpath_fmt+0x46/0x50
debug_print_object+0x8d/0xb0
__debug_check_no_obj_freed+0xa5/0x220
debug_check_no_obj_freed+0x15/0x20
kmem_cache_free+0x197/0x340
kmem_cache_destroy+0x86/0xe0
nf_conntrack_cleanup_net_list+0x131/0x170
nf_conntrack_pernet_exit+0x5d/0x70
ops_exit_list+0x5e/0x70
cleanup_net+0xfb/0x1c0
process_one_work+0x338/0x550
worker_thread+0x215/0x350
kthread+0xe7/0xf0
ret_from_fork+0x7c/0xb0

Also during dcookie cleanup:

WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
Modules linked in:
CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
Call Trace:
dump_stack (lib/dump_stack.c:52)
warn_slowpath_common (kernel/panic.c:430)
warn_slowpath_fmt (kernel/panic.c:445)
debug_print_object (lib/debugobjects.c:262)
__debug_check_no_obj_freed (lib/debugobjects.c:697)
debug_check_no_obj_freed (lib/debugobjects.c:726)
kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
kmem_cache_destroy (mm/slab_common.c:363)
dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
__fput (fs/file_table.c:217)
____fput (fs/file_table.c:253)
task_work_run (kernel/task_work.c:125 (discriminator 1))
do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
int_signal (arch/x86/kernel/entry_64.S:807)

Sysfs has a release mechanism. Use that to release the kmem_cache
structure if CONFIG_SYSFS is enabled.

Only slub is changed - slab currently only supports /proc/slabinfo and
not /sys/kernel/slab/*. We talked about adding that and someone was
working on it.

[akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
[akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
Signed-off-by: Christoph Lameter
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Acked-by: Greg KH
Cc: Thomas Gleixner
Cc: Pekka Enberg
Cc: Russell King
Cc: Bart Van Assche
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2014-05-07 04:04:59 +0800
623762517 revert "mm: vmscan: do not swap anon pages just because free+file is low" ... Browse Code »
10

This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
just because free+file is low") because it introduced a regression in
mostly-anonymous workloads, where reclaim would become ineffective and
trap every allocating task in direct reclaim.

The problem is that there is a runaway feedback loop in the scan balance
between file and anon, where the balance tips heavily towards a tiny
thrashing file LRU and anonymous pages are no longer being looked at.
The commit in question removed the safe guard that would detect such
situations and respond with forced anonymous reclaim.

This commit was part of a series to fix premature swapping in loads with
relatively little cache, and while it made a small difference, the cure
is obviously worse than the disease. Revert it.

Signed-off-by: Johannes Weiner
Reported-by: Christian Borntraeger
Acked-by: Christian Borntraeger
Acked-by: Rafael Aquini
Cc: Rik van Riel
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-05-07 04:04:59 +0800
139b6a6fb mm: filemap: update find_get_pages_tag() to deal with shadow entries ... Browse Code »

Dave Jones reports the following crash when find_get_pages_tag() runs
into an exceptional entry:

kernel BUG at mm/filemap.c:1347!
RIP: find_get_pages_tag+0x1cb/0x220
Call Trace:
find_get_pages_tag+0x36/0x220
pagevec_lookup_tag+0x21/0x30
filemap_fdatawait_range+0xbe/0x1e0
filemap_fdatawait+0x27/0x30
sync_inodes_sb+0x204/0x2a0
sync_inodes_one_sb+0x19/0x20
iterate_supers+0xb2/0x110
sys_sync+0x44/0xb0
ia32_do_call+0x13/0x13

1343 /*
1344 * This function is never used on a shmem/tmpfs
1345 * mapping, so a swap entry won't be found here.
1346 */
1347 BUG();

After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
page cache radix trees") this comment and BUG() are out of date because
exceptional entries can now appear in all mappings - as shadows of
recently evicted pages.

However, as Hugh Dickins notes,

"it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
any other PAGECACHE_TAG_*) to appear on an exceptional entry.

I expect it comes down to an occasional race in RCU lookup of the
radix_tree: lacking absolute synchronization, we might sometimes
catch an exceptional entry, with the tag which really belongs with
the unexceptional entry which was there an instant before."

And indeed, not only is the tree walk lockless, the tags are also read
in chunks, one radix tree node at a time. There is plenty of time for
page reclaim to swoop in and replace a page that was already looked up
as tagged with a shadow entry.

Remove the BUG() and update the comment. While reviewing all other
lookup sites for whether they properly deal with shadow entries of
evicted pages, update all the comments and fix memcg file charge moving
to not miss shmem/tmpfs swapcache pages.

Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees")
Signed-off-by: Johannes Weiner
Reported-by: Dave Jones
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-05-07 04:04:59 +0800
49e068f0b mm/compaction: make isolate_freepages start at pageblock boundary ... Browse Code »
6

The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc->free_pfn value as the first pfn. In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.

This means that when cc->free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages. Currently this
can happen when

a) zone's end pfn is not pageblock aligned, or

b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
enabled and a hole spanning the beginning of a pageblock

This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary. This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.

Signed-off-by: Vlastimil Babka
Reported-by: Heesub Shin
Acked-by: Minchan Kim
Cc: Mel Gorman
Acked-by: Joonsoo Kim
Cc: Bartlomiej Zolnierkiewicz
Cc: Michal Nazarewicz
Cc: Naoya Horiguchi
Cc: Christoph Lameter
Acked-by: Rik van Riel
Cc: Dongjun Shin
Cc: Sunghwan Yun
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-05-07 04:04:59 +0800
d5c9fde3d mm/page-writeback.c: fix divide by zero in pos_ratio_polynom ... Browse Code »
5

It is possible for "limit - setpoint + 1" to equal zero, after getting
truncated to a 32 bit variable, and resulting in a divide by zero error.

Using the fully 64 bit divide functions avoids this problem. It also
will cause pos_ratio_polynom() to return the correct value when
(setpoint - limit) exceeds 2^32.

Also uninline pos_ratio_polynom, at Andrew's request.

Signed-off-by: Rik van Riel
Reviewed-by: Michal Hocko
Cc: Aneesh Kumar K.V
Cc: Mel Gorman
Cc: Nishanth Aravamudan
Cc: Luiz Capitulino
Cc: Masayoshi Mizuma
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-05-07 04:04:58 +0800
457c1b27e hugetlb: ensure hugepage access is denied if hugepages are not supported ... Browse Code »
7

Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:

Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....

In KVM guests on Power, in a guest not backed by hugepages, we see the
following:

AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 64 kB

HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.

This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.

[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan
Reviewed-by: Aneesh Kumar K.V
Acked-by: Mel Gorman
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nishanth Aravamudan
2014-05-07 04:04:58 +0800
93030d83b slub: fix memcg_propagate_slab_attrs ... Browse Code »

After creating a cache for a memcg we should initialize its sysfs attrs
with the values from its parent. That's what memcg_propagate_slab_attrs
is for. Currently it's broken - we clearly muddled root-vs-memcg caches
there. Let's fix it up.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-05-07 04:04:58 +0800
8169d3005 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl
bugfix from hch"

The dcache fixes are for a subtle LRU list corruption bug reported by
Miklos Szeredi, where people inside IBM saw list corruptions with the
LTP/host01 test.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nick kvfree() from apparmor
posix_acl: handle NULL ACL in posix_acl_equiv_mode
dcache: don't need rcu in shrink_dentry_list()
more graceful recovery in umount_collect()
don't remove from shrink list in select_collect()
dentry_kill(): don't try to remove from shrink list
expand the call of dentry_lru_del() in dentry_kill()
new helper: dentry_free()
fold try_prune_one_dentry()
fold d_kill() and d_free()
fix races between __d_instantiate() and checks of dentry flags

Linus Torvalds
2014-05-07 03:22:20 +0800
39f1f78d5 nick kvfree() from apparmor ... Browse Code »

too many places open-code it

Signed-off-by: Al Viro

Al Viro
2014-05-07 02:02:53 +0800

06 May, 2014

2 commits

30321c7b6 slab: Fix off by one in object max number tests. ... Browse Code »

If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
likewise if freelist_idx_t is a short, then it should be 65535 not
65536.

This was leading to all kinds of random crashes on sparc64 where
PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was
enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
same cpu already holding a lock it shouldn't hold, or the lock belonging
to a completely unrelated process.

Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab")
Signed-off-by: David S. Miller
Signed-off-by: Linus Torvalds

David Miller
2014-05-06 11:38:49 +0800
7cc68973c slab: fix the type of the index on freelist index accessor ... Browse Code »

Commit a41adfaa23df ("slab: introduce byte sized index for the freelist
of a slab") changes the size of freelist index and also changes
prototype of accessor function to freelist index. And there was a
mistake.

The mistake is that although it changes the size of freelist index
correctly, it changes the size of the index of freelist index
incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that
means that num of object on on a slab can be more than 255. So we need
more than 1 byte for the index to find the index of free object on
freelist. But, above patch makes this index type 1 byte, so slab which
have more than 255 objects cannot work properly and in consequence of
it, the system cannot boot.

This issue was reported by Steven King on m68knommu which would use
2 bytes freelist index:

https://lkml.org/lkml/2014/4/16/433

To fix is easy. To change the type of the index of freelist index on
accessor functions is enough to fix this bug. Although 2 bytes is
enough, I use 4 bytes since it have no bad effect and make things more
easier. This fix was suggested and tested by Steven in his original
report.

Signed-off-by: Joonsoo Kim
Reported-and-acked-by: Steven King
Acked-by: Christoph Lameter
Tested-by: James Hogan
Tested-by: David Miller
Cc: Pekka Enberg
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-05-06 11:38:49 +0800

29 Apr, 2014

1 commit

50f5aa8a9 mm: don't pointlessly use BUG_ON() for sanity check ... Browse Code »
7

BUG_ON() is a big hammer, and should be used _only_ if there is some
major corruption that you cannot possibly recover from, making it
imperative that the current process (and possibly the whole machine) be
terminated with extreme prejudice.

The trivial sanity check in the vmacache code is *not* such a fatal
error. Recovering from it is absolutely trivial, and using BUG_ON()
just makes it harder to debug for no actual advantage.

To make matters worse, the placement of the BUG_ON() (only if the range
check matched) actually makes it harder to hit the sanity check to begin
with, so _if_ there is a bug (and we just got a report from Srivatsa
Bhat that this can indeed trigger), it is harder to debug not just
because the machine is possibly dead, but because we don't have better
coverage.

BUG_ON() must *die*. Maybe we should add a checkpatch warning for it,
because it is simply just about the worst thing you can ever do if you
hit some "this cannot happen" situation.

Reported-by: Srivatsa S. Bhat
Cc: Davidlohr Bueso
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-04-29 05:24:09 +0800

26 Apr, 2014

1 commit

1cf35d477 mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts ... Browse Code »
2

The mmu-gather operation 'tlb_flush_mmu()' has done two things: the
actual tlb flush operation, and the batched freeing of the pages that
the TLB entries pointed at.

This splits the operation into separate phases, so that the forced
batched flushing done by zap_pte_range() can now do the actual TLB flush
while still holding the page table lock, but delay the batched freeing
of all the pages to after the lock has been dropped.

This in turn allows us to avoid a race condition between
set_page_dirty() (as called by zap_pte_range() when it finds a dirty
shared memory pte) and page_mkclean(): because we now flush all the
dirty page data from the TLB's while holding the pte lock,
page_mkclean() will be held up walking the (recently cleaned) page
tables until after the TLB entries have been flushed from all CPU's.

Reported-by: Benjamin Herrenschmidt
Tested-by: Dave Hansen
Acked-by: Hugh Dickins
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux
Cc: Tony Luck
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-04-26 07:05:40 +0800

23 Apr, 2014

1 commit

1b17844b2 mm: make fixup_user_fault() check the vma access rights too ... Browse Code »
5

fixup_user_fault() is used by the futex code when the direct user access
fails, and the futex code wants it to either map in the page in a usable
form or return an error. It relied on handle_mm_fault() to map the
page, and correctly checked the error return from that, but while that
does map the page, it doesn't actually guarantee that the page will be
mapped with sufficient permissions to be then accessed.

So do the appropriate tests of the vma access rights by hand.

[ Side note: arguably handle_mm_fault() could just do that itself, but
we have traditionally done it in the caller, because some callers -
notably get_user_pages() - have been able to access pages even when
they are mapped with PROT_NONE. Maybe we should re-visit that design
decision, but in the meantime this is the minimal patch. ]

Found by Dave Jones running his trinity tool.

Reported-by: Dave Jones
Acked-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-04-23 04:49:40 +0800

19 Apr, 2014

1 commit

b5a8cad37 thp: close race between split and zap huge pages ... Browse Code »
5

Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
have the same root cause. Let's look to them one by one.

The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
testing I see that page_mapcount() is higher than mapcount here.

I think it happens due to race between zap_huge_pmd() and
page_check_address_pmd(). page_check_address_pmd() misses PMD which is
under zap:

CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
/*
* We check if PMD present without taking ptl: no
* serialization against zap_huge_pmd(). We miss this PMD,
* it's not accounted to 'mapcount' in __split_huge_page().
*/
pmd_present(pmd) == 0

BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)

The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

This happens in similar way:

CPU0 CPU1
zap_huge_pmd()
pmdp_get_and_clear()
page_remove_rmap(page)
atomic_add_negative(-1, &page->_mapcount)
__split_huge_page()
anon_vma_interval_tree_foreach()
__split_huge_page_splitting()
page_check_address_pmd()
mm_find_pmd()
pmd_present(pmd) == 0 /* The same comment as above */
/*
* No crash this time since we already decremented page->_mapcount in
* zap_huge_pmd().
*/
BUG_ON(mapcount != page_mapcount(page))

/*
* We split the compound page here into small pages without
* serialization against zap_huge_pmd()
*/
__split_huge_page_refcount()
VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

So my understanding the problem is pmd_present() check in mm_find_pmd()
without taking page table lock.

The bug was introduced by me commit with commit 117b0791ac42. Sorry for
that. :(

Let's open code mm_find_pmd() in page_check_address_pmd() and do the
check under page table lock.

Note that __page_check_address() does the same for PTE entires
if sync != 0.

I've stress tested split and zap code paths for 36+ hours by now and
don't see crashes with the patch applied. Before it took
[2] https://lkml.kernel.org/g/

Signed-off-by: Kirill A. Shutemov
Reported-by: Sasha Levin
Tested-by: Sasha Levin
Cc: Bob Liu
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Dave Jones
Cc: Vlastimil Babka
Cc: [3.13+]

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-04-19 07:40:09 +0800