Eric Lee / smarc-fsl-linux-kernel

04 Mar, 2014

5 commits

27329369c mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness ... Browse Code »

Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free, and
bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair
zone allocator policy").

The problem is that GFP_THISNODE obeys the zone fairness allocation
batches on one hand, but doesn't reset them and wake kswapd on the other
hand. After a few of those allocations, the batches are exhausted and
the allocations fail.

Fixing this means either having GFP_THISNODE wake up kswapd, or
GFP_THISNODE not participating in zone fairness at all. The latter
seems safer as an acute bugfix, we can clean up later.

Reported-by: Jan Stancek
Signed-off-by: Johannes Weiner
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-03-04 23:55:50 +0800
9050d7eba mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking ... Browse Code »

Daniel Borkmann reported a VM_BUG_ON assertion failing:

------------[ cut here ]------------
kernel BUG at mm/mlock.c:528!
invalid opcode: 0000 [#1] SMP
Modules linked in: ccm arc4 iwldvm [...]
video
CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
RIP: 0010:[] [] munlock_vma_pages_range+0x2e0/0x2f0
Call Trace:
do_munmap+0x18f/0x3b0
vm_munmap+0x41/0x60
SyS_munmap+0x22/0x30
system_call_fastpath+0x1a/0x1f
RIP munlock_vma_pages_range+0x2e0/0x2f0
---[ end trace a0088dcf07ae10f2 ]---

because munlock_vma_pages_range() thinks it's unexpectedly in the middle
of a THP page. This can be reproduced with default config since 3.11
kernels. A reproducer can be found in the kernel's selftest directory
for networking by running ./psock_tpacket.

The problem is that an order=2 compound page (allocated by
alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did
not trigger a bug. It just makes munlock_vma_pages_range() skip such
compound pages until the next 512-pages-aligned page, when it encounters
a head page. This is however not a problem for vma's where mlocking has
no effect anyway, but it can distort the accounting.

Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
PageTransHuge() check.

This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
list of flags that make vma's non-mlockable and non-mergeable. The
reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
already on the VM_SPECIAL list, and both are intended for non-LRU pages
where mlocking makes no sense anyway. Related Lkml discussion can be
found in [2].

[1] tools/testing/selftests/net/psock_tpacket
[2] https://lkml.org/lkml/2014/1/10/427

Signed-off-by: Vlastimil Babka
Signed-off-by: Daniel Borkmann
Reported-by: Daniel Borkmann
Tested-by: Daniel Borkmann
Cc: Thomas Hellstrom
Cc: John David Anglin
Cc: HATAYAMA Daisuke
Cc: Konstantin Khlebnikov
Cc: Carsten Otte
Cc: Jared Hulbert
Tested-by: Hannes Frederic Sowa
Cc: Kirill A. Shutemov
Acked-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: [3.11.x+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-03-04 23:55:48 +0800
4fb1a86fb memcg: reparent charges of children before processing parent ... Browse Code »

Sometimes the cleanup after memcg hierarchy testing gets stuck in
mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.

There may turn out to be several causes, but a major cause is this: the
workitem to offline parent can get run before workitem to offline child;
parent's mem_cgroup_reparent_charges() circles around waiting for the
child's pages to be reparented to its lrus, but it's holding
cgroup_mutex which prevents the child from reaching its
mem_cgroup_reparent_charges().

Further testing showed that an ordered workqueue for cgroup_destroy_wq
is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
stage on the way can mess up the order before reaching the workqueue.

Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
all its children (and grandchildren, in the correct order) to have their
charges reparented first.

Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
Signed-off-by: Filipe Brandenburger
Signed-off-by: Hugh Dickins
Reviewed-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: [v3.10+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Filipe Brandenburger
2014-03-04 23:55:48 +0800
ce48225fe memcg: fix endless loop in __mem_cgroup_iter_next() ... Browse Code »

Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in
mem_cgroup_iter") got the interaction with the commit a few before it
d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully
initialized") slightly wrong, and we didn't notice at the time.

It's elusive, and harder to get than the original, but for a couple of
days before rc1, I several times saw a endless loop similar to that
supposedly being fixed.

This time it was a tighter loop in __mem_cgroup_iter_next(): because we
can get here when our root has already been offlined, and the ordering
of conditions was such that we then just cycled around forever.

Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
Signed-off-by: Hugh Dickins
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Greg Thelen
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2014-03-04 23:55:47 +0800
668f9abbd mm: close PageTail race ... Browse Code »

Commit bf6bddf1924e ("mm: introduce compaction and migration for
ballooned pages") introduces page_count(page) into memory compaction
which dereferences page->first_page if PageTail(page).

This results in a very rare NULL pointer dereference on the
aforementioned page_count(page). Indeed, anything that does
compound_head(), including page_count() is susceptible to racing with
prep_compound_page() and seeing a NULL or dangling page->first_page
pointer.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling. The patch then adds a store memory barrier to
prep_compound_page() to ensure page->first_page is set.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes
Cc: Holger Kiehl
Cc: Christoph Lameter
Cc: Rafael Aquini
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: "Kirill A. Shutemov"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-03-04 23:55:47 +0800

26 Feb, 2014

3 commits

08088cb9a memcg: change oom_info_lock to mutex ... Browse Code »

Kirill has reported the following:

Task in /test killed as a result of limit of /test
memory: usage 10240kB, limit 10240kB, failcnt 51
memory+swap: usage 10240kB, limit 10240kB, failcnt 0
kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
Memory cgroup stats for /test:

BUG: sleeping function called from invalid context at kernel/cpu.c:68
in_atomic(): 1, irqs_disabled(): 0, pid: 66, name: memcg_test
2 locks held by memcg_test/66:
#0: (memcg_oom_lock#2){+.+...}, at: [] pagefault_out_of_memory+0x14/0x90
#1: (oom_info_lock){+.+...}, at: [] mem_cgroup_print_oom_info+0x2a/0x390
CPU: 2 PID: 66 Comm: memcg_test Not tainted 3.14.0-rc1-dirty #745
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
Call Trace:
__might_sleep+0x16a/0x210
get_online_cpus+0x1c/0x60
mem_cgroup_read_stat+0x27/0xb0
mem_cgroup_print_oom_info+0x260/0x390
dump_header+0x88/0x251
? trace_hardirqs_on+0xd/0x10
oom_kill_process+0x258/0x3d0
mem_cgroup_oom_synchronize+0x656/0x6c0
? mem_cgroup_charge_common+0xd0/0xd0
pagefault_out_of_memory+0x14/0x90
mm_fault_error+0x91/0x189
__do_page_fault+0x48e/0x580
do_page_fault+0xe/0x10
page_fault+0x22/0x30

which complains that mem_cgroup_read_stat cannot be called from an atomic
context but mem_cgroup_print_oom_info takes a spinlock. Change
oom_info_lock to a mutex.

This was introduced by 947b3dd1a84b ("memcg, oom: lock
mem_cgroup_print_oom_info").

Signed-off-by: Michal Hocko
Reported-by: "Kirill A. Shutemov"
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2014-02-26 07:25:44 +0800
9845cbbd1 mm, thp: fix infinite loop on memcg OOM ... Browse Code »

Masayoshi Mizuma reported a bug with the hang of an application under
the memcg limit. It happens on write-protection fault to huge zero page

If we successfully allocate a huge page to replace zero page but hit the
memcg limit we need to split the zero page with split_huge_page_pmd()
and fallback to small pages.

The other part of the problem is that VM_FAULT_OOM has special meaning
in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page
to be split if it sees VM_FAULT_OOM and it will will retry page fault
handling. This causes an infinite loop if the page was not split.

do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
to allocate one small page, so fallback to small pages will not help.

The solution for this part is to replace VM_FAULT_OOM with
VM_FAULT_FALLBACK is fallback required.

Signed-off-by: Kirill A. Shutemov
Reported-by: Masayoshi Mizuma
Reviewed-by: Michal Hocko
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-02-26 07:25:44 +0800
33b6c7765 mm, hwpoison: release page on PageHWPoison() in __do_fault() ... Browse Code »

It seems we forget to release page after detecting HW error.

Signed-off-by: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Andi Kleen
Cc: Matthew Wilcox
Cc: Dave Hansen
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-02-26 07:25:42 +0800

21 Feb, 2014

1 commit

6a4d07f85 Merge branch 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Quite a few fixes this time.

Three locking fixes, all marked for -stable. A couple error path
fixes and some misc fixes. Hugh found a bug in memcg offlining
sequence and we thought we could fix that from cgroup core side but
that turned out to be insufficient and got reverted. A different fix
has been applied to -mm"

* 'for-3.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: update cgroup_enable_task_cg_lists() to grab siglock
Revert "cgroup: use an ordered workqueue for cgroup destruction"
cgroup: protect modifications to cgroup_idr with cgroup_mutex
cgroup: fix locking in cgroup_cfts_commit()
cgroup: fix error return from cgroup_create()
cgroup: fix error return value in cgroup_mount()
cgroup: use an ordered workqueue for cgroup destruction
nfs: include xattr.h from fs/nfs/nfs3proc.c
cpuset: update MAINTAINERS entry
arm, pm, vmpressure: add missing slab.h includes

Linus Torvalds
2014-02-21 04:01:09 +0800

18 Feb, 2014

1 commit

f2a77abdb Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

Pull powerpc fixes from Ben Herrenschmidt:
"Here are some more powerpc fixes for 3.14

The main one is a nasty issue with the NUMA balancing support which
requires a small generic change and the addition of a new accessor to
set _PAGE_NUMA. Both have been reviewed and acked by Mel and Rik.

The changelog should have plenty of details but basically, without
this fix, we get random user segfaults and/or corruptions due to
missing TLB/hash flushes. Aneesh series of 3 patches fixes it.

We have some vDSO vs. perf fixes from Anton, some small EEH fixes
from Gavin, a ppc32 regression vs the stack overflow detector, and a
fix for the way we handle PCIe host bridge speed settings on pseries
(which is needed for proper operations of AMD graphics cards on
Power8)"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc/eeh: Disable EEH on reboot
powerpc/eeh: Cleanup on eeh_subsystem_enabled
powerpc/powernv: Rework EEH reset
powerpc: Use unstripped VDSO image for more accurate profiling data
powerpc: Link VDSOs at 0x0
mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit
mm: Dirty accountable change only apply to non prot numa case
powerpc/mm: Add new "set" flag argument to pte/pmd update function
powerpc/pseries: Add Gen3 definitions for PCIE link speed
powerpc/pseries: Fix regression on PCI link speed
powerpc: Set the correct ksp_limit on ppc32 when switching to irq stack

Linus Torvalds
2014-02-18 04:36:49 +0800

17 Feb, 2014

2 commits

56eecdb91 mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit ... Browse Code »

Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
a hash table MMU for various reasons (the flush is handled as part of
the PTE modification when necessary).

ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

Additionally ppc64 require the tlb flushing to be batched within ptl locks.

The reason to do that is to ensure that the hash page table is in sync with
linux page table.

We track the hpte index in linux pte and if we clear them without flushing
hash and drop the ptl lock, we can have another cpu update the pte and can
end up with duplicate entry in the hash table, which is fatal.

We also want to keep set_pte_at simpler by not requiring them to do hash
flush for performance reason. We do that by assuming that set_pte_at() is
never *ever* called on a PTE that is already valid.

This was the case until the NUMA code went in which broke that assumption.

Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
using set_pte_at() and a powerpc specific one using the appropriate
mechanism needed to keep the hash table in sync.

Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt

Aneesh Kumar K.V
2014-02-17 08:19:36 +0800
9d85d5863 mm: Dirty accountable change only apply to non prot numa case ... Browse Code »

So move it within the if loop

Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt

Aneesh Kumar K.V
2014-02-17 08:19:36 +0800

11 Feb, 2014

3 commits

8d547ff4a mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASED ... Browse Code »

mce-test detected a test failure when injecting error to a thp tail
page. This is because we take page refcount of the tail page in
madvise_hwpoison() while the fix in commit a3e0f9e47d5e
("mm/memory-failure.c: transfer page count from head page to tail page
after split thp") assumes that we always take refcount on the head page.

When a real memory error happens we take refcount on the head page where
memory_failure() is called without MF_COUNT_INCREASED set, so it seems
to me that testing memory error on thp tail page using madvise makes
little sense.

This patch cancels moving refcount in !MF_COUNT_INCREASED for valid
testing.

[akpm@linux-foundation.org: s/&&/&/]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Wanpeng Li
Cc: Chen Gong
Cc: [3.9+: a3e0f9e47d5e]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-02-11 08:01:43 +0800
1e4dd9461 slub: do not assert not having lock in removing freed partial ... Browse Code »

Vladimir reported the following issue:

Commit c65c1877bd68 ("slub: use lockdep_assert_held") requires
remove_partial() to be called with n->list_lock held, but free_partial()
called from kmem_cache_close() on cache destruction does not follow this
rule, leading to a warning:

WARNING: CPU: 0 PID: 2787 at mm/slub.c:1536 __kmem_cache_shutdown+0x1b2/0x1f0()
Modules linked in:
CPU: 0 PID: 2787 Comm: modprobe Tainted: G W 3.14.0-rc1-mm1+ #1
Hardware name:
0000000000000600 ffff88003ae1dde8 ffffffff816d9583 0000000000000600
0000000000000000 ffff88003ae1de28 ffffffff8107c107 0000000000000000
ffff880037ab2b00 ffff88007c240d30 ffffea0001ee5280 ffffea0001ee52a0
Call Trace:
__kmem_cache_shutdown+0x1b2/0x1f0
kmem_cache_destroy+0x43/0xf0
xfs_destroy_zones+0x103/0x110 [xfs]
exit_xfs_fs+0x38/0x4e4 [xfs]
SyS_delete_module+0x19a/0x1f0
system_call_fastpath+0x16/0x1b

His solution was to add a spinlock in order to quiet lockdep. Although
there would be no contention to adding the lock, that lock also requires
disabling of interrupts which will have a larger impact on the system.

Instead of adding a spinlock to a location where it is not needed for
lockdep, make a __remove_partial() function that does not test if the
list_lock is held, as no one should have it due to it being freed.

Also added a __add_partial() function that does not do the lock
validation either, as it is not needed for the creation of the cache.

Signed-off-by: Steven Rostedt
Reported-by: Vladimir Davydov
Suggested-by: David Rientjes
Acked-by: David Rientjes
Acked-by: Vladimir Davydov
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2014-02-11 08:01:42 +0800
255d0884f mm/slub.c: list_lock may not be held in some circumstances ... Browse Code »

Commit c65c1877bd68 ("slub: use lockdep_assert_held") incorrectly
required that add_full() and remove_full() hold n->list_lock. The lock
is only taken when kmem_cache_debug(s), since that's the only time it
actually does anything.

Require that the lock only be taken under such a condition.

Reported-by: Larry Finger
Tested-by: Larry Finger
Tested-by: Paul E. McKenney
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-02-11 08:01:41 +0800

10 Feb, 2014

2 commits

f94aa7c7f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull vfs fixes from Al Viro:
"A couple of fixes, both -stable fodder. The O_SYNC bug is fairly
old..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix a kmap leak in virtio_console
fix O_SYNC|O_APPEND syncing the wrong range on write()

Linus Torvalds
2014-02-10 10:12:07 +0800
d311d79de fix O_SYNC|O_APPEND syncing the wrong range on write() ... Browse Code »

It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
pos_before_write .. pos_before_write + written - 1
instead. Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().

All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().

The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.

Signed-off-by: Al Viro

Al Viro
2014-02-10 04:18:09 +0800

09 Feb, 2014

1 commit

c1ff84317 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 fixes from Peter Anvin:
"Quite a varied little collection of fixes. Most of them are
relatively small or isolated; the biggest one is Mel Gorman's fixes
for TLB range flushing.

A couple of AMD-related fixes (including not crashing when given an
invalid microcode image) and fix a crash when compiled with gcov"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, AMD: Unify valid container checks
x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y
x86/efi: Allow mapping BGRT on x86-32
x86: Fix the initialization of physnode_map
x86, cpu hotplug: Fix stack frame warning in check_irq_vectors_for_cpu_disable()
x86/intel/mid: Fix X86_INTEL_MID dependencies
arch/x86/mm/srat: Skip NUMA_NO_NODE while parsing SLIT
mm, x86: Revisit tlb_flushall_shift tuning for page flushes except on IvyBridge
x86: mm: change tlb_flushall_shift for IvyBridge
x86/mm: Eliminate redundant page table walk during TLB range flushing
x86/mm: Clean up inconsistencies when flushing TLB ranges
mm, x86: Account for TLB flushes only when debugging
x86/AMD/NB: Fix amd_set_subcaches() parameter type
x86/quirks: Add workaround for AMD F16h Erratum792
x86, doc, kconfig: Fix dud URL for Microcode data

Linus Torvalds
2014-02-09 03:54:43 +0800

08 Feb, 2014

1 commit

a3b072cd1 Merge tag 'efi-urgent' into x86/urgent ... Browse Code »

* Avoid WARN_ON() when mapping BGRT on Baytrail (EFI 32-bit).

Signed-off-by: H. Peter Anvin

H. Peter Anvin
2014-02-08 03:27:30 +0800

07 Feb, 2014

3 commits

a85d9df1e mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() ... Browse Code »

During aio stress test, we observed the following lockdep warning. This
mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.

other info that might help us debug this:
Possible unsafe locking scenario:

CPU0
----
lock(&(&ctx->completion_lock)->rlock);

lock(&(&ctx->completion_lock)->rlock);

*** DEADLOCK ***

dump_stack+0x19/0x1b
print_usage_bug+0x1f7/0x208
mark_lock+0x21d/0x2a0
mark_held_locks+0xb9/0x140
trace_hardirqs_on_caller+0x105/0x1d0
trace_hardirqs_on+0xd/0x10
_raw_spin_unlock_irq+0x2c/0x50
__set_page_dirty_nobuffers+0x8c/0xf0
migrate_page_copy+0x434/0x540
aio_migratepage+0xb1/0x140
move_to_new_page+0x7d/0x230
migrate_pages+0x5e5/0x700
migrate_misplaced_page+0xbc/0xf0
do_numa_page+0x102/0x190
handle_pte_fault+0x241/0x970
handle_mm_fault+0x265/0x370
__do_page_fault+0x172/0x5a0
do_page_fault+0x1a/0x70
page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Rik van Riel
Cc: Johannes Weiner
Acked-by: David Rientjes
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2014-02-07 05:48:51 +0800
f893ab41e mm/swap: fix race on swap_info reuse between swapoff and swapon ... Browse Code »

swapoff clear swap_info's SWP_USED flag prematurely and free its
resources after that. A concurrent swapon will reuse this swap_info
while its previous resources are not cleared completely.

These late freed resources are:
- p->percpu_cluster
- swap_cgroup_ctrl[type]
- block_device setting
- inode->i_flags &= ~S_SWAPFILE

This patch clears the SWP_USED flag after all its resources are freed,
so that swapon can reuse this swap_info by alloc_swap_info() safely.

[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Weijie Yang
Acked-by: Hugh Dickins
Cc: Krzysztof Kozlowski
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2014-02-07 05:48:51 +0800
579f82901 swap: add a simple detector for inappropriate swapin readahead ... Browse Code »

This is a patch to improve swap readahead algorithm. It's from Hugh and
I slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin is
sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they can
be reclaimed easily - though, what if their allocation forced reclaim of
useful pages? But on SSD devices large reads are more expensive than
small ones: if the readahead pages are unneeded, reading them in caused
significant overhead.

This patch adds very simplistic random read detection. Stealing the
PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
simply looks at readahead's current success rate, and narrows or widens
its readahead window accordingly. There is little science to its
heuristic: it's about as stupid as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running a
single repetitive swapping load across a 1000MB mapping in 900MB ram
with 1GB swap (the harddisk tests had taken painfully too long when I
used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
which Shaohua showed to be defective; HughNew this Nov 14 patch, with
page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
(1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD. Seq for
sequential access to the mapping, cycling five times around; Rand for
the same number of random touches. Anon for a MAP_PRIVATE anon mapping;
Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did not
optimize Shmem: seen below. Konstantin's approach was perhaps mistuned,
50% slower on Seq: did not compete and is not shown below.

HDD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 73921 76210 75611 76904 78191 121542
Seq Shmem 73601 73176 73855 72947 74543 118322
Rand Anon 895392 831243 871569 845197 846496 841680
Rand Shmem 1058375 1053486 827935 764955 764376 756489

SSD Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon 24634 24198 24673 25107 21614 70018
Seq Shmem 24959 24932 25052 25703 22030 69678
Rand Anon 43014 26146 28075 25989 26935 25901
Rand Shmem 45349 45215 28249 24268 24138 24332

These tests are, of course, two extremes of a very simple case: under
heavier mixed loads I've not yet observed any consistent improvement or
degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's
patch. I observed with Hugh's patch sometimes the readahead size is
shrinked too fast (from 8 to 1 immediately) in sequential workload if
there is no hit. And in such case, continuing doing readahead is good
actually.

I don't prepare a sophisticated algorithm for the sequential workload
because so far we can't guarantee sequential accessed pages are swap out
sequentially. So I slightly change Hugh's heuristic - don't shrink
readahead size too fast.

Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat
2'. The first part is running a random workload (till around 1200 of
the x-axis) and the second part is running a sequential workload.
swapin and swapout throughput are almost identical in steady state in
both workloads. These are expected behavior. while in Vanilla, swapin
is much bigger than swapout especially in random workload (because wrong
readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

[fengguang.wu@intel.com: swapin_nr_pages() can be static]
Signed-off-by: Hugh Dickins
Signed-off-by: Shaohua Li
Signed-off-by: Fengguang Wu
Cc: Rik van Riel
Cc: Wu Fengguang
Cc: Minchan Kim
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2014-02-07 05:48:51 +0800

04 Feb, 2014

1 commit

1ff6bbfd1 arm, pm, vmpressure: add missing slab.h includes ... Browse Code »

arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c
were somehow getting slab.h indirectly through cgroup.h which in turn
was getting it indirectly through xattr.h. A scheduled cgroup change
drops xattr.h inclusion from cgroup.h and breaks compilation of these
three files. Add explicit slab.h includes to the three files.

A pending cgroup patch depends on this change and it'd be great if
this can be routed through cgroup/for-3.14-fixes branch.

Signed-off-by: Tejun Heo
Acked-by: Stephen Warren
Cc: Thierry Reding
Cc: linux-tegra@vger.kernel.org
Cc: "Rafael J. Wysocki"
Cc: linux-pm@vger.kernel.org
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: cgroups@vger.kernel.org

Tejun Heo
2014-02-04 02:24:01 +0800

03 Feb, 2014

1 commit

7b383bef2 Merge branch 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

Pull SLAB changes from Pekka Enberg:
"Random bug fixes that have accumulated in my inbox over the past few
months"

* 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
mm: Fix warning on make htmldocs caused by slab.c
mm: slub: work around unneeded lockdep warning
mm: sl[uo]b: fix misleading comments
slub: Fix possible format string bug.
slub: use lockdep_assert_held
slub: Fix calculation of cpu slabs
slab.h: remove duplicate kmalloc declaration and fix kernel-doc warnings

Linus Torvalds
2014-02-03 03:30:08 +0800

31 Jan, 2014

9 commits

cb8ee1a3d mm: Fix warning on make htmldocs caused by slab.c ... Browse Code »

This patch fixed following errors while make htmldocs
Warning(/mm/slab.c:1956): No description found for parameter 'page'
Warning(/mm/slab.c:1956): Excess function parameter 'slabp' description in 'slab_destroy'

Incorrect function parameter "slabp" was set instead of "page"

Acked-by: Christoph Lameter
Signed-off-by: Masanari Iida
Signed-off-by: Pekka Enberg

Masanari Iida
2014-01-31 19:52:25 +0800
67b6c900d mm: slub: work around unneeded lockdep warning ... Browse Code »

The slub code does some setup during early boot in
early_kmem_cache_node_alloc() with some local data. There is no
possible way that another CPU can see this data, so the slub code
doesn't unnecessarily lock it. However, some new lockdep asserts
check to make sure that add_partial() _always_ has the list_lock
held.

Just add the locking, even though it is technically unnecessary.

Cc: Peter Zijlstra
Cc: Russell King
Acked-by: David Rientjes
Signed-off-by: Dave Hansen
Signed-off-by: Pekka Enberg

Dave Hansen
2014-01-31 19:41:26 +0800
7c094fd69 memcg: fix mutex not unlocked on memcg_create_kmem_cache fail path ... Browse Code »

Commit 842e2873697e ("memcg: get rid of kmem_cache_dup()") introduced a
mutex for memcg_create_kmem_cache() to protect the tmp_name buffer that
holds the memcg name. It failed to unlock the mutex if this buffer
could not be allocated.

This patch fixes the issue by appropriately unlocking the mutex if the
allocation fails.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Glauber Costa
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-01-31 08:56:56 +0800
778c14aff mm, oom: base root bonus on current usage ... Browse Code »

A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.

With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption. But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes. For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.

The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.

Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.

By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small. In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.

Signed-off-by: David Rientjes
Reported-by: Johannes Weiner
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-01-31 08:56:56 +0800
a03208652 mm/slub.c: fix page->_count corruption (again) ... Browse Code »

Commit abca7c496584 ("mm: fix slab->page _count corruption when using
slub") notes that we can not _set_ a page->counters directly, except
when using a real double-cmpxchg. Doing so can lose updates to
->_count.

That is an absolute rule:

You may not *set* page->counters except via a cmpxchg.

Commit abca7c496584 fixed this for the folks who have the slub
cmpxchg_double code turned off at compile time, but it left the bad case
alone. It can still be reached, and the same bug triggered in two
cases:

1. Turning on slub debugging at runtime, which is available on
the distro kernels that I looked at.
2. On 64-bit CPUs with no CMPXCHG16B (some early AMD x86-64
cpus, evidently)

There are at least 3 ways we could fix this:

1. Take all of the exising calls to cmpxchg_double_slab() and
__cmpxchg_double_slab() and convert them to take an old, new
and target 'struct page'.
2. Do (1), but with the newly-introduced 'slub_data'.
3. Do some magic inside the two cmpxchg...slab() functions to
pull the counters out of new_counters and only set those
fields in page->{inuse,frozen,objects}.

I've done (2) as well, but it's a bunch more code. This patch is an
attempt at (3). This was the most straightforward and foolproof way
that I could think to do this.

This would also technically allow us to get rid of the ugly

#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)

in 'struct page', but leaving it alone has the added benefit that
'counters' stays 'unsigned' instead of 'unsigned long', so all the
copies that the slub code does stay a bit smaller.

Signed-off-by: Dave Hansen
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Matt Mackall
Cc: Pravin B Shelar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-01-31 08:56:56 +0800
8790c71a1 mm/mempolicy.c: fix mempolicy printing in numa_maps ... Browse Code »

As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference
policy"), /proc//numa_maps prints the mempolicy for any as
"prefer:N" for the local node, N, of the process reading the file.

This should only be printed when the mempolicy of is
MPOL_PREFERRED for node N.

If the process is actually only using the default mempolicy for local
node allocation, make sure "default" is printed as expected.

Signed-off-by: David Rientjes
Reported-by: Robert Lippert
Cc: Peter Zijlstra
Acked-by: Mel Gorman
Cc: Ingo Molnar
Cc: [3.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-01-31 08:56:56 +0800
31fc00bb7 zsmalloc: add copyright ... Browse Code »

Add my copyright to the zsmalloc source code which I maintain.

Signed-off-by: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-01-31 08:56:55 +0800
bcf1647d0 zsmalloc: move it under mm ... Browse Code »

This patch moves zsmalloc under mm directory.

Before that, description will explain why we have needed custom
allocator.

Zsmalloc is a new slab-based memory allocator for storing compressed
pages. It is designed for low fragmentation and high allocation success
rate on large object, but
Acked-by: Nitin Gupta
Reviewed-by: Konrad Rzeszutek Wilk
Cc: Bob Liu
Cc: Greg Kroah-Hartman
Cc: Hugh Dickins
Cc: Jens Axboe
Cc: Luigi Semenzato
Cc: Mel Gorman
Cc: Pekka Enberg
Cc: Rik van Riel
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-01-31 08:56:55 +0800
f568849ed Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:

- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.

- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.

- Header export fix from CaiZhiyong.

- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:

- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.

- bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...

Linus Torvalds
2014-01-31 03:19:05 +0800

30 Jan, 2014

7 commits

f544e14f3 memblock: add limit checking to memblock_virt_alloc ... Browse Code »

In original bootmem wrapper for memblock, we have limit checking.

Add it to memblock_virt_alloc, to address arm and x86 booting crash.

Signed-off-by: Yinghai Lu
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Reported-by: Kevin Hilman
Tested-by: Kevin Hilman
Reported-by: Olof Johansson
Tested-by: Olof Johansson
Reported-by: Konrad Rzeszutek Wilk
Tested-by: Konrad Rzeszutek Wilk
Cc: Dave Hansen
Cc: Santosh Shilimkar
Cc: "Strashko, Grygorii"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yinghai Lu
2014-01-30 08:22:40 +0800
58d5640eb mm/readahead.c: fix do_readahead() for no readpage(s) ... Browse Code »

Commit 63d0f0a3c7e1 ("mm/readahead.c:do_readhead(): don't check for
->readpage") unintentionally made do_readahead return 0 for all valid
files regardless of whether readahead was supported, rather than the
expected -EINVAL. This gets forwarded on to userspace, and results in
sys_readahead appearing to succeed in cases that don't make sense (e.g.
when called on pipes or sockets). This issue is detected by the LTP
readahead01 testcase.

As the exact return value of force_page_cache_readahead is currently
never used, we can simplify it to return only 0 or -EINVAL (when
readpage or readpages is missing). With that in place we can simply
forward on the return value of force_page_cache_readahead in
do_readahead.

This patch performs said change, restoring the expected semantics.

Signed-off-by: Mark Rutland
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mark Rutland
2014-01-30 08:22:40 +0800
a0132ac0f mm/slub.c: do not VM_BUG_ON_PAGE() for temporary on-stack pages ... Browse Code »

Commit 309381feaee5 ("mm: dump page when hitting a VM_BUG_ON using
VM_BUG_ON_PAGE") added a bunch of VM_BUG_ON_PAGE() calls.

But, most of the ones in the slub code are for _temporary_ 'struct
page's which are declared on the stack and likely have lots of gunk in
them. Dumping their contents out will just confuse folks looking at
bad_page() output. Plus, if we try to page_to_pfn() on them or
soemthing, we'll probably oops anyway.

Turn them back in to VM_BUG_ON()s.

Signed-off-by: Dave Hansen
Cc: Sasha Levin
Cc: "Kirill A. Shutemov"
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Hansen
2014-01-30 08:22:40 +0800
ba3253c78 slab: fix wrong retval on kmem_cache_create_memcg error path ... Browse Code »

On kmem_cache_create_memcg() error path we set 'err', but leave 's' (the
new cache ptr) undefined. The latter can be NULL if we could not
allocate the cache, or pointing to a freed area if we failed somewhere
later while trying to initialize it. Initially we checked 'err'
immediately before exiting the function and returned NULL if it was set
ignoring the value of 's':

out_unlock:
...
if (err) {
/* report error */
return NULL;
}
return s;

Recently this check was, in fact, broken by commit f717eb3abb5e ("slab:
do not panic if we fail to create memcg cache"), which turned it to:

out_unlock:
...
if (err && !memcg) {
/* report error */
return NULL;
}
return s;

As a result, if we are failing creating a cache for a memcg, we will
skip the check and return 's' that can contain crap. Obviously, commit
f717eb3abb5e intended not to return crap on error allocating a cache for
a memcg, but only to remove the error reporting in this case, so the
check should look like this:

out_unlock:
...
if (err) {
if (!memcg)
return NULL;
/* report error */
return NULL;
}
return s;

[rientjes@google.com: despaghettification]
[vdavydov@parallels.com: patch monkeying]
Signed-off-by: David Rientjes
Signed-off-by: Vladimir Davydov
Signed-off-by: Dave Jones
Reported-by: Dave Jones
Acked-by: Pekka Enberg
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2014-01-30 08:22:40 +0800
4a404bea9 mm/mempolicy.c: convert to pr_foo() ... Browse Code »

A few printk(KERN_*'s have snuck in there.

Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2014-01-30 08:22:39 +0800
c297663c0 mm: numa: initialise numa balancing after jump label initialisation ... Browse Code »

The command line parsing takes place before jump labels are initialised
which generates a warning if numa_balancing= is specified and
CONFIG_JUMP_LABEL is set.

On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage
before jump_label_init was called") the kernel would have crashed. This
patch enables automatic numa balancing later in the initialisation
process if numa_balancing= is specified.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: stable
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-01-30 08:22:39 +0800
a1c3bfb2f mm/page-writeback.c: do not count anon pages as dirtyable memory ... Browse Code »

The VM is currently heavily tuned to avoid swapping. Whether that is
good or bad is a separate discussion, but as long as the VM won't swap
to make room for dirty cache, we can not consider anonymous pages when
calculating the amount of dirtyable memory, the baseline to which
dirty_background_ratio and dirty_ratio are applied.

A simple workload that occupies a significant size (40+%, depending on
memory layout, storage speeds etc.) of memory with anon/tmpfs pages and
uses the remainder for a streaming writer demonstrates this problem. In
that case, the actual cache pages are a small fraction of what is
considered dirtyable overall, which results in an relatively large
portion of the cache pages to be dirtied. As kswapd starts rotating
these, random tasks enter direct reclaim and stall on IO.

Only consider free pages and file pages dirtyable.

Signed-off-by: Johannes Weiner
Reported-by: Tejun Heo
Tested-by: Tejun Heo
Reviewed-by: Rik van Riel
Cc: Mel Gorman
Cc: Wu Fengguang
Reviewed-by: Michal Hocko
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-01-30 08:22:39 +0800