Eric Lee / smarc-fsl-linux-kernel

06 Nov, 2015

40 commits

c8f95ed1a ksm: unstable_tree_search_insert error checking cleanup ... Browse Code »

get_mergeable_page() can only return NULL (also in case of errors) or the
pinned mergeable page. It can't return an error different than NULL.
This optimizes away the unnecessary error check.

Add a return after the "out:" label in the callee to make it more
readable.

Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Petr Holasek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-11-06 11:34:48 +0800
85c6e8dd2 ksm: use find_mergeable_vma in try_to_merge_with_ksm_page ... Browse Code »

Doing the VM_MERGEABLE check after the page == kpage check won't provide
any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is
the only superfluous bit in using find_mergeable_vma because the !PageAnon
check of try_to_merge_one_page() implicitly checks for that, but it still
looks cleaner to share the same find_mergeable_vma().

Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Petr Holasek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-11-06 11:34:48 +0800
98666f8a2 ksm: use the helper method to do the hlist_empty check ... Browse Code »

This just uses the helper function to cleanup the assumption on the
hlist_node internals.

Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Petr Holasek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-11-06 11:34:48 +0800
f2e5ff85e ksm: don't fail stable tree lookups if walking over stale stable_nodes ... Browse Code »

The stable_nodes can become stale at any time if the underlying pages gets
freed. The stable_node gets collected and removed from the stable rbtree
if that is detected during the rbtree lookups.

Don't fail the lookup if running into stale stable_nodes, just restart the
lookup after collecting the stale stable_nodes. Otherwise the CPU spent
in the preparation stage is wasted and the lookup must be repeated at the
next loop potentially failing a second time in a second stale stable_node.

If we don't prune aggressively we delay the merging of the unstable node
candidates and at the same time we delay the freeing of the stale
stable_nodes. Keeping stale stable_nodes around wastes memory and it
can't provide any benefit.

Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Petr Holasek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-11-06 11:34:48 +0800
ad12695f1 ksm: add cond_resched() to the rmap_walks ... Browse Code »

While at it add it to the file and anon walks too.

Signed-off-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Cc: Petr Holasek
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-11-06 11:34:48 +0800
df4065516 memcg: simplify and inline __mem_cgroup_from_kmem ... Browse Code »

Before the previous patch ("memcg: unify slab and other kmem pages
charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
pages and pages allocated with alloc_kmem_pages - memcg in the page
struct. Now we can unify it. Since after it, this function becomes tiny
we can fold it into mem_cgroup_from_kmem.

[hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-11-06 11:34:48 +0800
f3ccb2c42 memcg: unify slab and other kmem pages charging ... Browse Code »

We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
uncharging kmem pages to memcg, but currently they are not used for
charging slab pages (i.e. they are only used for charging pages allocated
with alloc_kmem_pages). The only reason why the slab subsystem uses
special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
to the memcg that the current task belongs to.

To remove this diversity, this patch adds an extra argument to
__memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
not NULL, the function tries to charge to the memcg it points to,
otherwise it charge to the current context. Next, it makes the slab
subsystem use this function to charge slab pages.

Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
__memcg_kmem_charge stores a pointer to the memcg in the page struct, we
don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
Besides, one can now detect which memcg a slab page belongs to by reading
/proc/kpagecgroup.

Note, this patch switches slab to charge-after-alloc design. Since this
design is already used for all other memcg charges, it should not make any
difference.

[hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Signed-off-by: Johannes Weiner
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-11-06 11:34:48 +0800
d05e83a6f memcg: simplify charging kmem pages ... Browse Code »

Charging kmem pages proceeds in two steps. First, we try to charge the
allocation size to the memcg the current task belongs to, then we allocate
a page and "commit" the charge storing the pointer to the memcg in the
page struct.

Such a design looks overcomplicated, because there is not much sense in
trying charging the allocation before actually allocating a page: we won't
be able to consume much memory over the limit even if we charge after
doing the actual allocation, besides we already charge user pages post
factum, so being pedantic with kmem pages just looks pointless.

So this patch simplifies the design by merging the "charge" and the
"commit" steps into the same function, which takes the allocated page.

Also, rename the charge and uncharge methods to memcg_kmem_charge and
memcg_kmem_uncharge and make the charge method return error code instead
of bool to conform to mem_cgroup_try_charge.

Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-11-06 11:34:48 +0800
bde304bdf mm/page_alloc.c: skip ZONE_MOVABLE if required_kernelcore is larger than totalpages ... Browse Code »

If kernelcore was not specified, or the kernelcore size is zero
(required_movablecore >= totalpages), or the kernelcore size is larger
than totalpages, there is no ZONE_MOVABLE. We should fill the zone with
both kernel memory and movable memory.

Signed-off-by: Xishi Qiu
Reviewed-by: Yasuaki Ishimatsu
Cc: Mel Gorman
Cc: David Rientjes
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2015-11-06 11:34:48 +0800
a2c1aad3b mm/vmacache: inline vmacache_valid_mm() ... Browse Code »

This function incurs in very hot paths and merely does a few loads for
validity check. Lets inline it, such that we can save the function call
overhead.

(akpm: this is cosmetic - the compiler already inlines vmacache_valid_mm())

Signed-off-by: Davidlohr Bueso
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2015-11-06 11:34:48 +0800
f7ae3a95e include/linux/vm_event_item.h: change HIGHMEM_ZONE macro definition ... Browse Code »

Change HIGHMEM_ZONE to be the same as the DMA_ZONE macro.

Signed-off-by: yalin wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yalin wang
2015-11-06 11:34:48 +0800
a1c34a3bf mm: Don't offset memmap for flatmem ... Browse Code »

Srinivas Kandagatla reported bad page messages when trying to remove the
bottom 2MB on an ARM based IFC6410 board

BUG: Bad page state in process swapper pfn:fffa8
page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0
flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
bad because of flags:
flags: 0x200041(locked|active|mlocked)
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
Hardware name: Qualcomm (Flattened Device Tree)
unwind_backtrace
show_stack
dump_stack
bad_page
free_pages_prepare
free_hot_cold_page
__free_pages
free_highmem_page
mem_init
start_kernel
Disabling lock debugging due to kernel taint

Removing the lower 2MB made the start of the lowmem zone to no longer be
page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map
allocates memory for the mem_map. alloc_node_mem_map will offset for
unaligned nodes with the assumption the pfn/page translation functions
will account for the offset. The functions for CONFIG_FLATMEM do not
offset however, resulting in overrunning the memmap array. Just use the
allocated memmap without any offset when running with CONFIG_FLATMEM to
avoid the overrun.

Signed-off-by: Laura Abbott
Signed-off-by: Laura Abbott
Reported-by: Srinivas Kandagatla
Tested-by: Srinivas Kandagatla
Acked-by: Vlastimil Babka
Tested-by: Bjorn Andersson
Cc: Santosh Shilimkar
Cc: Russell King
Cc: Kevin Hilman
Cc: Arnd Bergman
Cc: Stephen Boyd
Cc: Andy Gross
Cc: Mel Gorman
Cc: Steven Rostedt
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Laura Abbott
2015-11-06 11:34:48 +0800
c2d42c16a mm/vmstat.c: uninline node_page_state() ... Browse Code »

With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
(4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
stack. Uninlining node_page_state() reduces this to 440 bytes.

The stack consumption issue is fixed by newer gcc (4.8.4) however with
that compiler this patch reduces the node.o text size from 7314 bytes to
4578.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-11-06 11:34:48 +0800
27f28b972 mm/mmap.c: change __install_special_mapping() args order ... Browse Code »

Make __install_special_mapping() args order match the caller, so the
caller can pass their register args directly to callee with no touch.

For most of architectures, args (at least the first 5th args) are in
registers, so this change will have effect on most of architectures.

For -O2, __install_special_mapping() may be inlined under most of
architectures, but for -Os, it should not. So this change can get a
little better performance for -Os, at least.

Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2015-11-06 11:34:48 +0800
c9427bc04 mm/nommu.c: drop unlikely inside BUG_ON() ... Browse Code »

(1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't
care less and the change is ok.

(2) ppc and mips, which HAVE_ARCH_BUG_ON, do not rely on branch
predictions as it seems to be pointless[1] and thus callers should not
be trying to push an optimization in the first place.

(3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
unlikely compiler flag already.

Hence, we can drop unlikely behind BUG_ON().

[1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html

Signed-off-by: Geliang Tang
Acked-by: Davidlohr Bueso
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geliang Tang
2015-11-06 11:34:48 +0800
1e3ee14b9 mm/mmap.c: do not initialize retval in mmap_pgoff() ... Browse Code »

When fget() fails we can return -EBADF directly.

Signed-off-by: Chen Gang
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2015-11-06 11:34:48 +0800
e6ee219fd mm/mmap.c: remove redundant statement "error = -ENOMEM" ... Browse Code »

It is still a little better to remove it, although it should be skipped
by "-O2".

Signed-off-by: Chen Gang =0A=
Acked-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2015-11-06 11:34:48 +0800
3ca65c19d mm: optimize PageHighMem() check ... Browse Code »

This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
kmap_atomic() generated code seemed needlessly bloated due to the way
PageHighMem() macro is implemented. It derives the exact zone for page
and then does pointer subtraction with first zone to infer the zone_type.
The pointer arithmatic in turn generates the code bloat.

PageHighMem(page)
is_highmem(page_zone(page))
zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones

Instead use is_highmem_idx() to work on zone_type available in page flags

----- Before -----
80756348: mov_s r13,r0
8075634a: ld_s r2,[r13,0]
8075634c: lsr_s r2,r2,30
8075634e: mpy r2,r2,0x2a4
80756352: add_s r2,r2,0x80aef880
80756358: ld_s r3,[r2,28]
8075635a: sub_s r2,r2,r3
8075635c: breq r2,0x2a4,80756378
80756364: breq r2,0x548,80756378

----- After -----
80756330: mov_s r13,r0
80756332: ld_s r2,[r13,0]
80756334: lsr_s r2,r2,30
80756336: sub_s r2,r2,1
80756338: brlo r2,2,80756348

For x86 defconfig build (32 bit only) it saves around 900 bytes.
For ARC defconfig with HIGHMEM, it saved around 2K bytes.

---->8-------
./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
function old new delta
saveable_page 162 154 -8
saveable_highmem_page 154 146 -8
skb_gro_reset_offset 147 131 -16
...
...
__change_page_attr_set_clr 1715 1678 -37
setup_data_read 434 394 -40
mon_bin_event 1967 1927 -40
swsusp_save 1148 1105 -43
_set_pages_array 549 493 -56
---->8-------

e.g. For ARC kmap()

Signed-off-by: Vineet Gupta
Acked-by: Michal Hocko
Cc: Naoya Horiguchi
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Jennifer Herbert
Cc: Konstantin Khlebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineet Gupta
2015-11-06 11:34:48 +0800
4d7b3394f mm/oom_kill: fix the wrong task->mm == mm checks in oom_kill_process() ... Browse Code »

Both "child->mm == mm" and "p->mm != mm" checks in oom_kill_process() are
wrong. task->mm can be NULL if the task is the exited group leader. This
means in particular that "kill sharing same memory" loop can miss a
process with a zombie leader which uses the same ->mm.

Note: the process_has_mm(child, p->mm) check is still not 100% correct,
p->mm can be NULL too. This is minor, but probably deserves a fix or a
comment anyway.

[akpm@linux-foundation.org: document process_shares_mm() a bit]
Signed-off-by: Oleg Nesterov
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Kyle Walker
Cc: Stanislav Kozina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-06 11:34:48 +0800
c319025a6 mm/oom_kill: cleanup the "kill sharing same memory" loop ... Browse Code »

Purely cosmetic, but the complex "if" condition looks annoying to me.
Especially because it is not consistent with OOM_SCORE_ADJ_MIN check
which adds another if/continue.

Signed-off-by: Oleg Nesterov
Acked-by: David Rientjes
Acked-by: Michal Hocko
Acked-by: Hillf Danton
Cc: Tetsuo Handa
Cc: Kyle Walker
Cc: Stanislav Kozina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-06 11:34:48 +0800
0c1b2d783 mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process() ... Browse Code »

The fatal_signal_pending() was added to suppress unnecessary "sharing same
memory" message, but it can't 100% help anyway because it can be
false-negative; SIGKILL can be already dequeued.

And worse, it can be false-positive due to exec or coredump. exec is
mostly fine, but coredump is not. It is possible that the group leader
has the pending SIGKILL because its sub-thread originated the coredump, in
this case we must not skip this process.

We could probably add the additional ->group_exit_task check but this
patch just removes the wrong check along with pr_info().

Signed-off-by: Oleg Nesterov
Acked-by: David Rientjes
Acked-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: Kyle Walker
Cc: Stanislav Kozina
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-06 11:34:48 +0800
093578147 mm: add the "struct mm_struct *mm" local into ... Browse Code »

Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm,
a local variable makes sense imho.

Signed-off-by: Oleg Nesterov
Acked-by: Hugh Dickins
Cc: Andrey Konovalov
Cc: Davidlohr Bueso
Cc: "Kirill A. Shutemov"
Cc: Sasha Levin
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-06 11:34:48 +0800
87e8827b3 mm: fix the racy mm->locked_vm change in ... Browse Code »

"mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
not safe; multiple threads using the same ->mm can do this at the same
time trying to expans different vma's under down_read(mmap_sem). This
means that one of the "locked_vm += grow" changes can be lost and we can
miss munlock_vma_pages_all() later.

Move this code into the caller(s) under mm->page_table_lock. All other
updates to ->locked_vm hold mmap_sem for writing.

Signed-off-by: Oleg Nesterov
Acked-by: Hugh Dickins
Cc: Andrey Konovalov
Cc: Davidlohr Bueso
Cc: "Kirill A. Shutemov"
Cc: Sasha Levin
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-06 11:34:48 +0800
9fd745d45 mm: fix overflow in find_zone_movable_pfns_for_nodes() ... Browse Code »

If the user set "movablecore=xx" to a large number, corepages will
overflow. Fix the problem.

Signed-off-by: Xishi Qiu
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Tang Chen
Acked-by: David Rientjes
Cc: Mel Gorman
Cc: Tang Chen
Cc: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2015-11-06 11:34:48 +0800
d031a1579 mm/vmscan.c: fix types of some locals ... Browse Code »

In zone_reclaimable_pages(), `nr' is returned by a function which is
declared as returning "unsigned long", so declare it such. Negative
values are meaningless here.

In zone_pagecache_reclaimable() we should also declare `delta' and
`nr_pagecache_reclaimable' as being unsigned longs because they're used to
store the values returned by zone_page_state() and
zone_unmapped_file_pages() which also happen to return unsigned integers.

[akpm@linux-foundation.org: make zone_pagecache_reclaimable() return ulong rather than long]
Signed-off-by: Alexandru Moise
Acked-by: Michal Hocko
Cc: Vladimir Davydov
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexandru Moise
2015-11-06 11:34:48 +0800
da39da3a5 mm, oom: remove task_lock protecting comm printing ... Browse Code »

The oom killer takes task_lock() in a couple of places solely to protect
printing the task's comm.

A process's comm, including current's comm, may change due to
/proc/pid/comm or PR_SET_NAME.

The comm will always be NULL-terminated, so the worst race scenario would
only be during update. We can tolerate a comm being printed that is in
the middle of an update to avoid taking the lock.

Other locations in the kernel have already dropped task_lock() when
printing comm, so this is consistent.

Signed-off-by: David Rientjes
Suggested-by: Oleg Nesterov
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Sergey Senozhatsky
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2015-11-06 11:34:48 +0800
2d1e10412 mm, compaction: distinguish contended status in tracepoints ... Browse Code »

Compaction returns prematurely with COMPACT_PARTIAL when contended or has
fatal signal pending. This is ok for the callers, but might be misleading
in the traces, as the usual reason to return COMPACT_PARTIAL is that we
think the allocation should succeed. After this patch we distinguish the
premature ending condition in the mm_compaction_finished and
mm_compaction_end tracepoints.

The contended status covers the following reasons:
- lock contention or need_resched() detected in async compaction
- fatal signal pending
- too many pages isolated in the zone (only for async compaction)
Further distinguishing the exact reason seems unnecessary for now.

Signed-off-by: Vlastimil Babka
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: David Rientjes
Cc: Steven Rostedt
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-11-06 11:34:48 +0800
1743d0506 mm, compaction: export tracepoints zone names to userspace ... Browse Code »

Some compaction tracepoints use zone->name to print which zone is being
compacted. This works for in-kernel printing, but not userspace trace
printing of raw captured trace such as via trace-cmd report.

This patch uses zone_idx() instead of zone->name as the raw value, and
when printing, converts the zone_type to string using the appropriate EM()
macros and some ugly tricks to overcome the problem that half the values
depend on CONFIG_ options and one does not simply use #ifdef inside of
#define.

trace-cmd output before:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=ffffffff81815d7a order=9 ret=partial

after:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=Normal order=9 ret=partial

Signed-off-by: Vlastimil Babka
Reviewed-by: Steven Rostedt
Cc: Joonsoo Kim
Cc: Ingo Molnar
Cc: Mel Gorman
Cc: David Rientjes
Cc: Valentin Rothberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-11-06 11:34:48 +0800
fa6c7b46a mm, compaction: export tracepoints status strings to userspace ... Browse Code »

Some compaction tracepoints convert the integer return values to strings
using the compaction_status_string array. This works for in-kernel
printing, but not userspace trace printing of raw captured trace such as
via trace-cmd report.

This patch converts the private array to appropriate tracepoint macros
that result in proper userspace support.

trace-cmd output before:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=ffffffff81815d7a order=9 ret=

after:
transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
zone=ffffffff81815d7a order=9 ret=partial

Signed-off-by: Vlastimil Babka
Reviewed-by: Steven Rostedt
Cc: Joonsoo Kim
Cc: Ingo Molnar
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-11-06 11:34:48 +0800
840807a8f mm/oom_kill.c: suppress unnecessary "sharing same memory" message ... Browse Code »

oom_kill_process() sends SIGKILL to other thread groups sharing victim's
mm. But printing

"Kill process %d (%s) sharing same memory\n"

lines makes no sense if they already have pending SIGKILL. This patch
reduces the "Kill process" lines by printing that line with info level
only if SIGKILL is not pending.

Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2015-11-06 11:34:48 +0800
880b76893 mm/oom_kill.c: fix potentially killing unrelated process ... Browse Code »

At the for_each_process() loop in oom_kill_process(), we are comparing
address of OOM victim's mm without holding a reference to that mm. If
there are a lot of processes to compare or a lot of "Kill process %d (%s)
sharing same memory" messages to print, for_each_process() loop could take
very long time.

It is possible that meanwhile the OOM victim exits and releases its mm,
and then mm is allocated with the same address and assigned to some
unrelated process. When we hit such race, the unrelated process will be
killed by error. To make sure that the OOM victim's mm does not go away
until for_each_process() loop finishes, get a reference on the OOM
victim's mm before calling task_unlock(victim).

[oleg@redhat.com: several fixes]
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: David Rientjes
Signed-off-by: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2015-11-06 11:34:48 +0800
426fb5e72 mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL ... Browse Code »

It was confirmed that a local unprivileged user can consume all memory
reserves and hang up that system using time lag between the OOM killer
sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
printk() inside for_each_process() loop at oom_kill_process() can consume
many seconds when there are many thread groups sharing the same memory.

Before starting oom-depleter process:

Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

As of invoking the OOM killer:

Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it was
free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is
set, the oom-depleter can terminate without touching memory reserves.

Although the possibility of hitting this time lag is very small for 3.19
and earlier kernels because TIF_MEMDIE is set immediately before sending
SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
step between and allow memory allocations which are not needed for
terminating the OOM victim.

Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock")
Signed-off-by: Tetsuo Handa
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: [4.0+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2015-11-06 11:34:48 +0800
13308ca9e mm/memcontrol: make mem_cgroup_inactive_anon_is_low() return bool ... Browse Code »

Make mem_cgroup_inactive_anon_is_low return bool due to this particular
function only using either one or zero as its return value.

No functional change.

Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2015-11-06 11:34:48 +0800
42e2e4577 mm/vmscan: make inactive_anon/file_is_low return bool ... Browse Code »

Make inactive_anon/file_is_low return bool due to these particular
functions only using either one or zero as their return value.

No functional change.

Signed-off-by: Yaowei Bai
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2015-11-06 11:34:48 +0800
80f73b4b7 Documentation/vm/transhuge.txt: add information about max_ptes_swap ... Browse Code »

max_ptes_swap specifies how many pages can be brought in from swap when
collapsing a group of pages into a transparent huge page.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap

A higher value can cause excessive swap IO and waste memory. A lower
value can prevent THPs from being collapsed, resulting fewer pages being
collapsed into THPs, and lower memory access performance.

Signed-off-by: Ebru Akagunduz
Acked-by: Rik van Riel
Acked-by: David Rientjes
Cc: Oleg Nesterov
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2015-11-06 11:34:48 +0800
3608de078 mm/memcontrol.c: fix order calculation in try_charge() ... Browse Code »

Since commit 6539cc053869 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
the order to pass to mem_cgroup_oom() is calculated by passing the
number of pages to get_order() instead of the expected size in bytes.
AFAICT, it only affects the value displayed in the oom warning message.
This patch fix this.

Michal said:

: We haven't noticed that just because the OOM is enabled only for page
: faults of order-0 (single page) and get_order work just fine. Thanks for
: noticing this. If we ever start triggering OOM on different orders this
: would be broken.

Signed-off-by: Jerome Marchand
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2015-11-06 11:34:48 +0800
a5f651090 mm: hwpoison: ratelimit messages from unpoison_memory() ... Browse Code »

Currently kernel prints out results of every single unpoison event, which
i= s not necessary because unpoison is purely a testing feature and
testers can = get little or no information from lots of lines of unpoison
log storm. So this patch ratelimits printk in unpoison_memory().

This patch introduces a file local ratelimit_state, which adds 64 bytes to
memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below,
2= 56 bytes is added, so it's a win.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-11-06 11:34:48 +0800
aa750fd71 mm/filemap.c: make global sync not clear error status of individual inodes ... Browse Code »

filemap_fdatawait() is a function to wait for on-going writeback to
complete but also consume and clear error status of the mapping set during
writeback.

The latter functionality is critical for applications to detect writeback
error with system calls like fsync(2)/fdatasync(2).

However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
which don't check error status of individual mappings.

As a result, fsync() may not be able to detect writeback error if events
happen in the following order:

Application System admin
----------------------------------------------------------
write data on page cache
Run sync command
writeback completes with error
filemap_fdatawait() clears error
fsync returns success
(but the data is not on disk)

This patch adds filemap_fdatawait_keep_errors() for call sites where
writeback error is not handled so that they don't clear error status.

Signed-off-by: Jun'ichi Nomura
Acked-by: Andi Kleen
Reviewed-by: Tejun Heo
Cc: Fengguang Wu
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junichi Nomura
2015-11-06 11:34:48 +0800
21c527a3c mm/compaction.c: add an is_via_compact_memory() helper ... Browse Code »

Introduce is_via_compact_memory() helper indicating compacting via
/proc/sys/vm/compact_memory to improve readability.

To catch this situation in __compaction_suitable, use order as parameter
directly instead of using struct compact_control.

This patch has no functional changes.

Signed-off-by: Yaowei Bai
Cc: Mel Gorman
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2015-11-06 11:34:48 +0800
29d06bbb4 mm/vmscan: make inactive_anon_is_low_global return directly ... Browse Code »

Delete unnecessary if to let inactive_anon_is_low_global return
directly.

No functional changes.

Signed-off-by: Yaowei Bai
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yaowei Bai
2015-11-06 11:34:48 +0800