Eric Lee / smarc-fsl-linux-kernel

25 Jun, 2015

40 commits

93ada579b mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan ... Browse Code »

The kmemleak memory scanning uses finer grained object->lock spinlocks
primarily to avoid races with the memory block freeing. However, the
pointer lookup in the rb tree requires the kmemleak_lock to be held.
This is currently done in the find_and_get_object() function for each
pointer-like location read during scanning. While this allows a low
latency on kmemleak_*() callbacks on other CPUs, the memory scanning is
slower.

This patch moves the kmemleak_lock outside the scan_block() loop,
acquiring/releasing it only once per scanned memory block. The
allow_resched logic is moved outside scan_block() and a new
scan_large_block() function is implemented which splits large blocks in
MAX_SCAN_SIZE chunks with cond_resched() calls in-between. A redundant
(object->flags & OBJECT_NO_SCAN) check is also removed from
scan_object().

With this patch, the kmemleak scanning performance is significantly
improved: at least 50% with lock debugging disabled and over an order of
magnitude with lock proving enabled (on an arm64 system).

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2015-06-25 08:49:45 +0800
9d5a4c730 mm: kmemleak: avoid deadlock on the kmemleak object insertion error path ... Browse Code »

While very unlikely (usually kmemleak or sl*b bug), the create_object()
function in mm/kmemleak.c may fail to insert a newly allocated object into
the rb tree. When this happens, kmemleak disables itself and prints
additional information about the object already found in the rb tree.
Such printing is done with the parent->lock acquired, however the
kmemleak_lock is already held. This is a potential race with the scanning
thread which acquires object->lock and kmemleak_lock in a

This patch removes the locking around the 'parent' object information
printing. Such object cannot be freed or removed from object_tree_root
and object_list since kmemleak_lock is already held. There is a very
small risk that some of the object data is being modified on another CPU
but the only downside is inconsistent information printing.

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2015-06-25 08:49:45 +0800
5f369f374 mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup() ... Browse Code »

The kmemleak_do_cleanup() work thread already waits for the kmemleak_scan
thread to finish via kthread_stop(). Waiting in kthread_stop() while
scan_mutex is held may lead to deadlock if kmemleak_scan_thread() also
waits to acquire for scan_mutex.

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2015-06-25 08:49:45 +0800
e781a9ab4 mm: kmemleak: fix delete_object_*() race when called on the same memory block ... Browse Code »

Calling delete_object_*() on the same pointer is not a standard use case
(unless there is a bug in the code calling kmemleak_free()). However,
during kmemleak disabling (error or user triggered via /sys), there is a
potential race between kmemleak_free() calls on a CPU and
__kmemleak_do_cleanup() on a different CPU.

The current delete_object_*() implementation first performs a look-up
holding kmemleak_lock, increments the object->use_count and then
re-acquires kmemleak_lock to remove the object from object_tree_root and
object_list.

This patch simplifies the delete_object_*() mechanism to both look up
and remove an object from the object_tree_root and object_list
atomically (guarded by kmemleak_lock). This allows safe concurrent
calls to delete_object_*() on the same pointer without additional
locking for synchronising the kmemleak_free_enabled flag.

A side effect is a slight improvement in the delete_object_*() performance
by avoiding acquiring kmemleak_lock twice and incrementing/decrementing
object->use_count.

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2015-06-25 08:49:45 +0800
c5f3b1a51 mm: kmemleak: allow safe memory scanning during kmemleak disabling ... Browse Code »

The kmemleak scanning thread can run for minutes. Callbacks like
kmemleak_free() are allowed during this time, the race being taken care
of by the object->lock spinlock. Such lock also prevents a memory block
from being freed or unmapped while it is being scanned by blocking the
kmemleak_free() -> ... -> __delete_object() function until the lock is
released in scan_object().

When a kmemleak error occurs (e.g. it fails to allocate its metadata),
kmemleak_enabled is set and __delete_object() is no longer called on
freed objects. If kmemleak_scan is running at the same time,
kmemleak_free() no longer waits for the object scanning to complete,
allowing the corresponding memory block to be freed or unmapped (in the
case of vfree()). This leads to kmemleak_scan potentially triggering a
page fault.

This patch separates the kmemleak_free() enabling/disabling from the
overall kmemleak_enabled nob so that we can defer the disabling of the
object freeing tracking until the scanning thread completed. The
kmemleak_free_part() is deliberately ignored by this patch since this is
only called during boot before the scanning thread started.

Signed-off-by: Catalin Marinas
Reported-by: Vignesh Radhakrishnan
Tested-by: Vignesh Radhakrishnan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2015-06-25 08:49:45 +0800
c2b42d3ca memcg: convert mem_cgroup->under_oom from atomic_t to int ... Browse Code »

memcg->under_oom tracks whether the memcg is under OOM conditions and is
an atomic_t counter managed with mem_cgroup_[un]mark_under_oom(). While
atomic_t appears to be simple synchronization-wise, when used as a
synchronization construct like here, it's trickier and more error-prone
due to weak memory ordering rules, especially around atomic_read(), and
false sense of security.

For example, both non-trivial read sites of memcg->under_oom are a bit
problematic although not being actually broken.

* mem_cgroup_oom_register_event()

It isn't explicit what guarantees the memory ordering between event
addition and memcg->under_oom check. This isn't broken only because
memcg_oom_lock is used for both event list and memcg->oom_lock.

* memcg_oom_recover()

The lockless test doesn't have any explanation why this would be
safe.

mem_cgroup_[un]mark_under_oom() are very cold paths and there's no point
in avoiding locking memcg_oom_lock there. This patch converts
memcg->under_oom from atomic_t to int, puts their modifications under
memcg_oom_lock and documents why the lockless test in
memcg_oom_recover() is safe.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2015-06-25 08:49:45 +0800
f4b90b70b memcg: remove unused mem_cgroup->oom_wakeups ... Browse Code »

Since commit 4942642080ea ("mm: memcg: handle non-error OOM situations
more gracefully"), nobody uses mem_cgroup->oom_wakeups. Remove it.

While at it, also fold memcg_wakeup_oom() into memcg_oom_recover() which
is its only user. This cleanup was suggested by Michal.

Signed-off-by: Tejun Heo
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2015-06-25 08:49:45 +0800
d1dc6f1bc frontswap: allow multiple backends ... Browse Code »

Change frontswap single pointer to a singly linked list of frontswap
implementations. Update Xen tmem implementation as register no longer
returns anything.

Frontswap only keeps track of a single implementation; any
implementation that registers second (or later) will replace the
previously registered implementation, and gets a pointer to the previous
implementation that the new implementation is expected to pass all
frontswap functions to if it can't handle the function itself. However
that method doesn't really make much sense, as passing that work on to
every implementation adds unnecessary work to implementations; instead,
frontswap should simply keep a list of all registered implementations
and try each implementation for any function. Most importantly, neither
of the two currently existing frontswap implementations in the kernel
actually do anything with any previous frontswap implementation that
they replace when registering.

This allows frontswap to successfully manage multiple implementations by
keeping a list of them all.

Signed-off-by: Dan Streetman
Cc: Konrad Rzeszutek Wilk
Cc: Boris Ostrovsky
Cc: David Vrabel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-06-25 08:49:45 +0800
b05b9f5f9 x86, mirror: x86 enabling - find mirrored memory ranges ... Browse Code »

UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
address ranges. See UEFI 2.5 spec pages 157-158:

http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf

On EFI enabled systems scan the memory map and tell memblock about any
mirrored ranges.

Signed-off-by: Tony Luck
Cc: Xishi Qiu
Cc: Hanjun Guo
Cc: Xiexiuqi
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tony Luck
2015-06-25 08:49:45 +0800
a3f5bafcc mm/memblock: allocate boot time data structures from mirrored memory ... Browse Code »

Try to allocate all boot time kernel data structures from mirrored
memory.

If we run out of mirrored memory print warnings, but fall back to using
non-mirrored memory to make sure that we still boot.

By number of bytes, most of what we allocate at boot time is the page
structures. 64 bytes per 4K page on x86_64 ... or about 1.5% of total
system memory. For workloads where the bulk of memory is allocated to
applications this may represent a useful improvement to system
availability since 1.5% of total memory might be a third of the memory
allocated to the kernel.

Signed-off-by: Tony Luck
Cc: Xishi Qiu
Cc: Hanjun Guo
Cc: Xiexiuqi
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tony Luck
2015-06-25 08:49:45 +0800
fc6daaf93 mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute ... Browse Code »

Some high end Intel Xeon systems report uncorrectable memory errors as a
recoverable machine check. Linux has included code for some time to
process these and just signal the affected processes (or even recover
completely if the error was in a read only page that can be replaced by
reading from disk).

But we have no recovery path for errors encountered during kernel code
execution. Except for some very specific cases were are unlikely to ever
be able to recover.

Enter memory mirroring. Actually 3rd generation of memory mirroing.

Gen1: All memory is mirrored
Pro: No s/w enabling - h/w just gets good data from other side of the
mirror
Con: Halves effective memory capacity available to OS/applications

Gen2: Partial memory mirror - just mirror memory begind some memory controllers
Pro: Keep more of the capacity
Con: Nightmare to enable. Have to choose between allocating from
mirrored memory for safety vs. NUMA local memory for performance

Gen3: Address range partial memory mirror - some mirror on each memory
controller
Pro: Can tune the amount of mirror and keep NUMA performance
Con: I have to write memory management code to implement

The current plan is just to use mirrored memory for kernel allocations.
This has been broken into two phases:

1) This patch series - find the mirrored memory, use it for boot time
allocations

2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
unused mirrored memory from mm/memblock.c and only give it out to
select kernel allocations (this is still being scoped because
page_alloc.c is scary).

This patch (of 3):

Add extra "flags" to memblock to allow selection of memory based on
attribute. No functional changes

Signed-off-by: Tony Luck
Cc: Xishi Qiu
Cc: Hanjun Guo
Cc: Xiexiuqi
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tony Luck
2015-06-25 08:49:44 +0800
6afdb859b mm: do not ignore mapping_gfp_mask in page cache allocation paths ... Browse Code »

page_cache_read, do_generic_file_read, __generic_file_splice_read and
__ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
add_to_page_cache_lru which might cause recursion into fs down in the
direct reclaim path if the mapping really relies on GFP_NOFS semantic.

This doesn't seem to be the case now because page_cache_read (page fault
path) doesn't seem to suffer from the reclaim recursion issues and
do_generic_file_read and __generic_file_splice_read also shouldn't be
called under fs locks which would deadlock in the reclaim path. Anyway it
is better to obey mapping gfp mask and prevent from later breakage.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Cc: Dave Chinner
Cc: Neil Brown
Cc: Johannes Weiner
Cc: Al Viro
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Tetsuo Handa
Cc: Anton Altaparmakov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-06-25 08:49:44 +0800
0f96ae292 mm/cma.c: fix typos in comments ... Browse Code »

Signed-off-by: Shailendra Verma
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shailendra Verma
2015-06-25 08:49:44 +0800
f0d6647e8 mm/oom_kill.c: print points as unsigned int ... Browse Code »

In oom_kill_process(), the variable 'points' is unsigned int. Print it as
such.

Signed-off-by: Wang Long
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Long
2015-06-25 08:49:44 +0800
33039678c mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages ... Browse Code »

alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
number of pages which will be added to the reserve map. Subpool and
global reserve counts are adjusted based on the output of region_chg.
Before the pages are actually added to the reserve map, these routines
could race and add fewer pages than expected. If this happens, the
subpool and global reserve counts are not correct.

Compare the number of pages actually added (region_add) to those expected
to added (region_chg). If fewer pages are actually added, this indicates
a race and adjust counters accordingly.

Signed-off-by: Mike Kravetz
Reviewed-by: Naoya Horiguchi
Reviewed-by: Davidlohr Bueso
Cc: David Rientjes
Cc: Luiz Capitulino
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-06-25 08:49:44 +0800
cf3ad20bf mm/hugetlb: compute/return the number of regions added by region_add() ... Browse Code »

Modify region_add() to keep track of regions(pages) added to the reserve
map and return this value. The return value can be compared to the return
value of region_chg() to determine if the map was modified between calls.

Make vma_commit_reservation() also pass along the return value of
region_add(). In the normal case, we want vma_commit_reservation to
return the same value as the preceding call to vma_needs_reservation.
Create a common __vma_reservation_common routine to help keep the special
case return values in sync

Signed-off-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Luiz Capitulino
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-06-25 08:49:44 +0800
1dd308a7b mm/hugetlb: document the reserve map/region tracking routines ... Browse Code »

While working on hugetlbfs fallocate support, I noticed the following race
in the existing code. It is unlikely that this race is hit very often in
the current code. However, if more functionality to add and remove pages
to hugetlbfs mappings (such as fallocate) is added the likelihood of
hitting this race will increase.

alloc_huge_page and hugetlb_reserve_pages use information from the reserve
map to determine if there are enough available huge pages to complete the
operation, as well as adjust global reserve and subpool usage counts. The
order of operations is as follows:

- call region_chg() to determine the expected change based on reserve map
- determine if enough resources are available for this operation
- adjust global counts based on the expected change
- call region_add() to update the reserve map

The issue is that reserve map could change between the call to region_chg
and region_add. In this case, the counters which were adjusted based on
the output of region_chg will not be correct.

In order to hit this race today, there must be an existing shared hugetlb
mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge
page via this mapping must occur at the same another task is mapping the
same region without the MAP_NORESERVE flag.

The patch set does not prevent the race from happening. Rather, it adds
simple functionality to detect when the race has occurred. If a race is
detected, then the incorrect counts are adjusted.

Review comments pointed out the need for documentation of the existing
region/reserve map routines. This patch set also adds documentation in
this area.

This patch (of 3):

This is a documentation only patch and does not modify any code.
Descriptions of the routines used for reserve map/region tracking are
added.

Signed-off-by: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Luiz Capitulino
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-06-25 08:49:44 +0800
9b012a29a Documentation/vm/unevictable-lru.txt: clarify MAP_LOCKED behavior ... Browse Code »

There is a very subtle difference between mmap()+mlock() vs
mmap(MAP_LOCKED) semantic. The former one fails if the population of the
area fails while the later one doesn't. This basically means that
mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
which is not the case for mlock. mmap man page has already been altered
but Documentation/vm/unevictable-lru.txt deserves a clarification as well.

Signed-off-by: Michal Hocko
Reported-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-06-25 08:49:44 +0800
22cc877b3 mm: nommu: refactor debug and warning prints ... Browse Code »

kenter/kleave/kdebug are wrapper macros to print functions flow and debug
information. This set was written before pr_devel() was introduced, so it
was controlled by "#if 0" construction. It is questionable if anyone is
using them [1] now.

This patch removes these macros, converts numerous printk(KERN_WARNING,
...) to use general pr_warn(...) and removes debug print line from
validate_mmap_request() function.

Signed-off-by: Leon Romanovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Leon Romanovsky
2015-06-25 08:49:44 +0800
8809aa2d2 mm: clarify that the function operates on hugepage pte ... Browse Code »

We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.

We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes

Signed-off-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Andrea Arcangeli
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2015-06-25 08:49:44 +0800
f28b6ff8c powerpc/mm: use generic version of pmdp_clear_flush() ... Browse Code »

Also move the pmd_trans_huge check to generic code.

Signed-off-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Andrea Arcangeli
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2015-06-25 08:49:44 +0800
15a25b2ea mm/thp: split out pmd collapse flush into separate functions ... Browse Code »

Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse. For them this operation is largely different from a
normal hugepage pte clear. Hence add a separate function to clear pmd
before collapse. After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.

[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table. But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries. By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.

This patch is a cleanup and only does code movement for clarity. There
should not be any change in functionality.

Signed-off-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Andrea Arcangeli
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2015-06-25 08:49:44 +0800
97f0b1345 tracing: add trace event for memory-failure ... Browse Code »

RAS user space tools like rasdaemon which base on trace event, could
receive mce error event, but no memory recovery result event. So, I want
to add this event to make this scenario complete.

This patch add a event at ras group for memory-failure.

The output like below:
# tracer: nop
#
# entries-in-buffer/entries-written: 2/2 #P:24
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
mce-inject-13150 [001] .... 277.019359: memory_failure_event: pfn 0x19869: recovery action for free buddy page: Delayed

[xiexiuqi@huawei.com: fix build error]
Signed-off-by: Xie XiuQi
Reviewed-by: Naoya Horiguchi
Acked-by: Steven Rostedt
Cc: Tony Luck
Cc: Chen Gong
Cc: Jim Davis
Signed-off-by: Xie XiuQi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xie XiuQi
2015-06-25 08:49:43 +0800
cc3e2af42 memory-failure: change type of action_result's param 3 to enum ... Browse Code »

Change type of action_result's param 3 to enum for type consistency,
and rename mf_outcome to mf_result for clearly.

Signed-off-by: Xie XiuQi
Acked-by: Naoya Horiguchi
Cc: Chen Gong
Cc: Jim Davis
Cc: Steven Rostedt
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xie XiuQi
2015-06-25 08:49:43 +0800
cc637b170 memory-failure: export page_type and action result ... Browse Code »

Export 'outcome' and 'action_page_type' to mm.h, so we could use
this emnus outside.

This patch is preparation for adding trace events for memory-failure
recovery action.

Signed-off-by: Xie XiuQi
Acked-by: Naoya Horiguchi
Cc: Chen Gong
Cc: Jim Davis
Cc: Steven Rostedt
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xie XiuQi
2015-06-25 08:49:43 +0800
eb3c24f30 mm, memcg: Try charging a page before setting page up to date ... Browse Code »

Historically memcg overhead was high even if memcg was unused. This has
improved a lot but it still showed up in a profile summary as being a
problem.

/usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
mem_cgroup_try_charge 2.950% 175781
__mem_cgroup_count_vm_event 1.431% 85239
mem_cgroup_page_lruvec 0.456% 27156
mem_cgroup_commit_charge 0.392% 23342
uncharge_list 0.323% 19256
mem_cgroup_update_lru_size 0.278% 16538
memcg_check_events 0.216% 12858
mem_cgroup_charge_statistics.isra.22 0.188% 11172
try_charge 0.150% 8928
commit_charge 0.141% 8388
get_mem_cgroup_from_mm 0.121% 7184

That is showing that 6.64% of system CPU cycles were in memcontrol.c and
dominated by mem_cgroup_try_charge. The annotation shows that the bulk
of the cost was checking PageSwapCache which is expected to be cache hot
but is very expensive. The problem appears to be that __SetPageUptodate
is called just before the check which is a write barrier. It is
required to make sure struct page and page data is written before the
PTE is updated and the data visible to userspace. memcg charging does
not require or need the barrier but gets unfairly hit with the cost so
this patch attempts the charging before the barrier. Aside from the
accidental cost to memcg there is the added benefit that the barrier is
avoided if the page cannot be charged. When applied the relevant
profile summary is as follows.

/usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c 3.7907 223277
__mem_cgroup_count_vm_event 1.143% 67312
mem_cgroup_page_lruvec 0.465% 27403
mem_cgroup_commit_charge 0.381% 22452
uncharge_list 0.332% 19543
mem_cgroup_update_lru_size 0.284% 16704
get_mem_cgroup_from_mm 0.271% 15952
mem_cgroup_try_charge 0.237% 13982
memcg_check_events 0.222% 13058
mem_cgroup_charge_statistics.isra.22 0.185% 10920
commit_charge 0.140% 8235
try_charge 0.131% 7716

That brings the overhead down to 3.79% and leaves the memcg fault
accounting to the root cgroup but it's an improvement. The difference
in headline performance of the page fault microbench is marginal as
memcg is such a small component of it.

pft faults
4.0.0 4.0.0
vanilla chargefirst
Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1509075.7561 ( 4.56%)
Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1339160.7113 ( -0.09%)
Hmean faults/cpu-5 875599.0222 ( 0.00%) 874174.1255 ( -0.16%)
Hmean faults/cpu-7 601146.6726 ( 0.00%) 601370.9977 ( 0.04%)
Hmean faults/cpu-8 510728.2754 ( 0.00%) 510598.8214 ( -0.03%)
Hmean faults/sec-1 1432084.7845 ( 0.00%) 1497935.5274 ( 4.60%)
Hmean faults/sec-3 3943818.1437 ( 0.00%) 3941920.1520 ( -0.05%)
Hmean faults/sec-5 3877573.5867 ( 0.00%) 3869385.7553 ( -0.21%)
Hmean faults/sec-7 3991832.0418 ( 0.00%) 3992181.4189 ( 0.01%)
Hmean faults/sec-8 3987189.8167 ( 0.00%) 3986452.2204 ( -0.02%)

It's only visible at single threaded. The overhead is there for higher
threads but other factors dominate.

Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-06-25 08:49:43 +0800
4165b9b46 hugetlb: do not account hugetlb pages as NR_FILE_PAGES ... Browse Code »

hugetlb pages uses add_to_page_cache to track shared mappings. This is
OK from the data structure point of view but it is less so from the
NR_FILE_PAGES accounting:

- huge pages are accounted as 4k which is clearly wrong
- this counter is used as the amount of the reclaimable page
cache which is incorrect as well because hugetlb pages are
special and not reclaimable
- the counter is then exported to userspace via /proc/meminfo
(in Cached:), /proc/vmstat and /proc/zoneinfo as
nr_file_pages which is confusing at least:
Cached: 8883504 kB
HugePages_Free: 8348
...
Cached: 8916048 kB
HugePages_Free: 156
...
thats 8192 huge pages allocated which is ~16G accounted as 32M

There are usually not that many huge pages in the system for this to
make any visible difference e.g. by fooling __vm_enough_memory or
zone_pagecache_reclaimable.

Fix this by special casing huge pages in both __delete_from_page_cache
and __add_to_page_cache_locked. replace_page_cache_page is currently
only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
is more robust to check for special casing there as well.

Hugetlb pages shouldn't get to any other paths where we do accounting:
- migration - we have a special handling via
hugetlbfs_migrate_page
- shmem - doesn't handle hugetlb pages directly even for
SHM_HUGETLB resp. MAP_HUGETLB
- swapcache - hugetlb is not swapable

This has a user visible effect but I believe it is reasonable because the
previously exported number is simply bogus.

An alternative would be to account hugetlb pages with their real size and
treat them similar to shmem. But this has some drawbacks.

First we would have to special case in kernel users of NR_FILE_PAGES and
considering how hugetlb is special we would have to do it everywhere. We
do not want Cached exported by /proc/meminfo to include it because the
value would be even more misleading.

__vm_enough_memory and zone_pagecache_reclaimable would have to do the
same thing because those pages are simply not reclaimable. The correction
is even not trivial because we would have to consider all active hugetlb
page sizes properly. Users of the counter outside of the kernel would
have to do the same.

So the question is why to account something that needs to be basically
excluded for each reasonable usage. This doesn't make much sense to me.

It seems that this has been broken since hugetlb was introduced but I
haven't checked the whole history.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Tested-by: Mike Kravetz
Acked-by: Johannes Weiner
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-06-25 08:49:43 +0800
9083905a2 mm: page_alloc: inline should_alloc_retry() ... Browse Code »

The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress. The physical
separation of those conditions make the code hard to follow.

Inline the should_alloc_retry() checks. Notes:

- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
replace it with looping on OOM killer progress

- The pm_suspended_storage() check is meant to skip the OOM killer
when reclaim has no IO available, move to __alloc_pages_may_oom()

- The order
Acked-by: Michal Hocko
Cc: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
dc56401fc mm: oom_kill: simplify OOM killer locking ... Browse Code »

The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain. Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills. And
the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
da51b14ad mm: oom_kill: remove unnecessary locking in exit_oom_victim() ... Browse Code »

Disabling the OOM killer needs to exclude allocators from entering, not
existing victims from exiting.

Right now the only waiter is suspend code, which achieves quiescence by
disabling the OOM killer. But later on we want to add waits that hold
the lock instead to stop new victims from showing up.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
c38f1025f mm: oom_kill: generalize OOM progress waitqueue ... Browse Code »

It turns out that the mechanism to wait for exiting OOM victims is less
generic than it looks: it won't issue wakeups unless the OOM killer is
disabled.

The reason this check was added was the thought that, since only the OOM
disabling code would wait on this queue, wakeup operations could be
saved when that specific consumer is known to be absent.

However, this is quite the handgrenade. Later attempts to reuse the
waitqueue for other purposes will lead to completely unexpected bugs and
the failure mode will appear seemingly illogical. Generally, providers
shouldn't make unnecessary assumptions about consumers.

This could have been replaced with waitqueue_active(), but it only saves
a few instructions in one of the coldest paths in the kernel. Simply
remove it.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
464027785 mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear ... Browse Code »

exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
can clear it concurrently. Use clear_thread_flag() directly.

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
16e951966 mm: oom_kill: clean up victim marking and exiting interfaces ... Browse Code »

Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
3f5ab8cfb mm: oom_kill: remove unnecessary locking in oom_enable() ... Browse Code »

Setting oom_killer_disabled to false is atomic, there is no need for
further synchronization with ongoing allocations trying to OOM-kill.

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Acked-by: David Rientjes
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800
febd5949e mm/memory hotplug: init the zone's size when calculating node totalpages ... Browse Code »

Init the zone's size when calculating node totalpages to avoid duplicated
operations in free_area_init_core().

Signed-off-by: Gu Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gu Zheng
2015-06-25 08:49:42 +0800
641844f56 mm/hugetlb: introduce minimum hugepage order ... Browse Code »

Currently the initial value of order in dissolve_free_huge_page is 64 or
32, which leads to the following warning in static checker:

mm/hugetlb.c:1203 dissolve_free_huge_pages()
warn: potential right shift more than type allows '9,18,64'

This is a potential risk of infinite loop, because 1 << order (== 0) is used
in for-loop like this:

for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
...

So this patch fixes it by using global minimum_order calculated at boot time.

text data bss dec hex filename
28313 469 84236 113018 1b97a mm/hugetlb.o
28256 473 84236 112965 1b945 mm/hugetlb.o (patched)

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Reported-by: Dan Carpenter
Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
414e2fb8c rmap: fix theoretical race between do_wp_page and shrink_active_list ... Browse Code »

As noted by Paul the compiler is free to store a temporary result in a
variable on stack, heap or global unless it is explicitly marked as
volatile, see:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html#sample-optimizations

This can result in a race between do_wp_page() and shrink_active_list()
as follows.

In do_wp_page() we can call page_move_anon_rmap(), which sets
page->mapping as follows:

anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
page->mapping = (struct address_space *) anon_vma;

The page in question may be on an LRU list, because nowhere in
do_wp_page() we remove it from the list, neither do we take any LRU
related locks. Although the page is locked, shrink_active_list() can
still call page_referenced() on it concurrently, because the latter does
not require an anonymous page to be locked:

CPU0 CPU1
---- ----
do_wp_page shrink_active_list
lock_page page_referenced
PageAnon->yes, so skip trylock_page
page_move_anon_rmap
page->mapping = anon_vma
rmap_walk
PageAnon->no
rmap_walk_file
BUG
page->mapping += PAGE_MAPPING_ANON

This patch fixes this race by explicitly forbidding the compiler to split
page->mapping store in page_move_anon_rmap() with the aid of WRITE_ONCE.

[akpm@linux-foundation.org: tweak comment, per Minchan]
Signed-off-by: Vladimir Davydov
Cc: "Paul E. McKenney"
Acked-by: Kirill A. Shutemov
Acked-by: Rik van Riel
Cc: Hugh Dickins
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-06-25 08:49:42 +0800
2491ffee9 mm/memory-failure: me_huge_page() does nothing for thp ... Browse Code »

memory_failure() is supposed not to handle thp itself, but to split it.
But if something were wrong and page_action() were called on thp,
me_huge_page() (action routine for hugepages) should be better to take
no action, rather than to take wrong action prepared for hugetlb (which
triggers BUG_ON().)

This change is for potential problems, but makes sense to me because thp
is an actively developing feature and this code path can be open in the
future.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
add05cece mm: soft-offline: don't free target page in successful page migration ... Browse Code »

Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger
VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():

Soft offlining page 0x70fe1 at 0x70100008d000
Soft offlining page 0x705fb at 0x70300008d000
page:ffffea0001c3f840 count:0 mapcount:0 mapping: (null) index:0x2
flags: 0x1fffff80800000(hwpoison)
page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
------------[ cut here ]------------
kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
RIP: free_pcppages_bulk+0x52a/0x6f0
Call Trace:
drain_pages_zone+0x3d/0x50
drain_local_pages+0x1d/0x30
on_each_cpu_mask+0x46/0x80
drain_all_pages+0x14b/0x1e0
soft_offline_page+0x432/0x6e0
SyS_madvise+0x73c/0x780
system_call_fastpath+0x12/0x17
Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
RIP [] free_pcppages_bulk+0x52a/0x6f0
RSP
---[ end trace 53926436e76d1f35 ]---

When soft offline successfully migrates page, the source page is supposed
to be freed. But there is a race condition where a source page looks
isolated (i.e. the refcount is 0 and the PageHWPoison is set) but
somewhat linked to pcplist. Then another soft offline event calls
drain_all_pages() and tries to free such hwpoisoned page, which is
forbidden.

This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play
with tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison:
drop lru_add_drain_all() in __soft_offline_page()", or to change page
freeing code for this soft offline's purpose.

Instead, let's think about the difference between hard offline and soft
offline. There is an interesting difference in how to isolate the in-use
page between these, that is, hard offline marks PageHWPoison of the target
page at first, and doesn't free it by keeping its refcount 1. OTOH, soft
offline tries to free the target page then marks PageHWPoison. This
difference might be the source of complexity and result in bugs like the
above. So making soft offline isolate with keeping refcount can be a
solution for this problem.

We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from
soft offline, which effectively does the isolation for soft offline. With
this change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.

Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800
ead07f6a8 mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling ... Browse Code »

memory_failure() can run in 2 different mode (specified by
MF_COUNT_INCREASED) in page refcount perspective. When
MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
takes a refcount of the target page. And if cleared, memory_failure()
takes it in it's own.

In current code, however, refcounting is done differently in each caller.
For example, madvise_hwpoison() uses get_user_pages_fast() and
hwpoison_inject() uses get_page_unless_zero(). So this inconsistent
refcounting causes refcount failure especially for thp tail pages.
Typical user visible effects are like memory leak or
VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page()
to handle thp tail pages in the same manner for each caller of hwpoison
code.

memory_failure() might fail to split thp and in such case it returns
without completing page isolation. This is not good because PageHWPoison
on the thp is still set and there's no easy way to unpoison such thps. So
this patch try to roll back any action to the thp in "non anonymous thp"
case and "thp split failed" case, expecting an MCE(SRAR) generated by
later access afterward will properly free such thps.

[akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-06-25 08:49:42 +0800