Eric Lee / smarc-fsl-linux-kernel

05 Jun, 2014

40 commits

6fb81a17d mm: do not use unnecessary atomic operations when adding pages to the LRU ... Browse Code »

When adding pages to the LRU we clear the active bit unconditionally.
As the page could be reachable from other paths we cannot use unlocked
operations without risk of corruption such as a parallel
mark_page_accessed. This patch tests if is necessary to clear the
active flag before using an atomic operation. This potentially opens a
tiny race when PageActive is checked as mark_page_accessed could be
called after PageActive was checked. The race already exists but this
patch changes it slightly. The consequence is that that the page may be
promoted to the active list that might have been left on the inactive
list before the patch. It's too tiny a race and too marginal a
consequence to always use atomic operations for.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:10 +0800
e3741b506 mm: do not use atomic operations when releasing pages ... Browse Code »

There should be no references to it any more and a parallel mark should
not be reordered against us. Use non-locked varient to clear page active.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
07a427884 mm: shmem: avoid atomic operation during shmem_getpage_gfp ... Browse Code »

shmem_getpage_gfp uses an atomic operation to set the SwapBacked field
before it's even added to the LRU or visible. This is unnecessary as what
could it possible race against? Use an unlocked variant.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
b745bc85f mm: page_alloc: convert hot/cold parameter and immediate callers to bool ... Browse Code »

cold is a bool, make it one. Make the likely case the "if" part of the
block instead of the else as according to the optimisation manual this is
preferred.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
7aeb09f91 mm: page_alloc: use unsigned int for order in more places ... Browse Code »

X86 prefers the use of unsigned types for iterators and there is a
tendency to mix whether a signed or unsigned type if used for page order.
This converts a number of sites in mm/page_alloc.c to use unsigned int for
order where possible.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
cfc47a280 mm: page_alloc: lookup pageblock migratetype with IRQs enabled during free ... Browse Code »

get_pageblock_migratetype() is called during free with IRQs disabled.
This is unnecessary and disables IRQs for longer than necessary.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
dc4b0caff mm: page_alloc: reduce number of times page_to_pfn is called ... Browse Code »

In the free path we calculate page_to_pfn multiple times. Reduce that.

Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
e58469baf mm: page_alloc: use word-based accesses for get/set pageblock bitmaps ... Browse Code »

The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.

In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.

In addition to the performance benefits, this patch closes races that are
possible between:

a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
reads part of the bits before and other part of the bits after
set_pageblock_migratetype() has updated them.

b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
read-modify-update set bit operation in set_pageblock_skip() will cause
lost updates to some bits changed in the set_pageblock_migratetype().

Joonsoo Kim first reported the case a) via code inspection. Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock). Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g. list_move(). Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.

That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove. Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.

After this patch, the race has no longer been observed in testing.

Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Reported-by: Joonsoo Kim
Reported-and-tested-by: Vlastimil Babka
Cc: Johannes Weiner
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
5dab29113 mm: page_alloc: take the ALLOC_NO_WATERMARK check out of the fast path ... Browse Code »

ALLOC_NO_WATERMARK is set in a few cases. Always by kswapd, always for
__GFP_MEMALLOC, sometimes for swap-over-nfs, tasks etc. Each of these
cases are relatively rare events but the ALLOC_NO_WATERMARK check is an
unlikely branch in the fast path. This patch moves the check out of the
fast path and after it has been determined that the watermarks have not
been met. This helps the common fast path at the cost of making the slow
path slower and hitting kswapd with a performance cost. It's a reasonable
tradeoff.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
a6e21b14f mm: page_alloc: only check the alloc flags and gfp_mask for dirty once ... Browse Code »

Currently it's calculated once per zone in the zonelist.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:09 +0800
d34c5fa06 mm: page_alloc: only check the zone id check if pages are buddies ... Browse Code »

A node/zone index is used to check if pages are compatible for merging
but this happens unconditionally even if the buddy page is not free. Defer
the calculation as long as possible. Ideally we would check the zone boundary
but nodes can overlap.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:08 +0800
664eeddee mm: page_alloc: use jump labels to avoid checking number_of_cpusets ... Browse Code »

If cpusets are not in use then we still check a global variable on every
page allocation. Use jump labels to avoid the overhead.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:08 +0800
ea5e9539a include/linux/jump_label.h: expose the reference count ... Browse Code »

This patch exposes the jump_label reference count in preparation for the
next patch. cpusets cares about both the jump_label being enabled and how
many users of the cpusets there currently are.

Signed-off-by: Peter Zijlstra
Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Vlastimil Babka
Cc: Jan Kara
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Theodore Ts'o
Cc: "Paul E. McKenney"
Cc: Oleg Nesterov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:08 +0800
800a1e750 mm: page_alloc: do not treat a zone that cannot be used for dirty pages as "full" ... Browse Code »

If a zone cannot be used for a dirty page then it gets marked "full" which
is cached in the zlc and later potentially skipped by allocation requests
that have nothing to do with dirty zones.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:08 +0800
65bb37198 mm: page_alloc: do not update zlc unless the zlc is active ... Browse Code »

The zlc is used on NUMA machines to quickly skip over zones that are full.
However it is always updated, even for the first zone scanned when the
zlc might not even be active. As it's a write to a bitmap that
potentially bounces cache line it's deceptively expensive and most
machines will not care. Only update the zlc if it was active.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-06-05 07:54:08 +0800
0bd62b119 slab: delete cache from list after __kmem_cache_shutdown succeeds ... Browse Code »

Currently, on kmem_cache_destroy we delete the cache from the slab_list
before __kmem_cache_shutdown, inserting it back to the list on failure.
Initially, this was done, because we could release the slab_mutex in
__kmem_cache_shutdown to delete sysfs slub entry, but since commit
41a212859a4d ("slub: use sysfs'es release mechanism for kmem_cache") we
remove sysfs entry later in kmem_cache_destroy after dropping the
slab_mutex, so that no implementation of __kmem_cache_shutdown can ever
release the lock. Therefore we can simplify the code a bit by moving
list_del after __kmem_cache_shutdown.

Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:54:08 +0800
776ed0f03 memcg: cleanup kmem cache creation/destruction functions naming ... Browse Code »

Current names are rather inconsistent. Let's try to improve them.

Brief change log:

** old name ** ** new name **

kmem_cache_create_memcg memcg_create_kmem_cache
memcg_kmem_create_cache memcg_regsiter_cache
memcg_kmem_destroy_cache memcg_unregister_cache

kmem_cache_destroy_memcg_children memcg_cleanup_cache_params
mem_cgroup_destroy_all_caches memcg_unregister_all_caches

create_work memcg_register_cache_work
memcg_create_cache_work_func memcg_register_cache_func
memcg_create_cache_enqueue memcg_schedule_register_cache

Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:54:08 +0800
172cb4b3d mm/dmapool.c: reuse devres_release() to free resources ... Browse Code »

Instead of calling an additional routine in dmam_pool_destroy() rely on
what dmam_pool_release() is doing.

Signed-off-by: Andy Shevchenko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Shevchenko
2014-06-05 07:54:08 +0800
fe54b1fd4 cma: increase CMA_ALIGNMENT upper limit to 12 ... Browse Code »

Some systems require a larger maximum PAGE_SIZE order for CMA allocations.
To accommodate such systems, increase the upper-bound of the
CMA_ALIGNMENT range to 12 (which ends up being 16MB on systems with 4K
pages).

Signed-off-by: Marc Carino
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marc Carino
2014-06-05 07:54:08 +0800
18ab4d4ce swap: change swap_list_head to plist, add swap_avail_head ... Browse Code »

Originally get_swap_page() started iterating through the singly-linked
list of swap_info_structs using swap_list.next or highest_priority_index,
which both were intended to point to the highest priority active swap
target that was not full. The first patch in this series changed the
singly-linked list to a doubly-linked list, and removed the logic to start
at the highest priority non-full entry; it starts scanning at the highest
priority entry each time, even if the entry is full.

Replace the manually ordered swap_list_head with a plist, swap_active_head.
Add a new plist, swap_avail_head. The original swap_active_head plist
contains all active swap_info_structs, as before, while the new
swap_avail_head plist contains only swap_info_structs that are active and
available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect
the swap_avail_head list.

Mel Gorman suggested using plists since they internally handle ordering
the list entries based on priority, which is exactly what swap was doing
manually. All the ordering code is now removed, and swap_info_struct
entries and simply added to their corresponding plist and automatically
ordered correctly.

Using a new plist for available swap_info_structs simplifies and
optimizes get_swap_page(), which no longer has to iterate over full
swap_info_structs. Using a new spinlock for swap_avail_head plist
allows each swap_info_struct to add or remove themselves from the
plist when they become full or not-full; previously they could not
do so because the swap_info_struct->lock is held when they change
from fullnot-full, and the swap_lock protecting the main
swap_active_head must be ordered before any swap_info_struct->lock.

Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Shaohua Li
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-06-05 07:54:07 +0800
a75f232ce lib/plist: add plist_requeue ... Browse Code »

Add plist_requeue(), which moves the specified plist_node after all other
same-priority plist_nodes in the list. This is essentially an optimized
plist_del() followed by plist_add().

This is needed by swap, which (with the next patch in this set) uses a
plist of available swap devices. When a swap device (either a swap
partition or swap file) are added to the system with swapon(), the device
is added to a plist, ordered by the swap device's priority. When swap
needs to allocate a page from one of the swap devices, it takes the page
from the first swap device on the plist, which is the highest priority
swap device. The swap device is left in the plist until all its pages are
used, and then removed from the plist when it becomes full.

However, as described in man 2 swapon, swap must allocate pages from swap
devices with the same priority in round-robin order; to do this, on each
swap page allocation, swap uses a page from the first swap device in the
plist, and then calls plist_requeue() to move that swap device entry to
after any other same-priority swap devices. The next swap page allocation
will again use a page from the first swap device in the plist and requeue
it, and so on, resulting in round-robin usage of equal-priority swap
devices.

Also add plist_test_requeue() test function, for use by plist_test() to
test plist_requeue() function.

Signed-off-by: Dan Streetman
Cc: Steven Rostedt
Cc: Peter Zijlstra
Acked-by: Mel Gorman
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-06-05 07:54:07 +0800
fd16618e1 lib/plist: add helper functions ... Browse Code »

Add PLIST_HEAD() to plist.h, equivalent to LIST_HEAD() from list.h, to
define and initialize a struct plist_head.

Add plist_for_each_continue() and plist_for_each_entry_continue(),
equivalent to list_for_each_continue() and list_for_each_entry_continue(),
to iterate over a plist continuing after the current position.

Add plist_prev() and plist_next(), equivalent to (struct list_head*)->prev
and ->next, implemented by list_prev_entry() and list_next_entry(), to
access the prev/next struct plist_node entry. These are needed because
unlike struct list_head, direct access of the prev/next struct plist_node
isn't possible; the list must be navigated via the contained struct
list_head. e.g. instead of accessing the prev by list_prev_entry(node,
node_list) it can be accessed by plist_prev(node).

Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Paul Gortmaker
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-06-05 07:54:07 +0800
adfab836f swap: change swap_info singly-linked list to list_head ... Browse Code »

The logic controlling the singly-linked list of swap_info_struct entries
for all active, i.e. swapon'ed, swap targets is rather complex, because:

- it stores the entries in priority order
- there is a pointer to the highest priority entry
- there is a pointer to the highest priority not-full entry
- there is a highest_priority_index variable set outside the swap_lock
- swap entries of equal priority should be used equally

this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
where different priority swap targets are incorrectly used equally.

That bug probably could be solved with the existing singly-linked lists,
but I think it would only add more complexity to the already difficult to
understand get_swap_page() swap_list iteration logic.

The first patch changes from a singly-linked list to a doubly-linked list
using list_heads; the highest_priority_index and related code are removed
and get_swap_page() starts each iteration at the highest priority
swap_info entry, even if it's full. While this does introduce unnecessary
list iteration (i.e. Schlemiel the painter's algorithm) in the case where
one or more of the highest priority entries are full, the iteration and
manipulation code is much simpler and behaves correctly re: the above bug;
and the fourth patch removes the unnecessary iteration.

The second patch adds some minor plist helper functions; nothing new
really, just functions to match existing regular list functions. These
are used by the next two patches.

The third patch adds plist_requeue(), which is used by get_swap_page() in
the next patch - it performs the requeueing of same-priority entries
(which moves the entry to the end of its priority in the plist), so that
all equal-priority swap_info_structs get used equally.

The fourth patch converts the main list into a plist, and adds a new plist
that contains only swap_info entries that are both active and not full.
As Mel suggested using plists allows removing all the ordering code from
swap - plists handle ordering automatically. The list naming is also
clarified now that there are two lists, with the original list changed
from swap_list_head to swap_active_head and the new list named
swap_avail_head. A new spinlock is also added for the new list, so
swap_info entries can be added or removed from the new list immediately as
they become full or not full.

This patch (of 4):

Replace the singly-linked list tracking active, i.e. swapon'ed,
swap_info_struct entries with a doubly-linked list using struct
list_heads. Simplify the logic iterating and manipulating the list of
entries, especially get_swap_page(), by using standard list_head
functions, and removing the highest priority iteration logic.

The change fixes the bug:
https://lkml.org/lkml/2014/2/13/181
in which different priority swap entries after the highest priority entry
are incorrectly used equally in pairs. The swap behavior is now as
advertised, i.e. different priority swap entries are used in order, and
equal priority swap targets are used concurrently.

Signed-off-by: Dan Streetman
Acked-by: Mel Gorman
Cc: Shaohua Li
Cc: Hugh Dickins
Cc: Dan Streetman
Cc: Michal Hocko
Cc: Christian Ehrhardt
Cc: Weijie Yang
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Bob Liu
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Paul Gortmaker
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-06-05 07:54:07 +0800
7ee07a44e mm: fold mlocked_vma_newpage() into its only call site ... Browse Code »

In previous commit(mm: use the light version __mod_zone_page_state in
mlocked_vma_newpage()) a irq-unsafe __mod_zone_page_state is used. And as
suggested by Andrew, to reduce the risks that new call sites incorrectly
using mlocked_vma_newpage() without knowing they are adding racing, this
patch folds mlocked_vma_newpage() into its only call site,
page_add_new_anon_rmap, to make it open-cocded for people to know what is
going on.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jianyu Zhan
Suggested-by: Andrew Morton
Suggested-by: Hugh Dickins
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:07 +0800
bea04b073 mm: use the light version __mod_zone_page_state in mlocked_vma_newpage() ... Browse Code »

mlocked_vma_newpage() is called with pte lock held(a spinlock), which
implies preemtion disabled, and the vm stat counter is not modified from
interrupt context, so we need not use an irq-safe mod_zone_page_state()
here, using a light-weight version __mod_zone_page_state() would be OK.

This patch also documents __mod_zone_page_state() and some of its
callsites. The comment above __mod_zone_page_state() is from Hugh
Dickins, and acked by Christoph.

Most credits to Hugh and Christoph for the clarification on the usage of
the __mod_zone_page_state().

[akpm@linux-foundation.org: coding-style fixes]
Suggested-by: Andrew Morton
Acked-by: Hugh Dickins
Signed-off-by: Jianyu Zhan
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianyu Zhan
2014-06-05 07:54:07 +0800
e9ade5699 mm/compaction: avoid rescanning pageblocks in isolate_freepages ... Browse Code »

The compaction free scanner in isolate_freepages() currently remembers PFN
of the highest pageblock where it successfully isolates, to be used as the
starting pageblock for the next invocation. The rationale behind this is
that page migration might return free pages to the allocator when
migration fails and we don't want to skip them if the compaction
continues.

Since migration now returns free pages back to compaction code where they
can be reused, this is no longer a concern. This patch changes
isolate_freepages() so that the PFN for restarting is updated with each
pageblock where isolation is attempted. Using stress-highalloc from
mmtests, this resulted in 10% reduction of the pages scanned by the free
scanner.

Note that the somewhat similar functionality that records highest
successful pageblock in zone->compact_cached_free_pfn, remains unchanged.
This cache is used when the whole compaction is restarted, not for
multiple invocations of the free scanner during single compaction.

Signed-off-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Bartlomiej Zolnierkiewicz
Acked-by: Michal Nazarewicz
Reviewed-by: Naoya Horiguchi
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-06-05 07:54:07 +0800
f8c9301fa mm/compaction: do not count migratepages when unnecessary ... Browse Code »

During compaction, update_nr_listpages() has been used to count remaining
non-migrated and free pages after a call to migrage_pages(). The
freepages counting has become unneccessary, and it turns out that
migratepages counting is also unnecessary in most cases.

The only situation when it's needed to count cc->migratepages is when
migrate_pages() returns with a negative error code. Otherwise, the
non-negative return value is the number of pages that were not migrated,
which is exactly the count of remaining pages in the cc->migratepages
list.

Furthermore, any non-zero count is only interesting for the tracepoint of
mm_compaction_migratepages events, because after that all remaining
unmigrated pages are put back and their count is set to 0.

This patch therefore removes update_nr_listpages() completely, and changes
the tracepoint definition so that the manual counting is done only when
the tracepoint is enabled, and only when migrate_pages() returns a
negative error code.

Furthermore, migrate_pages() and the tracepoints won't be called when
there's nothing to migrate. This potentially avoids some wasted cycles
and reduces the volume of uninteresting mm_compaction_migratepages events
where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this
was about 75% of the events. The mm_compaction_isolate_migratepages event
is better for determining that nothing was isolated for migration, and
this one was just duplicating the info.

Signed-off-by: Vlastimil Babka
Reviewed-by: Naoya Horiguchi
Cc: Minchan Kim
Cc: Mel Gorman
Cc: Joonsoo Kim
Cc: Bartlomiej Zolnierkiewicz
Acked-by: Michal Nazarewicz
Cc: Christoph Lameter
Cc: Rik van Riel
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2014-06-05 07:54:07 +0800
aeef4b838 mm, compaction: terminate async compaction when rescheduling ... Browse Code »

Async compaction terminates prematurely when need_resched(), see
compact_checklock_irqsave(). This can never trigger, however, if the
cond_resched() in isolate_migratepages_range() always takes care of the
scheduling.

If the cond_resched() actually triggers, then terminate this pageblock
scan for async compaction as well.

Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:07 +0800
75f30861a mm, thp: avoid excessive compaction latency during fault ... Browse Code »

Synchronous memory compaction can be very expensive: it can iterate an
enormous amount of memory without aborting, constantly rescheduling,
waiting on page locks and lru_lock, etc, if a pageblock cannot be
defragmented.

Unfortunately, it's too expensive for transparent hugepage page faults and
it's much better to simply fallback to pages. On 128GB machines, we find
that synchronous memory compaction can take O(seconds) for a single thp
fault.

Now that async compaction remembers where it left off without strictly
relying on sync compaction, this makes thp allocations best-effort without
causing egregious latency during fault. We still need to retry async
compaction after reclaim, but this won't stall for seconds.

Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Cc: Greg Thelen
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:06 +0800
e0b9daeb4 mm, compaction: embed migration mode in compact_control ... Browse Code »

We're going to want to manipulate the migration mode for compaction in the
page allocator, and currently compact_control's sync field is only a bool.

Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction
depending on the value of this bool. Convert the bool to enum
migrate_mode and pass the migration mode in directly. Later, we'll want
to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to
avoid unnecessary latency.

This also alters compaction triggered from sysfs, either for the entire
system or for a node, to force MIGRATE_SYNC.

[akpm@linux-foundation.org: fix build]
[iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()]
Signed-off-by: David Rientjes
Suggested-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Greg Thelen
Cc: Naoya Horiguchi
Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:06 +0800
35979ef33 mm, compaction: add per-zone migration pfn cache for async compaction ... Browse Code »

Each zone has a cached migration scanner pfn for memory compaction so that
subsequent calls to memory compaction can start where the previous call
left off.

Currently, the compaction migration scanner only updates the per-zone
cached pfn when pageblocks were not skipped for async compaction. This
creates a dependency on calling sync compaction to avoid having subsequent
calls to async compaction from scanning an enormous amount of non-MOVABLE
pageblocks each time it is called. On large machines, this could be
potentially very expensive.

This patch adds a per-zone cached migration scanner pfn only for async
compaction. It is updated everytime a pageblock has been scanned in its
entirety and when no pages from it were successfully isolated. The cached
migration scanner pfn for sync compaction is updated only when called for
sync compaction.

Signed-off-by: David Rientjes
Acked-by: Vlastimil Babka
Reviewed-by: Naoya Horiguchi
Cc: Greg Thelen
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:06 +0800
d53aea3d4 mm, compaction: return failed migration target pages back to freelist ... Browse Code »

Greg reported that he found isolated free pages were returned back to the
VM rather than the compaction freelist. This will cause holes behind the
free scanner and cause it to reallocate additional memory if necessary
later.

He detected the problem at runtime seeing that ext4 metadata pages (esp
the ones read by "sbi->s_group_desc[i] = sb_bread(sb, block)") were
constantly visited by compaction calls of migrate_pages(). These pages
had a non-zero b_count which caused fallback_migrate_page() ->
try_to_release_page() -> try_to_free_buffers() to fail.

Memory compaction works by having a "freeing scanner" scan from one end of
a zone which isolates pages as migration targets while another "migrating
scanner" scans from the other end of the same zone which isolates pages
for migration.

When page migration fails for an isolated page, the target page is
returned to the system rather than the freelist built by the freeing
scanner. This may require the freeing scanner to continue scanning memory
after suitable migration targets have already been returned to the system
needlessly.

This patch returns destination pages to the freeing scanner freelist when
page migration fails. This prevents unnecessary work done by the freeing
scanner but also encourages memory to be as compacted as possible at the
end of the zone.

Signed-off-by: David Rientjes
Reported-by: Greg Thelen
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:06 +0800
68711a746 mm, migration: add destination page freeing callback ... Browse Code »

Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.

This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.

If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.

Signed-off-by: David Rientjes
Reviewed-by: Naoya Horiguchi
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2014-06-05 07:54:06 +0800
93f39eea9 memcg: memcg_kmem_create_cache: make memcg_name_buf statically allocated ... Browse Code »

It isn't worth complicating the code by allocating it on the first access,
because it only takes 256 bytes.

Signed-off-by: Vladimir Davydov
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:54:06 +0800
073ee1c6c memcg: get rid of memcg_create_cache_name ... Browse Code »

Instead of calling back to memcontrol.c from kmem_cache_create_memcg in
order to just create the name of a per memcg cache, let's allocate it in
place. We only need to pass the memcg name to kmem_cache_create_memcg for
that - everything else can be done in slab_common.c.

Signed-off-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2014-06-05 07:54:06 +0800
b5ffc8560 memcg: correct comments for __mem_cgroup_begin_update_page_stat ... Browse Code »

Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2014-06-05 07:54:06 +0800
bdcbb659f memcg: fold mem_cgroup_stolen ... Browse Code »

It is only used in __mem_cgroup_begin_update_page_stat(), the name is
confusing and 2 routines for one thing also confuse people, so fold this
function seems more clear.

[akpm@linux-foundation.org: fix typo, per Michal]
Signed-off-by: Qiang Huang
Acked-by: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2014-06-05 07:54:05 +0800
3fb1c8dcf mm: update comment for DEFAULT_MAX_MAP_COUNT ... Browse Code »

With ELF extended numbering 16-bit bound is not hard limit any more.

[akpm@linux-foundation.org: fix typo]
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-05 07:54:05 +0800
af4459d36 arch/x86/mm/numa.c: use for_each_memblock() ... Browse Code »

Signed-off-by: Emil Medve
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Emil Medve
2014-06-05 07:54:05 +0800
b46e14acb mm/mempolicy.c: parameter doc uniformization ... Browse Code »

Also fixes kernel-doc warning

Signed-off-by: Fabian Frederick
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-06-05 07:54:05 +0800