Eric Lee / smarc-fsl-linux-kernel

12 Feb, 2015

40 commits

1e25a271c mincore: apply page table walker on do_mincore() ... Browse Code »

This patch makes do_mincore() use walk_page_vma(), which reduces many
lines of code by using common page table walk code.

[daeseok.youn@gmail.com: remove unneeded variable 'err']
Signed-off-by: Naoya Horiguchi
Acked-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Daeseok Youn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
7d5b3bfaa mm: /proc/pid/clear_refs: avoid split_huge_page() ... Browse Code »

Currently pagewalker splits all THP pages on any clear_refs request. It's
not necessary. We can handle this on PMD level.

One side effect is that soft dirty will potentially see more dirty memory,
since we will mark whole THP page dirty at once.

Sanity checked with CRIU test suite. More testing is required.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Naoya Horiguchi
Reviewed-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Andrea Arcangeli
Cc: Dave Hansen
Cc: "Kirill A. Shutemov"
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:06 +0800
48684a65b mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) ... Browse Code »

walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
undesirable behaviour at client end (who called walk_page_range). For
example for pagemap_read(), when no callbacks are called against VM_PFNMAP
vma, pagemap_read() may prepare pagemap data for next virtual address
range at wrong index. That could confuse and/or break userspace
applications.

This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
- for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
over vma(VM_PFNMAP),
- for clear_refs and queue_pages which have their own ->tests_walk,
just return 1 and skip vma(VM_PFNMAP). This is no problem because
these are not interested in hole regions,
- for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Shiraz Hashim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
6f4576e36 mempolicy: apply page table walker on queue_pages_range() ... Browse Code »

queue_pages_range() does page table walking in its own way now, but there
is some code duplicate. This patch applies page table walker to reduce
lines of code.

queue_pages_range() has to do some precheck to determine whether we really
walk over the vma or just skip it. Now we have test_walk() callback in
mm_walk for this purpose, so we can do this replacement cleanly.
queue_pages_test_walk() depends on not only the current vma but also the
previous one, so queue_pages->prev is introduced to remember it.

Signed-off-by: Naoya Horiguchi
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
1757bbd9c arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() ... Browse Code »

We don't have to use mm_walk->private to pass vma to the callback function
because of mm_walk->vma. And walk_page_vma() is useful if we walk over a
single vma.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
26bcd64aa memcg: cleanup preparation for page table walk ... Browse Code »

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private. And both of mem_cgroup_count_precharge() and
mem_cgroup_move_charge() do for each vma loop themselves, but now it's
done in pagewalk.c, so let's clean up them.

Signed-off-by: Naoya Horiguchi
Acked-by: Johannes Weiner
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
d85f4d6d3 numa_maps: remove numa_maps->vma ... Browse Code »

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private. And show_numa_map() walks pages on vma basis, so using
walk_page_vma() is preferable.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
632fd60fe numa_maps: fix typo in gather_hugetbl_stats ... Browse Code »

Just doing s/gather_hugetbl_stats/gather_hugetlb_stats/g, this makes code
grep-friendly.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:06 +0800
f995ece24 pagemap: use walk->vma instead of calling find_vma() ... Browse Code »

Page table walker has the information of the current vma in mm_walk, so we
don't have to call find_vma() in each pagemap_(pte|hugetlb)_range() call
any longer. Currently pagemap_pte_range() does vma loop itself, so this
patch reduces many lines of code.

NULL-vma check is omitted because we assume that we never run these
callbacks on any address outside vma. And even if it were broken, NULL
pointer dereference would be detected, so we can get enough information
for debugging.

Signed-off-by: Naoya Horiguchi
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:05 +0800
5c64f52ac clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() ... Browse Code »

clear_refs_write() has some prechecks to determine if we really walk over
a given vma. Now we have a test_walk() callback to filter vmas, so let's
utilize it.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:05 +0800
14eb6fdd4 smaps: remove mem_size_stats->vma and use walk_page_vma() ... Browse Code »

pagewalk.c can handle vma in itself, so we don't have to pass vma via
walk->private. And show_smap() walks pages on vma basis, so using
walk_page_vma() is preferable.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:05 +0800
900fc5f19 pagewalk: add walk_page_vma() ... Browse Code »

Introduce walk_page_vma(), which is useful for the callers which want to
walk over a given vma. It's used by later patches.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:05 +0800
fafaa4264 pagewalk: improve vma handling ... Browse Code »

Current implementation of page table walker has a fundamental problem in
vma handling, which started when we tried to handle vma(VM_HUGETLB).
Because it's done in pgd loop, considering vma boundary makes code
complicated and bug-prone.

From the users viewpoint, some user checks some vma-related condition to
determine whether the user really does page walk over the vma.

In order to solve these, this patch moves vma check outside pgd loop and
introduce a new callback ->test_walk().

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: "Kirill A. Shutemov"
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:05 +0800
0b1fbfe50 mm/pagewalk: remove pgd_entry() and pud_entry() ... Browse Code »

Currently no user of page table walker sets ->pgd_entry() or
->pud_entry(), so checking their existence in each loop is just wasting
CPU cycle. So let's remove it to reduce overhead.

Signed-off-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Cyrill Gorcunov
Cc: Dave Hansen
Cc: Kirill A. Shutemov
Cc: Pavel Emelyanov
Cc: Benjamin Herrenschmidt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-02-12 09:06:05 +0800
05fbf357d proc/pagemap: walk page tables under pte lock ... Browse Code »

Lockless access to pte in pagemap_pte_range() might race with page
migration and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap) CPU B (migration)
lock_page()
try_to_unmap(page, TTU_MIGRATION...)
make_migration_entry()
set_pte_at()

pte_to_pagemap_entry()
remove_migration_ptes()
unlock_page()
if(is_migration_entry())
migration_entry_to_page()
BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Signed-off-by: Konstantin Khlebnikov
Reported-by: Andrey Ryabinin
Reviewed-by: Cyrill Gorcunov
Acked-by: Naoya Horiguchi
Acked-by: Kirill A. Shutemov
Cc: [3.5+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-02-12 09:06:05 +0800
0664e57ff mm: gup: kvm use get_user_pages_unlocked ... Browse Code »

Use the more generic get_user_pages_unlocked which has the additional
benefit of passing FAULT_FLAG_ALLOW_RETRY at the very first page fault
(which allows the first page fault in an unmapped area to be always able
to block indefinitely by being allowed to release the mmap_sem).

Signed-off-by: Andrea Arcangeli
Reviewed-by: Andres Lagar-Cavilla
Reviewed-by: Kirill A. Shutemov
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800
7e3391284 mm: gup: use get_user_pages_unlocked ... Browse Code »

This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
the page fault in order to release the mmap_sem during the I/O.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800
a7b780750 mm: gup: use get_user_pages_unlocked within get_user_pages_fast ... Browse Code »

This allows the get_user_pages_fast slow path to release the mmap_sem
before blocking.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800
0fd71a56f mm: gup: add __get_user_pages_unlocked to customize gup_flags ... Browse Code »

Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
to get a proper -EHWPOSION retval instead of -EFAULT to take a more
appropriate action if get_user_pages runs into a memory failure.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Kirill A. Shutemov
Cc: Andres Lagar-Cavilla
Cc: Peter Feiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800
f0818f472 mm: gup: add get_user_pages_locked and get_user_pages_unlocked ... Browse Code »

FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
reading to reduce the mmap_sem contention (for writing), like while
waiting for I/O completion. The problem is that right now practically no
get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
that nifty feature.

Andres fixed it for the KVM page fault. However get_user_pages_fast
remains uncovered, and 99% of other get_user_pages aren't using it either
(the only exception being FOLL_NOWAIT in KVM which is really nonblocking
and in fact it doesn't even release the mmap_sem).

So this patchsets extends the optimization Andres did in the KVM page
fault to the whole kernel. It makes most important places (including
gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
during I/O.

The only few places that remains uncovered are drivers like v4l and other
exceptions that tends to work on their own memory and they're not working
on random user memory (for example like O_DIRECT that uses gup_fast and is
fully covered by this patch).

A follow up patch should probably also add a printk_once warning to
get_user_pages that should go obsolete and be phased out eventually. The
"vmas" parameter of get_user_pages makes it fundamentally incompatible
with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
mmap_sem is released).

While this is just an optimization, this becomes an absolute requirement
for the userfaultfd feature http://lwn.net/Articles/615086/ .

The userfaultfd allows to block the page fault, and in order to do so I
need to drop the mmap_sem first. So this patch also ensures that all
memory where userfaultfd could be registered by KVM, the very first fault
(no matter if it is a regular page fault, or a get_user_pages) always has
FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
only when the pagetable is already mapped. The second fault attempt after
the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
without it.

This patch (of 5):

We can leverage the VM_FAULT_RETRY functionality in the page fault paths
better by using either get_user_pages_locked or get_user_pages_unlocked.

The former allows conversion of get_user_pages invocations that will have
to pass a "&locked" parameter to know if the mmap_sem was dropped during
the call. Example from:

down_read(&mm->mmap_sem);
do_something()
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

to:

int locked = 1;
down_read(&mm->mmap_sem);
do_something()
get_user_pages_locked(tsk, mm, ..., pages, &locked);
if (locked)
up_read(&mm->mmap_sem);

The latter is suitable only as a drop in replacement of the form:

down_read(&mm->mmap_sem);
get_user_pages(tsk, mm, ..., pages, NULL);
up_read(&mm->mmap_sem);

into:

get_user_pages_unlocked(tsk, mm, ..., pages);

Where tsk, mm, the intermediate "..." paramters and "pages" can be any
value as before. Just the last parameter of get_user_pages (vmas) must be
NULL for get_user_pages_locked|unlocked to be usable (the latter original
form wouldn't have been safe anyway if vmas wasn't null, for the former we
just make it explicit by dropping the parameter).

If vmas is not NULL these two methods cannot be used.

Signed-off-by: Andrea Arcangeli
Reviewed-by: Andres Lagar-Cavilla
Reviewed-by: Peter Feiner
Reviewed-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-02-12 09:06:05 +0800
be97a41b2 mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma ... Browse Code »

The previous commit ("mm/thp: Allocate transparent hugepages on local
node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
special policy for THP allocations. The function has the same interface
as alloc_pages_vma(), shares a lot of boilerplate code and a long
comment.

This patch merges the hugepage special case into alloc_pages_vma. The
extra if condition should be cheap enough price to pay. We also prevent
a (however unlikely) race with parallel mems_allowed update, which could
make hugepage allocation restart only within the fallback call to
alloc_hugepage_vma() and not reconsider the special rule in
alloc_hugepage_vma().

Also by making sure mpol_cond_put(pol) is always called before actual
allocation attempt, we can use a single exit path within the function.

Also update the comment for missing node parameter and obsolete reference
to mm_sem.

Signed-off-by: Vlastimil Babka
Cc: Aneesh Kumar K.V
Cc: Kirill A. Shutemov
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2015-02-12 09:06:04 +0800
077fcf116 mm/thp: allocate transparent hugepages on local node ... Browse Code »

This make sure that we try to allocate hugepages from local node if
allowed by mempolicy. If we can't, we fallback to small page allocation
based on mempolicy. This is based on the observation that allocating
pages on local node is more beneficial than allocating hugepages on remote
node.

With this patch applied we may find transparent huge page allocation
failures if the current node doesn't have enough freee hugepages. Before
this patch such failures result in us retrying the allocation on other
nodes in the numa node mask.

[akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
Signed-off-by: Aneesh Kumar K.V
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: David Rientjes
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2015-02-12 09:06:04 +0800
24e2716f6 mm/compaction: add tracepoint to observe behaviour of compaction defer ... Browse Code »

Compaction deferring logic is heavy hammer that block the way to the
compaction. It doesn't consider overall system state, so it could prevent
user from doing compaction falsely. In other words, even if system has
enough range of memory to compact, compaction would be skipped due to
compaction deferring logic. This patch add new tracepoint to understand
work of deferring logic. This will also help to check compaction success
and fail.

Signed-off-by: Joonsoo Kim
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-02-12 09:06:04 +0800
837d026d5 mm/compaction: more trace to understand when/why compaction start/finish ... Browse Code »

It is not well analyzed that when/why compaction start/finish or not.
With these new tracepoints, we can know much more about start/finish
reason of compaction. I can find following bug with these tracepoint.

http://www.spinics.net/lists/linux-mm/msg81582.html

Signed-off-by: Joonsoo Kim
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-02-12 09:06:04 +0800
e34d85f0e mm/compaction: print current range where compaction work ... Browse Code »

It'd be useful to know current range where compaction work for detailed
analysis. With it, we can know pageblock where we actually scan and
isolate, and, how much pages we try in that pageblock and can guess why it
doesn't become freepage with pageblock order roughly.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-02-12 09:06:04 +0800
16c4a097a mm/compaction: enhance tracepoint output for compaction begin/end ... Browse Code »

We now have tracepoint for begin event of compaction and it prints start
position of both scanners, but, tracepoint for end event of compaction
doesn't print finish position of both scanners. It'd be also useful to
know finish position of both scanners so this patch add it. It will help
to find odd behavior or problem on compaction internal logic.

And mode is added to both begin/end tracepoint output, since according to
mode, compaction behavior is quite different.

And lastly, status format is changed to string rather than status number
for readability.

[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: David Rientjes
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-02-12 09:06:04 +0800
4645f0633 mm/compaction: change tracepoint format from decimal to hexadecimal ... Browse Code »

To check the range that compaction is working, tracepoint print
start/end pfn of zone and start pfn of both scanner with decimal format.
Since we manage all pages in order of 2 and it is well represented by
hexadecimal, this patch change the tracepoint format from decimal to
hexadecimal. This would improve readability. For example, it makes us
easily notice whether current scanner try to compact previously
attempted pageblock or not.

Signed-off-by: Joonsoo Kim
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2015-02-12 09:06:04 +0800
8d38633c3 page_writeback: put account_page_redirty() after set_page_dirty() ... Browse Code »

Helper account_page_redirty() fixes dirty pages counter for redirtied
pages. This patch puts it after dirtying and prevents temporary
underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.

Signed-off-by: Konstantin Khebnikov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khebnikov
2015-02-12 09:06:04 +0800
b30fe6c7c mm: fix false-positive warning on exit due mm_nr_pmds(mm) ... Browse Code »

The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
*before* pgd_free(). And if an arch does pte/pmd allocation in
pgd_alloc() and frees them in pgd_free() we see offset in counters by the
time of the checks.

We tried to workaround this by offsetting expected counter value according
to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap(). But it
doesn't work in some cases:

1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
upper addresses occupied with huge pmd entries, so the trick with
offsetting expected counter value will get really ugly: we will have
to apply it nr_pmds, but not nr_ptes.

2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
which is allocated at boot and shared accross all processes.

The proposal is to move the check to check_mm() which happens *after*
pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
which would bring counters to zero if nothing leaked.

Signed-off-by: Kirill A. Shutemov
Reported-by: Tyler Baker
Tested-by: Tyler Baker
Tested-by: Nishanth Menon
Cc: Russell King
Cc: James Hogan
Cc: Guan Xuetao
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800
dc6c9a35b mm: account pmd page tables to the process ... Browse Code »

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

#include
#include
#include
#include
#include
#include

#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)

#define NR_PUD 130000

int main(void)
{
char *addr = NULL;
unsigned long i;

prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}
*addr = 'x';
munmap(addr, PMD_SIZE);
mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
if (addr == MAP_FAILED)
perror("re-mmap"), exit(1);
}
printf("PID %d consumed %lu KiB in PMD page tables\n",
getpid(), i * 4096 >> 10);
return pause();
}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

- HugeTLB can share PMD page tables. The patch handles by accounting
the table to all processes who share it.

- x86 PAE pre-allocates few PMD tables on fork.

- Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded). As with nr_ptes we use per-mm counter. The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov
Reported-by: Dave Hansen
Cc: Hugh Dickins
Reviewed-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: David Rientjes
Tested-by: Sedat Dilek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800
8aa76875d arm: define __PAGETABLE_PMD_FOLDED for !LPAE ... Browse Code »

ARM uses custom implementation of PMD folding in 2-level page table case.
Generic code expects to see __PAGETABLE_PMD_FOLDED to be defined if PMD is
folded, but ARM doesn't do this. Let's fix it.

Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc(). It
also fixes problems with recently-introduced pmd accounting on ARM without
LPAE.

Signed-off-by: Kirill A. Shutemov
Reported-by: Nishanth Menon
Reported-by: Simon Horman
Tested-by: Simon Horman
Tested-by: Fabio Estevam
Tested-by: Felipe Balbi
Tested-by: Nishanth Menon
Tested-by: Peter Ujfalusi
Tested-by: Krzysztof Kozlowski
Tested-by: Geert Uytterhoeven
Cc: Dave Hansen
Cc: Hugh Dickins
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: David Rientjes
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:04 +0800
4155b8e0a mm, asm-generic: define PUD_SHIFT in <asm-generic/4level-fixup.h> ... Browse Code »

If an architecure uses , build fails if we
try to use PUD_SHIFT in generic code:

In file included from arch/microblaze/include/asm/bug.h:1:0,
from include/linux/bug.h:4,
from include/linux/thread_info.h:11,
from include/asm-generic/preempt.h:4,
from arch/microblaze/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/spinlock.h:50,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:5,
from include/linux/slab.h:14,
from mm/mmap.c:12:
mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2858:46: error: 'PUD_SHIFT' undeclared (first use in this function)
round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
^
include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^
mm/mmap.c:2858:46: note: each undeclared identifier is reported only once for each function it appears in
round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
^
include/asm-generic/bug.h:86:25: note: in definition of macro 'WARN_ON'
int __ret_warn_on = !!(condition); \
^
As with , let's define PUD_SHIFT to
PGDIR_SHIFT.

Signed-off-by: Kirill A. Shutemov
Reported-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:03 +0800
d016bf7ec mm: make FIRST_USER_ADDRESS unsigned long on all archs ... Browse Code »

LKP has triggered a compiler warning after my recent patch "mm: account
pmd page tables to the process":

mm/mmap.c: In function 'exit_mmap':
>> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]

The code:

> 2857 WARN_ON(mm_nr_pmds(mm) >
2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);

In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has
the same type -- int. PUD_SHIFT.

I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
long. On every arch for consistency.

Signed-off-by: Kirill A. Shutemov
Reported-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:03 +0800
3ae3ad4e6 microblaze: define __PAGETABLE_PMD_FOLDED ... Browse Code »

Microblaze uses custom implementation of PMD folding, but doesn't define
__PAGETABLE_PMD_FOLDED, which generic code expects to see. Let's fix it.

Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc(). It
also fixes problems with recently-introduced pmd accounting.

Signed-off-by: Kirill A. Shutemov
Reported-by: Guenter Roeck
Tested-by: Guenter Roeck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-12 09:06:03 +0800
21afa38ee mm: memcontrol: consolidate swap controller code ... Browse Code »

The swap controller code is scattered all over the file. Gather all
the code that isn't directly needed by the memory controller at the
end of the file in its own CONFIG_MEMCG_SWAP section.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-02-12 09:06:03 +0800
95a045f63 mm: memcontrol: consolidate memory controller initialization ... Browse Code »

The initialization code for the per-cpu charge stock and the soft
limit tree is compact enough to inline it into mem_cgroup_init().

Signed-off-by: Johannes Weiner
Acked-by: Michal Hocko
Reviewed-by: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-02-12 09:06:03 +0800
9c608dbe6 mm: memcontrol: simplify soft limit tree init code ... Browse Code »

- No need to test the node for N_MEMORY. node_online() is enough for
node fallback to work in slab, use NUMA_NO_NODE for everything else.

- Remove the BUG_ON() for allocation failure. A NULL pointer crash is
just as descriptive, and the absent return value check is obvious.

- Move local variables to the inner-most blocks.

- Point to the tree structure after its initialized, not before, it's
just more logical that way.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Guenter Roeck
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-02-12 09:06:03 +0800
94737a85f mm: cma: fix totalcma_pages to include DT defined CMA regions ... Browse Code »

The totalcma_pages variable is not updated to account for CMA regions
defined via device tree reserved-memory sub-nodes. Fix this omission by
moving the calculation of totalcma_pages into cma_init_reserved_mem()
instead of cma_declare_contiguous() such that it will include reserved
memory used by all CMA regions.

Signed-off-by: George G. Davis
Cc: Marek Szyprowski
Acked-by: Michal Nazarewicz
Cc: Joonsoo Kim
Cc: "Aneesh Kumar K.V"
Cc: Laurent Pinchart
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

George G. Davis
2015-02-12 09:06:03 +0800
c32b3cbe0 oom, PM: make OOM detection in the freezer path raceless ... Browse Code »

Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because

- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet

Signed-off-by: Michal Hocko
Suggested-by: Tejun Heo
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Oleg Nesterov
Cc: Cong Wang
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-02-12 09:06:03 +0800
401e4a7cf sysrq: convert printk to pr_* equivalent ... Browse Code »

While touching this area let's convert printk to pr_*. This also makes
the printing of continuation lines done properly.

Signed-off-by: Michal Hocko
Acked-by: Tejun Heo
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Oleg Nesterov
Cc: Cong Wang
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-02-12 09:06:03 +0800