Eric Lee / smarc-fsl-linux-kernel

18 Jan, 2016

1 commit

0cbeafb24 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:

- more MM stuff:

- Kirill's page-flags rework

- Kirill's now-allegedly-fixed THP rework

- MADV_FREE implementation

- DAX feature work (msync/fsync). This isn't quite complete but DAX
is new and it's good enough and the guys have a handle on what
needs to be done - I expect this to be wrapped in the next week or
two.

- some vsprintf maintenance work

- various other misc bits

* emailed patches from Andrew Morton : (145 commits)
printk: change recursion_bug type to bool
lib/vsprintf: factor out %pN[F] handler as netdev_bits()
lib/vsprintf: refactor duplicate code to special_hex_number()
printk-formats.txt: remove unimplemented %pT
printk: help pr_debug and pr_devel to optimize out arguments
lib/test_printf.c: test dentry printing
lib/test_printf.c: add test for large bitmaps
lib/test_printf.c: account for kvasprintf tests
lib/test_printf.c: add a few number() tests
lib/test_printf.c: test precision quirks
lib/test_printf.c: check for out-of-bound writes
lib/test_printf.c: don't BUG
lib/kasprintf.c: add sanity check to kvasprintf
lib/vsprintf.c: warn about too large precisions and field widths
lib/vsprintf.c: help gcc make number() smaller
lib/vsprintf.c: expand field_width to 24 bits
lib/vsprintf.c: eliminate potential race in string()
lib/vsprintf.c: move string() below widen_string()
lib/vsprintf.c: pull out padding code from dentry_name()
printk: do cond_resched() between lines while outputting to consoles
...

Linus Torvalds
2016-01-18 04:58:52 +0800

16 Jan, 2016

1 commit

a46e63764 thp: update documentation ... Browse Code »

The patch updates Documentation/vm/transhuge.txt to reflect changes in
THP design.

Signed-off-by: Kirill A. Shutemov
Acked-by: Jerome Marchand
Cc: Sasha Levin
Cc: Aneesh Kumar K.V
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

11 Dec, 2015

1 commit

7acccdbc4 Doc: treewide: Fix grammar "a" to "an" ... Browse Code »

This patch fix some grammar mistake.

Signed-off-by: Masanari Iida
Signed-off-by: Jonathan Corbet

Masanari Iida
2015-12-11 02:36:40 +0800

07 Nov, 2015

2 commits

1d798ca3f mm: make compound_head() robust ... Browse Code »

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800

06 Nov, 2015

5 commits

cf4b769ab mm: page migration avoid touching newpage until no going back ... Browse Code »

We have had trouble in the past from the way in which page migration's
newpage is initialized in dribs and drabs - see commit 8bdd63809160 ("mm:
fix direct reclaim writeback regression") which proposed a cleanup.

We have no actual problem now, but I think the procedure would be clearer
(and alternative get_new_page pools safer to implement) if we assert that
newpage is not touched until we are sure that it's going to be used -
except for taking the trylock on it in __unmap_and_move().

So shift the early initializations from move_to_new_page() into
migrate_page_move_mapping(), mapping and NULL-mapping paths. Similarly
migrate_huge_page_move_mapping(), but its NULL-mapping path can just be
deleted: you cannot reach hugetlbfs_migrate_page() with a NULL mapping.

Adjust stages 3 to 8 in the Documentation file accordingly.

Signed-off-by: Hugh Dickins
Cc: Christoph Lameter
Cc: "Kirill A. Shutemov"
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Sasha Levin
Cc: Dmitry Vyukov
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2015-11-06 11:34:48 +0800
b87537d9e mm: rmap use pte lock not mmap_sem to set PageMlocked ... Browse Code »

KernelThreadSanitizer (ktsan) has shown that the down_read_trylock() of
mmap_sem in try_to_unmap_one() (when going to set PageMlocked on a page
found mapped in a VM_LOCKED vma) is ineffective against races with
exit_mmap()'s munlock_vma_pages_all(), because mmap_sem is not held when
tearing down an mm.

But that's okay, those races are benign; and although we've believed for
years in that ugly down_read_trylock(), it's unsuitable for the job, and
frustrates the good intention of setting PageMlocked when it fails.

It just doesn't matter if here we read vm_flags an instant before or after
a racing mlock() or munlock() or exit_mmap() sets or clears VM_LOCKED: the
syscalls (or exit) work their way up the address space (taking pt locks
after updating vm_flags) to establish the final state.

We do still need to be careful never to mark a page Mlocked (hence
unevictable) by any race that will not be corrected shortly after. The
page lock protects from many of the races, but not all (a page is not
necessarily locked when it's unmapped). But the pte lock we just dropped
is good to cover the rest (and serializes even with
munlock_vma_pages_all(), so no special barriers required): now hold on to
the pte lock while calling mlock_vma_page(). Is that lock ordering safe?
Yes, that's how follow_page_pte() calls it, and how page_remove_rmap()
calls the complementary clear_page_mlock().

This fixes the following case (though not a case which anyone has
complained of), which mmap_sem did not: truncation's preliminary
unmap_mapping_range() is supposed to remove even the anonymous COWs of
filecache pages, and that might race with try_to_unmap_one() on a
VM_LOCKED vma, so that mlock_vma_page() sets PageMlocked just after
zap_pte_range() unmaps the page, causing "Bad page state (mlocked)" when
freed. The pte lock protects against this.

You could say that it also protects against the more ordinary case, racing
with the preliminary unmapping of a filecache page itself: but in our
current tree, that's independently protected by i_mmap_rwsem; and that
race would be why "Bad page state (mlocked)" was seen before commit
48ec833b7851 ("Revert mm/memory.c: share the i_mmap_rwsem").

Vlastimil Babka points out another race which this patch protects against.
try_to_unmap_one() might reach its mlock_vma_page() TestSetPageMlocked a
moment after munlock_vma_pages_all() did its Phase 1 TestClearPageMlocked:
leaving PageMlocked and unevictable when it should be evictable. mmap_sem
is ineffective because exit_mmap() does not hold it; page lock ineffective
because __munlock_pagevec() only takes it afterwards, in Phase 2; pte lock
is effective because __munlock_pagevec_fill() takes it to get the page,
after VM_LOCKED was cleared from vm_flags, so visible to try_to_unmap_one.

Kirill Shutemov points out that if the compiler chooses to implement a
"vma->vm_flags &= VM_WHATEVER" or "vma->vm_flags |= VM_WHATEVER" operation
with an intermediate store of unrelated bits set, since I'm here foregoing
its usual protection by mmap_sem, try_to_unmap_one() might catch sight of
a spurious VM_LOCKED in vm_flags, and make the wrong decision. This does
not appear to be an immediate problem, but we may want to define vm_flags
accessors in future, to guard against such a possibility.

While we're here, make a related optimization in try_to_munmap_one(): if
it's doing TTU_MUNLOCK, then there's no point at all in descending the
page tables and getting the pt lock, unless the vma is VM_LOCKED. Yes,
that can change racily, but it can change racily even without the
optimization: it's not critical. Far better not to waste time here.

Stopped short of separating try_to_munlock_one() from try_to_munmap_one()
on this occasion, but that's probably the sensible next step - with a
rename, given that try_to_munlock()'s business is to try to set Mlocked.

Updated the unevictable-lru Documentation, to remove its reference to mmap
semaphore, but found a few more updates needed in just that area.

Signed-off-by: Hugh Dickins
Cc: Christoph Lameter
Cc: "Kirill A. Shutemov"
Cc: Rik van Riel
Acked-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Sasha Levin
Cc: Dmitry Vyukov
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2015-11-06 11:34:48 +0800
7a14239a8 mm Documentation: undoc non-linear vmas ... Browse Code »

While updating some mm Documentation, I came across a few straggling
references to the non-linear vmas which were happily removed in v4.0.
Delete them.

Signed-off-by: Hugh Dickins
Cc: Christoph Lameter
Cc: "Kirill A. Shutemov"
Cc: Rik van Riel
Acked-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: Oleg Nesterov
Cc: Sasha Levin
Cc: Dmitry Vyukov
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2015-11-06 11:34:48 +0800
80f73b4b7 Documentation/vm/transhuge.txt: add information about max_ptes_swap ... Browse Code »

max_ptes_swap specifies how many pages can be brought in from swap when
collapsing a group of pages into a transparent huge page.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap

A higher value can cause excessive swap IO and waste memory. A lower
value can prevent THPs from being collapsed, resulting fewer pages being
collapsed into THPs, and lower memory access performance.

Signed-off-by: Ebru Akagunduz
Acked-by: Rik van Riel
Acked-by: David Rientjes
Cc: Oleg Nesterov
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ebru Akagunduz
2015-11-06 11:34:48 +0800
76f8ec712 Doc/slub: document slabinfo-gnuplot.sh script ... Browse Code »

Add documentation on how to use slabinfo-gnuplot.sh script.

Signed-off-by: Sergey Senozhatsky
Acked-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-11-06 11:34:48 +0800

11 Sep, 2015

4 commits

f074a8f49 proc: export idle flag via kpageflags ... Browse Code »

As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
is that one can easily filter dirty and/or unevictable pages while
estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/sys/kernel/mm/page_idle/bitmap first.

Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-11 04:29:01 +0800
33c3fc71c mm: introduce idle page tracking ... Browse Code »

Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.

Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-11 04:29:01 +0800
80ae2fdce proc: add kpagecgroup file ... Browse Code »

/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
page is charged to, indexed by PFN. Having this information is useful for
estimating a cgroup working set size.

The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.

Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-11 04:29:01 +0800
9c4c5ef37 zswap: update docs for runtime-changeable attributes ... Browse Code »

Change the Documentation/vm/zswap.txt doc to indicate that the "zpool" and
"compressor" params are now changeable at runtime.

Signed-off-by: Dan Streetman
Cc: Seth Jennings
Cc: Sergey Senozhatsky
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-09-11 04:29:01 +0800

09 Sep, 2015

3 commits

e6590740c Documentation: update libhugetlbfs location and use for testing ... Browse Code »

The URL for libhugetlbfs has changed. Also, put a stronger emphasis on
using libgugetlbfs for hugetlb regression testing.

Signed-off-by: Mike Kravetz
Acked-by: Naoya Horiguchi
Cc: Joern Engel
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-09-09 06:35:28 +0800
83b4b0bb6 pagemap: update documentation ... Browse Code »

Notes about recent changes.

[akpm@linux-foundation.org: various tweaks]
Signed-off-by: Konstantin Khlebnikov
Cc: Mark Williamson
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-09-09 06:35:28 +0800
77bb499bb pagemap: add mmap-exclusive bit for marking pages mapped only here ... Browse Code »

This patch sets bit 56 in pagemap if this page is mapped only once. It
allows to detect exclusively used pages without exposing PFN:

present file exclusive state
0 0 0 non-present
1 1 0 file page mapped somewhere else
1 1 1 file page mapped only here
1 0 0 anon non-CoWed page (shared with parent/child)
1 0 1 anon CoWed page (or never forked)

CoWed pages in (MAP_FILE | MAP_PRIVATE) areas are anon in this context.

MMap-exclusive bit doesn't reflect potential page-sharing via swapcache:
page could be mapped once but has several swap-ptes which point to it.
Application could detect that by swap bit in pagemap entry and touch that
pte via /proc/pid/mem to get real information.

See http://lkml.kernel.org/r/CAEVpBa+_RyACkhODZrRvQLs80iy0sqpdrd0AaP_-tgnX3Y9yNQ@mail.gmail.com

Requested by Mark Williamson.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Konstantin Khlebnikov
Reviewed-by: Mark Williamson
Tested-by: Mark Williamson
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2015-09-09 06:35:28 +0800

05 Sep, 2015

2 commits

a9b85f941 userfaultfd: change the read API to return a uffd_msg ... Browse Code »

I had requests to return the full address (not the page aligned one) to
userland.

It's not entirely clear how the page offset could be relevant because
userfaults aren't like SIGBUS that can sigjump to a different place and it
actually skip resolving the fault depending on a page offset. There's
currently no real way to skip the fault especially because after a
UFFDIO_COPY|ZEROPAGE, the fault is optimized to be retried within the
kernel without having to return to userland first (not even self modifying
code replacing the .text that touched the faulting address would prevent
the fault to be repeated). Userland cannot skip repeating the fault even
more so if the fault was triggered by a KVM secondary page fault or any
get_user_pages or any copy-user inside some syscall which will return to
kernel code. The second time FAULT_FLAG_RETRY_NOWAIT won't be set leading
to a SIGBUS being raised because the userfault can't wait if it cannot
release the mmap_map first (and FAULT_FLAG_RETRY_NOWAIT is required for
that).

Still returning userland a proper structure during the read() on the uffd,
can allow to use the current UFFD_API for the future non-cooperative
extensions too and it looks cleaner as well. Once we get additional
fields there's no point to return the fault address page aligned anymore
to reuse the bits below PAGE_SHIFT.

The only downside is that the read() syscall will read 32bytes instead of
8bytes but that's not going to be measurable overhead.

The total number of new events that can be extended or of new future bits
for already shipped events, is limited to 64 by the features field of the
uffdio_api structure. If more will be needed a bump of UFFD_API will be
required.

[akpm@linux-foundation.org: use __packed]
Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-05 07:54:41 +0800
25edd8bff userfaultfd: linux/Documentation/vm/userfaultfd.txt ... Browse Code »

This is the latest userfaultfd patchset. The postcopy live migration
feature on the qemu side is mostly ready to be merged and it entirely
depends on the userfaultfd syscall to be merged as well. So it'd be great
if this patchset could be reviewed for merging in -mm.

Userfaults allow to implement on demand paging from userland and more
generally they allow userland to more efficiently take control of the
behavior of page faults than what was available before (PROT_NONE +
SIGSEGV trap).

The use cases are:

1) KVM postcopy live migration (one form of cloud memory
externalization).

KVM postcopy live migration is the primary driver of this work:

http://blog.zhaw.ch/icclab/setting-up-post-copy-live-migration-in-openstack/
http://lists.gnu.org/archive/html/qemu-devel/2015-02/msg04873.html

2) postcopy live migration of binaries inside linux containers:

http://thread.gmane.org/gmane.linux.kernel.mm/132662

3) KVM postcopy live snapshotting (allowing to limit/throttle the
memory usage, unlike fork would, plus the avoidance of fork
overhead in the first place).

While the wrprotect tracking is not implemented yet, the syscall API is
already contemplating the wrprotect fault tracking and it's generic enough
to allow its later implementation in a backwards compatible fashion.

4) KVM userfaults on shared memory. The UFFDIO_COPY lowlevel method
should be extended to work also on tmpfs and then the
uffdio_register.ioctls will notify userland that UFFDIO_COPY is
available even when the registered virtual memory range is tmpfs
backed.

5) alternate mechanism to notify web browsers or apps on embedded
devices that volatile pages have been reclaimed. This basically
avoids the need to run a syscall before the app can access with the
CPU the virtual regions marked volatile. This depends on point 4)
to be fulfilled first, as volatile pages happily apply to tmpfs.

Even though there wasn't a real use case requesting it yet, it also
allows to implement distributed shared memory in a way that readonly
shared mappings can exist simultaneously in different hosts and they
can be become exclusive at the first wrprotect fault.

This patch (of 22):

Add documentation.

Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-05 07:54:41 +0800

26 Jun, 2015

1 commit

c00ed16a9 zswap: runtime enable/disable ... Browse Code »

Change the "enabled" parameter to be configurable at runtime. Remove the
enabled check from init(), and move it to the frontswap store() function;
when enabled, pages will be stored, and when disabled, pages won't be
stored.

This is almost identical to Seth's patch from 2 years ago:
http://lkml.iu.edu/hypermail/linux/kernel/1307.2/04289.html

[akpm@linux-foundation.org: tweak documentation]
Signed-off-by: Dan Streetman
Suggested-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-06-26 08:00:37 +0800

25 Jun, 2015

1 commit

9b012a29a Documentation/vm/unevictable-lru.txt: clarify MAP_LOCKED behavior ... Browse Code »

There is a very subtle difference between mmap()+mlock() vs
mmap(MAP_LOCKED) semantic. The former one fails if the population of the
area fails while the later one doesn't. This basically means that
mmap(MAPLOCKED) areas might see major fault after mmap syscall returns
which is not the case for mlock. mmap man page has already been altered
but Documentation/vm/unevictable-lru.txt deserves a clarification as well.

Signed-off-by: Michal Hocko
Reported-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-06-25 08:49:44 +0800

18 Apr, 2015

1 commit

d6a24d064 Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6 ... Browse Code »

Pull documentation updates from Jonathan Corbet:
"Numerous fixes, the overdue removal of the i2o docs, some new Chinese
translations, and, hopefully, the README fix that will end the flow of
identical patches to that file"

* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (34 commits)
Documentation/memcg: update memcg/kmem status
Documentation: blackfin: Makefile: Typo building issue
Documentation/vm/pagemap.txt: correct location of page-types tool
Documentation/memory-barriers.txt: typo fix
doc: Add guest_nice column to example output of `cat /proc/stat'
Documentation/kernel-parameters: Move "eagerfpu" to its right place
Documentation: gpio: Update ACPI part of the document to mention _DSD
docs/completion.txt: Various tweaks and corrections
doc: completion: context, scope and language fixes
Documentation:Update Documentation/zh_CN/arm64/memory.txt
Documentation:Update Documentation/zh_CN/arm64/booting.txt
Documentation: Chinese translation of arm64/legacy_instructions.txt
DocBook media: fix broken EIA hyperlink
Documentation: tweak the maintainers entry
README: Change gzip/bzip2 to xz compression format
README: Update version number reference
doc:pci: Fix typo in Documentation/PCI
Documentation: drm: Use '->' when describing access through pointers.
Documentation: Remove mentioning of block barriers
Documentation/email-clients.txt: Fix one grammar mistake, add extra info about TB
...

Linus Torvalds
2015-04-18 23:10:49 +0800

16 Apr, 2015

4 commits

d02be50db zsmalloc: zsmalloc documentation ... Browse Code »

Create zsmalloc doc which explains design concept and stat information.

Signed-off-by: Minchan Kim
Cc: Juneho Choi
Cc: Gunho Lee
Cc: Luigi Semenzato
Cc: Dan Streetman
Cc: Seth Jennings
Cc: Nitin Gupta
Cc: Jerome Marchand
Cc: Sergey Senozhatsky
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2015-04-16 07:35:21 +0800
80d6b94bd mm, doc: cleanup and clarify munmap behavior for hugetlb memory ... Browse Code »

munmap(2) of hugetlb memory requires a length that is hugepage aligned,
otherwise it may fail. Add this to the documentation.

This also cleans up the documentation and separates it into logical units:
one part refers to MAP_HUGETLB and another part refers to requirements for
shared memory segments.

Signed-off-by: David Rientjes
Cc: Jonathan Corbet
Cc: Davide Libenzi
Cc: Luiz Capitulino
Cc: Shuah Khan
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Joern Engel
Cc: Jianguo Wu
Cc: Eric B Munson
Cc: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2015-04-16 07:35:19 +0800
8c9b97033 hugetlbfs: document min_size mount option and cleanup ... Browse Code »

Add min_size mount option to the hugetlbfs documentation. Also, add the
missing pagesize option and mention that size can be specified as bytes or
a percentage of huge page pool.

Signed-off-by: Mike Kravetz
Cc: Davidlohr Bueso
Cc: Aneesh Kumar
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2015-04-16 07:35:18 +0800
922c0551a Documentation/vm/unevictable-lru.txt: document interaction between compaction an… ... Browse Code »

…d the unevictable LRU

The memory compaction code uses the migration code to do most of the
work in compaction. However, the compaction code interacts with the
unevictable LRU differently than migration code and this difference
should be noted in the documentation.

[akpm@linux-foundation.org: identify /proc/sys/vm/compact_unevictable directly]
Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Eric B Munson
2015-04-16 07:35:17 +0800

15 Apr, 2015

2 commits

53d85c985 cleancache: forbid overriding cleancache_ops ... Browse Code »

Currently, cleancache_register_ops returns the previous value of
cleancache_ops to allow chaining. However, chaining, as it is
implemented now, is extremely dangerous due to possible pool id
collisions. Suppose, a new cleancache driver is registered after the
previous one assigned an id to a super block. If the new driver assigns
the same id to another super block, which is perfectly possible, we will
have two different filesystems using the same id. No matter if the new
driver implements chaining or not, we are likely to get data corruption
with such a configuration eventually.

This patch therefore disables the ability to override cleancache_ops
altogether as potentially dangerous. If there is already cleancache
driver registered, all further calls to cleancache_register_ops will
return EBUSY. Since no user of cleancache implements chaining, we only
need to make minor changes to the code outside the cleancache core.

Signed-off-by: Vladimir Davydov
Cc: Konrad Rzeszutek Wilk
Cc: Boris Ostrovsky
Cc: David Vrabel
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Stefan Hengelein
Cc: Florian Schmaus
Cc: Andor Daam
Cc: Dan Magenheimer
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-04-15 07:49:03 +0800
fc05f5662 mm: rename __mlock_vma_pages_range() to populate_vma_page_range() ... Browse Code »

__mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on
vma flags. The same codepath is used for MAP_POPULATE.

Let's rename __mlock_vma_pages_range() to populate_vma_page_range().

This patch also drops mlock_vma_pages_range() references from
documentation. It has gone in cea10a19b797 ("mm: directly use
__mlock_vma_pages_range() in find_extend_vma()").

Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Acked-by: David Rientjes
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-15 07:49:00 +0800

11 Apr, 2015

1 commit

3250af197 Documentation/vm/pagemap.txt: correct location of page-types tool ... Browse Code »

The page-types tool was relocated to tools/vm in the 3.4 kernel timeframe.

Signed-off-by: Randy Wright
Signed-off-by: Jonathan Corbet

Randy Wright
2015-04-11 21:11:21 +0800

20 Mar, 2015

1 commit

9ddfa69fb doc: add information about max_ptes_none ... Browse Code »

max_ptes_none specifies how many extra small pages (that are
not already mapped) can be allocated when collapsing a group
of small pages into one large page.

/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none

A higher value leads to use additional memory for programs.
A lower value leads to gain less thp performance. Value of
max_ptes_none can waste cpu time very little, you can
ignore it.

Signed-off-by: Ebru Akagunduz
Reviewed-by: Rik van Riel
Signed-off-by: Jonathan Corbet

Ebru Akagunduz
2015-03-20 21:41:55 +0800

12 Feb, 2015

1 commit

56873f43a mm:add KPF_ZERO_PAGE flag for /proc/kpageflags ... Browse Code »

Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
detect zero_page in /proc/kpageflags, and then do memory analysis more
accurately.

Signed-off-by: Yalin Wang
Acked-by: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang, Yalin
2015-02-12 09:06:00 +0800

11 Feb, 2015

2 commits

29afc4e9a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial tree changes from Jiri Kosina:
"Patches from trivial.git that keep the world turning around.

Mostly documentation and comment fixes, and a two corner-case code
fixes from Alan Cox"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
kexec, Kconfig: spell "architecture" properly
mm: fix cleancache debugfs directory path
blackfin: mach-common: ints-priority: remove unused function
doubletalk: probe failure causes OOPS
ARM: cache-l2x0.c: Make it clear that cache-l2x0 handles L310 cache controller
msdos_fs.h: fix 'fields' in comment
scsi: aic7xxx: fix comment
ARM: l2c: fix comment
ibmraid: fix writeable attribute with no store method
dynamic_debug: fix comment
doc: usbmon: fix spelling s/unpriviledged/unprivileged/
x86: init_mem_mapping(): use capital BIOS in comment

Linus Torvalds
2015-02-11 10:57:15 +0800
c8d78c182 mm: replace remap_file_pages() syscall with emulation ... Browse Code »

remap_file_pages(2) was invented to be able efficiently map parts of
huge file into limited 32-bit virtual address space such as in database
workloads.

Nonlinear mappings are pain to support and it seems there's no
legitimate use-cases nowadays since 64-bit systems are widely available.

Let's drop it and get rid of all these special-cased code.

The patch replaces the syscall with emulation which creates new VMA on
each remap_file_pages(), unless they it can be merged with an adjacent
one.

I didn't find *any* real code that uses remap_file_pages(2) to test
emulation impact on. I've checked Debian code search and source of all
packages in ALT Linux. No real users: libc wrappers, mentions in
strace, gdb, valgrind and this kind of stuff.

There are few basic tests in LTP for the syscall. They work just fine
with emulation.

To test performance impact, I've written small test case which
demonstrate pretty much worst case scenario: map 4G shmfs file, write to
begin of every page pgoff of the page, remap pages in reverse order,
read every page.

The test creates 1 million of VMAs if emulation is in use, so I had to
set vm.max_map_count to 1100000 to avoid -ENOMEM.

Before: 23.3 ( +- 4.31% ) seconds
After: 43.9 ( +- 0.85% ) seconds
Slowdown: 1.88x

I believe we can live with that.

Test case:

#define _GNU_SOURCE
#include
#include
#include
#include

#define MB (1024UL * 1024)
#define SIZE (4096 * MB)

int main(int argc, char **argv)
{
unsigned long *p;
long i, pass;

for (pass = 0; pass < 10; pass++) {
p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
if (p == MAP_FAILED) {
perror("mmap");
return -1;
}

for (i = 0; i < SIZE / 4096; i++)
p[i * 4096 / sizeof(*p)] = i;

for (i = 0; i < SIZE / 4096; i++) {
if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
perror("remap_file_pages");
return -1;
}
}

for (i = SIZE / 4096 - 1; i >= 0; i--)
assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);

munmap(p, SIZE);
}

return 0;
}

[akpm@linux-foundation.org: fix spello]
[sasha.levin@oracle.com: initialize populate before usage]
[sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
Signed-off-by: "Kirill A. Shutemov"
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Armin Rigo
Signed-off-by: Sasha Levin
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:30 +0800

20 Jan, 2015

1 commit

8fc8f4d57 mm: fix cleancache debugfs directory path ... Browse Code »

Minor fixes for cleancache about wrong debugfs paths
in documentation and code comment.

Signed-off-by: Marcin Jabrzyk
Signed-off-by: Jiri Kosina

Marcin Jabrzyk
2015-01-20 21:08:31 +0800

14 Dec, 2014

2 commits

78a45c6f0 Merge branch 'akpm' (second patch-bomb from Andrew) ... Browse Code »

Merge second patchbomb from Andrew Morton:
- the rest of MM
- misc fs fixes
- add execveat() syscall
- new ratelimit feature for fault-injection
- decompressor updates
- ipc/ updates
- fallocate feature creep
- fsnotify cleanups
- a few other misc things

* emailed patches from Andrew Morton : (99 commits)
cgroups: Documentation: fix trivial typos and wrong paragraph numberings
parisc: percpu: update comments referring to __get_cpu_var
percpu: update local_ops.txt to reflect this_cpu operations
percpu: remove __get_cpu_var and __raw_get_cpu_var macros
fsnotify: remove destroy_list from fsnotify_mark
fsnotify: unify inode and mount marks handling
fallocate: create FAN_MODIFY and IN_MODIFY events
mm/cma: make kmemleak ignore CMA regions
slub: fix cpuset check in get_any_partial
slab: fix cpuset check in fallback_alloc
shmdt: use i_size_read() instead of ->i_size
ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
ipc/msg: increase MSGMNI, remove scaling
ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
lib/decompress.c: consistency of compress formats for kernel image
decompress_bunzip2: off by one in get_next_block()
usr/Kconfig: make initrd compression algorithm selection not expert
fault-inject: add ratelimit option
ratelimit: add initialization macro
...

Linus Torvalds
2014-12-14 05:00:36 +0800
16a7ade8a Documentation: add new page_owner document ... Browse Code »

page owner is for the tracking about who allocated each page. This
document explains what is the page owner feature and what is the merit of
it. And, simple HOW-TO is also explained. See the document for detailed
information.

Signed-off-by: Joonsoo Kim
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Minchan Kim
Cc: Dave Hansen
Cc: Michal Nazarewicz
Cc: Jungsoo Son
Cc: Ingo Molnar
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2014-12-14 04:42:48 +0800

07 Nov, 2014

1 commit

c0d7305cb Documentation: vm: Add 1GB large page support information ... Browse Code »

This patch adds 1GB large page support information in
Documentation/vm/hugetlbpage.txt

Reference:
https://lkml.org/lkml/2014/10/31/366

Signed-off-by: Masanari Iida
Reviewed-by: Luiz Capitulino
Signed-off-by: Jonathan Corbet

Masanari Iida
2014-11-07 04:14:11 +0800

23 Oct, 2014

1 commit

011bc4870 Docs: Document that the sticky bit is understood by hugetlbfs ... Browse Code »

Commit 75897d60 (hugetlb: allow sticky directory mount option) added
support for mounting hugetlbfs with sticky option set, like /tmp is
usually mounted, but forgot to document that.

Acked-by: Ken Chen
Signed-off-by: Kirill Smelkov
Signed-off-by: Jonathan Corbet

Kirill Smelkov
2014-10-23 02:26:04 +0800

07 Jun, 2014

1 commit

33041a0d7 mm: mark remap_file_pages() syscall as deprecated ... Browse Code »

The remap_file_pages() system call is used to create a nonlinear
mapping, that is, a mapping in which the pages of the file are mapped
into a nonsequential order in memory. The advantage of using
remap_file_pages() over using repeated calls to mmap(2) is that the
former approach does not require the kernel to create additional VMA
(Virtual Memory Area) data structures.

Supporting of nonlinear mapping requires significant amount of
non-trivial code in kernel virtual memory subsystem including hot paths.
Also to get nonlinear mapping work kernel need a way to distinguish
normal page table entries from entries with file offset (pte_file).
Kernel reserves flag in PTE for this purpose. PTE flags are scarce
resource especially on some CPU architectures. It would be nice to free
up the flag for other usage.

Fortunately, there are not many users of remap_file_pages() in the wild.
It's only known that one enterprise RDBMS implementation uses the
syscall on 32-bit systems to map files bigger than can linearly fit into
32-bit virtual address space. This use-case is not critical anymore
since 64-bit systems are widely available.

The plan is to deprecate the syscall and replace it with an emulation.
The emulation will create new VMAs instead of nonlinear mappings. It's
going to work slower for rare users of remap_file_pages() but ABI is
preserved.

One side effect of emulation (apart from performance) is that user can
hit vm.max_map_count limit more easily due to additional VMAs. See
comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Kirill A. Shutemov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Armin Rigo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-06-07 07:08:17 +0800

05 Jun, 2014

1 commit

3ba08129e mm/memory-failure.c: support use of a dedicated thread to handle SIGBUS(BUS_MCEERR_AO) ... Browse Code »

Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.

So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.

Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.

No behavioral change for all non-early kill cases.

Tony said:

: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi
Reviewed-by: Tony Luck
Cc: Kamil Iskra
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Chen Gong
Cc: [3.2+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2014-06-05 07:54:13 +0800