Eric Lee / smarc-fsl-linux-kernel

28 Jan, 2017

1 commit

3e4578f42 ANDROID: mm: add a field to store names for private anonymous memory ... Browse Code »

Userspace processes often have multiple allocators that each do
anonymous mmaps to get memory. When examining memory usage of
individual processes or systems as a whole, it is useful to be
able to break down the various heaps that were allocated by
each layer and examine their size, RSS, and physical memory
usage.

This patch adds a user pointer to the shared union in
vm_area_struct that points to a null terminated string inside
the user process containing a name for the vma. vmas that
point to the same address will be merged, but vmas that
point to equivalent strings at different addresses will
not be merged.

Userspace can set the name for a region of memory by calling
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name);
Setting the name to NULL clears it.

The names of named anonymous vmas are shown in /proc/pid/maps
as [anon:] and in /proc/pid/smaps in a new "Name" field
that is only present for named vmas. If the userspace pointer
is no longer valid all or part of the name will be replaced
with "".

The idea to store a userspace pointer to reduce the complexity
within mm (at the expense of the complexity of reading
/proc/pid/mem) came from Dave Hansen. This results in no
runtime overhead in the mm subsystem other than comparing
the anon_name pointers when considering vma merging. The pointer
is stored in a union with fieds that are only used on file-backed
mappings, so it does not increase memory usage.

Includes fix from Jed Davis for typo in
prctl_set_vma_anon_name, which could attempt to set the name
across two vmas at the same time due to a typo, which might
corrupt the vma list. Fix it to use tmp instead of end to limit
the name setting to a single vma at a time.

Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f
Signed-off-by: Dmitry Shmidt

Colin Cross
2017-01-28 05:52:21 +0800

19 Oct, 2016

1 commit

137baabe3 mm/numa: Remove duplicated include from mprotect.c ... Browse Code »

Signed-off-by: Wei Yongjun
Cc: Dave Hansen
Cc: linux-mm@kvack.org
Cc: Andrew Morton
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Link: http://lkml.kernel.org/r/1476719259-6214-1-git-send-email-weiyj.lk@gmail.com
Signed-off-by: Thomas Gleixner

Wei Yongjun
2016-10-19 23:28:48 +0800

11 Oct, 2016

1 commit

93c26d7dc Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull protection keys syscall interface from Thomas Gleixner:
"This is the final step of Protection Keys support which adds the
syscalls so user space can actually allocate keys and protect memory
areas with them. Details and usage examples can be found in the
documentation.

The mm side of this has been acked by Mel"

* 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pkeys: Update documentation
x86/mm/pkeys: Do not skip PKRU register if debug registers are not used
x86/pkeys: Fix pkeys build breakage for some non-x86 arches
x86/pkeys: Add self-tests
x86/pkeys: Allow configuration of init_pkru
x86/pkeys: Default to a restrictive init PKRU
pkeys: Add details of system call use to Documentation/
generic syscalls: Wire up memory protection keys syscalls
x86: Wire up protection keys system calls
x86/pkeys: Allocation/free syscalls
x86/pkeys: Make mprotect_key() mask off additional vm_flags
mm: Implement new pkey_mprotect() system call
x86/pkeys: Add fault handling for PF_PK page fault bit

Linus Torvalds
2016-10-11 02:01:51 +0800

08 Oct, 2016

2 commits

e86f15ee6 mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk ... Browse Code »

The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations). So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.

The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.

The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.

migrate mprotect
------------ -------------
migrating in "next" vma
vma_merge() # removes "next" vma and
# extends "current" vma
# current vma is not with
# vm_page_prot updated
remove_migration_ptes
read vm_page_prot of current "vma"
establish pte with wrong permissions
vm_set_page_prot(vma) # too late!
change_protection in the old vma range
only, next range is not updated

This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.

Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.

This fix removes the oddness factor from case 8 and it converts it from:

AAAA
PPPPNNNNXXXX -> PPPPNNNNNNNN

to:

AAAA
PPPPNNNNXXXX -> PPPPXXXXXXXX

XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully. It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).

Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Aditya Mandaleeka
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800
6d2329f88 mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE ... Browse Code »

vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently and this prevents the risk of reading intermediate values.

Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Jan Vorlicek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2016-10-08 09:46:29 +0800

09 Sep, 2016

3 commits

e8c24d3a2 x86/pkeys: Allocation/free syscalls ... Browse Code »

This patch adds two new system calls:

int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
int pkey_free(int pkey);

These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors. The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid. A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().

These system calls are also very important given the kernel's use
of pkeys to implement execute-only support. These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel. The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.

The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey. For
instance:

pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.

The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()). It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.

Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail. Why? pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this. Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.

This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings). Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.

Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

1. PKRU is the Protection Key Rights User register. It is a
usermode-accessible register that controls whether writes
and/or access to each individual pkey is allowed or denied.

Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2016-09-09 19:02:27 +0800
a8502b67d x86/pkeys: Make mprotect_key() mask off additional vm_flags ... Browse Code »

Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits(). If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do this clearing today by first calculating the VMA flags we
want set, then clearing the ones we do not want to inherit from
the original VMA:

vm_flags = calc_vm_prot_bits(prot, key);
...
newflags = vm_flags;
newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

ARCH_VM_PKEY_FLAGS

which allows the architecture to specify additional bits that it would
like cleared. We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Reviewed-by: Thomas Gleixner
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163013.E48D6981@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2016-09-09 19:02:26 +0800
7d06d9c9b mm: Implement new pkey_mprotect() system call ... Browse Code »

pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument. On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

int real_prot = PROT_READ|PROT_WRITE;
pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.

We settled on 'unsigned long' for the type of the key here. We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use? For instance:

pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
mprotect(ptr, PAGE_SIZE, PROT_EXEC); // set pkey=X_ONLY_PKEY?
mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.

Proposed semantics:
1. protection key 0 is special and represents the default,
"unassigned" protection key. It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
protection key. A protection key of 0 (even if set explicitly)
represents an unassigned protection key.
2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
key may or may not result in a mapping with execute-only
properties. pkey_mprotect() plus pkey_set() on all threads
should be used to _guarantee_ execute-only semantics if this
is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
kernel will internally attempt to allocate and dedicate a
protection key for the purpose of execute-only mappings. This
may not be possible in cases where there are no free protection
keys available. It can also happen, of course, in situations
where there is no hardware support for protection keys.

Signed-off-by: Dave Hansen
Acked-by: Mel Gorman
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner

Dave Hansen
2016-09-09 19:02:26 +0800

27 Jul, 2016

1 commit

337d9abf1 mm: thp: check pmd_trans_unstable() after split_huge_pmd() ... Browse Code »

split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
to pte entries, which can be checked with pmd_trans_unstable(). Some
callers make this assertion and some do it differently and some not, so
let's do it in a unified manner.

Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-07-27 07:19:19 +0800

24 May, 2016

1 commit

dc0ef0df7 mm: make mmap_sem for write waits killable for mm syscalls ... Browse Code »

This is a follow up work for oom_reaper [1]. As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way. This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending). Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree. I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

$ git grep "down_write(.*\)" next/master | wc -l
98
$ git grep "down_write(.*\)" | wc -l
62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement. Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable. It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR. This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress. E.g. the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap. To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Konstantin Khlebnikov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Dave Hansen
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-05-24 08:04:14 +0800

23 Mar, 2016

1 commit

f138556da mm/mprotect.c: don't imply PROT_EXEC on non-exec fs ... Browse Code »

The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC
binary on a memory mapped file located on non-exec fs. The mprotect
does not check whether fs is _executable_ or not. The PROT_EXEC flag is
set automatically even if a memory mapped file is located on non-exec
fs. Fix it by checking whether a memory mapped file is located on a
non-exec fs. If so the PROT_EXEC is not implied by the PROT_READ. The
implementation uses the VM_MAYEXEC flag set properly in mmap. Now it is
consistent with mmap.

I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs). I
also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
seems to work.

Signed-off-by: Piotr Kwapulinski
Cc: Mel Gorman
Cc: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Konstantin Khlebnikov
Cc: Dan Williams
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Piotr Kwapulinski
2016-03-23 06:36:02 +0800

19 Feb, 2016

2 commits

62b5f7d01 mm/core, x86/mm/pkeys: Add execute-only protection keys support ... Browse Code »

Protection keys provide new page-based protection in hardware.
But, they have an interesting attribute: they only affect data
accesses and never affect instruction fetches. That means that
if we set up some memory which is set as "access-disabled" via
protection keys, we can still execute from it.

This patch uses protection keys to set up mappings to do just that.
If a user calls:

mmap(..., PROT_EXEC);
or
mprotect(ptr, sz, PROT_EXEC);

(note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
notice this, and set a special protection key on the memory. It
also sets the appropriate bits in the Protection Keys User Rights
(PKRU) register so that the memory becomes unreadable and
unwritable.

I haven't found any userspace that does this today. With this
facility in place, we expect userspace to move to use it
eventually. Userspace _could_ start doing this today. Any
PROT_EXEC calls get converted to PROT_READ inside the kernel, and
would transparently be upgraded to "true" PROT_EXEC with this
code. IOW, userspace never has to do any PROT_EXEC runtime
detection.

This feature provides enhanced protection against leaking
executable memory contents. This helps thwart attacks which are
attempting to find ROP gadgets on the fly.

But, the security provided by this approach is not comprehensive.
The PKRU register which controls access permissions is a normal
user register writable from unprivileged userspace. An attacker
who can execute the 'wrpkru' instruction can easily disable the
protection provided by this feature.

The protection key that is used for execute-only support is
permanently dedicated at compile time. This is fine for now
because there is currently no API to set a protection key other
than this one.

Despite there being a constant PKRU value across the entire
system, we do not set it unless this feature is in use in a
process. That is to preserve the PKRU XSAVE 'init state',
which can lead to faster context switches.

PKRU *is* a user register and the kernel is modifying it. That
means that code doing:

pkru = rdpkru()
pkru |= 0x100;
mmap(..., PROT_EXEC);
wrpkru(pkru);

could lose the bits in PKRU that enforce execute-only
permissions. To avoid this, we suggest avoiding ever calling
mmap() or mprotect() when the PKRU value is expected to be
unstable.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Aneesh Kumar K.V
Cc: Borislav Petkov
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Chen Gang
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Kees Cook
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Piotr Kwapulinski
Cc: Rik van Riel
Cc: Stephen Smalley
Cc: Vladimir Murzin
Cc: Will Deacon
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-19 02:46:33 +0800
e6bfb7095 mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() ... Browse Code »

This plumbs a protection key through calc_vm_flag_bits(). We
could have done this in calc_vm_prot_bits(), but I did not feel
super strongly which way to go. It was pretty arbitrary which
one to use.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Arve Hjønnevåg
Cc: Benjamin Herrenschmidt
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Chen Gang
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: David Airlie
Cc: Denys Vlasenko
Cc: Eric W. Biederman
Cc: Geliang Tang
Cc: Greg Kroah-Hartman
Cc: H. Peter Anvin
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Leon Romanovsky
Cc: Linus Torvalds
Cc: Masahiro Yamada
Cc: Maxime Coquelin
Cc: Mel Gorman
Cc: Michael Ellerman
Cc: Oleg Nesterov
Cc: Paul Gortmaker
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Riley Andrews
Cc: Vladimir Davydov
Cc: devel@driverdev.osuosl.org
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-19 02:46:30 +0800

12 Feb, 2016

1 commit

6b9116a65 mm, dax: check for pmd_none() after split_huge_pmd() ... Browse Code »

DAX implements split_huge_pmd() by clearing pmd. This simple approach
reduces memory overhead, as we don't need to deposit page table on huge
page mapping to make split_huge_pmd() never-fail. PTE table can be
allocated and populated later on page fault from backing store.

But one side effect is that have to check if pmd is pmd_none() after
split_huge_pmd(). In most places we do this already to deal with
parallel MADV_DONTNEED.

But I found two call sites which is not affected by MADV_DONTNEED (due
down_write(mmap_sem)), but need to have the check to work with DAX
properly.

Signed-off-by: Kirill A. Shutemov
Cc: Dan Williams
Cc: Matthew Wilcox
Cc: Andrea Arcangeli
Cc: Ross Zwisler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-02-12 10:35:48 +0800

16 Jan, 2016

2 commits

5c7fb56e5 mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd ... Browse Code »

A dax-huge-page mapping while it uses some thp helpers is ultimately not
a transparent huge page. The distinction is especially important in the
get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
from pmd_huge() and pmd_trans_huge() which have slightly different
semantics.

Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
pmd_devmap() checks.

[kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
Signed-off-by: Dan Williams
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Matthew Wilcox
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800
78ddc5347 thp: rename split_huge_page_pmd() to split_huge_pmd() ... Browse Code »

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

15 Jan, 2016

1 commit

846383359 mm: rework virtual memory accounting ... Browse Code »

When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
testing the RLIMIT_DATA value to figure out if we're allowed to assign
new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
commited that RLIMIT_DATA in a form it's implemented now doesn't do
anything useful because most of user-space libraries use mmap() syscall
for dynamic memory allocations.

Linus suggested to convert RLIMIT_DATA rlimit into something suitable
for anonymous memory accounting. But in this patch we go further, and
the changes are bundled together as:

* keep vma counting if CONFIG_PROC_FS=n, will be used for limits
* replace mm->shared_vm with better defined mm->data_vm
* account anonymous executable areas as executable
* account file-backed growsdown/up areas as stack
* drop struct file* argument from vm_stat_account
* enforce RLIMIT_DATA for size of data areas

This way code looks cleaner: now code/stack/data classification depends
only on vm_flags state:

VM_EXEC & ~VM_WRITE -> code (VmExe + VmLib in proc)
VM_GROWSUP | VM_GROWSDOWN -> stack (VmStk)
VM_WRITE & ~VM_SHARED & !stack -> data (VmData)

The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
"shared", but that might be strange beast like readonly-private or VM_IO
area.

- RLIMIT_AS limits whole address space "VmSize"
- RLIMIT_STACK limits stack "VmStk" (but each vma individually)
- RLIMIT_DATA now limits "VmData"

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Cyrill Gorcunov
Cc: Quentin Casasnovas
Cc: Vegard Nossum
Acked-by: Linus Torvalds
Cc: Willy Tarreau
Cc: Andy Lutomirski
Cc: Kees Cook
Cc: Vladimir Davydov
Cc: Pavel Emelyanov
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2016-01-15 08:00:49 +0800

05 Sep, 2015

1 commit

19a809afe userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx ... Browse Code »

vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
must be aware about so that we can merge vmas back like they were
originally before arming the userfaultfd on some memory range.

Signed-off-by: Andrea Arcangeli
Acked-by: Pavel Emelyanov
Cc: Sanidhya Kashyap
Cc: zhang.zhanghailiang@huawei.com
Cc: "Kirill A. Shutemov"
Cc: Andres Lagar-Cavilla
Cc: Dave Hansen
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Andy Lutomirski
Cc: Hugh Dickins
Cc: Peter Feiner
Cc: "Dr. David Alan Gilbert"
Cc: Johannes Weiner
Cc: "Huangpeng (Peter)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2015-09-05 07:54:41 +0800

25 Jun, 2015

1 commit

36f881883 mm: fix mprotect() behaviour on VM_LOCKED VMAs ... Browse Code »

On mlock(2) we trigger COW on private writable VMA to avoid faults in
future.

mm/gup.c:
840 long populate_vma_page_range(struct vm_area_struct *vma,
841 unsigned long start, unsigned long end, int *nonblocking)
842 {
...
855 * We want to touch writable mappings with a write fault in order
856 * to break COW, except for shared mappings because these don't COW
857 * and we would not want to dirty them for nothing.
858 */
859 if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
860 gup_flags |= FOLL_WRITE;

But we miss this case when we make VM_LOCKED VMA writeable via
mprotect(2). The test case:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include
#include

#define PAGE_SIZE 4096

int main(int argc, char **argv)
{
struct rusage usage;
long before;
char *p;
int fd;

/* Create a file and populate first page of page cache */
fd = open("/tmp", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
write(fd, "1", 1);

/* Create a *read-only* *private* mapping of the file */
p = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);

/*
* Since the mapping is read-only, mlock() will populate the mapping
* with PTEs pointing to page cache without triggering COW.
*/
mlock(p, PAGE_SIZE);

/*
* Mapping became read-write, but it's still populated with PTEs
* pointing to page cache.
*/
mprotect(p, PAGE_SIZE, PROT_READ | PROT_WRITE);

getrusage(RUSAGE_SELF, &usage);
before = usage.ru_minflt;

/* Trigger COW: fault in mlock()ed VMA. */
*p = 1;

getrusage(RUSAGE_SELF, &usage);
printf("faults: %ld\n", usage.ru_minflt - before);

return 0;
}

$ ./test
faults: 1

Let's fix it by triggering populating of VMA in mprotect_fixup() on this
condition. We don't care about population error as we don't in other
similar cases i.e. mremap.

[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-06-25 08:49:41 +0800

26 Mar, 2015

1 commit

b191f9b10 mm: numa: preserve PTE write permissions across a NUMA hinting fault ... Browse Code »

Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to set
the writable bit again. This patch preserves the writable bit when
trapping NUMA hinting faults. The impact is obvious from the number of
minor faults trapped during the basis balancing benchmark and the system
CPU usage;

autonumabench
4.0.0-rc4 4.0.0-rc4
baseline preserve
Time System-NUMA01 107.13 ( 0.00%) 103.13 ( 3.73%)
Time System-NUMA01_THEADLOCAL 131.87 ( 0.00%) 83.30 ( 36.83%)
Time System-NUMA02 8.95 ( 0.00%) 10.72 (-19.78%)
Time System-NUMA02_SMT 4.57 ( 0.00%) 3.99 ( 12.69%)
Time Elapsed-NUMA01 515.78 ( 0.00%) 517.26 ( -0.29%)
Time Elapsed-NUMA01_THEADLOCAL 384.10 ( 0.00%) 384.31 ( -0.05%)
Time Elapsed-NUMA02 48.86 ( 0.00%) 48.78 ( 0.16%)
Time Elapsed-NUMA02_SMT 47.98 ( 0.00%) 48.12 ( -0.29%)

4.0.0-rc4 4.0.0-rc4
baseline preserve
User 44383.95 43971.89
System 252.61 201.24
Elapsed 998.68 1000.94

Minor Faults 2597249 1981230
Major Faults 365 364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
workload

4.0.0-rc4 4.0.0-rc4
baseline preserve
Amean real-xfsrepair 454.14 ( 0.00%) 442.36 ( 2.60%)
Amean syst-xfsrepair 277.20 ( 0.00%) 204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse. The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse. It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.

Signed-off-by: Mel Gorman
Reported-by: Dave Chinner
Tested-by: Dave Chinner
Cc: Ingo Molnar
Cc: Aneesh Kumar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-03-26 07:20:31 +0800

13 Feb, 2015

4 commits

10c1045f2 mm: numa: avoid unnecessary TLB flushes when setting NUMA hinting entries ... Browse Code »

If a PTE or PMD is already marked NUMA when scanning to mark entries for
NUMA hinting then it is not necessary to update the entry and incur a TLB
flush penalty. Avoid the avoidhead where possible.

Signed-off-by: Mel Gorman
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800
e944fd67b mm: numa: do not trap faults on the huge zero page ... Browse Code »

Faults on the huge zero page are pointless and there is a BUG_ON to catch
them during fault time. This patch reintroduces a check that avoids
marking the zero page PAGE_NONE.

Signed-off-by: Mel Gorman
Cc: Aneesh Kumar K.V
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Linus Torvalds
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800
4d9424669 mm: convert p[te|md]_mknonnuma and remaining page table manipulations ... Browse Code »

With PROT_NONE, the traditional page table manipulation functions are
sufficient.

[andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
[akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Tested-by: Sasha Levin
Cc: Benjamin Herrenschmidt
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800
8a0516ed8 mm: convert p[te|md]_numa users to p[te|md]_protnone_numa ... Browse Code »

Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered. This patch layout is to make review
easier.

Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Acked-by: Benjamin Herrenschmidt
Tested-by: Sasha Levin
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800

11 Feb, 2015

1 commit

0661a3361 mm: remove rest usage of VM_NONLINEAR and pte_file() ... Browse Code »

One bit in ->vm_flags is unused now!

Signed-off-by: Kirill A. Shutemov
Cc: Dan Carpenter
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-02-11 06:30:31 +0800

14 Oct, 2014

1 commit

64e455079 mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared ... Browse Code »

For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set. If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.

Here's a simple code snippet to demonstrate the bug:

char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
assert(*m == '\0'); /* new PTE allows write access */
assert(!soft_dirty(x));
*m = 'x'; /* should dirty the page */
assert(soft_dirty(x)); /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared. Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect. An analogous bug was fixed in mmap by
commit c9d0bf241451 ("mm: uncached vma support with writenotify").

Signed-off-by: Peter Feiner
Reported-by: Peter Feiner
Suggested-by: Kirill A. Shutemov
Cc: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Jamie Liu
Cc: Hugh Dickins
Cc: Naoya Horiguchi
Cc: Bjorn Helgaas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-10-14 08:18:28 +0800

08 Apr, 2014

3 commits

a5338093b mm: move mmu notifier call from change_protection to change_pmd_range ... Browse Code »

The NUMA scanning code can end up iterating over many gigabytes of
unpopulated memory, especially in the case of a freshly started KVM
guest with lots of memory.

This results in the mmu notifier code being called even when there are
no mapped pages in a virtual address range. The amount of time wasted
can be enough to trigger soft lockup warnings with very large KVM
guests.

This patch moves the mmu notifier call to the pmd level, which
represents 1GB areas of memory on x86-64. Furthermore, the mmu notifier
code is only called from the address in the PMD where present mappings
are first encountered.

The hugetlbfs code is left alone for now; hugetlb mappings are not
relocatable, and as such are left alone by the NUMA code, and should
never trigger this problem to begin with.

Signed-off-by: Rik van Riel
Acked-by: David Rientjes
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Reported-by: Xing Gang
Tested-by: Chegu Vinod
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-04-08 07:35:50 +0800
1ad9f620c mm: numa: recheck for transhuge pages under lock during protection changes ... Browse Code »

Sasha reported the following bug using trinity

kernel BUG at mm/mprotect.c:149!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in:
CPU: 20 PID: 26219 Comm: trinity-c216 Tainted: G W 3.14.0-rc5-next-20140305-sasha-00011-ge06f5f3-dirty #105
task: ffff8800b6c80000 ti: ffff880228436000 task.ti: ffff880228436000
RIP: change_protection_range+0x3b3/0x500
Call Trace:
change_protection+0x25/0x30
change_prot_numa+0x1b/0x30
task_numa_work+0x279/0x360
task_work_run+0xae/0xf0
do_notify_resume+0x8e/0xe0
retint_signal+0x4d/0x92

The VM_BUG_ON was added in -mm by the patch "mm,numa: reorganize
change_pmd_range". The race existed without the patch but was just
harder to hit.

The problem is that a transhuge check is made without holding the PTL.
It's possible at the time of the check that a parallel fault clears the
pmd and inserts a new one which then triggers the VM_BUG_ON check. This
patch removes the VM_BUG_ON but fixes the race by rechecking transhuge
under the PTL when marking page tables for NUMA hinting and bailing if a
race occurred. It is not a problem for calls to mprotect() as they hold
mmap_sem for write.

Signed-off-by: Mel Gorman
Reported-by: Sasha Levin
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-04-08 07:35:50 +0800
88a9ab6e3 mm,numa: reorganize change_pmd_range() ... Browse Code »

Reorganize the order of ifs in change_pmd_range a little, in preparation
for the next patch.

[akpm@linux-foundation.org: fix indenting, per David]
Signed-off-by: Rik van Riel
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Reported-by: Xing Gang
Tested-by: Chegu Vinod
Acked-by: David Rientjes
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-04-08 07:35:49 +0800

17 Feb, 2014

2 commits

56eecdb91 mm: Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit ... Browse Code »

Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
a hash table MMU for various reasons (the flush is handled as part of
the PTE modification when necessary).

ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

Additionally ppc64 require the tlb flushing to be batched within ptl locks.

The reason to do that is to ensure that the hash page table is in sync with
linux page table.

We track the hpte index in linux pte and if we clear them without flushing
hash and drop the ptl lock, we can have another cpu update the pte and can
end up with duplicate entry in the hash table, which is fatal.

We also want to keep set_pte_at simpler by not requiring them to do hash
flush for performance reason. We do that by assuming that set_pte_at() is
never *ever* called on a PTE that is already valid.

This was the case until the NUMA code went in which broke that assumption.

Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
using set_pte_at() and a powerpc specific one using the appropriate
mechanism needed to keep the hash table in sync.

Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt

Aneesh Kumar K.V
2014-02-17 08:19:36 +0800
9d85d5863 mm: Dirty accountable change only apply to non prot numa case ... Browse Code »

So move it within the if loop

Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Benjamin Herrenschmidt

Aneesh Kumar K.V
2014-02-17 08:19:36 +0800

22 Jan, 2014

1 commit

64a9a34e2 mm: numa: do not automatically migrate KSM pages ... Browse Code »

KSM pages can be shared between tasks that are not necessarily related
to each other from a NUMA perspective. This patch causes those pages to
be ignored by automatic NUMA balancing so they do not migrate and do not
cause unrelated tasks to be grouped together.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2014-01-22 08:19:48 +0800

19 Dec, 2013

3 commits

208414059 mm: fix TLB flush race between migration, and change_protection_range ... Browse Code »

There are a few subtle races, between change_protection_range (used by
mprotect and change_prot_numa) on one side, and NUMA page migration and
compaction on the other side.

The basic race is that there is a time window between when the PTE gets
made non-present (PROT_NONE or NUMA), and the TLB is flushed.

During that time, a CPU may continue writing to the page.

This is fine most of the time, however compaction or the NUMA migration
code may come in, and migrate the page away.

When that happens, the CPU may continue writing, through the cached
translation, to what is no longer the current memory location of the
process.

This only affects x86, which has a somewhat optimistic pte_accessible.
All other architectures appear to be safe, and will either always flush,
or flush whenever there is a valid mapping, even with no permissions
(SPARC).

The basic race looks like this:

CPU A CPU B CPU C

load TLB entry
make entry PTE/PMD_NUMA
fault on entry
read/write old page
start migrating page
change PTE/PMD to new page
read/write old page [*]
flush TLB
reload TLB from new entry
read/write new page
lose data

[*] the old page may belong to a new user at this point!

The obvious fix is to flush remote TLB entries, by making sure that
pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
still be accessible if there is a TLB flush pending for the mm.

This should fix both NUMA migration and compaction.

[mgorman@suse.de: fix build]
Signed-off-by: Rik van Riel
Signed-off-by: Mel Gorman
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2013-12-19 11:04:51 +0800
1667918b6 mm: numa: clear numa hinting information on mprotect ... Browse Code »

On a protection change it is no longer clear if the page should be still
accessible. This patch clears the NUMA hinting fault bits on a
protection change.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-12-19 11:04:51 +0800
0c5f83c23 mm: numa: do not clear PTE for pte_numa update ... Browse Code »

The TLB must be flushed if the PTE is updated but change_pte_range is
clearing the PTE while marking PTEs pte_numa without necessarily
flushing the TLB if it reinserts the same entry. Without the flush,
it's conceivable that two processors have different TLBs for the same
virtual address and at the very least it would generate spurious faults.

This patch only unmaps the pages in change_pte_range for a full
protection change.

[riel@redhat.com: write pte_numa pte back to the page tables]
Signed-off-by: Mel Gorman
Signed-off-by: Rik van Riel
Reviewed-by: Rik van Riel
Cc: Alex Thorlton
Cc: Chegu Vinod
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-12-19 11:04:51 +0800

13 Nov, 2013

1 commit

72403b4a0 mm: numa: return the number of base pages altered by protection changes ... Browse Code »

Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa. task_numa_work was using the old return value
to track how much address space had been updated. Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat. It is up to the administrator to
interpret the pair of values correctly. This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result. On
the flip size system CPU usage should be lower than recent tests
reported. This is an illustrative example of a short single JVM specjbb
test

specjbb
3.12.0 3.12.0
vanilla acctupdates
TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

3.12.0 3.12.0
vanillaacctupdates
User 5169.64 5184.14
System 100.45 80.02
Elapsed 252.75 251.85

Performance is similar but note the reduction in system CPU time. While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented. The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

3.12.0 3.12.0
vanillaacctupdates
NUMA page range updates 1408326 11043064
NUMA huge PMD updates 0 21040
NUMA PTE updates 1408326 291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner. NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman
Reported-by: Alex Thorlton
Reviewed-by: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-11-13 11:09:11 +0800

01 Nov, 2013

1 commit

fb10d5b7e Merge branch 'linus' into sched/core ... Browse Code »

Resolve cherry-picking conflicts:

Conflicts:
mm/huge_memory.c
mm/memory.c
mm/mprotect.c

See this upstream merge commit for more details:

52469b4fcd4f Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Signed-off-by: Ingo Molnar

Ingo Molnar
2013-11-01 15:24:41 +0800

29 Oct, 2013

1 commit

0255d4918 mm: Account for a THP NUMA hinting update as one PTE update ... Browse Code »

A THP PMD update is accounted for as 512 pages updated in vmstat. This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Cc:
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar

Mel Gorman
2013-10-29 18:38:17 +0800

17 Oct, 2013

1 commit

c3d16e165 mm: migration: do not lose soft dirty bit if page is in migration state ... Browse Code »

If page migration is turned on in config and the page is migrating, we
may lose the soft dirty bit. If fork and mprotect are called on
migrating pages (once migration is complete) pages do not obtain the
soft dirty bit in the correspond pte entries. Fix it adding an
appropriate test on swap entries.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Andy Lutomirski
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2013-10-17 12:35:52 +0800

09 Oct, 2013

1 commit

0f19c1792 mm: numa: Do not batch handle PMD pages ... Browse Code »

With the THP migration races closed it is still possible to occasionally
see corruption. The problem is related to handling PMD pages in batch.
When a page fault is handled it can be assumed that the page being
faulted will also be flushed from the TLB. The same flushing does not
happen when handling PMD pages in batch. Fixing is straight forward but
there are a number of reasons not to

1. Multiple TLB flushes may have to be sent depending on what pages get
migrated
2. The handling of PMDs in batch means that faults get accounted to
the task that is handling the fault. While care is taken to only
mark PMDs where the last CPU and PID match it can still have problems
due to PID truncation when matching PIDs.
3. Batching on the PMD level may reduce faults but setting pmd_numa
requires taking a heavy lock that can contend with THP migration
and handling the fault requires the release/acquisition of the PTL
for every page migrated. It's still pretty heavy.

PMD batch handling is not something that people ever have been happy
with. This patch removes it and later patches will deal with the
additional fault overhead using more installigent migrate rate adaption.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Cc: Srikar Dronamraju
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1381141781-10992-48-git-send-email-mgorman@suse.de
Signed-off-by: Ingo Molnar

Mel Gorman
2013-10-09 20:47:55 +0800