Eric Lee / smarc-fsl-linux-kernel

25 Oct, 2016

1 commit

0d7317598 mm: unexport __get_user_pages() ... Browse Code »

This patch unexports the low-level __get_user_pages() function.

Recent refactoring of the get_user_pages* functions allow flags to be
passed through get_user_pages() which eliminates the need for access to
this function from its one user, kvm.

We can see that the two calls to get_user_pages() which replace
__get_user_pages() in kvm_main.c are equivalent by examining their call
stacks:

get_user_page_nowait():
get_user_pages(start, 1, flags, page, NULL)
__get_user_pages_locked(current, current->mm, start, 1, page, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, start, 1,
flags | FOLL_TOUCH | FOLL_GET, page, NULL, NULL)

check_user_page_hwpoison():
get_user_pages(addr, 1, flags, NULL, NULL)
__get_user_pages_locked(current, current->mm, addr, 1, NULL, NULL, NULL,
false, flags | FOLL_TOUCH)
__get_user_pages(current, current->mm, addr, 1, flags | FOLL_TOUCH, NULL,
NULL, NULL)

Signed-off-by: Lorenzo Stoakes
Acked-by: Paolo Bonzini
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-25 10:13:20 +0800

19 Oct, 2016

7 commits

9beae1ea8 mm: replace get_user_pages_remote() write/force parameters with gup_flags ... Browse Code »

This removes the 'write' and 'force' from get_user_pages_remote() and
replaces them with 'gup_flags' to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Michal Hocko
Reviewed-by: Jan Kara
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 23:12:02 +0800
768ae309a mm: replace get_user_pages() write/force parameters with gup_flags ... Browse Code »

This removes the 'write' and 'force' from get_user_pages() and replaces
them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
as use of this flag can result in surprising behaviour (and hence bugs)
within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Christian König
Acked-by: Jesper Nilsson
Acked-by: Michal Hocko
Reviewed-by: Jan Kara
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 23:11:43 +0800
3b913179c mm: replace get_user_pages_locked() write/force parameters with gup_flags ... Browse Code »

This removes the 'write' and 'force' use from get_user_pages_locked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Michal Hocko
Reviewed-by: Jan Kara
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 23:11:05 +0800
c164154f6 mm: replace get_user_pages_unlocked() write/force parameters with gup_flags ... Browse Code »

This removes the 'write' and 'force' use from get_user_pages_unlocked()
and replaces them with 'gup_flags' to make the use of FOLL_FORCE
explicit in callers as use of this flag can result in surprising
behaviour (and hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Reviewed-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 05:13:37 +0800
d4944b0ec mm: remove write/force parameters from __get_user_pages_unlocked() ... Browse Code »

This removes the redundant 'write' and 'force' parameters from
__get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Acked-by: Paolo Bonzini
Reviewed-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 05:13:37 +0800
859110d74 mm: remove write/force parameters from __get_user_pages_locked() ... Browse Code »

This removes the redundant 'write' and 'force' parameters from
__get_user_pages_locked() to make the use of FOLL_FORCE explicit in
callers as use of this flag can result in surprising behaviour (and
hence bugs) within the mm subsystem.

Signed-off-by: Lorenzo Stoakes
Reviewed-by: Jan Kara
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Lorenzo Stoakes
2016-10-19 05:13:36 +0800
19be0eaff mm: remove gup_flags FOLL_WRITE games from __get_user_pages() ... Browse Code »

This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").

In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.

Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.

To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.

Reported-and-tested-by: Phil "not Paul" Oester
Acked-by: Hugh Dickins
Reviewed-by: Michal Hocko
Cc: Andy Lutomirski
Cc: Kees Cook
Cc: Oleg Nesterov
Cc: Willy Tarreau
Cc: Nick Piggin
Cc: Greg Thelen
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Linus Torvalds
2016-10-19 05:13:29 +0800

03 Aug, 2016

1 commit

221bb8a46 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:

- ARM: GICv3 ITS emulation and various fixes. Removal of the
old VGIC implementation.

- s390: support for trapping software breakpoints, nested
virtualization (vSIE), the STHYI opcode, initial extensions
for CPU model support.

- MIPS: support for MIPS64 hosts (32-bit guests only) and lots
of cleanups, preliminary to this and the upcoming support for
hardware virtualization extensions.

- x86: support for execute-only mappings in nested EPT; reduced
vmexit latency for TSC deadline timer (by about 30%) on Intel
hosts; support for more than 255 vCPUs.

- PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
KVM: PPC: Introduce KVM_CAP_PPC_HTM
MIPS: Select HAVE_KVM for MIPS64_R{2,6}
MIPS: KVM: Reset CP0_PageMask during host TLB flush
MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
MIPS: KVM: Sign extend MFC0/RDHWR results
MIPS: KVM: Fix 64-bit big endian dynamic translation
MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
MIPS: KVM: Use 64-bit CP0_EBase when appropriate
MIPS: KVM: Set CP0_Status.KX on MIPS64
MIPS: KVM: Make entry code MIPS64 friendly
MIPS: KVM: Use kmap instead of CKSEG0ADDR()
MIPS: KVM: Use virt_to_phys() to get commpage PFN
MIPS: Fix definition of KSEGX() for 64-bit
KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
kvm: x86: nVMX: maintain internal copy of current VMCS
KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
KVM: arm64: vgic-its: Simplify MAPI error handling
KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
...

Linus Torvalds
2016-08-03 04:11:27 +0800

27 Jul, 2016

3 commits

baa355fd3 thp: file pages support for split_huge_page() ... Browse Code »

Basic scheme is the same as for anon THP.

Main differences:

- File pages are on radix-tree, so we have head->_count offset by
HPAGE_PMD_NR. The count got distributed to small pages during split.

- mapping->tree_lock prevents non-lockless access to pages under split
over radix-tree;

- Lockless access is prevented by setting the head->_count to 0 during
split;

- After split, some pages can be beyond i_size. We drop them from
radix-tree.

- We don't setup migration entries. Just unmap pages. It helps
handling cases when i_size is in the middle of the page: no need
handle unmap pages beyond i_size manually.

Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
dcddffd41 mm: do not pass mm_struct into handle_mm_fault ... Browse Code »

We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-07-27 07:19:19 +0800
337d9abf1 mm: thp: check pmd_trans_unstable() after split_huge_pmd() ... Browse Code »

split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
to pte entries, which can be checked with pmd_trans_unstable(). Some
callers make this assertion and some do it differently and some not, so
let's do it in a unified manner.

Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-07-27 07:19:19 +0800

05 Jul, 2016

1 commit

add6a0cd1 KVM: MMU: try to fix up page faults before giving up ... Browse Code »

The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device. The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
and fixup_user_fault together help supporting it. The patch also supports
VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
reference counting.

Cc: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Radim Krčmář
Tested-by: Neo Jia
Reported-by: Kirti Wankhede
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2016-07-05 20:41:26 +0800

15 Apr, 2016

1 commit

a1f983174 Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull mm gup cleanup from Ingo Molnar:
"This removes the ugly get-user-pages API hack, now that all upstream
code has been migrated to it"

("ugly" is putting it mildly. But it worked.. - Linus)

* 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs

Linus Torvalds
2016-04-15 10:31:34 +0800

07 Apr, 2016

1 commit

c12d2da56 mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs ... Browse Code »

The pkeys changes brought about a truly hideous set of macros in:

cde70140fed8 ("mm/gup: Overload get_user_pages() functions")

... which macros are (ab-)using the fact that __VA_ARGS__ can be used
to shift parameter positions in macro arguments without breaking the
build and so can be used to call separate C functions depending on
the number of arguments of the macro.

This allowed easy migration of these 3 GUP APIs, as both these variants
worked at the C level:

old:
ret = get_user_pages(current, current->mm, address, 1, 1, 0, &page, NULL);

new:
ret = get_user_pages(address, 1, 1, 0, &page, NULL);

... while we also generated a (functionally harmless but noticeable) build
time warning if the old API was used. As there are over 300 uses of these
APIs, this trick eased the migration of the API and avoided excessive
migration pain in linux-next.

Now, with its work done, get rid of all of that complication and ugliness:

3 files changed, 16 insertions(+), 140 deletions(-)

... where the linecount of the migration hack was further inflated by the
fact that there are NOMMU variants of these GUP APIs as well.

Much of the conversion was done in linux-next over the past couple of months,
and Linus recently removed all remaining old API uses from the upstream tree
in the following upstrea commit:

cb107161df3c ("Convert straggling drivers to new six-argument get_user_pages()")

There was one more old-API usage in mm/gup.c, in the CONFIG_HAVE_GENERIC_RCU_GUP
code path that ARM, ARM64 and PowerPC uses.

After this commit any old API usage will break the build.

[ Also fixed a PowerPC/HAVE_GENERIC_RCU_GUP warning reported by Stephen Rothwell. ]

Cc: Andrew Morton
Cc: Dave Hansen
Cc: Dave Hansen
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephen Rothwell
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2016-04-07 16:46:14 +0800

05 Apr, 2016

1 commit

ea1754a08 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage ... Browse Code »

Mostly direct substitution with occasional adjustment or removing
outdated comments.

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-04-05 01:41:08 +0800

19 Feb, 2016

2 commits

d61172b4b mm/core, x86/mm/pkeys: Differentiate instruction fetches ... Browse Code »

As discussed earlier, we attempt to enforce protection keys in
software.

However, the code checks all faults to ensure that they are not
violating protection key permissions. It was assumed that all
faults are either write faults where we check PKRU[key].WD (write
disable) or read faults where we check the AD (access disable)
bit.

But, there is a third category of faults for protection keys:
instruction faults. Instruction faults never run afoul of
protection keys because they do not affect instruction fetches.

So, plumb the PF_INSTR bit down in to the
arch_vma_access_permitted() function where we do the protection
key checks.

We also add a new FAULT_FLAG_INSTRUCTION. This is because
handle_mm_fault() is not passed the architecture-specific
error_code where we keep PF_INSTR, so we need to encode the
instruction fetch information in to the arch-generic fault
flags.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-19 02:46:29 +0800
1b2ee1266 mm/core: Do not enforce PKEY permissions on remote mm access ... Browse Code »

We try to enforce protection keys in software the same way that we
do in hardware. (See long example below).

But, we only want to do this when accessing our *own* process's
memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
tried to PTRACE_POKE a target process which just happened to have
some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
debugger access to that memory. PKRU is fundamentally a
thread-local structure and we do not want to enforce it on access
to _another_ thread's data.

This gets especially tricky when we have workqueues or other
delayed-work mechanisms that might run in a random process's context.
We can check that we only enforce pkeys when operating on our *own* mm,
but delayed work gets performed when a random user context is active.
We might end up with a situation where a delayed-work gup fails when
running randomly under its "own" task but succeeds when running under
another process. We want to avoid that.

To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
fault flag: FAULT_FLAG_REMOTE. They indicate that we are
walking an mm which is not guranteed to be the same as
current->mm and should not be subject to protection key
enforcement.

Thanks to Jerome Glisse for pointing out this scenario.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Alexey Kardashevskiy
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Boaz Harrosh
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dan Williams
Cc: Dave Chinner
Cc: Dave Hansen
Cc: David Gibson
Cc: Denys Vlasenko
Cc: Dominik Dingel
Cc: Dominik Vogt
Cc: Eric B Munson
Cc: Geliang Tang
Cc: Guan Xuetao
Cc: H. Peter Anvin
Cc: Heiko Carstens
Cc: Hugh Dickins
Cc: Jan Kara
Cc: Jason Low
Cc: Jerome Marchand
Cc: Joerg Roedel
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Laurent Dufour
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michael Ellerman
Cc: Michal Hocko
Cc: Mikulas Patocka
Cc: Minchan Kim
Cc: Oleg Nesterov
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sasha Levin
Cc: Shachar Raindel
Cc: Vlastimil Babka
Cc: Xie XiuQi
Cc: iommu@lists.linux-foundation.org
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-19 02:46:28 +0800

18 Feb, 2016

2 commits

33a709b25 mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys ... Browse Code »

Today, for normal faults and page table walks, we check the VMA
and/or PTE to ensure that it is compatible with the action. For
instance, if we get a write fault on a non-writeable VMA, we
SIGSEGV.

We try to do the same thing for protection keys. Basically, we
try to make sure that if a user does this:

mprotect(ptr, size, PROT_NONE);
*ptr = foo;

they see the same effects with protection keys when they do this:

mprotect(ptr, size, PROT_READ|PROT_WRITE);
set_pkey(ptr, size, 4);
wrpkru(0xffffff3f); // access disable pkey 4
*ptr = foo;

The state to do that checking is in the VMA, but we also
sometimes have to do it on the page tables only, like when doing
a get_user_pages_fast() where we have no VMA.

We add two functions and expose them to generic code:

arch_pte_access_permitted(pte_flags, write)
arch_vma_access_permitted(vma, write)

These are, of course, backed up in x86 arch code with checks
against the PTE or VMA's protection key.

But, there are also cases where we do not want to respect
protection keys. When we ptrace(), for instance, we do not want
to apply the tracer's PKRU permissions to the PTEs from the
process being traced.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Alexey Kardashevskiy
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Aneesh Kumar K.V
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Boaz Harrosh
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Gibson
Cc: David Hildenbrand
Cc: David Vrabel
Cc: Denys Vlasenko
Cc: Dominik Dingel
Cc: Dominik Vogt
Cc: Guan Xuetao
Cc: H. Peter Anvin
Cc: Heiko Carstens
Cc: Hugh Dickins
Cc: Jason Low
Cc: Jerome Marchand
Cc: Juergen Gross
Cc: Kirill A. Shutemov
Cc: Laurent Dufour
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michael Ellerman
Cc: Michal Hocko
Cc: Mikulas Patocka
Cc: Minchan Kim
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sasha Levin
Cc: Shachar Raindel
Cc: Stephen Smalley
Cc: Toshi Kani
Cc: Vlastimil Babka
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-18 16:32:44 +0800
d4925e00d mm/gup: Factor out VMA fault permission checking ... Browse Code »

This code matches a fault condition up with the VMA and ensures
that the VMA allows the fault to be handled instead of just
erroring out.

We will be extending this in a moment to comprehend protection
keys.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Aneesh Kumar K.V
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dan Williams
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: Dominik Dingel
Cc: Eric B Munson
Cc: H. Peter Anvin
Cc: Jason Low
Cc: Kirill A. Shutemov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sasha Levin
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210216.C3824032@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-18 16:32:43 +0800

16 Feb, 2016

3 commits

d4edcf0d5 mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm ... Browse Code »

We will soon modify the vanilla get_user_pages() so it can no
longer be used on mm/tasks other than 'current/current->mm',
which is by far the most common way it is called. For now,
we allow the old-style calls, but warn when they are used.
(implemented in previous patch)

This patch switches all callers of:

get_user_pages()
get_user_pages_unlocked()
get_user_pages_locked()

to stop passing tsk/mm so they will no longer see the warnings.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Kirill A. Shutemov
Cc: Linus Torvalds
Cc: Naoya Horiguchi
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Vlastimil Babka
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-16 17:11:12 +0800
cde70140f mm/gup: Overload get_user_pages() functions ... Browse Code »

The concept here was a suggestion from Ingo. The implementation
horrors are all mine.

This allows get_user_pages(), get_user_pages_unlocked(), and
get_user_pages_locked() to be called with or without the
leading tsk/mm arguments. We will give a compile-time warning
about the old style being __deprecated and we will also
WARN_ON() if the non-remote version is used for a remote-style
access.

Doing this, folks will get nice warnings and will not break the
build. This should be nice for -next and will hopefully let
developers fix up their own code instead of maintainers needing
to do it at merge time.

The way we do this is hideous. It uses the __VA_ARGS__ macro
functionality to call different functions based on the number
of arguments passed to the macro.

There's an additional hack to ensure that our EXPORT_SYMBOL()
of the deprecated symbols doesn't trigger a warning.

We should be able to remove this mess as soon as -rc1 hits in
the release after this is merged.

Signed-off-by: Dave Hansen
Cc: Al Viro
Cc: Alexander Kuleshov
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Dan Williams
Cc: Dave Hansen
Cc: Dominik Dingel
Cc: Geliang Tang
Cc: Jan Kara
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Konstantin Khlebnikov
Cc: Leon Romanovsky
Cc: Linus Torvalds
Cc: Masahiro Yamada
Cc: Mateusz Guzik
Cc: Maxime Coquelin
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Oleg Nesterov
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Cc: Vlastimil Babka
Cc: Xie XiuQi
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-16 17:11:12 +0800
1e9877902 mm/gup: Introduce get_user_pages_remote() ... Browse Code »

For protection keys, we need to understand whether protections
should be enforced in software or not. In general, we enforce
protections when working on our own task, but not when on others.
We call these "current" and "remote" operations.

This patch introduces a new get_user_pages() variant:

get_user_pages_remote()

Which is a replacement for when get_user_pages() is called on
non-current tsk/mm.

We also introduce a new gup flag: FOLL_REMOTE which can be used
for the "__" gup variants to get this new behavior.

The uprobes is_trap_at_addr() location holds mmap_sem and
calls get_user_pages(current->mm) on an instruction address. This
makes it a pretty unique gup caller. Being an instruction access
and also really originating from the kernel (vs. the app), I opted
to consider this a 'remote' access where protection keys will not
be enforced.

Without protection keys, this patch should not change any behavior.

Signed-off-by: Dave Hansen
Reviewed-by: Thomas Gleixner
Cc: Andrea Arcangeli
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dave Hansen
Cc: Denys Vlasenko
Cc: H. Peter Anvin
Cc: Kirill A. Shutemov
Cc: Linus Torvalds
Cc: Naoya Horiguchi
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Srikar Dronamraju
Cc: Vlastimil Babka
Cc: jack@suse.cz
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar

Dave Hansen
2016-02-16 17:04:09 +0800

04 Feb, 2016

1 commit

464353647 mm: retire GUP WARN_ON_ONCE that outlived its usefulness ... Browse Code »

Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit
cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared
areas"). The warning has served its purpose, nobody was harmed by that
change, so just remove the warning to generate less noise from Trinity.

Which reminds me of the comment I wrongly left behind with that commit
(but was spotted at the time by Kirill), which has since moved into a
separate function, and become even more obscure: delete it.

Reported-by: Dave Jones
Suggested-by: Kirill A. Shutemov
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2016-02-04 00:57:14 +0800

16 Jan, 2016

9 commits

4a9e1cda2 mm: bring in additional flag for fixup_user_fault to signal unlock ... Browse Code »

During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.

The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.

This patchset does not take care of the futex code. I will now look
closer at this.

This patch (of 2):

With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.

This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.

Signed-off-by: Dominik Dingel
Reviewed-by: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Christian Borntraeger
Cc: "Jason J. Herne"
Cc: David Rientjes
Cc: Eric B Munson
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Heiko Carstens
Cc: Dominik Dingel
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dominik Dingel
2016-01-16 09:56:32 +0800
3565fce3a mm, x86: get_user_pages() for dax mappings ... Browse Code »

A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
has established a devm_memremap_pages() mapping, i.e. when the pfn_t
return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
struct dev_pagemap instance to keep the result of pfn_to_page() valid
until put_page().

Signed-off-by: Dan Williams
Tested-by: Logan Gunthorpe
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Williams
2016-01-16 09:56:32 +0800
e90309c9f thp: allow mlocked THP again ... Browse Code »

Before THP refcounting rework, THP was not allowed to cross VMA
boundary. So, if we have THP and we split it, PG_mlocked can be safely
transferred to small pages.

With new THP refcounting and naive approach to mlocking we can end up
with this scenario:
1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
2. the process does munlock() on the *part* of the THP:
- the VMA is split into two, one of them VM_LOCKED;
- huge PMD split into PTE table;
- THP is still mlocked;
3. split_huge_page():
- it transfers PG_mlocked to *all* small pages regrardless if it
blong to any VM_LOCKED VMA.

We probably could munlock() all small pages on split_huge_page(), but I
think we have accounting issue already on step two.

Instead of forbidding mlocked pages altogether, we just avoid mlocking
PTE-mapped THPs and munlock THPs on split_huge_pmd().

This means PTE-mapped THPs will be on normal lru lists and will be split
under memory pressure by vmscan. After the split vmscan will detect
unevictable small pages and mlock them.

With this approach we shouldn't hit situation like described above.

Signed-off-by: Kirill A. Shutemov
Cc: Sasha Levin
Cc: Aneesh Kumar K.V
Cc: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
4b471e889 mm, thp: remove infrastructure for handling splitting PMDs ... Browse Code »

With new refcounting we don't need to mark PMDs splitting. Let's drop
code to handle this.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
ddc58f27f mm: drop tail page refcounting ... Browse Code »

Tail page refcounting is utterly complicated and painful to support.

It uses ->_mapcount on tail pages to store how many times this page is
pinned. get_page() bumps ->_mapcount on tail page in addition to
->_count on head. This information is required by split_huge_page() to
be able to distribute pins from head of compound page to tails during
the split.

We will need ->_mapcount to account PTE mappings of subpages of the
compound page. We eliminate need in current meaning of ->_mapcount in
tail pages by forbidding split entirely if the page is pinned.

The only user of tail page refcounting is THP which is marked BROKEN for
now.

Let's drop all this mess. It makes get_page() and put_page() much
simpler.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
78ddc5347 thp: rename split_huge_page_pmd() to split_huge_pmd() ... Browse Code »

We are going to decouple splitting THP PMD from splitting underlying
compound page.

This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
to reflect the fact that it doesn't imply page splitting, only PMD.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
7479df6da thp, mlock: do not allow huge pages in mlocked area ... Browse Code »

With new refcounting THP can belong to several VMAs. This makes tricky
to track THP pages, when they partially mlocked. It can lead to leaking
mlocked pages to non-VM_LOCKED vmas and other problems.

With this patch we will split all pages on mlock and avoid
fault-in/collapse new THP in VM_LOCKED vmas.

I've tried alternative approach: do not mark THP pages mlocked and keep
them on normal LRUs. This way vmscan could try to split huge pages on
memory pressure and free up subpages which doesn't belong to VM_LOCKED
vmas. But this is user-visible change: we screw up Mlocked accouting
reported in meminfo, so I had to leave this approach aside.

We can bring something better later, but this should be good enough for
now.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
7aef4172c mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton ... Browse Code »

With new refcounting we are going to see THP tail pages mapped with PTE.
Generic fast GUP rely on page_cache_get_speculative() to obtain
reference on page. page_cache_get_speculative() always fails on tail
pages, because ->_count on tail pages is always zero.

Let's handle tail pages in gup_pte_range().

New split_huge_page() will rely on migration entries to freeze page's
counts. Recheck PTE value after page_cache_get_speculative() on head
page should be enough to serialize against split.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Acked-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800
6742d293c mm: adjust FOLL_SPLIT for new refcounting ... Browse Code »

We need to prepare kernel to allow transhuge pages to be mapped with
ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

Also we use split_huge_page() directly instead of split_huge_page_pmd().
split_huge_page_pmd() will gone.

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Vlastimil Babka
Acked-by: Jerome Marchand
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

06 Nov, 2015

1 commit

de60f5f10 mm: introduce VM_LOCKONFAULT ... Browse Code »

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.

Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.

Signed-off-by: Eric B Munson
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Jonathan Corbet
Cc: Catalin Marinas
Cc: Geert Uytterhoeven
Cc: Guenter Roeck
Cc: Heiko Carstens
Cc: Michael Kerrisk
Cc: Ralf Baechle
Cc: Shuah Khan
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-11-06 11:34:48 +0800

05 Sep, 2015

1 commit

1027e4436 mm: make GUP handle pfn mapping unless FOLL_GET is requested ... Browse Code »

With DAX, pfn mapping becoming more common. The patch adjusts GUP code to
cover pfn mapping for cases when we don't need struct page to proceed.

To make it possible, let's change follow_page() code to return -EEXIST
error code if proper page table entry exists, but no corresponding struct
page. __get_user_page() would ignore the error code and move to the next
page frame.

The immediate effect of the change is working MAP_POPULATE and mlock() on
DAX mappings.

[akpm@linux-foundation.org: fix arm64 build]
Signed-off-by: Kirill A. Shutemov
Reviewed-by: Toshi Kani
Acked-by: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-09-05 07:54:41 +0800

16 Apr, 2015

1 commit

9d8c47e4b mm: use READ_ONCE() for non-scalar types ... Browse Code »

Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
ACCESS_ONCE doesn't work reliably on non-scalar types.

This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
and __get_user_pages_fast() in mm/gup.c

Signed-off-by: Jason Low
Acked-by: Michal Hocko
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Reviewed-by: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Low
2015-04-16 07:35:18 +0800

15 Apr, 2015

2 commits

acc3c8d15 mm: move mm_populate()-related code to mm/gup.c ... Browse Code »

It's odd that we have populate_vma_page_range() and __mm_populate() in
mm/mlock.c. It's implementation of generic memory population and mlocking
is one of possible side effect, if VM_LOCKED is set.

__get_user_pages() is core of the implementation. Let's move the code
into mm/gup.c.

Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Acked-by: David Rientjes
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-15 07:49:00 +0800
84d33df27 mm: rename FOLL_MLOCK to FOLL_POPULATE ... Browse Code »

After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
get_user_pages only for mlock") FOLL_MLOCK has lost its original
meaning: we don't necessarily mlock the page if the flags is set -- we
also take VM_LOCKED into consideration.

Since we use the same codepath for __mm_populate(), let's rename
FOLL_MLOCK to FOLL_POPULATE.

Signed-off-by: Kirill A. Shutemov
Acked-by: Linus Torvalds
Acked-by: David Rientjes
Cc: Michel Lespinasse
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-04-15 07:48:59 +0800

15 Feb, 2015

1 commit

c833e17e2 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux ... Browse Code »

Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
"Tighten rules for ACCESS_ONCE

This series tightens the rules for ACCESS_ONCE to only work on scalar
types. It also contains the necessary fixups as indicated by build
bots of linux-next. Now everything is in place to prevent new
non-scalar users of ACCESS_ONCE and we can continue to convert code to
READ_ONCE/WRITE_ONCE"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
kernel: Fix sparse warning for ACCESS_ONCE
next: sh: Fix compile error
kernel: tighten rules for ACCESS ONCE
mm/gup: Replace ACCESS_ONCE with READ_ONCE
x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
ppc/kvm: Replace ACCESS_ONCE with READ_ONCE

Linus Torvalds
2015-02-15 02:54:28 +0800

13 Feb, 2015

1 commit

8a0516ed8 mm: convert p[te|md]_numa users to p[te|md]_protnone_numa ... Browse Code »

Convert existing users of pte_numa and friends to the new helper. Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered. This patch layout is to make review
easier.

Signed-off-by: Mel Gorman
Acked-by: Linus Torvalds
Acked-by: Aneesh Kumar
Acked-by: Benjamin Herrenschmidt
Tested-by: Sasha Levin
Cc: Dave Jones
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: Kirill Shutemov
Cc: Paul Mackerras
Cc: Rik van Riel
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-02-13 10:54:08 +0800