19 Feb, 2016

2 commits

  • As discussed earlier, we attempt to enforce protection keys in
    software.

    However, the code checks all faults to ensure that they are not
    violating protection key permissions. It was assumed that all
    faults are either write faults where we check PKRU[key].WD (write
    disable) or read faults where we check the AD (access disable)
    bit.

    But, there is a third category of faults for protection keys:
    instruction faults. Instruction faults never run afoul of
    protection keys because they do not affect instruction fetches.

    So, plumb the PF_INSTR bit down in to the
    arch_vma_access_permitted() function where we do the protection
    key checks.

    We also add a new FAULT_FLAG_INSTRUCTION. This is because
    handle_mm_fault() is not passed the architecture-specific
    error_code where we keep PF_INSTR, so we need to encode the
    instruction fetch information in to the arch-generic fault
    flags.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • We try to enforce protection keys in software the same way that we
    do in hardware. (See long example below).

    But, we only want to do this when accessing our *own* process's
    memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
    tried to PTRACE_POKE a target process which just happened to have
    some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
    debugger access to that memory. PKRU is fundamentally a
    thread-local structure and we do not want to enforce it on access
    to _another_ thread's data.

    This gets especially tricky when we have workqueues or other
    delayed-work mechanisms that might run in a random process's context.
    We can check that we only enforce pkeys when operating on our *own* mm,
    but delayed work gets performed when a random user context is active.
    We might end up with a situation where a delayed-work gup fails when
    running randomly under its "own" task but succeeds when running under
    another process. We want to avoid that.

    To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
    fault flag: FAULT_FLAG_REMOTE. They indicate that we are
    walking an mm which is not guranteed to be the same as
    current->mm and should not be subject to protection key
    enforcement.

    Thanks to Jerome Glisse for pointing out this scenario.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Alexey Kardashevskiy
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boaz Harrosh
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Dominik Vogt
    Cc: Eric B Munson
    Cc: Geliang Tang
    Cc: Guan Xuetao
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Jason Low
    Cc: Jerome Marchand
    Cc: Joerg Roedel
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Shachar Raindel
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: iommu@lists.linux-foundation.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Feb, 2016

2 commits

  • Today, for normal faults and page table walks, we check the VMA
    and/or PTE to ensure that it is compatible with the action. For
    instance, if we get a write fault on a non-writeable VMA, we
    SIGSEGV.

    We try to do the same thing for protection keys. Basically, we
    try to make sure that if a user does this:

    mprotect(ptr, size, PROT_NONE);
    *ptr = foo;

    they see the same effects with protection keys when they do this:

    mprotect(ptr, size, PROT_READ|PROT_WRITE);
    set_pkey(ptr, size, 4);
    wrpkru(0xffffff3f); // access disable pkey 4
    *ptr = foo;

    The state to do that checking is in the VMA, but we also
    sometimes have to do it on the page tables only, like when doing
    a get_user_pages_fast() where we have no VMA.

    We add two functions and expose them to generic code:

    arch_pte_access_permitted(pte_flags, write)
    arch_vma_access_permitted(vma, write)

    These are, of course, backed up in x86 arch code with checks
    against the PTE or VMA's protection key.

    But, there are also cases where we do not want to respect
    protection keys. When we ptrace(), for instance, we do not want
    to apply the tracer's PKRU permissions to the PTEs from the
    process being traced.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Alexey Kardashevskiy
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boaz Harrosh
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: David Hildenbrand
    Cc: David Vrabel
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Dominik Vogt
    Cc: Guan Xuetao
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Jason Low
    Cc: Jerome Marchand
    Cc: Juergen Gross
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Shachar Raindel
    Cc: Stephen Smalley
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • This code matches a fault condition up with the VMA and ensures
    that the VMA allows the fault to be handled instead of just
    erroring out.

    We will be extending this in a moment to comprehend protection
    keys.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Eric B Munson
    Cc: H. Peter Anvin
    Cc: Jason Low
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210216.C3824032@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

16 Feb, 2016

3 commits

  • We will soon modify the vanilla get_user_pages() so it can no
    longer be used on mm/tasks other than 'current/current->mm',
    which is by far the most common way it is called. For now,
    we allow the old-style calls, but warn when they are used.
    (implemented in previous patch)

    This patch switches all callers of:

    get_user_pages()
    get_user_pages_unlocked()
    get_user_pages_locked()

    to stop passing tsk/mm so they will no longer see the warnings.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • The concept here was a suggestion from Ingo. The implementation
    horrors are all mine.

    This allows get_user_pages(), get_user_pages_unlocked(), and
    get_user_pages_locked() to be called with or without the
    leading tsk/mm arguments. We will give a compile-time warning
    about the old style being __deprecated and we will also
    WARN_ON() if the non-remote version is used for a remote-style
    access.

    Doing this, folks will get nice warnings and will not break the
    build. This should be nice for -next and will hopefully let
    developers fix up their own code instead of maintainers needing
    to do it at merge time.

    The way we do this is hideous. It uses the __VA_ARGS__ macro
    functionality to call different functions based on the number
    of arguments passed to the macro.

    There's an additional hack to ensure that our EXPORT_SYMBOL()
    of the deprecated symbols doesn't trigger a warning.

    We should be able to remove this mess as soon as -rc1 hits in
    the release after this is merged.

    Signed-off-by: Dave Hansen
    Cc: Al Viro
    Cc: Alexander Kuleshov
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Dominik Dingel
    Cc: Geliang Tang
    Cc: Jan Kara
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Leon Romanovsky
    Cc: Linus Torvalds
    Cc: Masahiro Yamada
    Cc: Mateusz Guzik
    Cc: Maxime Coquelin
    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Oleg Nesterov
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • For protection keys, we need to understand whether protections
    should be enforced in software or not. In general, we enforce
    protections when working on our own task, but not when on others.
    We call these "current" and "remote" operations.

    This patch introduces a new get_user_pages() variant:

    get_user_pages_remote()

    Which is a replacement for when get_user_pages() is called on
    non-current tsk/mm.

    We also introduce a new gup flag: FOLL_REMOTE which can be used
    for the "__" gup variants to get this new behavior.

    The uprobes is_trap_at_addr() location holds mmap_sem and
    calls get_user_pages(current->mm) on an instruction address. This
    makes it a pretty unique gup caller. Being an instruction access
    and also really originating from the kernel (vs. the app), I opted
    to consider this a 'remote' access where protection keys will not
    be enforced.

    Without protection keys, this patch should not change any behavior.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

04 Feb, 2016

1 commit

  • Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit
    cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared
    areas"). The warning has served its purpose, nobody was harmed by that
    change, so just remove the warning to generate less noise from Trinity.

    Which reminds me of the comment I wrongly left behind with that commit
    (but was spotted at the time by Kirill), which has since moved into a
    separate function, and become even more obscure: delete it.

    Reported-by: Dave Jones
    Suggested-by: Kirill A. Shutemov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Jan, 2016

9 commits

  • During Jason's work with postcopy migration support for s390 a problem
    regarding gmap faults was discovered.

    The gmap code will call fixup_user_fault which will end up always in
    handle_mm_fault. Till now we never cared about retries, but as the
    userfaultfd code kind of relies on it. this needs some fix.

    This patchset does not take care of the futex code. I will now look
    closer at this.

    This patch (of 2):

    With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
    to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
    faulting we ever unlocked mmap_sem.

    This patch brings in the logic to handle retries as well as it cleans up
    the current documentation. fixup_user_fault was not having the same
    semantics as filemap_fault. It never indicated if a retry happened and
    so a caller wasn't able to handle that case. So we now changed the
    behaviour to always retry a locked mmap_sem.

    Signed-off-by: Dominik Dingel
    Reviewed-by: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Christian Borntraeger
    Cc: "Jason J. Herne"
    Cc: David Rientjes
    Cc: Eric B Munson
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: Dominik Dingel
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dominik Dingel
     
  • A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
    has established a devm_memremap_pages() mapping, i.e. when the pfn_t
    return from ->direct_access() has PFN_DEV and PFN_MAP set. Later, when
    encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
    struct dev_pagemap instance to keep the result of pfn_to_page() valid
    until put_page().

    Signed-off-by: Dan Williams
    Tested-by: Logan Gunthorpe
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Tail page refcounting is utterly complicated and painful to support.

    It uses ->_mapcount on tail pages to store how many times this page is
    pinned. get_page() bumps ->_mapcount on tail page in addition to
    ->_count on head. This information is required by split_huge_page() to
    be able to distribute pins from head of compound page to tails during
    the split.

    We will need ->_mapcount to account PTE mappings of subpages of the
    compound page. We eliminate need in current meaning of ->_mapcount in
    tail pages by forbidding split entirely if the page is pinned.

    The only user of tail page refcounting is THP which is marked BROKEN for
    now.

    Let's drop all this mess. It makes get_page() and put_page() much
    simpler.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting THP can belong to several VMAs. This makes tricky
    to track THP pages, when they partially mlocked. It can lead to leaking
    mlocked pages to non-VM_LOCKED vmas and other problems.

    With this patch we will split all pages on mlock and avoid
    fault-in/collapse new THP in VM_LOCKED vmas.

    I've tried alternative approach: do not mark THP pages mlocked and keep
    them on normal LRUs. This way vmscan could try to split huge pages on
    memory pressure and free up subpages which doesn't belong to VM_LOCKED
    vmas. But this is user-visible change: we screw up Mlocked accouting
    reported in meminfo, so I had to leave this approach aside.

    We can bring something better later, but this should be good enough for
    now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we are going to see THP tail pages mapped with PTE.
    Generic fast GUP rely on page_cache_get_speculative() to obtain
    reference on page. page_cache_get_speculative() always fails on tail
    pages, because ->_count on tail pages is always zero.

    Let's handle tail pages in gup_pte_range().

    New split_huge_page() will rely on migration entries to freeze page's
    counts. Recheck PTE value after page_cache_get_speculative() on head
    page should be enough to serialize against split.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We need to prepare kernel to allow transhuge pages to be mapped with
    ptes too. We need to handle FOLL_SPLIT in follow_page_pte().

    Also we use split_huge_page() directly instead of split_huge_page_pmd().
    split_huge_page_pmd() will gone.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

06 Nov, 2015

1 commit

  • The cost of faulting in all memory to be locked can be very high when
    working with large mappings. If only portions of the mapping will be used
    this can incur a high penalty for locking.

    For the example of a large file, this is the usage pattern for a large
    statical language model (probably applies to other statical or graphical
    models as well). For the security example, any application transacting in
    data that cannot be swapped out (credit card data, medical records, etc).

    This patch introduces the ability to request that pages are not
    pre-faulted, but are placed on the unevictable LRU when they are finally
    faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
    and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
    flag for a VMA will cause pages faulted into that VMA to be added to the
    unevictable LRU when they are faulted or if they are already present, but
    will not cause any missing pages to be faulted in.

    Exposing this new lock state means that we cannot overload the meaning of
    the FOLL_POPULATE flag any longer. Prior to this patch it was used to
    mean that the VMA for a fault was locked. This means we need the new
    FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
    will now only control if the VMA should be populated and in the case of
    VM_LOCKONFAULT, it will not be set.

    Signed-off-by: Eric B Munson
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Geert Uytterhoeven
    Cc: Guenter Roeck
    Cc: Heiko Carstens
    Cc: Michael Kerrisk
    Cc: Ralf Baechle
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     

05 Sep, 2015

1 commit

  • With DAX, pfn mapping becoming more common. The patch adjusts GUP code to
    cover pfn mapping for cases when we don't need struct page to proceed.

    To make it possible, let's change follow_page() code to return -EEXIST
    error code if proper page table entry exists, but no corresponding struct
    page. __get_user_page() would ignore the error code and move to the next
    page frame.

    The immediate effect of the change is working MAP_POPULATE and mlock() on
    DAX mappings.

    [akpm@linux-foundation.org: fix arm64 build]
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Toshi Kani
    Acked-by: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Apr, 2015

1 commit

  • Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
    converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
    ACCESS_ONCE doesn't work reliably on non-scalar types.

    This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
    and __get_user_pages_fast() in mm/gup.c

    Signed-off-by: Jason Low
    Acked-by: Michal Hocko
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Reviewed-by: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Low
     

15 Apr, 2015

2 commits

  • It's odd that we have populate_vma_page_range() and __mm_populate() in
    mm/mlock.c. It's implementation of generic memory population and mlocking
    is one of possible side effect, if VM_LOCKED is set.

    __get_user_pages() is core of the implementation. Let's move the code
    into mm/gup.c.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in
    get_user_pages only for mlock") FOLL_MLOCK has lost its original
    meaning: we don't necessarily mlock the page if the flags is set -- we
    also take VM_LOCKED into consideration.

    Since we use the same codepath for __mm_populate(), let's rename
    FOLL_MLOCK to FOLL_POPULATE.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Acked-by: David Rientjes
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Feb, 2015

1 commit

  • Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
    "Tighten rules for ACCESS_ONCE

    This series tightens the rules for ACCESS_ONCE to only work on scalar
    types. It also contains the necessary fixups as indicated by build
    bots of linux-next. Now everything is in place to prevent new
    non-scalar users of ACCESS_ONCE and we can continue to convert code to
    READ_ONCE/WRITE_ONCE"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    kernel: Fix sparse warning for ACCESS_ONCE
    next: sh: Fix compile error
    kernel: tighten rules for ACCESS ONCE
    mm/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
    x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
    ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
    ppc/kvm: Replace ACCESS_ONCE with READ_ONCE

    Linus Torvalds
     

13 Feb, 2015

1 commit

  • Convert existing users of pte_numa and friends to the new helper. Note
    that the kernel is broken after this patch is applied until the other page
    table modifiers are also altered. This patch layout is to make review
    easier.

    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Acked-by: Benjamin Herrenschmidt
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

4 commits

  • This allows the get_user_pages_fast slow path to release the mmap_sem
    before blocking.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
    to get a proper -EHWPOSION retval instead of -EFAULT to take a more
    appropriate action if get_user_pages runs into a memory failure.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Kirill A. Shutemov
    Cc: Andres Lagar-Cavilla
    Cc: Peter Feiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
    reading to reduce the mmap_sem contention (for writing), like while
    waiting for I/O completion. The problem is that right now practically no
    get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
    that nifty feature.

    Andres fixed it for the KVM page fault. However get_user_pages_fast
    remains uncovered, and 99% of other get_user_pages aren't using it either
    (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
    and in fact it doesn't even release the mmap_sem).

    So this patchsets extends the optimization Andres did in the KVM page
    fault to the whole kernel. It makes most important places (including
    gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
    during I/O.

    The only few places that remains uncovered are drivers like v4l and other
    exceptions that tends to work on their own memory and they're not working
    on random user memory (for example like O_DIRECT that uses gup_fast and is
    fully covered by this patch).

    A follow up patch should probably also add a printk_once warning to
    get_user_pages that should go obsolete and be phased out eventually. The
    "vmas" parameter of get_user_pages makes it fundamentally incompatible
    with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
    mmap_sem is released).

    While this is just an optimization, this becomes an absolute requirement
    for the userfaultfd feature http://lwn.net/Articles/615086/ .

    The userfaultfd allows to block the page fault, and in order to do so I
    need to drop the mmap_sem first. So this patch also ensures that all
    memory where userfaultfd could be registered by KVM, the very first fault
    (no matter if it is a regular page fault, or a get_user_pages) always has
    FAULT_FOLL_ALLOW_RETRY set. Then the userfaultfd blocks and it is waken
    only when the pagetable is already mapped. The second fault attempt after
    the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
    without it.

    This patch (of 5):

    We can leverage the VM_FAULT_RETRY functionality in the page fault paths
    better by using either get_user_pages_locked or get_user_pages_unlocked.

    The former allows conversion of get_user_pages invocations that will have
    to pass a "&locked" parameter to know if the mmap_sem was dropped during
    the call. Example from:

    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    to:

    int locked = 1;
    down_read(&mm->mmap_sem);
    do_something()
    get_user_pages_locked(tsk, mm, ..., pages, &locked);
    if (locked)
    up_read(&mm->mmap_sem);

    The latter is suitable only as a drop in replacement of the form:

    down_read(&mm->mmap_sem);
    get_user_pages(tsk, mm, ..., pages, NULL);
    up_read(&mm->mmap_sem);

    into:

    get_user_pages_unlocked(tsk, mm, ..., pages);

    Where tsk, mm, the intermediate "..." paramters and "pages" can be any
    value as before. Just the last parameter of get_user_pages (vmas) must be
    NULL for get_user_pages_locked|unlocked to be usable (the latter original
    form wouldn't have been safe anyway if vmas wasn't null, for the former we
    just make it explicit by dropping the parameter).

    If vmas is not NULL these two methods cannot be used.

    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Andres Lagar-Cavilla
    Reviewed-by: Peter Feiner
    Reviewed-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • We have a race condition between move_pages() and freeing hugepages, where
    move_pages() calls follow_page(FOLL_GET) for hugepages internally and
    tries to get its refcount without preventing concurrent freeing. This
    race crashes the kernel, so this patch fixes it by moving FOLL_GET code
    for hugepages into follow_huge_pmd() with taking the page table lock.

    This patch intentionally removes page==NULL check after pte_page.
    This is justified because pte_page() never returns NULL for any
    architectures or configurations.

    This patch changes the behavior of follow_huge_pmd() for tail pages and
    then tail pages can be pinned/returned. So the caller must be changed to
    properly handle the returned tail pages.

    We could have a choice to add the similar locking to
    follow_huge_(addr|pud) for consistency, but it's not necessary because
    currently these functions don't support FOLL_GET flag, so let's leave it
    for future development.

    Here is the reproducer:

    $ cat movepages.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000
    #define PS 0x1000

    int main(int argc, char *argv[]) {
    int i;
    int nr_hp = strtol(argv[1], NULL, 0);
    int nr_p = nr_hp * HPS / PS;
    int ret;
    void **addrs;
    int *status;
    int *nodes;
    pid_t pid;

    pid = strtol(argv[2], NULL, 0);
    addrs = malloc(sizeof(char *) * nr_p + 1);
    status = malloc(sizeof(char *) * nr_p + 1);
    nodes = malloc(sizeof(char *) * nr_p + 1);

    while (1) {
    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 1;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");

    for (i = 0; i < nr_p; i++) {
    addrs[i] = (void *)ADDR_INPUT + i * PS;
    nodes[i] = 0;
    status[i] = 0;
    }
    ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
    MPOL_MF_MOVE_ALL);
    if (ret == -1)
    err("move_pages");
    }
    return 0;
    }

    $ cat hugepage.c
    #include
    #include
    #include

    #define ADDR_INPUT 0x700000000000UL
    #define HPS 0x200000

    int main(int argc, char *argv[]) {
    int nr_hp = strtol(argv[1], NULL, 0);
    char *p;

    while (1) {
    p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
    if (p != (void *)ADDR_INPUT) {
    perror("mmap");
    break;
    }
    memset(p, 0, nr_hp * HPS);
    munmap(p, nr_hp * HPS);
    }
    }

    $ sysctl vm.nr_hugepages=40
    $ ./hugepage 10 &
    $ ./movepages 10 $(pgrep -f hugepage)

    Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()")
    Signed-off-by: Naoya Horiguchi
    Reported-by: Hugh Dickins
    Cc: James Hogan
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Luiz Capitulino
    Cc: Nishanth Aravamudan
    Cc: Lee Schermerhorn
    Cc: Steve Capper
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

11 Feb, 2015

1 commit


30 Jan, 2015

1 commit

  • The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
    "you should SIGSEGV" error, because the SIGSEGV case was generally
    handled by the caller - usually the architecture fault handler.

    That results in lots of duplication - all the architecture fault
    handlers end up doing very similar "look up vma, check permissions, do
    retries etc" - but it generally works. However, there are cases where
    the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.

    In particular, when accessing the stack guard page, libsigsegv expects a
    SIGSEGV. And it usually got one, because the stack growth is handled by
    that duplicated architecture fault handler.

    However, when the generic VM layer started propagating the error return
    from the stack expansion in commit fee7e49d4514 ("mm: propagate error
    from stack expansion even for guard page"), that now exposed the
    existing VM_FAULT_SIGBUS result to user space. And user space really
    expected SIGSEGV, not SIGBUS.

    To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
    duplicate architecture fault handlers about it. They all already have
    the code to handle SIGSEGV, so it's about just tying that new return
    value to the existing code, but it's all a bit annoying.

    This is the mindless minimal patch to do this. A more extensive patch
    would be to try to gather up the mostly shared fault handling logic into
    one generic helper routine, and long-term we really should do that
    cleanup.

    Just from this patch, you can generally see that most architectures just
    copied (directly or indirectly) the old x86 way of doing things, but in
    the meantime that original x86 model has been improved to hold the VM
    semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
    "newer" things, so it would be a good idea to bring all those
    improvements to the generic case and teach other architectures about
    them too.

    Reported-and-tested-by: Takashi Iwai
    Tested-by: Jan Engelhardt
    Acked-by: Heiko Carstens # "s390 still compiles and boots"
    Cc: linux-arch@vger.kernel.org
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Jan, 2015

1 commit

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Fixup gup_pmd_range.

    Signed-off-by: Christian Borntraeger

    Christian Borntraeger
     

21 Dec, 2014

1 commit

  • Pull ACCESS_ONCE cleanup preparation from Christian Borntraeger:
    "kernel: Provide READ_ONCE and ASSIGN_ONCE

    As discussed on LKML http://marc.info/?i=54611D86.4040306%40de.ibm.com
    ACCESS_ONCE might fail with specific compilers for non-scalar
    accesses.

    Here is a set of patches to tackle that problem.

    The first patch introduce READ_ONCE and ASSIGN_ONCE. If the data
    structure is larger than the machine word size memcpy is used and a
    warning is emitted. The next patches fix up several in-tree users of
    ACCESS_ONCE on non-scalar types.

    This does not yet contain a patch that forces ACCESS_ONCE to work only
    on scalar types. This is targetted for the next merge window as Linux
    next already contains new offenders regarding ACCESS_ONCE vs.
    non-scalar types"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
    s390/kvm: REPLACE barrier fixup with READ_ONCE
    arm/spinlock: Replace ACCESS_ONCE with READ_ONCE
    arm64/spinlock: Replace ACCESS_ONCE READ_ONCE
    mips/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/gup: Replace ACCESS_ONCE with READ_ONCE
    x86/spinlock: Replace ACCESS_ONCE with READ_ONCE
    mm: replace ACCESS_ONCE with READ_ONCE or barriers
    kernel: Provide READ_ONCE and ASSIGN_ONCE

    Linus Torvalds
     

18 Dec, 2014

1 commit

  • ACCESS_ONCE does not work reliably on non-scalar types. For
    example gcc 4.6 and 4.7 might remove the volatile tag for such
    accesses during the SRA (scalar replacement of aggregates) step
    (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145)

    Let's change the code to access the page table elements with
    READ_ONCE that does implicit scalar accesses for the gup code.

    mm_find_pmd is tricky, because m68k and sparc(32bit) define pmd_t
    as array of longs. This code requires just that the pmd_present
    and pmd_trans_huge check are done on the same value, so a barrier
    is sufficent.

    A similar case is in handle_pte_fault. On ppc44x the word size is
    32 bit, but a pte is 64 bit. A barrier is ok as well.

    Signed-off-by: Christian Borntraeger
    Cc: linux-mm@kvack.org
    Acked-by: Paul E. McKenney

    Christian Borntraeger
     

14 Nov, 2014

1 commit


10 Oct, 2014

1 commit

  • This series implements general forms of get_user_pages_fast and
    __get_user_pages_fast in core code and activates them for arm and arm64.

    These are required for Transparent HugePages to function correctly, as a
    futex on a THP tail will otherwise result in an infinite loop (due to the
    core implementation of __get_user_pages_fast always returning 0).

    Unfortunately, a futex on THP tail can be quite common for certain
    workloads; thus THP is unreliable without a __get_user_pages_fast
    implementation.

    This series may also be beneficial for direct-IO heavy workloads and
    certain KVM workloads.

    This patch (of 6):

    get_user_pages_fast() attempts to pin user pages by walking the page
    tables directly and avoids taking locks. Thus the walker needs to be
    protected from page table pages being freed from under it, and needs to
    block any THP splits.

    One way to achieve this is to have the walker disable interrupts, and rely
    on IPIs from the TLB flushing code blocking before the page table pages
    are freed.

    On some platforms we have hardware broadcast of TLB invalidations, thus
    the TLB flushing code doesn't necessarily need to broadcast IPIs; and
    spuriously broadcasting IPIs can hurt system performance if done too
    often.

    This problem has been solved on PowerPC and Sparc by batching up page
    table pages belonging to more than one mm_user, then scheduling an
    rcu_sched callback to free the pages. This RCU page table free logic has
    been promoted to core code and is activated when one enables
    HAVE_RCU_TABLE_FREE. Unfortunately, these architectures implement their
    own get_user_pages_fast routines.

    The RCU page table free logic coupled with an IPI broadcast on THP split
    (which is a rare event), allows one to protect a page table walker by
    merely disabling the interrupts during the walk.

    This patch provides a general RCU implementation of get_user_pages_fast
    that can be used by architectures that perform hardware broadcast of TLB
    invalidations.

    It is based heavily on the PowerPC implementation by Nick Piggin.

    [akpm@linux-foundation.org: various comment fixes]
    Signed-off-by: Steve Capper
    Tested-by: Dann Frazier
    Reviewed-by: Catalin Marinas
    Acked-by: Hugh Dickins
    Cc: Russell King
    Cc: Mark Rutland
    Cc: Mel Gorman
    Cc: Will Deacon
    Cc: Christoffer Dall
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steve Capper
     

24 Sep, 2014

1 commit

  • When KVM handles a tdp fault it uses FOLL_NOWAIT. If the guest memory
    has been swapped out or is behind a filemap, this will trigger async
    readahead and return immediately. The rationale is that KVM will kick
    back the guest with an "async page fault" and allow for some other
    guest process to take over.

    If async PFs are enabled the fault is retried asap from an async
    workqueue. If not, it's retried immediately in the same code path. In
    either case the retry will not relinquish the mmap semaphore and will
    block on the IO. This is a bad thing, as other mmap semaphore users
    now stall as a function of swap or filemap latency.

    This patch ensures both the regular and async PF path re-enter the
    fault allowing for the mmap semaphore to be relinquished in the case
    of IO wait.

    Reviewed-by: Radim Krčmář
    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Andrew Morton
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     

07 Aug, 2014

1 commit

  • Add a comment describing the circumstances in which
    __lock_page_or_retry() will or will not release the mmap_sem when
    returning 0.

    Add comments to lock_page_or_retry()'s callers (filemap_fault(),
    do_swap_page()) noting the impact on VM_FAULT_RETRY returns.

    Add comments on up the call tree, particularly replacing the false "We
    return with mmap_sem still held" comments.

    Signed-off-by: Paul Cassella
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Cassella
     

05 Jun, 2014

3 commits