05 Apr, 2016

2 commits

  • Mostly direct substitution with occasional adjustment or removing
    outdated comments.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
    ago with promise that one day it will be possible to implement page
    cache with bigger chunks than PAGE_SIZE.

    This promise never materialized. And unlikely will.

    We have many places where PAGE_CACHE_SIZE assumed to be equal to
    PAGE_SIZE. And it's constant source of confusion on whether
    PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
    especially on the border between fs and mm.

    Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
    breakage to be doable.

    Let's stop pretending that pages in page cache are special. They are
    not.

    The changes are pretty straight-forward:

    - << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> ;

    - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

    - page_cache_get() -> get_page();

    - page_cache_release() -> put_page();

    This patch contains automated changes generated with coccinelle using
    script below. For some reason, coccinelle doesn't patch header files.
    I've called spatch for them manually.

    The only adjustment after coccinelle is revert of changes to
    PAGE_CAHCE_ALIGN definition: we are going to drop it later.

    There are few places in the code where coccinelle didn't reach. I'll
    fix them manually in a separate patch. Comments and documentation also
    will be addressed with the separate patch.

    virtual patch

    @@
    expression E;
    @@
    - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    expression E;
    @@
    - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
    + E

    @@
    @@
    - PAGE_CACHE_SHIFT
    + PAGE_SHIFT

    @@
    @@
    - PAGE_CACHE_SIZE
    + PAGE_SIZE

    @@
    @@
    - PAGE_CACHE_MASK
    + PAGE_MASK

    @@
    expression E;
    @@
    - PAGE_CACHE_ALIGN(E)
    + PAGE_ALIGN(E)

    @@
    expression E;
    @@
    - page_cache_get(E)
    + get_page(E)

    @@
    expression E;
    @@
    - page_cache_release(E)
    + put_page(E)

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 Mar, 2016

1 commit

  • This patch (of 5):

    This is based on the idea from Mel Gorman discussed during LSFMM 2015
    and independently brought up by Oleg Nesterov.

    The OOM killer currently allows to kill only a single task in a good
    hope that the task will terminate in a reasonable time and frees up its
    memory. Such a task (oom victim) will get an access to memory reserves
    via mark_oom_victim to allow a forward progress should there be a need
    for additional memory during exit path.

    It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
    construct workloads which break the core assumption mentioned above and
    the OOM victim might take unbounded amount of time to exit because it
    might be blocked in the uninterruptible state waiting for an event (e.g.
    lock) which is blocked by another task looping in the page allocator.

    This patch reduces the probability of such a lockup by introducing a
    specialized kernel thread (oom_reaper) which tries to reclaim additional
    memory by preemptively reaping the anonymous or swapped out memory owned
    by the oom victim under an assumption that such a memory won't be needed
    when its owner is killed and kicked from the userspace anyway. There is
    one notable exception to this, though, if the OOM victim was in the
    process of coredumping the result would be incomplete. This is
    considered a reasonable constrain because the overall system health is
    more important than debugability of a particular application.

    A kernel thread has been chosen because we need a reliable way of
    invocation so workqueue context is not appropriate because all the
    workers might be busy (e.g. allocating memory). Kswapd which sounds
    like another good fit is not appropriate as well because it might get
    blocked on locks during reclaim as well.

    oom_reaper has to take mmap_sem on the target task for reading so the
    solution is not 100% because the semaphore might be held or blocked for
    write but the probability is reduced considerably wrt. basically any
    lock blocking forward progress as described above. In order to prevent
    from blocking on the lock without any forward progress we are using only
    a trylock and retry 10 times with a short sleep in between. Users of
    mmap_sem which need it for write should be carefully reviewed to use
    _killable waiting as much as possible and reduce allocations requests
    done with the lock held to absolute minimum to reduce the risk even
    further.

    The API between oom killer and oom reaper is quite trivial.
    wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
    NULL->mm transition and oom_reaper clear this atomically once it is done
    with the work. This means that only a single mm_struct can be reaped at
    the time. As the operation is potentially disruptive we are trying to
    limit it to the ncessary minimum and the reaper blocks any updates while
    it operates on an mm. mm_struct is pinned by mm_count to allow parallel
    exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrea Argangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

2 commits

  • Most of the mm subsystem uses pr_ so make it consistent.

    Miscellanea:

    - Realign arguments
    - Add missing newline to format
    - kmemleak-test.c has a "kmemleak: " prefix added to the
    "Kmemleak testing" logging message via pr_fmt

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • There are few things about *pte_alloc*() helpers worth cleaning up:

    - 'vma' argument is unused, let's drop it;

    - most __pte_alloc() callers do speculative check for pmd_none(),
    before taking ptl: let's introduce pte_alloc() macro which does
    the check.

    The only direct user of __pte_alloc left is userfaultfd, which has
    different expectation about atomicity wrt pmd.

    - pte_alloc_map() and pte_alloc_map_lock() are redefined using
    pte_alloc().

    [sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
    [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Signed-off-by: Sudeep Holla
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Mar, 2016

1 commit

  • Merge first patch-bomb from Andrew Morton:

    - some misc things

    - ofs2 updates

    - about half of MM

    - checkpatch updates

    - autofs4 update

    * emailed patches from Andrew Morton : (120 commits)
    autofs4: fix string.h include in auto_dev-ioctl.h
    autofs4: use pr_xxx() macros directly for logging
    autofs4: change log print macros to not insert newline
    autofs4: make autofs log prints consistent
    autofs4: fix some white space errors
    autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
    autofs4: fix coding style line length in autofs4_wait()
    autofs4: fix coding style problem in autofs4_get_set_timeout()
    autofs4: coding style fixes
    autofs: show pipe inode in mount options
    kallsyms: add support for relative offsets in kallsyms address table
    kallsyms: don't overload absolute symbol type for percpu symbols
    x86: kallsyms: disable absolute percpu symbols on !SMP
    checkpatch: fix another left brace warning
    checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
    checkpatch: warn on bare unsigned or signed declarations without int
    checkpatch: exclude asm volatile from complex macro check
    mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
    mm: migrate: consolidate mem_cgroup_migrate() calls
    mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
    ...

    Linus Torvalds
     

16 Mar, 2016

2 commits


07 Mar, 2016

1 commit


28 Feb, 2016

1 commit

  • pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
    introduced to locklessy (but atomically) detect when a pmd is a regular
    (stable) pmd or when the pmd is unstable and can infinitely transition
    from pmd_none() and pmd_trans_huge() from under us, while only holding
    the mmap_sem for reading (for writing not).

    While holding the mmap_sem only for reading, MADV_DONTNEED can run from
    under us and so before we can assume the pmd to be a regular stable pmd
    we need to compare it against pmd_none() and pmd_trans_huge() in an
    atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a
    tiny window for a race.

    Useful applications are unlikely to notice the difference as doing
    MADV_DONTNEED concurrently with a page fault would lead to undefined
    behavior.

    [akpm@linux-foundation.org: tidy up comment grammar/layout]
    Signed-off-by: Andrea Arcangeli
    Reported-by: Kirill A. Shutemov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

19 Feb, 2016

2 commits

  • As discussed earlier, we attempt to enforce protection keys in
    software.

    However, the code checks all faults to ensure that they are not
    violating protection key permissions. It was assumed that all
    faults are either write faults where we check PKRU[key].WD (write
    disable) or read faults where we check the AD (access disable)
    bit.

    But, there is a third category of faults for protection keys:
    instruction faults. Instruction faults never run afoul of
    protection keys because they do not affect instruction fetches.

    So, plumb the PF_INSTR bit down in to the
    arch_vma_access_permitted() function where we do the protection
    key checks.

    We also add a new FAULT_FLAG_INSTRUCTION. This is because
    handle_mm_fault() is not passed the architecture-specific
    error_code where we keep PF_INSTR, so we need to encode the
    instruction fetch information in to the arch-generic fault
    flags.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210224.96928009@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • We try to enforce protection keys in software the same way that we
    do in hardware. (See long example below).

    But, we only want to do this when accessing our *own* process's
    memory. If GDB set PKRU[6].AD=1 (disable access to PKEY 6), then
    tried to PTRACE_POKE a target process which just happened to have
    some mprotect_pkey(pkey=6) memory, we do *not* want to deny the
    debugger access to that memory. PKRU is fundamentally a
    thread-local structure and we do not want to enforce it on access
    to _another_ thread's data.

    This gets especially tricky when we have workqueues or other
    delayed-work mechanisms that might run in a random process's context.
    We can check that we only enforce pkeys when operating on our *own* mm,
    but delayed work gets performed when a random user context is active.
    We might end up with a situation where a delayed-work gup fails when
    running randomly under its "own" task but succeeds when running under
    another process. We want to avoid that.

    To avoid that, we use the new GUP flag: FOLL_REMOTE and add a
    fault flag: FAULT_FLAG_REMOTE. They indicate that we are
    walking an mm which is not guranteed to be the same as
    current->mm and should not be subject to protection key
    enforcement.

    Thanks to Jerome Glisse for pointing out this scenario.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Alexey Kardashevskiy
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boaz Harrosh
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Dominik Vogt
    Cc: Eric B Munson
    Cc: Geliang Tang
    Cc: Guan Xuetao
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Jan Kara
    Cc: Jason Low
    Cc: Jerome Marchand
    Cc: Joerg Roedel
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Oleg Nesterov
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Shachar Raindel
    Cc: Vlastimil Babka
    Cc: Xie XiuQi
    Cc: iommu@lists.linux-foundation.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

18 Feb, 2016

2 commits

  • Today, for normal faults and page table walks, we check the VMA
    and/or PTE to ensure that it is compatible with the action. For
    instance, if we get a write fault on a non-writeable VMA, we
    SIGSEGV.

    We try to do the same thing for protection keys. Basically, we
    try to make sure that if a user does this:

    mprotect(ptr, size, PROT_NONE);
    *ptr = foo;

    they see the same effects with protection keys when they do this:

    mprotect(ptr, size, PROT_READ|PROT_WRITE);
    set_pkey(ptr, size, 4);
    wrpkru(0xffffff3f); // access disable pkey 4
    *ptr = foo;

    The state to do that checking is in the VMA, but we also
    sometimes have to do it on the page tables only, like when doing
    a get_user_pages_fast() where we have no VMA.

    We add two functions and expose them to generic code:

    arch_pte_access_permitted(pte_flags, write)
    arch_vma_access_permitted(vma, write)

    These are, of course, backed up in x86 arch code with checks
    against the PTE or VMA's protection key.

    But, there are also cases where we do not want to respect
    protection keys. When we ptrace(), for instance, we do not want
    to apply the tracer's PKRU permissions to the PTEs from the
    process being traced.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Alexey Kardashevskiy
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Boaz Harrosh
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Gibson
    Cc: David Hildenbrand
    Cc: David Vrabel
    Cc: Denys Vlasenko
    Cc: Dominik Dingel
    Cc: Dominik Vogt
    Cc: Guan Xuetao
    Cc: H. Peter Anvin
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Jason Low
    Cc: Jerome Marchand
    Cc: Juergen Gross
    Cc: Kirill A. Shutemov
    Cc: Laurent Dufour
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Mikulas Patocka
    Cc: Minchan Kim
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sasha Levin
    Cc: Shachar Raindel
    Cc: Stephen Smalley
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160212210219.14D5D715@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 Feb, 2016

2 commits

  • For protection keys, we need to understand whether protections
    should be enforced in software or not. In general, we enforce
    protections when working on our own task, but not when on others.
    We call these "current" and "remote" operations.

    This patch introduces a new get_user_pages() variant:

    get_user_pages_remote()

    Which is a replacement for when get_user_pages() is called on
    non-current tsk/mm.

    We also introduce a new gup flag: FOLL_REMOTE which can be used
    for the "__" gup variants to get this new behavior.

    The uprobes is_trap_at_addr() location holds mmap_sem and
    calls get_user_pages(current->mm) on an instruction address. This
    makes it a pretty unique gup caller. Being an instruction access
    and also really originating from the kernel (vs. the app), I opted
    to consider this a 'remote' access where protection keys will not
    be enforced.

    Without protection keys, this patch should not change any behavior.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210154.3F0E51EA@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     
  • Provide a stable basis for the pkeys patches, which touches various
    x86 details.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

04 Feb, 2016

1 commit

  • Trinity is now hitting the WARN_ON_ONCE we added in v3.15 commit
    cda540ace6a1 ("mm: get_user_pages(write,force) refuse to COW in shared
    areas"). The warning has served its purpose, nobody was harmed by that
    change, so just remove the warning to generate less noise from Trinity.

    Which reminds me of the comment I wrongly left behind with that commit
    (but was spotted at the time by Kirill), which has since moved into a
    separate function, and become even more obscure: delete it.

    Reported-by: Dave Jones
    Suggested-by: Kirill A. Shutemov
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Feb, 2016

1 commit

  • pfn_t_to_page() honors the flags in the pfn_t value to determine if a
    pfn is backed by a page. However, vm_insert_mixed() was originally
    written to use pfn_valid() to make this determination. To restore the
    old/correct behavior, ignore the pfn_t flags in the !pfn_t_devmap() case
    and fallback to trusting pfn_valid().

    Fixes: 01c8f1c44b83 ("mm, dax, gpu: convert vm_insert_mixed to pfn_t")
    Cc: Dave Hansen
    Cc: David Airlie
    Reported-by: Tomi Valkeinen
    Tested-by: Tomi Valkeinen
    Signed-off-by: Dan Williams

    Dan Williams
     

29 Jan, 2016

1 commit


21 Jan, 2016

1 commit

  • Swap cache pages are freed aggressively if swap is nearly full (>50%
    currently), because otherwise we are likely to stop scanning anonymous
    when we near the swap limit even if there is plenty of freeable swap cache
    pages. We should follow the same trend in case of memory cgroup, which
    has its own swap limit.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

16 Jan, 2016

12 commits

  • A dax-huge-page mapping while it uses some thp helpers is ultimately not
    a transparent huge page. The distinction is especially important in the
    get_user_pages() path. pmd_devmap() is used to distinguish dax-pmds
    from pmd_huge() and pmd_trans_huge() which have slightly different
    semantics.

    Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
    pmd_devmap() checks.

    [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in __split_huge_pmd()]
    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Similar to the conversion of vm_insert_mixed() use pfn_t in the
    vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
    pfn is backed by a devm_memremap_pages() mapping.

    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
    evaluating the PFN_MAP and PFN_DEV flags. When both are set it triggers
    _PAGE_DEVMAP to be set in the resulting pte.

    There are no functional changes to the gpu drivers as a result of this
    conversion.

    Signed-off-by: Dan Williams
    Cc: Dave Hansen
    Cc: David Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Before THP refcounting rework, THP was not allowed to cross VMA
    boundary. So, if we have THP and we split it, PG_mlocked can be safely
    transferred to small pages.

    With new THP refcounting and naive approach to mlocking we can end up
    with this scenario:
    1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
    2. the process does munlock() on the *part* of the THP:
    - the VMA is split into two, one of them VM_LOCKED;
    - huge PMD split into PTE table;
    - THP is still mlocked;
    3. split_huge_page():
    - it transfers PG_mlocked to *all* small pages regrardless if it
    blong to any VM_LOCKED VMA.

    We probably could munlock() all small pages on split_huge_page(), but I
    think we have accounting issue already on step two.

    Instead of forbidding mlocked pages altogether, we just avoid mlocking
    PTE-mapped THPs and munlock THPs on split_huge_pmd().

    This means PTE-mapped THPs will be on normal lru lists and will be split
    under memory pressure by vmscan. After the split vmscan will detect
    unevictable small pages and mlock them.

    With this approach we shouldn't hit situation like described above.

    Signed-off-by: Kirill A. Shutemov
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to have THP mapped with PTEs. It will confuse
    numabalancing. Let's skip them for now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound. It
    means we need to track mapcount on per small page basis.

    Straight-forward approach is to use ->_mapcount in all subpages to track
    how many time this subpage is mapped with PMDs or PTEs combined. But
    this is rather expensive: mapping or unmapping of a THP page with PMD
    would require HPAGE_PMD_NR atomic operations instead of single we have
    now.

    The idea is to store separately how many times the page was mapped as
    whole -- compound_mapcount. This frees up ->_mapcount in subpages to
    track PTE mapcount.

    We use the same approach as with compound page destructor and compound
    order to store compound_mapcount: use space in first tail page,
    ->mapping this time.

    Any time we map/unmap whole compound page (THP or hugetlb) -- we
    increment/decrement compound_mapcount. When we map part of compound
    page with PTE we operate on ->_mapcount of the subpage.

    page_mapcount() counts both: PTE and PMD mappings of the page.

    Basically, we have mapcount for a subpage spread over two counters. It
    makes tricky to detect when last mapcount for a page goes away.

    We introduced PageDoubleMap() for this. When we split THP PMD for the
    first time and there's other PMD mapping left we offset up ->_mapcount
    in all subpages by one and set PG_double_map on the compound page.
    These additional references go away with last compound_mapcount.

    This approach provides a way to detect when last mapcount goes away on
    per small page basis without introducing new overhead for most common
    cases.

    [akpm@linux-foundation.org: fix typo in comment]
    [mhocko@suse.com: ignore partial THP when moving task]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Aneesh Kumar K.V
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting we don't need to mark PMDs splitting. Let's drop
    code to handle this.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With new refcounting THP can belong to several VMAs. This makes tricky
    to track THP pages, when they partially mlocked. It can lead to leaking
    mlocked pages to non-VM_LOCKED vmas and other problems.

    With this patch we will split all pages on mlock and avoid
    fault-in/collapse new THP in VM_LOCKED vmas.

    I've tried alternative approach: do not mark THP pages mlocked and keep
    them on normal LRUs. This way vmscan could try to split huge pages on
    memory pressure and free up subpages which doesn't belong to VM_LOCKED
    vmas. But this is user-visible change: we screw up Mlocked accouting
    reported in meminfo, so I had to leave this approach aside.

    We can bring something better later, but this should be good enough for
    now.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with rmap, with new refcounting we cannot rely on PageTransHuge() to
    check if we need to charge size of huge page form the cgroup. We need
    to get information from caller to know whether it was mapped with PMD or
    PTE.

    We do uncharge when last reference on the page gone. At that point if
    we see PageTransHuge() it means we need to unchange whole huge page.

    The tricky part is partial unmap -- when we try to unmap part of huge
    page. We don't do a special handing of this situation, meaning we don't
    uncharge the part of huge page unless last user is gone or
    split_huge_page() is triggered. In case of cgroup memory pressure
    happens the partial unmapped page will be split through shrinker. This
    should be good enough.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We're going to allow mapping of individual 4k pages of THP compound
    page. It means we cannot rely on PageTransHuge() check to decide if
    map/unmap small page or THP.

    The patch adds new argument to rmap functions to indicate whether we
    want to operate on whole compound page or only the small page.

    [n-horiguchi@ah.jp.nec.com: fix mapcount mismatch in hugepage migration]
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Acked-by: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We don't define meaning of page->mapping for tail pages. Currently it's
    always NULL, which can be inconsistent with head page and potentially
    lead to problems.

    Let's poison the pointer to catch all illigal uses.

    page_rmapping(), page_mapping() and page_anon_vma() are changed to look
    on head page.

    The only illegal use I've caught so far is __GPF_COMP pages from sound
    subsystem, mapped with PTEs. do_shared_fault() is changed to use
    page_rmapping() instead of direct access to fault_page->mapping.

    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: "Aneesh Kumar K.V"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Jerome Marchand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

2 commits

  • page_cache_read has been historically using page_cache_alloc_cold to
    allocate a new page. This means that mapping_gfp_mask is used as the
    base for the gfp_mask. Many filesystems are setting this mask to
    GFP_NOFS to prevent from fs recursion issues. page_cache_read is called
    from the vm_operations_struct::fault() context during the page fault.
    This context doesn't need the reclaim protection normally.

    ceph and ocfs2 which call filemap_fault from their fault handlers seem
    to be OK because they are not taking any fs lock before invoking generic
    implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the
    reclaim recursion POV because this lock serializes truncate and punch
    hole with the page faults and it doesn't get involved in the reclaim.

    There is simply no reason to deliberately use a weaker allocation
    context when a __GFP_FS | __GFP_IO can be used. The GFP_NOFS protection
    might be even harmful. There is a push to fail GFP_NOFS allocations
    rather than loop within allocator indefinitely with a very limited
    reclaim ability. Once we start failing those requests the OOM killer
    might be triggered prematurely because the page cache allocation failure
    is propagated up the page fault path and end up in
    pagefault_out_of_memory.

    We cannot play with mapping_gfp_mask directly because that would be racy
    wrt. parallel page faults and it might interfere with other users who
    really rely on NOFS semantic from the stored gfp_mask. The mask is also
    inode proper so it would even be a layering violation. What we can do
    instead is to push the gfp_mask into struct vm_fault and allow fs layer
    to overwrite it should the callback need to be called with a different
    allocation context.

    Initialize the default to (mapping_gfp_mask | __GFP_FS | __GFP_IO)
    because this should be safe from the page fault path normally. Why do
    we care about mapping_gfp_mask at all then? Because this doesn't hold
    only reclaim protection flags but it also might contain zone and
    movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have
    to respect those.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Acked-by: Jan Kara
    Acked-by: Vlastimil Babka
    Cc: Tetsuo Handa
    Cc: Mel Gorman
    Cc: Dave Chinner
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently looking at /proc//status or statm, there is no way to
    distinguish shmem pages from pages mapped to a regular file (shmem pages
    are mapped to /dev/zero), even though their implication in actual memory
    use is quite different.

    The internal accounting currently counts shmem pages together with
    regular files. As a preparation to extend the userspace interfaces,
    this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
    shmem pages separately from MM_FILEPAGES. The next patch will expose it
    to userspace - this patch doesn't change the exported values yet, by
    adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
    used before. The only user-visible change after this patch is the OOM
    killer message that separates the reported "shmem-rss" from "file-rss".

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

12 Jan, 2016

1 commit

  • The x86 vvar vma contains pages with differing cacheability
    flags. x86 currently implements this by manually inserting all
    the ptes using (io_)remap_pfn_range when the vma is set up.

    x86 wants to move to using .fault with VM_FAULT_NOPAGE to set up
    the mappings as needed. The correct API to use to insert a pfn
    in .fault is vm_insert_pfn(), but vm_insert_pfn() can't override the
    vma's cache mode, and the HPET page in particular needs to be
    uncached despite the fact that the rest of the VMA is cached.

    Add vm_insert_pfn_prot() to support varying cacheability within
    the same non-COW VMA in a more sane manner.

    x86 could alternatively use multiple VMAs, but that's messy,
    would break CRIU, and would create unnecessary VMAs that would
    waste memory.

    Signed-off-by: Andy Lutomirski
    Reviewed-by: Kees Cook
    Acked-by: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Fenghua Yu
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Quentin Casasnovas
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/d2938d1eb37be7a5e4f86182db646551f11e45aa.1451446564.git.luto@kernel.org
    Signed-off-by: Ingo Molnar

    Andy Lutomirski
     

19 Nov, 2015

1 commit

  • DAX handling of COW faults has wrong locking sequence:
    dax_fault does i_mmap_lock_read
    do_cow_fault does i_mmap_unlock_write

    Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
    commit[3].

    Original COW locking logic was introduced by Matthew here[4].

    This should be applied to v4.3 as well.

    [1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
    [2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
    [3] 843172978bb9 dax: fix race between simultaneous faults
    [4] 2e4cdab0584f mm: allow page fault handlers to perform the COW

    Cc:
    Cc: Boaz Harrosh
    Cc: Alexander Viro
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Acked-by: Ross Zwisler
    Signed-off-by: Yigal Korman
    Signed-off-by: Dan Williams

    Yigal Korman
     

17 Oct, 2015

1 commit

  • The following two locking commits in the DAX code:

    commit 843172978bb9 ("dax: fix race between simultaneous faults")
    commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

    introduced a number of deadlocks and other issues which need to be fixed
    for the v4.3 kernel. The list of issues in DAX after these commits
    (some newly introduced by the commits, some preexisting) can be found
    here:

    https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

    This undoes most of the changes introduced by those two commits,
    essentially returning us to the DAX locking scheme that was used in
    v4.2.

    Signed-off-by: Ross Zwisler
    Cc: Alexander Viro
    Cc: Dan Williams
    Tested-by: Dave Chinner
    Cc: Jan Kara
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

11 Sep, 2015

1 commit


09 Sep, 2015

1 commit