01 Feb, 2017

1 commit

  • commit d51e9894d27492783fc6d1b489070b4ba66ce969 upstream.

    Since commit be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to
    alloc_pages_vma") alloc_pages_vma() can potentially free a mempolicy by
    mpol_cond_put() before accessing the embedded nodemask by
    __alloc_pages_nodemask(). The commit log says it's so "we can use a
    single exit path within the function" but that's clearly wrong. We can
    still do that when doing mpol_cond_put() after the allocation attempt.

    Make sure the mempolicy is not freed prematurely, otherwise
    __alloc_pages_nodemask() can end up using a bogus nodemask, which could
    lead e.g. to premature OOM.

    Fixes: be97a41b291e ("mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma")
    Link: http://lkml.kernel.org/r/20170118141124.8345-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Aneesh Kumar K.V
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

19 Oct, 2016

1 commit

  • This removes the 'write' and 'force' from get_user_pages() and replaces
    them with 'gup_flags' to make the use of FOLL_FORCE explicit in callers
    as use of this flag can result in surprising behaviour (and hence bugs)
    within the mm subsystem.

    Signed-off-by: Lorenzo Stoakes
    Acked-by: Christian König
    Acked-by: Jesper Nilsson
    Acked-by: Michal Hocko
    Reviewed-by: Jan Kara
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

08 Oct, 2016

1 commit

  • Use the existing enums instead of hardcoded index when looking at the
    zonelist. This makes it more readable. No functionality change by this
    patch.

    Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

02 Sep, 2016

1 commit

  • KASAN allocates memory from the page allocator as part of
    kmem_cache_free(), and that can reference current->mempolicy through any
    number of allocation functions. It needs to be NULL'd out before the
    final reference is dropped to prevent a use-after-free bug:

    BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
    CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
    ...
    Call Trace:
    dump_stack
    kasan_object_err
    kasan_report_error
    __asan_report_load2_noabort
    alloc_pages_current mempolicy to NULL before dropping the final
    reference.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
    Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
    Signed-off-by: David Rientjes
    Reported-by: Vegard Nossum
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

29 Jul, 2016

1 commit

  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

2 commits

  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
    to pte entries, which can be checked with pmd_trans_unstable(). Some
    callers make this assertion and some do it differently and some not, so
    let's do it in a unified manner.

    Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

20 May, 2016

3 commits

  • The allocator fast path looks up the first usable zone in a zonelist and
    then get_page_from_freelist does the same job in the zonelist iterator.
    This patch preserves the necessary information.

    4.6.0-rc2 4.6.0-rc2
    fastmark-v1r20 initonce-v1r20
    Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%)
    Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%)
    Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%)
    Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%)
    Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%)
    Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%)
    Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%)
    Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%)
    Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%)
    Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%)
    Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%)
    Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%)
    Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%)
    Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%)
    Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%)

    The benefit is negligible and the results are within the noise but each
    cycle counts.

    Signed-off-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This code was pretty obscure and was relying upon obscure side-effects
    of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to
    -1.

    Clean that all up and document the function's intent.

    Acked-by: Vlastimil Babka
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Lots of code does

    node = next_node(node, XXX);
    if (node == MAX_NUMNODES)
    node = first_node(XXX);

    so create next_node_in() to do this and use it in various places.

    [mhocko@suse.com: use next_node_in() helper]
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Signed-off-by: Michal Hocko
    Cc: Xishi Qiu
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Laura Abbott
    Cc: Hui Zhu
    Cc: Wang Xiaoqiang
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

21 Mar, 2016

1 commit

  • Pull x86 protection key support from Ingo Molnar:
    "This tree adds support for a new memory protection hardware feature
    that is available in upcoming Intel CPUs: 'protection keys' (pkeys).

    There's a background article at LWN.net:

    https://lwn.net/Articles/643797/

    The gist is that protection keys allow the encoding of
    user-controllable permission masks in the pte. So instead of having a
    fixed protection mask in the pte (which needs a system call to change
    and works on a per page basis), the user can map a (handful of)
    protection mask variants and can change the masks runtime relatively
    cheaply, without having to change every single page in the affected
    virtual memory range.

    This allows the dynamic switching of the protection bits of large
    amounts of virtual memory, via user-space instructions. It also
    allows more precise control of MMU permission bits: for example the
    executable bit is separate from the read bit (see more about that
    below).

    This tree adds the MM infrastructure and low level x86 glue needed for
    that, plus it adds a high level API to make use of protection keys -
    if a user-space application calls:

    mmap(..., PROT_EXEC);

    or

    mprotect(ptr, sz, PROT_EXEC);

    (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice
    this special case, and will set a special protection key on this
    memory range. It also sets the appropriate bits in the Protection
    Keys User Rights (PKRU) register so that the memory becomes unreadable
    and unwritable.

    So using protection keys the kernel is able to implement 'true'
    PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies
    PROT_READ as well. Unreadable executable mappings have security
    advantages: they cannot be read via information leaks to figure out
    ASLR details, nor can they be scanned for ROP gadgets - and they
    cannot be used by exploits for data purposes either.

    We know about no user-space code that relies on pure PROT_EXEC
    mappings today, but binary loaders could start making use of this new
    feature to map binaries and libraries in a more secure fashion.

    There is other pending pkeys work that offers more high level system
    call APIs to manage protection keys - but those are not part of this
    pull request.

    Right now there's a Kconfig that controls this feature
    (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled
    (like most x86 CPU feature enablement code that has no runtime
    overhead), but it's not user-configurable at the moment. If there's
    any serious problem with this then we can make it configurable and/or
    flip the default"

    * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
    x86/mm/pkeys: Fix mismerge of protection keys CPUID bits
    mm/pkeys: Fix siginfo ABI breakage caused by new u64 field
    x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA
    mm/core, x86/mm/pkeys: Add execute-only protection keys support
    x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags
    x86/mm/pkeys: Allow kernel to modify user pkey rights register
    x86/fpu: Allow setting of XSAVE state
    x86/mm: Factor out LDT init from context init
    mm/core, x86/mm/pkeys: Add arch_validate_pkey()
    mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits()
    x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU
    x86/mm/pkeys: Add Kconfig prompt to existing config option
    x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps
    x86/mm/pkeys: Dump PKRU with other kernel registers
    mm/core, x86/mm/pkeys: Differentiate instruction fetches
    x86/mm/pkeys: Optimize fault handling in access_error()
    mm/core: Do not enforce PKEY permissions on remote mm access
    um, pkeys: Add UML arch_*_access_permitted() methods
    mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys
    x86/mm/gup: Simplify get_user_pages() PTE bit handling
    ...

    Linus Torvalds
     

18 Mar, 2016

1 commit

  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

16 Mar, 2016

1 commit

  • VM_HUGETLB and VM_MIXEDMAP vma needs to be excluded to avoid compound
    pages being marked for migration and unexpected COWs when handling
    hugetlb fault.

    Thanks to Naoya Horiguchi for reminding me on these checks.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Suggested-by: Naoya Horiguchi
    Acked-by: David Rientjes
    Cc: SeongJae Park
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen
     

10 Mar, 2016

1 commit

  • We don't have native support of THP migration, so we have to split huge
    page into small pages in order to migrate it to different node. This
    includes PTE-mapped huge pages.

    I made mistake in refcounting patchset: we don't actually split
    PTE-mapped huge page in queue_pages_pte_range(), if we step on head
    page.

    The result is that the head page is queued for migration, but none of
    tail pages: putting head page on queue takes pin on the page and any
    subsequent attempts of split_huge_pages() would fail and we skip queuing
    tail pages.

    unmap_and_move_huge_page() will eventually split the huge pages, but
    only one of 512 pages would get migrated.

    Let's fix the situation.

    Fixes: 248db92da13f2507 ("migrate_pages: try to split pages on queuing")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Feb, 2016

1 commit

  • We will soon modify the vanilla get_user_pages() so it can no
    longer be used on mm/tasks other than 'current/current->mm',
    which is by far the most common way it is called. For now,
    we allow the old-style calls, but warn when they are used.
    (implemented in previous patch)

    This patch switches all callers of:

    get_user_pages()
    get_user_pages_unlocked()
    get_user_pages_locked()

    to stop passing tsk/mm so they will no longer see the warnings.

    Signed-off-by: Dave Hansen
    Reviewed-by: Thomas Gleixner
    Cc: Andrea Arcangeli
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Kirill A. Shutemov
    Cc: Linus Torvalds
    Cc: Naoya Horiguchi
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Srikar Dronamraju
    Cc: Vlastimil Babka
    Cc: jack@suse.cz
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
    Signed-off-by: Ingo Molnar

    Dave Hansen
     

06 Feb, 2016

1 commit

  • Maybe I miss some point, but I don't see a reason why we try to queue
    pages from non migratable VMAs.

    This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():

    #include
    #include
    #include
    #include
    #include

    #define SIZE 0x2000

    int foo;

    int main()
    {
    int fd;
    char *p;
    unsigned long mask = 2;

    fd = open("/dev/sg0", O_RDWR);
    p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    /* Faultin pages */
    foo = p[0] + p[0x1000];
    mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
    return 0;
    }

    The only case when we can queue pages from such VMA is MPOL_MF_STRICT
    plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
    but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
    in vma_migratable()). That's looks like a bug to me.

    Let's filter out non-migratable vma at start of queue_pages_test_walk()
    and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL flag is set.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Dmitry Vyukov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Jan, 2016

3 commits

  • MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
    mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
    it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
    VM_HUGETLB VMAs, and avoid useless overhead of minor faults.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen
     
  • We are not able to migrate THPs. It means it's not enough to split only
    PMD on migration -- we need to split compound page under it too.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     

09 Sep, 2015

2 commits

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This check was introduced as part of
    6f4576e3687 ("mempolicy: apply page table walker on queue_pages_range()")

    which got duplicated by
    48684a65b4e ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)")

    by reintroducing it earlier on queue_page_test_walk()

    Signed-off-by: Aristeu Rozanski
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Acked-by: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aristeu Rozanski
     

05 Sep, 2015

1 commit

  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jun, 2015

1 commit

  • Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
    local node"), we handle THP allocations on page fault in a special way -
    for non-interleave memory policies, the allocation is only attempted on
    the node local to the current CPU, if the policy's nodemask allows the
    node.

    This is motivated by the assumption that THP benefits cannot offset the
    cost of remote accesses, so it's better to fallback to base pages on the
    local node (which might still be available, while huge pages are not due
    to fragmentation) than to allocate huge pages on a remote node.

    The nodemask check prevents us from violating e.g. MPOL_BIND policies
    where the local node is not among the allowed nodes. However, the
    current implementation can still give surprising results for the
    MPOL_PREFERRED policy when the preferred node is different than the
    current CPU's local node.

    In such case we should honor the preferred node and not use the local
    node, which is what this patch does. If hugepage allocation on the
    preferred node fails, we fall back to base pages and don't try other
    nodes, with the same motivation as is done for the local node hugepage
    allocations. The patch also moves the MPOL_INTERLEAVE check around to
    simplify the hugepage specific test.

    The difference can be demonstrated using in-tree transhuge-stress test
    on the following 2-node machine where half memory on one node was
    occupied to show the difference.

    > numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
    node 0 size: 7878 MB
    node 0 free: 3623 MB
    node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
    node 1 size: 8045 MB
    node 1 free: 7818 MB
    node distances:
    node 0 1
    0: 10 21
    1: 21 10

    Before the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages

    Number of successful THP allocations corresponds to free memory on node 0 in
    the first case and node 1 in the second case, i.e. -p parameter is ignored and
    cpu binding "wins".

    After the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -p1 -C0 ./transhuge-stress
    transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages

    The -p parameter is respected regardless of cpu binding.

    > numactl -C0 ./transhuge-stress
    transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -C12 ./transhuge-stress
    transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages

    Without -p parameter, hugepage restriction to CPU-local node works as before.

    Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

15 May, 2015

1 commit

  • NUMA balancing is meant to be disabled by default on UMA machines but
    the check is using nr_node_ids (highest node) instead of
    num_online_nodes (online nodes).

    The consequences are that a UMA machine with a node ID of 1 or higher
    will enable NUMA balancing. This will incur useless overhead due to
    minor faults with the impact depending on the workload. These are the
    impact on the stats when running a kernel build on a single node machine
    whose node ID happened to be 1:

    vanilla patched
    NUMA base PTE updates 5113158 0
    NUMA huge PMD updates 643 0
    NUMA page range updates 5442374 0
    NUMA hint faults 2109622 0
    NUMA hint local faults 2109622 0
    NUMA hint local percent 100 100
    NUMA pages migrated 0 0

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Apr, 2015

2 commits

  • Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local
    node") restructured alloc_hugepage_vma() with the intent of only
    allocating transparent hugepages locally when there was not an effective
    interleave mempolicy.

    alloc_pages_exact_node() does not limit the allocation to the single node,
    however, but rather prefers it. This is because __GFP_THISNODE is not set
    which would cause the node-local nodemask to be passed. Without it, only
    a nodemask that prefers the local node is passed.

    Fix this by passing __GFP_THISNODE and falling back to small pages when
    the allocation fails.

    Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
    node") suffers from a similar problem for khugepaged, which is also fixed.

    Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
    Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node")
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Pravin Shelar
    Cc: Jarno Rajahalme
    Cc: Li Zefan
    Cc: Greg Thelen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • migrate_to_node() is intended to migrate a page from one source node to
    a target node.

    Today, migrate_to_node() could end up migrating to any node, not only
    the target node. This is because the page migration allocator,
    new_node_page() does not pass __GFP_THISNODE to
    alloc_pages_exact_node(). This causes the target node to be preferred
    but allows fallback to any other node in order of affinity.

    Prevent this by allocating with __GFP_THISNODE. If memory is not
    available, -ENOMEM will be returned as appropriate.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

14 Feb, 2015

1 commit

  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

13 Feb, 2015

1 commit

  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

4 commits

  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
    undesirable behaviour at client end (who called walk_page_range). For
    example for pagemap_read(), when no callbacks are called against VM_PFNMAP
    vma, pagemap_read() may prepare pagemap data for next virtual address
    range at wrong index. That could confuse and/or break userspace
    applications.

    This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
    - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
    over vma(VM_PFNMAP),
    - for clear_refs and queue_pages which have their own ->tests_walk,
    just return 1 and skip vma(VM_PFNMAP). This is no problem because
    these are not interested in hole regions,
    - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Shiraz Hashim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • queue_pages_range() does page table walking in its own way now, but there
    is some code duplicate. This patch applies page table walker to reduce
    lines of code.

    queue_pages_range() has to do some precheck to determine whether we really
    walk over the vma or just skip it. Now we have test_walk() callback in
    mm_walk for this purpose, so we can do this replacement cleanly.
    queue_pages_test_walk() depends on not only the current vma but also the
    previous one, so queue_pages->prev is introduced to remember it.

    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The previous commit ("mm/thp: Allocate transparent hugepages on local
    node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
    special policy for THP allocations. The function has the same interface
    as alloc_pages_vma(), shares a lot of boilerplate code and a long
    comment.

    This patch merges the hugepage special case into alloc_pages_vma. The
    extra if condition should be cheap enough price to pay. We also prevent
    a (however unlikely) race with parallel mems_allowed update, which could
    make hugepage allocation restart only within the fallback call to
    alloc_hugepage_vma() and not reconsider the special rule in
    alloc_hugepage_vma().

    Also by making sure mpol_cond_put(pol) is always called before actual
    allocation attempt, we can use a single exit path within the function.

    Also update the comment for missing node parameter and obsolete reference
    to mm_sem.

    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This make sure that we try to allocate hugepages from local node if
    allowed by mempolicy. If we can't, we fallback to small page allocation
    based on mempolicy. This is based on the observation that allocating
    pages on local node is more beneficial than allocating hugepages on remote
    node.

    With this patch applied we may find transparent huge page allocation
    failures if the current node doesn't have enough freee hugepages. Before
    this patch such failures result in us retrying the allocation on other
    nodes in the numa node mask.

    [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

20 Dec, 2014

1 commit

  • Pull vfs pile #3 from Al Viro:
    "Assorted fixes and patches from the last cycle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] chunk lost from bd9b51
    vfs: make mounts and mountstats honor root dir like mountinfo does
    vfs: cleanup show_mountinfo
    init: fix read-write root mount
    unfuck binfmt_misc.c (broken by commit e6084d4)
    vm_area_operations: kill ->migrate()
    new helper: iter_is_iovec()
    move_extent_per_page(): get rid of unused w_flags
    lustre: get rid of playing with ->fs
    btrfs: filp_open() returns ERR_PTR() on failure, not NULL...

    Linus Torvalds
     

19 Dec, 2014

1 commit


17 Dec, 2014

1 commit


10 Oct, 2014

4 commits

  • PROT_NUMA VMAs are skipped to avoid problems distinguishing between
    present, prot_none and special entries. MPOL_MF_LAZY is not visible from
    userspace since commit a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and
    MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
    same way task_numa_work does.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • - get_vma_policy(task) is not safe if task != current, remove this
    argument.

    - get_vma_policy() no longer has callers outside of mempolicy.c,
    make it static.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic
    was never correct and it is no longer needed, see the previous patch.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Extract the code which looks for vma's policy from get_vma_policy()
    into the new helper, __get_vma_policy(). Export get_task_policy().

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov