16 Jan, 2016

3 commits

  • MPOL_MF_LAZY is not visible from userspace since a720094ded8c ("mm:
    mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
    it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
    VM_HUGETLB VMAs, and avoid useless overhead of minor faults.

    Signed-off-by: Liang Chen
    Signed-off-by: Gavin Guo
    Acked-by: Rik van Riel
    Cc: Mel Gorman
    Cc: Andi Kleen
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Liang Chen
     
  • We are not able to migrate THPs. It means it's not enough to split only
    PMD on migration -- we need to split compound page under it too.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to decouple splitting THP PMD from splitting underlying
    compound page.

    This patch renames split_huge_page_pmd*() functions to split_huge_pmd*()
    to reflect the fact that it doesn't imply page splitting, only PMD.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Jan, 2016

1 commit

  • When running the SPECint_rate gcc on some very large boxes it was
    noticed that the system was spending lots of time in
    mpol_shared_policy_lookup(). The gamess benchmark can also show it and
    is what I mostly used to chase down the issue since the setup for that I
    found to be easier.

    To be clear the binaries were on tmpfs because of disk I/O requirements.
    We then used text replication to avoid icache misses and having all the
    copies banging on the memory where the instruction code resides. This
    results in us hitting a bottleneck in mpol_shared_policy_lookup() since
    lookup is serialised by the shared_policy lock.

    I have only reproduced this on very large (3k+ cores) boxes. The
    problem starts showing up at just a few hundred ranks getting worse
    until it threatens to livelock once it gets large enough. For example
    on the gamess benchmark at 128 ranks this area consumes only ~1% of
    time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
    90%.

    To alleviate the contention in this area I converted the spinlock to an
    rwlock. This allows a large number of lookups to happen simultaneously.
    The results were quite good reducing this consumtion at max ranks to
    around 2%.

    [akpm@linux-foundation.org: tidy up code comments]
    Signed-off-by: Nathan Zimmer
    Acked-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Nadia Yvette Chambers
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: "Aneesh Kumar K.V"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nathan Zimmer
     

09 Sep, 2015

2 commits

  • alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page
    allocator: do not check NUMA node ID when the caller knows the node is
    valid") as an optimized variant of alloc_pages_node(), that doesn't
    fallback to current node for nid == NUMA_NO_NODE. Unfortunately the
    name of the function can easily suggest that the allocation is
    restricted to the given node and fails otherwise. In truth, the node is
    only preferred, unless __GFP_THISNODE is passed among the gfp flags.

    The misleading name has lead to mistakes in the past, see for example
    commits 5265047ac301 ("mm, thp: really limit transparent hugepage
    allocation to local node") and b360edb43f8e ("mm, mempolicy:
    migrate_to_node should only migrate to node").

    Another issue with the name is that there's a family of
    alloc_pages_exact*() functions where 'exact' means exact size (instead
    of page order), which leads to more confusion.

    To prevent further mistakes, this patch effectively renames
    alloc_pages_exact_node() to __alloc_pages_node() to better convey that
    it's an optimized variant of alloc_pages_node() not intended for general
    usage. Both functions get described in comments.

    It has been also considered to really provide a convenience function for
    allocations restricted to a node, but the major opinion seems to be that
    __GFP_THISNODE already provides that functionality and we shouldn't
    duplicate the API needlessly. The number of users would be small
    anyway.

    Existing callers of alloc_pages_exact_node() are simply converted to
    call __alloc_pages_node(), with the exception of sba_alloc_coherent()
    which open-codes the check for NUMA_NO_NODE, so it is converted to use
    alloc_pages_node() instead. This means it no longer performs some
    VM_BUG_ON checks, and since the current check for nid in
    alloc_pages_node() uses a 'nid < 0' comparison (which includes
    NUMA_NO_NODE), it may hide wrong values which would be previously
    exposed.

    Both differences will be rectified by the next patch.

    To sum up, this patch makes no functional changes, except temporarily
    hiding potentially buggy callers. Restricting the checks in
    alloc_pages_node() is left for the next patch which can in turn expose
    more existing buggy callers.

    Signed-off-by: Vlastimil Babka
    Acked-by: Johannes Weiner
    Acked-by: Robin Holt
    Acked-by: Michal Hocko
    Acked-by: Christoph Lameter
    Acked-by: Michael Ellerman
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Aneesh Kumar K.V
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Naoya Horiguchi
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Gleb Natapov
    Cc: Paolo Bonzini
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Cliff Whickman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This check was introduced as part of
    6f4576e3687 ("mempolicy: apply page table walker on queue_pages_range()")

    which got duplicated by
    48684a65b4e ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)")

    by reintroducing it earlier on queue_page_test_walk()

    Signed-off-by: Aristeu Rozanski
    Acked-by: Naoya Horiguchi
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Acked-by: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aristeu Rozanski
     

05 Sep, 2015

1 commit

  • vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge
    must be aware about so that we can merge vmas back like they were
    originally before arming the userfaultfd on some memory range.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: Sanidhya Kashyap
    Cc: zhang.zhanghailiang@huawei.com
    Cc: "Kirill A. Shutemov"
    Cc: Andres Lagar-Cavilla
    Cc: Dave Hansen
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Andy Lutomirski
    Cc: Hugh Dickins
    Cc: Peter Feiner
    Cc: "Dr. David Alan Gilbert"
    Cc: Johannes Weiner
    Cc: "Huangpeng (Peter)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Jun, 2015

1 commit

  • Since commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on
    local node"), we handle THP allocations on page fault in a special way -
    for non-interleave memory policies, the allocation is only attempted on
    the node local to the current CPU, if the policy's nodemask allows the
    node.

    This is motivated by the assumption that THP benefits cannot offset the
    cost of remote accesses, so it's better to fallback to base pages on the
    local node (which might still be available, while huge pages are not due
    to fragmentation) than to allocate huge pages on a remote node.

    The nodemask check prevents us from violating e.g. MPOL_BIND policies
    where the local node is not among the allowed nodes. However, the
    current implementation can still give surprising results for the
    MPOL_PREFERRED policy when the preferred node is different than the
    current CPU's local node.

    In such case we should honor the preferred node and not use the local
    node, which is what this patch does. If hugepage allocation on the
    preferred node fails, we fall back to base pages and don't try other
    nodes, with the same motivation as is done for the local node hugepage
    allocations. The patch also moves the MPOL_INTERLEAVE check around to
    simplify the hugepage specific test.

    The difference can be demonstrated using in-tree transhuge-stress test
    on the following 2-node machine where half memory on one node was
    occupied to show the difference.

    > numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
    node 0 size: 7878 MB
    node 0 free: 3623 MB
    node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
    node 1 size: 8045 MB
    node 1 free: 7818 MB
    node distances:
    node 0 1
    0: 10 21
    1: 21 10

    Before the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.197 s/loop, 0.276 ms/page, 7249.168 MiB/s 7962 succeed, 0 failed, 1786 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.962 s/loop, 0.372 ms/page, 5376.172 MiB/s 7962 succeed, 0 failed, 3873 different pages

    Number of successful THP allocations corresponds to free memory on node 0 in
    the first case and node 1 in the second case, i.e. -p parameter is ignored and
    cpu binding "wins".

    After the patch:
    > numactl -p0 -C0 ./transhuge-stress
    transhuge-stress: 2.183 s/loop, 0.274 ms/page, 7295.516 MiB/s 7962 succeed, 0 failed, 1760 different pages

    > numactl -p0 -C12 ./transhuge-stress
    transhuge-stress: 2.878 s/loop, 0.361 ms/page, 5533.638 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -p1 -C0 ./transhuge-stress
    transhuge-stress: 4.628 s/loop, 0.581 ms/page, 3440.893 MiB/s 7962 succeed, 0 failed, 3918 different pages

    The -p parameter is respected regardless of cpu binding.

    > numactl -C0 ./transhuge-stress
    transhuge-stress: 2.202 s/loop, 0.277 ms/page, 7230.003 MiB/s 7962 succeed, 0 failed, 1750 different pages

    > numactl -C12 ./transhuge-stress
    transhuge-stress: 3.020 s/loop, 0.379 ms/page, 5273.324 MiB/s 7962 succeed, 0 failed, 3916 different pages

    Without -p parameter, hugepage restriction to CPU-local node works as before.

    Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Acked-by: David Rientjes
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

15 May, 2015

1 commit

  • NUMA balancing is meant to be disabled by default on UMA machines but
    the check is using nr_node_ids (highest node) instead of
    num_online_nodes (online nodes).

    The consequences are that a UMA machine with a node ID of 1 or higher
    will enable NUMA balancing. This will incur useless overhead due to
    minor faults with the impact depending on the workload. These are the
    impact on the stats when running a kernel build on a single node machine
    whose node ID happened to be 1:

    vanilla patched
    NUMA base PTE updates 5113158 0
    NUMA huge PMD updates 643 0
    NUMA page range updates 5442374 0
    NUMA hint faults 2109622 0
    NUMA hint local faults 2109622 0
    NUMA hint local percent 100 100
    NUMA pages migrated 0 0

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Apr, 2015

2 commits

  • Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local
    node") restructured alloc_hugepage_vma() with the intent of only
    allocating transparent hugepages locally when there was not an effective
    interleave mempolicy.

    alloc_pages_exact_node() does not limit the allocation to the single node,
    however, but rather prefers it. This is because __GFP_THISNODE is not set
    which would cause the node-local nodemask to be passed. Without it, only
    a nodemask that prefers the local node is passed.

    Fix this by passing __GFP_THISNODE and falling back to small pages when
    the allocation fails.

    Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
    node") suffers from a similar problem for khugepaged, which is also fixed.

    Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
    Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node")
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Joonsoo Kim
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Pravin Shelar
    Cc: Jarno Rajahalme
    Cc: Li Zefan
    Cc: Greg Thelen
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • migrate_to_node() is intended to migrate a page from one source node to
    a target node.

    Today, migrate_to_node() could end up migrating to any node, not only
    the target node. This is because the page migration allocator,
    new_node_page() does not pass __GFP_THISNODE to
    alloc_pages_exact_node(). This causes the target node to be preferred
    but allows fallback to any other node in order of affinity.

    Prevent this by allocating with __GFP_THISNODE. If memory is not
    available, -ENOMEM will be returned as appropriate.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

14 Feb, 2015

1 commit

  • printk and friends can now format bitmaps using '%*pb[l]'. cpumask
    and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
    respectively which can be used to generate the two printf arguments
    necessary to format the specified cpu/nodemask.

    Signed-off-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     

13 Feb, 2015

1 commit

  • With PROT_NONE, the traditional page table manipulation functions are
    sufficient.

    [andre.przywara@arm.com: fix compiler warning in pmdp_invalidate()]
    [akpm@linux-foundation.org: fix build with STRICT_MM_TYPECHECKS]
    Signed-off-by: Mel Gorman
    Acked-by: Linus Torvalds
    Acked-by: Aneesh Kumar
    Tested-by: Sasha Levin
    Cc: Benjamin Herrenschmidt
    Cc: Dave Jones
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Kirill Shutemov
    Cc: Paul Mackerras
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

12 Feb, 2015

4 commits

  • walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
    undesirable behaviour at client end (who called walk_page_range). For
    example for pagemap_read(), when no callbacks are called against VM_PFNMAP
    vma, pagemap_read() may prepare pagemap data for next virtual address
    range at wrong index. That could confuse and/or break userspace
    applications.

    This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
    - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
    over vma(VM_PFNMAP),
    - for clear_refs and queue_pages which have their own ->tests_walk,
    just return 1 and skip vma(VM_PFNMAP). This is no problem because
    these are not interested in hole regions,
    - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Shiraz Hashim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • queue_pages_range() does page table walking in its own way now, but there
    is some code duplicate. This patch applies page table walker to reduce
    lines of code.

    queue_pages_range() has to do some precheck to determine whether we really
    walk over the vma or just skip it. Now we have test_walk() callback in
    mm_walk for this purpose, so we can do this replacement cleanly.
    queue_pages_test_walk() depends on not only the current vma but also the
    previous one, so queue_pages->prev is introduced to remember it.

    Signed-off-by: Naoya Horiguchi
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Cyrill Gorcunov
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: Pavel Emelyanov
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • The previous commit ("mm/thp: Allocate transparent hugepages on local
    node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
    special policy for THP allocations. The function has the same interface
    as alloc_pages_vma(), shares a lot of boilerplate code and a long
    comment.

    This patch merges the hugepage special case into alloc_pages_vma. The
    extra if condition should be cheap enough price to pay. We also prevent
    a (however unlikely) race with parallel mems_allowed update, which could
    make hugepage allocation restart only within the fallback call to
    alloc_hugepage_vma() and not reconsider the special rule in
    alloc_hugepage_vma().

    Also by making sure mpol_cond_put(pol) is always called before actual
    allocation attempt, we can use a single exit path within the function.

    Also update the comment for missing node parameter and obsolete reference
    to mm_sem.

    Signed-off-by: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This make sure that we try to allocate hugepages from local node if
    allowed by mempolicy. If we can't, we fallback to small page allocation
    based on mempolicy. This is based on the observation that allocating
    pages on local node is more beneficial than allocating hugepages on remote
    node.

    With this patch applied we may find transparent huge page allocation
    failures if the current node doesn't have enough freee hugepages. Before
    this patch such failures result in us retrying the allocation on other
    nodes in the numa node mask.

    [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

20 Dec, 2014

1 commit

  • Pull vfs pile #3 from Al Viro:
    "Assorted fixes and patches from the last cycle"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    [regression] chunk lost from bd9b51
    vfs: make mounts and mountstats honor root dir like mountinfo does
    vfs: cleanup show_mountinfo
    init: fix read-write root mount
    unfuck binfmt_misc.c (broken by commit e6084d4)
    vm_area_operations: kill ->migrate()
    new helper: iter_is_iovec()
    move_extent_per_page(): get rid of unused w_flags
    lustre: get rid of playing with ->fs
    btrfs: filp_open() returns ERR_PTR() on failure, not NULL...

    Linus Torvalds
     

19 Dec, 2014

1 commit


17 Dec, 2014

1 commit


10 Oct, 2014

8 commits

  • PROT_NUMA VMAs are skipped to avoid problems distinguishing between
    present, prot_none and special entries. MPOL_MF_LAZY is not visible from
    userspace since commit a720094ded8c ("mm: mempolicy: Hide MPOL_NOOP and
    MPOL_MF_LAZY from userspace for now") but it should still skip VMAs the
    same way task_numa_work does.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Acked-by: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • - get_vma_policy(task) is not safe if task != current, remove this
    argument.

    - get_vma_policy() no longer has callers outside of mempolicy.c,
    make it static.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Remove down_write(&mm->mmap_sem) in do_set_mempolicy(). This logic
    was never correct and it is no longer needed, see the previous patch.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Extract the code which looks for vma's policy from get_vma_policy()
    into the new helper, __get_vma_policy(). Export get_task_policy().

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. vma_policy_mof(task) is simply not safe unless task == current,
    it can race with do_exit()->mpol_put(). Remove this arg and update
    its single caller.

    2. vma can not be NULL, remove this check and simplify the code.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup + preparation. Every user of get_task_policy() calls it
    unconditionally, even if it is not going to use the result.

    get_task_policy() is cheap but still this does not look clean, plus
    the code looks simpler if get_task_policy() is called only when this
    is really needed.

    Note: I hope this is correct, but it is not clear why vma_policy_mof()
    doesn't fall back to get_task_policy() if ->get_policy() returns NULL.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Every caller of get_task_policy() falls back to default_policy if it
    returns NULL. Change get_task_policy() to do this.

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial cleanup. alloc_pages_vma() can use mpol_cond_put().

    Signed-off-by: Oleg Nesterov
    Cc: KAMEZAWA Hiroyuki
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Alexander Viro
    Cc: Cyrill Gorcunov
    Cc: "Eric W. Biederman"
    Cc: "Kirill A. Shutemov"
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jul, 2014

1 commit

  • Pull cgroup fixes from Tejun Heo:
    "Mostly fixes for the fallouts from the recent cgroup core changes.

    The decoupled nature of cgroup dynamic hierarchy management
    (hierarchies are created dynamically on mount but may or may not be
    reused once unmounted depending on remaining usages) led to more
    ugliness being added to kernfs.

    Hopefully, this is the last of it"

    * 'for-3.16-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cpuset: break kernfs active protection in cpuset_write_resmask()
    cgroup: fix a race between cgroup_mount() and cgroup_kill_sb()
    kernfs: introduce kernfs_pin_sb()
    cgroup: fix mount failure in a corner case
    cpuset,mempolicy: fix sleeping function called from invalid context
    cgroup: fix broken css_has_online_children()

    Linus Torvalds
     

25 Jun, 2014

1 commit

  • When runing with the kernel(3.15-rc7+), the follow bug occurs:
    [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
    [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
    [ 9969.441175] INFO: lockdep is turned off.
    [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
    [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
    [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
    [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
    [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
    [ 9969.974071] Call Trace:
    [ 9970.003403] [] dump_stack+0x4d/0x66
    [ 9970.065074] [] __might_sleep+0xfa/0x130
    [ 9970.130743] [] mutex_lock_nested+0x3c/0x4f0
    [ 9970.200638] [] ? kmem_cache_alloc+0x1bc/0x210
    [ 9970.272610] [] cpuset_mems_allowed+0x27/0x140
    [ 9970.344584] [] ? __mpol_dup+0x63/0x150
    [ 9970.409282] [] __mpol_dup+0xe5/0x150
    [ 9970.471897] [] ? __mpol_dup+0x63/0x150
    [ 9970.536585] [] ? copy_process.part.23+0x606/0x1d40
    [ 9970.613763] [] ? trace_hardirqs_on+0xd/0x10
    [ 9970.683660] [] ? monotonic_to_bootbased+0x2f/0x50
    [ 9970.759795] [] copy_process.part.23+0x670/0x1d40
    [ 9970.834885] [] do_fork+0xd8/0x380
    [ 9970.894375] [] ? __audit_syscall_entry+0x9c/0xf0
    [ 9970.969470] [] SyS_clone+0x16/0x20
    [ 9971.030011] [] stub_clone+0x69/0x90
    [ 9971.091573] [] ? system_call_fastpath+0x16/0x1b

    The cause is that cpuset_mems_allowed() try to take
    mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
    __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
    under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
    protection region to protect the access to cpuset only in
    current_cpuset_is_being_rebound(). So that we can avoid this bug.

    This patch is a temporary solution that just addresses the bug
    mentioned above, can not fix the long-standing issue about cpuset.mems
    rebinding on fork():

    "When the forker's task_struct is duplicated (which includes
    ->mems_allowed) and it races with an update to cpuset_being_rebound
    in update_tasks_nodemask() then the task's mems_allowed doesn't get
    updated. And the child task's mems_allowed can be wrong if the
    cpuset's nodemask changes before the child has been added to the
    cgroup's tasklist."

    Signed-off-by: Gu Zheng
    Acked-by: Li Zefan
    Signed-off-by: Tejun Heo
    Cc: stable

    Gu Zheng
     

24 Jun, 2014

1 commit

  • In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    introduced vma merging to mbind(), but it should have also changed the
    convention of passing start vma from queue_pages_range() (formerly
    check_range()) to new_vma_page(): vma merging may have already freed
    that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
    worse crashes.

    Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
    Reported-by: Naoya Horiguchi
    Tested-by: Naoya Horiguchi
    Signed-off-by: Hugh Dickins
    Acked-by: Christoph Lameter
    Cc: KOSAKI Motohiro
    Cc: Minchan Kim
    Cc: [2.6.34+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

09 Jun, 2014

1 commit

  • Now that 3.15 is released, this merges the 'next' branch into 'master',
    bringing us to the normal situation where my 'master' branch is the
    merge window.

    * accumulated work in next: (6809 commits)
    ufs: sb mutex merge + mutex_destroy
    powerpc: update comments for generic idle conversion
    cris: update comments for generic idle conversion
    idle: remove cpu_idle() forward declarations
    nbd: zero from and len fields in NBD_CMD_DISCONNECT.
    mm: convert some level-less printks to pr_*
    MAINTAINERS: adi-buildroot-devel is moderated
    MAINTAINERS: add linux-api for review of API/ABI changes
    mm/kmemleak-test.c: use pr_fmt for logging
    fs/dlm/debug_fs.c: replace seq_printf by seq_puts
    fs/dlm/lockspace.c: convert simple_str to kstr
    fs/dlm/config.c: convert simple_str to kstr
    mm: mark remap_file_pages() syscall as deprecated
    mm: memcontrol: remove unnecessary memcg argument from soft limit functions
    mm: memcontrol: clean up memcg zoneinfo lookup
    mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
    mm/mempool.c: update the kmemleak stack trace for mempool allocations
    lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
    mm: introduce kmemleak_update_trace()
    mm/kmemleak.c: use %u to print ->checksum
    ...

    Linus Torvalds
     

07 Jun, 2014

2 commits

  • printk is meant to be used with an associated log level. There are some
    instances of printk scattered around the mm code where the log level is
    missing. Add a log level and adhere to suggestions by
    scripts/checkpatch.pl by moving to the pr_* macros.

    Also add the typical pr_fmt definition so that print statements can be
    easily traced back to the modules where they occur, correlated one with
    another, etc. This will require the removal of some (now redundant)
    prefixes on a few print statements.

    Signed-off-by: Mitchel Humpherys
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mitchel Humpherys
     
  • The age table walker doesn't check non-present hugetlb entry in common
    path, so hugetlb_entry() callbacks must check it. The reason for this
    behavior is that some callers want to handle it in its own way.

    [ I think that reason is bogus, btw - it should just do what the regular
    code does, which is to call the "pte_hole()" function for such hugetlb
    entries - Linus]

    However, some callers don't check it now, which causes unpredictable
    result, for example when we have a race between migrating hugepage and
    reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
    checks on buggy callbacks.

    This bug exists for years and got visible by introducing hugepage
    migration.

    ChangeLog v2:
    - fix if condition (check !pte_present() instead of pte_present())

    Reported-by: Sasha Levin
    Signed-off-by: Naoya Horiguchi
    Cc: Rik van Riel
    Cc: [3.12+]
    Signed-off-by: Andrew Morton
    [ Backported to 3.15. Signed-off-by: Josh Boyer ]
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

05 Jun, 2014

4 commits

  • Memory migration uses a callback defined by the caller to determine how to
    allocate destination pages. When migration fails for a source page,
    however, it frees the destination page back to the system.

    This patch adds a memory migration callback defined by the caller to
    determine how to free destination pages. If a caller, such as memory
    compaction, builds its own freelist for migration targets, this can reuse
    already freed memory instead of scanning additional memory.

    If the caller provides a function to handle freeing of destination pages,
    it is called when page migration fails. If the caller passes NULL then
    freeing back to the system will be handled as usual. This patch
    introduces no functional change.

    Signed-off-by: David Rientjes
    Reviewed-by: Naoya Horiguchi
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Also fixes kernel-doc warning

    Signed-off-by: Fabian Frederick
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fabian Frederick
     
  • The nmask argument to set_mempolicy() is const according to the user-space
    header numaif.h, and since the kernel does indeed not modify it, it might
    as well be declared const in the kernel.

    Signed-off-by: Rasmus Villemoes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     
  • The nmask argument to mbind() is const according to the userspace header
    numaif.h, and since the kernel does indeed not modify it, it might as well
    be declared const in the kernel.

    Signed-off-by: Rasmus Villemoes
    Acked-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rasmus Villemoes
     

08 Apr, 2014

2 commits

  • PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
    There's no significant performance degradation to checking
    current->mempolicy rather than current->flags & PF_MEMPOLICY in the
    allocation path, especially since this is considered unlikely().

    Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
    64GB of memory and without a mempolicy:

    threads before after
    16 1249409 1244487
    32 1281786 1246783
    48 1239175 1239138
    64 1244642 1241841
    80 1244346 1248918
    96 1266436 1254316
    112 1307398 1312135
    128 1327607 1326502

    Per-process flags are a scarce resource so we should free them up whenever
    possible and make them available. We'll be using it shortly for memcg oom
    reserves.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • slab_node() is actually a mempolicy function, so rename it to
    mempolicy_slab_node() to make it clearer that it used for processes with
    mempolicies.

    At the same time, cleanup its code by saving numa_mem_id() in a local
    variable (since we require a node with memory, not just any node) and
    remove an obsolete comment that assumes the mempolicy is actually passed
    into the function.

    Signed-off-by: David Rientjes
    Acked-by: Christoph Lameter
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Tejun Heo
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Jianguo Wu
    Cc: Tim Hockin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes