08 Apr, 2020

1 commit

  • commit aa9f7d5172fac9bf1f09e678c35e287a40a7b7dd upstream.

    Using an empty (malformed) nodelist that is not caught during mount option
    parsing leads to a stack-out-of-bounds access.

    The option string that was used was: "mpol=prefer:,". However,
    MPOL_PREFERRED requires a single node number, which is not being provided
    here.

    Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
    nodeid.

    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Reported-by: Entropy Moe
    Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Cc: Lee Schermerhorn
    Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org
    Signed-off-by: Linus Torvalds
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     

06 Feb, 2020

1 commit

  • commit c7a91bc7c2e17e0a9c8b9745a2cb118891218fd1 upstream.

    What we are trying to do is change the '=' character to a NUL terminator
    and then at the end of the function we restore it back to an '='. The
    problem is there are two error paths where we jump to the end of the
    function before we have replaced the '=' with NUL.

    We end up putting the '=' in the wrong place (possibly one element
    before the start of the buffer).

    Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
    Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Signed-off-by: Dan Carpenter
    Acked-by: Vlastimil Babka
    Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Dan Carpenter
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

16 Nov, 2019

1 commit

  • Commit d883544515aa ("mm: mempolicy: make the behavior consistent when
    MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
    of mbind() for a couple of corner cases. But, it altered the errno for
    some other cases, for example, mbind() should return -EFAULT when part
    or all of the memory range specified by nodemask and maxnode points
    outside your accessible address space, or there was an unmapped hole in
    the specified memory range specified by addr and len.

    Fix this by preserving the errno returned by queue_pages_range(). And,
    the pagelist may be not empty even though queue_pages_range() returns
    error, put the pages back to LRU since mbind_range() is not called to
    really apply the policy so those pages should not be migrated, this is
    also the old behavior before the problematic commit.

    Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
    Signed-off-by: Yang Shi
    Reported-by: Li Xinhai
    Reviewed-by: Li Xinhai
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: [4.19 and 5.2+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

29 Sep, 2019

4 commits

  • Merge hugepage allocation updates from David Rientjes:
    "We (mostly Linus, Andrea, and myself) have been discussing offlist how
    to implement a sane default allocation strategy for hugepages on NUMA
    platforms.

    With these reverts in place, the page allocator will happily allocate
    a remote hugepage immediately rather than try to make a local hugepage
    available. This incurs a substantial performance degradation when
    memory compaction would have otherwise made a local hugepage
    available.

    This series reverts those reverts and attempts to propose a more sane
    default allocation strategy specifically for hugepages. Andrea
    acknowledges this is likely to fix the swap storms that he originally
    reported that resulted in the patches that removed __GFP_THISNODE from
    hugepage allocations.

    The immediate goal is to return 5.3 to the behavior the kernel has
    implemented over the past several years so that remote hugepages are
    not immediately allocated when local hugepages could have been made
    available because the increased access latency is untenable.

    The next goal is to introduce a sane default allocation strategy for
    hugepages allocations in general regardless of the configuration of
    the system so that we prevent thrashing of local memory when
    compaction is unlikely to succeed and can prefer remote hugepages over
    remote native pages when the local node is low on memory."

    Note on timing: this reverts the hugepage VM behavior changes that got
    introduced fairly late in the 5.3 cycle, and that fixed a huge
    performance regression for certain loads that had been around since
    4.18.

    Andrea had this note:

    "The regression of 4.18 was that it was taking hours to start a VM
    where 3.10 was only taking a few seconds, I reported all the details
    on lkml when it was finally tracked down in August 2018.

    https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/

    __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
    workload degrade like in the "current upstream" above. And it still
    would have been that bad as above until 5.3-rc5"

    where the bad behavior ends up happening as you fill up a local node,
    and without that change, you'd get into the nasty swap storm behavior
    due to compaction working overtime to make room for more memory on the
    nodes.

    As a result 5.3 got the two performance fix reverts in rc5.

    However, David Rientjes then noted that those performance fixes in turn
    regressed performance for other loads - although not quite to the same
    degree. He suggested reverting the reverts and instead replacing them
    with two small changes to how hugepage allocations are done (patch
    descriptions rephrased by me):

    - "avoid expensive reclaim when compaction may not succeed": just admit
    that the allocation failed when you're trying to allocate a huge-page
    and compaction wasn't successful.

    - "allow hugepage fallback to remote nodes when madvised": when that
    node-local huge-page allocation failed, retry without forcing the
    local node.

    but by then I judged it too late to replace the fixes for a 5.3 release.
    So 5.3 was released with behavior that harked back to the pre-4.18 logic.

    But now we're in the merge window for 5.4, and we can see if this
    alternate model fixes not just the horrendous swap storm behavior, but
    also restores the performance regression that the late reverts caused.

    Fingers crossed.

    * emailed patches from David Rientjes :
    mm, page_alloc: allow hugepage fallback to remote nodes when madvised
    mm, page_alloc: avoid expensive reclaim when compaction may not succeed
    Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
    Revert "Revert "mm, thp: restore node-local hugepage allocations""

    Linus Torvalds
     
  • For systems configured to always try hard to allocate transparent
    hugepages (thp defrag setting of "always") or for memory that has been
    explicitly madvised to MADV_HUGEPAGE, it is often better to fallback to
    remote memory to allocate the hugepage if the local allocation fails
    first.

    The point is to allow the initial call to __alloc_pages_node() to attempt
    to defragment local memory to make a hugepage available, if possible,
    rather than immediately fallback to remote memory. Local hugepages will
    always have a better access latency than remote (huge)pages, so an attempt
    to make a hugepage available locally is always preferred.

    If memory compaction cannot be successful locally, however, it is likely
    better to fallback to remote memory. This could take on two forms: either
    allow immediate fallback to remote memory or do per-zone watermark checks.
    It would be possible to fallback only when per-zone watermarks fail for
    order-0 memory, since that would require local reclaim for all subsequent
    faults so remote huge allocation is likely better than thrashing the local
    zone for large workloads.

    In this case, it is assumed that because the system is configured to try
    hard to allocate hugepages or the vma is advised to explicitly want to try
    hard for hugepages that remote allocation is better when local allocation
    and memory compaction have both failed.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This reverts commit 92717d429b38e4f9f934eed7e605cc42858f1839.

    Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage
    allocations"") is reverted in this series, it is better to restore the
    previous 5.2 behavior between the thp allocation and the page allocator
    rather than to attempt any consolidation or cleanup for a policy that is
    now reverted. It's less risky during an rc cycle and subsequent patches
    in this series further modify the same policy that the pre-5.3 behavior
    implements.

    Consolidation and cleanup can be done subsequent to a sane default page
    allocation strategy, so this patch reverts a cleanup done on a strategy
    that is now reverted and thus is the least risky option.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This reverts commit a8282608c88e08b1782141026eab61204c1e533f.

    The commit references the original intended semantic for MADV_HUGEPAGE
    which has subsequently taken on three unique purposes:

    - enables or disables thp for a range of memory depending on the system's
    config (is thp "enabled" set to "always" or "madvise"),

    - determines the synchronous compaction behavior for thp allocations at
    fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
    and

    - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
    clear previous hugepage advice).

    These are the three purposes that currently exist in 5.2 and over the
    past several years that userspace has been written around. Adding a
    NUMA locality preference adds a fourth dimension to an already conflated
    advice mode.

    Based on the semantic that MADV_HUGEPAGE has provided over the past
    several years, there exist workloads that use the tunable based on these
    principles: specifically that the allocation should attempt to
    defragment a local node before falling back. It is agreed that remote
    hugepages typically (but not always) have a better access latency than
    remote native pages, although on Naples this is at parity for
    intersocket.

    The revert commit that this patch reverts allows hugepage allocation to
    immediately allocate remotely when local memory is fragmented. This is
    contrary to the semantic of MADV_HUGEPAGE over the past several years:
    that is, memory compaction should be attempted locally before falling
    back.

    The performance degradation of remote hugepages over local hugepages on
    Rome, for example, is 53.5% increased access latency. For this reason,
    the goal is to revert back to the 5.2 and previous behavior that would
    attempt local defragmentation before falling back. With the patch that
    is reverted by this patch, we see performance degradations at the tail
    because the allocator happily allocates the remote hugepage rather than
    even attempting to make a local hugepage available.

    zone_reclaim_mode is not a solution to this problem since it does not
    only impact hugepage allocations but rather changes the memory
    allocation strategy for *all* page allocations.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

26 Sep, 2019

1 commit

  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Sep, 2019

1 commit

  • 1) task_nodes = cpuset_mems_allowed(current);
    -> cpuset_mems_allowed() guaranteed to return some non-empty
    subset of node_states[N_MEMORY].

    2) nodes_and(*new, *new, task_nodes);
    -> after nodes_and(), the 'new' should be empty or appropriate
    nodemask(online node and with memory).

    After 1) and 2), we could remove unnecessary check whether the 'new'
    AND node_states[N_MEMORY] is empty.

    Link: http://lkml.kernel.org/r/20190806023634.55356-1-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

14 Aug, 2019

4 commits

  • This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local
    hugepage allocations").

    commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a
    severe regression that was reported by the kernel test robot at the end
    of the merge window. Now we understood the regression was a false
    positive and was caused by a significant increase in fairness during a
    swap trashing benchmark. So it's safe to re-apply the fix and continue
    improving the code from there. The benchmark that reported the
    regression is very useful, but it provides a meaningful result only when
    there is no significant alteration in fairness during the workload. The
    removal of __GFP_THISNODE increased fairness.

    __GFP_THISNODE cannot be used in the generic page faults path for new
    memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
    behavior significantly deviates from what the MPOL_DEFAULT semantics are
    supposed to be for THP and 4k allocations alike.

    Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
    set to "madvise") has never meant to provide an implicit MPOL_BIND on
    the "current" node the task is running on, causing swap storms and
    providing a much more aggressive behavior than even zone_reclaim_node =
    3.

    Any workload who could have benefited from __GFP_THISNODE has now to
    enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
    the zone_reclaim_mode behavior, but it only did so if THP was enabled:
    if THP was disabled, there would have been no chance to get any 4k page
    from the current node if the current node was full of pagecache, which
    further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
    MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
    semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
    must work exactly the same with MADV_HUGEPAGE set or not.

    The performance characteristic of memory depends on the hardware
    details. The numbers below are obtained on Naples/EPYC architecture and
    the N/A projection extends them to show what we should aim for in the
    future as a good THP NUMA locality default. The benchmark used
    exercises random memory seeks (note: the cost of the page faults is not
    part of the measurement).

    D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
    0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A

    D0 means distance zero (i.e. local memory), D1 means distance one (i.e.
    intra socket memory), D2 means distance two (i.e. inter socket memory),
    etc...

    For the guest physical memory allocated by qemu and for guest mode
    kernel the performance characteristic of RAM is more complex and an
    ideal default could be:

    D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
    0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A

    NOTE: the N/A are projections and haven't been measured yet, the
    measurement in this case is done on a 1950x with only two NUMA nodes.
    The THP case here means THP was used both in the host and in the guest.

    After applying this commit the THP NUMA locality order that we'll get
    out of MADV_HUGEPAGE is this:

    D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Before this commit it was:

    D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Even if we ignore the breakage of large workloads that can't fit in a
    single node that the __GFP_THISNODE implicit "current node" mbind
    caused, the THP NUMA locality order provided by __GFP_THISNODE was still
    not the one we shall aim for in the long term (i.e. the first one at
    the top).

    After this commit is applied, we can introduce a new allocator multi
    order API and to replace those two alloc_pages_vmas calls in the page
    fault path, with a single multi order call:

    unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
    page = alloc_pages_multi_order(..., &order);
    if (!page)
    goto out;
    if (!(order & (1 << 0))) {
    VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
    /* THP fault */
    } else {
    VM_WARN_ON(order != 1 << 0);
    /* 4k fallback */
    }

    The page allocator logic has to be altered so that when it fails on any
    zone with order 9, it has to try again with a order 0 before falling
    back to the next zone in the zonelist.

    After that we need to do more measurements and evaluate if adding an
    opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
    with "DN+1 THP | DN 4k" at every NUMA distance crossing.

    Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

    The fixes for what was originally reported as "pathological THP
    behavior" we rightfully reverted to be sure not to introduced
    regressions at end of a merge window after a severe regression report
    from the kernel bot. We can safely re-apply them now that we had time
    to analyze the problem.

    The mm process worked fine, because the good fixes were eventually
    committed upstream without excessive delay.

    The regression reported by the kernel bot however forced us to revert
    the good fixes to be sure not to introduce regressions and to give us
    the time to analyze the issue further. The silver lining is that this
    extra time allowed to think more at this issue and also plan for a
    future direction to improve things further in terms of THP NUMA
    locality.

    This patch (of 2):

    This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
    gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
    89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
    alloc_hugepage_direct_gfpmask").

    Consolidation of the THP allocation flags at the same place was meant to
    be a clean up to easier handle otherwise scattered code which is
    imposing a maintenance burden. There were no real problems observed
    with the gfp mask consolidation but the reversion was rushed through
    without a larger consensus regardless.

    This patch brings the consolidation back because this should make the
    long term maintainability easier as well as it should allow future
    changes to be less error prone.

    [mhocko@kernel.org: changelog additions]
    Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • When running syzkaller internally, we ran into the below bug on 4.9.x
    kernel:

    kernel BUG at mm/huge_memory.c:2124!
    invalid opcode: 0000 [#1] SMP KASAN
    CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
    task: ffff880067b34900 task.stack: ffff880068998000
    RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
    Call Trace:
    split_huge_page include/linux/huge_mm.h:100 [inline]
    queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
    walk_pmd_range mm/pagewalk.c:50 [inline]
    walk_pud_range mm/pagewalk.c:90 [inline]
    walk_pgd_range mm/pagewalk.c:116 [inline]
    __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
    walk_page_range+0x154/0x370 mm/pagewalk.c:285
    queue_pages_range+0x115/0x150 mm/mempolicy.c:694
    do_mbind mm/mempolicy.c:1241 [inline]
    SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
    SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
    do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
    entry_SYSCALL_64_after_swapgs+0x5d/0xdb
    Code: c7 80 1c 02 00 e8 26 0a 76 01 0b 48 c7 c7 40 46 45 84 e8 4c
    RIP [] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
    RSP

    with the below test:

    uint64_t r[1] = {0xffffffffffffffff};

    int main(void)
    {
    syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
    intptr_t res = 0;
    res = syscall(__NR_socket, 0x11, 3, 0x300);
    if (res != -1)
    r[0] = res;
    *(uint32_t*)0x20000040 = 0x10000;
    *(uint32_t*)0x20000044 = 1;
    *(uint32_t*)0x20000048 = 0xc520;
    *(uint32_t*)0x2000004c = 1;
    syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
    syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
    *(uint64_t*)0x20000340 = 2;
    syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
    return 0;
    }

    Actually the test does:

    mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
    socket(AF_PACKET, SOCK_RAW, 768) = 3
    setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
    mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
    mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0

    The setsockopt() would allocate compound pages (16 pages in this test)
    for packet tx ring, then the mmap() would call packet_mmap() to map the
    pages into the user address space specified by the mmap() call.

    When calling mbind(), it would scan the vma to queue the pages for
    migration to the new node. It would split any huge page since 4.9
    doesn't support THP migration, however, the packet tx ring compound
    pages are not THP and even not movable. So, the above bug is triggered.

    However, the later kernel is not hit by this issue due to commit
    d44d363f6578 ("mm: don't assume anonymous pages have SwapBacked flag"),
    which just removes the PageSwapBacked check for a different reason.

    But, there is a deeper issue. According to the semantic of mbind(), it
    should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
    MPOL_MF_STRICT was also specified, but the kernel was unable to move all
    existing pages in the range. The tx ring of the packet socket is
    definitely not movable, however, mbind() returns success for this case.

    Although the most socket file associates with non-movable pages, but XDP
    may have movable pages from gup. So, it sounds not fine to just check
    the underlying file type of vma in vma_migratable().

    Change migrate_page_add() to check if the page is movable or not, if it
    is unmovable, just return -EIO. But do not abort pte walk immediately,
    since there may be pages off LRU temporarily. We should migrate other
    pages if MPOL_MF_MOVE* is specified. Set has_unmovable flag if some
    paged could not be not moved, then return -EIO for mbind() eventually.

    With this change the above test would return -EIO as expected.

    [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
    Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
    try best to migrate misplaced pages, if some of the pages could not be
    migrated, then return -EIO.

    There are three different sub-cases:
    1. vma is not migratable
    2. vma is migratable, but there are unmovable pages
    3. vma is migratable, pages are movable, but migrate_pages() fails

    If #1 happens, kernel would just abort immediately, then return -EIO,
    after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
    MPOL_MF_STRICT is specified").

    If #3 happens, kernel would set policy and migrate pages with
    best-effort, but won't rollback the migrated pages and reset the policy
    back.

    Before that commit, they behaves in the same way. It'd better to keep
    their behavior consistent. But, rolling back the migrated pages and
    resetting the policy back sounds not feasible, so just make #1 behave as
    same as #3.

    Userspace will know that not everything was successfully migrated (via
    -EIO), and can take whatever steps it deems necessary - attempt
    rollback, determine which exact page(s) are violating the policy, etc.

    Make queue_pages_range() return 1 to indicate there are unmovable pages
    or vma is not migratable.

    The #2 is not handled correctly in the current kernel, the following
    patch will fix it.

    [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
    Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Reviewed-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

03 Jul, 2019

1 commit

  • nouveau is currently using this through an odd hmm wrapper, and I plan
    to switch it to the real thing later in this series.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: John Hubbard
    Acked-by: Michal Hocko
    Reviewed-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

29 Jun, 2019

1 commit

  • mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
    mempoclicies when the tasks's cpuset's mems_allowed changes. For
    policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
    it works by remapping the policy's allowed nodes (stored in v.nodes)
    using the previous value of mems_allowed (stored in
    w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
    (passed as nodes) as the range of the map (see the comment of
    bitmap_remap() for details).

    The result of remapping is stored back as policy's nodemask in v.nodes,
    and the new value of mems_allowed should be stored in
    w.cpuset_mems_allowed to facilitate the next rebind, if it happens.

    However, 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies
    when updating cpusets") introduced a bug where the result of remapping
    is stored in w.cpuset_mems_allowed instead. Thus, a mempolicy's
    allowed nodes can evolve in an unexpected way after a series of
    rebinding due to cpuset mems_allowed changes, possibly binding to a
    wrong node or a smaller number of nodes which may e.g. overload them.
    This patch fixes the bug so rebinding again works as intended.

    [vbabka@suse.cz: new changlog]
    Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz
    Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com
    Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
    Signed-off-by: zhong jiang
    Reviewed-by: Vlastimil Babka
    Cc: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ralph Campbell
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    subject to the gnu public license version 2

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 1 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steve Winslow
    Reviewed-by: Alexios Zavras
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190528171440.319650492@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

30 Mar, 2019

1 commit

  • When MPOL_MF_STRICT was specified and an existing page was already on a
    node that does not follow the policy, mbind() should return -EIO. But
    commit 6f4576e3687b ("mempolicy: apply page table walker on
    queue_pages_range()") broke the rule.

    And commit c8633798497c ("mm: mempolicy: mbind and migrate_pages support
    thp migration") didn't return the correct value for THP mbind() too.

    If MPOL_MF_STRICT is set, ignore vma_migratable() to make sure it
    reaches queue_pages_to_pte_range() or queue_pages_pmd() to check if an
    existing page was already on a node that does not follow the policy.
    And, non-migratable vma may be used, return -EIO too if MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL was specified.

    Tested with https://github.com/metan-ucw/ltp/blob/master/testcases/kernel/syscalls/mbind/mbind02.c

    [akpm@linux-foundation.org: tweak code comment]
    Link: http://lkml.kernel.org/r/1553020556-38583-1-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()")
    Signed-off-by: Yang Shi
    Signed-off-by: Oscar Salvador
    Reported-by: Cyril Hrubis
    Suggested-by: Kirill A. Shutemov
    Acked-by: Rafael Aquini
    Reviewed-by: Oscar Salvador
    Acked-by: David Rientjes
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

06 Mar, 2019

2 commits

  • Syzbot with KMSAN reports (excerpt):

    ==================================================================
    BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
    BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x173/0x1d0 lib/dump_stack.c:113
    kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
    __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
    mpol_rebind_policy mm/mempolicy.c:353 [inline]
    mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
    update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
    update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
    update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
    cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728

    ...

    Uninit was created at:
    kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
    kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
    kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
    kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
    mpol_new mm/mempolicy.c:276 [inline]
    do_mbind mm/mempolicy.c:1180 [inline]
    kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
    __do_sys_mbind mm/mempolicy.c:1354 [inline]

    As it's difficult to report where exactly the uninit value resides in
    the mempolicy object, we have to guess a bit. mm/mempolicy.c:353
    contains this part of mpol_rebind_policy():

    if (!mpol_store_user_nodemask(pol) &&
    nodes_equal(pol->w.cpuset_mems_allowed, *newmask))

    "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
    ever see being uninitialized after leaving mpol_new(). So I'll guess
    it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
    but still part of statement starting on line 353.

    For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
    reachable for a mempolicy where mpol_set_nodemask() is called in
    do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
    with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL
    flag. Let's exclude such policies from the nodes_equal() check. Note
    the uninit access should be benign anyway, as rebinding this kind of
    policy is always a no-op. Therefore no actual need for stable
    inclusion.

    Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
    Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz
    Signed-off-by: Vlastimil Babka
    Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Yisheng Xie
    Cc: zhong jiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

22 Feb, 2019

1 commit

  • The system call, get_mempolicy() [1], passes an unsigned long *nodemask
    pointer and an unsigned long maxnode argument which specifies the length
    of the user's nodemask array in bits (which is rounded up). The manual
    page says that if the maxnode value is too small, get_mempolicy will
    return EINVAL but there is no system call to return this minimum value.
    To determine this value, some programs search /proc//status for a
    line starting with "Mems_allowed:" and use the number of digits in the
    mask to determine the minimum value. A recent change to the way this line
    is formatted [2] causes these programs to compute a value less than
    MAX_NUMNODES so get_mempolicy() returns EINVAL.

    Change get_mempolicy(), the older compat version of get_mempolicy(), and
    the copy_nodes_to_user() function to use nr_node_ids instead of
    MAX_NUMNODES, thus preserving the defacto method of computing the minimum
    size for the nodemask array and the maxnode argument.

    [1] http://man7.org/linux/man-pages/man2/get_mempolicy.2.html
    [2] https://lore.kernel.org/lkml/1545405631-6808-1-git-send-email-longman@redhat.com

    Link: http://lkml.kernel.org/r/20190211180245.22295-1-rcampbell@nvidia.com
    Fixes: 4fb8e5b89bcbbbb ("include/linux/nodemask.h: use nr_node_ids (not MAX_NUMNODES) in __nodemask_pr_numnodes()")
    Signed-off-by: Ralph Campbell
    Suggested-by: Alexander Duyck
    Cc: Waiman Long
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

09 Dec, 2018

1 commit

  • This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

    This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore
    node-local hugepage allocations"). The movement of the thp allocation
    policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
    intended to only set __GFP_THISNODE for mempolicies that are not
    MPOL_BIND whereas the revert could set this regardless of mempolicy.

    While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
    and alloc_pages_vma() was racy, that has since been removed since the
    revert. What is left is the possibility to use __GFP_THISNODE in
    policy_node() when it is unexpected because the special handling for
    hugepages in alloc_pages_vma() was removed as part of the consolidation.

    Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat
    different policy for hugepage allocations, which were allocated through
    alloc_hugepage_vma(). For hugepage allocations, if the allocating
    process's node is in the set of allowed nodes, allocate with
    __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
    __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to
    allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in
    mm/mempolicy.c which is functionally different behavior and removes the
    requirement to only allocate hugepages locally.

    So this commit does a full revert of 89c83fb539f9 instead of the partial
    revert that was done in 2f0799a0ffc0. The result is the same thp
    allocation policy for 4.20 that was in 4.19.

    Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
    Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 Dec, 2018

1 commit

  • This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
    MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
    consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

    By not setting __GFP_THISNODE, applications can allocate remote hugepages
    when the local node is fragmented or low on memory when either the thp
    defrag setting is "always" or the vma has been madvised with
    MADV_HUGEPAGE.

    Remote access to hugepages often has much higher latency than local pages
    of the native page size. On Haswell, ac5b2c18911f was shown to have a
    13.9% access regression after this commit for binaries that remap their
    text segment to be backed by transparent hugepages.

    The intent of ac5b2c18911f is to address an issue where a local node is
    low on memory or fragmented such that a hugepage cannot be allocated. In
    every scenario where this was described as a fix, there is abundant and
    unfragmented remote memory available to allocate from, even with a greater
    access latency.

    If remote memory is also low or fragmented, not setting __GFP_THISNODE was
    also measured on Haswell to have a 40% regression in allocation latency.

    Restore __GFP_THISNODE for thp allocations.

    Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
    Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

04 Nov, 2018

2 commits

  • THP allocation mode is quite complex and it depends on the defrag mode.
    This complexity is hidden in alloc_hugepage_direct_gfpmask from a large
    part currently. The NUMA special casing (namely __GFP_THISNODE) is
    however independent and placed in alloc_pages_vma currently. This both
    adds an unnecessary branch to all vma based page allocation requests and
    it makes the code more complex unnecessarily as well. Not to mention
    that e.g. shmem THP used to do the node reclaiming unconditionally
    regardless of the defrag mode until recently. This was not only
    unexpected behavior but it was also hardly a good default behavior and I
    strongly suspect it was just a side effect of the code sharing more than
    a deliberate decision which suggests that such a layering is wrong.

    Get rid of the thp special casing from alloc_pages_vma and move the
    logic to alloc_hugepage_direct_gfpmask. __GFP_THISNODE is applied to the
    resulting gfp mask only when the direct reclaim is not requested and
    when there is no explicit numa binding to preserve the current logic.

    Please note that there's also a slight difference wrt MPOL_BIND now. The
    previous code would avoid using __GFP_THISNODE if the local node was
    outside of policy_nodemask(). After this patch __GFP_THISNODE is avoided
    for all MPOL_BIND policies. So there's a difference that if local node
    is actually allowed by the bind policy's nodemask, previously
    __GFP_THISNODE would be added, but now it won't be. From the behavior
    POV this is still correct because the policy nodemask is used.

    Link: http://lkml.kernel.org/r/20180925120326.24392-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Alex Williamson
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Stefan Priebe - Profihost AG
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • THP allocation might be really disruptive when allocated on NUMA system
    with the local node full or hard to reclaim. Stefan has posted an
    allocation stall report on 4.12 based SLES kernel which suggests the
    same issue:

    kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
    kvm cpuset=/ mems_allowed=0-1
    CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph 0000001 SLE15 (unreleased)
    Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
    Call Trace:
    dump_stack+0x5c/0x84
    warn_alloc+0xe0/0x180
    __alloc_pages_slowpath+0x820/0xc90
    __alloc_pages_nodemask+0x1cc/0x210
    alloc_pages_vma+0x1e5/0x280
    do_huge_pmd_wp_page+0x83f/0xf00
    __handle_mm_fault+0x93d/0x1060
    handle_mm_fault+0xc6/0x1b0
    __do_page_fault+0x230/0x430
    do_page_fault+0x2a/0x70
    page_fault+0x7b/0x80
    [...]
    Mem-Info:
    active_anon:126315487 inactive_anon:1612476 isolated_anon:5
    active_file:60183 inactive_file:245285 isolated_file:0
    unevictable:15657 dirty:286 writeback:1 unstable:0
    slab_reclaimable:75543 slab_unreclaimable:2509111
    mapped:81814 shmem:31764 pagetables:370616 bounce:0
    free:32294031 free_pcp:6233 free_cma:0
    Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
    Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

    The defrag mode is "madvise" and from the above report it is clear that
    the THP has been allocated for MADV_HUGEPAGA vma.

    Andrea has identified that the main source of the problem is
    __GFP_THISNODE usage:

    : The problem is that direct compaction combined with the NUMA
    : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
    : hard the local node, instead of failing the allocation if there's no
    : THP available in the local node.
    :
    : Such logic was ok until __GFP_THISNODE was added to the THP allocation
    : path even with MPOL_DEFAULT.
    :
    : The idea behind the __GFP_THISNODE addition, is that it is better to
    : provide local memory in PAGE_SIZE units than to use remote NUMA THP
    : backed memory. That largely depends on the remote latency though, on
    : threadrippers for example the overhead is relatively low in my
    : experience.
    :
    : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
    : extremely slow qemu startup with vfio, if the VM is larger than the
    : size of one host NUMA node. This is because it will try very hard to
    : unsuccessfully swapout get_user_pages pinned pages as result of the
    : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
    : allocations and instead of trying to allocate THP on other nodes (it
    : would be even worse without vfio type1 GUP pins of course, except it'd
    : be swapping heavily instead).

    Fix this by removing __GFP_THISNODE for THP requests which are
    requesting the direct reclaim. This effectivelly reverts 5265047ac301
    on the grounds that the zone/node reclaim was known to be disruptive due
    to premature reclaim when there was memory free. While it made sense at
    the time for HPC workloads without NUMA awareness on rare machines, it
    was ultimately harmful in the majority of cases. The existing behaviour
    is similar, if not as widespare as it applies to a corner case but
    crucially, it cannot be tuned around like zone_reclaim_mode can. The
    default behaviour should always be to cause the least harm for the
    common case.

    If there are specialised use cases out there that want zone_reclaim_mode
    in specific cases, then it can be built on top. Longterm we should
    consider a memory policy which allows for the node reclaim like behavior
    for the specific memory ranges which would allow a

    [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

    Mel said:

    : Both patches look correct to me but I'm responding to this one because
    : it's the fix. The change makes sense and moves further away from the
    : severe stalling behaviour we used to see with both THP and zone reclaim
    : mode.
    :
    : I put together a basic experiment with usemem configured to reference a
    : buffer multiple times that is 80% the size of main memory on a 2-socket
    : box with symmetric node sizes and defrag set to "always". The defrag
    : setting is not the default but it would be functionally similar to
    : accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to
    : reference the buffer multiple times and while it's not an interesting
    : workload, it would be expected to complete reasonably quickly as it fits
    : within memory. The results were;
    :
    : usemem
    : vanilla noreclaim-v1
    : Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%)
    : Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%)
    : Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%)
    :
    : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
    : threads referencing buffers 80% the size of memory. With the patches
    : applied, it's 37.18% faster for the single thread and 73% faster with two
    : threads. Note that 4 threads showing little difference does not indicate
    : the problem is related to thread counts. It's simply the case that 4
    : threads gets spread so their workload mostly fits in one node.
    :
    : The overall view from /proc/vmstats is more startling
    :
    : 4.19.0-rc1 4.19.0-rc1
    : vanillanoreclaim-v1r1
    : Minor Faults 35593425 708164
    : Major Faults 484088 36
    : Swap Ins 3772837 0
    : Swap Outs 3932295 0
    :
    : Massive amounts of swap in/out without the patch
    :
    : Direct pages scanned 6013214 0
    : Kswapd pages scanned 0 0
    : Kswapd pages reclaimed 0 0
    : Direct pages reclaimed 4033009 0
    :
    : Lots of reclaim activity without the patch
    :
    : Kswapd efficiency 100% 100%
    : Kswapd velocity 0.000 0.000
    : Direct efficiency 67% 100%
    : Direct velocity 11191.956 0.000
    :
    : Mostly from direct reclaim context as you'd expect without the patch.
    :
    : Page writes by reclaim 3932314.000 0.000
    : Page writes file 19 0
    : Page writes anon 3932295 0
    : Page reclaim immediate 42336 0
    :
    : Writes from reclaim context is never good but the patch eliminates it.
    :
    : We should never have default behaviour to thrash the system for such a
    : basic workload. If zone reclaim mode behaviour is ever desired but on a
    : single task instead of a global basis then the sensible option is to build
    : a mempolicy that enforces that behaviour.

    This was a severe regression compared to previous kernels that made
    important workloads unusable and it starts when __GFP_THISNODE was
    added to THP allocations under MADV_HUGEPAGE. It is not a significant
    risk to go to the previous behavior before __GFP_THISNODE was added, it
    worked like that for years.

    This was simply an optimization to some lucky workloads that can fit in
    a single node, but it ended up breaking the VM for others that can't
    possibly fit in a single node, so going back is safe.

    [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
    Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
    Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Michal Hocko
    Reported-by: Stefan Priebe
    Debugged-by: Andrea Arcangeli
    Reported-by: Alex Williamson
    Reviewed-by: Mel Gorman
    Tested-by: Mel Gorman
    Cc: Zi Yan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Oct, 2018

2 commits

  • match_string() returns the index of an array for a matching string, which
    can be used intead of open coded implementation.

    Link: http://lkml.kernel.org/r/1536988365-50310-1-git-send-email-zhongjiang@huawei.com
    Signed-off-by: zhong jiang
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Andrey Ryabinin
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
    not be waiting for userfaults before failing and it would hit on a SIGBUS
    instead. Using get_user_pages_locked/unlocked instead will allow
    get_mempolicy to allow userfaults to resolve the fault and fill the hole,
    before grabbing the node id of the page.

    If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
    address inside an area managed by uffd and there is no page at that
    address, the page allocation from within get_mempolicy() will fail
    because get_user_pages() does not allow for page fault retry required
    for uffd; the user will get SIGBUS.

    With this patch, the page fault will be resolved by the uffd and the
    get_mempolicy() will continue normally.

    Background:

    Via code review, previously the syscall would have returned -EFAULT
    (vm_fault_to_errno), now it will block and wait for an userfault (if
    it's waken before the fault is resolved it'll still -EFAULT).

    This way get_mempolicy will give a chance to an "unaware" app to be
    compliant with userfaults.

    The reason this visible change is that becoming "userfault compliant"
    cannot regress anything: all other syscalls including read(2)/write(2)
    had to become "userfault compliant" long time ago (that's one of the
    things userfaultfd can do that PROT_NONE and trapping segfaults can't).

    So this is just one more syscall that become "userfault compliant" like
    all other major ones already were.

    This has been happening on virtio-bridge dpdk process which just called
    get_mempolicy on the guest space post live migration, but before the
    memory had a chance to be migrated to destination.

    I didn't run an strace to be able to show the -EFAULT going away, but
    I've the confirmation of the below debug aid information (only visible
    with CONFIG_DEBUG_VM=y) going away with the patch:

    [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
    [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
    [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
    [20116.371466] Call Trace:
    [20116.371473] dump_stack+0x5c/0x80
    [20116.371476] handle_userfault.cold.37+0x1b/0x22
    [20116.371479] ? remove_wait_queue+0x20/0x60
    [20116.371481] ? poll_freewait+0x45/0xa0
    [20116.371483] ? do_sys_poll+0x31c/0x520
    [20116.371485] ? radix_tree_lookup_slot+0x1e/0x50
    [20116.371488] shmem_getpage_gfp+0xce7/0xe50
    [20116.371491] ? page_add_file_rmap+0x1a/0x2c0
    [20116.371493] shmem_fault+0x78/0x1e0
    [20116.371495] ? filemap_map_pages+0x3a1/0x450
    [20116.371498] __do_fault+0x1f/0xc0
    [20116.371500] __handle_mm_fault+0xe2e/0x12f0
    [20116.371502] handle_mm_fault+0xda/0x200
    [20116.371504] __get_user_pages+0x238/0x790
    [20116.371506] get_user_pages+0x3e/0x50
    [20116.371510] kernel_get_mempolicy+0x40b/0x700
    [20116.371512] ? vfs_write+0x170/0x1a0
    [20116.371515] __x64_sys_get_mempolicy+0x21/0x30
    [20116.371517] do_syscall_64+0x5b/0x160
    [20116.371520] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The above harmless debug message (not a kernel crash, just a
    dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
    and improve kernel spots that may have to become "userfaultfd
    compliant" like this one (without having to run an strace and search
    for syscall misbehavior). Spots like the above are more closer to a
    kernel bug for the non-cooperative usages that Mike focuses on, than
    for for dpdk qemu-cooperative usages that reproduced it, but it's still
    nicer to get this fixed for dpdk too.

    The part of the patch that caused me to think is only the
    implementation issue of mpol_get, but it looks like it should work safe
    no matter the kind of mempolicy structure that is (the default static
    policy also starts at 1 so it'll go to 2 and back to 1 without crashing
    everything at 0).

    [rppt@linux.vnet.ibm.com: changelog addition]
    http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
    Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: Maxime Coquelin
    Tested-by: Dr. David Alan Gilbert
    Reviewed-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

23 Aug, 2018

2 commits

  • zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
    have inline functions to access this field in order to avoid ifdef's in c
    files.

    Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Rather than in vm_area_alloc(). To ensure that the various oddball
    stack-based vmas are in a good state. Some of the callers were zeroing
    them out, others were not.

    Acked-by: Kirill A. Shutemov
    Cc: Russell King
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

27 Jul, 2018

1 commit

  • Make sure to initialize all VMAs properly, not only those which come
    from vm_area_cachep.

    Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Linus Torvalds
    Reviewed-by: Andrew Morton
    Cc: Dmitry Vyukov
    Cc: Oleg Nesterov
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

12 Apr, 2018

3 commits

  • THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by spliting the THP into small pages while moving the head
    page to the newly allocated order-0 page. Remaning pages are moved to
    the LRU list by split_huge_page. The same happens if the THP allocation
    fails. This is really ugly and error prone [1].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    This patch tries to unclutter the situation by moving the special THP
    handling up to the migrate_pages layer where it actually belongs. We
    simply split the THP page into the existing list if unmap_and_move fails
    with ENOMEM and retry. So we will _always_ migrate all THP subpages and
    specific migrate_pages users do not have to deal with this case in a
    special way.

    [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

    Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "unclutter thp migration"

    Motivation:

    THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by splitting the THP into small pages while moving the
    head page to the newly allocated order-0 page. Remaining pages are
    moved to the LRU list by split_huge_page. The same happens if the THP
    allocation fails. This is really ugly and error prone [2].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    The first patch reworks do_pages_move which relies on a very ugly
    calling semantic when the return status is pushed to the migration path
    via private pointer. It uses pre allocated fixed size batching to
    achieve that. We simply cannot do the same if a THP is to be split
    during the migration path which is done in the patch 3. Patch 2 is
    follow up cleanup which removes the mentioned return status calling
    convention ugliness.

    On a side note:

    There are some semantic issues I have encountered on the way when
    working on patch 1 but I am not addressing them here. E.g. trying to
    move THP tail pages will result in either success or EBUSY (the later
    one more likely once we isolate head from the LRU list). Hugetlb
    reports EACCESS on tail pages. Some errors are reported via status
    parameter but migration failures are not even though the original
    `reason' argument suggests there was an intention to do so. From a
    quick look into git history this never worked. I have tried to keep the
    semantic unchanged.

    Then there is a relatively minor thing that the page isolation might
    fail because of pages not being on the LRU - e.g. because they are
    sitting on the per-cpu LRU caches. Easily fixable.

    This patch (of 3):

    do_pages_move is supposed to move user defined memory (an array of
    addresses) to the user defined numa nodes (an array of nodes one for
    each address). The user provided status array then contains resulting
    numa node for each address or an error. The semantic of this function
    is little bit confusing because only some errors are reported back.
    Notably migrate_pages error is only reported via the return value. This
    patch doesn't try to address these semantic nuances but rather change
    the underlying implementation.

    Currently we are processing user input (which can be really large) in
    batches which are stored to a temporarily allocated page. Each address
    is resolved to its struct page and stored to page_to_node structure
    along with the requested target numa node. The array of these
    structures is then conveyed down the page migration path via private
    argument. new_page_node then finds the corresponding structure and
    allocates the proper target page.

    What is the problem with the current implementation and why to change
    it? Apart from being quite ugly it also doesn't cope with unexpected
    pages showing up on the migration list inside migrate_pages path. That
    doesn't happen currently but the follow up patch would like to make the
    thp migration code more clear and that would need to split a THP into
    the list for some cases.

    How does the new implementation work? Well, instead of batching into a
    fixed size array we simply batch all pages that should be migrated to
    the same node and isolate all of them into a linked list which doesn't
    require any additional storage. This should work reasonably well
    because page migration usually migrates larger ranges of memory to a
    specific node. So the common case should work equally well as the
    current implementation. Even if somebody constructs an input where the
    target numa nodes would be interleaved we shouldn't see a large
    performance impact because page migration alone doesn't really benefit
    from batching. mmap_sem batching for the lookup is quite questionable
    and isolate_lru_page which would benefit from batching is not using it
    even in the current implementation.

    Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Anshuman Khandual
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Reale
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

03 Apr, 2018

4 commits

  • Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
    "System calls are interaction points between userspace and the kernel.
    Therefore, system call functions such as sys_xyzzy() or
    compat_sys_xyzzy() should only be called from userspace via the
    syscall table, but not from elsewhere in the kernel.

    At least on 64-bit x86, it will likely be a hard requirement from
    v4.17 onwards to not call system call functions in the kernel: It is
    better to use use a different calling convention for system calls
    there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
    which then hands processing over to the actual syscall function. This
    means that only those parameters which are actually needed for a
    specific syscall are passed on during syscall entry, instead of
    filling in six CPU registers with random user space content all the
    time (which may cause serious trouble down the call chain). Those
    x86-specific patches will be pushed through the x86 tree in the near
    future.

    Moreover, rules on how data may be accessed may differ between kernel
    data and user data. This is another reason why calling sys_xyzzy() is
    generally a bad idea, and -- at most -- acceptable in arch-specific
    code.

    This patchset removes all in-kernel calls to syscall functions in the
    kernel with the exception of arch/. On top of this, it cleans up the
    three places where many syscalls are referenced or prototyped, namely
    kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"

    * 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
    bpf: whitelist all syscalls for error injection
    kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
    kernel/sys_ni: sort cond_syscall() entries
    syscalls/x86: auto-create compat_sys_*() prototypes
    syscalls: sort syscall prototypes in include/linux/compat.h
    net: remove compat_sys_*() prototypes from net/compat.h
    syscalls: sort syscall prototypes in include/linux/syscalls.h
    kexec: move sys_kexec_load() prototype to syscalls.h
    x86/sigreturn: use SYSCALL_DEFINE0
    x86: fix sys_sigreturn() return type to be long, not unsigned long
    x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
    mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
    mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
    mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
    fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
    fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
    fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
    fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
    kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
    kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
    ...

    Linus Torvalds
     
  • Using the mm-internal kernel_[sg]et_mempolicy() helper allows us to get
    rid of the mm-internal calls to the sys_[sg]et_mempolicy() syscalls.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Using the mm-internal kernel_mbind() helper allows us to get rid of the
    mm-internal call to the sys_mbind() syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     
  • Move compat_sys_migrate_pages() to mm/mempolicy.c and make it call a newly
    introduced helper -- kernel_migrate_pages() -- instead of the syscall.

    This patch is part of a series which removes in-kernel calls to syscalls.
    On this basis, the syscall entry path can be streamlined. For details, see
    http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

    Cc: Al Viro
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Signed-off-by: Dominik Brodowski

    Dominik Brodowski
     

23 Mar, 2018

1 commit

  • Alexander reported a use of uninitialized memory in __mpol_equal(),
    which is caused by incorrect use of preferred_node.

    When mempolicy in mode MPOL_PREFERRED with flags MPOL_F_LOCAL, it uses
    numa_node_id() instead of preferred_node, however, __mpol_equal() uses
    preferred_node without checking whether it is MPOL_F_LOCAL or not.

    [akpm@linux-foundation.org: slight comment tweak]
    Link: http://lkml.kernel.org/r/4ebee1c2-57f6-bcb8-0e2d-1833d1ee0bb7@huawei.com
    Fixes: fc36b8d3d819 ("mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy")
    Signed-off-by: Yisheng Xie
    Reported-by: Alexander Potapenko
    Tested-by: Alexander Potapenko
    Reviewed-by: Andrew Morton
    Cc: Dmitriy Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yisheng Xie
     

01 Feb, 2018

1 commit

  • Dan Carpenter has noticed that mbind migration callback (new_page) can
    get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
    relies on the VMA to get the hstate. We used to BUG_ON this case but
    the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
    mbind hugetlb migration".

    The proper way to handle this is to get the hstate from the migrated
    page and rely on huge_node (resp. get_vma_policy) do the right thing
    with null VMA. We are currently falling back to the default mempolicy
    in that case which is in line what THP path is doing here.

    Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Dan Carpenter
    Cc: Naoya Horiguchi
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko