05 Mar, 2020

2 commits

  • commit f42f25526502d851d0e3ca1e46297da8aafce8a7 upstream.

    If thp defrag setting "defer" is used and a newline is *not* used when
    writing to the sysfs file, this is interpreted as the "defer+madvise"
    option.

    This is because we do prefix matching and if five characters are written
    without a newline, the current code ends up comparing to the first five
    bytes of the "defer+madvise" option and using that instead.

    Use the more appropriate sysfs_streq() that handles the trailing newline
    for us. Since this doubles as a nice cleanup, do it in enabled_store()
    as well.

    The current implementation relies on prefix matching: the number of
    bytes compared is either the number of bytes written or the length of
    the option being compared. With a newline, "defer\n" does not match
    "defer+"madvise"; without a newline, however, "defer" is considered to
    match "defer+madvise" (prefix matching is only comparing the first five
    bytes). End result is that writing "defer" is broken unless it has an
    additional trailing character.

    This means that writing "madv" in the past would match and set
    "madvise". With strict checking, that no longer is the case but it is
    unlikely anybody is currently doing this.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001171411020.56385@chino.kir.corp.google.com
    Fixes: 21440d7eb904 ("mm, thp: add new defer+madvise defrag option")
    Signed-off-by: David Rientjes
    Suggested-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     
  • commit cb829624867b5ab10bc6a7036d183b1b82bfe9f8 upstream.

    The page could be a tail page, if this is the case, this BUG_ON will
    never be triggered.

    Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com
    Fixes: e9b61f19858a ("thp: reintroduce split_huge_page()")

    Signed-off-by: Wei Yang
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wei Yang
     

23 Jan, 2020

1 commit

  • commit 97d3d0f9a1cf132c63c0b8b8bd497b8a56283dd9 upstream.

    Patch series "Fix two above-47bit hint address vs. THP bugs".

    The two get_unmapped_area() implementations have to be fixed to provide
    THP-friendly mappings if above-47bit hint address is specified.

    This patch (of 2):

    Filesystems use thp_get_unmapped_area() to provide THP-friendly
    mappings. For DAX in particular.

    Normally, the kernel doesn't create userspace mappings above 47-bit,
    even if the machine allows this (such as with 5-level paging on x86-64).
    Not all user space is ready to handle wide addresses. It's known that
    at least some JIT compilers use higher bits in pointers to encode their
    information.

    Userspace can ask for allocation from full address space by specifying
    hint address (with or without MAP_FIXED) above 47-bits. If the
    application doesn't need a particular address, but wants to allocate
    from whole address space it can specify -1 as a hint address.

    Unfortunately, this trick breaks thp_get_unmapped_area(): the function
    would not try to allocate PMD-aligned area if *any* hint address
    specified.

    Modify the routine to handle it correctly:

    - Try to allocate the space at the specified hint address with length
    padding required for PMD alignment.
    - If failed, retry without length padding (but with the same hint
    address);
    - If the returned address matches the hint address return it.
    - Otherwise, align the address as required for THP and return.

    The user specified hint address is passed down to get_unmapped_area() so
    above-47bit hint address will be taken into account without breaking
    alignment requirements.

    Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
    Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Thomas Willhalm
    Tested-by: Dan Williams
    Cc: "Aneesh Kumar K . V"
    Cc: "Bruggeman, Otto G"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

19 Oct, 2019

1 commit

  • Make sure split_huge_page_to_list() handles the state of shmem THP and
    file THP properly.

    Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
    Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Song Liu
    Tested-by: Song Liu
    Acked-by: Yang Shi
    Cc: Matthew Wilcox (Oracle)
    Cc: Oleg Nesterov
    Cc: Srikar Dronamraju
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

29 Sep, 2019

3 commits

  • Merge hugepage allocation updates from David Rientjes:
    "We (mostly Linus, Andrea, and myself) have been discussing offlist how
    to implement a sane default allocation strategy for hugepages on NUMA
    platforms.

    With these reverts in place, the page allocator will happily allocate
    a remote hugepage immediately rather than try to make a local hugepage
    available. This incurs a substantial performance degradation when
    memory compaction would have otherwise made a local hugepage
    available.

    This series reverts those reverts and attempts to propose a more sane
    default allocation strategy specifically for hugepages. Andrea
    acknowledges this is likely to fix the swap storms that he originally
    reported that resulted in the patches that removed __GFP_THISNODE from
    hugepage allocations.

    The immediate goal is to return 5.3 to the behavior the kernel has
    implemented over the past several years so that remote hugepages are
    not immediately allocated when local hugepages could have been made
    available because the increased access latency is untenable.

    The next goal is to introduce a sane default allocation strategy for
    hugepages allocations in general regardless of the configuration of
    the system so that we prevent thrashing of local memory when
    compaction is unlikely to succeed and can prefer remote hugepages over
    remote native pages when the local node is low on memory."

    Note on timing: this reverts the hugepage VM behavior changes that got
    introduced fairly late in the 5.3 cycle, and that fixed a huge
    performance regression for certain loads that had been around since
    4.18.

    Andrea had this note:

    "The regression of 4.18 was that it was taking hours to start a VM
    where 3.10 was only taking a few seconds, I reported all the details
    on lkml when it was finally tracked down in August 2018.

    https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/

    __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
    workload degrade like in the "current upstream" above. And it still
    would have been that bad as above until 5.3-rc5"

    where the bad behavior ends up happening as you fill up a local node,
    and without that change, you'd get into the nasty swap storm behavior
    due to compaction working overtime to make room for more memory on the
    nodes.

    As a result 5.3 got the two performance fix reverts in rc5.

    However, David Rientjes then noted that those performance fixes in turn
    regressed performance for other loads - although not quite to the same
    degree. He suggested reverting the reverts and instead replacing them
    with two small changes to how hugepage allocations are done (patch
    descriptions rephrased by me):

    - "avoid expensive reclaim when compaction may not succeed": just admit
    that the allocation failed when you're trying to allocate a huge-page
    and compaction wasn't successful.

    - "allow hugepage fallback to remote nodes when madvised": when that
    node-local huge-page allocation failed, retry without forcing the
    local node.

    but by then I judged it too late to replace the fixes for a 5.3 release.
    So 5.3 was released with behavior that harked back to the pre-4.18 logic.

    But now we're in the merge window for 5.4, and we can see if this
    alternate model fixes not just the horrendous swap storm behavior, but
    also restores the performance regression that the late reverts caused.

    Fingers crossed.

    * emailed patches from David Rientjes :
    mm, page_alloc: allow hugepage fallback to remote nodes when madvised
    mm, page_alloc: avoid expensive reclaim when compaction may not succeed
    Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
    Revert "Revert "mm, thp: restore node-local hugepage allocations""

    Linus Torvalds
     
  • This reverts commit 92717d429b38e4f9f934eed7e605cc42858f1839.

    Since commit a8282608c88e ("Revert "mm, thp: restore node-local hugepage
    allocations"") is reverted in this series, it is better to restore the
    previous 5.2 behavior between the thp allocation and the page allocator
    rather than to attempt any consolidation or cleanup for a policy that is
    now reverted. It's less risky during an rc cycle and subsequent patches
    in this series further modify the same policy that the pre-5.3 behavior
    implements.

    Consolidation and cleanup can be done subsequent to a sane default page
    allocation strategy, so this patch reverts a cleanup done on a strategy
    that is now reverted and thus is the least risky option.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This reverts commit a8282608c88e08b1782141026eab61204c1e533f.

    The commit references the original intended semantic for MADV_HUGEPAGE
    which has subsequently taken on three unique purposes:

    - enables or disables thp for a range of memory depending on the system's
    config (is thp "enabled" set to "always" or "madvise"),

    - determines the synchronous compaction behavior for thp allocations at
    fault (is thp "defrag" set to "always", "defer+madvise", or "madvise"),
    and

    - reverts a previous MADV_NOHUGEPAGE (there is no madvise mode to only
    clear previous hugepage advice).

    These are the three purposes that currently exist in 5.2 and over the
    past several years that userspace has been written around. Adding a
    NUMA locality preference adds a fourth dimension to an already conflated
    advice mode.

    Based on the semantic that MADV_HUGEPAGE has provided over the past
    several years, there exist workloads that use the tunable based on these
    principles: specifically that the allocation should attempt to
    defragment a local node before falling back. It is agreed that remote
    hugepages typically (but not always) have a better access latency than
    remote native pages, although on Naples this is at parity for
    intersocket.

    The revert commit that this patch reverts allows hugepage allocation to
    immediately allocate remotely when local memory is fragmented. This is
    contrary to the semantic of MADV_HUGEPAGE over the past several years:
    that is, memory compaction should be attempted locally before falling
    back.

    The performance degradation of remote hugepages over local hugepages on
    Rome, for example, is 53.5% increased access latency. For this reason,
    the goal is to revert back to the 5.2 and previous behavior that would
    attempt local defragmentation before falling back. With the patch that
    is reverted by this patch, we see performance degradations at the tail
    because the allocator happily allocates the remote hugepage rather than
    even attempting to make a local hugepage available.

    zone_reclaim_mode is not a solution to this problem since it does not
    only impact hugepage allocations but rather changes the memory
    allocation strategy for *all* page allocations.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Sep, 2019

3 commits

  • Currently THP deferred split shrinker is not memcg aware, this may cause
    premature OOM with some configuration. For example the below test would
    run into premature OOM easily:

    $ cgcreate -g memory:thp
    $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
    $ cgexec -g memory:thp transhuge-stress 4000

    transhuge-stress comes from kernel selftest.

    It is easy to hit OOM, but there are still a lot THP on the deferred split
    queue, memcg direct reclaim can't touch them since the deferred split
    shrinker is not memcg aware.

    Convert deferred split shrinker memcg aware by introducing per memcg
    deferred split queue. The THP should be on either per node or per memcg
    deferred split queue if it belongs to a memcg. When the page is
    immigrated to the other memcg, it will be immigrated to the target memcg's
    deferred split queue too.

    Reuse the second tail page's deferred_list for per memcg list since the
    same THP can't be on multiple deferred split queues.

    [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai]
    Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Patch series "Make deferred split shrinker memcg aware", v6.

    Currently THP deferred split shrinker is not memcg aware, this may cause
    premature OOM with some configuration. For example the below test would
    run into premature OOM easily:

    $ cgcreate -g memory:thp
    $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
    $ cgexec -g memory:thp transhuge-stress 4000

    transhuge-stress comes from kernel selftest.

    It is easy to hit OOM, but there are still a lot THP on the deferred split
    queue, memcg direct reclaim can't touch them since the deferred split
    shrinker is not memcg aware.

    Convert deferred split shrinker memcg aware by introducing per memcg
    deferred split queue. The THP should be on either per node or per memcg
    deferred split queue if it belongs to a memcg. When the page is
    immigrated to the other memcg, it will be immigrated to the target memcg's
    deferred split queue too.

    Reuse the second tail page's deferred_list for per memcg list since the
    same THP can't be on multiple deferred split queues.

    Make deferred split shrinker not depend on memcg kmem since it is not
    slab. It doesn't make sense to not shrink THP even though memcg kmem is
    disabled.

    With the above change the test demonstrated above doesn't trigger OOM even
    though with cgroup.memory=nokmem.

    This patch (of 4):

    Put split_queue, split_queue_lock and split_queue_len into a struct in
    order to reduce code duplication when we convert deferred_split to memcg
    aware in the later patches.

    Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: "Kirill A . Shutemov"
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    Kirill and Huang Ying contributed several fixes.

    [willy@infradead.org: use compound_nr, squish uninit-var warning]
    Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-by: Song Liu
    Tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Tested-by: Mikhail Gavrilov
    Cc: Hugh Dickins
    Cc: Chris Wilson
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

25 Aug, 2019

1 commit

  • THP splitting path is missing the split_page_owner() call that
    split_page() has.

    As a result, split THP pages are wrongly reported in the page_owner file
    as order-9 pages. Furthermore when the former head page is freed, the
    remaining former tail pages are not listed in the page_owner file at
    all. This patch fixes that by adding the split_page_owner() call into
    __split_huge_page().

    Link: http://lkml.kernel.org/r/20190820131828.22684-2-vbabka@suse.cz
    Fixes: a9627bc5e34e ("mm/page_owner: introduce split_page_owner and replace manual handling")
    Reported-by: Kirill A. Shutemov
    Signed-off-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

14 Aug, 2019

2 commits

  • This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local
    hugepage allocations").

    commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a
    severe regression that was reported by the kernel test robot at the end
    of the merge window. Now we understood the regression was a false
    positive and was caused by a significant increase in fairness during a
    swap trashing benchmark. So it's safe to re-apply the fix and continue
    improving the code from there. The benchmark that reported the
    regression is very useful, but it provides a meaningful result only when
    there is no significant alteration in fairness during the workload. The
    removal of __GFP_THISNODE increased fairness.

    __GFP_THISNODE cannot be used in the generic page faults path for new
    memory allocations under the MPOL_DEFAULT mempolicy, or the allocation
    behavior significantly deviates from what the MPOL_DEFAULT semantics are
    supposed to be for THP and 4k allocations alike.

    Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag
    set to "madvise") has never meant to provide an implicit MPOL_BIND on
    the "current" node the task is running on, causing swap storms and
    providing a much more aggressive behavior than even zone_reclaim_node =
    3.

    Any workload who could have benefited from __GFP_THISNODE has now to
    enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided
    the zone_reclaim_mode behavior, but it only did so if THP was enabled:
    if THP was disabled, there would have been no chance to get any 4k page
    from the current node if the current node was full of pagecache, which
    further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE.
    MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode
    semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3
    must work exactly the same with MADV_HUGEPAGE set or not.

    The performance characteristic of memory depends on the hardware
    details. The numbers below are obtained on Naples/EPYC architecture and
    the N/A projection extends them to show what we should aim for in the
    future as a good THP NUMA locality default. The benchmark used
    exercises random memory seeks (note: the cost of the page faults is not
    part of the measurement).

    D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ...
    0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A

    D0 means distance zero (i.e. local memory), D1 means distance one (i.e.
    intra socket memory), D2 means distance two (i.e. inter socket memory),
    etc...

    For the guest physical memory allocated by qemu and for guest mode
    kernel the performance characteristic of RAM is more complex and an
    ideal default could be:

    D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ...
    0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A

    NOTE: the N/A are projections and haven't been measured yet, the
    measurement in this case is done on a 1950x with only two NUMA nodes.
    The THP case here means THP was used both in the host and in the guest.

    After applying this commit the THP NUMA locality order that we'll get
    out of MADV_HUGEPAGE is this:

    D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Before this commit it was:

    D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ...

    Even if we ignore the breakage of large workloads that can't fit in a
    single node that the __GFP_THISNODE implicit "current node" mbind
    caused, the THP NUMA locality order provided by __GFP_THISNODE was still
    not the one we shall aim for in the long term (i.e. the first one at
    the top).

    After this commit is applied, we can introduce a new allocator multi
    order API and to replace those two alloc_pages_vmas calls in the page
    fault path, with a single multi order call:

    unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0);
    page = alloc_pages_multi_order(..., &order);
    if (!page)
    goto out;
    if (!(order & (1 << 0))) {
    VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER);
    /* THP fault */
    } else {
    VM_WARN_ON(order != 1 << 0);
    /* 4k fallback */
    }

    The page allocator logic has to be altered so that when it fails on any
    zone with order 9, it has to try again with a order 0 before falling
    back to the next zone in the zonelist.

    After that we need to do more measurements and evaluate if adding an
    opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP"
    with "DN+1 THP | DN 4k" at every NUMA distance crossing.

    Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "reapply: relax __GFP_THISNODE for MADV_HUGEPAGE mappings".

    The fixes for what was originally reported as "pathological THP
    behavior" we rightfully reverted to be sure not to introduced
    regressions at end of a merge window after a severe regression report
    from the kernel bot. We can safely re-apply them now that we had time
    to analyze the problem.

    The mm process worked fine, because the good fixes were eventually
    committed upstream without excessive delay.

    The regression reported by the kernel bot however forced us to revert
    the good fixes to be sure not to introduce regressions and to give us
    the time to analyze the issue further. The silver lining is that this
    extra time allowed to think more at this issue and also plan for a
    future direction to improve things further in terms of THP NUMA
    locality.

    This patch (of 2):

    This reverts commit 356ff8a9a78fb35d ("Revert "mm, thp: consolidate THP
    gfp handling into alloc_hugepage_direct_gfpmask"). So it reapplies
    89c83fb539f954 ("mm, thp: consolidate THP gfp handling into
    alloc_hugepage_direct_gfpmask").

    Consolidation of the THP allocation flags at the same place was meant to
    be a clean up to easier handle otherwise scattered code which is
    imposing a maintenance burden. There were no real problems observed
    with the gfp mask consolidation but the reversion was rushed through
    without a larger consensus regardless.

    This patch brings the consolidation back because this should make the
    long term maintainability easier as well as it should allow future
    changes to be less error prone.

    [mhocko@kernel.org: changelog additions]
    Link: http://lkml.kernel.org/r/20190503223146.2312-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Zi Yan
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

19 Jul, 2019

2 commits

  • Commit 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each
    vma") introduced THPeligible bit for processes' smaps. But, when
    checking the eligibility for shmem vma, __transparent_hugepage_enabled()
    is called to override the result from shmem_huge_enabled(). It may
    result in the anonymous vma's THP flag override shmem's. For example,
    running a simple test which create THP for shmem, but with anonymous THP
    disabled, when reading the process's smaps, it may show:

    7fc92ec00000-7fc92f000000 rw-s 00000000 00:14 27764 /dev/shm/test
    Size: 4096 kB
    ...
    [snip]
    ...
    ShmemPmdMapped: 4096 kB
    ...
    [snip]
    ...
    THPeligible: 0

    And, /proc/meminfo does show THP allocated and PMD mapped too:

    ShmemHugePages: 4096 kB
    ShmemPmdMapped: 4096 kB

    This doesn't make too much sense. The shmem objects should be treated
    separately from anonymous THP. Calling shmem_huge_enabled() with
    checking MMF_DISABLE_THP sounds good enough. And, we could skip stack
    and dax vma check since we already checked if the vma is shmem already.

    Also check if vma is suitable for THP by calling
    transhuge_vma_suitable().

    And minor fix to smaps output format and documentation.

    Link: http://lkml.kernel.org/r/1560401041-32207-3-git-send-email-yang.shi@linux.alibaba.com
    Fixes: 7635d9cbe832 ("mm, thp, proc: report THP eligibility for each vma")
    Signed-off-by: Yang Shi
    Acked-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • transhuge_vma_suitable() was only available for shmem THP, but anonymous
    THP has the same check except pgoff check. And, it will be used for THP
    eligible check in the later patch, so make it available for all kind of
    THPs. This also helps reduce code duplication slightly.

    Since anonymous THP doesn't have to check pgoff, so make pgoff check
    shmem vma only.

    And regroup some functions in include/linux/mm.h to solve compile issue
    since transhuge_vma_suitable() needs call vma_is_anonymous() which was
    defined after huge_mm.h is included.

    [akpm@linux-foundation.org: fix typo]
    [yang.shi@linux.alibaba.com: v4]
    Link: http://lkml.kernel.org/r/1563400758-124759-2-git-send-email-yang.shi@linux.alibaba.com
    Link: http://lkml.kernel.org/r/1560401041-32207-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

06 Jul, 2019

1 commit

  • This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

    Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
    __delete_from_swap_cache() to trigger:

    page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
    anon
    flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
    raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
    raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
    page dumped because: VM_BUG_ON_PAGE(entry != page)
    page->mem_cgroup:ffff978433ace000
    ------------[ cut here ]------------
    kernel BUG at mm/swap_state.c:170!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
    Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
    RIP: 0010:__delete_from_swap_cache+0x20d/0x240
    Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
    RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
    RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
    RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
    RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
    R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
    R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
    FS: 0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
    Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

    and it's not immediately obvious why it happens. It's too late in the
    rc cycle to do anything but revert for now.

    Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
    Reported-and-bisected-by: Mikhail Gavrilov
    Suggested-by: Jan Kara
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Kirill Shutemov
    Cc: William Kucharski
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2 see
    the copying file in the top level directory

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 35 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

5 commits

  • __thp_get_unmapped_area is only used in mm/huge_memory.c. Make it static.
    Tested by building and booting the kernel.

    Link: http://lkml.kernel.org/r/20190504102353.GA22525@bharath12345-Inspiron-5559
    Signed-off-by: Bharath Vedartham
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bharath Vedartham
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    [willy@infradead.org: fix swapcache pages]
    Link: http://lkml.kernel.org/r/20190324155441.GF10344@bombadil.infradead.org
    [kirill@shutemov.name: hugetlb stores pages in page cache differently]
    Link: http://lkml.kernel.org/r/20190404134553.vuvhgmghlkiw2hgl@kshutemo-mobl1
    Link: http://lkml.kernel.org/r/20190307153051.18815-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-and-tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Cc: Hugh Dickins
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Starting with c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page
    protection by insert_pfn_pmd()") vmf_insert_pfn_pmd() internally calls
    pmdp_set_access_flags(). That helper enforces a pmd aligned @address
    argument via VM_BUG_ON() assertion.

    Update the implementation to take a 'struct vm_fault' argument directly
    and apply the address alignment fixup internally to fix crash signatures
    like:

    kernel BUG at arch/x86/mm/pgtable.c:515!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 51 PID: 43713 Comm: java Tainted: G OE 4.19.35 #1
    [..]
    RIP: 0010:pmdp_set_access_flags+0x48/0x50
    [..]
    Call Trace:
    vmf_insert_pfn_pmd+0x198/0x350
    dax_iomap_fault+0xe82/0x1190
    ext4_dax_huge_fault+0x103/0x1f0
    ? __switch_to_asm+0x40/0x70
    __handle_mm_fault+0x3f6/0x1370
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    handle_mm_fault+0xda/0x200
    __do_page_fault+0x249/0x4f0
    do_page_fault+0x32/0x110
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30

    Link: http://lkml.kernel.org/r/155741946350.372037.11148198430068238140.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd()")
    Signed-off-by: Dan Williams
    Reported-by: Piotr Balcer
    Tested-by: Yan Ma
    Tested-by: Pankaj Gupta
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Jan Kara
    Reviewed-by: Aneesh Kumar K.V
    Cc: Chandan Rajendra
    Cc: Souptick Joarder
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

07 May, 2019

1 commit

  • Pull unified TLB flushing from Ingo Molnar:
    "This contains the generic mmu_gather feature from Peter Zijlstra,
    which is an all-arch unification of TLB flushing APIs, via the
    following (broad) steps:

    - enhance the APIs to cover more arch details

    - convert most TLB flushing arch implementations to the generic
    APIs.

    - remove leftovers of per arch implementations

    After this series every single architecture makes use of the unified
    TLB flushing APIs"

    * 'core-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    mm/resource: Use resource_overlaps() to simplify region_intersects()
    ia64/tlb: Eradicate tlb_migrate_finish() callback
    asm-generic/tlb: Remove tlb_table_flush()
    asm-generic/tlb: Remove tlb_flush_mmu_free()
    asm-generic/tlb: Remove CONFIG_HAVE_GENERIC_MMU_GATHER
    asm-generic/tlb: Remove arch_tlb*_mmu()
    s390/tlb: Convert to generic mmu_gather
    asm-generic/tlb: Introduce CONFIG_HAVE_MMU_GATHER_NO_GATHER=y
    arch/tlb: Clean up simple architectures
    um/tlb: Convert to generic mmu_gather
    sh/tlb: Convert SH to generic mmu_gather
    ia64/tlb: Convert to generic mmu_gather
    arm/tlb: Convert to generic mmu_gather
    asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
    asm-generic/tlb, ia64: Conditionally provide tlb_migrate_finish()
    asm-generic/tlb: Provide generic tlb_flush() based on flush_tlb_mm()
    asm-generic/tlb, arch: Provide generic tlb_flush() based on flush_tlb_range()
    asm-generic/tlb, arch: Provide generic VIPT cache flush
    asm-generic/tlb, arch: Provide CONFIG_HAVE_MMU_GATHER_PAGE_SIZE
    asm-generic/tlb: Provide a comment

    Linus Torvalds
     

06 Apr, 2019

1 commit

  • With some architectures like ppc64, set_pmd_at() cannot cope with a
    situation where there is already some (different) valid entry present.

    Use pmdp_set_access_flags() instead to modify the pfn which is built to
    deal with modifying existing PMD entries.

    This is similar to commit cae85cb8add3 ("mm/memory.c: fix modifying of
    page protection by insert_pfn()")

    We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
    support pud pfn entries now.

    Without this patch we also see the below message in kernel log "BUG:
    non-zero pgtables_bytes on freeing mm:"

    Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Chandan Rajendra
    Reviewed-by: Jan Kara
    Cc: Dan Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

03 Apr, 2019

1 commit

  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Mar, 2019

5 commits

  • Commit a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent
    hugepages") introduced pudp_huge_get_and_clear_full() but no one uses
    its return code.

    In order to not diverge from pmdp_huge_get_and_clear_full(), just change
    zap_huge_pud() to not assign the return value from
    pudp_huge_get_and_clear_full().

    mm/huge_memory.c: In function 'zap_huge_pud':
    mm/huge_memory.c:1982:8: warning: variable 'orig_pud' set but not used [-Wunused-but-set-variable]
    pud_t orig_pud;
    ^~~~~~~~

    Link: http://lkml.kernel.org/r/20190301221956.97493-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Reviewed-by: Andrew Morton
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Currently THP allocation events data is fairly opaque, since you can
    only get it system-wide. This patch makes it easier to reason about
    transparent hugepage behaviour on a per-memcg basis.

    For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1,
    which is used for v1's rss_huge [sic]. This is reused here as it's
    fairly involved to untangle NR_ANON_THPS right now to make it per-memcg,
    since right now some of this is delegated to rmap before we have any
    memcg actually assigned to the page. It's a good idea to rework that,
    but let's leave untangling THP allocation for a future patch.

    [akpm@linux-foundation.org: fix build]
    [chris@chrisdown.name: fix memcontrol build when THP is disabled]
    Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name
    Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name
    Signed-off-by: Chris Down
    Acked-by: Johannes Weiner
    Cc: Tejun Heo
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Down
     
  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

05 Jan, 2019

1 commit

  • Patch series "Add support for fast mremap".

    This series speeds up the mremap(2) syscall by copying page tables at
    the PMD level even for non-THP systems. There is concern that the extra
    'address' argument that mremap passes to pte_alloc may do something
    subtle architecture related in the future that may make the scheme not
    work. Also we find that there is no point in passing the 'address' to
    pte_alloc since its unused. This patch therefore removes this argument
    tree-wide resulting in a nice negative diff as well. Also ensuring
    along the way that the enabled architectures do not do anything funky
    with the 'address' argument that goes unnoticed by the optimization.

    Build and boot tested on x86-64. Build tested on arm64. The config
    enablement patch for arm64 will be posted in the future after more
    testing.

    The changes were obtained by applying the following Coccinelle script.
    (thanks Julia for answering all Coccinelle questions!).
    Following fix ups were done manually:
    * Removal of address argument from pte_fragment_alloc
    * Removal of pte_alloc_one_fast definitions from m68k and microblaze.

    // Options: --include-headers --no-includes
    // Note: I split the 'identifier fn' line, so if you are manually
    // running it, please unsplit it so it runs for you.

    virtual patch

    @pte_alloc_func_def depends on patch exists@
    identifier E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    type T2;
    @@

    fn(...
    - , T2 E2
    )
    { ... }

    @pte_alloc_func_proto_noarg depends on patch exists@
    type T1, T2, T3, T4;
    identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1, T2);
    + T3 fn(T1);
    |
    - T3 fn(T1, T2, T4);
    + T3 fn(T1, T2);
    )

    @pte_alloc_func_proto depends on patch exists@
    identifier E1, E2, E4;
    type T1, T2, T3, T4;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    (
    - T3 fn(T1 E1, T2 E2);
    + T3 fn(T1 E1);
    |
    - T3 fn(T1 E1, T2 E2, T4 E4);
    + T3 fn(T1 E1, T2 E2);
    )

    @pte_alloc_func_call depends on patch exists@
    expression E2;
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    @@

    fn(...
    -, E2
    )

    @pte_alloc_macro depends on patch exists@
    identifier fn =~
    "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
    identifier a, b, c;
    expression e;
    position p;
    @@

    (
    - #define fn(a, b, c) e
    + #define fn(a, b) e
    |
    - #define fn(a, b) e
    + #define fn(a) e
    )

    Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.com
    Signed-off-by: Joel Fernandes (Google)
    Suggested-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Julia Lawall
    Cc: Kirill A. Shutemov
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Fernandes (Google)
     

29 Dec, 2018

4 commits

  • Userspace falls short when trying to find out whether a specific memory
    range is eligible for THP. There are usecases that would like to know
    that
    http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
    : This is used to identify heap mappings that should be able to fault thp
    : but do not, and they normally point to a low-on-memory or fragmentation
    : issue.

    The only way to deduce this now is to query for hg resp. nh flags and
    confronting the state with the global setting. Except that there is also
    PR_SET_THP_DISABLE that might change the picture. So the final logic is
    not trivial. Moreover the eligibility of the vma depends on the type of
    VMA as well. In the past we have supported only anononymous memory VMAs
    but things have changed and shmem based vmas are supported as well these
    days and the query logic gets even more complicated because the
    eligibility depends on the mount option and another global configuration
    knob.

    Simplify the current state and report the THP eligibility in
    /proc//smaps for each existing vma. Reuse
    transparent_hugepage_enabled for this purpose. The original
    implementation of this function assumes that the caller knows that the vma
    itself is supported for THP so make the core checks into
    __transparent_hugepage_enabled and use it for existing callers.
    __show_smap just use the new transparent_hugepage_enabled which also
    checks the vma support status (please note that this one has to be out of
    line due to include dependency issues).

    [mhocko@kernel.org: fix oops with NULL ->f_mapping]
    Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: David Rientjes
    Cc: Jan Kara
    Cc: Mike Rapoport
    Cc: Paul Oppenheimer
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Waiting on a page migration entry has used wait_on_page_locked() all along
    since 2006: but you cannot safely wait_on_page_locked() without holding a
    reference to the page, and that extra reference is enough to make
    migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
    on the entry before migrate_page_move_mapping() gets there.

    And that failure is retried nine times, amplifying the pain when trying to
    migrate a popular page. With a single persistent faulter, migration
    sometimes succeeds; with two or three concurrent faulters, success becomes
    much less likely (and the more the page was mapped, the worse the overhead
    of unmapping and remapping it on each try).

    This is especially a problem for memory offlining, where the outer level
    retries forever (or until terminated from userspace), because a heavy
    refault workload can trigger an endless loop of migration failures.
    wait_on_page_locked() is the wrong tool for the job.

    David Herrmann (but was he the first?) noticed this issue in 2014:
    https://marc.info/?l=linux-mm&m=140110465608116&w=2

    Tim Chen started a thread in August 2017 which appears relevant:
    https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
    on to implicate __migration_entry_wait():
    https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
    up with the v4.14 commits: 2554db916586 ("sched/wait: Break up long wake
    list walk") 11a19c7b099f ("sched/wait: Introduce wakeup boomark in
    wake_up_page_bit")

    Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
    https://marc.info/?l=linux-mm&m=154217936431300&w=2

    We have all assumed that it is essential to hold a page reference while
    waiting on a page lock: partly to guarantee that there is still a struct
    page when MEMORY_HOTREMOVE is configured, but also to protect against
    reuse of the struct page going to someone who then holds the page locked
    indefinitely, when the waiter can reasonably expect timely unlocking.

    But in fact, so long as wait_on_page_bit_common() does the put_page(), and
    is careful not to rely on struct page contents thereafter, there is no
    need to hold a reference to the page while waiting on it. That does mean
    that this case cannot go back through the loop: but that's fine for the
    page migration case, and even if used more widely, is limited by the "Stop
    walking if it's locked" optimization in wake_page_function().

    Add interface put_and_wait_on_page_locked() to do this, using "behavior"
    enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
    No interruptible or killable variant needed yet, but they might follow: I
    have a vague notion that reporting -EINTR should take precedence over
    return from wait_on_page_bit_common() without knowing the page state, so
    arrange it accordingly - but that may be nothing but pedantic.

    __migration_entry_wait() still has to take a brief reference to the page,
    prior to calling put_and_wait_on_page_locked(): but now that it is dropped
    before waiting, the chance of impeding page migration is very much
    reduced. Should we perhaps disable preemption across this?

    shrink_page_list()'s __ClearPageLocked(): that was a surprise! This
    survived a lot of testing before that showed up. PageWaiters may have
    been set by wait_on_page_bit_common(), and the reference dropped, just
    before shrink_page_list() succeeds in freezing its last page reference: in
    such a case, unlock_page() must be used. Follow the suggestion from
    Michal Hocko, just revert a978d6f52106 ("mm: unlockless reclaim") now:
    that optimization predates PageWaiters, and won't buy much these days; but
    we can reinstate it for the !PageWaiters case if anyone notices.

    It does raise the question: should vmscan.c's is_page_cache_freeable() and
    __remove_mapping() now treat a PageWaiters page as if an extra reference
    were held? Perhaps, but I don't think it matters much, since
    shrink_page_list() already had to win its trylock_page(), so waiters are
    not very common there: I noticed no difference when trying the bigger
    change, and it's surely not needed while put_and_wait_on_page_locked() is
    only used for page migration.

    [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvils
    Signed-off-by: Hugh Dickins
    Reported-by: Baoquan He
    Tested-by: Baoquan He
    Reviewed-by: Andrea Arcangeli
    Acked-by: Michal Hocko
    Acked-by: Linus Torvalds
    Acked-by: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Baoquan He
    Cc: David Hildenbrand
    Cc: Mel Gorman
    Cc: David Herrmann
    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

22 Dec, 2018

1 commit

  • When splitting a huge migrating PMD, we'll transfer all the existing PMD
    bits and apply them again onto the small PTEs. However we are fetching
    the bits unconditionally via pmd_soft_dirty(), pmd_write() or
    pmd_yound() while actually they don't make sense at all when it's a
    migration entry. Fix them up. Since at it, drop the ifdef together as
    not needed.

    Note that if my understanding is correct about the problem then if
    without the patch there is chance to lose some of the dirty bits in the
    migrating pmd pages (on x86_64 we're fetching bit 11 which is part of
    swap offset instead of bit 2) and it could potentially corrupt the
    memory of an userspace program which depends on the dirty bit.

    Link: http://lkml.kernel.org/r/20181213051510.20306-1-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: William Kucharski
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Dave Jiang
    Cc: "Aneesh Kumar K.V"
    Cc: Souptick Joarder
    Cc: Konstantin Khlebnikov
    Cc: Zi Yan
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

09 Dec, 2018

1 commit

  • This reverts commit 89c83fb539f95491be80cdd5158e6f0ce329e317.

    This should have been done as part of 2f0799a0ffc0 ("mm, thp: restore
    node-local hugepage allocations"). The movement of the thp allocation
    policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was
    intended to only set __GFP_THISNODE for mempolicies that are not
    MPOL_BIND whereas the revert could set this regardless of mempolicy.

    While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask()
    and alloc_pages_vma() was racy, that has since been removed since the
    revert. What is left is the possibility to use __GFP_THISNODE in
    policy_node() when it is unexpected because the special handling for
    hugepages in alloc_pages_vma() was removed as part of the consolidation.

    Secondly, prior to 89c83fb539f9, alloc_pages_vma() implemented a somewhat
    different policy for hugepage allocations, which were allocated through
    alloc_hugepage_vma(). For hugepage allocations, if the allocating
    process's node is in the set of allowed nodes, allocate with
    __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with
    __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to
    allow fallback to other nodes in 89c83fb539f9 as it did for new_page() in
    mm/mempolicy.c which is functionally different behavior and removes the
    requirement to only allocate hugepages locally.

    So this commit does a full revert of 89c83fb539f9 instead of the partial
    revert that was done in 2f0799a0ffc0. The result is the same thp
    allocation policy for 4.20 that was in 4.19.

    Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
    Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations")
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

06 Dec, 2018

1 commit

  • This is a full revert of ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
    MADV_HUGEPAGE mappings") and a partial revert of 89c83fb539f9 ("mm, thp:
    consolidate THP gfp handling into alloc_hugepage_direct_gfpmask").

    By not setting __GFP_THISNODE, applications can allocate remote hugepages
    when the local node is fragmented or low on memory when either the thp
    defrag setting is "always" or the vma has been madvised with
    MADV_HUGEPAGE.

    Remote access to hugepages often has much higher latency than local pages
    of the native page size. On Haswell, ac5b2c18911f was shown to have a
    13.9% access regression after this commit for binaries that remap their
    text segment to be backed by transparent hugepages.

    The intent of ac5b2c18911f is to address an issue where a local node is
    low on memory or fragmented such that a hugepage cannot be allocated. In
    every scenario where this was described as a fix, there is abundant and
    unfragmented remote memory available to allocate from, even with a greater
    access latency.

    If remote memory is also low or fragmented, not setting __GFP_THISNODE was
    also measured on Haswell to have a 40% regression in allocation latency.

    Restore __GFP_THISNODE for thp allocations.

    Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
    Fixes: 89c83fb539f9 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask")
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

01 Dec, 2018

2 commits

  • Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
    __split_huge_page() was using i_size_read() while holding the irq-safe
    lru_lock and page tree lock, but the 32-bit i_size_read() uses an
    irq-unsafe seqlock which should not be nested inside them.

    Instead, read the i_size earlier in split_huge_page_to_list(), and pass
    the end offset down to __split_huge_page(): all while holding head page
    lock, which is enough to prevent truncation of that extent before the
    page tree lock has been taken.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
    Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
    VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).

    Move the setting of mapping and index up before the page_ref_unfreeze()
    in __split_huge_page_tail() to fix this: so that a page cache lookup
    cannot get a reference while the tail's mapping and index are unstable.

    In fact, might as well move them up before the smp_wmb(): I don't see an
    actual need for that, but if I'm missing something, this way round is
    safer than the other, and no less efficient.

    You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
    misplaced, and should be left until after the trylock_page(); but left as
    is has not crashed since, and gives more stringent assurance.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Jerome Glisse
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins